DanceMosaic: High-Fidelity Dance Generation with Multimodal Editability

Foram Shah*, Parshwa Shah*, Muhammad Usama Saleem, Ekkasit Pinyoanuntapong, Pu Wang, Hongfei Xue, Ahmed Helmy

University of North Carolina at Charlotte (UNCC)

arXiv Papers With Code Code (Coming Soon)

Recent advances in dance generation have enabled automatic synthesis of 3D dance motions. However, existing methods still struggle to produce high-fidelity dance sequences that simultaneously deliver exceptional realism, precise dance-music synchronization, high motion diversity, and physical plausibility. Moreover, existing methods lack the flexibility to edit dance sequences according to diverse guidance signals, such as musical prompts, pose constraints, action labels, and genre descriptions, significantly restricting their creative utility and adaptability. Unlike the existing approaches, DanceMosaic enables fast and high-fidelity dance generation, while allowing multimodal motion editing. Speciically, we propose a multimodal masked motion model that fuses the text-to-motion model with music and pose adapters to learn probabilistic mapping from diverse guidance signals to high-quality dance motion sequences via progressive generative masking training. To further enhance the motion generation quality, we propose multimodal classifer-free guidance and inference-time optimization mechanism that further enforce the alignment between the generated motions and the multimodal guidance. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches in motion fidelity, motion, and inference efficiency.

* Equal Contribution.

DanceMosaic Genre-Specified Dance Generation

Street HipHop

Street Jazz

Mix Korean

Street Popping

Classic Shenyun

Street Hiphop

Krump Dance

Folk Miao

Break Dance

Text-Controlled Dance Editing

In-Between Dance Editing with Instruction: "A person does 3 jumping jacks" from 3-5 seconds

In-Between Dance Editing with Instruction: "A person flaps his elbows like chicken" from 5-7 seconds

In-Between Dance Editing with Instruction: "A person walks in circle" from 4-6 seconds

Complete Dance with Text Instruction: "A person dances keeping hand high in air."

In-Between Dance Editing with Instruction: "A person jumps" from 3-5 seconds

In-Between Dance Editing with Instruction: "A person spin at a place" from 3-5 seconds

In-Between Dance Editing with Instruction: "A person is boxing" from 1-4 seconds

In-Between Dance Editing with Instruction: "A person claps" from 5-6 seconds

Additional Applications

Action-based Outpainting: Walk in and walk out of the frame

Genre-based Outpainting: Street Hiphop to Mix Korean to again HipHop

Lower Body Constrained Dance Generation

Upper Body Constrained Dance Generation

Long Dance Generation

This is one of the important application as the mask transformer model restricts model to generate motions longer than 10 seconds in a single forward pass.

Long Dance Generation

DanceMosaic Dance Diversity

Same Music and Different Genres -> Shenyun Music

Break Dance

Folk Miao

Street Hiphop