DanceMosaic: High-Fidelity Dance Generation with Multimodal Editability
Foram Shah*, Parshwa Shah*, Muhammad Usama Saleem, Ekkasit Pinyoanuntapong, Pu Wang, Hongfei Xue, Ahmed Helmy
University of North Carolina at Charlotte (UNCC)
Recent advances in dance generation have enabled automatic synthesis of 3D dance motions. However, existing methods still struggle to produce high-fidelity dance sequences that simultaneously deliver exceptional realism, precise dance-music synchronization, high motion diversity, and physical plausibility. Moreover, existing methods lack the flexibility to edit dance sequences according to diverse guidance signals, such as musical prompts, pose constraints, action labels, and genre descriptions, significantly restricting their creative utility and adaptability. Unlike the existing approaches, DanceMosaic enables fast and high-fidelity dance generation, while allowing multimodal motion editing. Speciically, we propose a multimodal masked motion model that fuses the text-to-motion model with music and pose adapters to learn probabilistic mapping from diverse guidance signals to high-quality dance motion sequences via progressive generative masking training. To further enhance the motion generation quality, we propose multimodal classifer-free guidance and inference-time optimization mechanism that further enforce the alignment between the generated motions and the multimodal guidance. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches in motion fidelity, motion, and inference efficiency.
* Equal Contribution.
DanceMosaic Genre-Specified Dance Generation
Street HipHop
Street Jazz
Mix Korean
Street Popping
Classic Shenyun
Street Hiphop
Krump Dance
Folk Miao
Break Dance
Text-Controlled Dance Editing
In-Between Dance Editing with Instruction: "A person does 3 jumping jacks" from 3-5 seconds
In-Between Dance Editing with Instruction: "A person flaps his elbows like chicken" from 5-7 seconds
In-Between Dance Editing with Instruction: "A person walks in circle" from 4-6 seconds
Complete Dance with Text Instruction: "A person dances keeping hand high in air."
In-Between Dance Editing with Instruction: "A person jumps" from 3-5 seconds
In-Between Dance Editing with Instruction: "A person spin at a place" from 3-5 seconds
In-Between Dance Editing with Instruction: "A person is boxing" from 1-4 seconds
In-Between Dance Editing with Instruction: "A person claps" from 5-6 seconds
Additional Applications
Action-based Outpainting: Walk in and walk out of the frame
Genre-based Outpainting: Street Hiphop to Mix Korean to again HipHop
Lower Body Constrained Dance Generation
Upper Body Constrained Dance Generation
Long Dance Generation
Long Dance Generation
DanceMosaic Dance Diversity
Same Music and Different Genres -> Shenyun Music
Break Dance
Folk Miao
Street Hiphop