Walk Before You Dance: High-fidelity and Editable Dance Synthesis via Generative Masked Motion Prior
AAAI 2026
Foram Shah*, Parshwa Shah*, Muhammad Usama Saleem, Ekkasit Pinyoanuntapong, Pu Wang, Hongfei Xue, Ahmed Helmy
University of North Carolina at Charlotte (UNCC)
Recent advances in dance generation have enabled the automatic synthesis of 3D dance motions. However, existing methods still face significant challenges in simultaneously achieving high realism, precise dance–music synchronization, diverse motion expression, and physical plausibility. To address these limitations, we propose a novel approach that leverages a generative masked text-to-motion model as a distribution prior to learn a probabilistic mapping from diverse guidance signals, including music, genre, and pose, into high-quality dance motion sequences. Our framework also supports semantic motion editing, such as motion inpainting and body part modification. Specifically, we introduce a multi-tower masked motion model that integrates a text-conditioned masked motion backbone with two parallel, modality-specific branches: a music-guidance tower and a pose-guidance tower. The model is trained using synchronized and progressive masked training, which allows effective infusion of the pretrained text-to-motion prior into the dance synthesis process while enabling each guidance branch to optimize independently through its own loss function, mitigating gradient interference. During inference, we introduce classifier-free logits guidance and pose-guided token optimization to strengthen the influence of music, genre, and pose signals. Extensive experiments demonstrate that our method sets a new state of the art in dance generation, significantly advancing both the quality and editability over existing approaches.
* Equal Contribution.
Introductory Video
Text-Controlled Dance Editing
In-Between Dance Editing with Instruction: "A person does 3 jumping jacks" from 3-5 seconds
In-Between Dance Editing with Instruction: "A person flaps his elbows like chicken" from 5-7 seconds
In-Between Dance Editing with Instruction: "A person walks in circle" from 4-6 seconds
Complete Dance with Text Instruction: "A person dances keeping hand high in air."
In-Between Dance Editing with Instruction: "A person jumps" from 3-5 seconds
In-Between Dance Editing with Instruction: "A person spin at a place" from 3-5 seconds
In-Between Dance Editing with Instruction: "A person is boxing" from 1-4 seconds
In-Between Dance Editing with Instruction: "A person claps" from 5-6 seconds
Additional Applications
Action-based Outpainting: Walk in and walk out of the frame
Genre-based Outpainting: Street Hiphop to Mix Korean to again HipHop
Lower Body Constrained Dance Generation
Upper Body Constrained Dance Generation
Long Dance Generation
Long Dance Generation
DanceMosaic Dance Diversity
Same Music and Different Genres -> Shenyun Music
Break Dance
Folk Miao
Street Hiphop
DanceMosaic Genre-Specified Dance Generation
Street Popping
Classic Shenyun
Street Hiphop
Krump Dance
Folk Miao
Break Dance