Listen, denoise, action! Audio-driven motion synthesis with diffusion models

We recommend using a recent version of Firefox or Chrome to view this page.


We present diffusion models for audio-driven motion synthesis, based on the Conformer architecture. We demonstrate that our model is able to generate full-body gestures and dancing with top-of-the-line motion quality, with distinctive styles whose expression can be made more or less pronounced. Finally, we generalise the classifier-free guidance procedure to perform style interpolation, in a manner that has connections to product-of-experts models

Our denoising diffusion model consists of a stack Conformer layers, which replace the feedforward networks in Transformers with convolutional layers. Unlike recent diffusion models for motion, we use a translation-invariant method for encoding positional information, which generalises better to long sequences. We trained models for gesture generation, dance synthesis, and path-driven locomotion, all using the same hyperparameters, except for the number of training iterations.

Our models can generate spontaneous gestures from speech

As well as diverse dances for various music genres

Style control for gesticulation

The gesture-generation model can successfully disentangle the stylistic expression in the gestures from the speech. Below we generate gestures in various styles from the same neutral-sounding speech excerpts from the ZeroEGGS dataset:

Speech excerpt #1

Speech excerpt #2

Style control for dancing

Similarly, the dance-generation model can synthesise dances in the selected style, even if the music is unchanged.

Music input #1

Music input #2

Stylised locomotion

In addition to the gesture- and dance generation models, we also train a locomotion model on the 100Styles dataset. In the below videos, we generate motion conditioned on the position of the root joint (i.e., the path and the velocity) and the style label.

Guided interpolation

Finally, we propose guided interpolation, an extension of the clasifier-free guidance technique that allows us to blend styles. We demonstrate the idea with each model below.
NOTE: We fix the random seed in each of these videos, in order to only show the difference in style. Without fixing the seed, the model would synthesise distinct moves in each sample thanks to its probabilistic nature.





          title={Listen, denoise, action! Audio-driven motion synthesis with diffusion models},
          author={Alexanderson, Simon and Nagy, Rajmund and Beskow, Jonas and Henter, Gustav Eje},
          journal={arXiv preprint arXiv:2211.09707}