Our denoising diffusion model consists of a stack Conformer layers, which replace the feedforward networks in Transformers with convolutional layers. Unlike recent diffusion models for motion, we use a translation-invariant method for encoding positional information, which generalises better to long sequences. We trained models for gesture generation, dance synthesis, and path-driven locomotion, all using the same hyperparameters, except for the number of training iterations.
The gesture-generation model can successfully disentangle the stylistic expression in the gestures from the speech. Below we generate gestures in various styles from the same neutral-sounding speech excerpts from the ZeroEGGS dataset:
Speech excerpt #1
Speech excerpt #2
Similarly, the dance-generation model can synthesise dances in the selected style, even if the music is unchanged.
Music input #1
Music input #2
In addition to the gesture- and dance generation models, we also train a locomotion model on the 100Styles dataset. In the below videos, we generate motion conditioned on the position of the root joint (i.e., the path and the velocity) and the style label.
Finally, we propose
NOTE: We fix the random seed in each of these videos, in order to
only show the difference in style. Without fixing the seed, the model would synthesise
distinct moves in each sample thanks to its probabilistic nature.
Gesticulation
Dancing
Locomotion
@article{alexanderson2022listen,
title={Listen, denoise, action! Audio-driven motion synthesis with diffusion models},
author={Alexanderson, Simon and Nagy, Rajmund and Beskow, Jonas and Henter, Gustav Eje},
year={2022},
journal={arXiv preprint arXiv:2211.09707}
}