Seminar at Speech, Music and Hearing:
A preliminary analysis of prosodic features for a predictive model of facial movements in speech visualisation
Angelika Hönemann, Beuth University of Technology, Berlin
AbstractThis study investigates the relationship between several prosodic speech features, such as
syllables prominence, and visual cues, such as head and facial movements. To this end we
created a corpus of audiovisual recordings that were annotated with respect to acoustic and
visual features. On the basis of this dataset we conducted an analysis to investigate the
relationship between acoustic and visual prosodic features.
The insights gained from the study provide the basis for a predictive model that generates
visual cues from speech signals. Such predictive models have many interesting applications,
such as for example the control of non-verbal movements for the visualization of voice
messages by an avatar.
Our dataset consists of synchronously recorded audio and video signals, as well as motion
capture data from seven speakers, which were asked to recount their last vacation in about
three minutes. Due to the free narrative, the speakers behaved in a natural way, so that an
investigation of natural facial expressions is possible. On the down side, materials produced
are unrestricted and therefore direct utterance comparisons impossible. The stories offer wide
prosodic variety, as they contain sentences of different lengths, pauses, hesitations, etc.
The 3D data were recorded by means of an optical method, using a QUALISYS motion
capture system. Three infrared cameras captured 43 passive markers that were attached to the
head and face of the speaker. In addition to the motion capture, we recorded synchronized
digital video and audio streams.
In a first step, the data set was used to conduct an empirical analysis of movements and facial
expressions of the speaker. Interesting regions such as mouth, eyes, eyebrows, as well as head
movements and emotional expressions of happiness, anger, surprise, etc., were annotated in
the video sequences. The acoustic data was segmented at the syllable level, annotated with
phrase, phrase boundaries and prominent syllables.
15:15 - 17:00
Monday August 20, 2012
The seminar is held in Fantum.
| Show complete seminar list