Start ReadSpeaker XT


Seminar at Speech, Music and Hearing:

A preliminary analysis of prosodic features for a predictive model of facial movements in speech visualisation

Angelika Hönemann, Beuth University of Technology, Berlin


This study investigates the relationship between several prosodic speech features, such as syllables prominence, and visual cues, such as head and facial movements. To this end we created a corpus of audiovisual recordings that were annotated with respect to acoustic and visual features. On the basis of this dataset we conducted an analysis to investigate the relationship between acoustic and visual prosodic features. The insights gained from the study provide the basis for a predictive model that generates visual cues from speech signals. Such predictive models have many interesting applications, such as for example the control of non-verbal movements for the visualization of voice messages by an avatar. Our dataset consists of synchronously recorded audio and video signals, as well as motion capture data from seven speakers, which were asked to recount their last vacation in about three minutes. Due to the free narrative, the speakers behaved in a natural way, so that an investigation of natural facial expressions is possible. On the down side, materials produced are unrestricted and therefore direct utterance comparisons impossible. The stories offer wide prosodic variety, as they contain sentences of different lengths, pauses, hesitations, etc. The 3D data were recorded by means of an optical method, using a QUALISYS motion capture system. Three infrared cameras captured 43 passive markers that were attached to the head and face of the speaker. In addition to the motion capture, we recorded synchronized digital video and audio streams. In a first step, the data set was used to conduct an empirical analysis of movements and facial expressions of the speaker. Interesting regions such as mouth, eyes, eyebrows, as well as head movements and emotional expressions of happiness, anger, surprise, etc., were annotated in the video sequences. The acoustic data was segmented at the syllable level, annotated with phrase, phrase boundaries and prominent syllables.

15:15 - 17:00
Monday August 20, 2012

The seminar is held in Fantum.

| Show complete seminar list

Published by: TMH, Speech, Music and Hearing

Last updated: Wednesday, 23-Jun-2010 09:22:46 MEST