Multimodal speech production

Today's spoken dialogue systems are being considered for applications such as social and collaborative applications, education and entertainment. These new areas call for systems to be increasingly human-like in their conversational behaviour (Hjalmarsson, 2010). In human-human conversations both parties continuously and simultaneously contribute actively and interactively to the conversation. Listeners actively contribute by providing feedback such as conversational grunts, head nods and eye brow raises. Their feedback indicates attention, feelings and understanding, and its purpose is to support the interaction. Speakers produce filled pauses to indicate hesitation and cognitive load, and they manage the conversation using turn taking cues like pitch movements, voice quality and eye gaze movements. The speech group has a focus on developing conversational speech synthesizers that are able to display these important social signals acoustically (SamSynt). Another research area is the development of a speech synthesizer that simulates different Swedish accents using a combination of prosodic rules and data-driven methods (Simulekt)

Research in this field also concerns the development of articulatorily motivated, highly natural multi-modal parametric synthesis for different voices, speaking styles and also the synthesis of non-verbal expressions. Included is a complete 3-D model of a face and speech organs that generate articulatory synthesis for use in ECAs, embodied conversational agents. This includes the modelling of non-articulatory facial gestures typical for interactive speech.

Published by: TMH, Speech, Music and Hearing

Last updated: Friday, 28-Oct-2011 15:24:35 MEST