In everyday, face-to-face communicative interaction, we make use of rich repertories of both vocal signals and facial gestures in addition to speech segments. While the auditory modality is generally seen as primary concerning phonetic information, the visual modality can also provide segmental cues for articulation. Moreover, the visual cues such as head nods, gaze and eye movements and the movements and shaping of the eyebrows can provide prosodic information such as signals for prominence and phrasing.
Such visual cues can also provide important communicative interaction information such as signals for turn-taking, feedback giving or seeking, and emotions and attitudes. The goal of this project is the creation of a Swedish multimodal spontaneous speech database rich enough to capture important variations among speakers and speaker styles to be of great value for current speech research. The database will make it possible to test hypotheses on the visual and verbal features employed in communicative behavior covering a variety of functions. To increase our understanding of traditional prosodic functions such as prominence lending and grouping and phrasing, the database will enable researchers to study visual and acoustic interaction and timing differences over several subjects and dialog partners. Moreover, human communicative functions such as the signaling of turn-taking, feedback, attitudes and emotion can be studied from a multimodal, dialog perspective.
In Spontal, we capture two channels of HD video, four channels of hifi sound, and OptiTrack motion capture of torso and head in 120 half-hour long dialogues.
Group: Speech Communication and Technology
David House (Project leader)
Funding: VR, KFI (VR2006-7482)
Duration: 2007-07-01 - 2010-12-31
KTH research database: http://researchprojects.kth.se/index.php/kb_1/io_9945/io.html
Keywords: massively multimodal corpora, audiovisual speech, dialogue, feedback, gesture, human communication, motion capture, prosody, interaction control
In search of the conversational homunculus - serving to understand spoken human face-to-face interaction. Doctoral dissertation, KTH. [abstract] [pdf]
Proc. of Interspeech 2011 (pp. 2069-2072). Florence, Italy. [abstract] [pdf]
(2011). Incremental Learning and Forgetting in Stochastic Turn-Taking Models. In
Development of Multimodal Interfaces: Active Listening and Synchrony (pp. 157 - 168). Berlin / Heidelberg: Springer.
(2010). Face-to-face interaction and the KTH Cooking Show. In Esposito, A., Campbell, N., Vogel, C., Hussain, A., & Nijholt, A. (Eds.),
Proc. of Fonetik 2010 (pp. 7-10). Lund, Sweden. [abstract] [pdf]
(2010). Research focus: Interactional aspects of spoken face-to-face communication. In
Proc. of Multimodal Corpora: Advances in Capturing, Coding and Analyzing Multimodality (MMC 2010) (pp. 160 - 161). Valetta, Malta. [abstract] [pdf]
(2010). Capturing massively multimodal dialogues: affordable synchronization and visualization. In Kipp, M., Martin, J-C., Paggio, P., & Heylen, D. (Eds.),
Proc. of the Seventh conference on International Language Resources and Evaluation (LREC'10) (pp. 2992 - 2995). Valetta, Malta. [abstract] [pdf]
(2010). Spontal: a Swedish spontaneous dialogue corpus of audio, video and motion capture. In Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Rosner, M., & Tapias, D. (Eds.),
Proceedings of LREC 2010 Workshop on Multimodal Corpora: Advances in Capturing, Coding and Analyzing Multimodality (pp. 27 - 30). Valetta, Malta. [abstract] [pdf]
(2010). D64: A corpus of richly recorded conversational interaction. In Kipp, M., Martin, J-C., Paggio, P., & Heylen, D. (Eds.),
Proc. of the Seventh conference on International Language Resources and Evaluation (LREC'10) (pp. 2986 - 2991). Valetta, Malta. [abstract] [pdf]
(2010). Spontal-N: A Corpus of Interactional Spoken Norwegian. In Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Rosner, M., & Tapias, D. (Eds.),
Fonetik 2009 (pp. 190-193). Stockholm. [abstract] [pdf]
(2009). Project presentation: Spontal – multimodal database of spontaneous dialog. In
Proc. of Interspeech 2009. Brighton, UK. [abstract] [pdf]
(2009). Pause and gap length in face-to-face interaction. In