Contact




Spontal

Multimodal database of spontaneous speech in dialog

Massively multimodal (HD video, hifi sound, and motion capture) database of spontaneous speech in dialog.

In everyday, face-to-face communicative interaction, we make use of rich repertories of both vocal signals and facial gestures in addition to speech segments. While the auditory modality is generally seen as primary concerning phonetic information, the visual modality can also provide segmental cues for articulation. Moreover, the visual cues such as head nods, gaze and eye movements and the movements and shaping of the eyebrows can provide prosodic information such as signals for prominence and phrasing.
Such visual cues can also provide important communicative interaction information such as signals for turn-taking, feedback giving or seeking, and emotions and attitudes. The goal of this project is the creation of a Swedish multimodal spontaneous speech database rich enough to capture important variations among speakers and speaker styles to be of great value for current speech research. The database will make it possible to test hypotheses on the visual and verbal features employed in communicative behavior covering a variety of functions. To increase our understanding of traditional prosodic functions such as prominence lending and grouping and phrasing, the database will enable researchers to study visual and acoustic interaction and timing differences over several subjects and dialog partners. Moreover, human communicative functions such as the signaling of turn-taking, feedback, attitudes and emotion can be studied from a multimodal, dialog perspective.
In Spontal, we capture two channels of HD video, four channels of hifi sound, and OptiTrack motion capture of torso and head in 120 half-hour long dialogues.

Group: Speech Communication and Technology

Staff:
David House (Project leader)
Jens Edlund
Jonas Beskow
Kahl Hellmer
Kjell Elenius
Sofia Strömbergsson

Funding: VR, KFI (VR2006-7482)

Duration: 2007-07-01 - 2010-12-31

Website: http://www.speech.kth.se/spontal/

KTH research database: http://researchprojects.kth.se/index.php/kb_1/io_9945/io.html

Keywords: massively multimodal corpora, audiovisual speech, dialogue, feedback, gesture, human communication, motion capture, prosody, interaction control

Related publications:

2011

Edlund, J. (2011). In search of the conversational homunculus - serving to understand spoken human face-to-face interaction. Doctoral dissertation, KTH. [abstract] [pdf]

Laskowski, K., Edlund, J., & Heldner, M. (2011). Incremental Learning and Forgetting in Stochastic Turn-Taking Models. In Proc. of Interspeech 2011 (pp. 2069-2072). Florence, Italy. [abstract] [pdf]

2010

Beskow, J., Edlund, J., Granström, B., Gustafson, J., & House, D. (2010). Face-to-face interaction and the KTH Cooking Show. In Esposito, A., Campbell, N., Vogel, C., Hussain, A., & Nijholt, A. (Eds.), Development of Multimodal Interfaces: Active Listening and Synchrony (pp. 157 - 168). Berlin / Heidelberg: Springer.

Beskow, J., Edlund, J., Gustafson, J., Heldner, M., Hjalmarsson, A., & House, D. (2010). Research focus: Interactional aspects of spoken face-to-face communication. In Proc. of Fonetik 2010 (pp. 7-10). Lund, Sweden. [abstract] [pdf]

Edlund, J., & Beskow, J. (2010). Capturing massively multimodal dialogues: affordable synchronization and visualization. In Kipp, M., Martin, J-C., Paggio, P., & Heylen, D. (Eds.), Proc. of Multimodal Corpora: Advances in Capturing, Coding and Analyzing Multimodality (MMC 2010) (pp. 160 - 161). Valetta, Malta. [abstract] [pdf]

Edlund, J., Beskow, J., Elenius, K., Hellmer, K., Strömbergsson, S., & House, D. (2010). Spontal: a Swedish spontaneous dialogue corpus of audio, video and motion capture. In Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Rosner, M., & Tapias, D. (Eds.), Proc. of the Seventh conference on International Language Resources and Evaluation (LREC'10) (pp. 2992 - 2995). Valetta, Malta. [abstract] [pdf]

Oertel, C., Cummins, F., Campbell, N., Edlund, J., & Wagner, P. (2010). D64: A corpus of richly recorded conversational interaction. In Kipp, M., Martin, J-C., Paggio, P., & Heylen, D. (Eds.), Proceedings of LREC 2010 Workshop on Multimodal Corpora: Advances in Capturing, Coding and Analyzing Multimodality (pp. 27 - 30). Valetta, Malta. [abstract] [pdf]

Sikveland, R-O., Öttl, A., Amdal, I., Ernestus, M., Svendsen, T., & Edlund, J. (2010). Spontal-N: A Corpus of Interactional Spoken Norwegian. In Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Rosner, M., & Tapias, D. (Eds.), Proc. of the Seventh conference on International Language Resources and Evaluation (LREC'10) (pp. 2986 - 2991). Valetta, Malta. [abstract] [pdf]

2009

Beskow, J., Edlund, J., Elenius, K., Hellmer, K., House, D., & Strömbergsson, S. (2009). Project presentation: Spontal – multimodal database of spontaneous dialog. In Fonetik 2009 (pp. 190-193). Stockholm. [abstract] [pdf]

Edlund, J., Heldner, M., & Hirschberg, J. (2009). Pause and gap length in face-to-face interaction. In Proc. of Interspeech 2009. Brighton, UK. [abstract] [pdf]







Published by: TMH, Speech, Music and Hearing
Webmaster, webmaster@speech.kth.se

Last updated: 2012-11-09