• Latest news from the multimodal speech technology group (updated September 24, 2001).
  • Video demos from the multimodal speech technology group.

  • Det finns inga genvägar fram till det perfekta ljudet, då rå

    Click to download video sequence with sound
    August (1998)

    Kattis (2000)

    Urban (1999)

    Per (1998)

    Olga (1996)

    Holger (1995)

    Tongue & teeth

    Multimodal speech synthesis, or audio-visual speech synthesis, deals with automatic generation of voice and facial animation from arbitrary text. Applications span from research on human communication and perception, via tools for the hearing impaired, to spoken and multimodal agent-based user interfaces. A view of the face can improve intelligibility of both natural and synthetic speech significantly, especially under degraded acoustic conditions. Moreover, facial expressions can signal emotion, add emphasis to the speech and support the interaction in a dialogue situation.


    Our approach to audio-visual speech synthesis is based on parametric descriptions of both the acoustic and visual speech modalities, in a text-to-speech framework. The visual speech synthesis uses 3D polygon models, that are parametrically articulated and deformed. We have a flexible architecture that allows us to create new characters either by adopting a static wireframe model and specifying the required deformation parameters for that model, or by sculpting and reshaping an already parameterized model. A few of our talking heads created to this date can be seen at the left. We are currently working on improving dynamic articulation modelling, using movement data recorded using an optical tracking system.


    We are investigating possible uses of audio-visual speech synthesis as a tool for the hard of hearing. This is done in the Synface and Teleface projects. Within the Teleface project we have carried out extensive audio-visual intelligibility tests, where synthetic and natural voices and faces are tested at varying signal-to-noise ratios, with hearing impaired as well as normal hearing subjects, that demonstrate the potential value of a communication aid based on multimodal synthesis technology.


    Our research is also concerned with interactive aspects of visual speech communication, such as generation of believable facial expressions and gestures for animated talking agents in multimodal spoken dialogue systems. This research is currently being carried out in the framework of the AdApt system. The agent has an associated library of gestures representing communicative functions that can be used in the dialoge. Actions are triggered by the state of the agent in such a way that appropriate gestures are automatically selected when the agent enters, exits or remains in a particular state (examples of states might be speaking or attending etc). State transitions and gestures are controlled by the dialogue manager and are communicated using XML markup.

    Synface. Synthesized talking face derived from speech for hearing disabled users of voice channels

    Teleface. Multimodal Speech Communication for the Hearing Impaired

    The 3D Vocal Tract Project. A three-dimensional vocal tract model for articulatory and visual speech synthesis.

    PER. The doorkeeper Per is the user interface for a speaker verification demonstrator at CTT.

    AdApt. The continuation of the August project.

    WaveSurfer/SpeechSurfer. Speech toolkit development.

    Previous projects

    List of publications from Centre for Speech Technology (CTT)
    Jonas Beskow

    Olov Engwall

    Giampiero Salvi

    Magnus Nordstrand

    Loredana Cerrato

    David House

    Björn Granström

    Computer conversation? Check out this video from the August dialogue system. This is a video playback of a real interaction between August and a user (mpg-format 1.44M) or (mpg-format 12M) (both are Swedish only). More August related videos can be found on the August Homepage

    For more video clips featuring our talking heads, please see our video page.


  • The new EU-project SynFace starts October 1:st 2001. The goal is to develop a multilingual speech-driven synthetic face with essential visual speech infiormation for hearing impaired telephone users. Project partners are KTH (Sweden), University College London (UK), Babel-Infovox (Sweden), Instituut voor Doven (NL) and Royal National Institute for Deaf People (UK). (September 24, 2001)
  • New masters thesis projects announced! If you're a student looking for an interesting topic, make sure to check out the X-jobs list. (November 1, 2000)
  • Now everybody can interact with our agents at the FrittFr@m exhibition at Tekniska Museet in Stockholm! The initial version of the demonstrator features five agents, each with their own area of expertise: Fritte (talks about FrittFr@m), Urban (the AdApt domain), Kattis (KTH trivia), August (Stockholm trivia and Strindberg quotes) and Holger2000 (spiced-up Holger presenting TMH/CTT research and doing NileCity impersonations) (May 24,  2000)
  • New webpage about our 3D vocal tract model for articulatory and visual speech synthesis! (February 24, 2000)
  • Release of the WaveSurfer A tool suited for a wide range of tasks in speech research and education . Now available for download. (January 19, 2000)
  • The video page has been updated with some fun demos from past christmas partys at the speech lab. (January 13, 2000)
  • Tobias Öhman presented his Licentiate Thesis: "Vision in Speech Technology. Automatic measurements of visual speech and audio-visual intelligibility of synthetic and natural faces." ( January 11, 2000)
  • A female talking head is born!!! Meet her at the video page (December 17, 1999)
  • Homage to prof. em. Gunnar Fant on his 80th birthday (October 8, 1999)
  • The video of the mr. Smoketoomuch-dialogue is now available to download! (September 1999)
  • At the MiLaSS Emotions webpage you can see how the students at the MiLaSS summer school defined six basic emotions (and a few not so basic also). (September 1999)
  • Birth of our two new Talking Heads named Gustav and Sven. They were created to run smoothly under Windows NT and they are our first heads capable of speaking English. The students at the European summerschool MiLaSS got the opportunity to work and play together with Gustav and Sven during three hands-on sessions of the summer school. Check here later for pictures, videos and more info.
  • Old news
    Facial animation page at UC Santa Cruz, USA

    Talking Heads website hosted by Haskins Laboratories, New Haven, CT, USA.

    Back up to

    Department of Speech, Music and Hearing, KTH 
    Centre for Speech Technology, KTH
    (While we're at it, here's a link to the legendary music group Talking Heads)

    Last updated[an error occurred while processing this directive] by Jonas Beskow