This page is no longer maintained. It has been superseeded by a new page on Multimodal Speech Synthesis. Please update your links.

Visual Speech Synthesis at KTH

The visual channel in speech communication is of great importance, as has been demonstrated by for example McGurk(1976). A view of the face can improve intelligibility of both natural and synthetic speech, especially under degraded acoustic conditions, as shown by Le Goff (1993). Moreover, visual signals can express emotion, add emphasis to the speech and support the interaction in a dialogue situation (turn-taking signals). This makes the use of a computer-synthesized face to create audiovisual speech synthesis an exciting possibility, with applications such as multimodal user-interfaces.

This text describes a system for audiovisual speech synthesis that has been developed at the departement of Speech, Music and Hearing, KTH. The system is based on the KTH rule-based text-to-speech synthesis system (Carlson, Granström and Hunnicutt, 1982) which has been extended with a real-time animated 3D model of a human face. The face is controlled from the same text-to-speech rule synthesis system that controls the auditory speech synthesizer. This provides a unified and flexible framework for development of audiovisual text-to-speech synthesis that allows the rules controlling the two modalities to be written in the same format.

The facial model

Figure 1: Wireframe and shaded representations of the face.

The face is based on a model developed Parke (1982). This is a parameterized 3D model of a face with eyes, lips and teeth, made up of about 700 polygons. The model is controlled by 50 parameters, which can be divided into two groups: expression parameters, that control the articulation and mimic of the face, and conformance parameters, that control static features in the face like for example nose length and jaw width.

A model of a tongue has been created and added to the face. Modeling of the tongue was motivated by the fact that the tip of the tongue is responsible for almost all visually perceivable tongue actions in natural speech. Thus the tongue can be modeled in a fairly simple way with focus on the movements of the tip, using 64 polygons under control of 4 parameters: width, thickness, length and tip raise, where tip raise controls the vertical position of the tongue tip relative to the upper teeth.

A set of new parameters for articulation control has been developed, including parameters for lip rounding, bilabial occlusion and labiodental occlusion. This allows lip movements to be controlled in a more speech oriented way than was possible in the original Parke model.

The face is rendered using gouraud shading, which gives the polygonal surface a smooth appearance. The implementation uses a standardized library of 3D graphics routines called PEXlib, which makes it portable to most UNIX environments.

Components of the synthesis system

The system for text-to-audiovisual-speech-synthesis is built up of three modules:

The auditory speech synthesizer, GLOVE, which is a formant based terminal analog (Carlson, Granström & Karlsson, 1991).
The visual face synthesizer, i. e. the parameterized face model described above.
The rule synthesis system, RULSYS (Carlson, Granström & Hunnicutt, 1982), which use a set of rules to convert input text to parameter values for control of the two synthesizers.

RULSYS parses the text according to the rules and creates a multi-channel data file, used to control the two synthesizers. (Presently 40 parameters are used to control the GLOVE synthesizer and 10 parameters to control the face) The GLOVE synthesizer module creates an audio sample file based on the parameters in the data file. The face synthesis module then plays back the audio file and animates the face in real-time according to the values in the data file, in synchrony with the audio playback.

References

Carlson, R., Granström, B., and Hunnicutt, S. (1982):" A multi-language text-to-speech module" Proc. ICASSP-Paris , Paris, Vol. 3, pp 1604-1607.

Carlson, R., Granström, B., Karlsson, I. (1991), "Experimentswith voice modelling in speech synthesis", Speech Communication 10, pp 481-489.

Cohen, M. M. and Massaro, D. W. (1993): "Modeling coarticulation in synthetic visual speech", In N. M. Thalmann & D. Thalmann (Eds.) Models and Techniques in Computer Animation. Tokyo: Springer-Verlag.

Le Goff, B.(1993):" Commandes paramŽtriques d'un modele de visage 3D pour animation en temps reel", Master of science thesis, Institut National Polytechnique, Grenoble, France.

McGurk, H., MacDonald, J. (1976):" Hearing lips and seeing voices", Nature, 264, pp 746-748.

Parke, F. I. (1982): "Parametrized models for facial animation", IEEE Computer Graphics, 2(9), pp 61-68.

Publications

Beskow, J. (1995): "Rule-based Visual Speech Synthesis" In Proceedings of Eurospeech '95, Madrid, Spain, pp 299-302

view in postscript (1,41 Mb)

download in UNIX-compressed postscript (295 kb)

Demonstration

Here is an MPEG-encoded movie of the speaking face. (1.88 MB)

Note: The above file is a so called mpeg-system file, containing multiplexed audio and video. Use this if your MPEG-player can handle multiplexed AV. Otherwise consider the video-only version. (1.73 MB)

Other sites

More information about computer facial animation and visual speech synthesis can be found at U.C. Santa Cruz

Go back to...

Department of Speech, Music and Hearing

KTH

beskow@speech.kth.se