This page is no longer maintained. It has been superseeded by a new page on Multimodal Speech Synthesis. Please update your links.
This text describes a system for audiovisual speech synthesis that has been developed at the departement of Speech, Music and Hearing, KTH. The system is based on the KTH rule-based text-to-speech synthesis system (Carlson, Granström and Hunnicutt, 1982) which has been extended with a real-time animated 3D model of a human face. The face is controlled from the same text-to-speech rule synthesis system that controls the auditory speech synthesizer. This provides a unified and flexible framework for development of audiovisual text-to-speech synthesis that allows the rules controlling the two modalities to be written in the same format.

Figure 1: Wireframe and shaded representations of the face.
The face is based on a model developed Parke (1982). This is a parameterized 3D model of a face with eyes, lips and teeth, made up of about 700 polygons. The model is controlled by 50 parameters, which can be divided into two groups: expression parameters, that control the articulation and mimic of the face, and conformance parameters, that control static features in the face like for example nose length and jaw width.
A model of a tongue has been created and added to the face. Modeling of the tongue was motivated by the fact that the tip of the tongue is responsible for almost all visually perceivable tongue actions in natural speech. Thus the tongue can be modeled in a fairly simple way with focus on the movements of the tip, using 64 polygons under control of 4 parameters: width, thickness, length and tip raise, where tip raise controls the vertical position of the tongue tip relative to the upper teeth.
A set of new parameters for articulation control has been developed, including parameters for lip rounding, bilabial occlusion and labiodental occlusion. This allows lip movements to be controlled in a more speech oriented way than was possible in the original Parke model.
The face is rendered using gouraud shading, which gives the polygonal surface a smooth appearance. The implementation uses a standardized library of 3D graphics routines called PEXlib, which makes it portable to most UNIX environments.
The system for text-to-audiovisual-speech-synthesis is built up of three modules:
RULSYS parses the text according to the rules and creates a multi-channel data file, used to control the two synthesizers. (Presently 40 parameters are used to control the GLOVE synthesizer and 10 parameters to control the face) The GLOVE synthesizer module creates an audio sample file based on the parameters in the data file. The face synthesis module then plays back the audio file and animates the face in real-time according to the values in the data file, in synchrony with the audio playback.
Carlson, R., Granström, B., and Hunnicutt, S. (1982):" A multi-language text-to-speech module" Proc. ICASSP-Paris , Paris, Vol. 3, pp 1604-1607.
Carlson, R., Granström, B., Karlsson, I. (1991), "Experimentswith voice modelling in speech synthesis", Speech Communication 10, pp 481-489.
Le Goff, B.(1993):" Commandes paramŽtriques d'un modele de visage 3D pour animation en temps reel", Master of science thesis, Institut National Polytechnique, Grenoble, France.
McGurk, H., MacDonald, J. (1976):" Hearing lips and seeing voices", Nature, 264, pp 746-748.
Parke, F. I. (1982): "Parametrized models for facial animation", IEEE Computer Graphics, 2(9), pp 61-68.
Beskow, J. (1995): "Rule-based Visual Speech Synthesis" In Proceedings of Eurospeech '95, Madrid, Spain, pp 299-302
Here is an MPEG-encoded movie of the speaking face. (1.88 MB)
Note: The above file is a so called mpeg-system file, containing multiplexed audio and video. Use this if your MPEG-player can handle multiplexed AV. Otherwise consider the video-only version. (1.73 MB)
More information about computer facial animation and visual speech synthesis can be found at U.C. Santa Cruz
Department of Speech, Music and Hearing