A three-dimensional vocal tract model for articulatory and visual speech synthesis developed within CTT, the Centre for Speech Technology, KTH.


A three-dimensional articulatory model of the vocal tract is being developed at the Department of Speech, Music and Hearing, KTH. The model, described in Engwall (1999a), consists of vocal and nasal tract walls, lips, teeth and tongue, represented as visually distinct articulators by different colours resembling the ones in a natural human vocal tract.

The naturalness of the vocal tract model can be used in speech training for hearing impaired children or in second language learning, where the visual feedback supplements the auditory feedback. The 3D model also provides a platform for studies on articulatory synthesis, as the vocal tract geometry can be set with a small number of articulation parameters, and vocal tract cross-sectional areas can be determined directly from the model.

The project involves three-dimensional and dynamical studies of the vocal tract during speech.

The three-dimensional measurements have been made using Magnetic Resonance Imaging (MRI), as described in Engwall & Badin (1999). The MRI data was collected in a collaboration between the CTT and l'Institut de la Communication Parlee in Grenoble. MR images were taken of the midsagittal plane and in full 3D for one speaker of Swedish, as exemplified by the MR image gallery and the animated gifs of a large part of the midsagittal corpus and an examples of the full 3D collection ('sj' of 'sjutton').

The MRI database was used to reconstruct the vocal tract shapes of Swedish phonemes in 3D. It was also exploited to find articulatory strategies and coarticulatory effects. Engwall & Badin (2000) describes a study of the vowel context influence on Swedish fricatives in VCV-sequences presented at the 5th Speech Production Seminar, Kloster Seeon in May.
Three-dimensional tongue shapes has been extracted and used to define a tongue model, by statistical analysis of how the articulatory parameters influence the different vertices of the tongue body (Engwall, 2000a ). The result of the analysis is illustrated as animated gifs of the range of variation of the extracted parameters Jaw Height, Tongue Body, Tongue Dorsum, Tongue Tip, Tongue Advance and Tongue Width.
To investigate the dynamical aspects that MRI is unable to capture, a combined EMA and EPG study has been carried out for the same reference subject as the MRI study. This work concentrates on temporal aspects of coarticulation ( Engwall, 2000d ) as well as evaluating if the static MRI measurements can be considered representative of running speech (Engwall , 2000b).
The results from the MRI and EMA/EPG studies has been incorporated in the KTH 3D Vocal Tract model. This far a kinematic, 3D tongue model has been created using parameters determined from MRI and parameter activation determined from EMA (Engwall , 2001c). The EPG data was employed to tune the parameters of the tongue, comparing the natural contact pattern of the subject with that of the model (Engwall , 2001b).

There are some animated gifs of the tongue and the face: face to see-through-face, showing half of the face and from inside the head.

The model has also been evaluated from the viewpoint of 3D articulatory synthesis of static vowels, by adding vocal tract walls and calculating the resulting formant pattern based on the cross-sectional areas of the model (Engwall , 2001d).

Correction algorithms that prevent the tongue from reaching unphysiological situations, such as protuding through the palate, have also been implemented (Engwall , 2001a).

The vocal tract model is currently being used in a project to develop a new computer-based pronunciation training system, ARTUR - the ARticualtion TUtoR.

Maintained by: Olov Engwall
Last updated Tue, May 03, 2005.