ARTUR - the ARticulation TUtoR
The focus group of the ARTUR project is hearing- or speech impaired children and second language learners, three categories that may be helped by computer assisted pronunciation training, to increase the possibility of individual training.
Most existing computer based systems for articulation training offer some kind of feedback on the user's acoustic production.
For a hearing-impaired child with limited notion of the acoustic targets or a second language learner who does not have the distinction in his or her mother tongue it is however often more fruitful to focus on visual or tactile properties of the pronunciation. If a speech therapist is present during the training, the therapist interprets the feedback and gives instructions to the child on how to alter the articulation to reach the acoustic target. We want to extend the existing computer based articulation training by providing these instructions directly in a computer program, ARTUR.
Overview of ARTUR
ARTUR is a virtual speech tutor who can use three-dimensional animations of the face and internal parts of the mouth (tongue, palate, jaw etc) to give feedback on the difference between the user's deviation and a correct pronunciation.
The use of a talking head with internal parts is a key feature, as phonetic features that are hidden in a human speaker can be displayed.
ARTUR involves several subtasks:
Audio-visual detection of mispronounced speech:
The input to the system is the user's utterances and the aim is to detect deviations between the target and the user's pronunciation. As the expected input from the user is generally known it can be compared to a target utterance using forced alignment.
The performance of the speech recognition is improved by adding visual information of the speaker's face.
Marker-less tracking of facial features from video:
The facial data, such as jaw position and mouth opening, is extracted from video images of the face. This can be done either by fitting a three-dimensional model of the face to the face in the video images or by training two-dimensional face appearance models from a large database of face images.
The next step is to recreate the user's motion of the face and vocal tract from the speech signal and face parameters. The visual input is important as there is a many-to-one mapping of acoustics to articulation, which means that the articulation cannot be recovered from the speech signal alone. As there is a significant correlation between the face and the tongue positions, facial data is used to improve the articulatory inversion.
The user's and the correct articulations are synthesized using the models of the face and vocal tract developed at KTH, based on articulation data from, e.g. Magnetic Resonance Imaging (MRI) data and Electromagnetic Articulography (EMA) measurements.
Adaptation of the model to the user:
The shape of the face and vocal tract varies between individuals, and the articulatory inversion requires that the model is adapted automatically to each new user.
The output of the system, an articulatory representation of the training utterance, requires much attention. It is crucial that the feedback is comprehensible, useful and motivating for the student. A Wizard of Oz study has been carried out as a first step to test and refine the human-computer interface and feedback display.