As part of the project Computer-Animated LAnguage TEAchers, this
thesis contributes to the creation of a framework for automatic
detection of mispronunciation to be used in a Computer Assisted
Language Learning (CALL) system. The aim is to provide users with
informative feedback about their pronunciation.
This work relies on the use of time-normalized Discrete Cosine
Tranforms (DCTs) to extract audiovisual features so as to process
vowels of different duration and generate visual feature vectors of
constant length. A combination of filter and wrapper feature selection
methods was employed to combine both modalities and demonstrated a
great ability to reduce the features to a subset suitable for
classification. Support Vector Machines (SVMs) were used as
classifiers and enabled the use of a sparse dataset. We concluded that
the addition of visual cues contributed to improving the performance
of the classifiers. We achieved 95 to 100% correct recognition rate
for each pairwise classifier.