Seminar at Speech, Music and Hearing:
Accounting for individual speaker properties in automatic speech recognition
Opponent: Torbjørn Svendsen, NTNU
AbstractIn this work, speaker characteristic modeling has been applied in the
fields of automatic speech recognition (ASR) and automatic speaker
verification (ASV). In ASR, a key problem is that acoustic mismatch
between training and test conditions degrade classification performance.
In this work, a child exemplifies a speaker not represented in training
data and methods to reduce the spectral mismatch are devised and
evaluated. To reduce the acoustic mismatch, predictive modeling based on
spectral speech transformation is applied. Following this approach, a
model suitable for a target speaker, not well represented in the
training data, is estimated and synthesized by applying vocal tract
predictive modeling (VTPM). In this thesis, the traditional static
modeling on the utterance level is extended to dynamic modeling. This is
accomplished by operating also on sub-utterance units, such as phonemes,
phone-realizations, sub-phone realizations and sound frames.
Initial experiments shows that adaptation of an acoustic model trained
on adult speech significantly reduced the word error rate of ASR for
children, but not to the level of a model trained on children’s speech.
Multi-speaker-group training provided an acoustic model that performed
recognition for both adults and children within the same model at almost
the same accuracy as speaker-group dedicated models, with no added model
complexity. In the analysis of the cause of errors, body height of the
child was shown to be correlated to word error rate.
A further result is that the computationally demanding iterative
recognition process in standard VTLN can be replaced by synthetically
extending the vocal tract length distribution in the training data. A
multi-warp model is trained on the extended data and recognition is
performed in a single pass. The accuracy is similar to that of the
A concluding experiment in ASR shows that the word error rate can be
reduced by extending a static vocal tract length compensation parameter
into a temporal parameter track. A key component to reach this
improvement was provided by a novel joint two-level optimization
process. In the process, the track was determined as a composition of a
static and a dynamic component, which were simultaneously optimized on
the utterance and sub-utterance level respectively. This had the
principal advantage of limiting the modulation amplitude of the track to
what is realistic for an individual speaker. The recognition error rate
was reduced by 10% relative compared with that of a standard
utterance-specific estimation technique.
The techniques devised and evaluated can also be applied to other
speaker characteristic properties, which exhibit a dynamic nature.
An excursion into ASV led to the proposal of a statistical speaker
population model. The model represents an alternative approach for
determining the reject/accept threshold in an ASV system instead of the
commonly used direct estimation on a set of client and impostor
utterances. This is especially valuable in applications where a low
false reject or false accept rate is required. In these cases, the
number of errors is often too few to estimate a reliable threshold using
the direct method. The results are encouraging but need to be verified
on a larger database.
15:15 - 17:00
Friday April 23, 2010
The seminar is held in Fantum.
| Show complete seminar list