Start ReadSpeaker XT


Seminar at Speech, Music and Hearing:

A framework for classification and analysis with discriminative prototypical spectrograms

Daniel Neiberg


In the first part of this PhD seminar, findings of more general interest are presented. Subtle temporal and spectral differences between categorical realizations of, in this case, para-linguistic phenomena of affective states, are not always easy to capture and describe. For this purpose, a signal representation based on Time Varying Constant-Q Cepstral Coefficients (TVCQCC) is derived. The coefficients are invariant to utterance length, and a method which is utilizing the special properties of the constant Q-transform for mean frequency estimation and normalization is described. As a special case, a representation for pitch-contours is shown. By calculating the average and performing inverse transformation, we get spectrograms which show the prototypical realization of each category. By using Support Vector Machines for classification, the separating hyperplanes are projected onto the axes which gives the importance of each coefficient. The importance weights can be used to plot discriminative spectrograms, which show both the prototypical realization as well as the unicity in each point of the spectro-temporal space. Based on these, a technique for analysis of rhythm and spectral shape is applied for basic affective states.

The second part investigates the relationship between the acoustic and articulatory domain. The mapping from the acoustic domain to the actual articulation, here measured by EMA-coils, is called acoustic-to-articulatory inversion. The accuracy of the mapping is restricted by the degree of non-uniqueness (one-to-many). A study which determines the non-uniqueness of cluster based inversion is presented. And finally, a general version of a coupled Hidden Markov/Bayesian Network model for performing phoneme recognition on acoustic-articulatory data is presented. The model, here denoted as Cross Modal Coupled Hidden Markov Model, uses knowledge learned from the articulatory measurements, available for training, for phoneme recognition on the acoustic input. Using optimized parameters, the proposed method shows a slight improvement for two speakers over the baseline phoneme recognition system which does not use articulatory knowledge. However, the improvement is only statistically significant for one of the speakers. An analysis of the results for phonemic groups showed an increase for diphthongs, and to some extent vowels, while there was a decrease for fricatives.

15:15 - 17:00
Tuesday November 17, 2009

The seminar is held in Fantum.

| Show complete seminar list

Published by: TMH, Speech, Music and Hearing

Last updated: Wednesday, 23-Jun-2010 09:22:46 MEST