Biologically inspired statistical methods for flexible automatic speech understanding

The project will develop machine learning methods for speech understanding that more closely resemble the biological approach to learning.

This in order to introduce more flexibility in the speech understanding systems. The proposed methods will allow the system to estimate if a new input does not agree with its current state of knowledge. When this happens, the system will be able to modify its state of knowledge and include the new input.

This requires the ability to judge the current model on the basis of new observations, and the ability to learn the new input in a semi-supervised way from the context. For this purpose we will use advanced machine learning techniques, including unsupervised and semi-supervised methods. The methods will be tested on multi-modal speech databases and in a robotic platform that interacts with its environment as well as the conversational partner.

Previously, we have conducted studies on applying unsupervised methods to speech processing, both at the signal level to discover acoustic classes and using multi-modal sensory input to associate words to meanings. The two aspects, however, were considered independently.

The project will contribute to the development of more flexible speech interfaces in the attempt to increase their use by the general public. The project will also contribute to basic research in cognitive science on modelling language acquisition in humans.

Everyone has, nowadays, experienced how it is to talk to a machine. Most mobile phones have voice dialling functionality, and train timetables can be accessed by telephone without ever talking to a real human being. As exciting as this could sound, it often turns into a frustrating experience. Perhaps we have a particular accent, or we are not proficient in the language used by the system. Often, we are required to talk in a mechanical way, phrasing our thoughts as if they where written on paper, and avoiding hesitations or extra phrases that in a human-to-human conversation would make the discourse flow more fluently.

One of the reasons for the limitations of these systems, in spite of their high sophistication, is their inability to use context and to learn from mistakes. The systems are in fact built using machine learning algorithm that allow them to learn from examples, but these algorithms do not share the same flexibility that humans have. The speech understanding systems have a hard time, for example, understanding if a word we say is part of the words it has already learnt or if it is unknown. Whoever has tried to learn a foreign language knows how intuitive it is for us to perform this task. It is also relatively easy for us to pick up a new word and to learn its pronunciation, provided that we have a patient friend that is willing to repeat the word for us many times. This task is also hard to perform for a machine in the current settings.

The research proposed here aims at investigating methods for automatic speech understanding that more closely resemble the abilities of human speakers. We wish to contribute to the development of systems that can learn the meaning of an utterance from context, and that are flexible enough to revise their current knowledge if it appears to be incomplete, for example by asking us for help if they did not understand what we just said. The research has importance for improving the current speech based applications, but will also contribute to the knowledge of how these tasks are performed by humans. Working together with psychologists, neuroscientists and cognitive scientists, we will analyse interactions between humans, with special focus on infants and their parents, and develop models that can explain some of the phenomena taking place in language acquisition.

One of the key concepts when considering human learning processes is "embodiment". The reason why we are so flexible in extracting new information from the environment is that we use several senses and are able to relate them to one another. The meaning we associate with our experience is a complex combination of sensory inputs. If a sensory input is unknown or inconsistent, we are not lost because we can rely on the others. Moreover, we can interact with the environment and learn the properties of objects by observing the effect that certain actions have on them.

In order to test if our methods can cope with this more complex way of learning, we will test them using robotic systems. The robots will be able to interact with their environment and with humans, and their task will be to make associations between the visual, tactile and auditory inputs. These associations will constitute their ability to connect spoken utterances to meanings.

Group: Speech Communication and Technology

Giampiero Salvi

Funding: VR (2009-4599)

Duration: 2010 - 2014


KTH research database:

Keywords: speech recognition, machine learning, developmental cognition

Related publications:


Salvi, G., & Vanhainen, N. (2014). The WaveSurfer Automatic Speech Recognition Plugin. In Proceedings of LREC. Reykjavik, Iceland. [pdf]

Vanhainen, N., & Salvi, G. (2014). Free Acoustic and Language Models for Large Vocabulary Continuous Speech Recognition in Swedish. In Proceedings of LREC. Reykjavik, Iceland. [pdf]

Vanhainen, N., & Salvi, G. (2014). Pattern Discovery in Continuous Speech Using Block Diagonal Infinite HMM. In Proceedings of ICASSP. Florence, Italy. [pdf]


Franovic, T., Herman, P., Salvi, G., Benjaminsson, S., & Lansner, A. (2013). Cortex-inspired network architecture for large-scale temporal information processing. In Frontiers in neuroinformatics.

Koniaris, C., Salvi, G., & Engwall, O. (2013). On Mispronunciation Analysis of Individual Foreign Speakers Using Auditory Periphery Models. Speech Communication, 55(5), 691-706. [abstract] [link]

Neiberg, D., Salvi, G., & Gustafson, J. (2013). Semi-supervised methods for exploring the acoustics of simple productive feedback. Speech Communication, 55(3), 451-469. [link]

Oertel, C., Salvi, G., Götze, J., Edlund, J., Gustafson, J., & Heldner, M. (2013). The KTH Games Corpora: How to Catch a Werewolf. In IVA 2013 Workshop Multimodal Corpora: Beyond Audio and Video - MMC 2013. [pdf]

Oertel, C., & Salvi, G. (2013). A Gaze-based Method for Relating Group Involvement to Individual Engagement in Multimodal Multiparty Dialogue. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI). Sydney, Australia. [abstract] [pdf]

Salvi, G. (2013). Biologically Inspired Methods for Automatic Speech Understanding. In Advances in Intelligent Systems and Computing (AISC) (pp. 283). Palermo, Italy. [abstract]

Saponaro, G., Salvi, G., & Bernardino, A. (2013). Robot Anticipation of Human Intentions through Continuous Gesture Recognition. In Proc. 4th International Workshop on Collaborative Robots and Human Robot Interaction (CR-HRI 2013). San Diego, USA. [pdf]


Koniaris, C., Engwall, O., & Salvi, G. (2012). Auditory and Dynamic Modeling Paradigms to Detect L2 Mispronunciations. In Interspeech 2012. Portland, OR, USA. [abstract] [pdf]

Koniaris, C., Engwall, O., & Salvi, G. (2012). On the Benefit of Using Auditory Modeling for Diagnostic Evaluation of Pronunciations. In Inter. Symp. on Auto. Detect. Errors in Pronunc. Training (IS ADEPT), 2012 (pp. 59-64). Stockholm, Sweden. [abstract] [pdf]

Salvi, G., Montesano, L., Bernardino, A., & Santos-Victor, J. (2012). Language bootstrapping: Learning word meanings from perception-action association. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 42(3), 660-671. [abstract] [pdf]

Vanhainen, N., & Salvi, G. (2012). Word Discovery with Beta Process Factor Analysis. In Proceedings of Interspeech. Portland, Oregon. [abstract] [pdf]


Ananthakrishnan, G., & Salvi, G. (2011). Using Imitation to learn Infant-Adult Acoustic Mappings. In Proceedings of Interspeech (pp. 765-768). Florence, Italy. [abstract] [pdf]

Lindblom, B., Diehl, R., Park, S-H., & Salvi, G. (2011). Sound systems are shaped by their users: The recombination of phonetic substance. In G. Nick Clements, G. N., & Ridouane, R. (Eds.), Where Do Phonological Features Come From?: Cognitive, physical and developmental bases of distinctive speech categories. CNRS & Sorbonne-Nouvelle. [abstract] [pdf]

Salvi, G., Tesser, F., Zovato, E., & Cosi, P. (2011). Analisi Gerarchica degli Inviluppui Spettrali Differenziali di una Voce Emotiva. In Contesto comunicativo e variabilità nella produzione e percezione della lingua (AISV). Lecce, Italy.


Salvi, G., Tesser, F., Zovato, E., & Cosi, P. (2010). Cluster Analysis of Differential Spectral Envelopes on Emotional Speech. In Proceedings of Interspeech (pp. 322--325). Makuhari, Japan. [abstract] [PDF]


Krunic, V., Salvi, G., Bernardino, A., Montesano, L., & Santos-Victor, J. (2009). Affordance based word-to-meaning association. In IEEE International Conference on Robotics and Automation (ICRA). Kobe, Japan. [abstract] [pdf]


Krunic, V., Salvi, G., Bernardino, A., Montesano, L., & Santos-Victor, J. (2008). Associating word descriptions to learned manipulation task models. In IEEE/RSJ International Conference on Intelligent RObots and Systems (IROS). Nice, France.

Published by: TMH, Speech, Music and Hearing

Last updated: 2012-11-09