Start ReadSpeaker XT


Seminar at Speech, Music and Hearing:

Docentlecture seminar:

Machines learning to speak

Obs tiden!

Giampiero Salvi


In this talk I will present research aimed at modelling some aspects of early language acquisition. The first goal is to draw inspiration from human learning in order to improve our machine learning methods for speech technology and cognitive systems in general. The second goal is to shed some light on how children learn to speak by simulating some of the phenomena involved in the process and suggesting some possible explanations.

The dream of being able to communicate with computers and robots by speech, often envisioned in science fiction, is beginning to become a reality. Speech-based human-computer interfaces, largely built upon some form of machine learning techniques, have indeed experienced significant improvements in recent years. However, this improvement is, in our opinion, to a large extent due to the use of ever increasing processing power and size of data collections, rather than to an improved understanding of how speech communication skills are acquired and deployed by humans. As a consequence, speech interfaces still lack the robustness, flexibility and naturalness observable in human-human interactions.

The first significant difference between human and machine learning, is that the former is inherently multi-modal. Infants have access to an extremely rich and redundant representation of their own body and of the world surrounding them. On the contrary, speech-based interfaces are split, in an engineering fashion, into modular blocks that are developed in isolation and only partially integrated during their use. As an example, automatic speech recognisers are typically developed focusing exclusively on the acoustic speech input. Another fundamental difference is that humans can learn in an active way by exploring and modifying their environment, instead of merely collecting statistics of the observations they are presented with.

The research I will present tries to overcome the above limitations by proposing models that are able to learn from interaction with the environment in multi-sensory settings. An example of this involves a humanoid robot grounding the meaning of words using perception-action associations in the context of a manipulation task. Moreover, we try to reduce the amount of expert knowledge required in the learning process, and, to some extent, to escape the simplistic input-output model often assumed in supervised machine learning. An example of this involves finding word candidates from a stream of continuous speech. The proposed methods will hopefully help us steer away from learning paradigms that are bound to finding only shallow associations between input data and annotations.

13:15 - 14:00
Friday January 13, 2012

The seminar is held in Fantum.

| Show complete seminar list

Published by: TMH, Speech, Music and Hearing

Last updated: Wednesday, 23-Jun-2010 09:22:46 MEST