Annual Report 1999

Table of Contents

Björn Granström

Professor in
Speech Communication

Rolf Carlson

Professor in
Speech Technology

The speech communication and technology group is the largest within the department. The group has now expanded to about 35 researchers and research students, a few of them working part-time. The creation of CTT, the Centre for Speech Technology, 1996 opened a new possibility for speech research at the department. The second phase started in July 1998 and was evaluated in the summer 1999 resulting in a continued support of the centre. The organisation of CTT is presented separately.

The work in the group, including CTT, covers a wide variety of topics, ranging from detailed theoretical development of speech production models through phonetic analyses to practical applications of speech technology.


MiLaSS, Multimodality in Language and Speech Systems, the 7th European Summer School on Language and Speech Communication, was hosted by TMH in the summer of 1999. The talking face was one of the highlights of this event together with a number of student exercises using the speech toolbox developed at TMH. The program and the results of the student sessions could be reached through /milass/

Spoken dialogue

A major research effort on multimodal dialog systems has been carried out during 1999. The motivation is to study speech technology as part of complete systems and the interaction between the different modules that are included in such systems. These systems have been the platform for data collection, data analysis and research on multimodal human-machine interaction.

The speech group participated in the activities in "Stockholm, the Cultural Capital of Europe '98", with an animated talking agent, AUGUST. Visitors at the CCE'98 were able to communicate with AUGUST. In the simplest configuration, AUGUST presented information about Stockholm, the study program of KTH, the research at CTT and himself. The system was in active use until the spring 1999. The work resulted in new insights in multimodal interaction, shallow dialog control, error resolution in human-machine interaction, and a unique database of spontaneous machine directed speech from men, women and children. Dialogue aspects such as degree of user initiative, confirmation, turn taking, and repair of misunderstandings on spontaneous data have been extensively studied.


Building on experiences from the August system we have now initiated the process of specifying and implementing AdApt, a multi-modal dialogue system for discussing and evaluating apartments for sale in Stockholm.

We have during 1999 been involved in the national project "Spoken Dialogue Systems". The aim of this project is to bring together research groups in Sweden who have competence in speech analysis and synthesis, dialogue structure and the interpretation of natural language.

Speech and language databases

We see an expanding interest in studies on speaker variability, especially in the context of speaker independent/speaker adaptive recognition. Large text corpora are increasingly important for language technology developments. We have participated in a large effort to build telephone speech databases for speech recognition in several European languages, the SpeechDat-project. This corpus consists of telephone speech from 5000 speakers with controlled age and regional accent distributions and is augmented with a 1000 speaker mobile phone database. These databases are now available through ELRA. The August system was used to collect a database of more than 10,000 spontaneous utterances, which included a number of recordings of child voices.

Speech recognition

New acoustic models have been trained for speech recognition on the large SpeechDat database. We have shown that phone models trained on the task-independent, large speech database can provide high recognition accuracy in a certain application if they are adapted using a (much smaller) task-specific corpus. This procedure dramatically reduces the effort involved in creating new applications, since the required size of the adaptation speech data is much smaller. Experiments using glottal source adaptation have shown very promising results. Especially interesting is the work on phonetic categorisation feature-based context information. HMM-methods have been used in comparative experiments on speech signal representations. Special effort has been devoted to model speaker consistency within an utterance. The speech recognition package "StarLight" is an important module in the TMH toolbox and constitutes the speech recognition part in many projects and demonstrators. It features phonetic classification based on artificial neural nets or Markov models. The lexical search procedure applies the A* stack decoding paradigm.

Speech production models

Our work on improved models of the voice source and its interaction with the vocal tract has led to a detailed understanding of the mechanisms. Data, in terms of the new model, on variations in natural speech have also been accumulated, both concerning linguistically motivated variations and variations among speakers. Special emphasis has been placed on analysis of female voice. Generalised observations from the analysis work are now implemented as rules for speech synthesis.

Articulatory models have recently attracted interest in our laboratory. Several ways of describing the vocal tract are being investigated, including a full 3D model. Reliable articulatory reference data still seem to be the most severe bottleneck. Both direct and indirect methods of data collection have been/are being investigated.

Speaker characteristics

In the context of speaker verification we are engaged in the European COST 250 project and the Esprit PICASSO projects.

Work is going on at implementing the speaker verification results in the CTT project "CTT Bank". CTT Bank is an experimental demonstrator that will investigate the potential advantages of using speech for identity verification and user commands in bank telephony. The "PER" project is an effort to build an automated entrance receptionist, PER (Prototype Entrance Receptionist). It operates in the central entrance to the department. The purpose is to create and experiment with alternative speech based means of controlling access to the premises.

In our text-to-speech project, we have increased the efforts on different speaking styles. Both speaker variation and synthesis of attitudes and emotions are studied. The long-lasting efforts on improved prosodic models and segmental synthesis continues.

Tools for education and prototyping

Our work on new tools continues. It has resulted in a new set of student labs in speech technology. An interactive dialogue system was created in which students can change and expand the system functions. A new framework for speech synthesis is the topic of another lab. These labs were used and evaluated in several classes during 1999. This and other software developments at the department have changed the working environment for many projects. Fast prototyping based on modules is now part of general experimental designs.

Multi-modal speech synthesis

The audio-visual face synthesis project has attracted considerable attention. The synthesis is now used in many of the demonstrators under development. Strategies for articulatory synthesis are under development. The expansion of the model to the internals of the speech production apparatus is well under way and will lead to a full 3D articulatory model.

In the Teleface project, we work together with the Hearing group, investigating the usefulness of synthesised faces for hard-of-hearing persons in telecommunication. This project recently received continued financing through KFB, Nutek and Hjälpmedelsinstitutet, one of the CTT partners.

Two scientific exchanges on the doctoral level to prominent international centres have taken place within this area. Jonas Beskow recently returned from Santa Cruz and Olov Engwall has been doing research in Grenoble.

The Teleface project illustrated by a phone call to a hearing-impaired person equipped with a synthetic face generator to support lip-reading.

Speech technology and disabilities

Speech and language technology for motorically disabled and non-vocal persons is a major research area. Research on communication disability has been designated a priority area at KTH. Several ways of increasing the communication speed have been investigated including interactive text prediction based on linguistic principles. A large national project aiming at computer support programs for persons with reading and writing difficulties has supported part of this work. Our part of the project was concerned with text prediction. The European project, ENABL, started in 1997 and was finished 1999 with a final review in 2000. It concerns the use of speech recognition for disabled users of CAD systems.

For an extended summary of external activities and projects, see National and International Contacts.

Open source software

The open source software developed in the speech group as been downloaded by many sites. /software/

Snack is an extension to the Tcl/Tk scripting language that adds commands sound I/O and sound visualization, e.g. waveforms and spectrograms. Snack serves as a general audio platform giving uniform access to the audio hardware on a number of systems.

The NICO Toolkit is an artificial neural network toolkit specifically designed and optimized for automatic speech recognition applications. Networks with both recurrent connections and time-delay windows are easily constructed.

The Broker is a server which forwards function calls, results and error codes between program modules over the Internet, and it was the goal to be easy to use in programs, platform independent and to have an uniform interface for all modules.

Published by: TMH, Speech, Music and Hearing

Last updated: Friday, 25-Nov-2005 13:24:40 MET