Annual Report 1999
Table of Contents
The speech communication and technology group is the largest within the
department. The group has now expanded to about 35 researchers and research students, a few of them working part-time.
The creation of CTT, the Centre for Speech Technology,
1996 opened a new possibility for speech research at the department. The second
phase started in July 1998 and was evaluated in the summer 1999 resulting in a continued support of the centre.
The organisation of CTT is presented separately.
The work in the group, including CTT, covers a wide variety of topics,
ranging from detailed theoretical development of speech production models through phonetic analyses to practical
applications of speech technology.
MiLaSS, Multimodality in Language and Speech Systems, the 7th European
Summer School on Language and Speech Communication, was hosted by TMH in the summer of 1999. The talking face was
one of the highlights of this event together with a number of student exercises using the speech toolbox developed
at TMH. The program and the results of the student sessions could be reached through /milass/
A major research effort on multimodal dialog systems has been carried
out during 1999. The motivation is to study speech technology as part of complete systems and the interaction between
the different modules that are included in such systems. These systems have been the platform for data collection,
data analysis and research on multimodal human-machine interaction.
The speech group participated in the activities in
"Stockholm, the Cultural Capital of Europe '98",
with an animated talking agent, AUGUST. Visitors at the CCE'98 were able to communicate with AUGUST. In the simplest
configuration, AUGUST presented information about Stockholm, the study program of KTH, the research at CTT and
himself. The system was in active use until the spring 1999. The work resulted in new insights in multimodal interaction,
shallow dialog control, error resolution in human-machine interaction, and a unique database of spontaneous machine
directed speech from men, women and children. Dialogue aspects such as degree of user initiative, confirmation,
turn taking, and repair of misunderstandings on spontaneous data have been extensively studied.
Building on experiences from the August system we have now initiated
the process of specifying and implementing AdApt, a multi-modal dialogue system for discussing and evaluating apartments
for sale in Stockholm.
We have during 1999 been involved in the national project
"Spoken Dialogue Systems".
The aim of this project is to bring together research groups in Sweden who have competence in speech analysis and
synthesis, dialogue structure and the interpretation of natural language.
Speech and language databases
We see an expanding interest in studies on speaker variability, especially
in the context of speaker independent/speaker adaptive recognition. Large text corpora are increasingly important
for language technology developments. We have participated in a large effort to build telephone speech databases
for speech recognition in several European languages, the SpeechDat-project. This corpus consists of telephone
speech from 5000 speakers with controlled age and regional accent distributions and is augmented with a 1000 speaker
mobile phone database. These databases are now available through ELRA. The August system was used to collect a
database of more than 10,000 spontaneous utterances, which included a number of recordings of child voices.
New acoustic models have been trained for speech recognition on the large
SpeechDat database. We have shown that phone models trained on the task-independent, large speech database can
provide high recognition accuracy in a certain application if they are adapted using a (much smaller) task-specific
corpus. This procedure dramatically reduces the effort involved in creating new applications, since the required
size of the adaptation speech data is much smaller. Experiments using glottal source adaptation have shown very
promising results. Especially interesting is the work on phonetic categorisation feature-based context information.
HMM-methods have been used in comparative experiments on speech signal representations. Special effort has been
devoted to model speaker consistency within an utterance. The speech recognition package
"StarLight" is an important
module in the TMH toolbox and constitutes the speech recognition part in many projects and demonstrators. It features
phonetic classification based on artificial neural nets or Markov models. The lexical search procedure applies
the A* stack decoding paradigm.
Speech production models
Our work on improved models of the voice source and its interaction with
the vocal tract has led to a detailed understanding of the mechanisms. Data, in terms of the new model, on variations
in natural speech have also been accumulated, both concerning linguistically motivated variations and variations
among speakers. Special emphasis has been placed on analysis of female voice. Generalised observations from the
analysis work are now implemented as rules for speech synthesis.
Articulatory models have recently attracted interest in our laboratory.
Several ways of describing the vocal tract are being investigated, including a full 3D model. Reliable articulatory
reference data still seem to be the most severe bottleneck. Both direct and indirect methods of data collection
have been/are being investigated.
In the context of speaker verification we are engaged in the European
COST 250 project and the Esprit PICASSO projects.
Work is going on at implementing the speaker verification results in
the CTT project "CTT Bank".
CTT Bank is an experimental demonstrator that will investigate the potential advantages
of using speech for identity verification and user commands in bank telephony. The "PER"
project is an effort to build an automated entrance receptionist, PER (Prototype
Entrance Receptionist). It operates in the central entrance to the department. The purpose is to create and experiment
with alternative speech based means of controlling access to the premises.
In our text-to-speech project, we have increased the efforts on different
speaking styles. Both speaker variation and synthesis of attitudes and emotions are studied. The long-lasting efforts
on improved prosodic models and segmental synthesis continues.
Tools for education and prototyping
Our work on new tools continues. It has resulted in a new set of student
labs in speech technology. An interactive dialogue system was created in which students can change and expand the
system functions. A new framework for speech synthesis is the topic of another lab. These labs were used and evaluated
in several classes during 1999. This and other software developments at the department have changed the working
environment for many projects. Fast prototyping based on modules is now part of general experimental designs.
Multi-modal speech synthesis
The audio-visual face synthesis project has attracted considerable attention.
The synthesis is now used in many of the demonstrators under development. Strategies for articulatory synthesis
are under development. The expansion of the model to the internals of the speech production apparatus is well under
way and will lead to a full 3D articulatory model.
In the Teleface project, we work together with the Hearing group, investigating
the usefulness of synthesised faces for hard-of-hearing persons in telecommunication. This project recently received
continued financing through KFB, Nutek and Hjälpmedelsinstitutet, one of the CTT partners.
Two scientific exchanges on the doctoral level to prominent international
centres have taken place within this area. Jonas Beskow recently returned from Santa Cruz and Olov Engwall has
been doing research in Grenoble.
The Teleface project illustrated by a phone call to
a hearing-impaired person equipped with a synthetic face generator to support lip-reading.
Speech technology and disabilities
Speech and language technology for motorically disabled and non-vocal
persons is a major research area. Research on communication disability has been designated a priority area at KTH.
Several ways of increasing the communication speed have been investigated including interactive text prediction
based on linguistic principles. A large national project aiming at computer support programs for persons with reading
and writing difficulties has supported part of this work. Our part of the project was concerned with text prediction.
The European project, ENABL, started in 1997 and was finished 1999 with a final review in 2000. It concerns the
use of speech recognition for disabled users of CAD systems.
For an extended summary of external activities and projects, see
National and International Contacts.
Open source software
The open source software developed in the speech group as been downloaded
by many sites. /software/
Snack is an extension
to the Tcl/Tk scripting language that adds commands sound I/O and sound visualization, e.g. waveforms and spectrograms.
Snack serves as a general audio platform giving uniform access to the audio hardware on a number of systems.
The NICO Toolkit
is an artificial neural network toolkit specifically designed and optimized for
automatic speech recognition applications. Networks with both recurrent connections and time-delay windows are
The Broker is a server which forwards function calls, results and error
codes between program modules over the Internet, and it was the goal to be easy to use in programs, platform independent
and to have an uniform interface for all modules.