May 30 - Jun 1,
2005
Sadaoki
Furui,
Tokyo
Institute of Technology,
Speech recognition and applications
1. Recent progress in spontaneous speech recognition
2. Robust speech recognition and understanding
3. Systematization and application of large-scale knowledge resources
More information
September 20-22,
2004
Nick Campbell, ATR Spoken Language Translation
Research Labs, Japan
1. Language, Speech, and Meaning
2. Working with a Corpus of Expressive Speech
3. Synthesising Conversational Speech
June 16-18, 2003
Alex
Waibel,
Univ Karlsruhe and CMU
1. CHIL Computing to Overcome Techno-Clutter
2. Communicating in a Multilingual World
May 22-24, 2002
Mehryar Mohri,
AT&T Labs –
1.
Weighted Automata Algorithms
2. Finite-State Machine Library (FSM Library)
3. Speech Recognition Applications
April 24-26, 2002
Steven Greenberg, International Computer Science Institute,
http://www.icsi.berkeley.edu/~steveng/
What are
the essential cues for understanding spoken language?
Automatic
phonetic and prosodic annotation of spoken language
Beyond the
phoneme: A juncture-accent model of spoken language
April 2 – 4, 2001
Marc Swerts, IPO Eindhoven,
Prosody
as a marker of information status
September 12 – 14, 2000
Julia
Hirschberg, AT&T Labs – Research, USA
Intonational Variation in Spoken Dialogue Systems: Generation and
Understanding
More
information
October 11-13, 1999
Sharon
Oviatt, Center for
Human-Computer Communication,
June 14 – 16, 1999
Gerard
Chollet, ENST,
May 17 – 19, 1999
Anne
Cutler, Max Planck Institute for Psycholinguistics,
Three
lectures on human recognition of spoken words:
April 26 – 28, 1999
Chin-Hui Lee, Lucent Technologies, USA
Three
lectures on automatic speech and speaker recognition:
November 4-5, 1998
Gösta Bruce,
Swedish
Prosody
September 11-16, 1998
Klaus
Kohler,
November, 1997
Paul Heisterkamp, Daimler Benz, Germany
Thinking in systems; Linguistic analysis and language generation for speech
dialogue systems; Speech dialogue management in and for real world applications
More information
November, 1996
Hynek Hermansky, OGI, USA
February, 1996
Susann Luperfoy,
Discourse
Processing for Spoken Dialogue Systems
More Information about Bullet
Courses at CTT
Sadaoki Furui,
Tokyo
Institute of Technology,
Speech recognition and applications
1. Recent progress in
spontaneous speech recognition
This talk
overviews recent progress in the development of corpus-based spontaneous speech
recognition technology. Although speech is in almost any
situation spontaneous, recognition of spontaneous speech is an area which has
only recently emerged in the field of automatic speech recognition. Broadening the application of speech
recognition depends crucially on raising recognition performance for
spontaneous speech. For this purpose, it
is necessary to build large spontaneous speech corpora for constructing
acoustic and language models. This talk
focuses on various achievements of a Japanese 5-year national project
“Spontaneous Speech: Corpus and Processing Technology” that has
recently been completed. Because of
various spontaneous-speech specific phenomena, such as filled pauses, repairs,
hesitations, repetitions and disfluencies,
recognition of spontaneous speech requires various new techniques. These new techniques include flexible
acoustic modeling, sentence boundary detection,
pronunciation modeling, acoustic as well as language
model adaptation, and automatic summarization.
Particularly automatic summarization including indexing, a process which
extracts important and reliable parts of the automatic transcription, is
expected to play an important role in building various speech archives,
speech-based information retrieval systems, and human-computer dialogue
systems.
2. Robust speech
recognition and understanding
This talk overviews the major methods
that have recently been investigated for making speech recognition systems more
robust at both acoustic and language processing levels. Improved robustness will enable such systems
to work well over a wide range of unexpected and adverse conditions by helping
them to cope with mismatches between training and testing speech
utterances. This talk focuses on
stochastic matching framework for model parameter adaptation, use of
constraints in adaptation, automatic speaker change detection and adaptation, noisy
speech recognition using tree-structured noise-adapted HMMs,
spontaneous speech recognition, approaches using dynamic Bayesian networks, and
multimodal speech recognition.
3. Systematization and
application of large-scale knowledge resources
This talk introduces the
five-year COE (
Nick Campbell, ATR Spoken Language Translation
Research Labs,
1. Language, Speech, and Meaning
In
this talk, I shall attempt to describe some of the roles played by prosody in
speech communication, and will relate them to the requirements of computer
speech processing. The talk covers phonetic, linguistic and paralinguistic
aspects of speech.
2.
Working with a Corpus of Expressive Speech
This
talk describes the JST/CREST Expressive Speech Processing project, introduces a
very large corpus of conversational speech and describes some of the main
findings of our related research. The talk explores the roles of non-verbal and
paralinguistic information in speech communication.
3.
Synthesising Conversational Speech
This
talk addresses the issues of synthesising non-verbal speech and describes a
prototype interface for the synthesis of conversational speech. The synthesised
samples are in Japanese, but I believe that they are sufficiently interesting
that any inherent language difficulties might be overcome by higher-level
speech-related interests.
June 16-18, 2003
Alex
Waibel, Univ Karlsruhe and CMU
1.
CHIL Computing to Overcome Techno-Clutter
2. Communicating in a Multilingual World
Title:
CHIL Computing to Overcome Techno-Clutter
Abstract:
After
building computers that paid no intention to communicating with humans, we have
in recent years developed ever more sophisticated interfaces that put the
"human in the loop" of computers.
These interfaces have improved usability by providing more appealing
output (graphics, animations), more easy to use input methods (mouse, pointing,
clicking, dragging) and more natural interaction modes (speech, vision,
gesture, etc.). Yet the productivity
gains that have been promised have largely not been seen and human-machine
interaction still remains a partially frustrating and tedious experience, full
of techno-clutter and excessive attention required by the technical artifact. In this talk, I will argue, that we must
transition to a third paradigm of computer use, in which we let people interact
with people, and move the machine into the background to observe the humans'
activities and to provide services implicitly, that is, -to the extent
possible- without explicit request.
Putting the "Computer in the Human Interaction Loop" (CHIL),
instead of the other way round, however, brings formidable technical
challenges. The machine must now always
observe and understand humans, model their activities, their interaction with
other humans, the human state as well as the state of the space they are in,
and finally, infer intentions and needs.
From a perceptual user interface point of view, we must process signals
from sensors that are always on, frequently inappropriately positioned, and
subject to much greater variablity. We must also not only recognize WHAT was seen
or said in a given space, but also a broad range of additional information,
such as the WHO, WHERE, HOW, TO WHOM, WHY, WHEN of human interaction and
engagement. In this talk, I will describe a variety of multimodal interface
technologies that we have developed to answer these questions and some
preliminary CHIL type services that take advantage of such perceptual
interfaces.
Title:
Communicating in a Multilingual World
Abstract:
With
the globalization of society, and the opportunities and threats that go along
with it, multilinguality is now emerging both as an
urgent technology requirement as well as a formidable scientific challenge.
Formerly often viewed as an uninteresting exercise in adapting (mostly) English
speech and language systems to a foreign language, multilinguality
is rapidly becoming an area of active study and research in its own right. Even
though Speech, Language and MT technologies still lag in performance behind
human capabilities in any one language and much research remains to be done,
researchers are now also actively exploring the added problems introduced by multilinguality: 1.) Additional languages add additional
peculiarities that have not been investigated in English, but need to be
handled effectively including tonality, morphology, orthography and
segmentation, 2.) Foreign accent and foreign words/names introduce modeling difficulties in monolingual speech systems, 3.)
the sheer number of languages (~6000 by most estimates) makes traditional
porting approaches via training on large corpora or rule based programming
impractical and prohibitively expensive, 4.) multilingual
and cross-lingual technologies such as Machine Translation and Cross-lingual
Retrieval are faced with scaling problems when demand for language pairs (N2)
and domain coverage rise with languages and applications. In this talk I will
discuss the problems and current approaches on the road to more effective, more
robust and more portable multilingual speech and language systems and services.
May 22-24, 2002
Mehryar Mohri,
AT&T Labs –
Dr Mohri
is Head of the Speech Algorithms Department at AT&T Labs - Research. The
objective of the department is to design general mathematical frameworks,
efficient algorithms, and fundamental software libraries for large-scale speech
processing problems. Its research areas include automata theory, machine
learning, pattern-matching, natural language processing, automatic speech
recognition, speech synthesis. His short course will concentrate on the
activities within speech at AT&T Labs, including:
1.
Weighted Automata Algorithms
2. Finite-State Machine Library (FSM Library)
3. Speech Recognition Applications
För mer
information se: (http://www.research.att.com/~mohri/)
April 24-26, 2002
Steven Greenberg, International Computer
Science Institute,
http://www.icsi.berkeley.edu/~steveng/
What are the essential cues for
understanding spoken language?
Automatic phonetic and prosodic
annotation of spoken language
Beyond the phoneme: A
juncture-accent model of spoken language
What are the essential acoustic cues for
understanding spoken language?
Classical models of speech recognition (by both
human and machine) assume that a detailed, short-term analysis of the signal is
essential for accurately decoding spoken language. Although this stratagem may
work well for carefully enunciated speech spoken in a pristine acoustic
environment, it is far less effective for recognizing speech spoken under more
realistic conditions, such as (1) moderate-to-high levels of background noise,
(2) reverberant acoustic environments and (3) spontaneous, informal conversation.
Statistical analysis of a spontaneous speech corpus suggests that the stability
of the linguistic representation is largely a consequence of syllabic (ca. 200)
intervals of analysis.
Perceptual experiments lend support to this
conclusion. In one experiment the spectrum of spoken sentences was partitioned
into critical-band-like channels and the onset of each channel shifted in time
relative to the others so as to desynchronize spectral information across the
frequency axis. Human listeners are highly tolerant of cross-channel spectral
asynchrony induced in this fashion. Intelligibility is highly correlated with
the magnitude of the low-frequency (3-6 Hz) modulation spectrum. A second study
partitioned the spectrum of the signal into 1/3-octave channels
("slits") and measured the intelligibility associated with each
channel presented alone and in concert with the others. Four slits distributed
over the speech-audio range (0.3-6 kHz), are sufficient for listeners to decode
sentential material with nearly 90% accuracy although more than 70% of the
spectrum is missing. Word recognition often remains relatively high (60-83%)
when just two or three channels are presented concurrently, despite the fact
that the intelligibility of these same slits, presented in isolation, is less
than 9%. Such data suggest that intelligibility is based on a compound
"image" of the modulation spectrum distributed across the frequency
spectrum. Because intelligibility seriously degrades when slits are
desynchronized by more than 25 ms this compound image is likely to be derived
from both the amplitude and phase components of the modulation spectrum. This
conclusion is supported by the results of a separate experiment in which the
phase and amplitude components of the modulation spectrum were dissociated
through the use of locally (20-180 ms) time-reversed speech segments. The
decline in intelligibility is correlated with the complex modulation spectrum,
reflecting the interaction of the phase and magnitude components of the modulation
pattern distributed across the frequency spectrum. The magnitude component of
the modulation spectrum (i.e., the modulation index), by itself, is a poor
predictor of speech intelligibility under these conditions. Such results may
account for why measures of intelligibility based solely on modulation
magnitude (such as the Speech Transmission Index) are not entirely successful
in predicting how well listeners understand spoken language under a wide range
of conditions.
Automatic phonetic and prosodic transcription
provides an invaluable source of empirical material with which to train
automatic speech recognition systems and derive fundamental insights into the
nature of spoken languagge. This presentation focuses
on development of automatic labeling systems for two
separate (albeit related) tiers of spoken language - articulatory
features (such as voicing, place and manner of articulation) and stress-accent
(related to the acoustic prominence associated with sequences of syllables
across an utterance.
ARTICULATORY-ACOUSTIC FEATURES
A novel framework for automatic articulatory-acoustic feature extraction has been developed
for enhancing the accuracy of place- and manner-of-articulation classification
in spoken language. The ŇelitistÓ approach focuses on
frames for which multilayer perceptron,
neural-network classifiers are highly confident, and discards the rest. Using
this method, it is possible to achieve a frame-level accuracy of 93% for manner
information on a corpus of American English sentences passed through a
telephone network (NTIMIT). Place-of-articulation information is extracted for
each manner class independently, resulting in an appreciable gain in
place-feature classification relative to performance for a manner-independent
system. A comparable gain in classification performance for the elitist
approach is evidenced when applied to a Dutch corpus of quasi- spontaneous
telephone interactions (VIOS). The elitist framework provides a potential means
of automatically annotating a corpus at the phonetic level without recourse to
a word-level transcript and could thus be of utility for developing training
materials for automatic speech recognition and speech synthesis applications,
as well as aid the empirical study of spoken language.
STRESS-ACCENT PATTERNS
There is a systematic relationship between
stress accent and vocalic identity in spontaneous English discourse (the Switchboard
corpus composed of telephone dialogues). Low vowels are much more likely to be
fully accented than their high vocalic counterparts. And conversely, high
vowels are far more likely to lack stress accent than low or mid vocalic
segments. Such patterns imply that stress accent and vocalic identity
(particularly vowel height) are bound together at some level of lexical
representation. Statistical analysis of a manually annotated corpus
(Switchboard) indicates that vocalic duration is likely to serve as an
important acoustic cue for stress accent, particularly for diphthongs and the
low, tense monophthongs. Multilayer perceptrons (MLPs) were trained
on a portion of this annotated material in order to automatically label the
corpus with respect to stress accent. The automatically derived labels are
highly concordant with those of human transcribers (79% concordance within a
quarter-step of accent level and 97.5% concordant within a half-step of accent
level). In order to achieve such a high degree of concordance it is necessary
to include features pertaining not only to the duration and amplitude of the
vocalic nuclei, but also those associated with speaker gender, syllabic
duration and most importantly, vocalic identity. Such results suggest that vocalic
identity is intimately associated with stress accent in spontaneous American
English (and vice versa), thereby providing a potential foundation with which
to model pronunciation variation for automatic speech recognition.
Beyond the Phoneme: A Juncture-Accent Model of
Spoken Language
Current-generation speech recognition systems
generally represent words as sequences of phonemes. This "phonemic beads
on a string" approach is based on a fundamental misconception about the
nature of spoken language and makes it difficult (if well-nigh impossible) to
accurately model the pronunciation variation observed in spontaneous speech.
Five hours of spontaneous dialogue material (from the Switchboard corpus) has
been manually labeled and segmented, and a subset of
this corpus has been manually annotated with respect to stress accent (i.e.,
the prosodic emphasis placed on a syllable). Statistical analysis of the corpus
indicates that much of the pronunciation variation observed at the phonetic-
segment level can be accounted for in terms of stress-accent pattern and
position of the segment within the syllable.
Such analyses imply that spoken language can be
thought of as a sequence of accent peaks (associated with vocalic nuclei)
separated by junctures of variable type and length. These junctures are
associated with what are generally termed "consonants" Đ however, many consonants do not behave as segments proper,
but rather serve to define the nature of the separation (i.e., juncture)
between adjacent syllables.
A "juncture-accent" model (JAM) of
spoken language has interesting implications for auditory models of speech
processing as well as consequences for developing future- generation speech
recognition systems.
April
2 – 4, 2001
Marc Swerts, IPO Eindhoven,
Prosody
as a marker of information status
Prosody can
be defined as the ensemble of suprasegmental speech
features (speech melody, tempo, rhythm, loudness, etc.). The three lectures
will focus on the use of prosody to mark the status of the information which is
exchanged between speaking partners in natural dialogue. First, we will
introduce recent findings of comparative analyses of Dutch, Italian and
Japanese, showing how prosody is exploited in these languages to distinguish
important bits of information from unimportant ones. Then, given that spoken
conversation represents a rather uncertain medium, one
central activity of dialogue participants is to "negotiate" about the
information being transmitted. We will discuss how prosody can be useful in
this process of grounding nformation. Finally, we
will see how these findings can be exploited to improve the interactions
between humans and a spoken dialogue system, in particular focusing on how
prosody can be useful for error recovery.
September 12 – 14, 2000
Julia Hirschberg, AT&T Labs –
Research, USA
Intonational Variation in Spoken Dialogue Systems: Generation and
Understanding
Summary
The
variation of prosodic features such as contour, accent, phrasing, pitch range,
loudness, speaking rate, and timing contributes important information to the
interpretation of spoken dialogue. To create dialogue systems that interact
naturally with human users, it is important both to generate natural speech and
to understand what users are saying, and intonational
contributions are important to both. This course of lectures will discuss past
findings and current research in the role of intonational
variation in speech synthesis/concept to speech and speech
recognition/understanding for spoken dialogue systems.
Go to top of page
October 11-13, 1999
Sharon Oviatt,
Center for Human-Computer Communication, OGI, USA
Lecture 1: Modeling hyperarticulate speech
to interactive systems
When using interactive
systems, people adapt their speech during attempts to resolve system
recognition errors. This talk will summarize the two-stage Computer-elicited Hyperarticulate Adaptation Model (CHAM), which accounts for
systematic changes in human speech during interactive error handling. It will
summarize the empirical findings and linguistic theory upon which CHAM is
based, as well as the model's main predictions. Finally, implications of CHAM
will be discussed for designing future interactive systems with improved error
handling.
Lecture 2: Mutual
disambiguation of recognition errors in a multimodal architecture
As a new generation of
multimodal systems begins to define itself, researchers are attempting to learn
how to combine different modes into strategically integrated whole systems. In
theory, well designed multimodal systems should be able to integrate
complementary modalities in a manner that supports mutual disambiguation of
errors and leads to more robust performance. In this talk, I will discuss the
results of two different studies that documented mutual disambiguation for
diverse users (i.e., accented speakers) and challenging usage contexts (i.e.,
mobile use in naturalistic noisy environment). Implications will be discussed
for the development of future multimodal architectures that can perform in a
more robust and stable manner than individual recognition technologies.
Lecture 3: Designing and
evaluating conversational interfaces with animated characters
There currently is
considerable interest in the development of robust conversational interfaces,
especially ones that make effective use of animated characters. This talk will
discuss the design of effective conversational interfaces, including: (1) new
mobile research infrastructure that we've been using to conduct multimodal
studies in natural field settings, (2) the I SEE! application (Immersive
Science Education for Elementary kids) that has been developed as a test-bed
for studying the design of conversational interfaces that incorporate animated
characters in educational technology, and (3) the results of a study that
compared the spoken language of 6-to-10-year-old children while interacting
with an animated character versus a human adult partner.
Go to top of page
May 17 – 19, 1999
Anne Cutler, Max
Planck Institute for Psycholinguistics, Nijmegen,
Holland
Three
lectures on human recognition of spoken words:
Lecture 1: The
psycholinguistic approach
Collecting
listening data in the laboratory. Detection tasks, decision tasks, generation
tasks. Examining the time-course of spoken-word recognition
with "on-line" tasks; advantages and disadvantages of reaction-time
data versus "off-line" data. Models of human
word recognition. Simultaneous activation of word
candidates and inter-word competition. The interplay between (computational)
modelling and experiment; evidence for and against current models.
Lecture 2: From input to
lexicon
Hypotheses about the
lexical access code: features, phonemes, syllables and other units. The relative weight of matching versus mismatching input. Segmental versus suprasegmental
information. Autonomy versus interaction in prelexical
processing: can feedback from the lexicon help listeners to process the input,
and is there evidence of such feedback? Recognition of words
in isolation versus words in continuous speech.
Lecture 3:
Language-specificity
The
structure of the vocabulary and the language-specificity of lexical
representations.
The role in recognition of phonetic sequencing constraints,
word boundary effects, transitional probability, assimilation and elision
phenomena. How vocabulary structure can affect the relative importance
of different types of phonetic information in activation. Native-language,
second-language and bilingual word processing.
Go to top of page
November, 1997
Paul Heisterkamp, Daimler-Benz AG, Research and
Technology Speech Understanding
Thinking in systems
Lecture contents.
Lesson I: Introduction,
Speech dialogue in a contructivist perspective,
Speech understanding as an abductive process, Speech
Dialogue Systems and their components, Natural Language Analysis for speech
dialogue
Lesson II: Dialogue
Management. Current practice & future perspectives, Finite-State Dialogues,
Dialogue management as rational behaviour, Dialogue management as local
optimization, Tools
Lesson III: Natural
Language Generation for speech dialogue, Conclusion - Requirements for fully
conversational systems