Bullet courses at CTT
June 6th - 8th, 2011
- Control vs Coordination
- Rhythm and Synchronization
- The Big Picture
May 30-June 01, 2005
Sadaoki Furui,
Tokyo Institute of Technology, Japan
1. Recent progress in spontaneous speech recognition slides
2. Robust speech recognition and understanding slides
3. Systematization and application of large-scale knowledge resources slides
Note that the slides are only accessible within TMH and CTT.
More information
September 20-22, 2004
Nick Campbell,
ATR Spoken Language Translation Research Labs, Japan
1. Language, Speech, and Meaning
2. Working with a Corpus of Expressive Speech
3. Synthesising Conversational Speech
More information
June 16-18, 2003
Alex Waibel, Univ Karlsruhe and CMU
1. CHIL Computing to Overcome Techno-Clutter
2. Communicating in a Multilingual World
More information
May 22-24,2002
Mehryar Mohri, AT&T Labs – Research, USA
1. Weighted Automata Algorithms
2. Finite-State Machine Library (FSM Library)
3. Speech Recognition Applications
More information
April 24-26,2002
Steven Greenberg, International Computer Science Institute, Berkeley
http://www.icsi.berkeley.edu/~steveng/
1. What are the essential cues for understanding spoken language?
2. Automatic phonetic and prosodic annotation of spoken language
3. Beyond the phoneme: A juncture-accent model of spoken language
More information
April 2-4, 2001
Marc Swerts, IPO Eindhoven, Holland
Prosody as a marker of information status
More information
September 12-14, 2000
Julia Hirschberg, AT&T Labs – Research, USA
Intonational Variation in Spoken Dialogue Systems: Generation and Understanding
More information
October 11-13, 1999
Sharon Oviatt, Center for Human-Computer Communication, OGI, USA
- Modeling hyperarticulate speech to interactive systems
- Mutual disambiguation of recognition errors in a multimodal
architecture
- Designing and evaluating conversational interfaces with animated
characters
More information
June 14-16, 1999
Gerard Chollet, ENST, Paris,France
- Speech analysis, coding, synthesis and recognition:Time-frequency
representations, wavelets, Temporal decomposition, Time-dependent models,
Analysis by synthesis, H+N, HMMs, Markov fields,…
- ALISP (Automatic Language Independent Speech Processing): Learning
from exam-ples, Segmental models, Speaker norma-lisation, Very Low bit
rate coding, Multi-lingual speech recognition…
- Applications: Identity verification, decision fusion, Interactive
Voice Servers, 'Major-dome', Multicasting….
May 17 – 19, 1999
Anne Cutler, Max Planck Institute for
Psycholinguistics, Nijmegen, Holland
- The psycholinguistic approach
- From input to lexicon
- Language-specificity
More information
April 26 - 28,1999
Chin-Hui Lee , Lucent Technologies, USA
Three lectures on automatic speech and speaker recognition:
- Robust speech recognition -- overview of statistical pattern
recognition approach and the implied robustness problems and solutions.
- Speaker and speech verification -- overview of statistical pattern
verification approach and applications to speaker and speech verification.
- A detection approach to speech recognition and understanding -- a
paradigm to combine feature-based detection and state-of-the-art recognition to open up new possibilities.
November 4 - 5, 1998
Gösta Bruce , Lund University, Sweden
Swedish Prosody
September 11-16, 1998
Klaus Kohler , Kiel University, Germany
- Labelled Speech Data Banks and Phonetic Research
- The structure of the Kiel data baseExamples of segmental data
analysis in German
- Multilingual studues of connected speech
- Modeling and labelling prosody
November, 1997
Paul Heisterkamp, Daimler Benz, Germany
- Thinking in systems
- Linguistic analysis and language generation for speech dialogue systems
Speech dialogue management in and for real world applications
More information
November, 1996
Hynek Hermansky, OGI, USA
- Should recognizers have ears?
- Perceptual Linear Prediction
- Neglected temporal domain
- RASTA Processing
- Processing of modulation spectrum of speech
- Towards multi-stream processing of speech
February, 1996
Susann Luperfoy, MITRE, USA
Discourse Processing for Spoken Dialogue Systems
Go to top of page
More Information about Bullet Courses at CTT
May 30-June 01, 2005
Sadaoki Furui,
Tokyo Institute of Technology, Japan
- Recent progress in spontaneous speech recognition
This talk overviews recent progress in the development of corpus-based
spontaneous speech recognition technology. Although speech is in
almost any situation spontaneous, recognition of spontaneous speech is
an area which has only recently emerged in the field of automatic
speech recognition. Broadening the application of speech recognition
depends crucially on raising recognition performance for spontaneous
speech. For this purpose, it is necessary to build large spontaneous
speech corpora for constructing acoustic and language models. This
talk focuses on various achievements of a Japanese 5-year national
project "Spontaneous Speech: Corpus and Processing Technology" that
has recently been completed. Because of various spontaneous-speech
specific phenomena, such as filled pauses, repairs, hesitations,
repetitions and disfluencies, recognition of spontaneous speech
requires various new techniques. These new techniques include flexible
acoustic modeling, sentence boundary detection, pronunciation
modeling, acoustic as well as language model adaptation, and automatic
summarization. Particularly automatic summarization including
indexing, a process which extracts important and reliable parts of the
automatic transcription, is expected to play an important role in
building various speech archives, speech-based information retrieval
systems, and human-computer dialogue systems.
- Robust speech recognition and understanding
This talk overviews the major methods that have recently been
investigated for making speech recognition systems more robust at both
acoustic and language processing levels. Improved robustness will
enable such systems to work well over a wide range of unexpected and
adverse conditions by helping them to cope with mismatches between
training and testing speech utterances. This talk focuses on
stochastic matching framework for model parameter adaptation, use of
constraints in adaptation, automatic speaker change detection and
adaptation, noisy speech recognition using tree-structured
noise-adapted HMMs, spontaneous speech recognition, approaches using
dynamic Bayesian networks, and multimodal speech
recognition.
- Systematization and application of large-scale knowledge resources
This talk introduces the five-year COE (Center of Excellence) program
"Framework for Systematization and Application of Large-scale
Knowledge Resources," that has recently launched at Tokyo Institute of
Technology. The project is conducting a wide range of
interdisciplinary research combining humanities and technology to
create a framework for the systematization and the application of
large-scale knowledge resources in electronic forms. Spontaneous
speech, written language, materials for e-learning and multimedia
teaching, classical literature and historical documents, as well as
information on cultural properties are just some examples of the real
knowledge resources being targeted by the project. These resources
will be systematized according to their respective semantic
structures. Pioneering new academic disciplines and educating
knowledge resource researchers are also objectives of the
project. Large-scale systems for computation and information storage,
as well as retrieval, have been installed to support this research and
education.
September 20-22, 2004
Nick Campbell
ATR
Spoken Language Translation Research Labs, Japan
- Language, Speech, and Meaning
In this talk, I shall attempt to describe some
of the roles played by prosody in speech communication, and will relate them to
the requirements of computer speech processing. The talk covers phonetic, linguistic
and paralinguistic aspects of speech.
- Working with a Corpus of Expressive Speech
This talk describes the JST/CREST Expressive
Speech Processing project, introduces a very large corpus of conversational
speech and describes some of the main findings of our related research. The
talk explores the roles of non-verbal and paralinguistic information in speech
communication.
- Synthesising Conversational Speech
This talk addresses the issues of synthesising
non-verbal speech and describes a prototype interface for the synthesis of
conversational speech. The synthesised samples are in Japanese, but I believe
that they are sufficiently interesting that any inherent language difficulties might
be overcome by higher-level speech-related interests.
Go to top of page
June 16-18, 2003
Alex Waibel,
Univ Karlsruhe and CMU
1. CHIL Computing to Overcome Techno-Clutter
After building computers that paid no intention
to communicating with humans, we have in recent years developed ever more
sophisticated interfaces that put the "human in the loop" of
computers. These interfaces have
improved usability by providing more appealing output (graphics, animations),
more easy to use input methods (mouse, pointing, clicking, dragging) and more
natural interaction modes (speech, vision, gesture, etc.). Yet the productivity gains that have
been promised have largely not been seen and human-machine interaction still
remains a partially frustrating and tedious experience, full of techno-clutter
and excessive attention required by the technical artifact. In this talk, I
will argue, that we must transition to a third paradigm of computer use, in
which we let people interact with people, and move the machine into the
background to observe the humans' activities and to provide services
implicitly, that is, -to the extent possible- without explicit request. Putting the "Computer in the Human
Interaction Loop" (CHIL), instead of the other way round, however, brings
formidable technical challenges. The machine must now always observe and understand humans, model their
activities, their interaction with other humans, the human state as well as the
state of the space they are in, and finally, infer intentions and needs. From a perceptual user interface point
of view, we must process signals from sensors that are always on, frequently
inappropriately positioned, and subject to much greater variablity. We must also not only recognize WHAT
was seen or said in a given space, but also a broad range of additional
information, such as the WHO, WHERE, HOW, TO WHOM, WHY, WHEN of human
interaction and engagement. In this talk, I will describe a variety of
multimodal interface technologies that we have developed to answer these
questions and some preliminary CHIL type services that take advantage of such
perceptual interfaces.
2. Communicating in a Multilingual World
With the globalization of society, and the
opportunities and threats that go along with it, multilinguality is now
emerging both as an urgent technology requirement as well as a formidable
scientific challenge. Formerly often viewed as an uninteresting exercise in
adapting (mostly) English speech and language systems to a foreign language,
multilinguality is rapidly becoming an area of active study and research in its
own right. Even though Speech, Language and MT technologies still lag in
performance behind human capabilities in any one language and much research
remains to be done, researchers are now also actively exploring the added
problems introduced by multilinguality: 1.) Additional languages add additional
peculiarities that have not been investigated in English, but need to be
handled effectively including tonality, morphology, orthography and
segmentation, 2.) Foreign accent and foreign words/names introduce modeling
difficulties in monolingual speech systems, 3.) the sheer number of languages
(~6000 by most estimates) makes traditional porting approaches via training on
large corpora or rule based programming impractical and prohibitively
expensive, 4.) multilingual and cross-lingual technologies such as Machine
Translation and Cross-lingual Retrieval are faced with scaling problems when
demand for language pairs (N2) and domain coverage rise with languages and
applications. In this talk I will discuss the problems and current approaches
on the road to more effective, more robust and more portable multilingual speech
and language systems and services.
Go to top of page
May 22-24, 2002
Mehryar Mohri, AT&T Labs – Research, USA
Dr Mohri is Head of the Speech Algorithms Department at AT&T Labs - Research.
The objective of the department is to design general mathematical frameworks, efficient
algorithms, and fundamental software libraries for large-scale speech
processing problems. Its research areas include automata theory, machine
learning, pattern-matching, natural language processing, automatic speech
recognition, speech synthesis. His short course will concentrate on the
activities within speech at AT&T Labs, including:
- Weighted Automata Algorithms
- Finite-State Machine Library (FSM Library)
- Speech Recognition Applications
http://www.research.att.com/~mohri/
Go to top of page
April 24-26, 2002
Steven Greenberg ,International Computer Science Institute, Berkeley
1. What are the essential acoustic cues for understanding spoken language?
Classical models of speech
recognition (by both human and machine) assume that a detailed, short-term
analysis of the signal is essential for accurately decoding spoken language.
Although this stratagem may work well for carefully enunciated speech spoken in
a pristine acoustic environment, it is far less effective for recognizing
speech spoken under more realistic conditions, such as (1) moderate-to-high
levels of background noise, (2) reverberant acoustic environments and (3)
spontaneous, informal conversation. Statistical analysis of a spontaneous
speech corpus suggests that the stability of the linguistic representation is
largely a consequence of syllabic (ca. 200) intervals of analysis.
Perceptual experiments
lend support to this conclusion. In one experiment the spectrum of spoken
sentences was partitioned into critical-band-like channels and the onset of
each channel shifted in time relative to the others so as to desynchronize
spectral information across the frequency axis. Human listeners are highly
tolerant of cross-channel spectral asynchrony induced in this fashion.
Intelligibility is highly correlated with the magnitude of the low-frequency
(3-6 Hz) modulation spectrum. A second study partitioned the spectrum of the
signal into 1/3-octave channels ("slits") and measured the
intelligibility associated with each channel presented alone and in concert
with the others. Four slits distributed over the speech-audio range (0.3-6
kHz), are sufficient for listeners to decode sentential material with nearly
90% accuracy although more than 70% of the spectrum is missing. Word
recognition often remains relatively high (60-83%) when just two or three
channels are presented concurrently, despite the fact that the intelligibility
of these same slits, presented in isolation, is less than 9%. Such data suggest
that intelligibility is based on a compound "image" of the modulation
spectrum distributed across the frequency spectrum. Because intelligibility
seriously degrades when slits are desynchronized by more than 25 ms this
compound image is likely to be derived from both the amplitude and phase
components of the modulation spectrum. This conclusion is supported by the
results of a separate experiment in which the phase and amplitude components of
the modulation spectrum were dissociated through the use of locally (20-180 ms)
time-reversed speech segments. The decline in intelligibility is correlated
with the complex modulation spectrum, reflecting the interaction of the phase
and magnitude components of the modulation pattern distributed across the
frequency spectrum. The magnitude component of the modulation spectrum (i.e.,
the modulation index), by itself, is a poor predictor of speech intelligibility
under these conditions. Such results may account for why measures of
intelligibility based solely on modulation magnitude (such as the Speech
Transmission Index) are not entirely successful in predicting how well
listeners understand spoken language under a wide range of conditions.
2. Automatic phonetic and prosodic annotation of spoken language
Automatic phonetic and
prosodic transcription provides an invaluable source of empirical material with
which to train automatic speech recognition systems and derive fundamental
insights into the nature of spoken languagge. This presentation focuses on
development of automatic labeling systems for two separate (albeit related)
tiers of spoken language - articulatory features (such as voicing, place and
manner of articulation) and stress-accent (related to the acoustic prominence
associated with sequences of syllables across an utterance.
ARTICULATORY-ACOUSTIC FEATURES
A novel framework for
automatic articulatory-acoustic feature extraction has been developed for
enhancing the accuracy of place- and manner-of-articulation classification in
spoken language. The ÒelitistÓ approach focuses on frames for which multilayer
perceptron, neural-network classifiers are highly confident, and discards the
rest. Using this method, it is possible to achieve a frame-level accuracy of
93% for manner information on a corpus of American English sentences passed
through a telephone network (NTIMIT). Place-of-articulation information is
extracted for each manner class independently, resulting in an appreciable gain
in place-feature classification relative to performance for a
manner-independent system. A comparable gain in classification performance for
the elitist approach is evidenced when applied to a Dutch corpus of quasi-
spontaneous telephone interactions (VIOS). The elitist framework provides a
potential means of automatically annotating a corpus at the phonetic level
without recourse to a word-level transcript and could thus be of utility for
developing training materials for automatic speech recognition and speech
synthesis applications, as well as aid the empirical study of spoken language.
STRESS-ACCENT PATTERNS
There is a systematic
relationship between stress accent and vocalic identity in spontaneous English
discourse (the Switchboard corpus composed of telephone dialogues). Low vowels
are much more likely to be fully accented than their high vocalic counterparts.
And conversely, high vowels are far more likely to lack stress accent than low
or mid vocalic segments. Such patterns imply that stress accent and vocalic
identity (particularly vowel height) are bound together at some level of
lexical representation. Statistical analysis of a manually annotated corpus
(Switchboard) indicates that vocalic duration is likely to serve as an
important acoustic cue for stress accent, particularly for diphthongs and the low,
tense monophthongs. Multilayer perceptrons (MLPs) were trained on a portion of
this annotated material in order to automatically label the corpus with respect
to stress accent. The automatically derived labels are highly concordant with
those of human transcribers (79% concordance within a quarter-step of accent
level and 97.5% concordant within a half-step of accent level). In order to
achieve such a high degree of concordance it is necessary to include features
pertaining not only to the duration and amplitude of the vocalic nuclei, but
also those associated with speaker gender, syllabic duration and most
importantly, vocalic identity. Such results suggest that vocalic identity is
intimately associated with stress accent in spontaneous American English (and
vice versa), thereby providing a potential foundation with which to model
pronunciation variation for automatic speech recognition.
3. Beyond the Phoneme: A Juncture-Accent Model of Spoken Language
Current-generation
speech recognition systems generally represent words as sequences of phonemes.
This "phonemic beads on a string" approach is based on a fundamental
misconception about the nature of spoken language and makes it difficult (if
well-nigh impossible) to accurately model the pronunciation variation observed
in spontaneous speech. Five hours of spontaneous dialogue material (from the
Switchboard corpus) has been manually labeled and segmented, and a subset of
this corpus has been manually annotated with respect to stress accent (i.e.,
the prosodic emphasis placed on a syllable). Statistical analysis of the corpus
indicates that much of the pronunciation variation observed at the phonetic-
segment level can be accounted for in terms of stress-accent pattern and
position of the segment within the syllable.
Such analyses imply
that spoken language can be thought of as a sequence of accent peaks
(associated with vocalic nuclei) separated by junctures of variable type and
length. These junctures are associated with what are generally termed
"consonants" Ð however, many consonants do not behave as segments
proper, but rather serve to define the nature of the separation (i.e.,
juncture) between adjacent syllables.
A "juncture-accent" model (JAM) of spoken language has interesting
implications for auditory models of speech processing as well as consequences
for developing future- generation speech recognition systems.
Go to top of page
April 2 – 4, 2001
Marc Swerts, IPO Eindhoven, Holland
Prosody as a marker of information status
Prosody can be defined as the ensemble of suprasegmental speech
features (speech melody, tempo, rhythm, loudness, etc.). The three lectures
will focus on the use of prosody to mark the status of the information which is
exchanged between speaking partners in natural dialogue. First, we will
introduce recent findings of comparative analyses of Dutch, Italian and
Japanese, showing how prosody is exploited in these languages to distinguish
important bits of information from unimportant ones. Then, given that spoken
conversation represents a rather uncertain medium, one central activity of
dialogue participants is to "negotiate" about the information being
transmitted. We will discuss how prosody can be useful in this process of
grounding nformation. Finally, we will see how these findings can be exploited
to improve the interactions between humans and a spoken dialogue system, in
particular focusing on how prosody can be useful for error recovery.
Slides from Day1, Day2, Day3.
Go to top of page
September 12 – 14, 2000
Julia Hirschberg, AT&T Labs –
Research, USA
Intonational Variation in Spoken Dialogue Systems: Generation and Understanding
The variation of prosodic features such as
contour, accent, phrasing, pitch range, loudness, speaking rate, and timing
contributes important information to the interpretation of spoken dialogue. To
create dialogue systems that interact naturally with human users, it is
important both to generate natural speech and to understand what users are
saying, and intonational contributions are important to both. This course of
lectures will discuss past findings and current research in the role of
intonational variation in speech synthesis/concept to speech and speech
recognition/understanding for spoken dialogue systems.
Go to top of page
October 11-13, 1999
Sharon Oviatt, Center for Human-Computer
Communication, OGI, USA
Lecture 1: Modeling hyperarticulate speech to interactive systems
When using interactive systems, people adapt their speech during attempts to
resolve system recognition errors. This talk will summarize the two-stage
Computer-elicited Hyperarticulate Adaptation Model (CHAM), which accounts for
systematic changes in human speech during interactive error handling. It will
summarize the empirical findings and linguistic theory upon which CHAM is
based, as well as the model's main predictions. Finally, implications of CHAM
will be discussed for designing future interactive systems with improved error
handling.
Lecture 2: Mutual disambiguation of recognition errors in a multimodal architecture
As a new generation of multimodal systems begins to define itself,
researchers are attempting to learn how to combine different modes into
strategically integrated whole systems. In theory, well designed multimodal
systems should be able to integrate complementary modalities in a manner that
supports mutual disambiguation of errors and leads to more robust performance.
In this talk, I will discuss the results of two different studies that
documented mutual disambiguation for diverse users (i.e., accented speakers)
and challenging usage contexts (i.e., mobile use in naturalistic noisy
environment). Implications will be discussed for the development of future
multimodal architectures that can perform in a more robust and stable manner
than individual recognition technologies.
Lecture 3: Designing and evaluating conversational interfaces with animated characters
There currently is considerable interest in the development of robust
conversational interfaces, especially ones that make effective use of animated
characters. This talk will discuss the design of effective conversational
interfaces, including: (1) new mobile research infrastructure that we've been
using to conduct multimodal studies in natural field settings, (2) the I SEE!
application (Immersive Science Education for Elementary kids) that has been
developed as a test-bed for studying the design of conversational interfaces
that incorporate animated characters in educational technology, and (3) the
results of a study that compared the spoken language of 6-to-10-year-old
children while interacting with an animated character versus a human adult
partner.
Go to top of page
Anne Cutler, Max
Planck Institute for Psycholinguistics, Nijmegen, Holland
Three lectures on human recognition of spoken words:
Lecture 1: The psycholinguistic approach
Collecting listening data in the laboratory. Detection tasks, decision
tasks, generation tasks. Examining the time-course of spoken-word recognition
with "on-line" tasks; advantages and disadvantages of reaction-time
data versus "off-line" data. Models of human word recognition.
Simultaneous activation of word candidates and inter-word competition. The interplay
between (computational) modelling and experiment; evidence for and against
current models.
Lecture 2: From input to lexicon
Hypotheses about the lexical access code: features, phonemes, syllables and
other units. The relative weight of matching versus mismatching input.
Segmental versus suprasegmental information. Autonomy versus interaction in
prelexical processing: can feedback from the lexicon help listeners to process
the input, and is there evidence of such feedback? Recognition of words in
isolation versus words in continuous speech.
Lecture 3: Language-specificity
The structure of the vocabulary and the language-specificity of lexical
representations. The role in recognition of phonetic sequencing constraints,
word boundary effects, transitional probability, assimilation and elision
phenomena. How vocabulary structure can affect the relative importance of
different types of phonetic information in activation. Native-language,
second-language and bilingual word processing.
Go to top of page
November, 1997
Paul Heisterkamp, Daimler-Benz AG, Research and Technology Speech Understanding
Thinking in systems
Lecture I: Introduction, Speech dialogue in a contructivist perspective,
Speech understanding as an abductive process, Speech Dialogue Systems and their
components, Natural Language Analysis for speech dialogue
Lecture II: Dialogue Management. Current practice & future perspectives,
Finite-State Dialogues, Dialogue management as rational behaviour, Dialogue
management as local optimization, Tools
Lecture III: Natural Language Generation for speech dialogue, Conclusion -
Requirements for fully conversational systems
Go to top of page
|