Bullet courses at CTT

May 30 - Jun 1, 2005
Sadaoki Furui, Tokyo Institute of Technology, Japan

Speech recognition and applications

1. Recent progress in spontaneous speech recognition
2. Robust speech recognition and understanding
3. Systematization and application of large-scale knowledge resources
More information

September 20-22, 2004
Nick Campbell, ATR Spoken Language Translation Research Labs, Japan

1. Language, Speech, and Meaning
2. Working with a Corpus of Expressive Speech
3. Synthesising Conversational Speech

More information

June 16-18, 2003
Alex Waibel, Univ Karlsruhe and CMU

1. CHIL Computing to Overcome Techno-Clutter
2. Communicating in a Multilingual World

More information

May 22-24, 2002
Mehryar Mohri, AT&T Labs – Research, USA

1. Weighted Automata Algorithms
2. Finite-State Machine Library (FSM Library)
3. Speech Recognition Applications

More information

April 24-26, 2002
Steven Greenberg, International Computer Science Institute, Berkeley
http://www.icsi.berkeley.edu/~steveng/

What are the essential cues for understanding spoken language?

Automatic phonetic and prosodic annotation of spoken language

Beyond the phoneme: A juncture-accent model of spoken language

More information

April 2 – 4, 2001
Marc Swerts, IPO Eindhoven, Holland
Prosody as a marker of information status

More information

September 12 – 14, 2000
Julia Hirschberg, AT&T Labs – Research, USA
Intonational Variation in Spoken Dialogue Systems: Generation and Understanding
More information

October 11-13, 1999
Sharon Oviatt, Center for Human-Computer Communication, OGI, USA

Modeling hyperarticulate speech to interactive systems
Mutual disambiguation of recognition errors in a multimodal architecture
Designing and evaluating conversational interfaces with animated characters

More information

June 14 – 16, 1999
Gerard Chollet, ENST, Paris,France

Speech analysis, coding, synthesis and recognition:Time-frequency representations, wavelets, Temporal decomposition, Time-dependent models, Analysis by synthesis, H+N, HMMs, Markov fields,…
ALISP (Automatic Language Independent Speech Processing): Learning from exam-ples, Segmental models, Speaker norma-lisation, Very Low bit rate coding, Multi-lingual speech recognition…
Applications: Identity verification, decision fusion, Interactive Voice Servers, 'Major-dome', Multicasting….

May 17 – 19, 1999
Anne Cutler, Max Planck Institute for Psycholinguistics, Nijmegen, Holland
Three lectures on human recognition of spoken words:

The psycholinguistic approach
From input to lexicon
Language-specificity

More information

April 26 – 28, 1999
Chin-Hui Lee, Lucent Technologies, USA
Three lectures on automatic speech and speaker recognition:

Robust speech recognition -- overview of statistical pattern recognition approach and the implied robustness problems and solutions.
Speaker and speech verification -- overview of statistical pattern verification approach and applications to speaker and speech verification.
A detection approach to speech recognition and understanding -- a paradigm to combine feature-based detection and state-of-the-art recognition to open up new possibilities.

November 4-5, 1998
Gösta Bruce, Lund University, Sweden
Swedish Prosody

September 11-16, 1998
Klaus Kohler, Kiel University, Germany

Labelled Speech Data Banks and Phonetic Research
The structure of the Kiel data baseExamples of segmental data analysis in German
Multilingual studues of connected speech
Modeling and labelling prosody

November, 1997
Paul Heisterkamp, Daimler Benz, Germany
Thinking in systems; Linguistic analysis and language generation for speech dialogue systems; Speech dialogue management in and for real world applications
More information

November, 1996
Hynek Hermansky, OGI, USA

Should recognizers have ears?
Perceptual Linear Prediction
Neglected temporal domain
RASTA Processing
Processing of modulation spectrum of speech
Towards multi-stream processing of speech

February, 1996
Susann Luperfoy, MITRE, USA

Discourse Processing for Spoken Dialogue Systems

More Information about Bullet Courses at CTT

May 30 – June 1, 2005

Sadaoki Furui, Tokyo Institute of Technology, Japan

Speech recognition and applications

1. Recent progress in spontaneous speech recognition

This talk overviews recent progress in the development of corpus-based spontaneous speech recognition technology. Although speech is in almost any situation spontaneous, recognition of spontaneous speech is an area which has only recently emerged in the field of automatic speech recognition. Broadening the application of speech recognition depends crucially on raising recognition performance for spontaneous speech. For this purpose, it is necessary to build large spontaneous speech corpora for constructing acoustic and language models. This talk focuses on various achievements of a Japanese 5-year national project “Spontaneous Speech: Corpus and Processing Technology” that has recently been completed. Because of various spontaneous-speech specific phenomena, such as filled pauses, repairs, hesitations, repetitions and disfluencies, recognition of spontaneous speech requires various new techniques. These new techniques include flexible acoustic modeling, sentence boundary detection, pronunciation modeling, acoustic as well as language model adaptation, and automatic summarization. Particularly automatic summarization including indexing, a process which extracts important and reliable parts of the automatic transcription, is expected to play an important role in building various speech archives, speech-based information retrieval systems, and human-computer dialogue systems.

2. Robust speech recognition and understanding

This talk overviews the major methods that have recently been investigated for making speech recognition systems more robust at both acoustic and language processing levels. Improved robustness will enable such systems to work well over a wide range of unexpected and adverse conditions by helping them to cope with mismatches between training and testing speech utterances. This talk focuses on stochastic matching framework for model parameter adaptation, use of constraints in adaptation, automatic speaker change detection and adaptation, noisy speech recognition using tree-structured noise-adapted HMMs, spontaneous speech recognition, approaches using dynamic Bayesian networks, and multimodal speech recognition.

3. Systematization and application of large-scale knowledge resources

This talk introduces the five-year COE (Center of Excellence) program“Framework for Systematization and Application of Large-scale Knowledge Resources,” that has recently launched at Tokyo Institute of Technology. The project is conducting a wide range of interdisciplinary research combining humanities and technology to create a framework for the systematization and the application of large-scale knowledge resources in electronic forms. Spontaneous speech, written language, materials for e-learning and multimedia teaching, classical literature and historical documents, as well as information on cultural properties are just some examples of the real knowledge resources being targeted by the project. These resources will be systematized according to their respective semantic structures. Pioneering new academic disciplines and educating knowledge resource researchers are also objectives of the project. Large-scale systems for computation and information storage, as well as retrieval, have been installed to support this research and education.

September 20-22, 2004

Nick Campbell, ATR Spoken Language Translation Research Labs, Japan

1. Language, Speech, and Meaning

In this talk, I shall attempt to describe some of the roles played by prosody in speech communication, and will relate them to the requirements of computer speech processing. The talk covers phonetic, linguistic and paralinguistic aspects of speech.

2. Working with a Corpus of Expressive Speech

This talk describes the JST/CREST Expressive Speech Processing project, introduces a very large corpus of conversational speech and describes some of the main findings of our related research. The talk explores the roles of non-verbal and paralinguistic information in speech communication.

3. Synthesising Conversational Speech

This talk addresses the issues of synthesising non-verbal speech and describes a prototype interface for the synthesis of conversational speech. The synthesised samples are in Japanese, but I believe that they are sufficiently interesting that any inherent language difficulties might be overcome by higher-level speech-related interests.

June 16-18, 2003
Alex Waibel, Univ Karlsruhe and CMU

1. CHIL Computing to Overcome Techno-Clutter
2. Communicating in a Multilingual World

Title: CHIL Computing to Overcome Techno-Clutter

Abstract:

After building computers that paid no intention to communicating with humans, we have in recent years developed ever more sophisticated interfaces that put the "human in the loop" of computers. These interfaces have improved usability by providing more appealing output (graphics, animations), more easy to use input methods (mouse, pointing, clicking, dragging) and more natural interaction modes (speech, vision, gesture, etc.). Yet the productivity gains that have been promised have largely not been seen and human-machine interaction still remains a partially frustrating and tedious experience, full of techno-clutter and excessive attention required by the technical artifact. In this talk, I will argue, that we must transition to a third paradigm of computer use, in which we let people interact with people, and move the machine into the background to observe the humans' activities and to provide services implicitly, that is, -to the extent possible- without explicit request. Putting the "Computer in the Human Interaction Loop" (CHIL), instead of the other way round, however, brings formidable technical challenges. The machine must now always observe and understand humans, model their activities, their interaction with other humans, the human state as well as the state of the space they are in, and finally, infer intentions and needs. From a perceptual user interface point of view, we must process signals from sensors that are always on, frequently inappropriately positioned, and subject to much greater variablity. We must also not only recognize WHAT was seen or said in a given space, but also a broad range of additional information, such as the WHO, WHERE, HOW, TO WHOM, WHY, WHEN of human interaction and engagement. In this talk, I will describe a variety of multimodal interface technologies that we have developed to answer these questions and some preliminary CHIL type services that take advantage of such perceptual interfaces.

Title: Communicating in a Multilingual World

Abstract:

With the globalization of society, and the opportunities and threats that go along with it, multilinguality is now emerging both as an urgent technology requirement as well as a formidable scientific challenge. Formerly often viewed as an uninteresting exercise in adapting (mostly) English speech and language systems to a foreign language, multilinguality is rapidly becoming an area of active study and research in its own right. Even though Speech, Language and MT technologies still lag in performance behind human capabilities in any one language and much research remains to be done, researchers are now also actively exploring the added problems introduced by multilinguality: 1.) Additional languages add additional peculiarities that have not been investigated in English, but need to be handled effectively including tonality, morphology, orthography and segmentation, 2.) Foreign accent and foreign words/names introduce modeling difficulties in monolingual speech systems, 3.) the sheer number of languages (~6000 by most estimates) makes traditional porting approaches via training on large corpora or rule based programming impractical and prohibitively expensive, 4.) multilingual and cross-lingual technologies such as Machine Translation and Cross-lingual Retrieval are faced with scaling problems when demand for language pairs (N2) and domain coverage rise with languages and applications. In this talk I will discuss the problems and current approaches on the road to more effective, more robust and more portable multilingual speech and language systems and services.

May 22-24, 2002
Mehryar Mohri, AT&T Labs – Research, USA

Dr Mohri is Head of the Speech Algorithms Department at AT&T Labs - Research. The objective of the department is to design general mathematical frameworks, efficient algorithms, and fundamental software libraries for large-scale speech processing problems. Its research areas include automata theory, machine learning, pattern-matching, natural language processing, automatic speech recognition, speech synthesis. His short course will concentrate on the activities within speech at AT&T Labs, including:

1. Weighted Automata Algorithms
2. Finite-State Machine Library (FSM Library)
3. Speech Recognition Applications

För mer information se: (http://www.research.att.com/~mohri/)

Go to top of page

April 24-26, 2002

Steven Greenberg, International Computer Science Institute, Berkeley

http://www.icsi.berkeley.edu/~steveng/

What are the essential cues for understanding spoken language?

Automatic phonetic and prosodic annotation of spoken language

Beyond the phoneme: A juncture-accent model of spoken language

What are the essential acoustic cues for understanding spoken language?

Classical models of speech recognition (by both human and machine) assume that a detailed, short-term analysis of the signal is essential for accurately decoding spoken language. Although this stratagem may work well for carefully enunciated speech spoken in a pristine acoustic environment, it is far less effective for recognizing speech spoken under more realistic conditions, such as (1) moderate-to-high levels of background noise, (2) reverberant acoustic environments and (3) spontaneous, informal conversation. Statistical analysis of a spontaneous speech corpus suggests that the stability of the linguistic representation is largely a consequence of syllabic (ca. 200) intervals of analysis.

Perceptual experiments lend support to this conclusion. In one experiment the spectrum of spoken sentences was partitioned into critical-band-like channels and the onset of each channel shifted in time relative to the others so as to desynchronize spectral information across the frequency axis. Human listeners are highly tolerant of cross-channel spectral asynchrony induced in this fashion. Intelligibility is highly correlated with the magnitude of the low-frequency (3-6 Hz) modulation spectrum. A second study partitioned the spectrum of the signal into 1/3-octave channels ("slits") and measured the intelligibility associated with each channel presented alone and in concert with the others. Four slits distributed over the speech-audio range (0.3-6 kHz), are sufficient for listeners to decode sentential material with nearly 90% accuracy although more than 70% of the spectrum is missing. Word recognition often remains relatively high (60-83%) when just two or three channels are presented concurrently, despite the fact that the intelligibility of these same slits, presented in isolation, is less than 9%. Such data suggest that intelligibility is based on a compound "image" of the modulation spectrum distributed across the frequency spectrum. Because intelligibility seriously degrades when slits are desynchronized by more than 25 ms this compound image is likely to be derived from both the amplitude and phase components of the modulation spectrum. This conclusion is supported by the results of a separate experiment in which the phase and amplitude components of the modulation spectrum were dissociated through the use of locally (20-180 ms) time-reversed speech segments. The decline in intelligibility is correlated with the complex modulation spectrum, reflecting the interaction of the phase and magnitude components of the modulation pattern distributed across the frequency spectrum. The magnitude component of the modulation spectrum (i.e., the modulation index), by itself, is a poor predictor of speech intelligibility under these conditions. Such results may account for why measures of intelligibility based solely on modulation magnitude (such as the Speech Transmission Index) are not entirely successful in predicting how well listeners understand spoken language under a wide range of conditions.

Automatic phonetic and prosodic annotation of spoken language

Automatic phonetic and prosodic transcription provides an invaluable source of empirical material with which to train automatic speech recognition systems and derive fundamental insights into the nature of spoken languagge. This presentation focuses on development of automatic labeling systems for two separate (albeit related) tiers of spoken language - articulatory features (such as voicing, place and manner of articulation) and stress-accent (related to the acoustic prominence associated with sequences of syllables across an utterance.

ARTICULATORY-ACOUSTIC FEATURES

A novel framework for automatic articulatory-acoustic feature extraction has been developed for enhancing the accuracy of place- and manner-of-articulation classification in spoken language. The ÒelitistÓ approach focuses on frames for which multilayer perceptron, neural-network classifiers are highly confident, and discards the rest. Using this method, it is possible to achieve a frame-level accuracy of 93% for manner information on a corpus of American English sentences passed through a telephone network (NTIMIT). Place-of-articulation information is extracted for each manner class independently, resulting in an appreciable gain in place-feature classification relative to performance for a manner-independent system. A comparable gain in classification performance for the elitist approach is evidenced when applied to a Dutch corpus of quasi- spontaneous telephone interactions (VIOS). The elitist framework provides a potential means of automatically annotating a corpus at the phonetic level without recourse to a word-level transcript and could thus be of utility for developing training materials for automatic speech recognition and speech synthesis applications, as well as aid the empirical study of spoken language.

STRESS-ACCENT PATTERNS

There is a systematic relationship between stress accent and vocalic identity in spontaneous English discourse (the Switchboard corpus composed of telephone dialogues). Low vowels are much more likely to be fully accented than their high vocalic counterparts. And conversely, high vowels are far more likely to lack stress accent than low or mid vocalic segments. Such patterns imply that stress accent and vocalic identity (particularly vowel height) are bound together at some level of lexical representation. Statistical analysis of a manually annotated corpus (Switchboard) indicates that vocalic duration is likely to serve as an important acoustic cue for stress accent, particularly for diphthongs and the low, tense monophthongs. Multilayer perceptrons (MLPs) were trained on a portion of this annotated material in order to automatically label the corpus with respect to stress accent. The automatically derived labels are highly concordant with those of human transcribers (79% concordance within a quarter-step of accent level and 97.5% concordant within a half-step of accent level). In order to achieve such a high degree of concordance it is necessary to include features pertaining not only to the duration and amplitude of the vocalic nuclei, but also those associated with speaker gender, syllabic duration and most importantly, vocalic identity. Such results suggest that vocalic identity is intimately associated with stress accent in spontaneous American English (and vice versa), thereby providing a potential foundation with which to model pronunciation variation for automatic speech recognition.

Beyond the Phoneme: A Juncture-Accent Model of Spoken Language

Current-generation speech recognition systems generally represent words as sequences of phonemes. This "phonemic beads on a string" approach is based on a fundamental misconception about the nature of spoken language and makes it difficult (if well-nigh impossible) to accurately model the pronunciation variation observed in spontaneous speech. Five hours of spontaneous dialogue material (from the Switchboard corpus) has been manually labeled and segmented, and a subset of this corpus has been manually annotated with respect to stress accent (i.e., the prosodic emphasis placed on a syllable). Statistical analysis of the corpus indicates that much of the pronunciation variation observed at the phonetic- segment level can be accounted for in terms of stress-accent pattern and position of the segment within the syllable.

Such analyses imply that spoken language can be thought of as a sequence of accent peaks (associated with vocalic nuclei) separated by junctures of variable type and length. These junctures are associated with what are generally termed "consonants" Ð however, many consonants do not behave as segments proper, but rather serve to define the nature of the separation (i.e., juncture) between adjacent syllables.

A "juncture-accent" model (JAM) of spoken language has interesting implications for auditory models of speech processing as well as consequences for developing future- generation speech recognition systems.

Go to top of page

April 2 – 4, 2001
Marc Swerts, IPO Eindhoven, Holland
Prosody as a marker of information status

Prosody can be defined as the ensemble of suprasegmental speech features (speech melody, tempo, rhythm, loudness, etc.). The three lectures will focus on the use of prosody to mark the status of the information which is exchanged between speaking partners in natural dialogue. First, we will introduce recent findings of comparative analyses of Dutch, Italian and Japanese, showing how prosody is exploited in these languages to distinguish important bits of information from unimportant ones. Then, given that spoken conversation represents a rather uncertain medium, one central activity of dialogue participants is to "negotiate" about the information being transmitted. We will discuss how prosody can be useful in this process of grounding nformation. Finally, we will see how these findings can be exploited to improve the interactions between humans and a spoken dialogue system, in particular focusing on how prosody can be useful for error recovery.

Slides from Day1, Day2, Day3.

Go to top of page

September 12 – 14, 2000
Julia Hirschberg, AT&T Labs – Research, USA
Intonational Variation in Spoken Dialogue Systems: Generation and Understanding

Summary

The variation of prosodic features such as contour, accent, phrasing, pitch range, loudness, speaking rate, and timing contributes important information to the interpretation of spoken dialogue. To create dialogue systems that interact naturally with human users, it is important both to generate natural speech and to understand what users are saying, and intonational contributions are important to both. This course of lectures will discuss past findings and current research in the role of intonational variation in speech synthesis/concept to speech and speech recognition/understanding for spoken dialogue systems.
Go to top of page

October 11-13, 1999
Sharon Oviatt, Center for Human-Computer Communication, OGI, USA

Lecture 1: Modeling hyperarticulate speech to interactive systems

When using interactive systems, people adapt their speech during attempts to resolve system recognition errors. This talk will summarize the two-stage Computer-elicited Hyperarticulate Adaptation Model (CHAM), which accounts for systematic changes in human speech during interactive error handling. It will summarize the empirical findings and linguistic theory upon which CHAM is based, as well as the model's main predictions. Finally, implications of CHAM will be discussed for designing future interactive systems with improved error handling.

Lecture 2: Mutual disambiguation of recognition errors in a multimodal architecture

As a new generation of multimodal systems begins to define itself, researchers are attempting to learn how to combine different modes into strategically integrated whole systems. In theory, well designed multimodal systems should be able to integrate complementary modalities in a manner that supports mutual disambiguation of errors and leads to more robust performance. In this talk, I will discuss the results of two different studies that documented mutual disambiguation for diverse users (i.e., accented speakers) and challenging usage contexts (i.e., mobile use in naturalistic noisy environment). Implications will be discussed for the development of future multimodal architectures that can perform in a more robust and stable manner than individual recognition technologies.

Lecture 3: Designing and evaluating conversational interfaces with animated characters

There currently is considerable interest in the development of robust conversational interfaces, especially ones that make effective use of animated characters. This talk will discuss the design of effective conversational interfaces, including: (1) new mobile research infrastructure that we've been using to conduct multimodal studies in natural field settings, (2) the I SEE! application (Immersive Science Education for Elementary kids) that has been developed as a test-bed for studying the design of conversational interfaces that incorporate animated characters in educational technology, and (3) the results of a study that compared the spoken language of 6-to-10-year-old children while interacting with an animated character versus a human adult partner.
Go to top of page
May 17 – 19, 1999
Photograph of Anne Cutler Anne Cutler, Max Planck Institute for Psycholinguistics, Nijmegen, Holland

Three lectures on human recognition of spoken words:

Lecture 1: The psycholinguistic approach

Collecting listening data in the laboratory. Detection tasks, decision tasks, generation tasks. Examining the time-course of spoken-word recognition with "on-line" tasks; advantages and disadvantages of reaction-time data versus "off-line" data. Models of human word recognition. Simultaneous activation of word candidates and inter-word competition. The interplay between (computational) modelling and experiment; evidence for and against current models.

Lecture 2: From input to lexicon

Hypotheses about the lexical access code: features, phonemes, syllables and other units. The relative weight of matching versus mismatching input. Segmental versus suprasegmental information. Autonomy versus interaction in prelexical processing: can feedback from the lexicon help listeners to process the input, and is there evidence of such feedback? Recognition of words in isolation versus words in continuous speech.

Lecture 3: Language-specificity

The structure of the vocabulary and the language-specificity of lexical representations. The role in recognition of phonetic sequencing constraints, word boundary effects, transitional probability, assimilation and elision phenomena. How vocabulary structure can affect the relative importance of different types of phonetic information in activation. Native-language, second-language and bilingual word processing.
Go to top of page

November, 1997
Paul Heisterkamp, Daimler-Benz AG, Research and Technology Speech Understanding
Thinking in systems

Lecture contents.

Lesson I: Introduction, Speech dialogue in a contructivist perspective, Speech understanding as an abductive process, Speech Dialogue Systems and their components, Natural Language Analysis for speech dialogue

Lesson II: Dialogue Management. Current practice & future perspectives, Finite-State Dialogues, Dialogue management as rational behaviour, Dialogue management as local optimization, Tools

Lesson III: Natural Language Generation for speech dialogue, Conclusion - Requirements for fully conversational systems

Go to top of page