COST 249: Work Group Organization
1 Introduction
In order to organize the research and the cooperation among the
different partners in the most efficient way, it was decided to
setup 4 work groups which would be responsible for treating different
types of questions regarding specific aspects of the continuous
speech recognition problem. These 4 work groups were named
- WG1: Concept establishment (B. Pfister)
- WG2: Linguistic processing (J.-L. Cochard)
- WG3: Phonetic decoding (K. Elenius)
- WG4: Acoustic signal processing (T. Svendsen)
Between brackets are the names of the people who were appointed
as WG chairmen. It was left to the COST action participants to
decide to which WG's they will contribute. It was also pointed
out that all participants were allowed to attend the meetings
of all WG's. In the subsequent sections, the reader can find a
list of participants having the intention to participate (contribute)
to the work of one or more of the WG's, as well as a detailed
list of the topics and questions which will be treated by the
WG's, and the institutions that will contribute to the different
topics. Each participant is designated by an abbreviation of the
institution he belongs to.
2 List of participants
What follows is an alphabetic list of contributing institutions
(ranked according to their abbreviation), and the person responsible
for that contribution.
- AKH: Hungarian Academy of Science (Hungary), Klara Vicsi
- BT: British Telecom (United Kingdom), Simon Ringland
- CPK: Centre for PersonKommunication (Denmark), Paul Dalsgaard
- CRIN: Centre de Recherches en Informatique de Nancy (France),
Jean-Paul Haton
- CTU: Czech Technical University (Czech Republic), Jan Uhlir
- DBT: Deutsche Bundepost Telekom (Germany), Bernhard Kaspar
- ETH: Eidgenössische Technische Hochschule (Switzerland),
Beat Pfister
- FEP: Faculty of Engineering in Porto (Portugal), Carlos Espain
- FPM: Faculte Polytechnique de Mons (Belgium), Henry Leich
- FUB: Fondazione Ugo Bordoni (Italy), Lucio Ricotti
- IDIAP: Institut Dalle Molle d'Intelligence Artificielle Perceptive
(Switzerland), Jean-Luc Cochard
- KTH: Speech Communication and Music Acoustics (Sweden), Kjell
Elenius
- NTH: Norwegian Institute of Technology (Norway), Torbjorn
Svendsen
- NTR: Norwegian Telecom Research (Norway), Finn Tore Johansen
- PTT: Dutch PTT research (Netherlands), Paul Van Alphen
- PUM: Polytechnical University of Madrid (Spain), Juan Gomez-Mena
- RUG: Regional University Gent (Belgium), Jean-Pierre Martens
- TDK: Tele Denmark (Denmark), Per Rosenbeck
- TUK: Technical University of Kosice (Slovakia), Anton Cizmar
- UMB: University of MAribor (Slovenia), Zdravko Kacic
Obviously, new participants can be added to this list as soon
as they have decided to which WG's they want to contribute. Furthermore,
the participants may at any time decide to join a WG whenever
they feel they could contribute to one of the topics treated by
this WG. Therefore, this document will be updated on a regular
basis.
WG1: Concept Establishment
Topic 1: Overall system configuration Questions:
- Which types of knowledge representation and processing mechanisms
are suitable for the different subtasks of an ASR system?
- Which information should be provided by the acoustic-phonetic
level?
- Where can what type of prosodic information be used in an
ASR system?
- How can prosodic features be used to detect stressed syllables,
phrase boundaries, phrase type, etc.?
- What are the optimal interaction mechanisms between the different
processing levels? Of particular interest is the interface between
the acoustic-phonetic level and the NLP level, as they typically
use different methodologies (statistical and knowledge-based resp.).
- What can we learn from human perception (hearing characteristics,
psychoacoustics, etc.) to improve ASR systems?
- Can the findings of cognitive science (neural structure, gestalt
perception, etc.) have significant implications for ASR systems?
Contributors:
DBT, ETH, IDIAP, KTH, NTH, RUG, UMB
Topic 2: Task complexity Questions:
- How can the complexity of a given speech recognition task
be assessed, taking into account acoustic variations, vocabulary
size, pronunciation variants, syntactic structures, etc.?
- For which types of ASR systems is the size of the search space
an adequate complexity measure?
- Which strategies for search space reduction can be applied
in the different types of ASR systems?
Contributors:
CPK, DBT, ETH, IDIAP
Topic 3: Task and language independence Questions:
- Which parts of the different system configurations (cf. 1.)
are or can be designed language-independently?
- How to design the ASR system in such a way that an adaptation
to a new task or a new language can be obtained efficiently?
Contributors
none yet
Topic 4: Spontaneous speech issues Questions:
- What are the problems/consequences for the different ASR system
approaches that arise from the use of spontaneous speech? Typical
properties of spontaneous speech are: bad pronunciation (sluring),
syntactic incorrectness, hesitations, etc.
Contributors:
none yet
Topic 5: Dialogue modelling Questions:
- Is there a general formalism (independent of specific applications)
to describe dialogues?
- What are the advantages and drawbacks of procedural vs. knowledge
based (e.g. plan recognition) dialogue descriptions?
- How can an ASR system appropriately be interfaced to the pragmatic
level?
Contributors:
CPK, DBT, ETH, KTH
WG2: Linguistic Processing
Topic 1: Lexical knowledge Questions:
- How to handle word coarticulations at the lexical knowledge
representation level?
- How to obtain an optimal lexicon capable of representing word
pronunciation variants?
- Is it possible to train word pronunciation models better taking
into account both true pronunciation variants and phonetic decoding
errors?
- How to tune a lexical access module to allow word selection
controlled by partial phonetic sequence?
Contributors:
ETH, IDIAP, KTH, NTH, NTR, RUG
Topic 2: Parsing strategies Questions:
- How can sentence parsing be extended to sentence AND word
parsing?
- How can rules for other regular phenomena be integrated into
the parsing mechanism?
- How can scoring be consistently integrated into parsing?
- How do parsing strategies have to look like, if their input
is no longer a given set of basic hypotheses, but if controlling
the generation of hypotheses becomes one of their jobs?
- What kind of information do parsers need then to make decisions?
- What kind of decisions should parsers be able to make?
- How to integrate a syntactical chart parser into a word recognition
module?
Contributors:
CPK, ETH, FUB, KTH
Topic 3: Prosody Questions:
- How can micro-prosodic variations be used to improve the pertinence
of lexical access in a CSR system?
- How to constraint syntactical constituant boundaries with
prosodic information (prosodic word boundaries)?
- How to introduce both phonetic and prosodic features in the
linguistic processing levels? Contributors: none yet Topic 4:
Higher order constraints Questions: 1. How to add semantic-based
and user plan-based constraints to syntactical ones?
- How to model semantic constraints and user plans in the sublangue
of ATIS flight queries? Contributors: FUB Topic 5: Language models
Questions: 1. How to represent different kinds of linguistic knowledge
in a speech recognition system?
- How can well-motivated natural language theory, namely GB,
GPSG, etc. be adapted in the context of speech processing to reduce
efficiently the problem solution space? Contributors: CPK Topic
6: Speaker adaptation Questions: 1. Can one adapt to the speaker
by imposing articulatory constraints?
- Can one adapt to the speaker by adapting the lexical knowledge
in the speech recognizer?
Contributors: KTH
WG3: Phonetic Decoding
Topic 1: Neural networks and HMM's Questions:
- What is the best way of adding context information to an ANN?
- Does it help to supplement raw data with phonetically motivated
features derived from these data in a neural network based classifier?
- What are the best architectures for combining neural networks
(NN) and dynamic programming (DP) or hidden Markov models (HMM).
Contributors:
CRIN, ETH, FPM, FEP, IDIAP, KTH, NTR, RUG, TUK
Topic 2: Task and language independency Questions:
- Can one come up with strategies which allow an nearly automatic
adaptation of an ASR which was designed for one task (language)
to one which will be optimal for another task (language)?
Contributors:
AKH, BT, CPK, ETH, FPM, IDIAP, NTH, NTR, PTT, PUM, RUG
Topic 3: Recognition units Questions:
- Does it help to perform the acoustic phonetic decoding in
a cascade scheme: first distinctive features and then phonetic
hypotheses?
- Is it possible to use phonetic features directly for lexical
look-up? Note that features need not be synchronous with phoneme
boundaries.
- What segmental and supra-segmental features (including prosodic
ones) are useful for the phonetic classification of phonetic segments?
- What are the best units of speech with regard to detectability,
classification ability, linguistic relevance, ...?
- How to represent phonetic events (transitions between units,
units or clusters of units) in a fuzzy way to take into account
the uncertainty about their precise location in the speech signal?
- How to rescore hypotheses about small phonetic events (e.g.
phones) into scores for hypotheses about larger phonetic events
(e.g. syllables)?
- Must speech units be linguistically defined, or can one define
units which are acoustically stable and offer a decodable correspondence
to linguistic units?
- Can one find better methods for segmentation and labelling
of speech, yielding improved training methods for ASR?
- Transition based systems.
Contributors:
AKH, CPK, CTU, DBT, ETH, FPM, NTH, RUG, TDK, TUK, UMB
WG4: Acoustic Signal Processing
Topic 1: Prosodic feature extraction Questions:
- How can prosodic features like fundamental frequency, intensity,
duration, stress, focus, ... be determined reliably?
Contributors:
CPK, DBT, IDIAP
Topic 2: Feature extraction Questions:
- What are the limitations of the current feature extraction
methods regarding the use of ASR systems in real environments?
- To what extent can the principles of sound perception in man
(peripheral as well as central auditory perception principles)
be followed in designing feature extraction methods?
- Can the findings of cognitive science (neural structure, parallel
processing of information, etc.) have significant meaning also
for acoustic processing?
- Does RASTA processing help to increase the robustness of continuous
speech recognizers against moderate additive and convolutional
noise?
- What type of features are best-suited for approaches like
HMM, NN, and others, and if it is different from approach to approach,
what are the fundamental reasons for this?
- Are there any methods for obtaining more speaker independent
features by means of speaker adapted transformations?
Contributors:
CPK, CRIN, FPM, PUM, NTH, NTR, TDK, UMB
Topic 3: Noise suppression Questions:
- What are the basic obstacles for feature extraction methods
in case of acoustic signals emerging from multiple sources (multiple
voices, or one voice plus ambient noise)?
- How can noise suppression, speech enhancement and transmission
line equalization improve speech recognition over the telephone
line?
Contributors:
CPK, CTU, RUG, TUK, UMB
Topic 4: Speech corpora Questions:
- Is it possible to agree upon a common procedure for collecting
similar speech material, recorded over the telephone in the different
European languages?
- Is it possible to manipulate databases not recorded over the
telephone such that the can be used for simulating telephone applications?
Contributors:
ETH, DBT, FPM, IDIAP, KTH, NTR, PTT, TDK, UMB