COST 249: Work Group Organization

1 Introduction

In order to organize the research and the cooperation among the different partners in the most efficient way, it was decided to setup 4 work groups which would be responsible for treating different types of questions regarding specific aspects of the continuous speech recognition problem. These 4 work groups were named

  1. WG1: Concept establishment (B. Pfister)
  2. WG2: Linguistic processing (J.-L. Cochard)
  3. WG3: Phonetic decoding (K. Elenius)
  4. WG4: Acoustic signal processing (T. Svendsen)

Between brackets are the names of the people who were appointed as WG chairmen. It was left to the COST action participants to decide to which WG's they will contribute. It was also pointed out that all participants were allowed to attend the meetings of all WG's. In the subsequent sections, the reader can find a list of participants having the intention to participate (contribute) to the work of one or more of the WG's, as well as a detailed list of the topics and questions which will be treated by the WG's, and the institutions that will contribute to the different topics. Each participant is designated by an abbreviation of the institution he belongs to.

2 List of participants

What follows is an alphabetic list of contributing institutions (ranked according to their abbreviation), and the person responsible for that contribution.

  1. AKH: Hungarian Academy of Science (Hungary), Klara Vicsi
  2. BT: British Telecom (United Kingdom), Simon Ringland
  3. CPK: Centre for PersonKommunication (Denmark), Paul Dalsgaard
  4. CRIN: Centre de Recherches en Informatique de Nancy (France), Jean-Paul Haton
  5. CTU: Czech Technical University (Czech Republic), Jan Uhlir
  6. DBT: Deutsche Bundepost Telekom (Germany), Bernhard Kaspar
  7. ETH: Eidgenössische Technische Hochschule (Switzerland), Beat Pfister
  8. FEP: Faculty of Engineering in Porto (Portugal), Carlos Espain
  9. FPM: Faculte Polytechnique de Mons (Belgium), Henry Leich
  10. FUB: Fondazione Ugo Bordoni (Italy), Lucio Ricotti
  11. IDIAP: Institut Dalle Molle d'Intelligence Artificielle Perceptive (Switzerland), Jean-Luc Cochard
  12. KTH: Speech Communication and Music Acoustics (Sweden), Kjell Elenius
  13. NTH: Norwegian Institute of Technology (Norway), Torbjorn Svendsen
  14. NTR: Norwegian Telecom Research (Norway), Finn Tore Johansen
  15. PTT: Dutch PTT research (Netherlands), Paul Van Alphen
  16. PUM: Polytechnical University of Madrid (Spain), Juan Gomez-Mena
  17. RUG: Regional University Gent (Belgium), Jean-Pierre Martens
  18. TDK: Tele Denmark (Denmark), Per Rosenbeck
  19. TUK: Technical University of Kosice (Slovakia), Anton Cizmar
  20. UMB: University of MAribor (Slovenia), Zdravko Kacic

Obviously, new participants can be added to this list as soon as they have decided to which WG's they want to contribute. Furthermore, the participants may at any time decide to join a WG whenever they feel they could contribute to one of the topics treated by this WG. Therefore, this document will be updated on a regular basis.

WG1: Concept Establishment

Topic 1: Overall system configuration Questions:

  1. Which types of knowledge representation and processing mechanisms are suitable for the different subtasks of an ASR system?
  2. Which information should be provided by the acoustic-phonetic level?
  3. Where can what type of prosodic information be used in an ASR system?
  4. How can prosodic features be used to detect stressed syllables, phrase boundaries, phrase type, etc.?
  5. What are the optimal interaction mechanisms between the different processing levels? Of particular interest is the interface between the acoustic-phonetic level and the NLP level, as they typically use different methodologies (statistical and knowledge-based resp.).
  6. What can we learn from human perception (hearing characteristics, psychoacoustics, etc.) to improve ASR systems?
  7. Can the findings of cognitive science (neural structure, gestalt perception, etc.) have significant implications for ASR systems?

Contributors:

DBT, ETH, IDIAP, KTH, NTH, RUG, UMB

Topic 2: Task complexity Questions:

  1. How can the complexity of a given speech recognition task be assessed, taking into account acoustic variations, vocabulary size, pronunciation variants, syntactic structures, etc.?
  2. For which types of ASR systems is the size of the search space an adequate complexity measure?
  3. Which strategies for search space reduction can be applied in the different types of ASR systems?

Contributors:

CPK, DBT, ETH, IDIAP

Topic 3: Task and language independence Questions:

  1. Which parts of the different system configurations (cf. 1.) are or can be designed language-independently?
  2. How to design the ASR system in such a way that an adaptation to a new task or a new language can be obtained efficiently?

Contributors

none yet

Topic 4: Spontaneous speech issues Questions:

  1. What are the problems/consequences for the different ASR system approaches that arise from the use of spontaneous speech? Typical properties of spontaneous speech are: bad pronunciation (sluring), syntactic incorrectness, hesitations, etc.

Contributors:

none yet

Topic 5: Dialogue modelling Questions:

  1. Is there a general formalism (independent of specific applications) to describe dialogues?
  2. What are the advantages and drawbacks of procedural vs. knowledge based (e.g. plan recognition) dialogue descriptions?
  3. How can an ASR system appropriately be interfaced to the pragmatic level?

Contributors:

CPK, DBT, ETH, KTH

WG2: Linguistic Processing

Topic 1: Lexical knowledge Questions:

  1. How to handle word coarticulations at the lexical knowledge representation level?
  2. How to obtain an optimal lexicon capable of representing word pronunciation variants?
  3. Is it possible to train word pronunciation models better taking into account both true pronunciation variants and phonetic decoding errors?
  4. How to tune a lexical access module to allow word selection controlled by partial phonetic sequence?

Contributors:

ETH, IDIAP, KTH, NTH, NTR, RUG

Topic 2: Parsing strategies Questions:

  1. How can sentence parsing be extended to sentence AND word parsing?
  2. How can rules for other regular phenomena be integrated into the parsing mechanism?
  3. How can scoring be consistently integrated into parsing?
  4. How do parsing strategies have to look like, if their input is no longer a given set of basic hypotheses, but if controlling the generation of hypotheses becomes one of their jobs?
  5. What kind of information do parsers need then to make decisions?
  6. What kind of decisions should parsers be able to make?
  7. How to integrate a syntactical chart parser into a word recognition module?

Contributors:

CPK, ETH, FUB, KTH

Topic 3: Prosody Questions:

  1. How can micro-prosodic variations be used to improve the pertinence of lexical access in a CSR system?
  2. How to constraint syntactical constituant boundaries with prosodic information (prosodic word boundaries)?
  3. How to introduce both phonetic and prosodic features in the linguistic processing levels? Contributors: none yet Topic 4: Higher order constraints Questions: 1. How to add semantic-based and user plan-based constraints to syntactical ones?
  4. How to model semantic constraints and user plans in the sublangue of ATIS flight queries? Contributors: FUB Topic 5: Language models Questions: 1. How to represent different kinds of linguistic knowledge in a speech recognition system?
  5. How can well-motivated natural language theory, namely GB, GPSG, etc. be adapted in the context of speech processing to reduce efficiently the problem solution space? Contributors: CPK Topic 6: Speaker adaptation Questions: 1. Can one adapt to the speaker by imposing articulatory constraints?
  6. Can one adapt to the speaker by adapting the lexical knowledge in the speech recognizer?

Contributors: KTH

WG3: Phonetic Decoding

Topic 1: Neural networks and HMM's Questions:

  1. What is the best way of adding context information to an ANN?
  2. Does it help to supplement raw data with phonetically motivated features derived from these data in a neural network based classifier?
  3. What are the best architectures for combining neural networks (NN) and dynamic programming (DP) or hidden Markov models (HMM).

Contributors:

CRIN, ETH, FPM, FEP, IDIAP, KTH, NTR, RUG, TUK

Topic 2: Task and language independency Questions:

  1. Can one come up with strategies which allow an nearly automatic adaptation of an ASR which was designed for one task (language) to one which will be optimal for another task (language)?

Contributors:

AKH, BT, CPK, ETH, FPM, IDIAP, NTH, NTR, PTT, PUM, RUG

Topic 3: Recognition units Questions:

  1. Does it help to perform the acoustic phonetic decoding in a cascade scheme: first distinctive features and then phonetic hypotheses?
  2. Is it possible to use phonetic features directly for lexical look-up? Note that features need not be synchronous with phoneme boundaries.
  3. What segmental and supra-segmental features (including prosodic ones) are useful for the phonetic classification of phonetic segments?
  4. What are the best units of speech with regard to detectability, classification ability, linguistic relevance, ...?
  5. How to represent phonetic events (transitions between units, units or clusters of units) in a fuzzy way to take into account the uncertainty about their precise location in the speech signal?
  6. How to rescore hypotheses about small phonetic events (e.g. phones) into scores for hypotheses about larger phonetic events (e.g. syllables)?
  7. Must speech units be linguistically defined, or can one define units which are acoustically stable and offer a decodable correspondence to linguistic units?
  8. Can one find better methods for segmentation and labelling of speech, yielding improved training methods for ASR?
  9. Transition based systems.

Contributors:

AKH, CPK, CTU, DBT, ETH, FPM, NTH, RUG, TDK, TUK, UMB

WG4: Acoustic Signal Processing

Topic 1: Prosodic feature extraction Questions:

  1. How can prosodic features like fundamental frequency, intensity, duration, stress, focus, ... be determined reliably?

Contributors:

CPK, DBT, IDIAP

Topic 2: Feature extraction Questions:

  1. What are the limitations of the current feature extraction methods regarding the use of ASR systems in real environments?
  2. To what extent can the principles of sound perception in man (peripheral as well as central auditory perception principles) be followed in designing feature extraction methods?
  3. Can the findings of cognitive science (neural structure, parallel processing of information, etc.) have significant meaning also for acoustic processing?
  4. Does RASTA processing help to increase the robustness of continuous speech recognizers against moderate additive and convolutional noise?
  5. What type of features are best-suited for approaches like HMM, NN, and others, and if it is different from approach to approach, what are the fundamental reasons for this?
  6. Are there any methods for obtaining more speaker independent features by means of speaker adapted transformations?

Contributors:

CPK, CRIN, FPM, PUM, NTH, NTR, TDK, UMB

Topic 3: Noise suppression Questions:

  1. What are the basic obstacles for feature extraction methods in case of acoustic signals emerging from multiple sources (multiple voices, or one voice plus ambient noise)?
  2. How can noise suppression, speech enhancement and transmission line equalization improve speech recognition over the telephone line?

Contributors:

CPK, CTU, RUG, TUK, UMB

Topic 4: Speech corpora Questions:

  1. Is it possible to agree upon a common procedure for collecting similar speech material, recorded over the telephone in the different European languages?
  2. Is it possible to manipulate databases not recorded over the telephone such that the can be used for simulating telephone applications?

Contributors:

ETH, DBT, FPM, IDIAP, KTH, NTR, PTT, TDK, UMB