Annual Report 1999

Table of Contents

Speech Communication and Speech Technology

Nparse - a shallow n-gram-based grammatical-phrase parser

Alice Carlberger,,

Nparse is a shallow probabilistic unification-based parser for the finding of grammatical phrases. It is data-driven and robust, allowing both domain-specific and unrestricted-language training. We believe it can constitute an interesting alternative to rudimentary grammars in speech technology applications; one such use is N-best list resorting in a synthesis or recognition front end. Two comparative studies were conducted on Swedish to evaluate the parser. In the first study, it was trained and tested across three domains, using a fine-grained set of grammatical-phrase nodes and grammatical features. The second study involved training and testing the system on one domain (a combination of linguist-generated basic structures and newspaper sentences) across three node systems of increasing complexity. In the process, a tree bank database (including a number grammar) was built and a detailed linguistic assessment performed. The long-term goal is to find the optimal system complexity for accurately establishing phrase boundaries and phrase types in newspaper text and, ultimately, unrestricted language. For this, a combination of iterative manual training and unsupervised training is desirable.

(TMH-QPSR 3-4/1999: 1-9)

Collecting and analysing two- and three-dimensional MRI data for Swedish

Olov Engwall and Pierre Badin*

* Institut de la Communication Parlée, UPRESA CNRS 5009, INPG-Université Stendhal, Grenoble, France

MRI (Magnetic Resonance Imaging) data have been collected for a male speaker of Swedish producing sustained vowels and consonants in VCV-context. The resulting database consists of one 3D set of 43 Swedish articulations, covering the entire vocal tract, and one midsagittal set of 85 articulations. The three-dimensional vocal tract shape has been reconstructed for 26 of the articulations, and the result for the Swedish fricatives and a vowel subset is reported on. The midsagittal image set has been analysed by applying Principal Component Analysis on the vocal tract contours to extract articulatory control parameters. The results of the analysis are presented together with findings on articulation strategies for the subject. A number of articulatory measures have been determined and the co-articulatory influence on these measures has been investigated.

(TMH-QPSR 3-4/1999: 11-38)

Vocal tract modeling in 3D

Olov Engwall

A three-dimensional model of the vocal tract is under development. The model consists of vocal and nasal tract walls, lips, teeth and tongue, represented as visually distinct articulators by different colours resembling the ones in a natural human vocal tract. The naturalness of the vocal tract model can be used in speech training for hearing impaired children or in second language learning, where the visual feedback supplements the auditory feedback. The 3D model also provides a platform for studies on articulatory synthesis, as the vocal tract geometry can be set with a small number of articulation parameters, and vocal tract cross-sectional areas can be determined directly from the model.

(TMH-QPSR 1-2/1999: 31-38)

A visual input module used in the August spoken dialogue system

Tobias Öhman

We have built a software package for video handling and video processing called MODAL. The package is an extension of the Tcl/Tk scripting language, which makes it easy to use and to integrate into other software. The image processing algorithms in MODAL are intended for reliable and fast movement detection and localisation, which makes the package especially suitable to use in visual input modules for human-machine interfaces.

MODAL was used together with a desktop video camera to build a visual input module to the August spoken dialogue system. The information extracted by the module provides the system with knowledge about the presence and localisation of a user. The visual output interface of the system includes a 3D animated agent that responds to the visual input information by turning his head and eyes towards an approaching user, letting the user know that it is, to some extent, aware of the scene in front of it.

(TMH-QPSR 1-2/1999: 39-43)

Using HMMs and ANNs for mapping acoustic to visual speech

Tobias Öhman and Giampiero Salvi

In this paper we present two different methods for mapping auditory, telephone quality speech to visual parameter trajectories, specifying the movements of an animated synthetic face. In the first method, Hidden Markov Models (HMMs) where used to obtain phoneme strings and time labels. These where then transformed by rules into parameter trajectories for visual speech synthesis. In the second method, Artificial Neural Networks (ANNs) were trained to directly map acoustic parameters to synthesis parameters. Speaker independent HMMs were trained on a phonetically transcribed telephone speech database. Different underlying units of speech were modelled by the HMMs, such as monophones, diphones, triphones, and visemes. The ANNs were trained on male, female , and mixed speakers.

The HMM method and the ANN method were evaluated through audio-visual intelligibility tests with ten hearing impaired persons, and compared to "ideal" articulations (where no recognition was involved), a natural face, and to the intelligibility of the audio alone. It was found that the HMM method performs considerably better than the audio alone condition (54% and 34% keywords correct, respectively), but not as well as the "ideal" articulating artificial face (64%). The intelligibility for the ANN method was 34% keywords correct.

(TMH-QPSR 1-2/1999: 45-50)

Experiences from building two large telephone speech databases for Swedish

Kjell Elenius

The objective of the EU-funded SpeechDat project was to create large-scale speech databases for voice-driven teleservices. This paper deals with the design of two such Swedish resources: 5000 speakers over the fixed telephone network, and 1000 over the mobile network. It also reports on experiences from speaker recruitment and presents statistics on speaker distribution. Results regarding orthographic labelling of pronunciation, pronunciation errors and non-speech events are also included.

(TMH-QPSR 3-4/1999: 51-56)


Music Acoustics

An effect of body massage on voice loudness and phonation frequency in reading

Sten Ternström, Marie Andersson*, and Ulrika Bergman*

* Scandinavian College of Manual Medicine, Observatorieg. 19-21,
SE-113 29 Stockholm, Sweden

In this experiment, the effect of massage on voice fundamental frequency Fo and sound pressure level SPL was investigated. Subjects were recorded when reading a three-minute passage before and after a 30-minute session of massage administered by a trained naprapathy therapist. Sixteen subjects were given the massage, while fifteen controls rested, lying down in silence for the same amount of time. The subjects were then recorded reading the same passage again. The Fo and SPL averages across the whole passage were measured for the pre-treatment and post-treatment recordings.

In the post-massage recordings, subjects had lowered their Fo by 1.1 semitones and their SPL by 1.0 dB, with very high statistical significance. The drop in Fo was somewhat larger for the males than for the females. The control subjects showed no effect at all.

(TMH-QPSR 3-4/1999: 39-44)

Music technology and audio processing: Rall or accel into the new millennium?

Johan Sundberg, Voice Research Centre KTH

Music acoustic research can provide support in terms of objective knowledge that would further the quickly developing areas of music technology and audioprocessing. This is illustrated by three examples taken from current projects at KTH. One concerns the development of quality in sound reproduction systems in this century. A test, where expert listeners rated the recording year of phonograms of different age, demonstrated that significant advances were made between 1950-1970, while development was rather modest before and after this period. The second example concerns the secrets underlying timbral beauty. Acoustic analyses of recordings of an international opera tenor and a singer with an extremely ugly voice shed some light on basic demands on voice timbre. The ugly voice is found to suffer from pressed phonation, lack of a singer’s formant, and irregular vibrato and unsufficient pitch accuracy. The third example elucidates tuning and phrasing differences between deadpan performances of Midi files played on synthesisers and performances played by musicians on real instruments. The examples suggest that future development in the areas of music technology and audio processing may gain considerably from a close interaction with music acoustics research.

(TMH-QPSR 3-4/1999: 45-53)

Velum behaviour in professional classic operatic singing

Peer Birch1 , Bodil Gümoes2 , Hanne Stavad2 , Svend Prytz3 , Eva Björkner4 , and Johan Sundberg4

1 Royal Academy of Music, Aarhus
2 Royal Academy of Music, Copenhagen
3 Phoniatric Department, Bispebjerg Hospital, Copenhagen
4 KTH Voice Research Centre, Department of Speech Music Hearing, Stockholm

Many singers regard "nasal resonance" as important to tone production in singing. In this study, we test the hypothesis that professional operatic singers sing with a slight velopharyngeal opening. The opening was estimated from nasofiberscope video recordings of 16 singers of different classifications who sang an ascending triad pattern throughout their pitch range. For each tone, the size of the opening was rated by a panel of experts. Many cases of a clear velopharyngeal opening were found. The singers repeated the same task into a flow mask (Glottal Enterprises), recording oral and nasal airflow separately and the DC component of these signals was analysed. In addition, the degree of "nasal resonance" was evaluated by a panel of singing teachers. The correlation between velopharyngeal opening, nasal airflow, and degree of "nasal resonance" is analysed.

(TMH-QPSR 3-4/1999: 55-63)

Voice source differences between falsetto and modal registers in counter tenors, tenors and baritones

Johan Sundberg and Carl Högset, Oslo

The voice source of professional singers is analyzed, for four counter tenors, five tenors, and four baritones who sang the syllable [pæ:] in soft, middle and loud voice at certain pitches in modal and falsetto/counter tenor register. Subglottal pressure, closed quotient, relative glottal leakage, as well as the relative level of the fundamental are analyzed. The counter tenors all used comparatively low subglottal pressures, and showed a closed phase in their flow glottogram waveform, also in soft phonation. For a given value of the closed quotient the fundamental tended to show a higher relative level in falsetto than in modal register. The flow glottogram differences between the registers seem related to vocal fold thickness that is greater in modal than in falsetto register. The main characteristic of the counter tenors’ falsetto register seemed to be the presence of a closed phase.

(TMH-QPSR 3-4/1999: 65-74)

Emotive transforms

Johan Sundberg

Emotional expressivity in singing is examined by comparing neutral and expressive performances of a set of music excerpts as performed by a professional baritone singer. Both the neutral and the expressive versions showed considerable deviations from the nominal description represented by the score. Much of these differences can be accounted for in terms of the application of two basic principles, grouping, i.e., marking of the hierarchical structure, and differentiation, i.e., enhancing the differences between tone categories. The expressive versions differed from the neutral versions with respect to a number of acoustic characteristics. In the expressive versions, the structure and the tone category differences were marked more clearly. Furthermore, the singer emphasised semantically important words in the lyrics in the expressive versions. Comparing the means used by the singer for the purpose of emphasis with those used by a professional actor and voice coach reveal striking similarities.

(TMH-QPSR 3-4/1999: 75-85)

Level and center frequency of the singer's formant

Johan Sundberg

The "singer’s formant" is a prominent spectrum envelope peak near 3 kHz, typically found in voice sounds produced by classical operatic singers. According to previous research, it is mainly a resonatory phenomenon produced by a clustering of formants 3, 4 and 5. Its level relative to the first formant peak varies depending on vowel, vocal loudness and other factors. Its dependence on vowel formant frequencies is examined. Applying the acoustic theory of voice production, the level difference between the first and third formant is calulated for some standard vowels. The difference between observed and thus calculated levels is determined for various voices. It is found to vary considerably less between vowels sung by professional singers than by untrained voices. The center frequency of the singer’s formant, determined by long-term-spectrum analysis of grammophon recordings, is found to increase slightly with the pitch range of the voice classification.

(TMH-QPSR 3-4/1999: 87-94)

Consistency of inhalatory breathing patterns in professional operatic singers

Monica Thomasson and Johan Sundberg

Breathing behaviour is generally considered of great importance to an optimised voice production in the classical western singing tradition, the assumption being that it affects phonation. If so, professional operatic singers would be expected to accurately repeat the same breathing patterns when repeating the same phrases. Recently, consistency of phonatory breathing patterns was examined. In this study, inhalatory rib cage (RC) and abdominal wall (AW) movements were documented in five professional operatic singers by means of respiratory inductive plethysmography. Comparisons of inhalatory data gathered for three takes of the same phrase revealed high consistency with regard to lung volume (LV) change and RC movements in all subjects, suggesting great relevance of RC control in singing. Consistency of AW movements was observed in three singers. Consistency measures across phrases in different musical contexts were slightly lower. The observations support the idea that breathing behaviour is important to voice production in singing. In addition, correlation between LV change, on the one hand, and RC and AW movement on the other, was examined. The contribution to LV change from RC and AW varied across singers, thus suggesting that professional operatic singing does not request a uniform breathing strategy.

(TMH-QPSR 1-2/1999: 9-20)

Vocal fold vibrations: high-speed video imaging, kymography and acoustic analysis

Hans Larsson1 , Stellan Hertegård2 , Per-Åke Lindestad1 and Britta Hammarberg1,3
Department of Logopedics and Phoniatrics, Karolinska Institute,
Huddinge University Hospital, Huddinge
2 Phoniatric Department, Karolinska Hospital, Stockholm, Sweden
3 Department of Speech, Music and Hearing, KTH, Stockholm, Sweden

A new analysis system (the High-Speed Tool Box) was developed for studying vocal fold vibrations using a high speed camera. A Weinberger Speedcam system was utilized with a frame rate of 1904/sec. Images were stored and analyzed digitally. Analysis included automatic glottal edge detection and calculation of glottal area variations, as well as kymography. These signals can be compared to the acoustic and/or electroglottographic waveforms. Periodic and aperiodic signals of normal and disordered phonations were studied, and relations between glottal vibratory patterns and the sound waveform are discussed. The findings suggested that the high-speed system is particularly useful for studying details of mucosal movements during abnormal vibrations, such as diplophonia and tremor.

(TMH-QPSR 1-2/1999: 21-29)

Hearing Technology

Strategies and results from spoken L2 teaching with audio-visual feedback

Anne-Marie Öster

Teaching strategies and positive results from audio-visual training of both perception and production of spoken Swedish with 13 immigrants are reported. The learners participated in a total of six half-hour training sessions twice a week. The training had a positive effect on both perception and production of individual Swedish sounds, stress, intonation and rhythm. The positive results were obtained through audio-visual contrastive feedback shown in real time and provided through a PC running the IBM SpeechViewer software. A split screen provided a comparison of the learner's production with a correct model by the teacher. F0 and intensity were displayed in real time. The learners' opinions of this pronunciation training with immediate audio-visual feedback are also reported.

(TMH-QPSR 1-2/1999: 1-7)

Published by: TMH, Speech, Music and Hearing

Last updated: Monday, 25-Oct-2004 12:08:18 MEST