Annual Report 1999
Table of Contents
Speech Communication and Speech Technology
Nparse - a shallow n-gram-based grammatical-phrase parser
Nparse is a shallow probabilistic unification-based parser for the finding
of grammatical phrases. It is data-driven and robust, allowing both domain-specific and unrestricted-language training.
We believe it can constitute an interesting alternative to rudimentary grammars in speech technology applications;
one such use is N-best list resorting in a synthesis or recognition front end. Two comparative studies were conducted
on Swedish to evaluate the parser. In the first study, it was trained and tested across three domains, using a
fine-grained set of grammatical-phrase nodes and grammatical features. The second study involved training and testing
the system on one domain (a combination of linguist-generated basic structures and newspaper sentences) across
three node systems of increasing complexity. In the process, a tree bank database (including a number grammar)
was built and a detailed linguistic assessment performed. The long-term goal is to find the optimal system complexity
for accurately establishing phrase boundaries and phrase types in newspaper text and, ultimately, unrestricted
language. For this, a combination of iterative manual training and unsupervised training is desirable.
(TMH-QPSR 3-4/1999: 1-9)
Collecting and analysing two- and three-dimensional MRI data for Swedish
Olov Engwall and Pierre Badin*
* Institut de la Communication Parlée, UPRESA CNRS 5009, INPG-Université
Stendhal, Grenoble, France
MRI (Magnetic Resonance Imaging) data have been collected for a male
speaker of Swedish producing sustained vowels and consonants in VCV-context. The resulting database consists of
one 3D set of 43 Swedish articulations, covering the entire vocal tract, and one midsagittal set of 85 articulations.
The three-dimensional vocal tract shape has been reconstructed for 26 of the articulations, and the result for
the Swedish fricatives and a vowel subset is reported on. The midsagittal image set has been analysed by applying
Principal Component Analysis on the vocal tract contours to extract articulatory control parameters. The results
of the analysis are presented together with findings on articulation strategies for the subject. A number of articulatory
measures have been determined and the co-articulatory influence on these measures has been investigated.
(TMH-QPSR 3-4/1999: 11-38)
Vocal tract modeling in 3D
A three-dimensional model of the vocal tract is under development. The
model consists of vocal and nasal tract walls, lips, teeth and tongue, represented as visually distinct articulators
by different colours resembling the ones in a natural human vocal tract. The naturalness of the vocal tract model
can be used in speech training for hearing impaired children or in second language learning, where the visual feedback
supplements the auditory feedback. The 3D model also provides a platform for studies on articulatory synthesis,
as the vocal tract geometry can be set with a small number of articulation parameters, and vocal tract cross-sectional
areas can be determined directly from the model.
(TMH-QPSR 1-2/1999: 31-38)
A visual input module used in the August spoken dialogue system
We have built a software package for video handling and video processing
called MODAL. The package is an extension of the Tcl/Tk scripting language, which makes it easy to use and to integrate
into other software. The image processing algorithms in MODAL are intended for reliable and fast movement detection
and localisation, which makes the package especially suitable to use in visual input modules for human-machine
MODAL was used together with a desktop video camera to build a visual
input module to the August spoken dialogue system. The information extracted by the module provides the system
with knowledge about the presence and localisation of a user. The visual output interface of the system includes
a 3D animated agent that responds to the visual input information by turning his head and eyes towards an approaching
user, letting the user know that it is, to some extent, aware of the scene in front of it.
(TMH-QPSR 1-2/1999: 39-43)
Using HMMs and ANNs for mapping acoustic to visual speech
Tobias Öhman and Giampiero Salvi
In this paper we present two different methods for mapping auditory,
telephone quality speech to visual parameter trajectories, specifying the movements of an animated synthetic face.
In the first method, Hidden Markov Models (HMMs) where used to obtain phoneme strings and time labels. These where
then transformed by rules into parameter trajectories for visual speech synthesis. In the second method, Artificial
Neural Networks (ANNs) were trained to directly map acoustic parameters to synthesis parameters. Speaker independent
HMMs were trained on a phonetically transcribed telephone speech database. Different underlying units of speech
were modelled by the HMMs, such as monophones, diphones, triphones, and visemes. The ANNs were trained on male,
female , and mixed speakers.
The HMM method and the ANN method were evaluated through audio-visual
intelligibility tests with ten hearing impaired persons, and compared to "ideal" articulations (where
no recognition was involved), a natural face, and to the intelligibility of the audio alone. It was found that
the HMM method performs considerably better than the audio alone condition (54% and 34% keywords correct, respectively),
but not as well as the "ideal" articulating artificial face (64%). The intelligibility for the ANN method
was 34% keywords correct.
(TMH-QPSR 1-2/1999: 45-50)
Experiences from building two large telephone speech databases for
The objective of the EU-funded SpeechDat project was to create large-scale
speech databases for voice-driven teleservices. This paper deals with the design of two such Swedish resources:
5000 speakers over the fixed telephone network, and 1000 over the mobile network. It also reports on experiences
from speaker recruitment and presents statistics on speaker distribution. Results regarding orthographic labelling
of pronunciation, pronunciation errors and non-speech events are also included.
(TMH-QPSR 3-4/1999: 51-56)
An effect of body massage on voice loudness and phonation frequency
Sten Ternström, Marie Andersson*, and Ulrika Bergman*
* Scandinavian College of Manual Medicine, Observatorieg. 19-21,
SE-113 29 Stockholm, Sweden
In this experiment, the effect of massage on voice fundamental frequency
Fo and sound pressure level SPL was investigated. Subjects were recorded when reading a three-minute passage before
and after a 30-minute session of massage administered by a trained naprapathy therapist. Sixteen subjects were
given the massage, while fifteen controls rested, lying down in silence for the same amount of time. The subjects
were then recorded reading the same passage again. The Fo and SPL averages across the whole passage were measured
for the pre-treatment and post-treatment recordings.
In the post-massage recordings, subjects had lowered their Fo by 1.1
semitones and their SPL by 1.0 dB, with very high statistical significance. The drop in Fo was somewhat larger
for the males than for the females. The control subjects showed no effect at all.
(TMH-QPSR 3-4/1999: 39-44)
Music technology and audio processing: Rall or accel into the new
Johan Sundberg, Voice Research Centre KTH
Music acoustic research can provide support in terms of objective knowledge
that would further the quickly developing areas of music technology and audioprocessing. This is illustrated by
three examples taken from current projects at KTH. One concerns the development of quality in sound reproduction
systems in this century. A test, where expert listeners rated the recording year of phonograms of different age,
demonstrated that significant advances were made between 1950-1970, while development was rather modest before
and after this period. The second example concerns the secrets underlying timbral beauty. Acoustic analyses of
recordings of an international opera tenor and a singer with an extremely ugly voice shed some light on basic demands
on voice timbre. The ugly voice is found to suffer from pressed phonation, lack of a singer’s formant, and irregular
vibrato and unsufficient pitch accuracy. The third example elucidates tuning and phrasing differences between deadpan
performances of Midi files played on synthesisers and performances played by musicians on real instruments. The
examples suggest that future development in the areas of music technology and audio processing may gain considerably
from a close interaction with music acoustics research.
(TMH-QPSR 3-4/1999: 45-53)
Velum behaviour in professional classic operatic singing
, Bodil Gümoes2
, Hanne Stavad2
, Svend Prytz3
, Eva Björkner4
, and Johan Sundberg4
1 Royal Academy
of Music, Aarhus
Academy of Music, Copenhagen
Department, Bispebjerg Hospital, Copenhagen
4 KTH Voice
Research Centre, Department of Speech Music Hearing, Stockholm
Many singers regard "nasal resonance" as important to tone
production in singing. In this study, we test the hypothesis that professional operatic singers sing with a slight
velopharyngeal opening. The opening was estimated from nasofiberscope video recordings of 16 singers of different
classifications who sang an ascending triad pattern throughout their pitch range. For each tone, the size of the
opening was rated by a panel of experts. Many cases of a clear velopharyngeal opening were found. The singers repeated
the same task into a flow mask (Glottal Enterprises), recording oral and nasal airflow separately and the DC component
of these signals was analysed. In addition, the degree of "nasal resonance" was evaluated by a panel
of singing teachers. The correlation between velopharyngeal opening, nasal airflow, and degree of "nasal resonance"
(TMH-QPSR 3-4/1999: 55-63)
Voice source differences between falsetto and modal registers in counter
tenors, tenors and baritones
Johan Sundberg and Carl Högset, Oslo
The voice source of professional singers is analyzed, for four counter
tenors, five tenors, and four baritones who sang the syllable [pæ:] in soft, middle and loud voice at certain
pitches in modal and falsetto/counter tenor register. Subglottal pressure, closed quotient, relative glottal leakage,
as well as the relative level of the fundamental are analyzed. The counter tenors all used comparatively low subglottal
pressures, and showed a closed phase in their flow glottogram waveform, also in soft phonation. For a given value
of the closed quotient the fundamental tended to show a higher relative level in falsetto than in modal register.
The flow glottogram differences between the registers seem related to vocal fold thickness that is greater in modal
than in falsetto register. The main characteristic of the counter tenors’ falsetto register seemed to be the presence
of a closed phase.
(TMH-QPSR 3-4/1999: 65-74)
Emotional expressivity in singing is examined by comparing neutral and
expressive performances of a set of music excerpts as performed by a professional baritone singer. Both the neutral
and the expressive versions showed considerable deviations from the nominal description represented by the score.
Much of these differences can be accounted for in terms of the application of two basic principles, grouping, i.e.,
marking of the hierarchical structure, and differentiation, i.e., enhancing the differences between tone categories.
The expressive versions differed from the neutral versions with respect to a number of acoustic characteristics.
In the expressive versions, the structure and the tone category differences were marked more clearly. Furthermore,
the singer emphasised semantically important words in the lyrics in the expressive versions. Comparing the means
used by the singer for the purpose of emphasis with those used by a professional actor and voice coach reveal striking
(TMH-QPSR 3-4/1999: 75-85)
Level and center frequency of the singer's formant
The "singer’s formant" is a prominent spectrum envelope peak
near 3 kHz, typically found in voice sounds produced by classical operatic singers. According to previous research,
it is mainly a resonatory phenomenon produced by a clustering of formants 3, 4 and 5. Its level relative to the
first formant peak varies depending on vowel, vocal loudness and other factors. Its dependence on vowel formant
frequencies is examined. Applying the acoustic theory of voice production, the level difference between the first
and third formant is calulated for some standard vowels. The difference between observed and thus calculated levels
is determined for various voices. It is found to vary considerably less between vowels sung by professional singers
than by untrained voices. The center frequency of the singer’s formant, determined by long-term-spectrum analysis
of grammophon recordings, is found to increase slightly with the pitch range of the voice classification.
(TMH-QPSR 3-4/1999: 87-94)
Consistency of inhalatory breathing patterns in professional operatic
Monica Thomasson and Johan Sundberg
Breathing behaviour is generally considered of great importance to an
optimised voice production in the classical western singing tradition, the assumption being that it affects phonation.
If so, professional operatic singers would be expected to accurately repeat the same breathing patterns when repeating
the same phrases. Recently, consistency of phonatory breathing patterns was examined. In this study, inhalatory
rib cage (RC) and abdominal wall (AW) movements were documented in five professional operatic singers by means
of respiratory inductive plethysmography. Comparisons of inhalatory data gathered for three takes of the same phrase
revealed high consistency with regard to lung volume (LV) change and RC movements in all subjects, suggesting great
relevance of RC control in singing. Consistency of AW movements was observed in three singers. Consistency measures
across phrases in different musical contexts were slightly lower. The observations support the idea that breathing
behaviour is important to voice production in singing. In addition, correlation between LV change, on the one hand,
and RC and AW movement on the other, was examined. The contribution to LV change from RC and AW varied across singers,
thus suggesting that professional operatic singing does not request a uniform breathing strategy.
(TMH-QPSR 1-2/1999: 9-20)
Vocal fold vibrations: high-speed video imaging, kymography and acoustic
, Stellan Hertegård2
, Per-Åke Lindestad1
and Britta Hammarberg1,3
1 Department of Logopedics and Phoniatrics, Karolinska Institute,
Huddinge University Hospital, Huddinge
Department, Karolinska Hospital, Stockholm, Sweden
of Speech, Music and Hearing, KTH, Stockholm, Sweden
A new analysis system (the
High-Speed Tool Box) was developed for studying vocal fold vibrations
using a high speed camera. A Weinberger Speedcam system was utilized with a frame rate of 1904/sec. Images were
stored and analyzed digitally. Analysis included automatic glottal edge detection and calculation of glottal area
variations, as well as kymography. These signals can be compared to the acoustic and/or electroglottographic waveforms.
Periodic and aperiodic signals of normal and disordered phonations were studied, and relations between glottal
vibratory patterns and the sound waveform are discussed. The findings suggested that the high-speed system is particularly
useful for studying details of mucosal movements during abnormal vibrations, such as diplophonia and tremor.
(TMH-QPSR 1-2/1999: 21-29)
Strategies and results from spoken L2 teaching with audio-visual feedback
Teaching strategies and positive results from audio-visual training of
both perception and production of spoken Swedish with 13 immigrants are reported. The learners participated in
a total of six half-hour training sessions twice a week. The training had a positive effect on both perception
and production of individual Swedish sounds, stress, intonation and rhythm. The positive results were obtained
through audio-visual contrastive feedback shown in real time and provided through a PC running the IBM SpeechViewer
software. A split screen provided a comparison of the learner's production with a correct model by the teacher.
F0 and intensity were displayed in real time. The learners' opinions of this pronunciation training with immediate
audio-visual feedback are also reported.
(TMH-QPSR 1-2/1999: 1-7)