TMH-QPSR 1/1996 - Abstracts

The Department of Speech Communication and Music Acoustics has a new name from January 1996:
Department of Speech, Music and Hearing, KTH

THM-QPSR is the new abbreviation for Quarterly Progress and Status Report




SPEECH COMMUNICATION

Towards an articulatory speech synthesiser: Model development and simulations
TMH-QPSR 1/1996: 1-16

Mats Båvegård
KTH, Speech, Music and Hearing

The main focus of this thesis is a parameterised production model of an articulatory speech synthesiser. It consists of an introduction and comments on the six papers included in the thesis.

A parameterised vocal-tract area function model has been revised and extended to include consonantal articulations. The extension to include consonantal articulations is based on the theory that any consonant can be regarded as superimposed on a vowel configuration. The different components are defined by the particular coarticulatory pattern. We have modelled target configurations of labial and alveolar nasals, laterals, and apical consonants coarticulated with a front vowel with promising results.

The analysis procedure of obtaining articulatory parameters for the synthesiser from the acoustic speech signal is studied. Two different studies are discussed here. The first is a real-time application, with a neural network, relating a filterbank representation of the speech signal to unit-length tubes of the vocal tract area function.

Another method is to represent the speech signal as formant frequencies and to describe the vocal tract area function as a three-parameter representation. This inversion is performed in two steps. A first approximation is attained from either a codebook or a neural net and a final optimisation is performed by an iterative interpolation for finding a perfect or acceptable match.

There is also a study of voice source characteristics in a fully interactive system with both sub- and supra-glottal tracts incorporated. One object has been to test the auditory significance of moderate amounts of ripple, which appear in the inverse filtered waveform in contrast to the more smooth glottal flow model waveforms. Another part of our study has been concerned with the nonlinear transformation from glottal area function to glottal flow, with and without constant leakage invoked by a glottal chink.


MUSIC ACOUSTICS


The Sentograph: Input devices and the communication of bodily expression
TMH-QPSR 1/1996: 17-22

Roel Vertegaal & Tamas Ungvary
Dept of Ergonomics, Twente Univ., The Netherlands, and KTH, Speech, Music and Hearing

In this paper, we present a qualitative movement analysis of the Sentograph, a 3D isometric input device. We believe the essence of this device is its capability to directly transduce rapidly accelerating and decelerating curves of muscular tension, naturally associated with the expression of emotion, into a form in which it can be communicated musically. The direct correspondence between the controlled parameters and the primary sensory feedback parameters plays an important role in this process.



On the body resonance C3 and its relation to top and back plate stiffness
TMH-QPSR 1/1996: 23-30

Erik V Jansson, Benedykt K Niewczyk & Lars Frydén
KTH, Dept of Speech, Music and Hearing

In an earlier investigation it was found that a dominant peak "C3" in the 500 to 600 Hz is a major difference between soloist violins and violins of inferior quality. Therefore the answer was sought to the question: What properties of the C3 resonance are optimum for tonal quality and how are they achieved? An experimental violin was selected. Its plates were stiffened with crossbars glued to the outside and removed in steps. The violin was tested by playing and its acoustical properties measured in each step. The player liked the violin the best without crossbars indicating that he liked a violin with a C3 of low frequency, of high level at the left bridge foot, and a Chladni-pattern typical for a violin. Measurements made at the bridge top showed that the stiffness of the back mainly monitored the C3-frequency and the soundpost position mainly monitored its peak level, i.e. the back plate thickness should be adjusted to give the right C3-frequency, and thereafter, the soundpost position is adjusted to give the right level. The experimental violin was preferred with the peak levels of T1 and C3 similar independently of the C3-frequency.


On the design of digital waveguides for stiff strings
TMH-QPSR 1/1996: 31-48

Magnus Kjellander
KTH, Dept of Speech, Music and Hearing

This study deals with some aspects of the design procedures for filters in a digital waveguide model of stiff strings. The model consists basically of a delay line and two filters; (1) one filter that determines the fine tuning of the string and the dispersion due to string stiffness, and (2) a second filter that models the mechanical impedance of the soundboard. A new method for the design of the filter for fine-tuning and dispersion is proposed, giving better results than filters of the same order designed with traditional techniques. The filter used is a group delay equalizer. For n specified group delay times at n freely specified frequencies the method gives an all-pass filter of order 2n. The filter approximation of the mechanical impedance of the soundboard was found by minimising an error function. The error function was designed to give large values when the mismatch between measured and generated decay rates of partial was large. The minimising was done by an iterative gradient method.



Phonetogram measurements of singers. Before and after solo singer education
TMH-QPSR 1/1996: 49-56

Dirk Mürbe, Johan Sundberg, Jenny Iwarsson & Friedemann Pabst
KTH, Dept of Speech, Music and Hearing

Phonetograms of 25 singer students were recorded at the beginning and termination of a full time solo singer education. Various aspects are analysed: the upper contour, the level in a high frequency band pass filter and the smoothness of these curves. Several differences are found, significantly in several respects. The overall SPL of the vowels [a], [i] and [u] as well as the level in a high frequency band pass filter centered at 3.16 kHz increased. The variation of overall SPL with pitch decreased. The results are discussed and compared with other investigations of phonetograms of singer voices. The results suggest that phonetogram analysis is useful as an objective assessment of singers' voice characteristics and should be a valuable supplement to the documentation of the subjective evaluation of the progress of the education of solo singers.



Intonation preferences for major thirds with non-beating ensemble sounds
TMH-QPSR 1/1996: 55-62

Jan Nordmark & Sten Ternström
KTH, Dept of Speech, Music and Hearing

The frequency ratios, or intervals, of the twelve-tone scale can be mathematically defined in several slightly different ways, each of which may be more or less appropriate in different musical contexts. For maximum mobility in musical key, instruments of our time with fixed tuning are typically tuned in equal temperament, except for performances of early music or avant-garde contemporary music. Some contend that pure intonation, being free of beats, is more natural, and would be preferred in instruments with variable tuning. The sound of choirs is such that beats are very unlikely to serve as cues for intonation. Choral performers have access to variable tuning, yet have not been shown to prefer pure intonation.

The difference between alternative intonation schemes is largest for the major third interval. Choral directors and other musically expert subjects were asked to adjust to their preference the intonation of 20 major third intervals in synthetic ensemble sounds. The preferred size of the major third was 395.4 cents, with intra-subject averages ranging from 388 to 407 cents.


HEARING TECHNOLOGY


A comparison between patients using cochlear implants and hearing aids. Part I: Results on speech tests
TMH-QPSR 1/1996: 63-76

Eva Agelfors
KTH, Dept of Speech, Music and Hearing

The speech perception ability reported from profoundly hearing-impaired persons using hearing aids or cochlear implants varies widely. A test battery was constructed that consisted of segmental and suprasegmental tasks and connected speech presented in three different modalities: visual (V), auditive (A) and audio-visual (AV). Two groups of patients participated. Fifteen subjects with a profound hearing loss (PTA 89.4 dB) used hearing aids (HA) and fifteen subjects used cochlear implants (CI), either a single-channel extracochlear implant, or a multi-channel intracochlear implant. The results obtained from the speech reception data showed a great variation among the subjects. In the Connected Discourse Tracking test (CDT),without support of lipreading, the HA-group received a higher mean benefit from their conventional hearing aid compared to the mean for the CI-group. Many of the cochlear implantees obtained great benefit from their aids, but others derived little benefit.



A comparison between patients using cochlear implants and hearing aids. Part II: Results on a self-rating scale
TMH-QPSR 1/1996: 77-81

Eva Agelfors
KTH, Dept of Speech, Music and Hearing

Two groups of patients participated (the same as in Part I). Fifteen subjects with a profound hearing loss (PTA 89.4 dB) used hearing aids (HA) and fifteen subjects used cochlear implants (CI), either a single-channel extracochlear implant or a multi-channel intracochlear implant. To obtain information about their difficulties in everyday communicative environments, the subjects rated themselves on a scale of 0-6 (never-always) on 74 items. A multi-category self-rating inventory specifically for persons with profound and severe hearing loss (PIPSL) was used, developed by Owens & Raggio (1988). The results obtained from the speech reception data (Part I) showed a great variation among the subjects. In the Connected Discourse Tracking test (CDT), without support of lipreading, the HA-group received a higher mean benefit from their conventional hearing aid compared to the mean for the CI-group. The mean results obtained on the self-rating inventory scale showed no difference between the two groups for the category "Understanding Speech With No Visual Cues". The mean scores obtained for the categories "Intensity, Environmental Sounds" and "Understanding Speech With Visual Cues" were higher for the CI-group.


TMH-QPSR 2/1996

FONETIK 96
Papers presented at the Swedish Phonetics Conference, Nässlingen, 29-31 May, 1996


Reduction of vowel chaos
TMH-QPSR 2/1996: 1-4

Rolf Lindgren & Björn Lindblom,
Dept of Linguistics, Stockholm University

A previous study showed that classifying vowel data from spontaneous speech with statistical clustering methods and artificial neural networks gave poor results.

In this study a different approach was tested. A production model (the undershoot model) was used to explain the large variation of F2 in vowel data from spontaneous speech. Expanding the model to also include regressive assimilation and formant peak velocity showed a significant improvement in predicting F2 values.



Phonotactically determined allophones of the j phoneme in Swedish
TMH-QPSR 2/1996: 5-8

Sven Björsten
Dept of Linguistics, Stockholm University

The realization of the j phoneme in Swedish varies between a glide and a fricative. This study describes the phonotactical distribution of these variants in 3 subjects from the Stockholm area. It was found that syllable initial single j-, and j in initial clusters of the type obstruent + j, vary between a glide and a fricative, while j in initial clusters of the type nasal + j, syllable final single -j, and j in all final clusters studied, are realized as a glide. There could be either one ore two underlying targets for /j/ giving rise to this distribution.


The production of some Swedish coronals
TMH-QPSR 2/1996: 9-12

Per Lindblad & Sture Lundqvist
Dept of Linguistics and Dept of Prosthetic Dentistry, Göteborg University

As a first step in a project aiming to describe the production of Swedish coronal consonants in detail with various methods, /t d l/ were EPG analyzed in nonsense words spoken by 10 speakers. /t/ was constantly dental, whereas /d/ and /l/ were partly dental, partly alveolar. In each of 4 south Swedish speakers, /d/ was alveolar or partly alveolar, partly dental. In contrast, 5 of 6 central Swedish speakers had a dental /d/. The dentialveolar tongue contact area was largest in /t/ and equal in /d/ and /l/. In 9 speakers, the total /l/ contact area had side wings extending along most of or the whole palate sides, whereas 1 speaker had an /l/ area with no or very short wings.



Acoustic-phonetic evidence of vowel quantity and quality in Norwegian
TMH-QPSR 2/1996: 13-16

Dawn Behne, Bente Moxness & Anne Nyland
Norwegian University of Science and Technology, Dragvoll

The phonological notions of vowel quantity and vowel quality are acoustically realized in the duration and spectral characteristics of a vowel. This study examines acoustic attributes of six Norwegian vowels, /i, i:, O, O:, A, A:/ and addresses the extent to which vowel quantity and quality each affect vowel duration and the first two formant frequencies of the vowels. Results suggest that while vowel quality affects both spectral characteristics and the duration of a vowel, vowel quantity affects vowel duration without necessarily affecting the vowel spectrum in Norwegian.



From natural phonetics to the rules behind the "rules"
TMH-QPSR 2/1996: 17

Alvar Nyqvist Goës, Bromma

A search for the ultimate exception-free rules of natural language, backed up by elementa and techniques from natural phonetics and other relevant hard sciences led to a basic highly explanatory grammar. A voltage-parallel calibration of the unmarked bottom of the English stress scale (stress = neuroelectric energy), served as introduction to the hidden rule system.


The Swedish intonation model in interactive perspective
TMH-QPSR 2/1996: 19-22

Gösta Bruce, Johan Frid, Björn Granström, Kjell Gustafson, Merle Horne & David House
Dept of Linguistics and Phonetics, Lund University, and KTH, Speech, Music and Hearing

In this paper we discuss some recent extensions of the prosody model used in the research project 'Prosodic Segmentation and Structuring of Dialogue'. In our current modelling of dialogue intonation we are using both model-based resynthesis and text-to-speech. The idea is to be able to regulate the influence of discourse and dialogue structure on prosody and intonation. In the present contribution we report on some aspects of overall F0 trends in dialogues and its implications for the model. We also report on our work to incorporate dialogue-related rules in our text-to-speech system.


F0 declination in spontaneous and read-aloud speech
TMH-QPSR 2/1996: 23-24

Marc Swerts, Eva Strangert & Mattias Heldner
Institute for Perception Research (IPO), MB Eindhoven, and Department of Phonetics, Umeå University

The study reports a comparison of F0 declination in read-aloud and spontaneous speech using Swedish material. For both speaking styles the analysis revealed negative slopes, resettings at utterance boundaries, and a steepness-duration dependency with declination being less steep in longer utterances than in shorter ones. However, there was a difference in degree of declination between the two speaking styles, read-aloud speech having steeper slopes, stronger resetting and a more apparent time-dependency than spontaneous speech.



Temporal variations in Swedish consonant clusters: preliminary data
TMH-QPSR 2/1996: 25-28

Robert Bannert, Peter Czigler, Nicolette Karst & Tomas Landgren.
Department of Phonetics, Umeå University

There is a lack of phonetic knowledge about consonant clusters in Swedish. The research project îVariations within consonant clusters in Swedish (VaKoS)î aims at providing new quantitative insights into this area of Swedish phonology and phonetics.

This report presents the temporal variations of /s, t, n/ in the speech of two Standard Swedish speakers. The material consists of 62 testwords including singletons and consonant clusters under various conditions: in initial, medial and final word positions; with and without focus accent, and various degrees of cluster complexity.



Focal accent and subglottal pressure
TMH-QPSR 2/1996: 29-32

Gunnar Fant, Stellan Hertegård & Anita Kruckenberg
KTH, Dept of Speech, Music and Hearing

Earlier studies (Fant and Kruckenberg, 1994) of acoustic correlates to prosodic categories in Swedish have shown that focal accent when realized with a local F0 peak may be associated with a minimum in voice excitation amplitude Ee and also in intensity. Meaurements from several speakers indicate that Ee rises with F0 up to a frequency F0r after which Ee decays or stays constant on a plateau. From recordings of subglottal pressure through a tracheal puncturing probe we have found that focal accentuation is generally accompanied by a decay of subglottal pressure starting from a high value at the syllable boundary and decaying within the region of the F0 peak. Our recent recordings also include supraglottal pressure. The relative role of F0, voice intensity, voice source spectral shape, duration and spectral contrasts as correlates of word stress, accent type and focal accent are discussed. Individual variation are large.



Phonetic correlates of focus accents in Swedish
TMH-QPSR 2/1996: 33-36

Mattias Heldner
Dept of Phonetics, Umeå University

This study reports the results of a production experiment conducted to examine the effect of focus on F0-movements (word accent falls and focus accent rises), segment durations, overall intensities and spectral balance. The experment will serve to provide data on which to model the construction of stimuli in perception experiments aiming at investigating the acoustic cues to perceived focus.



Evaluation of Swedish prosody within the MULTEXT-SW project
TMH-QPSR 2/1996: 37-40

Eva Strangert & Anna Aasa
Dept of Linguistics, Umeå University

Based on speech data collected within the SAM project, tools for the analysis and coding of F0 used within the MULTEXT project were evaluated using 8 Swedish speakers. Generally, comparison of original and modelled versions of the same utterances using INTSINT coding revealed fairly successful matching. The three main types of mismatches that occurred were: stylistic changes, changes of prominence relations, and shifts of word accent.



Quantal theory of speech timing
TMH-QPSR 2/1996: 41-44

Gunnar Fant & Anita Kruckenberg
KTH, Dept of Speech, Music and Hearing

This is a summary view of temporal patters we have observed in the analysis of Swedish text reading. Special attention is devoted to trends of quantal structures in the timing of vowels and consonants, syllables, interstress intervals and pauses. It is well known that pause durations increase with increasing syntactic order of boundaries. A recent study supports our previous findings of multiple peaks in the histograms of pause durations and our theory of neural coordination of pause durations and prepause final lengthening. These add to an integer of a basic timeconstant of about 0.5 sec which reflects a local average of interstress duration and preserves a quasi-rhythmical continuity of interstress intervals spanning a pause.

Regularities in the timing of syllables and phonetic segments with due regard to relative distinctiveness and reading speed will be discussed and, on a higher level, tempovariations within a sentence.



APEX an articulatory synthesis model for experimental and computational studies of speech production
TMH-QPSR 2/1996: 45-48

Johan Stark, Björn Lindblom & Johan Sundberg
Dept of Linguistics, Stockholm University, and KTH, Dept of Speech, Music and Hearing

This is a preliminary report of a project in progress with the purpose to create an articulatory synthesis model for studies of speech production. It is realised by a computer program which may control the lips, the shape of the tongue body and apex, and the mandible. An area function may be computed and displayed graphically and numerically. Formant values may be computed and sent to a formant synthesis model for sound production using a DSP hardware module. Automatic and systematic generation of parameters may be achieved and the results sent to a disk file. The program keeps all speaker dependent data in a disk file, enabling processing of several speakers.



VCV coarticulation experiments with the vocal tract area function model
TMH-QPSR 2/1996: 49-52

Mats Båvegård
KTH, Dept of Speech, Music and Hearing

Many VCV transitions can be regarded as a sequence of two components. First an initial local change due to the release of the primary consonant articulator. There-after slower changes of F2 and F3 as the tongue body and jaw move away from the positions as supporting articulators for the consonant constriction to their positions as primary articulators for the vowels.

A number of VCV coarticulation experiments are described using a vocal tract area function model and a coarticulation function.



Talking heads - communication, articulation and animation
TMH-QPSR 2/1996: 53-56

Jonas Beskow
KTH, Dept of speech, Music and Hearing

Human speech communication relies not only on audition, but also on vision, especially during poor acoustic conditions. The face is an important carrier of both linguistic and extra-linguistic information. Using computer graphics it is possible to synthesize faces and do audio-visual text-to-speech synthesis, a technique that has a number of interesting applications for example in the area of man-machine interfaces. At KTH, a system for rule-based audio-visual text-to-speech synthesis has been developed. The system is based on the KTH text-to-speech system which has been complemented with a three-dimensional parametric model of a human face, that is animated in real time in synchrony with the auditory speech. The audio-visual text-to-speech synthesis has also been incorporated into a system for spoken man-machine dialogue.



What´s in a sign; distributional data from Swedish Sign Language
TMH-QPSR 2/1996: 57-58

Catharina Kylander Unger
Dept of Linguistics, Stockholm University

The major articulatory parameters for sign production are the articulator (one or two hands), place of articulation (neutral space, body locations, or the non-dominant, passive hand) and the articulation (movement in the sign).

In this paper, focus is placed on the relationship and interaction between articulator and place of articulation in Swedish Sign Language (SSL), in particular for two-handed signs with the passive hand as place of articulation.

In discussions regarding markedness and handshape in sign language, it is generally accepted, at least from American Sign Language, that the shape of the passive hand is restricted to be one of a small set of unmarked handshapes (highly frequent in the language, acquired early in child language acquisition, present in all other sign languages, having a high perceptual salience). Recent dependency based phonological models of sign language maintains this claim. In the present study, a relatively large set of Swedish signs is analyzed, to see if the claim holds true for SSL.

The distribution of articulator handshapes over place of articulation was calculated for 2826 Swedish signs. The data shows that there is a small set of preferred handshapes for the articulator, no matter where the sign is produced. The data also shows that there are preferred places of articulation that are insensitive to shape of the articulator.

Finally, it can be seen that in two-handed signs with the passive hand as place of articulation, the passive hand has a choice of 17 shapes, some of these fairly complex and low in frequency. Thus, it is not possible to say that there is a small set of unmarked shapes in SSL for the passive hand to choose from. However, there seems to be a mutual restriction on the passive and active articulator, making a formalisation possible that can predict, or at least reduce, the choice of shapes for one hand given the other in a sign.



Language competence among cognitively non-disabled individuals with cerebral palsy
TMH-QPSR 2/1996: 59-62

Tina Magnuson
KTH, Dept of Speech, Music and Hearing

Many persons with congenital brain damage such as cerebral palsy do not experience intellectual or cognitive dysfunction. Nevertheless, in this group we find more individuals with reading and writing difficulties than average. These difficulties may be related to poor underlying basic linguistic knowledge. In this study the intention is to explore basic linguistic competence in individuals with cerebral palsy who have reading and writing difficulties. Standard tests of language function will be used. The results may help us in deciding what to include in compensatory language or communication aids.



Detailed observations of laryngectomee voice source characteristics from videofluoroscopy and perceptual - acoustic assessment
TMH-QPSR 2/1996: 63-66

Britta Hammarberg, Elisabet Lundström & Lennart Nord
KTH, Dept of Speech, Music and Hearing, Stockholm, and Dept of Logopedics and Phoniatrics, Huddinge Univ. Hosp.

In this report we emphasize on how surgical procedures and radiotherapy influence the phonatory ability in laryngectomees. Four tracheo-esophageal (TE) speakers and one esophageal (E) speaker were recorded by videofluoroscopy during phonatory tasks. Audiorecordings were simultaneously made on a DAT-recorder.

The results showed that the configuration and the placement of the vibrating structure in the esophagus have a strong influence on the voice quality. Rough, non-hyperfunctional alaryngeal voice quality was found in three of the subjects who all had the typical bulging pharyngo-esophageal (PE) segment in the back wall in videofluoroscopic analysis, whereas those two speakers who were judged to have hyperfunctional and weak voices showed deviant vibratory structures in the PE-segment region.



On prosodic variation in child directed speech in Swedish
TMH-QPSR 2/1996: 67-68

Anne-Christine Bredvad-Jensen
Dept. of Linguistics and Phonetics, Lund University

My goal is to describe nonverbal, especially prosodic and paralinguistic characteristics in child directed speech in order to set a prototype for child adjustment in Swedish in these respects. An efficient way of gathering speech material for this purpose is by using carefully prepared texts as well as spontaneous speech material. In my oral presentation of this paper I will discuss criteria for naturalness conditions in lab speech situations with reference to my collection of speech material. I will argue that it is possible to find text samples which can be used in everyday communication situations in a natural way. I will report on findings from my corpus containing both child directed and (for comparison) adult directed speech.



Aspects of the relation between the production and perception of a second language
TMH-QPSR 2/1996: 69-72

Robert McAllister
Dept of Linguistics, Stockholm University

Production and perception tests were administered to 24 L2 users. Relationships between language experience, self assessment and perception and production test results are presented. Results indicate possible support for test methods by showing expected results in these comparisons.



Coarticulation in apical consonants: acoustic and articulatory analyses of Hindi, Swedish and Tamil
TMH-QPSR 2/1996: 73-76

Diana Krull & Björn Lindblom
Dept of Linguistics, Stockholm University

In Hindi, Swedish and Tamil, the place of articulation of retroflexes is more posterior at the beginning of the closure than at the release. Retroflex place of contact is also vowel-dependent whereas that of dentals is constant. F2 locus parameters fail to support the idea that the degree of vowel-consonant coarticulation varies with place and/or language. The main difference between dental and retroflex is provided by F3, and to a lesser degree F4.



An acoustic-phonetic study of several Swedish dysarthric speakers
TMH-QPSR 2/1996: 77-80

Sheri Hunnicutt, Lennart Nord & Elisabet Rosengren
KTH, Dept of Speech, Music and Hearing

The speech of four speakers with varying degrees of dysarthria was studied, using spectrographic methods. Some interesting results were obtained, including incomplete realizations of consonants and consonant clusters and long voiceless portions of vowels.



Tonal timing in Thai
TMH-QPSR 2/1996: 81-84

David House & Jan-Olof Svantesson,
Dept of Linguistics and Phonetics, Lund University.

This paper investigates the question of the realization of tones and their timing in relationship to vowel onset in Thai where syllable onsets may consist of consonant clusters such as [kl] and [kw]. Speech material containing test words representing low and falling tones, initial aspirated and unaspirated stops and initial clusters of stops and sonorants [l] and [w] were recorded from three native speakers of Standard Thai. Sentence frames were used in which the test words were preceded by high, falling and rising tones and followed by mid tone to control for tonal coarticulation effects.

Considerable variation in the actual tonal contours was found. However, no major or systematic effects of vowel onset characteristics on tonal timing could be found. This seems to indicate that the timing of the tonal gesture for these two tones is related more to the onset of voicing in the syllable than to the vowel onset.


Tones and non-tones in Kammu dialects
TMH-QPSR 2/1996: 85-88

Jan-Olof Svantesson & David House,
Dept of Linguistics and Phonetics, Lund University

This is a preliminary report on the phonetic interaction of tone and consonant voicing in Kammu, a language where some dialects use Fo for producing distinctive word tones, while others do not have tones but rely on the contrastive voicing of initial consonants to distinguish words which tonal dialects distinguishes with tones. Speakers of non-tonal dialects produce no significant Fo differences in words which differ only in the tones in tonal dialects, and a perception test showed that they did not use Fo to distinguish such words when listening to a tonal dialect.



Recognition of "identical" words and phrases in French
TMH-QPSR 2/1996: 89-92

Robert Bannert, Pascale Nicolas & Monika Stridfeldt
Dept of Phonetics, Umeå University

In a second language learning context students with a Germanic L1, in general, experience larger difficulties in understanding spoken French in contrast to other Germanic languages. The explanation is sought in the large differences in coding and decoding of words in French opposed to these languages.

In an introductory study, the claim was tested that segmentally identical phrases in French result from phonological processes like ëliaisoní, ëenchaînementí and consonant deletion resulting in a separate syllable tier. Twenty phrases varying from one to seven syllables and minimally contrasting in pairs or triplets were presented to French and Swedish listeners. The results show variations in different aspects. In some cases, listeners recognised the original stimulus but not in others. Swedish listeners performed partly in a different way. Reasons for the divergent behaviours will be given



Making space for emotions in speech research - a methodological consideration
TMH-QPSR 2/1996: 93-96

Gunilla C Thunberg
Dept of Linguistics, Stockholm University

Research on emotional expressions in speech is mostly built on the assumption that human emotions, the way they are experienced as well as the way they are expressed, can be readily sorted into categories like: anger, joy, fear, and sadness. These four categories are often referred to as being our basic or primary emotions.

However, in experimental situations listeners frequently find it difficult to discriminate some of the presented emotional stimuli; e.g. intended joy tends to be confused with anger. This suggests that, although those two expressions represent quite opposite emotions, they probably share at least one common feature.

Furthermore, it suggests that we would need a more complex way to describe emotional speech, in order to make analytic progress. The present paper is a first attempt to outline a multi-dimensional model for the description of emotional speech, thereby allowing features to overlap between the traditionally basic emotional categories.



VOT in stop inventories and in young childrenís vocalizations: preliminary analyses
TMH-QPSR 2/1996: 97-100

Olle Engstrand & Karen Williams
Dept of Linguistics, Stockholm University

Voice onset time (VOT) was measured in initial stops produced by children at 12 and 18 months of age. VOT values were typically short but increasing from front to back places of articulation. These findings parallel the predominance of plain voiceless and unaspirated stops and the underrepresentation of voiced velar stops in the worldís languages. This parallel is discussed in terms of language-independent vocal tract and auditory constraints.



Constructing a database for a new word prediction system
TMH-QPSR 2/1996: 101-104

Alice Carlberger, Sheri Hunnicutt, Johan Carlberger, Gunnar Strömstedt & Henrik Wachtmeister
KTH, Dept of Speech, Music and Hearing

A strictly frequency-based adaptive lexical prediction program, used mainly by persons with motoric handicaps or linguistic disabilities such as mild aphasia and dyslexia, is undergoing major development to improve the prediction function. Such development includes the change to a probability-based system, scope extension, and addition of grammatical, phrasal, and semantic information to the database.



Perception of discourse boundaries in spontaneous speech in Dutch
TMH-QPSR 2/1996: 105-108

Monique E van Donzel & Florien J Koopmans-van Beinum
Inst of Phonetic Sciences/IFOIT, University of Amsterdam

This paper describes a perception experiment in which listeners were asked to mark various discourse structures in the verbatim transcription of a spontaneously retold story in Dutch, while listening to the spoken version of the story. They marked perceived discourse boundaries by means of conventional punctuation marks. Previously, the transcriptions had been analysed for discourse structure, on independent, non-prosodic grounds.

The aim of the experiment was to see in what way the perceived boundaries coincide with the objectively determined ones, and in what way listeners agree on a certain type of boundary, and in a later stage, to investigate by means of which cues listeners could make their specific decisions.



The dialog component in the Waxholm system
TMH-QPSR 2/1996: 109-112

Rolf Carlson
KTH, Dept of Speech, Music and Hearing

In this paper we will give a short overview of the dialog component in the Waxholm spoken dialog system. Dialog management based on grammar rules and lexical semantic features is implemented in our parser, STINA. The notation to describe the syntactic rules has been expanded to cover some of our special needs to model the dialog. The parser is running with two different time scales corresponding to the words in each utterance and to the turns in the dialog. Topic selection is accomplished based on probabilities calculated from user initiatives. Results from parser performance and topic prediction are included in the presentation.


Creation of unseen triphones from seen triphones, diphones and phones
TMH-QPSR 2/1996: 113-116

Mats Blomberg
KTH, Dept of Speech, Music and Hearing

With limited training data, infrequent triphone models for speech recognition will not be observed in sufficient number. In this report, a speech production approach is used to predict the characteristics of unseen triphones by using a transformation technique in the parametric representation of a formant speech synthesiser. Two techniques are currently tested. In one approach, unseen triphones are created by concatenating monophones and diphones and interpolating the parameter trajectories across the connection points. The second technique combines information from two similar triphones; one with correct context and one with correct mid phone identity. Preliminary experiments are performed in the task of rescoring recognition candidates in an N-best list.



The Gandalf speaker verification database
TMH-QPSR 2/1996: 117-120

Håkan Melin
KTH, Dept of Speech, Music and Hearing

The Gandalf speech database has been designed for use in research on automatic speaker verification. 86 customer speakers and 100 impostor speakers have been recorded in up to 24 telephone calls per speaker during a period of up to 12 months. In addition to speech files, Gandalf includes a relational database with a twofold function: it stores information on subjects and calls, and it is a tool for making quantitative and qualitative analyses of speaker verification test data.

The customer speaker part of the database is described, and some of the motivation behind the design is given.


A Swedish name pronunciation system
TMH-QPSR 2/1996: 121-122

Joakim Gustafson
KTH, Dept of Speech, Music and Hearing

Names are common in most text-to-speech applications. The letter-to-sound rules included in these applications often cannot handle names, since the rules usually are designed for ordinary words. A multi-lingual dictionary of names occurring in some of the European languages has been developed within the Onomastica project, that is a European Linguistic Research and Engineering project. The Swedish part of the project was done at KTH, where a transcription system for names was developed. The structures of Swedish names differ from non-name words - but their multi-morphemic structure make them suitable to analyse with a morphological analyser. To be able to develop this the names in the Swedish telephone directory were studied. The morphological analyser is used together with a set of context dependent rules to transcribe Swedish names. Names of foreign origin require special transcription methods.



Pronunciation in an internationalized society: a multi-dimensional problem considered
TMH-QPSR 2/1996: 123-126

Eklund Robert & Anders Lindström
Telia Research, AB, Haninge

This paper deals with the treatment of foreign words and proper names in Swedish. Preliminary results from a production study are presented, and guidelines are suggested for broad, phonematic transcription, covering alternative pronunciations. Such a transcription scheme is a prerequisite for applications such as speech synthesis and multi-dialectal speaker-independent speech recognition.



Cries and whispers: Acoustic effects of variations in vocal effort
TMH-QPSR 2/1996: 127-130

Anita Andersson, Anders Eriksson & Hartmut Traunmüller
Dept of Linguistics, Stockholm University

This is a preliminary report of an ongoing investigation examining the acoustic effects of the adjustment in vocal effort that is required when the distance between the speaker and the listener is varied over a large range (0.3 m - 187.5 m). Five kinds of characteristics have been studied in the speech of men and women: segment durations, F0, formant frequencies, sound pressure level, and spectral emphasis. When the distance increased, the following effects were observed: The durations of vowel like segments increased successively, as distinct from that of most consonants. The mean value of F0 increased by a factor close to two, while its SD in semitones remained roughly constant. The levels of voiced segments increased by about 30 dB, but less for unvoiced segments. The increase in level was much more pronounced in the upper than in the lower part of the spectrum.



A comparative study of male and female whispered and phonated versions of the long vowels of Swedish
TMH-QPSR 2/1996: 131-134

Ingegerd Eklund & Hartmut Traunmüller
Dept of Linguistics, Stockholm University

Confusions in vowel quality and in speaker sex among whispered and phonated versions of the long vowels of Swedish have been analysed. The recognition rate was higher than in other studies. The recognition of vowel quality was observed to interact with that of speaker sex in the whispered versions, but not in the phonated ones. The paper also reports on F0, F1, F2 and F3, and their dynamics, and on the overall spectral shape of the vowels. The observed upward shift of the lower formants in whispering as well as the spectral level differences agree with those found in other languages. (This is a summary of a paper with the same title, submitted to Phonetica).



LF-frequency domain analysis
TMH-QPSR 2/1996: 135-138

Gunnar Fant & Kjell Gustafson
KTH, Dept of Speech, Music and Hearing

An analysis-by-synthesis frequency domain procedure for deriving LF-voice source parameters has been developed. It is based on a detailed analysis of the frequency domain properties of the LF-model and synthesis requirements described in Fant (1995). It operates directly on harmonic spectra from an ordinary tape recording. An additional advantage is that the spectral matching is performed against a well defined synthesizer configuration which guarantees a correct resynthesis. Male and female source data are exemplified. The perceptual significance of parameter variations are discussed.



Analysis by synthesis of glottal airflow in a physical model
TMH-QPSR 2/1996: 139-142

Johan Liljencrants
KTH, Dept of Speech, Music and Hearing

A mechanical glottal model of the vocal flods with two basic degrees of freedom is self oscillating and is driven by a slowly varying lung pressure. A substantial number of static parameters describe its metrics and mechanical properties. The model is executed in an environment of tracheal and vocal tract loads. Contrary to the shape descriptive LF model this physical model purports to mimic details in the glottal oscillation process and the aim in the analysis by synthesis process is to find direct physical, anatomical, and articulatory causes to features observed in inverse filtered waveforms from natural utterances.



Diverse voice qualities: models and data
TMH-QPSR 2/1996: 143-146

Inger Karlsson & Johan Liljencrants
KTH, Dept of Speech, Music and Hearing

The voice source for different voice qualities has been investigated. The glottal signal has been attained using inverse filtering of the speech signal. Two different methods for modelling the results are demonstrated. An acoustic model, the LF voice source model, has been fitted to the inverse filtered waveform and the LF parameters have been used to describe the voice pulse. The parameters of a vocal cord model, the Liljencrants model, has been manipulated to produce glottal pulses similar to the natural pulses. The parameter settings are discussed.



Sound symbolism in deictic words
TMH-QPSR 2/1996: 147-150

Hartmut Traunmüller
Dept of Linguistics, Stockholm University

It is shown that pairs of demonstratives in which there is a vocalic opposition have an advantage in their struggle for existence in languages when F2' is higher in the proximal than in the distal form. It is also shown that nasals are preferred in first person pronouns while stops and other obstruents are preferred in second person pronouns. Explanations are suggested for both findings. They involve affinities with the association of pitch with size, the proprioceptive qualities of speech sounds, and oral pointing gestures.(Summary of paper to appear in H. Aili and P. av Trampe (eds.) Tongues and Texts Unlimited, 1996).


A lexical decision experiment with onomatopoeic, sound symbolic and arbitrary words
TMH-QPSR 2/1996: 151-154

Åsa Abelin
Dept of Linguistics, Göteborg University

A lexical decision test was done on onomatopoeic, sound symbolic and arbitrary words in order to find out if there were differences in processing times and error rates between these words.

The test was used in an unconventional way in order to see if non-words, modelled on onomatopoeic and sound symbolic words, would behave like established words and have shorter reaction times.

The results show, on the contrary, that established onomatopoeic and sound symbolic words have significantly longer reaction times than arbitrary words. They are also more often confused with non-words than are the arbitrary words.



Learning to interpret sociodialectal cues
TMH-QPSR 2/1996: 151-154

Una Cunningham-Andersson
Dept of Linguistics, Stockholm University

This study examines a single aspect of native speaker competence. The questions addressed here are: how well can a given non-native speaker perceive differences between dialects of Swedish? How well can native speakers of Swedish perceive this kind of variation? Does a long period of residence in Sweden and an apparently excellent command of the Swedish language imply that an immigrantís ability to place native speakers geographically approaches native standard? Is there an upper limit for how good non-native listeners can be, or can they approach native standard?



The influence of noise on subjective duration
TMH-QPSR 2/1996: 159-162

Wim A van Dommelen
Dept of Linguistics, Norwegian University of Science and Technology, Dragvoll

This paper explores the influence of noise on perceived segment duration. In test words with initial /s(p),/ occlusion duration was varied, giving rise to either /sl/ or /spl/ percepts. Perceptual phoneme boundaries were obtained for varying noise conditions.

The results showed that there was no monotonic duration modification due to increased noise level. Rather, perceived occlusion duration turned out to be shortened with moderately degraded listening conditions, whilst subjective duration was lengthened with more unfavourable S/N ratios.

It could be shown that the boundary shifts found were not caused by learning effects. Neither were they affected by language/dialect background.


TMH-QPSR 3/1996


SPEECH COMMUNICATION

Prosodic segmentation and structuring of dialogue
TMH-QPSR 3/1996: 1-6

Gösta Bruce, Johan Frid, Björn Granström, Kjell Gustafson, Merle Horne & David House
Dept of Linguistics and Phonetics, Lund Univ., and KTH, Dept of Speech, Music and Hearing

This paper reports on the research conducted within the project Prosodic Segmentation and Structuring of Dialogue. The work involves both the analysis of discourse/dialogue structure (independent of prosody) and prosodic analysis - both auditory analysis in the form of prosodic transcription and acoustic-phonetic analysis (based on F0 and waveform information). The different analysis types are combined and synchronized with each other in the ESPS/Waves+ environment. On the basis of the analyses, we perform F0 modelling and resynthesis in order to evaluate the models. We also integrate the results of the dialogue analyses into a text-to-speech system and a man-machine dialogue system. The implementation of this enhanced prosody model can be used as a research tool in itself. In more practical terms, the prosodic representations which result from this enhanced prosody model allow us to generate more appropriate prosody in automatic dialogue systems. This, in turn, should render the speech synthesis used in these systems more acceptable to the general user.



Cross phone state clustering using lexical stress and context
TMH-QPSR 3/1996: 7-11

Jesper Högberg & Kåre Sjölander
KTH, Dept of Speech, Music and Hearing

This study deals with acoustic phonetic modelling in HMM based continuous speech recognition. Context dependent phone models were derived by a decision tree clustering algorithm. In particular, lexical stress was introduced as a clustering variable in addition to the phonetic context. The parameter sharing model was extended by tying HMM states across different target phones. For instance, one or more states of a tense vowel and the corresponding lax vowel were tied if they proved to be acoustically similar. The results indicate that the use of lexical stress information in acoustic modelling might be fruitful when large amounts of training data are available.



Some studies of the relationship between speech acoustics, articulation and phonetic structure
TMH-QPSR 3/1996: 13-21

Jesper Högberg
KTH, Dept of Speech, Music and Hearing

This paper is the introductory part of a compilation of papers discussing analyses of acoustic and articulatory variability and the relation between acoustics, articulation and phonetic structure.

The first paper addresses problems concerning vocal tract geometry and inter-subject and gender related articulatory and acoustic variability. Midsagittal X-ray and acoustic vowel data were analysed for two French subjects, one male and one female. Area functions of the vocal tract were inferred from the midsagittal data. The result of the analysis indicated that the female subject used more extreme articulations for /a/ and /i/ than the male subject. Large inter-subject differences were observed in area functions pertaining to //, // and //, that is, the three most central vowels in the set. These findings indicate that the subjects used different articulation strategies.

Papers two and three concern the relation between acoustics and articulation. Artificial neural networks (ANN) were used to map acoustic features on the corresponding vocal tract area functions generated by vocal tract models. ANNs were trained and tested on data generated by different vocal tract models. In particular, a modular net configuration was tested that consisted of a number of parallel ANNs.. This design improved the results when tested on speech from both a male and a female subject.

The two last papers describe analysis of segmental acoustic quality depending on phonetic structure where data driven methods are shown to be useful. A regression tree approach and a normalisation technique were used to derive predictions for F1 and F2 of Swedish front vowels. The data base used in these two papers contains some 3000 front vowels in read speech of one male speaker. All vowels were pooled to represent a generic front vowel as a starting point for the analysis. The results of both papers indicate that more than 50% of the standard deviation of F1 and F2 (mel) can be explained by the phonetic factors used.



Creation of unseen triphones from diphones and monophones using a speech production approach
TMH-QPSR 3/1996: 23-27

Mats Blomberg & Kjell Elenius
KTH, Dept of Speech, Music and Hearing

With limited training data, infrequent triphone models for speech recognition will not be observed in sufficient number. In this report, a speech production approach is used to predict the characteristics of unseen triphones by concatenating diphones and/or monophones in the parametric representation of a formant speech synthesiser. The parameter trajectories are estimated by interpolation between the endpoints of the original units. The spectral states of the created triphone are generated by the speech synthesiser. Evaluation of the proposed technique has been performed using spectral error measurements and recognition candidate rescoring of N-best lists. In both cases, the created triphones are shown to perform better than the shorter units from which they were constructed.



MUSIC ACOUSTICS

Towards a musician´s cockpit: Transducers, feedback and musical function
TMH-QPSR 3/1996: 29-32

Roel Vertegaal, Tamas Ungvary & Michael Kieslinger
Dept of Ergonomics, Twente Univ, The Netherlands, and KTH, Dept of Speech, Music and Hearing

This paper describes our on-going theoretical research into the design of computer music instruments. Traditionally, an acoustical instrument is used as an extension of the body. We feel the importance of this tight relationship between musician and instrument has traditionally been underestimated in the design of computer music controllers. In this paper, we will show the benefits of a cybernetic approach to the man-instrument system, in which we specify relationships between body-parts, different types of transducers, feedback, and the musical function performed. Matching these different requirements is a first step into bringing computer music instruments into reach of the skilled musician.



Effects of lung volume on the glottal voice source
TMH-QPSR 3/1996: 33-40

Jenny Iwarsson, Monica Thomasson & Johan Sundberg
KTH, Dept of Speech, Music and Hearing

According to experience in voice therapy and singing pedagogy breathing habits can be used to modify phonation, although this relation has never been experimentally demonstrated. In the present investigation we examine if lung volume affects phonation. Twenty-four untrained subjects phonated at different pitches and degrees of vocal loudness at different lung volumes. Mean subglottal pressure was measured and voice source characteristics were analyzed by inverse filtering. The main results were that with decreasing lung volume the closed quotient increased while subglottal pressure, peak-to-peak flow amplitude, and glottal leakage tended to decrease. In addition, some estimates of the amount of the glottal adduction force component was examined. Possible explanations of the findings are discussed.



Blowing pressures in reed woodwind instruments
TMH-QPSR 3/1996: 41-56

Leonardo Fuks & Johan Sundberg
KTH, Dept of Speech, Music and Hearing

The blowing pressures during wind instruments playing has not been systematically measured in previous research, leaving the dependencies of pitch and dynamic level as open questions. In the present investigation, we recorded blowing pressures in the mouth cavity of two professional players of each of four reed woodwinds (Bb clarinet, alto saxophone, oboe, bassoon). The players performed three different tasks: (1) a series of isolated tones at four dynamic levels, (2) the same series with a crescendo-diminuendo tones and (3) ascending-descending musical arpeggio played legato at different dynamic levels (pp, mp, mf, ff). The results show that, within instruments, the players' pressures exhibit similar dependencies of pitch and dynamic levels. Between instruments, clear differences were found with regard to the dependence on pitch.


Source: Cathrin Dunger
Date: 1997-02-19
Technical assistance: webmaster@speech.kth.se