STL-QPSR ABSTRACTS 1992


SPEECH PRODUCTION


Using artificial neural networks to relate articulation to acoustic features of the vocal tract, No. 1/1992, pp. 33-48

Mats Båvegård & Jesper Högberg

This paper deals with artificial neural networks as a tool to relate vowel articulation to acoustic features of the vocal tract. Different area-model approaches, one articula- tory-parameter model and one area-parameter model combined with different speech signal representations, are compared. The sensitivity of the artificial neural net to speaker-dependent variations is tested.


Vocal-tract computation: how to make it more robust and faster, No. 4/1992, pp. 29-42

Qiguang Lin

Numerical methods are proposed to speed up the frequency-domain calculation of the vocal- tract response and effectively determine the associated pole/zero patterns. These methods include algorithms to properly decompose the numerator and the denominator of the transfer function, to estimate formant patterns of a lossy vocal-tract system from its lossless counterpart by means of linear interpolation, and to utilize the information of residues at the system poles to ensure that no pole/zero is missed. It is found that the proposed methods reduce the computation time at least by a factor of 2 while maintaining the accuracy. At the same time the risk of missing a pole or a zero is minimised. These algorithms have been im- plemented in a computer program, TRACTTALK, an articulation-based speech synthesizer. A brief description of TRACTTALK is presented.


SPEECH RECOGNITION


A comparison of speech signal representations for speech recognition with hidden Markov models and artificial neural networks, No. 2-3/1992, pp. 1-8

Joakim Borell & Nikko Ström

This paper explores several different speech signal representations in two different speech recognition environments, artificial neural networks (ANN) and Hidden Markov Models (HMM). The results indicate that an ANN does not benefit from a cepstrum representation of filter bank parameters. Integrating several speech representations results in improved recognition rate at the cost of significantly longer training/testing time for the network. By performing an LDA, which generates a new parameter vector, it is possible to reduce the network size and to obtain higher recognition rates. Preliminary tests show that the HMM system models time dynamics well but frequency dynamics poorly, contrary to the neural network that, in these experiments, models frequency dynamics well but time dynamics poorly.


Development of a recurrent time-delay neural net speech recognition system, No. 4/1992, pp. 1-15

Nikko Ström

State models, often used to model the dynamics of speech in recognition systems today, may not be flexible enough to model all important dynamic aspects of the speech. The present paper investigates an artificial neural network approach to the problem. An artificial neural network architecture is proposed and applied to some small memory problems and a word recognition task. The latter network has two explicit symbolic levels * phonemes and words * and both top-down and bottom-up strategies are learned by an automatic training algorithm.


Vowel Classification Based on Analysis-by-Synthesis, No. 4/1992, pp. 17-27

Rolf Carlson & James Glass

In this paper, we report on a sequence of experiments designed to explore the use of analysis- by-synthesis methods for speech recognition and speech analysis in general. An intermediate representation of the speech signal is formulated in terms of speech-synthesis-like parameters. Using a multi-layer perceptron as a common classifier, we have performed several vowel classification experiments based on these parameters. The results of the experiments indicate that we are able to obtain the same classification performance as a more traditional spectral representation using nearly an order of magnitude fewer dimensions. We have also developed a speaker normalization procedure that improves classification rate compared to the one we obtain with a simple male/female normalization. In our last set of experiments, we have studied the influence of the context on the classification results. The best classification results in our experiments were achieved by a combination of default formants and labels specifying the context together with speaker normalization of the automatically measured synthesis parameters.


SPEECH SYNTHESIS


Inter-cultural variations in gender-based language differences in young children, No. 1/1992, pp. 1-17

Inger Karlsson & Martin Rothenberg

This paper discusses to what degree the differences between male and female voices are genetic, or if they are learnt during childhood or even later. Preadolescent children speaking different languages were recorded. Adult listeners with different language backgrounds were asked to judge the gender of the child. The children, who were between 2.5 and 8 years old, spoke either English, Finnish, or Swedish; a few were bilingual. The mother tongue of the judges was either Chinese, English, Finnish, or Swedish. It seemed to be easier to recognize the gender of the English and Swedish speaking children. About 70% of the an- swers were right for these children. The Finnish children were much harder to judge; the result was only slightly better than chance for the boys and about chance for the girls. The judges that spoke the same language as the child were slightly better. This indicates that at least part of the difference between female and male speech is learnt; in some cultures it is learnt at an early age. Different acoustic parameters, mean FO, FO range, vowel formants, speaking rate, were measured for the test subjects. No correlations were found between these acoustic parameters and the gender judgements. The paper also raises the question if the construction of the particular language the child grows up with is influencing when the child learns to speak in a more gender-differentiating way. As opposed to Swedish and English, Finnish grammar is without gender, which could explain the apparent lack of gender cues in the Finnish speaking children's voices.


Evaluations of acoustic differences between male and female voices. A pilot study, No. 1/1992, pp. 19-31

Inger Karlsson

A literature survey was performed to identify papers dealing with acoustic differences between male and female voices. The reported differences were duplicated using the OVE III synthesizer. Manipulated copies of a sentence uttered by a female speaker served as test sentences. Listening-tests were run where the subjects were asked a) in an comparison which of two utterances was most female, and b) if an utterance sounded like a child, female or male speaker. The aim of the investigation was to find out if any of the acoustic differences between female and male voices that had been described would enhance the femaleness in a synthetic voice. The results of the listening gave some indications about in which areas future work should be concentrated. None of the tested acoustic differences changed the synthetic voice into a convincing female voice, though.


TECHNICAL PHONIATRICS


Acoustic properties of the Rothenberg mask, No. 2-3/1992, pp. 9-18

Stellan Hertegård & Jan Gauffin

The flow response and possible distortion from the Rothenberg mask system on radiated speech were studied by means of sweep-tone measurements. The flow frequency response was flat within 3 dB up to 1.6 kHz. This frequency range for the speech signal radiated through the mask normally includes the lowest two formants for open vowels. and is probably sufficient to describe most aspects of the glottal flow waveform for untrained normal and pathological voices. A pronounced zero between 1.8 and 2 kHz was found. This restricts the use of the mask systems tested here for air flow measurements at higher frequencies. It was shown that this zero could be moved up in frequency, increasing the useful frequency response to nearly 3 kHz. We suggest that the zero is caused by acoustic shunting of the nasal part of the mask. A modified mask design by placing wire screen also in its nasal part might substantially improve the frequency response.


Voiced-voiceless distinction in alaryngeal speech * acoustic and articulatory observations, No. 2-3/1992, pp. 19-22

Lennart Nord, Britta Hammarberg & Elisabet Lundström

In an investigation of the speech of laryngectomized speakers we presently focus on their ability to distinguish between voiced and voiceless speech sounds. Some of them are able to produce this distinction clearly in some contexts and it is quite remarkable considering the structure of the pharyngo-esophageal entrance, serving as the post-operative voice source. However, there are a number of acoustic cues that serve to signal the distinction between voiced and voiceless phonemes and an acoustic analysis will tell how this is possible.


MUSIC ACOUSTICS


Breathing behavior during singing, No. 1/1992, pp. 49-64

Johan Sundberg

Together with Curt von Euler and the late Rolf Leanderson, the author has carried out a series of investigations of the breathing behaviour during the last decade. This article presents the picture of phonatory breathing in singing that has emerged from this research. Subglottal pressure is determined by muscular forces, elasticity forces, and gravitation. The phonatory function of the breathing apparatus is to provide a subglottal pressure. Both in singing and speech this pressure is adjusted according to the intended vocal loudness, but in singing it has to be tailored also to pitch; higher pitches needing higher pressures than lower pitches. As subglottal pressure affects pitch, singers need to envelop a quite virtuosi breath control. Some singers activate the diaphragm only during inhalation and for reducing subglottal pressure at high lung volumes, while other singers have been found to co-contract it throughout the breath phrase. The tracheal pull, i.e., the pulling force exerted by the trachea on the larynx, is a me- chanical link between the breathing and phonatory systems. The magnitude of this force depends on the level of the diaphragm in the trunk, i.e., on the lung volume, but it is also probably increased by a co-contracting diaphragm. The pedagogical implications of these findings are discussed.


The sense of nonsense. Syllable choice in spontaneous nonsense-text singing, No. 1/1992, pp. 65-78

Johan Sundberg & Lars Fryden

When informally singing memorized instrumental or vocal themes, we tend to invent nonsense texts, e.g., duda duda dudaa or ... dibidi bambi dudi. It can be hypothesized that the choice of syllables in such nonsense texts is not completely random. Rather, certain principles reflecting aspects of musical structure are likely to guide this choice. A group of eleven members of a professional orchestra sang six melodical excerpts and their choice of syllables was analyzed. With regard to the phonetic properties the choice showed coherence in certain musical situations. For 65 example, a change of syllable was frequently made at structural boundaries while repetition of syllable was often used for all notes constituting a short motif. Later, one of the same musicians used typical signs, such as dots, dashes, slurs, etc., to mark how these same excerpts should be performed according to his view. This material suggested that the meaning of, e.g., a nasal consonant plus a [p] was an accent, whereas a voiced stop consonant, such as [d] plus vowel, frequently corresponded to a legato. We also compared these performance suggestions with those predicted by our system of performance rules.


Musician's tone glue, No. 1/1992, pp. 79-98

Johan Sundberg

In music performance, musicians rely on a variety of methods to mark melodically coherence of notes, i.e., that the notes belong together and constitute a melodical Gestalt. Minute departures from the nominal duration of the tone, as given by the score, seem to be particularly efficient but also loudness and timbre are used. Our generative rules for music performance (A. Friberg, Computer Music J. 15, pp. 56-71,1991) offer some examples: the final tone of a phrase is lenghtened, and subphrases are terminated by a micropause. Likewise, a crescendo seems to have the effect of joining the tones involved. A change of timbre seems another efficient method to signal a boundary between tone groups. The improvized nonsense texts that we use when we sing tunes also present examples of musical grouping. The magnitude of the effects needed to elicit a musically useful effect s discussed.


Perceptual analysis of child hoarseness using continuous scales, No. 1/1992, pp. 99-113

Elisabeth Sederholm, Anita McAllister, lohan Sundberg, & Jan Dalkvist

Fifty-eight 10-year old children were recorded and judged by a panel of voice expert listeners, who rated the voices along 16 voice parameters represented by visual analogue (continuous) scales on a test form. Interjudge reliability was high. Rank ordered rating means revealed a discontinuity in the distribution for most parameters. A factor analysis revealed three factors of major relevance to the perception of these voices. These factors were closely associated with hoarseness, pitch, and phonatory effort. The hoarseness factor was found to have high loadings in gratings, breathiness, hyperfunction, roughness, instability, and voice breaks. A stepwise multiple regression analysis revealed that breathiness, hyperfunction and roughness are good predictors of hoarseness.


Phonetographic aspects of physiological and perceptual voice characteristics in ten-year- old children, No. 2-3/1992, pp. 41-56

Anita McAllister, Elisabeth Sederholm, & Johan Sundberg

Pitch and intensity ranges of 60 children were recorded and plotted in terms of pho- netograms. Different aspects of the phonetograms were investigated, such as minimum phonation threshold, pitch range, and maximum dynamic range. The vocal cord status of all children was determined in a phoniatric examination. Seven voice experts listened to recordings of the voices and rated their properties along 16 parameters including hoarseness. Using these ratings, the hoarse children were identified. The phonetogram characteristics of adults and children were compared, as well as those of chronically hoarse and nonhoarse children and ten-year-olds with and without vocal nodules or glottal chinks. The phonetograms of children with beginning mutational voices were compared to those of adults.


A computer-controlled bowing machine (MUMS), No. 2-3/1992, pp. 61-66

Andreas Cronhjort

MUMS is a bowing machine, i.e., a machine that bows (plays) a violin in a controlled and repeatable manner by mechanical means. Traditionally, the main application of bowing machines has been in studying string vibrations and violins under reproducible conditions. MUMS uses a normal bow to excite the violin, which also allows a comparison of different bows and their influence on the string vibrations. The position and velocity of the bow and the force between bow and string ("bow pressure") can be specified and controlled within close limits. MUMS consists of two parts; a converted printer which contains the mechanical support of the bow and motors for bow motion and force, and a PC-computer which controls the motion by software servos.


Physiological aspects of a vocal exercise, No. 2-3/1992, pp. 79-87

Ninni Elliot, Johan Sundberg, & Patricia Gramming

The physiological aim of vocal exercises is mostly known in intuitive terms only. This article presents an attempt to document the phonatory behavior induced by a vocal exercise. An elevated vertical position of the larynx is frequently associated with hyperfunctional phonatory habits, presumably because it induces an exaggerated vocal fold adduction. Using the multi-channel electroglottograph (TMEGG) recently presented by Rothenberg (1992, J. of Voice, 6), the larynx position was determined in a group of subjects who performed a voice exercise that contained extremely prolonged versions of the consonant [b:]. This exercise is used by coauthor NE as part of the standard vocal exercise program. Two of the seven subjects were dysphonic phonastenic patients while the rest were normal trained or untrained persons. Different attempts to calibrate the TMEGG confirmed a linear relation with larynx height, provided a correct electrode positioning. The results showed that the exercise induced substantial vertical displacements of the larynx. Comparison with the larynx height behavior during other consonants revealed that the [b] in particular tended to lower the larynx.


Tracking multi-channel electroglottograph measurement of larynx height in singers. No. 2- 3/1992, pp. 67-78

Friedemann Pabst & Johan Sundberg

Changes in vertical larynx position during singing were studied in professional singer subjects using the two-channel electroglottograph recently presented by Rothenberg (1992). The evaluation of the method and the examination of the laryngeal behaviors of professional singers were the main aims of the study. The significance of pitch, vocal loudness and lung volume to vertical larynx position were analysed.


The making of the "acoustic electric guitar", No. 2-3/1992, pp. 57-60

Richard Wiedenkeller

Is it possible to make a modern electric guitar with the acoustic properties of a half-acoustic guitar? In this work, a comparison between an Ibanez Roadstar II and a traditional Levin half-acoustic guitar was done. The resonances of the Levin guitar body were found from the impulse response, and an electric filter consisting of a number of narrow band-pass filters was built with poles at the resonance frequencies of the half-acoustic guitar. When playing the electric guitar with this filter, the tone quality was clearly changed in the acoustic direction The attack of the tone sounded more natural. However, this filter caused some unwanted effects. A partial of a played note, which coincided with a resonant frequency of a filter, became too prominent. In the acoustic guitar, the body will add much damping to the string vibration at resonance. Experiments were made with electro-dynamic systems in- cluding feedback to synthesize the damping effects of the body. However, no satisfactory solution was found.


Observations on the dynamic properties of violin bows, No. 4/1992, pp. 43-49

Anders Askenfelt

Some of the dynamic properties of violin bows have been studied. The normal modes of the bow stick and assembled bow, respectively, have been explored by modal analysis. Mode frequencies and damping ratios have been compared for a set of bows which had ratings ranging from poor to excellent, assigned to them by professional players, and an attempt has been made to connect the modal properties with the quality of the bow. The influence of the tension of the bow hair and the player's holding of the bow are also considered.


Numerical simulations of piano strings, No. 4/1992, pp. 51-72

Antoine Chaigne & Anders Askenfelt

The first attempt to generate musical sounds by solving the equations of vibrating strings by means of Finite Difference Methods (FDM) was made by Hiller & Ruiz [J.Audio Eng.Soc. 19, pp. 462-472, 1971]. It is shown here how their numerical approach and the underlying physical model can be improved in order to simulate the motion of the piano string with a high degree of realism. Starting from the fundamental equations of a damped, stiff string interacting with a nonlinear hammer, a numerical finite difference scheme is derived, from which the time and spatial dependence of string displacement, velocity, and interacting force between hammer and string, as well as the force acting on the bridge, are computed in the time-domain. The strength of the model is illustrated by comparisons between measured and simulated piano tones. After this verification of the accuracy of the method, the model is used as a tool for systematically exploring the influence of string stiffness, relative striking position, and hammer-string mass ratio on string waveforms and spectra.


Measurements of the vibrato rate of ten singers, No. 4, 1992, pp. 73-86

Eric Prame

The vibrato rate for ten singers, all singing Schubert's Ave Maria, was measured on sonograms. Commercially available CD records were used to insure that the vibrato originated in a real musical performance. It was found, that the vibrato rate typically increased at the end of each tone, while no typical structure could be found in the beginning of a tone. Disregarding the increase of vibrato rate toward tone endings, the mean rate across singers was 6.1 Hz. The average variation between maximum and minimum rate within an artist is about *10% of the artist average. The variation across artists between the maximum and minimum personal mean rate was also about *10% of the group average.


Acoustical measurements of an artificial reverberation system with wooden loudspeakers, No. 4/1992, pp., 87-96

Gunilla Berndtsson

Georg Bolin, a recognized Swedish guitar maker, has developed two types of sound enhancement systems, "tone tables," which are used for reinforcing the sound from certain musical instruments, and "acoustic walls," which are used for increasing the reverberation time so as to improve room acoustics for music performance. The goal of the present investigation was to study some of the acoustical properties of the acoustic walls.


Perception of just noticeable time displacement of a tone presented in a metrical sequence at different tempos, No. 4/1992, pp. 97-108

Anders Friberg & Johan Sundberg

The JND for a perturbation of the timing of a tone appearing in a metrical sequence was examined in an experiment where 30 listeners of varied musical background were asked to adjust the timing of the fourth tone in a sequence of six such that they heard the sequence as perfectly regular. The tones were presented at a constant inter-onset time that was varied between 100 ms and 1000 ms. The average JND was found to be about 10 ms for tones shorter than about 240 ms duration and about 5% of the duration for longer tones. Subjects' musical training did not appear to affect these values.


Music and locomotion. A study of the perception of tones with level envelopes replicating force patterns of walking, No. 4/1992, pp. 109-122

Johan Sundberg, Anders Friberg, & Lars Frydén

Music listening often produces associations to locomotion. This suggests that some patterns in music are similar to those perceived during locomotion. The present investigation tests the hypothesis that the sound level envelope of tones allude to force patterns associated with walking and dancing. Six examples of such force patterns were recorded using a force platform, and the vertical components were translated from kg to dB and used as level envelopes for tones. Sequences of four copies of each of these tones were presented with four different fixed inter-onset times. Music students were asked to characterize these sequences in three tests. In one test, the subjects were free to use any expression, and the occurrence of motion words in the responses was examined. In another test, they were asked to describe, if possible, the motion characteristics of the sequences, and the number of blank responses were studied. In the third test, they were asked to describe the sequences along 24 motion adjective scales, and the responses were submitted to a factor analysis. The results from the three tests showed a reasonable degree of coherence, suggesting that associations to locomotions are likely to occur under these conditions, particularly when (1) the inter-onset time is similar to the inter-step time typical of walking, and (2) when the inter-onset time agreed with that observed when the gait patterns were recorded. The latter observation suggests that the different motion patterns thus translated to sound level envelopes also may convey information on the type of motion.


SPEECH AND HEARING DEFECTS AND AIDS


Numerical aspects of the speech tracking procedure, No. 1/1992, pp. 115-130

Karl-Erik Spens, Johan Gnosspelius, Gunilla Öhngren, Geoff Plant, & Arne Risberg

The Speech Tracking procedure developed by De Filippo & Scott (J.Acoust.Soc.Am. 63,1978) has been used extensively to evaluate different technical aids for deaf people. An important question which should be considered is to what extent results from these evaluations can be compared, as modifications of the test design can have a significant influence on the result. This paper describes a numerical model of the tracking procedure. The model specifies some of the unknown factors which make a comparison of speech tracking scores difficult. It also indicates that very different performances found when using the same aid can be explained by differences in test design. This must also be considered when comparing the tracking results obtained with different aids.


A computer-based speech tracking procedure, No. 1/1992, pp. 131-137

Johan Gnosspelius & Karl-Erik Spens

A computer program was written to facilitate the administration of the Speech Tracking procedure described by De Filippo & Scott (1978) and to make different speech tracking tests more comparable. A record was kept of the total number of blockages, the number of blockages resolved outside the system, the average time spent on blockages, the average time spent on non-blocked words, and the average tracking speed. Experience with the system indicates that special care should be taken in the choice of text as well as in the partitioning of the text into segments.


Touching voices with the Tactilator, No. 2-3/1992, pp. 23-39

Gunilla Öhngren

The purpose of this study is to investigate whether the Tactilator can give vibrational support to speechreading and - if so - what cognitive abilities are needed to take advantage of this additional information. the Tactilator is a single-channel tactile aid. The talker holds a contact-microphone against the throat, and the deafened adult has a bone-conductor in his hand. The vibrations from the talker's larynx are transmitted wirelessly to the bone- conductor. Sixteen deafened adults were tested with speech tracking live, two sentence-based video-tests, and a word decoding video-test. the subjects were tested with or without the Tactilator in all conditions. The Tactilator-supported speechreading proved to be generally superior to speechreading only. Cognitive testing revealed that using the Tactilator was best predicted by a text-based phonological recoding test, a lipped mono-syllabic word decoding test, and a complex working memory test (R=.81; Adj.R=.57). The implications for solving the signal-to-noise ratio problem, by the use of a contact-microphone both for tactile-aid and hearing-aid users, is discussed.