TMH / Publications

TMH Publications 2011

Note: this list may be incomplete.

Journal ArticlesAbelin, Å. (2011). Imitation of bird song in folklore – onomatopoeia or not?. TMH-QPSR, 51(1), 13-16. [pdf]Afsun, D., Forsman, E., Halvarsson, C., Jonsson, E., Malmgren, L., Neves, J., & Marklund, U. (2011). Effects of a film-based parental intervention on vocabulary development in toddlers aged 18-21 months. TMH-QPSR, 51(1), 105-108. [pdf]Ananthakrishnan, G., Eklund, R., Peters, G., & and Mabiza, E. (2011). An acoustic analysis of lion roars. II: Vocal tract characteristics. TMH-QPSR, 51(1), 5-8. [pdf]Ananthakrishnan, G., & Engwall, O. (2011). Mapping between Acoustic and Articulatory Gestures. Speech Communication, 53(4), 567-589. [abstract] [link]Abstract: This paper proposes a definition for articulatory as well as acoustic gestures along with a method to segment the measured articulatory trajectories and acoustic waveforms into gestures. Using a simultaneously recorded acoustic-articulatory database, the gestures are detected based on finding critical points in the utterance, both in the acoustic and articulatory representations. The acoustic gestures are parameterized using 2-D cepstral coefficients. The articulatory trajectories are essentially the horizontal and vertical movements of Electromagnetic Articulography (EMA) coils placed on the tongue, jaw and lips along the midsagittal plane. The articulatory movements are parameterized using 2D-DCT using the same transformation that is applied on the acoustics. The relationship between the detected acoustic and articulatory gestures in terms of the timing as well as the shape is studied. In order to study this relationship further, acoustic-to-articulatory inversion is performed using GMM-based regression. The accuracy of predicting the articulatory trajectories from the acoustic waveforms are at par with state-of-the-art frame-based methods with dynamical constraints (with an average error of 1.45–1.55 mm for the two speakers in the database). In order to evaluate the acoustic-to-articulatory inversion in a more intuitive manner, a method based on the error in estimated critical points is suggested. Using this method, it was noted that the estimated articulatory trajectories using the acoustic-to-articulatory inversion methods were still not accurate enough to be within the perceptual tolerance of audio-visual asynchrony.Ananthakrishnan, G., Wik, P., & Engwall, O. (2011). Detecting confusable phoneme pairs for Swedish language learners depending on their first language. TMH-QPSR, 51(1), 89-92. [pdf]Andersson, I., Gauding, J., Graca, A., Holm, K., Öhlin, L., Marklund, U., & Ericsson, A. (2011). Productive vocabulary size development in children aged 18-24 months – gender differences. TMH-QPSR, 51(1), 109-112. [pdf]Blomberg, M. (2011). Model space size scaling for speaker adaptation. TMH-QPSR, 51(1), 77-80. [pdf]Bresin, R., & Friberg, A. (2011). Emotion rendering in music: Range and characteristic values of seven musical variables. Cortex, 47(9), 1068-1081. [abstract] [link]Abstract: Many studies on the synthesis of emotional expression in music performance have focused on the effect of individual performance variables on perceived emotional quality by making a systematical variation of variables. However, most of the studies have used a predetermined small number of levels for each variable, and the selection of these levels has often been done arbitrarily. The main aim of this research work is to improve upon existing methodologies by taking a synthesis approach. In a production experiment, 20 performers were asked to manipulate values of 7 musical variables simultaneously (tempo, sound level, articulation, phrasing, register, timbre, and attack speed) for communicating 5 different emotional expressions (neutral, happy, scary, peaceful, sad) for each of 4 scores. The scores were compositions communicating four different emotions (happiness, sadness, fear, calmness). Emotional expressions and music scores were presented in combination and in random order for each performer for a total of 5 × 4 stimuli. The experiment allowed for a systematic investigation of the interaction between emotion of each score and intended expressed emotions by performers. A two-way analysis of variance (ANOVA), repeated measures, with factors emotion and score was conducted on the participants’ values separately for each of the seven musical factors. There are two main results. The first one is that musical variables were manipulated in the same direction as reported in previous research on emotional expressive music performance. The second one is the identification for each of the five emotions the mean values and ranges of the five musical variables tempo, sound level, articulation, register, and instrument. These values resulted to be independent from the particular score and its emotion. The results presented in this study therefore allow for both the design and control of emotionally expressive computerized musical stimuli that are more ecologically valid than stimuli without performance variations.Dahlby, M., Irmalm, L., Kytöharju, S., Wallander, L., Zachariassen, H., Ericsson, A., & Marklund, U. (2011). Parent-child interaction: Relationship between pause duration and infant vocabulary at 18 months. TMH-QPSR, 51(1), 101-104. [pdf]Echternach, M., Sundberg, J., Baumann, T., Markl, M., & Richter, B. (2011). . Vocal tract area functions and formant frequencies in opera tenors’ modal and falsetto registers. J. Acoust. Soc. Am, 129(6), 3955–3963.Echternach, M., Sundberg, J., Baumann, T., Markl, M., & Richter, B. (2011). Vocal tract and register changes analysed by real-time MRI in male professional singers—a pilot study. Logopedics Phoniatrics Vocology, 33(2), 67 — 73.Echternach, M., Sundberg, J., Zander, M., & Richter, B. (2011). Perturbation Measurements in Untrained Male Voices’ Transitions From Modal to Falsetto Register. Journal of voice, 25(6), 663-669.Eklund, R., Peters, G., Ananthakrishnan, G., & Mabiza, E. (2011). An acoustic analysis of lion roars. I: Data collection and spectrogram and waveform analyses. TMH-QPSR, 51(1), 1-4. [pdf]Engwall, O. (2011). Analysis of and feedback on phonetic features in pronunciation training with a virtual teacher. Computer Assisted Language Learning. [abstract] [link]Abstract: Pronunciation errors may be caused by several different deviations from the target, such as voicing, intonation, insertions or deletions of segments, or that the articulators are placed incorrectly. Computer-animated pronunciation teachers could potentially provide important assistance on correcting all these types of deviations, but they have an additional benefit for articulatory errors. By making parts of the face transparent, they can show the correct position and shape of the tongue and provide audiovisual feedback on how to change erroneous articulations. Such a scenario however requires firstly that the learner's current articulation can be estimated with precision and secondly that the learner is able to imitate the articulatory changes suggested in the audiovisual feedback. This article discusses both these aspects, with one experiment on estimating the important articulatory features from a speaker through acoustic-to-articulatory inversion and one user test with a virtual pronunciation teacher, in which the articulatory changes made by seven learners who receive audiovisual feedback are monitored using ultrasound imaging.Fabiani, M., & Friberg, A. (2011). Influence of pitch, loudness, and timbre on the perception of instrument dynamics. Journal of the Acoustical Society of America - Express Letters, EL193-EL199. [abstract] [pdf]Abstract: The e ffect of variations in pitch, loudness, and timbre on the perception of the dynamics of isolated instrumental tones is investigated. A full factorial design was used in a listening experiment. The subjects were asked to indicate the perceived dynamics of each stimulus on a scale from pianissimo to fortissimo. Statistical analysis shows that for the instruments included (i.e. clarinet, flute, piano, trumpet, and violin) timbre and loudness have equally large e ffects, while pitch is relevant mostly for the first three. The results confi rm our hypothesis that loudness alone is not a reliable estimate of the dynamics of musical tones.Frid, J., Schötz, S., & Löfqvist, A. (2011). Age-related lip movement repetition variability in two phrase positions. TMH-QPSR, 51(1), 21-24. [pdf]Gabrielsson, D., Kirchner, S., Nilsson, K., Norberg, A., & Widlund, C. (2011). Anticipatory lip rounding– a pilot study using The Wave Speech Research System. TMH-QPSR, 51(1), 37-40. [pdf]Hansen, K. F., Fabiani, M., & Bresin, R. (2011). Analysis of the acoustics and playing strategies of turntable scratching. Acta Acustica united with Acustica, 97(2), 303-314. [abstract] [link]Abstract: Scratching performed by a DJ (disk jockey) is a skillful style of playing the turntable with complex musical output. This study focuses on the description of some of the acoustical parameters and playing strategies of typical scratch improvisations, and how these parameters typically are used for expressive performance. Three professional DJs were instructed to express different emotions through improvisations, and both audio and gestural data were recorded. Feature extraction and analysis of the recordings are based on a combination of audio and gestural data, instrument characteristics, and playing techniques. The acoustical and performance parameters extracted from the recordings give a first approximation on the functional ranges within which DJs normally play. Results from the analysis show that parameters which are important for other solo instrument performances, such as pitch, have less influence in scratching. Both differences and commonalities between the DJs' playing styles were found. Impact that the findings of this work may have on constructing models for scratch performances are discussed.Heldner, M. (2011). Detection thresholds for gaps, overlaps and no-gap-no-overlaps. Journal of the Acoustical Society of America, 130(1), 508-513. [abstract] [pdf]Abstract: Detection thresholds for gaps and overlaps, that is acoustic and perceived silences and stretches of overlapping speech in speaker changes, were determined. Subliminal gaps and overlaps were categorized as no-gap-no-overlaps. The established gap and overlap detection thresholds both corresponded to the duration of a long vowel, or about 120 ms. These detection thresholds are valuable for mapping the perceptual speaker change categories gaps, overlaps, and no-gap-no-overlaps into the acoustic domain. Furthermore, the detection thresholds allow generation and understanding of gaps, overlaps, and no-gap-no-overlaps in human-like spoken dialogue systems.Hjalmarsson, A. (2011). The additive effect of turn-taking cues in human and synthetic voice. Speech Communication, 53(1), 23-35. [abstract] [link]Abstract: A previous line of research suggests that interlocutors identify appropriate places to speak by cues in the behaviour of the preceding speaker. If used in combination, these cues have an additive effect on listeners’ turn-taking attempts. The present study further explores these findings by examining the effect of such turn-taking cues experimentally. The objective is to investigate the possibilities of generating turn-taking cues with a synthetic voice. Thus, in addition to stimuli realized with a human voice, the experiment included dialogues where one of the speakers is replaced with a synthesis. The turn-taking cues investigated include intonation, phrase-final lengthening, semantic completeness, stereotyped lexical expressions and non-lexical speech production phenomena such as lexical repetitions, breathing and lip-smacks. The results show that the turn-taking cues realized with a synthetic voice affect the judgements similar to the corresponding human version and there is no difference in reaction times between these two conditions. Furthermore, the results support Duncan’s findings: the more turn-taking cues with the same pragmatic function, turn-yielding or turn-holding, the higher the agreement among subjects on the expected outcome. In addition, the number of turn-taking cues affects the reaction times for these decisions. Thus, the more cues, the faster the reaction time.Hu, G. (2011). Chinese perception coaching. TMH-QPSR, 51(1), 97-100. [pdf]Jonsson, H., & Eklund, R. (2011). Gender differences in verbal behaviour in a call routing speech application. TMH-QPSR, 51(1), 81-84. [pdf]Kaiser, R. (2011). Do Germans produce and perceive the Swedish word accent contrast? A cross-language analysis. TMH-QPSR, 51(1), 93-96. [pdf]Klintfors, E., Marklund, E., Kallioinen, P., & Lacerda, F. (2011). Cortical N400-Potentials Generated by Adults in Response to Semantic Incongruities. TMH-QPSR, 51(1), 121-124. [pdf]Laukka, P., Neiberg, D., Forsell, M., Karlsson, I., & Elenius, K. (2011). Expression of Affect in Spontaneous Speech: Acoustic Correlates and Automatic Detection of Irritation and Resignation. Computer Speech and Language, 25(1), 84-104. [link]Lindblom, B., & MacNeilage, P. (2011). Coarticulation: A universal phonetic phenomenon with roots in deep time. TMH-QPSR, 51(1), 41-44. [pdf]Lindblom, B., Sundberg, J., Branderud, P., & Djamshidpey, H. (2011). Articulatory modeling and front cavity acoustics. TMH-QPSR, 51(1), 17-20. [pdf]Lindström, F., Persson Waye, K., Södersten, M., McAllister, A., & Ternström, S. (2011). Observations of the relationship between noise exposure and pre-school teacher voice usage in day-care center environments. Journal of Voice, 25(2), 166-172. [abstract] [link]Abstract: Summary: Although the relationship between noise exposure and vocal behavior (the Lombard effect) is well established, actual vocal behavior in the workplace is still relatively unexamined. The first purpose of this study was to investigate correlations between noise level and both voice level and voice average fundamental frequency (F0) for a population of preschool teachers in their normal workplace. The second purpose was to study the vocal behavior of each teacher to investigate whether individual vocal behaviors or certain patterns could be identified. Voice and noise data were obtained for female preschool teachers (n ¼ 13) in their workplace, using wearable measurement equipment. Correlations between noise level and voice level, and between voice level and F0, were calculated for each participant and ranged from 0.07 to 0.87 for voice level and from 0.11 to 0.78 for F0. The large spread of the correlation coefficients indicates that the teachers react individually to the noise exposure. For example, some teachers increase their voice-to-noise level ratio when the noise is reduced, whereas others do not. Key Words: Vocal behavior–Noise exposure–Voice level–Voice intensity–Fundamental frequency–Teacher voice– Occupational voice. (Accepted 30 Sept 2009)Lundholm Fors, K. (2011). An investigation of intra-turn pauses in spontaneous speech. TMH-QPSR, 51(1), 65-68. [pdf]McDonnell. M., ., Sundberg, J., Westerlund, J., Lindestad, P-Å., & Larsson, H. (2011). Vocal Fold Vibration and Phonation Start in Aspirated, Unaspirated, and Staccato Onset. Journal of Voice, 25(5), 526-531.Monson, B., Lotto, A., & Ternström, S. (2011). Detection of high-frequency energy changes in sustained vowels produced by singers. J Acoust Soc Am, 129(4), 2263-2268. [abstract]Abstract: The human voice spectrum above 5 kHz receives little attention. However, there are reasons to believe that this high-frequency energy (HFE) may play a role in perceived quality of voice in singing and speech. To fulfill this role, differences in HFE must first be detectable. To determine human ability to detect differences in HFE, the levels of the 8- and 16-kHz center-frequency octave bands were individually attenuated in sustained vowel sounds produced by singers and presented to listeners. Relatively small changes in HFE were in fact detectable, suggesting that this frequency range potentially contributes to the perception of especially the singing voice. Detection ability was greater in the 8-kHz octave than in the 16-kHz octave and varied with band energy level.Neiberg, D. (2011). Visualizing prosodic densities and contours: Forming one from many. TMH-QPSR, 51(1), 57-60. [abstract] [pdf]Abstract: This paper summarizes a flora of explorative visualization techniques for prosody developed at KTH. It is demonstrated how analysis can be made which goes beyond conventional methodology. Examples are given for turn taking, affective speech, response tokens and Swedish accent II.Pabon, P., Ternström, S., & Lamarche, A. (2011). Fourier descriptor analysis and unification of Voice Range Profile contours: method and applications. Journal of Speech, Language, and Hearing Research, 54, 755-776. [abstract]Abstract: Purpose: to describe a method for unified description, statistical modelling, and comparison of voice range profile (VRP) contours, even from diverse sources. Method: A morphologic modelling technique, based on Fourier Descriptors (FDs), is applied to the VRP contour. The technique, which essentially involves resampling of the curve of the contour, is assessed and also compared to density-based VRP averaging methods that use the overlap count. Results: VRP contours can be usefully described and compared using FDs. The method also permits the visualization of the local co-variation along the contour average. For example, the FD-based analysis shows that the population variance for ensembles of VRP contours is usually smallest at the upper left-hand part of the VRP. To illustrate the method’s advantages and possible further application, graphs are given that compare the averaged contours from different authors and recording devices, for normal, trained and untrained male and female voices, and child voices. Conclusions: The proposed technique allows any VRP shape to be brought to the same uniform base. On this uniform base, VRP contours or contour elements coming from a variety of sources may be placed within the same graph for comparison and for statistical analysis.Patel, S., Scherer, K., Sundberg, J., & Björkner, E. (2011). Mapping Emotions into Acoustic Space: The Role of Voice Production. Biological Psychology, 87, 93-98.Reidsma, D., de Kok, I., Neiberg, D., Pammi, S., van Straalen, B., Truong, K., & van Welbergen, H. (2011). Continuous Interaction with a Virtual Human. Journal on Multimodal User Interfaces, 4(2), 97-118. [link]Roll, M., Söderström, P., & Horne, M. (2011). Phonetic markedness, turning points, and anticipatory attention. TMH-QPSR, 51(1), 113-116. [pdf]Schlangen, D., & Skantze, G. (2011). A General, Abstract Model of Incremental Dialogue Processing. Dialogue & Discourse, 2(1), 83-111. [pdf]Schötz, S., & Eklund, R. (2011). A comparative acoustic analysis of purring in four cats. TMH-QPSR, 51(1), 9-12. [pdf]Schötz, s., Frid, J., & Löfqvist, A. (2011). Exotic vowels in Swedish: a project description and an articulographic and acoustic pilot study of /i:/.. TMH-QPSR, 51(1), 25-28. [pdf]Stenberg, M. (2011). Phonetic transcriptions as a public service. TMH-QPSR, 51(1), 45-48. [pdf]Sundberg, J. (2011). On the expressive code in music performance. In Schmidhofer, A., & Jena, S. (Eds.), Klangfarbe (pp. 113-129). Sundberg, J., Gu, L., Huang, Q., & Huang, P. (2011). Acoustical Study of Classical Peking Opera Singing. Journal of Voice, prepubl.Sundberg, J., Lã, F., & Gill, B. (2011). Professional male singers ’ formant tuning strategies for the vowel /a/. Logopedics Phoniatrics Vocology, Early Online, 1-12.Sundberg, J., Patel, S., Björkner, E., & Scherer, K. (2011). Interdependencies among voice source parameters in emotional speech. IEEE Transactions on Affective Computing, 2(3), 162-174.Suomi, K., Meister, E., & Ylitalo, R. (2011). Non-contrastive durational patterns in two quantity languages. TMH-QPSR, 51(1), 61-64.Uneson, M., & Schachtenhaufen, R. (2011). Exploring phonetic realization in Danish by Transformation-Based Learning. TMH-QPSR, 51(1), 73-76. [pdf]Wik, P., Husby, O., Øvregaard, Å., Bech, Ø., Albertsen, E., Nefzaoui, S., Skarpnes, E., & Koreman, J. (2011). Contrastive analysis through L1-L2map. TMH-QPSR, 51(1), 49-52. [pdf]Zangger Borch, D., & Sundberg, J. (2011). Phonatory and Resonatory Characteristics of the Rock, Pop, Soul, and Swedish Dance Band Styles of Singing. Journal of Voice, 25(5), 532-537.Zetterholm, E., & Tronnier, M. (2011). Teaching pronunciation in Swedish as a second language. TMH-QPSR, 51(1), 85-88. [pdf]Öhrström, N., Arppe, H., Eklund, L., Eriksson, S., Marcus, D., Mathiassen, T., & Pettersson, L. (2011). Audiovisual integration in binaural, monaural and dichotic listening. TMH-QPSR, 51(1), 29-32. [pdf]Book ChaptersAl Moubayed, S., Beskow, J., Edlund, J., Granström, B., & House, D. (2011). Animated Faces for Robotic Heads: Gaze and Beyond. In Esposito, A., Vinciarelli, A., Vicsi, K., Pelachaud, C., & Nijholt, A. (Eds.), Analysis of Verbal and Nonverbal Communication and Enactment: The Processing Issues, Lecture Notes in Computer Science (pp. 19-35). Springer. [pdf]Björkman, B. (2011). Grammar of English as a Lingua Franca. In Chapelle, C. A. (Ed.), The Encyclopedia of Applied Linguistics. Oxford: Wiley-Blackwell.Camargo, Z., Salomão, G. L., & Pinho, S. (2011). Análise acústica e aerodinâmica da voz (Acoustic and aerodynamic assessment of voice). In Caldas, S., Melo, J., Martins, R., & Selaimen, S. (Eds.), Tratado de Otorrinolaringologia (Treatise on Otorhinolaryngology) (pp. 794-804). Roca.Engwall, O. (2011). Augmented Reality Talking Heads as a Support for Speech Perception and Production. In Nee, A. (Ed.), Augmented Reality. InTech. [pdf]Friberg, A., Schoonderwaldt, E., & Hedblad, A. (2011). Perceptual ratings of musical parameters. In von Loesch, H., & Weinzierl, S. (Eds.), Gemessene Interpretation - Computergestützte Aufführungsanalyse im Kreuzverhör der Disziplinen (pp. 237-253). Mainz: Schott 2011, (Klang und Begriff 4). [pdf]Lindblom, B., Diehl, R., Park, S-H., & Salvi, G. (2011). Sound systems are shaped by their users: The recombination of phonetic substance. In G. Nick Clements, G. N., & Ridouane, R. (Eds.), Where Do Phonological Features Come From?: Cognitive, physical and developmental bases of distinctive speech categories. CNRS & Sorbonne-Nouvelle. [abstract] [pdf]Abstract: Computational experiments were run using an optimization criterion based on independently motivated definitions of perceptual contrast, articulatory cost and learning cost. The question: If stop+vowel inventories are seen as adaptations to perceptual, articulatory and developmental constraints what would they be like? Simulations successfully predicted typologically widely observed place preferences and the re-use of place features (‘phonemic coding’) in voiced stop inventories. These results demonstrate the feasibility of user-based accounts of phonological facts and indicate the nature of the constraints that over time might shape the formation of both the formal structure and the intrinsic content of sound patterns. While phonetic factors are commonly invoked to account for substantive aspects of phonology, their explanatory scope is here also extended to a fundamental attribute of its formal organization: the combinatorial re-use of phonetic content. Keywords: phonological universals; phonetic systems; formal structure; intrinsic content; behavioral origins; substance-based explanationOertel, C., De Looze, C., Windmann, A., Wagner, P., & Campbell, N. (2011). Towards the Automatic Detection of Involvement in Conversation. In Esposito, A., Vinciarelli, A., Vicsi, K., Pelachaud, C., & Nijholt, A. (Eds.), Analysis of Verbal and Nonverbal Communication and Enactment: The Processing Issues, Lecture Notes in Computer Science. Springer. [abstract] [pdf]Abstract: Although an increasing amount of research has been carried out into human-machine interaction in the last century, even today we are not able to fully understand the dynamic changes in human interaction. Only when we achieve this, will we be able to go beyond a one-to-one mapping between text and speech and be able to add social information to speech technologies. Social information is expressed to a high degree through prosodic cues and movement of the body and the face. The aim of this paper is to use those cues to make one aspect of social information more tangible; namely participants’ degree of involvement in a conversation. Our results for voice span and intensity, and our preliminary results on the movement of the body and face suggest that these cues are reliable cues for the detection of distinct levels of participants involvement in conversation. This will allow for the development of a statistical model which is able to classify these stages of involvement. Our data indicate that involvement may be a scalar phenomenon.ProceedingsAl Moubayed, S., Alexanderson, S., Beskow, J., & Granström, B. (2011). A robotic head using projected animated faces. In Salvi, G., Beskow, J., Engwall, O., & Al Moubayed, S. (Eds.), Proceedings of AVSP2011 (pp. 69). [pdf]Al Moubayed, S., & Beskow, J. (2011). A novel Skype interface using SynFace for virtual speech reading support. TMH-QPSR, 51(1), 33-36. [pdf]Al Moubayed, S., & Skantze, G. (2011). Effects of 2D and 3D Displays on Turn-taking Behavior in Multiparty Human-Computer Dialog. In Proceedings of SemDial (pp. 192-193). Los Angeles, CA. [pdf]Al Moubayed, S., & Skantze, G. (2011). Turn-taking Control Using Gaze in Multiparty Human-Computer Dialogue: Effects of 2D and 3D Displays. In Proceedings of AVSP. Florence, Italy. [pdf]Ananthakrishnan, G. (2011). Imitating Adult Speech: An Infant's Motivation. In 9th International Seminar on Speech Production (pp. 361-368). [abstract]Abstract: This paper tries to detail two aspects of speech acquisition by infants which are often assumed to be intrinsic or innate knowledge, namely number of degrees of freedom in the articulatory parameters and the acoustic correlates that find the correspondence between adult speech and the speech produced by the infant. The paper shows that being able to distinguish the different vowels in the vowel space of the certain language is a strong motivation for choosing both a certain number of independent articulatory parameters as well as a certain scheme of acoustic normalization between adult and child speech.Ananthakrishnan, G., & Engwall, O. (2011). Resolving Non-uniqueness in the Acoustic-to-Articulatory Mapping. In ICASSP (pp. 4628-4631). Prague, Czech republic.Ananthakrishnan, G., & Salvi, G. (2011). Using Imitation to learn Infant-Adult Acoustic Mappings. In Proceedings of Interspeech (pp. 765-768). Florence, Italy. [abstract] [pdf]Abstract: This paper discusses a model which conceptually demonstrates how infants could learn the normalization between infant-adult acoustics. The model proposes that the mapping can be inferred from the topological correspondences between the adult and infant acoustic spaces that are clustered separately in an unsupervised manner. The model requires feedback from the adult in order to select the right topology for clustering, which is crucial aspect of the model. The feedback is in terms of as overall rating of the imitation effort by the infant, rather than a frame-by-frame correspondence. Using synthetic, but continuous speech data, we demonstrate that clusters, which have a good topological correspondence, are perceived to be similar by a phonetically trained listener.Ananthakrishnan, G., Wik, P., Engwall, O., & Abdou, S. (2011). Using an Ensemble of Classifiers for Mispronunciation Feedback. In Strik, H., Delmonte, R., & Russel, M. (Eds.), Proceedings of SLaTE. Venice, Italy.Beskow, J., Alexanderson, S., Al Moubayed, S., Edlund, J., & House, D. (2011). Kinetic Data for Large-Scale Analysis and Modeling of Face-to-Face Conversation. In Salvi, G., Beskow, J., Engwall, O., & Al Moubayed, S. (Eds.), Proceedings of AVSP2011 (pp. 103-106). [pdf]Bisesi, E., Parncutt, R., & Friberg, A. (2011). An accent-based approach to performance rendering: Music theory meets music psychology. In International Symposium on Performance Science (ISPS 2011) (pp. 27-32). [pdf]De Looze, C., Oertel, C., Rauzy, S., & Campbell, N. (2011). Measuring dynamics of mimicry by means of prosodic cues in conversational speech. In ICPhS 2011 (pp. 1294-1297). Hong-Kong, China. [abstract]Abstract: This study presents a method for measuring the dynamics of mimicry in conversational speech by means of prosodic cues. It shows that the more speakers are involved in a conversation, the more they intend to mimic each other’s speech prosody. It supports that mimicry in speech is part of social interaction and that it may be implemented into spoken dialogue systems in order to improve their efficiency.Dubus, G., & Bresin, R. (2011). Sonification of physical quantities throughout history: a meta-study of previous mapping strategies. In Wersényi, G., & Worrall, D. (Eds.), The 17th Annual Conference on Auditory Display, Budapest, Hungary 20-24 June, 2011, Proceedings (pp. 1-8). Budapest, Hungary: OPAKFI Egyesület. [abstract] [pdf]Abstract: We introduce a meta-study of previous sonification designs taking physical quantities as input data. The aim is to build a solid foundation for future sonification works so that auditory display researchers would be able to take benefit from former studies, avoiding to start from scratch when beginning new sonification projects. This work is at an early stage and the objective of this paper is rather to introduce the methodology than to come to definitive conclusions. After a historical introduction, we explain how to collect a large amount of articles and extract useful information about mapping strategies. Then, we present the physical quantities grouped according to conceptual dimensions, as well as the sound parameters used in sonification designs and we summarize the current state of the study by listing the couplings extracted from the article database. A total of 54 articles have been examined for the present article. Finally, a preliminary analysis of the results is performed.Edlund, J., Al Moubayed, S., & Beskow, J. (2011). The Mona Lisa Gaze Effect as an Objective Metric for Perceived Cospatiality. In Vilhjálmsson, H. H., Kopp, S., Marsella, S., & Thórisson, K. R. (Eds.), Proc. of the Intelligent Virtual Agents 10th International Conference (IVA 2011) (pp. 439-440). Reykjavík, Iceland: Springer. [abstract] [pdf]Abstract: We propose to utilize the Mona Lisa gaze effect for an objective and repeatable measure of the extent to which a viewer perceives an object as cospatial. Preliminary results suggest that the metric behaves as expected. Proc. of the Intelligent Virtual Agents 10th International Conference (IVA 2011)Edlund, J. (2011). How deeply rooted are the turns we take?. In Proc. of Los Angelogue (SemDial 2011). Los Angeles, CA, US. [abstract] [pdf]Abstract: This poster presents preliminary work investigating turn-taking in text-based chat with a view to learn something about how deeply rooted turn-taking is in the human cognition. A connexion is shown between preferred turn-taking patterns and length and type of experience with such chats, which supports the idea that the orderly type of turn-taking found in most spoken conversations is indeed deeply rooted, but not more so than that it can be overcome with training in a situation where such turn-taking is not beneficial to the communication.Elblaus, L., Hansen, K. F., & Unander-Scharin, C. (2011). Exploring the design space: Prototyping "The Throat v3" for the Elephant Man opera. In Zanolla, S., Avanzini, F., Canazza, S., & de Götzen, A. (Eds.), Proceedings of the Sound and Music Computing Conference (pp. 141-147). Padova, Italy: Padova University Press. [abstract] [pdf]Abstract: Developing new technology for artistic practice requires other methods than classical problem solving. Some of the challenges involved in the development of new musical instruments have affinities to the realm of wicked problems. Wicked problems are hard to define and have many different solutions that are good or bad (not true or false). The body of possible solutions to a wicked problem can be called a design space and exploring that space must be the objective of a design process. In this paper we present effective methods of iterative design and participatory design that we have used in a project developed in collaboration between the Royal Institute of Technology (KTH) and the University College of Opera, both in Stockholm. The methods are outlined, and examples are given of how they have been applied in specific situations. The focus lies on prototyping and evaluation with user participation. By creating and acting out scenarios with the user, and thus asking the questions through a prototype and receiving the answers through practice and exploration, we removed the bottleneck represented by language and allowed communication beyond verbalizing. Doing this, even so-called tacit knowledge could be activated and brought into the development process.Fabiani, M., Dubus, G., & Bresin, R. (2011). MoodifierLive: interactive and collaborative music performance on mobile devices. In Jensenius, A. R., Tveit, A., Godøy, R. I., & Overholt, D. (Eds.), Proceedings of the International Conference on New Interfaces for Musical Expression (pp. 116-119). Oslo, Norway: University of Oslo and Norwegian Academy of Music. [abstract] [pdf]Abstract: This paper presents Moodi erLive, a mobile phone application for interactive control of rule-based automatic music performance. Five different interaction modes are available, of which one allows for collaborative performances with up to four participants, and two let the user control the expressive performance using expressive hand gestures. Evaluations indicate that the application is interesting, fun to use, and that the gesture modes, especially the one based on data from free expressive gestures, allow for performances whose emotional content matches that of the gesture that produced them.Friberg, A., & Hedblad, A. (2011). A Comparison of Perceptual Ratings and Computed Audio Features. In 8th Sound and Music Computing Conference, Padova, Italy. [pdf]Friberg, A., & Källblad, A. (2011). Experiences from video-controlled sound installations. In Proceedings of New Interfaces for Musical Expression - NIME, Oslo, 2011. [pdf]Hansen, K. F., Dravins, C., & Bresin, R. (2011). Ljudskrapan/The Soundscraper: Sound exploration for children with complex needs, accommodating hearing aids and cochlear implants. In Zanolla, S., Avanzini, F., Canazza, S., & de Götzen, A. (Eds.), Proceedings of the Sound and Music Computing Conference (pp. 70-76). Padova, Italy: Padova University Press. [abstract] [pdf]Abstract: This paper describes a system for accommodating active listening for persons with hearing aids or cochlear implants, with a special focus on children with complex needs, for instance at an early stage of cognitive development and with additional physical disabilities. The system is called Ljudskrapan (or the Soundscraper in English) and consists of a software part in Pure data and a hardware part using an Arduino microcontroller with a combination of sensors. For both the software and hardware development, one of the most important aspects was to always ensure that the system was flexible enough to cater for the very different conditions that are characteristic of the intended user group. The Soundscraper has been tested with 25 children with good results. An increased attention span was reported, as well as surprising and positive reactions from children where the caregivers were unsure whether they could hear at all. The sound generating models, the sensors and the parameter mapping were simple, but provided a controllable and complex enough sound environment even with limited interaction.Heldner, M., Edlund, J., Hjalmarsson, A., & Laskowski, K. (2011). Very short utterances and timing in turn-taking. In Proceedings of Interspeech 2011 (pp. 2837-2840). Florence, Italy. [abstract] [pdf]Abstract: This work explores the timing of very short utterances in conversations, as well as the effects of excluding intervals adjacent to such utterances from distributions of between-speaker interval durations. The results show that very short utterances are more precisely timed to the preceding utterance than longer utterances in terms of a smaller variance and a larger proportion of no-gap-no-overlaps. Excluding intervals adjacent to very short utterances furthermore results in measures of central tendency closer to zero (i.e. no-gap-no-overlaps) as well as larger variance (i.e. relatively longer gaps and overlaps).Hjalmarsson, A., & Laskowski, K. (2011). Measuring final lengthening for speaker-change prediction. In Proceedings of Interspeech 2011 (pp. 2069-2072). Florence, Italy. [abstract] [pdf]Abstract: We explore pre-silence syllabic lengthening as a cue for next-speakership prediction in spontaneous dialogue. When estimated using a transcription-mediated procedure, lengthening is shown to reduce error rates by 25% relative to majority class guessing. This indicates that lengthening should be exploited by dialogue systems. With that in mind, we evaluate an automatic measure of spectral envelope change, Mel-spectral flux (MSF), and show that its performance is at least as good as that of the transcription-mediated measure. Modeling MSF is likely to improve turn uptake in dialogue systems, and to benefit other applications needing an estimate of durational variability in speech.House, D., & Strömbergsson, S. (2011). Self-voice identification in children with phonological impairment. In Proceedings of the ICPhS XVII (pp. 886-889). Hong Kong. [abstract] [pdf]Abstract: We report preliminary data from a study of self-voice identification in children with phonological impairment (PI), where results from 13 children with PI are compared to results from a group of children with typical speech. No difference between the two groups was found, suggesting that a phonological impairment does not affect children’s ability to recognize their recorded voices as their own. We conclude that children with PI indeed recognize their own recorded voice and that the use of recordings in therapy can be supported.Husby, O., Øvregaard, Å., Wik, P., Bech, Ø., Albertsen, E., Nefzaoui, S., Skarpnes, E., & Koreman, J. (2011). Dealing with L1 background and L2 dialects in Norwegian CAPT. In Proceedings of SLaTE. [pdf]Johansson, M., Skantze, G., & Gustafson, J. (2011). Understanding route directions in human-robot dialogue. In Proceedings of SemDial (pp. 19-27). Los Angeles, CA. [pdf]Johnson-Roberson, M., Bohg, J., Skantze, G., Gustafson, J., Carlson, R., Rasolzadeh, B., & Kragic, D. (2011). Enhanced Visual Scene Understanding through Human-Robot Dialog. In IEEE/RSJ International Conference on Intelligent Robots and Systems. [pdf]Karlsson, A., House, D., Svantesson, J-O., & Tayanin, D. (2011). Comparison of F0 range in spontaneous speech in Kammu tonal and non-tonal dialects. In Proceedings of the ICPhS XVII (pp. 1026-1029). Hong Kong. [abstract] [pdf]Abstract: The aim of this study is to investigate whether the occurrence of lexical tones in a language imposes restrictions on its pitch range. Kammu, a Mon-Khmer language spoken in Northern Laos com-prises dialects with and without lexical tones and with no other major phonological differences. We use Kammu spontaneous speech to investigate differences in pitch range in the two dialects. The main finding is that tonal speakers exhibit a narrower pitch range. Thus, even at a high degree of engagement found in spontaneous speech, lexical tones impose restrictions on speakers’ pitch variation. Keywords: pitch range, tone, timing, intonation, Kammu, KhmuKarlsson, A., Svantesson, J-O., House, D., & Tayanin, D. (2011). Tone restricts F0 range and variation in Kammu. TMH-QPSR, 51(1), 53-55. [abstract] [pdf]Abstract: The aim of this study is to investigate whether the occurrence of lexical tones in a language imposes restrictions on its pitch range. We use data from Kammu, a Mon-Khmer language spoken in Northern Laos, which has one dialect with, and one without, lexical tones. The main finding is that speakers of the tonal dialect have a narrower pitch range, and also a smaller variation in pitch range.Koniaris, C., & Engwall, O. (2011). Perceptual Differentiation Modeling Explains Phoneme Mispronunciation by Non-Native Speakers. In 2011 IEEE Int. Conf. on Acoust., Speech, Sig. Proc. (ICASSP) (pp. 5704-5707). Prague, Czech Republic. [abstract] [link]Abstract: One of the difficulties in second language (L2) learning is the weakness in discriminating between acoustic diversity within an L2 phoneme category and between different categories. In this paper, we describe a general method to quantitatively measure the perceptual difference between a group of native and individual non-native speakers. Normally, this task includes subjective listening tests and/or a thorough linguistic study. We instead use a totally automated method based on a psycho-acoustic auditory model. For a certain phoneme class, we measure the similarity of the Euclidean space spanned by the power spectrum of a native speech signal and the Euclidean space spanned by the auditory model output. We do the same for a non-native speech signal. Comparing the two similarity measurements, we find problematic phonemes for a given speaker. To validate our method, we apply it to different groups of non-native speakers of various first language (L1) backgrounds. Our results are verified by the theoretical findings in literature obtained from linguistic studies.Koniaris, C., & Engwall, O. (2011). Phoneme Level Non-Native Pronunciation Analysis by an Auditory Model-based Native Assessment Scheme. In Interspeech 2011 (pp. 1157-1160). Florence, Italy. [NOMINATED FOR BEST STUDENT PAPER AWARD]. [abstract] [html]Abstract: We introduce a general method for automatic diagnostic evaluation of the pronunciation of individual non-native speakers based on a model of the human auditory system trained with native data stimuli. For each phoneme class, the Euclidean geometry similarity between the native perceptual domain and the non-native speech power spectrum domain is measured. The problematic phonemes for a given second language speaker are found by comparing this measure to the Euclidean geometry similarity for the same phonemes produced by native speakers only. The method is applied to different groups of non-native speakers of various language backgrounds and the experimental results are in agreement with theoretical findings of linguistic studies.Koreman, J., Bech, Ø., Husby, O., & Wik, P. (2011). L1-L2map: a tool for multi-lingual contrastive analysis. In proceedings of ICPhS. [pdf]Landsiedel, C., Edlund, J., Eyben, F., Neiberg, D., & Schuller, B. (2011). Syllabification of conversational speech using bidirectional long-short-term memory neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on (pp. 5256 - 5259). Prague, Czech Republic. [abstract] [pdf]Abstract: Segmentation of speech signals is a crucial task in many types of speech analysis. We present a novel approach at segmentation on a syllable level, using a Bidirectional Long-Short-Term Memory Neural Network. It performs estimation of syllable nucleus positions based on regression of perceptually motivated input features to a smooth target function. Peak selection is performed to attain valid nuclei positions. Performance of the model is evaluated on the levels of both syllables and the vowel segments making up the syllable nuclei. The general applicability of the approach is illustrated by good results for two common databases - Switchboard and TIMIT - for both read and spontaneous speech, and a favourable comparison with other published results.Laskowski, K., Edlund, J., & Heldner, M. (2011). A single-port non-parametric model of turn-taking in multi-party conversation. In Proc. of ICASSP 2011 (pp. 5600-5603). Prague, Czech Republic. [abstract] [pdf]Abstract: The taking of turns to speak is an intrinsic property of conversation. It is therefore expected that models of turn-taking, providing a prior distribution over conversational form, can usefully reduce the perplexity of what is observed and processed in real-time spoken dialogue systems. We propose a conversation-independent single- port model of multi-party turn-taking, one which allows conversants to undertake independent actions but to condition them on the past behavior of all participants. The model is shown to generally out perform an existing multi-port model on a measure of perplexity over subsequently observed speech activity. We quantify the effect of history truncation and the success of predicting distant conversational futures, and argue that the framework is sufficiently accessible and has significant potential to usefully inform thedesignandbehavior ofspokendialoguesystems.Laskowski, K., Edlund, J., & Heldner, M. (2011). Incremental Learning and Forgetting in Stochastic Turn-Taking Models. In Proc. of Interspeech 2011 (pp. 2069-2072). Florence, Italy. [abstract] [pdf]Abstract: We present a computational framework for stochastically mod- eling dyad interaction chronograms. The framework’s most novel feature is the capacity for incremental learning and for- getting. To showcase its flexibility, we design experiments an- swering four concrete questions about the systematics of spoken interaction. The results show that: (1) individuals are clearly affected by one another; (2) there is individual variation in in- teraction strategy; (3) strategies wander in time rather than con- verge; and (4) individuals exhibit similarity with their interlocu- tors. We expect the proposed framework to be capable of an- swering many such questions with little additional effort.Laskowski, K., & Jin, Q. (2011). Harmonic structure transform for speaker recognition. In Proc. of Interspeech 2011 (pp. 365-368). Florence, Italy. [abstract] [pdf]Abstract: We evaluate a new filterbank structure, yielding the harmonic structure cepstral coefficients (HSCCs), on a mismatched- session closed-set speaker classification task. The novelty of the filterbank lies in its averaging of energy at frequencies re- lated by harmonicity rather than by adjacency. Improvements are presented which achieve a 37%rel reduction in error rate un- der these conditions. The improved features are combined with a similar Mel-frequency cepstral coefficient (MFCC) system to yield error rate reductions of 32%rel, suggesting that HSCCs offer information which is complimentary to that available to today’s MFCC-based systems.Neiberg, D., Ananthakrishnan, G., & Gustafson, J. (2011). Tracking pitch contours using minimum jerk trajectories. In INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association. Florence, Italy. [abstract] [pdf]Abstract: This paper proposes a fundamental frequency tracker, with the specific purpose of comparing the automatic estimates with pitch contours that are sketched by trained phoneticians. The method uses a frequency domain approach to estimate pitch tracks that form minimum jerk trajectories. This method tries to mimic motor movements of the hand made while sketching. When the fundamental frequency tracked by the proposed method on the oral and laryngograph signals were compared using the MOCHA-TIMIT database, the correlation was 0.98 and the root mean squared error was 4.0 Hz, which was slightly better than a state-of-the-art pitch tracking algorithm included in the ESPS. We also demonstrate how the proposed algorithm could to be applied when comparing with sketches made by phoneticians for the variations in accent II among the Swedish dialects.Neiberg, D., & Gustafson, J. (2011). A Dual Channel Coupled Decoder for Fillers and Feedback. In INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association. Florence, Italy. [abstract] [pdf]Abstract: This study presents a dual channel decoder capable of modeling cross-speaker dependencies for segmentation and classification of fillers and feedbacks in conversational speech found in the DEAL corpus. For the same number of Gaussians per state, we have shown improvement in terms of average F-score for the successive addition of 1) increased frame rate from 10 ms to 50 ms 2) Joint Maximum Cross-Correlation (JMXC) features in a single channel decoder 3) a joint transition matrix which captures dependencies symmetrically across the two channels 4) coupled acoustic model retraining symmetrically across the two channels. The final step gives a relative improvement of over 100% for fillers and feedbacks compared to our previous published results. The F-scores are in the range to make it possible to use the decoder as both a voice activity detector and an illucotary act decoder for semi-automatic annotation.Neiberg, D., & Gustafson, J. (2011). Predicting Speaker Changes and Listener Responses With And Without Eye-contact. In INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association. Florence, Italy.. [abstract] [pdf]Abstract: This paper compares turn-taking in terms of timing and prediction in human-human conversations under the conditions when participants has eye-contact versus when there is no eyecontact, as found in the HCRC Map Task corpus. By measuring between speaker intervals it was found that a larger proportion of speaker shifts occurred in overlap for the no eyecontact condition. For prediction we used prosodic and spectral features parametrized by time-varying length-invariant discrete cosine coefficients. With Gaussian Mixture Modeling and variations of classifier fusion schemes, we explored the task of predicting whether there is an upcoming speaker change (SC) or not (HOLD), at the end of an utterance (EOU) with a pause lag of 200 ms. The label SC was further split into LRs (listener responses, e.g. back-channels) and other TURNSHIFTs. The prediction was found to be somewhat easier for the eye-contact condition, for which the average recall rates was 60.57%, 66.35%, and 62.00% for TURN-SHIFTs, LR and SC respectively.Neiberg, D., Laukka, P., & Elfenbein, H. A. (2011). Intra-, Inter-, and Cross-cultural Classification of Vocal Affect. In INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association (pp. 1581-1584). Florence, Italy.. [abstract] [pdf]Abstract: We present intra-, inter- and cross-cultural classifications of vocal expressions. Stimuli were selected from the VENEC corpus and consisted of portrayals of 11 emotions, each expressed with 3 levels of intensity. Classification (nu-SVM) was based on acoustic measures related to pitch, intensity, formants, voice source and duration. Results showed that mean recall across emotions was around 2.4-3 times higher than chance level for both intra- and inter-cultural conditions. For cross-cultural conditions, the relative performance dropped 26%, 32%, and 34% for high, medium, and low emotion intensity, respectively. This suggests that intra-cultural models were more sensitive to mismatched conditions for low emotion intensity. Preliminary results further indicated that recall rate varied as a function of emotion, with lust and sadness showing the smallest performance drops in the cross-cultural condition.Neiberg, D., & Truong, K. (2011). Online Detection Of Vocal Listener Responses With Maximum Latency Constraints. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on (pp. 5836 - 5839). Prague, Czech Republic. [abstract] [pdf]Abstract: When human listeners utter Listener Responses (e.g. back-channels or acknowledgments) such as `yeah' and `mmhmm', interlocutors commonly continue to speak or resume their speech even before the listener has finished his/her response. This type of speech interactivity results in frequent speech overlap which is common in human-human conversation. To allow for this type of speech interactivity to occur between humans and spoken dialog systems, which will result in more human-like continuous and smoother human-machine interaction, we propose an on-line classifier which can classify incoming speech as Listener Responses. We show that it is possible to detect vocal Listener Responses using maximum latency thresholds of 100-500 ms, thereby obtaining equal error rates ranging from 34% to 28% by using an energy based voice activity detector.Oertel, C., De Looze, C., Vaughan, B., Gilmartin, E., Wagner, P., & Campbell, N. (2011). Using hotspots as a novel method for accessing key events in a large multi-modal corpus. In Workshop on New Tools and Methods for Very-Large-Scale Phonetics Research. Philadelphia, USA. [abstract] [pdf]Abstract: In 2009 we created the D64 corpus, a multi-modal corpus which consists of roughly eight hours of natural, non-directed spontaneous interaction in an informal setting. Five participants feature in the recordings and their conversations were captured by microphones (room, body mounted and head mounted), video cameras and a motion capture system. The large amount of video, audio and motion capture material made it necessary to structure and make available the corpus in such a way that it is easy to browse and query for various types of data that we term primary, secondary and tertiary. While users are able to make simple and highly structured searches, we discuss the use of conversational hotspots as a method of searching the data for key events in the corpus; thus enabling a user to obtain a broad overview of the data. In this paper we present an approach to structuring and presenting a multi-modal corpus based on our experience with the D64 corpus that is accessible over the web, incorporates an interactive front-end and is open to all interested researchers and students.Oertel, C., Scherer, S., & Campbell, N. (2011). On the use of multimodal cues for the prediction of degrees of involvement in spontaneous conversation. In Interspeech 2011 (pp. 1541-1544). Florence, Italy. [abstract] [pdf]Abstract: Quantifying the degree of involvement of a group of participants in a conversation is a task which humans accomplish every day, but it is something that, as of yet, machines are unable to do. In this study we first investigate the correlation between visual cues (gaze and blinking rate) and involvement. We then test the suitability of prosodic cues (acoustic model) as well as gaze and blinking (visual model) for the prediction of the degree of involvement by using a support vector machine (SVM). We also test whether the fusion of the acoustic and the visual model im- proves the prediction. We show that we are able to predict three classes of involvement with an reduction of error rate of 0.30 (accuracy =0.68).Salvi, G., & Al Moubayed, S. (2011). Spoken Language Identification using Frame Based Entropy Measures. TMH-QPSR, 51(1), 69-72. [pdf]Salvi, G., Tesser, F., Zovato, E., & Cosi, P. (2011). Analisi Gerarchica degli Inviluppui Spettrali Differenziali di una Voce Emotiva. In Contesto comunicativo e variabilità nella produzione e percezione della lingua (AISV). Lecce, Italy.Strömbergsson, S. (2011). Children’s perception of their modified speech – preliminary findings. TMH-QPSR, 51(1), 117-120. [abstract] [pdf]Abstract: This report describes an ongoing investigation of 4-6 year-old children’s perception of synthetically modified versions of their own recorded speech. Recordings of the children’s speech production are automatically modified so that the initial consonant is replaced by a different consonant. The task for the children is to judge the phonological accuracy (correct vs. incorrect) and the speaker identity (me vs. someone else) of each stimulus. Preliminary results indicate that children with typical speech generally judge phonological accuracy correctly, of both original recordings and synthetically modified recordings. As a first evaluation of the re-synthesis technique with child users, the results are promising, as the children generally accept the intended phonological form, seemingly without detecting the modification.Strömbergsson, S. (2011). Children’s recognition of their own voice: influence of phonological impairment. In Proceedings of Interspeech 2011 (pp. 2205-2208). Florence, Italy. [abstract] [pdf]Abstract: This study explores the ability to identify the recorded voice as one’s own, in three groups of children: one group of children with phonological impairment (PI) and two groups of children with typical speech and language development; 4-5 year-olds and 7-8 year-olds. High average performance rates in all three groups suggest that these children indeed recognize their recorded voice as their own, with no significant difference between the groups. Signs indicating that children with deviant speech use their speech deviance as a cue to identifying their own voice are discussed.Strömbergsson, S. (2011). Corrective re-synthesis of deviant speech using unit selection. In Sandford Pedersen, B., Nespore, G., & Skadina, I. (Eds.), NODALIDA 2011 Conference Proceedings (pp. 214-217). Riga, Latvia. [abstract] [pdf]Abstract: This report describes a novel approach to modified re-synthesis, by concatenation of speech from different speakers. The system removes an initial voiceless plosive from one utterance, recorded from a child, and replaces it with another voiceless plosive selected from a database of recordings of other child speakers. Preliminary results from a listener evaluation are reported.Strömbergsson, S. (2011). Segmental re-synthesis of child speech using unit selection. In Proceedings of the ICPhS XVII (pp. 1910-1913). Hong Kong. [abstract] [pdf]Abstract: This report describes a novel approach to segmental re-synthesis of child speech, by concatenation of speech from different speakers. The re-synthesis builds upon standard methods of unit selection, but instead of using speech from only one speaker, target segments are selected from a speech database of many child speakers. Results from a listener evaluation suggest that the method can be used to generate intelligible speech that is difficult to distinguish from original recordings.Master ThesesAhmad, R. (2011). Con curdino – Utökning av tvärflöjtens tonförråd till kvartstoner genom perturbation av tonhålens geometri. Master's thesis. [pdf]Gu, T-L. (2011). Why are Computer Games so Entertaining? – A study on whether motivational feedback elements from computer games can improve spoken language learning games.. Master's thesis, CSC. [abstract] [pdf]Abstract: This master thesis is a part of a project conducted by the Centre of Speech Technology at the Department of Language, Speech and Music at the Royal Institute of Technology. The focus of this master thesis has been on how to make computerized language learning more entertaining and stimulating. Computer games in general can create an addictive and motivational encouraging stimulation for players and it is very interesting to understand what these elements are and whether the elements can be implemented into language learning games. The subjects for this project consisted of both bachelor and master students from the ages 18-35 and some participants being members of a game committee at the school of Information and Communication Technology (ICT). In order to understand what makes normal games so addictive and what these motivational feedback attributes and elements are, we have to examine and explore what game traits, mechanisms and genres that are considered appealing by the users. The project started by examining what previous works have been made within the fields of motivation, education and games. Afterwards a qualitative survey was created based on previous theoretical works that was sent to the committee and the targeted user group of interest. The results were later evaluated and analyzed, in order to extract the most appealing, important and motivational encouraging feedback elements.Hedblad, A. (2011). Evaluation of Musical Feature Extraction Tools using Perceptual Ratings. Master's thesis. [pdf]Jani, M. (2011). An Interactive Articulation-to-Area-Function Phonetics Modelling Tool. Master's thesis, CSC. [abstract] [pdf]Abstract: For speech synthesis, the concatenative approach is currently the most widely adopted, however, it is recognized that concatenated speech has several inherent disadvantages when compared to what might be achieved using articulatory synthesis. As articulatory speech synthesis receives more attention, many articulatory models are developed. While articulatory models may eventually be used for speech synthesis, they are already valuable as research and pedagogical tools. For example, they can be applied to explore formant – cavity relationships and other articulatory aspects of the human sound production system. The main objectives of this work were to rewrite and modernize APEX, one of several current articulatory models, and at the same time to explore the feasibility of using the SuperCollider development environment as an interactive platform for voice modelling. SuperCollider is a programming environment for composition and sound processing. It follows a client-server model, where the client has an interpreted programming language to control the server, which has natively implemented signal processing functions. Initially, only the client side was used, but later to achieve better performance, the time consuming numeric computations were implemented using native code in the server. The architecture of this second version is described in details in the Implementation section. It was found that real-time simulation in SuperCollider is possible, but only if the code is carefully optimized and structured. Basic speed benchmarks are presented in the results section. The resulting software inherits the portability of SuperCollider, so it should be easy to transfer it to other platforms. The architecture makes it easy to change part of the software, for example to implement a new synthesizer. There are also recommendations for further work, including suggestions for an improved architecture and a discussion of how the project would benefit from a 3D model.PhD ThesesAnanthakrishnan, G. (2011). From Acoustics to Articulation. Doctoral dissertation, School of Computer Science and Communication. [pdf]Edlund, J. (2011). In search of the conversational homunculus - serving to understand spoken human face-to-face interaction. Doctoral dissertation, KTH. [abstract] [pdf]Abstract: In the group of people with whom I have worked most closely, we recently attempted to dress our visionary goal in words: “to learn enough about human face-to-face interaction that we are able to create an artificial conversational partner that is humanlike”. The “conversational homunculus” figuring in the title of this book represents this “artificial conversational partner”. The vision is motivated by an urge to test computationally our understandings of how human-human interaction functions, and the bulk of my work leads towards the conversational homunculus in one way or another. This book compiles and summarizes that work: it sets out with a presenting and providing background and motivation for the long-term research goal of creating a humanlike spoken dialogue system, and continues along the lines of an initial iteration of an iterative research process towards that goal, beginning with the planning and collection of human-human interaction corpora, continuing with the analysis and modelling of the human-human corpora, and ending in the implementation of, experimentation with and evaluation of humanlike components for in human-machine interaction. The studies presented have a clear focus on interactive phenomena at the expense of propositional content and syntactic constructs, and typically investigate the regulation of dialogue flow and feedback, or the establishment of mutual understanding and grounding.Fabiani, M. (2011). Interactive computer-aided expressive music performance - Analysis, control, modification, and synthesis methods. Doctoral dissertation, KTH Royal Institute of Technology. [pdf]Laskowski, K. (2011). Predicting, detecting and explaining the occurrence of vocal activity in multi-party conversation. Doctoral dissertation, Carnegie Mellon University. [abstract] [pdf]Abstract: Understanding a conversation involves many challenges that understanding the speech in that conversation does not. An important source of this discrepancy is the form of the conversation, which emerges from tactical decisions that par- ticipants make in how, and precisely when, they choose to participate. An offline conversation understanding system, beyond understanding the spoken sequence of words, must be able to account for that form. In addition, an online system may need to evaluate its competing next-action alternatives, instant by instant, and to adjust its strategy based on the felicity of its past decisions. In circumscribed transient settings, understanding the spoken sequence of words may not be necessary for either type of system. This dissertation explores tactical conversational conduct. It adopts an existing laconic representation of conversa- tional form known as the vocal interaction chronogram, which effectively elides a conversation’s text-dependent attributes, such as the words spoken, the language used, the sequence of topics, and any explicit or implicit goals. Chronograms are treated as Markov random fields, and probability density models are developed that characterize the joint behavior of participants. Importantly, these models are independent of a conversation’s duration and the number of its participants. They render overtly dissimilar conversations directly comparable, and enable the training of a single model of conversa- tional form using a large number of dissimilar human-human conversations. The resulting statistical framework is shown to provide a computational counterpart to the qualitative field of conversation analysis, corroborating and elaborating on several sociolinguistic observations. It extends the quantitative treatment of two-party dialogue, as found in anthropology, social psychology, and telecommunications research, to the general multi-party setting. Experimental results indicate that the proposed computational theory benefits the detection and participant-attribution of speech activity. Furthermore, the theory is shown to enable the inference of illocutionary intent, emotional state, and social status, independently of linguistic analysis. Taken together, for conversations of arbitrary duration, participant number, and text-dependent attributes, the results demonstrate a degree of characterization and understanding of the nature of conversational interaction that has not been shown previously.Wik, P. (2011). The Virtual Language Teacher: Models and applications for language learning using embodied conversational agents. Doctoral dissertation, KTH School of Computer Science and Communication. [abstract] [pdf]Abstract: This thesis presents a framework for computer assisted language learning using a virtual language teacher. It is an attempt at creating, not only a new type of language learning software, but also a server-based application that collects large amounts of speech material for future research purposes. The motivation for the framework is to create a research platform for computer assisted language learning, and computer assisted pronunciation training. Within the thesis, different feedback strategies and pronunciation error detectors are explored This is a broad, interdisciplinary approach, combining research from a number of scientific disciplines, such as speech-technology, game studies, cognitive science, phonetics, phonology, and second-language acquisition and teaching methodologies. The thesis discusses the paradigm both from a top-down point of view, where a number of functionally separate but interacting units are presented as part of a proposed architecture, and bottom-up by demonstrating and testing an implementation of the framework.Technical reportsBorin, L., Brandt, M. D., Edlund, J., Lindh, J., & Parkvall, M. (2011). Languages in the European Information Society - Swedish. Technical Report. [pdf]

Published by: TMH, Speech, Music and Hearing

Last updated: Wednesday, 01-Jun-2011 07:35:23 MEST