CTT Publications

This list contains publications authored or co-authored by members of CTT, the Centre for Speech Technology, during the period that CTT has existed. Some of the papers concern research projects outside CTT.

The papers are sorted by publication year and within each year they are sorted by first author.


Year 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996


2012Al Moubayed, S., Edlund, J., & Beskow, J. (2012). Taming Mona Lisa: communicating gaze faithfully in 2D and 3D facial projections. ACM Transactions on Interactive Intelligent Systems. [abstract]Abstract: The perception of gaze plays a crucial role in human-human interaction. Gaze has been shown to matter for a number of aspects of communication and dialogue, especially for managing the flow of the dialogue and participant attention, for deictic referencing, and for the communication of attitude. When developing embodied conversational agents (ECAs) and talking heads, modeling and delivering accurate gaze targets is crucial. Traditionally, systems communicating through talking heads have been displayed to the human conversant using 2D displays, such as flat monitors. This approach introduces severe limitations for an accurate communication of gaze since 2D displays are associated with several powerful effects and illusions, most importantly the Mona Lisa gaze effect, where the gaze of the projected head appears to follow the observer regardless of viewing angle. We describe the Mona Lisa gaze effect and its consequences in the interaction loop, and propose a new approach for displaying talking heads using a 3D projection surface (a physical model of a human head) as an alternative to the traditional flat surface projection. We investigate and compare the accuracy of the perception of gaze direction and the Mona Lisa gaze effect in 2D and 3D projection surfaces in a five subject gaze perception experiment. The experiment confirms that a 3D projection surface completely eliminates the Mona Lisa gaze effect and delivers very accurate gaze direction that is independent of the observer's viewing angle. Based on the data collected in this experiment, we rephrase the formulation of the Mona Lisa gaze effect. The data, when reinterpreted, confirms the predictions of the new model for both 2D and 3D projection surfaces. Finally, we discuss the requirements on different spatially interactive systems in terms of gaze direction, and propose new applications and experiments for interaction in a human-ECA and a human-robot settings made possible by this technology.Edlund, J., House, D., & Beskow, J. (in press). Gesture movement profiles in dialogues from a Swedish multimodal database of spontaneous speech. To be published in Bergmann, P., Brenning, J., Pfeiffer, M. C., & Reber, E. (Eds.), Prosodic and Visual Resources in Interactional Grammar. de Gruyter.Edlund, J., House, D., & Strömbergsson, S. (in press). Question types and some prosodic correlates in 600 questions in the Spontal database of Swedish dialogues. To be published in Proc. of Speech Prosody 2012. Shanghai, China. [abstract]Abstract: Studies of questions present strong evidence that there is no one-to-one relationship between intonation and interrogative mode. We present initial steps of a larger project investigating and describing intonational variation in the Spontal database of 120 half-hour spontaneous dialogues in Swedish, and testing the hypothesis that the concept of a standard question intonation such as a final pitch rise contrasting a final low declarative intonation is not consistent with the pragmatic use of intonation in dialogue. We report on the extraction of 600 questions from the Spontal corpus, coding and annotation of question typology, and preliminary results concerning some prosodic correlates related to question type.Engwall, O. (in press). Datoranimerade talande ansikten. To be published in Adelswärd, V., & Forstorp, P-A. (Eds.), Människans ansikten: Emotion, interaktion och konst. Carlssons Bokförlag. [pdf]2011Abelin, Å. (2011). Imitation of bird song in folklore – onomatopoeia or not?. TMH-QPSR, 51(1), 13-16. [pdf]Afsun, D., Forsman, E., Halvarsson, C., Jonsson, E., Malmgren, L., Neves, J., & Marklund, U. (2011). Effects of a film-based parental intervention on vocabulary development in toddlers aged 18-21 months. TMH-QPSR, 51(1), 105-108. [pdf]Al Moubayed, S., Alexanderson, S., Beskow, J., & Granström, B. (2011). A robotic head using projected animated faces. In Salvi, G., Beskow, J., Engwall, O., & Al Moubayed, S. (Eds.), Proceedings of AVSP2011 (pp. 69). [pdf]Al Moubayed, S., Beskow, J., Edlund, J., Granström, B., & House, D. (2011). Animated Faces for Robotic Heads: Gaze and Beyond. In Esposito, A., Vinciarelli, A., Vicsi, K., Pelachaud, C., & Nijholt, A. (Eds.), Analysis of Verbal and Nonverbal Communication and Enactment: The Processing Issues, Lecture Notes in Computer Science (pp. 19-35). Springer.Al Moubayed, S., & Beskow, J. (2011). A novel Skype interface using SynFace for virtual speech reading support. TMH-QPSR, 51(1), 33-36. [pdf]Al Moubayed, S., & Skantze, G. (2011). Effects of 2D and 3D Displays on Turn-taking Behavior in Multiparty Human-Computer Dialog. In Proceedings of SemDial (pp. 192-193). Los Angeles, CA. [pdf]Al Moubayed, S., & Skantze, G. (2011). Turn-taking Control Using Gaze in Multiparty Human-Computer Dialogue: Effects of 2D and 3D Displays. In Proceedings of AVSP. Florence, Italy. [pdf]Ananthakrishnan, G. (2011). From Acoustics to Articulation. Doctoral dissertation, School of Computer Science and Communication. [pdf]Ananthakrishnan, G. (2011). Imitating Adult Speech: An Infant's Motivation. In 9th International Seminar on Speech Production (pp. 361-368). [abstract]Abstract: This paper tries to detail two aspects of speech acquisition by infants which are often assumed to be intrinsic or innate knowledge, namely number of degrees of freedom in the articulatory parameters and the acoustic correlates that find the correspondence between adult speech and the speech produced by the infant. The paper shows that being able to distinguish the different vowels in the vowel space of the certain language is a strong motivation for choosing both a certain number of independent articulatory parameters as well as a certain scheme of acoustic normalization between adult and child speech.Ananthakrishnan, G., Eklund, R., Peters, G., & and Mabiza, E. (2011). An acoustic analysis of lion roars. II: Vocal tract characteristics. TMH-QPSR, 51(1), 5-8. [pdf]Ananthakrishnan, G., & Engwall, O. (2011). Mapping between Acoustic and Articulatory Gestures. Speech Communication, 53(4), 567-589. [abstract] [link]Abstract: This paper proposes a definition for articulatory as well as acoustic gestures along with a method to segment the measured articulatory trajectories and acoustic waveforms into gestures. Using a simultaneously recorded acoustic-articulatory database, the gestures are detected based on finding critical points in the utterance, both in the acoustic and articulatory representations. The acoustic gestures are parameterized using 2-D cepstral coefficients. The articulatory trajectories are essentially the horizontal and vertical movements of Electromagnetic Articulography (EMA) coils placed on the tongue, jaw and lips along the midsagittal plane. The articulatory movements are parameterized using 2D-DCT using the same transformation that is applied on the acoustics. The relationship between the detected acoustic and articulatory gestures in terms of the timing as well as the shape is studied. In order to study this relationship further, acoustic-to-articulatory inversion is performed using GMM-based regression. The accuracy of predicting the articulatory trajectories from the acoustic waveforms are at par with state-of-the-art frame-based methods with dynamical constraints (with an average error of 1.45–1.55 mm for the two speakers in the database). In order to evaluate the acoustic-to-articulatory inversion in a more intuitive manner, a method based on the error in estimated critical points is suggested. Using this method, it was noted that the estimated articulatory trajectories using the acoustic-to-articulatory inversion methods were still not accurate enough to be within the perceptual tolerance of audio-visual asynchrony.Ananthakrishnan, G., & Engwall, O. (2011). Resolving Non-uniqueness in the Acoustic-to-Articulatory Mapping. In ICASSP (pp. 4628-4631). Prague, Czech republic.Ananthakrishnan, G., & Salvi, G. (2011). Using Imitation to learn Infant-Adult Acoustic Mappings. In Proceedings of Interspeech (pp. 765-768). Florence, Italy. [abstract] [pdf]Abstract: This paper discusses a model which conceptually demonstrates how infants could learn the normalization between infant-adult acoustics. The model proposes that the mapping can be inferred from the topological correspondences between the adult and infant acoustic spaces that are clustered separately in an unsupervised manner. The model requires feedback from the adult in order to select the right topology for clustering, which is crucial aspect of the model. The feedback is in terms of as overall rating of the imitation effort by the infant, rather than a frame-by-frame correspondence. Using synthetic, but continuous speech data, we demonstrate that clusters, which have a good topological correspondence, are perceived to be similar by a phonetically trained listener.Ananthakrishnan, G., Wik, P., & Engwall, O. (2011). Detecting confusable phoneme pairs for Swedish language learners depending on their first language. TMH-QPSR, 51(1), 89-92. [pdf]Ananthakrishnan, G., Wik, P., Engwall, O., & Abdou, S. (2011). Using an Ensemble of Classifiers for Mispronunciation Feedback. In Strik, H., Delmonte, R., & Russel, M. (Eds.), Proceedings of SLaTE. Venice, Italy.Andersson, I., Gauding, J., Graca, A., Holm, K., Öhlin, L., Marklund, U., & Ericsson, A. (2011). Productive vocabulary size development in children aged 18-24 months – gender differences. TMH-QPSR, 51(1), 109-112. [pdf]Beskow, J., Alexanderson, S., Al Moubayed, S., Edlund, J., & House, D. (2011). Kinetic Data for Large-Scale Analysis and Modeling of Face-to-Face Conversation. In Salvi, G., Beskow, J., Engwall, O., & Al Moubayed, S. (Eds.), Proceedings of AVSP2011 (pp. 103-106). [pdf]Blomberg, M. (2011). Model space size scaling for speaker adaptation. TMH-QPSR, 51(1), 77-80. [pdf]Dahlby, M., Irmalm, L., Kytöharju, S., Wallander, L., Zachariassen, H., Ericsson, A., & Marklund, U. (2011). Parent-child interaction: Relationship between pause duration and infant vocabulary at 18 months. TMH-QPSR, 51(1), 101-104. [pdf]Edlund, J., Al Moubayed, S., & Beskow, J. (2011). The Mona Lisa Gaze Effect as an Objective Metric for Perceived Cospatiality. In Vilhjálmsson, H. H., Kopp, S., Marsella, S., & Thórisson, K. R. (Eds.), Proc. of the Intelligent Virtual Agents 10th International Conference (IVA 2011) (pp. 439-440). Reykjavík, Iceland: Springer. [abstract] [pdf]Abstract: We propose to utilize the Mona Lisa gaze effect for an objective and repeatable measure of the extent to which a viewer perceives an object as cospatial. Preliminary results suggest that the metric behaves as expected. Proc. of the Intelligent Virtual Agents 10th International Conference (IVA 2011)Edlund, J. (2011). How deeply rooted are the turns we take?. In Proc. of Los Angelogue (SemDial 2011). Los Angeles, CA, US.Edlund, J. (2011). In search of the conversational homunculus - serving to understand spoken human face-to-face interaction. Doctoral dissertation, KTH. [abstract] [pdf]Abstract: In the group of people with whom I have worked most closely, we recently attempted to dress our visionary goal in words: “to learn enough about human face-to-face interaction that we are able to create an artificial conversational partner that is humanlike”. The “conversational homunculus” figuring in the title of this book represents this “artificial conversational partner”. The vision is motivated by an urge to test computationally our understandings of how human-human interaction functions, and the bulk of my work leads towards the conversational homunculus in one way or another. This book compiles and summarizes that work: it sets out with a presenting and providing background and motivation for the long-term research goal of creating a humanlike spoken dialogue system, and continues along the lines of an initial iteration of an iterative research process towards that goal, beginning with the planning and collection of human-human interaction corpora, continuing with the analysis and modelling of the human-human corpora, and ending in the implementation of, experimentation with and evaluation of humanlike components for in human-machine interaction. The studies presented have a clear focus on interactive phenomena at the expense of propositional content and syntactic constructs, and typically investigate the regulation of dialogue flow and feedback, or the establishment of mutual understanding and grounding.Eklund, R., Peters, G., Ananthakrishnan, G., & Mabiza, E. (2011). An acoustic analysis of lion roars. I: Data collection and spectrogram and waveform analyses. TMH-QPSR, 51(1), 1-4. [pdf]Engwall, O. (2011). Analysis of and feedback on phonetic features in pronunciation training with a virtual teacher. Computer Assisted Language Learning. [abstract] [link]Abstract: Pronunciation errors may be caused by several different deviations from the target, such as voicing, intonation, insertions or deletions of segments, or that the articulators are placed incorrectly. Computer-animated pronunciation teachers could potentially provide important assistance on correcting all these types of deviations, but they have an additional benefit for articulatory errors. By making parts of the face transparent, they can show the correct position and shape of the tongue and provide audiovisual feedback on how to change erroneous articulations. Such a scenario however requires firstly that the learner's current articulation can be estimated with precision and secondly that the learner is able to imitate the articulatory changes suggested in the audiovisual feedback. This article discusses both these aspects, with one experiment on estimating the important articulatory features from a speaker through acoustic-to-articulatory inversion and one user test with a virtual pronunciation teacher, in which the articulatory changes made by seven learners who receive audiovisual feedback are monitored using ultrasound imaging.Engwall, O. (2011). Augmented Reality Talking Heads as a Support for Speech Perception and Production. In Nee, A. (Ed.), Augmented Reality. InTech. [pdf]Frid, J., Schötz, S., & Löfqvist, A. (2011). Age-related lip movement repetition variability in two phrase positions. TMH-QPSR, 51(1), 21-24. [pdf]Gabrielsson, D., Kirchner, S., Nilsson, K., Norberg, A., & Widlund, C. (2011). Anticipatory lip rounding– a pilot study using The Wave Speech Research System. TMH-QPSR, 51(1), 37-40. [pdf]Heldner, M. (2011). Detection thresholds for gaps, overlaps and no-gap-no-overlaps. Journal of the Acoustical Society of America, 130(1), 508-513. [abstract] [pdf]Abstract: Detection thresholds for gaps and overlaps, that is acoustic and perceived silences and stretches of overlapping speech in speaker changes, were determined. Subliminal gaps and overlaps were categorized as no-gap-no-overlaps. The established gap and overlap detection thresholds both corresponded to the duration of a long vowel, or about 120 ms. These detection thresholds are valuable for mapping the perceptual speaker change categories gaps, overlaps, and no-gap-no-overlaps into the acoustic domain. Furthermore, the detection thresholds allow generation and understanding of gaps, overlaps, and no-gap-no-overlaps in human-like spoken dialogue systems.Heldner, M., Edlund, J., Hjalmarsson, A., & Laskowski, K. (2011). Very short utterances and timing in turn-taking. In Proceedings of Interspeech 2011. Florence, Italy. [abstract]Abstract: This work explores the timing of very short utterances in conversations, as well as the effects of excluding intervals adjacent to such utterances from distributions of between-speaker interval durations. The results show that very short utterances are more precisely timed to the preceding utterance than longer utterances in terms of a smaller variance and a larger proportion of no-gap-no-overlaps. Excluding intervals adjacent to very short utterances furthermore results in measures of central tendency closer to zero (i.e. no-gap-no-overlaps) as well as larger variance (i.e. relatively longer gaps and overlaps).Hjalmarsson, A. (2011). The additive effect of turn-taking cues in human and synthetic voice. Speech Communication, 53(1), 23-35. [abstract] [003]Abstract: A previous line of research suggests that interlocutors identify appropriate places to speak by cues in the behaviour of the preceding speaker. If used in combination, these cues have an additive effect on listeners’ turn-taking attempts. The present study further explores these findings by examining the effect of such turn-taking cues experimentally. The objective is to investigate the possibilities of generating turn-taking cues with a synthetic voice. Thus, in addition to stimuli realized with a human voice, the experiment included dialogues where one of the speakers is replaced with a synthesis. The turn-taking cues investigated include intonation, phrase-final lengthening, semantic completeness, stereotyped lexical expressions and non-lexical speech production phenomena such as lexical repetitions, breathing and lip-smacks. The results show that the turn-taking cues realized with a synthetic voice affect the judgements similar to the corresponding human version and there is no difference in reaction times between these two conditions. Furthermore, the results support Duncan’s findings: the more turn-taking cues with the same pragmatic function, turn-yielding or turn-holding, the higher the agreement among subjects on the expected outcome. In addition, the number of turn-taking cues affects the reaction times for these decisions. Thus, the more cues, the faster the reaction time.Hjalmarsson, A., & Laskowski, K. (2011). Measuring final lengthening for speaker-change prediction. In Proceedings of Interspeech 2011 (pp. 2069-2072). Florence, Italy. [abstract] [pdf]Abstract: We explore pre-silence syllabic lengthening as a cue for next-speakership prediction in spontaneous dialogue. When estimated using a transcription-mediated procedure, lengthening is shown to reduce error rates by 25% relative to majority class guessing. This indicates that lengthening should be exploited by dialogue systems. With that in mind, we evaluate an automatic measure of spectral envelope change, Mel-spectral flux (MSF), and show that its performance is at least as good as that of the transcription-mediated measure. Modeling MSF is likely to improve turn uptake in dialogue systems, and to benefit other applications needing an estimate of durational variability in speech.House, D., & Strömbergsson, S. (2011). Self-voice identification in children with phonological impairment. In Proceedings of the ICPhS XVII (pp. 886-889). Hong Kong. [abstract] [pdf]Abstract: We report preliminary data from a study of self-voice identification in children with phonological impairment (PI), where results from 13 children with PI are compared to results from a group of children with typical speech. No difference between the two groups was found, suggesting that a phonological impairment does not affect children’s ability to recognize their recorded voices as their own. We conclude that children with PI indeed recognize their own recorded voice and that the use of recordings in therapy can be supported.Hu, G. (2011). Chinese perception coaching. TMH-QPSR, 51(1), 97-100. [pdf]Husby, O., Øvregaard, Å., Wik, P., Bech, Ø., Albertsen, E., Nefzaoui, S., Skarpnes, E., & Koreman, J. (2011). Dealing with L1 background and L2 dialects in Norwegian CAPT. In Proceedings of SLaTE. [pdf]Johansson, M., Skantze, G., & Gustafson, J. (2011). Understanding route directions in human-robot dialogue. In Proceedings of SemDial (pp. 19-27). Los Angeles, CA. [pdf]Johnson-Roberson, M., Bohg, J., Skantze, G., Gustafson, J., Carlson, R., Rasolzadeh, B., & Kragic, D. (2011). Enhanced Visual Scene Understanding through Human-Robot Dialog. In IEEE/RSJ International Conference on Intelligent Robots and Systems. [pdf]Jonsson, H., & Eklund, R. (2011). Gender differences in verbal behaviour in a call routing speech application. TMH-QPSR, 51(1), 81-84. [pdf]Kaiser, R. (2011). Do Germans produce and perceive the Swedish word accent contrast? A cross-language analysis. TMH-QPSR, 51(1), 93-96. [pdf]Karlsson, A., House, D., Svantesson, J-O., & Tayanin, D. (2011). Comparison of F0 range in spontaneous speech in Kammu tonal and non-tonal dialects. In Proceedings of the ICPhS XVII (pp. 1026-1029). Hong Kong. [abstract] [pdf]Abstract: The aim of this study is to investigate whether the occurrence of lexical tones in a language imposes restrictions on its pitch range. Kammu, a Mon-Khmer language spoken in Northern Laos com-prises dialects with and without lexical tones and with no other major phonological differences. We use Kammu spontaneous speech to investigate differences in pitch range in the two dialects. The main finding is that tonal speakers exhibit a narrower pitch range. Thus, even at a high degree of engagement found in spontaneous speech, lexical tones impose restrictions on speakers’ pitch variation. Keywords: pitch range, tone, timing, intonation, Kammu, KhmuKarlsson, A., Svantesson, J-O., House, D., & Tayanin, D. (2011). Tone restricts F0 range and variation in Kammu. TMH-QPSR, 51(1), 53-55. [abstract] [pdf]Abstract: The aim of this study is to investigate whether the occurrence of lexical tones in a language imposes restrictions on its pitch range. We use data from Kammu, a Mon-Khmer language spoken in Northern Laos, which has one dialect with, and one without, lexical tones. The main finding is that speakers of the tonal dialect have a narrower pitch range, and also a smaller variation in pitch range.Klintfors, E., Marklund, E., Kallioinen, P., & Lacerda, F. (2011). Cortical N400-Potentials Generated by Adults in Response to Semantic Incongruities. TMH-QPSR, 51(1), 121-124. [pdf]Koniaris, C., & Engwall, O. (2011). Perceptual Differentiation Modeling Explains Phoneme Mispronunciation by Non-Native Speakers. In 2011 IEEE Int. Conf. on Acoust., Speech, Sig. Proc. (ICASSP) (pp. 5704-5707). Prague, Czech Republic. [abstract] [link]Abstract: One of the difficulties in second language (L2) learning is the weakness in discriminating between acoustic diversity within an L2 phoneme category and between different categories. In this paper, we describe a general method to quantitatively measure the perceptual difference between a group of native and individual non-native speakers. Normally, this task includes subjective listening tests and/or a thorough linguistic study. We instead use a totally automated method based on a psycho-acoustic auditory model. For a certain phoneme class, we measure the similarity of the Euclidean space spanned by the power spectrum of a native speech signal and the Euclidean space spanned by the auditory model output. We do the same for a non-native speech signal. Comparing the two similarity measurements, we find problematic phonemes for a given speaker. To validate our method, we apply it to different groups of non-native speakers of various first language (L1) backgrounds. Our results are verified by the theoretical findings in literature obtained from linguistic studies.Koniaris, C., & Engwall, O. (2011). Phoneme Level Non-Native Pronunciation Analysis by an Auditory Model-based Native Assessment Scheme. In Interspeech 2011 (pp. 1157-1160). Florence, Italy. [NOMINATED FOR BEST STUDENT PAPER AWARD]. [abstract] [html]Abstract: We introduce a general method for automatic diagnostic evaluation of the pronunciation of individual non-native speakers based on a model of the human auditory system trained with native data stimuli. For each phoneme class, the Euclidean geometry similarity between the native perceptual domain and the non-native speech power spectrum domain is measured. The problematic phonemes for a given second language speaker are found by comparing this measure to the Euclidean geometry similarity for the same phonemes produced by native speakers only. The method is applied to different groups of non-native speakers of various language backgrounds and the experimental results are in agreement with theoretical findings of linguistic studies.Koreman, J., Bech, Ø., Husby, O., & Wik, P. (2011). L1-L2map: a tool for multi-lingual contrastive analysis. In proceedings of ICPhS. [pdf]Landsiedel, C., Edlund, J., Eyben, F., Neiberg, D., & Schuller, B. (2011). Syllabification of conversational speech using bidirectional long-short-term memory neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on (pp. 5256 - 5259). Prague, Czech Republic. [abstract] [pdf]Abstract: Segmentation of speech signals is a crucial task in many types of speech analysis. We present a novel approach at segmentation on a syllable level, using a Bidirectional Long-Short-Term Memory Neural Network. It performs estimation of syllable nucleus positions based on regression of perceptually motivated input features to a smooth target function. Peak selection is performed to attain valid nuclei positions. Performance of the model is evaluated on the levels of both syllables and the vowel segments making up the syllable nuclei. The general applicability of the approach is illustrated by good results for two common databases - Switchboard and TIMIT - for both read and spontaneous speech, and a favourable comparison with other published results.Laskowski, K. (2011). Predicting, detecting and explaining the occurrence of vocal activity in multi-party conversation. Doctoral dissertation, Carnegie Mellon University. [abstract]Abstract: Understanding a conversation involves many challenges that understanding the speech in that conversation does not. An important source of this discrepancy is the form of the conversation, which emerges from tactical decisions that par- ticipants make in how, and precisely when, they choose to participate. An offline conversation understanding system, beyond understanding the spoken sequence of words, must be able to account for that form. In addition, an online system may need to evaluate its competing next-action alternatives, instant by instant, and to adjust its strategy based on the felicity of its past decisions. In circumscribed transient settings, understanding the spoken sequence of words may not be necessary for either type of system. This dissertation explores tactical conversational conduct. It adopts an existing laconic representation of conversa- tional form known as the vocal interaction chronogram, which effectively elides a conversation’s text-dependent attributes, such as the words spoken, the language used, the sequence of topics, and any explicit or implicit goals. Chronograms are treated as Markov random fields, and probability density models are developed that characterize the joint behavior of participants. Importantly, these models are independent of a conversation’s duration and the number of its participants. They render overtly dissimilar conversations directly comparable, and enable the training of a single model of conversa- tional form using a large number of dissimilar human-human conversations. The resulting statistical framework is shown to provide a computational counterpart to the qualitative field of conversation analysis, corroborating and elaborating on several sociolinguistic observations. It extends the quantitative treatment of two-party dialogue, as found in anthropology, social psychology, and telecommunications research, to the general multi-party setting. Experimental results indicate that the proposed computational theory benefits the detection and participant-attribution of speech activity. Furthermore, the theory is shown to enable the inference of illocutionary intent, emotional state, and social status, independently of linguistic analysis. Taken together, for conversations of arbitrary duration, participant number, and text-dependent attributes, the results demonstrate a degree of characterization and understanding of the nature of conversational interaction that has not been shown previously.Laskowski, K., Edlund, J., & Heldner, M. (2011). A single-port non-parametric model of turn-taking in multi-party conversation. In Proc. of ICASSP 2011 (pp. 5600-5603). Prague, Czech Republic. [abstract] [pdf]Abstract: The taking of turns to speak is an intrinsic property of conversation. It is therefore expected that models of turn-taking, providing a prior distribution over conversational form, can usefully reduce the perplexity of what is observed and processed in real-time spoken dialogue systems. We propose a conversation-independent single- port model of multi-party turn-taking, one which allows conversants to undertake independent actions but to condition them on the past behavior of all participants. The model is shown to generally out perform an existing multi-port model on a measure of perplexity over subsequently observed speech activity. We quantify the effect of history truncation and the success of predicting distant conversational futures, and argue that the framework is suf&#64257;ciently accessible and has signi&#64257;cant potential to usefully inform thedesignandbehavior ofspokendialoguesystems.Laskowski, K., Edlund, J., & Heldner, M. (2011). Learning and forgetting in incremental stochastic turn-taking models. In Proc. of Interspeech 2011. Florence, Italy. [abstract]Abstract: We present a computational framework for stochastically modeling dyad interaction chronograms. The framework's most novel feature is the capacity for incremental learning and forgetting. To showcase its flexibility, we design experiments answering four concrete questions about the systematics of spoken interaction. The results show that: (1) individuals are clearly affected by one another; (2) there is individual variation in interaction strategy; (3) strategies wander in time rather than converge; and (4) individuals exhibit similarity with their interlocutors. We expect the proposed framework to be capable of answering many such questions with little additional effort.Laskowski, K., & Jin, Q. (2011). Harmonic structure transform for speaker recognition. In Proc. of Interspeech 2011. Florence, Italy. [abstract]Abstract: We evaluate a new filterbank structure, yielding the harmonic structure cepstral coefficients (HSCCs), on a mismatched- session closed-set speaker classification task. The novelty of the filterbank lies in its averaging of energy at frequencies re- lated by harmonicity rather than by adjacency. Improvements are presented which achieve a 37%rel reduction in error rate un- der these conditions. The improved features are combined with a similar Mel-frequency cepstral coefficient (MFCC) system to yield error rate reductions of 32%rel, suggesting that HSCCs offer information which is complimentary to that available to today’s MFCC-based systems.Laukka, P., Neiberg, D., Forsell, M., Karlsson, I., & Elenius, K. (2011). Expression of Affect in Spontaneous Speech: Acoustic Correlates and Automatic Detection of Irritation and Resignation. Computer Speech and Language, 25(1), 84-104. [link]Lindblom, B., & MacNeilage, P. (2011). Coarticulation: A universal phonetic phenomenon with roots in deep time. TMH-QPSR, 51(1), 41-44. [pdf]Lindblom, B., Sundberg, J., Branderud, P., & Djamshidpey, H. (2011). Articulatory modeling and front cavity acoustics. TMH-QPSR, 51(1), 17-20. [pdf]Lindblom, B., Diehl, R., Park, S-H., & Salvi, G. (2011). Sound systems are shaped by their users: The recombination of phonetic substance. In G. Nick Clements, G. N., & Ridouane, R. (Eds.), Where Do Phonological Features Come From?: Cognitive, physical and developmental bases of distinctive speech categories. CNRS & Sorbonne-Nouvelle. [abstract] [pdf]Abstract: Computational experiments were run using an optimization criterion based on independently motivated definitions of perceptual contrast, articulatory cost and learning cost. The question: If stop+vowel inventories are seen as adaptations to perceptual, articulatory and developmental constraints what would they be like? Simulations successfully predicted typologically widely observed place preferences and the re-use of place features (‘phonemic coding’) in voiced stop inventories. These results demonstrate the feasibility of user-based accounts of phonological facts and indicate the nature of the constraints that over time might shape the formation of both the formal structure and the intrinsic content of sound patterns. While phonetic factors are commonly invoked to account for substantive aspects of phonology, their explanatory scope is here also extended to a fundamental attribute of its formal organization: the combinatorial re-use of phonetic content. Keywords: phonological universals; phonetic systems; formal structure; intrinsic content; behavioral origins; substance-based explanationLundholm Fors, K. (2011). An investigation of intra-turn pauses in spontaneous speech. TMH-QPSR, 51(1), 65-68. [pdf]Neiberg, D. (2011). Visualizing prosodic densities and contours: Forming one from many. TMH-QPSR, 51(1), 57-60. [abstract] [pdf]Abstract: This paper summarizes a flora of explorative visualization techniques for prosody developed at KTH. It is demonstrated how analysis can be made which goes beyond conventional methodology. Examples are given for turn taking, affective speech, response tokens and Swedish accent II.Neiberg, D., Ananthakrishnan, G., & Gustafson, J. (2011). Tracking pitch contours using minimum jerk trajectories. In INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association. Florence, Italy. [abstract] [pdf]Abstract: This paper proposes a fundamental frequency tracker, with the specific purpose of comparing the automatic estimates with pitch contours that are sketched by trained phoneticians. The method uses a frequency domain approach to estimate pitch tracks that form minimum jerk trajectories. This method tries to mimic motor movements of the hand made while sketching. When the fundamental frequency tracked by the proposed method on the oral and laryngograph signals were compared using the MOCHA-TIMIT database, the correlation was 0.98 and the root mean squared error was 4.0 Hz, which was slightly better than a state-of-the-art pitch tracking algorithm included in the ESPS. We also demonstrate how the proposed algorithm could to be applied when comparing with sketches made by phoneticians for the variations in accent II among the Swedish dialects.Neiberg, D., & Gustafson, J. (2011). A Dual Channel Coupled Decoder for Fillers and Feedback. In INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association. Florence, Italy. [abstract] [pdf]Abstract: This study presents a dual channel decoder capable of modeling cross-speaker dependencies for segmentation and classification of fillers and feedbacks in conversational speech found in the DEAL corpus. For the same number of Gaussians per state, we have shown improvement in terms of average F-score for the successive addition of 1) increased frame rate from 10 ms to 50 ms 2) Joint Maximum Cross-Correlation (JMXC) features in a single channel decoder 3) a joint transition matrix which captures dependencies symmetrically across the two channels 4) coupled acoustic model retraining symmetrically across the two channels. The final step gives a relative improvement of over 100% for fillers and feedbacks compared to our previous published results. The F-scores are in the range to make it possible to use the decoder as both a voice activity detector and an illucotary act decoder for semi-automatic annotation.Neiberg, D., & Gustafson, J. (2011). Predicting Speaker Changes and Listener Responses With And Without Eye-contact. In INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association. Florence, Italy.. [abstract] [pdf]Abstract: This paper compares turn-taking in terms of timing and prediction in human-human conversations under the conditions when participants has eye-contact versus when there is no eyecontact, as found in the HCRC Map Task corpus. By measuring between speaker intervals it was found that a larger proportion of speaker shifts occurred in overlap for the no eyecontact condition. For prediction we used prosodic and spectral features parametrized by time-varying length-invariant discrete cosine coefficients. With Gaussian Mixture Modeling and variations of classifier fusion schemes, we explored the task of predicting whether there is an upcoming speaker change (SC) or not (HOLD), at the end of an utterance (EOU) with a pause lag of 200 ms. The label SC was further split into LRs (listener responses, e.g. back-channels) and other TURNSHIFTs. The prediction was found to be somewhat easier for the eye-contact condition, for which the average recall rates was 60.57%, 66.35%, and 62.00% for TURN-SHIFTs, LR and SC respectively.Neiberg, D., Laukka, P., & Elfenbein, H. A. (2011). Intra-, Inter-, and Cross-cultural Classification of Vocal Affect. In INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association (pp. 1581-1584). Florence, Italy.. [abstract] [pdf]Abstract: We present intra-, inter- and cross-cultural classifications of vocal expressions. Stimuli were selected from the VENEC corpus and consisted of portrayals of 11 emotions, each expressed with 3 levels of intensity. Classification (nu-SVM) was based on acoustic measures related to pitch, intensity, formants, voice source and duration. Results showed that mean recall across emotions was around 2.4-3 times higher than chance level for both intra- and inter-cultural conditions. For cross-cultural conditions, the relative performance dropped 26%, 32%, and 34% for high, medium, and low emotion intensity, respectively. This suggests that intra-cultural models were more sensitive to mismatched conditions for low emotion intensity. Preliminary results further indicated that recall rate varied as a function of emotion, with lust and sadness showing the smallest performance drops in the cross-cultural condition.Neiberg, D., & Truong, K. (2011). Online Detection Of Vocal Listener Responses With Maximum Latency Constraints. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on (pp. 5836 - 5839). Prague, Czech Republic. [abstract] [pdf]Abstract: When human listeners utter Listener Responses (e.g. back-channels or acknowledgments) such as `yeah' and `mmhmm', interlocutors commonly continue to speak or resume their speech even before the listener has finished his/her response. This type of speech interactivity results in frequent speech overlap which is common in human-human conversation. To allow for this type of speech interactivity to occur between humans and spoken dialog systems, which will result in more human-like continuous and smoother human-machine interaction, we propose an on-line classifier which can classify incoming speech as Listener Responses. We show that it is possible to detect vocal Listener Responses using maximum latency thresholds of 100-500 ms, thereby obtaining equal error rates ranging from 34% to 28% by using an energy based voice activity detector.Reidsma, D., de Kok, I., Neiberg, D., Pammi, S., van Straalen, B., Truong, K., & van Welbergen, H. (2011). Continuous Interaction with a Virtual Human. Journal on Multimodal User Interfaces, 4(2), 97-118. [link]Roll, M., Söderström, P., & Horne, M. (2011). Phonetic markedness, turning points, and anticipatory attention. TMH-QPSR, 51(1), 113-116. [pdf]Salvi, G., & Al Moubayed, S. (2011). Spoken Language Identification using Frame Based Entropy Measures. TMH-QPSR, 51(1), 69-72.Salvi, G., Montesano, L., Bernardino, A., & Santos-Victor, J. (2011). Language bootstrapping: Learning word meanings from perception-action association. IEEE Transactions on Systems, Man, and Cybernetics, Part B. [abstract] [pdf]Abstract: We address the problem of bootstrapping language acquisition for an artificial system similarly to what is observed in experiments with human infants. Our method works by associating meanings to words in manipulation tasks, as a robot interacts with objects and listens to verbal descriptions of the interactions. The model is based on an affordance network, i.e., a mapping between robot actions, robot perceptions and the perceived effects of these actions upon objects. We extend the affordance model to incorporate spoken words, which allows us to ground the verbal symbols to the execution of actions and the perception of the environment. The model takes verbal descriptions of a task as the input, and uses temporal co-occurrence to create links between speech utterances and the involved objects, actions and effects. We show that the robot is able form useful word-to-meaning associations, even without considering grammatical structure in the learning process and in the presence of recognition errors. These word-to-meaning associations are embedded in the robot’s own understanding of its actions. Thus, they can be directly used to instruct the robot to perform tasks and also allow to incorporate context in the speech recognition task. We believe that the encouraging results with our approach may afford robots with a capacity to acquire language descriptors in their operation's environment as well as to shed some light as to how this challenging process develops with human infants.Salvi, G., Tesser, F., Zovato, E., & Cosi, P. (2011). Analisi Gerarchica degli Inviluppui Spettrali Differenziali di una Voce Emotiva. In Contesto comunicativo e variabilità nella produzione e percezione della lingua (AISV). Lecce, Italy.Schlangen, D., & Skantze, G. (2011). A General, Abstract Model of Incremental Dialogue Processing. Dialogue & Discourse, 2(1), 83-111. [pdf]Schötz, S., & Eklund, R. (2011). A comparative acoustic analysis of purring in four cats. TMH-QPSR, 51(1), 9-12. [pdf]Schötz, s., Frid, J., & Löfqvist, A. (2011). Exotic vowels in Swedish: a project description and an articulographic and acoustic pilot study of /i:/.. TMH-QPSR, 51(1), 25-28. [pdf]Stenberg, M. (2011). Phonetic transcriptions as a public service. TMH-QPSR, 51(1), 45-48. [pdf]Strömbergsson, S. (2011). Children’s perception of their modified speech – preliminary findings. TMH-QPSR, 51(1), 117-120. [abstract] [pdf]Abstract: This report describes an ongoing investigation of 4-6 year-old children’s perception of synthetically modified versions of their own recorded speech. Recordings of the children’s speech production are automatically modified so that the initial consonant is replaced by a different consonant. The task for the children is to judge the phonological accuracy (correct vs. incorrect) and the speaker identity (me vs. someone else) of each stimulus. Preliminary results indicate that children with typical speech generally judge phonological accuracy correctly, of both original recordings and synthetically modified recordings. As a first evaluation of the re-synthesis technique with child users, the results are promising, as the children generally accept the intended phonological form, seemingly without detecting the modification.Strömbergsson, S. (2011). Children’s recognition of their own voice: influence of phonological impairment. In Proceedings of Interspeech 2011 (pp. 2205-2208). Florence, Italy. [abstract] [pdf]Abstract: This study explores the ability to identify the recorded voice as one’s own, in three groups of children: one group of children with phonological impairment (PI) and two groups of children with typical speech and language development; 4-5 year-olds and 7-8 year-olds. High average performance rates in all three groups suggest that these children indeed recognize their recorded voice as their own, with no significant difference between the groups. Signs indicating that children with deviant speech use their speech deviance as a cue to identifying their own voice are discussed.Strömbergsson, S. (2011). Corrective re-synthesis of deviant speech using unit selection. In Sandford Pedersen, B., Nespore, G., & Skadina, I. (Eds.), NODALIDA 2011 Conference Proceedings (pp. 214-217). Riga, Latvia. [abstract] [pdf]Abstract: This report describes a novel approach to modified re-synthesis, by concatenation of speech from different speakers. The system removes an initial voiceless plosive from one utterance, recorded from a child, and replaces it with another voiceless plosive selected from a database of recordings of other child speakers. Preliminary results from a listener evaluation are reported.Strömbergsson, S. (2011). Segmental re-synthesis of child speech using unit selection. In Proceedings of the ICPhS XVII (pp. 1910-1913). Hong Kong. [abstract] [pdf]Abstract: This report describes a novel approach to segmental re-synthesis of child speech, by concatenation of speech from different speakers. The re-synthesis builds upon standard methods of unit selection, but instead of using speech from only one speaker, target segments are selected from a speech database of many child speakers. Results from a listener evaluation suggest that the method can be used to generate intelligible speech that is difficult to distinguish from original recordings.Suomi, K., Meister, E., & Ylitalo, R. (2011). Non-contrastive durational patterns in two quantity languages. TMH-QPSR, 51(1), 61-64.Uneson, M., & Schachtenhaufen, R. (2011). Exploring phonetic realization in Danish by Transformation-Based Learning. TMH-QPSR, 51(1), 73-76. [pdf]Wik, P. (2011). The Virtual Language Teacher: Models and applications for language learning using embodied conversational agents. Doctoral dissertation, KTH School of Computer Science and Communication. [abstract] [pdf]Abstract: This thesis presents a framework for computer assisted language learning using a virtual language teacher. It is an attempt at creating, not only a new type of language learning software, but also a server-based application that collects large amounts of speech material for future research purposes. The motivation for the framework is to create a research platform for computer assisted language learning, and computer assisted pronunciation training. Within the thesis, different feedback strategies and pronunciation error detectors are explored This is a broad, interdisciplinary approach, combining research from a number of scientific disciplines, such as speech-technology, game studies, cognitive science, phonetics, phonology, and second-language acquisition and teaching methodologies. The thesis discusses the paradigm both from a top-down point of view, where a number of functionally separate but interacting units are presented as part of a proposed architecture, and bottom-up by demonstrating and testing an implementation of the framework.Wik, P., Husby, O., Øvregaard, Å., Bech, Ø., Albertsen, E., Nefzaoui, S., Skarpnes, E., & Koreman, J. (2011). Contrastive analysis through L1-L2map. TMH-QPSR, 51(1), 49-52. [pdf]Zetterholm, E., & Tronnier, M. (2011). Teaching pronunciation in Swedish as a second language. TMH-QPSR, 51(1), 85-88. [pdf]Öhrström, N., Arppe, H., Eklund, L., Eriksson, S., Marcus, D., Mathiassen, T., & Pettersson, L. (2011). Audiovisual integration in binaural, monaural and dichotic listening. TMH-QPSR, 51(1), 29-32. [pdf]2010Al Moubayed, S., & Ananthakrishnan, G. (2010). Acoustic-to-Articulatory Inversion based on Local Regression. In Proc. Interspeech. Makuhari, Japan. [abstract] [pdf]Abstract: This paper presents an Acoustic-to-Articulatory inversion method based on local regression. Two types of local regression, a non-parametric and a local linear regression have been applied on a corpus containing simultaneous recordings of positions of articulators and the corresponding acoustics. A maximum likelihood trajectory smoothing using the estimated dynamics of the articulators is also applied on the regression estimates. The average root mean square error in estimating articulatory positions, given the acoustics, is 1.56 mm for the nonparametric regression and 1.52 mm for the local linear regression. The local linear regression is found to perform significantly better than regression using Gaussian Mixture Models using the same acoustic and articulatory features.Al Moubayed, S., Ananthakrishnan, G., & Enflo, L. (2010). Automatic Prominence Classification in Swedish. In Proceedings of Speech Prosody 2010, Workshop on Prosodic Prominence. Chicago, USA. [abstract] [pdf]Abstract: This study aims at automatically classifying levels of acoustic prominence on a dataset of 200 Swedish sentences of read speech by one male native speaker. Each word in the sentences was categorized by four speech experts into one of three groups depending on the level of prominence perceived. Six acoustic features at a syllable level and seven features at a word level were used. Two machine learning algorithms, namely Support Vector Machines (SVM) and memory based Learning (MBL) were trained to classify the sentences into their respective classes. The MBL gave an average word level accuracy of 69.08% and the SVM gave an average accuracy of 65.17 % on the test set. These values were comparable with the average accuracy of the human annotators with respect to the average annotations. In this study, word duration was found to be the most important feature required for classifying prominence in Swedish read speechAl Moubayed, S., & Beskow, J. (2010). Prominence Detection in Swedish Using Syllable Correlates. In Interspeech'10. Makuhari, Japan. [pdf]Al Moubayed, S., & Beskow, J. (2010). Perception of Nonverbal Gestures of Prominence in Visual Speech Animation. In FAA, The ACM / SSPNET 2nd International Symposium on Facial Analysis and Animation. Edinburgh, UK. [pdf]Al Moubayed, S., Beskow, J., & Granström, B. (2010). Auditory-Visual Prominence: From Intelligibilitty to Behavior. Journal on Multimodal User Interfaces, 3(4), 299-311. [abstract] [pdf]Abstract: Auditory prominence is defined as when an acoustic segment is made salient in its context. Prominence is one of the prosodic functions that has been shown to be strongly correlated with facial movements. In this work, we investigate the effects of facial prominence cues, in terms of gestures, when synthesized on animated talking heads. In the first study, a speech intelligibility experiment is conducted, speech quality is acoustically degraded and the fundamental frequency is removed from the signal, then the speech is presented to 12 subjects through a lip synchronized talking head carrying head-nods and eyebrows raise gestures, which are synchronized with the auditory prominence. The experiment shows that presenting prominence as facial gestures significantly increases speech intelligibility compared to when these gestures are randomly added to speech. We also present a follow-up study examining the perception of the behavior of the talking heads when gestures are added over pitch accents. Using eye-gaze tracking technology and questionnaires on 10 moderately hearing impaired subjects, the results of the gaze data show that users look at the face in a similar fashion to when they look at a natural face when gestures are coupled with pitch accents opposed to when the face carries no gestures. From the questionnaires, the results also show that these gestures significantly increase the naturalness and the understanding of the talking head.Al Moubayed, S., Beskow, J., Granström, B., & House, D. (2010). Audio-Visual Prosody: Perception, Detection, and Synthesis of Prominence. In Esposito, A. e. a. (Ed.), Toward Autonomous, Adaptive, and Context-Aware Multimodal Interfaces: Theoretical and Practical Issues (pp. 55 - 71). Springer. [abstract] [pdf]Abstract: In this chapter, we investigate the effects of facial prominence cues, in terms of gestures, when synthesized on animated talking heads. In the first study a speech intelligibility experiment is conducted, where speech quality is acoustically degraded, then the speech is presented to 12 subjects through a lip synchronized talking head carrying head-nods and eyebrow raising gestures. The experiment shows that perceiving visual prominence as gestures, synchronized with the auditory prominence, significantly increases speech intelligibility compared to when these gestures are randomly added to speech. We also present a study examining the perception of the behavior of the talking heads when gestures are added at pitch movements. Using eye-gaze tracking technology and questionnaires for 10 moderately hearing impaired subjects, the results of the gaze data show that users look at the face in a similar fashion to when they look at a natural face when gestures are coupled with pitch movements opposed to when the face carries no gestures. From the questionnaires, the results also show that these gestures significantly increase the naturalness and helpfulness of the talking head.Ananthakrishnan, G., Badin, P., Vargas, J. A. V., & Engwall, O. (2010). Predicting Unseen Articulations from Multi-speaker Articulatory Models. In Proc. Interspeech. Makuhari, Japan. [abstract] [pdf]Abstract: In order to study inter-speaker variability, this work aims to assess the generalization capabilities of data-based multi-speaker articulatory models. We use various three-mode factor analysis techniques to model the variations of midsagittal vocal tract contours obtained from MRI images for three French speakers articulating 73 vowels and consonants. Articulations of a given speaker for phonemes not present in the training set are then predicted by inversion of the models from measurements of these phonemes articulated by the other subjects. On the average, the prediction RMSE was 5.25 mm for tongue contours, and 3.3 mm for 2D midsagittal vocal tract distances. Besides, this study has established a methodology to determine the optimal number of factors for such models.Beskow, J., & Al Moubayed, S. (2010). Perception of Gaze Direction in 2D and 3D Facial Projections. In The ACM / SSPNET 2nd International Symposium on Facial Analysis and Animation. Edinburgh, UK. [pdf]Beskow, J., Edlund, J., Granström, B., Gustafson, J., & House, D. (2010). Face-to-face interaction and the KTH Cooking Show. In Esposito, A., Campbell, N., Vogel, C., Hussain, A., & Nijholt, A. (Eds.), Development of Multimodal Interfaces: Active Listening and Synchrony (pp. 157 - 168). Berlin / Heidelberg: Springer.Beskow, J., Edlund, J., Gustafson, J., Heldner, M., Hjalmarsson, A., & House, D. (2010). Modelling humanlike conversational behaviour. In Proceedings of SLTC 2010. Linköping, Sweden. [pdf]Beskow, J., Edlund, J., Gustafson, J., Heldner, M., Hjalmarsson, A., & House, D. (2010). Research focus: Interactional aspects of spoken face-to-face communication. In Proc. of Fonetik 2010 (pp. 7-10). Lund, Sweden. [abstract] [pdf]Abstract: We have a visionary goal: to learn enough about human face-to-face interaction that we are able to create an artificial conversational partner that is human-like. We take the opportunity here to present four new projects inaugurated in 2010, each adding pieces of the puzzle through a shared research focus: interactional aspects of spoken face-to-face communication.Beskow, J., & Granström, B. (2010). Goda utsikter för teckenspråksteknologi. In Domeij, R., Breivik, T., Halskov, J., Kirchmeier-Andersen, S., Langgård, P., & Moshagen, S. (Eds.), Språkteknologi för ökad tillgänglighet - Rapport från ett nordiskt seminarium (pp. 77-86). [pdf]Beskow, J., & Granström, B. (2010). Teckenspråkteknologi - sammanfattning av förstudie. Technical Report, KTH Centrum för Talteknologi. [pdf]Branigan, H., Pickering, M., Pearson, J., & McLean, J. (2010). Linguistic alignment between people and computers. Journal of Pragmatics.Edlund, J., Gustafson, J., & Beskow, J. (2010). Cocktail – a demonstration of massively multi-component audio environments for illustration and analysis. In Porc. of SLTC 2010. Linköping, Sweden. [abstract] [pdf]Abstract: We present MMAE – Massively Multi-component Audio Environments – a new concept in auditory presentation, and Cocktail – a demonstrator built on this technology. MMAE creates a dynamic audio environment by playing a large number of sound clips simultaneously at different locations in a virtual 3D space. The technique utilizes standard soundboards and is based in the Snack Sound Toolkit. The result is an efficient 3D audio environment that can be modified dynamically, in real time. Applications range from the creation of canned as well as online audio environments for games and entertainment to the browsing, analyzing and comparing of large quantities of audio data. We also demonstrate the Cocktail implementation of MMAE using several test cases as examples.Edlund, J., & Beskow, J. (2010). Capturing massively multimodal dialogues: affordable synchronization and visualization. In Kipp, M., Martin, J-C., Paggio, P., & Heylen, D. (Eds.), Proc. of Multimodal Corpora: Advances in Capturing, Coding and Analyzing Multimodality (MMC 2010) (pp. 160 - 161). Valetta, Malta. [abstract] [pdf]Abstract: In this demo, we show (a) affordable and relatively easy-to-implement means to facilitate synchronization of audio, video and motion capture data in post processing, and (b) a flexible tool for 3D visualization of recorded motion capture data aligned with audio and video sequences. The synchronisation is made possible by the use of two simple and analogues devices: a turntable and an easy to build electronic clapper board. The demo shows examples of how the signals from the turntable and the clapper board are traced over the three modalities, using the 3D visualisation tool. We also demonstrate how the visualisation tool shows head and torso movements captured by the motion capture system.Edlund, J., Beskow, J., Elenius, K., Hellmer, K., Strömbergsson, S., & House, D. (2010). Spontal: a Swedish spontaneous dialogue corpus of audio, video and motion capture. In Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Rosner, M., & Tapias, D. (Eds.), Proc. of the Seventh conference on International Language Resources and Evaluation (LREC'10) (pp. 2992 - 2995). Valetta, Malta. [abstract] [pdf]Abstract: We present the Spontal database of spontaneous Swedish dialogues. 120 dialogues of at least 30 minutes each have been captured in high-quality audio, high-resolution video and with a motion capture system. The corpus is currently being processed and annotated, and will be made available for research at the end of the project.Edlund, J., & Gustafson, J. (2010). Ask the experts - Part II: Analysis. In Juel Henrichsen, P. (Ed.), Linguistic Theory and Raw Sound (pp. 183-198). Samfundslitteratur. [abstract] [pdf]Abstract: We present work fuelled by an urge to understand speech in its original and most fundamental context: in conversation between people. And what better way than to look to the experts? Regarding human conversation, authority lies with the speakers themselves, and asking the experts is a matter of observing and analyzing what speakers do. This is the second part of a two-part discussion which is illustrated with examples mainly from the work at KTH Speech, Music and Hearing. In this part, we discuss methods of extracting useful information from captured data, with a special focus on raw sound.Edlund, J., Heldner, M., Al Moubayed, S., Gravano, A., & Hirschberg, J. (2010). Very short utterances in conversation. In Proc. of Fonetik 2010 (pp. 11-16). Lund, Sweden. [abstract] [pdf]Abstract: Faced with the difficulties of finding an operationalized definition of backchannels, we have previously proposed an intermediate, auxiliary unit – the very short utterance (VSU) – which is defined operationally and is automatically extractable from recorded or ongoing dialogues. Here, we extend that work in the following ways: (1) we test the extent to which the VSU/NONVSU distinction corresponds to backchannels/non-backchannels in a different data set that is manually annotated for backchannels – the Columbia Games Corpus; (2) we examine to the extent to which VSUS capture other short utterances with a vocabulary similar to backchannels; (3) we propose a VSU method for better managing turn-taking and barge-ins in spoken dialogue systems based on detection of backchannels; and (4) we attempt to detect backchannels with better precision by training a backchannel classifier using durations and inter-speaker relative loudness differences as features. The results show that VSUS indeed capture a large proportion of backchannels – large enough that VSUs can be used to improve spoken dialogue system turntaking; and that building a reliable backchannel classifier working in real time is feasible.Elenius, D., & Blomberg, M. (2010). Dynamic vocal tract length normalization in speech recognition. In Working Papers 54, Proceedings from Fonetik 2010 (pp. 29-34). Centre for Languages and Literature, Lund University, Sweden. [pdf]Engwall, O. (2010). Is there a McGurk effect for tongue reading?. In Proceedings of AVSP. Hakone, Japan. [pdf]Gustafson, J., & Neiberg, D. (2010). Directing conversation using the prosody of mm and mhm. In Proceedings of SLTC 2010. Linköping, Sweden.Gustafson, J., & Neiberg, D. (2010). Prosodic cues to engagement in non-lexical response tokens in Swedish. In Proceedings of DiSS-LPSS Joint Workshop 2010. Tokyo, Japan. [pdf]Gustafson, J., & Edlund, J. (2010). Ask the experts - Part I: Elicitation. In Juel Henrichsen, P. (Ed.), Linguistic Theory and Raw Sound (pp. 169-182). Samfundslitteratur. [abstract] [pdf]Abstract: We present work fuelled by an urge to understand speech in its original and most fundamental context: in conversation between people. And what better way than to look to the experts? Regarding human conversation, authority lies with the speakers themselves, and asking the experts is a matter of observing and analyzing what speakers do. This is the first part of a two-part discussion which is illustrated with examples mainly from the work at KTH Speech, Music and Hearing. In this part, we discuss methods of capturing what people do when they talk to each other.Heldner, M., Edlund, J., & Hirschberg, J. (2010). Pitch similarity in the vicinity of backchannels. In Proc. of Interspeech 2010 (pp. 3054-3057). Makuhari, Japan. [pdf]Heldner, M., & Edlund, J. (2010). Pauses, gaps and overlaps in conversations. Journal of Phonetics, 38, 555-568. [abstract] [pdf]Abstract: This paper explores durational aspects of pauses, gaps and overlaps in three different conversational corpora with a view to challenge claims about precision timing in turn-taking. Distributions of pause, gap and overlap durations in conversations are presented, and methodological issues regarding the statistical treatment of such distributions are discussed. The results are related to published minimal response times for spoken utterances and thresholds for detection of acoustic silences in speech. It is shown that turn-taking is generally less precise than is often claimed by researchers in the field of conversation analysis or interactional linguistics. These results are discussed in the light of their implications for models of timing in turn-taking, and for interaction control models in speech technology. In particular, it is argued that the proportion of speaker changes that could potentially be triggered by information immediately preceding the speaker change is large enough for reactive interaction controls models to be viable in speech technology.Hirschberg, J., Hjalmarsson, A., & Elhadad, N. (2010). "You're as Sick as You Sound" Using Computational Approaches for Modeling Speaker State to Gauge Illness and Recovery. In Neustein, A. (Ed.), Mobile Environments, Call Centers and Clinics (pp. 305-322). Springer. [abstract]Abstract: Recently, researchers in computer science and engineering have begun to explore the possibility of finding speech-based correlates of various medical conditions using automatic, computational methods. If such language cues can be identified and quantified automatically, this information can be used to support diagnosis and treatment of medical conditions in clinical settings and to further fundamental research in understanding cognition. This chapter reviews computational approaches that explore communicative patterns of patients who suffer from medical conditions such as depression, autism spectrum disorders, schizophrenia, and cancer. There are two main approaches discussed: research that explores features extracted from the acoustic signal and research that focuses on lexical and semantic features. We also present some applied research that uses computational methods to develop assistive technologies. In the final sections we discuss issues related to and the future of this emerging field of research.Hjalmarsson, A. (2010). Human interaction as a model for spoken dialogue system behaviour. Doctoral dissertation. [abstract] [pdf]Abstract: This thesis is a step towards the long-term and high-reaching objective of building dialogue systems whose behaviour is similar to a human dialogue partner. The aim is not to build a machine with the same conversational skills as a human being, but rather to build a machine that is human enough to encourage users to interact with it accordingly. The behaviours in focus are cue phrases, hesitations and turn-taking cues. These behaviours serve several important communicative functions such as providing feedback and managing turn-taking. Thus, if dialogue systems could use interactional cues similar to those of humans, these systems could be more intuitive to talk to. A major part of this work has been to collect, identify and analyze the target behaviours in human-human interaction in order to gain a better understanding of these phenomena. Another part has been to reproduce these behaviours in a dialogue system context and explore listeners’ perceptions of these phenomena in empirical experiments. The thesis is divided into two parts. The first part serves as an overall background. The issues and motivations of humanlike dialogue systems are discussed. This part also includes an overview of research on human language production and spoken language generation in dialogue systems. The next part presents the data collections, data analyses and empirical experiments that this thesis is concerned with. The first study presented is a listening test that explores human behaviour as a model for dialogue systems. The results show that a version based on human behaviour is rated as more humanlike, polite and intelligent than a constrained version with less variability. Next, the DEAL dialogue system is introduced. DEAL is used as a platform for the re-search presented in this thesis. The domain of the system is a trade domain and the target audience are second language learners of Swedish who want to practice conversation. Furthermore, a data col-lection of human-human dialogues in the DEAL domain is presented. Analyses of cue phrases in these data are provided as well as an experimental study of turn-taking cues. The results from the turn-taking experiment indicate that turn-taking cues realized with a di-phone synthesis affect the expectations of a turn change similar to the corresponding human version. Finally, an experimental study that explores the use of talkspurt-initial cue phrases in an incremental version of DEAL is presented. The results show that the incremental version had shorter response times and was rated as more efficient, more polite and better at indicating when to speak than a non-incremental implementation of the same system.Hjalmarsson, A. (2010). The vocal intensity of turn-initial cue phrases and filled pauses in dialogue. In Proceedings of SIGdial (pp. 225-228). Tokyo, Japan. [abstract] [pdf]Abstract: The present study explores the vocal intensity of turn-initial cue phrases in a corpus of dialogues in Swedish. Cue phrases convey relatively little propositional content, but have several important pragmatic functions. The majority of these entities are frequently occurring monosyllabic words such as “eh”, “mm”, “ja”. Prosodic analysis shows that these words are produced with higher intensity than other turn-initial words are. In light of these results, it is suggested that speakers produce these expressions with high intensity in order to claim the floor. It is further shown that the difference in intensity can be measured as a dynamic inter- speaker relation over the course of a dialogue using the end of the interlocutor’s previous turn as a reference point.Horne, M., House, D., Svantesson, J-O., & Touati, P. (2010). Gösta Bruce 1947-2010 In Memoriam. Phonetica, 67(4), 268-270.Johnson-Roberson, M., Bohg, J., Kragic, D., Skantze, G., Gustafson, J., & Carlson, R. (2010). Enhanced Visual Scene Understanding through Human-Robot Dialog. In Proceedings of AAAI 2010 Fall Symposium: Dialog with Robots. Arlington, VA. [pdf]Karlsson, A., House, D., Svantesson, J-O., & Tayanin, D. (2010). Influence of lexical tones on intonation in Kammu. In Interspeech 2010 (pp. 1740-1743). Makuhari, Japan. [abstract] [pdf]Abstract: The aim of this study is to investigate how the presence of lexical tones influences the realization of focal accent and sentence intonation. The language studied is Kammu, a language particularly well suited for the study as it has both tonal and non-tonal dialects. The main finding is that lexical tone exerts an influence on both sentence and focal accent in the tonal dialect to such a strong degree that we can postulate a hierarchy where lexical tone is strongest followed by sentence accent, with focal accent exerting the weakest influence on the F0 contour.Laskowski, K., & Edlund, J. (2010). A Snack implementation and Tcl/Tk interface to the fundamental frequency variation spectrum algorithm. In Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Rosner, M., & Tapias, D. (Eds.), Proc. of the Seventh conference on International Language Resources and Evaluation (LREC'10) (pp. 3742 - 3749). Valetta, Malta. [abstract] [pdf]Abstract: Intonation is an important aspect of vocal production, used for a variety of communicative needs. Its modeling is therefore crucial in many speech understanding systems, particularly those requiring inference of speaker intent in real-time. However, the estimation of pitch, traditionally the first step in intonation modeling, is computationally inconvenient in such scenarios. This is because it is often, and most optimally, achieved only after speech segmentation and recognition. A consequence is that earlier speech processing components, in today’s state-of-the-art systems, lack intonation awareness by fiat; it is not known to what extent this circumscribes their performance. In the current work, we present a freely available implementation of an alternative to pitch estimation, namely the computation of the fundamental frequency variation (FFV) spectrum, which can be easily employed at any level within a speech processing system. It is our hope that the implementation we describe aid in the understanding of this novel acoustic feature space, and that it facilitate its inclusion, as desired, in the front-end routines of speech recognition, dialog act recognition, and speaker recognition systems.Laskowski, K., Heldner, M., & Edlund, J. (2010). Preliminaries to an account of multi-party conversational turn-taking as an antiferromagnetic spin glass. In Proc. of NIPS Workshop on Modeling Human Communication Dynamics. Vancouver, B.C., Canada. [abstract] [pdf]Abstract: We present empirical justification of why logistic regression may acceptably approximate, using the number of currently vocalizing interlocutors, the probabilities returned by a time-invariant, conditionally independent model of turn-taking. The resulting parametric model with 3 degrees of freedom is shown to be identical to an infinite-range Ising antiferromagnet, with slow connections, in an external field; it is suitable for undifferentiated-participant scenarios. In differentiated-participant scenarios, untying parameters results in an infinite-range spin glass whose degrees of freedom scale as the square of the number of participants; it offers an efficient representation of participant-pair synchrony. We discuss the implications of model parametrization and of the thermodynamic and feed-forward perceptron formalisms for easily quantifying aspects of conversational dynamics.Neiberg, D., & Gustafson, J. (2010). Modeling Conversational Interaction Using Coupled Markov Chains. In Proceedings of DiSS-LPSS Joint Workshop 2010. Tokyo, Japan. [pdf]Neiberg, D., & Gustafson, J. (2010). Prosodic Characterization and Automatic Classification of Conversational Grunts in Swedish. In Fonetik 2010. Lund.Neiberg, D., & Gustafson, J. (2010). The Prosody of Swedish Conversational Grunts. In INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association (pp. 2562-2565). Makuhari, Chiba, Japan. [pdf]Neiberg, D., Laukka, P., & Ananthakrishnan, G. (2010). Classification of Affective Speech using Normalized Time-Frequency Cepstra. In Fifth International Conference on Speech Prosody (Speech Prosody 2010). Chicago, Illinois, U.S.A. [abstract] [pdf]Abstract: Subtle temporal and spectral differences between categorical realizations of para-linguistic phenomena (e.g. affective vocal expressions), are hard to capture and describe. In this paper we present a signal representation based on Time Varying Constant-Q Cepstral Coefficients (TVCQCC) derived for this purpose. A method which utilize the special properties of the constant Q-transform for mean F0 estimation and normalization is described. The coefficients are invariant to utterance length, and as a special case, a representation for prosody is considered. Speaker independent classification results using nu-SVM the the Berlin EMO-DB and two closed sets of basic (anger, disgust, fear, happiness, sadness, neutral) and social/interpersonal (affection, pride, shame) emotions recorded by forty professional actors from two English dialect areas are reported. The accuracy for the Berlin EMO-DB is 71.2 %, and the accuracies for the first set including basic emotions was 44.6% and for the second set including basic and social emotions the accuracy was 31.7% . It was found that F0 normalization boosts the performance and a combined feature set shows the best performance.Neiberg, D., & Truong, K. P. (2010). A Maximum Latency Classifier for Listener Responses. In Proceedings of SLTC 2010. Linköping, Sweden. [abstract]Abstract: When Listener Responses such as “yeah”, “right” or “mhm” are uttered in a face-to-face conversation, it is not uncommon for the interlocutor to continue to speak in overlap, i.e. before the Listener becomes silent. We propose a classifier which can classify incoming speech as a Listener Response or not before the talk-spurt ends. The classifier is implemented as an upgrade of the Embodied Conversational Agent developed in the SEMAINE project during the eNTERFACE 2010 workshop.Oertel, C., Cummins, F., Campbell, N., Edlund, J., & Wagner, P. (2010). D64: A corpus of richly recorded conversational interaction. In Kipp, M., Martin, J-C., Paggio, P., & Heylen, D. (Eds.), Proceedings of LREC 2010 Workshop on Multimodal Corpora: Advances in Capturing, Coding and Analyzing Multimodality (pp. 27 - 30). Valetta, Malta. [abstract] [pdf]Abstract: Rich non-intrusive recording of anaturalistic conversation was conducted in a domestic setting. Four (sometimes five) participants engaged in lively conversation over two 4-hour sessions on two successive days. Conversation was not directed, and ranged widely over topics both trivial and technical. The entire conversation, on both days, was richly recorded using 7 video cameras, 10 audio microphones, and the registration of 3-D head,torso and arm motion using an Optitrack system. To add liveliness to the conversation, several bottles of wine were consumed during the final two hours of recording. The resulting corpus will be of immediate interest to all researchers interested in studying naturalistic, ethologically situated, conversational interaction.Picard, S. (2010). A framework for automatic error detection of Swedish vowels based on audiovisual data. Master's thesis, KTH. [pdf]Picard, S., Ananthakrishnan, G., Wik, P., Engwall, O., & Abdou, S. (2010). Detection of Specific Mispronunciations using Audiovisual Features. In International Conference on Auditory-Visual Speech Processing. Kanagawa, Japan. [abstract] [pdf]Abstract: This paper introduces a general approach for binary classification of audiovisual data. The intended application is mispronunciation detection for specific phonemic errors, using very sparse training data. The system uses a Support Vector Machine (SVM) classifier with features obtained from a Time Varying Discrete Cosine Transform (TV-DCT) on the audio log-spectrum as well as on the image sequences. The concatenated feature vectors from both the modalities were reduced to a very small subset using a combination of feature selection methods. We achieved 95-100% correct classification for each pair-wise classifier on a database of Swedish vowels with an average of 58 instances per vowel for training. The performance was largely unaffected when tested on data from a speaker who was not included in the training.Salvi, G., Tesser, F., Zovato, E., & Cosi, P. (2010). Cluster Analysis of Differential Spectral Envelopes on Emotional Speech. In Proceedings of Interspeech (pp. 322--325). Makuhari, Japan. [abstract] [PDF]Abstract: This paper reports on the analysis of the spectral variation of emotional speech. Spectral envelopes of time aligned speech frames are compared between emotionally neutral and active utterances. Statistics are computed over the resulting differential spectral envelopes for each phoneme. Finally, these statistics are classified using agglomerative hierarchical clustering and a measure of dissimilarity between statistical distributions and the resulting clusters are analysed. The results show that there are systematic changes in spectral envelopes when going from neutral to sad or happy speech, and those changes depend on the valence of the emotional content (negative, positive) as well as on the phonetic properties of the sounds such as voicing and place of articulation.Schlangen, D., Baumann, T., Buschmeier, H., Buss, O., Kopp, S., Skantze, G., & Yaghoubzadeh, R. (2010). Middleware for Incremental Processing in Conversational Agents. In Proceedings of SigDial. Tokyo, Japan. [pdf]Schötz, S., Beskow, J., Bruce, G., Granström, B., & Gustafson, J. (2010). Simulating Intonation in Regional Varieties of Swedish. In Speech Prosody 2010. Chicago, USA. [pdf]Schötz, S., Beskow, J., Bruce, G., Granström, B., Gustafson, J., & Segerup, M. (2010). Simulating Intonation in Regional Varieties of Swedish. In Fonetik 2010. Lund, Sweden. [pdf]Sikveland, R-O., Öttl, A., Amdal, I., Ernestus, M., Svendsen, T., & Edlund, J. (2010). Spontal-N: A Corpus of Interactional Spoken Norwegian. In Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Rosner, M., & Tapias, D. (Eds.), Proc. of the Seventh conference on International Language Resources and Evaluation (LREC'10) (pp. 2986 - 2991). Valetta, Malta. [abstract] [pdf]Abstract: Spontal-N is a corpus of spontaneous, interactional Norwegian. To our knowledge, it is the first corpus of Norwegian in which the majority of speakers have spent significant parts of their lives in Sweden, and in which the recorded speech displays varying degrees of interference from Swedish. The corpus consists of studio quality audio- and video-recordings of four 30-minute free conversations between acquaintances, and a manual orthographic transcription of the entire material. On basis of the orthographic transcriptions, we automatically annotated approximately 50 percent of thematerial on the phoneme level, by means of a forced alignment between the acoustic signal and pronunciations listed in a dictionary. Approximately seven percent of the automatic transcription was manually corrected. Taking the manual correction as a gold standard, we evaluated several sources of pronunciation variants for the automatic transcription. Spontal-N is intended as a general purpose speech resource that is also suitable for investigating phonetic detail.Skantze, G. (2010). Jindigo: a Java-based Framework for Incremental Dialogue Systems. Technical Report, KTH, Stockholm, Sweden. [pdf]Skantze, G., & Hjalmarsson, A. (2010). Towards Incremental Speech Generation in Dialogue Systems. In Proceedings of SIGdial (pp. 1-8). Tokyo, Japan. [abstract] [pdf]Abstract: We present a first step towards a model of speech generation for incremental dialogue systems. The model allows a dialogue system to incrementally interpret spoken input, while simultaneously planning, realising and selfmonitoring the system response. The model has been implemented in a general dialogue system framework. Using this framework, we have implemented a specific application and tested it in a Wizard-of-Oz setting, comparing it with a non-incremental version of the same system. The results show that the incremental version, while producing longer utterances, has a shorter response time and is perceived as more efficient by the users.Wik, P., & Granström, B. (2010). Simicry - A mimicry-feedback loop for second language learning. In Proceedings of Second Language Studies: Acquisition, Learning, Education and Technology. Waseda University, Tokyo, Japan. [abstract] [pdf]Abstract: This paper introduces the concept of Simicry, defined as similarity of mimicry, for the purpose of second language acquisition. We apply this method using a computer assisted language learning system called Ville on foreign students learning Swedish. The system deploys acoustic similarity measures between native and non-native pronunciation, derived from duration syllabicity and pitch. The system uses these measures to give pronunciation feedback in a mimicry-feedback loop exercise which has two variants: a ’say after me’ mimicry exercise, and a ’shadow with me’ exercise. The answers of questionnaires filled out by students after several training sessions spread over a month, show that the learning and practicing procedure has a promising potential being very useful and fun.2009Al Moubayed, S. (2009). Prosodic Disambiguation in Spoken Systems Output. In Proceedings of Diaholmia'09. Stockholm, Sweden.. [abstract] [pdf]Abstract: This paper presents work on using prosody in the output of spoken dialogue systems to resolve possible structural ambiguity of output utterances. An algorithm is proposed to discover ambiguous parses of an utterance and to add prosodic disambiguation events to deliver the intended structure. By conducting a pilot experiment, the automatic prosodic grouping applied to ambiguous sentences shows the ability to deliver the intended interpretation of the sentences.Al Moubayed, S., & Beskow, J. (2009). Effects of Visual Prominence Cues on Speech Intelligibility. In Proceedings of Auditory-Visual Speech Processing AVSP'09. Norwich, England. [abstract] [pdf]Abstract: This study reports experimental results on the effect of visual prominence, presented as gestures, on speech intelligibility. 30 acoustically vocoded sentences, permutated into different gestural conditions were presented audio-visually to 12 subjects. The analysis of correct word recognition shows a significant increase in intelligibility when focally-accented (prominent) words are supplemented with head-nods or with eye-brow raise gestures. The paper also examines coupling other acoustic phenomena to brow-raise gestures. As a result, the paper introduces new evidence on the ability of the non-verbal movements in the visual modality to support audio-visual speech perception.Al Moubayed, S., Beskow, J., Öster, A-M., Salvi, G., Granström, B., van Son, N., Ormel, E., & Herzke, T. (2009). Studies on Using the SynFace Talking Head for the Hearing Impaired. In Proceedings of Fonetik'09. Dept. of Linguistics, Stockholm University, Sweden. [abstract] [pdf]Abstract: SynFace is a lip-synchronized talking agent which is optimized as a visual reading support for the hearing impaired. In this paper we present the large scale hearing impaired user studies carried out for three languages in the Hearing at Home project. The user tests focus on measuring the gain in Speech Reception Threshold in Noise and the effort scaling when using SynFace by hearing impaired people, where groups of hearing impaired subjects with different impairment levels from mild to severe and cochlear implants are tested. Preliminary analysis of the results does not show significant gain in SRT or in effort scaling. But looking at large cross-subject variability in both tests, it is clear that many subjects benefit from SynFace especially with speech with stereo babble.Al Moubayed, S., Beskow, J., Öster, A., Salvi, G., Granström, B., van Son, N., & Ormel, E. (2009). Virtual Speech Reading Support for Hard of Hearing in a Domestic Multi-media Setting. In Proceedings of Interspeech 2009. Al Moubayed, S., Chetouani, M., Baklouti, M., Dutoit, T., Mahdhaoui, A., Martin, J-C., Ondas, S., Pelachaud, C., Urbain, J., & Yilmaz, M. (2009). Generating Robot/Agent Backchannels During a Storytelling Experiment. In Proceedings of (ICRA'09) IEEE International Conference on Robotics and Automation. Kobe, Japan.Ananthakrishnan, G., & Neiberg, D. (2009). Cross - modal Clustering in the Acoustic - Articulatory Space. In Fonetik 2009. Stockholm. [abstract] [pdf]Abstract: This paper explores cross-modal clustering in the acoustic-articulatory space. A method to improve clustering using information from more than one modality is presented. Formants and the Electromagnetic Articulography meas-urements are used to study corresponding clus-ters formed in the two modalities. A measure for estimating the uncertainty in correspon-dences between one cluster in the acoustic space and several clusters in the articulatory space is suggested.Ananthakrishnan, G., Neiberg, D., & Engwall, O. (2009). In search of Non-uniqueness in the Acoustic-to-Articulatory Mapping. In INTERSPEECH 2009 - 10th Annual Conference of the International Speech Communication Association (pp. 2799 – 2802). Brighton, UK. [abstract] [pdf]Abstract: This paper explores the possibility and extent of non-uniqueness in the acoustic-to-articulatory inversion of speech, from a statistical point of view. It proposes a technique to estimate the non-uniqueness, based on finding peaks in the conditional probability function of the articulatory space. The paper corroborates the existence of non-uniqueness in a statistical sense, especially in stop consonants, nasals and fricatives. The relationship between the importance of the articulator position and non-uniqueness at each instance is also explored.Beskow, J., Edlund, J., Granström, B., Gustafson, J., Skantze, G., & Tobiasson, H. (2009). The MonAMI Reminder: a spoken dialogue system for face-to-face interaction. In Interspeech 2009. Brighton, U.K. [abstract] [pdf]Abstract: We describe the MonAMI Reminder, a multimodal spoken dialogue system which can assist elderly and disabled people in organising and initiating their daily activities. Based on deep interviews with potential users, we have designed a calendar and reminder application which uses an innovative mix of an embodied conversational agent, digital pen and paper, and the web to meet the needs of those users as well as the current constraints of speech technology. We also explore the use of head pose tracking for interaction and attention control in human-computer face-to-face interaction.Beskow, J., & Gustafson, J. (2009). Experiments with Synthesis of Swedish Dialects. In Proceedings of Fonetik 2009. Beskow, J., Salvi, G., & Al Moubayed, S. (2009). SynFace - Verbal and Non-verbal Face Animation from Audio. In Proceedings of The International Conference on Auditory-Visual Speech Processing AVSP'09. Norwich, England. [abstract] [pdf]Abstract: We give an overview of SynFace, a speech-driven face animation system originally developed for the needs of hard-of-hearing users of the telephone. For the 2009 LIPS challenge, SynFace includes not only articulatory motion but also non-verbal motion of gaze, eyebrows and head, triggered by detection of acoustic correlates of prominence and cues for interaction control. In perceptual evaluations, both verbal and non-verbal movmements have been found to have positive impact on word recognition scores.Beskow, J., Carlson, R., Edlund, J., Granström, B., Heldner, M., Hjalmarsson, A., & Skantze, G. (2009). Multimodal Interaction Control. In Waibel, A., & Stiefelhagen, R. (Eds.), Computers in the Human Interaction Loop (pp. 143-158). Berlin/Heidelberg: Springer. [pdf]Beskow, J., Edlund, J., Elenius, K., Hellmer, K., House, D., & Strömbergsson, S. (2009). Project presentation: Spontal – multimodal database of spontaneous dialog. In Fonetik 2009 (pp. 190-193). Stockholm. [abstract] [pdf]Abstract: We describe the ongoing Swedish speech database project Spontal: Multimodal database of spontaneous speech in dialog (VR 2006-7482). The project takes as its point of departure the fact that both vocal signals and gesture involving the face and body are important in every- day, face-to-face communicative interaction, and that there is a great need for data with which we more precisely measure these.Bisitz, T., Herzke, T., Zokoll, M., Öster, A-M., Al Moubayed, S., Granström, B., Ormel, E., Van Son, N., & Tanke, R. (2009). Noise Reduction for Media Streams. In NAG/DAGA'09 International Conference on Acoustics. Rotterdam, Netherlands.Blomberg, M., & Elenius, D. (2009). Estimating speaker characteristics for speech recognition. In Proceedings of Fonetik 2009. Dept. of Linguistics, Stockholm University. [pdf]Blomberg, M., & Elenius, D. (2009). Tree-based Estimation of Speaker Characteristics for Speech Recognition. In Proceedings of Interspeech 2009. [abstract]Abstract: Speaker adaptation by means of adjustment of speaker characteristic properties, such as vocal tract length, has the important advantage compared to conventional adaptation techniques that the adapted models are guaranteed to be realistic if the description of the properties are. One problem with this approach is that the search procedure to estimate them is computationally heavy. We address the problem by using a multi-dimensional, hierarchical tree of acoustic model sets. The leaf sets are created by transforming a conventionally trained model set using leaf-specific speaker profile vectors. The model sets of non-leaf nodes are formed by merging the models of their child nodes, using a computationally efficient algorithm. During recognition, a maximum likelihood criterion is followed to traverse the tree. Studies of one- (VTLN) and four-dimensional speaker profile vectors (VTLN, two spectral slope parameters and model variance scaling) exhibit a reduction of the computational load to a fraction compared to that of an exhaustive search. In recognition experiments on children’s connected digits using adult and male models, the one-dimensional tree search performed as well as the exhaustive search. Further reduction was achieved with four dimensions. The best recognition results are 0.93% and 10.2% WER in TIDIGITS and PF-Star- Sw, respectively, using adult models.Blomberg, M., Elenius, K., House, D., & Karlsson, I. (2009). Research Challenges in Speech Technology: A Special Issue in Honour of Rolf Carlson and Bjorn Granstrom. Speech Communication, 51(7), 563.Boves, L., Carlson, R., Hinrichs, E., House, D., Krauwer, S., Lemnitzer, L., Vainio, M., & Wittenburg, P. (2009). Resources for Speech Research: Present and Future Infrastructure Needs. In Interspeech (pp. 1803-1806). Brighton, UK. [abstract] [pdf]Abstract: This paper introduces the EU-FP7 project CLARIN, a joint effort of over 150 institutions in Europe, aimed at the creation of a sustainable language resources and technology infrastructure for the humanities and social sciences research community. The paper briefly introduces the vision behind the project and how it relates to speech research with a focus on the contributions that CLARIN can and will make to research in spoken language processing.Carlson, R., & Gustafson, K. (2009). Exploring Data Driven Parametric Synthesis. In Fonetik 2009. Stockholm, Sweden. [pdf]Carlson, R., & Hirschberg, J. (2009). Cross-Cultural Perception of Discourse Phenomena. In Interspeech (pp. 1723-1726). Brighton, UK. [pdf]Edlund, J., & Beskow, J. (2009). MushyPeek - a framework for online investigation of audiovisual dialogue phenomena. Language and Speech, 52(2-3), 351-367. [abstract]Abstract: Evaluation of methods and techniques for conversational and multimodal spoken dialogue systems is complex, as is gathering data for the modeling and tuning of such techniques. This article describes MushyPeek, an experiment framework that allows us to manipulate the audiovisual behavior of interlocutors in a setting similar to face-to-face human—human dialogue. The setup connects two subjects to each other over a Voice over Internet Protocol (VoIP) telephone connection and simultaneously provides each of them with an avatar representing the other. We present a first experiment which inaugurates, exemplifies, and validates the framework. The experiment corroborates earlier findings on the use of gaze and head pose gestures in turn-taking.Edlund, J., Heldner, M., & Hirschberg, J. (2009). Pause and gap length in face-to-face interaction. In Proc. of Interspeech 2009. Brighton, UK. [abstract] [pdf]Abstract: It has long been noted that conversational partners tend to exhibit increasingly similar pitch, intensity, and timing behavior over the course of a conversation. However, the metrics developed to measure this similarity to date have generally failed to capture the dynamic temporal aspects of this process. In this paper, we propose new approaches to measuring interlocutor similarity in spoken dialogue. We define similarity in terms of convergence and synchrony and propose approaches to capture these, illustrating our techniques on gap and pause production in Swedish spontaneous dialogues.Edlund, J., Heldner, M., & Pelcé, A. (2009). Prosodic features of very short utterances in dialogue. In Vainio, M., Aulanko, R., & Aaltonen, O. (Eds.), Nordic Prosody - Proceedings of the Xth Conference (pp. 57 - 68). Frankfurt am Main: Peter Lang. [pdf]Elenius, D., & Blomberg, M. (2009). On Extending VTLN to Phoneme-specific Warping in Automatic Speech Recognition. In Proceedings of Fonetik 2009. Dept. of Linguistics, Stockholm University. [pdf]Engwall, O., & Wik, P. (2009). Are real tongue movements easier to speech read than synthesized?. In Proceedings of Interspeech. [pdf]Engwall, O., & Wik, P. (2009). Can you tell if tongue movements are real or synthetic?. In Proceedings of AVSP. [pdf]Engwall, O., & Wik, P. (2009). Real vs. rule-generated tongue movements as an audio-visual speech perception support. In Proceedings of Fonetik 2009. Gustafson, J., & Merkes, M. (2009). Eliciting interactional phenomena in human-human dialogues. In Proceedings of SigDial 2009.. Hagström, P. (2009). Predicting visual intelligibility gain in the Synface Application. Master's thesis, KTH. [pdf]Heldner, M., Edlund, J., Laskowski, K., & Pelcé, A. (2009). Prosodic features in the vicinity of pauses, gaps and overlaps. In Vainio, M., Aulanko, R., & Aaltonen, O. (Eds.), Nordic Prosody - Proceedings of the Xth Conference (pp. 95 - 106). Frankfurt am Main: Peter Lang. [abstract] [pdf]Abstract: In this study, we describe the range of prosodic variation observed in two types of dialogue contexts, using fully automatic methods. The first type of context is that of speaker-changes, or transitions from only one participant speaking to only the other, involving either acoustic silences or acoustic overlaps. The second type of context is comprised of mutual silences or overlaps where a speaker change could in principle occur but does not. For lack of a better term, we will refer to these contexts as non-speaker-changes. More specifically, we investigate F0 patterns in the intervals immediately preceding overlaps and silences – in order to assess whether prosody before overlaps or silences may invite or inhibit speaker change.Hincks, R., & Edlund, J. (2009). Using speech technology to promote increased pitch variation in oral presentations. In Proc. of SLaTE Workshop on Speech and Language Technology in Education. Wroxall, UK. [abstract] [pdf]Abstract: This paper reports on an experimental study comparing two groups of seven Chinese students of English who practiced oral presentations with computer feedback. Both groups imitated teacher models and could listen to recordings of their own production. The test group was also shown flashing lights that responded to the standard deviation of the fundamental frequency over the previous two seconds. The speech of the test group increased significantly more in pitch variation than the control group. These positive results suggest that this novel type of feedback could be used in training systems for speakers who have a tendency to speak in a monotone when making oral presentations.Hincks, R., & Edlund, J. (2009). Promoting increased pitch variation in oral presentations with transient visual feedback. Language Learning & Technology, 13(3), 32-50. [abstract]Abstract: This paper investigates learner response to a novel kind of intonation feedback generated from speech analysis. Instead of displays of pitch curves, our feedback is flashing lights that show how much pitch variation the speaker has produced. The variable used to generate the feedback is the standard deviation of fundamental frequency as measured in semitones. Flat speech causes the system to show yellow lights, while more expressive speech that has used pitch to give focus to any part of an utterance generates green lights. Participants in the study were 14 Chinese students of English at intermediate and advanced levels. A group that received visual feedback was compared with a group that received audio feedback. Pitch variation was measured at four stages: in a baseline oral presentation; for the first and second halves of three hours of training; and finally in the production of a new oral presentation. Both groups increased their pitch variation with training, and the effect lasted after the training had ended. The test group showed a significantly higher increase than the control group, indicating that the feedback is effective. These positive results imply that the feedback could be beneficially used in a system for practicing oral presentations.Hincks, R., & Edlund, J. (2009). Transient visual feedback on pitch variation for Chinese speakers of English. In Proc. of Fonetik 2009. Stockholm. [abstract] [pdf]Abstract: This paper reports on an experimental study comparing two groups of seven Chinese students of English who practiced oral presentations with computer feedback. Both groups imi tated teacher models and could listen to recordings of their own production. The test group was also shown flashing lights that responded to the standard deviation of the fundamental frequency over the previous two seconds. The speech of the test group increased significantly more in pitch variation than the control group. These positive results suggest that this novel type of feedback could be used in training systems for speakers who have a tendency to speak in a monotone when making oral presentations.Hjalmarsson, A. (2009). On cue - additive effects of turn-regulating phenomena in dialogue. In Diaholmia (pp. 27-34). [abstract] [pdf]Abstract: One line of work on turn-taking in dialogue suggests that speakers react to “cues” or “signals” in the behaviour of the preceding speaker. This paper describes a perception experiment that investigates if such potential turn-taking cues affect the judgments made by non-participating listeners. The experiment was designed as a game where the task was to listen to dialogues and guess the outcome, whether there will be a speaker change or not, whenever the recording was halted. Human-human dialogues as well as dialogues where one of the human voices was replaced by a synthetic voice were used. The results show that simultaneous turn-regulating cues have a reinforcing effect on the listeners’ judgements. The more turn-holding cues, the faster the reaction time, suggesting that the subjects were more confident in their judgments. Moreover, the more cues, regardless if turn-holding or turn-yielding, the higher the agreement among subjects on the predicted outcome. For the re-synthesized voice, responses were made significantly slower; however, the judgments show that the turn-taking cues were interpreted as having similar functions as for the original human voice.House, D., Karlsson, A., Svantesson, J-O., & Tayanin, D. (2009). On utterance-final intonation in tonal and non-tonal dialects of Kammu. In Proceedings of Fonetik 2009 (pp. 78-81). Department of Linguistics, Stockholm University. [abstract] [pdf]Abstract: In this study we investigate utterance-final intonation in two dialects of Kammu, one tonal and one non-tonal. While the general patterns of utterance-final intonation are similar between the dialects, we do find clear evidence that the lexical tones of the tonal dialect restrict the pitch range and the realization of focus. Speaker engagement can have a strong effect on the utterance-final accent in both dialects.House, D., Karlsson, A., Svantesson, J-O., & Tayanin, D. (2009). The Phrase-Final Accent in Kammu: Effects of Tone, Focus and Engagement. In Proceedings of Interspeech 2009. [abstract] [pdf]Abstract: The phrase-final accent can typically contain a multitude of simultaneous prosodic signals. In this study, aimed at separating the effects of lexical tone from phrase-final intonation, phrase-final accents of two dialects of Kammu were analyzed. Kammu, a Mon-Khmer language spoken primarily in northern Lao, has dialects with lexical tones and dialects with no lexical tones. Both dialects seem to engage the phrase-final accent to simultaneously convey focus, phrase finality, utterance finality, and speaker engagement. Both dialects also show clear evidence of truncation phenomena. These results have implications for our understanding of the interaction between tone, intonation and phrase-finality.Kjellström, H., & Engwall, O. (2009). Audiovisual-to-articulatory inversion. Speech Communication, 51(3), 195-209. [pdf]Krunic, V., Salvi, G., Bernardino, A., Montesano, L., & Santos-Victor, J. (2009). Affordance based word-to-meaning association. In IEEE International Conference on Robotics and Automation (ICRA). Kobe, Japan. [abstract] [pdf]Abstract: This paper presents a method to associate meanings to words in manipulation tasks. We base our model on an affordance network, i.e., a mapping between robot actions, robot perceptions and the perceived effects of these actions upon objects. We extend the affordance model to incorporate words. Using verbal descriptions of a task, the model uses temporal co-occurrence to create links between speech utterances and the involved objects, actions and effects. We show that the robot is able form useful word-to-meaning associations, even without considering grammatical structure in the learning process and in the presence of recognition errors. These word-to-meaning associations are embedded in the robot’s own understanding of its actions. Thus they can be directly used to instruct the robot to perform tasks and also allow to incorporate context in the speech recognition task.Laskowski, K., Heldner, M., & Edlund, J. (2009). A general-purpose 32 ms prosodic vector for Hidden Markov Modeling. In Proc. of Interspeech 2009. Brighton, UK. [abstract] [pdf]Abstract: Prosody plays a central role in communicating via speech, making it important for speech technologies to model. Unfortunately, the application of standard modeling techniques to the acoustics of prosody has been hindered by dif&#64257;culties in modeling intonation. In this work, we explore the suitability of the recently introduced fundamental frequency variation (FFV) spectrum as a candidate general representation of tone. Experiments on 4 tasks demontrate that FFV features are complimentary to other acoustic measures of prosody and that hidden Markov models offer a suitable modeling paradigm. Proposed improvements yield a 35% relative decrease in error on unseen data and simultaneously reduce time complexity by more than an order of magnitude. The resulting is suf&#64257;ciently mature for general deployment in a broad range of automatic speech processing applications.Laskowski, K., Heldner, M., & Edlund, J. (2009). Exploring the prosody of floor mechanisms in English using the fundamental frequency variation spectrum. In Proceedings of the 2009 European Signal Processing Conference (EUSIPCO-2009). Glasgow, Scotland. [abstract] [pdf]Abstract: A basic requirement for participation in conversation is the ability to jointly manage interaction. Examples of interaction management include indications to acquire, re-acquire, hold, release, and acknowledge floor ownership, and these are often implemented using specialized dialog act (DA) types. In this work, we explore the prosody of one class of such DA types, known as floor mechanisms, using a methodology based on a recently proposed representation of fundamental frequency variation (FFV). Models over the representation illustrate significant differences between floor mechanisms and other dialog act types, and lead to automatic detection accuracies in equal-prior test data of up to 75%. Analysis indicates that FFV modeling offers a useful tool for the discovery of prosodic phenomena which are not explicitly labeled in the audio.Merkes, M. (2009). Recording methods for Spoken Dialogue Systems. Master's thesis.Neiberg, D., Ananthakrishnan, G., & Blomberg, M. (2009). On Acquiring Speech Production Knowledge from Articulatory Measurements for Phoneme Recognition. In INTERSPEECH 2009 - 10th Annual Conference of the International Speech Communication Association (pp. 1387 – 1390). Brighton, UK. [abstract] [pdf]Abstract: The paper proposes a general version of a coupled Hidden Markov/Bayesian Network model for performing phoneme recognition on acoustic-articulatory data. The model uses knowledge learned from the articulatory measurements, available for training, for phoneme recognition on the acoustic input. After training on the articulatory data, the model is able to predict 71.5% of the articulatory state sequences using the acoustic input. Using optimized parameters, the proposed method shows a slight improvement for two speakers over the baseline phoneme recognition system which does not use articulatory knowledge. However, the improvement is only statistically significant for one of the speakers. While there is an improvement in recognition accuracy for the vowels, diphthongs and to some extent the semi-vowels, there is a decrease in accuracy for the remaining phonemes.Neiberg, D., Elenius, K., & Burger, S. (2009). Emotion Recognition. In Waibel, A., & Stiefelhagen, R. (Eds.), Computers in the Human Interaction Loop (pp. 96-105). Berlin/Heidelberg: Springer. [pdf]Pammi, S., & Schröder, M. (2009). Annotating meaning of listener vocalizations for speech synthesis. In International Conference on Affective Computing & Intelligent Interaction.. Amsterdam, The Netherlands.Salvi, G., Beskow, J., Al Moubayed, S., & Granström, B. (2009). SynFace—Speech-Driven Facial Animation for Virtual Speech-Reading Support. EURASIP Journal on Audio, Speech, and Music Processing, 2009. [abstract] [pdf]Abstract: This paper describes SynFace, a supportive technology that aims at enhancing audio-based spoken communication in adverse acoustic conditions by providing the missing visual information in the form of an animated talking head. Firstly, we describe the system architecture, consisting of a 3D animated face model controlled from the speech input by a specifically optimised phonetic recogniser. Secondly, we report on speech intelligibility experiments with focus on multilinguality and robustness to audio quality. The system, already available for Swedish, English, and Flemish, was optimised for German and for Swedish wide-band speech quality available in TV, radio, and Internet communication. Lastly, the paper covers experiments with nonverbal motions driven from the speech signal. It is shown that turn-taking gestures can be used to affect the flow of human-human dialogues. We have focused specifically on two categories of cues that may be extracted from the acoustic signal: prominence/emphasis and interactional cues (turn-taking/back-channelling).Schlangen, D., & Skantze, G. (2009). A general, abstract model of incremental dialogue processing. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (EACL-09). Athens, Greece. [abstract] [pdf]Abstract: We present a general model and conceptual framework for specifying architectures for incremental processing in dialogue systems, in particular with respect to the topology of the network of modules that make up the system, the way information flows through this network, how information increments are 'packaged', and how these increments are processed by the modules. This model enables the precise specification of incremental systems and hence facilitates detailed comparisons between systems, as well as giving guidance on designing new systems.Sen, A., Ananthakrishnan, G., Sundaram, S., & Ramakrishnan, A. G. (2009). Dynamic Space Warping of Strokes for Recognition of Online Handwritten Characters. International Journal of Pattern Recognition and Artificial Intelligence (IJPRAI), 23(5), 925-943. [abstract] [html]Abstract: This paper suggests a scheme for classifying online handwritten characters, based on dynamic space warping of strokes within the characters. A method for segmenting components into strokes using velocity profiles is proposed. Each stroke is a simple arbitrary shape and is encoded using three attributes. Correspondence between various strokes is established using Dynamic Space Warping. A distance measure which reliably differentiates between two corresponding simple shapes (strokes) has been formulated thus obtaining a perceptual distance measure between any two characters. Tests indicate an accuracy of over 85% on two different datasets of characters.Skantze, G., & Gustafson, J. (2009). Attention and interaction control in a human-human-computer dialogue setting. In Proceedings of SigDial 2009. London, UK. [abstract] [pdf]Abstract: This paper presents a simple, yet effective model for managing attention and interaction control in multimodal spoken dialogue systems. The model allows the user to switch attention between the system and other humans, and the system to stop and resume speaking. An evaluation in a tutoring setting shows that the user’s attention can be effectively monitored using head pose tracking, and that this is a more reliable method than using push-to-talk.Skantze, G., & Gustafson, J. (2009). Multimodal interaction control in the MonAMI Reminder. In Proceedings of DiaHolmia (pp. 127-128). Stockholm, Sweden. [pdf]Skantze, G., & Schlangen, D. (2009). Incremental dialogue processing in a micro-domain. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (EACL-09). Athens, Greece. [abstract] [pdf]Abstract: This paper describes a fully incremental dialogue system that can engage in dialogues in a simple domain, number dictation. Because it uses incremental speech recognition and prosodic analysis, the system can give rapid feedback as the user is speaking, with a very short latency of around 200ms. Because it uses incremental speech synthesis and self-monitoring, the system can react to feedback from the user as the system is speaking. A comparative evaluation shows that naïve users preferred this system over a non-incremental version, and that it was perceived as more human-like.Svantesson, J-O., House, D., Karlsson, A., & Tayanin, D. (2009). Reduplication with fixed tone pattern in Kammu. In Proceeding of Fonetik 2009 (pp. 82-84). Department of Linguistics, Stockholm University. [abstract] [pdf]Abstract: In this paper we show that speakers of both tonal and non-tonal dialects of Kammu use a fixed tone pattern high–low for intensifying re-duplication of adjectives, and also that speakers of the tonal dialect retain the lexical tones (high or low) while applying this fixed tone pattern.Wik, P., & Escribano, D. (2009). Say ‘Aaaaa’ Interactive Vowel Practice for Second Language Learning. In Proc. of SLaTE Workshop on Speech and Language Technology in Education. Wroxall, England. [pdf]Wik, P., Hincks, R., & Hirschberg, J. (2009). Responses to Ville: A virtual language teacher for Swedish. In Proc. of SLaTE Workshop on Speech and Language Technology in Education. Wroxall, England. [abstract] [pdf]Abstract: A series of novel capabilities have been designed to extend the repertoire of Ville, a virtual language teacher for Swedish, created at the Centre for Speech technology at KTH. These capabilities were tested by twenty-seven language students at KTH. This paper reports on qualitative surveys and quantitative performance from these sessions which suggest some general lessons for automated language training.Wik, P., & Hjalmarsson, A. (2009). Embodied conversational agents in computer assisted language learning. Speech Communication, 51(10), 1024-1037. [abstract] [pdf]Abstract: This paper describes two systems using embodied conversational agents (ECAs) for language learning. The first system, called Ville, is a virtual language teacher for vocabulary and pronunciation training. The second system, a dialogue system called DEAL, is a role-playing game for practicing conversational skills. Whereas DEAL acts as a conversational partner with the objective of creating and keeping an interesting dialogue, Ville takes the role of a teacher who guides, encourages and gives feedback to the students.2008Al Moubayed, S., De Smet, M., & Van Hamme, H. (2008). Lip Synchronization: from Phone Lattice to PCA Eigen-projections using Neural Networks. In Proceedings of Interspeech 2008. Brisbane, Australia. [pdf]Ananthakrishnan, G., & Engwall, O. (2008). Important regions in the articulator trajectory. In Proceedings of International Seminar on Speech Production (pp. 305-308). Strasbourg, France. [pdf]Beskow, J., Bruce, G., Enflo, L., Granström, B., & Schötz, S. (2008). Human Recognition of Swedish Dialects. In Proceedings of Fonetik 2008. Beskow, J., Bruce, G., Enflo, L., Granström, B., & Schötz, S. (2008). Recognizing and Modelling Regional Varieties of Swedish. In Proceedings of Interspeech 2008. [pdf]Beskow, J., & Cerrato, L. (2008). Evaluation of the expressivity of a Swedish talking headin the context of human-machine interaction. In Magno Caldognetto, E., Cavicchio, E., & Cosi, P. (Eds.), Comunicazione Parlata e manifestazione delle emozioni. [pdf]Beskow, J., Edlund, J., Gjermani, T., Granström, B., Gustafson, J., Jonsson, O., Skantze, G., & Tobiasson, H. (2008). Innovative interfaces in MonAMI: the reminder. In Proceedings of the 10th international conference on Multimodal interfaces, Chania, Crete, Greece (pp. 199-200). New York, NY, USA: ACM. [abstract] [pdf]Abstract: This demo paper presents an early version of the Reminder, a prototype ECA developed in the European project MonAMI, which aims at “mainstreaming accessibility in consumer goods and services, using advanced technologies to ensure equal access, independent living and participation for all”. The Reminder helps users to plan activities and to remember what to do. The prototype merges mobile ECA technology with other, existing technologies: Google Calendar and a digital pen and paper. The solution allows users to continue using a paper calendar in the manner they are used to, whilst the ECA provides notifications on what has been written in the calendar. Users may ask questions such as “When was I supposed to meet Sara?” or “What’s my schedule today?”Beskow, J., Edlund, J., Granström, B., Gustafson, J., Jonsson, O., & Skantze, G. (2008). Speech technology in the European project MonAMI. In Proceedings of FONETIK 2008 (pp. 33-36). Gothenburg, Sweden. [abstract] [pdf]Abstract: This paper describes the role of speech and speech technology in the European project MonAMI, which aims at “mainstreaming ac-cessibility in consumer goods and services, us-ing advanced technologies to ensure equal ac-cess, independent living and participation for all”. It presents the Reminder, a prototype em-bodied conversational agent (ECA) which helps users to plan activities and to remember what to do. The prototype merges speech technology with other, existing technologies: Google Cal-endar and a digital pen and paper. The solution allows users to continue using a paper calendar in the manner they are used to, whilst the ECA provides notifications on what has been written in the calendar. Users may also ask questions such as “When was I supposed to meet Sara?” or “What’s on my schedule today?”Beskow, J., Edlund, J., Granström, B., Gustafson, J., & Skantze, G. (2008). Innovative interfaces in MonAMI: the KTH Reminder. In Perception in Multimodal Dialogue Systems - Proceedings of the 4th IEEE Tutorial and Research Workshop on Perception and Interactive Technologies for Speech-Based Systems, PIT 2008, Kloster Irsee, Germany, June 16-18, 2008. (pp. 272-275). Berlin/Heidelberg: Springer. [abstract] [pdf]Abstract: This demo paper presents the first version of the Reminder, a prototype ECA developed in the European project MonAMI, which aims at “mainstreaming accessibility in consumer goods and services, using advanced technologies to ensure equal access, independent living and participation for all”. The Reminder helps users to plan activities and to remember what to do. The prototype merges ECA technology with other, existing technologies: Google Calendar and a digital pen and paper. This innovative combination of modalities allows users to continue using a paper calendar in the manner they are used to, whilst the ECA provides verbal notifications on what has been written in the calendar. Users may also ask questions such as “When was I supposed to meet Sara?” or “What’s on my schedule today?”Beskow, J., Engwall, O., Granström, B., Nordqvist, P., & Wik, P. (2008). Visualization of speech and audio for hearing-impaired persons. Technology and Disability, 20(2), 97-107. [pdf]Beskow, J., Granström, B., Nordqvist, P., Al Moubayed, S., Salvi, G., Herzke, T., & Schulz, A. (2008). Hearing at Home – Communication support in home environments for hearing impaired persons. In Proceedings of Interspeech 2008. Brisbane, Australia. [abstract] [pdf]Abstract: The Hearing at Home (HaH) project focuses on the needs of hearing-impaired people in home environments. The project is researching and developing an innovative media-center solution for hearing support, with several integrated features that support perception of speech and audio, such as individual loudness amplification, noise reduction, audio classification and event detection, and the possibility to display an animated talking head providing real-time speechreading support. In this paper we provide a brief project overview and then describe some recent results related to the audio classifier and the talking head. As the talking head expects clean speech input, an audio classifier has been developed for the task of classifying audio signals as clean speech, speech in noise or other. The mean accuracy of the classifier was 82%. The talking head (based on technology from the SynFace project) has been adapted for German, and a small speech-in-noise intelligibility experiment was conducted where sentence recognition rates increased from 3% to 17% when the talking head was present.Biadsy, F., Rosenberg, A., Carlson, R., Hirschberg, J., & Strangert, E. (2008). A Cross-Cultural Comparison of American, Palestinian, and Swedish. In Speech Prosody 2008. Campinas, Brazil. [pdf]Blomberg, M., & Elenius, D. (2008). Investigating Explicit Model Transformations for Speaker Normalization. In Proceedings of ISCA ITRW Speech Analysis and Processing for Knowledge Discovery. Aalborg, Denmark. [abstract] [pdf]Abstract: In this work we extend the test utterance adaptation technique used in vocal tract length normalization to a larger number of speaker characteristic features. We perform partially joint estimation of four features: the VTLN warping factor, the corner position of the piece-wise linear warping function, spectral tilt in voiced segments, and model variance scaling. In experiments on the Swedish PF-Star children database, joint estimation of warping factor and variance scaling lowered the recognition error rate compared to warping factor alone.Blomberg, M., & Elenius, D. (2008). Knowledge-Rich Model Transformations for Speaker Knowledge-Rich Model Transformations for Speaker Normalization in Speech Recognition. In Proceedings, FONETIK 2008, Department of Linguistics, University of Gothenburg. [abstract] [pdf]Abstract: In this work we extend the test utterance adaptation technique used in vocal tract length normalization to a larger number of speaker characteristic features. We perform partially joint estimation of four features: the VTLN warping factor, the corner position of the piece-wise linear warping function, spectral tilt in voiced segments, and model variance scaling. In experiments on the Swedish PF-Star children database, joint estimation of warping factor and variance scaling lowers the recognition error rate compared to warping factor alone.Blomberg, M., & Elenius, D. (2008). Knowledge-Rich Model Transformations for Speaker Normalization in Speech Recognition. In Eriksson, A., & Lindh, J. (Eds.), Fonetik 2008. Box 200, SE 405 30 Gothenburg. [pdf]Carlson, R., Gustafson, K., & Strangert, E. (2008). Synthesising disfluencies in a dialogue system. In Nordic Prosody. Helsinki, Finland.Edlund, J., Gustafson, J., Heldner, M., & Hjalmarsson, A. (2008). Towards human-like spoken dialogue systems. Speech Communication, 50(8-9), 630-645. [abstract] [pdf]Abstract: This paper presents an overview of methods that can be used to collect and analyse data on user responses to spoken dialogue system components intended to increase human-likeness, and to evaluate how well the components succeed in reaching that goal. Wizard-of-Oz variations, human-human data manipulation, and micro-domains are discussed in this context, as is the use of third-party reviewers to get a measure of the degree of human-likeness. We also present the two-way mimicry target, a model for measuring how well a human-computer dialogue mimics or replicates some aspect of human-human dialogue, including human flaws and inconsistencies. Although we have added a measure of innovation, none of the techniques is new in its entirety. Taken together and described from a human-likeness perspective, however, they form a set of tools that may widen the path towards human-like spoken dialogue systems.Elenius, K., Forsbom, E., & Megyesi, B. (2008). Language Resources and Tools for Swedish: A Survey. In Proc of LREC 2008. Marrakech, Marocko. [abstract] [pdf]Abstract: Language resources and tools to create and process these resources are necessary components in human language technology and natural language applications. In this paper, we describe a survey of existing language resources for Swedish, and the need for Swedish language resources to be used in research and real-world applications in language technology as well as in linguistic research. The survey is based on a questionnaire sent to industry and academia, institutions and organizations, and to experts involved in the development of Swedish language resources in Sweden, the Nordic countries and world-wide.Elenius, K., Forsbom, E., & Megyesi, B. (2008). Survey on Swedish Language Resources. Technical Report, Speech, Music and Hearing, KTH and Department of Linguistics and Philology, Uppsala University. [pdf]Engwall, O. (2008). Bättre tala än texta - talteknologi nu och i framtiden. In Domeij, R. (Ed.), Tekniken bakom språket (pp. 98-118). Stockholm: Norstedts Akademiska Förlag.Engwall, O. (2008). Can audio-visual instructions help learners improve their articulation? - an ultrasound study of short term changes. In Proceedings of Interspeech 2008 (pp. 2631-2634). Brisbane, Australia. [pdf]Escribano, D. L. (2008). Pronunciation Training of Swedish Vowels Using Speech Technology, Embodied Conversational Agents and an Interactive Game. Master's thesis, CSC KTH. [pdf]Gustafson, J., & Edlund, J. (2008). expros: a toolkit for exploratory experimentation with prosody in customized diphone voices. In Proceedings of Perception and Interactive Technologies for Speech-Based Systems (PIT 2008) (pp. 293-296). Berlin/Heidelberg: Springer. [abstract] [pdf]Abstract: This paper presents a toolkit for experimentation with prosody in diphone voices. Prosodic features play an important role for aspects of human-human spoken dialogue that are largely unexploited in current spoken dialogue systems. The toolkit contains tools for recording utterances for a number of purposes. Examples include extraction of prosodic features such as pitch, intensity and duration for transplantation onto synthetic utterances, and creation of purpose-built customized MBROLA mini-voices.Gustafson, J., & Edlund, J. (2008). EXPROS: Tools for exploratory experimentation with prosody. In Proceedings of FONETIK 2008 (pp. 17-20). Gothenburg, Sweden. [abstract] [pdf]Abstract: This demo paper presents EXPROS, a toolkit for experimentation with prosody in diphone voices. Although prosodic features play an important role in human-human spoken dialogue, they are largely unexploited in current spoken dialogue systems. The toolkit contains tools for a number of purposes: for example extraction of prosodic features such as pitch, intensity and duration for transplantation onto synthetic utterances and creation of purpose-built customized MBROLA mini-voices.Gustafson, J., Heldner, M., & Edlund, J. (2008). Potential benefits of human-like dialogue behaviour in the call routing domain. In Proceedings of Perception and Interactive Technologies for Speech-Based Systems (PIT 2008) (pp. 240-251). Berlin/Heidelberg: Springer. [abstract] [pdf]Abstract: This paper presents a Wizard-of-Oz (Woz) experiment in the call routing domain that took place during the development of a call routing system for the TeliaSonera residential customer care in Sweden. A corpus of 42,000 calls was used as a basis for identifying problematic dialogues and the strategies used by operators to overcome the problems. A new Woz recording was made, implementing some of these strategies. The collected data is described and discussed with a view to explore the possible benefits of more human-like dialogue behaviour in call routing applications.Hjalmarsson, A. (2008). Speaking without knowing what to say... or when to end. In Proceedings of SIGdial 2008 (pp. 72-75). Columbus, Ohio, USA. [abstract] [pdf]Abstract: Humans produce speech incrementally and on-line as the dialogue progresses using information from several different sources in parallel. A dialogue system that generates output in a stepwise manner and not in preplanned syntactically correct sentences needs to signal how new dialogue contributions relate to previous discourse. This paper describes a data collection which is the foundation for an effort towards more human-like language generation in DEAL, a spoken dialogue system developed at KTH. Two annotators labelled cue phrases in the corpus with high inter-annotator agreement (kappa coefficient 0.82).Hjalmarsson, A., & Edlund, J. (2008). Human-likeness in utterance generation: effects of variability. In Perception in Multimodal Dialogue Systems - Proceedings of the 4th IEEE Tutorial and Research Workshop on Perception and Interactive Technologies for Speech-Based Systems, PIT 2008, Kloster Irsee, Germany, June 16-18, 2008. (pp. 252-255). Berlin/Heidelberg: Springer. [abstract] [pdf]Abstract: There are compelling reasons to endow dialogue systems with human-like conversational abilities, which require modelling of aspects of human behaviour. This paper examines the value of using human behaviour as a target for system behaviour through a study making use of a simulation method. Two versions of system behaviour are compared: a replica of a human speaker’s behaviour and a constrained version with less variability. The version based on human behaviour is rated more human-like, polite and intelligent.Hofer, G., Yamagishi, J., & Shimodaira, H. (2008). Speech-Driven Lip Motion Generation with a Trajectory HMM. In Proc. of Interspeech. Brisbane, Australia.Karlsson, A., House, D., & Tayanin, D. (2008). Recognizing phrase and utterance as prosodic units in non-tonal dialects of Kammu. In Proceedings, FONETIK 2008 (pp. 89-92). Department of Linguistics, University of Gothenburg. [pdf]Katsamanis, N., Ananthakrishnan, G., Papandreou, G., Engwall, O., & Maragos, P. (2008). Audiovisual speech inversion by switching dynamical modeling Governed by a Hidden Markov Process. In Proceedings of EUSIPCO. [pdf]Krunic, V., Salvi, G., Bernardino, A., Montesano, L., & Santos-Victor, J. (2008). Associating word descriptions to learned manipulation task models. In IEEE/RSJ International Conference on Intelligent RObots and Systems (IROS). Nice, France.Laskowski, K., Edlund, J., & Heldner, M. (2008). An instantaneous vector representation of delta pitch for speaker-change prediction in conversational dialogue systems. In Proceedings ICASSP 2008 (pp. 5041-5044). Las Vegas, Nevada, US. [abstract] [pdf]Abstract: As spoken dialogue systems become deployed in increasingly complex domains, they face rising demands on the naturalness of interaction. We focus on system responsiveness, aiming to mimic human-like dialogue flow control by predicting speaker changes as observed in real human-human conversations. We derive an instantaneous vector representation of pitch variation and show that it is amenable to standard acoustic modeling techniques. Using a small amount of automatically labeled data, we train models which significantly outperform current state-of-the-art pause-only systems, and replicate to within 1% absolute the performance of our previously published hand-crafted baseline. The new system additionally offers scope for run-time control over the precision or recall of locations at which to speak.Laskowski, K., Edlund, J., & Heldner, M. (2008). Learning prosodic sequences using the fundamental frequency variation spectrum. In Proceedings of the Speech Prosody 2008 Conference (pp. 151-154). Campinas, Brazil: Editora RG/CNPq. [abstract] [pdf]Abstract: We investigate a recently introduced vector-valued representation of fundamental frequency variation, whose properties appear to be well-suited for statistical sequence modeling. We show what the representation looks like, and apply hidden Markov models to learn prosodic sequences characteristic of higher-level turn-taking phenomena. Our analysis shows that the models learn exactly those characteristics which have been reported for the phenomena in the literature. Further refinements to the representation lead to 12-17% relative improvement in speaker change prediction for conversational spoken dialogue systems.Laskowski, K., Heldner, M., & Edlund, J. (2008). The fundamental frequency variation spectrum. In Proceedings of FONETIK 2008 (pp. 29-32). Gothenburg, Sweden: Department of Linguistics, University of Gothenburg. [abstract] [pdf]Abstract: This paper describes a recently introduced vec-tor-valued representation of fundamental fre-quency variation – the FFV spectrum – which has a number of desirable properties. In par-ticular, it is instantaneous, continuous, distri-buted, and well suited for application of stan-dard acoustic modeling techniques. We show what the representation looks like, and how it can be used to model prosodic sequences.Laskowski, K., Wölfel, M., Heldner, M., & Edlund, J. (2008). Computing the fundamental frequency variation spectrum in conversational spoken dialogue systems. In Proceedings of Acoustics'08 (pp. 3305-3310). Paris, France. [abstract] [pdf]Abstract: Continuous modeling of intonation in natural speech has long been hampered by a focus on modeling fundamental frequency, of which several normative aspects are particularly problematic. The latter include, among others, the fact that pitch is unde?ned in unvoiced segments, that its absolute magnitude is speaker-specific, and that its robust estimation and modeling, at a particular point in time, rely on a patchwork of long-time stability heuristics. In the present work, we continue our analysis of the fundamental frequency variation (FFV) spectrum, a recently proposed instantaneous, continuous, vector-valued representation of pitch variation, which is obtained by comparing the harmonic structure of the frequency magnitude spectra of the left and right half of an analysis frame. We analyze the sensitivity of a task-specific error rate in a conversational spoken dialogue system to the specific definition of the left and right halves of a frame, resulting in operational recommendations regarding the framing policy and window shape.Laukka, P., Elenius, K., Fredriksson, M., Furumark, T., & Neiberg, D. (2008). Vocal Expression in spontaneous and experimentally induced affective speech: Acoustic correlates of anxiety, irritation and resignation. In Workshop on Corpora for Research on Emotion and Affect. Marrakesh, Marocko. [abstract] [pdf]Abstract: We present two studies on authentic vocal affect expressions. In Study 1, the speech of social phobics was recorded in an anxiogenic public speaking task both before and after treatment. In Study 2, the speech material was collected from real life human-computer interactions. All speech samples were acoustically analyzed and subjected to listening tests. Results from Study 1 showed that a decrease in experienced state anxiety after treatment was accompanied by corresponding decreases in a) several acoustic parameters (i.e., mean and maximum F0, proportion of high-frequency components in the energy spectrum, and proportion of silent pauses), and b) listeners’ perceived level of nervousness. Both speakers’ self-ratings of state anxiety and listeners’ ratings of perceived nervousness were further correlated with similar acoustic parameters. Results from Study 2 revealed that mean and maximum F0, mean voice intensity and H1-H2 was higher for speech perceived as irritated than for speech perceived as neutral. Also, speech perceived as resigned had lower mean and maximum F0, and mean voice intensity than neutral speech. Listeners’ ratings of irritation, resignation and emotion intensity were further correlated with several acoustic parameters. The results complement earlier studies on vocal affect expression which have been conducted on posed, rather than authentic, emotional speech.Lindblom, B., Diehl, R., Park, S-H., & Salvi, G. (2008). (Re)use of place features in voiced stop systems: Role of phonetic constraints. In Proceedings of Fonetik 2008. University of Gothenburg. [abstract] [pdf]Abstract: Computational experiments focused on place of articulation in voiced stops were designed to generate ‘optimal’ inventories of CV syllables from a larger set of ‘possible CV:s’ in the presence of independently and numerically defined articulatory, perceptual and developmental constraints. Across vowel contexts the most salient places were retroflex, palatal and uvular. This was evident from acoustic measurements and perceptual data. Simulation results using the criterion of perceptual contrast alone failed to produce systems with the typologically widely attested set [b] [d] [g], whereas using articulatory cost as the sole criterion produced inventories in which bilabial, dental/alveolar and velar onsets formed the core. Neither perceptual contrast, nor articulatory cost, (nor the two combined), produced a consistent re-use of place features (‘phonemic coding’). Only systems constrained by ‘target learning’ exhibited a strong recombination of place features.López-Colino, F., Beskow, J., & Colás, J. (2008). Mobile SynFace: Ubiquitous visual interface for mobile VoIP telephone calls. In Proceedings of The second Swedish Language Technology Conference (SLTC). Stockholm, Sweden.. [pdf]Neiberg, D., & Ananthakrishnan, G. (2008). On the Non-uniqueness of Acoustic-to-Articulatory Mapping. In Fonetik. Göteborg.Neiberg, D., Ananthakrishnan, G., & Engwall, O. (2008). The Acoustic to Articulation Mapping: Non-linear or Non-unique?. In INTERSPEECH 2008 - 9th Annual Conference of the International Speech Communication Association (pp. 1485-1488). Brisbane, Australia. [pdf]Neiberg, D., & Elenius, K. (2008). Automatic Recognition of Anger in Spontaneous Speech. In INTERSPEECH 2008 - 9th Annual Conference of the International Speech Communication Association (pp. 1485-1488). Brisbane, Australia. [pdf]Skantze, G. (2008). Galatea: A discourse modeller supporting concept-level error handling in spoken dialogue systems. In Dybkjær, L., & Minker, W. (Eds.), Recent Trends in Discourse and Dialogue. Springer. [pdf]Strangert, E., & Gustafson, J. (2008). Improving speaker skill in a resynthesis experiment. In In Proceedings of FONETIK 2008. Gothenburg, Sweden.. [abstract] [pdf]Abstract: A synthesis experiment was conducted based on data from ratings of speaker skill and acoustic measurements in samples of political speech. Features assumed to be important for “being a good speaker” were manipulated in the sample of the lowest rated speaker. Increased F0 dynamics gave the greatest positive effects, but elimination of disfluencies and hesitation pauses, and increased speech rate also played a role for the impression of improved speaker skill.Strangert, E., & Gustafson, J. (2008). Subject ratings, acoustic measurements and synthesis of good-speaker characteristics. In Proceedings of Interspeech 2008. Brisbane, Australia. [pdf]Wik, P., & Engwall, O. (2008). Can visualization of internal articulators support speech perception?. In Proceedings of Interspeech 2008 (pp. 2627-2630). Brisbane, Australia. [pdf]Wik, P., & Engwall, O. (2008). Looking at tongues – can it help in speech perception?. In Proceedings of Fonetik 2008. [pdf]2007Abelin, ï. (2007). Emotional McGurk effect in Swedish. Technical Report 1. [pdf]Ambrazaitis, G. (2007). Swedish word accents in a ‘confirmation’ context. Proceedings of Fonetik, TMH-QPSR, 50(1), 49-52. [pdf]Bell, L., & Gustafson, J. (2007). Children’s convergence in referring expressions to graphical objects in a speech-enabled computer game. In Proceedings of Interspeech. Antwerp, Belgium. [pdf]Berg, A., & Brandt, A. (2007). What you Hear is what you See – a study of visual vs. auditive noise. Proceedings of Fonetik, TMH-QPSR, 50(1), 77-80. [pdf]Beskow, J., Granström, B., & House, D. (2007). Analysis and synthesis of multimodal verbal and non-verbal interaction for animated interface agents. In Esposito, A., Faundez-Zanuy, M., Keller, E., & Marinaro, M. (Eds.), Verbal and Nonverbal Communication Behaviours (pp. 250-263). Berlin: Springer-Verlag.Blomberg, M., & Elenius, D. (2007). Vocal tract length compensation in the signal and model domains in child speech recognition. Proceedings of Fonetik, TMH-QPSR, 50(1), 41-44. [pdf]Bodén, P., & Svensson, G. (2007). Linguistic challenges for bilingual schoolchildren in Rosengård. Proceedings of Fonetik, TMH-QPSR, 50(1), 93-96. [pdf]Bruce, G., Schötz, S., & Granström, B. (2007). SIMULEKT – modelling Swedish regional intonation. Proceedings of Fonetik, TMH-QPSR, 50(1), 121-124. [pdf]Brusk, J., Lager, T., Hjalmarsson, A., & Wik, P. (2007). DEAL – Dialogue Management in SCXML for Believable Game Characters. In Proceedings of ACM Future Play 2007 (pp. 137-144). [abstract] [pdf]Abstract: In order for game characters to be believable, they must appear to possess qualities such as emotions, the ability to learn and adapt as well as being able to communicate in natural language. With this paper we aim to contribute to the development of believable non-player characters (NPCs) in games, by presenting a method for managing NPC dialogues. We have selected the trade scenario as an example setting since it offers a well-known and limited domain common in games that support ownership, such as role-playing games. We have developed a dialogue manager in State Chart XML, a newly introduced W3C standard, as part of DEAL -- a research platform for exploring the challenges and potential benefits of combining elements from computer games, dialogue systems and language learning.Carlson, R. (2007). Conflicting acoustic cues in stop perception. In Where Do Features Come From ? - Phonological Primitives in the Brain, the Mouth, and the Ear (pp. 63-64). Paris, France. [pdf]Carlson, R. (2007). Using acoustic cues in stop perception. Proceedings of Fonetik, TMH-QPSR, 50(1), 25-28. [pdf]Carlson, R., & Granström, B. (2007). Rule-based Speech Synthesis. In Benesty, J., Sondhi, M. M., & Huang, Y. (Eds.), Springer Handbook of Speech Processing (pp. 429-436). Springer Berlin Heidelberg.Carlson, R., & Hawkins, S. (2007). When is fine phonetic detail a detail?. In ICPhS 2007 (pp. 211-214). Saarbrücken, Germany. [pdf]Edlund, J., & Beskow, J. (2007). Pushy versus meek – using avatars to influence turn-taking behaviour. In Proceedings of Interspeech 2007. Antwerp, Belgium. [abstract] [pdf]Abstract: The flow of spoken interaction between human interlocutors is a widely studied topic. Amongst other things, studies have shown that we use a number of facial gestures to improve this flow – to control the taking of turns. This ought to be useful in systems where an animated talking head is used, be they systems for computer mediated human-human dialogue or spoken dialogue systems, where the computer itself uses speech to interact with users. In this article, we show that a small set of simple interaction control gestures and a simple model of interaction can be used to influence users’ behaviour in an unobtrusive manner. The results imply that such a model may improve the flow of computer mediated interaction between humans under adverse circumstances, such as network latency, or to create more human-like spoken humancomputer interaction.Edlund, J., Beskow, J., & Heldner, M. (2007). MushyPeek – an experiment framework for controlled investigation of human-human interaction control behaviour. Proceedings of Fonetik, TMH-QPSR, 50(1), 61-64. [abstract] [pdf]Abstract: This paper describes MushyPeek, a experiment framework that allows us to manipulate interaction control behaviour – including turn-taking – in a setting quite similar to face-to-face human-human dialogue. The setup connects two subjects to each other over a VoIP telephone connection and simultaneuously provides each of them with an avatar representing the other. The framework is exemplified with the first experiment we tried in it – a test of the effectiveness interaction control gestures in an animated lip-synchronised talking head.Edlund, J., & Heldner, M. (2007). Underpinning /nailon/ - automatic estimation of pitch range and speaker relative pitch. In Müller, C. (Ed.), Speaker Classification I: Fundamentals, Features, and Methods. Springer. [abstract] [pdf]Abstract: In this study, we explore what is needed to get an automatic estimation of speaker relative pitch that is good enough for many practical tasks in speech technology. We present analyses of fundamental frequency (F0) distributions from eight speakers with a view to examine (i) the effect of semitone transform on the shape of these distributions; (ii) the errors resulting from calculation of percentiles from the means and standard deviations of the distributions; and (iii) the amount of voiced speech required to obtain a robust estimation of speaker relative pitch. In addition, we provide a hands-on description of how such an estimation can be obtained under real-time online conditions using /nailon/ - our software for online analysis of prosody.Eklund, R. (2007). Pulmonic ingressive speech: a neglected universal?. Proceedings of Fonetik, TMH-QPSR, 50(1), 21-24. [pdf]Enflo, L. (2007). Threshold Pressure For Vocal Fold Collision. Proceedings of Fonetik, TMH-QPSR, 50(1), 105-108. [pdf]Engwall, O., & Bälter, O. (2007). Pronunciation feedback from real and virtual language teachers. Journal of Computer Assisted Language Learning, 20(3), 235-262. [pdf]Ericsson, C., Klein, J., Sjölander, K., & Sönnebo, L. (2007). Filibuster – a new Swedish text-to-speech system. Proceedings of Fonetik, TMH-QPSR, 50(1), 33-36. [pdf]Eriksson, E., & Sullivan, K. (2007). Dialect recognition in a noisy environment:preliminary data. Proceedings of Fonetik, TMH-QPSR, 50(1), 101-104. [pdf]Fant, G., & Kruckenberg, A. (2007). Co-variation of acoustic parameters in prosody. Proceedings of Fonetik, TMH-QPSR, 50(1), 1-4. [pdf]Forsell, M., Elenius, K., & Laukka, P. (2007). Acoustic correlates of frustration in spontaneous speech. Proceedings of Fonetik, TMH-QPSR, 50(1), 37-40. [pdf]Frid, J. (2007). Automatic classification of 'front' and 'back' pronunciation variants of /r/ in the Götaland dialects of Swedish. Proceedings of Fonetik, TMH-QPSR, 50(1), 113-116. [pdf]Granström, B., & House, D. (2007). Inside out - Acoustic and visual aspects of verbal and non-verbal communication (Keynote Paper). Proceedings of the 16th International Congress of Phonetic Sciences, Saarbrücken, 11-18. [pdf]Granström, B., & House, D. (2007). Modelling and evaluating verbal and non-verbal communication in talking animated interface agents. In Dybkjaer, l., Hemsen, H., & Minker, W. (Eds.), Evaluation of Text and Speech Systems (pp. 65-98). Springer-Verlag Ltd.Heldner, M., & Edlund, J. (2007). What turns speech into conversation? A project description. Proceedings of Fonetik, TMH-QPSR, 50(1), 45-48. [abstract] [pdf]Abstract: The project Vad gör tal till samtal? (What turns speech into conversation?) takes as its starting point that while conversation must be considered the primary kind of speech, we are still far better at modelling monologue than dialogue, in theory as well as for speech technology applications. There are also good reasons to assume that conversation contains a number of features that are not found in other kinds of speech, including, among other things, the active cooperation among interlocutors to control the interaction, and to establish common ground. Through this project, we hope to improve the situation by investigating features that are specific to human-human conversation – features that turns speech into conversation. We will focus on acoustic and prosodic aspects of such features.Hjalmarsson, A., Wik, P., & Brusk, J. (2007). Dealing with DEAL: a dialogue system for conversation training. In Proceedings of SIGdial (pp. 132-135). Antwerp, Belgium. [abstract] [pdf]Abstract: We present DEAL, a spoken dialogue system for conversation training under development at KTH. DEAL is a game with a spoken language interface designed for second language learners. The system is intended as a multidisciplinary research platform where challenges and potential benefits of combining elements from computer games, dialogue systems and language learning can be explored.House, D. (2007). Integrating Audio and Visual Cues for Speaker Friendliness in Multimodal Speech Synthesis. In Interspeech 2007 (pp. 1250-1253). Antwerp. [abstract] [pdf]Abstract: This paper investigates interactions between audio and visual cues to friendliness in questions in two perception experiments. In the first experiment, manually edited parametric audio-visual synthesis was used to create the stimuli. Results were consistent with earlier findings in that a late, high final focal accent peak was perceived as friendlier than an earlier, lower focal accent peak. Friendliness was also effectively signaled by visual facial parameters such as a smile, head nod and eyebrow raising synchronized with the final accent. Consistent additive effects were found between the audio and visual cues for the subjects as a group and individually showing that subjects integrate the two modalities. The second experiment used data-driven visual synthesis where the database was recorded by an actor instructed to portray anger and happiness. Friendliness was correlated to the happy database, but the effect was not as strong as for the parametric synthesis.House, D., & Granström, B. (2007). Analyzing and modelling verbal and non-verbal communication for talking animated interface agents. In Esposito, A., Bratanic, M., Keller, E., & Marinaro, M. (Eds.), Fundamentals of verbal and nonverbal communication and the biometric issue (pp. 317-331). Amsterdam: IOS Press.Hugot, V. (2007). Eye gaze analysis in human-human interactions. Master's thesis, CSC. [pdf]Hunnicutt, S., & Magnuson, T. (2007). Grammar-Guided Writing for AAC Users. Assistive Technology, 19(3), 128-142. [abstract]Abstract: A method of grammar-guided writing has been devised to guide graphic sign users through the construction of text messages for use in e-mail and other applications with a remote receiver. The purpose is to promote morphologically and syntactically correct sentences as output for graphic sign users.Jande, P. (2007). Spoken language annotation and data-driven modelling of phone-level pronunciation in discourse context. Speech Communication, 50(2), 126-141.Karlsson, A., House, D., Svantesson, J-O., & Tayanin, D. (2007). Boundary signaling in tonal and non-tonal dialects of Kammu. Proceedings of Fonetik, TMH-QPSR, 50(1), 117-120. [pdf]Karlsson, A. M., House, D., Svantesson, J-O., & Tayanin, D. (2007). Prosodic Phrasing in Tonal and Non-tonal Dialects of Kammu. Proceedings of the 16th International Congress of Phonetic Sciences, Saarbrücken, 1309-1312. [pdf]Kjellström, H., Engwall, O., Abdou, S., & Bälter, O. (2007). Audio-visual phoneme classification for pronunciation training applications. In Proceedings of Interspeech 2007 (pp. 702-705). Antwerpen, Belgium. [pdf]Klintfors, E., Lacerda, F., & Sundberg, U. (2007). Estimates of Infants’ Vocabulary Composition and the Role of Adult-instructions for Early Word-learning. Proceedings of Fonetik, TMH-QPSR, 50(1), 53-56. [pdf]Krull, D. (2007). The influence of Swedish on prepausal lengthening in Estonian. Proceedings of Fonetik, TMH-QPSR, 50(1), 89-92. [pdf]Kügler, F. (2007). Timing of legal and illegal consonant clusters in Swedish. Proceedings of Fonetik, TMH-QPSR, 50(1), 9-12. [pdf]Lindblom, B., Sundberg, J., Branderud, P., & Djamshidpey, H. (2007). On the acoustics of spread lips. Proceedings of Fonetik, TMH-QPSR, 50(1), 13-16. [pdf]Lindh, J. (2007). Voxalys – a Pedagogical Praat Plugin for Voice Analysis. Proceedings of Fonetik, TMH-QPSR, 50(1), 97-100. [pdf]Palmstierna, C. (2007). A Survey of the Sound Shift in Dublin English. Proceedings of Fonetik, TMH-QPSR, 50(1), 81-84. [pdf]Persson, J., & Westholm, L. (2007). The Parrot Effect – a study of the ability to imitate a foreign language. Proceedings of Fonetik, TMH-QPSR, 50(1). [pdf]Skantze, G. (2007). Error Handling in Spoken Dialogue Systems - Managing Uncertainty, Grounding and Miscommunication. Doctoral dissertation, KTH, Department of Speech, Music and Hearing. [pdf]Skantze, G. (2007). Making grounding decisions: Data-driven estimation of dialogue costs and confidence thresholds. In Proceedings of SigDial (pp. 206-210). Antwerp, Belgium. [abstract] [pdf]Abstract: This paper presents a data-driven decision-theoretic approach to making grounding decisions in spoken dialogue systems, i.e., to decide which recognition hypotheses to consider as correct and which grounding action to take. Based on task analysis of the dialogue domain, cost functions are derived, which take dialogue efficiency, consequence of task failure and information gain into account. Dialogue data is then used to estimate speech recognition confidence thresholds that are dependent on the dialogue context.Strangert, E. (2007). What makes a good speaker? Subjective ratings and acoustic measurements. Proceedings of Fonetik, TMH-QPSR, 50(1), 29-32. [pdf]Strömbergsson, S. (2007). Interactional patterns in computer-assisted phonological intervention in children. Proceedings of Fonetik 2007, TMH-QPSR, 50(1), 69-72. [pdf]Suomi, K. (2007). Accentual tonal targets and speaking rate in Northern Finnish. Proceedings of Fonetik, TMH-QPSR, 50(1), 109-112. [pdf]Toivanen, J. (2007). Fall-rise intonation usage in Finnish English second language discourse. Proceedings of Fonetik, TMH-QPSR, 50(1), 85-88. [pdf]Traunmüller, H. (2007). Demodulation, mirror neurons and audiovisual perception nullify the motor theory. Proceedings of Fonetik, TMH-QPSR, 50(1), 17-20. [pdf]Van Dommelen, W., & Ringen, C. (2007). Intervocalic fortis and lenis stops in a Norwegian dialect. Proceedings of Fonetik, TMH-QPSR, 50(1), 5-8. [pdf]Wik, P., & Granström, B. (2007). Att lära sig språk med en virtuell lärare. In Från Vision till praktik, språkutbildning och informationsteknik (pp. 51-70). Nätuniversitetet. [pdf]Wik, P., Hjalmarsson, A., & Brusk, J. (2007). Computer Assisted Conversation Training for Second Language Learners. Proceedings of Fonetik, TMH-QPSR, 50(1), 57-60. [pdf]Wik, P., Hjalmarsson, A., & Brusk, J. (2007). DEAL A Serious Game For CALL Practicing Conversational Skills In The Trade Domain. In Proceedings of SLATE 2007. [abstract] [pdf]Abstract: This paper describes work in progress on DEAL, a spoken dialogue system under development at KTH. It is intended as a platform for exploring the challenges and potential benefits of combining elements from computer games, dialogue systems and language learning.Öhgren, S. (2007). Experiment with adaptation and vocal tract length normalization at automatic speech recognition of children's speech. Master's thesis, CSC. [pdf]2006Abou Zliekha, M., Al Moubayed, S., Al Dakkak, O., & Ghneim, N. (2006). Emotional Audio-Visual Arabic Text to Speech. In Proceedings of the XIV European Signal Processing Conference (EUSIPCO). Florence, Italy. [link]Agelfors, E., Beskow, J., Karlsson, I., Kewley, J., Salvi, G., & Thomas, N. (2006). User Evaluation of the SYNFACE Talking Head Telephone. Lecture Notes in Computer Science, 4061, 579-586. [abstract] [pdf]Abstract: The talking-head telephone, Synface, is a lip-reading support for people with hearing-impairment. It has been tested by 49 users with varying degrees of hearing-impaired in UK and Sweden in lab and home environments. Synface was found to give support to the users, especially in perceiving numbers and addresses and an enjoyable way to communicate. A majority deemed Synface to be a useful productAl Dakkak, O., Ghneim, N., Abou Zliekha, M., & Al Moubayed, S. (2006). Prosodic Feature Introduction and Emotion Incorporation n an Arabic TTS. In Proceedings of IEEE International Conference on Information and Communication Technologies. Damascus, Syria. [pdf]Allwood, J., Cerrato, L., Jokinen, K., Navarretta, C., & Paggio, P. (2006). A coding scheme for the annotation of feedback, turn management and sequencing phenomena. Proceedings of the LREC2006 Workshop on Multimodal Corpora. From Multimodal Behaviour to Usable Models., 38-42.Beskow, J., Granström, B., & House, D. (2006). Focal accent and facial movements in expressive speech. In Fonetik 2006, Working Papers 52, General Linguistics and Phonetics, Lund University (pp. 9-12). [pdf]Beskow, J., Granström, B., & House, D. (2006). Visual correlates to prominence in several expressive modes. In Proceedings of Interspeech 2006 (pp. 1272–1275). Pittsburg, PA. [pdf]Boye, J., Gustafson, J., & Wirén, M. (2006). Robust spoken language understanding in a computer game. Speech Communication, 48(3-4), 335-353. [pdf]Carlson, R., Edlund, J., Heldner, M., Hjalmarsson, A., House, D., & Skantze, G. (2006). Towards human-like behaviour in spoken dialog systems. In Proceedings of Swedish Language Technology Conference (SLTC 2006). Gothenburg, Sweden. [pdf]Carlson, R., Gustafson, K., & Strangert, E. (2006). Cues for Hesitation in Speech Synthesis. In Proceedings of Interspeech 06. Pittsburgh, USA. [pdf]Carlson, R., Gustafson, K., & Strangert, E. (2006). Prosodic Cues for Hesitation. Dept. of Linguistics & Phonetics Working Papers, 52, 21–24.Carlson, R., Gustafsson, K., & Strangert, E. (2006). Modelling hesitation for synthesis of spontaneous speech. In Proceedings of Speech Prosody 2006. Dresden. [pdf]Cerrato, L., & D’Imperio, M. (2006). An investigation of the communicative functions of short expressions in Italian and Swedish. Unpublished manuscript.Edlund, J., & Heldner, M. (2006). /nailon/ - online analysis of prosody. In Working Papers 52: Proceedings of Fonetik 2006 (pp. 37-40). Lund, Sweden: Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics. [abstract] [pdf]Abstract: This paper presents /nailon/ - a software package for online real-time prosodic analysis that captures a number of prosodic features relevant for interaction control in spoken dialogue systems. The current implementation captures silence durations; voicing, intensity, and pitch; pseudo-syllable durations; and intonation patterns. The paper provides detailed information on how this is achieved.Edlund, J., & Heldner, M. (2006). /nailon/ - software for online analysis of prosody. In Proc of Interspeech 2006 ICSLP (pp. 2022-2025). Pittsburgh PA, USA. [abstract] [pdf]Abstract: This paper presents /nailon/ - a software package for online real-time prosodic analysis that captures a number of prosodic features relevant for interaction control in spoken dialogue systems. The current implementation captures silence durations; voicing, intensity, and pitch; pseudosyllable durations; and intonation patterns. The paper provides detailed information on how this is achieved. As an example application of /nailon/, we demonstrate how it is used to improve the efficiency of identifying relevant places at which a machine can legitimately begin to talk to a human interlocutor, as well as to shorten system response times.Edlund, J., Heldner, M., & Gustafson, J. (2006). Two faces of spoken dialogue systems. In Interspeech 2006 - ICSLP Satellite Workshop Dialogue on Dialogues: Multidisciplinary Evaluation of Advanced Speech-based Interactive Systems. Pittsburgh PA, USA. [abstract] [pdf]Abstract: This paper is intended as a basis for discussion. We propose that users may, knowingly or subconsciously, interpret the events that occur when interacting with spoken dialogue systems in more than one way. Put differently, there is more than one metaphor people may use in order to make sense of spoken human-computer dialogue. We further suggest that different metaphors may not play well together. The analysis is consistent with many observations in human-computer interaction and has implications that may be helpful to researchers and developers alike. For example, developers may want to guide users towards a metaphor of their choice and ensure that the interaction is coherent with that metaphor; researchers may need different approaches depending on the metaphor employed in the system they study; and in both cases one would need to have very good reasons to use mixed metaphors.Ellis, N. (2006). Selective Attention and Transfer Phenomena in L2 Acquisition: Contingency, Cue Competition, Salience, Interference, Overshadowing, Blocking, and Perceptual Learning. Applied Linguistics, 27, 164-194.Engwall, O. (2006). Evaluation of speech inversion using an articulatory classifier. In Yehia, H., Demolin, D., & Laboissière, R. (Eds.), In Proceedings of the Seventh International Seminar on Speech Production (pp. 469-476). Ubatuba, Sao Paolo, Brazil. [pdf]Engwall, O. (2006). Assessing MRI measurements: Effects of sustenation, gravitation and coarticulation. In Harrington, J., & Tabain, M. (Eds.), Speech production: Models, Phonetic Processes and Techniques (pp. 301-314). New York: Psychology Press. [pdf]Engwall, O., Bälter, O., Öster, A-M., & Kjellström, H. (2006). Designing the user interface of the computer-based speech training system ARTUR based on early user tests. Journal of Behaviour and Information Technology, 25(4), 353-365. [pdf]Engwall, O., Bälter, O., Öster, A-M., & Kjellström, H. (2006). Feedback management in the pronunciation training system ARTUR. In Proceedings of CHI 2006 (pp. 231-234). Montreal. [pdf]Engwall, O., Delvaux, V., & Metens, T. (2006). Interspeaker Variation in the Articulation of French Nasal Vowels. In In Proceedings of the Seventh International Seminar on Speech Production (pp. 3-10). Ubatuba, Sao Paolo, Brazil. [pdf]Granström, B., & House, D. (2006). Measuring and modeling audiovisual prosody for animated agents. In Proceedings of Speech Prosody 2006. Dresden. [pdf]Heldner, M., & Edlund, J. (2006). Prosodic cues for interaction control in spoken dialogue systems. In Working Papers 52: Proceedings of Fonetik 2006 (pp. 53-56). Lund, Sweden: Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics. [abstract] [pdf]Abstract: This paper discusses the feasibility of using prosodic features for interaction control in spoken dialogue systems, and points to experimental evidence that automatically extracted prosodic features can be used to improve the efficiency of identifying relevant places at which a machine can legitimately begin to talk to a human interlocutor, as well as to shorten system response times.Heldner, M., Edlund, J., & Carlson, R. (2006). Interruption impossible. In Bruce, G., & Horne, M. (Eds.), Nordic Prosody, Proceedings of the IXth Conference, Lund 2004 (pp. 97-105). Frankfurt am Main, Germany. [abstract] [pdf]Abstract: Most current work on spoken human-computer interaction has so far concentrated on interactions between a single user and a dialogue system. The advent of ideas of the computer or dialogue system as a conversational partner in a group of humans, for example within the CHIL-project1 and elsewhere (e.g. Kirchhoff & Ostendorf, 2003), introduces new requirements on the capabilities of the dialogue system. Among other things, the computer as a participant in a multi-part conversation has to appreciate the human turn-taking system, in order to time its' own interjections appropriately. As the role of a conversational computer is likely to be to support human collaboration, rather than to guide or control it, it is particularly important that it does not interrupt or disturb the human participants. The ultimate goal of the work presented here is to predict suitable places for turn-takings, as well as positions where it is impossible for a conversational computer to interrupt without irritating the human interlocutors.House, D. (2006). On the interaction of audio and visual cues to friendliness in interrogative prosody. In Proceedings of The Nordic Conference on Multimodal Communication, 2005 (pp. 201-213). Göteborg.House, D. (2006). Perception and production of phrase-final intonation in Swedish questions. In Bruce, G., & Horne, M. (Eds.), Nordic Prosody, Proceedings of the IXth Conference, Lund 2004 (pp. 127-136). Frankfurt am Main: Peter Lang.Höglind, D. (2006). Texture-based expression modelling for a virtual talking head. Master's thesis, KTH. [pdf]Jande, P-A. (2006). Integrating Linguistic Information from Multiple Sources in Lexicon Development and Spoken Language Annotation. In Proceedings of the LREC workshop on merging and layering linguistic information (pp. 1-8). Genua, Italy. [pdf]Jande, P-A. (2006). Modelling Phone-Level Pronunciation in Discourse Context. Doctoral dissertation. [pdf]Jande, P-A. (2006). Modelling Pronunciation in Discourse Context. In Proceedings of Fonetik (pp. 7-9). Lund, Sweden. [pdf]Kjellström, H., Engwall, O., & Bälter, O. (2006). Reconstructing Tongue Movements from Audio and Video. In Proc of Interspeech 2006 (pp. 2238–2241). Pittsburgh. [pdf]Lidestam, B., & Beskow, J. (2006). Motivation and appraisal in perception of poorly specified speech. Scandinavian Journal of Psychology, 47(2), 93-101. [pdf]Melin, H. (2006). Automatic speaker verification on site and by telephone: methods, applications and assessment. Doctoral dissertation, KTH. [pdf]Neiberg, D., Elenius, K., Karlsson, I., & Laskowski, K. (2006). Emotion Recognition in Spontaneous Speech. In Working Papers 52: Proceedings of Fonetik 2006 (pp. 101-104). Lund, Sweden: Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics. [pdf]Neiberg, D., Elenius, K., & Laskowski, K. (2006). Emotion Recognition in Spontaneous Speech Using GMMs. In INTERSPEECH 2006 - ICSLP, Ninth International Conference on Spoken Language Processing. Pittsburgh, PA, USA,. [pdf]Salvi, G. (2006). Dynamic behaviour of connectionist speech recognition with strong latency constraints. Speech Communication, 48(7), 802-818. [abstract] [pdf]Abstract: This paper describes the use of connectionist techniques in phonetic speech recognition with strong latency constraints. The constraints are imposed by the task of deriving the lip movements of a synthetic face in real time from the speech signal, by feeding the phonetic string into an articulatory synthesiser. Particular attention has been paid to analysing the interaction between the time evolution model learnt by the multi-layer perceptrons and the transition model imposed by the Viterbi decoder, in different latency conditions. Two experiments were conducted in which the time dependencies in the language model (LM) were controlled by a parameter. The results show a strong interaction between the three factors involved, namely the neural network topology, the length of time dependencies in the LM and the decoder latency.Salvi, G. (2006). Mining Speech Sounds, Machine Learning Methods for Automatic Speech Recognition and Analysis. Doctoral dissertation, KTH, School of Computer Science and Communication. [pdf]Salvi, G. (2006). Segment boundaries in low latency phonetic recognition. Lecture Notes in Computer Science, 3817, 267-276. [abstract] [pdf]Abstract: The segment boundaries produced by the Synface low latency phoneme recogniser are analysed. The precision in placing the boundaries is an important factor in the Synface system as the aim is to drive the lip movements of a synthetic face for lip-reading support. The recogniser is based on a hybrid of recurrent neural networks and hidden Markov models. In this paper we analyse the look-ahead length in the Viterbi-like decoder aff ects the precision of boundary placement. The properties of the entropy of the posterior probabilities estimated by the neural network are also investigated in relation to the distance of the frame from a phonetic transition.Salvi, G. (2006). Segment boundary detection via class entropy measurements in connectionist phoneme recognition. Speech Communication, 48(12), 1666-1676. [abstract] [pdf]Abstract: This article investigates the possibility to use the class entropy of the output of a connectionist phoneme recogniser to predict time boundaries between phonetic classes. The rationale is that the value of the entropy should increase in proximity of a transition between two segments that are well modelled (known) by the recognition network since it is a measure of uncertainty. The advantage of this measure is its simplicity as the posterior probabilities of each class are available in connectionist phoneme recognition. The entropy and a number of measures based on differentiation of the entropy are used in isolation and in combination. The decision methods for predicting the boundaries range from simple thresholds to neural network based procedure. The different methods are compared with respect to their precision, measured in terms of the ratio between the number C of predicted boundaries within 10 or 20 ms of the reference and the total number of predicted boundaries, and recall, measured as the ratio between C and the total number of reference boundaries.Skantze, G., Edlund, J., & Carlson, R. (2006). Talking with Higgins: Research challenges in a spoken dialogue system. In André, E., Dybkjaer, L., Minker, W., Neumann, H., & Weber, M. (Eds.), Perception and Interactive Technologies (pp. 193-196). Berlin/Heidelberg: Springer. [abstract] [pdf]Abstract: This paper presents the current status of the research in the Higgins project and provides background for a demonstration of the spoken dialogue system implemented within the project. The project represents the latest development in the ongoing dialogue systems research at KTH. The practical goal of the project is to build collaborative conversational dialogue systems in which research issues such as error handling techniques can be tested empirically.Skantze, G., House, D., & Edlund, J. (2006). Grounding and prosody in dialog. In Working Papers 52: Proceedings of Fonetik 2006 (pp. 117-120). Lund, Sweden: Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics. [abstract] [pdf]Abstract: In a previous study we demonstrated that subjects could use prosodic features (primarily peak height and alignment) to make different interpretations of synthesized fragmentary grounding utterances. In the present study we test the hypothesis that subjects also change their behavior accordingly in a human-computer dialog setting. We report on an experiment in which subjects participate in a color-naming task in a Wizard-of-Oz controlled human-computer dialog in Swedish. The results show that two annotators were able to categorize the subjects' responses based on pragmatic meaning. Moreover, the subjects' response times differed significantly, depending on the prosodic features of the grounding fragment spoken by the system.Skantze, G., House, D., & Edlund, J. (2006). User responses to prosodic variation in fragmentary grounding utterances in dialogue. In Proceedings of Interspeech 2006 - ICSLP (pp. 2002-2005). Pittsburgh PA, USA. [abstract] [pdf]Abstract: In this paper, actual user responses to fragmentary grounding utterances in Swedish human-computer dialog are investigated. Building on a previous study which demonstrated that listeners could use prosodic features (primarily peak height and alignment) to make different interpretations of such utterances, we now report on an experiment in which subjects participate in a color-naming task in a Wizard-of-Oz controlled human-computer dialog setting. The results show that two annotators were able to categorize the subjects' responses based on pragmatic meaning. Moreover, the subjects' response times differed significantly, depending on the prosodic features of the grounding fragment spoken by the system.Svanfeldt, G. (2006). Expressiveness in Virtual Talking Faces. Licentiate dissertation.Svantesson, J-O., & House, D. (2006). Tone production, tone perception and Kammu tonogenesis. Phonology, 23, 309-333. [pdf]Wallers, Å., Edlund, J., & Skantze, G. (2006). The effects of prosodic features on the interpretation of synthesised backchannels. In André, E., Dybkjaer, L., Minker, W., Neumann, H., & Weber, M. (Eds.), Proceedings of Perception and Interactive Technologies (pp. 183-187). Springer. [abstract] [pdf]Abstract: A study of the interpretation of prosodic features in backchannels (Swedish /a/ and /m/) produced by speech synthesis is presented. The study is part of work-in-progress towards endowing conversational spoken dialogue systems with the ability to produce and use backchannels and other feedback.Öster, A-M. (2006). Computer-Based Speech Therapy Using Visual Feedback with Focus on Children with Profound Hearing Impairments. Doctoral dissertation, KTH/TMH.2005Carlson, R., Hirschberg, J., & Swerts, M. (Eds.). (2005). Special Issue on Error handling in spoken dialogue systems. Speech Communication, 45(3).Al Dakkak, O., Ghneim, N., Abou Zliekha, M., & Al Moubayed, S. (2005). Emotional Inclusion in An Arabic Text-To-Speech. In Proceedings of the 13th European Signal Processing Conference (EUSIPCO). Antalya, Turkey. [pdf]Allwood, J., Cerrato, L., Jokinen, K., Navarretta, C., & Paggio, P. (2005). The MUMIN annotation scheme for feedback, turn management and sequencing. In Gothenburg papers in Theoretical Linguistics 92: Proceedings from The Second Nordic Conference on Multimodal Communication (pp. 91-109). Göteborg University, Sweden.Batliner, A., Blomberg, M., D’Arcy, S., Elenius, D., Giuliani, D., Gerosa, M., Hacker, C., Russell, M., Steidl, S., & Wong, M. (2005). The PF STAR Children’s Speech Corpus. In Proc Interspeech 2005. [pdf]Beskow, J., Edlund, J., & Nordstrand, M. (2005). A model for multi-modal dialogue system output applied to an animated talking head. In Minker, W., Bühler, D., & Dybkjaer, L. (Eds.), Spoken Multimodal Human-Computer Dialogue in Mobile Environments, Text, Speech and Language Technology (pp. 93-113). Dordrecht, The Netherlands: Kluwer Academic Publishers. [abstract] [pdf]Abstract: We present a formalism for specifying verbal and non-verbal output from a multi-modal dialogue system. The output specification is XML-based and provides information about communicative functions of the output, without detailing the realisation of these functions. The aim is to let dialogue systems generate the same output for a wide variety of output devices and modalities. The formalism was developed and implemented in the multi-modal spoken dialogue system AdApt. We also describe how facial gestures in the 3D-animated talking head used within this system are controlled through the formalism.Beskow, J., & Nordenberg, M. (2005). Data-driven Synthesis of Expressive Visual Speech using an MPEG-4 Talking Head. In Proceedings of Interspeech 2005. Lisbon. [pdf]Boye, J., & Gustafson, J. (2005). How to do dialogue in a fairy-tale world. In 6th SIGdial &th Workshop on Discourse and Dialogue. [pdf]Bälter, O., Engwall, O., Öster, A-M., & Kjellström, H. (2005). Wizard-of-Oz Test of ARTUR - a Computer-Based Speech Training System with Articulation Correction. In Proceedings of the Seventh International ACM SIGACCESS Conference on Computers and Accessibility (pp. 36-43). Baltimore. [pdf]Carlson, R., & Granström, B. (2005). Data-driven multimodal synthesis. Speech Communication, 47(1-2), 182-193.Carlson, R., Hirschberg, J., & Swerts, M. (2005). Cues to upcoming Swedish prosodic boundaries: Subjective judgment studies and acoustic correlates. Speech Communication, 46, 326-333.Cerrato, L. (2005). Linguistic functions of head nods. In Allwood, J., & Dorriots, B. (Eds.), Gothenburg papers in Theoretical Linguistics 92: Proceedings from The Second Nordic Conference on Multimodal Communication (pp. 137-152). Göteborg University, Sweden.Cerrato, L. (2005). On the acoustic, prosodic and gestural characteristics of “m-like” sounds in Swedish. Gothenburg papers in theoretical linguistics, Feedback in Spoken Interaction- Nordtalk Symposium 2003, 18-31.Cerrato, L. (2005). The communicative function of "sì" in Italian and "ja" in Swedish: an acoustic analysis. In Proceedings of Fonetik 2005 (pp. 41-44). Göteborg.Cerrato, L., & Svanfeldt, G. (2005). A method for the detection of communicative head nods in expressive speech. In Allwood, J., Dorriots, B., & Nicholson, S. (Eds.), Gothenburg papers in Theoretical Linguistics 92: Proceedings from The Second Nordic Conference on Multimodal Communication (pp. 153-165). Göteborg University, Sweden.Edlund, J., & Heldner, M. (2005). Exploring prosody in interaction control. Phonetica, 62(2-4), 215-226. [abstract] [pdf]Abstract: This paper investigates prosodic aspects of turn-taking in conversation with a view to improving the efficiency of identifying relevant places at which a machine can legitimately begin to talk to a human interlocutor. It examines the relationship between interaction control, the communicative function of which is to regulate the flow of information between interlocutors, and its phonetic manifestation. Specifically, the listener's perception of such interaction control phenomena is modelled. Algorithms for automatic online extraction of prosodic phenomena liable to be relevant for interaction control, such as silent pauses and intonation patterns, are presented and evaluated in experiments using Swedish Map Task data. We show that the automatically extracted prosodic features can be used to avoid many of the places where current dialogue systems run the risk of interrupting their users, and also to identify suitable places to take the turn.Edlund, J., Heldner, M., & Gustafson, J. (2005). Utterance segmentation and turn-taking in spoken dialogue systems. In Fisseni, B., Schmitz, H-C., Schröder, B., & Wagner, P. (Eds.), Computer Studies in Language and Speech (pp. 576-587). Frankfurt am Main, Germany: Peter Lang. [abstract] [pdf]Abstract: A widely used method for finding places to take turn in spoken dialogue systems is to assume that an utterance ends where the user ceases to speak. Such endpoint detection normally triggers on a certain amount of silence, or non-speech. However, spontaneous speech frequently contains silent pauses inside sentencelike units, for example when the speaker hesitates. This paper presents /nailon/, an on-line, real-time prosodic analysis tool, and a number of experiments in which end-point detection has been augmented with prosodic analysis in order to segment the speech signal into what humans intuitively perceive as utterance-like units.Edlund, J., & Hjalmarsson, A. (2005). Applications of distributed dialogue systems: the KTH Connector. In Proceedings of ISCA Tutorial and Research Workshop on Applied Spoken Language Interaction in Distributed Environments (ASIDE 2005). Aalborg, Denmark. [abstract] [pdf]Abstract: We describe a spoken dialogue system domain: that of the personal secretary. This domain allows us to capitalise on the characteristics that make speech a unique interface; characteristics that humans use regularly, implicitly, and with remarkable ease. We present a prototype system - the KTH Connector - and highlight several dialogue research issues arising in the domain.Edlund, J., House, D., & Skantze, G. (2005). Prosodic Features in the Perception of Clarification Ellipses. In Proceedings of Fonetik 2005. Gothenburg, Sweden. [abstract] [pdf]Abstract: We present an experiment where subjects were asked to listen to Swedish human-computer dialogue fragments where a synthetic voice makes an elliptical clarification after a user turn. The prosodic features of the synthetic voice were systematically varied, and subjects were asked to judge the computer's actual intention. The results show that an early low F0 peak signals acceptance, that a late high peak is perceived as a request for clarification of what was said, and that a mid high peak is perceived as a request for clarification of the meaning of what was said. The study can be seen as the beginnings of a tentative model for intonation of clarification ellipses in Swedish, which can be implemented and tested in spoken dialogue systems.Edlund, J., House, D., & Skantze, G. (2005). The effects of prosodic features on the interpretation of clarification ellipses. In Proceedings of Interspeech 2005 (pp. 2389-2392). Lisbon, Portugal. [abstract] [pdf]Abstract: In this paper, the effects of prosodic features on the interpretation of elliptical clarification requests in dialogue are studied. An experiment is presented where subjects were asked to listen to short human-computer dialogue fragments in Swedish, where a synthetic voice was making an elliptical clarification after a user turn. The prosodic features of the synthetic voice were systematically varied, and the subjects were asked to judge what was actually intended by the computer. The results show that an early low F0 peak signals acceptance, that a late high peak is perceived as a request for clarification of what was said, and that a mid high peak is perceived as a request for clarification of the meaning of what was said. The study can be seen as the beginnings of a tentative model for intonation of clarification ellipses in Swedish, which can be implemented and tested in spoken dialogue systems.Elenius, D., & Blomberg, M. (2005). Adaptation and Normalization Experiments in Speech Recognition for 4 to 8 Year old Children. In Proc Interspeech 2005. [pdf]Engwall, O. (2005). Articulatory synthesis using corpus-based estimation of line spectrum pairs. In Proceedings of Interspeech 2005. Lisbon, Portugal. [pdf]Engwall, O. (2005). Introducing visual cues in acoustic-to-articulatory inversion. In Proceedings of Interspeech 2005. Lisbon, Portugal. [pdf]Eriksson, E., Bälter, O., Engwall, O., Öster, A-M., & Kjellström, H. (2005). Design Recommendations for a Computer-Based Speech Training System Based on End-User Interviews. In Proceedings of the Tenth International Conference on Speech and Computers (pp. 483-486). Patras, Greece. [pdf]Granström, B., & House, D. (2005). Audiovisual representation of prosody in expressive speech communication. Speech Communication, 46, 473-484.Granström, B., & House, D. (2005). Effective Interaction with Talking Animated Agents in Dialogue Systems. In van Kuppevelt, J., Dybkjaer, L., & Bernsen, N. O. (Eds.), Advances in Natural Multimodal Dialogue Systems (pp. 215-243). Springer, Dordrecht, The Netherlands.Gustafson, J., Boye, J., Fredriksson, M., Johannesson, L., & Königsmann, J. (2005). Providing computer game characters with conversational abilities. In Proceedings of Intelligent Virtual Agent (IVA05). Kos, Greece. [pdf]Hincks, R. (2005). Computer Support for Learners of Spoken English. Doctoral dissertation, School of Computer Science and Communication. [abstract] [pdf]Abstract: This thesis concerns the use of speech technology to support the process of learning the English language. It applies theories of computer-assisted language learning and second language acquisition to address the needs of beginning, intermediate and advanced students of English for specific purposes. The thesis includes an evaluation of speech-recognition-based pronunciation soft¬ware, based on a controlled study of a group of immigrant engineers. The study finds that while the weaker students may have benefited from their software practice, the pronun¬ciation ability of the better students did not improve. The linguistic needs of advanced and intermediate Swedish-native students of English are addressed in a study using multimodal speech synthesis in an interactive exercise demonstrating differences in the placement of lexical stress in two Swedish-English cognates. A speech database consisting of 28 ten-minute oral presentations made by these learners is described, and an analysis of pronunciation errors is pre-sented. Eighteen of the presentations are further analyzed with regard to the normalized standard deviation of fundamental frequency over 10-second long samples of speech, termed pitch variation quotient (PVQ). The PVQ is found to range from 6% to 34% in samples of speech, with mean levels of PVQ per presentation ranging from 11% to 24%. Males are found to use more pitch variation than females. Females who are more proficient in English use more pitch variation than the less profi¬cient females. A perceptual experiment tests the relationship between PVQ and impressions of speaker liveliness. An overall correlation of .83 is found. Temporal variables in the presentation speech are also studied. A bilingual database where five speakers make the same presentation in both English and Swedish is studied to examine effects of using a second language on presentation prosody. Little intra-speaker difference in pitch variation is found, but these speakers speak on average 20% faster when using their native language. The thesis concludes with a discussion of how the results could be applied in a proposed feedback mechanism for practicing and assessing oral presentations, concept¬ualized as a ‘speech checker.’ Potential users of the system would include native as well as non-native speakers of English.Hincks, R. (2005). Measures and perceptions of liveliness in student presentation speech: A proposal for an automatic feedback mechanism. System, 33(4), 575-591. [abstract] [002]Abstract: This paper analyzes prosodic variables in a corpus of eighteen oral presentations made by students of Technical English, all of whom were native speakers of Swedish. The focus is on the extent to which speakers were able to use their voices in a lively manner, and the hypothesis tested is that speakers who had high pitch variation as they spoke would be perceived as livelier speakers. A metric (termed PVQ), derived from the standard deviation in fundamental frequency, is proposed as a measure of pitch variation. Composite listener ratings of liveliness for nine 10-s samples of speech per speaker correlate strongly (r = .83, n = 18, p < .01) with the PVQ metric. Liveliness ratings for individual 10-s samples of speech show moderate but significant (n = 81, p < .01) correlations: r = .70 for males and r = .64 for females. The paper also investigates rate of speech and fluency variables in this corpus of L2 English. An application for this research is in presentation skills training, where computer feedback could be provided for speaking rate and the extent to which speakers have been able to use their voices in an engaging manner.Hincks, R. (2005). Measuring liveliness in presentation speech. In Proceedings of Interspeech 2005 (pp. 765-768). Lisbon. [abstract] [pdf]Abstract: This paper proposes that speech analysis be used to quantify prosodic variables in presentation speech, and reports the results of a perception test of speaker liveliness. The test material was taken from a corpus of oral presentations made by 18 Swedish native students of Technical English. Liveliness ratings from a panel of eight judges correlated strongly with normalized standard deviation of F0 and, for female speakers, with mean length of runs, which is the number of syllables between pauses of >250 ms. An application of these findings would be in the development of a feedback mechanism for the prosody of public speaking.Hincks, R. (2005). Presenting in English and Swedish. In Proceedings of Fonetik 2005 (pp. 45-48). Göteborg. [abstract] [pdf]Abstract: This paper reports on a comparison of prosodic variables from oral presentations in a first and second language. Five Swedish natives who speak English at the advanced-intermediate level were recorded as they made the same presentation twice, once in English and once in Swedish. Though it was expected that speakers would use more pitch variation when they spoke Swedish, three of the five speakers showed no significant difference between the two languages. All speakers spoke more quickly in Swedish, the mean being 20% faster.Hjalmarsson, A. (2005). Towards user modelling in conversational dialogue systems: A qualitative study of the dynamics of dialogue parameters. In Proceedings of Interspeech 2005 (pp. 869-872). Lisbon, Portugal. [abstract] [pdf]Abstract: This paper presents a qualitative study of data from a 26 subject experimental study within the multimodal, conversational dialogue system AdApt. Qualitative analysis of data is used to illustrate the dynamic variation of dialogue parameters over time. The analysis will serve as a foundation for research and future data collections in the area of adaptive dialogue systems and user modelling.House, D. (2005). Fonetiska undersökningar av kammu. In Lundström, H., & Svantesson, J-O. (Eds.), Kammu - om ett folk i Laos (pp. 164-167). Lund: Lunds universitetshistoriska sällskap.House, D. (2005). Phrase-final rises as a prosodic feature in wh-questions in Swedish human–machine dialogue. Speech Communication, 46, 268-283. [abstract] [pdf]Abstract: This paper examines the extent to which optional final rises occur in a set of 200 wh-questions extracted from a large corpus of computer-directed spontaneous speech in Swedish and discusses the function these rises may have in signalling dialogue acts and speaker attitude over and beyond an information question. Final rises occurred in 22% of the utterances, primarily in conjunction with final focal accent. Children exhibited the largest percentage of final rises (32%), with women second (27%) and men lowest (17%). The distribution of the rises in the material is examined and evidence relating to the final rise as a signal of a social interaction oriented dialogue act is gathered from the distribution. Two separate perception tests were carried out to test the hypothesis that high and late focal accent peaks in a wh-question are perceived as friendlier and more socially interested than low and early peaks. Generally, the results were consistent with these hypotheses when the late peaks were in phrase-final position. Finally, the results of this study are discussed in terms of pragmatic and attitudinal meanings and biological codes.Imboden, S., Petrone, M., Quadrani, P., Zannoni, C., Mayoral, R., Clapworthy, G. J., Testi, D., Viceconti, M., Neiberg, D., Tsagarakis, N. G., & Caldwell, D. (2005). A Haptic Enabled Multimodal Pre-Operative Planner for Hip Arthroplasty. In WorldHaptics Conference. Pisa, Italy. [pdf]Jande, P-A. (2005). Annotating Speech Data for Pronunciation Variation Modelling. In Proceedings of Fonetik (pp. 25-27). Göteborg, Sweden. [pdf]Jande, P-A. (2005). Inducing Decision Tree Pronunciation Variation Models from Annotated Speech Data. In Proceedings of Interspeech (pp. 4-8). Lisbon, Portugal. [pdf]Johnson, W., Vilhjalmsson, H., & Marsella, S. (2005). Serious games for language learning: How much game, how much AI?. In 12: th International Conference on Artificial Intelligence in Education. Amsterdam.Nordenberg, M., Svanfeldt, G., & Wik, P. (2005). Artificial Gaze - Perception experiment of eye gaze in synthetic faces. In Proceedings from the Second Nordic Conference on Multimodal Communication. [PDF]Oppelstrup, L., Blomberg, M., & Elenius, D. (2005). Scoring Children's Foreign Language Pronunciation. In Proc FONETIK 2005. Department of Linguistics, Göteborg University. [pdf]Salvi, G. (2005). Advances in regional accent clustering in Swedish. In Proceedings of European Conference on Speech Communication and Technology (Eurospeech) (pp. 2841-2844). Lisbon, Portugal. [abstract] [pdf]Abstract: The regional pronunciation variation in Swedish is analysed on a large database. Statistics over each phoneme and for each region of Sweden are computed using the EM algorithm in a hidden Markov model framework to overcome the difficulties of transcribing the whole set of data at the phonetic level. The model representations obtained this way are compared using a distance measure in the space spanned by the model parameters, and hierarchical clustering. The regional variants of each phoneme may group with those of any other phoneme, on the basis of their acoustic properties. The log likelihood of the data given the model is shown to display interesting properties regarding the choice of number of clusters, given a particular level of details. Discriminative analysis is used to find the parameters that most contribute to the separation between groups, adding an interpretative value to the discussion. Finally a number of examples are given on some of the phenomena that are revealed by examining the clustering tree.Salvi, G. (2005). Ecological language acquisition via incremental model-based clustering. In Proceedings of European Conference on Speech Communication and Technology (Eurospeech) (pp. 1181-1184). Lisbon, Portugal. [abstract] [pdf]Abstract: We analyse the behaviour of Incremental Model-Based Clustering on child-directed speech data, and suggest a possible use of this method to describe the acquisition of phonetic classes by an infant. The effects of two factors are analysed, namely the number of coefficients describing the speech signal, and the frame length of the incremental clustering procedure. The results show that, although the number of predicted clusters vary in different conditions, the classifications obtained are essentially consistent. Different classifications were compared using the variation of information measure.Salvi, G. (2005). Segment Boundaries in Low Latency Phonetic Recognition. In Proceedings of Non Linear Speech Processing (NOLISP). Barcelona, Spain.Skantze, G. (2005). Exploring human error recovery strategies: implications for spoken dialogue systems. Speech Communication, 45(3), 325-341. [abstract] [pdf]Abstract: In this study, an explorative experiment was conducted in which subjects were asked to give route directions to each other in a simulated campus (similar to Map Task). In order to elicit error handling strategies, a speech recogniser was used to corrupt the speech in one direction. This way, data could be collected on how the subjects might recover from speech recognition errors. This method for studying error handling has the advantages that the level of understanding is transparent to the analyser, and the errors that occur are similar to errors in spoken dialogue systems. The results show that when subjects face speech recognition problems, a common strategy is to ask task-related questions that confirm their hypothesis about the situation instead of signalling non-understanding. Compared to other strategies, such as asking for a repetition, this strategy leads to better understanding of subsequent utterances, whereas signalling non-understanding leads to decreased experience of task success.Skantze, G. (2005). Galatea: a discourse modeller supporting concept-level error handling in spoken dialogue systems. In Proceedings of SigDial (pp. 178-189). Lisbon, Portugal. [abstract] [pdf]Abstract: In this paper, a discourse modeller for conversational spoken dialogue systems, called GALATEA, is presented. Apart from handling the resolution of ellipses and anaphora, it tracks the “grounding status” of concepts that are mentioned during the discourse, i.e. information about who said what when. This grounding information also contains concept confidence scores that are derived from the speech recogniser word confidence scores. The discourse model may then be used for concept-level error handling, i.e. grounding of concepts, fragmentary clarification requests, and detection of erroneous concepts in the model at later stages in the dialogue.Svanfeldt, G., & Olszewski, D. (2005). Perception experiment combining a parametric loudspeaker and a synthetic talking head. In Proceedings of Interspeech (pp. 1721-1724). Testi, D., Zannoni, C., Caldwell, D., Neiberg D., ., Clapworthy, G., & Viceconti, M. (2005). An innovative multisensorial environment for pre-operative planning of total hip replacement", 5th Annual Meeting of Computer Assisted Orthopaedic Surgery. In 5th Annual Meeting of Computer Assisted Orthopaedic Surgery. Helsinki, Finland.2004Beskow, J. (2004). Trainable articulatory control models for visual speech synthesis. Journal of Speech Technology, 4(7), 335-349. [pdf]Beskow, J., Cerrato, L., Cosi, P., Costantini, E., Nordstrand, M., Pianesi, F., Prete, M., & Svanfeldt, G. (2004). Preliminary cross-cultural evaluation of expressiveness in synthetic faces. In André, E., Dybkjaer, L., Minker, W., & Heisterkampf, P. (Eds.), Proc Tutorial and Research Workshop on Affective Dialogue Systems, ADS'04 (pp. 240-243). Kloster Irsee, Tyskland. [pdf]Beskow, J., Cerrato, L., Granström, B., House, D., Nordenberg, M., Nordstrand, M., & Svanfeldt, G. (2004). Expressive Animated Agents for Affective Dialogue Systems.. Proc Tutorial and Research Workshop on Affective Dialogue Systems, ADS'04, . Kloster Irsee, Tyskland.. [pdf]Beskow, J., Cerrato, L., Granström, B., House, D., Nordstrand, M., & Svanfeldt, G. (2004). The Swedish PF-Star multimodal corpora. In Proc LREC Workshop on Models of Human Behaviour for the Specification and Evaluation of Multimodal Input and Output Interfaces (pp. 34-37). Lisboa. [pdf]Beskow, J., Karlsson, I., Kewley, J., & Salvi, G. (2004). SYNFACE - A talking head telephone for the hearing-impaired. In Miesenberger, K., Klaus, J., Zagler, W., & Burger, D. (Eds.), Computers Helping People with Special Needs (pp. 1178-1186). Springer-Verlag. [abstract] [pdf]Abstract: SYNFACE is a telephone aid for hearing-impaired people that shows the lip movements of the speaker at the other telephone synchronised with the speech. The SYNFACE system consists of a speech recogniser that recognises the incoming speech and a synthetic talking head. The output from the recogniser is used to control the articulatory movements of the synthetic head. SYNFACE prototype systems exist for three languages: Dutch, English and Swedish and the rst user trials have just started.Blomberg, M., Elenius, D., & Zetterholm, E. (2004). Speaker verification scores and acoustic analysis of a professional impersonator. In Proc of The XVIIth Swedish Phonetics Conference, Fonetik 2004 (pp. 84-87). Stockholm University. [pdf]Boye, J., Wiren, M., & Gustafson, J. (2004). Contextual reasoning in multimodal dialogue systems: two case studies. In Proceedings of The 8th Workshop on the Semantics and Pragmatics of Dialogue Catalogue'04 (pp. 19-21). Barcelona. [pdf]Carlson, R., Elenius, K., & Swerts, M. (2004). Perceptual judgments of pitch range. In Bel, B., & Marlin, I. (Eds.), Proc. of Intl Conference on Speech Prosody 2004 (pp. 689-692). Nara, Japan. [pdf]Carlson, R., Hirschberg, J., & Swerts, M. (2004). Prediction of upcoming Swedish prosodic boundaries by Swedish and American listeners. In Bel, B., & Marlin, I. (Eds.), Proc of Intl Conference on Speech Prosody 2004 (pp. 329-332). Nara, Japan. [pdf]Cerrato, L. (2004). A coding scheme for the annotation of feedback phenomena in conversational speech. In Martin, J. (Ed.), Proc of LREC Workshop on Models of Human Behaviour for the Specification and Evaluation of Multimodal Input and Output Interfaces (pp. 25-28). Lisboa.Cerrato, L. (2004). A comparative study of verbal feedback in Italian and Swedish map-task dialogues. In Copenhagen, P., & Hernrichsen, J. (Eds.), Proceedings of the Nordic Symposium on the comparison of spoken languages, Copenhagen Working Papers in LSP (pp. 99-126). Cerrato, L., & Ekeklint, S. (2004). Evaluating users reactions to human-like interfaces: Prosodic and paralinguistic features as new evaluation measures for users satisfaction. In Ruttkay, Z., & Pelachaud, C. (Eds.), Kluwer's Human-Computer Interaction Series: From Brows to Trust Evaluating Embodied Conversational Agents (pp. 101-124). Edlund, J., Skantze, G., & Carlson, R. (2004). Higgins - a spoken dialogue system for investigating error handling techniques. In Proceedings of the International Conference on Spoken Language Processing, ICSLP 04 (pp. 229-231). Jeju, Korea. [abstract] [pdf]Abstract: In this paper, an overview of the Higgins project and the research within the project is presented. The project incorporates studies of error handling for spoken dialogue systems on several levels, from processing to dialogue level. A domain in which a range of different error types can be studied has been chosen: pedestrian navigation and guiding. Several data collections within Higgins have been analysed along with data from Higgins' predecessor, the AdApt system. The error handling research issues in the project are presented in light of these analyses.Elenius, D., & Blomberg, M. (2004). Comparing speech recognition for adults and children. In Proc of The XVIIth Swedish Phonetics Conference, Fonetik 2004 (pp. 156-159). Stockholm University. [pdf]Engwall, O. (2004). From real-time MRI to 3D tongue movements. In Kim, S. H., & Young, D. H. (Eds.), Proc ICSLP 2004 (pp. 1109-1112). Jeju Island, Korea. [pdf]Engwall, O. (2004). Speaker adaptation of a three-dimensional tongue model. In Kim, S. H., & Young, D. H. (Eds.), Proc ICSLP 2004 (pp. 465-468). Jeju Island, Korea. [pdf]Engwall, O., Wik, P., Beskow, J., & Granström, G. (2004). Design strategies for a virtual language tutor. In Kim, S. H., & Young, D. H. (Eds.), Proc ICSLP 2004 (pp. 1693-1696). Jeju Island, Korea. [pdf]Granström, B. (2004). Towards a virtual language tutor. In Delmonte, R., Delcloque, P., & Tonellli, S. (Eds.), Proc InSTIL/ICALL2004 NLP and Speech Technologies in Advanced Language Learning (pp. 1-8 (Invited paper)). Venice, Italy. [pdf]Granström, B., & House, D. (2004). Audiovisual representation of prosody in expressive speech communication. In Bel, B., & Marlin, I. (Eds.), Proc of Intl Conference on Speech Prosody 2004 (pp. 393-396). Nara, Japan. [pdf]Gustafson, J., Bell, L., Boye, J., Lindström, A., & Wirén, M. (2004). The NICE fairy-tale game system. In Proceedings of SIGdial. Boston. [pdf]Gustafson, J., & Sjölander, K. (2004). Voice creations for conversational fairy-tale characters. In Proc 5th ISCA speech synthesis workshop (pp. 145-150). Pittsburgh. [pdf]Heldner, M., Edlund, J., & Björkenstam, T. (2004). Automatically extracted F0 features as acoustic correlates of prosodic boundaries. In Proc of The XVIIth Swedish Phonetics Conference, Fonetik 2004 (pp. 52-55). Stockholm University. [abstract] [pdf]Abstract: This work presents preliminary results of an investigation of various automatically extracted F0 features as acoustic correlates of prosodic boundaries. The F0 features were primarily intended to capture phenomena such as boundary tones, F0 resets across boundaries and position in the speaker's F0 range. While there were no correspondences between boundary tones and boundaries, the reset and range features appeared to separate boundaries from no boundaries fairly well.Hincks, R. (2004). Processing the prosody of oral presentations. In Delmonte, R., Delcloque, P., & Tonellli, S. (Eds.), Proc InSTIL/ICALL2004 NLP and Speech Technologies in Advanced Language Learning (pp. 63-66). Venice, Italy. [abstract] [pdf]Abstract: Standard advice to people preparing to speak in public is to use a “lively” voice. A lively voice is described as one that varies in intonation, rhythm and loudness: qualities that can be analyzed using speech analysis software. This paper reports on a study analyzing pitch variation as a measure of speaker liveliness. A potential application of this approach for analysis would be for rehearsing or assessing the prosody of oral presentations. While public speaking can be intimidating even to native speakers, second language users are especially challenged, particularly when it comes to using their voices in a prosodically engaging manner. The material is a database of audio recordings of twenty 10-minute student oral presentations, where all speakers were college-age Swedes studying Technical English. The speech has been processed using the analysis software WaveSurfer for pitch extraction. Speaker liveliness has been measured as the standard deviation from the mean fundamental frequency over 10-second periods of speech. The standard deviations have been normal¬ized (by division with the mean frequency) to obtain a value termed the pitch dynamism quotient (PDQ). Mean values (for ten minutes of speech) of PDQ per speaker range from a low of 0.11 to a high of 0.235. Individual values for 10-second segments range from lows of 0.06 to highs of 0.36.Hincks, R. (2004). Standard deviation of F0 in student monologue. In Proc of The XVIIth Swedish Phonetics Conference, Fonetik 2004 (pp. 132-135). Stockholm University. [abstract] [pdf]Abstract: Twenty ten-minute oral presentations made by Swedish students speaking English have been analyzed with respect to the standard deviation of F0 over long stretches of speech. Values have been normalized by division with the mean. Results show a strong correlation between pro-ficiency in English and pitch variation for male speakers but not for females. The results also identify monotone and disfluent speakers.House, D. (2004). Final rises and Swedish question intonation. In Proc of The XVIIth Swedish Phonetics Conference, Fonetik 2004 (pp. 56-59). Stockholm University. [pdf]House, D. (2004). Final rises in spontaneous Swedish computer-directed questions: incidence and function. In Bel, B., & Marlin, I. (Eds.), Proc of Intl Conference on Speech Prosody 2004 (pp. 115-118). Nara, Japan. [pdf]House, D. (2004). Pitch and alignment in the perception of tone and intonation. In Fant, G., Fujisaki, H., Cao, J., & Xu, Y. (Eds.), From Traditional Phonology to Modern Speech Processing (pp. 189-204). Beijing: Foreign Language Teaching and Research Press.House, D. (2004). Pitch and alignment in the perception of tone and intonation: pragmatic signals and biological codes. In Bel, B., & Marlein, I. (Eds.), Proc of International Symposium on Tonal Aspects of Languages: Emphasis on Tone Languages (pp. 93-96). Beijng, China.Hunnicutt, S., Nozadze, L., & Chikoidze, G. (2004). Russian word prediction with morphological support. In 5th International Symposium on Language, Logic and Computation. Tbilisi, Georgia. [pdf]Hunnicutt, S., & Zuurman, M. (2004). Comparison of vocabulary in several symbol sets and Voice of America word list. In 11th Biennial Conference of the International Society for Augmentative and Alternative Communication (extended abstract). Natal, Brazil.Jande, P-A. (2004). Pronunciation variation modelling using decision tree induction from multiple linguistic parameters. In Proceedings of Fonetik (pp. 12-15). Stockholm, Sweden. [pdf]Karnebäck, S. (2004). Spectro-temporal properties of the acoustic speech signal used for speech/music discrimination. Licentiate dissertation. [pdf]Kohler, K. J. (2004). Pragmatic and attitudinal meanings of pitch patterns in German syntactically marked questions. In Fant, G., Fujisaki, H., Cao, J., & Xu, Y. (Eds.), From traditional phonology to modern speech processing (pp. 205-214). Beijing: Foreign Language Teaching and Research Press.Lacerda, F., Sundberg, U., Carlson, R., & Holt, L. (2004). Modelling interactive language learning: a project presentation. In Proc of The XVIIth Swedish Phonetics Conference, Fonetik 2004 (pp. 60-63). Stockholm University. [pdf]Magnuson, T., & Hunnicutt, S. (2004). Aided text construction in an e-mail application for symbol users. In 11th Biennial Conference of the International Society for Augmentative and Alternative Communication (extended abstract). Natal, Brazil.Nordenberg, M., & Sundberg, J. (2004). Effect on LTAS of vocal loudness variation. Logopedics Phoniatrics Vocology, 29, 183-191.Nordstrand, M., Svanfeldt, G., Granström, B., & House, D. (2004). Measurements of articulatory variation in expressive speech for a set of Swedish vowels. J Speech Communication - Special Issue on Audio Visual Speech Processing, 1-4(44), 187-196.Pakucs, B. (2004). Butler: a universal speech interface for mobile environments. In Brewster, S., & Dunlop, M. (Eds.), Lecture notes in Computer Science 3160. Mobile HCI 04. 6th International Symposium on Human Computer Interaction with Mobile Devices and Services (pp. 399-403). Glasgow, UK: Springer Verlag.Pakucs, B. (2004). Employing context of use in dialogue processing. In Proc CATALOG '04. 8th Workshop on the Semantics and Pragmatics of Dialogue (pp. 162-163). Barcelona, Spain.Pakucs, B., & Huhta, S. (2004). Developing speech interfaces for frequent users: the DUMAS-calendar prototype. In Proc of COLING 2004, Workshop on Robust and Adaptive Information Processing for Mobile Speech Interfaces (pp. 65-68). Geneva, Switzerland.Seward, A. (2004). A fast HMM match algorithm for very large vocabulary speech recognition. Speech Comm, 42, 191-206.Siciliano, C., Williams, G., Faulkner, A., & Salvi, G. (2004). Intelligibility of an ASR-controlled synthetic talking face. Journal of the Acoustical Society of America, 115(5), 2428.Sjölander, K., & Heldner, M. (2004). Word level precision of the NALIGN automatic segmentation algorithm. In Proc of The XVIIth Swedish Phonetics Conference, Fonetik 2004 (pp. 116-119). Stockholm University. [pdf]Skantze, G., & Edlund, J. (2004). Early error detection on word level. In Proceedings of ISCA Tutorial and Research Workshop (ITRW) on Robustness Issues in Conversational Interaction. Norwich, UK. [abstract] [pdf]Abstract: In this paper two studies are presented in which the detection of speech recognition errors on the word level was examined. In the first study, memory-based and transformation-based machine learning was used for the task, using confidence, lexical, contextual and discourse features. In the second study, we investigated which factors humans benefit from when detecting errors. Information from the speech recogniser (i.e. word confidence scores and 5-best lists) and contextual information were the factors investigated. The results show that word confidence scores are useful and that lexical and contextual (both from the utterance and from the discourse) features further improve performance.Skantze, G., & Edlund, J. (2004). Robust interpretation in the Higgins spoken dialogue system. In Proceedings of ISCA Tutorial and Research Workshop (ITRW) on Robustness Issues in Conversational Interaction. Norwich, UK. [abstract] [pdf]Abstract: This paper describes Pickering, the semantic interpreter developed in the Higgins project - a research project on error handling in spoken dialogue systems. In the project, the initial efforts are centred on the input side of the system. The semantic interpreter combines a rich set of robustness techniques with the production of deep semantic structures. It allows insertions and non-agreement inside phrases, and combines partial results to return a limited list of semantically distinct solutions. A preliminary evaluation shows that the interpreter performs well under error conditions, and that the built-in robustness techniques contribute to this performance.Spens, K-E., Agelfors, E., Beskow, J., Granström, B., Karlsson, I., & Salvi, G. (2004). SYNFACE, a talking head telephone for the hearing impaired. In Proc IFHOH 7th World Congress. Helsinki, Finland.Strangert, E., & Carlson, R. (2004). On the modelling and synthesis of conversational speech. In Bruce, G., & Horne, M. (Eds.), Nordic Prosody. Proceedings of the IXth Conference (pp. 255-264). Lund: Peter Lang: Frankfurt am Main.Wik, P. (2004). Designing a virtual language tutor. In Proc of The XVIIth Swedish Phonetics Conference, Fonetik 2004 (pp. 136-139). Stockholm University. [pdf]Wik, P., Nygaard, L., & Fjeld, R. V. (2004). Managing complex and multilingual lexical data with a simple editor. In Proceedings of the Eleventh EURALEX International Congress. Lorient, France. [pdf]Zetterholm, E., Blomberg, M., & Elenius, D. (2004). A comparison between human perception and a speaker verification system score of a voice imitation. In Proc of Tenth Australian International Conference on Speech Science & Technology (pp. 393-397). Macquarie Univ, Sydney, Australia. [pdf]Öhlin, D., & Carlson, R. (2004). Data-driven formant synthesis. In Proc of the XVIIth Swedish Phonetics Conference, Fonetik 2004 (pp. 160-163). Stockholm University. [pdf]2003Allwood, J., & Cerrato, L. (2003). A study of gestural feedback expressions. In The First Nordic Symposium on Multimodal Communication. Copenhagen.Bell, L. (2003). Linguistic adaptations in spoken human-computer dialogues. Empirical studies of user behavior. Doctoral dissertation. [pdf]Bell, L., & Gustafson, J. (2003). Child and adult speaker adaptation during error resolution in a publicly available spoken dialogue system. In Proc of EuroSpeech 2003 (pp. 613-616). Geneva, Switzerland. [pdf]Bell, L., Gustafson, J., & Heldner, M. (2003). Prosodic adaptation in human-computer interaction. In Proc of ICPhS, XV Intl Conference of Phonetic Sciences (pp. 2453-2456). Barcelona, Spain. [pdf]Beskow, J. (2003). Talking heads - Models and applications for multimodal speech synthesis. Doctoral dissertation, KTH.Beskow, J., Engwall, O., & Granström, B. (2003). Resynthesis of Facial and Intraoral Articulation from Simultaneous Measurements. In Solé, M., Recasens, D., & Romero, J. (Eds.), Proceedings of the 15th ICPhS (pp. 431-434). Barcelona, Spain. [pdf]Beskow, J., Engwall, O., & Granström, B. (2003). Simultaneous measurements of facial and intraoral articulation. In Proc of Fonetik 2003, Umeå University, Dept of Philosophy and Linguistics PHONUM 9 (pp. 57-60). [pdf]Blomberg, M., & Elenius, D. (2003). Collection and recognition of children s speech in the PF-Star project. In Proc of Fonetik 200,3 Umeå University, Dept of Philosophy and Linguistics PHONUM 9 (pp. 81-84). [pdf]Carlson, R., & Swerts, M. (2003). Perceptually based prediction of upcoming prosodic breaks in spontaneous Swedish speech materials. In Proc of ICPhS, XV Intl Conference of Phonetic Sciences (pp. 79-82). Barcelona, Spain. [pdf]Carlson, R., & Swerts, M. (2003). Relating perceptual judgments of upcoming prosodic breaks to F0 features. In Proc of Fonetik 2003, Umeå University, Dept of Philosophy and Linguistics PHONUM 9 (pp. 181-184). [pdf]Cerrato, L., & D'Imperio, M. (2003). Duration and tonal characteristics of short expressions in Italian.. In Proc 15th ICPhS. Cerrato, L., & Paoloni, A. (2003). Utilizzo dei parametric della fonetica acustica nell' identificazione del parlante in ambito forense. In Cosi, P., Magno Caldognetto, E., & Zamboni, A. (Eds.), Voce, Canto Parlato, Studi in Onore di Franco Ferrero (pp. 59-66). Unipress.Cerrato, L., & Skhiri, M. (2003). A method for the analysis and measurement of communicative head movements in human dialogues. In Proc of AVSP 2003, ISCA Tutorial and Research Workshop on Audio Visual Speech Processing (pp. 251-256). St Jorioz, France.Engwall, O. (2003). A revisit to the application of MRI to the analysis of speech production. Testing our assumptions. In 6th Intl Seminar on Speech Production (pp. 43-48). Sydney. [pdf]Engwall, O. (2003). Combining MRI, EMA & EPG measurements in a three-dimensional tongue model. Speech Communication, 41, 303-329. [pdf]Engwall, O., & Beskow, J. (2003). Resynthesis of 3D tongue movements from facial data. In Proc EuroSpeech 2003 (pp. 2261-2264). [pdf]Engwall, O., & Beskow, J. (2003). The effect of corpus choice on statistical articulatory modeling. In 7th Intl Seminar on Speech Production (pp. 49-54). Sydney. [pdf]Granqvist, S. (2003). Computer methods for voice analysis. Doctoral dissertation, KTH/TMH.Granström, B., & House, D. (2003). Multimodality and speech technology: Verbal and non-verbal communication in talking agents. In Proc of EuroSpeech 2003 (pp. 2901-2904). Geneva, Switzerland. [pdf]Heldner, M. (2003). On the reliability of overall intensity and spectral emphasis as acoustic correlates of focal accents in Swedish. Journal of Phonetics, 31(1), 39-62.Heldner, M., & Megyesi, B. (2003). Exploring the prosody-syntax interface in conversations. In Proc 15th ICPhS, XV Intl Conference of Phonetic Sciences (pp. 2501-2504). Barcelona, Spain. [pdf]Heldner, M., & Megyesi, B. (2003). The acoustic and morpho-syntactic context of prosodic boundaries in dialogs. In Proc of Fonetik 2003, Umeå University, Dept of Philosophy and Linguistics PHONUM 9 (pp. 117-120). [pdf]Hincks, R. (2003). Pronouncing the academic word list: Features of L2 student oral presentations. In Proc of ICPhS, XV Intl Conference of Phonetic Sciences (pp. 1545-1549). Barcelona, Spain. [abstract] [pdf]Abstract: This paper is an analysis of lexical choices, pronunciation errors, and discourse features found in a corpus of student presentation speech. The speakers were Swedish natives studying Technical English. Particular emphasis is given to the pronunciation of the words most often used in academic texts. 93% of words used in the corpus came from the most frequent 2570 lexemes of academic written English, 99% of all words were acceptably pronounced, disfluencies occurred at relatively stable inter-student rates, and 30% of all new sentences began with the conjunction ‘and’.Hincks, R. (2003). Speech technologies for pronunciation feedback and evaluation. ReCALL, 1(15), 3-21. [abstract] [link]Abstract: Educators and researchers in the acquisition of L2 phonology have called for empirical assessment of the progress students make after using new methods for learning (Chun, 1998, Morley, 1991). The present study investigated whether unlimited access to a speech-recognition-based language learning program would improve the general standard of pronunciation of a group of middle-aged immigrant professionals studying English in Sweden. Eleven students were given a copy of the program Talk to Me from Auralog as a supplement to a 200-hour course in Technical English, and were encouraged to practise on their home computers. Their development in spoken English was compared with a control group of fifteen students who did not use the program. The program is evaluated in this paper according to Chapelle’s (2001) six criteria for CALL assessment. Since objective human ratings of pronunciation are costly and can be unreliable, our students were pre- and posttested with the automatic PhonePass SET-10 test from Ordinate Corp. Results indicate that practice with the program was beneficial to those students who began the course with a strong foreign accent but was of limited value for students who began the course with better pronunciation. The paper begins with an overview of the state of the art of using speech recognition in L2 applications.Hincks, R. (2003). Tutors, tools and assistants for the L2 user. In Proc of Fonetik 2003, Umeå University, Dept of Philosophy and Linguistics PHONUM 9 (pp. 173-176). [abstract] [pdf]Abstract: This paper explores the concept of a speech checker for use in the production of oral presentations. The speech checker would be specially adapted speaker-dependent software to be used as a tool in the rehearsal of a presentation. The speech checker would localize mispronounced words and words unlikely to be known by an audience, give feedback on the speaker’s prosody, and provide a friendly face to listen to the presentation.House, D. (2003). Hesitation and interrogative Swedish intonation. In Proc of Fonetik 2003, Umeå University, Dept of Philosophy and Linguistics PHONUM 9 (pp. 185-188). [pdf]House, D. (2003). Perceiving question intonation: the role of pre-focal pause and delayed focal peak. In Proc of ICPhS, XV Intl Conference of Phonetic Sciences (pp. 755-758). Barcelona, Spain. [pdf]House, D. (2003). Perception of tone with particular reference to tonal alignment. In Kaji, S. (Ed.), Proc of the International Symposium on Crosslinguistic Studies of Tonal Phenomena 2002. Tokyo: Institute for Languages and Cultures of Asia and Africa, Tokyo University of Foreign Studies.Jande, P-A. (2003). Evaluating rules for phonological reduction in Swedish. In Proceedings of Fonetik (pp. 149-152). [pdf]Jande, P-A. (2003). Phonological reduction in Swedish. In Proceedings of the International Conference of Phonetic Sciences (ICPhS) (pp. 2557-2560). Barcelona, Spain. [pdf]Karlsson, I. (2003). The SYNFACE project - a status report. In Proc of Fonetik 2003, Umeå University, Dept of Philosophy and Linguistics PHONUM 9 (pp. 61-64). [pdf]Karlsson, I., Faulkner, A., & Salvi, G. (2003). SYNFACE - a talking face telephone. In Proc of EuroSpeech 2003 (pp. 1297-1300). Geneva, Switzerland. [abstract] [pdf]Abstract: The SYNFACE project has as its primary goal to facilitate for hearing-impaired people to use an ordinary telephone. This will be achieved by using a talking face connected to the telephone. The incoming speech signal will govern the speech movements of the talking face, hence the talking face will provide lip-reading support for the user. The project will define the visual speech information that supports lip-reading, and develop techniques to derive this information from the acoustic speech signal in near real time for three different languages: Dutch, English and Swedish. This requires the development of automatic speech recognition methods that detect information in the acoustic signal that correlates with the speech movements. This information will govern the speech movements in a synthetic face and synchronise them with the acoustic speech signal. A prototype system is being constructed. The prototype contains results achieved so far in SYNFACE. This system will be tested and evaluated for the three languages by hearing-impaired users. SYNFACE is an IST project (IST-2001-33327) with partners from the Netherlands, UK and Sweden. SYNFACE builds on experiences gained in the Swedish Teleface project.Magnuson, T., & Hunnicutt, S. (2003). Support for the construction of sentences and phrases for symbol users. In Proc of Fonetik 2003, Umeå University, Dept of Philosophy and Linguistics PHONUM 9 (pp. 85-88). [pdf]Nordstrand, M., Svanfeldt, G., Granström, B., & House, D. (2003). Measurements of articulatory variation and communicative signals in expressive speech. In Proc of AVSP'03, ISCA Tutorial and Research Workshop on Audio Visual Speech Processing (pp. 233-238). St Jorioz, France. [pdf]Pakucs, B. (2003). SesaME: A framework for personalized and adaptive speech interfaces. In Proc of the EACL-03 Workshop on Dialogue Systems: interaction, adaptation and styles of management (pp. 95-102). Budapest. [pdf]Pakucs, B. (2003). Towards dynamic multi-domain dialogue processing. In Proc of EuroSpeech 2003 (pp. 741-744). Geneva, Switzerland. [pdf]Salvi, G. (2003). Accent clustering in Swedish using the Bhattacharyya distance. In Proceedings of the International Congress of Phonetic Sciences (ICPhS) (pp. 1149-1152). Barcelona, Spain. [abstract] [pdf]Abstract: In an attempt to improve automatic speech recognition (ASR) models for Swedish, accent variations were considered. These have proved to be important variables in the statistical distribution of the acoustic features usually employed in ASR. The analysis of feature variability have revealed phenomena that are consistent with what is known from phonetic investigations, suggesting that a consistent part of the information about accents could be derived form those features. A graphical interface has been developed to simplify the visualization of the geographical distributions of these phenomena.Salvi, G. (2003). Truncation Error and Dynamics in Very Low Latency Phonetic Recognition. In Proceedings of Non Linear Speech Processing (NOLISP). Le Croisic, France. [abstract] [pdf]Abstract: The truncation error for a two-pass decoder is analyzed in a problem of phonetic speech recognition for very demanding latency constraints (look-ahead length < 100ms) and for applications where successive refinements of the hypotheses are not allowed. This is done empirically in the framework of hybrid MLP/HMM models. The ability of recurrent MLPs, as a posteriori probability estimators, to model time variations is also considered, and its interaction with the dynamic modeling in the decoding phase is shown in the simulations.Salvi, G. (2003). Using accent information in ASR models for Swedish. In Proceedings of European Conference on Speech Communication and Technology (Eurospeech) (pp. 2677-2680). Geneva, Switzerland. [abstract] [pdf]Abstract: In this study accent information is used in an attempt to improve acoustic models for automatic speech recognition (ASR). First, accent dependent Gaussian models were trained independently. The Bhattacharyya distance was then used in conjunction with agglomerative hierarchical clustering to define optimal strategies for merging those models. The resulting allophonic classes were analyzed and compared with the phonetic literature. Finally, accent “aware” models were built, in which the parametric complexity for each phoneme corresponds to the degree of variability across accent areas and to the amount of training data available for it. The models were compared to models with the same, but evenly spread, overall complexity showing in some cases a slight improvement in recognition accuracy.Seward, A. (2003). Efficient methods for automatic speech recognition. Doctoral dissertation, Department of Speech, Music and Hearing, KTH.Seward, A. (2003). Low-latency incremental speech transcription in the Synface project. In Proc of EuroSpeech 2003 (pp. 1141-1144). Geneva, Switzerland. [pdf]Siciliano, C., Williams, G., Beskow, J., & Faulkner, A. (2003). Evaluation of a Multilingual Synthetic Talking Face as a communication Aid for the Hearing Impaired. In Proc of ICPhS, XV Intl Conference of Phonetic Sciences (pp. 131-134). Barcelona, Spain. [pdf]Sjölander, K. (2003). An ensemble method for the alignment of sound and its transcription. In Proc of SMAC 03, Stockholm Music Acoustics Conference (pp. 743-746). Sjölander, K. (2003). An HMM-based system for automatic segmentation and alignment of speech. In Proc of Fonetik 2003, Umeå University, Dept of Philosophy and Linguistics PHONUM 9 (pp. 93-96). [pdf]Skantze, D., & Dahlbäck, N. (2003). Auditory icons support for navigation in speech-only interfaces for room-based design metaphors. In Proc of ICAD 2003, Intl Community for Auditory Display (pp. 140-143). Boston University, USA. [pdf]Skantze, G. (2003). Exploring human error handling strategies: implications for spoken dialogue systems. In Proceedings of ISCA Tutorial and Research Workshop on Error Handling in Spoken Dialogue Systems (pp. 71-76). Chateau-d'Oex-Vaud, Switzerland. [pdf]Svanfeldt, G., Nordstrand, M., Granström, B., & House, D. (2003). Measurements of articulatory variation in expressive speech. In Proc of Fonetik 2003, Umeå University, Dept of Philosophy and Linguistics, PHONUM 9 (pp. 53-56). [pdf]Wik, P. (2003). A Cognitive Science Approach To A Human-Dolphin Dialog Protocol. In 17th International Conference of the European Cetacean Society. Las Palmas, Canary Islands.Wik, P. (2003). Building Common Ground: Communication Across Species Barriers. In First International Conference on Acoustic Communication by Animals. Maryland, USA.Öster, A-M., House, D., Hatzis, A., & Green, P. (2003). Testing a new method for training fricatives using visual maps in the Ortho-Logo-Paedia project (OLP). In Proc of Fonetik 2003, Umeå University, Dept of Philosophy and Linguistics PHONUM 9 (pp. 89-92). [pdf]2002Beskow, J., Edlund, J., & Nordstrand, M. (2002). Specification and realisation of multimodal output in dialogue systems. In Proc of ICSLP 2002 (pp. 181-184). Denver, Colorado, USA. [abstract] [pdf]Abstract: We present a high level formalism for specifying verbal and nonverbal output from a multimodal dialogue system. The output specification is XML-based and provides information about communicative functions of the output without detailing the realisation of these functions. The specification can be used to control an animated character that uses speech and gestures. We give examples from an implementation in a multimodal spoken dialogue system, and describe how facial gestures are implemented in a 3Danimated talking agent within this system.Beskow, J., Granström, B., & House, D. (2002). A multimodal speech synthesis tool applied to audio-visual prosody. In Keller, E., Bailly, G., Monaghan, A., Terken, J., & Huckvale, M. (Eds.), Improvements in Speech Synthesis (pp. 372-382). New York: John Wiley & Sons, Inc.Carlson, R., Granström, B., Heldner, M., House, D., Megyesi, B., Strangert, E., & Swerts, M. (2002). Boundaries and groupings - the structuring of speech in different communicative situations: a description of the GROG project. In Proc of Fonetik 2002 (pp. 65-68). Stockholm.Cerrato, L. (2002). A comparison between feedback strategies in Human-to-Human and Human-Machine communication. In Proc of ICSLP-2002 (pp. 557-560). Denver, Colorado, USA. [pdf]Cerrato, L. (2002). A Study of Verbal Feedback in Italian.. In Proc of NORDTALK Symposium on Relations between Utterances. Cerrato, L., & Ekeklint, S. (2002). Different ways of ending human-machine interactions. In Proc of AAMAS02 Workshop. Bologna. [pdf]Cerrato, L., & Skhiri, M. (2002). Quantifying non-verbal communicative behaviour in face-to-face human dialogues. In First Pan-American/Iberian Meeting on Acoustics Cancun. Mexico.Edlund, J., Beskow, J., & Nordstrand, M. (2002). GESOM - A model for describing and generating multi-modal output. In Proc of ISCA Workshop Multi-Modal Dialogue in Mobile Environments. Kloster Irsee, Germany. [abstract] [pdf]Abstract: This paper describes GESOM, a model for generation of generalised, high-level multi-modal dialogue system output. It aims to let dialogue systems generate output for various output devices and modalities with a minimum of changes to the output generation of the dialogue system. The model was developed and tested within the AdApt spoken dialogue system, from which the bulk of the examples in this paper are taken.Edlund, J., & Nordstrand, M. (2002). Turn-taking gestures and hour-glasses in a multi-modal dialogue system. In Proc of ISCA Workshop Multi-Modal Dialogue in Mobile Environments. Kloster Irsee, Germany. [abstract] [pdf]Abstract: An experiment with 24 subjects was performed. The subjects were split in three groups, and asked to extract information from the AdApt spoken dialogue system for somewhat over 30 minutes per subject. The system configuration varied in that one group had turn-taking gestures from an animated talking head, another had an hourglass symbol to signal when the system was busy, and the third had no turn-taking feedback at all. The results show that although the hourglass setup showed no decrease in efficiency compared to the facial gestures, it made the subjects less satisfied. The lack of turn-taking feedback was noticed and mentioned by half of the subjects in that group.Elenius, D., & Blomberg, M. (2002). Characteristics of a low reject mode speaker verification system. In Proc of ICSLP 2002 (pp. 1385-1388). Denver, Colorado, USA. [pdf]Engwall, O. (2002). Evaluation of a system for concatenative articulatory visual speech synthesis. In Proc of ICSLP 2002 (pp. 665-668). Denver, Colorado, USA. [pdf]Engwall, O. (2002). Tongue Talking - Studies in Intraoral Speech Synthesis. Doctoral dissertation, KTH. [pdf]Fant, G., Kruckenberg, A., Gustafson, K., & Liljencrants, J. (2002). A new approach to intonation analysis and synthesis of Swedish. In Proc of Fonetik 2002 (pp. 161-64). Stockholm.Fant, G., Kruckenberg, A., Gustafson, K., Liljencrants, J., & Botinis, A. (2002). Individual variations in prominence correlates. Some observations from lab-speech. In Proc of Fonetik 2003 (pp. 177-180). Stockholm.Graf, H. .., Cosatto, E., Strom, V., & Huang, F. J. (2002). Visual prosody: Facial movements accompanying speech. In Fifth IEEE International Conference on Automatic Face and Gesture Recognition, 2002. Proceedings (pp. 396-401). Granström, B., House, D., & Beskow, J. (2002). Speech and gestures for talking faces in conversational dialogue systems. In Granström, B., House, D., & Karlsson, I. (Eds.), Multimodality in language and speech systems (pp. 209-241). Dordrecht: Kluwer Academic Publishers.Granström, B., House, D., & Karlsson, I. (2002). Multimodality in language and speech systems. Dordrecht: Kluwer Academic Publishers.Granström, B., House, D., & Swerts, M. G. (2002). Multimodal feedback cues in human-machine interactions. In Bel, B., & Marlien, I. (Eds.), Proc of the Speech Prosody 2002 Conference (pp. 347-350). Aix-en-Provence: Laboratoire Parole et Langage.Gustafson, J. (2002). Developing multimodal spoken dialogue systems. Empirical studies of spoken human-computer interaction. Doctoral dissertation, KTH. [pdf]Gustafson, J., Bell, L., Boye, J., Edlund, J., & Wiren, M. (2002). Constraint manipulation and visualization in a multimodal dialogue system. In Proc of the ISCA Workshop Multi-Modal Dialogue in Mobile Environments. Kloster, Irsee, Germany. [abstract] [pdf]Abstract: When interacting with spoken and multimodal dialogue systems, it is often difficult for users to understand and influence how their input is processed by the system. In this paper, we describe how these problems were addressed in the multimodal real-estate dialogue system AdApt. During the course of a dialogue, the user's contraints are translated into symbolic icons that are visualized on the screen and can be manipulated by drag-and-drop operations. Users are thus given a clear picture of how their utterances are understood, and are given a transparent means of controlling the interaction with the system.Gustafson, J., & Sjölander, K. (2002). Voice transformations for improving children's speech recognition in a publicly available dialogue system. In Proc of ICSLP 2002 (pp. 297-300). Denver, Colorado, USA. [pdf]Gustafson, K., & House, D. (2002). Prosodic parameters of a "fun" speaking style. In Keller, E., Bailly, G., Monaghan, A., Terken, J., & Huckvale, M. (Eds.), Improvements in Speech Synthesis (pp. 264-272). New York: John Wiley & Sons, Inc.Gustafson-Capkova, S., & Megyesi, B. (2002). Silence and discourse context in read speech and dialogues in Swedish. In Bel, B., & Marlien, I. (Eds.), Proc of Speech Prosody 2002 Conference (pp. 363-366). Aix-en-Provence: Laboratoire Parole et Langage. [pdf]Hincks, R. (2002). Speech recognition for language teaching and evaluating: a study of existing products. In Proceedings of ICSLP 2002 (pp. 733-736). Denver, Colorado, USA. [abstract] [pdf]Abstract: Educators and researchers in the acquisition of L2 phonology have called for empirical assessment of the progress students make after using new methods for learning [1], [2]. This study investigated whether unlimited access to a speech-recognition-based language learning program would improve the general goodness of pronunciation of a group of middle-aged immigrant professionals studying English in Sweden. Eleven students were given a copy of the program Talk to Me by Auralog as a supplement to a 200-hour course in Technical English, and were encouraged to practice on their home computers. Their development in spoken English was compared with a control group of fifteen students who did not receive software. Talk to Me uses speech recognition to provide conversational practice, phonetic instruction, visual feedback on prosody and scoring of pronunciation. A significant limitation of commercial systems currently available is their inability to diagnose specific articulatory problems. In this course in Technical English, however, students also met at regular intervals with a pronunciation tutor who could steer the student in the right direction for finding the most important sections to practice for his or her particular problems. Students reported high satisfaction with the software and used it for an average of 12.5 hours. Students were pre- and post-tested with the automatic PhonePass SET-10 test from Ordinate Corp. Results indicate that practice with the program was beneficial to those students who began the course with a strong foreign accent but that students who began the course with intermediate pronunciation did not show the same improvement.House, D. (2002). Intonational and visual cues in the perception of interrogative mode in Swedish. In Proc of ICSLP 2002 (pp. 1957-1960). Denver, Colorado, USA. [pdf]House, D. (2002). The interaction of pitch range and temporal alignment in the perception of interrogative mode in Swedish. In Hawkins, S., & Nguyen, N. (Eds.), Temporal Integration in the Perception of Speech. University of Cambridge.House, D., & Granström, B. (2002). Multimodal speech synthesis: Improving information flow in dialogue systems using 3D talking heads. In Artificial Intelligence: Methodology, Systems, and Applications, 10th International Conference, AIMSA 2002 (pp. 38362). Berlin: Springer-Verlag.Hunnicutt, S., & Magnuson, T. (2002). Sentences for symbol users: email and echat. In Proc of ISAAC «02, 10th Biennial Conference of the International Society for Augmentative and Alternative Communication (pp. 467-468). Odense, Denmark.Johansson, M., Blomberg, M., Elenius, K., Hoffsten, L-E., & Torberger, A. (2002). A phoneme recognizer for the hearing impaired. In Proc. of ICSLP'2002 (pp. 433-436). Denver, Colorado, USA. [pdf]Johansson, M., Blomberg, M., Elenius, K., Hoffsten, L-E., & Torberger, A. (2002). Phoneme recognition for the hearing impaired. In Proc. of Fonetik 2002 (pp. 109-112). Stockholm.Karnebäck, S. (2002). Expanded examinations of a low frequency modulation feature for speech/music discrimination. In Proc of ICSLP 2002 (pp. 2009-2012). Denver, Colorado, USA. [pdf]Magnuson, T. (2002). Assessment of writing difficulties and evaluation of computerized writing support. Licentiate dissertation, KTH.Megyesi, B. (2002). Data-Driven Syntactic Analysis - Methods and Applications for Swedish. Doctoral dissertation, KTH.Megyesi, B. (2002). Shallow parsing with pos taggers and linguistic knowledge. Journal of Machine Learning Research: Special Issue on Shallow Parsing, 2, 639-668. [pdf]Megyesi, B., & Carlson, R. (2002). Data-driven methods for building a Swedish Treebank. In Proceedings of the Swedish Treebank Symposium. Växjö University, Sweden. [pdf]Megyesi, B., & Gustafson-Capkova, S. (2002). Production and perception of pauses and their linguistic context in read and spontaneous speech in Swedish. In Proc of ICSLP'2002 (pp. 2153-2156). Denver, Colorado, USA. [pdf]Pakucs, B. (2002). VoiceXML-based dynamic plug and play dialogue management for mobile environments. In Proc of the ISCA Workshop Multi-Modal Dialogue in Mobile Environments. Kloster, Irsee, Germany. [pdf]Rydin, S. (2002). Building a hyponymy lexicon with hierarchical structure. In Proc of the SIGLEX Workshop on Unsupervised Lexical Acquisition, ACL'02 (pp. 26-33). [pdf]Skantze, G. (2002). Coordination of referring expressions in multimodal human-computer dialogue. In Proceedings of ICSLP 2002 (pp. 553-556). Denver, Colorado, USA. [abstract] [pdf]Abstract: This study examines coordination of referring expressions in multimodal human-computer dialogue, i.e. to what extent users’ choices of referring expressions are affected by the referring expressions that the system is designed to use. An experiment was conducted, using a semi-automatic multimodal dialogue system for apartment seeking. The user and the system could refer to areas and apartments on an interactive map by means of speech and pointing gestures. Results indicate that the referring expressions of the system have great influence on the user’s choice of referring expressions, both in terms of modality and linguistic content. From this follows a number of implications for the design of multimodal dialogue systems.Wik, P. (2002). Building Bridges: A Cognitive Science Approach To A Human-Dolphin Dialog Protocol. Master's thesis, University of Oslo. [pdf]Öster, A-M., House, D., Protopapas, A., & Hatzis, A. (2002). A presentation of a new EU project for speech therapy: OLP (Ortho-Logo-Paedia). In Proc of Fonetik 2002 (pp. 45-48). Stockholm. [pdf]2001Allen, J., Byron, D., Dzikovska, M., Ferguson, G., Galescu, L., & Stent, A. (2001). Towards conversational human-computer interaction. AI Magazine, 22, 27-37.Beaugendre, F., House, D., & Hermes, D. J. (2001). Accentuation boundaries in Dutch, French and Swedish. Speech Communication, 33, 305-318.Bell, L., Boye, J., & Gustafson, J. (2001). Real-time handling of fragmented utterances. In Proceedings of NAACL 2001 Workshop: Adaptation in Dialogue Systems. Pittsburgh, PA. [abstract] [pdf]Abstract: In this paper, we discuss an adaptive method ofhandling fragmented user utterances to a speech-basedmultimodal dialogue system. Inserted silent pausesbetween fragments present the following problem:Does the current silence indicate that the user hascompleted her utterance, or is the silence just a pausebetween two fragments, so that the system should waitfor more input? Our system incrementally classifiesuser utterances as either closing (more input isunlikely to come) or non-closing (more input is likelyto come), partly depending on the current dialoguestate. Utterances that are categorized as non-closingallow the dialogue system to await additional spoken or graphical input before responding.Engwall, O. (2001). Considerations in intraoral visual speech synthesis: Data and modelling. In Proc of 4th Intl Speech Motor Conf (pp. 23-26). Nijmegen. [pdf]Engwall, O. (2001). Making the tongue model talk: Merging MRI & EMA Measurements. In Proc of Eurospeech 2001 (pp. 261-264). Aalborg. [pdf]Engwall, O. (2001). Synthesising static vowels and dynamic sounds using a 3D vocal tract model. In Proc of 4th ISCA Tutorial and Research Workshop on Speech Synthesis (pp. 38-41). Perthshire. [pdf]Engwall, O. (2001). Using linguopalatal contact patterns to tune a 3D tongue model. In Proc of Eurospeech 2001 (pp. 1475-1478). Aalborg. [pdf]Granström, B., House, D., Beskow, J., & Lundeberg, M. (2001). Verbal and visual prosody in multimodal speech perception. In van Dommelen, W., & Fretheim, T. (Eds.), Nordic Prosody 2000: Proc of VIII Conf (pp. 77-88). Trondheim, Norway.Granström, B., House, D., & Swerts, M. (2001). Multimodal feedback cues in human-machine interactions. In COST258 Meeting. Maastricht, The Netherlands.Gustafson, K., & House, D. (2001). Children's evaluation of expressive synthesis: A webbased experiment. In COST258 Meeting. Prague.Gustafson, K., & House, D. (2001). Expressive synthesis for children, a web-based evaluation. In Karlsson, A., & van de Weijer, J. (Eds.), Papers from Fonetik 2001 (pp. 50-53). Gustafson, K., & House, D. (2001). Fun or boring? A web-based evaluation of expressive synthesis for children. In Proc of Eurospeech 2001 (pp. 565-568). Aalborg, Denmark. [pdf]Gustafson-Capkova, S., & Megyesi, B. (2001). A comparative study of pauses in dialogues and read speech. In Proc of Eurospeech 2001 (pp. 931-934). Aalborg, Danmark. [pdf]Heldner, M. (2001). Focal accent - F0 movements and beyond. Doctoral dissertation, Umeå University.Heldner, M. (2001). On the non-linear lengthening of focally accented Swedish words. In van Dommelen, W., & Fretheim, T. (Eds.), Nordic Prosody: Proc of the VIIIth Conference (pp. 103-112). Trondheim. [pdf]Heldner, M. (2001). Spectral emphasis as an additional source of information in accent detection. In Bacchiani, M., Hirschberg, J., Litman, D., & Ostendorf, M. (Eds.), Prosody 2001: ISCA Tutorial and Research Workshop on Prosody in Speech Recognition and Understanding (pp. 57-60). Red Bank, NJ. [pdf]Heldner, M., & Strangert, E. (2001). Temporal effects of focus in Swedish. Journal of Phonetics, 29(3), 329-361.Hirsch, H-G. (2001). HMM adaptation for applications in telecommunication. Speech Communication, 34, 127-139.House, D. (2001). Focal accent in Swedish: perception of rise properties for Accent 1. In van Dommelen, W., & Fretheim, T. (Eds.), Nordic Prosody 2000: Proc of VIII Conf (pp. 127-136). Trondheim, Norway.House, D., Beskow, J., & Granström, B. (2001). Interaction of visual cues for prominence. In Karlsson, A., & van de Weijer, J. (Eds.), Papers from Fonetik 2001 (pp. 62-65). House, D., Beskow, J., & Granström, B. (2001). Timing and interaction of visual cues for prominence in audiovisual speech perception. In Proc of Eurospeech 2001 (pp. 387-390). Aalborg, Denmark. [pdf]Hunnicutt, S., & Carlberger, J. (2001). Markov models and heuristic methods improve a word prediction program. J Augmentative and Alternative Communication, 17, 255-264.Hunnicutt, S., & Magnuson, T. (2001). Linguistic structures for email and echat. In Karlsson, A., & van de Weijer, J. (Eds.), Fonetik 2001 (pp. 66-69). Örenäs, Sweden.Jande, P-A. (2001). Stress patterns in Swedish lexicalised phrases. In Proceedings of Fonetik (pp. 70-73). Lund, Sweden. [pdf]Karnebäck, S. (2001). Discrimination between speech and music based on a low frequency modulation feature. In Proc of Eurospeech 2001 (pp. 1891-1894). [pdf]Megyesi, B. (2001). Comparing data-driven learning algorithms for PoS tagging of Swedish. In Proc of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2001) (pp. 151-158). Carnegie Mellon University, Pittsburgh, PA, USA. [pdf]Megyesi, B. (2001). Data-driven methods for PoS tagging and chunking of Swedish. In 13th Nordic Conference of Computational Linguistics, NoDaLiDa 2001. Uppsala, Sweden. [pdf]Megyesi, B. (2001). Phrasal parsing by datadriven PoS taggers. In Proc of the Conf on Recent Advances in Natural Language Processing, Euro Conf RANLP-2001 (pp. 166-173). Tzigov Chark, Bulgaria. [pdf]Megyesi, B., & Gustafson-Capkova, S. (2001). Pausing in dialogues and read speech: Speaker's production and listeners interpretation. In Proc of the Workshop on Prosody in Speech Recognition and Understanding (pp. 107-113). New Jersey, USA. [pdf]Nordqvist, P., & Leijon, A. (2001). Automatic assessment of Hearing Environments. In Second McMaster-Gennum Workshop on Intelligent Hearing Instruments. Niagara-on-the-Lake, ON.Pakucs, B., & Melin, H. (2001). PER: A speech based automated entrance receptionist. In 13th Nordic Conference of Computational Linguistics, NoDaLiDa'01. Uppsala University, Uppsala.Raghavendra, P., Rosengren, E., & Hunnicutt, S. (2001). An investigation of different degrees of dysarthric speech as input to speaker dependent and speaker-adaptive recognition systems. J Augmentative and Alternative Communication, 17, 265-275.Seward, A. (2001). Transducer optimizations for tight-coupled decoding. In Proc of Eurospeech 2001 (pp. 1607-1610). [pdf]Sjölander, K. (2001). Automatic alignment of phonetic segments. In Karlsson, A., & van de Weijer, J. (Eds.), Papers from Fonetik 2001 (pp. 140-143). 2000Bannert, R., Botinis, A., Bruce, G., Engstrand, O., Granström, B., Lindblad, P., & Strangert, E. (2000). Phonetics in Sweden 2000. In Botinis, A., & Torstensson, N. (Eds.), Fonetik 2000, Proc of the Swedish Phonetics Conference (pp. 1-8). Skövde.Bell, L. (2000). Linguistic adaptations in spoken and multimodal dialogue systems. Licentiate dissertation, KTH.Bell, L., Boye, J., Gustafson, J., & Wirén, M. (2000). Modality convergence in a multimodal dialogue system. In Poesio, M., & Traum, D. (Eds.), Proc of Götalog 2000, Fourth Workshop on the Semantics and Pragmatics of Dialogue (pp. 29-34). Gothenburg. [pdf]Bell, L., Eklund, R., & Gustafson, J. (2000). A comparison of disfluency. Distribution in a unimodal and a multimodal speech interface. In Yuan, B., Huang, T., & Tang, X. (Eds.), Proc of ICSLP 2000, 6th Intl Conf on Spoken Language Processing (pp. 626-629). Beijing. [pdf]Bell, L., & Gustafson, J. (2000). Positive and negative user feedback in a spoken dialogue corpus. In Yuan, B., Huang, T., & Tang, X. (Eds.), Proc of ICSLP 2000, 6th Intl Conf on Spoken Language Processing (pp. 589-592). Beijing. [pdf]Berthelsen, H., & Megyesi, B. (2000). Ensemble of Classifiers for Noise Detection in PoS Tagged Corpora. In Proc of the Third International Workshop on TEXT, SPEECH and DIALOGUE (pp. 27-32). Springer-Verlag, Berlin.Beskow, J., Granström, B., House, D., & Lundeberg, M. (2000). Experiments with verbal and visual conversational signals for an automatic language tutor. In Delcloque, P., & Bramoullé, A. (Eds.), Proc of InSTIL 2000 (pp. 138-142). University of Albertay Dundee, Dundee, Scotland.Beskow, J., Granström, B., House, D., & Lundeberg, M. (2000). Verbal and visual prosody in multimodal speech perception.. In Nordic Prosody VIII. Bimbot, F., Blomberg, M., Boves, L., Genoud, D., Hutter, H-P., Jaboulet, C., Koolwaaij, J., Lindberg, J., & Pierrot, J-B. (2000). An overwiev of the CAVE project research activities in speaker verification. Speech Comm, 31, 155-180.Boye, J., Hockey, B. A., & Rayner, M. (2000). Asynchronous dialogue management: Two case-studies. In Poesio, M., & Traum, D. (Eds.), Proc of Götalog 2000, Fourth Workshop on the Semantics and Pragmatics of Dialogue (pp. 51-55). Gothenburg.Bruce, G., Filipsson, M., Frid, J., Granström, B., Gustafson, K., Horne, M., & House, D. (2000). Modelling of Swedish Text and Discourse Intonation in a Speech Synthesis Framework. In Botinis, A. (Ed.), Intonation: Analysis, Modelling and Technology (pp. 291-320). Dordrecht: Kluwer Academic Publishers.Carlson, R., & House, D. (2000). Prosodic aspects of Swedish question words in computer-directed spontaneous speech.. In Nordic Prosody VIII. Eineborg, M., & Lindberg, N. (2000). ILP in part-of-speech tagging. An overview. In Cussens, J., & Deroski, S. (Eds.), Learning Language in Logic Workshop (LLL99) (pp. 157-169). Springer.Elenius, K. (2000). Experiences from collecting two Swedish telephone speech databases.. Int Journal of Speech Technology, 3, 119-127.Engwall, O. (2000). A 3D tongue model based on MRI data. In Yuan, B., Huang, T., & Tang, X. (Eds.), Proc of ICSLP 2000, 6th Intl Conf on Spoken Language Processing (pp. 901-904). Beijing. [pdf]Engwall, O. (2000). Are static MRI representative of dynamic speech? Results from a comparative study using MRI, EPG, and EMA. In Yuan, B., Huang, T., & Tang, X. (Eds.), Proc of ICSLP 2000, 6th Intl Conf on Spoken Language Processing (pp. 17-20). Beijing. [pdf]Engwall, O., & Badin, P. (2000). An MRI study of Swedish fricatives: coarticulatory effects. In Hole, P. (Ed.), Proc of 5th Speech Production Seminar: Models and data (pp. 297-300). Kloster Seeon, Germany. [pdf]Granström, B., House, D., Beskow, J., & Lundeberg, M. (2000). Verbal and visual prosody in multimodal speech perception. In Proc 4th Swedish Symposium on Multimodal Communication. Gustafson, J., & Bell, L. (2000). Speech Technology on Trial: Experiences from the August System.. Natural Language Engineering, 6(Special issue on Best Practice in Spoken Dialogue Systems). [pdf]Gustafson, J., Bell, L., Beskow, J., Boye, J., Carlson, R., Edlund, J., Granström, B., House, D., & Wirén, M. (2000). AdApt - a multimodal conversational dialogue system in an apartment domain. In Yuan, B., Huang, T., & Tang, X. (Eds.), Proc. of ICSLP 2000, 6th Intl Conf on Spoken Language Processing (pp. 134-137). Beijing: China Military Friendship Publish. [abstract] [pdf]Abstract: A general overview of the AdApt project and the research that is performed within the project is presented. In this project various aspects of human-computer interaction in a multimodal conversational dialogue systems are investigated. The project will also include studies on the integration of user/system/dialogue dependent speech recognition and multimodal speech synthesis. A domain in which multimodal interaction is highly useful has been chosen, namely, finding available apartments in Stockholm. A Wizard-of-Oz data collection within this domain is also described.Heldner, M. (2000). Is non-linear lengthening important for the perceived naturalness of focal accented Swedish words?. In Botinis, A., & Torstensson, N. (Eds.), Fonetik 2000, Proc of the Swedish Phonetics Conference (pp. 69-72). Skövde. [pdf]Hennebert, J., Melin, H., Petrovska, D., & Genoud, D. (2000). POLYCOST: A telephone-speech database for speaker recognition. Speech Comm, 2-3(31), 265-270.House, D. (2000). Perception of focal accent in Swedish. How necessary is the rise for Accent 1?. In Nordic Prosody VIII. House, D. (2000). Rise alignment in the perception of focal accent and pitch in Swedish. In Botinis, A., & Torstensson, N. (Eds.), Fonetik 2000, Proc of the Swedish Phonetics Conference (pp. 73-76). Skövde.Johansen, F. T., Warakagoda, N., Lindberg, B., Lehtinen, G., Kai, Z., Gank, A., Elenius, K., & Salvi, G. (2000). The COST 249 SpeechDat multilingual reference recogniser. In Gavrilidou, M., Caryannis, G., Markantonatou, S., Piperidis, S., & Stainhaouer, G. (Eds.), Proc. of LREC 2000, 2nd Intl Conf on Language Resources and Evaluation (pp. 1351-1356). Athens, Greece. [pdf]Karlsson, I., Banziger, T., Johnstone, T., Lindberg, J., Melin, H., Nolan, F., & Scherer, K. (2000). Speaker verification with elicited speaking-styles in the VeriVox project. Speech Comm, 2(31), 121-129.Leijon, A., & Nordqvist, P. (2000). Finite-state modelling for specification of non-linear hearing instruments. In International Hearing Aid Research. Lake Tahoe, CA.Lindberg, B., Johansen, F. T., Warakagoda, N., Lehtinen, G., Kai, Z., Gank, A., Elenius, K., & Salvi, G. (2000). A noise robust multilingual reference recogniser based on SpeechDat(II). In Proc of ICSLP 2000, 6th Intl Conf on Spoken Language Processing (pp. 370-373). Beijing. [pdf]Lindberg, J., & Blomberg, M. (2000). On the potential threat of using large speech corpora for impostor selection in speaker verification. In Yuan, B., Huang, T., & Tang, X. (Eds.), Proc. of ICSLP 2000, 6th Intl Conf on Spoken Language Processing (pp. 258-261). Beijing.Lindberg, N. (2000). Data driven methods in natural language processing - Two applications. Licentiate dissertation, KTH/TMH.Magnuson, T., & Blomberg, M. (2000). Acoustic analysis of dysarthric speech. In Botinis, A., & Torstensson, N. (Eds.), The Swedish Phonetics Conf (pp. 105-108). Skövde.Mariethoz, J., Lindberg, J., & Bimbot, F. (2000). A MAP approach, with synchronous decoding and unit-based normalization for text-dependent speaker verification.. In Proc of ICSLP 2000. Massaro D.W, C., Beskow, J., & Cole, R. (2000). Developing and Evaluating Conversational Agents.. In Cassell, J., & et al, . (Eds.), Embodied Conversational Agents.. Cambridge, MA: MIT Press.Melin, H. (2000). Databases for Speaker Recognition: Activities in COST250 Working Group 2. Technical Report, Europeach Commision DG-XIII, Brussels, COST 250 - Speaker Recognition in Telephony. [pdf]Pakucs, B., & Gambäck, B. (2000). Designing a system for Swedish spoken document retrieval. In Nordgård, T. (Ed.), Proc of NODALIDA-99, 12th Nordic Conference in Computational Linguistics (pp. 162-174). Trondheim, Norway.Rosengren, E., Magnuson, T., Hunnicutt, S., & Blomberg, M. (2000). Analysis of dysarthric speech for use with speech recognition. In Proc of ISAAC«00, 9th Biennal Conf of the Intl Society for Augmentative and Alternative Communication (pp. 64-66). Washington, DC, USA.Seward, A. (2000). A tree-trellis N-best decoder for stochastic context-free grammars. In Yuan, B., Huang, T., & Tang, X. (Eds.), Proc of ICSLP 2000, 6th Intl Conf on Spoken Language Processing (pp. 282-285). Beijing.Sjölander, K., & Beskow, J. (2000). WaveSurfer - an open source speech tool. In Yuan, B., Huang, T., & Tang, X. (Eds.), Proceedings of ICSLP 2000, 6th Intl Conf on Spoken Language Processing (pp. 464-467). Beijing. [pdf]Wissing, D., Gustafson, K., & Coetzee, A. (2000). Temporal organisation in some varieties of South African English. In Proc Workshop on Black South African English, Intl Conf on Linguistics in Southern Africa (pp. 59-68). Cape Town, South Africa.Öhman, T. (2000). Vision in speech technology. Automatic measurements of visual speech and audiovisual intelligibility of synthetic and natural faces. Licentiate dissertation.1999Agelfors, E., Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Salvi, G., Spens, K-E., & Öhman, T. (1999). A Synthetic Face as a Lip-reading Support for Hearing Impaired Telephone Users - Problems and Positive Results. In Proceedings of the 4th European Conference on Audiology. Oulo, Finland.Agelfors, E., Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Salvi, G., Spens, K-E., & Öhman, T. (1999). Two methods for Visual Parameter Extraction in the Teleface Project. In Proceedings of Fonetik. Gothenburg, Sweden.Agelfors, E., Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Spens, K-E., & Öhman, T. (1999). Artificial video for Hearing-Impaired Telephone Users; A comparison with the No Video and Perfect Video Conditions.. In Buhler C & Harry Knops H, . (Eds.), Assistive Technology on the Threshold of the New Millennium (pp. 116-121). IOS Press.Agelfors, E., Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Spens, K-E., & Öhman, T. (1999). ASR controlled synthetic face as a lipreading support for hearing impaired telephone users. In Cost249 meeting. Prague, Czech Republic.Agelfors, E., Beskow, J., Granström, B., Lundeberg, M., Salvi, G., Spens, K-E., & Öhman, T. (1999). Synthetic visual speech driven from auditory speech. In Proceedings of Audio-Visual Speech Processing (AVSP). Santa Cruz, USA. [pdf]Bell, L., & Gustafson, J. (1999). Interaction with an animated agent: an analysis of a Swedish database of spontaneous computer directed speech. In Proc of Eurospeech '99 (pp. 1143-1146). Budapest, Hungary. [pdf]Bell, L., & Gustafson, J. (1999). Repetition and its phonetic realizations: Investigating a Swedish database of spontaneous computer directed speech. In Proceedings of ICPhS-99 (pp. 1221-1224). [pdf]Bell, L., & Gustafson, J. (1999). Repetition in a Swedish database of spontaneous computer-directed speech.. In Andersson, R., Abelin, Å., Allwood, J., & Lindblad, P. (Eds.), Proc of Fonetik 99 (pp. 15-18). Bell, L., & Gustafson, J. (1999). Utterance types in the August database.. In The Third Swedish Symposium on Multimodal Communication. Bell, L., & Gustafson, J. (1999). Utterance types in the August System. In Proc from IDS '99. [pdf]Bell, L., Gustafson, K., House, D., & Johansson L., . (1999). Children´s evaluation of prosody in speech synthesis. In Andersson, R., Abelin, Å., Allwood, J., & Lindblad, P. (Eds.), Proc of Fonetik 99 (pp. 19-22). Bengtsson, B., Burgoon, J., Cederberg, C., Bonito, J., & Lundeberg, M. (1999). The impact of antropomorphic interfaces on influence, understanding and credibility.. In Proc 23nd Hawaii Intl Conf on System Sciences. Bimbot, F., Blomberg, M., Boves, L., Chollet, G., Jaboulet, C., Jacob, B., Kharroubi, J., Koolwaaij, J., Lindberg, J., Mariethoz, J., Mokbel, C., & Mokbel, H. (1999). An overview of the PICASSO project research activities in speaker verification for telephone applications. In Proc of Eurospeech 99 (pp. 1963-1967). [pdf]Blomberg, M. (1999). Within-utterance correlation for speech recognition. In Proc of Eurospeech 99 (pp. 2479-2482). Blomberg, M. (1999). Within-utterance correlation in automatic speech recognition.. In Proc of Fonetik 99 (pp. 23-26). Carlberger, A. (1999). Grammar and lexicons for a speech-interfaced knowledge-based engineering program (ICAD). In Loncke, F., Clibbens, J., & Lloyd, L. (Eds.), AAC: New directions in research and practice, The ISAAC Research Symposium on Natural Language Processing. Dublin: Whurr Publ, London.Carlberger, A. (1999). Nparse - A Shallow N-Gram-Based Grammatical-Phrase Parser. In Proc of Eurospeech 99 (pp. 2067-2070). Damper, R. I., Marchand, Y., Anderson, M., & Gustafson, K. (1999). Evaluating the pronunciation component of text-to-speech systems for English: a performance comparison of different approaches.. Comput Speech Language, 13(2), 155-176.Elenius, K. (1999). Experiences from building two large telephone speech databases for Swedish.. In Proc of ICPhS-99 (pp. 1741-1744). Elenius, K. (1999). Two Swedish SpeechDat databases - some experiences and results. In Proc of Eurospeech 99 (pp. 2243-2246). Elenius, K. (1999). Two Swedish telephone speech databases.. In Proc of Fonetik 99 (pp. 45-48). Engwall, O. (1999). Modeling of the vocal tract in three dimensions. In Proc of Eurospeech 99 (pp. 113-116). Budapest. [pdf]Falcone, M., Melin, H., & Ariyaeeinia, A. (1999). Speaker Recognition Assessment and Dissemination: Activities in COST 250 Working Group 4.. In Proc COST250 Workshop on Speaker Recognition in Telephony.. Granström, B. (1999). Multi-modal speech synthesis with applications.. In Chollet, G., Di Benedetto, M., Esposito, A., & Marinaro, M. (Eds.), Speech Processing, Recognition and Artificial Neural Networks. Proc 3rd Intl School on Neural Nets Eduardo R Cajaniello (pp. 327-346). London: Springer-Verlag Ltd.Granström, B., House, D., & Lundeberg, M. (1999). Eyebrow movements as a cue to prominence.. In The Third Swedish Symposium on Multimodal Communication. Granström, B., House, D., & Lundeberg, M. (1999). Prosodic cues in multi-modal speech perception.. In Proc of ICPhS-99 (pp. 655-658). Granström, B., House, D., & Lundeberg, M. (1999). Visual prominence in multimodal speech perception.. In Proc of Fonetik 99 (pp. 61-64). Gustafson, J., Lindberg, N., & Lundeberg, M. (1999). The August spoken dialogue system. In Proc of Eurospeech 99 (pp. 1151-1154). [pdf]Gustafson, J., Lindberg, N., & Lundeberg, M. (1999). The August spoken dialogue system.. In The Third Swedish Symposium on Multimodal Communication. Gustafson, J., Lundeberg, M., & Liljencrants, J. (1999). Experiences from the development of August - a multimodal spoken dialogue system.. In Proc from IDS '99 (pp. 61-64). [pdf]Gustafson, J., Sjölander, K., Beskow, J., Granström, B., & Carlson, R. (1999). Creating web-based exercises for spoken language technology. In Tutorial session in proceedings of IDS'99 (pp. 165-168). [pdf]Gustafson, K., & House, D. (1999). Prosodic parameters of a "fun" speaking style.. In COST 258: The Budapest Meeting, COST Working Papers. Heldner, M., Strangert, E., & Deschamps, T. (1999). A focus detector using overall intensity and high frequency emphasis. In Proceedings of ICPhS-99 (pp. 1491-1494). [pdf]Heldner, M., Strangert, E., & Deschamps, T. (1999). Focus detection using overall intensity and high frequency emphasis. In Proc of Fonetik 99 (pp. 73-76). [pdf]House, D. (1999). Perception of pitch and tonal timing: implications for mechanisms of tonogenesis. In Proc of ICPhS-99 (pp. 1823-1826). House, D., Bell, L., Gustafson, K., & Johansson, L. (1999). Child-directed speech synthesis: evaluation of prosodic variation for an educational computer program. In Proc of Eurospeech 99 (pp. 1843-1846). [pdf]Hunnicutt, S., Carlberger, A., Rosengren, E., Talbot, N., & Bickley, C. (1999). Draw a circle! Spoken man-machine communication for design.. In The Third Swedish Symposium on Multimodal Communication. Hunnicutt, S., Carlson, R., Carlberger, A., & Rosengren, E. (1999). TIDE-ENABL-projektet: Sofistikerad design med hjälp av talförståelse.. In Proc of Konferensen Människa-Handikapp-Livsvillkor Rendez-vous (pp. 161-162). Karlsson, I. (1999). Within-speaker variability in the VeriVox database. In Andersson, R., Abelin, Å., Allwood, J., & Lindblad, P. (Eds.), Proc of Fonetik 99 (pp. 93-96). [pdf]Karlsson, I., & Thornton, S. (1999). Really natural Speech Generation poses some tough challenges. Computer Telephony Europe, August 1999.Liljencrants, J. (1999). Judges of prominence. In Proc of Fonetik 99 (pp. 101-107). Lindberg, J., & Blomberg, M. (1999). Vulnerability in speaker verification. A study of technical impostor techniques. In Proc of Eurospeech 99 (pp. 1211-1214). [pdf]Lindberg, N., & Eineborg, M. (1999). Improving part of speech disambiguation rules by adding linguistic knowledge.. In Dzeroski, S., & Flach, P. A. (Eds.), Inductive Logic Programming (pp. 186-197). Berlin: Springer Verlag.Lindberg, N., & Eineborg, M. (1999). Improving POS tagging by adding heuristic background knowledge.. In Proc of ILP '99. Lundeberg, M., & Beskow, J. (1999). Developing a 3D-agent for the August dialogue system. In Proc of AVSP 99. [pdf]Massaro, D., Beskow, J., Cohen, M., Fry, C., & Rodriguez, T. (1999). Picture my voice: Audio to visual speech synthesis using artificial neural networks. In Proc of AVSP 99 (pp. 133-138). [pdf]Massaro, D., Cohen, M., & Beskow, J. (1999). From Theory to Practice: Rewards and Challenges. In Proc of ICPhS. [pdf]Melin, H. (1999). Databases for Speaker Recognition: Activities in COST250 Working Group 2.. In COST250 Workshop on Speaker Recognition in Telephony. Rom. [pdf]Melin, H., & Lindberg, J. (1999). Variance flooring, scaling and tying for text-dependent speaker verification. In Proc of Eurospeech 99 (pp. 1975-1978). [pdf]Parkvall, M., & Edlund, J. (1999). The creolist archives. Journal of Pidgin and Creole languages, 14(2), 347 - 350.Salvi, G. (1999). Developing acoustic models for automatic speech recognition in Swedish. The European Student Journal of Language and Speech. [abstract] [pdf]Abstract: This paper is concerned with automatic continuous speech recognition using trainable systems. The aim of this work is to build acoustic models for spoken Swedish. This is done employing hidden Markov models and using the SpeechDat database to train their parameters. Acoustic modeling has been worked out at a phonetic level, allowing general speech recognition applications, even though a simplified task (digits and natural number recognition) has been considered for model evaluation. Different kinds of phone models have been tested, including context independent models and two variations of context dependent models. Furthermore many experiments have been done with bigram language models to tune some of the system parameters. System performance over various speaker subsets with different sex, age and dialect has also been examined. Results are compared to previous similar studies showing a remarkable improvement.Svensson, E-L. (1999). Pronunciation in KTH:s text-to-speech system.. In Proc of Fonetik 99 (pp. 129-132). Öhman, T., & Lundeberg, M. (1999). Differences in speechreading a synthetic and a natural face.. In Proc of ICPhS-99. 1998Agelfors, E., Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Spens, K-E., & Öhman, T. (1998). Synthetic faces as a lipreading support. In Proceedings of ICSLP'98. [pdf]Agelfors, E., Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Spens, K-E., & Öhman, T. (1998). Teleface - the use of a synthetic face for the hard of hearing.. In Proc of IVTTA'98. Agelfors, E., Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Spens, K-E., & Öhman, T. (1998). The synthetic face from a hearing impaired view. In Branderud, P., & Traunmüller, H. (Eds.), Proc of Fonetik -98, The Swedish Phonetics Conference (pp. 200-203). Stockholm University. [html]Beskow, J. (1998). A tool for teaching and development of parametric speech synthesis. In Branderud, P., & Traunmüller, H. (Eds.), Proc of Fonetik -98, The Swedish Phonetics Conference (pp. 162-165). Stockholm University. [pdf]Bickley, C., Carlson, R., Cudd, P., Hunnicutt, S., Reimers, B., & Whiteside, S. (1998). Enabler for engineering software using language and speech. In Proc RESNA. Bimbot, F., Hutter, H-P., Jaboulet, C., Koolwaaij, J., Lindberg, J., & Pierrot, J-B. (1998). An overview of the CAVE project research activities in speaker verification.. In Proc of RLA2C (pp. 215-220). [ps]Blomberg, M. (1998). Speech recognition using long-distance relations in an utterance. In Branderud, P., & Traunmüller, H. (Eds.), Proc of Fonetik -98, The Swedish Phonetics Conference (pp. 166). Bruce, G., Frid, J., Granström, B., Gustafson, K., Horne, M., & House, D. (1998). Prosodic segmentation and structuring of dialogue. In Werner, S. (Ed.), Nordic Prosody, Proceedings of the VIIth Conference (pp. 63-72). Carlberger, A. (1998). Lexicons and grammar for speech recognition in an engineering design program (ICAD). In Branderud, P., & Traunmüller, H. (Eds.), Proc of Fonetik -98, The Swedish Phonetics Conference (pp. 172-175). Stockholm University.Carlberger, J., & Hunnicutt, S. (1998). A probabilistic word prediction program. In Proc of RESNA Õ98 (pp. 50-52). Minneapolis, Minn..Carlson, R., Granström, B., Gustafson, J., Lewin, E., & Sjölander, K. (1998). Hands-on speech technology on the Web. In Proceedings of Elsnet in Wonderland (pp. 30-36). Cohen, M., Beskow, J., & Massaro, D. (1998). Recent developments in facial animation: an inside view. In Proc of AVSP'98. [pdf]Cole, R., Carmell, T., Conners, P., Macon, M., Wouters, J., de Villiers, J., Tarachow, A., Massaro, D., Cohen, M., Beskow, J., Yang, J., Meier, U., Waibel, A., Stone, P., Fortier, G., Davis, A., & Soland, C. (1998). Intelligent Animated Agents for Interactive Language Training. In Proc of STiLL - ESCA Workshop on Speech Technology in Language Learning. Dybkjær, L., Bernsen, N-O., Carlson, R., Chase, L., Dahlbäck, N., Failenschmid, K., Heid, U., Heisterkamp, P., Jönsson, A., Kamp, H., Karlsson, I., van Kuppevelt, J., Lamel, L., Paroubek, P., & Williams, D. (1998). Comparative evaluation of letter-to-sound conversion techniques for English text-to-speech synthesis. In Proc of the First International Conference on Language Resources and Evaluation. Eineborg, M., & Lindberg, N. (1998). Induction of constraint grammar-rules using Progol. In Proceedings of ILP -98, The Eighth International Conference on Inductive Logic Programming. Madison, Wisconsin.Engwall, O. (1998). A 3D vocal tract model for articulatory and visual speech synthesis. In Branderud, P., & Traunmüller, H. (Eds.), Proc of Fonetik -98, The Swedish Phonetics Conference (pp. 196-199). Stockholm University.Gawronska, B., & House, D. (1998). Information extraction and text generation of news reports for a Swedish-English bilingual spoken dialogue system. In Proceedings ICSLP 98, Fifth International Conference on Spoken Language Processing (pp. 1139-1142). Sidney, Australia. [pdf]Gustafson, J., Elmberg, P., Carlson, R., & Jönsson, A. (1998). An educational dialogue system with a user controllable dialogue manager. In Proc of ICSLP98, Intl Conference on Spoken Language Processing (pp. 33-36). Sydney, Australia. [pdf]Gustafson, J., & Sjölander, K. (1998). Educational tools for speech technology. In Branderud, P., & Traunmüller, H. (Eds.), Proc of Fonetik -98, The Swedish Phonetics Conference (pp. 176-179). Stockholm University.Heldner, M. (1998). Is an F0-rise a necessary or a sufficient cue to perceived focus in Swedish?. In Nordic Prosody: Proc of the VIIth Conference (pp. 109-125). [pdf]Heldner, M., & Strangert, E. (1998). On the amount and domain of focal lengthening in Swedish two-syllable words.. In Proc of FONETIK 98 (pp. 154-157). [pdf]Hermes, D., Beaugendre, F., & House, D. (1998). Individual differences in accentuation boundaries in Dutch. IPO Annual Progress Report, Eindhoven, 32, 131-138.House, D., Hermes, D., & Beaugendre, F. (1998). Perception of tonal rises and falls for accentuation and phrasing in Swedish. In Proceedings ICSLP 98, Fifth International Conference on Spoken Language Processing (pp. 2799-2802). Sydney, Australia. [pdf]House, D., Hermes, D., & Beaugendre, F. (1998). Perception of tonal rises and falls for accentuation and phrasing: Swedish listener results. In Proceedings of Fonetik ’98 (pp. 146-149). Stockholm University.Hunnicutt, S., Rosen, K., Rosengren, E., & Carlberger, A. (1998). TIDE-ENABL: Access to knowledge-based engineering through speech recognition. In Proc of ISAAC Conference (pp. 462-463). Dublin, Ireland.Karlsson, I., Banziger, T., Dankovicov, J., Johnstone, T., Lindberg, J., Melin, H., Nolan, F., & Scherer, K. (1998). Speaker verification with elicited speaking-styles in the VeriVox project. In Proc of RLA2C, La Reconnaissance du Locuteur et ses Applications Commerciales et Criminalistiques (Speaker Recognition and its Commercial and Forensic Applcations) (pp. 207-210). Avignon, France. [pdf]Karlsson, I., Banziger, T., Dankovicov, J., Johnstone, T., Lindberg, J., Melin, H., Nolan, F., & Scherer, K. (1998). Within speaker variation due to induced stress. In Proc of Fonetik -98, The Swedish Phonetics Conference (pp. 150-153). Stockholm University.Karlsson, I., Banziger, T., Dankovicov, J., Johnstone, T., Lindberg, J., Melin, H., Nolan, F., & Scherer, K. (1998). Within-speaker variability due to speaking manners. In Mannell, R., & Robert-Ribes, J. (Eds.), Proc of ICSLP98, Intl Conference on Spoken Language Processing (pp. 2379-2382). Sydney, Australia. [pdf]Langlais, P., Öster, A-M., Carlson, R., & Granström, B. (1998). Automatic detection of mispronunciation in non-native Swedish speech. In Proc of STiLL98, ESCA-Workshop on Speech Technology in Language Learning (pp. 41-44). Marholmen, Sweden.Langlais, P., Öster, A-M., & Granström, B. (1998). . In Proc of ICSLP98, Intl Conference on Spoken Language Processing (pp. 1743-1746). Sydney, Australia.Lee, T., Carlson, R., & Granström, B. (1998). Context-dependent duration modeling for continuous speech recognition. In Proc of ICSLP98, Intl Conference on Spoken Language Processing (pp. 2955-2958). Sydney, Australia.Lindberg, J., Koolwaaij, J., Hutter, H-P., Genoud, D., Pierrot, J-B., Blomberg, M., & Bimbot, F. (1998). Techniques for a priori decision threshold estimation in speaker verification. In Proc of RLA2C, La Reconnaissance du Locuteur et ses Applications Commerciales et Criminalistiques (Speaker Recognition and its Commercial and Forensic Applications) (pp. 89-92). Avignon, France. [pdf]Lindberg, N., & Eineborg, M. (1998). Learning constraint grammar-style disambiguation rules using inductive logic programming. In Proc of COLING-ACL -98 (pp. 775-779). Montréal, Quebec, Canada.Lindberg, N., & Eineborg, M. (1998). Learning part of speech disambiguation rules using inductive logic programming. In Keller, B. (Ed.), Proceedings of the ESSLLI -98 Workshop on Automated Acquisition of Syntax and Parsing (pp. 1-8). Saarbrücken, Summer school.Melin, H. (1998). On word boundary detection in digit-based speaker verification. In Proc of RLA2C, La Reconnaissance du Locuteur et ses Applications Commerciales et Criminalistiques (Speaker Recognition and its Commercial and Forensic Applications) (pp. 46-49). Avignon, France. [pdf]Melin, H. (1998). Optimizing Variance Flooring In HMM-Based Speaker Verification.. In COST250 report. Ankara, Turkey. [pdf]Melin, H., Koolwaaij, J., Lindberg, J., & Bimbot, F. (1998). A comparative evaluation of variance flooring techniques in HMM-based speaker verification. In Proc of ICSLP98, 5th Intl Conference on Spoken Language Processing (pp. 1903-1906). Sydney, Australia. [pdf]Nord, L., Hunnicutt, S., & Rosengren, E. (1998). ENABL - Access to design by speech recognition: analysis of dysarthric voices. In Branderud, P., & Traunmüller, H. (Eds.), Proc of Fonetik -98, The Swedish Phonetics Conference (pp. 82-85). Stockholm University.Nordström, T., Melin, H., & Lindberg, J. (1998). A comparative study of speaker verification systems using the polycost database. In Proc of ICSLP98, 5th Intl Conference on Spoken Language Processing (pp. 1359-1362). Sydney, Australia. [pdf]Petrovska, D., Hennebert, J., Melin, H., & Genoud, D. (1998). POLYCOST: A telephone-speech database for speaker recognition. In Proc of RLA2C, La Reconnaissance du Locuteur et ses Applcations Commerciales et Criminalistiques (Speaker Recognition and its Commercial and Forensic Applications) (pp. 211-214). [pdf]Pierrot, J-B., Lindberg, J., Koolwaaij, J., Hutter, H-P., Genoud, D., Blomberg, M., & Bimbot, F. (1998). A comparison of a priori threshold setting procedures for speaker verification in the CAVE project. In Proc of ICASSP98, Intl Conference on Acoustics, Speech and Signal Processing (pp. 125-128). Seattle, Wash.Rosengren, E., & Hunnicutt, S. (1998). Evaluations in VAESS. Programs for touchscreen use: voices and emotions. In Proc of ISAAC (pp. 490-491). Dublin, Irland.Salvi, G. (1998). Developing acoustic models for automatic speech recognition. Master's thesis, KTH, TMH, CTT. [pdf]Sjölander, K., Beskow, J., Gustafson, J., Lewin, E., Carlson, R., & Granström, B. (1998). Web-based educational tools for speech technology. In Proc of ICSLP98, 5th Intl Conference on Spoken Language Processing (pp. 3217-3220). Sydney, Australia.Strangert, E., & Heldner, M. (1998). On the amount and domain of focal lengthening in Swedish. In Proc of ICSLP98 (pp. 3305-3308). [pdf]Strangert, E., & Heldner, M. (1998). On the amount and domain of focal lengthening in two-syllable and longer Swedish words. In Proc of FONETIK 98 (pp. 134-137). [pdf]Sundström, A. (1998). Automatic prosody modification as a means for foreign language pronunciation training. In Proc of STiLL98, ESCA Workshop on Speech Technology in Language Learning (pp. 49-52). [pdf]Tamura, M., Masuko, T., Kobayashi, T., & Tokuda, K. (1998). Visual Speech Synthesis Based on Parameter Generation from HMM: Speech-Driven and Text-And-Speech-Driven Approaches. In Proc. of Auditory-Visual Speech Processing (AVSP'98) (pp. 221-226). [abstract]Abstract: This paper describes a technique for synthesizing synchronized lip movements from auditory input speech signal. The technique is based on an algorithm for parameter generation from HMM with dynamic features, which has been successfully applied to text-to-speech synthesis. Audio-visual speech unit HMMs, namely, syllable HMMs are trained with parameter vector sequences that represent both auditory and visual speech features. Input speech is recognized using the syllable HMMs and converted into a transcription and a state sequence. A sentence HMM is constructed by concatenating the syllable HMMs corresponding to the transcription for the input speech. Then an optimum visual speech parameter sequence is generated from the sentence HMM in ML sense. Since the generated parameter sequence reflects statistical information of both static and dynamic features of several phonemes before and after the current phonemes, synthetic lip motion becomes smooth and realistic. We show experimental results which demonstrate the effectiveness of the proposed technique.1997Beaugendre, F., House, D., & Hermes, D. (1997). Accentuation boundaries in Dutch, French and Swedish. In Botinis, A., Kouroupetroglou, G., & Carayannis, G. (Eds.), Proceedings of ESCA Tutorial and Research Workshop on Intonation: Theory, Models and Applications (pp. 43-46). Athens, Greece. [pdf]Bertenstam, J., Granström, B., Gustafson, K., Hunnicutt, S., Karlsson, I., Meurlinger, C., Nord, L., & Rosengren, E. (1997). The VAESS communicator: a portable communication aid with new voice types and emotions. In Bannert, R., Heldner, M., Sullivan, K., & Wretling, P. (Eds.), Proc of Fonetik -97, Dept of Phonetics, Umeå Univ., Phonum 4 (pp. 57-60). Beskow, J. (1997). Animation of talking agents. In Benoit, C., & Campbel, R. (Eds.), Proc of ESCA Workshop on Audio-Visual Speech Processing (pp. 149-152). Rhodes, Greece. [pdf]Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Spens, K-E., & Öhman, T. (1997). The Teleface project - disability, feasibility and intelligibility. In Bannert, R., Heldner, M., Sullivan, K., & Wretling, P. (Eds.), Proc of Fonetik -97, Dept of Phonetics, Umeå Univ., Phonum (pp. 85-88). [pdf]Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Spens, K-E., & Öhman, T. (1997). The Teleface project - Multimodal speech communication for the hearing impaired. In Kokkinakis, G., Fakotakis, N., & Dermatas, E. (Eds.), Proc of Eurospeech Õ97, 5th European Conference on Speech Communication and Technology (pp. 2003-2006). Rhodes, Greece. [pdf]Beskow, J., Elenius, K., & McGlashan, S. (1997). OLGA - A dialogue system with an animated talking agent. In Kokkinakis, G., Fakotakis, N., & Dermatas, E. (Eds.), Proc of Eurospeech '97, 5th European Conference on Speech Communication and Technology (pp. 1651-1654). Rhodes, Greece. [pdf]Beskow, J., Elenius, K., & McGlashan, S. (1997). The OLGA project: An animated talking agent in a dialogue system. In Bannert, R., Heldner, M., Sullivan, K., & Wretling, P. (Eds.), Proc of Fonetik -97, Dept of Phonetics, Phonum 4 (pp. 69-72). Lövånger/Umeå. [pdf]Beskow, J., & McGlashan, S. (1997). OLGA - A conversational agent with gestures. In André, E. (Ed.), Proc of the IJCAI -97 Workshop on Animated Interface Agents: Making them Intelligent (pp. 39-44). Nagoya, Japan. [pdf]Bimbot, F., Hutter, H-P., Jaboulet, C., Koolwaaij, J., Lindberg, J., & Pierrot, J-B. (1997). Speaker verification in the telephone network research activities in the CAVE project. In Proc EUROSPEECH'97 (pp. 971-974). [pdf]Blomberg, M. (1997). Creating unseen triphones by phone concatenation in the spectral, cepstral and formant domains. In Proc of Fonetik -97, Dept of Phonetics, Umeå Univ., Phonum 4 (pp. 41-44). Blomberg, M. (1997). Creating unseen triphones by phone concatenation of diphones and monophones in the spectral, cepstral and formant domains. In Kokkinakis, G., Fakotakis, N., & Dermatas, E. (Eds.), Proc of Eurospeech Õ97, 5th European Conference on Speech Communication and Technology (pp. 1187-1190). Rhodes, Greece. [ps]Bruce, G., Filipsson, M., Frid, J., Granström, B., Gustafson, K., Horne, M., & House, D. (1997). Global features in the modelling of intonation in spontaneous Swedish. In Botinis, A., Kouroupetroglou, G., & Carayannis, G. (Eds.), Proc of ESCA workshop on Intonation: Theory, Models and Applications (pp. 59-62). Athens. [pdf]Bruce, G., Filipsson, M., Frid, J., Granström, B., Gustafson, K., Horne, M., & House, D. (1997). Modelling intonation in spontaneous speech. In Bannert, R., Heldner, M., Sullivan, K., & Wretling, P. (Eds.), Proc of Fonetik -97, Dept of Phonetics, Umeå Univ., Phonum 4 (pp. 173-174). Bruce, G., Filipsson, M., Frid, J., Granström, B., Gustafson, K., Horne, M., & House, D. (1997). Text-to-intonation in spontaneous Swedish. In Kokkinakis, G., Fakotakis, N., & Dermatas, E. (Eds.), Proc of Eurospeech 97, 5th European Conference on Speech Communication and Technology (pp. 215-218). Rhodes, Greece. [pdf]Bruce, G., & Granström, B. (1997). Prosodic modelling in Swedish speech synthesis. In Fant, G., Hirose, K., & Kiritani, S. (Eds.), Analysis, Perception and Processing of Spoken Language. Festschrift for Hiroya Fujisaki. (pp. 62-73). Amsterdam, The Netherlands: Elsevier Science B.V..Bruce, G., Granström, B., Gustafson, K., Horne, M., House, D., & Touati, P. (1997). On the analysis of prosody in interaction. In Sagisaka, Y., Campbell, N., & Higuchi, N. (Eds.), Computing Prosody. Computational Models for Processing Spontaneous Speech. New York: Springer.Carlberger, A. (1997). Profet. A new generation of word prediction: An evaluation study. In Copestake, A., Langer, S., & Palazuelos-Cagigas, S. (Eds.), Proc of a Workshop on Natural Language Processing (NLP) for Communication Aids (pp. 23-28). Madrid, Spain.Carlberger, A., Lewin, E., Nord, L., Rosengren, E., Ström, N., Carlson, R., & Hunnicutt, S. (1997). ENABL - access to design by speech recognition. In Bannert, R., Heldner, M., Sullivan, K., & Wretling, P. (Eds.), Proc of Fonetik -97, Dept of Phonetics, Umeå Univ., Phonum 4 (pp. 89-92). Carlberger, A., Magnuson, T., Carlberger, J., Wachtmeister, H., & Hunnicutt, S. (1997). Probability-based word prediction for writing support in dyslexia. In Bannert, R., Heldner, M., Sullivan, K., & Wretling, P. (Eds.), Proc of Fonetik -97, Dept of Phonetics, Umeå Univ., Phonum 4 (pp. 17-20). Carlson, R., & Granström, B. (1997). Dialogue management and multi-modal output in the Waxholm spoken dialogue system. In Luperfoy, S. (Ed.), Automated Spoken Dialog Systems. MIT Press.Carlson, R., & Granström, B. (1997). Speech synthesis. In Hardcastle, W. J., & Laver, J. (Eds.), The Handbook of Phonetic Science (pp. 768-788). Oxford: Blackwell Publ. Ltd.Damper, R., & Gustafson, K. (1997). Evaluating the pronunciation component of a text-to-speech system. In Speech and Language Technology (SALT) Club Workshop on Evaluation in Speech and Language Technology (pp. 72-79). D’Imperio, M., & House, D. (1997). Perception of questions and statements in Neapolitan Italian. In Proceedings of Eurospeech 97, 5th European Conference on Speech Communication and Technology (pp. 251-254). Rhodes, Greece. [pdf]Eriksson, M., & Gambäck, B. (1997). SVENSK: A toolbox of Swedish language processing resources. In Proc of the 2nd Intl Conf on Recent Advances in Natural Language Processing (RANLP Õ97) (pp. 336-341). Tzigov Chark, Bulgaria.Fant, G., Hertegård, S., Kruckenberg, A., & Liljencrants, J. (1997). Covariation of subglottal pressure, F0 and glottal parameters. In Proc Eurospeech '97 (pp. 453-456). Granström, B. (1997). Applications of intonation - An overview. In Botinis, A., Kouroupetroglou, G., & Carayannis, G. (Eds.), Proc of ESCA workshop on Intonation: Theory, Models and Applications (pp. 21-24). Athens.Gustafson, J. (1997). How do system questions influence lexical choices in user answers?. In Kokkinakis, G., Fakotakis, N., & Dermatas, E. (Eds.), Proc of Fonetik -97, Dept of Phonetics, Umeå Univ., Phonum 4 (pp. 49-52). Gustafson, J., Larsson, A., Carlson, R., & Hellman, K. (1997). How do system questions influence lexical choices in user answers?. In Proc of Eurospeech '97, 5th European Conference on Speech Communication and Technology (pp. 2275-2278). Rhodes, Greece. [pdf]Heldner, M. (1997). The contribution of pitch movements to perceived focus. In PHONUM 4 (pp. 109-112). Umeå: Department of Phonetics, Umeå University.Heldner, M. (1997). To what extent is perceived focus determined by F0-cues?. In Eurospeech '97 Proceedings (pp. 875-877). Rhodes, Greece: ESCA.Hermes, D., Beaugendre, F., & House, D. (1997). Temporal alignment of accentuation boundaries in Dutch. In Botinis, A., Kouroupetroglou, G., & Carayannis, G. (Eds.), Proceedings of ESCA Tutorial and Research Workshop on Intonation: Theory, Models and Applications (pp. 177-180). Athens, Greece. [pdf]Hirsch, H-G. (1997). Adaptation of HMMs in the presence of additive and convolutional noise. In Furui, S., Juang, B., & Chou, W. (Eds.), Proc IEEE Workshop on Automatic Speech Recognition and Understanding. House, D. (1997). Perceptual thresholds and tonal categories. In Phonum 4 (pp. 179-182). Department of Phonetics, University of Umeå.House, D., Hermes, D., & Beaugendre, F. (1997). Temporal-alignment categories of accent-lending rises and falls. In Proceedings of Eurospeech 97, 5th European Conference on Speech Communication and Technology (pp. 879-882). Rhodes, Greece. [pdf]Högberg, J. (1997). Data driven formant synthesis. In Kokkinakis, G., Fakotakis, N., & Dermatas, E. (Eds.), Proc of Eurospeech Õ97, 5th European Conference on Speech Communication and Technology (pp. 565-568). Rhodes, Greece.Lieske, C., Bos, J., Emele, M., Gambäck, B., & Rupp, C. J. (1997). Giving prosody a meaning. In Kokkinakis, G., Fakotakis, N., & Dermatas, E. (Eds.), Proc of Eurospeech Õ97, 5th European Conference on Speech Communication and Technology (pp. 1431-1434). Rhodes, Greece.Lindberg, J., Blomberg, M., & Melin, H. (1997). CAVE - Speaker verification in bank and telecom services. In Bannert, R., Heldner, M., Sullivan, K., & Wretling, P. (Eds.), Proc of Fonetik -97, Dept of Phonetics, Umeå Univ., Phonum 4 (pp. 65-68). [ps]Lindberg, J., & Melin, H. (1997). Text-Prompted versus Sound-Prompted Passwords in Speaker Verification Systems. In Proc EUROSPEECH'97 (pp. 851-854). [pdf]Melin, H., & Lindberg, J. (1997). Guidelines for experiments on the POLYCOST database. In Proc of COST250 Workshop on Application of speaker recognition techniques in telephony (pp. 59-69). Vigo, Spain.Melin, H., & Lindberg, J. (1997). Prompting of passwords in speaker verification systems. In Bannert, R., Heldner, M., Sullivan, K., & Wretling, P. (Eds.), Proc of Fonetik -97, Dept of Phonetics, Umeå Univ., Phonum 4 (pp. 45-48). [pdf]Rosengren, E., & Hunnicutt, S. (1997). VAESS-projektet: Ett nytt portabelt kommunikationshjälpmedel med nyutvecklade röster och känslouttryck.. In Forskningskonferensen Människa, Handikapp, Livsvillkor (pp. 341-343). Örebro.Sjölander, K., & Gustafson, J. (1997). An integrated system for teaching spoken dialogue systems technology. In Kokkinakis, G., Fakotakis, N., & Dermatas, E. (Eds.), Proc of Eurospeech '97, 5th European Conference on Speech Communication and Technology (pp. 1927-1930). Rhodes, Greece. [pdf]Sjölander, K., & Högberg, J. (1997). Trying to improve phone and word recognition using finely tuned phone-like units. In Bannert, R., Heldner, M., Sullivan, K., & Wretling, P. (Eds.), Proc of Fonetik -97, Dept of Phonetics, Umeå Univ., Phonum 4 (pp. 125-128). Sjölander, K., & Högberg, J. (1997). Using expended question sets decision tree clustering for acoustic modelling. In Furui, S., Juang, B-H., & Chou, W. (Eds.), Proc. of IEEE Workshop on Automatic Speech Recognition and Understanding (pp. 179-184). Santa Barbara, Calif..Ström, N. (1997). A tonotopic artificial neural network architechture for phoneme probability estimation. In Furui, S., Juang, B-H., & Chou, W. (Eds.), Proc of IEEE Workshop on Automatic Speech Recognition and Understanding (pp. 156-163). Ström, N. (1997). Automatic continuous speech recognition with rapid speaker adaptation for human/machine interaction. Doctoral dissertation, KTH/TMH.Ström, N. (1997). Phoneme probability estimation with dynamic sparsely connected artificial neural networks. The Free Speech Journal, 1(5), 1-41.Ström, N. (1997). Sparce connection and pruning in large dynamic artificial neural networks. In Kokkinakis, G., Fakotakis, N., & Dermatas, E. (Eds.), Proc of Eurospeech 97 (pp. 2807-2810). Öhman, T. (1997). Measuring visual speech. In Bannert, R., Heldner, M., Sullivan, K., & Wretling, P. (Eds.), Proc of Fonetik -97, Dept of Phonetics, Umeå Univ., Phonum 4 (pp. 53-56). 1996Blomberg, M., & Elenius, K. (1996). Creation of unseen triphones from diphones and monophones using a speech production approach. In Proc of ICSLP-96, 4th Intl Conference on Spoken Language Processing (pp. 2316-2319). Philadelphia, USA. [pdf]Bruce, G., Filipsson, M., Frid, J., Granström, B., Gustafson, K., Horne, M., House, D., Lastow, B., & Touati, P. (1996). Developing the modelling of Swedish prosody in spontaneous dialogue. In Proc of ICSLP 96 (pp. 370-373). [pdf]Bruce, G., Frid, J., Granström, B., Gustafson, K., Horne, M., & House, D. (1996). Prosodic segmentation and structuring of dialogue. In Proc Nordisk Prosodi VII. Joensuu, Finland.Bruce, G., & Granström, B. (1996). Prosodic modelling in Swedish speech synthesis. In Fant G, H. K. &. K. S. (Eds.), Analysis, Perception and Processing of Spoken Language. Festschrift for Hiroya Fujisaki. (pp. 62-73). Amsterdam: Elsevier.Båvegård, M. (1996). Towards an articulatory speech syntheziser: Model development and simulations. Licentiate dissertation, KTH/TMH.Carlson, R. (1996). The dialog component in the Waxholm system. In LuperFoy, S., Nijholt, A., & Veldhuijzen van Zanten, G. (Eds.), Proc of Twente Workshop on Language Technology. Dialogue Management in Natural Language Systems (TWLT 11) (pp. 209-218). [pdf]Carlson, R., & Granström, B. (1996). The Waxholm spoken dialogue system. Phonetica Pragensia IX. Charisteria viro doctissimo Premysl Janota oblata. Acta Universitatis Carolinae Philologica 1, 1996. [pdf]Carlson, R., & Hunnicutt, S. (1996). Generic and domain-specific aspects of the Waxholm NLP and Dialog modules. In Proc of ICSLP-96, 4th Intl Conference on Spoken Language Processing (pp. 677-680). Philadelphia, USA. [pdf]Gustafson, J. (1996). A Swedish name pronounciation system. Licentiate dissertation, KTH/TMH. [pdf]Hennebert, J., Melin, H., Genoud, D., & Petrovska Delacretaz, D. (1996). The POLYCOST 250 Database. Technical Report v1.0, COST250 report. [pdf]House, D. (1996). Differential perception of tonal contours through the syllable. In Proceedings ICSLP 96, Fourth International Conference on Spoken Language Processing (pp. 2048-2051). Philadelphia, PA, USA. [pdf]House, D., & Svantesson, J. (1996). Tonal timing and vowel onset characteristics in Thai. In Proceedings of the Fourth International Symposium on Languages and Linguistics (pp. 1:104-113). Bangkok, Thailand.Högberg, J. (1996). Some studies of the relation between speech acoustics, articulation and phonetic structure. Licentiate dissertation, KTH/TMH.Högberg, J., & Sjölander, K. (1996). Cross phone state clustering using lexical stress information. In Proc of ICSLP-96, 4th Intl Conference on Spoken Language Processing (pp. 474-77). Philadelphia, USA. [pdf]Liljencrants, J. (1996). Experiments with analysis by synthesis of glottal airflow. In Proc of ICSLP'96 (pp. 1289-1292). Melin, H. (1996). Gandalf - A Swedish telephone speaker verification database. In Proc of 4th Intl Conference on Spoken Language Processing, ICSLP-96 (pp. 1954-1957). Philadelphia, USA. [pdf]Melin, H., & Lindberg, J. (1996). Guidelines for experiments on the POLYCOST database. In Proc COST250 Workshop on The Application of Speaker Recognition Technologies in Telephony (pp. 59-69). [pdf]Meng, H., Hunnicutt, S., Seneff, S., & Zue, V. (1996). Reversible letter-to-sound/sound-to-letter generation based on parsing word morphology. Speech Comm, 18, 47-63.1995Horne, M., Strangert, E., & Heldner, M. (1995). Prosodic boundary strength in Swedish: Final lengthening and silent interval duration. In Proceedings ICPhS 95 (pp. 170-173). Stockholm, Sweden.House, D. (1995). Perception of prepausal tonal contours: implications for automatic stylization of intonation. In Eurospeech 1995. Madrid, Spain. [pdf]House, D. (1995). Speech production by adults using cochlear implants. In Plant, G., & Spens, K-E. (Eds.), Profound Deafness and Speech Communication (pp. 285-296). London: Whurr Publishers.House, D. (1995). The influence of silence on perceiving the preceding tonal contour. In Elenius, K., & Branderud, P. (Eds.), Proceedings of ICPhS 95, 13th International Congress of Phonetic Sciences (pp. 1:122-125). Stockholm, Sweden.Strangert, E., & Heldner, M. (1995). Labelling of boundaries and prominences by phonetically experienced and non-experienced transcribers. In PHONUM 3 (pp. 85-109). Umeå: Department of Phonetics, Umeå University.Strangert, E., & Heldner, M. (1995). The labelling of prominence in Swedish by phonetically experienced transcribers. In Proceedings ICPhS 95 (pp. 204-207). Stockholm, Sweden.1994House, D. (1994). Perception and production of mood in speech by cochlear implant users. In Proceedings ICSLP 94, 1994 International Conference on Spoken Language Processing (pp. 2051-2054). Yokohama, Japan. [pdf]Strangert, E., & Heldner, M. (1994). Prosodic labelling and acoustic data. In Working Papers 43: Papers from the Eighth Swedish Phonetics Conference (pp. 120-123). Lund, Sweden.1993Heldner, M. (1993). Mustasch på Mona Lisa: Några reflexioner om patafysiken. In Söhrman, I. (Ed.), La culture dans la langue (pp. 71-87). Stockholm, Sweden: Almquist & Wiksell.Nordebo, S., Bengtsson, B., Claesson, I., Nordholm, S., Roxström, A., Blomberg, M., & Elenius, K. (1993). Noise Reduction Using an Adaptive Microphone Array for Speech Recognition in a Car. In RVK -93. 1992Deutsch, D. (1992). Some new pitch paradoxes and their implications. Philosophical Transactions: Biological Sciences, 336, 391-397. [pdf]House, D. (1992). Changes in the control of fundamental frequency following activation of a cochlear prosthesis. In Proceedings of the Sixth National Swedish Phonetics Conference, Technical Report no. 10 (pp. 79-82). Department of Information Theory, Chalmers University of Technology, Gothenburg.House, D. (1992). Perceptual Constraints and Tonal Features. In Dressler, W., Luschutzky, H., Pfeiffer, O., & Rennison, J. (Eds.), Phonologica 1988 (pp. 111-118). Cambridge: Cambridge University Press.House, D., & Willstedt, U. (1992). Can a cochlear implant and voice training improve voice control?. Working Papers in Logopedics and Phoniatrics, Department of Logopedics and Phoniatrics, Lund University, 8, 15-33.House, D., & Willstedt, U. (1992). Changes in control of fundamental frequency and voice quality following cochlear implant activation and speech training. In Risberg, A., Felicetti, S., Plant, G., & Spens, K. (Eds.), Proceedings of the 2nd International Conference on Tactile aids, Hearing Aids, & Cochlear Implants (pp. 201-210). Stockholm, Sweden: KTH. [pdf]1991Bruce, G., Granström, B., Gustafson, D., & House, D. (1991). On prosodic phrasing in Swedish. Perilus, Institute of Linguistics, University of Stockholm, XIII, 35-38.Bruce, G., Granström, B., Gustafson, K., & House, D. (1991). Prosodic phrasing in Swedish. Working Papers, Department of Linguistics and Phonetics, Lund University, 38, 5-17.Bruce, G., Granström, B., & House, D. (1991). Strategies for prosodic phrasing in Swedish. In Proc. of the Twelfth International Congress of Phonetic Sciences (pp. 4:182-185). Aix-en-Provence, France.Fant, G. (1991). Units of temporal organization. Stress groups versus syllables and words.. In Proceedings of ICPhS, Aix-en-Provence, France (pp. 247-250). [pdf]Fant, G., Kruckenberg, A., & Nord, L. (1991). Temporal organization and rhythm in Swedish. In Proceedings of ICPhS, Aix-en-Provence, France (pp. 251-256.). [pdf]House, D. (1991). A model of optimal tonal feature perception. In Proc. of the Twelfth International Congress of Phonetic Sciences (pp. 2:102-105). Aix-en-Provence, France.House, D. (1991). Cochlear implants and the perception of mood in speech. British Journal of Audiology, 26, 198.House, D. (1991). On hearing impairments, cochlear implants and the perception of mood in speech. In PERILUS XIII (pp. 125-128). Institute of Linguistics, University of Stockholm. [pdf]House, D., & Willstedt, U. (1991). Fundamental frequency control and voice quality in cochlear implant users. Working Papers, Department of Linguistics and Phonetics, Lund University, 38, 115-132.Summerfield, Q. (1991). Visual perception of phonetic gestures. In Modularity and the Motor Theory of Speech Perception: Proceedings of a Conference to Honor Alvin M. Liberman (pp. 117). 1990Bruce, G., Granström, B., & House, D. (1990). Prosodic phrasing in Swedish speech synthesis. In Bailly, G., & Benoit, C. (Eds.), Proc. of the ESCA Workshop on speech synthesis (pp. 125-129). Autrans, France.Bruce, G., & House, D. (1990). Slutrapport från projektet Prosodisk parsning för igenkänning av svenska. Technical Report Department of Linguistics and Phonetics, Lund University.House, D. (1990). On the perception of mood in speech. Phonum, 1, 113-116.House, D. (1990). On the perception of mood in speech: implications for the hearing impaired. Working Papers, Department of Linguistics and Phonetics, Lund University, 36, 99-108. [pdf]House, D. (1990). Tonal Perception in Speech. Lund: Lund University Press.House, D., & Bruce, G. (1990). Word and focal accents in Swedish from a recognition perspective. In Wiik, K., & Raimo, I. (Eds.), Nordic Prosody V (pp. 156-173). Turku University.1989House, D. (1989). Automatic recognition of prosodic categories. In Szende, T. (Ed.), Proc. of the Speech Research '89 International Conference (pp. 347-350). Budapest: Linguistics Institute of the Hungarian Academy of Sciences.House, D. (1989). Cues for the perception of mood in speech: implications for the hearing-impaired. British Journal of Audiology, 23, 171.House, D. (1989). Perceptual Constraints and Tonal Features. Working Papers, Department of Linguistics and Phonetics, Lund University, 35, 113-120.1988House, D. (1988). Perceptual constraints and tonal features. In Sixth International Phonology Meeting, Krems (pp. 38). Vienna.House, D., Bruce, G., Eriksson, L., & Lacerda, F. (1988). Recognition of Prosodic Categories in Swedish: Rule Implementation. Working Papers, Department of Linguistics and Phonetics, Lund University, 33, 153-161.House, D., Bruce, G., Eriksson, L., & Lacerda, F. (1988). Recognition of Prosodic Categories in Swedish: Rule Implementation (abridged version). Working Papers, Department of Linguistics and Phonetics, Lund University, 34, 62-65.House, D., & Horne, M. (1988). Reduction Phenomena in Fast Speech: Influence of Dialect and Focus. In Discussion Papers, Sixth International Phonology Meeting, Krems (pp. 1:20-22). Vienna.1987Gårding, E., & House, D. (1987). Production and Perception of Phrases in some Nordic Dialects. In Lilius, P., & Saari, M. (Eds.), The Nordic Languages and Modern Linguistics, 6 (pp. 163-175). Helsinki University Press.House, D. (1987). Implications of spectral changes for categorising tonal patterns in speech. British Journal of Audiology, 21, 116.House, D. (1987). Perception of Tonal Patterns in Speech: Implications for Models of Speech Perception. In Proc. of the Eleventh International Congress of Phonetic Sciences (pp. 1:76-79). Academy of Sciences of the Estonian S.S.R. Tallinn..House, D. (1987). Speech Perception, Intonation and Memory. In RUUL 17, Papers from the Swedish Phonetics Conference (pp. 72-77). Department of Linguistics, Uppsala University.House, D., Bruce, G., Lacerda, F., & Lindblom, B. (1987). Automatic Prosodic Analysis for Swedish Speech Recognition. In J. Laver & M. Jack (eds.). European Conference on Speech Technology, Volume 1 (pp. 215-218). Edinburgh. [pdf]House, D., Bruce, G., Lacerda, F., & Lindblom, B. (1987). Automatic Prosodic Analysis for Swedish Speech Recognition (extended version). Working Papers, Department of Linguistics and Phonetics, Lund University, 31, 87-101.House, D., Bruce, G., Lacerda, F., & Lindblom, B. (1987). Prosodisk parsning för igenkänning av svenska. In Tal-Ljud-Hörsel 87 (pp. 29-21). Lund University.House, D., & Gårding, E. (1987). Phrasing in some Nordic dialects. In Nordic Prosody IV (pp. 61-70). Odense University Press.1986House, D. (1986). Compensatory Use of Acoustic Speech Cues and Language Structure by Hearing-Impaired Listeners. In Nilsson, L., & Hjelmquist, E. (Eds.), Communication and Handicap - Aspects of Psychological Compensation and Technical Aids. (pp. 39-59). Amsterdam: North Holland.1985Gårding, E., & House, D. (1985). Frasintonation, särskilt i svenska. In Svenskans Beskrivning 15 (pp. 205-221). Gothenburg University.House, D. (1985). Betydelsen av spektrala förändringar för kategorisering av tonala mönster. In Tal-Ljud Hörsel 2. (pp. 168). Gothenburg University.House, D. (1985). Implications of Rapid Spectral Changes on the Categorization of Tonal Patterns in Speech Perception. Working Papers, Department of Linguistics and Phonetics, Lund University, 28, 69-89.House, D. (1985). Kompensatoriska perceptionstrategier bland hörselskade lyssnare. In Tal-Ljud Hörsel 2. (pp. 168). Gothenburg University.House, D. (1985). Sentence Prosody and Syntax in Speech Perception. Working Papers, Department of Linguistics and Phonetics, Lund University, 28, 91-107.1984House, D. (1984). Experiments with Filtered Speech and Hearing-Impaired Listeners. Working Papers, Department of Linguistics and Phonetics, Lund University, 27, 101-123.House, D. (1984). Transitioner i talperception: implikationer för perceptionsteori och applikationer i hörselvården. In Tal-Ljud-Hörsel 1 (pp. 108). Stockholm University.1983House, D. (1983). Perceptual Interaction between Fo Excursions and Spectral Cues. Working Papers, Department of Linguistics and Phonetics, Lund University, 25, 67-74.House, D. (1983). Transitions in Speech Perception. In Abstracts of The Tenth International Congress of Phonetic Sciences (pp. 511). Foris Publications, Dordrecht, Holland.1982House, D. (1982). Identification of Nasals: An Acoustical and Perceptual Analysis of Nasal Consonants. Working Papers, Department of Linguistics and Phonetics, Lund University, 22, 153-162.1979Watanabe, A., Felicetti, S., Hedström, B., Surjadi, G., Tannergård, G., Tegerstedt, I., Wejnebring, B., Wetterling, M-B., Andersson, L., Hallsten, L., Kaunisto, M., Murray, T., Eriksson, H., Haapakorpi, M., Karlsson, I., Nord, L., Stålhammar, U., Elenius, K., Blomberg, M., Liljencrants, J., Carlson, R., Granström, B., Risberg, A., Spens, K-E., Agelförs, E., Boberg, G., Mártony, J., Tunblad, T., Öster, A-M., Galyas, K., Gauffin, J., de Serpa-Leitão, A., Askenfelt, A., Jansson, E., & Sunberg, J. (1979). Gunnar Fant 60 years. TMH-QPSR, 20(2), 1-45. [pdf]1959Fant, G. (1959). Acoustic analysis and synthesis of speech with applications to swedish. Technical Report.