CTT Publications

This list contains publications authored or co-authored by members of CTT, the Centre for Speech Technology, during the period that CTT has existed. Some of the papers concern research projects outside CTT.

The papers are sorted by publication year and within each year they are sorted by first author.


Year 2009, 2008, 2007, 2006, 2005, 2004, 2003, 2002, 2001, 2000, 1999, 1998, 1997, 1996


2017Ambrazaitis, G., & House, D. (2017). Multimodal prominences: Exploring the interplay and usage of focal pitch accents, eyebrow beats and head beats in Swedish news readings. In Phonetics and Phonology in Europe 2017. Cologne, Germany: Accepted for publication.Avramova, V., Yang, F., Li, C., Peters, C., & Skantze, G. (2017). A Virtual Poster Presenter Using Mixed Reality. In Proceedings of International Conference on Intelligent Virtual Agents (pp. 25-28). Stockholm, Sweden.House, D., Ambrazaitis, G., Alexanderson, S., Ewald, O., & Kelterer, A. (2017). Temporal organization of eyebrow beats, head beats and syllables in multimodal signaling of prominence. In International Conference on Multimodal Communication: Developing New Theories and Methods. Osnabrück, Germany: Accepted for publication.Johansson, R., Skantze, G., & Jönsson, A. (2017). A psychotherapy training environment with virtual patients implemented using the Furhat robot platform. In In Proceedings of International Conference on Intelligent Virtual Agents (pp. 184-187). Stockholm, Sweden.Karlsson, A., & House, D. (2017). Phonetics and Phonology in Europe 2017. In Phonetics and Phonology in Europe 2017. Cologne, Germany: Accepted for publication.Karlsson, A., & House, D. (2017). The role of edges in prosodic articulation of discourse in phrase languages. In 15th International Pragmatics Conference. Belfast, Northern Ireland: Accepted for publication.Lopes, J., Engwall, O., & Skantze, G. (2017). A First Visit to the Robot Language Café. In Proceedings of SLATE. Stockholm, Sweden.Shore, T., & Skantze, G. (2017). Enhancing reference resolution in dialogue using participant feedback. In Proceedings of International Workshop on Grounding Language Understanding. Stockholm, Sweden.Skantze, G. (2017). Predicting and Regulating Participation Equality in Human-robot Conversations: Effects of Age and Gender. In Conference on Human-Robot Interaction (HRI2017). Vienna, Austria. [pdf]Skantze, G. (2017). Towards a General, Continuous Model of Turn-taking in Spoken Dialogue using LSTM Recurrent Neural Networks. In Proceedings of SigDial. Saarbrucken, Germany. [pdf]Zellers, M., House, D., & Alexanderson, S. (2017). Investigating cooccurring gestural and prosodic cues at turn boundaries. In 15th International Pragmatics Conference. Belfast, Northern Ireland: Accepted for publication.2016Alexanderson, S., House, D., & Beskow, J. (2016). Automatic Annotation of Gestural Units in Spontaneous Face-to-Face Interaction. In Proceedings of the Workshop on Multimodal Analyses enabling Artificial Agents in Human-Machine Interaction (pp. 15-19). Tokyo, Japan. [pdf]Alexanderson, S., O'Sullivan, C., & Beskow, J. (2016). Robust online motion capture labeling of finger markers. In Proceedings of the 9th International Conference on Motion in Games (pp. 7--13). San Francisco, CA, USA. [pdf]Ambrazaitis, G., & House, D. (2016). Multimodal levels of prominence - The use of eyebrows and head beats to convey information structure in Swedish news reading. In Seventh Conference of the International Society for Gesture Studies (pp. 310). Paris.Andersson, J., Berlin, S., Costa, A., Berthelsen, H., Lindgren, H., Lindberg, N., Beskow, J., Edlund, J., & Gustafson, J. (2016). WikiSpeech – enabling open source text-to-speech for Wikipedia. In Proceedings of 9th ISCA Speech Synthesis Workshop (pp. 111-117). Sunnyvale, USA. [pdf]Arnela, M., Blandin, R., Dabbaghchian, S., Guasch, O., Alías, F., Pelorson, X., Van Hirtum, A., & Engwall, O. (2016). Influence of lips on the production of vowels based on finite element simulations and experiments. Journal of the Acoustical Society of America, 139(5), 2852–2859. [abstract]Abstract: Three-dimensional (3-D) numerical approaches for voice production are currently being investigated and developed. Radiation losses produced when sound waves emanate from the mouth aperture are one of the key aspects to be modeled. When doing so, the lips are usually removed from the vocal tract geometry in order to impose a radiation impedance on a closed cross-section, which speeds up the numerical simulations compared to free-field radiation solutions. However, lips may play a significant role. In this work, the lips' effects on vowel sounds are investigated by using 3-D vocal tract geometries generated from magnetic resonance imaging. To this aim, two configurations for the vocal tract exit are considered: with lips and without lips. The acoustic behavior of each is analyzed and compared by means of time-domain finite element simulations that allow free-field wave propagation and experiments performed using 3-D-printed mechanical replicas. The results show that the lips should be included in order to correctly model vocal tract acoustics not only at high frequencies, as commonly accepted, but also in the low frequency range below 4 kHz, where plane wave propagation occurs. http://dx.doi.org/10.1121/1.4950698Arnela, M., Dabbaghchian, S., Blandin, R., Guasch, O., Engwall, O., Van Hirtum, A., & Pelorson, X. (2016). Influence of vocal tract geometry simplifications on the numerical simulation of vowel sounds. Journal of the Acoustical Society of America, 140(3), 1707-1718. [abstract]Abstract: For many years, the vocal tract shape has been approximated by one-dimensional (1D) area functions to study the production of voice. More recently, 3D approaches allow one to deal with the complex 3D vocal tract, although area-based 3D geometries of circular cross-section are still in use. However, little is known about the influence of performing such a simplification, and some alternatives may exist between these two extreme options. To this aim, several vocal tract geometry simplifications for vowels [A], [i], and [u] are investigated in this work. Six cases are considered, consisting of realistic, elliptical, and circular cross-sections interpolated through a bent or straight midline. For frequencies below 4–5 kHz, the influence of bending and cross-sectional shape has been found weak, while above these values simplified bent vocal tracts with realistic cross-sections are necessary to correctly emulate higher-order mode propagation. To perform this study, the finite element method (FEM) has been used. FEM results have also been compared to a 3D multimodal method and to a classical 1D frequency domain model. VC 2016 Acoustical Society of America. [http://dx.doi.org/10.1121/1.4962488]Beskow, J., & Berthelsen, H. (2016). A hybrid harmonics-and-bursts modelling approach to speech synthesis. In 9th ISCA Workshop on Speech Synthesis (pp. 225--230). Sunnyvale, CA, USA. [pdf]Dabbaghchian, S., Arnela, M., Engwall, O., Guasch, O., Stavness, I., & Badin, P. (2016). Using a Biomechanical Model and Articulatory Data for the Numerical Production of Vowels. In Proceedings of Interspeech 2016. San Fransisco. [abstract] [pdf]Abstract: We introduce a framework to study speech production using a biomechanical model of the human vocal tract, ArtiSynth. Electromagnetic articulography data was used as input to an inverse tracking simulation that estimates muscle activations to generate 3D jaw and tongue postures corresponding to the target articulator positions. For acoustic simulations, the vocal tract geometry is needed, but since the vocal tract is a cavity rather than a physical object, its geometry does not explicitly exist in a biomechanical model. A fully-automatic method to extract the 3D geometry (surface mesh) of the vocal tract by blending geometries of the relevant articulators has therefore been developed. This automatic extraction procedure is essential, since a method with manual intervention is not feasible for large numbers of simulations or for generation of dynamic sounds, such as diphthongs. We then simulated the vocal tract acoustics by using the Finite Element Method (FEM). This requires a high quality vocal tract mesh without irregular geometry or self-intersections. We demonstrate that the framework is applicable to acoustic FEM simulations of a wide range of vocal tract deformations. In particular we present results for cardinal vowel production, with muscle activations, vocal tract geometry, and acoustic simulations.Frid, J., Ambrazaitis, G., Svensson Lundmark, M., & House, D. (2016). Towards classification of head movements in audiovisual recordings of read news. In 4th European and 7th Nordic Symposium on Multimodal Communication. Copenhagen, Denmark.Georgiladakis, S., Athanasopoulou, G., Meena, R., Lopes, J., Chorianopoulou, A., Palogiannidi, E., Iosif, E., Skantze, G., & Potamianos, A. (2016). Root Cause Analysis of Miscommunication Hotspots in Spoken Dialogue Systems. In Interspeech 2016 (pp. 1156--1160). San Francisco, CA, USA. [pdf]House, D., & Alexanderson, S. (2016). Temporal domains of co-speech gestures and speech prosody. In Seventh Conference of the International Society for Gesture Studies (pp. 365). Paris. [pdf]House, D., Karlsson, A., & Svantesson, J. (2016). When epistemic meaning overrides the constraints of lexical tone: a case from Kammu. In The role of prosody in conveying epistemic and evidential meaning. University of Kent, Canterbury, UK.Hultén, M., Artman, H., & House, D. (2016). A model to analyse students’ cooperative idea generation in conceptual design. International Journal of Technology and Design Education, Springer Online First Articles.Johansson, M., Hori, T., Skantze, G., Höthker, A., & Gustafson, J. (2016). Making Turn-taking Decisions for an Active Listening Robot for Memory Training. In Proceedings of International Conference on Social Robotics (ICSR). Kansas City, MO. [pdf]Karlsson, A., House, D., & Svantesson, J. (2016). Tonal cues to topic and comment in spontaneous Japanese and Mongolian. In 7th conference on Tone and Intonation in Europe (TIE). University of Kent, Canterbury, UK.Kumar Dhaka, A., & Salvi, G. (2016). Optimising The Input Window Alignment in CD-DNN Based Phoneme Recognition for Low Latency Processing. CoRR, abs/1606.09163. [link]Kumar Dhaka, A., & Salvi, G. (2016). Semi-supervised Learning with Sparse Autoencoders in Phone Classification. CoRR, abs/1610.00520. [link]Lison, P., & Meena, R. (2016). Automatic Turn Segmentation for Movie & TV Subtitles. In 2016 IEEE Workshop on Spoken Language Technology. San Juan, Puerto Rico. [abstract]Abstract: Movie and TV subtitles contain large amounts of conversational material, but lack an explicit turn structure. This paper present a data-driven approach to the segmentation of subtitles into dialogue turns. Training data is first extracted by aligning subtitles with transcripts in order to obtain speaker labels. This data is then used to build a classifier whose task is to determine whether two consecutive sentences are part of the same dialogue turn. The approach relies on linguistic, visual and timing features extracted from the subtitles themselves and does not require access to the audiovisual material -- although speaker diarization can be exploited when audio data is available. The approach also exploits alignments with related subtitles in other languages to further improve the classification performance. The classifier achieves an accuracy of 78% on a held-out test set. A follow-up annotation experiment demonstrates that this task is also difficult for human annotators.Lopes, J., Chorianopoulou, A., Palogiannidi, E., Moniz, H., Abad, A., Louka, K., Iosif, E., & Potamianos, A. (2016). The SpeDial datasets: datasets for Spoken Dialogue Systems analytics. In 10th edition of the Language Resources and Evaluation Conference. [abstract] [pdf]Abstract: The SpeDial consortium is sharing two datasets that were used during the SpeDial project. By sharing them with the community we are providing a resource to reduce the duration of cycle of development of new Spoken Dialogue Systems (SDSs). The datasets include audios and several manual annotations, i.e., miscommunication, anger, satisfaction, repetition, gender and task success. The datasets were created with data from real users and cover two different languages: English and Greek. Detectors for miscommunication, anger and gender were trained for both systems. The detectors were particularly accurate in tasks where humans have high annotator agreement such as miscommunication and gender. As expected due to the subjectivity of the task, the anger detector had a less satisfactory performance. Nevertheless, we proved that the automatic detection of situations that can lead to problems in SDSs is possible and can be a promising direction to reduce the duration of SDS’s development cycle.Meena, R. (2016). Data-driven Methods for Spoken Dialogue Systems - Applications in Language Understanding, Turn-taking, Error Detection, and Knowledge Acquisition. Doctoral dissertation, School of Computer Science and Communication. [pdf]Oertel, C., Gustafson, J., & Black, A. (2016). On Data Driven Parametric Backchannel Synthesis for Expressing Attentiveness in Conversational Agents. In Proceedings of Multimodal Analyses enabling Artificial Agents in Human­-Machine Interaction (MA3HMI), satellite workshop of ICMI 2016. [pdf]Oertel, C., Gustafson, J., & Black, A. (2016). Towards Building an Attentive Artificial Listener: On the Perception of Attentiveness in Feedback Utterances. In Proceedings of Interspeech 2016. San Fransisco, USA. [pdf]Oertel, C., Lopes, J., Yu, Y., Funes, K., Gustafson, J., Black, A., & Odobez, J-M. (2016). Towards Building an Attentive Artificial Listener: On the Perception of Attentiveness in Audio-Visual Feedback Tokens. In Proceedings of the 18th ACM International Conference on Multimodal Interaction (ICMI 2016). Tokyo, Japan. [pdf]Potamianos, A., Tzafestas, C., Iosif, E., Kirstein, F., Maragos, P., Dauthenhahn, K., Gustafson, J., Østergaard, J., Kopp, S., Wik, P., Pietquin, O., & Al Moubayed, S. (2016). BabyRobot - Next Generation Social Robots: Enhancing Communication and Collaboration Development of TD and ASD Children by Developing and Commercially Exploiting the Next Generation of Human-Robot Interaction Technologies. In Proceedings of 2nd Workshop on Evaluating Child-Robot Interaction (CRI) at Human-Robot Interaction (HRI'16). Christchurch, New Zealand..Skantze, G. (2016). Real-time Coordination in Human-robot Interaction using Face and Voice. AI Magazine, 37(4), 19-31. [abstract] [pdf]Abstract: When humans interact and collaborate with each other, they coordinate their turn-taking be-haviours using verbal and non-verbal signals, expressed in the face and voice. If robots of the fu-ture are supposed to engage in social interaction with humans, it is essential that they can gen-erate and understand these behaviours. In this article, I give an overview of several studies that show how humans in interaction with a human-like robot make use of the same coordination signals typically found in studies on human-human interaction, and that it is possible to auto-matically detect and combine these cues to facilitate real-time coordination. The studies also show that humans react naturally to such signals when used by a robot, without being given any special instructions. They follow the gaze of the robot to disambiguate referring expressions, they conform when the robot selects the next speaker using gaze, and they respond naturally to subtle cues, such as gaze aversion, breathing, facial gestures and hesitation sounds.Stefanov, K., & Beskow, J. (2016). A Multi-party Multi-modal Dataset for Focus of Visual Attention in Human-human and Human-robot Interaction. In Proceedings of the 10th edition of the Language Resources and Evaluation Conference (LREC 2016). Portorož, Slovenia. [pdf]Stefanov, K., Sugimoto, A., & Beskow, J. (2016). Look Who’s Talking - Visual Identification of the Active Speaker in Multi-party Human-robot Interaction. In Proceedings of 2nd Workshop on Advancements in Social Signal Processing for Multimodal Interaction. Tokyo, Japan.Wedenborn, A., Wik, P., Engwall, O., & Beskow, J. (2016). The effect of a physical robot on vocabulary learning. In Proceedings of the International Workshop on Spoken Dialogue Systems (IWSDS 2016),. Saariselkä, Finland. [pdf]Zellers, M., House, D., & Alexanderson, S. (2016). Prosody and hand gesture at turn boundaries in Swedish. In Proceedings of Speech Prosody 2016 (pp. 832-835). Boston, USA. [abstract] [pdf]Abstract: In order to ensure smooth turn-taking between conversational participants, interlocutors must have ways of providing information to one another about whether they have finished speaking or intend to continue. The current work investigates Swedish speakers’ use of hand gestures in conjunction with turn change or turn hold in unrestricted, spontaneous speech. As has been reported by other researchers, we find that speakers’ gestures end before the end of speech in cases of turn change, while they may extend well beyond the end of a given speech chunk in the case of turn hold. We investigate the degree to which prosodic cues and gesture cues to turn transition in Swedish face-to-face conversation are complementary versus functioning additively. The co-occurrence of acoustic prosodic features and gesture at potential turn boundaries gives strong support for considering hand gestures as part of the prosodic system, particularly in the context of discourse-level information such as maintaining smooth turn transition.2015Alexanderson, S., & Beskow, J. (2015). Towards Fully Automated Motion Capture of Signs -- Development and Evaluation of a Key Word Signing Avatar. ACM Trans. Access. Comput., 7(2), 7:1-7:17. [abstract]Abstract: Motion capture of signs provides unique challenges in the field of multimodal data collection. The dense packaging of visual information requires high fidelity and high bandwidth of the captured data. Even though marker-based optical motion capture provides many desirable features such as high accuracy, global fitting, and the ability to record body and face simultaneously, it is not widely used to record finger motion, especially not for articulated and syntactic motion such as signs. Instead, most signing avatar projects use costly instrumented gloves, which require long calibration procedures. In this article, we evaluate the data quality obtained from optical motion capture of isolated signs from Swedish sign language with a large number of low-cost cameras. We also present a novel dual-sensor approach to combine the data with low-cost, five-sensor instrumented gloves to provide a recording method with low manual postprocessing. Finally, we evaluate the collected data and the dual-sensor approach as transferred to a highly stylized avatar. The application of the avatar is a game-based environment for training Key Word Signing (KWS) as augmented and alternative communication (AAC), intended for children with communication disabilities.Ambrazaitis, G., Svensson Lundmark, M., & House, D. (2015). Head beats and eyebrow movements as a function of phonological prominence levels and word accents in Stockholm Swedish news broadcasts. In The 3rd European Symposium on Multimodal Communication. Dublin, Ireland.Ambrazaitis, G., Svensson Lundmark, M., & House, D. (2015). Head Movements, Eyebrows, and Phonological Prosodic Prominence Levels in Stockholm Swedish News Broadcasts. In 13th International Conference on Auditory-Visual Speech Processing (AVSP 2015) (pp. 42). Vienna, Austria.Ambrazaitis, G., Svensson Lundmark, M., & House, D. (2015). Multimodal levels of promincence: a preliminary analysis of head and eyebrow movements in Swedish news broadcasts. In Lundmark Svensson, M., Ambrazaitis, G., & van de Weijer, J. (Eds.), Proceedings of Fonetik 2015 (pp. 11-16). Lund University, Sweden.Artman, H., House, D., & Hultén, M. (2015). Designed by Engineers: An analysis of interactionaries with engineering students. Designs for Learning, Vol. 7(No. 2-2015), 28-56.Artman, H., House, D., Hultén, M., Karlgren, K., & Ramberg, R. (2015). The Interactionary as a didactic format in design education. In Proc. of KTH Scholarship of Teaching and Learning 2015. Stockholm, Sweden.Cuayáhuitl, H., Kazunori, K., & Skantze, G. (2015). Introduction for Speech and language for interactive robots. Computer Speech and Language, 34(1), 83-86.Dabbaghchian, S., Arnela, M., & Engwall, O. (2015). Simplification of Vocal Tract Shapes with Different Levels of Detail. In Proceedings of 18th International Congress of Phonetic Sciences. Glasgow. [abstract]Abstract: We propose a semi-automatic method to regenerate simplified vocal tract geometries from very detailed input (e.g. MRI-based geometry) with the possibility to control the level of detail, while maintaining the overall properties. The simplification procedure controls the number and organization of the vertices in the vocal tract surface mesh and can be assigned to replace complex cross-sections with regular shapes. Six different geometry regenerations are suggested: bent or straight vocal tract centreline, combined with three different types of cross-sections; namely realistic, elliptical or circular. The key feature in the simplification is that the cross-sectional areas and the length of the vocal tract are maintained. This method may, for example, be used to facilitate 3D finite element method simulations of vowels and diphthongs and to examine the basic acoustic characteristics of vocal tract in printed physical replicas. Furthermore, it allows for multimodal solutions of the wave equation.Edlund, J., Tånnander, C., & Gustafson, J. (2015). Audience response system-based assessment for analysis-by-synthesis. In Proceedings of ICPhS 2015. Glasgov, UK. [pdf]Gudmandsen, M. (2015). Using a robot head with a 3D face mask as a communication medium for telepresence. Master's thesis, KTH.House, D., Alexanderson, S., & Beskow, J. (2015). On the temporal domain of co-speech gestures: syllable, phrase or talk spurt?. In Lundmark Svensson, M., Ambrazaitis, G., & van de Weijer, J. (Eds.), Proceedings of Fonetik 2015 (pp. 63-68). Lund University, Sweden. [abstract] [pdf]Abstract: This study explores the use of automatic methods to detect and extract hand gesture movement co-occuring with speech. Two spontaneous dyadic dialogues were analyzed using 3D motion-capture techniques to track hand movement. Automatic speech/non-speech detection was performed on the dialogues resulting in a series of connected talk spurts for each speaker. Temporal synchrony of onset and offset of gesture and speech was studied between the automatic hand gesture tracking and talk spurts, and compared to an earlier study of head nods and syllable synchronization. The results indicated onset synchronization between head nods and the syllable in the short temporal domain and between the onset of longer gesture units and the talk spurt in a more extended temporal domain.Johansson, M., & Skantze, G. (2015). Opportunities and Obligations to Take Turns in Collaborative Multi-Party Human-Robot Interaction. In Proceedings of SIGDIAL. Prague, Czech Republic. [abstract] [pdf]Abstract: In this paper we present a data-driven model for detecting opportunities and obligations for a robot to take turns in multi-party discussions about objects. The data used for the model was collected in a public setting, where the robot head Furhat played a collaborative card sorting game together with two users. The model makes a combined detection of addressee and turn-yielding cues, using multi-modal data from voice activity, syntax, prosody, head pose, movement of cards, and dialogue context. The best result for a binary decision is achieved when several modalities are com-bined, giving a weighted F1 score of 0.876 on data from a previously unseen interaction, using only automatically extractable features.Karlsson, A., House, D., & Svantesson, J-O. (2015). Prosodic signaling of information and discourse structure from a typological perspective. In The Scottish Consortium for ICPhS 2015 (Ed.), Proceedings of the 18th International Congress of Phonetic Sciences. Paper number 0013. Glasgow, UK: the University of Glasgow. ISBN 978-0-85261-941-4.Karlsson, A., House, D., & Svantesson, J-O. (2015). Prosodic signaling of information and discourse structure from a typological perspective. In 14th International Pragmatics Conference (pp. 542). Antwerp, Belgium.Laskowski, K., & Hjalmarsson, A. (2015). An information-theoretic framework for automated discovery of prosodic cues to conversational structure. In ICASSP. Brisbane, Australia.Lopes, J., Eskenazi, M., & Trancoso, I. (2015). From rule-based to data-driven lexical entrainment models in spoken dialog systems. Computer Speech \& Language, 31(1), 87-112.Lopes, J., Salvi, G., Skantze, G., Abad, A., Gustafson, J., Batista, F., Meena, R., & Trancoso, I. (2015). Detecting Repetitions in Spoken Dialogue Systems Using Phonetic Distances. In Proceedings of Interspeech 2015 (pp. 1805-1809). Dresden, Germany. [abstract] [pdf]Abstract: Repetitions in Spoken Dialogue Systems can be a symptom of problematic communication. Such repetitions are often due to speech recognition errors, which in turn makes it harder to use the output of the speech recognizer to detect repetitions. In this paper, we combine the alignment score obtained using phonetic distances with dialogue-related features to improve repetition detection. To evaluate the method proposed we compare several alignment techniques from edit distance to DTW-based distance, previously used in Spoken-Term detection tasks. We also compare two different methods to compute the phonetic distance: the first one using the phoneme sequence, and the second one using the distance between the phone posterior vectors. Two different datasets were used in this evaluation: a bus-schedule information system (in English) and a call routing system (in Swedish). The results show that approaches using phoneme distances over-perform approaches using Levenshtein distances between ASR outputs for repetition detection.Meena, R., Lopes, J., Skantze, G., & Gustafson, J. (2015). Automatic Detection of Miscommunication in Spoken Dialogue Systems. In Proceedings of 16th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL) (pp. 354-363). Prague, Czech Republic. [abstract] [pdf]Abstract: In this paper, we present a data-driven approach for detecting instances of miscommunication in dialogue system interactions. A range of generic features that are both automatically extractable and manually annotated were used to train two models for online detection and one for offline analysis. Online detection could be used to raise the error awareness of the system, whereas offline detection could be used by a system designer to identify potential flaws in the dialogue design. In experimental evaluations on system logs from three different dialogue systems that vary in their dialogue strategy, the proposed models performed substantially better than the majority class baseline models.Oertel, C., Funes, K., Gustafson, J., & Odobez, J-M. (2015). Deciphering the Silent Participant - On the Use of Audio-Visual Cues for the Classification of Listener Categories in Group Discussions. In Proceedings of the 17th ACM International Conference on Multimodal Interaction (ICMI 2015). Seattle, US. [pdf]Salvi, G. (2015). An Analysis of Shallow and Deep Representations of Speech Based on Unsupervised Classification of Isolated Words. In Proceedings of Nonlinear Speech Processing. [abstract]Abstract: We analyse the properties of shallow and deep representa- tions of speech. Mel frequency cepstral coefficients (MFCC) are compared to representations learned by a four layer Deep Belief Network (DBN) in terms of discriminative power and invariance to irrelevant factors such as speaker identity or gender. To avoid the influence of supervised statistical modelling, an unsupervised isolated word classification task is used for the comparison. The deep representations are also obtained with unsupervised training (no back-propagation pass is performed). The results show that DBN features provide a more concise clustering and higher match between clusters and word categories in terms of adjusted Rand score. Some of the confusions present with the MFCC features are, however, retained even with the DBN features.Skantze, G., & Johansson, M. (2015). Modelling situated human-robot interaction using IrisTK. In Proceedings of SIGDIAL. Prague, Czech Republic. [abstract]Abstract: In this demonstration we show how situated multi-party human-robot interaction can be modelled using the open source framework IrisTK. We will demonstrate the capabilities of IrisTK by showing an application where two users are playing a collaborative card sorting game together with the robot head Furhat, where the cards are shown on a touch table between the players. The application is interesting from a research perspective, as it involves both multi-party interaction, as well as joint attention to the objects under discussion.Skantze, G., Johansson, M., & Beskow, J. (2015). A Collaborative Human-robot Game as a Test-bed for Modelling Multi-party, Situated Interaction. In Proceedings of IVA. Delft, Netherlands. [abstract] [pdf]Abstract: In this demonstration we present a test-bed for collecting data and testing out models for multi-party, situated interaction between humans and robots. Two users are playing a collaborative card sorting game together with the robot head Furhat. The cards are shown on a touch table between the players, thus constituting a target for joint attention. The system has been exhibited at the Swedish National Museum of Science and Technology during nine days, resulting in a rich multi-modal corpus with users of mixed ages.Skantze, G., Johansson, M., & Beskow, J. (2015). Exploring Turn-taking Cues in Multi-party Human-robot Discussions about Objects. In Proceedings of ICMI. Seattle, Washington, USA. [pdf]Strömbergsson, S., Salvi, G., & House, D. (2015). Acoustic and perceptual evaluation of category goodness of /t/ and /k/ in typical and misarticulated children's speech. Journal of the Acoustic Society of America, 137(6), 3422--3435. [abstract] [pdf]Abstract: This investigation explores perceptual and acoustic characteristics of children's successful and unsuccessful productions of /t/ and /k/, with a specific aim of exploring perceptual sensitivity to phonetic detail, and the extent to which this sensitivity is reflected in the acoustic domain. Recordings were collected from 4- to 8-year-old children with a speech sound disorder (SSD) who misarticulated one of the target plosives, and compared to productions recorded from peers with typical speech development (TD). Perceptual responses were registered with regards to a visual-analog scale, ranging from “clear [t]” to “clear [k].” Statistical models of prototypical productions were built, based on spectral moments and discrete cosine transform features, and used in the scoring of SSD productions. In the perceptual evaluation, “clear substitutions” were rated as less prototypical than correct productions. Moreover, target-appropriate productions of /t/ and /k/ produced by children with SSD were rated as less prototypical than those produced by TD peers. The acoustical modeling could to a large extent discriminate between the gross categories /t/ and /k/, and scored the SSD utterances on a continuous scale that was largely consistent with the category of production. However, none of the methods exhibited the same sensitivity to phonetic detail as the human listeners.Wedenborn, A. (2015). A physical robot’s effect on vocabulary learning. Master's thesis, KTH. [pdf]Wlodarczak, M., Heldner, M., & Edlund, J. (2015). Communicative needs and respiratory constraints. In Proc of Interspeech 2015. Dresden, Germany.Zellers, M., & House, D. (2015). Parallels between hand gestures and acoustic prosodic features in turn-taking. In 14th International Pragmatics Conference (pp. 454-455). Antwerp, Belgium. [pdf]2014Al Moubayed, S., Beskow, J., Bollepalli, B., Gustafson, J., Hussen-Abdelaziz, A., Johansson, M., Koutsombogera, M., Lopes, J., Novikova, J., Oertel, C., Skantze, G., Stefanov, K., & Varol, G. (2014). Human-robot collaborative tutoring using multiparty multimodal spoken dialogue. In Proc. of HRI'14. Bielefeld, Germany. [abstract] [pdf]Abstract: In this paper, we describe a project that explores a novel experi-mental setup towards building a spoken, multi-modally rich, and human-like multiparty tutoring robot. A human-robot interaction setup is designed, and a human-human dialogue corpus is collected. The corpus targets the development of a dialogue system platform to study verbal and nonverbal tutoring strategies in multiparty spoken interactions with robots which are capable of spoken dialogue. The dialogue task is centered on two participants involved in a dialogue aiming to solve a card-ordering game. Along with the participants sits a tutor (robot) that helps the participants perform the task, and organizes and balances their interaction. Different multimodal signals captured and auto-synchronized by different audio-visual capture technologies, such as a microphone array, Kinects, and video cameras, were coupled with manual annotations. These are used build a situated model of the interaction based on the participants personalities, their state of attention, their conversational engagement and verbal dominance, and how that is correlated with the verbal and visual feedback, turn-management, and conversation regulatory actions generated by the tutor. Driven by the analysis of the corpus, we will show also the detailed design methodologies for an affective, and multimodally rich dialogue system that allows the robot to measure incrementally the attention states, and the dominance for each participant, allowing the robot head Furhat to maintain a well-coordinated, balanced, and engaging conversation, that attempts to maximize the agreement and the contribution to solve the task. This project sets the first steps to explore the potential of using multimodal dialogue systems to build interactive robots that can serve in educational, team building, and collaborative task solving applications.Al Moubayed, S., Beskow, J., Bollepalli, B., Hussen-Abdelaziz, A., Johansson, M., Koutsombogera, M., Lopes, J., Novikova, J., Oertel, C., Skantze, G., Stefanov, K., & Varol, G. (2014). Tutoring Robots: Multiparty multimodal social dialogue with an embodied tutor. In Proceedings of eNTERFACE2013. Springer. [abstract] [pdf]Abstract: This project explores a novel experimental setup towards building spoken, multi-modally rich, and human-like multiparty tutoring agent. A setup is developed and a corpus is collected that targets the development of a dialogue system platform to explore verbal and nonverbal tutoring strategies in multiparty spoken interactions with embodied agents. The dialogue task is centered on two participants involved in a dialogue aiming to solve a card-ordering game. With the participants sits a tutor that helps the participants perform the task and organizes and balances their interaction. Different multimodal signals captured and auto-synchronized by different audio-visual capture technologies were coupled with manual annotations to build a situated model of the interaction based on the participants personalities, their temporally-changing state of attention, their conversational engagement and verbal dominance, and the way these are correlated with the verbal and visual feedback, turn-management, and conversation regulatory actions generated by the tutor. At the end of this chapter we discuss the potential areas of research and developments this work opens and some of the challenges that lie in the road ahead.Al Moubayed, S., Beskow, J., & Skantze, G. (2014). Spontaneous spoken dialogues with the Furhat human-like robot head. In HRI'14. Bielefeld, Germany. [abstract] [pdf]Abstract: We will show in this demonstrator an advanced multimodal and multiparty spoken conversational system using Furhat, a robot head based on projected facial animation. Furhat is an anthropomorphic robot head that utilizes facial animation for physical robot heads using back-projection. In the system, multimodality is enabled using speech and rich visual input signals such as multi-person real-time face tracking and microphone tracking. The demonstrator will showcase a system that is able to carry out social dialogue with multiple interlocutors simultaneously with rich output signals such as eye and head coordination, lips synchronized speech synthesis, and non-verbal facial gestures used to regulate fluent and expressive multiparty conversations. The dialogue design is performed using the IrisTK [4] dialogue authoring toolkit developed at KTH. The system will also be able to perform a moderator in a quiz-game showing different strategies for regulating spoken situated interactions.Alexanderson, S., & Beskow, J. (2014). Animated Lombard speech: Motion capture, facial animation and visual intelligibility of speech produced in adverse conditions. Computer Speech & Language, 28(2), 607-618. [pdf]Alexanderson, S., Beskow, J., & House, D. (2014). Automatic speech/non-speech classification using gestures in dialogue. In The Fifth Swedish Language Technology Conference. Uppsala, Sweden. [pdf]Artman, H., House, D., & Hultén, M. (2014). Design learning opportunities in engineering education: A case study of students solving an interaction - design task. In Proc. 4th International Designs for Learning Conference. Stockholm University. [pdf]Beskow, J., Stafanov, K., Alexandersson, S., Claesson, B., Derbring S., ., & Fredriksson, M. (2014). Tivoli - teckeninlärning via spel och interaktion. Slutrapport projektgenomförande. Technical Report, PTS Innovation för alla. [pdf]Bissiri, M., Zellers, M., & Ding, H. (2014). Perception of glottalization in varying pitch contexts in Mandarin Chinese. In Campbell, N., Gibbon, D., & Hirst, D. (Eds.), Proceedings of Speech Prosody 7 (pp. 633-637). [pdf]Bollepalli, B., Urbain, J., Raitio, T., Gustafson, J., & Cakmak, H. (2014). A Comparative Evaluation of Vocoding Techniques for HMM-based Laughter Synthesis. In Proceedings of ICASSP 2014. [pdf]Boye, J., Fredriksson, M., Götze, J., Gustafson, J., & Königsmann, J. (2014). Walk this way: Spatial grounding for city exploration. In Natural interaction with robots, knowbots and smartphones (pp. 59-67). Springer-Verlag. [pdf]Contardo, I., McAllister, A., & Strömbergsson, S. (2014). Real-time registration of listener reactions to unintelligibility in misarticulated child speech. In Heldner, M. (Ed.), Proc. of Fonetik 2014 (pp. 127-132). Stockholm, Sweden. [abstract] [pdf]Abstract: This study explores the relation between misarticulations and their impact on intelligibility. 30 listeners (17 clinicians and 13 untrained listeners) were given the task of clicking a button whenever they perceived something unintelligible during playback of misarticulated child speech samples. No differences were found between the clinicians and the untrained listeners regarding clicking frequency. The distribution of listener clicks correlated strongly with the clinical evaluations of the same samples. The distribution of clicks was also related to manually annotated speech errors, allowing examination of links between events in the speech signal and reactions evoked in listeners. Hereby, we demonstrate a viable approach to ranking speech error types with regards to their impact on intelligibility in conversational speech.Edlund, J., Edelstam, F., & Gustafson, J. (2014). Human pause and resume behaviours for unobtrusive humanlike in-car spoken dialogue systems. In Proceedings of the EACL Satellite Workshop Dialogue In Motion (DIM-2014). Gothenburg, Sweden. [abstract] [pdf]Abstract: This paper presents a first, largely qualitative analysis of a set of human-human dialogues recorded specifically to provide insights in how humans handle pauses and resumptions in situations where the speakers cannot see each other, but have to rely on the acoustic signal alone. The work presented is part of a larger effort to find unobtrusive human dialogue behaviours that can be mimicked and implemented in-car spoken dialogue systems within in the EU project Get Home Safe, a collaboration between KTH, DFKI, Nuance, IBM and Daimler aiming to find ways of driver interaction that minimizes safety issues,. The analysis reveals several human temporal, semantic/pragmatic, and structural behaviours that are good candidates for inclusion in spoken dialogue systems.Edlund, J., Heldner, M., & Wlodarczak, M. (2014). Catching wind of multiparty conversation. In Proceedings of Multimodal Corpora: Combining applied and basic research targets (MMC 2014) (pp. 35-36). Reykjavik, Iceland. [abstract]Abstract: The paper describes the design of a novel multimodal corpus of spontaneous multiparty conversations in Swedish. The corpus is collected with the primary goal of investigating the role of breathing and its perceptual cues for interactive control of interaction. Physiological correlates of breathing are captured by means of respiratory belts, which measure changes in cross sectional area of the rib cage and the abdomen. Additionally, auditory and visual correlates of breathing are recorded in parallel to the actual conversations. The corpus allows studying respiratory mechanisms underlying organisation of spontaneous conversation, especially in connection with turn management. As such, it is a valuable resource both for fundamental research and speech techonology applications.Edlund, J., Heldner, M., & Wlodarczak, M. (2014). Is breathing prosody?. In International Symposium on Prosody to Commemorate Gösta Bruce. Lund, Sweden. [abstract]Abstract: Even though we may not be aware of it, much breathing in face-to-face conversation is both clearly audible and visible. Consequently, it has been suggested that respiratory activity is used in the joint coordination of conversational flow. For instance, it has been claimed that inhalation is an interactionally salient cue to speech initiation, that exhalation is a turn yielding device, and that breath holding is a marker of turn incompleteness (e.g. Local & Kelly, 1986; Schegloff, 1996). So far, however, few studies have addressed the interactional aspects of breathing (one notable exeption is McFarland, 2001). In this poster, we will describe our ongoing efforts to fill this gap. We will present the design of a novel corpus of respiratory activity in spontaneous multiparty face-to-face conversations in Swedish. The corpus will contain physiological measurements relevant to breathing, high-quality audio, and video. Minimally, the corpus will be annotated with interactional events derived from voice activity detection and (semi-) automatically detected inhalation and exhalation events in the respiratory data. We will also present initial analyses of the material collected. The question is whether breathing is prosody and relevant to this symposium? What we do know is that the turntaking phenomena that of particular interest to us are closely (almost by definition) related to several prosodic phenomena, and in particular to those associated with prosodic phrasing, grouping and boundaries. Thus, we will learn more about respiratory activity in phrasing (and the like) through analyses of breathing in conversation.Johansson, M., Skantze, G., & Gustafson, J. (2014). Comparison of human-human and human-robot Turn-taking Behaviour in multi-party Situated interaction. In Proceedings of the International Workshop on Understanding and Modeling Multiparty, Multimodal Interactions, at ICMI 2014. Istanbul, Turkey. [pdf]Karlsson, A., Svantesson, J-O., & House, D. (2014). Prosodic boundaries and discourse structure in Kammu. In Heldner, M. (Ed.), Proceedings from FONETIK 2014 (pp. 71-76). Stockholm, Sweden: Stockholm University.Koutsombogera, M., Al Moubayed, S., Beskow, J., Bollepalli, B., Gustafson, J., Hussen-Abdelaziz, A., Johansson, M., Lopes, J., Novikova, J., Oertel, C., Skantze, G., Stefanov, K., & Varol, G. (2014). The Tutorbot Corpus – A Corpus for Studying Tutoring Behaviour in Multiparty Face-to-Face Spoken Dialogue. In Proc. of LREC'14. Reykjavik, Iceland. [abstract]Abstract: This paper describes a novel experimental setup exploiting state-of-the-art capture equipment to collect a multimodally rich game-solving collaborative multiparty dialogue corpus. The corpus is targeted and designed towards the development of a dialogue system platform to explore verbal and nonverbal tutoring strategies in multiparty spoken interactions. The dialogue task is centered on two participants involved in a dialogue aiming to solve a card-ordering game. The participants were paired into teams based on their degree of extraversion as resulted from a personality test. With the participants sits a tutor that helps them perform the task, organizes and balances their interaction and whose behavior was assessed by the participants after each interaction. Different multimodal signals captured and auto-synchronized by different audio-visual capture technologies, together with manual annotations of the tutor’s behavior constitute the Tutorbot corpus. This corpus is exploited to build a situated model of the interaction based on the participants’ temporally-changing state of attention, their conversational engagement and verbal dominance, and their correlation with the verbal and visual feedback and conversation regulatory actions generated by the tutor.Lison, P., & Meena, R. (2014). Spoken Dialogue Systems: The new frontier in human-computer interaction. In XRDS Crossroads The ACM Magazine for Students (pp. 46-51). US: ACM. [abstract] [link]Abstract: Wouldn’t it be great if we could simply talk to our technical devices instead of relying on cumbersome displays and keyboards to convey what we want?Meena, R., Boye, J., Skantze, G., & Gustafson, J. (2014). Crowdsourcing Street-level Geographic Information Using a Spoken Dialogue System. In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL) (pp. 2-11). Philadelphia, PA, USA. [abstract] [pdf]Abstract: We present a technique for crowd-sourcing street-level geographic information using spoken natural language. In particular, we are interested in obtaining first-person-view information about what can be seen from different positions in the city. This information can then for example be used for pedestrian routing services. The approach has been tested in the lab using a fully implemented spoken dialogue system, and is showing promising results.Meena, R., Boye, J., Skantze, G., & Gustafson, J. (2014). Using a Spoken Dialogue System for Crowdsourcing Street-level Geographic Information. In the 2nd Workshop on Action, Perception and Language, SLTC 2014. Uppsala, Sweden. [abstract] [pdf]Abstract: We present a novel scheme for enriching geographic database with street-level geographic information that could be useful for pedestrian navigation. A spoken dialogue system for crowdsourcing street-level geographic details was developed and tested in an in-lab experimentation, and has shown promising results.Meena, R., Dabbaghchian, S., & Stefanov, K. (2014). A data-driven approach to detection of interruptions in human–human conversations. In Heldner, M. (Ed.), Proceedings of FONETIK 2014 (pp. 29-32). Stockholm, Sweden. [abstract] [pdf]Abstract: We report the results of our initial efforts towards automatic detection of user’s interruptions in a spoken human–machine dialogue. In a first step, we explored the use of automatically extractable acoustic features, frequency and intensity, in discriminating listener’s interruptions in human–human conversations. A preliminary analysis of interaction snippets from the HCRC Map Task corpus suggests that for the task at hand, intensity is a stronger feature than frequency, and using intensity in combination with feature loudness offers the best results for a k-means clustering algorithm.Meena, R., Skantze, G., & Gustafson, J. (2014). Data-driven Models for timing feedback responses in a Map Task dialogue system. Computer Speech and Language, 28(4), 903-922. [abstract] [pdf]Abstract: Traditional dialogue systems use a fixed silence threshold to detect the end of users’ turns. Such a simplistic model can result in system behaviour that is both interruptive and unresponsive, which in turn affects user experience. Various studies have observed that human interlocutors take cues from speaker behaviour, such as prosody, syntax, and gestures, to coordinate smooth exchange of speaking turns. However, little effort has been made towards implementing these models in dialogue systems and verifying how well they model the turn-taking behaviour in human–computer interactions. We present a data-driven approach to building models for online detection of suitable feedback response locations in the user's speech. We first collected human–computer interaction data using a spoken dialogue system that can perform the Map Task with users (albeit using a trick). On this data, we trained various models that use automatically extractable prosodic, contextual and lexico-syntactic features for detecting response locations. Next, we implemented a trained model in the same dialogue system and evaluated it in interactions with users. The subjective and objective measures from the user evaluation confirm that a model trained on speaker behavioural cues offers both smoother turn-transitions and more responsive system behaviour.Oertel, C., Funes, K., Sheiki, S., Odobez, J-M., & Gustafson, J. (2014). Who Will Get the Grant ? A Multimodal Corpus for the Analysis of Conversational Behaviours in Group Interviews. In International Workshop on Understanding and Modeling Multiparty, Multimodal Interactions, at ICMI 2014. Istanbul, Turkey. [pdf]Pieropan, A., Salvi, G., Pauwels, K., & Kjellström, H. (2014). A dataset of human manipulation actions. In Proc of the IEEE International Conference on Robotics and Automation (ICRA). Hong Kong, China.Pieropan, A., Salvi, G., Pauwels, K., & Kjellström, H. (2014). Audio-Visual Classification and Detection of Human Manipulation Actions. In Proc. of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Chicago, Illinois. [pdf]Salvi, G., & Vanhainen, N. (2014). The WaveSurfer Automatic Speech Recognition Plugin. In Proceedings of LREC. Reykjavik, Iceland. [pdf]Skantze, G., Anna, H., & Oertel, C. (2014). User Feedback in Human-Robot Dialogue: Task Progression and Uncertainty. In Proceedings of the HRI Workshop on Timing in Human-Robot Interaction. Bielefeld, Germany.Skantze, G., Hjalmarsson, A., & Oertel, C. (2014). Turn-taking, Feedback and Joint Attention in Situated Human-Robot Interaction. Speech Communication, 65, 50-66. [abstract] [pdf]Abstract: In this paper, we present a study where a robot instructs a human on how to draw a route on a map. The human and robot are seated face-to-face with the map placed on the table between them. The user’s and the robot’s gaze can thus serve several simultaneous functions: as cues to joint attention, turn-taking, level of understanding and task progression. We have compared this face-to-face setting with a setting where the robot employs a random gaze behaviour, as well as a voice-only setting where the robot is hidden behind a paper board. In addition to this, we have also manipulated turn-taking cues such as completeness and filled pauses in the robot’s speech. By analysing the participants’ subjective rating, task completion, verbal responses, gaze behaviour, and drawing activity, we show that the users indeed benefit from the robot’s gaze when talking about landmarks, and that the robot’s verbal and gaze behaviour has a strong effect on the users’ turn-taking behaviour. We also present an analysis of the users’ gaze and lexical and prosodic realisation of feedback after the robot instructions, and show that these cues reveal whether the user has yet executed the previous instruction, as well as the user’s level of uncertainty.Strömbergsson, S. (2014). The /k/s, the /t/s, and the inbetweens: Novel approaches to examining the perceptual consequences of misarticulated speech. Doctoral dissertation, School of Computer Science and Communication. [abstract] [pdf]Abstract: This thesis comprises investigations of the perceptual consequences of children’s misarticulated speech – as perceived by clinicians, by everyday listeners, and by the children themselves. By inviting methods from other areas to the study of speech disorders, this work demonstrates some successful cases of cross-fertilization. The population in focus is children with a phonological disorder (PD), who misarticulate /t/ and /k/. A theoretical assumption underlying this work is that errors in speech production are often paralleled in perception, e.g. that children base their decision on whether a speech sound is a /t/ or a /k/ on other acoustic-phonetic criteria than those employed by proficient language users. This assumption, together with an aim at stimulating self-monitoring in these children, motivated two of the included studies. Through these studies, new insights into children’s perception of their own speech were achieved – insights entailing both clinical and psycholinguistic implications. For example, the finding that children with PD generally recognize themselves as the speaker in recordings of their own utterances lends support to the use of recordings in therapy, to attract children’s attention to their own speech production. Furthermore, through the introduction of a novel method for automatic correction of children’s speech errors, these findings were extended with the observation that children with PD tend to evaluate misarticulated utterances as correct when just having produced them, and to perceive inaccuracies better when time has passed. Another theme in this thesis is the gradual nature of speech perception related to phonological categories, and a concern that perceptual sensitivity is obscured in descriptions based solely on discrete categorical labels. This concern is substantiated by the finding that listeners rate “substitutions” of [t] for /k/ as less /t/-like than correct productions of [t] for intended /t/. Finally, a novel method of registering listener reactions during the continuous playback of misarticulated speech is introduced, demonstrating a viable approach to exploring how different speech errors influence intelligibility and/or acceptability. By integrating such information in the prioritizing of therapeutic targets, intervention may be better directed at those patterns that cause the most problems for the child in his or her everyday life.Strömbergsson, S., Salvi, G., & House, D. (2014). Gradient evaluation of /k/-likeness in typical and misarticulated child speech. In Proc. of ICPLA 2014. Stockholm, Sweden. [abstract] [pdf]Abstract: Phonetic transcription is an important instrument in the evaluation of misarticulated speech. However, this instrument is not sensitive to fine acoustic-phonetic detail – information that can provide insight into the processes underlying speech production [1]. An objective and fine-grained measure of children’s efforts at producing a specific speech target would be clinically valuable, both in assessment and when monitoring progress in therapy. Here, we describe the first steps towards such a measure. This study describes the perceptual and acoustic evaluation of children’s successful and inaccurate efforts at producing /k/. A corpus of 2990 recordings of isolated words, beginning with either /tV/ or /kV/, produced by 4-8-year-old children, was used. The recordings were labelled with regards to whether they were a) correct productions, b) clear substitutions (i.e. [t] for /k/, or [k] for /t/), or c) intermediate productions between [t] and [k]. In the perceptual evaluation, 10 adult listeners judged 120 typical and misarticulated productions of /t/ and /k/ with regards to a scale from “clear /t/” to “clear /k/”. The listeners utilized the whole scale, thus exhibiting sensitivity to sub-phonemic detail. This finding demonstrates that listeners perceive more detail than is conveyed in phonetic transcription. However, despite their experience of evaluating misarticulated child speech, the listeners did not discriminate between correct productions and clear substitutions, i.e. they did not distinguish successful productions of [t] for /t/ from cases where [t] was a misarticulated production of /k/ (and vice versa). In order to explore the existence of covert contrasts, i.e. sub-perceptual differentiation between correct productions and clear substitutions, acoustic analysis was performed. Here, a frequently described approach [2] to the analysis of voiceless plosives was compared to more recent methods. We report on the performance of the different methods, regarding how well they modelled the human evaluations, and to their sensitivity to covert contrast.Strömbergsson, S., Tånnander, C., & Edlund, J. (2014). Ranking severity of speech errors by their phonological impact in context. In Proc. of Interspeech 2014. Singapore.Strömbergsson, S., Wengelin, Å., & House, D. (2014). Children’s perception of their synthetically corrected speech production. Clinical Linguistics & Phonetics, 28(6), 373-395. [abstract] [link]Abstract: We explore children’s perception of their own speech – in its online form, in its recorded form, and in synthetically modified forms. Children with Phonological Disorder (PD) and children with typical speech and language development (TD) performed tasks of evaluating accuracy of the different types of speech stimuli, either immediately after having produced the utterance or after a delay. In addition, they performed a task designed to assess their ability to detect synthetic modification. Both groups showed high performance in tasks involving evaluation of other children’s speech, whereas in tasks of evaluating one’s own speech, the children with PD were less accurate than their TD peers. The children with PD were less sensitive to misproductions in immediate conjunction with their production of an utterance, and more accurate after a delay. Within-category modification often passed undetected, indicating a satisfactory quality of the generated speech. Potential clinical benefits of using corrective re-synthesis are discussed.Vanhainen, N., & Salvi, G. (2014). Free Acoustic and Language Models for Large Vocabulary Continuous Speech Recognition in Swedish. In Proceedings of LREC. Reykjavik, Iceland. [pdf]Vanhainen, N., & Salvi, G. (2014). Pattern Discovery in Continuous Speech Using Block Diagonal Infinite HMM. In Proceedings of ICASSP. Florence, Italy. [pdf]Wlodarczak, M., Heldner, M., & Edlund, J. (2014). Breathing in Conversation: an Unwritten History. In Proc. of the 2nd European and the 5th Nordic Symposium on Multimodal Communication. Tartu, Estonia. [abstract]Abstract: This paper attempts to draw attention of the multimodal communication research community to what we consider a long overdue topic, namely respiratory activity in conversation. We submit that a turn towards spontaneous interaction is a natural extension of the recent interest in speech breathing, and is likely to offer valuable insights into mechanisms underlying organisation of interaction and collaborative human action in general, as well as to make advancement in existing speech technology applications. Particular focus is placed on the role of breathing as a perceptually and interactionally salient turn-taking cue. We also present the recording setup developed in the Phonetics Laboratory at Stockholm University with the aim of studying communicative functions of physiological and audio-visual breathing correlates in spontaneous multiparty interactions.2013Al Moubayed, S. (2013). Towards rich multimodal behavior in spoken dialogues with embodied agents. In IEEE Intl' Conference on Cognitive Infocommunicaitons. Budapest, Hungary. [abstract] [pdf]Abstract: Spoken dialogue frameworks have traditionally been designed to handle a single stream of data – the speech signal. Research on human-human communication has been providing large evidence and quantifying the effects and the importance of a multitude of other multimodal nonverbal signals that people use in their communication, that shape and regulate their interaction. Driven by findings from multimodal human spoken interaction, and the advancements of capture devices and robotics and animation technologies, new possibilities are rising for the development of multimodal human-machine interaction that is more affective, social, and engaging. In such face-to-face interaction scenarios, dialogue systems can have a large set of signals at their disposal to infer context and enhance and regulate the interaction through the generation of verbal and nonverbal facial signals. This paper summarizes several design decision, and experiments that we have followed in attempts to build rich and fluent multimodal interactive systems using a newly developed hybrid robotic head called Furhat, and discuss issues and challenges that this effort is facing.Al Moubayed, S., Edlund, J., & Gustafson, J. (2013). Analysis of gaze and speech patterns in three-party quiz game interaction. In Interspeech 2013. Lyon, France. [abstract] [pdf]Abstract: In order to understand and model the dynamics between interaction phenomena such as gaze and speech in face-to-face multiparty interaction between humans, we need large quantities of reliable, objective data of such interactions. To date, this type of data is in short supply. We present a data collection setup using automated, objective techniques in which we capture the gaze and speech patterns of triads deeply engaged in a high-stakes quiz game. The resulting corpus consists of five one-hour recordings, and is unique in that it makes use of three state-of-the-art gaze trackers (one per subject) in combination with a state-of-the-art conical microphone array designed to capture roundtable meetings. Several video channels are also included. In this paper we present the obstacles we encountered and the possibilities afforded by a synchronised, reliable combination of large-scale multi-party speech and gaze data, and an overview of the first analyses of the data.Al Moubayed, S., Beskow, J., & Skantze, G. (2013). The Furhat Social Companion Talking Head. In Interspeech 2013 - Show and Tell. Lyon, France. [abstract] [pdf]Abstract: In this demonstrator we present the Furhat robot head. Furhat is a highly human-like robot head in terms of dynamics, thanks to its use of back-projected facial animation. Furhat also takes advantage of a complex and advanced dialogue toolkits designed to facilitate rich and fluent multimodal multiparty human-machine situated and spoken dialogue. The demonstrator will present a social dialogue system with Furhat that allows for several simultaneous interlocutors, and takes advantage of several verbal and nonverbal input signals such as speech input, real-time multi-face tracking, and facial analysis, and communicates with its users in a mixed initiative dialogue, using state of the art speech synthesis, with rich prosody, lip animated facial synthesis, eye and head movements, and gestures.Al Moubayed, S., Skantze, G., & Beskow, J. (2013). The Furhat Back-Projected Humanoid Head - Lip reading, Gaze and Multiparty Interaction. International Journal of Humanoid Robotics, 10(1). [abstract] [pdf]Abstract: In this article, we present Furhat – a back-projected human-like robot head using state-of-the art facial animation. Three experiments are presented where we investigate how the head might facilitate human-robot face-to-face interaction. First, we investigate how the animated lips increase the intelligibility of the spoken output, and compare this to an animated agent presented on a flat screen, as well as to a human face. Second, we investigate the accuracy of the perception of Furhat’s gaze in a setting typical for situated interaction, where Furhat and a human are sitting around a table. The accuracy of the perception of Furhat’s gaze is measured depending on eye design, head movement and viewing angle. Third, we investigate the turn-taking accuracy of Furhat in a multi-party interactive setting, as compared to an animated agent on a flat screen. We conclude with some observations from a public setting at a museum, where Furhat interacted with thousands of visitors in a multi-party interaction.Alexanderson, S., House, D., & Beskow, J. (2013). Aspects of co-occurring syllables and head nods in spontaneous dialogue. In Proc. of 12th International Conference on Auditory-Visual Speech Processing (AVSP2013). Annecy, France. [pdf]Alexanderson, S., House, D., & Beskow, J. (2013). Extracting and analysing co-speech head gestures from motion-capture data. In Eklund, R. (Ed.), Proc. of Fonetik 2013 (pp. 1-4). Linköping University, Sweden. [pdf]Alexanderson, S., House, D., & Beskow, J. (2013). Extracting and analyzing head movements accompanying spontaneous dialogue. In Proc. Tilburg Gesture Research Meeting. Tilburg University, The Netherlands. [pdf]Beskow, J., Alexanderson, S., Stefanov, K., Claesson, B., Derbring, S., & Fredriksson, M. (2013). The Tivoli System – A Sign-driven Game for Children with Communicative Disorders. In Proceedings of the 1:st european Symposium on Multimodal Communication. [pdf]Beskow, J., & Stefanov, K. (2013). Web-enabled 3D talking avatars based on WebGL and HTML5. In In Proc. of the 13th International Conference on Intelligent Virtual Agents (IVA). Edinburgh, UK. [pdf]Bollepalli, B., Beskow, J., & Gustafson, J. (2013). Non-Linear Pitch Modification in Voice Conversion using Artificial Neural Networks. In ISCA Workshop on Non-Linear Speech Processing 2013. [pdf]Edlund, J., Al Moubayed, S., Tånnander, C., & Gustafson, J. (2013). Audience response system based annotation of speech. In Fonetik 2013. Linköping, Sweden.Edlund, J., Al Moubayed, S., & Beskow, J. (2013). Co-present or not? Embodiement, situatedness and the Mona Lisa gaze effect. In Nakano, Y., Conati, C., & Bader, T. (Eds.), Eye Gaze in Intelligent User Interfaces - Gaze-based Analyses, Models, and Applications. Springer. [abstract]Abstract: The interest in embodying and situating computer programmes took off in the autonomous agents community in the 90s. Today, researchers and designers of programmes that interact with people on human terms endow their systems with humanoid physiognomies for a variety of reasons. In most cases, attempts at achieving this embodiment and situatedness has taken one of two directions: virtual characters and actual physical robots. In addition, a technique that is far from new is gaining ground rapidly: projection of animated faces on head-shaped 3D surfaces. In this chapter, we provide a history of this technique; an overview of its pros and cons; and an in-depth description of the cause and mechanics of the main drawback of 2D displays of 3D faces (and objects): the Mona Liza gaze effect. We conclude with a description of an experimental paradigm that measures perceived directionality in general and the Mona Lisa gaze effect in particular.Edlund, J., Al Moubayed, S., Tånnander, C., & Gustafson, J. (2013). Temporal precision and reliability of audience response system based annotation. In Proc. of Multimodal Corpora 2013. Edinburgh, UK. [abstract] [pdf]Abstract: Manual annotators are often used to label human interaction data. This is associated with high costs and high time consumption, which is one reason annotation by crowd sourcing is increasing. But in crowd sourcing, control over the conditions is largely lost. To get increased throughput with a higher measure of experimental control, we suggest borrowing from the Audience Response Systems used extensively in the film and television industries. We present (a) a cost-efficient setup for rapid, plenary annotation of human spoken interaction data; and (b) a study that quantifies the temporal precision and reliability of annotations made with the system in the auditory perception domain.Franovic, T., Herman, P., Salvi, G., Benjaminsson, S., & Lansner, A. (2013). Cortex-inspired network architecture for large-scale temporal information processing. In Frontiers in neuroinformatics. Götze, J., & Boye, J. (2013). Deriving salience models from human route directions. In Proceedings of the Workshop on Computational Models of Spatial Language Interpretation and Generation (pp. 36-41). Heldner, M., Hjalmarsson, A., & Edlund, J. (2013). Backchannel relevance spaces. In Asu, E. L., & Lippus, P. (Eds.), Nordic Prosody: Proceedings of the XIth Conference (pp. 137-146). Frankfurt am Main, Germany: Peter Lang. [pdf]Johansson, M., Skantze, G., & Gustafson, J. (2013). Head Pose Patterns in Multiparty Human-Robot Team-Building Interactions. In International Conference on Social Robotics - ICSR 2013. Bristol, UK. [abstract] [pdf]Abstract: We present a data collection setup for exploring turn-taking in three-party human-robot interaction involving objects competing for attention. The collected corpus comprises 78 minutes in four interactions. Using automated techniques to record head pose and speech patterns, we analyze head pose patterns in turn-transitions. We find that introduction of objects makes addressee identification based on head pose more challenging. The symmetrical setup also allows us to compare human-human to human-robot behavior within the same interaction. We argue that this symmetry can be used to assess to what extent the system exhibits a human-like behavior.Karlsson, A. M., Svantesson, J-O., & House, D. (2013). Multifunctionality of prosodic boundaries in spontaneous narratives in Kammu. In Proc. Prosody-Discourse Interface 2013, Leuven, Belgium. Koniaris, C., Salvi, G., & Engwall, O. (2013). On Mispronunciation Analysis of Individual Foreign Speakers Using Auditory Periphery Models. Speech Communication, 55(5), 691-706. [abstract] [004]Abstract: In second language (L2) learning, a major difficulty is to discriminate between the acoustic diversity within an L2 phoneme category and that between different categories. We propose a general method for automatic diagnostic assessment of the pronunciation of non-native speakers based on models of the human auditory periphery. Considering each phoneme class separately, the geometric shape similarity between the native auditory domain and the non-native speech domain is measured. The phonemes that deviate the most from the native pronunciation for a set of L2 speakers are detected by comparing the geometric shape similarity measure with that calculated for native speakers on the same phonemes. To evaluate the system, we have tested it with different non-native speaker groups from various language backgrounds. The experimental results are in accordance with linguistic findings and human listeners’ ratings, particularly when both the spectral and temporal cues of the speech signal are utilized in the pronunciation analysis.Lopes, J. (2013). Lexical Entrainment in Spoken Dialog Systems. Doctoral dissertation, Instituto Superior Técnico, Universidade de Lisboa.Meena, R., Skantze, G., & Gustafson, J. (2013). A Data-driven Model for Timing Feedback in a Map Task Dialogue System. In 14th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL) (pp. 375-383). Metz, France. [abstract] [pdf]Abstract: We present a data-driven model for detecting suitable response locations in the user’s speech. The model has been trained on human–machine dialogue data and implemented and tested in a spoken dialogue system that can perform the Map Task with users. To our knowledge, this is the first example of a dialogue system that uses automatically extracted syntactic, prosodic and contextual features for online detection of response locations. A subjective evaluation of the dialogue system suggests that interactions with a system using our trained model were perceived significantly better than those with a system using a model that made decisions at random.Meena, R., Skantze, G., & Gustafson, J. (2013). Human Evaluation of Conceptual Route Graphs for Interpreting Spoken Route Descriptions. In Proceedings of IWCS 2013 Workshop on Computational Models of Spatial Language Interpretation and Generation (CoSLI-3) (pp. 13-18). Potsdam, Germany: Association for Computational Linguistics. [abstract] [pdf]Abstract: We present a human evaluation of the usefulness of conceptual route graphs (CRGs) when it comes to route following using spoken route descriptions. We describe a method for data-driven semantic interpretation of route de-scriptions into CRGs. The comparable performances of human participants in sketching a route using the manually transcribed CRGs and the CRGs produced on speech recognized route descriptions indicate the robustness of our method in preserving the vital conceptual information required for route following despite speech recognition errors.Meena, R., Skantze, G., & Gustafson, J. (2013). The Map Task Dialogue System: A Test-bed for Modelling Human-Like Dialogue. In 14th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL) (pp. 366-368). Metz, France. [abstract] [pdf]Abstract: The demonstrator presents a test-bed for collecting data on human–computer dialogue: a fully automated dialogue system that can perform Map Task with a user. In a first step, we have used the test-bed to collect human–computer Map Task dialogue data, and have trained various data-driven models on it for detecting feedback response locations in the user’s speech. One of the trained models has been tested in user interactions and was perceived better in comparison to a system using a random model. The demonstrator will exhibit three versions of the Map Task dialogue system—each using a different trained data-driven model of Response Location Detection.Mirnig, N., Weiss, A., Skantze, G., Al Moubayed, S., Gustafson, J., Beskow, J., Granström, B., & Tscheligi, M. (2013). Face-to-Face with a Robot: What do we actually talk about?. International Journal of Humanoid Robotics, 10(1). [abstract] [link]Abstract: Whereas much of the state-of-the-art research in Human-Robot Interaction (HRI) investigates task-oriented interaction, this paper aims at exploring what people talk about to a robot if the content of the conversation is not prede ned. We used the robot head Furhat to explore the conversational behavior of people who encounter a robot in the public setting of a robot exhibition in a scienti c museum, but without a prede ned purpose. Upon analyzing the conversations, it could be shown that a sophisticated robot provides an inviting atmosphere for people to engage in interaction and to be experimental and challenge the robot\'s capabilities. Many visitors to the exhibition were willing to go beyond the guiding questions that were provided as a starting point. Amongst other things, they asked Furhat questions concerning the robot itself, such as how it would de ne a robot, or if it plans to take over the world. People were also interested in the feelings and likes of the robot and they asked many personal questions - this is how Furhat ended up with its rst marriage proposal. People who talked to Furhat were asked to complete a questionnaire on their assessment of the conversation, with which we could show that the interaction with Furhat was rated as a pleasing experience.Neiberg, D., Salvi, G., & Gustafson, J. (2013). Semi-supervised methods for exploring the acoustics of simple productive feedback. Speech Communication, 55(3), 451-469. [007]Oertel, C., Salvi, G., Götze, J., Edlund, J., Gustafson, J., & Heldner, M. (2013). The KTH Games Corpora: How to Catch a Werewolf. In IVA 2013 Workshop Multimodal Corpora: Beyond Audio and Video - MMC 2013. [pdf]Oertel, C., & Salvi, G. (2013). A Gaze-based Method for Relating Group Involvement to Individual Engagement in Multimodal Multiparty Dialogue. In Proceedings of the ACM International Conference on Multimodal Interaction (ICMI). Sydney, Australia. [abstract] [pdf]Abstract: This paper is concerned with modelling individual engagement and group involvement as well as their relationship in an eight-party, mutimodal corpus. We propose a number of features (presence, entropy, symmetry and maxgaze) that summarise different aspects of eye-gaze patterns and allow us to describe individual as well as group behaviour in time. We use these features to define similarities between the subjects and we compare this information with the engagement rankings the subjects expressed at the end of each interactions about themselves and the other participants. We analyse how these features relate to four classes of group involvement and we build a classifier that is able to distinguish between those classes with 71\% of accuracy.Salvi, G. (2013). Biologically Inspired Methods for Automatic Speech Understanding. In Advances in Intelligent Systems and Computing (AISC) (pp. 283). Palermo, Italy. [abstract]Abstract: Automatic Speech Recognition (ASR) and Understanding (ASU) systems heavily rely on machine learning techniques to solve the problem of mapping spoken utterances into words and meanings. The statistical methods employed, however, greatly deviate from the processes involved in human language acquisition in a number of key aspects. Although ASR and ASU have recently reached a level of accuracy that is sufficient for some practical applications, there are still severe limitations due, for example, to the amount of training data required and the lack of generalization of the resulting models. In our opinion, there is a need for a paradigm shift and speech technology should address some of the challenges that humans face when learning a first language and that are currently ignored by the ASR and ASU methods. In this paper, we point out some of the aspects that could lead to more robust and flexible models, and we describe some of the research we and other researchers have performed in the area.Saponaro, G., Salvi, G., & Bernardino, A. (2013). Robot Anticipation of Human Intentions through Continuous Gesture Recognition. In Proc. 4th International Workshop on Collaborative Robots and Human Robot Interaction (CR-HRI 2013). San Diego, USA. [pdf]Skantze, G., & Hjalmarsson, A. (2013). Towards Incremental Speech Generation in Conversational Systems. Computer Speech & Language, 27(1), 243-262. [abstract] [pdf]Abstract: This paper presents a model of incremental speech generation in practical conversational systems. The model allows a conversational system to incrementally interpret spoken input, while simultaneously planning, realising and self-monitoring the system response. If these processes are time consuming and result in a response delay, the system can automatically produce hesitations to retain the floor. While speaking, the system utilises hidden and overt self-corrections to accommodate revisions in the system. The model has been implemented in a general dialogue system framework. Using this framework, we have implemented a conversational game application. A Wizard-of-Oz experiment is presented, where the automatic speech recognizer is replaced by a Wizard who transcribes the spoken input. In this setting, the incremental model allows the system to start speaking while the user's utterance is being transcribed. In comparison to a non-incremental version of the same system, the incremental version has a shorter response time and is perceived as more efficient by the users.Skantze, G., Hjalmarsson, A., & Oertel, C. (2013). Exploring the effects of gaze and pauses in situated human-robot interaction. In 14th Annual Meeting of the Special Interest Group on Discourse and Dialogue - SIGDial. Metz, France. [abstract] [pdf]Abstract: In this paper, we present a user study where a robot instructs a human on how to draw a route on a map, similar to a Map Task. This setup has allowed us to study user reactions to the robot’s conversational behaviour in order to get a better understanding of how to generate utterances in incremental dialogue systems. We have analysed the participants' subjective rating, task completion, verbal responses, gaze behaviour, drawing activity, and cognitive load. The results show that users utilise the robot’s gaze in order to disambiguate referring expressions and manage the flow of the interaction. Furthermore, we show that the user’s behaviour is affected by how pauses are realised in the robot’s speech.Skantze, G., Oertel, C., & Hjalmarsson, A. (2013). User feedback in human-robot interaction: Prosody, gaze and timing. In Proceedings of Interspeech. [abstract] [pdf]Abstract: This paper investigates forms and functions of user feedback in a map task dialogue between a human and a robot, where the robot is the instruction-giver and the human is the instruction-follower. First, we investigate how user acknowledgements in task-oriented dialogue signal whether an activity is about to be initiated or has been completed. The parameters analysed include the users’ lexical and prosodic realisation as well as gaze direction and response timing. Second, we investigate the relation between these parameters and the perception of uncertainty.Stefanov, K., & Beskow, J. (2013). A Kinect Corpus of Swedish Sign Language Signs. In Proceedings of the 2013 Workshop on Multimodal Corpora: Beyond Audio and Video. [pdf]Strömbergsson, S. (2013). Children’s recognition of their own recorded voice: Influence of age and phonological impairment. Clinical Linguistics & Phonetics, 27(1), 33-45. [abstract] [link]Abstract: Children with phonological impairment (PI) often have difficulties to perceive the insufficiencies in their own speech. The use of recordings has been suggested as a way of directing the child’s attention towards his/her own speech, despite a lack of evidence that children actually recognize their recorded voice as their own. We present two studies of children’s self-voice identification, one exploring developmental aspects, and one exploring potential effects of having a PI. The results indicate that children from 4 to 8 years recognize their recorded voice well (around 80% accuracy), regardless of whether they have a PI or not. A subtle change in this ability from 4 to 8 years is observed that could be linked to a development in short-term memory. Clinically, one can indeed expect an advantage of using recordings in therapy; this could constitute an intermediate step towards the more challenging task of online self-monitoring.Strömbergsson, S., Hjalmarsson, A., Edlund, J., & House, D. (2013). Timing responses to questions in dialogue. In Proceedings of Interspeech 2013 (pp. 2584-2588). Lyon, France. [abstract] [pdf]Abstract: Questions and answers play an important role in spoken dialogue systems as well as in human-human interaction. A critical concern when responding to a question is the timing of the response. While human response times depend on a wide set of features, dialogue systems generally respond as soon as they can, that is, when the end of the question has been detected and the response is ready to be deployed. This paper presents an analysis of how different semantic and pragmatic features affect the response times to questions in two different data sets of spontaneous human-human dialogues: the Swedish Spontal Corpus and the US English Switchboard corpus. Our analysis shows that contextual features such as question type, response type, and conversation topic influence human response times. Based on these results, we propose that more sophisticated response timing can be achieved in spoken dialogue systems by using these features to automatically and deliberately target system response timing.Strömbergsson, S., & Tånnander, C. (2013). Correlates to intelligibility in deviant child speech – comparing clinical evaluations to audience response system-based evaluations by untrained listeners. In Proceedings of Interspeech 2013 (pp. 3717-3721). Lyon, France. [abstract] [pdf]Abstract: The severity of speech impairments can be measured in different ways; whereas some metrics focus on quantifying the specific speech deviations, other focus on the functional effects of the speech impairment, e.g. by rating intelligibility. This report describes the application of a previously untested method to the domain of deviant child speech; an audience response system-based method where listeners’ responses are continuously registered during playback of speech stimuli. 20 adult listeners were given the task of clicking a button whenever they perceived something unintelligible or deviant during playback of child speech stimuli. The untrained listeners’ responses were compared to clinical evaluations of the same speech samples, revealing a strong correlation between the two types of measures. Furthermore, patterns of how listeners’ different experiences influence their clicking responses were explored. Qualitative analysis linking listener clicks to triggering events in the speech samples demonstrates the potential of the click method as an instrument for identification of features in children’s speech that are most detrimental to intelligibility – insights that may have important implications for the selection of speech targets in clinical intervention.2012Al Moubayed, S., Beskow, J., Blomberg, M., Granström, B., Gustafson, J., Mirnig, N., & Skantze, G. (2012). Talking with Furhat - multi-party interaction with a back-projected robot head. In Proceedings of Fonetik'12. Gothenberg, Sweden. [abstract] [pdf]Abstract: This is a condensed presentation of some recent work on a back-projected robotic head for multi-party interaction in public settings. We will describe some of the design strategies and give some preliminary analysis of an interaction database collected at the Robotville exhibition at the London Science MuseumAl Moubayed, S., Beskow, J., Skantze, G., & Granström, B. (2012). Furhat: A Back-projected Human-like Robot Head for Multiparty Human-Machine Interaction. In Esposito, A., Esposito, A., Vinciarelli, A., Hoffmann, R., & C. Müller, V. (Eds.), Cognitive Behavioural Systems. Lecture Notes in Computer Science (pp. 114-130). Springer.Al Moubayed, S., Edlund, J., & Beskow, J. (2012). Taming Mona Lisa: communicating gaze faithfully in 2D and 3D facial projections. ACM Transactions on Interactive Intelligent Systems, 1(2), 25. [abstract] [pdf]Abstract: The perception of gaze plays a crucial role in human-human interaction. Gaze has been shown to matter for a number of aspects of communication and dialogue, especially for managing the flow of the dialogue and participant attention, for deictic referencing, and for the communication of attitude. When developing embodied conversational agents (ECAs) and talking heads, modeling and delivering accurate gaze targets is crucial. Traditionally, systems communicating through talking heads have been displayed to the human conversant using 2D displays, such as flat monitors. This approach introduces severe limitations for an accurate communication of gaze since 2D displays are associated with several powerful effects and illusions, most importantly the Mona Lisa gaze effect, where the gaze of the projected head appears to follow the observer regardless of viewing angle. We describe the Mona Lisa gaze effect and its consequences in the interaction loop, and propose a new approach for displaying talking heads using a 3D projection surface (a physical model of a human head) as an alternative to the traditional flat surface projection. We investigate and compare the accuracy of the perception of gaze direction and the Mona Lisa gaze effect in 2D and 3D projection surfaces in a five subject gaze perception experiment. The experiment confirms that a 3D projection surface completely eliminates the Mona Lisa gaze effect and delivers very accurate gaze direction that is independent of the observer's viewing angle. Based on the data collected in this experiment, we rephrase the formulation of the Mona Lisa gaze effect. The data, when reinterpreted, confirms the predictions of the new model for both 2D and 3D projection surfaces. Finally, we discuss the requirements on different spatially interactive systems in terms of gaze direction, and propose new applications and experiments for interaction in a human-ECA and a human-robot settings made possible by this technology.Al Moubayed, S., & Skantze, G. (2012). Perception of Gaze Direction for Situated Interaction. In Proc. of the 4th Workshop on Eye Gaze in Intelligent Human Machine Interaction. The 14th ACM International Conference on Multimodal Interaction ICMI. Santa Monica, CA, USA. [abstract] [pdf]Abstract: Accurate human perception of robots’ gaze direction is crucial for the design of a natural and fluent situated multimodal face-to-face interaction between humans and machines. In this paper, we present an experiment targeted at quantifying the effects of different gaze cues synthesized using the Furhat back-projected robot head, on the accuracy of perceived spatial direction of gaze by humans using 18 test subjects. The study first quantifies the accuracy of the perceived gaze direction in a human-human setup, and compares that to the use of synthesized gaze movements in different conditions: viewing the robot eyes frontal or at a 45 degrees angle side view. We also study the effect of 3D gaze by controlling both eyes to indicate the depth of the focal point, the use of gaze or head pose, and the use of static or dynamic eyelids. The findings of the study are highly relevant to the design and control of robots and animated agents in situated face-to-face interactionAl Moubayed, S., Skantze, G., & Beskow, J. (2012). Lip-reading Furhat: Audio Visual Intelligibility of a Back Projected Animated Face. In Proc. of the Intelligent Virtual Agents 10th International Conference (IVA 2012). Santa Cruz, CA, USA: Springer. [abstract] [pdf]Abstract: Back projecting a computer animated face, onto a three dimensional static physical model of a face, is a promising technology that is gaining ground as a solution to building situated, flexible and human-like robot heads. In this paper, we first briefly describe Furhat, a back projected robot head built for the purpose of multimodal multiparty human-machine interaction, and its benefits over virtual characters and robotic heads; and then motivate the need to investigating the contribution to speech intelligibility Furhat’s face offers. We present an audio-visual speech intelligibility experiment, in which 10 subjects listened to short sentences with degraded speech signal. The experiment compares the gain in intelligibility between lip reading a face visualized on a 2D screen compared to a 3D back-projected face and from different viewing angles. The results show that the audio-visual speech intelligibility holds when the avatar is projected onto a static face model (in the case of Furhat), and even, rather surprisingly, exceeds it. This means that despite the movement limitations back projected animated face models bring about; their audio visual speech intelligibility is equal, or even higher, compared to the same models shown on flat displays. At the end of the paper we discuss several hypotheses on how to interpret the results, and motivate future investigations to better explore the characteristics of visual speech perception 3D projected animated faces.Al Moubayed, S., Skantze, G., Beskow, J., Stefanov, K., & Gustafson, J. (2012). Multimodal Multiparty Social Interaction with the Furhat Head. In Proc. of the 14th ACM International Conference on Multimodal Interaction ICMI. Santa Monica, CA, USA. [abstract] [pdf]Abstract: We will show in this demonstrator an advanced multimodal and multiparty spoken conversational system using Furhat, a robot head based on projected facial animation. Furhat is a human-like interface that utilizes facial animation for physical robot heads using back-projection. In the system, multimodality is enabled using speech and rich visual input signals such as multi-person real-time face tracking and microphone tracking. The demonstrator will showcase a system that is able to carry out social dialogue with multiple interlocutors simultaneously with rich output signals such as eye and head coordination, lips synchronized speech synthesis, and non-verbal facial gestures used to regulate fluent and expressive multiparty conversations.Al Moubayed, S. (2012). Bringing the avatar to life. Doctoral dissertation, School of Computer Science, KTH Royal Institute of Technology.. [abstract] [pdf]Abstract: The work presented in this thesis comes in pursuit of the ultimate goal of building spoken and embodied human-like interfaces that are able to interact with humans under human terms. Such interfaces need to employ the subtle, rich and multidimensional signals of communicative and social value that complement the stream of words – signals humans typically use when interacting with each other. The studies presented in the thesis concern facial signals used in spoken communication, and can be divided into two connected groups. The first is targeted towards exploring and verifying models of facial signals that come in synchrony with speech and its intonation. We refer to this as visual-prosody, and as part of visual-prosody, we take prominence as a case study. We show that the use of prosodically relevant gestures in animated faces results in a more expressive and human-like behaviour. We also show that animated faces supported with these gestures result in more intelligible speech which in turn can be used to aid communication, for example in noisy environments. The other group of studies targets facial signals that complement speech. As spoken language is a relatively poor system for the communication of spatial information; since such information is visual in nature. Hence, the use of visual movements of spatial value, such as gaze and head movements, is important for an efficient interaction. The use of such signals is especially important when the interaction between the human and the embodied agent is situated – that is when they share the same physical space, and while this space is taken into account in the interaction. We study the perception, the modelling, and the interaction effects of gaze and head pose in regulating situated and multiparty spoken dialogues in two conditions. The first is the typical case where the animated face is displayed on flat surfaces, and the second where they are displayed on a physical three-dimensional model of a face. The results from the studies show that projecting the animated face onto a face-shaped mask results in an accurate perception of the direction of gaze that is generated by the avatar, and hence can allow for the use of these movements in multiparty spoken dialogue. Driven by these findings, the Furhat back-projected robot head is developed. Furhat employs state-of-the-art facial animation that is projected on a 3D printout of that face, and a neck to allow for head movements. Although the mask in Furhat is static, the fact that the animated face matches the design of the mask results in a physical face that is perceived to “move”. We present studies that show how this technique renders a more intelligible, human-like and expressive face. We further present experiments in which Furhat is used as a tool to investigate properties of facial signals in situated interaction. Furhat is built to study, implement, and verify models of situated and multiparty, multimodal Human-Machine spoken dialogue, a study that requires that the face is physically situated in the interaction environment rather than in a two-dimensional screen. It also has received much interest from several communities, and been showcased at several venues, including a robot exhibition at the London Science Museum. We present an evaluation study of Furhat at the exhibition where it interacted with several thousand persons in a multiparty conversation. The analysis of the data from the setup further shows that Furhat can accurately regulate multiparty interaction using gaze and head movements.Al Moubayed, S., Beskow, J., Granström, B., Gustafson, J., Mirning, N., Skantze, G., & Tscheligi, M. (2012). Furhat goes to Robotville: a large-scale multiparty human-robot interaction data collection in a public space. In Proc of LREC Workshop on Multimodal Corpora. Istanbul, Turkey. [pdf]Alexanderson, S., & Beskow, J. (2012). Can Anybody Read Me? Motion Capture Recordings for an Adaptable Visual Speech Synthesizer. In In proceedings of The Listening Talker. Edinburgh, UK.. [pdf]Ananthakrishnan, G., Engwall, O., & Neiberg, D. (2012). Exploring the Predictability of Non-unique Acoustic-to-Articulatory Mappings. IEEE transactions on Audio, Speech, and Language Processing. [abstract] [link]Abstract: This paper explores statistical tools that help analyze the predictability in the acoustic-to-articulatory inversion of speech, using an Electromagnetic Articulography database of simultaneously recorded acoustic and articulatory data. Since it has been shown that speech acoustics can be mapped to non-unique articulatory modes, the variance of the articulatory parameters is not sufficient to understand the predictability of the inverse mapping. We, therefore, estimate an upper bound to the conditional entropy of the articulatory distribution. This provides a probabilistic estimate of the range of articulatory values (either over a continuum or over discrete non-unique regions) for a given acoustic vector in the database. The analysis is performed for different British/Scottish English consonants with respect to which articulators (lips, jaws or the tongue) are important for producing the phoneme. The paper shows that acoustic-articulatory mappings for the important articulators have a low upper bound on the entropy, but can still have discrete non-unique configurations.Blomberg, M., Skantze, G., Al Moubayed, S., Gustafson, J., Beskow, J., & Granström, B. (2012). Children and adults in dialogue with the robot head Furhat - corpus collection and initial analysis. In Proceedings of WOCCI. Portland, OR. [pdf]Bollepalli, B., Beskow, J., & Gustafson, J. (2012). HMM based speech synthesis system for Swedish Language. In The Fourth Swedish Language Technology Conference. Lund, Sweden.Borin, L., Brandt, M. D., Edlund, J., Lindh, J., & Parkvall, M. (2012). The Swedish Language in the Digital Age/Svenska språket i den digitala tidsåldern. Springer. [pdf]Boye, J., Fredriksson, M., Götze, J., Gustafson, J., & Königsmann, J. (2012). Walk this way: Spatial grounding for city exploration. In IWSDS2012 (International Workshop on Spoken Dialog Systems). Csapo, A., Gilmartin, E., Grizou, J., Han, J., Meena, R., Anastasiou, D., Jokinen, K., & Wilcock, G. (2012). Multimodal Conversational Interaction with a Humanoid Robot. In Proceedings of 3rd International Conference on Cognitive Infocommunications (CogInfoCom 2012) (pp. 667-672). Kosice, Slovakia: IEEE. [abstract] [pdf]Abstract: The paper presents a multimodal conversational interaction system for the Nao humanoid robot. The system was developed at the 8th International Summer Workshop on Multimodal Interfaces, Metz, 2012. We implemented WikiTalk, an existing spoken dialogue system for open-domain conversations, on Nao. This greatly extended the robot's interaction capabilities by enabling Nao to talk about an unlimited range of topics. In addition to speech interaction, we developed a wide range of multimodal interactive behaviours by the robot, including face-tracking, nodding, communicative gesturing, proximity detection and tactile interrupts. We made video recordings of user interactions and used questionnaires to evaluate the system. We further extended the robot's capabilities by linking Nao with Kinect.Csapo, A., Gilmartin, E., Grizou, J., Han, J., Meena, R., Anastasiou, D., Jokinen, K., & Wilcock, G. (2012). Open-Domain Conversation with a NAO Robot. In Proceedings of the 3rd International Conference on Cognitive Infocommunications (CogInfoCom 2012). Kosice, Slovakia: IEEE. [abstract] [pdf]Abstract: In this demo, we present a multimodal conversation system, implemented using a Nao robot and Wikipedia. The system was developed at the 8th International Workshop on Multimodal Interfaces in Metz, France, 2012. The system is based on an interactive, open-domain spoken dialogue system called WikiTalk, which guides the user through conversations based on the link structure of Wikipedia. In addition to speech interaction, the robot interacts with users by tracking their faces and nodding/gesturing at key points of interest within the Wikipedia text. The proximity detection capabilities of the Nao, as well as its tactile sensors were used to implement context-based interrupts in the dialogue system.Csapo, A., Gilmartin, E., Grizou, J., Han, J., Meena, R., Anastasiou, D., Jokinen, K., & Wilcock, G. (2012). Speech, Gaze and Gesturing: Multimodal Conversational Interaction with Nao Robot. Technical Report, Metz, France. [abstract] [pdf]Abstract: The paper presents a multimodal conversational interaction system for the Nao humanoid robot. The system was developed at the 8th International Summer Workshop on Multi-modal Interfaces, Metz, 2012. We implemented WikiTalk, an existing spoken dialogue system for open-domain conversations, on Nao. This greatly extended the robot’s interaction capabilities by enabling Nao to talk about an unlimited range of topics. In addition to speech interaction, we developed a wide range of multimodal interactive behaviours by the robot, including face-tracking, nodding, communicative gesturing, proximity detection and tactile interrupts. We made video recordings of user interactions and used questionnaires to evaluate the system. We further extended the robot’s capabilities by linking Nao with Kinect.Edlund, J., Alexandersson, S., Beskow, J., Gustavsson, L., Heldner, M., Hjalmarsson, A., Kallionen, P., & Marklund, E. (2012). 3rd party observer gaze as a continuous measure of dialogue flow. In Proc. of LREC 2012. Istanbul, Turkey. [abstract] [pdf]Abstract: We present an attempt at using 3rd party observer gaze to get a measure of how appropriate each segment in a dialogue is for a speaker change. The method is a step away from the current dependency of speaker turns or talkspurts towards a more general view of speaker changes. We show that 3rd party observers do indeed largely look at the same thing (the speaker), and how this can be captured and utilized to provide insights into human communication.Edlund, J., Heldner, M., & Gustafson, J. (2012). On the effect of the acoustic environment on the accuracy of perception of speaker orientation from auditory cues alone. In Proc. of Interspeech 2012. Portland, Oregon, US. [abstract] [pdf]Abstract: The ability of people, and of machines, to determine the position of a sound source in a room is well studied. The related ability to determine the orientation of a directed sound source, on the other hand, is not, but the few studies there are show people to be surprisingly skilled at it. This has bearing for studies of face-to-face interaction and of embodied spoken dialogue systems, as sound source orientation of a speaker is connected to the head pose of the speaker, which is meaningful in a number of ways. The feature most often implicated for detection of sound source orientation is the inter-aural level difference - a feature which it is assumed is more easily exploited in anechoic chambers than in everyday surroundings. We expand here on our previous studies and compare detection of speaker orientation within and outside of the anechoic chamber. Our results show that listeners find the task easier, rather than harder, in everyday surroundings, which suggests that inter-aural level differences is not the only feature at play.Edlund, J., Heldner, M., & Gustafson, J. (2012). Who am I speaking at? - perceiving the head orientation of speakers from acoustic cues alone. In Proc. of LREC Workshop on Multimodal Corpora 2012. Istanbul, Turkey. [abstract] [pdf]Abstract: The ability of people, and of machines, to determine the position of a sound source in a room is well studied. The related ability to determine the orientation of a directed sound source, on the other hand, is not, but the few studies there are show people to be surprisingly skilled at it. This has bearing for studies of face-to-face interaction and of embodied spoken dialogue systems, as sound source orientation of a speaker is connected to the head pose of the speaker, which is meaningful in a number of ways. We describe in passing some preliminary findings that led us onto this line of investigation, and in detail a study in which we extend an experiment design intended to measure perception of gaze direction to test instead for perception of sound source orientation. The results corroborate those of previous studies, and further show that people are very good at performing this skill outside of studio conditions as well.Edlund, J., Heldner, M., & Hjalmarsson, A. (2012). 3rd party observer gaze during backchannels. In Proc. of the Interspeech 2012 Interdisciplinary Workshop on Feedback Behaviors in Dialog. Skamania Lodge, WA, USA.Edlund, J., & Hjalmarsson, A. (2012). Is it really worth it? Cost-based selection of system responses to speech-in-overlap. In Proc. of the IVA 2012 workshop on Realtime Conversational Virtual Agents (RCVA 2012). Santa Cruz, CA, USA. [abstract] [pdf]Abstract: For purposes of discussion and feedback, we present a preliminary version of a simple yet powerful cost-based framework for spoken dialogue sys-tems to continuously and incrementally decide whether to speak or not. The framework weighs the cost of producing speech in overlap against the cost of not speaking when something needs saying. Main features include a small number of parameters controlling characteristics that are readily understood, al-lowing manual tweaking as well as interpretation of trained parameter settings; observation-based estimates of expected overlap which can be adapted dynami-cally; and a simple and general method for context dependency. No evaluation has yet been undertaken, but the effects of the parameters; the observation-based cost of expected overlap trained on Switchboard data; and the context de-pendency using inter-speaker intensity differences from the same corpus are demonstrated with generated input data in the context of user barge-ins.Edlund, J., Hjalmarsson, A., & Tånnander, C. (2012). Unconventional methods in perception experiments. In Proc. of Nordic Prosody XI. Tartu, Estonia.Edlund, J., House, D., & Beskow, J. (2012). Gesture movement profiles in dialogues from a Swedish multimodal database of spontaneous speech. In Bergmann, P., Brenning, J., Pfeiffer, M. C., & Reber, E. (Eds.), Prosody and Embodiment in Interactional Grammar (pp. 265-280). Berlin: de Gruyter.Edlund, J., House, D., & Strömbergsson, S. (2012). Question types and some prosodic correlates in 600 questions in the Spontal database of Swedish dialogues. In Ma, Q., Ding, H., & Hirst, D. (Eds.), Proc. of Speech Prosody 2012 (pp. 737-740). Shanghai, China. [abstract] [pdf]Abstract: Studies of questions present strong evidence that there is no one-to-one relationship between intonation and interrogative mode. We present initial steps of a larger project investigating and describing intonational variation in the Spontal database of 120 half-hour spontaneous dialogues in Swedish, and testing the hypothesis that the concept of a standard question intonation such as a final pitch rise contrasting a final low declarative intonation is not consistent with the pragmatic use of intonation in dialogue. We report on the extraction of 600 questions from the Spontal corpus, coding and annotation of question typology, and preliminary results concerning some prosodic correlates related to question type.Edlund, J., Oertel, C., & Gustafson, J. (2012). Investigating negotiation for load-time in the GetHomeSafe project. In In Proc. of Workshop on Innovation and Applications in Speech Technology (IAST). Dublin, Ireland. [abstract] [pdf]Abstract: This paper describes ongoing work by KTH Speech, Music and Hearing in GetHomeSafe, a newly inaugurated EU project in collaboration with DFKI, Nuance, IBM and Daimler. Under the assumption that drivers will utilize technology while driving regardless of legislation, the project aims at finding out how to make the use of in-car technology as safe as possible rather than prohibiting it. We describe the project in general briefly and our role in some more detail, in particular one of our tasks: to build a system that can ask the driver if now is a good time to speak about X? in an unobtrusive manner, and that knows how to deal with rejection, for example by asking the driver to get back when it is a good time or to schedule a time that will be convenient.Edlund, J., Strömbergsson, S., & House, D. (2012). Telling questions from statements in spoken dialogue systems. In Proc. of SLTC 2012. Lund, Sweden. [pdf]Engwall, O. (2012). Datoranimerade talande ansikten. In Adelswärd, V., & Forstorp, P-A. (Eds.), Människans ansikten: Emotion, interaktion och konst. Carlssons Bokförlag. [pdf]Engwall, O. (2012). Pronunciation analysis by acoustic-to-articulatory feature inversion. In Engwall, O. (Ed.), Proceedings of the International Symposium on Automatic detection of Errors in Pronunciation Training (pp. 79-84). Stockholm. [pdf]Haider, F., & Al Moubayed, S. (2012). Towards speaker detection using lips movements for human-machine multiparty dialogue. In Proceedings of Fonetik'12. Gothenberg, Sweden. [pdf]Hjalmarsson, A., & Oertel, C. (2012). Gaze direction as a Back-Channel inviting Cue in Dialogue. In Proc. of the IVA 2012 workshop on Realtime Conversational Virtual Agents (RCVA 2012). Santa Cruz, CA, USA. [abstract] [pdf]Abstract: In this study, we experimentally explore the relationship between gaze direction and backchannels in face-to-face interaction. The overall motivation is to use gaze direction in a virtual agent as a mean to elicit user feedback. The relationship between gaze and backchannels was tested in an experiment in which participants were asked to provide feedback when listening to a story-telling virtual agent. When speaking, the agent shifted her gaze towards the listener at predefined positions in the dialogue. The results show that listeners are more prone to backchannel when the virtual agent’s gaze is directed towards them than when it is directed away. However, there is a high response variability for different dialogue contexts which suggests that the timing of backchannels cannot be explained by gaze direction alone.House, D. (2012). Response to Fred Cummins: Looking for Rhythm in Speech. Empirical Musicology Review, 7(1-2), 45-48. [abstract] [pdf]Abstract: This commentary briefly reviews three aspects of rhythm in speech. The first concerns the issues of what to measure and how measurements should relate to rhythm's communicative functions. The second relates to how tonal and durational features of speech contribute to the percept of rhythm, noting evidence that indicates such features can be tightly language-specific. The third aspect addressed is how bodily gestures integrate with and enhance the communicative functions of speech rhythm.Johansson, M. (2012). A Data-Driven Semantic Parser for Spoken Route Descriptions. Master's thesis, TMH, KTH. [abstract] [pdf]Abstract: The IURO project is an international multidisciplinary project with the aim to develop a robot that can autonomously navigate in unknown populated urban environments using passers-by as a source of knowledge in addition to sensory data. The problem presented in this thesis is to devise, implement and evaluate a method for interpreting spoken route descriptions into conceptual route graphs. As part of the work, a format for conceptual route graphs is selected and a Concept Error Rate measure is introduced for assessing the performance of route graph interpretations. Additionally, a corpus of Swedish urban route descriptions is collected using a Wizard-of-Oz setting. A data-driven approach is selected for implementing a semantic chunking parser, which divides each route description into conceptual chunks, which are attached to each other to form a hierarchical route graph. The proposed method is implemented and evaluated on the collected corpus, using different machine learning algorithms, including single layer perceptron networks. The results show that the selected route graph representation is applicable and that the method is functional. The best machine learning algorithm performs significantly better than a majority class baseline. The results also indicate that most errors are introduced in the initial chunking process and that performance may be improved by the use of more data.Karlsson, A., House, D., & Svantesson, J-O. (2012). Intonation adapts to lexical tone: The case of Kammu. Phonetica, 69(1-2), 28-47.Karlsson, A. M., Svantesson, J-O., & House, D. (2012). Adaptation of focus to lexical tone and phrasing in Kammu. In Proceedings of the Third International Symposium on Tonal Aspects of Languages (pp. O3-01). Nanjing, China. [pdf]Koniaris, C. (2012). Perceptually Motivated Speech Recognition and Mispronunciation Detection. Doctoral dissertation, School of Computer Science and Communication, KTH, Stockholm, Sweden. [abstract] [link]Abstract: This doctoral thesis is the result of a research effort performed in two fields of speech technology, i.e., speech recognition and mispronunciation detection. Although the two areas are clearly distinguishable, the proposed approaches share a common hypothesis based on psychoacoustic processing of speech signals. The conjecture implies that the human auditory periphery provides a relatively good separation of different sound classes. Hence, it is possible to use recent findings from psychoacoustic perception together with mathematical and computational tools to model the auditory sensitivities to small speech signal changes. The performance of an automatic speech recognition system strongly depends on the representation used for the front-end. If the extracted features do not include all relevant information, the performance of the classification stage is inherently suboptimal. The work described in Papers A, B and C is motivated by the fact that humans perform better at speech recognition than machines, particularly for noisy environments. The goal is to make use of knowledge of human perception in the selection and optimization of speech features for speech recognition. These papers show that maximizing the similarity of the Euclidean geometry of the features to the geometry of the perceptual domain is a powerful tool to select or optimize features. Experiments with a practical speech recognizer confirm the validity of the principle. It is also shown an approach to improve mel frequency cepstrum coefficients (MFCCs) through offline optimization. The method has three advantages: i) it is computationally inexpensive, ii) it does not use the auditory model directly, thus avoiding its computational cost, and iii) importantly, it provides better recognition performance than traditional MFCCs for both clean and noisy conditions. The second task concerns automatic pronunciation error detection. The research, described in Papers D, E and F, is motivated by the observation that almost all native speakers perceive, relatively easily, the acoustic characteristics of their own language when it is produced by speakers of the language. Small variations within a phoneme category, sometimes different for various phonemes, do not change significantly the perception of the language's own sounds. Several methods are introduced based on similarity measures of the Euclidean space spanned by the acoustic representations of the speech signal and the Euclidean space spanned by an auditory model output, to identify the problematic phonemes for a given speaker. The methods are tested for groups of speakers from different languages and evaluated according to a theoretical linguistic study showing that they can capture many of the problematic phonemes that speakers from each language mispronounce. Finally, a listening test on the same dataset verifies the validity of these methods.Koniaris, C., Engwall, O., & Salvi, G. (2012). Auditory and Dynamic Modeling Paradigms to Detect L2 Mispronunciations. In Interspeech 2012. Portland, OR, USA. [abstract] [pdf]Abstract: This paper expands our previous work on automatic pronunciation error detection that exploits knowledge from psychoacoustic auditory models. The new system has two additional important features, i.e., auditory and acoustic processing of the temporal cues of the speech signal, and classification feedback from a trained linear dynamic model. We also perform a pronunciation analysis by considering the task as a classification problem. Finally, we evaluate the proposed methods conducting a listening test on the same speech material and compare the judgment of the listeners and the methods. The automatic analysis based on spectro-temporal cues is shown to have the best agreement with the human evaluation, particularly with that of language teachers, and with previous plenary linguistic studies.Koniaris, C., Engwall, O., & Salvi, G. (2012). On the Benefit of Using Auditory Modeling for Diagnostic Evaluation of Pronunciations. In Inter. Symp. on Auto. Detect. Errors in Pronunc. Training (IS ADEPT), 2012 (pp. 59-64). Stockholm, Sweden. [abstract] [pdf]Abstract: In this paper we demonstrate that a psychoacoustic model based distance measure performs better than a speech signal distance measure in assessing the pronunciation of individual foreign speakers. The experiments show that the perceptual-based method performs not only quantitatively better than a speech spectrum-based method, but also qualitatively better, hence showing that auditory information is beneficial in the task of pronunciation error detection. We first present the general approach of the method, which is using the dissimilarity between the native perceptual domain and the non-native speech power spectrum domain. The problematic phonemes for a given non-native speaker are determined by the degree of disparity between the dissimilarity measure for the non-native and a group of native speakers. The two methods compared here are applied to different groups of non-native speakers of various language backgrounds and validated against a theoretical linguistic study.Laskowski, K. (2012). Exploiting Loudness Dynamics in Stochastic Models of Turn-Taking. In Proceedings of the 4th IEEE Workshop on Spoken Language Technology (SLT2012). Miami FL, USA. [abstract] [pdf]Abstract: Stochastic turn-taking models have traditionally been implemented as N-grams, which condition predictions on recent binary-valued speech/non-speech contours. The current work re-implements this function using feed-forward neural networks, capable of accepting binary- as well as continuous-valued features; performance is shown to asymptotically approach that of the N-gram baseline as model complexity increases. The conditioning context is then extended to leverage loudness contours. Experiments indicate that the additional sensitivity to loudness considerably decreases average cross entropy rates on unseen data, by 0.03 bits per framing interval of 100 ms. This reduction is shown to make loudness-sensitive conversants capable of better predictions, with attention memory requirements at least 5 times smaller and responsiveness latency at least 10 times shorter than the loudness-insensitive baseline.Laskowski, K., Heldner, M., & Edlund, j. (2012). On the dynamics of overlap in multi-party conversation. In Proc. of Interspeech 2012. Portland, Oregon, US. [abstract] [pdf]Abstract: Overlap, although short in duration, occurs frequently in multiparty conversation. We show that its duration is approximately log-normal, and inversely proportional to the number of simultaneously speaking parties. Using a simple model, we demonstrate that simultaneous talk tends to end simultaneously less frequently than in begins simultaneously, leading to an arrow of time in chronograms constructed from speech activity alone. The asymmetry is signi&#64257;cant and discriminative. It appears to be due to dialog acts which do not carry propositional content, and those which are not brought to completion.Lopes, J., Eskenazi, M., & Trancoso, I. (2012). Incorporating ASR information in spoken dialog system confidence score. In Computational Processing of the Portuguese Language. Meena, R., Jokinen, K., & Wilcock, G. (2012). Integration of Gestures and Speech in Human-Robot Interaction. In Proceedings of 3rd International Conference on Cognitive Infocommunications (CogInfoCom 2012) (pp. 673-678). Kosice, Slovakia: IEEE. [abstract] [pdf]Abstract: We present an approach to enhance the interaction abilities of the Nao humanoid robot by extending its communicative behavior with non-verbal gestures (hand and head movements, and gaze following). A set of non-verbal gestures were identified that Nao could use for enhancing its presentation and turn-management capabilities in conversational interactions. We discuss our approach for modeling and synthesizing gestures on the Nao robot. A scheme for system evaluation that compares the values of users’ expectations and actual experiences has been presented. We found that open arm gestures, head movements and gaze following could significantly enhance Nao’s ability to be expressive and appear lively, and to engage human users in conversational interactions.Meena, R., Skantze, G., & Gustafson, J. (2012). A Chunking Parser for Semantic Interpretation of Spoken Route Directions in Human-Robot Dialogue. In Proceedings of the 4th Swedish Language Technology Conference (SLTC 2012) (pp. 55-56). Lund, Sweden. [abstract] [pdf]Abstract: We present a novel application of the chunking parser for data-driven semantic interpretation of spoken route directions into route graphs that are useful for robot navigation. Various sets of features and machine learning algorithms were explored. The results indicate that our approach is robust to speech recognition errors, and could be easily used in other languages using simple features.Neiberg, D., & Gustafson, J. (2012). Towards letting machines humming in the right way - prosodic analysis of six functions of short feedback tokens in English. In Fonetik 2012. Göteborg, Sweden. [pdf]Neiberg, D. (2012). Modelling Paralinguistic Conversational Interaction: Towards social awareness in spoken human-machine dialogue. Doctoral dissertation, KTH School of Computer Science and Communication. [link]Neiberg, D., & Gustafson, J. (2012). Cues to perceived functions of acted and spontaneous feedback expressions. In The Interdisciplinary Workshop on Feedback Behaviors in Dialog. [abstract] [pdf]Abstract: We present a two step study where the &#64257;rst part aims to determine the phonemic prior bias (conditioned on “ah”, “m-hm”, “m-m”, “n-hn”, “oh”, “okay”, “u-hu”, “yeah” and “yes”) in subjects perception of six feedback functions (acknowledgment, continuer, disagreement, surprise, enthusiasm and uncertainty). The results showed a clear phonemic prior bias for some tokens, e.g “ah” and “oh” is commonly interpreted as surprise but “yeah” and “yes” less so. The second part aims to examine determinants to judged typicality, or graded structure, within the six functions of “okay”. Typicality was correlated to four determinants: prosodic central tendency within the function (CT); phonemic prior bias as an approximation to frequency instantiation (FI), the posterior i.e. CT x FI and judged Ideality (ID), i.e. similarity to ideals associated with the goals served by its function. The results tentatively suggests that acted expressions are more effectively communicated and that the functions of feedback to a greater extent constitute goal-based categories determined by ideals and to a lesser extent a taxonomy determined by CT and FI. However, it is possible to automatically predict typicality with a correlation of r = 0.52 via the posterior.Neiberg, D., & Gustafson, J. (2012). Exploring the implications for feedback of a neurocognitive theory of overlapped speech. In The Interdisciplinary Workshop on Feedback Behaviors in Dialog. [abstract] [pdf]Abstract: Neurocognitive evidence suggests that the cognitive load caused by decoding interlocutors speech while one self is talking is dependent on two factors: the type of incoming speech, i.e. nonlexical feedback, lexical feedback or non-feedback; and the duration of the speech segment. This predicts that the fraction of overlap should be high for non-lexical feedback, medium for lexical feedback and low for non-feedback, and that short segments has a higher fraction of overlapped speech than long segments. By normalizing for duration, it is indeed shown that the fraction of overlap is 32% for non-lexical feedback, 27% for lexical feedback and 12% for non-feedback. Investigating nonfeedback tokens for the durational factor gives that the fraction of overlap can be modeled by linear regression and logarithmic transform of duration giving a R 2 = 0.57 (p < 0.01 for F-test) and a slope b(2) = -0.04 (p < 0.01 for T-test). However, it is not enough to take duration into account when modeling overlap in feedback tokens.Niebuhr, O., & Zellers, M. (2012). Late pitch accents in hat and dip intonation patterns.. In Niebuhr, O., & Pfitzinger, H. (Eds.), Prosdies: context, function, and communication (pp. 159-186). Berlin/New York: de Gruyter.Oertel, C., Wlodarczak, M., Edlund, J., Wagner, P., & Gustafson, J. (2012). Gaze Patterns in Turn-Taking. In Proc. of Interspeech 2012. Portland, Oregon, US. [abstract] [pdf]Abstract: This paper investigates gaze patterns in turn-taking. We focus on the difference between speaker changes resulting in gaps and overlaps. We also investigate gaze patters around backchannels and around silences not involving speaker changes.Oertel, C., Cummins, F., Edlund, J., Wagner, P., & Campbell, N. (2012). D64: a corpus of richly recorded conversational interaction. Journal of Multimodal User Interfaces. [abstract]Abstract: In recent years there has been a substantial debate about the need for increasingly spontaneous, conversational corpora of spoken interaction that are not controlled or task directed. In parallel the need arises for the recording of multi-modal corpora which are not restricted to the audio domain alone. With a corpus that would fulfill both needs, it would be possible to investigate the natural coupling, not only in turn-taking and voice, but also in the movement of participants. In the following paper we describe the design and recording of such a corpus and we provide some illustrative examples of how such a corpus might be exploited in the study of dynamic interaction.Renklint, E., Cardell, F., Dahlbäck, J., Edlund, J., & Heldner, M. (2012). Conversational gaze in light and darkness. In Proc. of Fonetik 2012 (pp. 59-60). Gothenburg, Sweden. [abstract] [pdf]Abstract: The way we use our gaze in face-to-face interaction is an important part of our social behavior. This exploratory study investigates the relationship between mutual gaze and joint silences and overlaps, where speaker changes and backchannels often occur. Seven dyadic conversations between two persons were recorded in a studio. Gaze patterns were annotated in ELAN to find instances of mutual gaze. Part of the study was conducted in total darkness as a way to observe what happens to our gaze-patterns when we cannot see our interlocutor, although the physical face-to-face condition is upheld. The results show a difference in the frequency of mutual gaze in conversation in light and darkness.Saad, A., Beskow, J., & Kjellström, H. (2012). Visual Recognition of Isolated Swedish Sign Language Signs. http://arxiv.org/abs/1211.3901. [pdf]Salvi, G., Montesano, L., Bernardino, A., & Santos-Victor, J. (2012). Language bootstrapping: Learning word meanings from perception-action association. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 42(3), 660-671. [abstract] [pdf]Abstract: We address the problem of bootstrapping language acquisition for an artificial system similarly to what is observed in experiments with human infants. Our method works by associating meanings to words in manipulation tasks, as a robot interacts with objects and listens to verbal descriptions of the interactions. The model is based on an affordance network, i.e., a mapping between robot actions, robot perceptions and the perceived effects of these actions upon objects. We extend the affordance model to incorporate spoken words, which allows us to ground the verbal symbols to the execution of actions and the perception of the environment. The model takes verbal descriptions of a task as the input, and uses temporal co-occurrence to create links between speech utterances and the involved objects, actions and effects. We show that the robot is able form useful word-to-meaning associations, even without considering grammatical structure in the learning process and in the presence of recognition errors. These word-to-meaning associations are embedded in the robot’s own understanding of its actions. Thus, they can be directly used to instruct the robot to perform tasks and also allow to incorporate context in the speech recognition task. We believe that the encouraging results with our approach may afford robots with a capacity to acquire language descriptors in their operation's environment as well as to shed some light as to how this challenging process develops with human infants.Schötz, S., Bruce, G., Segerup, M., Beskow, J., Gustafson, J., & Granström, B. (2012). Regional Varieties of Swedish: Models and synthesis. In Niebuhr, O. (Ed.), Understanding Prosody (pp. 119 - 134). De Gruyter.Skantze, G. (2012). A Testbed for Examining the Timing of Feedback using a Map Task. In Proceedings of the Interdisciplinary Workshop on Feedback Behaviors in Dialog. Portland, OR. [abstract] [pdf]Abstract: In this paper, we present a fully automated spoken dialogue sys-tem that can perform the Map Task with a user. By implementing a trick, the system can convincingly act as an attentive listener, without any speech recognition. An initial study is presented where we let users interact with the system and recorded the interactions. Using this data, we have then trained a Support Vector Machine on the task of identifying appropriate locations to give feedback, based on automatically extractable prosodic and contextual features. 200 ms after the end of the user’s speech, the model may identify response locations with an accuracy of 75%, as compared to a baseline of 56.3%.Skantze, G., & Al Moubayed, S. (2012). IrisTK: a statechart-based toolkit for multi-party face-to-face interaction. In Proceedings of ICMI. Santa Monica, CA. [pdf]Skantze, G., Al Moubayed, S., Gustafson, J., Beskow, J., & Granström, B. (2012). Furhat at Robotville: A Robot Head Harvesting the Thoughts of the Public through Multi-party Dialogue. In Proceedings of IVA-RCVA. Santa Cruz, CA. [pdf]Skevofylax, A. (2012). A Computer Game for Pronunciation Training. Master's thesis, CSC. [abstract]Abstract: The concept to which this Master’s Thesis relates is the influence that the injection of fun has in the learning process of a second language. For this reason a game was developed and used in a study in an attempt to understand the importance of fun in pronunciation training. An investigation of the notion of fun and its use in learning methods was necessary to be delivered before the design and development of the game. The game was based on previous work on the provision of visual feedback in pronunciation training and ideas extracted from the famous Star Wars saga. It embodies a space fight where the player/learner needs to move a spaceship by pro- nouncing vowels and shoot the enemies by pronouncing small words. In the study conducted nine people were involved in playing the game and an- swering a questionnaire. The results revealed, among other things, the interest of learners for educational games that focus in pronunciation training. The insuffi- ciency in the time spent for pronunciation training that was observed would be possible to be amended by the use of games in the learning process.Strömbergsson, S. (2012). Identifying children’s dialect from isolated words. In Proc. of Fonetik 2012. Gothenburg, Sweden. [abstract] [pdf]Abstract: This report describes a pilot study investigating whether adult listeners are able to identify the dialect of child speakers from the recordings of isolated words. The recorded children are either speaking a Stockholm dialect or a Scanian dialect. Despite low agreement between the nine adult listeners, the results show that adult listeners can indeed identify dialect even in isolated short words, and that some features in the word make them more revealing of dialect than others. For example, dialect is identified correctly more often in bisyllabic words than in monosyllabic words, suggesting that intonation is a strong cue to dialect identification.Strömbergsson, S. (2012). Is that me? Self-voice identification in children with deviant speech. In Proc. of ICPLA 2012. Cork, Ireland. [abstract] [pdf]Abstract: In children with deviant speech, a discrepancy between internal and external discrimination is often observed. For example, a child who produces a target word like cat as [tat] might well be able to recognize the same error when someone else produces it (external discrimination), but still fail to perceive the error in his/her own speech (internal discrimination). One focus in speech and language therapy is therefore to strengthen the child’s self-monitoring skills, e.g. through the use of recordings (Hoffman & Norris, 2005). However, it is still not known if children with deviant speech indeed recognize their recorded voice as their own. And if they don’t, we could not expect any advantage from using the child’s own recordings in therapy, as it would merely be another form of external discrimination. The present study aims to explore 1) if children with phonological impairment (PI) recognize their recorded voice as their own on the same level as children with typical speech, and 2) whether the time interval between making a recording and identifying the recording as one’s own influences the children’s performance, and 3) whether the performance of children with PI is dependent on the phonological accuracy of their production. The ability to recognize the recorded voice as one’s own was explored in three groups of children: one group of children with PI (N=18) and two groups of children with typical speech and language development; 4-5 year-olds (N=25) and 7-8 year-olds (N=23). The children with PI all exhibited patterns of velar fronting in their speech production. A recording script of 24 words was used for all children, with half of the words beginning with /tV/. Thus, the children with PI were expected to produce half of the words in error. The task for the children was to identify which of four randomly presented child recordings of a word was their own recording. Self-voice identification was tested on two occasions for each child, the first immediately following the child’s production of the recording, and the second 1-2 weeks later. High average performance rates in all three groups suggest that children indeed recognize their recorded voice as their own, with no significant difference between the groups. A significant drop in performance from immediate playback to delayed playback was found. This drop was most pronounced in the older group of children with typical speech. The children with PI did not perform differently on stimuli produced in error and stimuli produced correctly, suggesting that they do not use their speech deviance as a cue to identifying their own voice. A clinical implication of these findings is that one can indeed expect an advantage of using recordings of the child’s own speech in therapeutic settings, as this would be closer to internal discrimination than external discrimination. We argue that the indications that children with PI do not use their speech deviance as a cue to identifying their own recording from those of other children reflect the difficulties they have to judge their own speech production accurately, and indicate this as an important focus of intervention.Strömbergsson, S. (2012). Synthetic correction of deviant speech – children’s perception of phonologically modified recordings of their own speech. In Proc. of Interspeech. Portland, Oregon, US. [abstract] [pdf]Abstract: This report describes preliminary data from a study of how children with phonological impairment (PI) perceive automatically corrected versions of their own deviant speech. The results from 8 children with PI are compared to results of a group of 20 children with typical speech and language (nPI). The results indicate group differences only in tasks where the children make judgments of their own recorded (original or modified) speech; here, the children in the nPI group perform significantly better than the children with PI. In tasks where the children judge the phonological accuracy of recordings of other children (original or modified), however, the two groups perform equally well. Furthermore, the results indicate that sub-phonemic modifications of recordings are too subtle for the children in both groups to detect. Technical and clinical implications of these findings are discussed.Strömbergsson, S., Edlund, J., & House, D. (2012). A study of Swedish questions and their prosodic characteristics. In In Proc. of Workshop on Innovation and Applications in Speech Technology (IAST). Dublin, Ireland. [abstract] [pdf]Abstract: Studies of questions present strong evidence that there is no one-to-one relationship between intonation and interrogative mode. This is one of the fundamental observations that formed the VariQ project, a Swedish national project aiming for a deeper insight in how questions work and indeed what constitutes a question. Here, we present some intermediate project results and outline the way ahead. VariQ looks mainly at the Spontal corpus of conversational speech [1], but exploits other data sets to a limited extent for comparative purposes. We report a recent study, in which we selected 600 questions from the Spontal corpus and annotated these in a theory-independent manner. In a subsequent study we compared some prosodic characteristics of these questions with those of the speech used in seven Swedish spoken dialogue systems. The results reveal differences both in the distributions of question types and in prosodic characteristics of the questions in the two different settings.Strömbergsson, S., Edlund, J., & House, D. (2012). Prosodic measurements and question types in the Spontal corpus of Swedish dialogues. In Proc. of Interspeech 2012 (pp. 839-842). Portland, Oregon, US. [abstract] [pdf]Abstract: Studies of questions present strong evidence that there is no one-to-one relationship between intonation and interrogative mode. In this paper, we describe some aspects of prosodic variation in the Spontal corpus of 120 half-hour spontaneous dialogues in Swedish. The study is part of ongoing work aimed at extracting a database of 600 questions from the corpus, complete with categorization and prosodic descriptions. We report on coding and annotation of question typol¬ogy and present results concerning some prosodic corre-lates related to question type for 455 of the questions. A prosodically salient distinction was found between the two categories termed, in our typology, forward and backward looking questions.Strömbergsson, S., Edlund, J., & House, D. (2012). Question types and some prosodic correlates in the Spontal corpus of Swedish dialogues. In Proc. of Fonetik 2012. Gothenburg, Sweden. [abstract] [pdf]Abstract: We describe some aspects of prosodic variation in the Spontal corpus of 120 half-hour spontaneous dialogues in Swedish. Coding and annotation of question typology is described and results are presented concerning prosodic correlates related to question type for well over 400 questions. A prosodically salient distinction was found between the two categories termed, in our typology, forward- and backward-looking questions.Strömbergsson, S., Edlund, J., & House, D. (2012). Questions and reported speech in Swedish dialogues. In Proc. of Nordic Prosody XI, Tartu (pp. 373-382). Frankfurt am Main: Peter Lang.Vanhainen, N., & Salvi, G. (2012). Word Discovery with Beta Process Factor Analysis. In Proceedings of Interspeech. Portland, Oregon. [abstract] [pdf]Abstract: We propose the application of a recently developed non-parametric Bayesian method for factor analysis to the problem of word discovery from continuous speech. The method, based on Beta Process priors, has a number of advantages compared to previously proposed methods, such as Non-negative Matrix Factorisation (NMF). Beta Process Factor Analysis (BPFA) is able to estimate the size of the basis, and therefore the num- ber of recurring patterns, or word candidates, found in the data. We compare the results obtained with BPFA and NMF on the TIDigits database, showing that our method is capable of not only finding the correct words, but also the correct number of words. We also show that the method can infer the approximate number of words for different vocabulary sizes by testing on randomly generated sequences of words.Vanhainen, N. (2012). Discovering words from continuous speech - Two factor analysis methods. Master's thesis, KTH, School of Computer Science and Communication, Dept for Speech, Music and Hearing. [pdf]2011Abelin, Å. (2011). Imitation of bird song in folklore – onomatopoeia or not?. TMH-QPSR, 51(1), 13-16. [pdf]Afsun, D., Forsman, E., Halvarsson, C., Jonsson, E., Malmgren, L., Neves, J., & Marklund, U. (2011). Effects of a film-based parental intervention on vocabulary development in toddlers aged 18-21 months. TMH-QPSR, 51(1), 105-108. [pdf]Al Moubayed, S., Alexanderson, S., Beskow, J., & Granström, B. (2011). A robotic head using projected animated faces. In Salvi, G., Beskow, J., Engwall, O., & Al Moubayed, S. (Eds.), Proceedings of AVSP2011 (pp. 69). [pdf]Al Moubayed, S., Beskow, J., Edlund, J., Granström, B., & House, D. (2011). Animated Faces for Robotic Heads: Gaze and Beyond. In Esposito, A., Vinciarelli, A., Vicsi, K., Pelachaud, C., & Nijholt, A. (Eds.), Analysis of Verbal and Nonverbal Communication and Enactment: The Processing Issues, Lecture Notes in Computer Science (pp. 19-35). Springer. [pdf]Al Moubayed, S., & Beskow, J. (2011). A novel Skype interface using SynFace for virtual speech reading support. TMH-QPSR, 51(1), 33-36. [pdf]Al Moubayed, S., & Skantze, G. (2011). Effects of 2D and 3D Displays on Turn-taking Behavior in Multiparty Human-Computer Dialog. In Proceedings of SemDial (pp. 192-193). Los Angeles, CA. [pdf]Al Moubayed, S., & Skantze, G. (2011). Turn-taking Control Using Gaze in Multiparty Human-Computer Dialogue: Effects of 2D and 3D Displays. In Proceedings of AVSP. Florence, Italy. [pdf]Ananthakrishnan, G. (2011). From Acoustics to Articulation. Doctoral dissertation, School of Computer Science and Communication. [pdf]Ananthakrishnan, G. (2011). Imitating Adult Speech: An Infant's Motivation. In 9th International Seminar on Speech Production (pp. 361-368). [abstract]Abstract: This paper tries to detail two aspects of speech acquisition by infants which are often assumed to be intrinsic or innate knowledge, namely number of degrees of freedom in the articulatory parameters and the acoustic correlates that find the correspondence between adult speech and the speech produced by the infant. The paper shows that being able to distinguish the different vowels in the vowel space of the certain language is a strong motivation for choosing both a certain number of independent articulatory parameters as well as a certain scheme of acoustic normalization between adult and child speech.Ananthakrishnan, G., Eklund, R., Peters, G., & and Mabiza, E. (2011). An acoustic analysis of lion roars. II: Vocal tract characteristics. TMH-QPSR, 51(1), 5-8. [pdf]Ananthakrishnan, G., & Engwall, O. (2011). Mapping between Acoustic and Articulatory Gestures. Speech Communication, 53(4), 567-589. [abstract] [link]Abstract: This paper proposes a definition for articulatory as well as acoustic gestures along with a method to segment the measured articulatory trajectories and acoustic waveforms into gestures. Using a simultaneously recorded acoustic-articulatory database, the gestures are detected based on finding critical points in the utterance, both in the acoustic and articulatory representations. The acoustic gestures are parameterized using 2-D cepstral coefficients. The articulatory trajectories are essentially the horizontal and vertical movements of Electromagnetic Articulography (EMA) coils placed on the tongue, jaw and lips along the midsagittal plane. The articulatory movements are parameterized using 2D-DCT using the same transformation that is applied on the acoustics. The relationship between the detected acoustic and articulatory gestures in terms of the timing as well as the shape is studied. In order to study this relationship further, acoustic-to-articulatory inversion is performed using GMM-based regression. The accuracy of predicting the articulatory trajectories from the acoustic waveforms are at par with state-of-the-art frame-based methods with dynamical constraints (with an average error of 1.45–1.55 mm for the two speakers in the database). In order to evaluate the acoustic-to-articulatory inversion in a more intuitive manner, a method based on the error in estimated critical points is suggested. Using this method, it was noted that the estimated articulatory trajectories using the acoustic-to-articulatory inversion methods were still not accurate enough to be within the perceptual tolerance of audio-visual asynchrony.Ananthakrishnan, G., & Engwall, O. (2011). Resolving Non-uniqueness in the Acoustic-to-Articulatory Mapping. In ICASSP (pp. 4628-4631). Prague, Czech republic.Ananthakrishnan, G., & Salvi, G. (2011). Using Imitation to learn Infant-Adult Acoustic Mappings. In Proceedings of Interspeech (pp. 765-768). Florence, Italy. [abstract] [pdf]Abstract: This paper discusses a model which conceptually demonstrates how infants could learn the normalization between infant-adult acoustics. The model proposes that the mapping can be inferred from the topological correspondences between the adult and infant acoustic spaces that are clustered separately in an unsupervised manner. The model requires feedback from the adult in order to select the right topology for clustering, which is crucial aspect of the model. The feedback is in terms of as overall rating of the imitation effort by the infant, rather than a frame-by-frame correspondence. Using synthetic, but continuous speech data, we demonstrate that clusters, which have a good topological correspondence, are perceived to be similar by a phonetically trained listener.Ananthakrishnan, G., Wik, P., & Engwall, O. (2011). Detecting confusable phoneme pairs for Swedish language learners depending on their first language. TMH-QPSR, 51(1), 89-92. [pdf]Ananthakrishnan, G., Wik, P., Engwall, O., & Abdou, S. (2011). Using an Ensemble of Classifiers for Mispronunciation Feedback. In Strik, H., Delmonte, R., & Russel, M. (Eds.), Proceedings of SLaTE. Venice, Italy.Andersson, I., Gauding, J., Graca, A., Holm, K., Öhlin, L., Marklund, U., & Ericsson, A. (2011). Productive vocabulary size development in children aged 18-24 months – gender differences. TMH-QPSR, 51(1), 109-112. [pdf]Beskow, J., Alexanderson, S., Al Moubayed, S., Edlund, J., & House, D. (2011). Kinetic Data for Large-Scale Analysis and Modeling of Face-to-Face Conversation. In Salvi, G., Beskow, J., Engwall, O., & Al Moubayed, S. (Eds.), Proceedings of AVSP2011 (pp. 103-106). [pdf]Blomberg, M. (2011). Model space size scaling for speaker adaptation. In TMH-QPSR Vol. 51, Fonetik 2011 (pp. 77-80). Dept. of Speech, Music and Hearing and Centre for Speech Technology (CTT), KTH, Stockholm. [pdf]Borin, L., Brandt, M. D., Edlund, J., Lindh, J., & Parkvall, M. (2011). Languages in the European Information Society - Swedish. Technical Report. [pdf]Dahlby, M., Irmalm, L., Kytöharju, S., Wallander, L., Zachariassen, H., Ericsson, A., & Marklund, U. (2011). Parent-child interaction: Relationship between pause duration and infant vocabulary at 18 months. TMH-QPSR, 51(1), 101-104. [pdf]Edlund, J., Al Moubayed, S., & Beskow, J. (2011). The Mona Lisa Gaze Effect as an Objective Metric for Perceived Cospatiality. In Vilhjálmsson, H. H., Kopp, S., Marsella, S., & Thórisson, K. R. (Eds.), Proc. of the Intelligent Virtual Agents 10th International Conference (IVA 2011) (pp. 439-440). Reykjavík, Iceland: Springer. [abstract] [pdf]Abstract: We propose to utilize the Mona Lisa gaze effect for an objective and repeatable measure of the extent to which a viewer perceives an object as cospatial. Preliminary results suggest that the metric behaves as expected. Proc. of the Intelligent Virtual Agents 10th International Conference (IVA 2011)Edlund, J. (2011). How deeply rooted are the turns we take?. In Proc. of Los Angelogue (SemDial 2011). Los Angeles, CA, US. [abstract] [pdf]Abstract: This poster presents preliminary work investigating turn-taking in text-based chat with a view to learn something about how deeply rooted turn-taking is in the human cognition. A connexion is shown between preferred turn-taking patterns and length and type of experience with such chats, which supports the idea that the orderly type of turn-taking found in most spoken conversations is indeed deeply rooted, but not more so than that it can be overcome with training in a situation where such turn-taking is not beneficial to the communication.Edlund, J. (2011). In search of the conversational homunculus - serving to understand spoken human face-to-face interaction. Doctoral dissertation, KTH. [abstract] [pdf]Abstract: In the group of people with whom I have worked most closely, we recently attempted to dress our visionary goal in words: “to learn enough about human face-to-face interaction that we are able to create an artificial conversational partner that is humanlike”. The “conversational homunculus” figuring in the title of this book represents this “artificial conversational partner”. The vision is motivated by an urge to test computationally our understandings of how human-human interaction functions, and the bulk of my work leads towards the conversational homunculus in one way or another. This book compiles and summarizes that work: it sets out with a presenting and providing background and motivation for the long-term research goal of creating a humanlike spoken dialogue system, and continues along the lines of an initial iteration of an iterative research process towards that goal, beginning with the planning and collection of human-human interaction corpora, continuing with the analysis and modelling of the human-human corpora, and ending in the implementation of, experimentation with and evaluation of humanlike components for in human-machine interaction. The studies presented have a clear focus on interactive phenomena at the expense of propositional content and syntactic constructs, and typically investigate the regulation of dialogue flow and feedback, or the establishment of mutual understanding and grounding.Eklund, R., Peters, G., Ananthakrishnan, G., & Mabiza, E. (2011). An acoustic analysis of lion roars. I: Data collection and spectrogram and waveform analyses. TMH-QPSR, 51(1), 1-4. [pdf]Engwall, O. (2011). Analysis of and feedback on phonetic features in pronunciation training with a virtual teacher. Computer Assisted Language Learning. [abstract] [link]Abstract: Pronunciation errors may be caused by several different deviations from the target, such as voicing, intonation, insertions or deletions of segments, or that the articulators are placed incorrectly. Computer-animated pronunciation teachers could potentially provide important assistance on correcting all these types of deviations, but they have an additional benefit for articulatory errors. By making parts of the face transparent, they can show the correct position and shape of the tongue and provide audiovisual feedback on how to change erroneous articulations. Such a scenario however requires firstly that the learner's current articulation can be estimated with precision and secondly that the learner is able to imitate the articulatory changes suggested in the audiovisual feedback. This article discusses both these aspects, with one experiment on estimating the important articulatory features from a speaker through acoustic-to-articulatory inversion and one user test with a virtual pronunciation teacher, in which the articulatory changes made by seven learners who receive audiovisual feedback are monitored using ultrasound imaging.Engwall, O. (2011). Augmented Reality Talking Heads as a Support for Speech Perception and Production. In Nee, A. (Ed.), Augmented Reality. InTech. [pdf]Frid, J., Schötz, S., & Löfqvist, A. (2011). Age-related lip movement repetition variability in two phrase positions. TMH-QPSR, 51(1), 21-24. [pdf]Gabrielsson, D., Kirchner, S., Nilsson, K., Norberg, A., & Widlund, C. (2011). Anticipatory lip rounding– a pilot study using The Wave Speech Research System. TMH-QPSR, 51(1), 37-40. [pdf]Gu, T-L. (2011). Why are Computer Games so Entertaining? – A study on whether motivational feedback elements from computer games can improve spoken language learning games.. Master's thesis, CSC. [abstract] [pdf]Abstract: This master thesis is a part of a project conducted by the Centre of Speech Technology at the Department of Language, Speech and Music at the Royal Institute of Technology. The focus of this master thesis has been on how to make computerized language learning more entertaining and stimulating. Computer games in general can create an addictive and motivational encouraging stimulation for players and it is very interesting to understand what these elements are and whether the elements can be implemented into language learning games. The subjects for this project consisted of both bachelor and master students from the ages 18-35 and some participants being members of a game committee at the school of Information and Communication Technology (ICT). In order to understand what makes normal games so addictive and what these motivational feedback attributes and elements are, we have to examine and explore what game traits, mechanisms and genres that are considered appealing by the users. The project started by examining what previous works have been made within the fields of motivation, education and games. Afterwards a qualitative survey was created based on previous theoretical works that was sent to the committee and the targeted user group of interest. The results were later evaluated and analyzed, in order to extract the most appealing, important and motivational encouraging feedback elements.Heldner, M. (2011). Detection thresholds for gaps, overlaps and no-gap-no-overlaps. Journal of the Acoustical Society of America, 130(1), 508-513. [abstract] [pdf]Abstract: Detection thresholds for gaps and overlaps, that is acoustic and perceived silences and stretches of overlapping speech in speaker changes, were determined. Subliminal gaps and overlaps were categorized as no-gap-no-overlaps. The established gap and overlap detection thresholds both corresponded to the duration of a long vowel, or about 120 ms. These detection thresholds are valuable for mapping the perceptual speaker change categories gaps, overlaps, and no-gap-no-overlaps into the acoustic domain. Furthermore, the detection thresholds allow generation and understanding of gaps, overlaps, and no-gap-no-overlaps in human-like spoken dialogue systems.Heldner, M., Edlund, J., Hjalmarsson, A., & Laskowski, K. (2011). Very short utterances and timing in turn-taking. In Proceedings of Interspeech 2011 (pp. 2837-2840). Florence, Italy. [abstract] [pdf]Abstract: This work explores the timing of very short utterances in conversations, as well as the effects of excluding intervals adjacent to such utterances from distributions of between-speaker interval durations. The results show that very short utterances are more precisely timed to the preceding utterance than longer utterances in terms of a smaller variance and a larger proportion of no-gap-no-overlaps. Excluding intervals adjacent to very short utterances furthermore results in measures of central tendency closer to zero (i.e. no-gap-no-overlaps) as well as larger variance (i.e. relatively longer gaps and overlaps).Hjalmarsson, A. (2011). The additive effect of turn-taking cues in human and synthetic voice. Speech Communication, 53(1), 23-35. [abstract] [003]Abstract: A previous line of research suggests that interlocutors identify appropriate places to speak by cues in the behaviour of the preceding speaker. If used in combination, these cues have an additive effect on listeners’ turn-taking attempts. The present study further explores these findings by examining the effect of such turn-taking cues experimentally. The objective is to investigate the possibilities of generating turn-taking cues with a synthetic voice. Thus, in addition to stimuli realized with a human voice, the experiment included dialogues where one of the speakers is replaced with a synthesis. The turn-taking cues investigated include intonation, phrase-final lengthening, semantic completeness, stereotyped lexical expressions and non-lexical speech production phenomena such as lexical repetitions, breathing and lip-smacks. The results show that the turn-taking cues realized with a synthetic voice affect the judgements similar to the corresponding human version and there is no difference in reaction times between these two conditions. Furthermore, the results support Duncan’s findings: the more turn-taking cues with the same pragmatic function, turn-yielding or turn-holding, the higher the agreement among subjects on the expected outcome. In addition, the number of turn-taking cues affects the reaction times for these decisions. Thus, the more cues, the faster the reaction time.Hjalmarsson, A., & Laskowski, K. (2011). Measuring final lengthening for speaker-change prediction. In Proceedings of Interspeech 2011 (pp. 2069-2072). Florence, Italy. [abstract] [pdf]Abstract: We explore pre-silence syllabic lengthening as a cue for next-speakership prediction in spontaneous dialogue. When estimated using a transcription-mediated procedure, lengthening is shown to reduce error rates by 25% relative to majority class guessing. This indicates that lengthening should be exploited by dialogue systems. With that in mind, we evaluate an automatic measure of spectral envelope change, Mel-spectral flux (MSF), and show that its performance is at least as good as that of the transcription-mediated measure. Modeling MSF is likely to improve turn uptake in dialogue systems, and to benefit other applications needing an estimate of durational variability in speech.House, D., & Strömbergsson, S. (2011). Self-voice identification in children with phonological impairment. In Proceedings of the ICPhS XVII (pp. 886-889). Hong Kong. [abstract] [pdf]Abstract: We report preliminary data from a study of self-voice identification in children with phonological impairment (PI), where results from 13 children with PI are compared to results from a group of children with typical speech. No difference between the two groups was found, suggesting that a phonological impairment does not affect children’s ability to recognize their recorded voices as their own. We conclude that children with PI indeed recognize their own recorded voice and that the use of recordings in therapy can be supported.Hu, G. (2011). Chinese perception coaching. TMH-QPSR, 51(1), 97-100. [pdf]Husby, O., Øvregaard, Å., Wik, P., Bech, Ø., Albertsen, E., Nefzaoui, S., Skarpnes, E., & Koreman, J. (2011). Dealing with L1 background and L2 dialects in Norwegian CAPT. In Proceedings of SLaTE. [pdf]Johansson, M., Skantze, G., & Gustafson, J. (2011). Understanding route directions in human-robot dialogue. In Proceedings of SemDial (pp. 19-27). Los Angeles, CA. [pdf]Johnson-Roberson, M., Bohg, J., Skantze, G., Gustafson, J., Carlson, R., Rasolzadeh, B., & Kragic, D. (2011). Enhanced Visual Scene Understanding through Human-Robot Dialog. In IEEE/RSJ International Conference on Intelligent Robots and Systems. [pdf]Jonsson, H., & Eklund, R. (2011). Gender differences in verbal behaviour in a call routing speech application. TMH-QPSR, 51(1), 81-84. [pdf]Kaiser, R. (2011). Do Germans produce and perceive the Swedish word accent contrast? A cross-language analysis. TMH-QPSR, 51(1), 93-96. [pdf]Karlsson, A., House, D., Svantesson, J-O., & Tayanin, D. (2011). Comparison of F0 range in spontaneous speech in Kammu tonal and non-tonal dialects. In Proceedings of the ICPhS XVII (pp. 1026-1029). Hong Kong. [abstract] [pdf]Abstract: The aim of this study is to investigate whether the occurrence of lexical tones in a language imposes restrictions on its pitch range. Kammu, a Mon-Khmer language spoken in Northern Laos com-prises dialects with and without lexical tones and with no other major phonological differences. We use Kammu spontaneous speech to investigate differences in pitch range in the two dialects. The main finding is that tonal speakers exhibit a narrower pitch range. Thus, even at a high degree of engagement found in spontaneous speech, lexical tones impose restrictions on speakers’ pitch variation. Keywords: pitch range, tone, timing, intonation, Kammu, KhmuKarlsson, A., Svantesson, J-O., House, D., & Tayanin, D. (2011). Tone restricts F0 range and variation in Kammu. TMH-QPSR, 51(1), 53-55. [abstract] [pdf]Abstract: The aim of this study is to investigate whether the occurrence of lexical tones in a language imposes restrictions on its pitch range. We use data from Kammu, a Mon-Khmer language spoken in Northern Laos, which has one dialect with, and one without, lexical tones. The main finding is that speakers of the tonal dialect have a narrower pitch range, and also a smaller variation in pitch range.Klintfors, E., Marklund, E., Kallioinen, P., & Lacerda, F. (2011). Cortical N400-Potentials Generated by Adults in Response to Semantic Incongruities. TMH-QPSR, 51(1), 121-124. [pdf]Koniaris, C., & Engwall, O. (2011). Perceptual Differentiation Modeling Explains Phoneme Mispronunciation by Non-Native Speakers. In 2011 IEEE Int. Conf. on Acoust., Speech, Sig. Proc. (ICASSP) (pp. 5704-5707). Prague, Czech Republic. [abstract] [link]Abstract: One of the difficulties in second language (L2) learning is the weakness in discriminating between acoustic diversity within an L2 phoneme category and between different categories. In this paper, we describe a general method to quantitatively measure the perceptual difference between a group of native and individual non-native speakers. Normally, this task includes subjective listening tests and/or a thorough linguistic study. We instead use a totally automated method based on a psycho-acoustic auditory model. For a certain phoneme class, we measure the similarity of the Euclidean space spanned by the power spectrum of a native speech signal and the Euclidean space spanned by the auditory model output. We do the same for a non-native speech signal. Comparing the two similarity measurements, we find problematic phonemes for a given speaker. To validate our method, we apply it to different groups of non-native speakers of various first language (L1) backgrounds. Our results are verified by the theoretical findings in literature obtained from linguistic studies.Koniaris, C., & Engwall, O. (2011). Phoneme Level Non-Native Pronunciation Analysis by an Auditory Model-based Native Assessment Scheme. In Interspeech 2011 (pp. 1157-1160). Florence, Italy. [NOMINATED FOR BEST STUDENT PAPER AWARD]. [abstract] [html]Abstract: We introduce a general method for automatic diagnostic evaluation of the pronunciation of individual non-native speakers based on a model of the human auditory system trained with native data stimuli. For each phoneme class, the Euclidean geometry similarity between the native perceptual domain and the non-native speech power spectrum domain is measured. The problematic phonemes for a given second language speaker are found by comparing this measure to the Euclidean geometry similarity for the same phonemes produced by native speakers only. The method is applied to different groups of non-native speakers of various language backgrounds and the experimental results are in agreement with theoretical findings of linguistic studies.Koreman, J., Bech, Ø., Husby, O., & Wik, P. (2011). L1-L2map: a tool for multi-lingual contrastive analysis. In proceedings of ICPhS. [pdf]Landsiedel, C., Edlund, J., Eyben, F., Neiberg, D., & Schuller, B. (2011). Syllabification of conversational speech using bidirectional long-short-term memory neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on (pp. 5256 - 5259). Prague, Czech Republic. [abstract] [pdf]Abstract: Segmentation of speech signals is a crucial task in many types of speech analysis. We present a novel approach at segmentation on a syllable level, using a Bidirectional Long-Short-Term Memory Neural Network. It performs estimation of syllable nucleus positions based on regression of perceptually motivated input features to a smooth target function. Peak selection is performed to attain valid nuclei positions. Performance of the model is evaluated on the levels of both syllables and the vowel segments making up the syllable nuclei. The general applicability of the approach is illustrated by good results for two common databases - Switchboard and TIMIT - for both read and spontaneous speech, and a favourable comparison with other published results.Laskowski, K. (2011). Predicting, detecting and explaining the occurrence of vocal activity in multi-party conversation. Doctoral dissertation, Carnegie Mellon University. [abstract] [pdf]Abstract: Understanding a conversation involves many challenges that understanding the speech in that conversation does not. An important source of this discrepancy is the form of the conversation, which emerges from tactical decisions that par- ticipants make in how, and precisely when, they choose to participate. An offline conversation understanding system, beyond understanding the spoken sequence of words, must be able to account for that form. In addition, an online system may need to evaluate its competing next-action alternatives, instant by instant, and to adjust its strategy based on the felicity of its past decisions. In circumscribed transient settings, understanding the spoken sequence of words may not be necessary for either type of system. This dissertation explores tactical conversational conduct. It adopts an existing laconic representation of conversa- tional form known as the vocal interaction chronogram, which effectively elides a conversation’s text-dependent attributes, such as the words spoken, the language used, the sequence of topics, and any explicit or implicit goals. Chronograms are treated as Markov random fields, and probability density models are developed that characterize the joint behavior of participants. Importantly, these models are independent of a conversation’s duration and the number of its participants. They render overtly dissimilar conversations directly comparable, and enable the training of a single model of conversa- tional form using a large number of dissimilar human-human conversations. The resulting statistical framework is shown to provide a computational counterpart to the qualitative field of conversation analysis, corroborating and elaborating on several sociolinguistic observations. It extends the quantitative treatment of two-party dialogue, as found in anthropology, social psychology, and telecommunications research, to the general multi-party setting. Experimental results indicate that the proposed computational theory benefits the detection and participant-attribution of speech activity. Furthermore, the theory is shown to enable the inference of illocutionary intent, emotional state, and social status, independently of linguistic analysis. Taken together, for conversations of arbitrary duration, participant number, and text-dependent attributes, the results demonstrate a degree of characterization and understanding of the nature of conversational interaction that has not been shown previously.Laskowski, K., Edlund, J., & Heldner, M. (2011). A single-port non-parametric model of turn-taking in multi-party conversation. In Proc. of ICASSP 2011 (pp. 5600-5603). Prague, Czech Republic. [abstract] [pdf]Abstract: The taking of turns to speak is an intrinsic property of conversation. It is therefore expected that models of turn-taking, providing a prior distribution over conversational form, can usefully reduce the perplexity of what is observed and processed in real-time spoken dialogue systems. We propose a conversation-independent single- port model of multi-party turn-taking, one which allows conversants to undertake independent actions but to condition them on the past behavior of all participants. The model is shown to generally out perform an existing multi-port model on a measure of perplexity over subsequently observed speech activity. We quantify the effect of history truncation and the success of predicting distant conversational futures, and argue that the framework is suf&#64257;ciently accessible and has signi&#64257;cant potential to usefully inform thedesignandbehavior ofspokendialoguesystems.Laskowski, K., Edlund, J., & Heldner, M. (2011). Incremental Learning and Forgetting in Stochastic Turn-Taking Models. In Proc. of Interspeech 2011 (pp. 2069-2072). Florence, Italy. [abstract] [pdf]Abstract: We present a computational framework for stochastically mod- eling dyad interaction chronograms. The framework’s most novel feature is the capacity for incremental learning and for- getting. To showcase its flexibility, we design experiments an- swering four concrete questions about the systematics of spoken interaction. The results show that: (1) individuals are clearly affected by one another; (2) there is individual variation in in- teraction strategy; (3) strategies wander in time rather than con- verge; and (4) individuals exhibit similarity with their interlocu- tors. We expect the proposed framework to be capable of an- swering many such questions with little additional effort.Laskowski, K., & Jin, Q. (2011). Harmonic structure transform for speaker recognition. In Proc. of Interspeech 2011 (pp. 365-368). Florence, Italy. [abstract] [pdf]Abstract: We evaluate a new filterbank structure, yielding the harmonic structure cepstral coefficients (HSCCs), on a mismatched- session closed-set speaker classification task. The novelty of the filterbank lies in its averaging of energy at frequencies re- lated by harmonicity rather than by adjacency. Improvements are presented which achieve a 37%rel reduction in error rate un- der these conditions. The improved features are combined with a similar Mel-frequency cepstral coefficient (MFCC) system to yield error rate reductions of 32%rel, suggesting that HSCCs offer information which is complimentary to that available to today’s MFCC-based systems.Laukka, P., Neiberg, D., Forsell, M., Karlsson, I., & Elenius, K. (2011). Expression of Affect in Spontaneous Speech: Acoustic Correlates and Automatic Detection of Irritation and Resignation. Computer Speech and Language, 25(1), 84-104. [link]Lindblom, B., & MacNeilage, P. (2011). Coarticulation: A universal phonetic phenomenon with roots in deep time. TMH-QPSR, 51(1), 41-44. [pdf]Lindblom, B., Sundberg, J., Branderud, P., & Djamshidpey, H. (2011). Articulatory modeling and front cavity acoustics. TMH-QPSR, 51(1), 17-20. [pdf]Lindblom, B., Diehl, R., Park, S-H., & Salvi, G. (2011). Sound systems are shaped by their users: The recombination of phonetic substance. In G. Nick Clements, G. N., & Ridouane, R. (Eds.), Where Do Phonological Features Come From?: Cognitive, physical and developmental bases of distinctive speech categories. CNRS & Sorbonne-Nouvelle. [abstract] [pdf]Abstract: Computational experiments were run using an optimization criterion based on independently motivated definitions of perceptual contrast, articulatory cost and learning cost. The question: If stop+vowel inventories are seen as adaptations to perceptual, articulatory and developmental constraints what would they be like? Simulations successfully predicted typologically widely observed place preferences and the re-use of place features (‘phonemic coding’) in voiced stop inventories. These results demonstrate the feasibility of user-based accounts of phonological facts and indicate the nature of the constraints that over time might shape the formation of both the formal structure and the intrinsic content of sound patterns. While phonetic factors are commonly invoked to account for substantive aspects of phonology, their explanatory scope is here also extended to a fundamental attribute of its formal organization: the combinatorial re-use of phonetic content. Keywords: phonological universals; phonetic systems; formal structure; intrinsic content; behavioral origins; substance-based explanationLundholm Fors, K. (2011). An investigation of intra-turn pauses in spontaneous speech. TMH-QPSR, 51(1), 65-68. [pdf]Neiberg, D. (2011). Visualizing prosodic densities and contours: Forming one from many. TMH-QPSR, 51(1), 57-60. [abstract] [pdf]Abstract: This paper summarizes a flora of explorative visualization techniques for prosody developed at KTH. It is demonstrated how analysis can be made which goes beyond conventional methodology. Examples are given for turn taking, affective speech, response tokens and Swedish accent II.Neiberg, D., Ananthakrishnan, G., & Gustafson, J. (2011). Tracking pitch contours using minimum jerk trajectories. In INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association. Florence, Italy. [abstract] [pdf]Abstract: This paper proposes a fundamental frequency tracker, with the specific purpose of comparing the automatic estimates with pitch contours that are sketched by trained phoneticians. The method uses a frequency domain approach to estimate pitch tracks that form minimum jerk trajectories. This method tries to mimic motor movements of the hand made while sketching. When the fundamental frequency tracked by the proposed method on the oral and laryngograph signals were compared using the MOCHA-TIMIT database, the correlation was 0.98 and the root mean squared error was 4.0 Hz, which was slightly better than a state-of-the-art pitch tracking algorithm included in the ESPS. We also demonstrate how the proposed algorithm could to be applied when comparing with sketches made by phoneticians for the variations in accent II among the Swedish dialects.Neiberg, D., & Gustafson, J. (2011). A Dual Channel Coupled Decoder for Fillers and Feedback. In INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association. Florence, Italy. [abstract] [pdf]Abstract: This study presents a dual channel decoder capable of modeling cross-speaker dependencies for segmentation and classification of fillers and feedbacks in conversational speech found in the DEAL corpus. For the same number of Gaussians per state, we have shown improvement in terms of average F-score for the successive addition of 1) increased frame rate from 10 ms to 50 ms 2) Joint Maximum Cross-Correlation (JMXC) features in a single channel decoder 3) a joint transition matrix which captures dependencies symmetrically across the two channels 4) coupled acoustic model retraining symmetrically across the two channels. The final step gives a relative improvement of over 100% for fillers and feedbacks compared to our previous published results. The F-scores are in the range to make it possible to use the decoder as both a voice activity detector and an illucotary act decoder for semi-automatic annotation.Neiberg, D., & Gustafson, J. (2011). Predicting Speaker Changes and Listener Responses With And Without Eye-contact. In INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association. Florence, Italy.. [abstract] [pdf]Abstract: This paper compares turn-taking in terms of timing and prediction in human-human conversations under the conditions when participants has eye-contact versus when there is no eyecontact, as found in the HCRC Map Task corpus. By measuring between speaker intervals it was found that a larger proportion of speaker shifts occurred in overlap for the no eyecontact condition. For prediction we used prosodic and spectral features parametrized by time-varying length-invariant discrete cosine coefficients. With Gaussian Mixture Modeling and variations of classifier fusion schemes, we explored the task of predicting whether there is an upcoming speaker change (SC) or not (HOLD), at the end of an utterance (EOU) with a pause lag of 200 ms. The label SC was further split into LRs (listener responses, e.g. back-channels) and other TURNSHIFTs. The prediction was found to be somewhat easier for the eye-contact condition, for which the average recall rates was 60.57%, 66.35%, and 62.00% for TURN-SHIFTs, LR and SC respectively.Neiberg, D., Laukka, P., & Elfenbein, H. A. (2011). Intra-, Inter-, and Cross-cultural Classification of Vocal Affect. In INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association (pp. 1581-1584). Florence, Italy.. [abstract] [pdf]Abstract: We present intra-, inter- and cross-cultural classifications of vocal expressions. Stimuli were selected from the VENEC corpus and consisted of portrayals of 11 emotions, each expressed with 3 levels of intensity. Classification (nu-SVM) was based on acoustic measures related to pitch, intensity, formants, voice source and duration. Results showed that mean recall across emotions was around 2.4-3 times higher than chance level for both intra- and inter-cultural conditions. For cross-cultural conditions, the relative performance dropped 26%, 32%, and 34% for high, medium, and low emotion intensity, respectively. This suggests that intra-cultural models were more sensitive to mismatched conditions for low emotion intensity. Preliminary results further indicated that recall rate varied as a function of emotion, with lust and sadness showing the smallest performance drops in the cross-cultural condition.Neiberg, D., & Truong, K. (2011). Online Detection Of Vocal Listener Responses With Maximum Latency Constraints. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on (pp. 5836 - 5839). Prague, Czech Republic. [abstract] [pdf]Abstract: When human listeners utter Listener Responses (e.g. back-channels or acknowledgments) such as `yeah' and `mmhmm', interlocutors commonly continue to speak or resume their speech even before the listener has finished his/her response. This type of speech interactivity results in frequent speech overlap which is common in human-human conversation. To allow for this type of speech interactivity to occur between humans and spoken dialog systems, which will result in more human-like continuous and smoother human-machine interaction, we propose an on-line classifier which can classify incoming speech as Listener Responses. We show that it is possible to detect vocal Listener Responses using maximum latency thresholds of 100-500 ms, thereby obtaining equal error rates ranging from 34% to 28% by using an energy based voice activity detector.Reidsma, D., de Kok, I., Neiberg, D., Pammi, S., van Straalen, B., Truong, K., & van Welbergen, H. (2011). Continuous Interaction with a Virtual Human. Journal on Multimodal User Interfaces, 4(2), 97-118. [link]Roll, M., Söderström, P., & Horne, M. (2011). Phonetic markedness, turning points, and anticipatory attention. TMH-QPSR, 51(1), 113-116. [pdf]Salvi, G., & Al Moubayed, S. (2011). Spoken Language Identification using Frame Based Entropy Measures. TMH-QPSR, 51(1), 69-72. [pdf]Salvi, G., Tesser, F., Zovato, E., & Cosi, P. (2011). Analisi Gerarchica degli Inviluppui Spettrali Differenziali di una Voce Emotiva. In Contesto comunicativo e variabilità nella produzione e percezione della lingua (AISV). Lecce, Italy.Schlangen, D., & Skantze, G. (2011). A General, Abstract Model of Incremental Dialogue Processing. Dialogue & Discourse, 2(1), 83-111. [pdf]Schötz, S., & Eklund, R. (2011). A comparative acoustic analysis of purring in four cats. TMH-QPSR, 51(1), 9-12. [pdf]Schötz, s., Frid, J., & Löfqvist, A. (2011). Exotic vowels in Swedish: a project description and an articulographic and acoustic pilot study of /i:/.. TMH-QPSR, 51(1), 25-28. [pdf]Stenberg, M. (2011). Phonetic transcriptions as a public service. TMH-QPSR, 51(1), 45-48. [pdf]Strömbergsson, S. (2011). Children’s perception of their modified speech – preliminary findings. TMH-QPSR, 51(1), 117-120. [abstract] [pdf]Abstract: This report describes an ongoing investigation of 4-6 year-old children’s perception of synthetically modified versions of their own recorded speech. Recordings of the children’s speech production are automatically modified so that the initial consonant is replaced by a different consonant. The task for the children is to judge the phonological accuracy (correct vs. incorrect) and the speaker identity (me vs. someone else) of each stimulus. Preliminary results indicate that children with typical speech generally judge phonological accuracy correctly, of both original recordings and synthetically modified recordings. As a first evaluation of the re-synthesis technique with child users, the results are promising, as the children generally accept the intended phonological form, seemingly without detecting the modification.Strömbergsson, S. (2011). Children’s recognition of their own voice: influence of phonological impairment. In Proceedings of Interspeech 2011 (pp. 2205-2208). Florence, Italy. [abstract] [pdf]Abstract: This study explores the ability to identify the recorded voice as one’s own, in three groups of children: one group of children with phonological impairment (PI) and two groups of children with typical speech and language development; 4-5 year-olds and 7-8 year-olds. High average performance rates in all three groups suggest that these children indeed recognize their recorded voice as their own, with no significant difference between the groups. Signs indicating that children with deviant speech use their speech deviance as a cue to identifying their own voice are discussed.Strömbergsson, S. (2011). Corrective re-synthesis of deviant speech using unit selection. In Sandford Pedersen, B., Nespore, G., & Skadina, I. (Eds.), NODALIDA 2011 Conference Proceedings (pp. 214-217). Riga, Latvia. [abstract] [pdf]Abstract: This report describes a novel approach to modified re-synthesis, by concatenation of speech from different speakers. The system removes an initial voiceless plosive from one utterance, recorded from a child, and replaces it with another voiceless plosive selected from a database of recordings of other child speakers. Preliminary results from a listener evaluation are reported.Strömbergsson, S. (2011). Segmental re-synthesis of child speech using unit selection. In Proceedings of the ICPhS XVII (pp. 1910-1913). Hong Kong. [abstract] [pdf]Abstract: This report describes a novel approach to segmental re-synthesis of child speech, by concatenation of speech from different speakers. The re-synthesis builds upon standard methods of unit selection, but instead of using speech from only one speaker, target segments are selected from a speech database of many child speakers. Results from a listener evaluation suggest that the method can be used to generate intelligible speech that is difficult to distinguish from original recordings.Suomi, K., Meister, E., & Ylitalo, R. (2011). Non-contrastive durational patterns in two quantity languages. TMH-QPSR, 51(1), 61-64.Uneson, M., & Schachtenhaufen, R. (2011). Exploring phonetic realization in Danish by Transformation-Based Learning. TMH-QPSR, 51(1), 73-76. [pdf]Wik, P. (2011). The Virtual Language Teacher: Models and applications for language learning using embodied conversational agents. Doctoral dissertation, KTH School of Computer Science and Communication. [abstract] [pdf]Abstract: This thesis presents a framework for computer assisted language learning using a virtual language teacher. It is an attempt at creating, not only a new type of language learning software, but also a server-based application that collects large amounts of speech material for future research purposes. The motivation for the framework is to create a research platform for computer assisted language learning, and computer assisted pronunciation training. Within the thesis, different feedback strategies and pronunciation error detectors are explored This is a broad, interdisciplinary approach, combining research from a number of scientific disciplines, such as speech-technology, game studies, cognitive science, phonetics, phonology, and second-language acquisition and teaching methodologies. The thesis discusses the paradigm both from a top-down point of view, where a number of functionally separate but interacting units are presented as part of a proposed architecture, and bottom-up by demonstrating and testing an implementation of the framework.Wik, P., Husby, O., Øvregaard, Å., Bech, Ø., Albertsen, E., Nefzaoui, S., Skarpnes, E., & Koreman, J. (2011). Contrastive analysis through L1-L2map. TMH-QPSR, 51(1), 49-52. [pdf]Zetterholm, E., & Tronnier, M. (2011). Teaching pronunciation in Swedish as a second language. TMH-QPSR, 51(1), 85-88. [pdf]Öhrström, N., Arppe, H., Eklund, L., Eriksson, S., Marcus, D., Mathiassen, T., & Pettersson, L. (2011). Audiovisual integration in binaural, monaural and dichotic listening. TMH-QPSR, 51(1), 29-32. [pdf]2010Al Moubayed, S., & Ananthakrishnan, G. (2010). Acoustic-to-Articulatory Inversion based on Local Regression. In Proc. Interspeech. Makuhari, Japan. [abstract] [pdf]Abstract: This paper presents an Acoustic-to-Articulatory inversion method based on local regression. Two types of local regression, a non-parametric and a local linear regression have been applied on a corpus containing simultaneous recordings of positions of articulators and the corresponding acoustics. A maximum likelihood trajectory smoothing using the estimated dynamics of the articulators is also applied on the regression estimates. The average root mean square error in estimating articulatory positions, given the acoustics, is 1.56 mm for the nonparametric regression and 1.52 mm for the local linear regression. The local linear regression is found to perform significantly better than regression using Gaussian Mixture Models using the same acoustic and articulatory features.Al Moubayed, S., Ananthakrishnan, G., & Enflo, L. (2010). Automatic Prominence Classification in Swedish. In Proceedings of Speech Prosody 2010, Workshop on Prosodic Prominence. Chicago, USA. [abstract] [pdf]Abstract: This study aims at automatically classifying levels of acoustic prominence on a dataset of 200 Swedish sentences of read speech by one male native speaker. Each word in the sentences was categorized by four speech experts into one of three groups depending on the level of prominence perceived. Six acoustic features at a syllable level and seven features at a word level were used. Two machine learning algorithms, namely Support Vector Machines (SVM) and memory based Learning (MBL) were trained to classify the sentences into their respective classes. The MBL gave an average word level accuracy of 69.08% and the SVM gave an average accuracy of 65.17 % on the test set. These values were comparable with the average accuracy of the human annotators with respect to the average annotations. In this study, word duration was found to be the most important feature required for classifying prominence in Swedish read speechAl Moubayed, S., & Beskow, J. (2010). Prominence Detection in Swedish Using Syllable Correlates. In Interspeech'10. Makuhari, Japan. [pdf]Al Moubayed, S., & Beskow, J. (2010). Perception of Nonverbal Gestures of Prominence in Visual Speech Animation. In FAA, The ACM / SSPNET 2nd International Symposium on Facial Analysis and Animation. Edinburgh, UK. [pdf]Al Moubayed, S., Beskow, J., & Granström, B. (2010). Auditory-Visual Prominence: From Intelligibilitty to Behavior. Journal on Multimodal User Interfaces, 3(4), 299-311. [abstract] [pdf]Abstract: Auditory prominence is defined as when an acoustic segment is made salient in its context. Prominence is one of the prosodic functions that has been shown to be strongly correlated with facial movements. In this work, we investigate the effects of facial prominence cues, in terms of gestures, when synthesized on animated talking heads. In the first study, a speech intelligibility experiment is conducted, speech quality is acoustically degraded and the fundamental frequency is removed from the signal, then the speech is presented to 12 subjects through a lip synchronized talking head carrying head-nods and eyebrows raise gestures, which are synchronized with the auditory prominence. The experiment shows that presenting prominence as facial gestures significantly increases speech intelligibility compared to when these gestures are randomly added to speech. We also present a follow-up study examining the perception of the behavior of the talking heads when gestures are added over pitch accents. Using eye-gaze tracking technology and questionnaires on 10 moderately hearing impaired subjects, the results of the gaze data show that users look at the face in a similar fashion to when they look at a natural face when gestures are coupled with pitch accents opposed to when the face carries no gestures. From the questionnaires, the results also show that these gestures significantly increase the naturalness and the understanding of the talking head.Al Moubayed, S., Beskow, J., Granström, B., & House, D. (2010). Audio-Visual Prosody: Perception, Detection, and Synthesis of Prominence. In Esposito, A. e. a. (Ed.), Toward Autonomous, Adaptive, and Context-Aware Multimodal Interfaces: Theoretical and Practical Issues (pp. 55 - 71). Springer. [abstract] [pdf]Abstract: In this chapter, we investigate the effects of facial prominence cues, in terms of gestures, when synthesized on animated talking heads. In the first study a speech intelligibility experiment is conducted, where speech quality is acoustically degraded, then the speech is presented to 12 subjects through a lip synchronized talking head carrying head-nods and eyebrow raising gestures. The experiment shows that perceiving visual prominence as gestures, synchronized with the auditory prominence, significantly increases speech intelligibility compared to when these gestures are randomly added to speech. We also present a study examining the perception of the behavior of the talking heads when gestures are added at pitch movements. Using eye-gaze tracking technology and questionnaires for 10 moderately hearing impaired subjects, the results of the gaze data show that users look at the face in a similar fashion to when they look at a natural face when gestures are coupled with pitch movements opposed to when the face carries no gestures. From the questionnaires, the results also show that these gestures significantly increase the naturalness and helpfulness of the talking head.Ananthakrishnan, G., Badin, P., Vargas, J. A. V., & Engwall, O. (2010). Predicting Unseen Articulations from Multi-speaker Articulatory Models. In Proc. Interspeech. Makuhari, Japan. [abstract] [pdf]Abstract: In order to study inter-speaker variability, this work aims to assess the generalization capabilities of data-based multi-speaker articulatory models. We use various three-mode factor analysis techniques to model the variations of midsagittal vocal tract contours obtained from MRI images for three French speakers articulating 73 vowels and consonants. Articulations of a given speaker for phonemes not present in the training set are then predicted by inversion of the models from measurements of these phonemes articulated by the other subjects. On the average, the prediction RMSE was 5.25 mm for tongue contours, and 3.3 mm for 2D midsagittal vocal tract distances. Besides, this study has established a methodology to determine the optimal number of factors for such models.Beskow, J., & Al Moubayed, S. (2010). Perception of Gaze Direction in 2D and 3D Facial Projections. In The ACM / SSPNET 2nd International Symposium on Facial Analysis and Animation. Edinburgh, UK. [pdf]Beskow, J., Edlund, J., Granström, B., Gustafson, J., & House, D. (2010). Face-to-face interaction and the KTH Cooking Show. In Esposito, A., Campbell, N., Vogel, C., Hussain, A., & Nijholt, A. (Eds.), Development of Multimodal Interfaces: Active Listening and Synchrony (pp. 157 - 168). Berlin / Heidelberg: Springer. [pdf]Beskow, J., Edlund, J., Gustafson, J., Heldner, M., Hjalmarsson, A., & House, D. (2010). Modelling humanlike conversational behaviour. In Proceedings of SLTC 2010. Linköping, Sweden. [pdf]Beskow, J., Edlund, J., Gustafson, J., Heldner, M., Hjalmarsson, A., & House, D. (2010). Research focus: Interactional aspects of spoken face-to-face communication. In Proc. of Fonetik 2010 (pp. 7-10). Lund, Sweden. [abstract] [pdf]Abstract: We have a visionary goal: to learn enough about human face-to-face interaction that we are able to create an artificial conversational partner that is human-like. We take the opportunity here to present four new projects inaugurated in 2010, each adding pieces of the puzzle through a shared research focus: interactional aspects of spoken face-to-face communication.Beskow, J., & Granström, B. (2010). Goda utsikter för teckenspråksteknologi. In Domeij, R., Breivik, T., Halskov, J., Kirchmeier-Andersen, S., Langgård, P., & Moshagen, S. (Eds.), Språkteknologi för ökad tillgänglighet - Rapport från ett nordiskt seminarium (pp. 77-86). [pdf]Beskow, J., & Granström, B. (2010). Teckenspråkteknologi - sammanfattning av förstudie. Technical Report, KTH Centrum för Talteknologi. [pdf]Branigan, H., Pickering, M., Pearson, J., & McLean, J. (2010). Linguistic alignment between people and computers. Journal of Pragmatics.Edlund, J., Gustafson, J., & Beskow, J. (2010). Cocktail – a demonstration of massively multi-component audio environments for illustration and analysis. In Porc. of SLTC 2010. Linköping, Sweden. [abstract] [pdf]Abstract: We present MMAE – Massively Multi-component Audio Environments – a new concept in auditory presentation, and Cocktail – a demonstrator built on this technology. MMAE creates a dynamic audio environment by playing a large number of sound clips simultaneously at different locations in a virtual 3D space. The technique utilizes standard soundboards and is based in the Snack Sound Toolkit. The result is an efficient 3D audio environment that can be modified dynamically, in real time. Applications range from the creation of canned as well as online audio environments for games and entertainment to the browsing, analyzing and comparing of large quantities of audio data. We also demonstrate the Cocktail implementation of MMAE using several test cases as examples.Edlund, J., & Beskow, J. (2010). Capturing massively multimodal dialogues: affordable synchronization and visualization. In Kipp, M., Martin, J-C., Paggio, P., & Heylen, D. (Eds.), Proc. of Multimodal Corpora: Advances in Capturing, Coding and Analyzing Multimodality (MMC 2010) (pp. 160 - 161). Valetta, Malta. [abstract] [pdf]Abstract: In this demo, we show (a) affordable and relatively easy-to-implement means to facilitate synchronization of audio, video and motion capture data in post processing, and (b) a flexible tool for 3D visualization of recorded motion capture data aligned with audio and video sequences. The synchronisation is made possible by the use of two simple and analogues devices: a turntable and an easy to build electronic clapper board. The demo shows examples of how the signals from the turntable and the clapper board are traced over the three modalities, using the 3D visualisation tool. We also demonstrate how the visualisation tool shows head and torso movements captured by the motion capture system.Edlund, J., Beskow, J., Elenius, K., Hellmer, K., Strömbergsson, S., & House, D. (2010). Spontal: a Swedish spontaneous dialogue corpus of audio, video and motion capture. In Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Rosner, M., & Tapias, D. (Eds.), Proc. of the Seventh conference on International Language Resources and Evaluation (LREC'10) (pp. 2992 - 2995). Valetta, Malta. [abstract] [pdf]Abstract: We present the Spontal database of spontaneous Swedish dialogues. 120 dialogues of at least 30 minutes each have been captured in high-quality audio, high-resolution video and with a motion capture system. The corpus is currently being processed and annotated, and will be made available for research at the end of the project.Edlund, J., & Gustafson, J. (2010). Ask the experts - Part II: Analysis. In Juel Henrichsen, P. (Ed.), Linguistic Theory and Raw Sound (pp. 183-198). Samfundslitteratur. [abstract] [pdf]Abstract: We present work fuelled by an urge to understand speech in its original and most fundamental context: in conversation between people. And what better way than to look to the experts? Regarding human conversation, authority lies with the speakers themselves, and asking the experts is a matter of observing and analyzing what speakers do. This is the second part of a two-part discussion which is illustrated with examples mainly from the work at KTH Speech, Music and Hearing. In this part, we discuss methods of extracting useful information from captured data, with a special focus on raw sound.Edlund, J., Heldner, M., Al Moubayed, S., Gravano, A., & Hirschberg, J. (2010). Very short utterances in conversation. In Proc. of Fonetik 2010 (pp. 11-16). Lund, Sweden. [abstract] [pdf]Abstract: Faced with the difficulties of finding an operationalized definition of backchannels, we have previously proposed an intermediate, auxiliary unit – the very short utterance (VSU) – which is defined operationally and is automatically extractable from recorded or ongoing dialogues. Here, we extend that work in the following ways: (1) we test the extent to which the VSU/NONVSU distinction corresponds to backchannels/non-backchannels in a different data set that is manually annotated for backchannels – the Columbia Games Corpus; (2) we examine to the extent to which VSUS capture other short utterances with a vocabulary similar to backchannels; (3) we propose a VSU method for better managing turn-taking and barge-ins in spoken dialogue systems based on detection of backchannels; and (4) we attempt to detect backchannels with better precision by training a backchannel classifier using durations and inter-speaker relative loudness differences as features. The results show that VSUS indeed capture a large proportion of backchannels – large enough that VSUs can be used to improve spoken dialogue system turntaking; and that building a reliable backchannel classifier working in real time is feasible.Elenius, D., & Blomberg, M. (2010). Dynamic vocal tract length normalization in speech recognition. In Working Papers 54, Proceedings from Fonetik 2010 (pp. 29-34). Centre for Languages and Literature, Lund University, Sweden. [pdf]Engwall, O. (2010). Is there a McGurk effect for tongue reading?. In Proceedings of AVSP. Hakone, Japan. [pdf]Gustafson, J., & Neiberg, D. (2010). Directing conversation using the prosody of mm and mhm. In Proceedings of SLTC 2010. Linköping, Sweden.Gustafson, J., & Neiberg, D. (2010). Prosodic cues to engagement in non-lexical response tokens in Swedish. In Proceedings of DiSS-LPSS Joint Workshop 2010. Tokyo, Japan. [pdf]Gustafson, J., & Edlund, J. (2010). Ask the experts - Part I: Elicitation. In Juel Henrichsen, P. (Ed.), Linguistic Theory and Raw Sound (pp. 169-182). Samfundslitteratur. [abstract] [pdf]Abstract: We present work fuelled by an urge to understand speech in its original and most fundamental context: in conversation between people. And what better way than to look to the experts? Regarding human conversation, authority lies with the speakers themselves, and asking the experts is a matter of observing and analyzing what speakers do. This is the first part of a two-part discussion which is illustrated with examples mainly from the work at KTH Speech, Music and Hearing. In this part, we discuss methods of capturing what people do when they talk to each other.Heldner, M., Edlund, J., & Hirschberg, J. (2010). Pitch similarity in the vicinity of backchannels. In Proc. of Interspeech 2010 (pp. 3054-3057). Makuhari, Japan. [pdf]Heldner, M., & Edlund, J. (2010). Pauses, gaps and overlaps in conversations. Journal of Phonetics, 38, 555-568. [abstract] [pdf]Abstract: This paper explores durational aspects of pauses, gaps and overlaps in three different conversational corpora with a view to challenge claims about precision timing in turn-taking. Distributions of pause, gap and overlap durations in conversations are presented, and methodological issues regarding the statistical treatment of such distributions are discussed. The results are related to published minimal response times for spoken utterances and thresholds for detection of acoustic silences in speech. It is shown that turn-taking is generally less precise than is often claimed by researchers in the field of conversation analysis or interactional linguistics. These results are discussed in the light of their implications for models of timing in turn-taking, and for interaction control models in speech technology. In particular, it is argued that the proportion of speaker changes that could potentially be triggered by information immediately preceding the speaker change is large enough for reactive interaction controls models to be viable in speech technology.Hirschberg, J., Hjalmarsson, A., & Elhadad, N. (2010). "You're as Sick as You Sound" Using Computational Approaches for Modeling Speaker State to Gauge Illness and Recovery. In Neustein, A. (Ed.), Mobile Environments, Call Centers and Clinics (pp. 305-322). Springer. [abstract]Abstract: Recently, researchers in computer science and engineering have begun to explore the possibility of finding speech-based correlates of various medical conditions using automatic, computational methods. If such language cues can be identified and quantified automatically, this information can be used to support diagnosis and treatment of medical conditions in clinical settings and to further fundamental research in understanding cognition. This chapter reviews computational approaches that explore communicative patterns of patients who suffer from medical conditions such as depression, autism spectrum disorders, schizophrenia, and cancer. There are two main approaches discussed: research that explores features extracted from the acoustic signal and research that focuses on lexical and semantic features. We also present some applied research that uses computational methods to develop assistive technologies. In the final sections we discuss issues related to and the future of this emerging field of research.Hjalmarsson, A. (2010). Human interaction as a model for spoken dialogue system behaviour. Doctoral dissertation. [abstract] [pdf]Abstract: This thesis is a step towards the long-term and high-reaching objective of building dialogue systems whose behaviour is similar to a human dialogue partner. The aim is not to build a machine with the same conversational skills as a human being, but rather to build a machine that is human enough to encourage users to interact with it accordingly. The behaviours in focus are cue phrases, hesitations and turn-taking cues. These behaviours serve several important communicative functions such as providing feedback and managing turn-taking. Thus, if dialogue systems could use interactional cues similar to those of humans, these systems could be more intuitive to talk to. A major part of this work has been to collect, identify and analyze the target behaviours in human-human interaction in order to gain a better understanding of these phenomena. Another part has been to reproduce these behaviours in a dialogue system context and explore listeners’ perceptions of these phenomena in empirical experiments. The thesis is divided into two parts. The first part serves as an overall background. The issues and motivations of humanlike dialogue systems are discussed. This part also includes an overview of research on human language production and spoken language generation in dialogue systems. The next part presents the data collections, data analyses and empirical experiments that this thesis is concerned with. The first study presented is a listening test that explores human behaviour as a model for dialogue systems. The results show that a version based on human behaviour is rated as more humanlike, polite and intelligent than a constrained version with less variability. Next, the DEAL dialogue system is introduced. DEAL is used as a platform for the re-search presented in this thesis. The domain of the system is a trade domain and the target audience are second language learners of Swedish who want to practice conversation. Furthermore, a data col-lection of human-human dialogues in the DEAL domain is presented. Analyses of cue phrases in these data are provided as well as an experimental study of turn-taking cues. The results from the turn-taking experiment indicate that turn-taking cues realized with a di-phone synthesis affect the expectations of a turn change similar to the corresponding human version. Finally, an experimental study that explores the use of talkspurt-initial cue phrases in an incremental version of DEAL is presented. The results show that the incremental version had shorter response times and was rated as more efficient, more polite and better at indicating when to speak than a non-incremental implementation of the same system.Hjalmarsson, A. (2010). The vocal intensity of turn-initial cue phrases and filled pauses in dialogue. In Proceedings of SIGdial (pp. 225-228). Tokyo, Japan. [abstract] [pdf]Abstract: The present study explores the vocal intensity of turn-initial cue phrases in a corpus of dialogues in Swedish. Cue phrases convey relatively little propositional content, but have several important pragmatic functions. The majority of these entities are frequently occurring monosyllabic words such as “eh”, “mm”, “ja”. Prosodic analysis shows that these words are produced with higher intensity than other turn-initial words are. In light of these results, it is suggested that speakers produce these expressions with high intensity in order to claim the floor. It is further shown that the difference in intensity can be measured as a dynamic inter- speaker relation over the course of a dialogue using the end of the interlocutor’s previous turn as a reference point.Horne, M., House, D., Svantesson, J-O., & Touati, P. (2010). Gösta Bruce 1947-2010 In Memoriam. Phonetica, 67(4), 268-270.Johnson-Roberson, M., Bohg, J., Kragic, D., Skantze, G., Gustafson, J., & Carlson, R. (2010). Enhanced Visual Scene Understanding through Human-Robot Dialog. In Proceedings of AAAI 2010 Fall Symposium: Dialog with Robots. Arlington, VA. [pdf]Karlsson, A., House, D., Svantesson, J-O., & Tayanin, D. (2010). Influence of lexical tones on intonation in Kammu. In Interspeech 2010 (pp. 1740-1743). Makuhari, Japan. [abstract] [pdf]Abstract: The aim of this study is to investigate how the presence of lexical tones influences the realization of focal accent and sentence intonation. The language studied is Kammu, a language particularly well suited for the study as it has both tonal and non-tonal dialects. The main finding is that lexical tone exerts an influence on both sentence and focal accent in the tonal dialect to such a strong degree that we can postulate a hierarchy where lexical tone is strongest followed by sentence accent, with focal accent exerting the weakest influence on the F0 contour.Laskowski, K. (2010). Modeling norms of turn-taking in multi-party conversation. In proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL2010) (pp. 999-1008). Uppsala, Sweden. [pdf]Laskowski, K., & Edlund, J. (2010). A Snack implementation and Tcl/Tk interface to the fundamental frequency variation spectrum algorithm. In Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Rosner, M., & Tapias, D. (Eds.), Proc. of the Seventh conference on International Language Resources and Evaluation (LREC'10) (pp. 3742 - 3749). Valetta, Malta. [abstract] [pdf]Abstract: Intonation is an important aspect of vocal production, used for a variety of communicative needs. Its modeling is therefore crucial in many speech understanding systems, particularly those requiring inference of speaker intent in real-time. However, the estimation of pitch, traditionally the first step in intonation modeling, is computationally inconvenient in such scenarios. This is because it is often, and most optimally, achieved only after speech segmentation and recognition. A consequence is that earlier speech processing components, in today’s state-of-the-art systems, lack intonation awareness by fiat; it is not known to what extent this circumscribes their performance. In the current work, we present a freely available implementation of an alternative to pitch estimation, namely the computation of the fundamental frequency variation (FFV) spectrum, which can be easily employed at any level within a speech processing system. It is our hope that the implementation we describe aid in the understanding of this novel acoustic feature space, and that it facilitate its inclusion, as desired, in the front-end routines of speech recognition, dialog act recognition, and speaker recognition systems.Laskowski, K., Heldner, M., & Edlund, J. (2010). Preliminaries to an account of multi-party conversational turn-taking as an antiferromagnetic spin glass. In Proc. of NIPS Workshop on Modeling Human Communication Dynamics (pp. 46-49). Vancouver, B.C., Canada. [abstract] [pdf]Abstract: We present empirical justification of why logistic regression may acceptably approximate, using the number of currently vocalizing interlocutors, the probabilities returned by a time-invariant, conditionally independent model of turn-taking. The resulting parametric model with 3 degrees of freedom is shown to be identical to an infinite-range Ising antiferromagnet, with slow connections, in an external field; it is suitable for undifferentiated-participant scenarios. In differentiated-participant scenarios, untying parameters results in an infinite-range spin glass whose degrees of freedom scale as the square of the number of participants; it offers an efficient representation of participant-pair synchrony. We discuss the implications of model parametrization and of the thermodynamic and feed-forward perceptron formalisms for easily quantifying aspects of conversational dynamics.Neiberg, D., & Gustafson, J. (2010). Modeling Conversational Interaction Using Coupled Markov Chains. In Proceedings of DiSS-LPSS Joint Workshop 2010. Tokyo, Japan. [pdf]Neiberg, D., & Gustafson, J. (2010). Prosodic Characterization and Automatic Classification of Conversational Grunts in Swedish. In Fonetik 2010. Lund.Neiberg, D., & Gustafson, J. (2010). The Prosody of Swedish Conversational Grunts. In INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association (pp. 2562-2565). Makuhari, Chiba, Japan. [pdf]Neiberg, D., Laukka, P., & Ananthakrishnan, G. (2010). Classification of Affective Speech using Normalized Time-Frequency Cepstra. In Fifth International Conference on Speech Prosody (Speech Prosody 2010). Chicago, Illinois, U.S.A. [abstract] [pdf]Abstract: Subtle temporal and spectral differences between categorical realizations of para-linguistic phenomena (e.g. affective vocal expressions), are hard to capture and describe. In this paper we present a signal representation based on Time Varying Constant-Q Cepstral Coefficients (TVCQCC) derived for this purpose. A method which utilize the special properties of the constant Q-transform for mean F0 estimation and normalization is described. The coefficients are invariant to utterance length, and as a special case, a representation for prosody is considered. Speaker independent classification results using nu-SVM the the Berlin EMO-DB and two closed sets of basic (anger, disgust, fear, happiness, sadness, neutral) and social/interpersonal (affection, pride, shame) emotions recorded by forty professional actors from two English dialect areas are reported. The accuracy for the Berlin EMO-DB is 71.2 %, and the accuracies for the first set including basic emotions was 44.6% and for the second set including basic and social emotions the accuracy was 31.7% . It was found that F0 normalization boosts the performance and a combined feature set shows the best performance.Neiberg, D., & Truong, K. P. (2010). A Maximum Latency Classifier for Listener Responses. In Proceedings of SLTC 2010. Linköping, Sweden. [abstract]Abstract: When Listener Responses such as “yeah”, “right” or “mhm” are uttered in a face-to-face conversation, it is not uncommon for the interlocutor to continue to speak in overlap, i.e. before the Listener becomes silent. We propose a classifier which can classify incoming speech as a Listener Response or not before the talk-spurt ends. The classifier is implemented as an upgrade of the Embodied Conversational Agent developed in the SEMAINE project during the eNTERFACE 2010 workshop.Oertel, C., Cummins, F., Campbell, N., Edlund, J., & Wagner, P. (2010). D64: A corpus of richly recorded conversational interaction. In Kipp, M., Martin, J-C., Paggio, P., & Heylen, D. (Eds.), Proceedings of LREC 2010 Workshop on Multimodal Corpora: Advances in Capturing, Coding and Analyzing Multimodality (pp. 27 - 30). Valetta, Malta. [abstract] [pdf]Abstract: Rich non-intrusive recording of anaturalistic conversation was conducted in a domestic setting. Four (sometimes five) participants engaged in lively conversation over two 4-hour sessions on two successive days. Conversation was not directed, and ranged widely over topics both trivial and technical. The entire conversation, on both days, was richly recorded using 7 video cameras, 10 audio microphones, and the registration of 3-D head,torso and arm motion using an Optitrack system. To add liveliness to the conversation, several bottles of wine were consumed during the final two hours of recording. The resulting corpus will be of immediate interest to all researchers interested in studying naturalistic, ethologically situated, conversational interaction.Picard, S. (2010). A framework for automatic error detection of Swedish vowels based on audiovisual data. Master's thesis, KTH. [pdf]Picard, S., Ananthakrishnan, G., Wik, P., Engwall, O., & Abdou, S. (2010). Detection of Specific Mispronunciations using Audiovisual Features. In International Conference on Auditory-Visual Speech Processing. Kanagawa, Japan. [abstract] [pdf]Abstract: This paper introduces a general approach for binary classification of audiovisual data. The intended application is mispronunciation detection for specific phonemic errors, using very sparse training data. The system uses a Support Vector Machine (SVM) classifier with features obtained from a Time Varying Discrete Cosine Transform (TV-DCT) on the audio log-spectrum as well as on the image sequences. The concatenated feature vectors from both the modalities were reduced to a very small subset using a combination of feature selection methods. We achieved 95-100% correct classification for each pair-wise classifier on a database of Swedish vowels with an average of 58 instances per vowel for training. The performance was largely unaffected when tested on data from a speaker who was not included in the training.Salvi, G., Tesser, F., Zovato, E., & Cosi, P. (2010). Cluster Analysis of Differential Spectral Envelopes on Emotional Speech. In Proceedings of Interspeech (pp. 322--325). Makuhari, Japan. [abstract] [PDF]Abstract: This paper reports on the analysis of the spectral variation of emotional speech. Spectral envelopes of time aligned speech frames are compared between emotionally neutral and active utterances. Statistics are computed over the resulting differential spectral envelopes for each phoneme. Finally, these statistics are classified using agglomerative hierarchical clustering and a measure of dissimilarity between statistical distributions and the resulting clusters are analysed. The results show that there are systematic changes in spectral envelopes when going from neutral to sad or happy speech, and those changes depend on the valence of the emotional content (negative, positive) as well as on the phonetic properties of the sounds such as voicing and place of articulation.Schlangen, D., Baumann, T., Buschmeier, H., Buss, O., Kopp, S., Skantze, G., & Yaghoubzadeh, R. (2010). Middleware for Incremental Processing in Conversational Agents. In Proceedings of SigDial. Tokyo, Japan. [pdf]Schötz, S., Beskow, J., Bruce, G., Granström, B., & Gustafson, J. (2010). Simulating Intonation in Regional Varieties of Swedish. In Speech Prosody 2010. Chicago, USA. [pdf]Schötz, S., Beskow, J., Bruce, G., Granström, B., Gustafson, J., & Segerup, M. (2010). Simulating Intonation in Regional Varieties of Swedish. In Fonetik 2010. Lund, Sweden. [pdf]Sikveland, R-O., Öttl, A., Amdal, I., Ernestus, M., Svendsen, T., & Edlund, J. (2010). Spontal-N: A Corpus of Interactional Spoken Norwegian. In Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Rosner, M., & Tapias, D. (Eds.), Proc. of the Seventh conference on International Language Resources and Evaluation (LREC'10) (pp. 2986 - 2991). Valetta, Malta. [abstract] [pdf]Abstract: Spontal-N is a corpus of spontaneous, interactional Norwegian. To our knowledge, it is the first corpus of Norwegian in which the majority of speakers have spent significant parts of their lives in Sweden, and in which the recorded speech displays varying degrees of interference from Swedish. The corpus consists of studio quality audio- and video-recordings of four 30-minute free conversations between acquaintances, and a manual orthographic transcription of the entire material. On basis of the orthographic transcriptions, we automatically annotated approximately 50 percent of thematerial on the phoneme level, by means of a forced alignment between the acoustic signal and pronunciations listed in a dictionary. Approximately seven percent of the automatic transcription was manually corrected. Taking the manual correction as a gold standard, we evaluated several sources of pronunciation variants for the automatic transcription. Spontal-N is intended as a general purpose speech resource that is also suitable for investigating phonetic detail.Skantze, G. (2010). Jindigo: a Java-based Framework for Incremental Dialogue Systems. Technical Report, KTH, Stockholm, Sweden. [pdf]Skantze, G., & Hjalmarsson, A. (2010). Towards Incremental Speech Generation in Dialogue Systems. In Proceedings of SIGdial (pp. 1-8). Tokyo, Japan. [abstract] [pdf]Abstract: We present a first step towards a model of speech generation for incremental dialogue systems. The model allows a dialogue system to incrementally interpret spoken input, while simultaneously planning, realising and selfmonitoring the system response. The model has been implemented in a general dialogue system framework. Using this framework, we have implemented a specific application and tested it in a Wizard-of-Oz setting, comparing it with a non-incremental version of the same system. The results show that the incremental version, while producing longer utterances, has a shorter response time and is perceived as more efficient by the users.Wik, P., & Granström, B. (2010). Simicry - A mimicry-feedback loop for second language learning. In Proceedings of Second Language Studies: Acquisition, Learning, Education and Technology. Waseda University, Tokyo, Japan. [abstract] [pdf]Abstract: This paper introduces the concept of Simicry, defined as similarity of mimicry, for the purpose of second language acquisition. We apply this method using a computer assisted language learning system called Ville on foreign students learning Swedish. The system deploys acoustic similarity measures between native and non-native pronunciation, derived from duration syllabicity and pitch. The system uses these measures to give pronunciation feedback in a mimicry-feedback loop exercise which has two variants: a ’say after me’ mimicry exercise, and a ’shadow with me’ exercise. The answers of questionnaires filled out by students after several training sessions spread over a month, show that the learning and practicing procedure has a promising potential being very useful and fun.2009Al Moubayed, S. (2009). Prosodic Disambiguation in Spoken Systems Output. In Proceedings of Diaholmia'09. Stockholm, Sweden.. [abstract] [pdf]Abstract: This paper presents work on using prosody in the output of spoken dialogue systems to resolve possible structural ambiguity of output utterances. An algorithm is proposed to discover ambiguous parses of an utterance and to add prosodic disambiguation events to deliver the intended structure. By conducting a pilot experiment, the automatic prosodic grouping applied to ambiguous sentences shows the ability to deliver the intended interpretation of the sentences.Al Moubayed, S., & Beskow, J. (2009). Effects of Visual Prominence Cues on Speech Intelligibility. In Proceedings of Auditory-Visual Speech Processing AVSP'09. Norwich, England. [abstract] [pdf]Abstract: This study reports experimental results on the effect of visual prominence, presented as gestures, on speech intelligibility. 30 acoustically vocoded sentences, permutated into different gestural conditions were presented audio-visually to 12 subjects. The analysis of correct word recognition shows a significant increase in intelligibility when focally-accented (prominent) words are supplemented with head-nods or with eye-brow raise gestures. The paper also examines coupling other acoustic phenomena to brow-raise gestures. As a result, the paper introduces new evidence on the ability of the non-verbal movements in the visual modality to support audio-visual speech perception.Al Moubayed, S., Beskow, J., Öster, A-M., Salvi, G., Granström, B., van Son, N., Ormel, E., & Herzke, T. (2009). Studies on Using the SynFace Talking Head for the Hearing Impaired. In Proceedings of Fonetik'09. Dept. of Linguistics, Stockholm University, Sweden. [abstract] [pdf]Abstract: SynFace is a lip-synchronized talking agent which is optimized as a visual reading support for the hearing impaired. In this paper we present the large scale hearing impaired user studies carried out for three languages in the Hearing at Home project. The user tests focus on measuring the gain in Speech Reception Threshold in Noise and the effort scaling when using SynFace by hearing impaired people, where groups of hearing impaired subjects with different impairment levels from mild to severe and cochlear implants are tested. Preliminary analysis of the results does not show significant gain in SRT or in effort scaling. But looking at large cross-subject variability in both tests, it is clear that many subjects benefit from SynFace especially with speech with stereo babble.Al Moubayed, S., Beskow, J., Öster, A., Salvi, G., Granström, B., van Son, N., & Ormel, E. (2009). Virtual Speech Reading Support for Hard of Hearing in a Domestic Multi-media Setting. In Proceedings of Interspeech 2009. Al Moubayed, S., Chetouani, M., Baklouti, M., Dutoit, T., Mahdhaoui, A., Martin, J-C., Ondas, S., Pelachaud, C., Urbain, J., & Yilmaz, M. (2009). Generating Robot/Agent Backchannels During a Storytelling Experiment. In Proceedings of (ICRA'09) IEEE International Conference on Robotics and Automation. Kobe, Japan.Ananthakrishnan, G., & Neiberg, D. (2009). Cross - modal Clustering in the Acoustic - Articulatory Space. In Fonetik 2009. Stockholm. [abstract] [pdf]Abstract: This paper explores cross-modal clustering in the acoustic-articulatory space. A method to improve clustering using information from more than one modality is presented. Formants and the Electromagnetic Articulography meas-urements are used to study corresponding clus-ters formed in the two modalities. A measure for estimating the uncertainty in correspon-dences between one cluster in the acoustic space and several clusters in the articulatory space is suggested.Ananthakrishnan, G., Neiberg, D., & Engwall, O. (2009). In search of Non-uniqueness in the Acoustic-to-Articulatory Mapping. In INTERSPEECH 2009 - 10th Annual Conference of the International Speech Communication Association (pp. 2799 – 2802). Brighton, UK. [abstract] [pdf]Abstract: This paper explores the possibility and extent of non-uniqueness in the acoustic-to-articulatory inversion of speech, from a statistical point of view. It proposes a technique to estimate the non-uniqueness, based on finding peaks in the conditional probability function of the articulatory space. The paper corroborates the existence of non-uniqueness in a statistical sense, especially in stop consonants, nasals and fricatives. The relationship between the importance of the articulator position and non-uniqueness at each instance is also explored.Andréasson, M., Borin, L., Forsberg, M., Beskow, J., Carlson, R., Edlund, J., Elenius, K., Hellmer, K., House, D., Merkel, M., Forsbom, E., Megyesi, B., Eriksson, A., & Strömqvist, S. (2009). Swedish CLARIN activities. In Domeij, R., Koskenniemi, K., Krauwer, S., Maegaard, B., Rögnvaldsson, E., & de Smedt, K. (Eds.), Proc. of the NODALIDA 2009 workshop Nordic Perspectives on the CLARIN Infrastructure of Language Resources (pp. 1-5). Northern European Association for Language Technology. [abstract] [pdf]Abstract: Although Sweden has yet to allocate funds speci&#64257;cally intended for CLARIN activities, there are some ongoing activities which are directly relevant to CLARIN, and which are explicitly linked to CLARIN. These activities have been funded by the Committee for Research Infrastructures and its subcommittee DISC (Database Infrastructure Committee) of the Swedish Research Council.Beskow, J., Edlund, J., Granström, B., Gustafson, J., Skantze, G., & Tobiasson, H. (2009). The MonAMI Reminder: a spoken dialogue system for face-to-face interaction. In Interspeech 2009. Brighton, U.K. [abstract] [pdf]Abstract: We describe the MonAMI Reminder, a multimodal spoken dialogue system which can assist elderly and disabled people in organising and initiating their daily activities. Based on deep interviews with potential users, we have designed a calendar and reminder application which uses an innovative mix of an embodied conversational agent, digital pen and paper, and the web to meet the needs of those users as well as the current constraints of speech technology. We also explore the use of head pose tracking for interaction and attention control in human-computer face-to-face interaction.Beskow, J., & Gustafson, J. (2009). Experiments with Synthesis of Swedish Dialects. In Proceedings of Fonetik 2009. Beskow, J., Salvi, G., & Al Moubayed, S. (2009). SynFace - Verbal and Non-verbal Face Animation from Audio. In Proceedings of The International Conference on Auditory-Visual Speech Processing AVSP'09. Norwich, England. [abstract] [pdf]Abstract: We give an overview of SynFace, a speech-driven face animation system originally developed for the needs of hard-of-hearing users of the telephone. For the 2009 LIPS challenge, SynFace includes not only articulatory motion but also non-verbal motion of gaze, eyebrows and head, triggered by detection of acoustic correlates of prominence and cues for interaction control. In perceptual evaluations, both verbal and non-verbal movmements have been found to have positive impact on word recognition scores.Beskow, J., Carlson, R., Edlund, J., Granström, B., Heldner, M., Hjalmarsson, A., & Skantze, G. (2009). Multimodal Interaction Control. In Waibel, A., & Stiefelhagen, R. (Eds.), Computers in the Human Interaction Loop (pp. 143-158). Berlin/Heidelberg: Springer. [pdf]Beskow, J., Edlund, J., Elenius, K., Hellmer, K., House, D., & Strömbergsson, S. (2009). Project presentation: Spontal – multimodal database of spontaneous dialog. In Fonetik 2009 (pp. 190-193). Stockholm. [abstract] [pdf]Abstract: We describe the ongoing Swedish speech database project Spontal: Multimodal database of spontaneous speech in dialog (VR 2006-7482). The project takes as its point of departure the fact that both vocal signals and gesture involving the face and body are important in every- day, face-to-face communicative interaction, and that there is a great need for data with which we more precisely measure these.Bisitz, T., Herzke, T., Zokoll, M., Öster, A-M., Al Moubayed, S., Granström, B., Ormel, E., Van Son, N., & Tanke, R. (2009). Noise Reduction for Media Streams. In NAG/DAGA'09 International Conference on Acoustics. Rotterdam, Netherlands.Blomberg, M., & Elenius, D. (2009). Estimating speaker characteristics for speech recognition. In Proceedings of Fonetik 2009. Dept. of Linguistics, Stockholm University. [pdf]Blomberg, M., & Elenius, D. (2009). Tree-based Estimation of Speaker Characteristics for Speech Recognition. In Proceedings of Interspeech 2009. [abstract]Abstract: Speaker adaptation by means of adjustment of speaker characteristic properties, such as vocal tract length, has the important advantage compared to conventional adaptation techniques that the adapted models are guaranteed to be realistic if the description of the properties are. One problem with this approach is that the search procedure to estimate them is computationally heavy. We address the problem by using a multi-dimensional, hierarchical tree of acoustic model sets. The leaf sets are created by transforming a conventionally trained model set using leaf-specific speaker profile vectors. The model sets of non-leaf nodes are formed by merging the models of their child nodes, using a computationally efficient algorithm. During recognition, a maximum likelihood criterion is followed to traverse the tree. Studies of one- (VTLN) and four-dimensional speaker profile vectors (VTLN, two spectral slope parameters and model variance scaling) exhibit a reduction of the computational load to a fraction compared to that of an exhaustive search. In recognition experiments on children’s connected digits using adult and male models, the one-dimensional tree search performed as well as the exhaustive search. Further reduction was achieved with four dimensions. The best recognition results are 0.93% and 10.2% WER in TIDIGITS and PF-Star- Sw, respectively, using adult models.Blomberg, M., Elenius, K., House, D., & Karlsson, I. (2009). Research Challenges in Speech Technology: A Special Issue in Honour of Rolf Carlson and Bjorn Granstrom. Speech Communication, 51(7), 563.Boves, L., Carlson, R., Hinrichs, E., House, D., Krauwer, S., Lemnitzer, L., Vainio, M., & Wittenburg, P. (2009). Resources for Speech Research: Present and Future Infrastructure Needs. In Interspeech (pp. 1803-1806). Brighton, UK. [abstract] [pdf]Abstract: This paper introduces the EU-FP7 project CLARIN, a joint effort of over 150 institutions in Europe, aimed at the creation of a sustainable language resources and technology infrastructure for the humanities and social sciences research community. The paper briefly introduces the vision behind the project and how it relates to speech research with a focus on the contributions that CLARIN can and will make to research in spoken language processing.Carlson, R., & Gustafson, K. (2009). Exploring Data Driven Parametric Synthesis. In Fonetik 2009. Stockholm, Sweden. [pdf]Carlson, R., & Hirschberg, J. (2009). Cross-Cultural Perception of Discourse Phenomena. In Interspeech (pp. 1723-1726). Brighton, UK. [pdf]Edlund, J., & Beskow, J. (2009). MushyPeek - a framework for online investigation of audiovisual dialogue phenomena. Language and Speech, 52(2-3), 351-367. [abstract]Abstract: Evaluation of methods and techniques for conversational and multimodal spoken dialogue systems is complex, as is gathering data for the modeling and tuning of such techniques. This article describes MushyPeek, an experiment framework that allows us to manipulate the audiovisual behavior of interlocutors in a setting similar to face-to-face human—human dialogue. The setup connects two subjects to each other over a Voice over Internet Protocol (VoIP) telephone connection and simultaneously provides each of them with an avatar representing the other. We present a first experiment which inaugurates, exemplifies, and validates the framework. The experiment corroborates earlier findings on the use of gaze and head pose gestures in turn-taking.Edlund, J., Heldner, M., & Hirschberg, J. (2009). Pause and gap length in face-to-face interaction. In Proc. of Interspeech 2009. Brighton, UK. [abstract] [pdf]Abstract: It has long been noted that conversational partners tend to exhibit increasingly similar pitch, intensity, and timing behavior over the course of a conversation. However, the metrics developed to measure this similarity to date have generally failed to capture the dynamic temporal aspects of this process. In this paper, we propose new approaches to measuring interlocutor similarity in spoken dialogue. We define similarity in terms of convergence and synchrony and propose approaches to capture these, illustrating our techniques on gap and pause production in Swedish spontaneous dialogues.Edlund, J., Heldner, M., & Pelcé, A. (2009). Prosodic features of very short utterances in dialogue. In Vainio, M., Aulanko, R., & Aaltonen, O. (Eds.), Nordic Prosody - Proceedings of the Xth Conference (pp. 57 - 68). Frankfurt am Main: Peter Lang. [pdf]Elenius, D., & Blomberg, M. (2009). On Extending VTLN to Phoneme-specific Warping in Automatic Speech Recognition. In Proceedings of Fonetik 2009. Dept. of Linguistics, Stockholm University. [pdf]Engwall, O., & Wik, P. (2009). Are real tongue movements easier to speech read than synthesized?. In Proceedings of Interspeech. [pdf]Engwall, O., & Wik, P. (2009). Can you tell if tongue movements are real or synthetic?. In Proceedings of AVSP. [pdf]Engwall, O., & Wik, P. (2009). Real vs. rule-generated tongue movements as an audio-visual speech perception support. In Proceedings of Fonetik 2009. Gustafson, J., & Merkes, M. (2009). Eliciting interactional phenomena in human-human dialogues. In Proceedings of SigDial 2009.. [pdf]Hagström, P. (2009). Predicting visual intelligibility gain in the Synface Application. Master's thesis, KTH. [pdf]Heldner, M., Edlund, J., Laskowski, K., & Pelcé, A. (2009). Prosodic features in the vicinity of pauses, gaps and overlaps. In Vainio, M., Aulanko, R., & Aaltonen, O. (Eds.), Nordic Prosody - Proceedings of the Xth Conference (pp. 95 - 106). Frankfurt am Main: Peter Lang. [abstract] [pdf]Abstract: In this study, we describe the range of prosodic variation observed in two types of dialogue contexts, using fully automatic methods. The first type of context is that of speaker-changes, or transitions from only one participant speaking to only the other, involving either acoustic silences or acoustic overlaps. The second type of context is comprised of mutual silences or overlaps where a speaker change could in principle occur but does not. For lack of a better term, we will refer to these contexts as non-speaker-changes. More specifically, we investigate F0 patterns in the intervals immediately preceding overlaps and silences – in order to assess whether prosody before overlaps or silences may invite or inhibit speaker change.Hincks, R., & Edlund, J. (2009). Using speech technology to promote increased pitch variation in oral presentations. In Proc. of SLaTE Workshop on Speech and Language Technology in Education. Wroxall, UK. [abstract] [pdf]Abstract: This paper reports on an experimental study comparing two groups of seven Chinese students of English who practiced oral presentations with computer feedback. Both groups imitated teacher models and could listen to recordings of their own production. The test group was also shown flashing lights that responded to the standard deviation of the fundamental frequency over the previous two seconds. The speech of the test group increased significantly more in pitch variation than the control group. These positive results suggest that this novel type of feedback could be used in training systems for speakers who have a tendency to speak in a monotone when making oral presentations.Hincks, R., & Edlund, J. (2009). Promoting increased pitch variation in oral presentations with transient visual feedback. Language Learning & Technology, 13(3), 32-50. [abstract] [pdf]Abstract: This paper investigates learner response to a novel kind of intonation feedback generated from speech analysis. Instead of displays of pitch curves, our feedback is flashing lights that show how much pitch variation the speaker has produced. The variable used to generate the feedback is the standard deviation of fundamental frequency as measured in semitones. Flat speech causes the system to show yellow lights, while more expressive speech that has used pitch to give focus to any part of an utterance generates green lights. Participants in the study were 14 Chinese students of English at intermediate and advanced levels. A group that received visual feedback was compared with a group that received audio feedback. Pitch variation was measured at four stages: in a baseline oral presentation; for the first and second halves of three hours of training; and finally in the production of a new oral presentation. Both groups increased their pitch variation with training, and the effect lasted after the training had ended. The test group showed a significantly higher increase than the control group, indicating that the feedback is effective. These positive results imply that the feedback could be beneficially used in a system for practicing oral presentations.Hincks, R., & Edlund, J. (2009). Transient visual feedback on pitch variation for Chinese speakers of English. In Proc. of Fonetik 2009. Stockholm. [abstract] [pdf]Abstract: This paper reports on an experimental study comparing two groups of seven Chinese students of English who practiced oral presentations with computer feedback. Both groups imi tated teacher models and could listen to recordings of their own production. The test group was also shown flashing lights that responded to the standard deviation of the fundamental frequency over the previous two seconds. The speech of the test group increased significantly more in pitch variation than the control group. These positive results suggest that this novel type of feedback could be used in training systems for speakers who have a tendency to speak in a monotone when making oral presentations.Hjalmarsson, A. (2009). On cue - additive effects of turn-regulating phenomena in dialogue. In Diaholmia (pp. 27-34). [abstract] [pdf]Abstract: One line of work on turn-taking in dialogue suggests that speakers react to “cues” or “signals” in the behaviour of the preceding speaker. This paper describes a perception experiment that investigates if such potential turn-taking cues affect the judgments made by non-participating listeners. The experiment was designed as a game where the task was to listen to dialogues and guess the outcome, whether there will be a speaker change or not, whenever the recording was halted. Human-human dialogues as well as dialogues where one of the human voices was replaced by a synthetic voice were used. The results show that simultaneous turn-regulating cues have a reinforcing effect on the listeners’ judgements. The more turn-holding cues, the faster the reaction time, suggesting that the subjects were more confident in their judgments. Moreover, the more cues, regardless if turn-holding or turn-yielding, the higher the agreement among subjects on the predicted outcome. For the re-synthesized voice, responses were made significantly slower; however, the judgments show that the turn-taking cues were interpreted as having similar functions as for the original human voice.House, D., Karlsson, A., Svantesson, J-O., & Tayanin, D. (2009). On utterance-final intonation in tonal and non-tonal dialects of Kammu. In Proceedings of Fonetik 2009 (pp. 78-81). Department of Linguistics, Stockholm University. [abstract] [pdf]Abstract: In this study we investigate utterance-final intonation in two dialects of Kammu, one tonal and one non-tonal. While the general patterns of utterance-final intonation are similar between the dialects, we do find clear evidence that the lexical tones of the tonal dialect restrict the pitch range and the realization of focus. Speaker engagement can have a strong effect on the utterance-final accent in both dialects.House, D., Karlsson, A., Svantesson, J-O., & Tayanin, D. (2009). The Phrase-Final Accent in Kammu: Effects of Tone, Focus and Engagement. In Proceedings of Interspeech 2009. [abstract] [pdf]Abstract: The phrase-final accent can typically contain a multitude of simultaneous prosodic signals. In this study, aimed at separating the effects of lexical tone from phrase-final intonation, phrase-final accents of two dialects of Kammu were analyzed. Kammu, a Mon-Khmer language spoken primarily in northern Lao, has dialects with lexical tones and dialects with no lexical tones. Both dialects seem to engage the phrase-final accent to simultaneously convey focus, phrase finality, utterance finality, and speaker engagement. Both dialects also show clear evidence of truncation phenomena. These results have implications for our understanding of the interaction between tone, intonation and phrase-finality.Kjellström, H., & Engwall, O. (2009). Audiovisual-to-articulatory inversion. Speech Communication, 51(3), 195-209. [pdf]Krunic, V., Salvi, G., Bernardino, A., Montesano, L., & Santos-Victor, J. (2009). Affordance based word-to-meaning association. In IEEE International Conference on Robotics and Automation (ICRA). Kobe, Japan. [abstract] [pdf]Abstract: This paper presents a method to associate meanings to words in manipulation tasks. We base our model on an affordance network, i.e., a mapping between robot actions, robot perceptions and the perceived effects of these actions upon objects. We extend the affordance model to incorporate words. Using verbal descriptions of a task, the model uses temporal co-occurrence to create links between speech utterances and the involved objects, actions and effects. We show that the robot is able form useful word-to-meaning associations, even without considering grammatical structure in the learning process and in the presence of recognition errors. These word-to-meaning associations are embedded in the robot’s own understanding of its actions. Thus they can be directly used to instruct the robot to perform tasks and also allow to incorporate context in the speech recognition task.Laskowski, K., Heldner, M., & Edlund, J. (2009). A general-purpose 32 ms prosodic vector for Hidden Markov Modeling. In Proc. of Interspeech 2009. Brighton, UK. [abstract] [pdf]Abstract: Prosody plays a central role in communicating via speech, making it important for speech technologies to model. Unfortunately, the application of standard modeling techniques to the acoustics of prosody has been hindered by dif&#64257;culties in modeling intonation. In this work, we explore the suitability of the recently introduced fundamental frequency variation (FFV) spectrum as a candidate general representation of tone. Experiments on 4 tasks demontrate that FFV features are complimentary to other acoustic measures of prosody and that hidden Markov models offer a suitable modeling paradigm. Proposed improvements yield a 35% relative decrease in error on unseen data and simultaneously reduce time complexity by more than an order of magnitude. The resulting is suf&#64257;ciently mature for general deployment in a broad range of automatic speech processing applications.Laskowski, K., Heldner, M., & Edlund, J. (2009). Exploring the prosody of floor mechanisms in English using the fundamental frequency variation spectrum. In Proceedings of the 2009 European Signal Processing Conference (EUSIPCO-2009). Glasgow, Scotland. [abstract] [pdf]Abstract: A basic requirement for participation in conversation is the ability to jointly manage interaction. Examples of interaction management include indications to acquire, re-acquire, hold, release, and acknowledge floor ownership, and these are often implemented using specialized dialog act (DA) types. In this work, we explore the prosody of one class of such DA types, known as floor mechanisms, using a methodology based on a recently proposed representation of fundamental frequency variation (FFV). Models over the representation illustrate significant differences between floor mechanisms and other dialog act types, and lead to automatic detection accuracies in equal-prior test data of up to 75%. Analysis indicates that FFV modeling offers a useful tool for the discovery of prosodic phenomena which are not explicitly labeled in the audio.Merkes, M. (2009). Recording methods for Spoken Dialogue Systems. Master's thesis.Neiberg, D., Ananthakrishnan, G., & Blomberg, M. (2009). On Acquiring Speech Production Knowledge from Articulatory Measurements for Phoneme Recognition. In INTERSPEECH 2009 - 10th Annual Conference of the International Speech Communication Association (pp. 1387 – 1390). Brighton, UK. [abstract] [pdf]Abstract: The paper proposes a general version of a coupled Hidden Markov/Bayesian Network model for performing phoneme recognition on acoustic-articulatory data. The model uses knowledge learned from the articulatory measurements, available for training, for phoneme recognition on the acoustic input. After training on the articulatory data, the model is able to predict 71.5% of the articulatory state sequences using the acoustic input. Using optimized parameters, the proposed method shows a slight improvement for two speakers over the baseline phoneme recognition system which does not use articulatory knowledge. However, the improvement is only statistically significant for one of the speakers. While there is an improvement in recognition accuracy for the vowels, diphthongs and to some extent the semi-vowels, there is a decrease in accuracy for the remaining phonemes.Neiberg, D., Elenius, K., & Burger, S. (2009). Emotion Recognition. In Waibel, A., & Stiefelhagen, R. (Eds.), Computers in the Human Interaction Loop (pp. 96-105). Berlin/Heidelberg: Springer. [pdf]Pammi, S., & Schröder, M. (2009). Annotating meaning of listener vocalizations for speech synthesis. In International Conference on Affective Computing & Intelligent Interaction.. Amsterdam, The Netherlands.Ronge, K. (2009). Automatic Methods for Dialogue Classification and Prediction. Master's thesis, KTH. [abstract] [pdf]Abstract: The problem presented in this thesis is to specify and implement a solution which could be used to automatically classify previously completed calls to a call routing service, and furthermore, to specify and implement a solution which could be used to automatically predict the outcome of calls to this service “on-line”, i.e. as they happen. The specific service on which to investigate this problem is a residential customer care service designed for a triple-play provider, providing TV, telephone, and broadband solutions to end consumers. The entrance to this service is a call routing system, to which end consumers can call and be routed to the appropriate customer service (for example, technical support for broadband, or the customer service for invoice issues). As the triple-play provider has large-scale operations, and offers a wide variety of products, their need for an easy-to-use, dynamic, maintainable and inexpensive call routing system is obvious. The method used to classify the calls to the service yields a classifier with an accuracy of 86%, which is a vast improvement on the majority-guess baseline of 27%. The method used to predict dialogues in calls to the service yields a predictor with an accuracy of 96% after three dialogue turns, which is a relative improvement on the majority-guess baseline by over 35%. The method is also used to successfully predict transaction success after three dialogue turns, with an accuracy of 82%, which is a relative improvement on the majority-guess baseline by over 5%. Although the work presented in this thesis was carried out on data collected from one specific service, it is the author's claim that the results and conclusions found in this thesis are largely applicable to any similar voice controlled call routing service.Salvi, G., Beskow, J., Al Moubayed, S., & Granström, B. (2009). SynFace—Speech-Driven Facial Animation for Virtual Speech-Reading Support. EURASIP Journal on Audio, Speech, and Music Processing, 2009. [abstract] [pdf]Abstract: This paper describes SynFace, a supportive technology that aims at enhancing audio-based spoken communication in adverse acoustic conditions by providing the missing visual information in the form of an animated talking head. Firstly, we describe the system architecture, consisting of a 3D animated face model controlled from the speech input by a specifically optimised phonetic recogniser. Secondly, we report on speech intelligibility experiments with focus on multilinguality and robustness to audio quality. The system, already available for Swedish, English, and Flemish, was optimised for German and for Swedish wide-band speech quality available in TV, radio, and Internet communication. Lastly, the paper covers experiments with nonverbal motions driven from the speech signal. It is shown that turn-taking gestures can be used to affect the flow of human-human dialogues. We have focused specifically on two categories of cues that may be extracted from the acoustic signal: prominence/emphasis and interactional cues (turn-taking/back-channelling).Schlangen, D., & Skantze, G. (2009). A general, abstract model of incremental dialogue processing. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (EACL-09). Athens, Greece. [abstract] [pdf]Abstract: We present a general model and conceptual framework for specifying architectures for incremental processing in dialogue systems, in particular with respect to the topology of the network of modules that make up the system, the way information flows through this network, how information increments are 'packaged', and how these increments are processed by the modules. This model enables the precise specification of incremental systems and hence facilitates detailed comparisons between systems, as well as giving guidance on designing new systems.Sen, A., Ananthakrishnan, G., Sundaram, S., & Ramakrishnan, A. G. (2009). Dynamic Space Warping of Strokes for Recognition of Online Handwritten Characters. International Journal of Pattern Recognition and Artificial Intelligence (IJPRAI), 23(5), 925-943. [abstract] [html]Abstract: This paper suggests a scheme for classifying online handwritten characters, based on dynamic space warping of strokes within the characters. A method for segmenting components into strokes using velocity profiles is proposed. Each stroke is a simple arbitrary shape and is encoded using three attributes. Correspondence between various strokes is established using Dynamic Space Warping. A distance measure which reliably differentiates between two corresponding simple shapes (strokes) has been formulated thus obtaining a perceptual distance measure between any two characters. Tests indicate an accuracy of over 85% on two different datasets of characters.Skantze, G., & Gustafson, J. (2009). Attention and interaction control in a human-human-computer dialogue setting. In Proceedings of SigDial 2009. London, UK. [abstract] [pdf]Abstract: This paper presents a simple, yet effective model for managing attention and interaction control in multimodal spoken dialogue systems. The model allows the user to switch attention between the system and other humans, and the system to stop and resume speaking. An evaluation in a tutoring setting shows that the user’s attention can be effectively monitored using head pose tracking, and that this is a more reliable method than using push-to-talk.Skantze, G., & Gustafson, J. (2009). Multimodal interaction control in the MonAMI Reminder. In Proceedings of DiaHolmia (pp. 127-128). Stockholm, Sweden. [pdf]Skantze, G., & Schlangen, D. (2009). Incremental dialogue processing in a micro-domain. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (EACL-09). Athens, Greece. [abstract] [pdf]Abstract: This paper describes a fully incremental dialogue system that can engage in dialogues in a simple domain, number dictation. Because it uses incremental speech recognition and prosodic analysis, the system can give rapid feedback as the user is speaking, with a very short latency of around 200ms. Because it uses incremental speech synthesis and self-monitoring, the system can react to feedback from the user as the system is speaking. A comparative evaluation shows that naïve users preferred this system over a non-incremental version, and that it was perceived as more human-like.Svantesson, J-O., House, D., Karlsson, A., & Tayanin, D. (2009). Reduplication with fixed tone pattern in Kammu. In Proceeding of Fonetik 2009 (pp. 82-84). Department of Linguistics, Stockholm University. [abstract] [pdf]Abstract: In this paper we show that speakers of both tonal and non-tonal dialects of Kammu use a fixed tone pattern high–low for intensifying re-duplication of adjectives, and also that speakers of the tonal dialect retain the lexical tones (high or low) while applying this fixed tone pattern.Wik, P., & Escribano, D. (2009). Say ‘Aaaaa’ Interactive Vowel Practice for Second Language Learning. In Proc. of SLaTE Workshop on Speech and Language Technology in Education. Wroxall, England. [pdf]Wik, P., Hincks, R., & Hirschberg, J. (2009). Responses to Ville: A virtual language teacher for Swedish. In Proc. of SLaTE Workshop on Speech and Language Technology in Education. Wroxall, England. [abstract] [pdf]Abstract: A series of novel capabilities have been designed to extend the repertoire of Ville, a virtual language teacher for Swedish, created at the Centre for Speech technology at KTH. These capabilities were tested by twenty-seven language students at KTH. This paper reports on qualitative surveys and quantitative performance from these sessions which suggest some general lessons for automated language training.Wik, P., & Hjalmarsson, A. (2009). Embodied conversational agents in computer assisted language learning. Speech Communication, 51(10), 1024-1037. [abstract] [pdf]Abstract: This paper describes two systems using embodied conversational agents (ECAs) for language learning. The first system, called Ville, is a virtual language teacher for vocabulary and pronunciation training. The second system, a dialogue system called DEAL, is a role-playing game for practicing conversational skills. Whereas DEAL acts as a conversational partner with the objective of creating and keeping an interesting dialogue, Ville takes the role of a teacher who guides, encourages and gives feedback to the students.2008Al Moubayed, S., De Smet, M., & Van Hamme, H. (2008). Lip Synchronization: from Phone Lattice to PCA Eigen-projections using Neural Networks. In Proceedings of Interspeech 2008. Brisbane, Australia. [pdf]Ananthakrishnan, G., & Engwall, O. (2008). Important regions in the articulator trajectory. In Proceedings of International Seminar on Speech Production (pp. 305-308). Strasbourg, France. [pdf]Beskow, J., Bruce, G., Enflo, L., Granström, B., & Schötz, S. (2008). Human Recognition of Swedish Dialects. In Proceedings of Fonetik 2008. Beskow, J., Bruce, G., Enflo, L., Granström, B., & Schötz, S. (2008). Recognizing and Modelling Regional Varieties of Swedish. In Proceedings of Interspeech 2008. [pdf]Beskow, J., & Cerrato, L. (2008). Evaluation of the expressivity of a Swedish talking headin the context of human-machine interaction. In Magno Caldognetto, E., Cavicchio, E., & Cosi, P. (Eds.), Comunicazione Parlata e manifestazione delle emozioni. [pdf]Beskow, J., Edlund, J., Gjermani, T., Granström, B., Gustafson, J., Jonsson, O., Skantze, G., & Tobiasson, H. (2008). Innovative interfaces in MonAMI: the reminder. In Proceedings of the 10th international conference on Multimodal interfaces, Chania, Crete, Greece (pp. 199-200). New York, NY, USA: ACM. [abstract] [pdf]Abstract: This demo paper presents an early version of the Reminder, a prototype ECA developed in the European project MonAMI, which aims at “mainstreaming accessibility in consumer goods and services, using advanced technologies to ensure equal access, independent living and participation for all”. The Reminder helps users to plan activities and to remember what to do. The prototype merges mobile ECA technology with other, existing technologies: Google Calendar and a digital pen and paper. The solution allows users to continue using a paper calendar in the manner they are used to, whilst the ECA provides notifications on what has been written in the calendar. Users may ask questions such as “When was I supposed to meet Sara?” or “What’s my schedule today?”Beskow, J., Edlund, J., Granström, B., Gustafson, J., Jonsson, O., & Skantze, G. (2008). Speech technology in the European project MonAMI. In Proceedings of FONETIK 2008 (pp. 33-36). Gothenburg, Sweden. [abstract] [pdf]Abstract: This paper describes the role of speech and speech technology in the European project MonAMI, which aims at “mainstreaming ac-cessibility in consumer goods and services, us-ing advanced technologies to ensure equal ac-cess, independent living and participation for all”. It presents the Reminder, a prototype em-bodied conversational agent (ECA) which helps users to plan activities and to remember what to do. The prototype merges speech technology with other, existing technologies: Google Cal-endar and a digital pen and paper. The solution allows users to continue using a paper calendar in the manner they are used to, whilst the ECA provides notifications on what has been written in the calendar. Users may also ask questions such as “When was I supposed to meet Sara?” or “What’s on my schedule today?”Beskow, J., Edlund, J., Granström, B., Gustafson, J., & Skantze, G. (2008). Innovative interfaces in MonAMI: the KTH Reminder. In Perception in Multimodal Dialogue Systems - Proceedings of the 4th IEEE Tutorial and Research Workshop on Perception and Interactive Technologies for Speech-Based Systems, PIT 2008, Kloster Irsee, Germany, June 16-18, 2008. (pp. 272-275). Berlin/Heidelberg: Springer. [abstract] [pdf]Abstract: This demo paper presents the first version of the Reminder, a prototype ECA developed in the European project MonAMI, which aims at “mainstreaming accessibility in consumer goods and services, using advanced technologies to ensure equal access, independent living and participation for all”. The Reminder helps users to plan activities and to remember what to do. The prototype merges ECA technology with other, existing technologies: Google Calendar and a digital pen and paper. This innovative combination of modalities allows users to continue using a paper calendar in the manner they are used to, whilst the ECA provides verbal notifications on what has been written in the calendar. Users may also ask questions such as “When was I supposed to meet Sara?” or “What’s on my schedule today?”Beskow, J., Engwall, O., Granström, B., Nordqvist, P., & Wik, P. (2008). Visualization of speech and audio for hearing-impaired persons. Technology and Disability, 20(2), 97-107. [pdf]Beskow, J., Granström, B., Nordqvist, P., Al Moubayed, S., Salvi, G., Herzke, T., & Schulz, A. (2008). Hearing at Home – Communication support in home environments for hearing impaired persons. In Proceedings of Interspeech 2008. Brisbane, Australia. [abstract] [pdf]Abstract: The Hearing at Home (HaH) project focuses on the needs of hearing-impaired people in home environments. The project is researching and developing an innovative media-center solution for hearing support, with several integrated features that support perception of speech and audio, such as individual loudness amplification, noise reduction, audio classification and event detection, and the possibility to display an animated talking head providing real-time speechreading support. In this paper we provide a brief project overview and then describe some recent results related to the audio classifier and the talking head. As the talking head expects clean speech input, an audio classifier has been developed for the task of classifying audio signals as clean speech, speech in noise or other. The mean accuracy of the classifier was 82%. The talking head (based on technology from the SynFace project) has been adapted for German, and a small speech-in-noise intelligibility experiment was conducted where sentence recognition rates increased from 3% to 17% when the talking head was present.Beskow, J., Edlund, J., Granström, B., Gustafson, J., Jonsson, O., Skantze, G., & Tobiasson, H. (2008). The MonAMI Reminder system. In Proc. of SLTC 2008 (pp. 13-14). Stockholm. [abstract] [pdf]Abstract: This paper presents an early version of the Reminder, a prototype ECA developed in the European project MonAMI. The Reminder helps users to plan activities and to remember what to do. The prototype merges mobile ECA technology with other, existing technologies: Google Calendar and a digital pen and paper. The solution allows users to continue using a paper calendar in the manner they are used to, whilst the ECA provides notifications on what has been written in the calendar.Biadsy, F., Rosenberg, A., Carlson, R., Hirschberg, J., & Strangert, E. (2008). A Cross-Cultural Comparison of American, Palestinian, and Swedish. In Speech Prosody 2008. Campinas, Brazil. [pdf]Blomberg, M., & Elenius, D. (2008). Investigating Explicit Model Transformations for Speaker Normalization. In Proceedings of ISCA ITRW Speech Analysis and Processing for Knowledge Discovery. Aalborg, Denmark. [abstract] [pdf]Abstract: In this work we extend the test utterance adaptation technique used in vocal tract length normalization to a larger number of speaker characteristic features. We perform partially joint estimation of four features: the VTLN warping factor, the corner position of the piece-wise linear warping function, spectral tilt in voiced segments, and model variance scaling. In experiments on the Swedish PF-Star children database, joint estimation of warping factor and variance scaling lowered the recognition error rate compared to warping factor alone.Blomberg, M., & Elenius, D. (2008). Knowledge-Rich Model Transformations for Speaker Normalization in Speech Recognition. In Proc. of Fonetik 2008. Dept. of Linguistics, Göteborg University. [pdf]Carlson, R., Gustafson, K., & Strangert, E. (2008). Synthesising disfluencies in a dialogue system. In Nordic Prosody. Helsinki, Finland.Edlund, J., Gustafson, J., Heldner, M., & Hjalmarsson, A. (2008). Towards human-like spoken dialogue systems. Speech Communication, 50(8-9), 630-645. [abstract] [pdf]Abstract: This paper presents an overview of methods that can be used to collect and analyse data on user responses to spoken dialogue system components intended to increase human-likeness, and to evaluate how well the components succeed in reaching that goal. Wizard-of-Oz variations, human-human data manipulation, and micro-domains are discussed in this context, as is the use of third-party reviewers to get a measure of the degree of human-likeness. We also present the two-way mimicry target, a model for measuring how well a human-computer dialogue mimics or replicates some aspect of human-human dialogue, including human flaws and inconsistencies. Although we have added a measure of innovation, none of the techniques is new in its entirety. Taken together and described from a human-likeness perspective, however, they form a set of tools that may widen the path towards human-like spoken dialogue systems.Edlund, J. (2008). Incremental speech synthesis. In Proc. of SLTC 2008 (pp. 53-54). Stockholm. [abstract] [pdf]Abstract: This demo paper describes a proof-of-concept demonstrator highlighting some possibilities and advantages of using incrementality in speech synthesis for spoken dialogue systems. A first version of the application was developed within the European project CHIL and displayed publically on several occasions. The current version focuses on different aspects, but uses similar technology.Elenius, K., Forsbom, E., & Megyesi, B. (2008). Language Resources and Tools for Swedish: A Survey. In Proc of LREC 2008. Marrakech, Marocko. [abstract] [pdf]Abstract: Language resources and tools to create and process these resources are necessary components in human language technology and natural language applications. In this paper, we describe a survey of existing language resources for Swedish, and the need for Swedish language resources to be used in research and real-world applications in language technology as well as in linguistic research. The survey is based on a questionnaire sent to industry and academia, institutions and organizations, and to experts involved in the development of Swedish language resources in Sweden, the Nordic countries and world-wide.Elenius, K., Forsbom, E., & Megyesi, B. (2008). Survey on Swedish Language Resources. Technical Report, Speech, Music and Hearing, KTH and Department of Linguistics and Philology, Uppsala University. [pdf]Engwall, O. (2008). Bättre tala än texta - talteknologi nu och i framtiden. In Domeij, R. (Ed.), Tekniken bakom språket (pp. 98-118). Stockholm: Norstedts Akademiska Förlag.Engwall, O. (2008). Can audio-visual instructions help learners improve their articulation? - an ultrasound study of short term changes. In Proceedings of Interspeech 2008 (pp. 2631-2634). Brisbane, Australia. [pdf]Escribano, D. L. (2008). Pronunciation Training of Swedish Vowels Using Speech Technology, Embodied Conversational Agents and an Interactive Game. Master's thesis, CSC KTH. [pdf]Gustafson, J., & Edlund, J. (2008). expros: a toolkit for exploratory experimentation with prosody in customized diphone voices. In Proceedings of Perception and Interactive Technologies for Speech-Based Systems (PIT 2008) (pp. 293-296). Berlin/Heidelberg: Springer. [abstract] [pdf]Abstract: This paper presents a toolkit for experimentation with prosody in diphone voices. Prosodic features play an important role for aspects of human-human spoken dialogue that are largely unexploited in current spoken dialogue systems. The toolkit contains tools for recording utterances for a number of purposes. Examples include extraction of prosodic features such as pitch, intensity and duration for transplantation onto synthetic utterances, and creation of purpose-built customized MBROLA mini-voices.Gustafson, J., & Edlund, J. (2008). EXPROS: Tools for exploratory experimentation with prosody. In Proceedings of FONETIK 2008 (pp. 17-20). Gothenburg, Sweden. [abstract] [pdf]Abstract: This demo paper presents EXPROS, a toolkit for experimentation with prosody in diphone voices. Although prosodic features play an important role in human-human spoken dialogue, they are largely unexploited in current spoken dialogue systems. The toolkit contains tools for a number of purposes: for example extraction of prosodic features such as pitch, intensity and duration for transplantation onto synthetic utterances and creation of purpose-built customized MBROLA mini-voices.Gustafson, J., Heldner, M., & Edlund, J. (2008). Potential benefits of human-like dialogue behaviour in the call routing domain. In Proceedings of Perception and Interactive Technologies for Speech-Based Systems (PIT 2008) (pp. 240-251). Berlin/Heidelberg: Springer. [abstract] [pdf]Abstract: This paper presents a Wizard-of-Oz (Woz) experiment in the call routing domain that took place during the development of a call routing system for the TeliaSonera residential customer care in Sweden. A corpus of 42,000 calls was used as a basis for identifying problematic dialogues and the strategies used by operators to overcome the problems. A new Woz recording was made, implementing some of these strategies. The collected data is described and discussed with a view to explore the possible benefits of more human-like dialogue behaviour in call routing applications.Hincks, R., & Edlund, J. (2008). Improving presentation intonation with feedback on pitch variation. In Proc. of SLTC 2008 (pp. 55-56). Stockholm. [pdf]Hjalmarsson, A. (2008). Speaking without knowing what to say... or when to end. In Proceedings of SIGdial 2008 (pp. 72-75). Columbus, Ohio, USA. [abstract] [pdf]Abstract: Humans produce speech incrementally and on-line as the dialogue progresses using information from several different sources in parallel. A dialogue system that generates output in a stepwise manner and not in preplanned syntactically correct sentences needs to signal how new dialogue contributions relate to previous discourse. This paper describes a data collection which is the foundation for an effort towards more human-like language generation in DEAL, a spoken dialogue system developed at KTH. Two annotators labelled cue phrases in the corpus with high inter-annotator agreement (kappa coefficient 0.82).Hjalmarsson, A., & Edlund, J. (2008). Human-likeness in utterance generation: effects of variability. In Perception in Multimodal Dialogue Systems - Proceedings of the 4th IEEE Tutorial and Research Workshop on Perception and Interactive Technologies for Speech-Based Systems, PIT 2008, Kloster Irsee, Germany, June 16-18, 2008. (pp. 252-255). Berlin/Heidelberg: Springer. [abstract] [pdf]Abstract: There are compelling reasons to endow dialogue systems with human-like conversational abilities, which require modelling of aspects of human behaviour. This paper examines the value of using human behaviour as a target for system behaviour through a study making use of a simulation method. Two versions of system behaviour are compared: a replica of a human speaker’s behaviour and a constrained version with less variability. The version based on human behaviour is rated more human-like, polite and intelligent.Hofer, G., Yamagishi, J., & Shimodaira, H. (2008). Speech-Driven Lip Motion Generation with a Trajectory HMM. In Proc. of Interspeech. Brisbane, Australia.Karlsson, A., House, D., & Tayanin, D. (2008). Recognizing phrase and utterance as prosodic units in non-tonal dialects of Kammu. In Proceedings, FONETIK 2008 (pp. 89-92). Department of Linguistics, University of Gothenburg. [pdf]Katsamanis, N., Ananthakrishnan, G., Papandreou, G., Engwall, O., & Maragos, P. (2008). Audiovisual speech inversion by switching dynamical modeling Governed by a Hidden Markov Process. In Proceedings of EUSIPCO. [pdf]Krunic, V., Salvi, G., Bernardino, A., Montesano, L., & Santos-Victor, J. (2008). Associating word descriptions to learned manipulation task models. In IEEE/RSJ International Conference on Intelligent RObots and Systems (IROS). Nice, France.Laskowski, K., Edlund, J., & Heldner, M. (2008). An instantaneous vector representation of delta pitch for speaker-change prediction in conversational dialogue systems. In Proceedings ICASSP 2008 (pp. 5041-5044). Las Vegas, Nevada, US. [abstract] [pdf]Abstract: As spoken dialogue systems become deployed in increasingly complex domains, they face rising demands on the naturalness of interaction. We focus on system responsiveness, aiming to mimic human-like dialogue flow control by predicting speaker changes as observed in real human-human conversations. We derive an instantaneous vector representation of pitch variation and show that it is amenable to standard acoustic modeling techniques. Using a small amount of automatically labeled data, we train models which significantly outperform current state-of-the-art pause-only systems, and replicate to within 1% absolute the performance of our previously published hand-crafted baseline. The new system additionally offers scope for run-time control over the precision or recall of locations at which to speak.Laskowski, K., Edlund, J., & Heldner, M. (2008). Learning prosodic sequences using the fundamental frequency variation spectrum. In Proceedings of the Speech Prosody 2008 Conference (pp. 151-154). Campinas, Brazil: Editora RG/CNPq. [abstract] [pdf]Abstract: We investigate a recently introduced vector-valued representation of fundamental frequency variation, whose properties appear to be well-suited for statistical sequence modeling. We show what the representation looks like, and apply hidden Markov models to learn prosodic sequences characteristic of higher-level turn-taking phenomena. Our analysis shows that the models learn exactly those characteristics which have been reported for the phenomena in the literature. Further refinements to the representation lead to 12-17% relative improvement in speaker change prediction for conversational spoken dialogue systems.Laskowski, K., Heldner, M., & Edlund, J. (2008). The fundamental frequency variation spectrum. In Proceedings of FONETIK 2008 (pp. 29-32). Gothenburg, Sweden: Department of Linguistics, University of Gothenburg. [abstract] [pdf]Abstract: This paper describes a recently introduced vec-tor-valued representation of fundamental fre-quency variation – the FFV spectrum – which has a number of desirable properties. In par-ticular, it is instantaneous, continuous, distri-buted, and well suited for application of stan-dard acoustic modeling techniques. We show what the representation looks like, and how it can be used to model prosodic sequences.Laskowski, K., Wölfel, M., Heldner, M., & Edlund, J. (2008). Computing the fundamental frequency variation spectrum in conversational spoken dialogue systems. In Proceedings of Acoustics'08 (pp. 3305-3310). Paris, France. [abstract] [pdf]Abstract: Continuous modeling of intonation in natural speech has long been hampered by a focus on modeling fundamental frequency, of which several normative aspects are particularly problematic. The latter include, among others, the fact that pitch is unde?ned in unvoiced segments, that its absolute magnitude is speaker-specific, and that its robust estimation and modeling, at a particular point in time, rely on a patchwork of long-time stability heuristics. In the present work, we continue our analysis of the fundamental frequency variation (FFV) spectrum, a recently proposed instantaneous, continuous, vector-valued representation of pitch variation, which is obtained by comparing the harmonic structure of the frequency magnitude spectra of the left and right half of an analysis frame. We analyze the sensitivity of a task-specific error rate in a conversational spoken dialogue system to the specific definition of the left and right halves of a frame, resulting in operational recommendations regarding the framing policy and window shape.Laukka, P., Elenius, K., Fredriksson, M., Furumark, T., & Neiberg, D. (2008). Vocal Expression in spontaneous and experimentally induced affective speech: Acoustic correlates of anxiety, irritation and resignation. In Workshop on Corpora for Research on Emotion and Affect. Marrakesh, Marocko. [abstract] [pdf]Abstract: We present two studies on authentic vocal affect expressions. In Study 1, the speech of social phobics was recorded in an anxiogenic public speaking task both before and after treatment. In Study 2, the speech material was collected from real life human-computer interactions. All speech samples were acoustically analyzed and subjected to listening tests. Results from Study 1 showed that a decrease in experienced state anxiety after treatment was accompanied by corresponding decreases in a) several acoustic parameters (i.e., mean and maximum F0, proportion of high-frequency components in the energy spectrum, and proportion of silent pauses), and b) listeners’ perceived level of nervousness. Both speakers’ self-ratings of state anxiety and listeners’ ratings of perceived nervousness were further correlated with similar acoustic parameters. Results from Study 2 revealed that mean and maximum F0, mean voice intensity and H1-H2 was higher for speech perceived as irritated than for speech perceived as neutral. Also, speech perceived as resigned had lower mean and maximum F0, and mean voice intensity than neutral speech. Listeners’ ratings of irritation, resignation and emotion intensity were further correlated with several acoustic parameters. The results complement earlier studies on vocal affect expression which have been conducted on posed, rather than authentic, emotional speech.Lindblom, B., Diehl, R., Park, S-H., & Salvi, G. (2008). (Re)use of place features in voiced stop systems: Role of phonetic constraints. In Proceedings of Fonetik 2008. University of Gothenburg. [abstract] [pdf]Abstract: Computational experiments focused on place of articulation in voiced stops were designed to generate ‘optimal’ inventories of CV syllables from a larger set of ‘possible CV:s’ in the presence of independently and numerically defined articulatory, perceptual and developmental constraints. Across vowel contexts the most salient places were retroflex, palatal and uvular. This was evident from acoustic measurements and perceptual data. Simulation results using the criterion of perceptual contrast alone failed to produce systems with the typologically widely attested set [b] [d] [g], whereas using articulatory cost as the sole criterion produced inventories in which bilabial, dental/alveolar and velar onsets formed the core. Neither perceptual contrast, nor articulatory cost, (nor the two combined), produced a consistent re-use of place features (‘phonemic coding’). Only systems constrained by ‘target learning’ exhibited a strong recombination of place features.López-Colino, F., Beskow, J., & Colás, J. (2008). Mobile SynFace: Ubiquitous visual interface for mobile VoIP telephone calls. In Proceedings of The second Swedish Language Technology Conference (SLTC). Stockholm, Sweden.. [pdf]Neiberg, D., & Ananthakrishnan, G. (2008). On the Non-uniqueness of Acoustic-to-Articulatory Mapping. In Fonetik. Göteborg.Neiberg, D., Ananthakrishnan, G., & Engwall, O. (2008). The Acoustic to Articulation Mapping: Non-linear or Non-unique?. In INTERSPEECH 2008 - 9th Annual Conference of the International Speech Communication Association (pp. 1485-1488). Brisbane, Australia. [pdf]Neiberg, D., & Elenius, K. (2008). Automatic Recognition of Anger in Spontaneous Speech. In INTERSPEECH 2008 - 9th Annual Conference of the International Speech Communication Association (pp. 1485-1488). Brisbane, Australia. [pdf]Skantze, G. (2008). Galatea: A discourse modeller supporting concept-level error handling in spoken dialogue systems. In Dybkjær, L., & Minker, W. (Eds.), Recent Trends in Discourse and Dialogue. Springer. [pdf]Strangert, E., & Gustafson, J. (2008). Improving speaker skill in a resynthesis experiment. In In Proceedings of FONETIK 2008. Gothenburg, Sweden.. [abstract] [pdf]Abstract: A synthesis experiment was conducted based on data from ratings of speaker skill and acoustic measurements in samples of political speech. Features assumed to be important for “being a good speaker” were manipulated in the sample of the lowest rated speaker. Increased F0 dynamics gave the greatest positive effects, but elimination of disfluencies and hesitation pauses, and increased speech rate also played a role for the impression of improved speaker skill.Strangert, E., & Gustafson, J. (2008). Subject ratings, acoustic measurements and synthesis of good-speaker characteristics. In Proceedings of Interspeech 2008. Brisbane, Australia. [pdf]Wik, P., & Engwall, O. (2008). Can visualization of internal articulators support speech perception?. In Proceedings of Interspeech 2008 (pp. 2627-2630). Brisbane, Australia. [pdf]Wik, P., & Engwall, O. (2008). Looking at tongues – can it help in speech perception?. In Proceedings of Fonetik 2008. [pdf]2007Abelin, ï. (2007). Emotional McGurk effect in Swedish. Technical Report 1. [pdf]Ambrazaitis, G. (2007). Swedish word accents in a ‘confirmation’ context. Proceedings of Fonetik, TMH-QPSR, 50(1), 49-52. [pdf]Bell, L., & Gustafson, J. (2007). Children’s convergence in referring expressions to graphical objects in a speech-enabled computer game. In Proceedings of Interspeech. Antwerp, Belgium. [pdf]Berg, A., & Brandt, A. (2007). What you Hear is what you See – a study of visual vs. auditive noise. Proceedings of Fonetik, TMH-QPSR, 50(1), 77-80. [pdf]Beskow, J., Granström, B., & House, D. (2007). Analysis and synthesis of multimodal verbal and non-verbal interaction for animated interface agents. In Esposito, A., Faundez-Zanuy, M., Keller, E., & Marinaro, M. (Eds.), Verbal and Nonverbal Communication Behaviours (pp. 250-263). Berlin: Springer-Verlag.Blomberg, M., & Elenius, D. (2007). Vocal tract length compensation in the signal and model domains in child speech recognition. Proceedings of Fonetik, TMH-QPSR, 50(1), 41-44. [pdf]Bodén, P., & Svensson, G. (2007). Linguistic challenges for bilingual schoolchildren in Rosengård. Proceedings of Fonetik, TMH-QPSR, 50(1), 93-96. [pdf]Bruce, G., Schötz, S., & Granström, B. (2007). SIMULEKT – modelling Swedish regional intonation. Proceedings of Fonetik, TMH-QPSR, 50(1), 121-124. [pdf]Brusk, J., Lager, T., Hjalmarsson, A., & Wik, P. (2007). DEAL – Dialogue Management in SCXML for Believable Game Characters. In Proceedings of ACM Future Play 2007 (pp. 137-144). [abstract] [pdf]Abstract: In order for game characters to be believable, they must appear to possess qualities such as emotions, the ability to learn and adapt as well as being able to communicate in natural language. With this paper we aim to contribute to the development of believable non-player characters (NPCs) in games, by presenting a method for managing NPC dialogues. We have selected the trade scenario as an example setting since it offers a well-known and limited domain common in games that support ownership, such as role-playing games. We have developed a dialogue manager in State Chart XML, a newly introduced W3C standard, as part of DEAL -- a research platform for exploring the challenges and potential benefits of combining elements from computer games, dialogue systems and language learning.Carlson, R. (2007). Conflicting acoustic cues in stop perception. In Where Do Features Come From ? - Phonological Primitives in the Brain, the Mouth, and the Ear (pp. 63-64). Paris, France. [pdf]Carlson, R. (2007). Using acoustic cues in stop perception. Proceedings of Fonetik, TMH-QPSR, 50(1), 25-28. [pdf]Carlson, R., & Granström, B. (2007). Rule-based Speech Synthesis. In Benesty, J., Sondhi, M. M., & Huang, Y. (Eds.), Springer Handbook of Speech Processing (pp. 429-436). Springer Berlin Heidelberg.Carlson, R., & Hawkins, S. (2007). When is fine phonetic detail a detail?. In ICPhS 2007 (pp. 211-214). Saarbrücken, Germany. [pdf]Edlund, J., & Beskow, J. (2007). Pushy versus meek – using avatars to influence turn-taking behaviour. In Proceedings of Interspeech 2007. Antwerp, Belgium. [abstract] [pdf]Abstract: The flow of spoken interaction between human interlocutors is a widely studied topic. Amongst other things, studies have shown that we use a number of facial gestures to improve this flow – to control the taking of turns. This ought to be useful in systems where an animated talking head is used, be they systems for computer mediated human-human dialogue or spoken dialogue systems, where the computer itself uses speech to interact with users. In this article, we show that a small set of simple interaction control gestures and a simple model of interaction can be used to influence users’ behaviour in an unobtrusive manner. The results imply that such a model may improve the flow of computer mediated interaction between humans under adverse circumstances, such as network latency, or to create more human-like spoken humancomputer interaction.Edlund, J., Beskow, J., & Heldner, M. (2007). MushyPeek – an experiment framework for controlled investigation of human-human interaction control behaviour. Proceedings of Fonetik, TMH-QPSR, 50(1), 61-64. [abstract] [pdf]Abstract: This paper describes MushyPeek, a experiment framework that allows us to manipulate interaction control behaviour – including turn-taking – in a setting quite similar to face-to-face human-human dialogue. The setup connects two subjects to each other over a VoIP telephone connection and simultaneuously provides each of them with an avatar representing the other. The framework is exemplified with the first experiment we tried in it – a test of the effectiveness interaction control gestures in an animated lip-synchronised talking head.Edlund, J., & Heldner, M. (2007). Underpinning /nailon/ - automatic estimation of pitch range and speaker relative pitch. In Müller, C. (Ed.), Speaker Classification II, Selected Projects (pp. 229-242). Springer. [abstract] [pdf]Abstract: In this study, we explore what is needed to get an automatic estimation of speaker relative pitch that is good enough for many practical tasks in speech technology. We present analyses of fundamental frequency (F0) distributions from eight speakers with a view to examine (i) the effect of semitone transform on the shape of these distributions; (ii) the errors resulting from calculation of percentiles from the means and standard deviations of the distributions; and (iii) the amount of voiced speech required to obtain a robust estimation of speaker relative pitch. In addition, we provide a hands-on description of how such an estimation can be obtained under real-time online conditions using /nailon/ - our software for online analysis of prosody.Eklund, R. (2007). Pulmonic ingressive speech: a neglected universal?. Proceedings of Fonetik, TMH-QPSR, 50(1), 21-24. [pdf]Enflo, L. (2007). Threshold Pressure For Vocal Fold Collision. Proceedings of Fonetik, TMH-QPSR, 50(1), 105-108. [pdf]Engwall, O., & Bälter, O. (2007). Pronunciation feedback from real and virtual language teachers. Journal of Computer Assisted Language Learning, 20(3), 235-262. [pdf]Ericsson, C., Klein, J., Sjölander, K., & Sönnebo, L. (2007). Filibuster – a new Swedish text-to-speech system. Proceedings of Fonetik, TMH-QPSR, 50(1), 33-36. [pdf]Eriksson, E., & Sullivan, K. (2007). Dialect recognition in a noisy environment:preliminary data. Proceedings of Fonetik, TMH-QPSR, 50(1), 101-104. [pdf]Fant, G., & Kruckenberg, A. (2007). Co-variation of acoustic parameters in prosody. Proceedings of Fonetik, TMH-QPSR, 50(1), 1-4. [pdf]Forsell, M., Elenius, K., & Laukka, P. (2007). Acoustic correlates of frustration in spontaneous speech. Proceedings of Fonetik, TMH-QPSR, 50(1), 37-40. [pdf]Frid, J. (2007). Automatic classification of 'front' and 'back' pronunciation variants of /r/ in the Götaland dialects of Swedish. Proceedings of Fonetik, TMH-QPSR, 50(1), 113-116. [pdf]Granström, B., & House, D. (2007). Inside out - Acoustic and visual aspects of verbal and non-verbal communication (Keynote Paper). Proceedings of the 16th International Congress of Phonetic Sciences, Saarbrücken, 11-18. [pdf]Granström, B., & House, D. (2007). Modelling and evaluating verbal and non-verbal communication in talking animated interface agents. In Dybkjaer, l., Hemsen, H., & Minker, W. (Eds.), Evaluation of Text and Speech Systems (pp. 65-98). Springer-Verlag Ltd.Heldner, M., & Edlund, J. (2007). What turns speech into conversation? A project description. Proceedings of Fonetik, TMH-QPSR, 50(1), 45-48. [abstract] [pdf]Abstract: The project Vad gör tal till samtal? (What turns speech into conversation?) takes as its starting point that while conversation must be considered the primary kind of speech, we are still far better at modelling monologue than dialogue, in theory as well as for speech technology applications. There are also good reasons to assume that conversation contains a number of features that are not found in other kinds of speech, including, among other things, the active cooperation among interlocutors to control the interaction, and to establish common ground. Through this project, we hope to improve the situation by investigating features that are specific to human-human conversation – features that turns speech into conversation. We will focus on acoustic and prosodic aspects of such features.Hjalmarsson, A., Wik, P., & Brusk, J. (2007). Dealing with DEAL: a dialogue system for conversation training. In Proceedings of SIGdial (pp. 132-135). Antwerp, Belgium. [abstract] [pdf]Abstract: We present DEAL, a spoken dialogue system for conversation training under development at KTH. DEAL is a game with a spoken language interface designed for second language learners. The system is intended as a multidisciplinary research platform where challenges and potential benefits of combining elements from computer games, dialogue systems and language learning can be explored.House, D. (2007). Integrating Audio and Visual Cues for Speaker Friendliness in Multimodal Speech Synthesis. In Interspeech 2007 (pp. 1250-1253). Antwerp. [abstract] [pdf]Abstract: This paper investigates interactions between audio and visual cues to friendliness in questions in two perception experiments. In the first experiment, manually edited parametric audio-visual synthesis was used to create the stimuli. Results were consistent with earlier findings in that a late, high final focal accent peak was perceived as friendlier than an earlier, lower focal accent peak. Friendliness was also effectively signaled by visual facial parameters such as a smile, head nod and eyebrow raising synchronized with the final accent. Consistent additive effects were found between the audio and visual cues for the subjects as a group and individually showing that subjects integrate the two modalities. The second experiment used data-driven visual synthesis where the database was recorded by an actor instructed to portray anger and happiness. Friendliness was correlated to the happy database, but the effect was not as strong as for the parametric synthesis.House, D., & Granström, B. (2007). Analyzing and modelling verbal and non-verbal communication for talking animated interface agents. In Esposito, A., Bratanic, M., Keller, E., & Marinaro, M. (Eds.), Fundamentals of verbal and nonverbal communication and the biometric issue (pp. 317-331). Amsterdam: IOS Press.Hugot, V. (2007). Eye gaze analysis in human-human interactions. Master's thesis, CSC. [pdf]Hunnicutt, S., & Magnuson, T. (2007). Grammar-Guided Writing for AAC Users. Assistive Technology, 19(3), 128-142. [abstract]Abstract: A method of grammar-guided writing has been devised to guide graphic sign users through the construction of text messages for use in e-mail and other applications with a remote receiver. The purpose is to promote morphologically and syntactically correct sentences as output for graphic sign users.Jande, P. (2007). Spoken language annotation and data-driven modelling of phone-level pronunciation in discourse context. Speech Communication, 50(2), 126-141.Karlsson, A., House, D., Svantesson, J-O., & Tayanin, D. (2007). Boundary signaling in tonal and non-tonal dialects of Kammu. Proceedings of Fonetik, TMH-QPSR, 50(1), 117-120. [pdf]Karlsson, A. M., House, D., Svantesson, J-O., & Tayanin, D. (2007). Prosodic Phrasing in Tonal and Non-tonal Dialects of Kammu. Proceedings of the 16th International Congress of Phonetic Sciences, Saarbrücken, 1309-1312. [pdf]Kjellström, H., Engwall, O., Abdou, S., & Bälter, O. (2007). Audio-visual phoneme classification for pronunciation training applications. In Proceedings of Interspeech 2007 (pp. 702-705). Antwerpen, Belgium. [pdf]Klintfors, E., Lacerda, F., & Sundberg, U. (2007). Estimates of Infants’ Vocabulary Composition and the Role of Adult-instructions for Early Word-learning. Proceedings of Fonetik, TMH-QPSR, 50(1), 53-56. [pdf]Krull, D. (2007). The influence of Swedish on prepausal lengthening in Estonian. Proceedings of Fonetik, TMH-QPSR, 50(1), 89-92. [pdf]Kügler, F. (2007). Timing of legal and illegal consonant clusters in Swedish. Proceedings of Fonetik, TMH-QPSR, 50(1), 9-12. [pdf]Lindblom, B., Sundberg, J., Branderud, P., & Djamshidpey, H. (2007). On the acoustics of spread lips. Proceedings of Fonetik, TMH-QPSR, 50(1), 13-16. [pdf]Lindh, J. (2007). Voxalys – a Pedagogical Praat Plugin for Voice Analysis. Proceedings of Fonetik, TMH-QPSR, 50(1), 97-100. [pdf]Palmstierna, C. (2007). A Survey of the Sound Shift in Dublin English. Proceedings of Fonetik, TMH-QPSR, 50(1), 81-84. [pdf]Persson, J., & Westholm, L. (2007). The Parrot Effect – a study of the ability to imitate a foreign language. Proceedings of Fonetik, TMH-QPSR, 50(1). [pdf]Skantze, G. (2007). Error Handling in Spoken Dialogue Systems - Managing Uncertainty, Grounding and Miscommunication. Doctoral dissertation, KTH, Department of Speech, Music and Hearing. [pdf]Skantze, G. (2007). Making grounding decisions: Data-driven estimation of dialogue costs and confidence thresholds. In Proceedings of SigDial (pp. 206-210). Antwerp, Belgium. [abstract] [pdf]Abstract: This paper presents a data-driven decision-theoretic approach to making grounding decisions in spoken dialogue systems, i.e., to decide which recognition hypotheses to consider as correct and which grounding action to take. Based on task analysis of the dialogue domain, cost functions are derived, which take dialogue efficiency, consequence of task failure and information gain into account. Dialogue data is then used to estimate speech recognition confidence thresholds that are dependent on the dialogue context.Strangert, E. (2007). What makes a good speaker? Subjective ratings and acoustic measurements. Proceedings of Fonetik, TMH-QPSR, 50(1), 29-32. [pdf]Strömbergsson, S. (2007). Interactional patterns in computer-assisted phonological intervention in children. Proceedings of Fonetik 2007, TMH-QPSR, 50(1), 69-72. [pdf]Suomi, K. (2007). Accentual tonal targets and speaking rate in Northern Finnish. Proceedings of Fonetik, TMH-QPSR, 50(1), 109-112. [pdf]Toivanen, J. (2007). Fall-rise intonation usage in Finnish English second language discourse. Proceedings of Fonetik, TMH-QPSR, 50(1), 85-88. [pdf]Traunmüller, H. (2007). Demodulation, mirror neurons and audiovisual perception nullify the motor theory. Proceedings of Fonetik, TMH-QPSR, 50(1), 17-20. [pdf]Van Dommelen, W., & Ringen, C. (2007). Intervocalic fortis and lenis stops in a Norwegian dialect. Proceedings of Fonetik, TMH-QPSR, 50(1), 5-8. [pdf]Wik, P., & Granström, B. (2007). Att lära sig språk med en virtuell lärare. In Från Vision till praktik, språkutbildning och informationsteknik (pp. 51-70). Nätuniversitetet. [pdf]Wik, P., Hjalmarsson, A., & Brusk, J. (2007). Computer Assisted Conversation Training for Second Language Learners. Proceedings of Fonetik, TMH-QPSR, 50(1), 57-60. [pdf]Wik, P., Hjalmarsson, A., & Brusk, J. (2007). DEAL A Serious Game For CALL Practicing Conversational Skills In The Trade Domain. In Proceedings of SLATE 2007. [abstract] [pdf]Abstract: This paper describes work in progress on DEAL, a spoken dialogue system under development at KTH. It is intended as a platform for exploring the challenges and potential benefits of combining elements from computer games, dialogue systems and language learning.Öhgren, S. (2007). Experiment with adaptation and vocal tract length normalization at automatic speech recognition of children's speech. Master's thesis, CSC. [pdf]2006Agelfors, E., Beskow, J., Karlsson, I., Kewley, J., Salvi, G., & Thomas, N. (2006). User Evaluation of the SYNFACE Talking Head Telephone. Lecture Notes in Computer Science, 4061, 579-586. [abstract] [pdf]Abstract: The talking-head telephone, Synface, is a lip-reading support for people with hearing-impairment. It has been tested by 49 users with varying degrees of hearing-impaired in UK and Sweden in lab and home environments. Synface was found to give support to the users, especially in perceiving numbers and addresses and an enjoyable way to communicate. A majority deemed Synface to be a useful productAllwood, J., Cerrato, L., Jokinen, K., Navarretta, C., & Paggio, P. (2006). A coding scheme for the annotation of feedback, turn management and sequencing phenomena. Proceedings of the LREC2006 Workshop on Multimodal Corpora. From Multimodal Behaviour to Usable Models., 38-42.Beskow, J., Granström, B., & House, D. (2006). Focal accent and facial movements in expressive speech. In Fonetik 2006, Working Papers 52, General Linguistics and Phonetics, Lund University (pp. 9-12). [pdf]Beskow, J., Granström, B., & House, D. (2006). Visual correlates to prominence in several expressive modes. In Proceedings of Interspeech 2006 (pp. 1272–1275). Pittsburg, PA. [pdf]Boye, J., Gustafson, J., & Wirén, M. (2006). Robust spoken language understanding in a computer game. Speech Communication, 48(3-4), 335-353. [pdf]Carlson, R., Edlund, J., Heldner, M., Hjalmarsson, A., House, D., & Skantze, G. (2006). Towards human-like behaviour in spoken dialog systems. In Proceedings of Swedish Language Technology Conference (SLTC 2006). Gothenburg, Sweden. [pdf]Carlson, R., Gustafson, K., & Strangert, E. (2006). Cues for Hesitation in Speech Synthesis. In Proceedings of Interspeech 06. Pittsburgh, USA. [pdf]Carlson, R., Gustafson, K., & Strangert, E. (2006). Prosodic Cues for Hesitation. Dept. of Linguistics & Phonetics Working Papers, 52, 21–24.Carlson, R., Gustafsson, K., & Strangert, E. (2006). Modelling hesitation for synthesis of spontaneous speech. In Proceedings of Speech Prosody 2006. Dresden. [pdf]Cerrato, L., & D’Imperio, M. (2006). An investigation of the communicative functions of short expressions in Italian and Swedish. Unpublished manuscript.Edlund, J., & Heldner, M. (2006). /nailon/ - online analysis of prosody. In Working Papers 52: Proceedings of Fonetik 2006 (pp. 37-40). Lund, Sweden: Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics. [abstract] [pdf]Abstract: This paper presents /nailon/ - a software package for online real-time prosodic analysis that captures a number of prosodic features relevant for interaction control in spoken dialogue systems. The current implementation captures silence durations; voicing, intensity, and pitch; pseudo-syllable durations; and intonation patterns. The paper provides detailed information on how this is achieved.Edlund, J., & Heldner, M. (2006). /nailon/ - software for online analysis of prosody. In Proc of Interspeech 2006 ICSLP (pp. 2022-2025). Pittsburgh PA, USA. [abstract] [pdf]Abstract: This paper presents /nailon/ - a software package for online real-time prosodic analysis that captures a number of prosodic features relevant for interaction control in spoken dialogue systems. The current implementation captures silence durations; voicing, intensity, and pitch; pseudosyllable durations; and intonation patterns. The paper provides detailed information on how this is achieved. As an example application of /nailon/, we demonstrate how it is used to improve the efficiency of identifying relevant places at which a machine can legitimately begin to talk to a human interlocutor, as well as to shorten system response times.Edlund, J., Heldner, M., & Gustafson, J. (2006). Two faces of spoken dialogue systems. In Interspeech 2006 - ICSLP Satellite Workshop Dialogue on Dialogues: Multidisciplinary Evaluation of Advanced Speech-based Interactive Systems. Pittsburgh PA, USA. [abstract] [pdf]Abstract: This paper is intended as a basis for discussion. We propose that users may, knowingly or subconsciously, interpret the events that occur when interacting with spoken dialogue systems in more than one way. Put differently, there is more than one metaphor people may use in order to make sense of spoken human-computer dialogue. We further suggest that different metaphors may not play well together. The analysis is consistent with many observations in human-computer interaction and has implications that may be helpful to researchers and developers alike. For example, developers may want to guide users towards a metaphor of their choice and ensure that the interaction is coherent with that metaphor; researchers may need different approaches depending on the metaphor employed in the system they study; and in both cases one would need to have very good reasons to use mixed metaphors.Ellis, N. (2006). Selective Attention and Transfer Phenomena in L2 Acquisition: Contingency, Cue Competition, Salience, Interference, Overshadowing, Blocking, and Perceptual Learning. Applied Linguistics, 27, 164-194.Engwall, O. (2006). Evaluation of speech inversion using an articulatory classifier. In Yehia, H., Demolin, D., & Laboissière, R. (Eds.), In Proceedings of the Seventh International Seminar on Speech Production (pp. 469-476). Ubatuba, Sao Paolo, Brazil. [pdf]Engwall, O. (2006). Assessing MRI measurements: Effects of sustenation, gravitation and coarticulation. In Harrington, J., & Tabain, M. (Eds.), Speech production: Models, Phonetic Processes and Techniques (pp. 301-314). New York: Psychology Press. [pdf]Engwall, O., Bälter, O., Öster, A-M., & Kjellström, H. (2006). Designing the user interface of the computer-based speech training system ARTUR based on early user tests. Journal of Behaviour and Information Technology, 25(4), 353-365. [pdf]Engwall, O., Bälter, O., Öster, A-M., & Kjellström, H. (2006). Feedback management in the pronunciation training system ARTUR. In Proceedings of CHI 2006 (pp. 231-234). Montreal. [pdf]Engwall, O., Delvaux, V., & Metens, T. (2006). Interspeaker Variation in the Articulation of French Nasal Vowels. In In Proceedings of the Seventh International Seminar on Speech Production (pp. 3-10). Ubatuba, Sao Paolo, Brazil. [pdf]Granström, B., & House, D. (2006). Measuring and modeling audiovisual prosody for animated agents. In Proceedings of Speech Prosody 2006. Dresden. [pdf]Heldner, M., & Edlund, J. (2006). Prosodic cues for interaction control in spoken dialogue systems. In Working Papers 52: Proceedings of Fonetik 2006 (pp. 53-56). Lund, Sweden: Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics. [abstract] [pdf]Abstract: This paper discusses the feasibility of using prosodic features for interaction control in spoken dialogue systems, and points to experimental evidence that automatically extracted prosodic features can be used to improve the efficiency of identifying relevant places at which a machine can legitimately begin to talk to a human interlocutor, as well as to shorten system response times.Heldner, M., Edlund, J., & Carlson, R. (2006). Interruption impossible. In Bruce, G., & Horne, M. (Eds.), Nordic Prosody, Proceedings of the IXth Conference, Lund 2004 (pp. 97-105). Frankfurt am Main, Germany. [abstract] [pdf]Abstract: Most current work on spoken human-computer interaction has so far concentrated on interactions between a single user and a dialogue system. The advent of ideas of the computer or dialogue system as a conversational partner in a group of humans, for example within the CHIL-project1 and elsewhere (e.g. Kirchhoff & Ostendorf, 2003), introduces new requirements on the capabilities of the dialogue system. Among other things, the computer as a participant in a multi-part conversation has to appreciate the human turn-taking system, in order to time its' own interjections appropriately. As the role of a conversational computer is likely to be to support human collaboration, rather than to guide or control it, it is particularly important that it does not interrupt or disturb the human participants. The ultimate goal of the work presented here is to predict suitable places for turn-takings, as well as positions where it is impossible for a conversational computer to interrupt without irritating the human interlocutors.House, D. (2006). On the interaction of audio and visual cues to friendliness in interrogative prosody. In Proceedings of The Nordic Conference on Multimodal Communication, 2005 (pp. 201-213). Göteborg.House, D. (2006). Perception and production of phrase-final intonation in Swedish questions. In Bruce, G., & Horne, M. (Eds.), Nordic Prosody, Proceedings of the IXth Conference, Lund 2004 (pp. 127-136). Frankfurt am Main: Peter Lang.Höglind, D. (2006). Texture-based expression modelling for a virtual talking head. Master's thesis, KTH. [pdf]Jande, P-A. (2006). Integrating Linguistic Information from Multiple Sources in Lexicon Development and Spoken Language Annotation. In Proceedings of the LREC workshop on merging and layering linguistic information (pp. 1-8). Genua, Italy. [pdf]Jande, P-A. (2006). Modelling Phone-Level Pronunciation in Discourse Context. Doctoral dissertation. [pdf]Jande, P-A. (2006). Modelling Pronunciation in Discourse Context. In Proceedings of Fonetik (pp. 7-9). Lund, Sweden. [pdf]Kjellström, H., Engwall, O., & Bälter, O. (2006). Reconstructing Tongue Movements from Audio and Video. In Proc of Interspeech 2006 (pp. 2238–2241). Pittsburgh. [pdf]Lidestam, B., & Beskow, J. (2006). Motivation and appraisal in perception of poorly specified speech. Scandinavian Journal of Psychology, 47(2), 93-101. [pdf]Lidestam, B., & Beskow, J. (2006). Visual phonemic ambiguity and speechreading. Journal of Speech, Language and Hearing Research, 49(4), 835-847. [pdf]Melin, H. (2006). Automatic speaker verification on site and by telephone: methods, applications and assessment. Doctoral dissertation, KTH. [pdf]Neiberg, D., Elenius, K., Karlsson, I., & Laskowski, K. (2006). Emotion Recognition in Spontaneous Speech. In Working Papers 52: Proceedings of Fonetik 2006 (pp. 101-104). Lund, Sweden: Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics. [pdf]Neiberg, D., Elenius, K., & Laskowski, K. (2006). Emotion Recognition in Spontaneous Speech Using GMMs. In INTERSPEECH 2006 - ICSLP, Ninth International Conference on Spoken Language Processing. Pittsburgh, PA, USA,. [pdf]Salvi, G. (2006). Dynamic behaviour of connectionist speech recognition with strong latency constraints. Speech Communication, 48(7), 802-818. [abstract] [pdf]Abstract: This paper describes the use of connectionist techniques in phonetic speech recognition with strong latency constraints. The constraints are imposed by the task of deriving the lip movements of a synthetic face in real time from the speech signal, by feeding the phonetic string into an articulatory synthesiser. Particular attention has been paid to analysing the interaction between the time evolution model learnt by the multi-layer perceptrons and the transition model imposed by the Viterbi decoder, in different latency conditions. Two experiments were conducted in which the time dependencies in the language model (LM) were controlled by a parameter. The results show a strong interaction between the three factors involved, namely the neural network topology, the length of time dependencies in the LM and the decoder latency.Salvi, G. (2006). Mining Speech Sounds, Machine Learning Methods for Automatic Speech Recognition and Analysis. Doctoral dissertation, KTH, School of Computer Science and Communication. [pdf]Salvi, G. (2006). Segment boundaries in low latency phonetic recognition. Lecture Notes in Computer Science, 3817, 267-276. [abstract] [pdf]Abstract: The segment boundaries produced by the Synface low latency phoneme recogniser are analysed. The precision in placing the boundaries is an important factor in the Synface system as the aim is to drive the lip movements of a synthetic face for lip-reading support. The recogniser is based on a hybrid of recurrent neural networks and hidden Markov models. In this paper we analyse the look-ahead length in the Viterbi-like decoder aff ects the precision of boundary placement. The properties of the entropy of the posterior probabilities estimated by the neural network are also investigated in relation to the distance of the frame from a phonetic transition.Salvi, G. (2006). Segment boundary detection via class entropy measurements in connectionist phoneme recognition. Speech Communication, 48(12), 1666-1676. [abstract] [pdf]Abstract: This article investigates the possibility to use the class entropy of the output of a connectionist phoneme recogniser to predict time boundaries between phonetic classes. The rationale is that the value of the entropy should increase in proximity of a transition between two segments that are well modelled (known) by the recognition network since it is a measure of uncertainty. The advantage of this measure is its simplicity as the posterior probabilities of each class are available in connectionist phoneme recognition. The entropy and a number of measures based on differentiation of the entropy are used in isolation and in combination. The decision methods for predicting the boundaries range from simple thresholds to neural network based procedure. The different methods are compared with respect to their precision, measured in terms of the ratio between the number C of predicted boundaries within 10 or 20 ms of the reference and the total number of predicted boundaries, and recall, measured as the ratio between C and the total number of reference boundaries.Skantze, G., Edlund, J., & Carlson, R. (2006). Talking with Higgins: Research challenges in a spoken dialogue system. In André, E., Dybkjaer, L., Minker, W., Neumann, H., & Weber, M. (Eds.), Perception and Interactive Technologies (pp. 193-196). Berlin/Heidelberg: Springer. [abstract] [pdf]Abstract: This paper presents the current status of the research in the Higgins project and provides background for a demonstration of the spoken dialogue system implemented within the project. The project represents the latest development in the ongoing dialogue systems research at KTH. The practical goal of the project is to build collaborative conversational dialogue systems in which research issues such as error handling techniques can be tested empirically.Skantze, G., House, D., & Edlund, J. (2006). Grounding and prosody in dialog. In Working Papers 52: Proceedings of Fonetik 2006 (pp. 117-120). Lund, Sweden: Lund University, Centre for Languages & Literature, Dept. of Linguistics & Phonetics. [abstract] [pdf]Abstract: In a previous study we demonstrated that subjects could use prosodic features (primarily peak height and alignment) to make different interpretations of synthesized fragmentary grounding utterances. In the present study we test the hypothesis that subjects also change their behavior accordingly in a human-computer dialog setting. We report on an experiment in which subjects participate in a color-naming task in a Wizard-of-Oz controlled human-computer dialog in Swedish. The results show that two annotators were able to categorize the subjects' responses based on pragmatic meaning. Moreover, the subjects' response times differed significantly, depending on the prosodic features of the grounding fragment spoken by the system.Skantze, G., House, D., & Edlund, J. (2006). User responses to prosodic variation in fragmentary grounding utterances in dialogue. In Proceedings of Interspeech 2006 - ICSLP (pp. 2002-2005). Pittsburgh PA, USA. [abstract] [pdf]Abstract: In this paper, actual user responses to fragmentary grounding utterances in Swedish human-computer dialog are investigated. Building on a previous study which demonstrated that listeners could use prosodic features (primarily peak height and alignment) to make different interpretations of such utterances, we now report on an experiment in which subjects participate in a color-naming task in a Wizard-of-Oz controlled human-computer dialog setting. The results show that two annotators were able to categorize the subjects' responses based on pragmatic meaning. Moreover, the subjects' response times differed significantly, depending on the prosodic features of the grounding fragment spoken by the system.Svanfeldt, G. (2006). Expressiveness in Virtual Talking Faces. Licentiate dissertation.Svantesson, J-O., & House, D. (2006). Tone production, tone perception and Kammu tonogenesis. Phonology, 23, 309-333. [pdf]Wallers, Å., Edlund, J., & Skantze, G. (2006). The effects of prosodic features on the interpretation of synthesised backchannels. In André, E., Dybkjaer, L., Minker, W., Neumann, H., & Weber, M. (Eds.), Proceedings of Perception and Interactive Technologies (pp. 183-187). Springer. [abstract] [pdf]Abstract: A study of the interpretation of prosodic features in backchannels (Swedish /a/ and /m/) produced by speech synthesis is presented. The study is part of work-in-progress towards endowing conversational spoken dialogue systems with the ability to produce and use backchannels and other feedback.Öster, A-M. (2006). Computer-Based Speech Therapy Using Visual Feedback with Focus on Children with Profound Hearing Impairments. Doctoral dissertation, KTH/TMH.2005Carlson, R., Hirschberg, J., & Swerts, M. (Eds.). (2005). Special Issue on Error handling in spoken dialogue systems. Speech Communication, 45(3).Allwood, J., Cerrato, L., Jokinen, K., Navarretta, C., & Paggio, P. (2005). The MUMIN annotation scheme for feedback, turn management and sequencing. In Gothenburg papers in Theoretical Linguistics 92: Proceedings from The Second Nordic Conference on Multimodal Communication (pp. 91-109). Göteborg University, Sweden.Batliner, A., Blomberg, M., D’Arcy, S., Elenius, D., Giuliani, D., Gerosa, M., Hacker, C., Russell, M., Steidl, S., & Wong, M. (2005). The PF STAR Children’s Speech Corpus. In Proc Interspeech 2005. [pdf]Beskow, J., Edlund, J., & Nordstrand, M. (2005). A model for multi-modal dialogue system output applied to an animated talking head. In Minker, W., Bühler, D., & Dybkjaer, L. (Eds.), Spoken Multimodal Human-Computer Dialogue in Mobile Environments, Text, Speech and Language Technology (pp. 93-113). Dordrecht, The Netherlands: Kluwer Academic Publishers. [abstract] [pdf]Abstract: We present a formalism for specifying verbal and non-verbal output from a multi-modal dialogue system. The output specification is XML-based and provides information about communicative functions of the output, without detailing the realisation of these functions. The aim is to let dialogue systems generate the same output for a wide variety of output devices and modalities. The formalism was developed and implemented in the multi-modal spoken dialogue system AdApt. We also describe how facial gestures in the 3D-animated talking head used within this system are controlled through the formalism.Beskow, J., & Nordenberg, M. (2005). Data-driven Synthesis of Expressive Visual Speech using an MPEG-4 Talking Head. In Proceedings of Interspeech 2005. Lisbon. [pdf]Boye, J., & Gustafson, J. (2005). How to do dialogue in a fairy-tale world. In 6th SIGdial &th Workshop on Discourse and Dialogue. [pdf]Bälter, O., Engwall, O., Öster, A-M., & Kjellström, H. (2005). Wizard-of-Oz Test of ARTUR - a Computer-Based Speech Training System with Articulation Correction. In Proceedings of the Seventh International ACM SIGACCESS Conference on Computers and Accessibility (pp. 36-43). Baltimore. [pdf]Carlson, R., & Granström, B. (2005). Data-driven multimodal synthesis. Speech Communication, 47(1-2), 182-193.Carlson, R., Hirschberg, J., & Swerts, M. (2005). Cues to upcoming Swedish prosodic boundaries: Subjective judgment studies and acoustic correlates. Speech Communication, 46, 326-333.Cerrato, L. (2005). Linguistic functions of head nods. In Allwood, J., & Dorriots, B. (Eds.), Gothenburg papers in Theoretical Linguistics 92: Proceedings from The Second Nordic Conference on Multimodal Communication (pp. 137-152). Göteborg University, Sweden.Cerrato, L. (2005). On the acoustic, prosodic and gestural characteristics of “m-like” sounds in Swedish. Gothenburg papers in theoretical linguistics, Feedback in Spoken Interaction- Nordtalk Symposium 2003, 18-31.Cerrato, L. (2005). The communicative function of "sì" in Italian and "ja" in Swedish: an acoustic analysis. In Proceedings of Fonetik 2005 (pp. 41-44). Göteborg.Cerrato, L., & Svanfeldt, G. (2005). A method for the detection of communicative head nods in expressive speech. In Allwood, J., Dorriots, B., & Nicholson, S. (Eds.), Gothenburg papers in Theoretical Linguistics 92: Proceedings from The Second Nordic Conference on Multimodal Communication (pp. 153-165). Göteborg University, Sweden.Edlund, J., & Heldner, M. (2005). Exploring prosody in interaction control. Phonetica, 62(2-4), 215-226. [abstract] [pdf]Abstract: This paper investigates prosodic aspects of turn-taking in conversation with a view to improving the efficiency of identifying relevant places at which a machine can legitimately begin to talk to a human interlocutor. It examines the relationship between interaction control, the communicative function of which is to regulate the flow of information between interlocutors, and its phonetic manifestation. Specifically, the listener's perception of such interaction control phenomena is modelled. Algorithms for automatic online extraction of prosodic phenomena liable to be relevant for interaction control, such as silent pauses and intonation patterns, are presented and evaluated in experiments using Swedish Map Task data. We show that the automatically extracted prosodic features can be used to avoid many of the places where current dialogue systems run the risk of interrupting their users, and also to identify suitable places to take the turn.Edlund, J., Heldner, M., & Gustafson, J. (2005). Utterance segmentation and turn-taking in spoken dialogue systems. In Fisseni, B., Schmitz, H-C., Schröder, B., & Wagner, P. (Eds.), Computer Studies in Language and Speech (pp. 576-587). Frankfurt am Main, Germany: Peter Lang. [abstract] [pdf]Abstract: A widely used method for finding places to take turn in spoken dialogue systems is to assume that an utterance ends where the user ceases to speak. Such endpoint detection normally triggers on a certain amount of silence, or non-speech. However, spontaneous speech frequently contains silent pauses inside sentencelike units, for example when the speaker hesitates. This paper presents /nailon/, an on-line, real-time prosodic analysis tool, and a number of experiments in which end-point detection has been augmented with prosodic analysis in order to segment the speech signal into what humans intuitively perceive as utterance-like units.Edlund, J., & Hjalmarsson, A. (2005). Applications of distributed dialogue systems: the KTH Connector. In Proceedings of ISCA Tutorial and Research Workshop on Applied Spoken Language Interaction in Distributed Environments (ASIDE 2005). Aalborg, Denmark. [abstract] [pdf]Abstract: We describe a spoken dialogue system domain: that of the personal secretary. This domain allows us to capitalise on the characteristics that make speech a unique interface; characteristics that humans use regularly, implicitly, and with remarkable ease. We present a prototype system - the KTH Connector - and highlight several dialogue research issues arising in the domain.Edlund, J., House, D., & Skantze, G. (2005). Prosodic Features in the Perception of Clarification Ellipses. In Proceedings of Fonetik 2005. Gothenburg, Sweden. [abstract] [pdf]Abstract: We present an experiment where subjects were asked to listen to Swedish human-computer dialogue fragments where a synthetic voice makes an elliptical clarification after a user turn. The prosodic features of the synthetic voice were systematically varied, and subjects were asked to judge the computer's actual intention. The results show that an early low F0 peak signals acceptance, that a late high peak is perceived as a request for clarification of what was said, and that a mid high peak is perceived as a request for clarification of the meaning of what was said. The study can be seen as the beginnings of a tentative model for intonation of clarification ellipses in Swedish, which can be implemented and tested in spoken dialogue systems.Edlund, J., House, D., & Skantze, G. (2005). The effects of prosodic features on the interpretation of clarification ellipses. In Proceedings of Interspeech 2005 (pp. 2389-2392). Lisbon, Portugal. [abstract] [pdf]Abstract: In this paper, the effects of prosodic features on the interpretation of elliptical clarification requests in dialogue are studied. An experiment is presented where subjects were asked to listen to short human-computer dialogue fragments in Swedish, where a synthetic voice was making an elliptical clarification after a user turn. The prosodic features of the synthetic voice were systematically varied, and the subjects were asked to judge what was actually intended by the computer. The results show that an early low F0 peak signals acceptance, that a late high peak is perceived as a request for clarification of what was said, and that a mid high peak is perceived as a request for clarification of the meaning of what was said. The study can be seen as the beginnings of a tentative model for intonation of clarification ellipses in Swedish, which can be implemented and tested in spoken dialogue systems.Elenius, D., & Blomberg, M. (2005). Adaptation and Normalization Experiments in Speech Recognition for 4 to 8 Year old Children. In Proc Interspeech 2005. [pdf]Engwall, O. (2005). Articulatory synthesis using corpus-based estimation of line spectrum pairs. In Proceedings of Interspeech 2005. Lisbon, Portugal. [pdf]Engwall, O. (2005). Introducing visual cues in acoustic-to-articulatory inversion. In Proceedings of Interspeech 2005. Lisbon, Portugal. [pdf]Eriksson, E., Bälter, O., Engwall, O., Öster, A-M., & Kjellström, H. (2005). Design Recommendations for a Computer-Based Speech Training System Based on End-User Interviews. In Proceedings of the Tenth International Conference on Speech and Computers (pp. 483-486). Patras, Greece. [pdf]Granström, B., & House, D. (2005). Audiovisual representation of prosody in expressive speech communication. Speech Communication, 46, 473-484.Granström, B., & House, D. (2005). Effective Interaction with Talking Animated Agents in Dialogue Systems. In van Kuppevelt, J., Dybkjaer, L., & Bernsen, N. O. (Eds.), Advances in Natural Multimodal Dialogue Systems (pp. 215-243). Springer, Dordrecht, The Netherlands.Gustafson, J., Boye, J., Fredriksson, M., Johannesson, L., & Königsmann, J. (2005). Providing computer game characters with conversational abilities. In Proceedings of Intelligent Virtual Agent (IVA05). Kos, Greece. [pdf]Hincks, R. (2005). Computer Support for Learners of Spoken English. Doctoral dissertation, School of Computer Science and Communication. [abstract] [pdf]Abstract: This thesis concerns the use of speech technology to support the process of learning the English language. It applies theories of computer-assisted language learning and second language acquisition to address the needs of beginning, intermediate and advanced students of English for specific purposes. The thesis includes an evaluation of speech-recognition-based pronunciation soft¬ware, based on a controlled study of a group of immigrant engineers. The study finds that while the weaker students may have benefited from their software practice, the pronun¬ciation ability of the better students did not improve. The linguistic needs of advanced and intermediate Swedish-native students of English are addressed in a study using multimodal speech synthesis in an interactive exercise demonstrating differences in the placement of lexical stress in two Swedish-English cognates. A speech database consisting of 28 ten-minute oral presentations made by these learners is described, and an analysis of pronunciation errors is pre-sented. Eighteen of the presentations are further analyzed with regard to the normalized standard deviation of fundamental frequency over 10-second long samples of speech, termed pitch variation quotient (PVQ). The PVQ is found to range from 6% to 34% in samples of speech, with mean levels of PVQ per presentation ranging from 11% to 24%. Males are found to use more pitch variation than females. Females who are more proficient in English use more pitch variation than the less profi¬cient females. A perceptual experiment tests the relationship between PVQ and impressions of speaker liveliness. An overall correlation of .83 is found. Temporal variables in the presentation speech are also studied. A bilingual database where five speakers make the same presentation in both English and Swedish is studied to examine effects of using a second language on presentation prosody. Little intra-speaker difference in pitch variation is found, but these speakers speak on average 20% faster when using their native language. The thesis concludes with a discussion of how the results could be applied in a proposed feedback mechanism for practicing and assessing oral presentations, concept¬ualized as a ‘speech checker.’ Potential users of the system would include native as well as non-native speakers of English.Hincks, R. (2005). Measures and perceptions of liveliness in student presentation speech: A proposal for an automatic feedback mechanism. System, 33(4), 575-591. [abstract] [002]Abstract: This paper analyzes prosodic variables in a corpus of eighteen oral presentations made by students of Technical English, all of whom were native speakers of Swedish. The focus is on the extent to which speakers were able to use their voices in a lively manner, and the hypothesis tested is that speakers who had high pitch variation as they spoke would be perceived as livelier speakers. A metric (termed PVQ), derived from the standard deviation in fundamental frequency, is proposed as a measure of pitch variation. Composite listener ratings of liveliness for nine 10-s samples of speech per speaker correlate strongly (r = .83, n = 18, p < .01) with the PVQ metric. Liveliness ratings for individual 10-s samples of speech show moderate but significant (n = 81, p < .01) correlations: r = .70 for males and r = .64 for females. The paper also investigates rate of speech and fluency variables in this corpus of L2 English. An application for this research is in presentation skills training, where computer feedback could be provided for speaking rate and the extent to which speakers have been able to use their voices in an engaging manner.Hincks, R. (2005). Measuring liveliness in presentation speech. In Proceedings of Interspeech 2005 (pp. 765-768). Lisbon. [abstract] [pdf]Abstract: This paper proposes that speech analysis be used to quantify prosodic variables in presentation speech, and reports the results of a perception test of speaker liveliness. The test material was taken from a corpus of oral presentations made by 18 Swedish native students of Technical English. Liveliness ratings from a panel of eight judges correlated strongly with normalized standard deviation of F0 and, for female speakers, with mean length of runs, which is the number of syllables between pauses of >250 ms. An application of these findings would be in the development of a feedback mechanism for the prosody of public speaking.Hincks, R. (2005). Presenting in English and Swedish. In Proceedings of Fonetik 2005 (pp. 45-48). Göteborg. [abstract] [pdf]Abstract: This paper reports on a comparison of prosodic variables from oral presentations in a first and second language. Five Swedish natives who speak English at the advanced-intermediate level were recorded as they made the same presentation twice, once in English and once in Swedish. Though it was expected that speakers would use more pitch variation when they spoke Swedish, three of the five speakers showed no significant difference between the two languages. All speakers spoke more quickly in Swedish, the mean being 20% faster.Hjalmarsson, A. (2005). Towards user modelling in conversational dialogue systems: A qualitative study of the dynamics of dialogue parameters. In Proceedings of Interspeech 2005 (pp. 869-872). Lisbon, Portugal. [abstract] [pdf]Abstract: This paper presents a qualitative study of data from a 26 subject experimental study within the multimodal, conversational dialogue system AdApt. Qualitative analysis of data is used to illustrate the dynamic variation of dialogue parameters over time. The analysis will serve as a foundation for research and future data collections in the area of adaptive dialogue systems and user modelling.House, D. (2005). Fonetiska undersökningar av kammu. In Lundström, H., & Svantesson, J-O. (Eds.), Kammu - om ett folk i Laos (pp. 164-167). Lund: Lunds universitetshistoriska sällskap.House, D. (2005). Phrase-final rises as a prosodic feature in wh-questions in Swedish human–machine dialogue. Speech Communication, 46, 268-283. [abstract] [pdf]Abstract: This paper examines the extent to which optional final rises occur in a set of 200 wh-questions extracted from a large corpus of computer-directed spontaneous speech in Swedish and discusses the function these rises may have in signalling dialogue acts and speaker attitude over and beyond an information question. Final rises occurred in 22% of the utterances, primarily in conjunction with final focal accent. Children exhibited the largest percentage of final rises (32%), with women second (27%) and men lowest (17%). The distribution of the rises in the material is examined and evidence relating to the final rise as a signal of a social interaction oriented dialogue act is gathered from the distribution. Two separate perception tests were carried out to test the hypothesis that high and late focal accent peaks in a wh-question are perceived as friendlier and more socially interested than low and early peaks. Generally, the results were consistent with these hypotheses when the late peaks were in phrase-final position. Finally, the results of this study are discussed in terms of pragmatic and attitudinal meanings and biological codes.Imboden, S., Petrone, M., Quadrani, P., Zannoni, C., Mayoral, R., Clapworthy, G. J., Testi, D., Viceconti, M., Neiberg, D., Tsagarakis, N. G., & Caldwell, D. (2005). A Haptic Enabled Multimodal Pre-Operative Planner for Hip Arthroplasty. In WorldHaptics Conference. Pisa, Italy. [pdf]Jande, P-A. (2005). Annotating Speech Data for Pronunciation Variation Modelling. In Proceedings of Fonetik (pp. 25-27). Göteborg, Sweden. [pdf]Jande, P-A. (2005). Inducing Decision Tree Pronunciation Variation Models from Annotated Speech Data. In Proceedings of Interspeech (pp. 4-8). Lisbon, Portugal. [pdf]Johnson, W., Vilhjalmsson, H., & Marsella, S. (2005). Serious games for language learning: How much game, how much AI?. In 12: th International Conference on Artificial Intelligence in Education. Amsterdam.Nordenberg, M., Svanfeldt, G., & Wik, P. (2005). Artificial Gaze - Perception experiment of eye gaze in synthetic faces. In Proceedings from the Second Nordic Conference on Multimodal Communication. [PDF]Oppelstrup, L., Blomberg, M., & Elenius, D. (2005). Scoring Children's Foreign Language Pronunciation. In Proc FONETIK 2005. Department of Linguistics, Göteborg University. [pdf]Salvi, G. (2005). Advances in regional accent clustering in Swedish. In Proceedings of European Conference on Speech Communication and Technology (Eurospeech) (pp. 2841-2844). Lisbon, Portugal. [abstract] [pdf]Abstract: The regional pronunciation variation in Swedish is analysed on a large database. Statistics over each phoneme and for each region of Sweden are computed using the EM algorithm in a hidden Markov model framework to overcome the difficulties of transcribing the whole set of data at the phonetic level. The model representations obtained this way are compared using a distance measure in the space spanned by the model parameters, and hierarchical clustering. The regional variants of each phoneme may group with those of any other phoneme, on the basis of their acoustic properties. The log likelihood of the data given the model is shown to display interesting properties regarding the choice of number of clusters, given a particular level of details. Discriminative analysis is used to find the parameters that most contribute to the separation between groups, adding an interpretative value to the discussion. Finally a number of examples are given on some of the phenomena that are revealed by examining the clustering tree.Salvi, G. (2005). Ecological language acquisition via incremental model-based clustering. In Proceedings of European Conference on Speech Communication and Technology (Eurospeech) (pp. 1181-1184). Lisbon, Portugal. [abstract] [pdf]Abstract: We analyse the behaviour of Incremental Model-Based Clustering on child-directed speech data, and suggest a possible use of this method to describe the acquisition of phonetic classes by an infant. The effects of two factors are analysed, namely the number of coefficients describing the speech signal, and the frame length of the incremental clustering procedure. The results show that, although the number of predicted clusters vary in different conditions, the classifications obtained are essentially consistent. Different classifications were compared using the variation of information measure.Salvi, G. (2005). Segment Boundaries in Low Latency Phonetic Recognition. In Proceedings of Non Linear Speech Processing (NOLISP). Barcelona, Spain.Skantze, G. (2005). Exploring human error recovery strategies: implications for spoken dialogue systems. Speech Communication, 45(3), 325-341. [abstract] [pdf]Abstract: In this study, an explorative experiment was conducted in which subjects were asked to give route directions to each other in a simulated campus (similar to Map Task). In order to elicit error handling strategies, a speech recogniser was used to corrupt the speech in one direction. This way, data could be collected on how the subjects might recover from speech recognition errors. This method for studying error handling has the advantages that the level of understanding is transparent to the analyser, and the errors that occur are similar to errors in spoken dialogue systems. The results show that when subjects face speech recognition problems, a common strategy is to ask task-related questions that confirm their hypothesis about the situation instead of signalling non-understanding. Compared to other strategies, such as asking for a repetition, this strategy leads to better understanding of subsequent utterances, whereas signalling non-understanding leads to decreased experience of task success.Skantze, G. (2005). Galatea: a discourse modeller supporting concept-level error handling in spoken dialogue systems. In Proceedings of SigDial (pp. 178-189). Lisbon, Portugal. [abstract] [pdf]Abstract: In this paper, a discourse modeller for conversational spoken dialogue systems, called GALATEA, is presented. Apart from handling the resolution of ellipses and anaphora, it tracks the “grounding status” of concepts that are mentioned during the discourse, i.e. information about who said what when. This grounding information also contains concept confidence scores that are derived from the speech recogniser word confidence scores. The discourse model may then be used for concept-level error handling, i.e. grounding of concepts, fragmentary clarification requests, and detection of erroneous concepts in the model at later stages in the dialogue.Svanfeldt, G., & Olszewski, D. (2005). Perception experiment combining a parametric loudspeaker and a synthetic talking head. In Proceedings of Interspeech (pp. 1721-1724). Testi, D., Zannoni, C., Caldwell, D., Neiberg D., ., Clapworthy, G., & Viceconti, M. (2005). An innovative multisensorial environment for pre-operative planning of total hip replacement", 5th Annual Meeting of Computer Assisted Orthopaedic Surgery. In 5th Annual Meeting of Computer Assisted Orthopaedic Surgery. Helsinki, Finland.2004Beskow, J. (2004). Trainable articulatory control models for visual speech synthesis. Journal of Speech Technology, 4(7), 335-349. [pdf]Beskow, J., Cerrato, L., Cosi, P., Costantini, E., Nordstrand, M., Pianesi, F., Prete, M., & Svanfeldt, G. (2004). Preliminary cross-cultural evaluation of expressiveness in synthetic faces. In André, E., Dybkjaer, L., Minker, W., & Heisterkampf, P. (Eds.), Proc Tutorial and Research Workshop on Affective Dialogue Systems, ADS'04 (pp. 240-243). Kloster Irsee, Tyskland. [pdf]Beskow, J., Cerrato, L., Granström, B., House, D., Nordenberg, M., Nordstrand, M., & Svanfeldt, G. (2004). Expressive Animated Agents for Affective Dialogue Systems.. Proc Tutorial and Research Workshop on Affective Dialogue Systems, ADS'04, . Kloster Irsee, Tyskland.. [pdf]Beskow, J., Cerrato, L., Granström, B., House, D., Nordstrand, M., & Svanfeldt, G. (2004). The Swedish PF-Star multimodal corpora. In Proc LREC Workshop on Models of Human Behaviour for the Specification and Evaluation of Multimodal Input and Output Interfaces (pp. 34-37). Lisboa. [pdf]Beskow, J., Karlsson, I., Kewley, J., & Salvi, G. (2004). SYNFACE - A talking head telephone for the hearing-impaired. In Miesenberger, K., Klaus, J., Zagler, W., & Burger, D. (Eds.), Computers Helping People with Special Needs (pp. 1178-1186). Springer-Verlag. [abstract] [pdf]Abstract: SYNFACE is a telephone aid for hearing-impaired people that shows the lip movements of the speaker at the other telephone synchronised with the speech. The SYNFACE system consists of a speech recogniser that recognises the incoming speech and a synthetic talking head. The output from the recogniser is used to control the articulatory movements of the synthetic head. SYNFACE prototype systems exist for three languages: Dutch, English and Swedish and the rst user trials have just started.Blomberg, M., Elenius, D., & Zetterholm, E. (2004). Speaker verification scores and acoustic analysis of a professional impersonator. In Proc of The XVIIth Swedish Phonetics Conference, Fonetik 2004 (pp. 84-87). Stockholm University. [pdf]Boye, J., Wiren, M., & Gustafson, J. (2004). Contextual reasoning in multimodal dialogue systems: two case studies. In Proceedings of The 8th Workshop on the Semantics and Pragmatics of Dialogue Catalogue'04 (pp. 19-21). Barcelona. [pdf]Carlson, R., Elenius, K., & Swerts, M. (2004). Perceptual judgments of pitch range. In Bel, B., & Marlin, I. (Eds.), Proc. of Intl Conference on Speech Prosody 2004 (pp. 689-692). Nara, Japan. [pdf]Carlson, R., Hirschberg, J., & Swerts, M. (2004). Prediction of upcoming Swedish prosodic boundaries by Swedish and American listeners. In Bel, B., & Marlin, I. (Eds.), Proc of Intl Conference on Speech Prosody 2004 (pp. 329-332). Nara, Japan. [pdf]Cerrato, L. (2004). A coding scheme for the annotation of feedback phenomena in conversational speech. In Martin, J. (Ed.), Proc of LREC Workshop on Models of Human Behaviour for the Specification and Evaluation of Multimodal Input and Output Interfaces (pp. 25-28). Lisboa.Cerrato, L. (2004). A comparative study of verbal feedback in Italian and Swedish map-task dialogues. In Copenhagen, P., & Hernrichsen, J. (Eds.), Proceedings of the Nordic Symposium on the comparison of spoken languages, Copenhagen Working Papers in LSP (pp. 99-126). Cerrato, L., & Ekeklint, S. (2004). Evaluating users reactions to human-like interfaces: Prosodic and paralinguistic features as new evaluation measures for users satisfaction. In Ruttkay, Z., & Pelachaud, C. (Eds.), Kluwer's Human-Computer Interaction Series: From Brows to Trust Evaluating Embodied Conversational Agents (pp. 101-124). Edlund, J., Skantze, G., & Carlson, R. (2004). Higgins - a spoken dialogue system for investigating error handling techniques. In Proceedings of the International Conference on Spoken Language Processing, ICSLP 04 (pp. 229-231). Jeju, Korea. [abstract] [pdf]Abstract: In this paper, an overview of the Higgins project and the research within the project is presented. The project incorporates studies of error handling for spoken dialogue systems on several levels, from processing to dialogue level. A domain in which a range of different error types can be studied has been chosen: pedestrian navigation and guiding. Several data collections within Higgins have been analysed along with data from Higgins' predecessor, the AdApt system. The error handling research issues in the project are presented in light of these analyses.Elenius, D., & Blomberg, M. (2004). Comparing speech recognition for adults and children. In Proc of The XVIIth Swedish Phonetics Conference, Fonetik 2004 (pp. 156-159). Stockholm University. [pdf]Engwall, O. (2004). From real-time MRI to 3D tongue movements. In Kim, S. H., & Young, D. H. (Eds.), Proc ICSLP 2004 (pp. 1109-1112). Jeju Island, Korea. [pdf]Engwall, O. (2004). Speaker adaptation of a three-dimensional tongue model. In Kim, S. H., & Young, D. H. (Eds.), Proc ICSLP 2004 (pp. 465-468). Jeju Island, Korea. [pdf]Engwall, O., Wik, P., Beskow, J., & Granström, G. (2004). Design strategies for a virtual language tutor. In Kim, S. H., & Young, D. H. (Eds.), Proc ICSLP 2004 (pp. 1693-1696). Jeju Island, Korea. [pdf]Granström, B. (2004). Towards a virtual language tutor. In Delmonte, R., Delcloque, P., & Tonellli, S. (Eds.), Proc InSTIL/ICALL2004 NLP and Speech Technologies in Advanced Language Learning (pp. 1-8 (Invited paper)). Venice, Italy. [pdf]Granström, B., & House, D. (2004). Audiovisual representation of prosody in expressive speech communication. In Bel, B., & Marlin, I. (Eds.), Proc of Intl Conference on Speech Prosody 2004 (pp. 393-396). Nara, Japan. [pdf]Gustafson, J., Bell, L., Boye, J., Lindström, A., & Wirén, M. (2004). The NICE fairy-tale game system. In Proceedings of SIGdial. Boston. [pdf]Gustafson, J., & Sjölander, K. (2004). Voice creations for conversational fairy-tale characters. In Proc 5th ISCA speech synthesis workshop (pp. 145-150). Pittsburgh. [pdf]Heldner, M., Edlund, J., & Björkenstam, T. (2004). Automatically extracted F0 features as acoustic correlates of prosodic boundaries. In Proc of The XVIIth Swedish Phonetics Conference, Fonetik 2004 (pp. 52-55). Stockholm University. [abstract] [pdf]Abstract: This work presents preliminary results of an investigation of various automatically extracted F0 features as acoustic correlates of prosodic boundaries. The F0 features were primarily intended to capture phenomena such as boundary tones, F0 resets across boundaries and position in the speaker's F0 range. While there were no correspondences between boundary tones and boundaries, the reset and range features appeared to separate boundaries from no boundaries fairly well.Hincks, R. (2004). Processing the prosody of oral presentations. In Delmonte, R., Delcloque, P., & Tonellli, S. (Eds.), Proc InSTIL/ICALL2004 NLP and Speech Technologies in Advanced Language Learning (pp. 63-66). Venice, Italy. [abstract] [pdf]Abstract: Standard advice to people preparing to speak in public is to use a “lively” voice. A lively voice is described as one that varies in intonation, rhythm and loudness: qualities that can be analyzed using speech analysis software. This paper reports on a study analyzing pitch variation as a measure of speaker liveliness. A potential application of this approach for analysis would be for rehearsing or assessing the prosody of oral presentations. While public speaking can be intimidating even to native speakers, second language users are especially challenged, particularly when it comes to using their voices in a prosodically engaging manner. The material is a database of audio recordings of twenty 10-minute student oral presentations, where all speakers were college-age Swedes studying Technical English. The speech has been processed using the analysis software WaveSurfer for pitch extraction. Speaker liveliness has been measured as the standard deviation from the mean fundamental frequency over 10-second periods of speech. The standard deviations have been normal¬ized (by division with the mean frequency) to obtain a value termed the pitch dynamism quotient (PDQ). Mean values (for ten minutes of speech) of PDQ per speaker range from a low of 0.11 to a high of 0.235. Individual values for 10-second segments range from lows of 0.06 to highs of 0.36.Hincks, R. (2004). Standard deviation of F0 in student monologue. In Proc of The XVIIth Swedish Phonetics Conference, Fonetik 2004 (pp. 132-135). Stockholm University. [abstract] [pdf]Abstract: Twenty ten-minute oral presentations made by Swedish students speaking English have been analyzed with respect to the standard deviation of F0 over long stretches of speech. Values have been normalized by division with the mean. Results show a strong correlation between pro-ficiency in English and pitch variation for male speakers but not for females. The results also identify monotone and disfluent speakers.House, D. (2004). Final rises and Swedish question intonation. In Proc of The XVIIth Swedish Phonetics Conference, Fonetik 2004 (pp. 56-59). Stockholm University. [pdf]House, D. (2004). Final rises in spontaneous Swedish computer-directed questions: incidence and function. In Bel, B., & Marlin, I. (Eds.), Proc of Intl Conference on Speech Prosody 2004 (pp. 115-118). Nara, Japan. [pdf]House, D. (2004). Pitch and alignment in the perception of tone and intonation. In Fant, G., Fujisaki, H., Cao, J., & Xu, Y. (Eds.), From Traditional Phonology to Modern Speech Processing (pp. 189-204). Beijing: Foreign Language Teaching and Research Press.House, D. (2004). Pitch and alignment in the perception of tone and intonation: pragmatic signals and biological codes. In Bel, B., & Marlein, I. (Eds.), Proc of International Symposium on Tonal Aspects of Languages: Emphasis on Tone Languages (pp. 93-96). Beijng, China.Hunnicutt, S., Nozadze, L., & Chikoidze, G. (2004). Russian word prediction with morphological support. In 5th International Symposium on Language, Logic and Computation. Tbilisi, Georgia. [pdf]Hunnicutt, S., & Zuurman, M. (2004). Comparison of vocabulary in several symbol sets and Voice of America word list. In 11th Biennial Conference of the International Society for Augmentative and Alternative Communication (extended abstract). Natal, Brazil.Jande, P-A. (2004). Pronunciation variation modelling using decision tree induction from multiple linguistic parameters. In Proceedings of Fonetik (pp. 12-15). Stockholm, Sweden. [pdf]Karnebäck, S. (2004). Spectro-temporal properties of the acoustic speech signal used for speech/music discrimination. Licentiate dissertation. [pdf]Kohler, K. J. (2004). Pragmatic and attitudinal meanings of pitch patterns in German syntactically marked questions. In Fant, G., Fujisaki, H., Cao, J., & Xu, Y. (Eds.), From traditional phonology to modern speech processing (pp. 205-214). Beijing: Foreign Language Teaching and Research Press.Lacerda, F., Sundberg, U., Carlson, R., & Holt, L. (2004). Modelling interactive language learning: a project presentation. In Proc of The XVIIth Swedish Phonetics Conference, Fonetik 2004 (pp. 60-63). Stockholm University. [pdf]Magnuson, T., & Hunnicutt, S. (2004). Aided text construction in an e-mail application for symbol users. In 11th Biennial Conference of the International Society for Augmentative and Alternative Communication (extended abstract). Natal, Brazil.Nordenberg, M., & Sundberg, J. (2004). Effect on LTAS of vocal loudness variation. Logopedics Phoniatrics Vocology, 29, 183-191.Nordstrand, M., Svanfeldt, G., Granström, B., & House, D. (2004). Measurements of articulatory variation in expressive speech for a set of Swedish vowels. J Speech Communication - Special Issue on Audio Visual Speech Processing, 1-4(44), 187-196.Pakucs, B. (2004). Butler: a universal speech interface for mobile environments. In Brewster, S., & Dunlop, M. (Eds.), Lecture notes in Computer Science 3160. Mobile HCI 04. 6th International Symposium on Human Computer Interaction with Mobile Devices and Services (pp. 399-403). Glasgow, UK: Springer Verlag.Pakucs, B. (2004). Employing context of use in dialogue processing. In Proc CATALOG '04. 8th Workshop on the Semantics and Pragmatics of Dialogue (pp. 162-163). Barcelona, Spain.Pakucs, B., & Huhta, S. (2004). Developing speech interfaces for frequent users: the DUMAS-calendar prototype. In Proc of COLING 2004, Workshop on Robust and Adaptive Information Processing for Mobile Speech Interfaces (pp. 65-68). Geneva, Switzerland.Seward, A. (2004). A fast HMM match algorithm for very large vocabulary speech recognition. Speech Comm, 42, 191-206.Siciliano, C., Williams, G., Faulkner, A., & Salvi, G. (2004). Intelligibility of an ASR-controlled synthetic talking face. Journal of the Acoustical Society of America, 115(5), 2428.Sjölander, K., & Heldner, M. (2004). Word level precision of the NALIGN automatic segmentation algorithm. In Proc of The XVIIth Swedish Phonetics Conference, Fonetik 2004 (pp. 116-119). Stockholm University. [pdf]Skantze, G., & Edlund, J. (2004). Early error detection on word level. In Proceedings of ISCA Tutorial and Research Workshop (ITRW) on Robustness Issues in Conversational Interaction. Norwich, UK. [abstract] [pdf]Abstract: In this paper two studies are presented in which the detection of speech recognition errors on the word level was examined. In the first study, memory-based and transformation-based machine learning was used for the task, using confidence, lexical, contextual and discourse features. In the second study, we investigated which factors humans benefit from when detecting errors. Information from the speech recogniser (i.e. word confidence scores and 5-best lists) and contextual information were the factors investigated. The results show that word confidence scores are useful and that lexical and contextual (both from the utterance and from the discourse) features further improve performance.Skantze, G., & Edlund, J. (2004). Robust interpretation in the Higgins spoken dialogue system. In Proceedings of ISCA Tutorial and Research Workshop (ITRW) on Robustness Issues in Conversational Interaction. Norwich, UK. [abstract] [pdf]Abstract: This paper describes Pickering, the semantic interpreter developed in the Higgins project - a research project on error handling in spoken dialogue systems. In the project, the initial efforts are centred on the input side of the system. The semantic interpreter combines a rich set of robustness techniques with the production of deep semantic structures. It allows insertions and non-agreement inside phrases, and combines partial results to return a limited list of semantically distinct solutions. A preliminary evaluation shows that the interpreter performs well under error conditions, and that the built-in robustness techniques contribute to this performance.Spens, K-E., Agelfors, E., Beskow, J., Granström, B., Karlsson, I., & Salvi, G. (2004). SYNFACE, a talking head telephone for the hearing impaired. In Proc IFHOH 7th World Congress. Helsinki, Finland.Strangert, E., & Carlson, R. (2004). On the modelling and synthesis of conversational speech. In Bruce, G., & Horne, M. (Eds.), Nordic Prosody. Proceedings of the IXth Conference (pp. 255-264). Lund: Peter Lang: Frankfurt am Main.Wik, P. (2004). Designing a virtual language tutor. In Proc of The XVIIth Swedish Phonetics Conference, Fonetik 2004 (pp. 136-139). Stockholm University. [pdf]Wik, P., Nygaard, L., & Fjeld, R. V. (2004). Managing complex and multilingual lexical data with a simple editor. In Proceedings of the Eleventh EURALEX International Congress. Lorient, France. [pdf]Zetterholm, E., Blomberg, M., & Elenius, D. (2004). A comparison between human perception and a speaker verification system score of a voice imitation. In Proc of Tenth Australian International Conference on Speech Science & Technology (pp. 393-397). Macquarie Univ, Sydney, Australia. [pdf]Öhlin, D., & Carlson, R. (2004). Data-driven formant synthesis. In Proc of the XVIIth Swedish Phonetics Conference, Fonetik 2004 (pp. 160-163). Stockholm University. [pdf]2003Allwood, J., & Cerrato, L. (2003). A study of gestural feedback expressions. In The First Nordic Symposium on Multimodal Communication. Copenhagen.Batliner, A., Fischer, K., Huber, R., Spilker, J., & Nöth, E. (2003). How to find trouble in communication. Speech Communication, 40, 117-143.Bell, L. (2003). Linguistic adaptations in spoken human-computer dialogues. Empirical studies of user behavior. Doctoral dissertation. [pdf]Bell, L., & Gustafson, J. (2003). Child and adult speaker adaptation during error resolution in a publicly available spoken dialogue system. In Proc of EuroSpeech 2003 (pp. 613-616). Geneva, Switzerland. [pdf]Bell, L., Gustafson, J., & Heldner, M. (2003). Prosodic adaptation in human-computer interaction. In ICPhS, XV Intl Conference of Phonetic Sciences. Bell, L., Gustafson, J., & Heldner, M. (2003). Prosodic adaptation in human-computer interaction. In Proc of ICPhS, XV Intl Conference of Phonetic Sciences (pp. 2453-2456). Barcelona, Spain. [pdf]Beskow, J. (2003). Talking heads - Models and applications for multimodal speech synthesis. Doctoral dissertation, KTH.Beskow, J., Engwall, O., & Granström, B. (2003). Resynthesis of Facial and Intraoral Articulation from Simultaneous Measurements. In Solé, M., Recasens, D., & Romero, J. (Eds.), Proceedings of the 15th ICPhS (pp. 431-434). Barcelona, Spain. [pdf]Beskow, J., Engwall, O., & Granström, B. (2003). Simultaneous measurements of facial and intraoral articulation. In Proc of Fonetik 2003, Umeå University, Dept of Philosophy and Linguistics PHONUM 9 (pp. 57-60). [pdf]Blomberg, M., & Elenius, D. (2003). Collection and recognition of children s speech in the PF-Star project. In Proc of Fonetik 2003 Umeå University, Dept of Philosophy and Linguistics PHONUM 9 (pp. 81-84). [pdf]Carlson, R., & Swerts, M. (2003). Perceptually based prediction of upcoming prosodic breaks in spontaneous Swedish speech materials. In Proc of ICPhS, XV Intl Conference of Phonetic Sciences (pp. 79-82). Barcelona, Spain. [pdf]Carlson, R., & Swerts, M. (2003). Relating perceptual judgments of upcoming prosodic breaks to F0 features. In Proc of Fonetik 2003, Umeå University, Dept of Philosophy and Linguistics PHONUM 9 (pp. 181-184). [pdf]Cerrato, L., & D'Imperio, M. (2003). Duration and tonal characteristics of short expressions in Italian.. In Proc 15th ICPhS. Cerrato, L., & Paoloni, A. (2003). Utilizzo dei parametric della fonetica acustica nell' identificazione del parlante in ambito forense. In Cosi, P., Magno Caldognetto, E., & Zamboni, A. (Eds.), Voce, Canto Parlato, Studi in Onore di Franco Ferrero (pp. 59-66). Unipress.Cerrato, L., & Skhiri, M. (2003). A method for the analysis and measurement of communicative head movements in human dialogues. In Proc of AVSP 2003, ISCA Tutorial and Research Workshop on Audio Visual Speech Processing (pp. 251-256). St Jorioz, France.Engwall, O. (2003). A revisit to the application of MRI to the analysis of speech production. Testing our assumptions. In 6th Intl Seminar on Speech Production (pp. 43-48). Sydney. [pdf]Engwall, O. (2003). Combining MRI, EMA & EPG measurements in a three-dimensional tongue model. Speech Communication, 41, 303-329. [pdf]Engwall, O., & Beskow, J. (2003). Resynthesis of 3D tongue movements from facial data. In Proc EuroSpeech 2003 (pp. 2261-2264). [pdf]Engwall, O., & Beskow, J. (2003). The effect of corpus choice on statistical articulatory modeling. In 7th Intl Seminar on Speech Production (pp. 49-54). Sydney. [pdf]Granqvist, S. (2003). Computer methods for voice analysis. Doctoral dissertation, KTH/TMH.Granström, B., & House, D. (2003). Multimodality and speech technology: Verbal and non-verbal communication in talking agents. In Proc of EuroSpeech 2003 (pp. 2901-2904). Geneva, Switzerland. [pdf]Heldner, M. (2003). On the reliability of overall intensity and spectral emphasis as acoustic correlates of focal accents in Swedish. Journal of Phonetics, 31(1), 39-62.Heldner, M., & Megyesi, B. (2003). Exploring the prosody-syntax interface in conversations. In Proc 15th ICPhS, XV Intl Conference of Phonetic Sciences (pp. 2501-2504). Barcelona, Spain. [pdf]Heldner, M., & Megyesi, B. (2003). The acoustic and morpho-syntactic context of prosodic boundaries in dialogs. In Proc of Fonetik 2003, Umeå University, Dept of Philosophy and Linguistics PHONUM 9 (pp. 117-120). [pdf]Hincks, R. (2003). Pronouncing the academic word list: Features of L2 student oral presentations. In Proc of ICPhS, XV Intl Conference of Phonetic Sciences (pp. 1545-1549). Barcelona, Spain. [abstract] [pdf]Abstract: This paper is an analysis of lexical choices, pronunciation errors, and discourse features found in a corpus of student presentation speech. The speakers were Swedish natives studying Technical English. Particular emphasis is given to the pronunciation of the words most often used in academic texts. 93% of words used in the corpus came from the most frequent 2570 lexemes of academic written English, 99% of all words were acceptably pronounced, disfluencies occurred at relatively stable inter-student rates, and 30% of all new sentences began with the conjunction ‘and’.Hincks, R. (2003). Speech technologies for pronunciation feedback and evaluation. ReCALL, 1(15), 3-21. [abstract] [link]Abstract: Educators and researchers in the acquisition of L2 phonology have called for empirical assessment of the progress students make after using new methods for learning (Chun, 1998, Morley, 1991). The present study investigated whether unlimited access to a speech-recognition-based language learning program would improve the general standard of pronunciation of a group of middle-aged immigrant professionals studying English in Sweden. Eleven students were given a copy of the program Talk to Me from Auralog as a supplement to a 200-hour course in Technical English, and were encouraged to practise on their home computers. Their development in spoken English was compared with a control group of fifteen students who did not use the program. The program is evaluated in this paper according to Chapelle’s (2001) six criteria for CALL assessment. Since objective human ratings of pronunciation are costly and can be unreliable, our students were pre- and posttested with the automatic PhonePass SET-10 test from Ordinate Corp. Results indicate that practice with the program was beneficial to those students who began the course with a strong foreign accent but was of limited value for students who began the course with better pronunciation. The paper begins with an overview of the state of the art of using speech recognition in L2 applications.Hincks, R. (2003). Tutors, tools and assistants for the L2 user. In Proc of Fonetik 2003, Umeå University, Dept of Philosophy and Linguistics PHONUM 9 (pp. 173-176). [abstract] [pdf]Abstract: This paper explores the concept of a speech checker for use in the production of oral presentations. The speech checker would be specially adapted speaker-dependent software to be used as a tool in the rehearsal of a presentation. The speech checker would localize mispronounced words and words unlikely to be known by an audience, give feedback on the speaker’s prosody, and provide a friendly face to listen to the presentation.House, D. (2003). Hesitation and interrogative Swedish intonation. In Proc of Fonetik 2003, Umeå University, Dept of Philosophy and Linguistics PHONUM 9 (pp. 185-188). [pdf]House, D. (2003). Perceiving question intonation: the role of pre-focal pause and delayed focal peak. In Proc of ICPhS, XV Intl Conference of Phonetic Sciences (pp. 755-758). Barcelona, Spain. [pdf]House, D. (2003). Perception of tone with particular reference to tonal alignment. In Kaji, S. (Ed.), Proc of the International Symposium on Crosslinguistic Studies of Tonal Phenomena 2002. Tokyo: Institute for Languages and Cultures of Asia and Africa, Tokyo University of Foreign Studies.Jande, P-A. (2003). Evaluating rules for phonological reduction in Swedish. In Proceedings of Fonetik (pp. 149-152). [pdf]Jande, P-A. (2003). Phonological reduction in Swedish. In Proceedings of the International Conference of Phonetic Sciences (ICPhS) (pp. 2557-2560). Barcelona, Spain. [pdf]Karlsson, I. (2003). The SYNFACE project - a status report. In Proc of Fonetik 2003, Umeå University, Dept of Philosophy and Linguistics PHONUM 9 (pp. 61-64). [pdf]Karlsson, I., Faulkner, A., & Salvi, G. (2003). SYNFACE - a talking face telephone. In Proc of EuroSpeech 2003 (pp. 1297-1300). Geneva, Switzerland. [abstract] [pdf]Abstract: The SYNFACE project has as its primary goal to facilitate for hearing-impaired people to use an ordinary telephone. This will be achieved by using a talking face connected to the telephone. The incoming speech signal will govern the speech movements of the talking face, hence the talking face will provide lip-reading support for the user. The project will define the visual speech information that supports lip-reading, and develop techniques to derive this information from the acoustic speech signal in near real time for three different languages: Dutch, English and Swedish. This requires the development of automatic speech recognition methods that detect information in the acoustic signal that correlates with the speech movements. This information will govern the speech movements in a synthetic face and synchronise them with the acoustic speech signal. A prototype system is being constructed. The prototype contains results achieved so far in SYNFACE. This system will be tested and evaluated for the three languages by hearing-impaired users. SYNFACE is an IST project (IST-2001-33327) with partners from the Netherlands, UK and Sweden. SYNFACE builds on experiences gained in the Swedish Teleface project.Magnuson, T., & Hunnicutt, S. (2003). Support for the construction of sentences and phrases for symbol users. In Proc of Fonetik 2003, Umeå University, Dept of Philosophy and Linguistics PHONUM 9 (pp. 85-88). [pdf]Nordstrand, M., Svanfeldt, G., Granström, B., & House, D. (2003). Measurements of articulatory variation and communicative signals in expressive speech. In Proc of AVSP'03, ISCA Tutorial and Research Workshop on Audio Visual Speech Processing (pp. 233-238). St Jorioz, France. [pdf]Pakucs, B. (2003). SesaME: A framework for personalized and adaptive speech interfaces. In Proc of the EACL-03 Workshop on Dialogue Systems: interaction, adaptation and styles of management (pp. 95-102). Budapest. [pdf]Pakucs, B. (2003). Towards dynamic multi-domain dialogue processing. In Proc of EuroSpeech 2003 (pp. 741-744). Geneva, Switzerland. [pdf]Salvi, G. (2003). Accent clustering in Swedish using the Bhattacharyya distance. In Proceedings of the International Congress of Phonetic Sciences (ICPhS) (pp. 1149-1152). Barcelona, Spain. [abstract] [pdf]Abstract: In an attempt to improve automatic speech recognition (ASR) models for Swedish, accent variations were considered. These have proved to be important variables in the statistical distribution of the acoustic features usually employed in ASR. The analysis of feature variability have revealed phenomena that are consistent with what is known from phonetic investigations, suggesting that a consistent part of the information about accents could be derived form those features. A graphical interface has been developed to simplify the visualization of the geographical distributions of these phenomena.Salvi, G. (2003). Truncation Error and Dynamics in Very Low Latency Phonetic Recognition. In Proceedings of Non Linear Speech Processing (NOLISP). Le Croisic, France. [abstract] [pdf]Abstract: The truncation error for a two-pass decoder is analyzed in a problem of phonetic speech recognition for very demanding latency constraints (look-ahead length < 100ms) and for applications where successive refinements of the hypotheses are not allowed. This is done empirically in the framework of hybrid MLP/HMM models. The ability of recurrent MLPs, as a posteriori probability estimators, to model time variations is also considered, and its interaction with the dynamic modeling in the decoding phase is shown in the simulations.Salvi, G. (2003). Using accent information in ASR models for Swedish. In Proceedings of European Conference on Speech Communication and Technology (Eurospeech) (pp. 2677-2680). Geneva, Switzerland. [abstract] [pdf]Abstract: In this study accent information is used in an attempt to improve acoustic models for automatic speech recognition (ASR). First, accent dependent Gaussian models were trained independently. The Bhattacharyya distance was then used in conjunction with agglomerative hierarchical clustering to define optimal strategies for merging those models. The resulting allophonic classes were analyzed and compared with the phonetic literature. Finally, accent “aware” models were built, in which the parametric complexity for each phoneme corresponds to the degree of variability across accent areas and to the amount of training data available for it. The models were compared to models with the same, but evenly spread, overall complexity showing in some cases a slight improvement in recognition accuracy.Seward, A. (2003). Efficient methods for automatic speech recognition. Doctoral dissertation, Department of Speech, Music and Hearing, KTH.Seward, A. (2003). Low-latency incremental speech transcription in the Synface project. In Proc of EuroSpeech 2003 (pp. 1141-1144). Geneva, Switzerland. [pdf]Siciliano, C., Williams, G., Beskow, J., & Faulkner, A. (2003). Evaluation of a Multilingual Synthetic Talking Face as a communication Aid for the Hearing Impaired. In Proc of ICPhS, XV Intl Conference of Phonetic Sciences (pp. 131-134). Barcelona, Spain. [pdf]Sjölander, K. (2003). An ensemble method for the alignment of sound and its transcription. In Proc of SMAC 03, Stockholm Music Acoustics Conference (pp. 743-746). Sjölander, K. (2003). An HMM-based system for automatic segmentation and alignment of speech. In Proc of Fonetik 2003, Umeå University, Dept of Philosophy and Linguistics PHONUM 9 (pp. 93-96). [pdf]Skantze, D., & Dahlbäck, N. (2003). Auditory icons support for navigation in speech-only interfaces for room-based design metaphors. In Proc of ICAD 2003, Intl Community for Auditory Display (pp. 140-143). Boston University, USA. [pdf]Skantze, G. (2003). Exploring human error handling strategies: implications for spoken dialogue systems. In Proceedings of ISCA Tutorial and Research Workshop on Error Handling in Spoken Dialogue Systems (pp. 71-76). Chateau-d'Oex-Vaud, Switzerland. [pdf]Svanfeldt, G., Nordstrand, M., Granström, B., & House, D. (2003). Measurements of articulatory variation in expressive speech. In Proc of Fonetik 2003, Umeå University, Dept of Philosophy and Linguistics, PHONUM 9 (pp. 53-56). [pdf]Wik, P. (2003). A Cognitive Science Approach To A Human-Dolphin Dialog Protocol. In 17th International Conference of the European Cetacean Society. Las Palmas, Canary Islands.Wik, P. (2003). Building Common Ground: Communication Across Species Barriers. In First International Conference on Acoustic Communication by Animals. Maryland, USA.Öster, A-M., House, D., Hatzis, A., & Green, P. (2003). Testing a new method for training fricatives using visual maps in the Ortho-Logo-Paedia project (OLP). In Proc of Fonetik 2003, Umeå University, Dept of Philosophy and Linguistics PHONUM 9 (pp. 89-92). [pdf]2002Beskow, J., Edlund, J., & Nordstrand, M. (2002). Specification and realisation of multimodal output in dialogue systems. In Proc of ICSLP 2002 (pp. 181-184). Denver, Colorado, USA. [abstract] [pdf]Abstract: We present a high level formalism for specifying verbal and nonverbal output from a multimodal dialogue system. The output specification is XML-based and provides information about communicative functions of the output without detailing the realisation of these functions. The specification can be used to control an animated character that uses speech and gestures. We give examples from an implementation in a multimodal spoken dialogue system, and describe how facial gestures are implemented in a 3Danimated talking agent within this system.Beskow, J., Granström, B., & House, D. (2002). A multimodal speech synthesis tool applied to audio-visual prosody. In Keller, E., Bailly, G., Monaghan, A., Terken, J., & Huckvale, M. (Eds.), Improvements in Speech Synthesis (pp. 372-382). New York: John Wiley & Sons, Inc.Carlson, R., Granström, B., Heldner, M., House, D., Megyesi, B., Strangert, E., & Swerts, M. (2002). Boundaries and groupings - the structuring of speech in different communicative situations: a description of the GROG project. In Proc of Fonetik 2002 (pp. 65-68). Stockholm.Cerrato, L. (2002). A comparison between feedback strategies in Human-to-Human and Human-Machine communication. In Proc of ICSLP-2002 (pp. 557-560). Denver, Colorado, USA. [pdf]Cerrato, L. (2002). A Study of Verbal Feedback in Italian.. In Proc of NORDTALK Symposium on Relations between Utterances. Cerrato, L., & Ekeklint, S. (2002). Different ways of ending human-machine interactions. In Proc of AAMAS02 Workshop. Bologna. [pdf]Cerrato, L., & Skhiri, M. (2002). Quantifying non-verbal communicative behaviour in face-to-face human dialogues. In First Pan-American/Iberian Meeting on Acoustics Cancun. Mexico.Edlund, J., Beskow, J., & Nordstrand, M. (2002). GESOM - A model for describing and generating multi-modal output. In Proc of ISCA Workshop Multi-Modal Dialogue in Mobile Environments. Kloster Irsee, Germany. [abstract] [pdf]Abstract: This paper describes GESOM, a model for generation of generalised, high-level multi-modal dialogue system output. It aims to let dialogue systems generate output for various output devices and modalities with a minimum of changes to the output generation of the dialogue system. The model was developed and tested within the AdApt spoken dialogue system, from which the bulk of the examples in this paper are taken.Edlund, J., & Nordstrand, M. (2002). Turn-taking gestures and hour-glasses in a multi-modal dialogue system. In Proc of ISCA Workshop Multi-Modal Dialogue in Mobile Environments. Kloster Irsee, Germany. [abstract] [pdf]Abstract: An experiment with 24 subjects was performed. The subjects were split in three groups, and asked to extract information from the AdApt spoken dialogue system for somewhat over 30 minutes per subject. The system configuration varied in that one group had turn-taking gestures from an animated talking head, another had an hourglass symbol to signal when the system was busy, and the third had no turn-taking feedback at all. The results show that although the hourglass setup showed no decrease in efficiency compared to the facial gestures, it made the subjects less satisfied. The lack of turn-taking feedback was noticed and mentioned by half of the subjects in that group.Elenius, D., & Blomberg, M. (2002). Characteristics of a low reject mode speaker verification system. In Proc of ICSLP 2002 (pp. 1385-1388). Denver, Colorado, USA. [pdf]Engwall, O. (2002). Evaluation of a system for concatenative articulatory visual speech synthesis. In Proc of ICSLP 2002 (pp. 665-668). Denver, Colorado, USA. [pdf]Engwall, O. (2002). Tongue Talking - Studies in Intraoral Speech Synthesis. Doctoral dissertation, KTH. [pdf]Fant, G., Kruckenberg, A., Gustafson, K., & Liljencrants, J. (2002). A new approach to intonation analysis and synthesis of Swedish. In Proc of Fonetik 2002 (pp. 161-64). Stockholm.Fant, G., Kruckenberg, A., Gustafson, K., Liljencrants, J., & Botinis, A. (2002). Individual variations in prominence correlates. Some observations from lab-speech. In Proc of Fonetik 2003 (pp. 177-180). Stockholm.Graf, H. .., Cosatto, E., Strom, V., & Huang, F. J. (2002). Visual prosody: Facial movements accompanying speech. In Fifth IEEE International Conference on Automatic Face and Gesture Recognition, 2002. Proceedings (pp. 396-401). Granström, B., House, D., & Beskow, J. (2002). Speech and gestures for talking faces in conversational dialogue systems. In Granström, B., House, D., & Karlsson, I. (Eds.), Multimodality in language and speech systems (pp. 209-241). Dordrecht: Kluwer Academic Publishers.Granström, B., House, D., & Karlsson, I. (2002). Multimodality in language and speech systems. Dordrecht: Kluwer Academic Publishers.Granström, B., House, D., & Swerts, M. G. (2002). Multimodal feedback cues in human-machine interactions. In Bel, B., & Marlien, I. (Eds.), Proc of the Speech Prosody 2002 Conference (pp. 347-350). Aix-en-Provence: Laboratoire Parole et Langage.Gustafson, J. (2002). Developing multimodal spoken dialogue systems. Empirical studies of spoken human-computer interaction. Doctoral dissertation, KTH. [pdf]Gustafson, J., Bell, L., Boye, J., Edlund, J., & Wiren, M. (2002). Constraint manipulation and visualization in a multimodal dialogue system. In Proc of the ISCA Workshop Multi-Modal Dialogue in Mobile Environments. Kloster, Irsee, Germany. [abstract] [pdf]Abstract: When interacting with spoken and multimodal dialogue systems, it is often difficult for users to understand and influence how their input is processed by the system. In this paper, we describe how these problems were addressed in the multimodal real-estate dialogue system AdApt. During the course of a dialogue, the user's contraints are translated into symbolic icons that are visualized on the screen and can be manipulated by drag-and-drop operations. Users are thus given a clear picture of how their utterances are understood, and are given a transparent means of controlling the interaction with the system.Gustafson, J., & Sjölander, K. (2002). Voice transformations for improving children's speech recognition in a publicly available dialogue system. In Proc of ICSLP 2002 (pp. 297-300). Denver, Colorado, USA. [pdf]Gustafson, K., & House, D. (2002). Prosodic parameters of a "fun" speaking style. In Keller, E., Bailly, G., Monaghan, A., Terken, J., & Huckvale, M. (Eds.), Improvements in Speech Synthesis (pp. 264-272). New York: John Wiley & Sons, Inc.Gustafson-Capkova, S., & Megyesi, B. (2002). Silence and discourse context in read speech and dialogues in Swedish. In Bel, B., & Marlien, I. (Eds.), Proc of Speech Prosody 2002 Conference (pp. 363-366). Aix-en-Provence: Laboratoire Parole et Langage. [pdf]Hincks, R. (2002). Speech recognition for language teaching and evaluating: a study of existing products. In Proceedings of ICSLP 2002 (pp. 733-736). Denver, Colorado, USA. [abstract] [pdf]Abstract: Educators and researchers in the acquisition of L2 phonology have called for empirical assessment of the progress students make after using new methods for learning [1], [2]. This study investigated whether unlimited access to a speech-recognition-based language learning program would improve the general goodness of pronunciation of a group of middle-aged immigrant professionals studying English in Sweden. Eleven students were given a copy of the program Talk to Me by Auralog as a supplement to a 200-hour course in Technical English, and were encouraged to practice on their home computers. Their development in spoken English was compared with a control group of fifteen students who did not receive software. Talk to Me uses speech recognition to provide conversational practice, phonetic instruction, visual feedback on prosody and scoring of pronunciation. A significant limitation of commercial systems currently available is their inability to diagnose specific articulatory problems. In this course in Technical English, however, students also met at regular intervals with a pronunciation tutor who could steer the student in the right direction for finding the most important sections to practice for his or her particular problems. Students reported high satisfaction with the software and used it for an average of 12.5 hours. Students were pre- and post-tested with the automatic PhonePass SET-10 test from Ordinate Corp. Results indicate that practice with the program was beneficial to those students who began the course with a strong foreign accent but that students who began the course with intermediate pronunciation did not show the same improvement.House, D. (2002). Intonational and visual cues in the perception of interrogative mode in Swedish. In Proc of ICSLP 2002 (pp. 1957-1960). Denver, Colorado, USA. [pdf]House, D. (2002). The interaction of pitch range and temporal alignment in the perception of interrogative mode in Swedish. In Hawkins, S., & Nguyen, N. (Eds.), Temporal Integration in the Perception of Speech. University of Cambridge.House, D., & Granström, B. (2002). Multimodal speech synthesis: Improving information flow in dialogue systems using 3D talking heads. In Artificial Intelligence: Methodology, Systems, and Applications, 10th International Conference, AIMSA 2002 (pp. 38362). Berlin: Springer-Verlag.Hunnicutt, S., & Magnuson, T. (2002). Sentences for symbol users: email and echat. In Proc of ISAAC «02, 10th Biennial Conference of the International Society for Augmentative and Alternative Communication (pp. 467-468). Odense, Denmark.Johansson, M., Blomberg, M., Elenius, K., Hoffsten, L-E., & Torberger, A. (2002). A phoneme recognizer for the hearing impaired. In Proc. of ICSLP'2002 (pp. 433-436). Denver, Colorado, USA. [pdf]Karnebäck, S. (2002). Expanded examinations of a low frequency modulation feature for speech/music discrimination. In Proc of ICSLP 2002 (pp. 2009-2012). Denver, Colorado, USA. [pdf]Magnuson, T. (2002). Assessment of writing difficulties and evaluation of computerized writing support. Licentiate dissertation, KTH.Megyesi, B. (2002). Data-Driven Syntactic Analysis - Methods and Applications for Swedish. Doctoral dissertation, KTH.Megyesi, B. (2002). Shallow parsing with pos taggers and linguistic knowledge. Journal of Machine Learning Research: Special Issue on Shallow Parsing, 2, 639-668. [pdf]Megyesi, B., & Carlson, R. (2002). Data-driven methods for building a Swedish Treebank. In Proceedings of the Swedish Treebank Symposium. Växjö University, Sweden. [pdf]Megyesi, B., & Gustafson-Capkova, S. (2002). Production and perception of pauses and their linguistic context in read and spontaneous speech in Swedish. In Proc of ICSLP'2002 (pp. 2153-2156). Denver, Colorado, USA. [pdf]Pakucs, B. (2002). VoiceXML-based dynamic plug and play dialogue management for mobile environments. In Proc of the ISCA Workshop Multi-Modal Dialogue in Mobile Environments. Kloster, Irsee, Germany. [pdf]Rydin, S. (2002). Building a hyponymy lexicon with hierarchical structure. In Proc of the SIGLEX Workshop on Unsupervised Lexical Acquisition, ACL'02 (pp. 26-33). [pdf]Skantze, G. (2002). Coordination of referring expressions in multimodal human-computer dialogue. In Proceedings of ICSLP 2002 (pp. 553-556). Denver, Colorado, USA. [abstract] [pdf]Abstract: This study examines coordination of referring expressions in multimodal human-computer dialogue, i.e. to what extent users’ choices of referring expressions are affected by the referring expressions that the system is designed to use. An experiment was conducted, using a semi-automatic multimodal dialogue system for apartment seeking. The user and the system could refer to areas and apartments on an interactive map by means of speech and pointing gestures. Results indicate that the referring expressions of the system have great influence on the user’s choice of referring expressions, both in terms of modality and linguistic content. From this follows a number of implications for the design of multimodal dialogue systems.Wik, P. (2002). Building Bridges: A Cognitive Science Approach To A Human-Dolphin Dialog Protocol. Master's thesis, University of Oslo. [pdf]Öster, A-M., House, D., Protopapas, A., & Hatzis, A. (2002). A presentation of a new EU project for speech therapy: OLP (Ortho-Logo-Paedia). In Proc of Fonetik 2002 (pp. 45-48). Stockholm. [pdf]2001Allen, J., Byron, D., Dzikovska, M., Ferguson, G., Galescu, L., & Stent, A. (2001). Towards conversational human-computer interaction. AI Magazine, 22, 27-37.Beaugendre, F., House, D., & Hermes, D. J. (2001). Accentuation boundaries in Dutch, French and Swedish. Speech Communication, 33, 305-318.Bell, L., Boye, J., & Gustafson, J. (2001). Real-time handling of fragmented utterances. In Proceedings of NAACL 2001 Workshop: Adaptation in Dialogue Systems. Pittsburgh, PA. [abstract] [pdf]Abstract: In this paper, we discuss an adaptive method ofhandling fragmented user utterances to a speech-basedmultimodal dialogue system. Inserted silent pausesbetween fragments present the following problem:Does the current silence indicate that the user hascompleted her utterance, or is the silence just a pausebetween two fragments, so that the system should waitfor more input? Our system incrementally classifiesuser utterances as either closing (more input isunlikely to come) or non-closing (more input is likelyto come), partly depending on the current dialoguestate. Utterances that are categorized as non-closingallow the dialogue system to await additional spoken or graphical input before responding.Engwall, O. (2001). Considerations in intraoral visual speech synthesis: Data and modelling. In Proc of 4th Intl Speech Motor Conf (pp. 23-26). Nijmegen. [pdf]Engwall, O. (2001). Making the tongue model talk: Merging MRI & EMA Measurements. In Proc of Eurospeech 2001 (pp. 261-264). Aalborg. [pdf]Engwall, O. (2001). Synthesising static vowels and dynamic sounds using a 3D vocal tract model. In Proc of 4th ISCA Tutorial and Research Workshop on Speech Synthesis (pp. 38-41). Perthshire. [pdf]Engwall, O. (2001). Using linguopalatal contact patterns to tune a 3D tongue model. In Proc of Eurospeech 2001 (pp. 1475-1478). Aalborg. [pdf]Granström, B., House, D., Beskow, J., & Lundeberg, M. (2001). Verbal and visual prosody in multimodal speech perception. In van Dommelen, W., & Fretheim, T. (Eds.), Nordic Prosody 2000: Proc of VIII Conf (pp. 77-88). Trondheim, Norway.Granström, B., House, D., & Swerts, M. (2001). Multimodal feedback cues in human-machine interactions. In COST258 Meeting. Maastricht, The Netherlands.Gustafson, K., & House, D. (2001). Children's evaluation of expressive synthesis: A webbased experiment. In COST258 Meeting. Prague.Gustafson, K., & House, D. (2001). Expressive synthesis for children, a web-based evaluation. In Karlsson, A., & van de Weijer, J. (Eds.), Papers from Fonetik 2001 (pp. 50-53). Gustafson, K., & House, D. (2001). Fun or boring? A web-based evaluation of expressive synthesis for children. In Proc of Eurospeech 2001 (pp. 565-568). Aalborg, Denmark. [pdf]Gustafson-Capkova, S., & Megyesi, B. (2001). A comparative study of pauses in dialogues and read speech. In Proc of Eurospeech 2001 (pp. 931-934). Aalborg, Danmark. [pdf]Heldner, M. (2001). Focal accent - F0 movements and beyond. Doctoral dissertation, Umeå University.Heldner, M. (2001). On the non-linear lengthening of focally accented Swedish words. In van Dommelen, W., & Fretheim, T. (Eds.), Nordic Prosody: Proc of the VIIIth Conference (pp. 103-112). Trondheim. [pdf]Heldner, M. (2001). Spectral emphasis as an additional source of information in accent detection. In Bacchiani, M., Hirschberg, J., Litman, D., & Ostendorf, M. (Eds.), Prosody 2001: ISCA Tutorial and Research Workshop on Prosody in Speech Recognition and Understanding (pp. 57-60). Red Bank, NJ. [pdf]Heldner, M., & Strangert, E. (2001). Temporal effects of focus in Swedish. Journal of Phonetics, 29(3), 329-361.Hirsch, H-G. (2001). HMM adaptation for applications in telecommunication. Speech Communication, 34, 127-139.House, D. (2001). Focal accent in Swedish: perception of rise properties for Accent 1. In van Dommelen, W., & Fretheim, T. (Eds.), Nordic Prosody 2000: Proc of VIII Conf (pp. 127-136). Trondheim, Norway.House, D., Beskow, J., & Granström, B. (2001). Interaction of visual cues for prominence. In Karlsson, A., & van de Weijer, J. (Eds.), Papers from Fonetik 2001 (pp. 62-65). House, D., Beskow, J., & Granström, B. (2001). Timing and interaction of visual cues for prominence in audiovisual speech perception. In Proc of Eurospeech 2001 (pp. 387-390). Aalborg, Denmark. [pdf]Hunnicutt, S., & Carlberger, J. (2001). Markov models and heuristic methods improve a word prediction program. J Augmentative and Alternative Communication, 17, 255-264.Hunnicutt, S., & Magnuson, T. (2001). Linguistic structures for email and echat. In Karlsson, A., & van de Weijer, J. (Eds.), Fonetik 2001 (pp. 66-69). Örenäs, Sweden.Jande, P-A. (2001). Stress patterns in Swedish lexicalised phrases. In Proceedings of Fonetik (pp. 70-73). Lund, Sweden. [pdf]Karnebäck, S. (2001). Discrimination between speech and music based on a low frequency modulation feature. In Proc of Eurospeech 2001 (pp. 1891-1894). [pdf]Megyesi, B. (2001). Comparing data-driven learning algorithms for PoS tagging of Swedish. In Proc of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2001) (pp. 151-158). Carnegie Mellon University, Pittsburgh, PA, USA. [pdf]Megyesi, B. (2001). Data-driven methods for PoS tagging and chunking of Swedish. In 13th Nordic Conference of Computational Linguistics, NoDaLiDa 2001. Uppsala, Sweden. [pdf]Megyesi, B. (2001). Phrasal parsing by datadriven PoS taggers. In Proc of the Conf on Recent Advances in Natural Language Processing, Euro Conf RANLP-2001 (pp. 166-173). Tzigov Chark, Bulgaria. [pdf]Megyesi, B., & Gustafson-Capkova, S. (2001). Pausing in dialogues and read speech: Speaker's production and listeners interpretation. In Proc of the Workshop on Prosody in Speech Recognition and Understanding (pp. 107-113). New Jersey, USA. [pdf]Nordqvist, P., & Leijon, A. (2001). Automatic assessment of Hearing Environments. In Second McMaster-Gennum Workshop on Intelligent Hearing Instruments. Niagara-on-the-Lake, ON.Pakucs, B., & Melin, H. (2001). PER: A speech based automated entrance receptionist. In 13th Nordic Conference of Computational Linguistics, NoDaLiDa'01. Uppsala University, Uppsala.Raghavendra, P., Rosengren, E., & Hunnicutt, S. (2001). An investigation of different degrees of dysarthric speech as input to speaker dependent and speaker-adaptive recognition systems. J Augmentative and Alternative Communication, 17, 265-275.Seward, A. (2001). Transducer optimizations for tight-coupled decoding. In Proc of Eurospeech 2001 (pp. 1607-1610). [pdf]Sjölander, K. (2001). Automatic alignment of phonetic segments. In Karlsson, A., & van de Weijer, J. (Eds.), Papers from Fonetik 2001 (pp. 140-143). 2000Bannert, R., Botinis, A., Bruce, G., Engstrand, O., Granström, B., Lindblad, P., & Strangert, E. (2000). Phonetics in Sweden 2000. In Botinis, A., & Torstensson, N. (Eds.), Fonetik 2000, Proc of the Swedish Phonetics Conference (pp. 1-8). Skövde.Bell, L. (2000). Linguistic adaptations in spoken and multimodal dialogue systems. Licentiate dissertation, KTH.Bell, L., Boye, J., Gustafson, J., & Wirén, M. (2000). Modality convergence in a multimodal dialogue system. In Poesio, M., & Traum, D. (Eds.), Proc of Götalog 2000, Fourth Workshop on the Semantics and Pragmatics of Dialogue (pp. 29-34). Gothenburg. [pdf]Bell, L., Eklund, R., & Gustafson, J. (2000). A comparison of disfluency. Distribution in a unimodal and a multimodal speech interface. In Yuan, B., Huang, T., & Tang, X. (Eds.), Proc of ICSLP 2000, 6th Intl Conf on Spoken Language Processing (pp. 626-629). Beijing. [pdf]Bell, L., & Gustafson, J. (2000). Positive and negative user feedback in a spoken dialogue corpus. In Yuan, B., Huang, T., & Tang, X. (Eds.), Proc of ICSLP 2000, 6th Intl Conf on Spoken Language Processing (pp. 589-592). Beijing. [pdf]Berthelsen, H., & Megyesi, B. (2000). Ensemble of Classifiers for Noise Detection in PoS Tagged Corpora. In Proc of the Third International Workshop on TEXT, SPEECH and DIALOGUE (pp. 27-32). Springer-Verlag, Berlin.Beskow, J., Granström, B., House, D., & Lundeberg, M. (2000). Experiments with verbal and visual conversational signals for an automatic language tutor. In Delcloque, P., & Bramoullé, A. (Eds.), Proc of InSTIL 2000 (pp. 138-142). University of Albertay Dundee, Dundee, Scotland.Beskow, J., Granström, B., House, D., & Lundeberg, M. (2000). Verbal and visual prosody in multimodal speech perception.. In Nordic Prosody VIII. Bimbot, F., Blomberg, M., Boves, L., Genoud, D., Hutter, H-P., Jaboulet, C., Koolwaaij, J., Lindberg, J., & Pierrot, J-B. (2000). An overview of the CAVE project research activities in speaker verification. Speech Comm, 31, 155-180.Boye, J., Hockey, B. A., & Rayner, M. (2000). Asynchronous dialogue management: Two case-studies. In Poesio, M., & Traum, D. (Eds.), Proc of Götalog 2000, Fourth Workshop on the Semantics and Pragmatics of Dialogue (pp. 51-55). Gothenburg.Bruce, G., Filipsson, M., Frid, J., Granström, B., Gustafson, K., Horne, M., & House, D. (2000). Modelling of Swedish Text and Discourse Intonation in a Speech Synthesis Framework. In Botinis, A. (Ed.), Intonation: Analysis, Modelling and Technology (pp. 291-320). Dordrecht: Kluwer Academic Publishers.Carlson, R., & House, D. (2000). Prosodic aspects of Swedish question words in computer-directed spontaneous speech.. In Nordic Prosody VIII. Eineborg, M., & Lindberg, N. (2000). ILP in part-of-speech tagging. An overview. In Cussens, J., & Deroski, S. (Eds.), Learning Language in Logic Workshop (LLL99) (pp. 157-169). Springer.Elenius, K. (2000). Experiences from collecting two Swedish telephone speech databases.. Int Journal of Speech Technology, 3, 119-127.Engwall, O. (2000). A 3D tongue model based on MRI data. In Yuan, B., Huang, T., & Tang, X. (Eds.), Proc of ICSLP 2000, 6th Intl Conf on Spoken Language Processing (pp. 901-904). Beijing. [pdf]Engwall, O. (2000). Are static MRI representative of dynamic speech? Results from a comparative study using MRI, EPG, and EMA. In Yuan, B., Huang, T., & Tang, X. (Eds.), Proc of ICSLP 2000, 6th Intl Conf on Spoken Language Processing (pp. 17-20). Beijing. [pdf]Engwall, O., & Badin, P. (2000). An MRI study of Swedish fricatives: coarticulatory effects. In Hole, P. (Ed.), Proc of 5th Speech Production Seminar: Models and data (pp. 297-300). Kloster Seeon, Germany. [pdf]Granström, B., House, D., Beskow, J., & Lundeberg, M. (2000). Verbal and visual prosody in multimodal speech perception. In Proc 4th Swedish Symposium on Multimodal Communication. Gustafson, J., & Bell, L. (2000). Speech Technology on Trial: Experiences from the August System.. Natural Language Engineering, 6(Special issue on Best Practice in Spoken Dialogue Systems). [pdf]Gustafson, J., Bell, L., Beskow, J., Boye, J., Carlson, R., Edlund, J., Granström, B., House, D., & Wirén, M. (2000). AdApt - a multimodal conversational dialogue system in an apartment domain. In Yuan, B., Huang, T., & Tang, X. (Eds.), Proc. of ICSLP 2000, 6th Intl Conf on Spoken Language Processing (pp. 134-137). Beijing: China Military Friendship Publish. [abstract] [pdf]Abstract: A general overview of the AdApt project and the research that is performed within the project is presented. In this project various aspects of human-computer interaction in a multimodal conversational dialogue systems are investigated. The project will also include studies on the integration of user/system/dialogue dependent speech recognition and multimodal speech synthesis. A domain in which multimodal interaction is highly useful has been chosen, namely, finding available apartments in Stockholm. A Wizard-of-Oz data collection within this domain is also described.Heldner, M. (2000). Is non-linear lengthening important for the perceived naturalness of focal accented Swedish words?. In Botinis, A., & Torstensson, N. (Eds.), Fonetik 2000, Proc of the Swedish Phonetics Conference (pp. 69-72). Skövde. [pdf]Hennebert, J., Melin, H., Petrovska, D., & Genoud, D. (2000). POLYCOST: A telephone-speech database for speaker recognition. Speech Comm, 2-3(31), 265-270.House, D. (2000). Perception of focal accent in Swedish. How necessary is the rise for Accent 1?. In Nordic Prosody VIII. House, D. (2000). Rise alignment in the perception of focal accent and pitch in Swedish. In Botinis, A., & Torstensson, N. (Eds.), Fonetik 2000, Proc of the Swedish Phonetics Conference (pp. 73-76). Skövde.Johansen, F. T., Warakagoda, N., Lindberg, B., Lehtinen, G., Kai, Z., Gank, A., Elenius, K., & Salvi, G. (2000). The COST 249 SpeechDat multilingual reference recogniser. In Gavrilidou, M., Caryannis, G., Markantonatou, S., Piperidis, S., & Stainhaouer, G. (Eds.), Proc. of LREC 2000, 2nd Intl Conf on Language Resources and Evaluation (pp. 1351-1356). Athens, Greece. [pdf]Karlsson, I., Banziger, T., Johnstone, T., Lindberg, J., Melin, H., Nolan, F., & Scherer, K. (2000). Speaker verification with elicited speaking-styles in the VeriVox project. Speech Comm, 2(31), 121-129.Leijon, A., & Nordqvist, P. (2000). Finite-state modelling for specification of non-linear hearing instruments. In International Hearing Aid Research. Lake Tahoe, CA.Lindberg, B., Johansen, F. T., Warakagoda, N., Lehtinen, G., Kai, Z., Gank, A., Elenius, K., & Salvi, G. (2000). A noise robust multilingual reference recogniser based on SpeechDat(II). In Proc of ICSLP 2000, 6th Intl Conf on Spoken Language Processing (pp. 370-373). Beijing. [pdf]Lindberg, J., & Blomberg, M. (2000). On the potential threat of using large speech corpora for impostor selection in speaker verification. In Yuan, B., Huang, T., & Tang, X. (Eds.), Proc. of ICSLP 2000, 6th Intl Conf on Spoken Language Processing (pp. 258-261). Beijing.Lindberg, N. (2000). Data driven methods in natural language processing - Two applications. Licentiate dissertation, KTH/TMH.Magnuson, T., & Blomberg, M. (2000). Acoustic analysis of dysarthric speech. In Botinis, A., & Torstensson, N. (Eds.), The Swedish Phonetics Conf (pp. 105-108). Skövde. [pdf]Mariethoz, J., Lindberg, J., & Bimbot, F. (2000). A MAP approach, with synchronous decoding and unit-based normalization for text-dependent speaker verification.. In Proc of ICSLP 2000. Massaro D.W, C., Beskow, J., & Cole, R. (2000). Developing and Evaluating Conversational Agents.. In Cassell, J., & et al, . (Eds.), Embodied Conversational Agents.. Cambridge, MA: MIT Press.Melin, H. (2000). Databases for Speaker Recognition: Activities in COST250 Working Group 2. Technical Report, Europeach Commision DG-XIII, Brussels, COST 250 - Speaker Recognition in Telephony. [pdf]Pakucs, B., & Gambäck, B. (2000). Designing a system for Swedish spoken document retrieval. In Nordgård, T. (Ed.), Proc of NODALIDA-99, 12th Nordic Conference in Computational Linguistics (pp. 162-174). Trondheim, Norway.Rosengren, E., Magnuson, T., Hunnicutt, S., & Blomberg, M. (2000). Analysis of dysarthric speech for use with speech recognition. In Proc of ISAAC«00, 9th Biennal Conf of the Intl Society for Augmentative and Alternative Communication (pp. 64-66). Washington, DC, USA.Seward, A. (2000). A tree-trellis N-best decoder for stochastic context-free grammars. In Yuan, B., Huang, T., & Tang, X. (Eds.), Proc of ICSLP 2000, 6th Intl Conf on Spoken Language Processing (pp. 282-285). Beijing.Sjölander, K., & Beskow, J. (2000). WaveSurfer - an open source speech tool. In Yuan, B., Huang, T., & Tang, X. (Eds.), Proceedings of ICSLP 2000, 6th Intl Conf on Spoken Language Processing (pp. 464-467). Beijing. [pdf]Wissing, D., Gustafson, K., & Coetzee, A. (2000). Temporal organisation in some varieties of South African English. In Proc Workshop on Black South African English, Intl Conf on Linguistics in Southern Africa (pp. 59-68). Cape Town, South Africa.Öhman, T. (2000). Vision in speech technology. Automatic measurements of visual speech and audiovisual intelligibility of synthetic and natural faces. Licentiate dissertation.1999Agelfors, E., Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Salvi, G., Spens, K-E., & Öhman, T. (1999). A Synthetic Face as a Lip-reading Support for Hearing Impaired Telephone Users - Problems and Positive Results. In Proceedings of the 4th European Conference on Audiology. Oulo, Finland.Agelfors, E., Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Salvi, G., Spens, K-E., & Öhman, T. (1999). Two methods for Visual Parameter Extraction in the Teleface Project. In Proceedings of Fonetik. Gothenburg, Sweden.Agelfors, E., Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Spens, K-E., & Öhman, T. (1999). Artificial video for Hearing-Impaired Telephone Users; A comparison with the No Video and Perfect Video Conditions.. In Buhler C & Harry Knops H, . (Eds.), Assistive Technology on the Threshold of the New Millennium (pp. 116-121). IOS Press.Agelfors, E., Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Spens, K-E., & Öhman, T. (1999). ASR controlled synthetic face as a lipreading support for hearing impaired telephone users. In Cost249 meeting. Prague, Czech Republic.Agelfors, E., Beskow, J., Granström, B., Lundeberg, M., Salvi, G., Spens, K-E., & Öhman, T. (1999). Synthetic visual speech driven from auditory speech. In Proceedings of Audio-Visual Speech Processing (AVSP). Santa Cruz, USA. [pdf]Bell, L., & Gustafson, J. (1999). Interaction with an animated agent: an analysis of a Swedish database of spontaneous computer directed speech. In Proc of Eurospeech '99 (pp. 1143-1146). Budapest, Hungary. [pdf]Bell, L., & Gustafson, J. (1999). Repetition and its phonetic realizations: Investigating a Swedish database of spontaneous computer directed speech. In Proceedings of ICPhS-99 (pp. 1221-1224). [pdf]Bell, L., & Gustafson, J. (1999). Repetition in a Swedish database of spontaneous computer-directed speech.. In Andersson, R., Abelin, Å., Allwood, J., & Lindblad, P. (Eds.), Proc of Fonetik 99 (pp. 15-18). Bell, L., & Gustafson, J. (1999). Utterance types in the August database.. In The Third Swedish Symposium on Multimodal Communication. Bell, L., & Gustafson, J. (1999). Utterance types in the August System. In Proc from IDS '99. [pdf]Bell, L., Gustafson, K., House, D., & Johansson L., . (1999). Children´s evaluation of prosody in speech synthesis. In Andersson, R., Abelin, Å., Allwood, J., & Lindblad, P. (Eds.), Proc of Fonetik 99 (pp. 19-22). Bengtsson, B., Burgoon, J., Cederberg, C., Bonito, J., & Lundeberg, M. (1999). The impact of antropomorphic interfaces on influence, understanding and credibility.. In Proc 23nd Hawaii Intl Conf on System Sciences. Bimbot, F., Blomberg, M., Boves, L., Chollet, G., Jaboulet, C., Jacob, B., Kharroubi, J., Koolwaaij, J., Lindberg, J., Mariethoz, J., Mokbel, C., & Mokbel, H. (1999). An overview of the PICASSO project research activities in speaker verification for telephone applications. In Proc of Eurospeech 99 (pp. 1963-1967). [pdf]Blomberg, M. (1999). Within-utterance correlation for speech recognition. In Proc of Eurospeech 99 (pp. 2479-2482). Blomberg, M. (1999). Within-utterance correlation in automatic speech recognition.. In Proc of Fonetik 99 (pp. 23-26). [pdf]Carlberger, A. (1999). Grammar and lexicons for a speech-interfaced knowledge-based engineering program (ICAD). In Loncke, F., Clibbens, J., & Lloyd, L. (Eds.), AAC: New directions in research and practice, The ISAAC Research Symposium on Natural Language Processing. Dublin: Whurr Publ, London.Carlberger, A. (1999). Nparse - A Shallow N-Gram-Based Grammatical-Phrase Parser. In Proc of Eurospeech 99 (pp. 2067-2070). Damper, R. I., Marchand, Y., Anderson, M., & Gustafson, K. (1999). Evaluating the pronunciation component of text-to-speech systems for English: a performance comparison of different approaches.. Comput Speech Language, 13(2), 155-176.Elenius, K. (1999). Experiences from building two large telephone speech databases for Swedish.. In Proc of ICPhS-99 (pp. 1741-1744). Elenius, K. (1999). Two Swedish SpeechDat databases - some experiences and results. In Proc of Eurospeech 99 (pp. 2243-2246). Elenius, K. (1999). Two Swedish telephone speech databases.. In Proc of Fonetik 99 (pp. 45-48). Engwall, O. (1999). Modeling of the vocal tract in three dimensions. In Proc of Eurospeech 99 (pp. 113-116). Budapest. [pdf]Falcone, M., Melin, H., & Ariyaeeinia, A. (1999). Speaker Recognition Assessment and Dissemination: Activities in COST 250 Working Group 4.. In Proc COST250 Workshop on Speaker Recognition in Telephony.. Granström, B. (1999). Multi-modal speech synthesis with applications.. In Chollet, G., Di Benedetto, M., Esposito, A., & Marinaro, M. (Eds.), Speech Processing, Recognition and Artificial Neural Networks. Proc 3rd Intl School on Neural Nets Eduardo R Cajaniello (pp. 327-346). London: Springer-Verlag Ltd.Granström, B., House, D., & Lundeberg, M. (1999). Eyebrow movements as a cue to prominence.. In The Third Swedish Symposium on Multimodal Communication. Granström, B., House, D., & Lundeberg, M. (1999). Prosodic cues in multi-modal speech perception.. In Proc of ICPhS-99 (pp. 655-658). Granström, B., House, D., & Lundeberg, M. (1999). Visual prominence in multimodal speech perception.. In Proc of Fonetik 99 (pp. 61-64). Gustafson, J., Lindberg, N., & Lundeberg, M. (1999). The August spoken dialogue system. In Proc of Eurospeech 99 (pp. 1151-1154). [pdf]Gustafson, J., Lindberg, N., & Lundeberg, M. (1999). The August spoken dialogue system.. In The Third Swedish Symposium on Multimodal Communication. Gustafson, J., Lundeberg, M., & Liljencrants, J. (1999). Experiences from the development of August - a multimodal spoken dialogue system.. In Proc from IDS '99 (pp. 61-64). [pdf]Gustafson, J., Sjölander, K., Beskow, J., Granström, B., & Carlson, R. (1999). Creating web-based exercises for spoken language technology. In Tutorial session in proceedings of IDS'99 (pp. 165-168). [pdf]Gustafson, K., & House, D. (1999). Prosodic parameters of a "fun" speaking style.. In COST 258: The Budapest Meeting, COST Working Papers. Heldner, M., Strangert, E., & Deschamps, T. (1999). A focus detector using overall intensity and high frequency emphasis. In Proceedings of ICPhS-99 (pp. 1491-1494). [pdf]Heldner, M., Strangert, E., & Deschamps, T. (1999). Focus detection using overall intensity and high frequency emphasis. In Proc of Fonetik 99 (pp. 73-76). [pdf]House, D. (1999). Perception of pitch and tonal timing: implications for mechanisms of tonogenesis. In Proc of ICPhS-99 (pp. 1823-1826). House, D., Bell, L., Gustafson, K., & Johansson, L. (1999). Child-directed speech synthesis: evaluation of prosodic variation for an educational computer program. In Proc of Eurospeech 99 (pp. 1843-1846). [pdf]Hunnicutt, S., Carlberger, A., Rosengren, E., Talbot, N., & Bickley, C. (1999). Draw a circle! Spoken man-machine communication for design.. In The Third Swedish Symposium on Multimodal Communication. Hunnicutt, S., Carlson, R., Carlberger, A., & Rosengren, E. (1999). TIDE-ENABL-projektet: Sofistikerad design med hjälp av talförståelse.. In Proc of Konferensen Människa-Handikapp-Livsvillkor Rendez-vous (pp. 161-162). Karlsson, I. (1999). Within-speaker variability in the VeriVox database. In Andersson, R., Abelin, Å., Allwood, J., & Lindblad, P. (Eds.), Proc of Fonetik 99 (pp. 93-96). [pdf]Karlsson, I., & Thornton, S. (1999). Really natural Speech Generation poses some tough challenges. Computer Telephony Europe, August 1999.Liljencrants, J. (1999). Judges of prominence. In Proc of Fonetik 99 (pp. 101-107). Lindberg, J., & Blomberg, M. (1999). Vulnerability in speaker verification. A study of technical impostor techniques. In Proc of Eurospeech 99 (pp. 1211-1214). [pdf]Lindberg, N., & Eineborg, M. (1999). Improving part of speech disambiguation rules by adding linguistic knowledge.. In Dzeroski, S., & Flach, P. A. (Eds.), Inductive Logic Programming (pp. 186-197). Berlin: Springer Verlag.Lindberg, N., & Eineborg, M. (1999). Improving POS tagging by adding heuristic background knowledge.. In Proc of ILP '99. Lundeberg, M., & Beskow, J. (1999). Developing a 3D-agent for the August dialogue system. In Proc of AVSP 99. [pdf]Massaro, D., Beskow, J., Cohen, M., Fry, C., & Rodriguez, T. (1999). Picture my voice: Audio to visual speech synthesis using artificial neural networks. In Proc of AVSP 99 (pp. 133-138). [pdf]Massaro, D., Cohen, M., & Beskow, J. (1999). From Theory to Practice: Rewards and Challenges. In Proc of ICPhS. [pdf]Melin, H. (1999). Databases for Speaker Recognition: Activities in COST250 Working Group 2.. In COST250 Workshop on Speaker Recognition in Telephony. Rom. [pdf]Melin, H., & Lindberg, J. (1999). Variance flooring, scaling and tying for text-dependent speaker verification. In Proc of Eurospeech 99 (pp. 1975-1978). [pdf]Parkvall, M., & Edlund, J. (1999). The creolist archives. Journal of Pidgin and Creole languages, 14(2), 347 - 350.Salvi, G. (1999). Developing acoustic models for automatic speech recognition in Swedish. The European Student Journal of Language and Speech. [abstract] [pdf]Abstract: This paper is concerned with automatic continuous speech recognition using trainable systems. The aim of this work is to build acoustic models for spoken Swedish. This is done employing hidden Markov models and using the SpeechDat database to train their parameters. Acoustic modeling has been worked out at a phonetic level, allowing general speech recognition applications, even though a simplified task (digits and natural number recognition) has been considered for model evaluation. Different kinds of phone models have been tested, including context independent models and two variations of context dependent models. Furthermore many experiments have been done with bigram language models to tune some of the system parameters. System performance over various speaker subsets with different sex, age and dialect has also been examined. Results are compared to previous similar studies showing a remarkable improvement.Svensson, E-L. (1999). Pronunciation in KTH:s text-to-speech system.. In Proc of Fonetik 99 (pp. 129-132). Öhman, T., & Lundeberg, M. (1999). Differences in speechreading a synthetic and a natural face.. In Proc of ICPhS-99. 1998Agelfors, E., Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Spens, K-E., & Öhman, T. (1998). Synthetic faces as a lipreading support. In Proceedings of ICSLP'98. [pdf]Agelfors, E., Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Spens, K-E., & Öhman, T. (1998). Teleface - the use of a synthetic face for the hard of hearing.. In Proc of IVTTA'98. Agelfors, E., Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Spens, K-E., & Öhman, T. (1998). The synthetic face from a hearing impaired view. In Branderud, P., & Traunmüller, H. (Eds.), Proc of Fonetik -98, The Swedish Phonetics Conference (pp. 200-203). Stockholm University. [html]Beskow, J. (1998). A tool for teaching and development of parametric speech synthesis. In Branderud, P., & Traunmüller, H. (Eds.), Proc of Fonetik -98, The Swedish Phonetics Conference (pp. 162-165). Stockholm University. [pdf]Bickley, C., Carlson, R., Cudd, P., Hunnicutt, S., Reimers, B., & Whiteside, S. (1998). Enabler for engineering software using language and speech. In Proc RESNA. Bimbot, F., Hutter, H-P., Jaboulet, C., Koolwaaij, J., Lindberg, J., & Pierrot, J-B. (1998). An overview of the CAVE project research activities in speaker verification.. In Proc of RLA2C (pp. 215-220). [ps]Blomberg, M. (1998). Speech recognition using long-distance relations in an utterance. In Branderud, P., & Traunmüller, H. (Eds.), Proc of Fonetik -98, The Swedish Phonetics Conference (pp. 166:abstract, full paper in link). [pdf]Bruce, G., Frid, J., Granström, B., Gustafson, K., Horne, M., & House, D. (1998). Prosodic segmentation and structuring of dialogue. In Werner, S. (Ed.), Nordic Prosody, Proceedings of the VIIth Conference (pp. 63-72). Carlberger, A. (1998). Lexicons and grammar for speech recognition in an engineering design program (ICAD). In Branderud, P., & Traunmüller, H. (Eds.), Proc of Fonetik -98, The Swedish Phonetics Conference (pp. 172-175). Stockholm University.Carlberger, J., & Hunnicutt, S. (1998). A probabilistic word prediction program. In Proc of RESNA Õ98 (pp. 50-52). Minneapolis, Minn..Carlson, R., Granström, B., Gustafson, J., Lewin, E., & Sjölander, K. (1998). Hands-on speech technology on the Web. In Proceedings of Elsnet in Wonderland (pp. 30-36). Cohen, M., Beskow, J., & Massaro, D. (1998). Recent developments in facial animation: an inside view. In Proc of AVSP'98. [pdf]Cole, R., Carmell, T., Conners, P., Macon, M., Wouters, J., de Villiers, J., Tarachow, A., Massaro, D., Cohen, M., Beskow, J., Yang, J., Meier, U., Waibel, A., Stone, P., Fortier, G., Davis, A., & Soland, C. (1998). Intelligent Animated Agents for Interactive Language Training. In Proc of STiLL - ESCA Workshop on Speech Technology in Language Learning. Dybkjær, L., Bernsen, N-O., Carlson, R., Chase, L., Dahlbäck, N., Failenschmid, K., Heid, U., Heisterkamp, P., Jönsson, A., Kamp, H., Karlsson, I., van Kuppevelt, J., Lamel, L., Paroubek, P., & Williams, D. (1998). Comparative evaluation of letter-to-sound conversion techniques for English text-to-speech synthesis. In Proc of the First International Conference on Language Resources and Evaluation. Eineborg, M., & Lindberg, N. (1998). Induction of constraint grammar-rules using Progol. In Proceedings of ILP -98, The Eighth International Conference on Inductive Logic Programming. Madison, Wisconsin.Engwall, O. (1998). A 3D vocal tract model for articulatory and visual speech synthesis. In Branderud, P., & Traunmüller, H. (Eds.), Proc of Fonetik -98, The Swedish Phonetics Conference (pp. 196-199). Stockholm University.Gawronska, B., & House, D. (1998). Information extraction and text generation of news reports for a Swedish-English bilingual spoken dialogue system. In Proceedings ICSLP 98, Fifth International Conference on Spoken Language Processing (pp. 1139-1142). Sidney, Australia. [pdf]Gustafson, J., Elmberg, P., Carlson, R., & Jönsson, A. (1998). An educational dialogue system with a user controllable dialogue manager. In Proc of ICSLP98, Intl Conference on Spoken Language Processing (pp. 33-36). Sydney, Australia. [pdf]Gustafson, J., & Sjölander, K. (1998). Educational tools for speech technology. In Branderud, P., & Traunmüller, H. (Eds.), Proc of Fonetik -98, The Swedish Phonetics Conference (pp. 176-179). Stockholm University.Heldner, M. (1998). Is an F0-rise a necessary or a sufficient cue to perceived focus in Swedish?. In Nordic Prosody: Proc of the VIIth Conference (pp. 109-125). [pdf]Heldner, M., & Strangert, E. (1998). On the amount and domain of focal lengthening in Swedish two-syllable words.. In Proc of FONETIK 98 (pp. 154-157). [pdf]Hermes, D., Beaugendre, F., & House, D. (1998). Individual differences in accentuation boundaries in Dutch. IPO Annual Progress Report, Eindhoven, 32, 131-138.House, D., Hermes, D., & Beaugendre, F. (1998). Perception of tonal rises and falls for accentuation and phrasing in Swedish. In Proceedings ICSLP 98, Fifth International Conference on Spoken Language Processing (pp. 2799-2802). Sydney, Australia. [pdf]House, D., Hermes, D., & Beaugendre, F. (1998). Perception of tonal rises and falls for accentuation and phrasing: Swedish listener results. In Proceedings of Fonetik ’98 (pp. 146-149). Stockholm University.Hunnicutt, S., Rosen, K., Rosengren, E., & Carlberger, A. (1998). TIDE-ENABL: Access to knowledge-based engineering through speech recognition. In Proc of ISAAC Conference (pp. 462-463). Dublin, Ireland.Karlsson, I., Banziger, T., Dankovicov, J., Johnstone, T., Lindberg, J., Melin, H., Nolan, F., & Scherer, K. (1998). Speaker verification with elicited speaking-styles in the VeriVox project. In Proc of RLA2C, La Reconnaissance du Locuteur et ses Applications Commerciales et Criminalistiques (Speaker Recognition and its Commercial and Forensic Applcations) (pp. 207-210). Avignon, France. [pdf]Karlsson, I., Banziger, T., Dankovicov, J., Johnstone, T., Lindberg, J., Melin, H., Nolan, F., & Scherer, K. (1998). Within speaker variation due to induced stress. In Proc of Fonetik -98, The Swedish Phonetics Conference (pp. 150-153). Stockholm University.Karlsson, I., Banziger, T., Dankovicov, J., Johnstone, T., Lindberg, J., Melin, H., Nolan, F., & Scherer, K. (1998). Within-speaker variability due to speaking manners. In Mannell, R., & Robert-Ribes, J. (Eds.), Proc of ICSLP98, Intl Conference on Spoken Language Processing (pp. 2379-2382). Sydney, Australia. [pdf]Langlais, P., Öster, A-M., Carlson, R., & Granström, B. (1998). Automatic detection of mispronunciation in non-native Swedish speech. In Proc of STiLL98, ESCA-Workshop on Speech Technology in Language Learning (pp. 41-44). Marholmen, Sweden.Langlais, P., Öster, A-M., & Granström, B. (1998). . In Proc of ICSLP98, Intl Conference on Spoken Language Processing (pp. 1743-1746). Sydney, Australia.Lee, T., Carlson, R., & Granström, B. (1998). Context-dependent duration modeling for continuous speech recognition. In Proc of ICSLP98, Intl Conference on Spoken Language Processing (pp. 2955-2958). Sydney, Australia.Lindberg, J., Koolwaaij, J., Hutter, H-P., Genoud, D., Pierrot, J-B., Blomberg, M., & Bimbot, F. (1998). Techniques for a priori decision threshold estimation in speaker verification. In Proc of RLA2C, La Reconnaissance du Locuteur et ses Applications Commerciales et Criminalistiques (Speaker Recognition and its Commercial and Forensic Applications) (pp. 89-92). Avignon, France. [pdf]Lindberg, N., & Eineborg, M. (1998). Learning constraint grammar-style disambiguation rules using inductive logic programming. In Proc of COLING-ACL -98 (pp. 775-779). Montréal, Quebec, Canada.Lindberg, N., & Eineborg, M. (1998). Learning part of speech disambiguation rules using inductive logic programming. In Keller, B. (Ed.), Proceedings of the ESSLLI -98 Workshop on Automated Acquisition of Syntax and Parsing (pp. 1-8). Saarbrücken, Summer school.Melin, H. (1998). On word boundary detection in digit-based speaker verification. In Proc of RLA2C, La Reconnaissance du Locuteur et ses Applications Commerciales et Criminalistiques (Speaker Recognition and its Commercial and Forensic Applications) (pp. 46-49). Avignon, France. [pdf]Melin, H. (1998). Optimizing Variance Flooring In HMM-Based Speaker Verification.. In COST250 report. Ankara, Turkey. [pdf]Melin, H., Koolwaaij, J., Lindberg, J., & Bimbot, F. (1998). A comparative evaluation of variance flooring techniques in HMM-based speaker verification. In Proc of ICSLP98, 5th Intl Conference on Spoken Language Processing (pp. 1903-1906). Sydney, Australia. [pdf]Nord, L., Hunnicutt, S., & Rosengren, E. (1998). ENABL - Access to design by speech recognition: analysis of dysarthric voices. In Branderud, P., & Traunmüller, H. (Eds.), Proc of Fonetik -98, The Swedish Phonetics Conference (pp. 82-85). Stockholm University.Nordström, T., Melin, H., & Lindberg, J. (1998). A comparative study of speaker verification systems using the polycost database. In Proc of ICSLP98, 5th Intl Conference on Spoken Language Processing (pp. 1359-1362). Sydney, Australia. [pdf]Oviatt, S., MacEachern, M., & Levow, G. A. (1998). Predicting Hyperarticulate Speech During Human-Computer Error Resolution. Speech Communication, 24(2), 87-110.Petrovska, D., Hennebert, J., Melin, H., & Genoud, D. (1998). POLYCOST: A telephone-speech database for speaker recognition. In Proc of RLA2C, La Reconnaissance du Locuteur et ses Applcations Commerciales et Criminalistiques (Speaker Recognition and its Commercial and Forensic Applications) (pp. 211-214). [pdf]Pierrot, J-B., Lindberg, J., Koolwaaij, J., Hutter, H-P., Genoud, D., Blomberg, M., & Bimbot, F. (1998). A comparison of a priori threshold setting procedures for speaker verification in the CAVE project. In Proc of ICASSP98, Intl Conference on Acoustics, Speech and Signal Processing (pp. 125-128). Seattle, Wash.Rosengren, E., & Hunnicutt, S. (1998). Evaluations in VAESS. Programs for touchscreen use: voices and emotions. In Proc of ISAAC (pp. 490-491). Dublin, Irland.Salvi, G. (1998). Developing acoustic models for automatic speech recognition. Master's thesis, KTH, TMH, CTT. [pdf]Sjölander, K., Beskow, J., Gustafson, J., Lewin, E., Carlson, R., & Granström, B. (1998). Web-based educational tools for speech technology. In Proc of ICSLP98, 5th Intl Conference on Spoken Language Processing (pp. 3217-3220). Sydney, Australia.Strangert, E., & Heldner, M. (1998). On the amount and domain of focal lengthening in Swedish. In Proc of ICSLP98 (pp. 3305-3308). [pdf]Strangert, E., & Heldner, M. (1998). On the amount and domain of focal lengthening in two-syllable and longer Swedish words. In Proc of FONETIK 98 (pp. 134-137). [pdf]Sundström, A. (1998). Automatic prosody modification as a means for foreign language pronunciation training. In Proc of STiLL98, ESCA Workshop on Speech Technology in Language Learning (pp. 49-52). [pdf]Tamura, M., Masuko, T., Kobayashi, T., & Tokuda, K. (1998). Visual Speech Synthesis Based on Parameter Generation from HMM: Speech-Driven and Text-And-Speech-Driven Approaches. In Proc. of Auditory-Visual Speech Processing (AVSP'98) (pp. 221-226). [abstract]Abstract: This paper describes a technique for synthesizing synchronized lip movements from auditory input speech signal. The technique is based on an algorithm for parameter generation from HMM with dynamic features, which has been successfully applied to text-to-speech synthesis. Audio-visual speech unit HMMs, namely, syllable HMMs are trained with parameter vector sequences that represent both auditory and visual speech features. Input speech is recognized using the syllable HMMs and converted into a transcription and a state sequence. A sentence HMM is constructed by concatenating the syllable HMMs corresponding to the transcription for the input speech. Then an optimum visual speech parameter sequence is generated from the sentence HMM in ML sense. Since the generated parameter sequence reflects statistical information of both static and dynamic features of several phonemes before and after the current phonemes, synthetic lip motion becomes smooth and realistic. We show experimental results which demonstrate the effectiveness of the proposed technique.1997Beaugendre, F., House, D., & Hermes, D. (1997). Accentuation boundaries in Dutch, French and Swedish. In Botinis, A., Kouroupetroglou, G., & Carayannis, G. (Eds.), Proceedings of ESCA Tutorial and Research Workshop on Intonation: Theory, Models and Applications (pp. 43-46). Athens, Greece. [pdf]Bertenstam, J., Granström, B., Gustafson, K., Hunnicutt, S., Karlsson, I., Meurlinger, C., Nord, L., & Rosengren, E. (1997). The VAESS communicator: a portable communication aid with new voice types and emotions. In Bannert, R., Heldner, M., Sullivan, K., & Wretling, P. (Eds.), Proc of Fonetik -97, Dept of Phonetics, Umeå Univ., Phonum 4 (pp. 57-60). Beskow, J. (1997). Animation of talking agents. In Benoit, C., & Campbel, R. (Eds.), Proc of ESCA Workshop on Audio-Visual Speech Processing (pp. 149-152). Rhodes, Greece. [pdf]Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Spens, K-E., & Öhman, T. (1997). The Teleface project - disability, feasibility and intelligibility. In Bannert, R., Heldner, M., Sullivan, K., & Wretling, P. (Eds.), Proc of Fonetik -97, Dept of Phonetics, Umeå Univ., Phonum (pp. 85-88). [pdf]Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Spens, K-E., & Öhman, T. (1997). The Teleface project - Multimodal speech communication for the hearing impaired. In Kokkinakis, G., Fakotakis, N., & Dermatas, E. (Eds.), Proc of Eurospeech Õ97, 5th European Conference on Speech Communication and Technology (pp. 2003-2006). Rhodes, Greece. [pdf]Beskow, J., Elenius, K., & McGlashan, S. (1997). OLGA - A dialogue system with an animated talking agent. In Kokkinakis, G., Fakotakis, N., & Dermatas, E. (Eds.), Proc of Eurospeech '97, 5th European Conference on Speech Communication and Technology (pp. 1651-1654). Rhodes, Greece. [pdf]Beskow, J., Elenius, K., & McGlashan, S. (1997). The OLGA project: An animated talking agent in a dialogue system. In Bannert, R., Heldner, M., Sullivan, K., & Wretling, P. (Eds.), Proc of Fonetik -97, Dept of Phonetics, Phonum 4 (pp. 69-72). Lövånger/Umeå. [pdf]Beskow, J., & McGlashan, S. (1997). OLGA - A conversational agent with gestures. In André, E. (Ed.), Proc of the IJCAI -97 Workshop on Animated Interface Agents: Making them Intelligent (pp. 39-44). Nagoya, Japan. [pdf]Bimbot, F., Hutter, H-P., Jaboulet, C., Koolwaaij, J., Lindberg, J., & Pierrot, J-B. (1997). Speaker verification in the telephone network research activities in the CAVE project. In Proc EUROSPEECH'97 (pp. 971-974). [pdf]Blomberg, M. (1997). Creating unseen triphones by phone concatenation in the spectral, cepstral and formant domains. In Proc of Fonetik -97, Dept of Phonetics, Umeå Univ., Phonum 4 (pp. 41-44). [pdf]Blomberg, M. (1997). Creating unseen triphones by phone concatenation of diphones and monophones in the spectral, cepstral and formant domains. In Kokkinakis, G., Fakotakis, N., & Dermatas, E. (Eds.), Proc of Eurospeech Õ97, 5th European Conference on Speech Communication and Technology (pp. 1187-1190). Rhodes, Greece. [ps]Bruce, G., Filipsson, M., Frid, J., Granström, B., Gustafson, K., Horne, M., & House, D. (1997). Global features in the modelling of intonation in spontaneous Swedish. In Botinis, A., Kouroupetroglou, G., & Carayannis, G. (Eds.), Proc of ESCA workshop on Intonation: Theory, Models and Applications (pp. 59-62). Athens. [pdf]Bruce, G., Filipsson, M., Frid, J., Granström, B., Gustafson, K., Horne, M., & House, D. (1997). Modelling intonation in spontaneous speech. In Bannert, R., Heldner, M., Sullivan, K., & Wretling, P. (Eds.), Proc of Fonetik -97, Dept of Phonetics, Umeå Univ., Phonum 4 (pp. 173-174). Bruce, G., Filipsson, M., Frid, J., Granström, B., Gustafson, K., Horne, M., & House, D. (1997). Text-to-intonation in spontaneous Swedish. In Kokkinakis, G., Fakotakis, N., & Dermatas, E. (Eds.), Proc of Eurospeech 97, 5th European Conference on Speech Communication and Technology (pp. 215-218). Rhodes, Greece. [pdf]Bruce, G., & Granström, B. (1997). Prosodic modelling in Swedish speech synthesis. In Fant, G., Hirose, K., & Kiritani, S. (Eds.), Analysis, Perception and Processing of Spoken Language. Festschrift for Hiroya Fujisaki. (pp. 62-73). Amsterdam, The Netherlands: Elsevier Science B.V..Bruce, G., Granström, B., Gustafson, K., Horne, M., House, D., & Touati, P. (1997). On the analysis of prosody in interaction. In Sagisaka, Y., Campbell, N., & Higuchi, N. (Eds.), Computing Prosody. Computational Models for Processing Spontaneous Speech. New York: Springer.Carlberger, A. (1997). Profet. A new generation of word prediction: An evaluation study. In Copestake, A., Langer, S., & Palazuelos-Cagigas, S. (Eds.), Proc of a Workshop on Natural Language Processing (NLP) for Communication Aids (pp. 23-28). Madrid, Spain.Carlberger, A., Lewin, E., Nord, L., Rosengren, E., Ström, N., Carlson, R., & Hunnicutt, S. (1997). ENABL - access to design by speech recognition. In Bannert, R., Heldner, M., Sullivan, K., & Wretling, P. (Eds.), Proc of Fonetik -97, Dept of Phonetics, Umeå Univ., Phonum 4 (pp. 89-92). Carlberger, A., Magnuson, T., Carlberger, J., Wachtmeister, H., & Hunnicutt, S. (1997). Probability-based word prediction for writing support in dyslexia. In Bannert, R., Heldner, M., Sullivan, K., & Wretling, P. (Eds.), Proc of Fonetik -97, Dept of Phonetics, Umeå Univ., Phonum 4 (pp. 17-20). Carlson, R., & Granström, B. (1997). Dialogue management and multi-modal output in the Waxholm spoken dialogue system. In Luperfoy, S. (Ed.), Automated Spoken Dialog Systems. MIT Press.Carlson, R., & Granström, B. (1997). Speech synthesis. In Hardcastle, W. J., & Laver, J. (Eds.), The Handbook of Phonetic Science (pp. 768-788). Oxford: Blackwell Publ. Ltd.Damper, R., & Gustafson, K. (1997). Evaluating the pronunciation component of a text-to-speech system. In Speech and Language Technology (SALT) Club Workshop on Evaluation in Speech and Language Technology (pp. 72-79). D’Imperio, M., & House, D. (1997). Perception of questions and statements in Neapolitan Italian. In Proceedings of Eurospeech 97, 5th European Conference on Speech Communication and Technology (pp. 251-254). Rhodes, Greece. [pdf]Eriksson, M., & Gambäck, B. (1997). SVENSK: A toolbox of Swedish language processing resources. In Proc of the 2nd Intl Conf on Recent Advances in Natural Language Processing (RANLP Õ97) (pp. 336-341). Tzigov Chark, Bulgaria.Fant, G., Hertegård, S., Kruckenberg, A., & Liljencrants, J. (1997). Covariation of subglottal pressure, F0 and glottal parameters. In Proc Eurospeech '97 (pp. 453-456). Granström, B. (1997). Applications of intonation - An overview. In Botinis, A., Kouroupetroglou, G., & Carayannis, G. (Eds.), Proc of ESCA workshop on Intonation: Theory, Models and Applications (pp. 21-24). Athens.Gustafson, J. (1997). How do system questions influence lexical choices in user answers?. In Kokkinakis, G., Fakotakis, N., & Dermatas, E. (Eds.), Proc of Fonetik -97, Dept of Phonetics, Umeå Univ., Phonum 4 (pp. 49-52). Gustafson, J., Larsson, A., Carlson, R., & Hellman, K. (1997). How do system questions influence lexical choices in user answers?. In Proc of Eurospeech '97, 5th European Conference on Speech Communication and Technology (pp. 2275-2278). Rhodes, Greece. [pdf]Heldner, M. (1997). The contribution of pitch movements to perceived focus. In PHONUM 4 (pp. 109-112). Umeå: Department of Phonetics, Umeå University.Heldner, M. (1997). To what extent is perceived focus determined by F0-cues?. In Eurospeech '97 Proceedings (pp. 875-877). Rhodes, Greece: ESCA.Hermes, D., Beaugendre, F., & House, D. (1997). Temporal alignment of accentuation boundaries in Dutch. In Botinis, A., Kouroupetroglou, G., & Carayannis, G. (Eds.), Proceedings of ESCA Tutorial and Research Workshop on Intonation: Theory, Models and Applications (pp. 177-180). Athens, Greece. [pdf]Hirsch, H-G. (1997). Adaptation of HMMs in the presence of additive and convolutional noise. In Furui, S., Juang, B., & Chou, W. (Eds.), Proc IEEE Workshop on Automatic Speech Recognition and Understanding. House, D. (1997). Perceptual thresholds and tonal categories. In Phonum 4 (pp. 179-182). Department of Phonetics, University of Umeå.House, D., Hermes, D., & Beaugendre, F. (1997). Temporal-alignment categories of accent-lending rises and falls. In Proceedings of Eurospeech 97, 5th European Conference on Speech Communication and Technology (pp. 879-882). Rhodes, Greece. [pdf]Högberg, J. (1997). Data driven formant synthesis. In Kokkinakis, G., Fakotakis, N., & Dermatas, E. (Eds.), Proc of Eurospeech Õ97, 5th European Conference on Speech Communication and Technology (pp. 565-568). Rhodes, Greece.Lieske, C., Bos, J., Emele, M., Gambäck, B., & Rupp, C. J. (1997). Giving prosody a meaning. In Kokkinakis, G., Fakotakis, N., & Dermatas, E. (Eds.), Proc of Eurospeech Õ97, 5th European Conference on Speech Communication and Technology (pp. 1431-1434). Rhodes, Greece.Lindberg, J., Blomberg, M., & Melin, H. (1997). CAVE - Speaker verification in bank and telecom services. In Bannert, R., Heldner, M., Sullivan, K., & Wretling, P. (Eds.), Proc of Fonetik -97, Dept of Phonetics, Umeå Univ., Phonum 4 (pp. 65-68). [ps]Lindberg, J., & Melin, H. (1997). Text-Prompted versus Sound-Prompted Passwords in Speaker Verification Systems. In Proc EUROSPEECH'97 (pp. 851-854). [pdf]Melin, H., & Lindberg, J. (1997). Guidelines for experiments on the POLYCOST database. In Proc of COST250 Workshop on Application of speaker recognition techniques in telephony (pp. 59-69). Vigo, Spain.Melin, H., & Lindberg, J. (1997). Prompting of passwords in speaker verification systems. In Bannert, R., Heldner, M., Sullivan, K., & Wretling, P. (Eds.), Proc of Fonetik -97, Dept of Phonetics, Umeå Univ., Phonum 4 (pp. 45-48). [pdf]Rosengren, E., & Hunnicutt, S. (1997). VAESS-projektet: Ett nytt portabelt kommunikationshjälpmedel med nyutvecklade röster och känslouttryck.. In Forskningskonferensen Människa, Handikapp, Livsvillkor (pp. 341-343). Örebro.Sjölander, K., & Gustafson, J. (1997). An integrated system for teaching spoken dialogue systems technology. In Kokkinakis, G., Fakotakis, N., & Dermatas, E. (Eds.), Proc of Eurospeech '97, 5th European Conference on Speech Communication and Technology (pp. 1927-1930). Rhodes, Greece. [pdf]Sjölander, K., & Högberg, J. (1997). Trying to improve phone and word recognition using finely tuned phone-like units. In Bannert, R., Heldner, M., Sullivan, K., & Wretling, P. (Eds.), Proc of Fonetik -97, Dept of Phonetics, Umeå Univ., Phonum 4 (pp. 125-128). Sjölander, K., & Högberg, J. (1997). Using expended question sets decision tree clustering for acoustic modelling. In Furui, S., Juang, B-H., & Chou, W. (Eds.), Proc. of IEEE Workshop on Automatic Speech Recognition and Understanding (pp. 179-184). Santa Barbara, Calif..Ström, N. (1997). A tonotopic artificial neural network architechture for phoneme probability estimation. In Furui, S., Juang, B-H., & Chou, W. (Eds.), Proc of IEEE Workshop on Automatic Speech Recognition and Understanding (pp. 156-163). Ström, N. (1997). Automatic continuous speech recognition with rapid speaker adaptation for human/machine interaction. Doctoral dissertation, KTH/TMH.Ström, N. (1997). Phoneme probability estimation with dynamic sparsely connected artificial neural networks. The Free Speech Journal, 1(5), 1-41.Ström, N. (1997). Sparce connection and pruning in large dynamic artificial neural networks. In Kokkinakis, G., Fakotakis, N., & Dermatas, E. (Eds.), Proc of Eurospeech 97 (pp. 2807-2810). Öhman, T. (1997). Measuring visual speech. In Bannert, R., Heldner, M., Sullivan, K., & Wretling, P. (Eds.), Proc of Fonetik -97, Dept of Phonetics, Umeå Univ., Phonum 4 (pp. 53-56). 1996Blomberg, M., & Elenius, K. (1996). Creation of unseen triphones from diphones and monophones using a speech production approach. In Proc of ICSLP-96, 4th Intl Conference on Spoken Language Processing (pp. 2316-2319). Philadelphia, USA. [pdf]Bruce, G., Filipsson, M., Frid, J., Granström, B., Gustafson, K., Horne, M., House, D., Lastow, B., & Touati, P. (1996). Developing the modelling of Swedish prosody in spontaneous dialogue. In Proc of ICSLP 96 (pp. 370-373). [pdf]Bruce, G., Frid, J., Granström, B., Gustafson, K., Horne, M., & House, D. (1996). Prosodic segmentation and structuring of dialogue. In Proc Nordisk Prosodi VII. Joensuu, Finland.Bruce, G., & Granström, B. (1996). Prosodic modelling in Swedish speech synthesis. In Fant G, H. K. &. K. S. (Eds.), Analysis, Perception and Processing of Spoken Language. Festschrift for Hiroya Fujisaki. (pp. 62-73). Amsterdam: Elsevier.Båvegård, M. (1996). Towards an articulatory speech syntheziser: Model development and simulations. Licentiate dissertation, KTH/TMH.Carlson, R. (1996). The dialog component in the Waxholm system. In LuperFoy, S., Nijholt, A., & Veldhuijzen van Zanten, G. (Eds.), Proc of Twente Workshop on Language Technology. Dialogue Management in Natural Language Systems (TWLT 11) (pp. 209-218). [pdf]Carlson, R., & Granström, B. (1996). The Waxholm spoken dialogue system. Phonetica Pragensia IX. Charisteria viro doctissimo Premysl Janota oblata. Acta Universitatis Carolinae Philologica 1, 1996. [pdf]Carlson, R., & Hunnicutt, S. (1996). Generic and domain-specific aspects of the Waxholm NLP and Dialog modules. In Proc of ICSLP-96, 4th Intl Conference on Spoken Language Processing (pp. 677-680). Philadelphia, USA. [pdf]Gustafson, J. (1996). A Swedish name pronounciation system. Licentiate dissertation, KTH/TMH. [pdf]Hennebert, J., Melin, H., Genoud, D., & Petrovska Delacretaz, D. (1996). The POLYCOST 250 Database. Technical Report v1.0, COST250 report. [pdf]House, D. (1996). Differential perception of tonal contours through the syllable. In Proceedings ICSLP 96, Fourth International Conference on Spoken Language Processing (pp. 2048-2051). Philadelphia, PA, USA. [pdf]House, D., & Svantesson, J. (1996). Tonal timing and vowel onset characteristics in Thai. In Proceedings of the Fourth International Symposium on Languages and Linguistics (pp. 1:104-113). Bangkok, Thailand.Högberg, J. (1996). Some studies of the relation between speech acoustics, articulation and phonetic structure. Licentiate dissertation, KTH/TMH.Högberg, J., & Sjölander, K. (1996). Cross phone state clustering using lexical stress information. In Proc of ICSLP-96, 4th Intl Conference on Spoken Language Processing (pp. 474-77). Philadelphia, USA. [pdf]Liljencrants, J. (1996). Experiments with analysis by synthesis of glottal airflow. In Proc of ICSLP'96 (pp. 1289-1292). Melin, H. (1996). Gandalf - A Swedish telephone speaker verification database. In Proc of 4th Intl Conference on Spoken Language Processing, ICSLP-96 (pp. 1954-1957). Philadelphia, USA. [pdf]Melin, H., & Lindberg, J. (1996). Guidelines for experiments on the POLYCOST database. In Proc COST250 Workshop on The Application of Speaker Recognition Technologies in Telephony (pp. 59-69). [pdf]Meng, H., Hunnicutt, S., Seneff, S., & Zue, V. (1996). Reversible letter-to-sound/sound-to-letter generation based on parsing word morphology. Speech Comm, 18, 47-63.1995Anderson, O., Boves, L., Dalsgaard, P., Darsinos, V., Granström, B., Gustafson, J., van den Heuvel, H., Jack, M., Kokkinakis, G., Konst, E., Logothetis, M., Mascarenhas, I., Mengel, A., Molbaek, P., Ottensen, G., Pardo, J., Pirrelli, V., Schmidt, M., Sutherland, A., Trancoso, I., Valverde, F., Viana, C., & Yvon, F. (1995). The Onomastica interlanguage pronunciation lexicon. In Eurospeech (pp. 829–832). Madrid, Spain. [pdf]Horne, M., Strangert, E., & Heldner, M. (1995). Prosodic boundary strength in Swedish: Final lengthening and silent interval duration. In Proceedings ICPhS 95 (pp. 170-173). Stockholm, Sweden.House, D. (1995). Perception of prepausal tonal contours: implications for automatic stylization of intonation. In Eurospeech 1995. Madrid, Spain. [pdf]House, D. (1995). Speech production by adults using cochlear implants. In Plant, G., & Spens, K-E. (Eds.), Profound Deafness and Speech Communication (pp. 285-296). London: Whurr Publishers.House, D. (1995). The influence of silence on perceiving the preceding tonal contour. In Elenius, K., & Branderud, P. (Eds.), Proceedings of ICPhS 95, 13th International Congress of Phonetic Sciences (pp. 1:122-125). Stockholm, Sweden.Strangert, E., & Heldner, M. (1995). Labelling of boundaries and prominences by phonetically experienced and non-experienced transcribers. In PHONUM 3 (pp. 85-109). Umeå: Department of Phonetics, Umeå University.Strangert, E., & Heldner, M. (1995). The labelling of prominence in Swedish by phonetically experienced transcribers. In Proceedings ICPhS 95 (pp. 204-207). Stockholm, Sweden.1994House, D. (1994). Perception and production of mood in speech by cochlear implant users. In Proceedings ICSLP 94, 1994 International Conference on Spoken Language Processing (pp. 2051-2054). Yokohama, Japan. [pdf]Strangert, E., & Heldner, M. (1994). Prosodic labelling and acoustic data. In Working Papers 43: Papers from the Eighth Swedish Phonetics Conference (pp. 120-123). Lund, Sweden.1993Blomberg, M., & Carlson, R. (1993). Automatic labelling of speech given its text representation. In Papers from the Seventh Swedish Phonetics Conference, RUUL 23 (pp. 61-64). Dept. of Linguistics, Uppsala university, Uppsala. [pdf]Heldner, M. (1993). Mustasch på Mona Lisa: Några reflexioner om patafysiken. In Söhrman, I. (Ed.), La culture dans la langue (pp. 71-87). Stockholm, Sweden: Almquist & Wiksell.Nordebo, S., Bengtsson, B., Claesson, I., Nordholm, S., Roxström, A., Blomberg, M., & Elenius, K. (1993). Noise Reduction Using an Adaptive Microphone Array for Speech Recognition in a Car. In RVK -93. Quinlan, R. J. (1993). C4.5: Programs for Machine Learning. 1992Deutsch, D. (1992). Some new pitch paradoxes and their implications. Philosophical Transactions: Biological Sciences, 336, 391-397. [pdf]House, D. (1992). Changes in the control of fundamental frequency following activation of a cochlear prosthesis. In Proceedings of the Sixth National Swedish Phonetics Conference, Technical Report no. 10 (pp. 79-82). Department of Information Theory, Chalmers University of Technology, Gothenburg.House, D. (1992). Perceptual Constraints and Tonal Features. In Dressler, W., Luschutzky, H., Pfeiffer, O., & Rennison, J. (Eds.), Phonologica 1988 (pp. 111-118). Cambridge: Cambridge University Press.House, D., & Willstedt, U. (1992). Can a cochlear implant and voice training improve voice control?. Working Papers in Logopedics and Phoniatrics, Department of Logopedics and Phoniatrics, Lund University, 8, 15-33.House, D., & Willstedt, U. (1992). Changes in control of fundamental frequency and voice quality following cochlear implant activation and speech training. In Risberg, A., Felicetti, S., Plant, G., & Spens, K. (Eds.), Proceedings of the 2nd International Conference on Tactile aids, Hearing Aids, & Cochlear Implants (pp. 201-210). Stockholm, Sweden: KTH. [pdf]1991Blomberg, M. (1991). Articulatory inter-timing variation in speech: modelling in a recognition system. In Perilus XIII, Papers from the Fifth Swedish Phonetics Conference (pp. 93-96). Inst. of Linguistics, University of Stockholm. [pdf]Bruce, G., Granström, B., Gustafson, D., & House, D. (1991). On prosodic phrasing in Swedish. Perilus, Institute of Linguistics, University of Stockholm, XIII, 35-38.Bruce, G., Granström, B., Gustafson, K., & House, D. (1991). Prosodic phrasing in Swedish. Working Papers, Department of Linguistics and Phonetics, Lund University, 38, 5-17.Bruce, G., Granström, B., & House, D. (1991). Strategies for prosodic phrasing in Swedish. In Proc. of the Twelfth International Congress of Phonetic Sciences (pp. 4:182-185). Aix-en-Provence, France.Fant, G. (1991). Units of temporal organization. Stress groups versus syllables and words.. In Proceedings of ICPhS, Aix-en-Provence, France (pp. 247-250). [pdf]Fant, G., Kruckenberg, A., & Nord, L. (1991). Temporal organization and rhythm in Swedish. In Proceedings of ICPhS, Aix-en-Provence, France (pp. 251-256.). [pdf]House, D. (1991). A model of optimal tonal feature perception. In Proc. of the Twelfth International Congress of Phonetic Sciences (pp. 2:102-105). Aix-en-Provence, France.House, D. (1991). Cochlear implants and the perception of mood in speech. British Journal of Audiology, 26, 198.House, D. (1991). On hearing impairments, cochlear implants and the perception of mood in speech. In PERILUS XIII (pp. 125-128). Institute of Linguistics, University of Stockholm. [pdf]House, D., & Willstedt, U. (1991). Fundamental frequency control and voice quality in cochlear implant users. Working Papers, Department of Linguistics and Phonetics, Lund University, 38, 115-132.Summerfield, Q. (1991). Visual perception of phonetic gestures. In Modularity and the Motor Theory of Speech Perception: Proceedings of a Conference to Honor Alvin M. Liberman (pp. 117). 1990Blomberg, M. (1990). Automatic detection of the phoneme boundaries in an utterance given its phonetic representation. In Phonum 1, Fonetik -90, The Fourth Swedish Phonetics Conference (pp. 100-103). Dept. of Phonetics, University of Umeå, Umeå. [pdf]Blomberg, M., Carlson, R., Elenius, R., Granström, B., & Hunnicutt, S. (1990). Word recognition using synthesized templates. In Journal of The American Voice I/O Society (AVIOS), Vol. 8 (pp. 43-57). Bruce, G., Granström, B., & House, D. (1990). Prosodic phrasing in Swedish speech synthesis. In Bailly, G., & Benoit, C. (Eds.), Proc. of the ESCA Workshop on speech synthesis (pp. 125-129). Autrans, France.Bruce, G., & House, D. (1990). Slutrapport från projektet Prosodisk parsning för igenkänning av svenska. Technical Report Department of Linguistics and Phonetics, Lund University.House, D. (1990). On the perception of mood in speech. Phonum, 1, 113-116.House, D. (1990). On the perception of mood in speech: implications for the hearing impaired. Working Papers, Department of Linguistics and Phonetics, Lund University, 36, 99-108. [pdf]House, D. (1990). Tonal Perception in Speech. Lund: Lund University Press.House, D., & Bruce, G. (1990). Word and focal accents in Swedish from a recognition perspective. In Wiik, K., & Raimo, I. (Eds.), Nordic Prosody V (pp. 156-173). Turku University.1989House, D. (1989). Automatic recognition of prosodic categories. In Szende, T. (Ed.), Proc. of the Speech Research '89 International Conference (pp. 347-350). Budapest: Linguistics Institute of the Hungarian Academy of Sciences.House, D. (1989). Cues for the perception of mood in speech: implications for the hearing-impaired. British Journal of Audiology, 23, 171.House, D. (1989). Perceptual Constraints and Tonal Features. Working Papers, Department of Linguistics and Phonetics, Lund University, 35, 113-120.1988Blomberg, M. (1988). Synthetic Phoneme Prototypes in Speech Recognition. In Wood, S. (Ed.), Working Papers 34, Papers from the Second Swedish Phonetics Conference (pp. 9-12). Dept. of Linguistics and Phonetics, Lund University. [pdf]Blomberg, M., Elenius, K., Lundström, B., & Neovius, L. (1988). Speech Recognizer for Voice Control of Mobile Telephone. In Wood, S. (Ed.), Working Papers 34, Papers from the Second Swedish Phonetics Conference (pp. 13-16). Dept. of Linguistics and Phonetics, Lund University. [pdf]House, D. (1988). Perceptual constraints and tonal features. In Sixth International Phonology Meeting, Krems (pp. 38). Vienna.House, D., Bruce, G., Eriksson, L., & Lacerda, F. (1988). Recognition of Prosodic Categories in Swedish: Rule Implementation. Working Papers, Department of Linguistics and Phonetics, Lund University, 33, 153-161.House, D., Bruce, G., Eriksson, L., & Lacerda, F. (1988). Recognition of Prosodic Categories in Swedish: Rule Implementation (abridged version). Working Papers, Department of Linguistics and Phonetics, Lund University, 34, 62-65.House, D., & Horne, M. (1988). Reduction Phenomena in Fast Speech: Influence of Dialect and Focus. In Discussion Papers, Sixth International Phonology Meeting, Krems (pp. 1:20-22). Vienna.1987Gårding, E., & House, D. (1987). Production and Perception of Phrases in some Nordic Dialects. In Lilius, P., & Saari, M. (Eds.), The Nordic Languages and Modern Linguistics, 6 (pp. 163-175). Helsinki University Press.House, D. (1987). Implications of spectral changes for categorising tonal patterns in speech. British Journal of Audiology, 21, 116.House, D. (1987). Perception of Tonal Patterns in Speech: Implications for Models of Speech Perception. In Proc. of the Eleventh International Congress of Phonetic Sciences (pp. 1:76-79). Academy of Sciences of the Estonian S.S.R. Tallinn..House, D. (1987). Speech Perception, Intonation and Memory. In RUUL 17, Papers from the Swedish Phonetics Conference (pp. 72-77). Department of Linguistics, Uppsala University.House, D., Bruce, G., Lacerda, F., & Lindblom, B. (1987). Automatic Prosodic Analysis for Swedish Speech Recognition. In J. Laver & M. Jack (eds.). European Conference on Speech Technology, Volume 1 (pp. 215-218). Edinburgh. [pdf]House, D., Bruce, G., Lacerda, F., & Lindblom, B. (1987). Automatic Prosodic Analysis for Swedish Speech Recognition (extended version). Working Papers, Department of Linguistics and Phonetics, Lund University, 31, 87-101.House, D., Bruce, G., Lacerda, F., & Lindblom, B. (1987). Prosodisk parsning för igenkänning av svenska. In Tal-Ljud-Hörsel 87 (pp. 29-21). Lund University.House, D., & Gårding, E. (1987). Phrasing in some Nordic dialects. In Nordic Prosody IV (pp. 61-70). Odense University Press.1986House, D. (1986). Compensatory Use of Acoustic Speech Cues and Language Structure by Hearing-Impaired Listeners. In Nilsson, L., & Hjelmquist, E. (Eds.), Communication and Handicap - Aspects of Psychological Compensation and Technical Aids. (pp. 39-59). Amsterdam: North Holland.1985Gårding, E., & House, D. (1985). Frasintonation, särskilt i svenska. In Svenskans Beskrivning 15 (pp. 205-221). Gothenburg University.House, D. (1985). Betydelsen av spektrala förändringar för kategorisering av tonala mönster. In Tal-Ljud Hörsel 2. (pp. 168). Gothenburg University.House, D. (1985). Implications of Rapid Spectral Changes on the Categorization of Tonal Patterns in Speech Perception. Working Papers, Department of Linguistics and Phonetics, Lund University, 28, 69-89.House, D. (1985). Kompensatoriska perceptionstrategier bland hörselskade lyssnare. In Tal-Ljud Hörsel 2. (pp. 168). Gothenburg University.House, D. (1985). Sentence Prosody and Syntax in Speech Perception. Working Papers, Department of Linguistics and Phonetics, Lund University, 28, 91-107.1984House, D. (1984). Experiments with Filtered Speech and Hearing-Impaired Listeners. Working Papers, Department of Linguistics and Phonetics, Lund University, 27, 101-123.House, D. (1984). Transitioner i talperception: implikationer för perceptionsteori och applikationer i hörselvården. In Tal-Ljud-Hörsel 1 (pp. 108). Stockholm University.1983House, D. (1983). Perceptual Interaction between Fo Excursions and Spectral Cues. Working Papers, Department of Linguistics and Phonetics, Lund University, 25, 67-74.House, D. (1983). Transitions in Speech Perception. In Abstracts of The Tenth International Congress of Phonetic Sciences (pp. 511). Foris Publications, Dordrecht, Holland.1982House, D. (1982). Identification of Nasals: An Acoustical and Perceptual Analysis of Nasal Consonants. Working Papers, Department of Linguistics and Phonetics, Lund University, 22, 153-162.1979Watanabe, A., Felicetti, S., Hedström, B., Surjadi, G., Tannergård, G., Tegerstedt, I., Wejnebring, B., Wetterling, M-B., Andersson, L., Hallsten, L., Kaunisto, M., Murray, T., Eriksson, H., Haapakorpi, M., Karlsson, I., Nord, L., Stålhammar, U., Elenius, K., Blomberg, M., Liljencrants, J., Carlson, R., Granström, B., Risberg, A., Spens, K-E., Agelförs, E., Boberg, G., Mártony, J., Tunblad, T., Öster, A-M., Galyas, K., Gauffin, J., de Serpa-Leitão, A., Askenfelt, A., Jansson, E., & Sunberg, J. (1979). Gunnar Fant 60 years. TMH-QPSR, 20(2), 1-45. [pdf]1978Blomberg, M., & Elenius, K. (1978). A phonetically based isolated word recognition system. J. Acoustic Soc. Am., 64(Suppl. No. 1), S181.1964Brady, P. T. (1964). A technique for investigating on-off patterns of speech. The Bell System Technical Journal, 44(1), 1-22.1959Fant, G. (1959). Acoustic analysis and synthesis of speech with applications to swedish. Technical Report.