Publications by Jonas Beskow
2016Proceedings of the Workshop on Multimodal Analyses enabling Artificial Agents in Human-Machine Interaction (pp. 15-19). Tokyo, Japan. [pdf] (2016). Automatic Annotation of Gestural Units in Spontaneous Face-to-Face Interaction. In Proceedings of the 9th International Conference on Motion in Games (pp. 7--13). San Francisco, CA, USA. [pdf] (2016). Robust online motion capture labeling of finger markers. In Proceedings of 9th ISCA Speech Synthesis Workshop (pp. 111-117). Sunnyvale, USA. [pdf] (2016). WikiSpeech – enabling open source text-to-speech for Wikipedia. In 9th ISCA Workshop on Speech Synthesis (pp. 225--230). Sunnyvale, CA, USA. [pdf] (2016). A hybrid harmonics-and-bursts modelling approach to speech synthesis. In Proceedings of the 10th edition of the Language Resources and Evaluation Conference (LREC 2016). Portorož, Slovenia. [pdf] (2016). A Multi-party Multi-modal Dataset for Focus of Visual Attention in Human-human and Human-robot Interaction. In Proceedings of 2nd Workshop on Advancements in Social Signal Processing for Multimodal Interaction. Tokyo, Japan. (2016). Look Who’s Talking - Visual Identification of the Active Speaker in Multi-party Human-robot Interaction. In Proceedings of the International Workshop on Spoken Dialogue Systems (IWSDS 2016),. Saariselkä, Finland. [pdf] (2016). The effect of a physical robot on vocabulary learning. In
2015ACM Trans. Access. Comput., 7(2), 7:1-7:17. [abstract] (2015). Towards Fully Automated Motion Capture of Signs -- Development and Evaluation of a Key Word Signing Avatar. Abstract: Motion capture of signs provides unique challenges in the field of multimodal data collection. The dense packaging of visual information requires high fidelity and high bandwidth of the captured data. Even though marker-based optical motion capture provides many desirable features such as high accuracy, global fitting, and the ability to record body and face simultaneously, it is not widely used to record finger motion, especially not for articulated and syntactic motion such as signs. Instead, most signing avatar projects use costly instrumented gloves, which require long calibration procedures. In this article, we evaluate the data quality obtained from optical motion capture of isolated signs from Swedish sign language with a large number of low-cost cameras. We also present a novel dual-sensor approach to combine the data with low-cost, five-sensor instrumented gloves to provide a recording method with low manual postprocessing. Finally, we evaluate the collected data and the dual-sensor approach as transferred to a highly stylized avatar. The application of the avatar is a game-based environment for training Key Word Signing (KWS) as augmented and alternative communication (AAC), intended for children with communication disabilities.Proceedings of Fonetik 2015 (pp. 63-68). Lund University, Sweden. [abstract] [pdf] (2015). On the temporal domain of co-speech gestures: syllable, phrase or talk spurt?. In Lundmark Svensson, M., Ambrazaitis, G., & van de Weijer, J. (Eds.), Abstract: This study explores the use of automatic methods to detect and extract hand
gesture movement co-occuring with speech. Two spontaneous dyadic dialogues
were analyzed using 3D motion-capture techniques to track hand movement.
Automatic speech/non-speech detection was performed on the dialogues resulting
in a series of connected talk spurts for each speaker. Temporal synchrony of onset
and offset of gesture and speech was studied between the automatic hand gesture
tracking and talk spurts, and compared to an earlier study of head nods and
syllable synchronization. The results indicated onset synchronization between head
nods and the syllable in the short temporal domain and between the onset of longer
gesture units and the talk spurt in a more extended temporal domain.Proceedings of IVA. Delft, Netherlands. [abstract] [pdf] (2015). A Collaborative Human-robot Game as a Test-bed for Modelling Multi-party, Situated Interaction. In Abstract: In this demonstration we present a test-bed for collecting data and testing out models for multi-party, situated interaction between humans and robots. Two users are playing a collaborative card sorting game together with the robot head Furhat. The cards are shown on a touch table between the players, thus constituting a target for joint attention. The system has been exhibited at the Swedish National Museum of Science and Technology during nine days, resulting in a rich multi-modal corpus with users of mixed ages.Proceedings of ICMI. Seattle, Washington, USA. [pdf] (2015). Exploring Turn-taking Cues in Multi-party Human-robot Discussions about Objects. In
2014Proc. of HRI'14. Bielefeld, Germany. [abstract] [pdf] (2014). Human-robot collaborative tutoring using multiparty multimodal spoken dialogue. In Abstract: In this paper, we describe a project that explores a novel experi-mental setup towards building a spoken, multi-modally rich, and human-like multiparty tutoring robot. A human-robot interaction setup is designed, and a human-human dialogue corpus is collected. The corpus targets the development of a dialogue system platform to study verbal and nonverbal tutoring strategies in multiparty spoken interactions with robots which are capable of spoken dialogue. The dialogue task is centered on two participants involved in a dialogue aiming to solve a card-ordering game. Along with the participants sits a tutor (robot) that helps the participants perform the task, and organizes and balances their interaction. Different multimodal signals captured and auto-synchronized by different audio-visual capture technologies, such as a microphone array, Kinects, and video cameras, were coupled with manual annotations. These are used build a situated model of the interaction based on the participants personalities, their state of attention, their conversational engagement and verbal dominance, and how that is correlated with the verbal and visual feedback, turn-management, and conversation regulatory actions generated by the tutor. Driven by the analysis of the corpus, we will show also the detailed design methodologies for an affective, and multimodally rich dialogue system that allows the robot to measure incrementally the attention states, and the dominance for each participant, allowing the robot head Furhat to maintain a well-coordinated, balanced, and engaging conversation, that attempts to maximize the agreement and the contribution to solve the task.
This project sets the first steps to explore the potential of using multimodal dialogue systems to build interactive robots that can serve in educational, team building, and collaborative task solving applications.Proceedings of eNTERFACE2013. Springer. [abstract] [pdf] (2014). Tutoring Robots: Multiparty multimodal social dialogue with an embodied tutor. In Abstract: This project explores a novel experimental setup towards building spoken, multi-modally rich, and human-like multiparty tutoring agent. A setup is developed and a corpus is collected that targets the development of a dialogue system platform to explore verbal and nonverbal tutoring strategies in multiparty spoken interactions with embodied agents. The dialogue task is centered on two participants involved in a dialogue aiming to solve a card-ordering game. With the participants sits a tutor that helps the participants perform the task and organizes and balances their interaction. Different multimodal signals captured and auto-synchronized by different audio-visual capture technologies were coupled with manual annotations to build a situated model of the interaction based on the participants personalities, their temporally-changing state of attention, their conversational engagement and verbal dominance, and the way these are correlated with the verbal and visual feedback, turn-management, and conversation regulatory actions generated by the tutor. At the end of this chapter we discuss the potential areas of research and developments this work opens and some of the challenges that lie in the road ahead.HRI'14. Bielefeld, Germany. [abstract] [pdf] (2014). Spontaneous spoken dialogues with the Furhat human-like robot head. In Abstract: We will show in this demonstrator an advanced multimodal and multiparty spoken conversational system using Furhat, a robot head based on projected facial animation. Furhat is an anthropomorphic robot head that utilizes facial animation for physical robot heads using back-projection. In the system, multimodality is enabled using speech and rich visual input signals such as multi-person real-time face tracking and microphone tracking. The demonstrator will showcase a system that is able to carry out social dialogue with multiple interlocutors simultaneously with rich output signals such as eye and head coordination, lips synchronized speech synthesis, and non-verbal facial gestures used to regulate fluent and expressive multiparty conversations. The dialogue design is performed using the IrisTK  dialogue authoring toolkit developed at KTH. The system will also be able to perform a moderator in a quiz-game showing different strategies for regulating spoken situated interactions.Computer Speech & Language, 28(2), 607-618. [pdf] (2014). Animated Lombard speech: Motion capture, facial animation and visual intelligibility of speech produced in adverse conditions. The Fifth Swedish Language Technology Conference. Uppsala, Sweden. [pdf] (2014). Automatic speech/non-speech classification using gestures in dialogue. In Tivoli - teckeninlärning via spel och interaktion. Slutrapport projektgenomförande. Technical Report, PTS Innovation för alla. [pdf] (2014). Proc. of LREC'14. Reykjavik, Iceland. [abstract] (2014). The Tutorbot Corpus – A Corpus for Studying Tutoring Behaviour in Multiparty Face-to-Face Spoken Dialogue. In Abstract: This paper describes a novel experimental setup exploiting state-of-the-art capture equipment to collect a multimodally rich game-solving collaborative multiparty dialogue corpus. The corpus is targeted and designed towards the development of a dialogue system platform to explore verbal and nonverbal tutoring strategies in multiparty spoken interactions. The dialogue task is centered on two participants involved in a dialogue aiming to solve a card-ordering game. The participants were paired into teams based on their degree of extraversion as resulted from a personality test. With the participants sits a tutor that helps them perform the task, organizes and balances their interaction and whose behavior was assessed by the participants after each interaction. Different multimodal signals captured and auto-synchronized by different audio-visual capture technologies, together with manual annotations of the tutor’s behavior constitute the Tutorbot corpus. This corpus is exploited to build a situated model of the interaction based on the participants’ temporally-changing state of attention, their conversational engagement and verbal dominance, and their correlation with the verbal and visual feedback and conversation regulatory actions generated by the tutor.
2013Interspeech 2013 - Show and Tell. Lyon, France. [abstract] [pdf] (2013). The Furhat Social Companion Talking Head. In Abstract: In this demonstrator we present the Furhat robot head. Furhat is a highly human-like robot head in terms of dynamics, thanks to its use of back-projected facial animation. Furhat also takes advantage of a complex and advanced dialogue toolkits designed to facilitate rich and fluent multimodal multiparty human-machine situated and spoken dialogue.
The demonstrator will present a social dialogue system with Furhat that allows for several simultaneous interlocutors, and takes advantage of several verbal and nonverbal input signals such as speech input, real-time multi-face tracking, and facial analysis, and communicates with its users in a mixed initiative dialogue, using state of the art speech synthesis, with rich prosody, lip animated facial synthesis, eye and head movements, and gestures.International Journal of Humanoid Robotics, 10(1). [abstract] [pdf] (2013). The Furhat Back-Projected Humanoid Head - Lip reading, Gaze and Multiparty Interaction. Abstract: In this article, we present Furhat – a back-projected human-like robot head using state-of-the art facial animation. Three experiments are presented where we investigate how the head might facilitate human-robot face-to-face interaction. First, we investigate how the animated lips increase the intelligibility of the spoken output, and compare this to an animated agent presented on a flat screen, as well as to a human face. Second, we investigate the accuracy of the perception of Furhat’s gaze in a setting typical for situated interaction, where Furhat and a human are sitting around a table. The accuracy of the perception of Furhat’s gaze is measured depending on eye design, head movement and viewing angle. Third, we investigate the turn-taking accuracy of Furhat in a multi-party interactive setting, as compared to an animated agent on a flat screen. We conclude with some observations from a public setting at a museum, where Furhat interacted with thousands of visitors in a multi-party interaction.Proc. of 12th International Conference on Auditory-Visual Speech Processing (AVSP2013). Annecy, France. [pdf] (2013). Aspects of co-occurring syllables and head nods in spontaneous dialogue. In Proc. of Fonetik 2013 (pp. 1-4). Linköping University, Sweden. [pdf] (2013). Extracting and analysing co-speech head gestures from motion-capture data. In Eklund, R. (Ed.), Proc. Tilburg Gesture Research Meeting. Tilburg University, The Netherlands. [pdf] (2013). Extracting and analyzing head movements accompanying spontaneous dialogue. In Proceedings of the 1:st european Symposium on Multimodal Communication. [pdf] (2013). The Tivoli System – A Sign-driven Game for Children with Communicative Disorders. In In Proc. of the 13th International Conference on Intelligent Virtual Agents (IVA). Edinburgh, UK. [pdf] (2013). Web-enabled 3D talking avatars based on WebGL and HTML5. In ISCA Workshop on Non-Linear Speech Processing 2013. [pdf] (2013). Non-Linear Pitch Modification in Voice Conversion using Artificial Neural Networks. In Eye Gaze in Intelligent User Interfaces - Gaze-based Analyses, Models, and Applications. Springer. [abstract] (2013). Co-present or not? Embodiement, situatedness and the Mona Lisa gaze effect. In Nakano, Y., Conati, C., & Bader, T. (Eds.), Abstract: The interest in embodying and situating computer programmes took off in the autonomous agents community in the 90s. Today, researchers and designers of programmes that interact with people on human terms endow their systems with humanoid physiognomies for a variety of reasons. In most cases, attempts at achieving this embodiment and situatedness has taken one of two directions: virtual characters and actual physical robots. In addition, a technique that is far from new is gaining ground rapidly: projection of animated faces on head-shaped 3D surfaces. In this chapter, we provide a history of this technique; an overview of its pros and cons; and an in-depth description of the cause and mechanics of the main drawback of 2D displays of 3D faces (and objects): the Mona Liza gaze effect. We conclude with a description of an experimental paradigm that measures perceived directionality in general and the Mona Lisa gaze effect in particular.International Journal of Humanoid Robotics, 10(1). [abstract] [link] (2013). Face-to-Face with a Robot: What do we actually talk about?. Abstract: Whereas much of the state-of-the-art research in Human-Robot Interaction (HRI) investigates task-oriented interaction, this paper aims at exploring what people talk about to a robot if the content of the conversation is not predened. We used the robot head Furhat to explore the conversational behavior of people who encounter a robot in the public setting of a robot exhibition in a scientic museum, but without a predened purpose. Upon analyzing the conversations, it could be shown that a sophisticated robot provides an inviting atmosphere for people to engage in interaction and to be experimental and challenge the robot\'s capabilities. Many visitors to the exhibition were willing to go beyond the guiding questions that were provided as a starting point. Amongst other things, they asked Furhat questions concerning the robot itself, such as how it would dene a robot, or if it plans to take over the world. People were also interested in the feelings and likes of the robot and they asked many personal questions - this is how Furhat ended up with its rst marriage proposal. People who talked to Furhat were asked to complete a questionnaire on their assessment of the conversation, with which we could show that the interaction with Furhat was rated as a pleasing experience.Proceedings of the 2013 Workshop on Multimodal Corpora: Beyond Audio and Video. [pdf] (2013). A Kinect Corpus of Swedish Sign Language Signs. In
2012Proceedings of Fonetik'12. Gothenberg, Sweden. [abstract] [pdf] (2012). Talking with Furhat - multi-party interaction with a back-projected robot head. In Abstract: This is a condensed presentation of some recent work on a back-projected robotic head for multi-party interaction in public settings. We will describe some of the design strategies and give some preliminary analysis of an interaction database collected at the Robotville exhibition at the London Science MuseumCognitive Behavioural Systems. Lecture Notes in Computer Science (pp. 114-130). Springer. (2012). Furhat: A Back-projected Human-like Robot Head for Multiparty Human-Machine Interaction. In Esposito, A., Esposito, A., Vinciarelli, A., Hoffmann, R., & C. Müller, V. (Eds.), ACM Transactions on Interactive Intelligent Systems, 1(2), 25. [abstract] [pdf] (2012). Taming Mona Lisa: communicating gaze faithfully in 2D and 3D facial projections. Abstract: The perception of gaze plays a crucial role in human-human interaction. Gaze has been shown to matter for a number of aspects of communication and dialogue, especially for managing the flow of the dialogue and participant attention, for deictic referencing, and for the communication of attitude. When developing embodied conversational agents (ECAs) and talking heads, modeling and delivering accurate gaze targets is crucial. Traditionally, systems communicating through talking heads have been displayed to the human conversant using 2D displays, such as flat monitors. This approach introduces severe limitations for an accurate communication of gaze since 2D displays are associated with several powerful effects and illusions, most importantly the Mona Lisa gaze effect, where the gaze of the projected head appears to follow the observer regardless of viewing angle. We describe the Mona Lisa gaze effect and its consequences in the interaction loop, and propose a new approach for displaying talking heads using a 3D projection surface (a physical model of a human head) as an alternative to the traditional flat surface projection. We investigate and compare the accuracy of the perception of gaze direction and the Mona Lisa gaze effect in 2D and 3D projection surfaces in a five subject gaze perception experiment. The experiment confirms that a 3D projection surface completely eliminates the Mona Lisa gaze effect and delivers very accurate gaze direction that is independent of the observer's viewing angle. Based on the data collected in this experiment, we rephrase the formulation of the Mona Lisa gaze effect. The data, when reinterpreted, confirms the predictions of the new model for both 2D and 3D projection surfaces. Finally, we discuss the requirements on different spatially interactive systems in terms of gaze direction, and propose new applications and experiments for interaction in a human-ECA and a human-robot settings made possible by this technology.Proc. of the Intelligent Virtual Agents 10th International Conference (IVA 2012). Santa Cruz, CA, USA: Springer. [abstract] [pdf] (2012). Lip-reading Furhat: Audio Visual Intelligibility of a Back Projected Animated Face. In Abstract: Back projecting a computer animated face, onto a three dimensional static physical model of a face, is a promising technology that is gaining ground as a solution to building situated, flexible and human-like robot heads. In this paper, we first briefly describe Furhat, a back projected robot head built for the purpose of multimodal multiparty human-machine interaction, and its benefits over virtual characters and robotic heads; and then motivate the need to investigating the contribution to speech intelligibility Furhat’s face offers. We present an audio-visual speech intelligibility experiment, in which 10 subjects listened to short sentences with degraded speech signal. The experiment compares the gain in intelligibility between lip reading a face visualized on a 2D screen compared to a 3D back-projected face and from different viewing angles. The results show that the audio-visual speech intelligibility holds when the avatar is projected onto a static face model (in the case of Furhat), and even, rather surprisingly, exceeds it. This means that despite the movement limitations back projected animated face models bring about; their audio visual speech intelligibility is equal, or even higher, compared to the same models shown on flat displays. At the end of the paper we discuss several hypotheses on how to interpret the results, and motivate future investigations to better explore the characteristics of visual speech perception 3D projected animated faces.Proc. of the 14th ACM International Conference on Multimodal Interaction ICMI. Santa Monica, CA, USA. [abstract] [pdf] (2012). Multimodal Multiparty Social Interaction with the Furhat Head. In Abstract: We will show in this demonstrator an advanced multimodal and multiparty spoken conversational system using Furhat, a robot head based on projected facial animation. Furhat is a human-like interface that utilizes facial animation for physical robot heads using back-projection. In the system, multimodality is enabled using speech and rich visual input signals such as multi-person real-time face tracking and microphone tracking. The demonstrator will showcase a system that is able to carry out social dialogue with multiple interlocutors simultaneously with rich output signals such as eye and head coordination, lips synchronized speech synthesis, and non-verbal facial gestures used to regulate fluent and expressive multiparty conversations.Proc of LREC Workshop on Multimodal Corpora. Istanbul, Turkey. [pdf] (2012). Furhat goes to Robotville: a large-scale multiparty human-robot interaction data collection in a public space. In In proceedings of The Listening Talker. Edinburgh, UK.. [pdf] (2012). Can Anybody Read Me? Motion Capture Recordings for an Adaptable Visual Speech Synthesizer. In Proceedings of WOCCI. Portland, OR. [pdf] (2012). Children and adults in dialogue with the robot head Furhat - corpus collection and initial analysis. In The Fourth Swedish Language Technology Conference. Lund, Sweden. (2012). HMM based speech synthesis system for Swedish Language. In Proc. of LREC 2012. Istanbul, Turkey. [abstract] [pdf] (2012). 3rd party observer gaze as a continuous measure of dialogue flow. In Abstract: We present an attempt at using 3rd party observer gaze to get a measure of how appropriate each segment in a dialogue is for a speaker change. The method is a step away from the current dependency of speaker turns or talkspurts towards a more general view of speaker changes. We show that 3rd party observers do indeed largely look at the same thing (the speaker), and how this can be captured and utilized to provide insights into human communication.Prosody and Embodiment in Interactional Grammar (pp. 265-280). Berlin: de Gruyter. (2012). Gesture movement profiles in dialogues from a Swedish multimodal database of spontaneous speech. In Bergmann, P., Brenning, J., Pfeiffer, M. C., & Reber, E. (Eds.), http://arxiv.org/abs/1211.3901. [pdf] (2012). Visual Recognition of Isolated Swedish Sign Language Signs. Understanding Prosody (pp. 119 - 134). De Gruyter. (2012). Regional Varieties of Swedish: Models and synthesis. In Niebuhr, O. (Ed.), Proceedings of IVA-RCVA. Santa Cruz, CA. [pdf] (2012). Furhat at Robotville: A Robot Head Harvesting the Thoughts of the Public through Multi-party Dialogue. In
2011Proceedings of AVSP2011 (pp. 69). [pdf] (2011). A robotic head using projected animated faces. In Salvi, G., Beskow, J., Engwall, O., & Al Moubayed, S. (Eds.), Analysis of Verbal and Nonverbal Communication and Enactment: The Processing Issues, Lecture Notes in Computer Science (pp. 19-35). Springer. [pdf] (2011). Animated Faces for Robotic Heads: Gaze and Beyond. In Esposito, A., Vinciarelli, A., Vicsi, K., Pelachaud, C., & Nijholt, A. (Eds.), TMH-QPSR, 51(1), 33-36. [pdf] (2011). A novel Skype interface using SynFace for virtual speech reading support. Proceedings of AVSP2011 (pp. 103-106). [pdf] (2011). Kinetic Data for Large-Scale Analysis and Modeling of Face-to-Face Conversation. In Salvi, G., Beskow, J., Engwall, O., & Al Moubayed, S. (Eds.), Proc. of the Intelligent Virtual Agents 10th International Conference (IVA 2011) (pp. 439-440). Reykjavík, Iceland: Springer. [abstract] [pdf] (2011). The Mona Lisa Gaze Effect as an Objective Metric for Perceived Cospatiality. In Vilhjálmsson, H. H., Kopp, S., Marsella, S., & Thórisson, K. R. (Eds.), Abstract: We propose to utilize the Mona Lisa gaze effect for an objective and repeatable measure of the extent to which a viewer perceives an object as cospatial. Preliminary results suggest that the metric behaves as expected.
Proc. of the Intelligent Virtual Agents 10th International Conference (IVA 2011)
2010Interspeech'10. Makuhari, Japan. [pdf] (2010). Prominence Detection in Swedish Using Syllable Correlates. In FAA, The ACM / SSPNET 2nd International Symposium on Facial Analysis and Animation. Edinburgh, UK. [pdf] (2010). Perception of Nonverbal Gestures of Prominence in Visual Speech Animation. In Journal on Multimodal User Interfaces, 3(4), 299-311. [abstract] [pdf] (2010). Auditory-Visual Prominence: From Intelligibilitty to Behavior. Abstract: Auditory prominence is defined as when an acoustic segment is made salient in its context. Prominence
is one of the prosodic functions that has been shown to be
strongly correlated with facial movements. In this work, we
investigate the effects of facial prominence cues, in terms
of gestures, when synthesized on animated talking heads.
In the first study, a speech intelligibility experiment is conducted, speech quality is acoustically degraded and the fundamental frequency is removed from the signal, then the speech is presented to 12 subjects through a lip synchronized talking head carrying head-nods and eyebrows raise gestures, which are synchronized with the auditory prominence.
The experiment shows that presenting prominence
as facial gestures significantly increases speech intelligibility compared to when these gestures are randomly added to speech. We also present a follow-up study examining the perception of the behavior of the talking heads when gestures are added over pitch accents. Using eye-gaze tracking technology and questionnaires on 10 moderately hearing impaired subjects, the results of the gaze data show that users look at the face in a similar fashion to when they look at a natural face when gestures are coupled with pitch accents opposed to when the face carries no gestures. From the questionnaires, the results also show that these gestures significantly increase the naturalness and the understanding of the talking head.Toward Autonomous, Adaptive, and Context-Aware Multimodal Interfaces: Theoretical and Practical Issues (pp. 55 - 71). Springer. [abstract] [pdf] (2010). Audio-Visual Prosody: Perception, Detection, and Synthesis of Prominence. In Esposito, A. e. a. (Ed.), Abstract: In this chapter, we investigate the effects of facial prominence
cues, in terms of gestures, when synthesized on animated talking heads.
In the first study a speech intelligibility experiment is conducted, where
speech quality is acoustically degraded, then the speech is presented to
12 subjects through a lip synchronized talking head carrying head-nods
and eyebrow raising gestures. The experiment shows that perceiving visual
prominence as gestures, synchronized with the auditory prominence,
significantly increases speech intelligibility compared to when these gestures
are randomly added to speech.
We also present a study examining the perception of the behavior of
the talking heads when gestures are added at pitch movements. Using
eye-gaze tracking technology and questionnaires for 10 moderately hearing
impaired subjects, the results of the gaze data show that users look
at the face in a similar fashion to when they look at a natural face when
gestures are coupled with pitch movements opposed to when the face
carries no gestures. From the questionnaires, the results also show that
these gestures significantly increase the naturalness and helpfulness of
the talking head.The ACM / SSPNET 2nd International Symposium on Facial Analysis and Animation. Edinburgh, UK. [pdf] (2010). Perception of Gaze Direction in 2D and 3D Facial Projections. In Development of Multimodal Interfaces: Active Listening and Synchrony (pp. 157 - 168). Berlin / Heidelberg: Springer. [pdf] (2010). Face-to-face interaction and the KTH Cooking Show. In Esposito, A., Campbell, N., Vogel, C., Hussain, A., & Nijholt, A. (Eds.), Proceedings of SLTC 2010. Linköping, Sweden. [pdf] (2010). Modelling humanlike conversational behaviour. In Proc. of Fonetik 2010 (pp. 7-10). Lund, Sweden. [abstract] [pdf] (2010). Research focus: Interactional aspects of spoken face-to-face communication. In Abstract: We have a visionary goal: to learn enough about human face-to-face interaction that we are able to create an artificial conversational partner that is human-like. We take the opportunity here to present four new projects inaugurated in 2010, each adding pieces of the puzzle through a shared research focus: interactional aspects of spoken face-to-face communication.Språkteknologi för ökad tillgänglighet - Rapport från ett nordiskt seminarium (pp. 77-86). [pdf] (2010). Goda utsikter för teckenspråksteknologi. In Domeij, R., Breivik, T., Halskov, J., Kirchmeier-Andersen, S., Langgård, P., & Moshagen, S. (Eds.), Teckenspråkteknologi - sammanfattning av förstudie. Technical Report, KTH Centrum för Talteknologi. [pdf] (2010). Porc. of SLTC 2010. Linköping, Sweden. [abstract] [pdf] (2010). Cocktail – a demonstration of massively multi-component audio environments for illustration and analysis. In Abstract: We present MMAE – Massively Multi-component Audio Environments – a new concept in auditory presentation, and Cocktail – a demonstrator built on this technology. MMAE creates a dynamic audio environment by playing a large number of sound clips simultaneously at different locations in a virtual 3D space. The technique utilizes standard soundboards and is based in the Snack Sound Toolkit. The result is an efficient 3D audio environment that can be modified dynamically, in real time. Applications range from the creation of canned as well as online audio environments for games and entertainment to the browsing, analyzing and comparing of large quantities of audio data. We also demonstrate the Cocktail implementation of MMAE using several test cases as examples.Proc. of Multimodal Corpora: Advances in Capturing, Coding and Analyzing Multimodality (MMC 2010) (pp. 160 - 161). Valetta, Malta. [abstract] [pdf] (2010). Capturing massively multimodal dialogues: affordable synchronization and visualization. In Kipp, M., Martin, J-C., Paggio, P., & Heylen, D. (Eds.), Abstract: In this demo, we show (a) affordable and relatively easy-to-implement means to facilitate synchronization of audio, video and motion capture data in post processing, and (b) a flexible tool for 3D visualization of recorded motion capture data aligned with audio and video sequences. The synchronisation is made possible by the use of two simple and analogues devices: a turntable and an easy to build electronic clapper board. The demo shows examples of how the signals from the turntable and the clapper board are traced over the three modalities, using the 3D visualisation tool. We also demonstrate how the visualisation tool shows head and torso movements captured by the motion capture system.Proc. of the Seventh conference on International Language Resources and Evaluation (LREC'10) (pp. 2992 - 2995). Valetta, Malta. [abstract] [pdf] (2010). Spontal: a Swedish spontaneous dialogue corpus of audio, video and motion capture. In Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Rosner, M., & Tapias, D. (Eds.), Abstract: We present the Spontal database of spontaneous Swedish dialogues. 120 dialogues of at least 30 minutes each have been captured in high-quality audio, high-resolution video and with a motion capture system. The corpus is currently being processed and annotated, and will be made available for research at the end of the project.Speech Prosody 2010. Chicago, USA. [pdf] (2010). Simulating Intonation in Regional Varieties of Swedish. In Fonetik 2010. Lund, Sweden. [pdf] (2010). Simulating Intonation in Regional Varieties of Swedish. In
2009Proceedings of Auditory-Visual Speech Processing AVSP'09. Norwich, England. [abstract] [pdf] (2009). Effects of Visual Prominence Cues on Speech Intelligibility. In Abstract: This study reports experimental results on the effect of visual prominence, presented as gestures, on speech intelligibility. 30 acoustically vocoded sentences, permutated into different gestural conditions were presented audio-visually to 12 subjects. The analysis of correct word recognition shows a significant increase in intelligibility when focally-accented (prominent) words are supplemented with head-nods or with eye-brow raise gestures. The paper also examines coupling other acoustic phenomena to brow-raise gestures. As a result, the paper introduces new evidence on the ability of the non-verbal movements in the visual modality to support audio-visual speech perception.Proceedings of Fonetik'09. Dept. of Linguistics, Stockholm University, Sweden. [abstract] [pdf] (2009). Studies on Using the SynFace Talking Head for the Hearing Impaired. In Abstract: SynFace is a lip-synchronized talking agent which is optimized as a visual reading support for the hearing impaired. In this paper we
present the large scale hearing impaired user studies carried out for three languages in the Hearing at Home project. The user tests focus
on measuring the gain in Speech Reception Threshold in Noise and the effort scaling when using SynFace by hearing impaired people, where groups of hearing impaired subjects with different impairment levels from mild to severe and cochlear implants are tested. Preliminary
analysis of the results does not show significant gain in SRT or in effort scaling. But looking at large cross-subject variability in both tests, it is
clear that many subjects benefit from SynFace especially with speech with stereo babble.Proceedings of Interspeech 2009. (2009). Virtual Speech Reading Support for Hard of Hearing in a Domestic Multi-media Setting. In Proc. of the NODALIDA 2009 workshop Nordic Perspectives on the CLARIN Infrastructure of Language Resources (pp. 1-5). Northern European Association for Language Technology. [abstract] [pdf] (2009). Swedish CLARIN activities. In Domeij, R., Koskenniemi, K., Krauwer, S., Maegaard, B., Rögnvaldsson, E., & de Smedt, K. (Eds.), Abstract: Although Sweden has yet to allocate
funds speciﬁcally intended for CLARIN
activities, there are some ongoing activities which are directly relevant to
CLARIN, and which are explicitly linked
to CLARIN. These activities have been
funded by the Committee for Research Infrastructures and its subcommittee DISC
(Database Infrastructure Committee) of
the Swedish Research Council.Interspeech 2009. Brighton, U.K. [abstract] [pdf] (2009). The MonAMI Reminder: a spoken dialogue system for face-to-face interaction. In Abstract: We describe the MonAMI Reminder, a multimodal spoken dialogue system which can assist elderly and disabled people in organising and initiating their daily activities. Based on deep interviews with potential users, we have designed a calendar and reminder application which uses an innovative mix of an embodied conversational agent, digital pen and paper, and the web to meet the needs of those users as well as the current constraints of speech technology. We also explore the use of head pose tracking for interaction and attention control in human-computer face-to-face interaction.Proceedings of Fonetik 2009. (2009). Experiments with Synthesis of Swedish Dialects. In Proceedings of The International Conference on Auditory-Visual Speech Processing AVSP'09. Norwich, England. [abstract] [pdf] (2009). SynFace - Verbal and Non-verbal Face Animation from Audio. In Abstract: We give an overview of SynFace, a speech-driven
face animation system originally developed for the
needs of hard-of-hearing users of the telephone. For
the 2009 LIPS challenge, SynFace includes not only
articulatory motion but also non-verbal motion of
gaze, eyebrows and head, triggered by detection of
acoustic correlates of prominence and cues for interaction
control. In perceptual evaluations, both verbal
and non-verbal movmements have been found to have
positive impact on word recognition scores.Computers in the Human Interaction Loop (pp. 143-158). Berlin/Heidelberg: Springer. [pdf] (2009). Multimodal Interaction Control. In Waibel, A., & Stiefelhagen, R. (Eds.), Fonetik 2009 (pp. 190-193). Stockholm. [abstract] [pdf] (2009). Project presentation: Spontal – multimodal database of spontaneous dialog. In Abstract: We describe the ongoing Swedish speech database project Spontal: Multimodal database of spontaneous speech in dialog (VR 2006-7482). The project takes as its point of departure the fact that both vocal signals and gesture involving the face and body are important in every-
day, face-to-face communicative interaction, and that there is a great need for data with which we more precisely measure these.Language and Speech, 52(2-3), 351-367. [abstract] (2009). MushyPeek - a framework for online investigation of audiovisual dialogue phenomena. Abstract: Evaluation of methods and techniques for conversational and multimodal spoken dialogue systems is complex, as is gathering data for the modeling and tuning of such techniques. This article describes MushyPeek, an experiment framework that allows us to manipulate the audiovisual behavior of interlocutors in a setting similar to face-to-face human—human dialogue. The setup connects two subjects to each other over a Voice over Internet Protocol (VoIP) telephone connection and simultaneously provides each of them with an avatar representing the other. We present a first experiment which inaugurates, exemplifies, and validates the framework. The experiment corroborates earlier findings on the use of gaze and head pose gestures in turn-taking.EURASIP Journal on Audio, Speech, and Music Processing, 2009. [abstract] [pdf] (2009). SynFace—Speech-Driven Facial Animation for Virtual Speech-Reading Support. Abstract: This paper describes SynFace, a supportive technology that aims at enhancing audio-based spoken communication in adverse acoustic conditions by providing the missing visual information in the form of an animated talking head. Firstly, we describe the system architecture, consisting of a 3D animated face model controlled from the speech input by a specifically optimised phonetic recogniser. Secondly, we report on speech intelligibility experiments with focus on multilinguality and robustness to audio quality. The system, already available for Swedish, English, and Flemish, was optimised for German and for Swedish wide-band speech quality available in TV, radio, and Internet communication. Lastly, the paper covers experiments with nonverbal motions driven from the speech signal. It is shown that turn-taking gestures can be used to affect the flow of human-human dialogues. We have focused specifically on two categories of cues that may be extracted from the acoustic signal: prominence/emphasis and interactional cues (turn-taking/back-channelling).
2008Proceedings of The second Swedish Language Technology Conference (SLTC). Stockholm, Sweden.. [abstract] [pdf] (2008). SynFace Phone Recognizer for Swedish Wideband and Narrowband Speech. In Abstract: In this paper, we present new results and comparisons of the real-time lips synchronized talking head SynFace on different Swedish databases and bandwidth. The work involves training SynFace on narrow-band telephone speech from the Swedish SpeechDat, and on the narrow-band and wide-band Speecon corpus. Auditory perceptual tests are getting established for SynFace as an audio visual hearing support for the hearing-impaired. Preliminary results show high recognition accuracy compared to other languages.Proceedings of Fonetik 2008. (2008). Human Recognition of Swedish Dialects. In Proceedings of Interspeech 2008. [pdf] (2008). Recognizing and Modelling Regional Varieties of Swedish. In Comunicazione Parlata e manifestazione delle emozioni. [pdf] (2008). Evaluation of the expressivity of a Swedish talking headin the context of human-machine interaction. In Magno Caldognetto, E., Cavicchio, E., & Cosi, P. (Eds.), Proceedings of the 10th international conference on Multimodal interfaces, Chania, Crete, Greece (pp. 199-200). New York, NY, USA: ACM. [abstract] [pdf] (2008). Innovative interfaces in MonAMI: the reminder. In Abstract: This demo paper presents an early version of the Reminder, a prototype ECA developed in the European project MonAMI, which aims at “mainstreaming accessibility in consumer goods and services, using advanced technologies to ensure equal access, independent living and participation for all”. The Reminder helps users to plan activities and to remember what to do. The prototype merges mobile ECA technology with other, existing technologies: Google Calendar and a digital pen and paper. The solution allows users to continue using a paper calendar in the manner they are used to, whilst the ECA provides notifications on what has been written in the calendar. Users may ask questions such as “When was I supposed to meet Sara?” or “What’s my schedule today?”Proceedings of FONETIK 2008 (pp. 33-36). Gothenburg, Sweden. [abstract] [pdf] (2008). Speech technology in the European project MonAMI. In Abstract: This paper describes the role of speech and speech technology in the European project MonAMI, which aims at “mainstreaming ac-cessibility in consumer goods and services, us-ing advanced technologies to ensure equal ac-cess, independent living and participation for all”. It presents the Reminder, a prototype em-bodied conversational agent (ECA) which helps users to plan activities and to remember what to do. The prototype merges speech technology with other, existing technologies: Google Cal-endar and a digital pen and paper. The solution allows users to continue using a paper calendar in the manner they are used to, whilst the ECA provides notifications on what has been written in the calendar. Users may also ask questions such as “When was I supposed to meet Sara?” or “What’s on my schedule today?”Perception in Multimodal Dialogue Systems - Proceedings of the 4th IEEE Tutorial and Research Workshop on Perception and Interactive Technologies for Speech-Based Systems, PIT 2008, Kloster Irsee, Germany, June 16-18, 2008. (pp. 272-275). Berlin/Heidelberg: Springer. [abstract] [pdf] (2008). Innovative interfaces in MonAMI: the KTH Reminder. In Abstract: This demo paper presents the first version of the Reminder, a prototype ECA developed in the European project MonAMI, which aims at “mainstreaming accessibility in consumer goods and services, using advanced technologies to ensure equal access, independent living and participation for all”. The Reminder helps users to plan activities and to remember what to do. The prototype merges ECA technology with other, existing technologies: Google Calendar and a digital pen and paper. This innovative combination of modalities allows users to continue using a paper calendar in the manner they are used to, whilst the ECA provides verbal notifications on what has been written in the calendar. Users may also ask questions such as “When was I supposed to meet Sara?” or “What’s on my schedule today?”Technology and Disability, 20(2), 97-107. [pdf] (2008). Visualization of speech and audio for hearing-impaired persons. Proceedings of Interspeech 2008. Brisbane, Australia. [abstract] [pdf] (2008). Hearing at Home – Communication support in home environments for hearing impaired persons. In Abstract: The Hearing at Home (HaH) project focuses on the needs of
hearing-impaired people in home environments. The project is researching and developing an innovative media-center solution for hearing support, with several integrated features that support perception of speech and audio, such as individual loudness amplification, noise reduction, audio classification and event detection, and the possibility to display an animated talking head providing real-time speechreading support. In this paper we provide a brief project overview and then describe some recent results related to the audio classifier and the talking head. As the talking head expects clean speech input, an audio classifier has been developed for the task of classifying audio signals as clean speech, speech in noise or other. The mean accuracy of the classifier was 82%. The talking head (based on technology from the SynFace project) has been adapted for German, and a small speech-in-noise intelligibility experiment was conducted where sentence recognition rates increased from 3% to 17% when the talking head was present.Proc. of SLTC 2008 (pp. 13-14). Stockholm. [abstract] [pdf] (2008). The MonAMI Reminder system. In Abstract: This paper presents an early version of the Reminder, a prototype ECA developed in the European project MonAMI. The Reminder helps users to plan activities and to
remember what to do. The prototype merges mobile ECA technology with other, existing technologies: Google Calendar and a digital pen and paper. The solution allows users to continue using a paper calendar in the manner they are used to, whilst the ECA provides notifications on what has been written in the calendar.Proceedings of The second Swedish Language Technology Conference (SLTC). Stockholm, Sweden.. [pdf] (2008). Mobile SynFace: Ubiquitous visual interface for mobile VoIP telephone calls. In
2007Verbal and Nonverbal Communication Behaviours (pp. 250-263). Berlin: Springer-Verlag. (2007). Analysis and synthesis of multimodal verbal and non-verbal interaction for animated interface agents. In Esposito, A., Faundez-Zanuy, M., Keller, E., & Marinaro, M. (Eds.), Proceedings of Interspeech 2007. Antwerp, Belgium. [abstract] [pdf] (2007). Pushy versus meek – using avatars to influence turn-taking behaviour. In Abstract: The flow of spoken interaction between human interlocutors
is a widely studied topic. Amongst other things, studies have shown that we use a number of facial gestures to improve this flow – to control the taking of turns. This ought to be useful in systems where an animated talking head is used, be they systems for computer mediated human-human dialogue or spoken dialogue systems, where the computer itself uses speech to interact with users. In this article, we show that a small set of simple interaction control gestures and a simple model of interaction can be used to influence users’ behaviour
in an unobtrusive manner. The results imply that such a model may improve the flow of computer mediated interaction between humans under adverse circumstances, such as network latency, or to create more human-like spoken humancomputer interaction.Proceedings of Fonetik, TMH-QPSR, 50(1), 61-64. [abstract] [pdf] (2007). MushyPeek – an experiment framework for controlled investigation of human-human interaction control behaviour. Abstract: This paper describes MushyPeek, a experiment framework that allows us to manipulate interaction control behaviour – including turn-taking – in a setting quite similar to face-to-face human-human dialogue. The setup connects two subjects to each other over a VoIP telephone connection and simultaneuously provides each of them with an avatar representing the other. The framework is exemplified with the first experiment we tried in it – a test of the effectiveness interaction control gestures in an animated lip-synchronised talking head.
2006Lecture Notes in Computer Science, 4061, 579-586. [abstract] [pdf] (2006). User Evaluation of the SYNFACE Talking Head Telephone. Abstract: The talking-head telephone, Synface, is a lip-reading support for people with hearing-impairment. It has been tested by 49 users with varying degrees of hearing-impaired in UK and Sweden in lab and home environments. Synface was found to give support to the users, especially in perceiving numbers and addresses and an enjoyable way to communicate. A majority deemed Synface to be a useful productFonetik 2006, Working Papers 52, General Linguistics and Phonetics, Lund University (pp. 9-12). [pdf] (2006). Focal accent and facial movements in expressive speech. In Proceedings of Interspeech 2006 (pp. 1272–1275). Pittsburg, PA. [pdf] (2006). Visual correlates to prominence in several expressive modes. In Scandinavian Journal of Psychology, 47(2), 93-101. [pdf] (2006). Motivation and appraisal in perception of poorly specified speech. Journal of Speech, Language and Hearing Research, 49(4), 835-847. [pdf] (2006). Visual phonemic ambiguity and speechreading.
2005Spoken Multimodal Human-Computer Dialogue in Mobile Environments, Text, Speech and Language Technology (pp. 93-113). Dordrecht, The Netherlands: Kluwer Academic Publishers. [abstract] [pdf] (2005). A model for multi-modal dialogue system output applied to an animated talking head. In Minker, W., Bühler, D., & Dybkjaer, L. (Eds.), Abstract: We present a formalism for specifying verbal and non-verbal output from a multi-modal dialogue system. The output specification is XML-based and provides information about communicative functions of the output, without detailing the realisation of these functions. The aim is to let dialogue systems generate the same output for a wide variety of output devices and modalities. The formalism was developed and implemented in the multi-modal spoken dialogue system AdApt. We also describe how facial gestures in the 3D-animated talking head used within this system are controlled through the formalism.Proceedings of Interspeech 2005. Lisbon. [pdf] (2005). Data-driven Synthesis of Expressive Visual Speech using an MPEG-4 Talking Head. In
2004Journal of Speech Technology, 4(7), 335-349. [pdf] (2004). Trainable articulatory control models for visual speech synthesis. Proc Tutorial and Research Workshop on Affective Dialogue Systems, ADS'04 (pp. 240-243). Kloster Irsee, Tyskland. [pdf] (2004). Preliminary cross-cultural evaluation of expressiveness in synthetic faces. In André, E., Dybkjaer, L., Minker, W., & Heisterkampf, P. (Eds.), Proc Tutorial and Research Workshop on Affective Dialogue Systems, ADS'04, . Kloster Irsee, Tyskland.. [pdf] (2004). Expressive Animated Agents for Affective Dialogue Systems.. Proc LREC Workshop on Models of Human Behaviour for the Specification and Evaluation of Multimodal Input and Output Interfaces (pp. 34-37). Lisboa. [pdf] (2004). The Swedish PF-Star multimodal corpora. In Computers Helping People with Special Needs (pp. 1178-1186). Springer-Verlag. [abstract] [pdf] (2004). SYNFACE - A talking head telephone for the hearing-impaired. In Miesenberger, K., Klaus, J., Zagler, W., & Burger, D. (Eds.), Abstract: SYNFACE is a telephone aid for hearing-impaired people
that shows the lip movements of the speaker at the other telephone synchronised with the speech. The SYNFACE system consists of a speech recogniser that recognises the incoming speech and a synthetic talking head. The output from the recogniser is used to control the articulatory movements of the synthetic head. SYNFACE prototype systems exist for three languages: Dutch, English and Swedish and the rst user trials have just started.Proc ICSLP 2004 (pp. 1693-1696). Jeju Island, Korea. [pdf] (2004). Design strategies for a virtual language tutor. In Kim, S. H., & Young, D. H. (Eds.), Proc IFHOH 7th World Congress. Helsinki, Finland. (2004). SYNFACE, a talking head telephone for the hearing impaired. In
2003Talking heads - Models and applications for multimodal speech synthesis. Doctoral dissertation, KTH. (2003). Proceedings of the 15th ICPhS (pp. 431-434). Barcelona, Spain. [pdf] (2003). Resynthesis of Facial and Intraoral Articulation from Simultaneous Measurements. In Solé, M., Recasens, D., & Romero, J. (Eds.), Proc of Fonetik 2003, Umeå University, Dept of Philosophy and Linguistics PHONUM 9 (pp. 57-60). [pdf] (2003). Simultaneous measurements of facial and intraoral articulation. In Proc EuroSpeech 2003 (pp. 2261-2264). [pdf] (2003). Resynthesis of 3D tongue movements from facial data. In 7th Intl Seminar on Speech Production (pp. 49-54). Sydney. [pdf] (2003). The effect of corpus choice on statistical articulatory modeling. In Proc of ICPhS, XV Intl Conference of Phonetic Sciences (pp. 131-134). Barcelona, Spain. [pdf] (2003). Evaluation of a Multilingual Synthetic Talking Face as a communication Aid for the Hearing Impaired. In
2002Proc of ICSLP 2002 (pp. 181-184). Denver, Colorado, USA. [abstract] [pdf] (2002). Specification and realisation of multimodal output in dialogue systems. In Abstract: We present a high level formalism for specifying verbal and nonverbal output from a multimodal dialogue system. The output specification is XML-based and provides information about communicative functions of the output without detailing the realisation of these functions. The specification can be used to control an animated character that uses speech and gestures. We give examples from an implementation in a multimodal spoken dialogue system, and describe how facial gestures are implemented in a 3Danimated talking agent within this system.Improvements in Speech Synthesis (pp. 372-382). New York: John Wiley & Sons, Inc. (2002). A multimodal speech synthesis tool applied to audio-visual prosody. In Keller, E., Bailly, G., Monaghan, A., Terken, J., & Huckvale, M. (Eds.), Proceedings of Fonetik, TMH-QPSR, 44(1), 097-100. [pdf] (2002). Articulation strength - Readability experiments with a synthetic talking face. Proc of ISCA Workshop Multi-Modal Dialogue in Mobile Environments. Kloster Irsee, Germany. [abstract] [pdf] (2002). GESOM - A model for describing and generating multi-modal output. In Abstract: This paper describes GESOM, a model for generation of generalised, high-level multi-modal dialogue system output. It aims to let dialogue systems generate output for various output devices and modalities with a minimum of changes to the output generation of the dialogue system. The model was developed and tested within the AdApt spoken dialogue system, from which the bulk of the examples in this paper are taken.Multimodality in language and speech systems (pp. 209-241). Dordrecht: Kluwer Academic Publishers. (2002). Speech and gestures for talking faces in conversational dialogue systems. In Granström, B., House, D., & Karlsson, I. (Eds.),
2001Nordic Prosody 2000: Proc of VIII Conf (pp. 77-88). Trondheim, Norway. (2001). Verbal and visual prosody in multimodal speech perception. In van Dommelen, W., & Fretheim, T. (Eds.), Papers from Fonetik 2001 (pp. 62-65). (2001). Interaction of visual cues for prominence. In Karlsson, A., & van de Weijer, J. (Eds.), Proc of Eurospeech 2001 (pp. 387-390). Aalborg, Denmark. [pdf] (2001). Timing and interaction of visual cues for prominence in audiovisual speech perception. In
2000Proc of InSTIL 2000 (pp. 138-142). University of Albertay Dundee, Dundee, Scotland. (2000). Experiments with verbal and visual conversational signals for an automatic language tutor. In Delcloque, P., & Bramoullé, A. (Eds.), Nordic Prosody VIII. (2000). Verbal and visual prosody in multimodal speech perception.. In Proc 4th Swedish Symposium on Multimodal Communication. (2000). Verbal and visual prosody in multimodal speech perception. In Proc. of ICSLP 2000, 6th Intl Conf on Spoken Language Processing (pp. 134-137). Beijing: China Military Friendship Publish. [abstract] [pdf] (2000). AdApt - a multimodal conversational dialogue system in an apartment domain. In Yuan, B., Huang, T., & Tang, X. (Eds.), Abstract: A general overview of the AdApt project and the research that is performed within the project is presented. In this project various aspects of human-computer interaction in a multimodal conversational dialogue systems are investigated. The project will also include studies on the integration of user/system/dialogue dependent speech recognition and multimodal speech synthesis. A domain in which multimodal interaction is highly useful has been chosen, namely, finding available apartments in Stockholm. A Wizard-of-Oz data collection within this domain is also described.Embodied Conversational Agents.. Cambridge, MA: MIT Press. (2000). Developing and Evaluating Conversational Agents.. In Cassell, J., & et al, . (Eds.), Proceedings of ICSLP 2000, 6th Intl Conf on Spoken Language Processing (pp. 464-467). Beijing. [pdf] (2000). WaveSurfer - an open source speech tool. In Yuan, B., Huang, T., & Tang, X. (Eds.),
1999Proceedings of the 4th European Conference on Audiology. Oulo, Finland. (1999). A Synthetic Face as a Lip-reading Support for Hearing Impaired Telephone Users - Problems and Positive Results. In Proceedings of Fonetik. Gothenburg, Sweden. (1999). Two methods for Visual Parameter Extraction in the Teleface Project. In Assistive Technology on the Threshold of the New Millennium (pp. 116-121). IOS Press. (1999). Artificial video for Hearing-Impaired Telephone Users; A comparison with the No Video and Perfect Video Conditions.. In Buhler C & Harry Knops H, . (Eds.), Cost249 meeting. Prague, Czech Republic. (1999). ASR controlled synthetic face as a lipreading support for hearing impaired telephone users. In Proceedings of Audio-Visual Speech Processing (AVSP). Santa Cruz, USA. [pdf] (1999). Synthetic visual speech driven from auditory speech. In Tutorial session in proceedings of IDS'99 (pp. 165-168). [pdf] (1999). Creating web-based exercises for spoken language technology. In Proc of AVSP 99. [pdf] (1999). Developing a 3D-agent for the August dialogue system. In Proc of AVSP 99 (pp. 133-138). [pdf] (1999). Picture my voice: Audio to visual speech synthesis using artificial neural networks. In Proc of ICPhS. [pdf] (1999). From Theory to Practice: Rewards and Challenges. In Proceedings Proc MATISSE - ESCA/SOCRATES Workshop on Method and Tool Innovations for Speech Science Education. London. (1999). Web-based educational tools for speech technology. In
1998Proceedings of ICSLP'98. [pdf] (1998). Synthetic faces as a lipreading support. In Proc of IVTTA'98. (1998). Teleface - the use of a synthetic face for the hard of hearing.. In Proc of Fonetik -98, The Swedish Phonetics Conference (pp. 200-203). Stockholm University. [html] (1998). The synthetic face from a hearing impaired view. In Branderud, P., & Traunmüller, H. (Eds.), Proc of Fonetik -98, The Swedish Phonetics Conference (pp. 162-165). Stockholm University. [pdf] (1998). A tool for teaching and development of parametric speech synthesis. In Branderud, P., & Traunmüller, H. (Eds.), Proc of AVSP'98. [pdf] (1998). Recent developments in facial animation: an inside view. In Proc of STiLL - ESCA Workshop on Speech Technology in Language Learning. (1998). Intelligent Animated Agents for Interactive Language Training. In Proc of ICSLP98, 5th Intl Conference on Spoken Language Processing (pp. 3217-3220). Sydney, Australia. (1998). Web-based educational tools for speech technology. In
1997Proc of ESCA Workshop on Audio-Visual Speech Processing (pp. 149-152). Rhodes, Greece. [pdf] (1997). Animation of talking agents. In Benoit, C., & Campbel, R. (Eds.), Proc of Fonetik -97, Dept of Phonetics, Umeå Univ., Phonum (pp. 85-88). [pdf] (1997). The Teleface project - disability, feasibility and intelligibility. In Bannert, R., Heldner, M., Sullivan, K., & Wretling, P. (Eds.), Proc of Eurospeech Õ97, 5th European Conference on Speech Communication and Technology (pp. 2003-2006). Rhodes, Greece. [pdf] (1997). The Teleface project - Multimodal speech communication for the hearing impaired. In Kokkinakis, G., Fakotakis, N., & Dermatas, E. (Eds.), Proc of Eurospeech '97, 5th European Conference on Speech Communication and Technology (pp. 1651-1654). Rhodes, Greece. [pdf] (1997). OLGA - A dialogue system with an animated talking agent. In Kokkinakis, G., Fakotakis, N., & Dermatas, E. (Eds.), Proc of Fonetik -97, Dept of Phonetics, Phonum 4 (pp. 69-72). Lövånger/Umeå. [pdf] (1997). The OLGA project: An animated talking agent in a dialogue system. In Bannert, R., Heldner, M., Sullivan, K., & Wretling, P. (Eds.), TMH-QPSR, 38(2-3), 001-006. [pdf] (1997). OLGA - A dialogue system with an animated talking agent. Proc of the IJCAI -97 Workshop on Animated Interface Agents: Making them Intelligent (pp. 39-44). Nagoya, Japan. [pdf] (1997). OLGA - A conversational agent with gestures. In André, E. (Ed.),
1996TMH-QPSR, 37(2), 053-056. [pdf] (1996). Talking heads - communication, articulation and animation.
1995Proc of ESCA Workshop on Spoken Dialogue Systems (pp. 281-284). Vigs¿, Denmark. [pdf] (1995). The Waxholm system - a progress report. In Dalsgaard, P. (Ed.), Regelstyrd visuell talsyntes. Master's thesis, KTH, TMH. (1995). Proc of the 4th European Conference on Speech Communication and Technology (EUROSPEECH«95) (pp. 299-302). Madrid.
(1995). Rule-based visual speech synthesis. In Pardo, J. (Ed.),