Publications by Jonas Beskow

2016

Alexanderson, S., House, D., & Beskow, J. (2016). Automatic Annotation of Gestural Units in Spontaneous Face-to-Face Interaction. In Proceedings of the Workshop on Multimodal Analyses enabling Artificial Agents in Human-Machine Interaction (pp. 15-19). Tokyo, Japan. [pdf]Alexanderson, S., O'Sullivan, C., & Beskow, J. (2016). Robust online motion capture labeling of finger markers. In Proceedings of the 9th International Conference on Motion in Games (pp. 7--13). San Francisco, CA, USA. [pdf]Andersson, J., Berlin, S., Costa, A., Berthelsen, H., Lindgren, H., Lindberg, N., Beskow, J., Edlund, J., & Gustafson, J. (2016). WikiSpeech – enabling open source text-to-speech for Wikipedia. In Proceedings of 9th ISCA Speech Synthesis Workshop (pp. 111-117). Sunnyvale, USA. [pdf]Beskow, J., & Berthelsen, H. (2016). A hybrid harmonics-and-bursts modelling approach to speech synthesis. In 9th ISCA Workshop on Speech Synthesis (pp. 225--230). Sunnyvale, CA, USA. [pdf]Stefanov, K., & Beskow, J. (2016). A Multi-party Multi-modal Dataset for Focus of Visual Attention in Human-human and Human-robot Interaction. In Proceedings of the 10th edition of the Language Resources and Evaluation Conference (LREC 2016). Portorož, Slovenia. [pdf]Stefanov, K., Sugimoto, A., & Beskow, J. (2016). Look Who’s Talking - Visual Identification of the Active Speaker in Multi-party Human-robot Interaction. In Proceedings of 2nd Workshop on Advancements in Social Signal Processing for Multimodal Interaction. Tokyo, Japan.Wedenborn, A., Wik, P., Engwall, O., & Beskow, J. (2016). The effect of a physical robot on vocabulary learning. In Proceedings of the International Workshop on Spoken Dialogue Systems (IWSDS 2016),. Saariselkä, Finland. [pdf]

2015

Alexanderson, S., & Beskow, J. (2015). Towards Fully Automated Motion Capture of Signs -- Development and Evaluation of a Key Word Signing Avatar. ACM Trans. Access. Comput., 7(2), 7:1-7:17. [abstract]Abstract: Motion capture of signs provides unique challenges in the field of multimodal data collection. The dense packaging of visual information requires high fidelity and high bandwidth of the captured data. Even though marker-based optical motion capture provides many desirable features such as high accuracy, global fitting, and the ability to record body and face simultaneously, it is not widely used to record finger motion, especially not for articulated and syntactic motion such as signs. Instead, most signing avatar projects use costly instrumented gloves, which require long calibration procedures. In this article, we evaluate the data quality obtained from optical motion capture of isolated signs from Swedish sign language with a large number of low-cost cameras. We also present a novel dual-sensor approach to combine the data with low-cost, five-sensor instrumented gloves to provide a recording method with low manual postprocessing. Finally, we evaluate the collected data and the dual-sensor approach as transferred to a highly stylized avatar. The application of the avatar is a game-based environment for training Key Word Signing (KWS) as augmented and alternative communication (AAC), intended for children with communication disabilities.House, D., Alexanderson, S., & Beskow, J. (2015). On the temporal domain of co-speech gestures: syllable, phrase or talk spurt?. In Lundmark Svensson, M., Ambrazaitis, G., & van de Weijer, J. (Eds.), Proceedings of Fonetik 2015 (pp. 63-68). Lund University, Sweden. [abstract] [pdf]Abstract: This study explores the use of automatic methods to detect and extract hand gesture movement co-occuring with speech. Two spontaneous dyadic dialogues were analyzed using 3D motion-capture techniques to track hand movement. Automatic speech/non-speech detection was performed on the dialogues resulting in a series of connected talk spurts for each speaker. Temporal synchrony of onset and offset of gesture and speech was studied between the automatic hand gesture tracking and talk spurts, and compared to an earlier study of head nods and syllable synchronization. The results indicated onset synchronization between head nods and the syllable in the short temporal domain and between the onset of longer gesture units and the talk spurt in a more extended temporal domain.Skantze, G., Johansson, M., & Beskow, J. (2015). A Collaborative Human-robot Game as a Test-bed for Modelling Multi-party, Situated Interaction. In Proceedings of IVA. Delft, Netherlands. [abstract] [pdf]Abstract: In this demonstration we present a test-bed for collecting data and testing out models for multi-party, situated interaction between humans and robots. Two users are playing a collaborative card sorting game together with the robot head Furhat. The cards are shown on a touch table between the players, thus constituting a target for joint attention. The system has been exhibited at the Swedish National Museum of Science and Technology during nine days, resulting in a rich multi-modal corpus with users of mixed ages.Skantze, G., Johansson, M., & Beskow, J. (2015). Exploring Turn-taking Cues in Multi-party Human-robot Discussions about Objects. In Proceedings of ICMI. Seattle, Washington, USA. [pdf]

2014

Al Moubayed, S., Beskow, J., Bollepalli, B., Gustafson, J., Hussen-Abdelaziz, A., Johansson, M., Koutsombogera, M., Lopes, J., Novikova, J., Oertel, C., Skantze, G., Stefanov, K., & Varol, G. (2014). Human-robot collaborative tutoring using multiparty multimodal spoken dialogue. In Proc. of HRI'14. Bielefeld, Germany. [abstract] [pdf]Abstract: In this paper, we describe a project that explores a novel experi-mental setup towards building a spoken, multi-modally rich, and human-like multiparty tutoring robot. A human-robot interaction setup is designed, and a human-human dialogue corpus is collected. The corpus targets the development of a dialogue system platform to study verbal and nonverbal tutoring strategies in multiparty spoken interactions with robots which are capable of spoken dialogue. The dialogue task is centered on two participants involved in a dialogue aiming to solve a card-ordering game. Along with the participants sits a tutor (robot) that helps the participants perform the task, and organizes and balances their interaction. Different multimodal signals captured and auto-synchronized by different audio-visual capture technologies, such as a microphone array, Kinects, and video cameras, were coupled with manual annotations. These are used build a situated model of the interaction based on the participants personalities, their state of attention, their conversational engagement and verbal dominance, and how that is correlated with the verbal and visual feedback, turn-management, and conversation regulatory actions generated by the tutor. Driven by the analysis of the corpus, we will show also the detailed design methodologies for an affective, and multimodally rich dialogue system that allows the robot to measure incrementally the attention states, and the dominance for each participant, allowing the robot head Furhat to maintain a well-coordinated, balanced, and engaging conversation, that attempts to maximize the agreement and the contribution to solve the task. This project sets the first steps to explore the potential of using multimodal dialogue systems to build interactive robots that can serve in educational, team building, and collaborative task solving applications.Al Moubayed, S., Beskow, J., Bollepalli, B., Hussen-Abdelaziz, A., Johansson, M., Koutsombogera, M., Lopes, J., Novikova, J., Oertel, C., Skantze, G., Stefanov, K., & Varol, G. (2014). Tutoring Robots: Multiparty multimodal social dialogue with an embodied tutor. In Proceedings of eNTERFACE2013. Springer. [abstract] [pdf]Abstract: This project explores a novel experimental setup towards building spoken, multi-modally rich, and human-like multiparty tutoring agent. A setup is developed and a corpus is collected that targets the development of a dialogue system platform to explore verbal and nonverbal tutoring strategies in multiparty spoken interactions with embodied agents. The dialogue task is centered on two participants involved in a dialogue aiming to solve a card-ordering game. With the participants sits a tutor that helps the participants perform the task and organizes and balances their interaction. Different multimodal signals captured and auto-synchronized by different audio-visual capture technologies were coupled with manual annotations to build a situated model of the interaction based on the participants personalities, their temporally-changing state of attention, their conversational engagement and verbal dominance, and the way these are correlated with the verbal and visual feedback, turn-management, and conversation regulatory actions generated by the tutor. At the end of this chapter we discuss the potential areas of research and developments this work opens and some of the challenges that lie in the road ahead.Al Moubayed, S., Beskow, J., & Skantze, G. (2014). Spontaneous spoken dialogues with the Furhat human-like robot head. In HRI'14. Bielefeld, Germany. [abstract] [pdf]Abstract: We will show in this demonstrator an advanced multimodal and multiparty spoken conversational system using Furhat, a robot head based on projected facial animation. Furhat is an anthropomorphic robot head that utilizes facial animation for physical robot heads using back-projection. In the system, multimodality is enabled using speech and rich visual input signals such as multi-person real-time face tracking and microphone tracking. The demonstrator will showcase a system that is able to carry out social dialogue with multiple interlocutors simultaneously with rich output signals such as eye and head coordination, lips synchronized speech synthesis, and non-verbal facial gestures used to regulate fluent and expressive multiparty conversations. The dialogue design is performed using the IrisTK [4] dialogue authoring toolkit developed at KTH. The system will also be able to perform a moderator in a quiz-game showing different strategies for regulating spoken situated interactions.Alexanderson, S., & Beskow, J. (2014). Animated Lombard speech: Motion capture, facial animation and visual intelligibility of speech produced in adverse conditions. Computer Speech & Language, 28(2), 607-618. [pdf]Alexanderson, S., Beskow, J., & House, D. (2014). Automatic speech/non-speech classification using gestures in dialogue. In The Fifth Swedish Language Technology Conference. Uppsala, Sweden. [pdf]Beskow, J., Stafanov, K., Alexandersson, S., Claesson, B., Derbring S., ., & Fredriksson, M. (2014). Tivoli - teckeninlärning via spel och interaktion. Slutrapport projektgenomförande. Technical Report, PTS Innovation för alla. [pdf]Koutsombogera, M., Al Moubayed, S., Beskow, J., Bollepalli, B., Gustafson, J., Hussen-Abdelaziz, A., Johansson, M., Lopes, J., Novikova, J., Oertel, C., Skantze, G., Stefanov, K., & Varol, G. (2014). The Tutorbot Corpus – A Corpus for Studying Tutoring Behaviour in Multiparty Face-to-Face Spoken Dialogue. In Proc. of LREC'14. Reykjavik, Iceland. [abstract]Abstract: This paper describes a novel experimental setup exploiting state-of-the-art capture equipment to collect a multimodally rich game-solving collaborative multiparty dialogue corpus. The corpus is targeted and designed towards the development of a dialogue system platform to explore verbal and nonverbal tutoring strategies in multiparty spoken interactions. The dialogue task is centered on two participants involved in a dialogue aiming to solve a card-ordering game. The participants were paired into teams based on their degree of extraversion as resulted from a personality test. With the participants sits a tutor that helps them perform the task, organizes and balances their interaction and whose behavior was assessed by the participants after each interaction. Different multimodal signals captured and auto-synchronized by different audio-visual capture technologies, together with manual annotations of the tutor’s behavior constitute the Tutorbot corpus. This corpus is exploited to build a situated model of the interaction based on the participants’ temporally-changing state of attention, their conversational engagement and verbal dominance, and their correlation with the verbal and visual feedback and conversation regulatory actions generated by the tutor.

2013

Al Moubayed, S., Beskow, J., & Skantze, G. (2013). The Furhat Social Companion Talking Head. In Interspeech 2013 - Show and Tell. Lyon, France. [abstract] [pdf]Abstract: In this demonstrator we present the Furhat robot head. Furhat is a highly human-like robot head in terms of dynamics, thanks to its use of back-projected facial animation. Furhat also takes advantage of a complex and advanced dialogue toolkits designed to facilitate rich and fluent multimodal multiparty human-machine situated and spoken dialogue. The demonstrator will present a social dialogue system with Furhat that allows for several simultaneous interlocutors, and takes advantage of several verbal and nonverbal input signals such as speech input, real-time multi-face tracking, and facial analysis, and communicates with its users in a mixed initiative dialogue, using state of the art speech synthesis, with rich prosody, lip animated facial synthesis, eye and head movements, and gestures.Al Moubayed, S., Skantze, G., & Beskow, J. (2013). The Furhat Back-Projected Humanoid Head - Lip reading, Gaze and Multiparty Interaction. International Journal of Humanoid Robotics, 10(1). [abstract] [pdf]Abstract: In this article, we present Furhat – a back-projected human-like robot head using state-of-the art facial animation. Three experiments are presented where we investigate how the head might facilitate human-robot face-to-face interaction. First, we investigate how the animated lips increase the intelligibility of the spoken output, and compare this to an animated agent presented on a flat screen, as well as to a human face. Second, we investigate the accuracy of the perception of Furhat’s gaze in a setting typical for situated interaction, where Furhat and a human are sitting around a table. The accuracy of the perception of Furhat’s gaze is measured depending on eye design, head movement and viewing angle. Third, we investigate the turn-taking accuracy of Furhat in a multi-party interactive setting, as compared to an animated agent on a flat screen. We conclude with some observations from a public setting at a museum, where Furhat interacted with thousands of visitors in a multi-party interaction.Alexanderson, S., House, D., & Beskow, J. (2013). Aspects of co-occurring syllables and head nods in spontaneous dialogue. In Proc. of 12th International Conference on Auditory-Visual Speech Processing (AVSP2013). Annecy, France. [pdf]Alexanderson, S., House, D., & Beskow, J. (2013). Extracting and analysing co-speech head gestures from motion-capture data. In Eklund, R. (Ed.), Proc. of Fonetik 2013 (pp. 1-4). Linköping University, Sweden. [pdf]Alexanderson, S., House, D., & Beskow, J. (2013). Extracting and analyzing head movements accompanying spontaneous dialogue. In Proc. Tilburg Gesture Research Meeting. Tilburg University, The Netherlands. [pdf]Beskow, J., Alexanderson, S., Stefanov, K., Claesson, B., Derbring, S., & Fredriksson, M. (2013). The Tivoli System – A Sign-driven Game for Children with Communicative Disorders. In Proceedings of the 1:st european Symposium on Multimodal Communication. [pdf]Beskow, J., & Stefanov, K. (2013). Web-enabled 3D talking avatars based on WebGL and HTML5. In In Proc. of the 13th International Conference on Intelligent Virtual Agents (IVA). Edinburgh, UK. [pdf]Bollepalli, B., Beskow, J., & Gustafson, J. (2013). Non-Linear Pitch Modification in Voice Conversion using Artificial Neural Networks. In ISCA Workshop on Non-Linear Speech Processing 2013. [pdf]Edlund, J., Al Moubayed, S., & Beskow, J. (2013). Co-present or not? Embodiement, situatedness and the Mona Lisa gaze effect. In Nakano, Y., Conati, C., & Bader, T. (Eds.), Eye Gaze in Intelligent User Interfaces - Gaze-based Analyses, Models, and Applications. Springer. [abstract]Abstract: The interest in embodying and situating computer programmes took off in the autonomous agents community in the 90s. Today, researchers and designers of programmes that interact with people on human terms endow their systems with humanoid physiognomies for a variety of reasons. In most cases, attempts at achieving this embodiment and situatedness has taken one of two directions: virtual characters and actual physical robots. In addition, a technique that is far from new is gaining ground rapidly: projection of animated faces on head-shaped 3D surfaces. In this chapter, we provide a history of this technique; an overview of its pros and cons; and an in-depth description of the cause and mechanics of the main drawback of 2D displays of 3D faces (and objects): the Mona Liza gaze effect. We conclude with a description of an experimental paradigm that measures perceived directionality in general and the Mona Lisa gaze effect in particular.Mirnig, N., Weiss, A., Skantze, G., Al Moubayed, S., Gustafson, J., Beskow, J., Granström, B., & Tscheligi, M. (2013). Face-to-Face with a Robot: What do we actually talk about?. International Journal of Humanoid Robotics, 10(1). [abstract] [link]Abstract: Whereas much of the state-of-the-art research in Human-Robot Interaction (HRI) investigates task-oriented interaction, this paper aims at exploring what people talk about to a robot if the content of the conversation is not prede ned. We used the robot head Furhat to explore the conversational behavior of people who encounter a robot in the public setting of a robot exhibition in a scienti c museum, but without a prede ned purpose. Upon analyzing the conversations, it could be shown that a sophisticated robot provides an inviting atmosphere for people to engage in interaction and to be experimental and challenge the robot\'s capabilities. Many visitors to the exhibition were willing to go beyond the guiding questions that were provided as a starting point. Amongst other things, they asked Furhat questions concerning the robot itself, such as how it would de ne a robot, or if it plans to take over the world. People were also interested in the feelings and likes of the robot and they asked many personal questions - this is how Furhat ended up with its rst marriage proposal. People who talked to Furhat were asked to complete a questionnaire on their assessment of the conversation, with which we could show that the interaction with Furhat was rated as a pleasing experience.Stefanov, K., & Beskow, J. (2013). A Kinect Corpus of Swedish Sign Language Signs. In Proceedings of the 2013 Workshop on Multimodal Corpora: Beyond Audio and Video. [pdf]

2012

Al Moubayed, S., Beskow, J., Blomberg, M., Granström, B., Gustafson, J., Mirnig, N., & Skantze, G. (2012). Talking with Furhat - multi-party interaction with a back-projected robot head. In Proceedings of Fonetik'12. Gothenberg, Sweden. [abstract] [pdf]Abstract: This is a condensed presentation of some recent work on a back-projected robotic head for multi-party interaction in public settings. We will describe some of the design strategies and give some preliminary analysis of an interaction database collected at the Robotville exhibition at the London Science MuseumAl Moubayed, S., Beskow, J., Skantze, G., & Granström, B. (2012). Furhat: A Back-projected Human-like Robot Head for Multiparty Human-Machine Interaction. In Esposito, A., Esposito, A., Vinciarelli, A., Hoffmann, R., & C. Müller, V. (Eds.), Cognitive Behavioural Systems. Lecture Notes in Computer Science (pp. 114-130). Springer.Al Moubayed, S., Edlund, J., & Beskow, J. (2012). Taming Mona Lisa: communicating gaze faithfully in 2D and 3D facial projections. ACM Transactions on Interactive Intelligent Systems, 1(2), 25. [abstract] [pdf]Abstract: The perception of gaze plays a crucial role in human-human interaction. Gaze has been shown to matter for a number of aspects of communication and dialogue, especially for managing the flow of the dialogue and participant attention, for deictic referencing, and for the communication of attitude. When developing embodied conversational agents (ECAs) and talking heads, modeling and delivering accurate gaze targets is crucial. Traditionally, systems communicating through talking heads have been displayed to the human conversant using 2D displays, such as flat monitors. This approach introduces severe limitations for an accurate communication of gaze since 2D displays are associated with several powerful effects and illusions, most importantly the Mona Lisa gaze effect, where the gaze of the projected head appears to follow the observer regardless of viewing angle. We describe the Mona Lisa gaze effect and its consequences in the interaction loop, and propose a new approach for displaying talking heads using a 3D projection surface (a physical model of a human head) as an alternative to the traditional flat surface projection. We investigate and compare the accuracy of the perception of gaze direction and the Mona Lisa gaze effect in 2D and 3D projection surfaces in a five subject gaze perception experiment. The experiment confirms that a 3D projection surface completely eliminates the Mona Lisa gaze effect and delivers very accurate gaze direction that is independent of the observer's viewing angle. Based on the data collected in this experiment, we rephrase the formulation of the Mona Lisa gaze effect. The data, when reinterpreted, confirms the predictions of the new model for both 2D and 3D projection surfaces. Finally, we discuss the requirements on different spatially interactive systems in terms of gaze direction, and propose new applications and experiments for interaction in a human-ECA and a human-robot settings made possible by this technology.Al Moubayed, S., Skantze, G., & Beskow, J. (2012). Lip-reading Furhat: Audio Visual Intelligibility of a Back Projected Animated Face. In Proc. of the Intelligent Virtual Agents 10th International Conference (IVA 2012). Santa Cruz, CA, USA: Springer. [abstract] [pdf]Abstract: Back projecting a computer animated face, onto a three dimensional static physical model of a face, is a promising technology that is gaining ground as a solution to building situated, flexible and human-like robot heads. In this paper, we first briefly describe Furhat, a back projected robot head built for the purpose of multimodal multiparty human-machine interaction, and its benefits over virtual characters and robotic heads; and then motivate the need to investigating the contribution to speech intelligibility Furhat’s face offers. We present an audio-visual speech intelligibility experiment, in which 10 subjects listened to short sentences with degraded speech signal. The experiment compares the gain in intelligibility between lip reading a face visualized on a 2D screen compared to a 3D back-projected face and from different viewing angles. The results show that the audio-visual speech intelligibility holds when the avatar is projected onto a static face model (in the case of Furhat), and even, rather surprisingly, exceeds it. This means that despite the movement limitations back projected animated face models bring about; their audio visual speech intelligibility is equal, or even higher, compared to the same models shown on flat displays. At the end of the paper we discuss several hypotheses on how to interpret the results, and motivate future investigations to better explore the characteristics of visual speech perception 3D projected animated faces.Al Moubayed, S., Skantze, G., Beskow, J., Stefanov, K., & Gustafson, J. (2012). Multimodal Multiparty Social Interaction with the Furhat Head. In Proc. of the 14th ACM International Conference on Multimodal Interaction ICMI. Santa Monica, CA, USA. [abstract] [pdf]Abstract: We will show in this demonstrator an advanced multimodal and multiparty spoken conversational system using Furhat, a robot head based on projected facial animation. Furhat is a human-like interface that utilizes facial animation for physical robot heads using back-projection. In the system, multimodality is enabled using speech and rich visual input signals such as multi-person real-time face tracking and microphone tracking. The demonstrator will showcase a system that is able to carry out social dialogue with multiple interlocutors simultaneously with rich output signals such as eye and head coordination, lips synchronized speech synthesis, and non-verbal facial gestures used to regulate fluent and expressive multiparty conversations.Al Moubayed, S., Beskow, J., Granström, B., Gustafson, J., Mirning, N., Skantze, G., & Tscheligi, M. (2012). Furhat goes to Robotville: a large-scale multiparty human-robot interaction data collection in a public space. In Proc of LREC Workshop on Multimodal Corpora. Istanbul, Turkey. [pdf]Alexanderson, S., & Beskow, J. (2012). Can Anybody Read Me? Motion Capture Recordings for an Adaptable Visual Speech Synthesizer. In In proceedings of The Listening Talker. Edinburgh, UK.. [pdf]Blomberg, M., Skantze, G., Al Moubayed, S., Gustafson, J., Beskow, J., & Granström, B. (2012). Children and adults in dialogue with the robot head Furhat - corpus collection and initial analysis. In Proceedings of WOCCI. Portland, OR. [pdf]Bollepalli, B., Beskow, J., & Gustafson, J. (2012). HMM based speech synthesis system for Swedish Language. In The Fourth Swedish Language Technology Conference. Lund, Sweden.Edlund, J., Alexandersson, S., Beskow, J., Gustavsson, L., Heldner, M., Hjalmarsson, A., Kallionen, P., & Marklund, E. (2012). 3rd party observer gaze as a continuous measure of dialogue flow. In Proc. of LREC 2012. Istanbul, Turkey. [abstract] [pdf]Abstract: We present an attempt at using 3rd party observer gaze to get a measure of how appropriate each segment in a dialogue is for a speaker change. The method is a step away from the current dependency of speaker turns or talkspurts towards a more general view of speaker changes. We show that 3rd party observers do indeed largely look at the same thing (the speaker), and how this can be captured and utilized to provide insights into human communication.Edlund, J., House, D., & Beskow, J. (2012). Gesture movement profiles in dialogues from a Swedish multimodal database of spontaneous speech. In Bergmann, P., Brenning, J., Pfeiffer, M. C., & Reber, E. (Eds.), Prosody and Embodiment in Interactional Grammar (pp. 265-280). Berlin: de Gruyter.Saad, A., Beskow, J., & Kjellström, H. (2012). Visual Recognition of Isolated Swedish Sign Language Signs. http://arxiv.org/abs/1211.3901. [pdf]Schötz, S., Bruce, G., Segerup, M., Beskow, J., Gustafson, J., & Granström, B. (2012). Regional Varieties of Swedish: Models and synthesis. In Niebuhr, O. (Ed.), Understanding Prosody (pp. 119 - 134). De Gruyter.Skantze, G., Al Moubayed, S., Gustafson, J., Beskow, J., & Granström, B. (2012). Furhat at Robotville: A Robot Head Harvesting the Thoughts of the Public through Multi-party Dialogue. In Proceedings of IVA-RCVA. Santa Cruz, CA. [pdf]

2011

Al Moubayed, S., Alexanderson, S., Beskow, J., & Granström, B. (2011). A robotic head using projected animated faces. In Salvi, G., Beskow, J., Engwall, O., & Al Moubayed, S. (Eds.), Proceedings of AVSP2011 (pp. 69). [pdf]Al Moubayed, S., Beskow, J., Edlund, J., Granström, B., & House, D. (2011). Animated Faces for Robotic Heads: Gaze and Beyond. In Esposito, A., Vinciarelli, A., Vicsi, K., Pelachaud, C., & Nijholt, A. (Eds.), Analysis of Verbal and Nonverbal Communication and Enactment: The Processing Issues, Lecture Notes in Computer Science (pp. 19-35). Springer. [pdf]Al Moubayed, S., & Beskow, J. (2011). A novel Skype interface using SynFace for virtual speech reading support. TMH-QPSR, 51(1), 33-36. [pdf]Beskow, J., Alexanderson, S., Al Moubayed, S., Edlund, J., & House, D. (2011). Kinetic Data for Large-Scale Analysis and Modeling of Face-to-Face Conversation. In Salvi, G., Beskow, J., Engwall, O., & Al Moubayed, S. (Eds.), Proceedings of AVSP2011 (pp. 103-106). [pdf]Edlund, J., Al Moubayed, S., & Beskow, J. (2011). The Mona Lisa Gaze Effect as an Objective Metric for Perceived Cospatiality. In Vilhjálmsson, H. H., Kopp, S., Marsella, S., & Thórisson, K. R. (Eds.), Proc. of the Intelligent Virtual Agents 10th International Conference (IVA 2011) (pp. 439-440). Reykjavík, Iceland: Springer. [abstract] [pdf]Abstract: We propose to utilize the Mona Lisa gaze effect for an objective and repeatable measure of the extent to which a viewer perceives an object as cospatial. Preliminary results suggest that the metric behaves as expected. Proc. of the Intelligent Virtual Agents 10th International Conference (IVA 2011)

2010

Al Moubayed, S., & Beskow, J. (2010). Prominence Detection in Swedish Using Syllable Correlates. In Interspeech'10. Makuhari, Japan. [pdf]Al Moubayed, S., & Beskow, J. (2010). Perception of Nonverbal Gestures of Prominence in Visual Speech Animation. In FAA, The ACM / SSPNET 2nd International Symposium on Facial Analysis and Animation. Edinburgh, UK. [pdf]Al Moubayed, S., Beskow, J., & Granström, B. (2010). Auditory-Visual Prominence: From Intelligibilitty to Behavior. Journal on Multimodal User Interfaces, 3(4), 299-311. [abstract] [pdf]Abstract: Auditory prominence is defined as when an acoustic segment is made salient in its context. Prominence is one of the prosodic functions that has been shown to be strongly correlated with facial movements. In this work, we investigate the effects of facial prominence cues, in terms of gestures, when synthesized on animated talking heads. In the first study, a speech intelligibility experiment is conducted, speech quality is acoustically degraded and the fundamental frequency is removed from the signal, then the speech is presented to 12 subjects through a lip synchronized talking head carrying head-nods and eyebrows raise gestures, which are synchronized with the auditory prominence. The experiment shows that presenting prominence as facial gestures significantly increases speech intelligibility compared to when these gestures are randomly added to speech. We also present a follow-up study examining the perception of the behavior of the talking heads when gestures are added over pitch accents. Using eye-gaze tracking technology and questionnaires on 10 moderately hearing impaired subjects, the results of the gaze data show that users look at the face in a similar fashion to when they look at a natural face when gestures are coupled with pitch accents opposed to when the face carries no gestures. From the questionnaires, the results also show that these gestures significantly increase the naturalness and the understanding of the talking head.Al Moubayed, S., Beskow, J., Granström, B., & House, D. (2010). Audio-Visual Prosody: Perception, Detection, and Synthesis of Prominence. In Esposito, A. e. a. (Ed.), Toward Autonomous, Adaptive, and Context-Aware Multimodal Interfaces: Theoretical and Practical Issues (pp. 55 - 71). Springer. [abstract] [pdf]Abstract: In this chapter, we investigate the effects of facial prominence cues, in terms of gestures, when synthesized on animated talking heads. In the first study a speech intelligibility experiment is conducted, where speech quality is acoustically degraded, then the speech is presented to 12 subjects through a lip synchronized talking head carrying head-nods and eyebrow raising gestures. The experiment shows that perceiving visual prominence as gestures, synchronized with the auditory prominence, significantly increases speech intelligibility compared to when these gestures are randomly added to speech. We also present a study examining the perception of the behavior of the talking heads when gestures are added at pitch movements. Using eye-gaze tracking technology and questionnaires for 10 moderately hearing impaired subjects, the results of the gaze data show that users look at the face in a similar fashion to when they look at a natural face when gestures are coupled with pitch movements opposed to when the face carries no gestures. From the questionnaires, the results also show that these gestures significantly increase the naturalness and helpfulness of the talking head.Beskow, J., & Al Moubayed, S. (2010). Perception of Gaze Direction in 2D and 3D Facial Projections. In The ACM / SSPNET 2nd International Symposium on Facial Analysis and Animation. Edinburgh, UK. [pdf]Beskow, J., Edlund, J., Granström, B., Gustafson, J., & House, D. (2010). Face-to-face interaction and the KTH Cooking Show. In Esposito, A., Campbell, N., Vogel, C., Hussain, A., & Nijholt, A. (Eds.), Development of Multimodal Interfaces: Active Listening and Synchrony (pp. 157 - 168). Berlin / Heidelberg: Springer. [pdf]Beskow, J., Edlund, J., Gustafson, J., Heldner, M., Hjalmarsson, A., & House, D. (2010). Modelling humanlike conversational behaviour. In Proceedings of SLTC 2010. Linköping, Sweden. [pdf]Beskow, J., Edlund, J., Gustafson, J., Heldner, M., Hjalmarsson, A., & House, D. (2010). Research focus: Interactional aspects of spoken face-to-face communication. In Proc. of Fonetik 2010 (pp. 7-10). Lund, Sweden. [abstract] [pdf]Abstract: We have a visionary goal: to learn enough about human face-to-face interaction that we are able to create an artificial conversational partner that is human-like. We take the opportunity here to present four new projects inaugurated in 2010, each adding pieces of the puzzle through a shared research focus: interactional aspects of spoken face-to-face communication.Beskow, J., & Granström, B. (2010). Goda utsikter för teckenspråksteknologi. In Domeij, R., Breivik, T., Halskov, J., Kirchmeier-Andersen, S., Langgård, P., & Moshagen, S. (Eds.), Språkteknologi för ökad tillgänglighet - Rapport från ett nordiskt seminarium (pp. 77-86). [pdf]Beskow, J., & Granström, B. (2010). Teckenspråkteknologi - sammanfattning av förstudie. Technical Report, KTH Centrum för Talteknologi. [pdf]Edlund, J., Gustafson, J., & Beskow, J. (2010). Cocktail – a demonstration of massively multi-component audio environments for illustration and analysis. In Porc. of SLTC 2010. Linköping, Sweden. [abstract] [pdf]Abstract: We present MMAE – Massively Multi-component Audio Environments – a new concept in auditory presentation, and Cocktail – a demonstrator built on this technology. MMAE creates a dynamic audio environment by playing a large number of sound clips simultaneously at different locations in a virtual 3D space. The technique utilizes standard soundboards and is based in the Snack Sound Toolkit. The result is an efficient 3D audio environment that can be modified dynamically, in real time. Applications range from the creation of canned as well as online audio environments for games and entertainment to the browsing, analyzing and comparing of large quantities of audio data. We also demonstrate the Cocktail implementation of MMAE using several test cases as examples.Edlund, J., & Beskow, J. (2010). Capturing massively multimodal dialogues: affordable synchronization and visualization. In Kipp, M., Martin, J-C., Paggio, P., & Heylen, D. (Eds.), Proc. of Multimodal Corpora: Advances in Capturing, Coding and Analyzing Multimodality (MMC 2010) (pp. 160 - 161). Valetta, Malta. [abstract] [pdf]Abstract: In this demo, we show (a) affordable and relatively easy-to-implement means to facilitate synchronization of audio, video and motion capture data in post processing, and (b) a flexible tool for 3D visualization of recorded motion capture data aligned with audio and video sequences. The synchronisation is made possible by the use of two simple and analogues devices: a turntable and an easy to build electronic clapper board. The demo shows examples of how the signals from the turntable and the clapper board are traced over the three modalities, using the 3D visualisation tool. We also demonstrate how the visualisation tool shows head and torso movements captured by the motion capture system.Edlund, J., Beskow, J., Elenius, K., Hellmer, K., Strömbergsson, S., & House, D. (2010). Spontal: a Swedish spontaneous dialogue corpus of audio, video and motion capture. In Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Rosner, M., & Tapias, D. (Eds.), Proc. of the Seventh conference on International Language Resources and Evaluation (LREC'10) (pp. 2992 - 2995). Valetta, Malta. [abstract] [pdf]Abstract: We present the Spontal database of spontaneous Swedish dialogues. 120 dialogues of at least 30 minutes each have been captured in high-quality audio, high-resolution video and with a motion capture system. The corpus is currently being processed and annotated, and will be made available for research at the end of the project.Schötz, S., Beskow, J., Bruce, G., Granström, B., & Gustafson, J. (2010). Simulating Intonation in Regional Varieties of Swedish. In Speech Prosody 2010. Chicago, USA. [pdf]Schötz, S., Beskow, J., Bruce, G., Granström, B., Gustafson, J., & Segerup, M. (2010). Simulating Intonation in Regional Varieties of Swedish. In Fonetik 2010. Lund, Sweden. [pdf]

2009

Al Moubayed, S., & Beskow, J. (2009). Effects of Visual Prominence Cues on Speech Intelligibility. In Proceedings of Auditory-Visual Speech Processing AVSP'09. Norwich, England. [abstract] [pdf]Abstract: This study reports experimental results on the effect of visual prominence, presented as gestures, on speech intelligibility. 30 acoustically vocoded sentences, permutated into different gestural conditions were presented audio-visually to 12 subjects. The analysis of correct word recognition shows a significant increase in intelligibility when focally-accented (prominent) words are supplemented with head-nods or with eye-brow raise gestures. The paper also examines coupling other acoustic phenomena to brow-raise gestures. As a result, the paper introduces new evidence on the ability of the non-verbal movements in the visual modality to support audio-visual speech perception.Al Moubayed, S., Beskow, J., Öster, A-M., Salvi, G., Granström, B., van Son, N., Ormel, E., & Herzke, T. (2009). Studies on Using the SynFace Talking Head for the Hearing Impaired. In Proceedings of Fonetik'09. Dept. of Linguistics, Stockholm University, Sweden. [abstract] [pdf]Abstract: SynFace is a lip-synchronized talking agent which is optimized as a visual reading support for the hearing impaired. In this paper we present the large scale hearing impaired user studies carried out for three languages in the Hearing at Home project. The user tests focus on measuring the gain in Speech Reception Threshold in Noise and the effort scaling when using SynFace by hearing impaired people, where groups of hearing impaired subjects with different impairment levels from mild to severe and cochlear implants are tested. Preliminary analysis of the results does not show significant gain in SRT or in effort scaling. But looking at large cross-subject variability in both tests, it is clear that many subjects benefit from SynFace especially with speech with stereo babble.Al Moubayed, S., Beskow, J., Öster, A., Salvi, G., Granström, B., van Son, N., & Ormel, E. (2009). Virtual Speech Reading Support for Hard of Hearing in a Domestic Multi-media Setting. In Proceedings of Interspeech 2009. Andréasson, M., Borin, L., Forsberg, M., Beskow, J., Carlson, R., Edlund, J., Elenius, K., Hellmer, K., House, D., Merkel, M., Forsbom, E., Megyesi, B., Eriksson, A., & Strömqvist, S. (2009). Swedish CLARIN activities. In Domeij, R., Koskenniemi, K., Krauwer, S., Maegaard, B., Rögnvaldsson, E., & de Smedt, K. (Eds.), Proc. of the NODALIDA 2009 workshop Nordic Perspectives on the CLARIN Infrastructure of Language Resources (pp. 1-5). Northern European Association for Language Technology. [abstract] [pdf]Abstract: Although Sweden has yet to allocate funds specifically intended for CLARIN activities, there are some ongoing activities which are directly relevant to CLARIN, and which are explicitly linked to CLARIN. These activities have been funded by the Committee for Research Infrastructures and its subcommittee DISC (Database Infrastructure Committee) of the Swedish Research Council.Beskow, J., Edlund, J., Granström, B., Gustafson, J., Skantze, G., & Tobiasson, H. (2009). The MonAMI Reminder: a spoken dialogue system for face-to-face interaction. In Interspeech 2009. Brighton, U.K. [abstract] [pdf]Abstract: We describe the MonAMI Reminder, a multimodal spoken dialogue system which can assist elderly and disabled people in organising and initiating their daily activities. Based on deep interviews with potential users, we have designed a calendar and reminder application which uses an innovative mix of an embodied conversational agent, digital pen and paper, and the web to meet the needs of those users as well as the current constraints of speech technology. We also explore the use of head pose tracking for interaction and attention control in human-computer face-to-face interaction.Beskow, J., & Gustafson, J. (2009). Experiments with Synthesis of Swedish Dialects. In Proceedings of Fonetik 2009. Beskow, J., Salvi, G., & Al Moubayed, S. (2009). SynFace - Verbal and Non-verbal Face Animation from Audio. In Proceedings of The International Conference on Auditory-Visual Speech Processing AVSP'09. Norwich, England. [abstract] [pdf]Abstract: We give an overview of SynFace, a speech-driven face animation system originally developed for the needs of hard-of-hearing users of the telephone. For the 2009 LIPS challenge, SynFace includes not only articulatory motion but also non-verbal motion of gaze, eyebrows and head, triggered by detection of acoustic correlates of prominence and cues for interaction control. In perceptual evaluations, both verbal and non-verbal movmements have been found to have positive impact on word recognition scores.Beskow, J., Carlson, R., Edlund, J., Granström, B., Heldner, M., Hjalmarsson, A., & Skantze, G. (2009). Multimodal Interaction Control. In Waibel, A., & Stiefelhagen, R. (Eds.), Computers in the Human Interaction Loop (pp. 143-158). Berlin/Heidelberg: Springer. [pdf]Beskow, J., Edlund, J., Elenius, K., Hellmer, K., House, D., & Strömbergsson, S. (2009). Project presentation: Spontal – multimodal database of spontaneous dialog. In Fonetik 2009 (pp. 190-193). Stockholm. [abstract] [pdf]Abstract: We describe the ongoing Swedish speech database project Spontal: Multimodal database of spontaneous speech in dialog (VR 2006-7482). The project takes as its point of departure the fact that both vocal signals and gesture involving the face and body are important in every- day, face-to-face communicative interaction, and that there is a great need for data with which we more precisely measure these.Edlund, J., & Beskow, J. (2009). MushyPeek - a framework for online investigation of audiovisual dialogue phenomena. Language and Speech, 52(2-3), 351-367. [abstract]Abstract: Evaluation of methods and techniques for conversational and multimodal spoken dialogue systems is complex, as is gathering data for the modeling and tuning of such techniques. This article describes MushyPeek, an experiment framework that allows us to manipulate the audiovisual behavior of interlocutors in a setting similar to face-to-face human—human dialogue. The setup connects two subjects to each other over a Voice over Internet Protocol (VoIP) telephone connection and simultaneously provides each of them with an avatar representing the other. We present a first experiment which inaugurates, exemplifies, and validates the framework. The experiment corroborates earlier findings on the use of gaze and head pose gestures in turn-taking.Salvi, G., Beskow, J., Al Moubayed, S., & Granström, B. (2009). SynFace—Speech-Driven Facial Animation for Virtual Speech-Reading Support. EURASIP Journal on Audio, Speech, and Music Processing, 2009. [abstract] [pdf]Abstract: This paper describes SynFace, a supportive technology that aims at enhancing audio-based spoken communication in adverse acoustic conditions by providing the missing visual information in the form of an animated talking head. Firstly, we describe the system architecture, consisting of a 3D animated face model controlled from the speech input by a specifically optimised phonetic recogniser. Secondly, we report on speech intelligibility experiments with focus on multilinguality and robustness to audio quality. The system, already available for Swedish, English, and Flemish, was optimised for German and for Swedish wide-band speech quality available in TV, radio, and Internet communication. Lastly, the paper covers experiments with nonverbal motions driven from the speech signal. It is shown that turn-taking gestures can be used to affect the flow of human-human dialogues. We have focused specifically on two categories of cues that may be extracted from the acoustic signal: prominence/emphasis and interactional cues (turn-taking/back-channelling).

2008

Al Moubayed, S., Beskow, J., & Salvi, G. (2008). SynFace Phone Recognizer for Swedish Wideband and Narrowband Speech. In Proceedings of The second Swedish Language Technology Conference (SLTC). Stockholm, Sweden.. [abstract] [pdf]Abstract: In this paper, we present new results and comparisons of the real-time lips synchronized talking head SynFace on different Swedish databases and bandwidth. The work involves training SynFace on narrow-band telephone speech from the Swedish SpeechDat, and on the narrow-band and wide-band Speecon corpus. Auditory perceptual tests are getting established for SynFace as an audio visual hearing support for the hearing-impaired. Preliminary results show high recognition accuracy compared to other languages.Beskow, J., Bruce, G., Enflo, L., Granström, B., & Schötz, S. (2008). Human Recognition of Swedish Dialects. In Proceedings of Fonetik 2008. Beskow, J., Bruce, G., Enflo, L., Granström, B., & Schötz, S. (2008). Recognizing and Modelling Regional Varieties of Swedish. In Proceedings of Interspeech 2008. [pdf]Beskow, J., & Cerrato, L. (2008). Evaluation of the expressivity of a Swedish talking headin the context of human-machine interaction. In Magno Caldognetto, E., Cavicchio, E., & Cosi, P. (Eds.), Comunicazione Parlata e manifestazione delle emozioni. [pdf]Beskow, J., Edlund, J., Gjermani, T., Granström, B., Gustafson, J., Jonsson, O., Skantze, G., & Tobiasson, H. (2008). Innovative interfaces in MonAMI: the reminder. In Proceedings of the 10th international conference on Multimodal interfaces, Chania, Crete, Greece (pp. 199-200). New York, NY, USA: ACM. [abstract] [pdf]Abstract: This demo paper presents an early version of the Reminder, a prototype ECA developed in the European project MonAMI, which aims at “mainstreaming accessibility in consumer goods and services, using advanced technologies to ensure equal access, independent living and participation for all”. The Reminder helps users to plan activities and to remember what to do. The prototype merges mobile ECA technology with other, existing technologies: Google Calendar and a digital pen and paper. The solution allows users to continue using a paper calendar in the manner they are used to, whilst the ECA provides notifications on what has been written in the calendar. Users may ask questions such as “When was I supposed to meet Sara?” or “What’s my schedule today?”Beskow, J., Edlund, J., Granström, B., Gustafson, J., Jonsson, O., & Skantze, G. (2008). Speech technology in the European project MonAMI. In Proceedings of FONETIK 2008 (pp. 33-36). Gothenburg, Sweden. [abstract] [pdf]Abstract: This paper describes the role of speech and speech technology in the European project MonAMI, which aims at “mainstreaming ac-cessibility in consumer goods and services, us-ing advanced technologies to ensure equal ac-cess, independent living and participation for all”. It presents the Reminder, a prototype em-bodied conversational agent (ECA) which helps users to plan activities and to remember what to do. The prototype merges speech technology with other, existing technologies: Google Cal-endar and a digital pen and paper. The solution allows users to continue using a paper calendar in the manner they are used to, whilst the ECA provides notifications on what has been written in the calendar. Users may also ask questions such as “When was I supposed to meet Sara?” or “What’s on my schedule today?”Beskow, J., Edlund, J., Granström, B., Gustafson, J., & Skantze, G. (2008). Innovative interfaces in MonAMI: the KTH Reminder. In Perception in Multimodal Dialogue Systems - Proceedings of the 4th IEEE Tutorial and Research Workshop on Perception and Interactive Technologies for Speech-Based Systems, PIT 2008, Kloster Irsee, Germany, June 16-18, 2008. (pp. 272-275). Berlin/Heidelberg: Springer. [abstract] [pdf]Abstract: This demo paper presents the first version of the Reminder, a prototype ECA developed in the European project MonAMI, which aims at “mainstreaming accessibility in consumer goods and services, using advanced technologies to ensure equal access, independent living and participation for all”. The Reminder helps users to plan activities and to remember what to do. The prototype merges ECA technology with other, existing technologies: Google Calendar and a digital pen and paper. This innovative combination of modalities allows users to continue using a paper calendar in the manner they are used to, whilst the ECA provides verbal notifications on what has been written in the calendar. Users may also ask questions such as “When was I supposed to meet Sara?” or “What’s on my schedule today?”Beskow, J., Engwall, O., Granström, B., Nordqvist, P., & Wik, P. (2008). Visualization of speech and audio for hearing-impaired persons. Technology and Disability, 20(2), 97-107. [pdf]Beskow, J., Granström, B., Nordqvist, P., Al Moubayed, S., Salvi, G., Herzke, T., & Schulz, A. (2008). Hearing at Home – Communication support in home environments for hearing impaired persons. In Proceedings of Interspeech 2008. Brisbane, Australia. [abstract] [pdf]Abstract: The Hearing at Home (HaH) project focuses on the needs of hearing-impaired people in home environments. The project is researching and developing an innovative media-center solution for hearing support, with several integrated features that support perception of speech and audio, such as individual loudness amplification, noise reduction, audio classification and event detection, and the possibility to display an animated talking head providing real-time speechreading support. In this paper we provide a brief project overview and then describe some recent results related to the audio classifier and the talking head. As the talking head expects clean speech input, an audio classifier has been developed for the task of classifying audio signals as clean speech, speech in noise or other. The mean accuracy of the classifier was 82%. The talking head (based on technology from the SynFace project) has been adapted for German, and a small speech-in-noise intelligibility experiment was conducted where sentence recognition rates increased from 3% to 17% when the talking head was present.Beskow, J., Edlund, J., Granström, B., Gustafson, J., Jonsson, O., Skantze, G., & Tobiasson, H. (2008). The MonAMI Reminder system. In Proc. of SLTC 2008 (pp. 13-14). Stockholm. [abstract] [pdf]Abstract: This paper presents an early version of the Reminder, a prototype ECA developed in the European project MonAMI. The Reminder helps users to plan activities and to remember what to do. The prototype merges mobile ECA technology with other, existing technologies: Google Calendar and a digital pen and paper. The solution allows users to continue using a paper calendar in the manner they are used to, whilst the ECA provides notifications on what has been written in the calendar.López-Colino, F., Beskow, J., & Colás, J. (2008). Mobile SynFace: Ubiquitous visual interface for mobile VoIP telephone calls. In Proceedings of The second Swedish Language Technology Conference (SLTC). Stockholm, Sweden.. [pdf]

2007

Beskow, J., Granström, B., & House, D. (2007). Analysis and synthesis of multimodal verbal and non-verbal interaction for animated interface agents. In Esposito, A., Faundez-Zanuy, M., Keller, E., & Marinaro, M. (Eds.), Verbal and Nonverbal Communication Behaviours (pp. 250-263). Berlin: Springer-Verlag.Edlund, J., & Beskow, J. (2007). Pushy versus meek – using avatars to influence turn-taking behaviour. In Proceedings of Interspeech 2007. Antwerp, Belgium. [abstract] [pdf]Abstract: The flow of spoken interaction between human interlocutors is a widely studied topic. Amongst other things, studies have shown that we use a number of facial gestures to improve this flow – to control the taking of turns. This ought to be useful in systems where an animated talking head is used, be they systems for computer mediated human-human dialogue or spoken dialogue systems, where the computer itself uses speech to interact with users. In this article, we show that a small set of simple interaction control gestures and a simple model of interaction can be used to influence users’ behaviour in an unobtrusive manner. The results imply that such a model may improve the flow of computer mediated interaction between humans under adverse circumstances, such as network latency, or to create more human-like spoken humancomputer interaction.Edlund, J., Beskow, J., & Heldner, M. (2007). MushyPeek – an experiment framework for controlled investigation of human-human interaction control behaviour. Proceedings of Fonetik, TMH-QPSR, 50(1), 61-64. [abstract] [pdf]Abstract: This paper describes MushyPeek, a experiment framework that allows us to manipulate interaction control behaviour – including turn-taking – in a setting quite similar to face-to-face human-human dialogue. The setup connects two subjects to each other over a VoIP telephone connection and simultaneuously provides each of them with an avatar representing the other. The framework is exemplified with the first experiment we tried in it – a test of the effectiveness interaction control gestures in an animated lip-synchronised talking head.

2006

Agelfors, E., Beskow, J., Karlsson, I., Kewley, J., Salvi, G., & Thomas, N. (2006). User Evaluation of the SYNFACE Talking Head Telephone. Lecture Notes in Computer Science, 4061, 579-586. [abstract] [pdf]Abstract: The talking-head telephone, Synface, is a lip-reading support for people with hearing-impairment. It has been tested by 49 users with varying degrees of hearing-impaired in UK and Sweden in lab and home environments. Synface was found to give support to the users, especially in perceiving numbers and addresses and an enjoyable way to communicate. A majority deemed Synface to be a useful productBeskow, J., Granström, B., & House, D. (2006). Focal accent and facial movements in expressive speech. In Fonetik 2006, Working Papers 52, General Linguistics and Phonetics, Lund University (pp. 9-12). [pdf]Beskow, J., Granström, B., & House, D. (2006). Visual correlates to prominence in several expressive modes. In Proceedings of Interspeech 2006 (pp. 1272–1275). Pittsburg, PA. [pdf]Lidestam, B., & Beskow, J. (2006). Motivation and appraisal in perception of poorly specified speech. Scandinavian Journal of Psychology, 47(2), 93-101. [pdf]Lidestam, B., & Beskow, J. (2006). Visual phonemic ambiguity and speechreading. Journal of Speech, Language and Hearing Research, 49(4), 835-847. [pdf]

2005

Beskow, J., Edlund, J., & Nordstrand, M. (2005). A model for multi-modal dialogue system output applied to an animated talking head. In Minker, W., Bühler, D., & Dybkjaer, L. (Eds.), Spoken Multimodal Human-Computer Dialogue in Mobile Environments, Text, Speech and Language Technology (pp. 93-113). Dordrecht, The Netherlands: Kluwer Academic Publishers. [abstract] [pdf]Abstract: We present a formalism for specifying verbal and non-verbal output from a multi-modal dialogue system. The output specification is XML-based and provides information about communicative functions of the output, without detailing the realisation of these functions. The aim is to let dialogue systems generate the same output for a wide variety of output devices and modalities. The formalism was developed and implemented in the multi-modal spoken dialogue system AdApt. We also describe how facial gestures in the 3D-animated talking head used within this system are controlled through the formalism.Beskow, J., & Nordenberg, M. (2005). Data-driven Synthesis of Expressive Visual Speech using an MPEG-4 Talking Head. In Proceedings of Interspeech 2005. Lisbon. [pdf]

2004

Beskow, J. (2004). Trainable articulatory control models for visual speech synthesis. Journal of Speech Technology, 4(7), 335-349. [pdf]Beskow, J., Cerrato, L., Cosi, P., Costantini, E., Nordstrand, M., Pianesi, F., Prete, M., & Svanfeldt, G. (2004). Preliminary cross-cultural evaluation of expressiveness in synthetic faces. In André, E., Dybkjaer, L., Minker, W., & Heisterkampf, P. (Eds.), Proc Tutorial and Research Workshop on Affective Dialogue Systems, ADS'04 (pp. 240-243). Kloster Irsee, Tyskland. [pdf]Beskow, J., Cerrato, L., Granström, B., House, D., Nordenberg, M., Nordstrand, M., & Svanfeldt, G. (2004). Expressive Animated Agents for Affective Dialogue Systems.. Proc Tutorial and Research Workshop on Affective Dialogue Systems, ADS'04, . Kloster Irsee, Tyskland.. [pdf]Beskow, J., Cerrato, L., Granström, B., House, D., Nordstrand, M., & Svanfeldt, G. (2004). The Swedish PF-Star multimodal corpora. In Proc LREC Workshop on Models of Human Behaviour for the Specification and Evaluation of Multimodal Input and Output Interfaces (pp. 34-37). Lisboa. [pdf]Beskow, J., Karlsson, I., Kewley, J., & Salvi, G. (2004). SYNFACE - A talking head telephone for the hearing-impaired. In Miesenberger, K., Klaus, J., Zagler, W., & Burger, D. (Eds.), Computers Helping People with Special Needs (pp. 1178-1186). Springer-Verlag. [abstract] [pdf]Abstract: SYNFACE is a telephone aid for hearing-impaired people that shows the lip movements of the speaker at the other telephone synchronised with the speech. The SYNFACE system consists of a speech recogniser that recognises the incoming speech and a synthetic talking head. The output from the recogniser is used to control the articulatory movements of the synthetic head. SYNFACE prototype systems exist for three languages: Dutch, English and Swedish and the rst user trials have just started.Engwall, O., Wik, P., Beskow, J., & Granström, G. (2004). Design strategies for a virtual language tutor. In Kim, S. H., & Young, D. H. (Eds.), Proc ICSLP 2004 (pp. 1693-1696). Jeju Island, Korea. [pdf]Spens, K-E., Agelfors, E., Beskow, J., Granström, B., Karlsson, I., & Salvi, G. (2004). SYNFACE, a talking head telephone for the hearing impaired. In Proc IFHOH 7th World Congress. Helsinki, Finland.

2003

Beskow, J. (2003). Talking heads - Models and applications for multimodal speech synthesis. Doctoral dissertation, KTH.Beskow, J., Engwall, O., & Granström, B. (2003). Resynthesis of Facial and Intraoral Articulation from Simultaneous Measurements. In Solé, M., Recasens, D., & Romero, J. (Eds.), Proceedings of the 15th ICPhS (pp. 431-434). Barcelona, Spain. [pdf]Beskow, J., Engwall, O., & Granström, B. (2003). Simultaneous measurements of facial and intraoral articulation. In Proc of Fonetik 2003, Umeå University, Dept of Philosophy and Linguistics PHONUM 9 (pp. 57-60). [pdf]Engwall, O., & Beskow, J. (2003). Resynthesis of 3D tongue movements from facial data. In Proc EuroSpeech 2003 (pp. 2261-2264). [pdf]Engwall, O., & Beskow, J. (2003). The effect of corpus choice on statistical articulatory modeling. In 7th Intl Seminar on Speech Production (pp. 49-54). Sydney. [pdf]Siciliano, C., Williams, G., Beskow, J., & Faulkner, A. (2003). Evaluation of a Multilingual Synthetic Talking Face as a communication Aid for the Hearing Impaired. In Proc of ICPhS, XV Intl Conference of Phonetic Sciences (pp. 131-134). Barcelona, Spain. [pdf]

2002

Beskow, J., Edlund, J., & Nordstrand, M. (2002). Specification and realisation of multimodal output in dialogue systems. In Proc of ICSLP 2002 (pp. 181-184). Denver, Colorado, USA. [abstract] [pdf]Abstract: We present a high level formalism for specifying verbal and nonverbal output from a multimodal dialogue system. The output specification is XML-based and provides information about communicative functions of the output without detailing the realisation of these functions. The specification can be used to control an animated character that uses speech and gestures. We give examples from an implementation in a multimodal spoken dialogue system, and describe how facial gestures are implemented in a 3Danimated talking agent within this system.Beskow, J., Granström, B., & House, D. (2002). A multimodal speech synthesis tool applied to audio-visual prosody. In Keller, E., Bailly, G., Monaghan, A., Terken, J., & Huckvale, M. (Eds.), Improvements in Speech Synthesis (pp. 372-382). New York: John Wiley & Sons, Inc.Beskow, J., Granström, B., & Spens, K-E. (2002). Articulation strength - Readability experiments with a synthetic talking face. Proceedings of Fonetik, TMH-QPSR, 44(1), 097-100. [pdf]Edlund, J., Beskow, J., & Nordstrand, M. (2002). GESOM - A model for describing and generating multi-modal output. In Proc of ISCA Workshop Multi-Modal Dialogue in Mobile Environments. Kloster Irsee, Germany. [abstract] [pdf]Abstract: This paper describes GESOM, a model for generation of generalised, high-level multi-modal dialogue system output. It aims to let dialogue systems generate output for various output devices and modalities with a minimum of changes to the output generation of the dialogue system. The model was developed and tested within the AdApt spoken dialogue system, from which the bulk of the examples in this paper are taken.Granström, B., House, D., & Beskow, J. (2002). Speech and gestures for talking faces in conversational dialogue systems. In Granström, B., House, D., & Karlsson, I. (Eds.), Multimodality in language and speech systems (pp. 209-241). Dordrecht: Kluwer Academic Publishers.

2001

Granström, B., House, D., Beskow, J., & Lundeberg, M. (2001). Verbal and visual prosody in multimodal speech perception. In van Dommelen, W., & Fretheim, T. (Eds.), Nordic Prosody 2000: Proc of VIII Conf (pp. 77-88). Trondheim, Norway.House, D., Beskow, J., & Granström, B. (2001). Interaction of visual cues for prominence. In Karlsson, A., & van de Weijer, J. (Eds.), Papers from Fonetik 2001 (pp. 62-65). House, D., Beskow, J., & Granström, B. (2001). Timing and interaction of visual cues for prominence in audiovisual speech perception. In Proc of Eurospeech 2001 (pp. 387-390). Aalborg, Denmark. [pdf]

2000

Beskow, J., Granström, B., House, D., & Lundeberg, M. (2000). Experiments with verbal and visual conversational signals for an automatic language tutor. In Delcloque, P., & Bramoullé, A. (Eds.), Proc of InSTIL 2000 (pp. 138-142). University of Albertay Dundee, Dundee, Scotland.Beskow, J., Granström, B., House, D., & Lundeberg, M. (2000). Verbal and visual prosody in multimodal speech perception.. In Nordic Prosody VIII. Granström, B., House, D., Beskow, J., & Lundeberg, M. (2000). Verbal and visual prosody in multimodal speech perception. In Proc 4th Swedish Symposium on Multimodal Communication. Gustafson, J., Bell, L., Beskow, J., Boye, J., Carlson, R., Edlund, J., Granström, B., House, D., & Wirén, M. (2000). AdApt - a multimodal conversational dialogue system in an apartment domain. In Yuan, B., Huang, T., & Tang, X. (Eds.), Proc. of ICSLP 2000, 6th Intl Conf on Spoken Language Processing (pp. 134-137). Beijing: China Military Friendship Publish. [abstract] [pdf]Abstract: A general overview of the AdApt project and the research that is performed within the project is presented. In this project various aspects of human-computer interaction in a multimodal conversational dialogue systems are investigated. The project will also include studies on the integration of user/system/dialogue dependent speech recognition and multimodal speech synthesis. A domain in which multimodal interaction is highly useful has been chosen, namely, finding available apartments in Stockholm. A Wizard-of-Oz data collection within this domain is also described.Massaro D.W, C., Beskow, J., & Cole, R. (2000). Developing and Evaluating Conversational Agents.. In Cassell, J., & et al, . (Eds.), Embodied Conversational Agents.. Cambridge, MA: MIT Press.Sjölander, K., & Beskow, J. (2000). WaveSurfer - an open source speech tool. In Yuan, B., Huang, T., & Tang, X. (Eds.), Proceedings of ICSLP 2000, 6th Intl Conf on Spoken Language Processing (pp. 464-467). Beijing. [pdf]

1999

Agelfors, E., Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Salvi, G., Spens, K-E., & Öhman, T. (1999). A Synthetic Face as a Lip-reading Support for Hearing Impaired Telephone Users - Problems and Positive Results. In Proceedings of the 4th European Conference on Audiology. Oulo, Finland.Agelfors, E., Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Salvi, G., Spens, K-E., & Öhman, T. (1999). Two methods for Visual Parameter Extraction in the Teleface Project. In Proceedings of Fonetik. Gothenburg, Sweden.Agelfors, E., Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Spens, K-E., & Öhman, T. (1999). Artificial video for Hearing-Impaired Telephone Users; A comparison with the No Video and Perfect Video Conditions.. In Buhler C & Harry Knops H, . (Eds.), Assistive Technology on the Threshold of the New Millennium (pp. 116-121). IOS Press.Agelfors, E., Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Spens, K-E., & Öhman, T. (1999). ASR controlled synthetic face as a lipreading support for hearing impaired telephone users. In Cost249 meeting. Prague, Czech Republic.Agelfors, E., Beskow, J., Granström, B., Lundeberg, M., Salvi, G., Spens, K-E., & Öhman, T. (1999). Synthetic visual speech driven from auditory speech. In Proceedings of Audio-Visual Speech Processing (AVSP). Santa Cruz, USA. [pdf]Gustafson, J., Sjölander, K., Beskow, J., Granström, B., & Carlson, R. (1999). Creating web-based exercises for spoken language technology. In Tutorial session in proceedings of IDS'99 (pp. 165-168). [pdf]Lundeberg, M., & Beskow, J. (1999). Developing a 3D-agent for the August dialogue system. In Proc of AVSP 99. [pdf]Massaro, D., Beskow, J., Cohen, M., Fry, C., & Rodriguez, T. (1999). Picture my voice: Audio to visual speech synthesis using artificial neural networks. In Proc of AVSP 99 (pp. 133-138). [pdf]Massaro, D., Cohen, M., & Beskow, J. (1999). From Theory to Practice: Rewards and Challenges. In Proc of ICPhS. [pdf]Sjölander, K., Gustafson, J., Beskow, J., Granström, B., & Carlson, R. (1999). Web-based educational tools for speech technology. In Proceedings Proc MATISSE - ESCA/SOCRATES Workshop on Method and Tool Innovations for Speech Science Education. London.

1998

Agelfors, E., Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Spens, K-E., & Öhman, T. (1998). Synthetic faces as a lipreading support. In Proceedings of ICSLP'98. [pdf]Agelfors, E., Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Spens, K-E., & Öhman, T. (1998). Teleface - the use of a synthetic face for the hard of hearing.. In Proc of IVTTA'98. Agelfors, E., Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Spens, K-E., & Öhman, T. (1998). The synthetic face from a hearing impaired view. In Branderud, P., & Traunmüller, H. (Eds.), Proc of Fonetik -98, The Swedish Phonetics Conference (pp. 200-203). Stockholm University. [html]Beskow, J. (1998). A tool for teaching and development of parametric speech synthesis. In Branderud, P., & Traunmüller, H. (Eds.), Proc of Fonetik -98, The Swedish Phonetics Conference (pp. 162-165). Stockholm University. [pdf]Cohen, M., Beskow, J., & Massaro, D. (1998). Recent developments in facial animation: an inside view. In Proc of AVSP'98. [pdf]Cole, R., Carmell, T., Conners, P., Macon, M., Wouters, J., de Villiers, J., Tarachow, A., Massaro, D., Cohen, M., Beskow, J., Yang, J., Meier, U., Waibel, A., Stone, P., Fortier, G., Davis, A., & Soland, C. (1998). Intelligent Animated Agents for Interactive Language Training. In Proc of STiLL - ESCA Workshop on Speech Technology in Language Learning. Sjölander, K., Beskow, J., Gustafson, J., Lewin, E., Carlson, R., & Granström, B. (1998). Web-based educational tools for speech technology. In Proc of ICSLP98, 5th Intl Conference on Spoken Language Processing (pp. 3217-3220). Sydney, Australia.

1997

Beskow, J. (1997). Animation of talking agents. In Benoit, C., & Campbel, R. (Eds.), Proc of ESCA Workshop on Audio-Visual Speech Processing (pp. 149-152). Rhodes, Greece. [pdf]Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Spens, K-E., & Öhman, T. (1997). The Teleface project - disability, feasibility and intelligibility. In Bannert, R., Heldner, M., Sullivan, K., & Wretling, P. (Eds.), Proc of Fonetik -97, Dept of Phonetics, Umeå Univ., Phonum (pp. 85-88). [pdf]Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Spens, K-E., & Öhman, T. (1997). The Teleface project - Multimodal speech communication for the hearing impaired. In Kokkinakis, G., Fakotakis, N., & Dermatas, E. (Eds.), Proc of Eurospeech Õ97, 5th European Conference on Speech Communication and Technology (pp. 2003-2006). Rhodes, Greece. [pdf]Beskow, J., Elenius, K., & McGlashan, S. (1997). OLGA - A dialogue system with an animated talking agent. In Kokkinakis, G., Fakotakis, N., & Dermatas, E. (Eds.), Proc of Eurospeech '97, 5th European Conference on Speech Communication and Technology (pp. 1651-1654). Rhodes, Greece. [pdf]Beskow, J., Elenius, K., & McGlashan, S. (1997). The OLGA project: An animated talking agent in a dialogue system. In Bannert, R., Heldner, M., Sullivan, K., & Wretling, P. (Eds.), Proc of Fonetik -97, Dept of Phonetics, Phonum 4 (pp. 69-72). Lövånger/Umeå. [pdf]Beskow, J., Elenius, K. O. E., & McGlashan, S. (1997). OLGA - A dialogue system with an animated talking agent. TMH-QPSR, 38(2-3), 001-006. [pdf]Beskow, J., & McGlashan, S. (1997). OLGA - A conversational agent with gestures. In André, E. (Ed.), Proc of the IJCAI -97 Workshop on Animated Interface Agents: Making them Intelligent (pp. 39-44). Nagoya, Japan. [pdf]

1996

Beskow, J. (1996). Talking heads - communication, articulation and animation. TMH-QPSR, 37(2), 053-056. [pdf]

1995

Bertenstam, J., Beskow, J., Blomberg, M., Carlson, R., Elenius, K., Granström, B., Gustafson, J., Hunnicutt, S., Högberg, J., Lindell, R., Neovius, L., Nord, L., de Serpa-Leitao, A., & Ström, N. (1995). The Waxholm system - a progress report. In Dalsgaard, P. (Ed.), Proc of ESCA Workshop on Spoken Dialogue Systems (pp. 281-284). Vigs¿, Denmark. [pdf]Beskow, J. (1995). Regelstyrd visuell talsyntes. Master's thesis, KTH, TMH.Beskow, J. (1995). Rule-based visual speech synthesis. In Pardo, J. (Ed.), Proc of the 4th European Conference on Speech Communication and Technology (EUROSPEECH«95) (pp. 299-302). Madrid.