Publications by Jonas Beskow
2012Al Moubayed, S., Edlund, J., & Beskow, J. (2012). Taming Mona Lisa: communicating gaze faithfully in 2D and 3D facial projections. ACM Transactions on Interactive Intelligent Systems. [abstract]Abstract: The perception of gaze plays a crucial role in human-human interaction. Gaze has been shown to matter for a number of aspects of communication and dialogue, especially for managing the flow of the dialogue and participant attention, for deictic referencing, and for the communication of attitude. When developing embodied conversational agents (ECAs) and talking heads, modeling and delivering accurate gaze targets is crucial. Traditionally, systems communicating through talking heads have been displayed to the human conversant using 2D displays, such as flat monitors. This approach introduces severe limitations for an accurate communication of gaze since 2D displays are associated with several powerful effects and illusions, most importantly the Mona Lisa gaze effect, where the gaze of the projected head appears to follow the observer regardless of viewing angle. We describe the Mona Lisa gaze effect and its consequences in the interaction loop, and propose a new approach for displaying talking heads using a 3D projection surface (a physical model of a human head) as an alternative to the traditional flat surface projection. We investigate and compare the accuracy of the perception of gaze direction and the Mona Lisa gaze effect in 2D and 3D projection surfaces in a five subject gaze perception experiment. The experiment confirms that a 3D projection surface completely eliminates the Mona Lisa gaze effect and delivers very accurate gaze direction that is independent of the observer's viewing angle. Based on the data collected in this experiment, we rephrase the formulation of the Mona Lisa gaze effect. The data, when reinterpreted, confirms the predictions of the new model for both 2D and 3D projection surfaces. Finally, we discuss the requirements on different spatially interactive systems in terms of gaze direction, and propose new applications and experiments for interaction in a human-ECA and a human-robot settings made possible by this technology.Edlund, J., House, D., & Beskow, J. (in press). Gesture movement profiles in dialogues from a Swedish multimodal database of spontaneous speech. To be published in Bergmann, P., Brenning, J., Pfeiffer, M. C., & Reber, E. (Eds.), Prosodic and Visual Resources in Interactional Grammar. de Gruyter.2011Al Moubayed, S., Alexanderson, S., Beskow, J., & Granström, B. (2011). A robotic head using projected animated faces. In Salvi, G., Beskow, J., Engwall, O., & Al Moubayed, S. (Eds.), Proceedings of AVSP2011 (pp. 69). [pdf]Al Moubayed, S., Beskow, J., Edlund, J., Granström, B., & House, D. (2011). Animated Faces for Robotic Heads: Gaze and Beyond. In Esposito, A., Vinciarelli, A., Vicsi, K., Pelachaud, C., & Nijholt, A. (Eds.), Analysis of Verbal and Nonverbal Communication and Enactment: The Processing Issues, Lecture Notes in Computer Science (pp. 19-35). Springer.Al Moubayed, S., & Beskow, J. (2011). A novel Skype interface using SynFace for virtual speech reading support. TMH-QPSR, 51(1), 33-36. [pdf]Beskow, J., Alexanderson, S., Al Moubayed, S., Edlund, J., & House, D. (2011). Kinetic Data for Large-Scale Analysis and Modeling of Face-to-Face Conversation. In Salvi, G., Beskow, J., Engwall, O., & Al Moubayed, S. (Eds.), Proceedings of AVSP2011 (pp. 103-106). [pdf]Edlund, J., Al Moubayed, S., & Beskow, J. (2011). The Mona Lisa Gaze Effect as an Objective Metric for Perceived Cospatiality. In Vilhjálmsson, H. H., Kopp, S., Marsella, S., & Thórisson, K. R. (Eds.), Proc. of the Intelligent Virtual Agents 10th International Conference (IVA 2011) (pp. 439-440). Reykjavík, Iceland: Springer. [abstract] [pdf]Abstract: We propose to utilize the Mona Lisa gaze effect for an objective and repeatable measure of the extent to which a viewer perceives an object as cospatial. Preliminary results suggest that the metric behaves as expected.
Proc. of the Intelligent Virtual Agents 10th International Conference (IVA 2011)2010Al Moubayed, S., & Beskow, J. (2010). Prominence Detection in Swedish Using Syllable Correlates. In Interspeech'10. Makuhari, Japan. [pdf]Al Moubayed, S., & Beskow, J. (2010). Perception of Nonverbal Gestures of Prominence in Visual Speech Animation. In FAA, The ACM / SSPNET 2nd International Symposium on Facial Analysis and Animation. Edinburgh, UK. [pdf]Al Moubayed, S., Beskow, J., & Granström, B. (2010). Auditory-Visual Prominence: From Intelligibilitty to Behavior. Journal on Multimodal User Interfaces, 3(4), 299-311. [abstract] [pdf]Abstract: Auditory prominence is defined as when an acoustic segment is made salient in its context. Prominence
is one of the prosodic functions that has been shown to be
strongly correlated with facial movements. In this work, we
investigate the effects of facial prominence cues, in terms
of gestures, when synthesized on animated talking heads.
In the first study, a speech intelligibility experiment is conducted, speech quality is acoustically degraded and the fundamental frequency is removed from the signal, then the speech is presented to 12 subjects through a lip synchronized talking head carrying head-nods and eyebrows raise gestures, which are synchronized with the auditory prominence.
The experiment shows that presenting prominence
as facial gestures significantly increases speech intelligibility compared to when these gestures are randomly added to speech. We also present a follow-up study examining the perception of the behavior of the talking heads when gestures are added over pitch accents. Using eye-gaze tracking technology and questionnaires on 10 moderately hearing impaired subjects, the results of the gaze data show that users look at the face in a similar fashion to when they look at a natural face when gestures are coupled with pitch accents opposed to when the face carries no gestures. From the questionnaires, the results also show that these gestures significantly increase the naturalness and the understanding of the talking head.Al Moubayed, S., Beskow, J., Granström, B., & House, D. (2010). Audio-Visual Prosody: Perception, Detection, and Synthesis of Prominence. In Esposito, A. e. a. (Ed.), Toward Autonomous, Adaptive, and Context-Aware Multimodal Interfaces: Theoretical and Practical Issues (pp. 55 - 71). Springer. [abstract] [pdf]Abstract: In this chapter, we investigate the effects of facial prominence
cues, in terms of gestures, when synthesized on animated talking heads.
In the first study a speech intelligibility experiment is conducted, where
speech quality is acoustically degraded, then the speech is presented to
12 subjects through a lip synchronized talking head carrying head-nods
and eyebrow raising gestures. The experiment shows that perceiving visual
prominence as gestures, synchronized with the auditory prominence,
significantly increases speech intelligibility compared to when these gestures
are randomly added to speech.
We also present a study examining the perception of the behavior of
the talking heads when gestures are added at pitch movements. Using
eye-gaze tracking technology and questionnaires for 10 moderately hearing
impaired subjects, the results of the gaze data show that users look
at the face in a similar fashion to when they look at a natural face when
gestures are coupled with pitch movements opposed to when the face
carries no gestures. From the questionnaires, the results also show that
these gestures significantly increase the naturalness and helpfulness of
the talking head.Beskow, J., & Al Moubayed, S. (2010). Perception of Gaze Direction in 2D and 3D Facial Projections. In The ACM / SSPNET 2nd International Symposium on Facial Analysis and Animation. Edinburgh, UK. [pdf]Beskow, J., Edlund, J., Granström, B., Gustafson, J., & House, D. (2010). Face-to-face interaction and the KTH Cooking Show. In Esposito, A., Campbell, N., Vogel, C., Hussain, A., & Nijholt, A. (Eds.), Development of Multimodal Interfaces: Active Listening and Synchrony (pp. 157 - 168). Berlin / Heidelberg: Springer.Beskow, J., Edlund, J., Gustafson, J., Heldner, M., Hjalmarsson, A., & House, D. (2010). Modelling humanlike conversational behaviour. In Proceedings of SLTC 2010. Linköping, Sweden. [pdf]Beskow, J., Edlund, J., Gustafson, J., Heldner, M., Hjalmarsson, A., & House, D. (2010). Research focus: Interactional aspects of spoken face-to-face communication. In Proc. of Fonetik 2010 (pp. 7-10). Lund, Sweden. [abstract] [pdf]Abstract: We have a visionary goal: to learn enough about human face-to-face interaction that we are able to create an artificial conversational partner that is human-like. We take the opportunity here to present four new projects inaugurated in 2010, each adding pieces of the puzzle through a shared research focus: interactional aspects of spoken face-to-face communication.Beskow, J., & Granström, B. (2010). Goda utsikter för teckenspråksteknologi. In Domeij, R., Breivik, T., Halskov, J., Kirchmeier-Andersen, S., Langgård, P., & Moshagen, S. (Eds.), Språkteknologi för ökad tillgänglighet - Rapport från ett nordiskt seminarium (pp. 77-86). [pdf]Beskow, J., & Granström, B. (2010). Teckenspråkteknologi - sammanfattning av förstudie. Technical Report, KTH Centrum för Talteknologi. [pdf]Edlund, J., Gustafson, J., & Beskow, J. (2010). Cocktail – a demonstration of massively multi-component audio environments for illustration and analysis. In Porc. of SLTC 2010. Linköping, Sweden. [abstract] [pdf]Abstract: We present MMAE – Massively Multi-component Audio Environments – a new concept in auditory presentation, and Cocktail – a demonstrator built on this technology. MMAE creates a dynamic audio environment by playing a large number of sound clips simultaneously at different locations in a virtual 3D space. The technique utilizes standard soundboards and is based in the Snack Sound Toolkit. The result is an efficient 3D audio environment that can be modified dynamically, in real time. Applications range from the creation of canned as well as online audio environments for games and entertainment to the browsing, analyzing and comparing of large quantities of audio data. We also demonstrate the Cocktail implementation of MMAE using several test cases as examples.Edlund, J., & Beskow, J. (2010). Capturing massively multimodal dialogues: affordable synchronization and visualization. In Kipp, M., Martin, J-C., Paggio, P., & Heylen, D. (Eds.), Proc. of Multimodal Corpora: Advances in Capturing, Coding and Analyzing Multimodality (MMC 2010) (pp. 160 - 161). Valetta, Malta. [abstract] [pdf]Abstract: In this demo, we show (a) affordable and relatively easy-to-implement means to facilitate synchronization of audio, video and motion capture data in post processing, and (b) a flexible tool for 3D visualization of recorded motion capture data aligned with audio and video sequences. The synchronisation is made possible by the use of two simple and analogues devices: a turntable and an easy to build electronic clapper board. The demo shows examples of how the signals from the turntable and the clapper board are traced over the three modalities, using the 3D visualisation tool. We also demonstrate how the visualisation tool shows head and torso movements captured by the motion capture system.Edlund, J., Beskow, J., Elenius, K., Hellmer, K., Strömbergsson, S., & House, D. (2010). Spontal: a Swedish spontaneous dialogue corpus of audio, video and motion capture. In Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Rosner, M., & Tapias, D. (Eds.), Proc. of the Seventh conference on International Language Resources and Evaluation (LREC'10) (pp. 2992 - 2995). Valetta, Malta. [abstract] [pdf]Abstract: We present the Spontal database of spontaneous Swedish dialogues. 120 dialogues of at least 30 minutes each have been captured in high-quality audio, high-resolution video and with a motion capture system. The corpus is currently being processed and annotated, and will be made available for research at the end of the project.Schötz, S., Beskow, J., Bruce, G., Granström, B., & Gustafson, J. (2010). Simulating Intonation in Regional Varieties of Swedish. In Speech Prosody 2010. Chicago, USA. [pdf]Schötz, S., Beskow, J., Bruce, G., Granström, B., Gustafson, J., & Segerup, M. (2010). Simulating Intonation in Regional Varieties of Swedish. In Fonetik 2010. Lund, Sweden. [pdf]2009Al Moubayed, S., & Beskow, J. (2009). Effects of Visual Prominence Cues on Speech Intelligibility. In Proceedings of Auditory-Visual Speech Processing AVSP'09. Norwich, England. [abstract] [pdf]Abstract: This study reports experimental results on the effect of visual prominence, presented as gestures, on speech intelligibility. 30 acoustically vocoded sentences, permutated into different gestural conditions were presented audio-visually to 12 subjects. The analysis of correct word recognition shows a significant increase in intelligibility when focally-accented (prominent) words are supplemented with head-nods or with eye-brow raise gestures. The paper also examines coupling other acoustic phenomena to brow-raise gestures. As a result, the paper introduces new evidence on the ability of the non-verbal movements in the visual modality to support audio-visual speech perception.Al Moubayed, S., Beskow, J., Öster, A-M., Salvi, G., Granström, B., van Son, N., Ormel, E., & Herzke, T. (2009). Studies on Using the SynFace Talking Head for the Hearing Impaired. In Proceedings of Fonetik'09. Dept. of Linguistics, Stockholm University, Sweden. [abstract] [pdf]Abstract: SynFace is a lip-synchronized talking agent which is optimized as a visual reading support for the hearing impaired. In this paper we
present the large scale hearing impaired user studies carried out for three languages in the Hearing at Home project. The user tests focus
on measuring the gain in Speech Reception Threshold in Noise and the effort scaling when using SynFace by hearing impaired people, where groups of hearing impaired subjects with different impairment levels from mild to severe and cochlear implants are tested. Preliminary
analysis of the results does not show significant gain in SRT or in effort scaling. But looking at large cross-subject variability in both tests, it is
clear that many subjects benefit from SynFace especially with speech with stereo babble.Al Moubayed, S., Beskow, J., Öster, A., Salvi, G., Granström, B., van Son, N., & Ormel, E. (2009). Virtual Speech Reading Support for Hard of Hearing in a Domestic Multi-media Setting. In Proceedings of Interspeech 2009. Beskow, J., Edlund, J., Granström, B., Gustafson, J., Skantze, G., & Tobiasson, H. (2009). The MonAMI Reminder: a spoken dialogue system for face-to-face interaction. In Interspeech 2009. Brighton, U.K. [abstract] [pdf]Abstract: We describe the MonAMI Reminder, a multimodal spoken dialogue system which can assist elderly and disabled people in organising and initiating their daily activities. Based on deep interviews with potential users, we have designed a calendar and reminder application which uses an innovative mix of an embodied conversational agent, digital pen and paper, and the web to meet the needs of those users as well as the current constraints of speech technology. We also explore the use of head pose tracking for interaction and attention control in human-computer face-to-face interaction.Beskow, J., & Gustafson, J. (2009). Experiments with Synthesis of Swedish Dialects. In Proceedings of Fonetik 2009. Beskow, J., Salvi, G., & Al Moubayed, S. (2009). SynFace - Verbal and Non-verbal Face Animation from Audio. In Proceedings of The International Conference on Auditory-Visual Speech Processing AVSP'09. Norwich, England. [abstract] [pdf]Abstract: We give an overview of SynFace, a speech-driven
face animation system originally developed for the
needs of hard-of-hearing users of the telephone. For
the 2009 LIPS challenge, SynFace includes not only
articulatory motion but also non-verbal motion of
gaze, eyebrows and head, triggered by detection of
acoustic correlates of prominence and cues for interaction
control. In perceptual evaluations, both verbal
and non-verbal movmements have been found to have
positive impact on word recognition scores.Beskow, J., Carlson, R., Edlund, J., Granström, B., Heldner, M., Hjalmarsson, A., & Skantze, G. (2009). Multimodal Interaction Control. In Waibel, A., & Stiefelhagen, R. (Eds.), Computers in the Human Interaction Loop (pp. 143-158). Berlin/Heidelberg: Springer. [pdf]Beskow, J., Edlund, J., Elenius, K., Hellmer, K., House, D., & Strömbergsson, S. (2009). Project presentation: Spontal – multimodal database of spontaneous dialog. In Fonetik 2009 (pp. 190-193). Stockholm. [abstract] [pdf]Abstract: We describe the ongoing Swedish speech database project Spontal: Multimodal database of spontaneous speech in dialog (VR 2006-7482). The project takes as its point of departure the fact that both vocal signals and gesture involving the face and body are important in every-
day, face-to-face communicative interaction, and that there is a great need for data with which we more precisely measure these.Edlund, J., & Beskow, J. (2009). MushyPeek - a framework for online investigation of audiovisual dialogue phenomena. Language and Speech, 52(2-3), 351-367. [abstract]Abstract: Evaluation of methods and techniques for conversational and multimodal spoken dialogue systems is complex, as is gathering data for the modeling and tuning of such techniques. This article describes MushyPeek, an experiment framework that allows us to manipulate the audiovisual behavior of interlocutors in a setting similar to face-to-face human—human dialogue. The setup connects two subjects to each other over a Voice over Internet Protocol (VoIP) telephone connection and simultaneously provides each of them with an avatar representing the other. We present a first experiment which inaugurates, exemplifies, and validates the framework. The experiment corroborates earlier findings on the use of gaze and head pose gestures in turn-taking.Salvi, G., Beskow, J., Al Moubayed, S., & Granström, B. (2009). SynFace—Speech-Driven Facial Animation for Virtual Speech-Reading Support. EURASIP Journal on Audio, Speech, and Music Processing, 2009. [abstract] [pdf]Abstract: This paper describes SynFace, a supportive technology that aims at enhancing audio-based spoken communication in adverse acoustic conditions by providing the missing visual information in the form of an animated talking head. Firstly, we describe the system architecture, consisting of a 3D animated face model controlled from the speech input by a specifically optimised phonetic recogniser. Secondly, we report on speech intelligibility experiments with focus on multilinguality and robustness to audio quality. The system, already available for Swedish, English, and Flemish, was optimised for German and for Swedish wide-band speech quality available in TV, radio, and Internet communication. Lastly, the paper covers experiments with nonverbal motions driven from the speech signal. It is shown that turn-taking gestures can be used to affect the flow of human-human dialogues. We have focused specifically on two categories of cues that may be extracted from the acoustic signal: prominence/emphasis and interactional cues (turn-taking/back-channelling).2008Al Moubayed, S., Beskow, J., & Salvi, G. (2008). SynFace Phone Recognizer for Swedish Wideband and Narrowband Speech. In Proceedings of The second Swedish Language Technology Conference (SLTC). Stockholm, Sweden.. [abstract] [pdf]Abstract: In this paper, we present new results and comparisons of the real-time lips synchronized talking head SynFace on different Swedish databases and bandwidth. The work involves training SynFace on narrow-band telephone speech from the Swedish SpeechDat, and on the narrow-band and wide-band Speecon corpus. Auditory perceptual tests are getting established for SynFace as an audio visual hearing support for the hearing-impaired. Preliminary results show high recognition accuracy compared to other languages.Beskow, J., Bruce, G., Enflo, L., Granström, B., & Schötz, S. (2008). Human Recognition of Swedish Dialects. In Proceedings of Fonetik 2008. Beskow, J., Bruce, G., Enflo, L., Granström, B., & Schötz, S. (2008). Recognizing and Modelling Regional Varieties of Swedish. In Proceedings of Interspeech 2008. [pdf]Beskow, J., & Cerrato, L. (2008). Evaluation of the expressivity of a Swedish talking headin the context of human-machine interaction. In Magno Caldognetto, E., Cavicchio, E., & Cosi, P. (Eds.), Comunicazione Parlata e manifestazione delle emozioni. [pdf]Beskow, J., Edlund, J., Gjermani, T., Granström, B., Gustafson, J., Jonsson, O., Skantze, G., & Tobiasson, H. (2008). Innovative interfaces in MonAMI: the reminder. In Proceedings of the 10th international conference on Multimodal interfaces, Chania, Crete, Greece (pp. 199-200). New York, NY, USA: ACM. [abstract] [pdf]Abstract: This demo paper presents an early version of the Reminder, a prototype ECA developed in the European project MonAMI, which aims at “mainstreaming accessibility in consumer goods and services, using advanced technologies to ensure equal access, independent living and participation for all”. The Reminder helps users to plan activities and to remember what to do. The prototype merges mobile ECA technology with other, existing technologies: Google Calendar and a digital pen and paper. The solution allows users to continue using a paper calendar in the manner they are used to, whilst the ECA provides notifications on what has been written in the calendar. Users may ask questions such as “When was I supposed to meet Sara?” or “What’s my schedule today?”Beskow, J., Edlund, J., Granström, B., Gustafson, J., Jonsson, O., & Skantze, G. (2008). Speech technology in the European project MonAMI. In Proceedings of FONETIK 2008 (pp. 33-36). Gothenburg, Sweden. [abstract] [pdf]Abstract: This paper describes the role of speech and speech technology in the European project MonAMI, which aims at “mainstreaming ac-cessibility in consumer goods and services, us-ing advanced technologies to ensure equal ac-cess, independent living and participation for all”. It presents the Reminder, a prototype em-bodied conversational agent (ECA) which helps users to plan activities and to remember what to do. The prototype merges speech technology with other, existing technologies: Google Cal-endar and a digital pen and paper. The solution allows users to continue using a paper calendar in the manner they are used to, whilst the ECA provides notifications on what has been written in the calendar. Users may also ask questions such as “When was I supposed to meet Sara?” or “What’s on my schedule today?”Beskow, J., Edlund, J., Granström, B., Gustafson, J., & Skantze, G. (2008). Innovative interfaces in MonAMI: the KTH Reminder. In Perception in Multimodal Dialogue Systems - Proceedings of the 4th IEEE Tutorial and Research Workshop on Perception and Interactive Technologies for Speech-Based Systems, PIT 2008, Kloster Irsee, Germany, June 16-18, 2008. (pp. 272-275). Berlin/Heidelberg: Springer. [abstract] [pdf]Abstract: This demo paper presents the first version of the Reminder, a prototype ECA developed in the European project MonAMI, which aims at “mainstreaming accessibility in consumer goods and services, using advanced technologies to ensure equal access, independent living and participation for all”. The Reminder helps users to plan activities and to remember what to do. The prototype merges ECA technology with other, existing technologies: Google Calendar and a digital pen and paper. This innovative combination of modalities allows users to continue using a paper calendar in the manner they are used to, whilst the ECA provides verbal notifications on what has been written in the calendar. Users may also ask questions such as “When was I supposed to meet Sara?” or “What’s on my schedule today?”Beskow, J., Engwall, O., Granström, B., Nordqvist, P., & Wik, P. (2008). Visualization of speech and audio for hearing-impaired persons. Technology and Disability, 20(2), 97-107. [pdf]Beskow, J., Granström, B., Nordqvist, P., Al Moubayed, S., Salvi, G., Herzke, T., & Schulz, A. (2008). Hearing at Home – Communication support in home environments for hearing impaired persons. In Proceedings of Interspeech 2008. Brisbane, Australia. [abstract] [pdf]Abstract: The Hearing at Home (HaH) project focuses on the needs of
hearing-impaired people in home environments. The project is researching and developing an innovative media-center solution for hearing support, with several integrated features that support perception of speech and audio, such as individual loudness amplification, noise reduction, audio classification and event detection, and the possibility to display an animated talking head providing real-time speechreading support. In this paper we provide a brief project overview and then describe some recent results related to the audio classifier and the talking head. As the talking head expects clean speech input, an audio classifier has been developed for the task of classifying audio signals as clean speech, speech in noise or other. The mean accuracy of the classifier was 82%. The talking head (based on technology from the SynFace project) has been adapted for German, and a small speech-in-noise intelligibility experiment was conducted where sentence recognition rates increased from 3% to 17% when the talking head was present.López-Colino, F., Beskow, J., & Colás, J. (2008). Mobile SynFace: Ubiquitous visual interface for mobile VoIP telephone calls. In Proceedings of The second Swedish Language Technology Conference (SLTC). Stockholm, Sweden.. [pdf]2007Beskow, J., Granström, B., & House, D. (2007). Analysis and synthesis of multimodal verbal and non-verbal interaction for animated interface agents. In Esposito, A., Faundez-Zanuy, M., Keller, E., & Marinaro, M. (Eds.), Verbal and Nonverbal Communication Behaviours (pp. 250-263). Berlin: Springer-Verlag.Edlund, J., & Beskow, J. (2007). Pushy versus meek – using avatars to influence turn-taking behaviour. In Proceedings of Interspeech 2007. Antwerp, Belgium. [abstract] [pdf]Abstract: The flow of spoken interaction between human interlocutors
is a widely studied topic. Amongst other things, studies have shown that we use a number of facial gestures to improve this flow – to control the taking of turns. This ought to be useful in systems where an animated talking head is used, be they systems for computer mediated human-human dialogue or spoken dialogue systems, where the computer itself uses speech to interact with users. In this article, we show that a small set of simple interaction control gestures and a simple model of interaction can be used to influence users’ behaviour
in an unobtrusive manner. The results imply that such a model may improve the flow of computer mediated interaction between humans under adverse circumstances, such as network latency, or to create more human-like spoken humancomputer interaction.Edlund, J., Beskow, J., & Heldner, M. (2007). MushyPeek – an experiment framework for controlled investigation of human-human interaction control behaviour. Proceedings of Fonetik, TMH-QPSR, 50(1), 61-64. [abstract] [pdf]Abstract: This paper describes MushyPeek, a experiment framework that allows us to manipulate interaction control behaviour – including turn-taking – in a setting quite similar to face-to-face human-human dialogue. The setup connects two subjects to each other over a VoIP telephone connection and simultaneuously provides each of them with an avatar representing the other. The framework is exemplified with the first experiment we tried in it – a test of the effectiveness interaction control gestures in an animated lip-synchronised talking head.2006Agelfors, E., Beskow, J., Karlsson, I., Kewley, J., Salvi, G., & Thomas, N. (2006). User Evaluation of the SYNFACE Talking Head Telephone. Lecture Notes in Computer Science, 4061, 579-586. [abstract] [pdf]Abstract: The talking-head telephone, Synface, is a lip-reading support for people with hearing-impairment. It has been tested by 49 users with varying degrees of hearing-impaired in UK and Sweden in lab and home environments. Synface was found to give support to the users, especially in perceiving numbers and addresses and an enjoyable way to communicate. A majority deemed Synface to be a useful productBeskow, J., Granström, B., & House, D. (2006). Focal accent and facial movements in expressive speech. In Fonetik 2006, Working Papers 52, General Linguistics and Phonetics, Lund University (pp. 9-12). [pdf]Beskow, J., Granström, B., & House, D. (2006). Visual correlates to prominence in several expressive modes. In Proceedings of Interspeech 2006 (pp. 1272–1275). Pittsburg, PA. [pdf]Lidestam, B., & Beskow, J. (2006). Motivation and appraisal in perception of poorly specified speech. Scandinavian Journal of Psychology, 47(2), 93-101. [pdf]2005Beskow, J., Edlund, J., & Nordstrand, M. (2005). A model for multi-modal dialogue system output applied to an animated talking head. In Minker, W., Bühler, D., & Dybkjaer, L. (Eds.), Spoken Multimodal Human-Computer Dialogue in Mobile Environments, Text, Speech and Language Technology (pp. 93-113). Dordrecht, The Netherlands: Kluwer Academic Publishers. [abstract] [pdf]Abstract: We present a formalism for specifying verbal and non-verbal output from a multi-modal dialogue system. The output specification is XML-based and provides information about communicative functions of the output, without detailing the realisation of these functions. The aim is to let dialogue systems generate the same output for a wide variety of output devices and modalities. The formalism was developed and implemented in the multi-modal spoken dialogue system AdApt. We also describe how facial gestures in the 3D-animated talking head used within this system are controlled through the formalism.Beskow, J., & Nordenberg, M. (2005). Data-driven Synthesis of Expressive Visual Speech using an MPEG-4 Talking Head. In Proceedings of Interspeech 2005. Lisbon. [pdf]2004Beskow, J. (2004). Trainable articulatory control models for visual speech synthesis. Journal of Speech Technology, 4(7), 335-349. [pdf]Beskow, J., Cerrato, L., Cosi, P., Costantini, E., Nordstrand, M., Pianesi, F., Prete, M., & Svanfeldt, G. (2004). Preliminary cross-cultural evaluation of expressiveness in synthetic faces. In André, E., Dybkjaer, L., Minker, W., & Heisterkampf, P. (Eds.), Proc Tutorial and Research Workshop on Affective Dialogue Systems, ADS'04 (pp. 240-243). Kloster Irsee, Tyskland. [pdf]Beskow, J., Cerrato, L., Granström, B., House, D., Nordenberg, M., Nordstrand, M., & Svanfeldt, G. (2004). Expressive Animated Agents for Affective Dialogue Systems.. Proc Tutorial and Research Workshop on Affective Dialogue Systems, ADS'04, . Kloster Irsee, Tyskland.. [pdf]Beskow, J., Cerrato, L., Granström, B., House, D., Nordstrand, M., & Svanfeldt, G. (2004). The Swedish PF-Star multimodal corpora. In Proc LREC Workshop on Models of Human Behaviour for the Specification and Evaluation of Multimodal Input and Output Interfaces (pp. 34-37). Lisboa. [pdf]Beskow, J., Karlsson, I., Kewley, J., & Salvi, G. (2004). SYNFACE - A talking head telephone for the hearing-impaired. In Miesenberger, K., Klaus, J., Zagler, W., & Burger, D. (Eds.), Computers Helping People with Special Needs (pp. 1178-1186). Springer-Verlag. [abstract] [pdf]Abstract: SYNFACE is a telephone aid for hearing-impaired people
that shows the lip movements of the speaker at the other telephone synchronised with the speech. The SYNFACE system consists of a speech recogniser that recognises the incoming speech and a synthetic talking head. The output from the recogniser is used to control the articulatory movements of the synthetic head. SYNFACE prototype systems exist for three languages: Dutch, English and Swedish and the rst user trials have just started.Engwall, O., Wik, P., Beskow, J., & Granström, G. (2004). Design strategies for a virtual language tutor. In Kim, S. H., & Young, D. H. (Eds.), Proc ICSLP 2004 (pp. 1693-1696). Jeju Island, Korea. [pdf]Spens, K-E., Agelfors, E., Beskow, J., Granström, B., Karlsson, I., & Salvi, G. (2004). SYNFACE, a talking head telephone for the hearing impaired. In Proc IFHOH 7th World Congress. Helsinki, Finland.2003Beskow, J. (2003). Talking heads - Models and applications for multimodal speech synthesis. Doctoral dissertation, KTH.Beskow, J., Engwall, O., & Granström, B. (2003). Resynthesis of Facial and Intraoral Articulation from Simultaneous Measurements. In Solé, M., Recasens, D., & Romero, J. (Eds.), Proceedings of the 15th ICPhS (pp. 431-434). Barcelona, Spain. [pdf]Beskow, J., Engwall, O., & Granström, B. (2003). Simultaneous measurements of facial and intraoral articulation. In Proc of Fonetik 2003, Umeå University, Dept of Philosophy and Linguistics PHONUM 9 (pp. 57-60). [pdf]Engwall, O., & Beskow, J. (2003). Resynthesis of 3D tongue movements from facial data. In Proc EuroSpeech 2003 (pp. 2261-2264). [pdf]Engwall, O., & Beskow, J. (2003). The effect of corpus choice on statistical articulatory modeling. In 7th Intl Seminar on Speech Production (pp. 49-54). Sydney. [pdf]Siciliano, C., Williams, G., Beskow, J., & Faulkner, A. (2003). Evaluation of a Multilingual Synthetic Talking Face as a communication Aid for the Hearing Impaired. In Proc of ICPhS, XV Intl Conference of Phonetic Sciences (pp. 131-134). Barcelona, Spain. [pdf]2002Beskow, J., Edlund, J., & Nordstrand, M. (2002). Specification and realisation of multimodal output in dialogue systems. In Proc of ICSLP 2002 (pp. 181-184). Denver, Colorado, USA. [abstract] [pdf]Abstract: We present a high level formalism for specifying verbal and nonverbal output from a multimodal dialogue system. The output specification is XML-based and provides information about communicative functions of the output without detailing the realisation of these functions. The specification can be used to control an animated character that uses speech and gestures. We give examples from an implementation in a multimodal spoken dialogue system, and describe how facial gestures are implemented in a 3Danimated talking agent within this system.Beskow, J., Granström, B., & House, D. (2002). A multimodal speech synthesis tool applied to audio-visual prosody. In Keller, E., Bailly, G., Monaghan, A., Terken, J., & Huckvale, M. (Eds.), Improvements in Speech Synthesis (pp. 372-382). New York: John Wiley & Sons, Inc.Beskow, J., Granström, B., & Spens, K-E. (2002). Articulation strength - Readability experiments with a synthetic talking face. Proceedings of Fonetik, TMH-QPSR, 44(1), 097-100. [pdf]Edlund, J., Beskow, J., & Nordstrand, M. (2002). GESOM - A model for describing and generating multi-modal output. In Proc of ISCA Workshop Multi-Modal Dialogue in Mobile Environments. Kloster Irsee, Germany. [abstract] [pdf]Abstract: This paper describes GESOM, a model for generation of generalised, high-level multi-modal dialogue system output. It aims to let dialogue systems generate output for various output devices and modalities with a minimum of changes to the output generation of the dialogue system. The model was developed and tested within the AdApt spoken dialogue system, from which the bulk of the examples in this paper are taken.Granström, B., House, D., & Beskow, J. (2002). Speech and gestures for talking faces in conversational dialogue systems. In Granström, B., House, D., & Karlsson, I. (Eds.), Multimodality in language and speech systems (pp. 209-241). Dordrecht: Kluwer Academic Publishers.2001Granström, B., House, D., Beskow, J., & Lundeberg, M. (2001). Verbal and visual prosody in multimodal speech perception. In van Dommelen, W., & Fretheim, T. (Eds.), Nordic Prosody 2000: Proc of VIII Conf (pp. 77-88). Trondheim, Norway.House, D., Beskow, J., & Granström, B. (2001). Interaction of visual cues for prominence. In Karlsson, A., & van de Weijer, J. (Eds.), Papers from Fonetik 2001 (pp. 62-65). House, D., Beskow, J., & Granström, B. (2001). Timing and interaction of visual cues for prominence in audiovisual speech perception. In Proc of Eurospeech 2001 (pp. 387-390). Aalborg, Denmark. [pdf]2000Beskow, J., Granström, B., House, D., & Lundeberg, M. (2000). Experiments with verbal and visual conversational signals for an automatic language tutor. In Delcloque, P., & Bramoullé, A. (Eds.), Proc of InSTIL 2000 (pp. 138-142). University of Albertay Dundee, Dundee, Scotland.Beskow, J., Granström, B., House, D., & Lundeberg, M. (2000). Verbal and visual prosody in multimodal speech perception.. In Nordic Prosody VIII. Granström, B., House, D., Beskow, J., & Lundeberg, M. (2000). Verbal and visual prosody in multimodal speech perception. In Proc 4th Swedish Symposium on Multimodal Communication. Gustafson, J., Bell, L., Beskow, J., Boye, J., Carlson, R., Edlund, J., Granström, B., House, D., & Wirén, M. (2000). AdApt - a multimodal conversational dialogue system in an apartment domain. In Yuan, B., Huang, T., & Tang, X. (Eds.), Proc. of ICSLP 2000, 6th Intl Conf on Spoken Language Processing (pp. 134-137). Beijing: China Military Friendship Publish. [abstract] [pdf]Abstract: A general overview of the AdApt project and the research that is performed within the project is presented. In this project various aspects of human-computer interaction in a multimodal conversational dialogue systems are investigated. The project will also include studies on the integration of user/system/dialogue dependent speech recognition and multimodal speech synthesis. A domain in which multimodal interaction is highly useful has been chosen, namely, finding available apartments in Stockholm. A Wizard-of-Oz data collection within this domain is also described.Massaro D.W, C., Beskow, J., & Cole, R. (2000). Developing and Evaluating Conversational Agents.. In Cassell, J., & et al, . (Eds.), Embodied Conversational Agents.. Cambridge, MA: MIT Press.Sjölander, K., & Beskow, J. (2000). WaveSurfer - an open source speech tool. In Yuan, B., Huang, T., & Tang, X. (Eds.), Proceedings of ICSLP 2000, 6th Intl Conf on Spoken Language Processing (pp. 464-467). Beijing. [pdf]1999Agelfors, E., Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Salvi, G., Spens, K-E., & Öhman, T. (1999). A Synthetic Face as a Lip-reading Support for Hearing Impaired Telephone Users - Problems and Positive Results. In Proceedings of the 4th European Conference on Audiology. Oulo, Finland.Agelfors, E., Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Salvi, G., Spens, K-E., & Öhman, T. (1999). Two methods for Visual Parameter Extraction in the Teleface Project. In Proceedings of Fonetik. Gothenburg, Sweden.Agelfors, E., Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Spens, K-E., & Öhman, T. (1999). Artificial video for Hearing-Impaired Telephone Users; A comparison with the No Video and Perfect Video Conditions.. In Buhler C & Harry Knops H, . (Eds.), Assistive Technology on the Threshold of the New Millennium (pp. 116-121). IOS Press.Agelfors, E., Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Spens, K-E., & Öhman, T. (1999). ASR controlled synthetic face as a lipreading support for hearing impaired telephone users. In Cost249 meeting. Prague, Czech Republic.Agelfors, E., Beskow, J., Granström, B., Lundeberg, M., Salvi, G., Spens, K-E., & Öhman, T. (1999). Synthetic visual speech driven from auditory speech. In Proceedings of Audio-Visual Speech Processing (AVSP). Santa Cruz, USA. [pdf]Gustafson, J., Sjölander, K., Beskow, J., Granström, B., & Carlson, R. (1999). Creating web-based exercises for spoken language technology. In Tutorial session in proceedings of IDS'99 (pp. 165-168). [pdf]Lundeberg, M., & Beskow, J. (1999). Developing a 3D-agent for the August dialogue system. In Proc of AVSP 99. [pdf]Massaro, D., Beskow, J., Cohen, M., Fry, C., & Rodriguez, T. (1999). Picture my voice: Audio to visual speech synthesis using artificial neural networks. In Proc of AVSP 99 (pp. 133-138). [pdf]Massaro, D., Cohen, M., & Beskow, J. (1999). From Theory to Practice: Rewards and Challenges. In Proc of ICPhS. [pdf]Sjölander, K., Gustafson, J., Beskow, J., Granström, B., & Carlson, R. (1999). Web-based educational tools for speech technology. In Proceedings Proc MATISSE - ESCA/SOCRATES Workshop on Method and Tool Innovations for Speech Science Education. London.1998Agelfors, E., Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Spens, K-E., & Öhman, T. (1998). Synthetic faces as a lipreading support. In Proceedings of ICSLP'98. [pdf]Agelfors, E., Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Spens, K-E., & Öhman, T. (1998). Teleface - the use of a synthetic face for the hard of hearing.. In Proc of IVTTA'98. Agelfors, E., Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Spens, K-E., & Öhman, T. (1998). The synthetic face from a hearing impaired view. In Branderud, P., & Traunmüller, H. (Eds.), Proc of Fonetik -98, The Swedish Phonetics Conference (pp. 200-203). Stockholm University. [html]Beskow, J. (1998). A tool for teaching and development of parametric speech synthesis. In Branderud, P., & Traunmüller, H. (Eds.), Proc of Fonetik -98, The Swedish Phonetics Conference (pp. 162-165). Stockholm University. [pdf]Cohen, M., Beskow, J., & Massaro, D. (1998). Recent developments in facial animation: an inside view. In Proc of AVSP'98. [pdf]Cole, R., Carmell, T., Conners, P., Macon, M., Wouters, J., de Villiers, J., Tarachow, A., Massaro, D., Cohen, M., Beskow, J., Yang, J., Meier, U., Waibel, A., Stone, P., Fortier, G., Davis, A., & Soland, C. (1998). Intelligent Animated Agents for Interactive Language Training. In Proc of STiLL - ESCA Workshop on Speech Technology in Language Learning. Sjölander, K., Beskow, J., Gustafson, J., Lewin, E., Carlson, R., & Granström, B. (1998). Web-based educational tools for speech technology. In Proc of ICSLP98, 5th Intl Conference on Spoken Language Processing (pp. 3217-3220). Sydney, Australia.1997Beskow, J. (1997). Animation of talking agents. In Benoit, C., & Campbel, R. (Eds.), Proc of ESCA Workshop on Audio-Visual Speech Processing (pp. 149-152). Rhodes, Greece. [pdf]Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Spens, K-E., & Öhman, T. (1997). The Teleface project - disability, feasibility and intelligibility. In Bannert, R., Heldner, M., Sullivan, K., & Wretling, P. (Eds.), Proc of Fonetik -97, Dept of Phonetics, Umeå Univ., Phonum (pp. 85-88). [pdf]Beskow, J., Dahlquist, M., Granström, B., Lundeberg, M., Spens, K-E., & Öhman, T. (1997). The Teleface project - Multimodal speech communication for the hearing impaired. In Kokkinakis, G., Fakotakis, N., & Dermatas, E. (Eds.), Proc of Eurospeech Õ97, 5th European Conference on Speech Communication and Technology (pp. 2003-2006). Rhodes, Greece. [pdf]Beskow, J., Elenius, K., & McGlashan, S. (1997). OLGA - A dialogue system with an animated talking agent. In Kokkinakis, G., Fakotakis, N., & Dermatas, E. (Eds.), Proc of Eurospeech '97, 5th European Conference on Speech Communication and Technology (pp. 1651-1654). Rhodes, Greece. [pdf]Beskow, J., Elenius, K., & McGlashan, S. (1997). The OLGA project: An animated talking agent in a dialogue system. In Bannert, R., Heldner, M., Sullivan, K., & Wretling, P. (Eds.), Proc of Fonetik -97, Dept of Phonetics, Phonum 4 (pp. 69-72). Lövånger/Umeå. [pdf]Beskow, J., Elenius, K. O. E., & McGlashan, S. (1997). OLGA - A dialogue system with an animated talking agent. TMH-QPSR, 38(2-3), 001-006. [pdf]Beskow, J., & McGlashan, S. (1997). OLGA - A conversational agent with gestures. In André, E. (Ed.), Proc of the IJCAI -97 Workshop on Animated Interface Agents: Making them Intelligent (pp. 39-44). Nagoya, Japan. [pdf]1996Beskow, J. (1996). Talking heads - communication, articulation and animation. TMH-QPSR, 37(2), 053-056. [pdf]1995Bertenstam, J., Beskow, J., Blomberg, M., Carlson, R., Elenius, K., Granström, B., Gustafson, J., Hunnicutt, S., Högberg, J., Lindell, R., Neovius, L., Nord, L., de Serpa-Leitao, A., & Ström, N. (1995). The Waxholm system - a progress report. In Dalsgaard, P. (Ed.), Proc of ESCA Workshop on Spoken Dialogue Systems (pp. 281-284). Vigs¿, Denmark. [pdf]Beskow, J. (1995). Regelstyrd visuell talsyntes. Master's thesis, KTH, TMH.Beskow, J. (1995). Rule-based visual speech synthesis. In Pardo, J. (Ed.), Proc of the 4th European Conference on Speech Communication and Technology (EUROSPEECH«95) (pp. 299-302). Madrid.