Publications

Publications

 

2014Al Moubayed, S., Bohus, D., Esposito, A., Heylen, D., & M., Koutsombogera, M. (2014). UM3I: International Workshop on Understanding and modeling Multiparty, Multimodal Interactions, ICMI2014, Istanbul, Turkey. Al Moubayed, S., Beskow, J., Bollepalli, B., Gustafson, J., Hussen-Abdelaziz, A., Johansson, M., Koutsombogera, M., Lopes, J., Novikova, J., Oertel, C., Skantze, G., Stefanov, K., & Varol, G. (2014). Human-robot collaborative tutoring using multiparty multimodal spoken dialogue. In Proc. of HRI'14. reykjavik. [abstract] [pdf]Abstract: In this paper, we describe a project that explores a novel experi-mental setup towards building a spoken, multi-modally rich, and human-like multiparty tutoring robot. A human-robot interaction setup is designed, and a human-human dialogue corpus is collected. The corpus targets the development of a dialogue system platform to study verbal and nonverbal tutoring strategies in multiparty spoken interactions with robots which are capable of spoken dialogue. The dialogue task is centered on two participants involved in a dialogue aiming to solve a card-ordering game. Along with the participants sits a tutor (robot) that helps the participants perform the task, and organizes and balances their interaction. Different multimodal signals captured and auto-synchronized by different audio-visual capture technologies, such as a microphone array, Kinects, and video cameras, were coupled with manual annotations. These are used build a situated model of the interaction based on the participants personalities, their state of attention, their conversational engagement and verbal dominance, and how that is correlated with the verbal and visual feedback, turn-management, and conversation regulatory actions generated by the tutor. Driven by the analysis of the corpus, we will show also the detailed design methodologies for an affective, and multimodally rich dialogue system that allows the robot to measure incrementally the attention states, and the dominance for each participant, allowing the robot head Furhat to maintain a well-coordinated, balanced, and engaging conversation, that attempts to maximize the agreement and the contribution to solve the task. This project sets the first steps to explore the potential of using multimodal dialogue systems to build interactive robots that can serve in educational, team building, and collaborative task solving applications.Al Moubayed, S., Beskow, J., Bollepalli, B., Hussen-Abdelaziz, A., Johansson, M., Koutsombogera, M., Lopes, J., Novikova, J., Oertel, C., Skantze, G., Stefanov, K., & Varol, G. (2014). Tutoring Robots: Multiparty multimodal social dialogue with an embodied tutor. In Proceedings of eNTERFACE2013. Springer. [abstract] [pdf]Abstract: This project explores a novel experimental setup towards building spoken, multi-modally rich, and human-like multiparty tutoring agent. A setup is developed and a corpus is collected that targets the development of a dialogue system platform to explore verbal and nonverbal tutoring strategies in multiparty spoken interactions with embodied agents. The dialogue task is centered on two participants involved in a dialogue aiming to solve a card-ordering game. With the participants sits a tutor that helps the participants perform the task and organizes and balances their interaction. Different multimodal signals captured and auto-synchronized by different audio-visual capture technologies were coupled with manual annotations to build a situated model of the interaction based on the participants personalities, their temporally-changing state of attention, their conversational engagement and verbal dominance, and the way these are correlated with the verbal and visual feedback, turn-management, and conversation regulatory actions generated by the tutor. At the end of this chapter we discuss the potential areas of research and developments this work opens and some of the challenges that lie in the road ahead.Al Moubayed, S., Beskow, J., & Skantze, G. (2014). Spontaneous spoken dialogues with the Furhat human-like robot head. In HRI'14. Bielefeld, Germany. [abstract] [pdf]Abstract: We will show in this demonstrator an advanced multimodal and multiparty spoken conversational system using Furhat, a robot head based on projected facial animation. Furhat is an anthropomorphic robot head that utilizes facial animation for physical robot heads using back-projection. In the system, multimodality is enabled using speech and rich visual input signals such as multi-person real-time face tracking and microphone tracking. The demonstrator will showcase a system that is able to carry out social dialogue with multiple interlocutors simultaneously with rich output signals such as eye and head coordination, lips synchronized speech synthesis, and non-verbal facial gestures used to regulate fluent and expressive multiparty conversations. The dialogue design is performed using the IrisTK [4] dialogue authoring toolkit developed at KTH. The system will also be able to perform a moderator in a quiz-game showing different strategies for regulating spoken situated interactions.Persson, A., Al Moubayed, S., Loutfi, A. (2014). Fluent Human-Robot Dialogues About Grounded Objects in Home Environments. Journal on Cognitive Computation, 1-14. Koutsombogera, M., Al Moubayed, S., Beskow, J., Bollepalli, B., Gustafson, J., Hussen-Abdelaziz, A., Johansson, M., Lopes, J., Novikova, J., Oertel, C., Skantze, G., Stefanov, K., & Varol, G. (2014). The Tutorbot Corpus – A Corpus for Studying Tutoring Behaviour in Multiparty Face-to-Face Spoken Dialogue. To be published in Proc. of LREC'14. Reykjavik, Iceland. [abstract]Abstract: This paper describes a novel experimental setup exploiting state-of-the-art capture equipment to collect a multimodally rich game-solving collaborative multiparty dialogue corpus. The corpus is targeted and designed towards the development of a dialogue system platform to explore verbal and nonverbal tutoring strategies in multiparty spoken interactions. The dialogue task is centered on two participants involved in a dialogue aiming to solve a card-ordering game. The participants were paired into teams based on their degree of extraversion as resulted from a personality test. With the participants sits a tutor that helps them perform the task, organizes and balances their interaction and whose behavior was assessed by the participants after each interaction. Different multimodal signals captured and auto-synchronized by different audio-visual capture technologies, together with manual annotations of the tutor’s behavior constitute the Tutorbot corpus. This corpus is exploited to build a situated model of the interaction based on the participants’ temporally-changing state of attention, their conversational engagement and verbal dominance, and their correlation with the verbal and visual feedback and conversation regulatory actions generated by the tutor.

2013Al Moubayed, S. (2013). Towards rich multimodal behavior in spoken dialogues with embodied agents. In IEEE Intl' Conference on Cognitive Infocommunicaitons. Budapest, Hungary. [abstract] [pdf]Abstract: Spoken dialogue frameworks have traditionally been designed to handle a single stream of data – the speech signal. Research on human-human communication has been providing large evidence and quantifying the effects and the importance of a multitude of other multimodal nonverbal signals that people use in their communication, that shape and regulate their interaction. Driven by findings from multimodal human spoken interaction, and the advancements of capture devices and robotics and animation technologies, new possibilities are rising for the development of multimodal human-machine interaction that is more affective, social, and engaging. In such face-to-face interaction scenarios, dialogue systems can have a large set of signals at their disposal to infer context and enhance and regulate the interaction through the generation of verbal and nonverbal facial signals. This paper summarizes several design decision, and experiments that we have followed in attempts to build rich and fluent multimodal interactive systems using a newly developed hybrid robotic head called Furhat, and discuss issues and challenges that this effort is facing.Al Moubayed, S., Edlund, J., & Gustafson, J. (2013). Analysis of gaze and speech patterns in three-party quiz game interaction. In Interspeech 2013. Lyon, France. [abstract] [pdf]Abstract: In order to understand and model the dynamics between interaction phenomena such as gaze and speech in face-to-face multiparty interaction between humans, we need large quantities of reliable, objective data of such interactions. To date, this type of data is in short supply. We present a data collection setup using automated, objective techniques in which we capture the gaze and speech patterns of triads deeply engaged in a high-stakes quiz game. The resulting corpus consists of five one-hour recordings, and is unique in that it makes use of three state-of-the-art gaze trackers (one per subject) in combination with a state-of-the-art conical microphone array designed to capture roundtable meetings. Several video channels are also included. In this paper we present the obstacles we encountered and the possibilities afforded by a synchronised, reliable combination of large-scale multi-party speech and gaze data, and an overview of the first analyses of the data.Al Moubayed, S., Beskow, J., & Skantze, G. (2013). The Furhat Social Companion Talking Head. In Interspeech 2013 - Show and Tell. Lyon, France. [abstract] [pdf]Abstract: In this demonstrator we present the Furhat robot head. Furhat is a highly human-like robot head in terms of dynamics, thanks to its use of back-projected facial animation. Furhat also takes advantage of a complex and advanced dialogue toolkits designed to facilitate rich and fluent multimodal multiparty human-machine situated and spoken dialogue. The demonstrator will present a social dialogue system with Furhat that allows for several simultaneous interlocutors, and takes advantage of several verbal and nonverbal input signals such as speech input, real-time multi-face tracking, and facial analysis, and communicates with its users in a mixed initiative dialogue, using state of the art speech synthesis, with rich prosody, lip animated facial synthesis, eye and head movements, and gestures.Al Moubayed, S., Skantze, G., & Beskow, J. (2013). The Furhat Back-Projected Humanoid Head - Lip reading, Gaze and Multiparty Interaction. International Journal of Humanoid Robotics, 10(1). [abstract] [link]Abstract: In this article, we present Furhat – a back-projected human-like robot head using state-of-the art facial animation. Three experiments are presented where we investigate how the head might facilitate human-robot face-to-face interaction. First, we investigate how the animated lips increase the intelligibility of the spoken output, and compare this to an animated agent presented on a flat screen, as well as to a human face. Second, we investigate the accuracy of the perception of Furhat’s gaze in a setting typical for situated interaction, where Furhat and a human are sitting around a table. The accuracy of the perception of Furhat’s gaze is measured depending on eye design, head movement and viewing angle. Third, we investigate the turn-taking accuracy of Furhat in a multi-party interactive setting, as compared to an animated agent on a flat screen. We conclude with some observations from a public setting at a museum, where Furhat interacted with thousands of visitors in a multi-party interaction.Edlund, J., Al Moubayed, S., Tånnander, C., & Gustafson, J. (2013). Audience response system based annotation of speech. In Fonetik 2013. Linköping, Sweden.Edlund, J., Al Moubayed, S., & Beskow, J. (2013). Co-present or not? Embodiement, situatedness and the Mona Lisa gaze effect. In Nakano, Y., Conati, C., & Bader, T. (Eds.), Eye Gaze in Intelligent User Interfaces - Gaze-based Analyses, Models, and Applications. Springer. [abstract]Abstract: The interest in embodying and situating computer programmes took off in the autonomous agents community in the 90s. Today, researchers and designers of programmes that interact with people on human terms endow their systems with humanoid physiognomies for a variety of reasons. In most cases, attempts at achieving this embodiment and situatedness has taken one of two directions: virtual characters and actual physical robots. In addition, a technique that is far from new is gaining ground rapidly: projection of animated faces on head-shaped 3D surfaces. In this chapter, we provide a history of this technique; an overview of its pros and cons; and an in-depth description of the cause and mechanics of the main drawback of 2D displays of 3D faces (and objects): the Mona Liza gaze effect. We conclude with a description of an experimental paradigm that measures perceived directionality in general and the Mona Lisa gaze effect in particular.Edlund, J., Al Moubayed, S., Tånnander, C., & Gustafson, J. (2013). Temporal precision and reliability of audience response system based annotation. In Proc. of Multimodal Corpora 2013. Edinburgh, UK. [abstract] [pdf]Abstract: Manual annotators are often used to label human interaction data. This is associated with high costs and high time consumption, which is one reason annotation by crowd sourcing is increasing. But in crowd sourcing, control over the conditions is largely lost. To get increased throughput with a higher measure of experimental control, we suggest borrowing from the Audience Response Systems used extensively in the film and television industries. We present (a) a cost-efficient setup for rapid, plenary annotation of human spoken interaction data; and (b) a study that quantifies the temporal precision and reliability of annotations made with the system in the auditory perception domain.Mirnig, N., Weiss, A., Skantze, G., Al Moubayed, S., Gustafson, J., Beskow, J., Granström, B., & Tscheligi, M. (2013). Face-to-Face with a Robot: What do we actually talk about?. International Journal of Humanoid Robotics, 10(1). [abstract] [link]Abstract: Whereas much of the state-of-the-art research in Human-Robot Interaction (HRI) investigates task-oriented interaction, this paper aims at exploring what people talk about to a robot if the content of the conversation is not prede ned. We used the robot head Furhat to explore the conversational behavior of people who encounter a robot in the public setting of a robot exhibition in a scienti c museum, but without a prede ned purpose. Upon analyzing the conversations, it could be shown that a sophisticated robot provides an inviting atmosphere for people to engage in interaction and to be experimental and challenge the robot\'s capabilities. Many visitors to the exhibition were willing to go beyond the guiding questions that were provided as a starting point. Amongst other things, they asked Furhat questions concerning the robot itself, such as how it would de ne a robot, or if it plans to take over the world. People were also interested in the feelings and likes of the robot and they asked many personal questions - this is how Furhat ended up with its rst marriage proposal. People who talked to Furhat were asked to complete a questionnaire on their assessment of the conversation, with which we could show that the interaction with Furhat was rated as a pleasing experience.

2012Al Moubayed, S., Beskow, J., Blomberg, M., Granström, B., Gustafson, J., Mirnig, N., & Skantze, G. (2012). Talking with Furhat - multi-party interaction with a back-projected robot head. In Proceedings of Fonetik'12. Gothenberg, Sweden. [abstract] [pdf]Abstract: This is a condensed presentation of some recent work on a back-projected robotic head for multi-party interaction in public settings. We will describe some of the design strategies and give some preliminary analysis of an interaction database collected at the Robotville exhibition at the London Science MuseumAl Moubayed, S., Beskow, J., Skantze, G., & Granström, B. (2012). Furhat: A Back-projected Human-like Robot Head for Multiparty Human-Machine Interaction. In Esposito, A., Esposito, A., Vinciarelli, A., Hoffmann, R., & C. Müller, V. (Eds.), Cognitive Behavioural Systems. Lecture Notes in Computer Science (pp. 114-130). Springer.Al Moubayed, S., Edlund, J., & Beskow, J. (2012). Taming Mona Lisa: communicating gaze faithfully in 2D and 3D facial projections. ACM Transactions on Interactive Intelligent Systems, 1(2), 25. [abstract] [pdf]Abstract: The perception of gaze plays a crucial role in human-human interaction. Gaze has been shown to matter for a number of aspects of communication and dialogue, especially for managing the flow of the dialogue and participant attention, for deictic referencing, and for the communication of attitude. When developing embodied conversational agents (ECAs) and talking heads, modeling and delivering accurate gaze targets is crucial. Traditionally, systems communicating through talking heads have been displayed to the human conversant using 2D displays, such as flat monitors. This approach introduces severe limitations for an accurate communication of gaze since 2D displays are associated with several powerful effects and illusions, most importantly the Mona Lisa gaze effect, where the gaze of the projected head appears to follow the observer regardless of viewing angle. We describe the Mona Lisa gaze effect and its consequences in the interaction loop, and propose a new approach for displaying talking heads using a 3D projection surface (a physical model of a human head) as an alternative to the traditional flat surface projection. We investigate and compare the accuracy of the perception of gaze direction and the Mona Lisa gaze effect in 2D and 3D projection surfaces in a five subject gaze perception experiment. The experiment confirms that a 3D projection surface completely eliminates the Mona Lisa gaze effect and delivers very accurate gaze direction that is independent of the observer's viewing angle. Based on the data collected in this experiment, we rephrase the formulation of the Mona Lisa gaze effect. The data, when reinterpreted, confirms the predictions of the new model for both 2D and 3D projection surfaces. Finally, we discuss the requirements on different spatially interactive systems in terms of gaze direction, and propose new applications and experiments for interaction in a human-ECA and a human-robot settings made possible by this technology.Al Moubayed, S., & Skantze, G. (2012). Perception of Gaze Direction for Situated Interaction. In Proc. of the 4th Workshop on Eye Gaze in Intelligent Human Machine Interaction. The 14th ACM International Conference on Multimodal Interaction ICMI. Santa Monica, CA, USA. [abstract] [pdf]Abstract: Accurate human perception of robots’ gaze direction is crucial for the design of a natural and fluent situated multimodal face-to-face interaction between humans and machines. In this paper, we present an experiment targeted at quantifying the effects of different gaze cues synthesized using the Furhat back-projected robot head, on the accuracy of perceived spatial direction of gaze by humans using 18 test subjects. The study first quantifies the accuracy of the perceived gaze direction in a human-human setup, and compares that to the use of synthesized gaze movements in different conditions: viewing the robot eyes frontal or at a 45 degrees angle side view. We also study the effect of 3D gaze by controlling both eyes to indicate the depth of the focal point, the use of gaze or head pose, and the use of static or dynamic eyelids. The findings of the study are highly relevant to the design and control of robots and animated agents in situated face-to-face interactionAl Moubayed, S., Skantze, G., & Beskow, J. (2012). Lip-reading Furhat: Audio Visual Intelligibility of a Back Projected Animated Face. In Proc. of the Intelligent Virtual Agents 10th International Conference (IVA 2012). Santa Cruz, CA, USA: Springer. [abstract] [pdf]Abstract: Back projecting a computer animated face, onto a three dimensional static physical model of a face, is a promising technology that is gaining ground as a solution to building situated, flexible and human-like robot heads. In this paper, we first briefly describe Furhat, a back projected robot head built for the purpose of multimodal multiparty human-machine interaction, and its benefits over virtual characters and robotic heads; and then motivate the need to investigating the contribution to speech intelligibility Furhat’s face offers. We present an audio-visual speech intelligibility experiment, in which 10 subjects listened to short sentences with degraded speech signal. The experiment compares the gain in intelligibility between lip reading a face visualized on a 2D screen compared to a 3D back-projected face and from different viewing angles. The results show that the audio-visual speech intelligibility holds when the avatar is projected onto a static face model (in the case of Furhat), and even, rather surprisingly, exceeds it. This means that despite the movement limitations back projected animated face models bring about; their audio visual speech intelligibility is equal, or even higher, compared to the same models shown on flat displays. At the end of the paper we discuss several hypotheses on how to interpret the results, and motivate future investigations to better explore the characteristics of visual speech perception 3D projected animated faces.Al Moubayed, S., Skantze, G., Beskow, J., Stefanov, K., & Gustafson, J. (2012). Multimodal Multiparty Social Interaction with the Furhat Head. In Proc. of the 14th ACM International Conference on Multimodal Interaction ICMI. Santa Monica, CA, USA. [abstract] [pdf]Abstract: We will show in this demonstrator an advanced multimodal and multiparty spoken conversational system using Furhat, a robot head based on projected facial animation. Furhat is a human-like interface that utilizes facial animation for physical robot heads using back-projection. In the system, multimodality is enabled using speech and rich visual input signals such as multi-person real-time face tracking and microphone tracking. The demonstrator will showcase a system that is able to carry out social dialogue with multiple interlocutors simultaneously with rich output signals such as eye and head coordination, lips synchronized speech synthesis, and non-verbal facial gestures used to regulate fluent and expressive multiparty conversations.Blomberg, M., Skantze, G., Al Moubayed, S., Gustafson, J., Beskow, J., & Granström, B. (2012). Children and adults in dialogue with the robot head Furhat - corpus collection and initial analysis. In Proceedings of WOCCI. Portland, OR. [pdf]Haider, F., & Al Moubayed, S. (2012). Towards speaker detection using lips movements for human-machine multiparty dialogue. In Proceedings of Fonetik'12. Gothenberg, Sweden. [pdf]Skantze, G., & Al Moubayed, S. (2012). IrisTK: a statechart-based toolkit for multi-party face-to-face interaction. In Proceedings of ICMI. Santa Monica, CA. [pdf]Skantze, G., Al Moubayed, S., Gustafson, J., Beskow, J., & Granström, B. (2012). Furhat at Robotville: A Robot Head Harvesting the Thoughts of the Public through Multi-party Dialogue. In Proceedings of IVA-RCVA. Santa Cruz, CA. [pdf]

2011Al Moubayed, S., Alexanderson, S., Beskow, J., & Granström, B. (2011). A robotic head using projected animated faces. In Salvi, G., Beskow, J., Engwall, O., & Al Moubayed, S. (Eds.), Proceedings of AVSP2011 (pp. 69). [pdf]Al Moubayed, S., Beskow, J., Edlund, J., Granström, B., & House, D. (2011). Animated Faces for Robotic Heads: Gaze and Beyond. In Esposito, A., Vinciarelli, A., Vicsi, K., Pelachaud, C., & Nijholt, A. (Eds.), Analysis of Verbal and Nonverbal Communication and Enactment: The Processing Issues, Lecture Notes in Computer Science (pp. 19-35). Springer. [pdf]Al Moubayed, S., & Beskow, J. (2011). A novel Skype interface using SynFace for virtual speech reading support. TMH-QPSR, 51(1), 33-36. [pdf]Al Moubayed, S., & Skantze, G. (2011). Effects of 2D and 3D Displays on Turn-taking Behavior in Multiparty Human-Computer Dialog. In Proceedings of SemDial (pp. 192-193). Los Angeles, CA. [pdf]Al Moubayed, S., & Skantze, G. (2011). Turn-taking Control Using Gaze in Multiparty Human-Computer Dialogue: Effects of 2D and 3D Displays. In Proceedings of AVSP. Florence, Italy. [pdf]Beskow, J., Alexanderson, S., Al Moubayed, S., Edlund, J., & House, D. (2011). Kinetic Data for Large-Scale Analysis and Modeling of Face-to-Face Conversation. In Salvi, G., Beskow, J., Engwall, O., & Al Moubayed, S. (Eds.), Proceedings of AVSP2011 (pp. 103-106). [pdf]Edlund, J., Al Moubayed, S., & Beskow, J. (2011). The Mona Lisa Gaze Effect as an Objective Metric for Perceived Cospatiality. In Vilhjálmsson, H. H., Kopp, S., Marsella, S., & Thórisson, K. R. (Eds.), Proc. of the Intelligent Virtual Agents 10th International Conference (IVA 2011) (pp. 439-440). Reykjavík, Iceland: Springer. [abstract] [pdf]Abstract: We propose to utilize the Mona Lisa gaze effect for an objective and repeatable measure of the extent to which a viewer perceives an object as cospatial. Preliminary results suggest that the metric behaves as expected. Proc. of the Intelligent Virtual Agents 10th International Conference (IVA 2011)Salvi, G., & Al Moubayed, S. (2011). Spoken Language Identification using Frame Based Entropy Measures. TMH-QPSR, 51(1), 69-72. [pdf]

2010Al Moubayed, S., & Ananthakrishnan, G. (2010). Acoustic-to-Articulatory Inversion based on Local Regression. In Proc. Interspeech. Makuhari, Japan. [abstract] [pdf]Abstract: This paper presents an Acoustic-to-Articulatory inversion method based on local regression. Two types of local regression, a non-parametric and a local linear regression have been applied on a corpus containing simultaneous recordings of positions of articulators and the corresponding acoustics. A maximum likelihood trajectory smoothing using the estimated dynamics of the articulators is also applied on the regression estimates. The average root mean square error in estimating articulatory positions, given the acoustics, is 1.56 mm for the nonparametric regression and 1.52 mm for the local linear regression. The local linear regression is found to perform significantly better than regression using Gaussian Mixture Models using the same acoustic and articulatory features.Al Moubayed, S., Ananthakrishnan, G., & Enflo, L. (2010). Automatic Prominence Classification in Swedish. In Proceedings of Speech Prosody 2010, Workshop on Prosodic Prominence. Chicago, USA. [abstract] [pdf]Abstract: This study aims at automatically classifying levels of acoustic prominence on a dataset of 200 Swedish sentences of read speech by one male native speaker. Each word in the sentences was categorized by four speech experts into one of three groups depending on the level of prominence perceived. Six acoustic features at a syllable level and seven features at a word level were used. Two machine learning algorithms, namely Support Vector Machines (SVM) and memory based Learning (MBL) were trained to classify the sentences into their respective classes. The MBL gave an average word level accuracy of 69.08% and the SVM gave an average accuracy of 65.17 % on the test set. These values were comparable with the average accuracy of the human annotators with respect to the average annotations. In this study, word duration was found to be the most important feature required for classifying prominence in Swedish read speechAl Moubayed, S., & Beskow, J. (2010). Prominence Detection in Swedish Using Syllable Correlates. In Interspeech'10. Makuhari, Japan. [pdf]Al Moubayed, S., & Beskow, J. (2010). Perception of Nonverbal Gestures of Prominence in Visual Speech Animation. In FAA, The ACM / SSPNET 2nd International Symposium on Facial Analysis and Animation. Edinburgh, UK. [pdf]Al Moubayed, S., Beskow, J., & Granström, B. (2010). Auditory-Visual Prominence: From Intelligibilitty to Behavior. Journal on Multimodal User Interfaces, 3(4), 299-311. [abstract] [pdf]Abstract: Auditory prominence is defined as when an acoustic segment is made salient in its context. Prominence is one of the prosodic functions that has been shown to be strongly correlated with facial movements. In this work, we investigate the effects of facial prominence cues, in terms of gestures, when synthesized on animated talking heads. In the first study, a speech intelligibility experiment is conducted, speech quality is acoustically degraded and the fundamental frequency is removed from the signal, then the speech is presented to 12 subjects through a lip synchronized talking head carrying head-nods and eyebrows raise gestures, which are synchronized with the auditory prominence. The experiment shows that presenting prominence as facial gestures significantly increases speech intelligibility compared to when these gestures are randomly added to speech. We also present a follow-up study examining the perception of the behavior of the talking heads when gestures are added over pitch accents. Using eye-gaze tracking technology and questionnaires on 10 moderately hearing impaired subjects, the results of the gaze data show that users look at the face in a similar fashion to when they look at a natural face when gestures are coupled with pitch accents opposed to when the face carries no gestures. From the questionnaires, the results also show that these gestures significantly increase the naturalness and the understanding of the talking head.Al Moubayed, S., Beskow, J., Granström, B., & House, D. (2010). Audio-Visual Prosody: Perception, Detection, and Synthesis of Prominence. In Esposito, A. e. a. (Ed.), Toward Autonomous, Adaptive, and Context-Aware Multimodal Interfaces: Theoretical and Practical Issues (pp. 55 - 71). Springer. [abstract] [pdf]Abstract: In this chapter, we investigate the effects of facial prominence cues, in terms of gestures, when synthesized on animated talking heads. In the first study a speech intelligibility experiment is conducted, where speech quality is acoustically degraded, then the speech is presented to 12 subjects through a lip synchronized talking head carrying head-nods and eyebrow raising gestures. The experiment shows that perceiving visual prominence as gestures, synchronized with the auditory prominence, significantly increases speech intelligibility compared to when these gestures are randomly added to speech. We also present a study examining the perception of the behavior of the talking heads when gestures are added at pitch movements. Using eye-gaze tracking technology and questionnaires for 10 moderately hearing impaired subjects, the results of the gaze data show that users look at the face in a similar fashion to when they look at a natural face when gestures are coupled with pitch movements opposed to when the face carries no gestures. From the questionnaires, the results also show that these gestures significantly increase the naturalness and helpfulness of the talking head.Beskow, J., & Al Moubayed, S. (2010). Perception of Gaze Direction in 2D and 3D Facial Projections. In The ACM / SSPNET 2nd International Symposium on Facial Analysis and Animation. Edinburgh, UK. [pdf]Edlund, J., Heldner, M., Al Moubayed, S., Gravano, A., & Hirschberg, J. (2010). Very short utterances in conversation. In Proc. of Fonetik 2010 (pp. 11-16). Lund, Sweden. [abstract] [pdf]Abstract: Faced with the difficulties of finding an operationalized definition of backchannels, we have previously proposed an intermediate, auxiliary unit – the very short utterance (VSU) – which is defined operationally and is automatically extractable from recorded or ongoing dialogues. Here, we extend that work in the following ways: (1) we test the extent to which the VSU/NONVSU distinction corresponds to backchannels/non-backchannels in a different data set that is manually annotated for backchannels – the Columbia Games Corpus; (2) we examine to the extent to which VSUS capture other short utterances with a vocabulary similar to backchannels; (3) we propose a VSU method for better managing turn-taking and barge-ins in spoken dialogue systems based on detection of backchannels; and (4) we attempt to detect backchannels with better precision by training a backchannel classifier using durations and inter-speaker relative loudness differences as features. The results show that VSUS indeed capture a large proportion of backchannels – large enough that VSUs can be used to improve spoken dialogue system turntaking; and that building a reliable backchannel classifier working in real time is feasible.

2009Al Moubayed, S. (2009). Prosodic Disambiguation in Spoken Systems Output. In Proceedings of Diaholmia'09. Stockholm, Sweden.. [abstract] [pdf]Abstract: This paper presents work on using prosody in the output of spoken dialogue systems to resolve possible structural ambiguity of output utterances. An algorithm is proposed to discover ambiguous parses of an utterance and to add prosodic disambiguation events to deliver the intended structure. By conducting a pilot experiment, the automatic prosodic grouping applied to ambiguous sentences shows the ability to deliver the intended interpretation of the sentences.Al Moubayed, S., & Beskow, J. (2009). Effects of Visual Prominence Cues on Speech Intelligibility. In Proceedings of Auditory-Visual Speech Processing AVSP'09. Norwich, England. [abstract] [pdf]Abstract: This study reports experimental results on the effect of visual prominence, presented as gestures, on speech intelligibility. 30 acoustically vocoded sentences, permutated into different gestural conditions were presented audio-visually to 12 subjects. The analysis of correct word recognition shows a significant increase in intelligibility when focally-accented (prominent) words are supplemented with head-nods or with eye-brow raise gestures. The paper also examines coupling other acoustic phenomena to brow-raise gestures. As a result, the paper introduces new evidence on the ability of the non-verbal movements in the visual modality to support audio-visual speech perception.Al Moubayed, S., Beskow, J., Öster, A-M., Salvi, G., Granström, B., van Son, N., Ormel, E., & Herzke, T. (2009). Studies on Using the SynFace Talking Head for the Hearing Impaired. In Proceedings of Fonetik'09. Dept. of Linguistics, Stockholm University, Sweden. [abstract] [pdf]Abstract: SynFace is a lip-synchronized talking agent which is optimized as a visual reading support for the hearing impaired. In this paper we present the large scale hearing impaired user studies carried out for three languages in the Hearing at Home project. The user tests focus on measuring the gain in Speech Reception Threshold in Noise and the effort scaling when using SynFace by hearing impaired people, where groups of hearing impaired subjects with different impairment levels from mild to severe and cochlear implants are tested. Preliminary analysis of the results does not show significant gain in SRT or in effort scaling. But looking at large cross-subject variability in both tests, it is clear that many subjects benefit from SynFace especially with speech with stereo babble.Al Moubayed, S., Beskow, J., Öster, A., Salvi, G., Granström, B., van Son, N., & Ormel, E. (2009). Virtual Speech Reading Support for Hard of Hearing in a Domestic Multi-media Setting. In Proceedings of Interspeech 2009. Al Moubayed, S., Chetouani, M., Baklouti, M., Dutoit, T., Mahdhaoui, A., Martin, J-C., Ondas, S., Pelachaud, C., Urbain, J., & Yilmaz, M. (2009). Generating Robot/Agent Backchannels During a Storytelling Experiment. In Proceedings of (ICRA'09) IEEE International Conference on Robotics and Automation. Kobe, Japan.Beskow, J., Salvi, G., & Al Moubayed, S. (2009). SynFace - Verbal and Non-verbal Face Animation from Audio. In Proceedings of The International Conference on Auditory-Visual Speech Processing AVSP'09. Norwich, England. [abstract] [pdf]Abstract: We give an overview of SynFace, a speech-driven face animation system originally developed for the needs of hard-of-hearing users of the telephone. For the 2009 LIPS challenge, SynFace includes not only articulatory motion but also non-verbal motion of gaze, eyebrows and head, triggered by detection of acoustic correlates of prominence and cues for interaction control. In perceptual evaluations, both verbal and non-verbal movmements have been found to have positive impact on word recognition scores.Bisitz, T., Herzke, T., Zokoll, M., Öster, A-M., Al Moubayed, S., Granström, B., Ormel, E., Van Son, N., & Tanke, R. (2009). Noise Reduction for Media Streams. In NAG/DAGA'09 International Conference on Acoustics. Rotterdam, Netherlands.Salvi, G., Beskow, J., Al Moubayed, S., & Granström, B. (2009). SynFace—Speech-Driven Facial Animation for Virtual Speech-Reading Support. EURASIP Journal on Audio, Speech, and Music Processing, 2009. [abstract] [pdf]Abstract: This paper describes SynFace, a supportive technology that aims at enhancing audio-based spoken communication in adverse acoustic conditions by providing the missing visual information in the form of an animated talking head. Firstly, we describe the system architecture, consisting of a 3D animated face model controlled from the speech input by a specifically optimised phonetic recogniser. Secondly, we report on speech intelligibility experiments with focus on multilinguality and robustness to audio quality. The system, already available for Swedish, English, and Flemish, was optimised for German and for Swedish wide-band speech quality available in TV, radio, and Internet communication. Lastly, the paper covers experiments with nonverbal motions driven from the speech signal. It is shown that turn-taking gestures can be used to affect the flow of human-human dialogues. We have focused specifically on two categories of cues that may be extracted from the acoustic signal: prominence/emphasis and interactional cues (turn-taking/back-channelling).

2008Al Moubayed, S., Beskow, J., & Salvi, G. (2008). SynFace Phone Recognizer for Swedish Wideband and Narrowband Speech. In Proceedings of The second Swedish Language Technology Conference (SLTC). Stockholm, Sweden.. [abstract] [pdf]Abstract: In this paper, we present new results and comparisons of the real-time lips synchronized talking head SynFace on different Swedish databases and bandwidth. The work involves training SynFace on narrow-band telephone speech from the Swedish SpeechDat, and on the narrow-band and wide-band Speecon corpus. Auditory perceptual tests are getting established for SynFace as an audio visual hearing support for the hearing-impaired. Preliminary results show high recognition accuracy compared to other languages.Al Moubayed, S., De Smet, M., & Van Hamme, H. (2008). Lip Synchronization: from Phone Lattice to PCA Eigen-projections using Neural Networks. In Proceedings of Interspeech 2008. Brisbane, Australia. [pdf]Beskow, J., Granström, B., Nordqvist, P., Al Moubayed, S., Salvi, G., Herzke, T., & Schulz, A. (2008). Hearing at Home – Communication support in home environments for hearing impaired persons. In Proceedings of Interspeech 2008. Brisbane, Australia. [abstract] [pdf]

[disclaimer] This is a personal homepage. Opinions expressed here or implied by links provided, do not represent the official views of KTH.