About me Research Publications Thesis Contact


My Google Scholar page

2014Skantze, G., Hjalmarsson, A., & Oertel, C. (2014). Turn-taking, Feedback and Joint Attention in Situated Human-Robot Interaction. Speech Communication, 65, 50-66. [abstract] [link]Abstract: In this paper, we present a study where a robot instructs a human on how to draw a route on a map. The human and robot are seated face-to-face with the map placed on the table between them. The user’s and the robot’s gaze can thus serve several simultaneous functions: as cues to joint attention, turn-taking, level of understanding and task progression. We have compared this face-to-face setting with a setting where the robot employs a random gaze behaviour, as well as a voice-only setting where the robot is hidden behind a paper board. In addition to this, we have also manipulated turn-taking cues such as completeness and filled pauses in the robot’s speech. By analysing the participants’ subjective rating, task completion, verbal responses, gaze behaviour, and drawing activity, we show that the users indeed benefit from the robot’s gaze when talking about landmarks, and that the robot’s verbal and gaze behaviour has a strong effect on the users’ turn-taking behaviour. We also present an analysis of the users’ gaze and lexical and prosodic realisation of feedback after the robot instructions, and show that these cues reveal whether the user has yet executed the previous instruction, as well as the user’s level of uncertainty.2013Heldner, M., Hjalmarsson, A., & Edlund, J. (2013). Backchannel relevance spaces. In Asu, E. L., & Lippus, P. (Eds.), Nordic Prosody: Proceedings of the XIth Conference (pp. 137-146). Frankfurt am Main, Germany: Peter Lang. [pdf]Skantze, G., & Hjalmarsson, A. (2013). Towards Incremental Speech Generation in Conversational Systems. Computer Speech & Language, 27(1), 243-262. [abstract] [link]Abstract: This paper presents a model of incremental speech generation in practical conversational systems. The model allows a conversational system to incrementally interpret spoken input, while simultaneously planning, realising and self-monitoring the system response. If these processes are time consuming and result in a response delay, the system can automatically produce hesitations to retain the floor. While speaking, the system utilises hidden and overt self-corrections to accommodate revisions in the system. The model has been implemented in a general dialogue system framework. Using this framework, we have implemented a conversational game application. A Wizard-of-Oz experiment is presented, where the automatic speech recognizer is replaced by a Wizard who transcribes the spoken input. In this setting, the incremental model allows the system to start speaking while the user's utterance is being transcribed. In comparison to a non-incremental version of the same system, the incremental version has a shorter response time and is perceived as more efficient by the users.Skantze, G., Hjalmarsson, A., & Oertel, C. (2013). Exploring the effects of gaze and pauses in situated human-robot interaction. In 14th Annual Meeting of the Special Interest Group on Discourse and Dialogue - SIGDial. Metz, France. (*) [abstract] [pdf](*) Nominated for Best Paper Award at SIGdial 2013Abstract: In this paper, we present a user study where a robot instructs a human on how to draw a route on a map, similar to a Map Task. This setup has allowed us to study user reactions to the robot’s conversational behaviour in order to get a better understanding of how to generate utterances in incremental dialogue systems. We have analysed the participants' subjective rating, task completion, verbal responses, gaze behaviour, drawing activity, and cognitive load. The results show that users utilise the robot’s gaze in order to disambiguate referring expressions and manage the flow of the interaction. Furthermore, we show that the user’s behaviour is affected by how pauses are realised in the robot’s speech.Skantze, G., Oertel, C., & Hjalmarsson, A. (2013). User feedback in human-robot interaction: Prosody, gaze and timing. In Proceedings of Interspeech. [abstract] [pdf]Abstract: This paper investigates forms and functions of user feedback in a map task dialogue between a human and a robot, where the robot is the instruction-giver and the human is the instruction-follower. First, we investigate how user acknowledgements in task-oriented dialogue signal whether an activity is about to be initiated or has been completed. The parameters analysed include the users’ lexical and prosodic realisation as well as gaze direction and response timing. Second, we investigate the relation between these parameters and the perception of uncertainty.Strömbergsson, S., Hjalmarsson, A., Edlund, J., & House, D. (2013). Timing responses to questions in dialogue. In Proceedings of Interspeech 2013 (pp. 2584-2588). Lyon, France. [abstract] [pdf]Abstract: Questions and answers play an important role in spoken dialogue systems as well as in human-human interaction. A critical concern when responding to a question is the timing of the response. While human response times depend on a wide set of features, dialogue systems generally respond as soon as they can, that is, when the end of the question has been detected and the response is ready to be deployed. This paper presents an analysis of how different semantic and pragmatic features affect the response times to questions in two different data sets of spontaneous human-human dialogues: the Swedish Spontal Corpus and the US English Switchboard corpus. Our analysis shows that contextual features such as question type, response type, and conversation topic influence human response times. Based on these results, we propose that more sophisticated response timing can be achieved in spoken dialogue systems by using these features to automatically and deliberately target system response timing.2012Edlund, J., Alexandersson, S., Beskow, J., Gustavsson, L., Heldner, M., Hjalmarsson, A., Kallionen, P., & Marklund, E. (2012). 3rd party observer gaze as a continuous measure of dialogue flow. In Proc. of LREC 2012. Istanbul, Turkey. [abstract] [pdf]Abstract: We present an attempt at using 3rd party observer gaze to get a measure of how appropriate each segment in a dialogue is for a speaker change. The method is a step away from the current dependency of speaker turns or talkspurts towards a more general view of speaker changes. We show that 3rd party observers do indeed largely look at the same thing (the speaker), and how this can be captured and utilized to provide insights into human communication.Edlund, J., Heldner, M., & Hjalmarsson, A. (2012). 3rd party observer gaze during backchannels. In Proc. of the Interspeech 2012 Interdisciplinary Workshop on Feedback Behaviors in Dialog. Skamania Lodge, WA, USA.Edlund, J., & Hjalmarsson, A. (2012). Is it really worth it? Cost-based selection of system responses to speech-in-overlap. In Proc. of the IVA 2012 workshop on Realtime Conversational Virtual Agents (RCVA 2012). Santa Cruz, CA, USA. [abstract] [pdf]Abstract: For purposes of discussion and feedback, we present a preliminary version of a simple yet powerful cost-based framework for spoken dialogue sys-tems to continuously and incrementally decide whether to speak or not. The framework weighs the cost of producing speech in overlap against the cost of not speaking when something needs saying. Main features include a small number of parameters controlling characteristics that are readily understood, al-lowing manual tweaking as well as interpretation of trained parameter settings; observation-based estimates of expected overlap which can be adapted dynami-cally; and a simple and general method for context dependency. No evaluation has yet been undertaken, but the effects of the parameters; the observation-based cost of expected overlap trained on Switchboard data; and the context de-pendency using inter-speaker intensity differences from the same corpus are demonstrated with generated input data in the context of user barge-ins.Edlund, J., Hjalmarsson, A., & Tånnander, C. (2012). Unconventional methods in perception experiments. In Proc. of Nordic Prosody XI. Tartu, Estonia.Hjalmarsson, A., & Oertel, C. (2012). Gaze direction as a Back-Channel inviting Cue in Dialogue. In Proc. of the IVA 2012 workshop on Realtime Conversational Virtual Agents (RCVA 2012). Santa Cruz, CA, USA. [abstract] [pdf]Abstract: In this study, we experimentally explore the relationship between gaze direction and backchannels in face-to-face interaction. The overall motivation is to use gaze direction in a virtual agent as a mean to elicit user feedback. The relationship between gaze and backchannels was tested in an experiment in which participants were asked to provide feedback when listening to a story-telling virtual agent. When speaking, the agent shifted her gaze towards the listener at predefined positions in the dialogue. The results show that listeners are more prone to backchannel when the virtual agent’s gaze is directed towards them than when it is directed away. However, there is a high response variability for different dialogue contexts which suggests that the timing of backchannels cannot be explained by gaze direction alone.2011Heldner, M., Edlund, J., Hjalmarsson, A., & Laskowski, K. (2011). Very short utterances and timing in turn-taking. In Proceedings of Interspeech 2011 (pp. 2837-2840). Florence, Italy. [abstract] [pdf]Abstract: This work explores the timing of very short utterances in conversations, as well as the effects of excluding intervals adjacent to such utterances from distributions of between-speaker interval durations. The results show that very short utterances are more precisely timed to the preceding utterance than longer utterances in terms of a smaller variance and a larger proportion of no-gap-no-overlaps. Excluding intervals adjacent to very short utterances furthermore results in measures of central tendency closer to zero (i.e. no-gap-no-overlaps) as well as larger variance (i.e. relatively longer gaps and overlaps).Hjalmarsson, A. (2011). The additive effect of turn-taking cues in human and synthetic voice. Speech Communication, 53(1), 23-35. [abstract] [link]Abstract: A previous line of research suggests that interlocutors identify appropriate places to speak by cues in the behaviour of the preceding speaker. If used in combination, these cues have an additive effect on listeners’ turn-taking attempts. The present study further explores these findings by examining the effect of such turn-taking cues experimentally. The objective is to investigate the possibilities of generating turn-taking cues with a synthetic voice. Thus, in addition to stimuli realized with a human voice, the experiment included dialogues where one of the speakers is replaced with a synthesis. The turn-taking cues investigated include intonation, phrase-final lengthening, semantic completeness, stereotyped lexical expressions and non-lexical speech production phenomena such as lexical repetitions, breathing and lip-smacks. The results show that the turn-taking cues realized with a synthetic voice affect the judgements similar to the corresponding human version and there is no difference in reaction times between these two conditions. Furthermore, the results support Duncan’s findings: the more turn-taking cues with the same pragmatic function, turn-yielding or turn-holding, the higher the agreement among subjects on the expected outcome. In addition, the number of turn-taking cues affects the reaction times for these decisions. Thus, the more cues, the faster the reaction time.Hjalmarsson, A., & Laskowski, K. (2011). Measuring final lengthening for speaker-change prediction. In Proceedings of Interspeech 2011 (pp. 2069-2072). Florence, Italy. [abstract] [pdf]Abstract: We explore pre-silence syllabic lengthening as a cue for next-speakership prediction in spontaneous dialogue. When estimated using a transcription-mediated procedure, lengthening is shown to reduce error rates by 25% relative to majority class guessing. This indicates that lengthening should be exploited by dialogue systems. With that in mind, we evaluate an automatic measure of spectral envelope change, Mel-spectral flux (MSF), and show that its performance is at least as good as that of the transcription-mediated measure. Modeling MSF is likely to improve turn uptake in dialogue systems, and to benefit other applications needing an estimate of durational variability in speech.2010Beskow, J., Edlund, J., Gustafson, J., Heldner, M., Hjalmarsson, A., & House, D. (2010). Modelling humanlike conversational behaviour. In Proceedings of SLTC 2010. Linköping, Sweden. [pdf]Beskow, J., Edlund, J., Gustafson, J., Heldner, M., Hjalmarsson, A., & House, D. (2010). Research focus: Interactional aspects of spoken face-to-face communication. In Proc. of Fonetik 2010 (pp. 7-10). Lund, Sweden. [abstract] [pdf]Abstract: We have a visionary goal: to learn enough about human face-to-face interaction that we are able to create an artificial conversational partner that is human-like. We take the opportunity here to present four new projects inaugurated in 2010, each adding pieces of the puzzle through a shared research focus: interactional aspects of spoken face-to-face communication.Hirschberg, J., Hjalmarsson, A., & Elhadad, N. (2010). "You're as Sick as You Sound" Using Computational Approaches for Modeling Speaker State to Gauge Illness and Recovery. In Neustein, A. (Ed.), Mobile Environments, Call Centers and Clinics (pp. 305-322). Springer. [abstract]Abstract: Recently, researchers in computer science and engineering have begun to explore the possibility of finding speech-based correlates of various medical conditions using automatic, computational methods. If such language cues can be identified and quantified automatically, this information can be used to support diagnosis and treatment of medical conditions in clinical settings and to further fundamental research in understanding cognition. This chapter reviews computational approaches that explore communicative patterns of patients who suffer from medical conditions such as depression, autism spectrum disorders, schizophrenia, and cancer. There are two main approaches discussed: research that explores features extracted from the acoustic signal and research that focuses on lexical and semantic features. We also present some applied research that uses computational methods to develop assistive technologies. In the final sections we discuss issues related to and the future of this emerging field of research.Hjalmarsson, A. (2010). Human interaction as a model for spoken dialogue system behaviour. Doctoral dissertation. [abstract] [pdf]Abstract: This thesis is a step towards the long-term and high-reaching objective of building dialogue systems whose behaviour is similar to a human dialogue partner. The aim is not to build a machine with the same conversational skills as a human being, but rather to build a machine that is human enough to encourage users to interact with it accordingly. The behaviours in focus are cue phrases, hesitations and turn-taking cues. These behaviours serve several important communicative functions such as providing feedback and managing turn-taking. Thus, if dialogue systems could use interactional cues similar to those of humans, these systems could be more intuitive to talk to. A major part of this work has been to collect, identify and analyze the target behaviours in human-human interaction in order to gain a better understanding of these phenomena. Another part has been to reproduce these behaviours in a dialogue system context and explore listeners’ perceptions of these phenomena in empirical experiments. The thesis is divided into two parts. The first part serves as an overall background. The issues and motivations of humanlike dialogue systems are discussed. This part also includes an overview of research on human language production and spoken language generation in dialogue systems. The next part presents the data collections, data analyses and empirical experiments that this thesis is concerned with. The first study presented is a listening test that explores human behaviour as a model for dialogue systems. The results show that a version based on human behaviour is rated as more humanlike, polite and intelligent than a constrained version with less variability. Next, the DEAL dialogue system is introduced. DEAL is used as a platform for the re-search presented in this thesis. The domain of the system is a trade domain and the target audience are second language learners of Swedish who want to practice conversation. Furthermore, a data col-lection of human-human dialogues in the DEAL domain is presented. Analyses of cue phrases in these data are provided as well as an experimental study of turn-taking cues. The results from the turn-taking experiment indicate that turn-taking cues realized with a di-phone synthesis affect the expectations of a turn change similar to the corresponding human version. Finally, an experimental study that explores the use of talkspurt-initial cue phrases in an incremental version of DEAL is presented. The results show that the incremental version had shorter response times and was rated as more efficient, more polite and better at indicating when to speak than a non-incremental implementation of the same system.Hjalmarsson, A. (2010). The vocal intensity of turn-initial cue phrases and filled pauses in dialogue. In Proceedings of SIGdial (pp. 225-228). Tokyo, Japan. [abstract] [pdf]Abstract: The present study explores the vocal intensity of turn-initial cue phrases in a corpus of dialogues in Swedish. Cue phrases convey relatively little propositional content, but have several important pragmatic functions. The majority of these entities are frequently occurring monosyllabic words such as “eh”, “mm”, “ja”. Prosodic analysis shows that these words are produced with higher intensity than other turn-initial words are. In light of these results, it is suggested that speakers produce these expressions with high intensity in order to claim the floor. It is further shown that the difference in intensity can be measured as a dynamic inter- speaker relation over the course of a dialogue using the end of the interlocutor’s previous turn as a reference point.Skantze, G., & Hjalmarsson, A. (2010). Towards Incremental Speech Generation in Dialogue Systems. In Proceedings of SIGdial (pp. 1-8). Tokyo, Japan. (*) [abstract] [pdf](*) Best Paper Award at SIGdial 2010Abstract: We present a first step towards a model of speech generation for incremental dialogue systems. The model allows a dialogue system to incrementally interpret spoken input, while simultaneously planning, realising and selfmonitoring the system response. The model has been implemented in a general dialogue system framework. Using this framework, we have implemented a specific application and tested it in a Wizard-of-Oz setting, comparing it with a non-incremental version of the same system. The results show that the incremental version, while producing longer utterances, has a shorter response time and is perceived as more efficient by the users.2009Beskow, J., Carlson, R., Edlund, J., Granström, B., Heldner, M., Hjalmarsson, A., & Skantze, G. (2009). Multimodal Interaction Control. In Waibel, A., & Stiefelhagen, R. (Eds.), Computers in the Human Interaction Loop (pp. 143-158). Berlin/Heidelberg: Springer. [pdf]Hjalmarsson, A. (2009). On cue - additive effects of turn-regulating phenomena in dialogue. In Diaholmia (pp. 27-34). [abstract] [pdf]Abstract: One line of work on turn-taking in dialogue suggests that speakers react to “cues” or “signals” in the behaviour of the preceding speaker. This paper describes a perception experiment that investigates if such potential turn-taking cues affect the judgments made by non-participating listeners. The experiment was designed as a game where the task was to listen to dialogues and guess the outcome, whether there will be a speaker change or not, whenever the recording was halted. Human-human dialogues as well as dialogues where one of the human voices was replaced by a synthetic voice were used. The results show that simultaneous turn-regulating cues have a reinforcing effect on the listeners’ judgements. The more turn-holding cues, the faster the reaction time, suggesting that the subjects were more confident in their judgments. Moreover, the more cues, regardless if turn-holding or turn-yielding, the higher the agreement among subjects on the predicted outcome. For the re-synthesized voice, responses were made significantly slower; however, the judgments show that the turn-taking cues were interpreted as having similar functions as for the original human voice.Wik, P., & Hjalmarsson, A. (2009). Embodied conversational agents in computer assisted language learning. Speech Communication, 51(10), 1024-1037. [abstract] [pdf]Abstract: This paper describes two systems using embodied conversational agents (ECAs) for language learning. The first system, called Ville, is a virtual language teacher for vocabulary and pronunciation training. The second system, a dialogue system called DEAL, is a role-playing game for practicing conversational skills. Whereas DEAL acts as a conversational partner with the objective of creating and keeping an interesting dialogue, Ville takes the role of a teacher who guides, encourages and gives feedback to the students.2008Edlund, J., Gustafson, J., Heldner, M., & Hjalmarsson, A. (2008). Towards human-like spoken dialogue systems. Speech Communication, 50(8-9), 630-645. [abstract] [pdf]Abstract: This paper presents an overview of methods that can be used to collect and analyse data on user responses to spoken dialogue system components intended to increase human-likeness, and to evaluate how well the components succeed in reaching that goal. Wizard-of-Oz variations, human-human data manipulation, and micro-domains are discussed in this context, as is the use of third-party reviewers to get a measure of the degree of human-likeness. We also present the two-way mimicry target, a model for measuring how well a human-computer dialogue mimics or replicates some aspect of human-human dialogue, including human flaws and inconsistencies. Although we have added a measure of innovation, none of the techniques is new in its entirety. Taken together and described from a human-likeness perspective, however, they form a set of tools that may widen the path towards human-like spoken dialogue systems.Hjalmarsson, A. (2008). Speaking without knowing what to say... or when to end. In Proceedings of SIGdial 2008 (pp. 72-75). Columbus, Ohio, USA. [abstract] [pdf]Abstract: Humans produce speech incrementally and on-line as the dialogue progresses using information from several different sources in parallel. A dialogue system that generates output in a stepwise manner and not in preplanned syntactically correct sentences needs to signal how new dialogue contributions relate to previous discourse. This paper describes a data collection which is the foundation for an effort towards more human-like language generation in DEAL, a spoken dialogue system developed at KTH. Two annotators labelled cue phrases in the corpus with high inter-annotator agreement (kappa coefficient 0.82).Hjalmarsson, A., & Edlund, J. (2008). Human-likeness in utterance generation: effects of variability. In Perception in Multimodal Dialogue Systems - Proceedings of the 4th IEEE Tutorial and Research Workshop on Perception and Interactive Technologies for Speech-Based Systems, PIT 2008, Kloster Irsee, Germany, June 16-18, 2008. (pp. 252-255). Berlin/Heidelberg: Springer. [abstract] [pdf]Abstract: There are compelling reasons to endow dialogue systems with human-like conversational abilities, which require modelling of aspects of human behaviour. This paper examines the value of using human behaviour as a target for system behaviour through a study making use of a simulation method. Two versions of system behaviour are compared: a replica of a human speaker’s behaviour and a constrained version with less variability. The version based on human behaviour is rated more human-like, polite and intelligent.2007Brusk, J., Lager, T., Hjalmarsson, A., & Wik, P. (2007). DEAL – Dialogue Management in SCXML for Believable Game Characters. In Proceedings of ACM Future Play 2007 (pp. 137-144). [abstract] [pdf]Abstract: In order for game characters to be believable, they must appear to possess qualities such as emotions, the ability to learn and adapt as well as being able to communicate in natural language. With this paper we aim to contribute to the development of believable non-player characters (NPCs) in games, by presenting a method for managing NPC dialogues. We have selected the trade scenario as an example setting since it offers a well-known and limited domain common in games that support ownership, such as role-playing games. We have developed a dialogue manager in State Chart XML, a newly introduced W3C standard, as part of DEAL -- a research platform for exploring the challenges and potential benefits of combining elements from computer games, dialogue systems and language learning.Hjalmarsson, A., Wik, P., & Brusk, J. (2007). Dealing with DEAL: a dialogue system for conversation training. In Proceedings of SIGdial (pp. 132-135). Antwerp, Belgium. [abstract] [pdf]Abstract: We present DEAL, a spoken dialogue system for conversation training under development at KTH. DEAL is a game with a spoken language interface designed for second language learners. The system is intended as a multidisciplinary research platform where challenges and potential benefits of combining elements from computer games, dialogue systems and language learning can be explored.Wik, P., Hjalmarsson, A., & Brusk, J. (2007). Computer Assisted Conversation Training for Second Language Learners. Proceedings of Fonetik, TMH-QPSR, 50(1), 57-60. [pdf]Wik, P., Hjalmarsson, A., & Brusk, J. (2007). DEAL A Serious Game For CALL Practicing Conversational Skills In The Trade Domain. In Proceedings of SLATE 2007. [abstract] [pdf]Abstract: This paper describes work in progress on DEAL, a spoken dialogue system under development at KTH. It is intended as a platform for exploring the challenges and potential benefits of combining elements from computer games, dialogue systems and language learning.2006Carlson, R., Edlund, J., Heldner, M., Hjalmarsson, A., House, D., & Skantze, G. (2006). Towards human-like behaviour in spoken dialog systems. In Proceedings of Swedish Language Technology Conference (SLTC 2006). Gothenburg, Sweden. [pdf]2005Edlund, J., & Hjalmarsson, A. (2005). Applications of distributed dialogue systems: the KTH Connector. In Proceedings of ISCA Tutorial and Research Workshop on Applied Spoken Language Interaction in Distributed Environments (ASIDE 2005). Aalborg, Denmark. [abstract] [pdf]Abstract: We describe a spoken dialogue system domain: that of the personal secretary. This domain allows us to capitalise on the characteristics that make speech a unique interface; characteristics that humans use regularly, implicitly, and with remarkable ease. We present a prototype system - the KTH Connector - and highlight several dialogue research issues arising in the domain.Hjalmarsson, A. (2005). Towards user modelling in conversational dialogue systems: A qualitative study of the dynamics of dialogue parameters. In Proceedings of Interspeech 2005 (pp. 869-872). Lisbon, Portugal. [abstract] [pdf]Abstract: This paper presents a qualitative study of data from a 26 subject experimental study within the multimodal, conversational dialogue system AdApt. Qualitative analysis of data is used to illustrate the dynamic variation of dialogue parameters over time. The analysis will serve as a foundation for research and future data collections in the area of adaptive dialogue systems and user modelling.2002Hjalmarsson, A. (2002). Att utvärdera AdApt, ett multimodalt konverserande dialogsystem, med PARADISE. Master's thesis, KTH, TMH, CTT. [pdf]



Carlson, R., Edlund, J., House, D., Heldner, M., Hjalmarsson, A., & Skantze, G. "Towards human-like behaviour in spoken dialog systems". Position paper at SLTC 2006, Swedish Language Technology Conference. Göteborg.

KTH / TMH / Anna Hjalmarsson's Home Page
This is a personal web page. More information
Published by: Anna Hjalmarsson, Speech, Music and Hearing, KTH
Last updated: CEST