Human interaction as a model for spoken dialogue system behaviour
Doctoral Thesis in Speech Communication, KTH, Stockholm, Sweden
The thesis was defended September 3, 2010
Opponent: David Schlangen
Reviewing commitee: Robin Cooper, Christina Hellman, Hedvig Kjellström
This thesis is a step towards the long-term and high-reaching objective of building dialogue systems whose behaviour is similar to a human dialogue partner. The aim is not to build a machine with the same conversational skills as a human being, but rather to build a machine that is human enough to encourage users to interact with it accordingly. The behaviours in focus are cue phrases, hesitations and turn-taking cues. These behaviours serve several important communicative functions such as providing feedback and managing turn-taking. Thus, if dialogue systems could use interactional cues similar to those of humans, these systems could be more intuitive to talk to. A major part of this work has been to collect, identify and analyze the target behaviours in human-human interaction in order to gain a better understanding of these phenomena. Another part has been to reproduce these behaviours in a dialogue system context and explore listeners’ perceptions of these phenomena in empirical experiments.
The thesis is divided into two parts. The first part serves as an overall background. The issues and motivations of humanlike dialogue systems are discussed. This part also includes an overview of research on human language production and spoken language generation in dialogue systems.
The next part presents the data collections, data analyses and empirical experiments that this thesis is concerned with. The first study presented is a listening test that explores human behaviour as a model for dialogue systems. The results show that a version based on human behaviour is rated as more humanlike, polite and intelligent than a constrained version with less variability. Next, the DEAL dialogue system is introduced. DEAL is used as a platform for the research presented in this thesis. The domain of the system is a trade domain and the target audience are second language learners of Swedish who want to practice conversation. Furthermore, a data collection of human-human dialogues in the DEAL domain is presented. Analyses of cue phrases in these data are provided as well as an experimental study of turn-taking cues. The results from the turn-taking experiment indicate that turn-taking cues realized with a diphone synthesis affect the expectations of a turn change similar to the corresponding human version.
Finally, an experimental study that explores the use of talkspurt-initial cue phrases in an incremental version of DEAL is presented. The results show that the incremental version had shorter response times and was rated as more efficient, more polite and better at indicating when to speak than a non-incremental implementation of the same system.