The Role of Creaky Voice in Turn-Taking and the Perception of Speaker Stance: Experiments Using Controllable TTS

Harm Lameris, Éva Székely, Joakim Gustafson

LREC-COLING 2024, Torino, Italia, 20-25 May, 2024

Abstract

Recent work in spontaneous text-to-speech (TTS) has achieved naturalistic synthesis of creaky voice, a voice quality that has many pragmatic as well as paralinguistic functions. In this paper, we investigate the perception of two types of synthesized creaky voice by non-expert listeners. We annotated a spontaneous speech corpus using an automatic creaky phonation detection tool, and a Tacotron 2 TTS engine that was modified with a creaky phonation embedding with which we are capable of synthesizing a creaky voice style. We performed two subjective listening experiments in which the participants rated non-positional creak as sounding less certain, less positive, and more turn final, while positional creak at the end of an utterance was rated as more turn final than the stimuli synthesized with modal phonation. We also performed an objective analysis of the two types of creaky and modal phonation using a creak detection tool, which indicated significant differences in the level of creaky phonation present for each phonation type.

Stimuli for the Subjective Listening Experiments

Stimulus	No creak	Non-terminal creak	No creak	Terminal creak
1
2
3
4
5
6
7
8
9
10
11
12

Model and further stimuli

In case you are interested in replicating this experiment, feel free to reach out at my email in order to get access to the model and the remaining stimuli.