Lameris, H., Gustafson, J. and Székely, É (2023) "Beyond Style: Synthesizing Speech with Pragmatic Functions" Proceedings of Interspeech 2023, Dublin, Ireland

Abstract

With recent advances in generative modeling, conversational systems are becoming more lifelike and capable of long, more nuanced interactions. Text-to-Speech (TTS) is being tested in territories requiring natural-sounding speech that can mimic the complexities of human conversation. Hyper-realistic speech generation has been achieved, but a gap remains between the verbal behavior required for upscaled conversation such as paralinguistic information and pragmatic functions, and comprehension of the acoustic prosodic correlates underlying these. Without this knowledge, reproducing these functions in speech has little value. We use prosodic correlates including spectral peaks, spectral tilt, and creak percentage for speech synthesis with the pragmatic functions of small talk, self-directed speech, advice, and instructions. We perform a MOS evaluation, and a suitability experiment in which our system outperforms a read-speech and conversational baseline.

Sentences for the Pragmatic Function evaluation

Function Pragmatic Spontaneous Read Speech
Self-directed
Self-directed
Self-directed
Advice
Advice
Advice
Small talk
Small talk
Small talk
Instructions
Instructions
Instructions