Connected – Context-Aware Speech Synthesis

About the Project

Why conversational speech synthesis?

Traditional text-to-speech systems deliver speech the way a newscaster reads from a script — clear, neutral, and utterly disconnected from the flow of real conversation.

Connected set out to change that. Funded by the Swedish Research Council (VR) from 2020 to 2025, the project asked: what does it take to build a speech synthesiser that does not just read text, but speaks — breathing naturally between phrases, inserting fillers to convey hesitation, and adapting its voice quality to the social and pragmatic context of a conversation?

The name captures the core ambition: a synthesiser that is connected — to its own previous output, to its conversational partner, and to the communicative situation it finds itself in. Rather than treating each sentence as an isolated unit, the project trained models on continuous breath-group sequences drawn from spontaneous speech corpora, capturing the natural flow of connected discourse.

The research spanned computational modelling of spontaneous speech phenomena, fine-grained control of paralinguistic cues, joint speech and gesture synthesis for embodied agents, and the development of novel evaluation paradigms that go beyond the traditional Mean Opinion Score test.

Project Details

Funder Swedish Research Council (VR)

Grant number 2019-05003

Duration 2020 – 2025

Host KTH Royal Institute of Technology

Department Speech, Music and Hearing (TMH)

KTH

KTH · TMH

Department of Speech, Music and Hearing

Popular Science

Teaching machines to speak like humans

From scripted reading to spontaneous speech

Have you ever tried making small talk with a smart speaker? Today's voice assistants sound natural reading the news, but the illusion shatters in real conversation — they speak from a script, never hesitating or breathing naturally. Traditional synthesis is trained on perfectly read, isolated sentences. The Connected project trained AI on spontaneous speech divided by natural inhalations — breath-group bigrams — producing a voice that breathes and pauses the way humans do. Where a filler like "uh" falls in a sentence, combined with pitch and speaking rate, directly shapes how confident or uncertain the speaker sounds. Machines can now be tuned to express exactly the right degree of certainty for any situation.

Creaky voice, smiling voice, and knowing when to stop

Smooth conversation depends on knowing whose turn it is to speak. The project modelled creaky voice — the subtle low rattle heard at the end of utterances — and smiling voice, the audible warmth that enters speech when we grin. Creaky voice turned out to be a powerful turn-yielding cue: when added to the end of an utterance, listeners instinctively understood the speaker was done, dramatically reducing unwanted interruptions. A synthesised smile in the voice similarly changes how the spoken content is perceived, opening the door to more empathetic systems. Together, fillers, pitch, speech rate, and voice quality form a rich expressive palette — the very same words can sound confident or hesitant, warm or detached, depending entirely on how they are delivered.

More than words: speech and body speaking together

When we speak we use our whole body — gestures and lip movements are tightly synchronised with our voice. Until recently, robots had to use pipelining: audio is generated first, then a separate system guesses appropriate gestures after the fact, producing clumsy, late movements. The Connected project created ISG — an Integrated Speech and Gesture system — a single deep-learning architecture generating both simultaneously from text, with 3.5 times fewer parameters. The team also developed control over articulatory effort, letting a robot overarticulate a joke set-up then mumble the punchline — a step toward embodied agents that feel genuinely present.

Testing voices in the real world — and why it matters

Standard evaluation asks people to rate isolated sentences — measuring audio quality but revealing nothing about real conversation. The project introduced interactive tests instead: participants played the board game Pandemic or engaged in an autonomous Twenty Questions game with a voice that adapted dynamically to each turn. People strongly preferred the spontaneous, context-aware voice over polished but static alternatives. The stakes are high: a voice that sounds confident even when wrong misleads users. By calibrating delivery to actual certainty — slowing down, inserting an "um" when unsure — AI becomes more transparent, trustworthy, and ready for roles in healthcare and education.

Research Goals

Three central questions

The project investigated how to build synthesisers that speak continuously, adaptively, and in ways that can be meaningfully evaluated in conversational contexts.

Modelling continuous spontaneous speech

How can a TTS system capture the physiological rhythms of real speech — breaths, disfluencies, and the planning cues that speakers embed in their voice? The project trained models on connected breath-group bigrams rather than isolated sentences, enabling contextually appropriate inhalation breaths and naturally inserted fillers.

Adapting to pragmatic context and partners

How can a synthesiser adjust its voice quality and style to the social situation? This goal drove research into paralinguistic cues — smiling voice, creaky phonation, and filled pauses — and their role in conveying confidence, personality, and turn-taking intent. It also led to joint speech and gesture models for embodied conversational agents.

Evaluating conversational TTS beyond MOS

Standard listening tests measure isolated utterances, not conversational speech. The project developed novel evaluation paradigms where speech synthesis is evaluated given the conversational context it is targeted for, how it signals turn taking, to what extent it can deliver jokes and interactive evaluations inside fully autonomous spoken dialogue systems. All methods aim to capture the user experience that isolated sentences in online tests cannot.

Key Results

Scientific contributions

Five years of research produced a body of work that advanced spontaneous TTS from proof-of-concept to fully interactive dialogue system integration.

01

Breath-group bigram synthesis

A novel training paradigm using automated breath and filler detection enables the synthesiser to insert salient inhalation breaths and disfluencies in contextually appropriate positions — producing speech that flows like natural connected discourse rather than concatenated sentences.

02

Controllable voice quality

Creaky voice and smiling voice were successfully modelled and perceptually validated. Synthesised creaky voice was shown to significantly influence listeners' perception of speaker certainty and to function as a turn-yielding cue — a finding with direct implications for dialogue systems.

03

Pragmatic speech synthesis

Going beyond generic style transfer, the project synthesised speech tailored to specific pragmatic functions — small talk, advice-giving, instructional speech, and self-directed speech — and demonstrated their integration into embodied conversational agents and social robots.

04

SSL representations & prosody control

Self-supervised learning (SSL) speech representations were shown to outperform mel-spectrograms for spontaneous TTS. Prosody-controllable neural HMMs enabled fine-grained manipulation of pitch, rate, and filler density within a single coherent synthesis framework.

05

Integrated speech and gesture

A novel joint deep-learning architecture simultaneously generates speech and co-speech gestures, enabling more coherent and natural embodied agents. Methods were also developed to control articulatory effort and synchronise lip animations with synthesised audio.

06

Task-embedded TTS evaluation

TTS was evaluated in a simulated board game and a 20 Questions spoken dialogue system. Providing dialogue context revealed that speech style is as consequential as propositional content — a finding that challenges the dominance of context-free MOS tests in the TTS community.

Research Highlights

Selected findings

Specific results that demonstrate the breadth of the Connected project's impact on the field.

Fillers shape personality perception

Filler words combined with variations in pitch and speech rate were shown to convey distinct levels of confidence and personality traits — demonstrating that how a system speaks matters as much as what it says.

Creaky voice signals turn-yielding

Synthesised creaky phonation was perceived as a turn-yielding cue by listeners, demonstrating that non-modal voice qualities can be instrumentalised to manage conversational floor changes in dialogue systems.

Robot speech style matters

In a study with a humanoid robot delivering self-defeating jokes, spontaneous speech synthesis — with controlled articulatory effort and facial animation — was preferred over standard TTS, showing the value of embodied spontaneous synthesis.

SSL beats mel for spontaneous TTS

Self-supervised speech representations (e.g., from wav2vec 2.0) outperformed traditional mel-spectrogram features as intermediate representations for spontaneous text-to-speech synthesis.

Context reveals preference

In a board-game evaluation, users exposed to dialogue context rated a spontaneous TTS on par with a leading commercial system — invisible in isolated listening but clearly preferred in context. MOS tests would have missed this entirely.

Prosody Control Enables Context-aware Dialogue

In a fully autonomous “20 Questions” guessing game, a TTS model that dynamically adapted its prosody to game progression — becoming faster and higher-pitched on correct guesses, more subdued on wrong ones — was strongly preferred over static alternatives. Users spontaneously described it as “human-like”, “interactive”, and “less robotic”, demonstrating that prosodic context-adaptation is a key driver of user preference in dialogue systems.

Publications

Research output

20+ peer-reviewed papers at top venues in speech synthesis, spoken language processing, and human-robot interaction. Demo pages and PDFs are linked where available.

2025

Interspeech

VoiceQualityVC: A Voice Conversion System for Studying the Perceptual Effects of Voice Quality in Speech

Lameris, H., Gustafson, J., & Székely, É.

2024

Interspeech

Contextual Interactive Evaluation of TTS Models in Dialogue Systems

Wang, S., Székely, É., & Gustafson, J.

Evaluation Dialogue Systems PDF Demo

LREC-COLING

The Role of Creaky Voice in Turn Taking and the Perception of Speaker Stance: Experiments Using Controllable TTS

Lameris, H., Székely, É., & Gustafson, J.

Creaky Voice Turn-Taking PDF Demo

LREC-COLING

Revisiting Three Text-to-Speech Synthesis Experiments with a Web-Based Audience Response System

Tånnander, C., Edlund, J., & Gustafson, J.

Evaluation ARS PDF

2023

Interspeech

Beyond Style: Synthesizing Speech with Pragmatic Functions

Lameris, H., Gustafson, J., & Székely, É.

Pragmatics Style Transfer PDF Demo

Interspeech

Automatic Evaluation of Turn-Taking Cues in Conversational Speech Synthesis

Ekstedt, E., Wang, S., Székely, É., Gustafson, J., & Skantze, G.

Turn-Taking Evaluation PDF

ICASSP

Prosody-Controllable Spontaneous TTS with Neural HMMs

Lameris, H., Mehta, S., Henter, G. E., Gustafson, J., & Székely, É.

Neural HMMs Prosody PDF Demo

ICASSPW

A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS

Wang, S., Henter, G. E., Gustafson, J., & Székely, É.

SSL Spontaneous TTS PDF Demo

ICPhS

Neural Speech Synthesis with Controllable Creaky Voice Style

Lameris, H., Włodarczak, M., Gustafson, J., & Székely, É.

Creaky Voice Voice Quality PDF

IVA

Generation of Speech and Facial Animation with Controllable Articulatory Effort for Amusing Conversational Characters

Gustafson, J., Székely, É., & Beskow, J.

Embodied Agents Facial Animation PDF Demo

RO-MAN

Hi Robot, It's Not What You Say, It's How You Say It

Miniota, J., Wang, S., Beskow, J., Gustafson, J., Székely, É., & Pereira, A.

HRI Robot Speech PDF Demo

SSW

On the Use of Self-Supervised Speech Representations in Spontaneous Speech Synthesis

Wang, S., Henter, G. E., Gustafson, J., & Székely, É.

SSL Representations PDF Demo

SSW

Stuck in the MOS Pit: A Critical Analysis of MOS Test Methodology in TTS Evaluation

Kirkland, A., Mehta, S., Lameris, H., Henter, G. E., Székely, É., & Gustafson, J.

Evaluation MOS Methodology PDF

2022

Interspeech

Where's the Uh, Hesitation? The Interplay between Filled Pause Location, Speech Rate and F0 in Perception of Confidence

Kirkland, A., Lameris, H., Székely, É., & Gustafson, J.

Filled Pauses Perception PDF Demo

LREC

Evaluating Sampling-Based Filler Insertion with Spontaneous TTS

Wang, S., Gustafson, J., & Székely, É.

Filler Insertion Evaluation PDF Demo

2021

ICMI

Integrated Speech and Gesture Synthesis

Wang, S., Alexanderson, S., Gustafson, J., Beskow, J., Henter, G. E., & Székely, É.

Gesture Synthesis Multimodal PDF Demo

SSW

Perception of Smiling Voice in Spontaneous Speech Synthesis

Kirkland, A., Włodarczak, M., Gustafson, J., & Székely, É.

Smiling Voice Voice Quality PDF Demo

SSW

Personality in the Mix: Investigating the Contribution of Fillers and Speaking Style to Perception of Spontaneous Speech Synthesis

Gustafson, J., Beskow, J., & Székely, É.

Personality Filled Pauses PDF Demo

2020

ICASSP

Breathing and Speech Planning in Spontaneous Speech Synthesis

Székely, É., Henter, G. E., Beskow, J., & Gustafson, J.

Breathing Speech Planning PDF Demo

LREC

Augmented Prompt Selection for Evaluation of Spontaneous Speech Synthesis

Székely, É., Edlund, J., & Gustafson, J.

Evaluation Prompt Design PDF Demo

Audio Demonstrations

Hear the research in action

All major publications from the Connected project include interactive audio demonstration pages where you can listen to synthesised speech examples and compare systems side by side. Visit the KTH Speech Synthesis Demo Hub to explore the full collection.

Browse all demos

Contextual Interactive Evaluation

Interspeech 2024

→

Creaky Voice in Turn-Taking

LREC-COLING 2024

→

Beyond Style: Pragmatic Functions

Interspeech 2023

→

Prosody-Controllable Spontaneous TTS

ICASSP 2023