VR Project · 2020–2025

KTH Department of Speech, Music and Hearing

Connected
Speech Synthesis

Context-aware speech synthesis for conversational AI — moving beyond generic read-speech delivery toward spontaneous, adaptive voices that breathe, hesitate, and engage like humans.

20+
Publications
5
Years of Research
10+
Demo Systems
Connected – context-aware speech synthesis
Spontaneous TTS
Breath-group bigrams
Creaky voice control
Filler insertion
Gesture synthesis
Novel evaluation
About the Project

Why conversational speech synthesis?

Traditional text-to-speech systems deliver speech the way a newscaster reads from a script — clear, neutral, and utterly disconnected from the flow of real conversation.

Connected set out to change that. Funded by the Swedish Research Council (VR) from 2020 to 2025, the project asked: what does it take to build a speech synthesiser that does not just read text, but speaks — breathing naturally between phrases, inserting fillers to convey hesitation, and adapting its voice quality to the social and pragmatic context of a conversation?

The name captures the core ambition: a synthesiser that is connected — to its own previous output, to its conversational partner, and to the communicative situation it finds itself in. Rather than treating each sentence as an isolated unit, the project trained models on continuous breath-group sequences drawn from spontaneous speech corpora, capturing the natural flow of connected discourse.

The research spanned computational modelling of spontaneous speech phenomena, fine-grained control of paralinguistic cues, joint speech and gesture synthesis for embodied agents, and the development of novel evaluation paradigms that go beyond the traditional Mean Opinion Score test.

Project Details

Funder Swedish Research Council (VR)
Grant number 2019-05003
Duration 2020 – 2025
Host KTH Royal Institute of Technology
Department Speech, Music and Hearing (TMH)
KTH · TMH
Department of Speech, Music and Hearing
Popular Science

Teaching machines to speak like humans

From scripted reading to spontaneous speech

From scripted reading to spontaneous speech

Have you ever tried making small talk with a smart speaker? Today's voice assistants sound natural reading the news, but the illusion shatters in real conversation — they speak from a script, never hesitating or breathing naturally. Traditional synthesis is trained on perfectly read, isolated sentences. The Connected project trained AI on spontaneous speech divided by natural inhalations — breath-group bigrams — producing a voice that breathes and pauses the way humans do. Where a filler like "uh" falls in a sentence, combined with pitch and speaking rate, directly shapes how confident or uncertain the speaker sounds. Machines can now be tuned to express exactly the right degree of certainty for any situation.

Creaky voice, smiling voice, and knowing when to stop

Creaky voice, smiling voice, and knowing when to stop

Smooth conversation depends on knowing whose turn it is to speak. The project modelled creaky voice — the subtle low rattle heard at the end of utterances — and smiling voice, the audible warmth that enters speech when we grin. Creaky voice turned out to be a powerful turn-yielding cue: when added to the end of an utterance, listeners instinctively understood the speaker was done, dramatically reducing unwanted interruptions. A synthesised smile in the voice similarly changes how the spoken content is perceived, opening the door to more empathetic systems. Together, fillers, pitch, speech rate, and voice quality form a rich expressive palette — the very same words can sound confident or hesitant, warm or detached, depending entirely on how they are delivered.

More than words: speech and body speaking together

More than words: speech and body speaking together

When we speak we use our whole body — gestures and lip movements are tightly synchronised with our voice. Until recently, robots had to use pipelining: audio is generated first, then a separate system guesses appropriate gestures after the fact, producing clumsy, late movements. The Connected project created ISG — an Integrated Speech and Gesture system — a single deep-learning architecture generating both simultaneously from text, with 3.5 times fewer parameters. The team also developed control over articulatory effort, letting a robot overarticulate a joke set-up then mumble the punchline — a step toward embodied agents that feel genuinely present.

Testing voices in the real world — and why it matters

Testing voices in the real world — and why it matters

Standard evaluation asks people to rate isolated sentences — measuring audio quality but revealing nothing about real conversation. The project introduced interactive tests instead: participants played the board game Pandemic or engaged in an autonomous Twenty Questions game with a voice that adapted dynamically to each turn. People strongly preferred the spontaneous, context-aware voice over polished but static alternatives. The stakes are high: a voice that sounds confident even when wrong misleads users. By calibrating delivery to actual certainty — slowing down, inserting an "um" when unsure — AI becomes more transparent, trustworthy, and ready for roles in healthcare and education.

Research Goals

Three central questions

The project investigated how to build synthesisers that speak continuously, adaptively, and in ways that can be meaningfully evaluated in conversational contexts.

Modelling continuous spontaneous speech

How can a TTS system capture the physiological rhythms of real speech — breaths, disfluencies, and the planning cues that speakers embed in their voice? The project trained models on connected breath-group bigrams rather than isolated sentences, enabling contextually appropriate inhalation breaths and naturally inserted fillers.

Adapting to pragmatic context and partners

How can a synthesiser adjust its voice quality and style to the social situation? This goal drove research into paralinguistic cues — smiling voice, creaky phonation, and filled pauses — and their role in conveying confidence, personality, and turn-taking intent. It also led to joint speech and gesture models for embodied conversational agents.

Evaluating conversational TTS beyond MOS

Standard listening tests measure isolated utterances, not conversational speech. The project developed novel evaluation paradigms where speech synthesis is evaluated given the conversational context it is targeted for, how it signals turn taking, to what extent it can deliver jokes and interactive evaluations inside fully autonomous spoken dialogue systems. All methods aim to capture the user experience that isolated sentences in online tests cannot.

Key Results

Scientific contributions

Five years of research produced a body of work that advanced spontaneous TTS from proof-of-concept to fully interactive dialogue system integration.

01

Breath-group bigram synthesis

A novel training paradigm using automated breath and filler detection enables the synthesiser to insert salient inhalation breaths and disfluencies in contextually appropriate positions — producing speech that flows like natural connected discourse rather than concatenated sentences.

02

Controllable voice quality

Creaky voice and smiling voice were successfully modelled and perceptually validated. Synthesised creaky voice was shown to significantly influence listeners' perception of speaker certainty and to function as a turn-yielding cue — a finding with direct implications for dialogue systems.

03

Pragmatic speech synthesis

Going beyond generic style transfer, the project synthesised speech tailored to specific pragmatic functions — small talk, advice-giving, instructional speech, and self-directed speech — and demonstrated their integration into embodied conversational agents and social robots.

04

SSL representations & prosody control

Self-supervised learning (SSL) speech representations were shown to outperform mel-spectrograms for spontaneous TTS. Prosody-controllable neural HMMs enabled fine-grained manipulation of pitch, rate, and filler density within a single coherent synthesis framework.

05

Integrated speech and gesture

A novel joint deep-learning architecture simultaneously generates speech and co-speech gestures, enabling more coherent and natural embodied agents. Methods were also developed to control articulatory effort and synchronise lip animations with synthesised audio.

06

Task-embedded TTS evaluation

TTS was evaluated in a simulated board game and a 20 Questions spoken dialogue system. Providing dialogue context revealed that speech style is as consequential as propositional content — a finding that challenges the dominance of context-free MOS tests in the TTS community.

Research Highlights

Selected findings

Specific results that demonstrate the breadth of the Connected project's impact on the field.

Fillers shape personality perception

Filler words combined with variations in pitch and speech rate were shown to convey distinct levels of confidence and personality traits — demonstrating that how a system speaks matters as much as what it says.

Creaky voice signals turn-yielding

Synthesised creaky phonation was perceived as a turn-yielding cue by listeners, demonstrating that non-modal voice qualities can be instrumentalised to manage conversational floor changes in dialogue systems.

Robot speech style matters

In a study with a humanoid robot delivering self-defeating jokes, spontaneous speech synthesis — with controlled articulatory effort and facial animation — was preferred over standard TTS, showing the value of embodied spontaneous synthesis.

SSL beats mel for spontaneous TTS

Self-supervised speech representations (e.g., from wav2vec 2.0) outperformed traditional mel-spectrogram features as intermediate representations for spontaneous text-to-speech synthesis.

Context reveals preference

In a board-game evaluation, users exposed to dialogue context rated a spontaneous TTS on par with a leading commercial system — invisible in isolated listening but clearly preferred in context. MOS tests would have missed this entirely.

Prosody Control Enables Context-aware Dialogue

In a fully autonomous “20 Questions” guessing game, a TTS model that dynamically adapted its prosody to game progression — becoming faster and higher-pitched on correct guesses, more subdued on wrong ones — was strongly preferred over static alternatives. Users spontaneously described it as “human-like”, “interactive”, and “less robotic”, demonstrating that prosodic context-adaptation is a key driver of user preference in dialogue systems.

Publications

Research output

20+ peer-reviewed papers at top venues in speech synthesis, spoken language processing, and human-robot interaction. Demo pages and PDFs are linked where available.

2025
Interspeech
VoiceQualityVC: A Voice Conversion System for Studying the Perceptual Effects of Voice Quality in Speech
Lameris, H., Gustafson, J., & Székely, É.
Voice Conversion Voice Quality PDF GitHub
2024
Interspeech
Contextual Interactive Evaluation of TTS Models in Dialogue Systems
Wang, S., Székely, É., & Gustafson, J.
Evaluation Dialogue Systems PDF Demo
LREC-COLING
The Role of Creaky Voice in Turn Taking and the Perception of Speaker Stance: Experiments Using Controllable TTS
Lameris, H., Székely, É., & Gustafson, J.
Creaky Voice Turn-Taking PDF Demo
LREC-COLING
Revisiting Three Text-to-Speech Synthesis Experiments with a Web-Based Audience Response System
Tånnander, C., Edlund, J., & Gustafson, J.
Evaluation ARS PDF
2023
Interspeech
Beyond Style: Synthesizing Speech with Pragmatic Functions
Lameris, H., Gustafson, J., & Székely, É.
Pragmatics Style Transfer PDF Demo
Interspeech
Automatic Evaluation of Turn-Taking Cues in Conversational Speech Synthesis
Ekstedt, E., Wang, S., Székely, É., Gustafson, J., & Skantze, G.
Turn-Taking Evaluation PDF
ICASSP
Prosody-Controllable Spontaneous TTS with Neural HMMs
Lameris, H., Mehta, S., Henter, G. E., Gustafson, J., & Székely, É.
Neural HMMs Prosody PDF Demo
ICASSPW
A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS
Wang, S., Henter, G. E., Gustafson, J., & Székely, É.
SSL Spontaneous TTS PDF Demo
ICPhS
Neural Speech Synthesis with Controllable Creaky Voice Style
Lameris, H., Włodarczak, M., Gustafson, J., & Székely, É.
Creaky Voice Voice Quality PDF
IVA
Generation of Speech and Facial Animation with Controllable Articulatory Effort for Amusing Conversational Characters
Gustafson, J., Székely, É., & Beskow, J.
Embodied Agents Facial Animation PDF Demo
RO-MAN
Hi Robot, It's Not What You Say, It's How You Say It
Miniota, J., Wang, S., Beskow, J., Gustafson, J., Székely, É., & Pereira, A.
HRI Robot Speech PDF Demo
SSW
On the Use of Self-Supervised Speech Representations in Spontaneous Speech Synthesis
Wang, S., Henter, G. E., Gustafson, J., & Székely, É.
SSL Representations PDF Demo
SSW
Stuck in the MOS Pit: A Critical Analysis of MOS Test Methodology in TTS Evaluation
Kirkland, A., Mehta, S., Lameris, H., Henter, G. E., Székely, É., & Gustafson, J.
Evaluation MOS Methodology PDF
2022
Interspeech
Where's the Uh, Hesitation? The Interplay between Filled Pause Location, Speech Rate and F0 in Perception of Confidence
Kirkland, A., Lameris, H., Székely, É., & Gustafson, J.
Filled Pauses Perception PDF Demo
LREC
Evaluating Sampling-Based Filler Insertion with Spontaneous TTS
Wang, S., Gustafson, J., & Székely, É.
Filler Insertion Evaluation PDF Demo
2021
ICMI
Integrated Speech and Gesture Synthesis
Wang, S., Alexanderson, S., Gustafson, J., Beskow, J., Henter, G. E., & Székely, É.
Gesture Synthesis Multimodal PDF Demo
SSW
Perception of Smiling Voice in Spontaneous Speech Synthesis
Kirkland, A., Włodarczak, M., Gustafson, J., & Székely, É.
Smiling Voice Voice Quality PDF Demo
SSW
Personality in the Mix: Investigating the Contribution of Fillers and Speaking Style to Perception of Spontaneous Speech Synthesis
Gustafson, J., Beskow, J., & Székely, É.
Personality Filled Pauses PDF Demo
2020
ICASSP
Breathing and Speech Planning in Spontaneous Speech Synthesis
Székely, É., Henter, G. E., Beskow, J., & Gustafson, J.
Breathing Speech Planning PDF Demo
LREC
Augmented Prompt Selection for Evaluation of Spontaneous Speech Synthesis
Székely, É., Edlund, J., & Gustafson, J.
Evaluation Prompt Design PDF Demo
Audio Demonstrations

Hear the research in action

All major publications from the Connected project include interactive audio demonstration pages where you can listen to synthesised speech examples and compare systems side by side. Visit the KTH Speech Synthesis Demo Hub to explore the full collection.

Browse all demos
Project Team

The researchers behind Connected