KTH Department of Speech, Music and Hearing
Context-aware speech synthesis for conversational AI — moving beyond generic read-speech delivery toward spontaneous, adaptive voices that breathe, hesitate, and engage like humans.
Traditional text-to-speech systems deliver speech the way a newscaster reads from a script — clear, neutral, and utterly disconnected from the flow of real conversation.
Connected set out to change that. Funded by the Swedish Research Council (VR) from 2020 to 2025, the project asked: what does it take to build a speech synthesiser that does not just read text, but speaks — breathing naturally between phrases, inserting fillers to convey hesitation, and adapting its voice quality to the social and pragmatic context of a conversation?
The name captures the core ambition: a synthesiser that is connected — to its own previous output, to its conversational partner, and to the communicative situation it finds itself in. Rather than treating each sentence as an isolated unit, the project trained models on continuous breath-group sequences drawn from spontaneous speech corpora, capturing the natural flow of connected discourse.
The research spanned computational modelling of spontaneous speech phenomena, fine-grained control of paralinguistic cues, joint speech and gesture synthesis for embodied agents, and the development of novel evaluation paradigms that go beyond the traditional Mean Opinion Score test.
Have you ever tried making small talk with a smart speaker? Today's voice assistants sound natural reading the news, but the illusion shatters in real conversation — they speak from a script, never hesitating or breathing naturally. Traditional synthesis is trained on perfectly read, isolated sentences. The Connected project trained AI on spontaneous speech divided by natural inhalations — breath-group bigrams — producing a voice that breathes and pauses the way humans do. Where a filler like "uh" falls in a sentence, combined with pitch and speaking rate, directly shapes how confident or uncertain the speaker sounds. Machines can now be tuned to express exactly the right degree of certainty for any situation.
Smooth conversation depends on knowing whose turn it is to speak. The project modelled creaky voice — the subtle low rattle heard at the end of utterances — and smiling voice, the audible warmth that enters speech when we grin. Creaky voice turned out to be a powerful turn-yielding cue: when added to the end of an utterance, listeners instinctively understood the speaker was done, dramatically reducing unwanted interruptions. A synthesised smile in the voice similarly changes how the spoken content is perceived, opening the door to more empathetic systems. Together, fillers, pitch, speech rate, and voice quality form a rich expressive palette — the very same words can sound confident or hesitant, warm or detached, depending entirely on how they are delivered.
When we speak we use our whole body — gestures and lip movements are tightly synchronised with our voice. Until recently, robots had to use pipelining: audio is generated first, then a separate system guesses appropriate gestures after the fact, producing clumsy, late movements. The Connected project created ISG — an Integrated Speech and Gesture system — a single deep-learning architecture generating both simultaneously from text, with 3.5 times fewer parameters. The team also developed control over articulatory effort, letting a robot overarticulate a joke set-up then mumble the punchline — a step toward embodied agents that feel genuinely present.
Standard evaluation asks people to rate isolated sentences — measuring audio quality but revealing nothing about real conversation. The project introduced interactive tests instead: participants played the board game Pandemic or engaged in an autonomous Twenty Questions game with a voice that adapted dynamically to each turn. People strongly preferred the spontaneous, context-aware voice over polished but static alternatives. The stakes are high: a voice that sounds confident even when wrong misleads users. By calibrating delivery to actual certainty — slowing down, inserting an "um" when unsure — AI becomes more transparent, trustworthy, and ready for roles in healthcare and education.
The project investigated how to build synthesisers that speak continuously, adaptively, and in ways that can be meaningfully evaluated in conversational contexts.
How can a TTS system capture the physiological rhythms of real speech — breaths, disfluencies, and the planning cues that speakers embed in their voice? The project trained models on connected breath-group bigrams rather than isolated sentences, enabling contextually appropriate inhalation breaths and naturally inserted fillers.
How can a synthesiser adjust its voice quality and style to the social situation? This goal drove research into paralinguistic cues — smiling voice, creaky phonation, and filled pauses — and their role in conveying confidence, personality, and turn-taking intent. It also led to joint speech and gesture models for embodied conversational agents.
Standard listening tests measure isolated utterances, not conversational speech. The project developed novel evaluation paradigms where speech synthesis is evaluated given the conversational context it is targeted for, how it signals turn taking, to what extent it can deliver jokes and interactive evaluations inside fully autonomous spoken dialogue systems. All methods aim to capture the user experience that isolated sentences in online tests cannot.
Five years of research produced a body of work that advanced spontaneous TTS from proof-of-concept to fully interactive dialogue system integration.
A novel training paradigm using automated breath and filler detection enables the synthesiser to insert salient inhalation breaths and disfluencies in contextually appropriate positions — producing speech that flows like natural connected discourse rather than concatenated sentences.
Creaky voice and smiling voice were successfully modelled and perceptually validated. Synthesised creaky voice was shown to significantly influence listeners' perception of speaker certainty and to function as a turn-yielding cue — a finding with direct implications for dialogue systems.
Going beyond generic style transfer, the project synthesised speech tailored to specific pragmatic functions — small talk, advice-giving, instructional speech, and self-directed speech — and demonstrated their integration into embodied conversational agents and social robots.
Self-supervised learning (SSL) speech representations were shown to outperform mel-spectrograms for spontaneous TTS. Prosody-controllable neural HMMs enabled fine-grained manipulation of pitch, rate, and filler density within a single coherent synthesis framework.
A novel joint deep-learning architecture simultaneously generates speech and co-speech gestures, enabling more coherent and natural embodied agents. Methods were also developed to control articulatory effort and synchronise lip animations with synthesised audio.
TTS was evaluated in a simulated board game and a 20 Questions spoken dialogue system. Providing dialogue context revealed that speech style is as consequential as propositional content — a finding that challenges the dominance of context-free MOS tests in the TTS community.
Specific results that demonstrate the breadth of the Connected project's impact on the field.
Filler words combined with variations in pitch and speech rate were shown to convey distinct levels of confidence and personality traits — demonstrating that how a system speaks matters as much as what it says.
Synthesised creaky phonation was perceived as a turn-yielding cue by listeners, demonstrating that non-modal voice qualities can be instrumentalised to manage conversational floor changes in dialogue systems.
In a study with a humanoid robot delivering self-defeating jokes, spontaneous speech synthesis — with controlled articulatory effort and facial animation — was preferred over standard TTS, showing the value of embodied spontaneous synthesis.
Self-supervised speech representations (e.g., from wav2vec 2.0) outperformed traditional mel-spectrogram features as intermediate representations for spontaneous text-to-speech synthesis.
In a board-game evaluation, users exposed to dialogue context rated a spontaneous TTS on par with a leading commercial system — invisible in isolated listening but clearly preferred in context. MOS tests would have missed this entirely.
In a fully autonomous “20 Questions” guessing game, a TTS model that dynamically adapted its prosody to game progression — becoming faster and higher-pitched on correct guesses, more subdued on wrong ones — was strongly preferred over static alternatives. Users spontaneously described it as “human-like”, “interactive”, and “less robotic”, demonstrating that prosodic context-adaptation is a key driver of user preference in dialogue systems.
20+ peer-reviewed papers at top venues in speech synthesis, spoken language processing, and human-robot interaction. Demo pages and PDFs are linked where available.
All major publications from the Connected project include interactive audio demonstration pages where you can listen to synthesised speech examples and compare systems side by side. Visit the KTH Speech Synthesis Demo Hub to explore the full collection.
Browse all demos