KTH Department of Speech, Music and Hearing
Not just what you say — but how you say it. Building expressive, controllable voices for people who depend on them.
Millions of people worldwide rely on Speech Generating Devices (SGDs) for communication. Current technology gives them intelligible speech — but not expressive speech. A terse synthesized message can sound rude when the user meant it neutrally. A joke lands flat. Sarcasm disappears. Uncertainty, warmth, emphasis: all lost.
Our research group at KTH is working to change this — giving SGD users not just a voice, but their voice: one they can shape, control, and rely on to convey what they actually mean. Two complementary projects address this challenge. RAPPORT focuses on the full conversational system — the speed, timing, and context awareness that make real-time dialogue possible. Personalized Voices focuses on the expressive engine underneath — the voice itself, its prosody, its affect, and the controls that let a user shape it.
Real-time context-aware speech prosthesis for conversational interaction. SGD users communicate at 5–15 words per minute; natural conversation runs at 140–200. RAPPORT closes this gap by using large language models to expand brief user input into full utterances, gaze and facial expression detection for real-time prosody control, and turn-taking models that let users start composing a response before their conversation partner has finished speaking.
Developing the expressive TTS engine at the heart of next-generation SGDs. Personalized Voices builds a speech synthesis architecture with independently controllable dimensions: voice identity (who you sound like), prosodic realization (how you deliver a message), and communicative intent (what emotion or attitude you convey). The research focuses on the design of control mechanisms and evaluation methods that hold regardless of which underlying TTS technology is used.
Our research is organized around four themes that together define what a genuinely useful personalized voice system must achieve.
A core contribution of the project is a richly annotated conversational speech corpus in both Swedish and English, recorded with voice actors covering a broad range of speaking styles, affective states, speech acts, disfluencies, and turn-taking cues. Two-dimensional voice maps are created for each speaker to identify phonation regions and guide modelling of voice quality dimensions such as creakiness and breathiness. The corpus and voice profiles will be released as open-source resources for the research community.
Natural conversation requires speed and timing. RAPPORT addresses this with LLM-based text expansion (turning brief input into full contextually appropriate utterances), style prediction from conversational context, and turn-taking models that let the system hold the floor while the user composes — so responses land at the right moment, not after the conversation has moved on.
We build TTS architectures that separate three independently controllable dimensions — voice identity, prosodic realization, and communicative intent — allowing each to be set via audio reference, text description, or direct manipulation. A verifier-based feedback loop ensures generated speech matches the user's intended affect, regardless of which underlying synthesis engine is used.
We evaluate beyond isolated listening tests — in real conversational tasks with actual SGD users. Both projects include structured studies with people living with ALS, cerebral palsy, and ASD, in collaboration with clinical partners. The question we ask is not "does it sound good?" but "does it help people communicate what they intend?"
These projects build upon the PI's previous work in spontaneous speech synthesis with prosody control carried out in the VR-funded project Connected: Context-Aware Speech Synthesis (2020–2025), which developed and evaluated a conversational TTS framework with controllable speaking style, breathing, fillers, and voice quality — resulting in 20+ publications across Interspeech, ICASSP, SSW, and other leading venues.
Early publications from the current projects. The list will grow as research progresses.
Based at the Department of Speech, Music and Hearing (TMH), KTH Royal Institute of Technology, Stockholm.
We are looking for researchers to join the team. If you are passionate about speech synthesis, voice technology, or assistive communication, we would love to hear from you.
A postdoctoral position in the Vetenskapsrådet-funded Personalized Voices project, working on speech synthesis technology that enables expressive communication for users of Speech Generating Devices (SGDs). The work spans corpus recording, neural TTS modelling, and user studies with real SGD users.