Funded by WASP & Vetenskapsrådet

KTH Department of Speech, Music and Hearing

Personalized AI Voices for Communication Aids

Not just what you say — but how you say it. Building expressive, controllable voices for people who depend on them.

2 Active Projects
4+ Publications
Personalized AI Voices – KTH research

The Research Programme

Millions of people worldwide rely on Speech Generating Devices (SGDs) for communication. Current technology gives them intelligible speech — but not expressive speech. A terse synthesized message can sound rude when the user meant it neutrally. A joke lands flat. Sarcasm disappears. Uncertainty, warmth, emphasis: all lost.

Our research group at KTH is working to change this — giving SGD users not just a voice, but their voice: one they can shape, control, and rely on to convey what they actually mean. Two complementary projects address this challenge. RAPPORT focuses on the full conversational system — the speed, timing, and context awareness that make real-time dialogue possible. Personalized Voices focuses on the expressive engine underneath — the voice itself, its prosody, its affect, and the controls that let a user shape it.

RAPPORT

WASP-funded 2024 – 2029

Real-time context-aware speech prosthesis for conversational interaction. SGD users communicate at 5–15 words per minute; natural conversation runs at 140–200. RAPPORT closes this gap by using large language models to expand brief user input into full utterances, gaze and facial expression detection for real-time prosody control, and turn-taking models that let users start composing a response before their conversation partner has finished speaking.

Personalized Voices

Vetenskapsrådet-funded 2026 – 2029

Developing the expressive TTS engine at the heart of next-generation SGDs. Personalized Voices builds a speech synthesis architecture with independently controllable dimensions: voice identity (who you sound like), prosodic realization (how you deliver a message), and communicative intent (what emotion or attitude you convey). The research focuses on the design of control mechanisms and evaluation methods that hold regardless of which underlying TTS technology is used.

What We Are Working Towards

Our research is organized around four themes that together define what a genuinely useful personalized voice system must achieve.

Goal 1

Corpus Creation and Voice Mapping

A core contribution of the project is a richly annotated conversational speech corpus in both Swedish and English, recorded with voice actors covering a broad range of speaking styles, affective states, speech acts, disfluencies, and turn-taking cues. Two-dimensional voice maps are created for each speaker to identify phonation regions and guide modelling of voice quality dimensions such as creakiness and breathiness. The corpus and voice profiles will be released as open-source resources for the research community.

Goal 2

Real-Time Conversational Integration

Natural conversation requires speed and timing. RAPPORT addresses this with LLM-based text expansion (turning brief input into full contextually appropriate utterances), style prediction from conversational context, and turn-taking models that let the system hold the floor while the user composes — so responses land at the right moment, not after the conversation has moved on.

Goal 3

Expressive Prosody and Affect Control

We build TTS architectures that separate three independently controllable dimensions — voice identity, prosodic realization, and communicative intent — allowing each to be set via audio reference, text description, or direct manipulation. A verifier-based feedback loop ensures generated speech matches the user's intended affect, regardless of which underlying synthesis engine is used.

Goal 4

Evaluation with Real Users

We evaluate beyond isolated listening tests — in real conversational tasks with actual SGD users. Both projects include structured studies with people living with ALS, cerebral palsy, and ASD, in collaboration with clinical partners. The question we ask is not "does it sound good?" but "does it help people communicate what they intend?"

Recent Work

These projects build upon the PI's previous work in spontaneous speech synthesis with prosody control carried out in the VR-funded project Connected: Context-Aware Speech Synthesis (2020–2025), which developed and evaluated a conversational TTS framework with controllable speaking style, breathing, fillers, and voice quality — resulting in 20+ publications across Interspeech, ICASSP, SSW, and other leading venues.

Early publications from the current projects. The list will grow as research progresses.

2025
Interpolating Speaker Identities for Synthetic Voice Generation
Juliana Francis, Joakim Gustafson, Éva Székely
13th Speech Synthesis Workshop (SSW 2025), Leeuwarden, The Netherlands
Voice Personalization Speaker Embedding Zero-Shot TTS
2025
From Static to Dynamic: Enhancing AAC with Generative Imagery and Zero-Shot TTS
Juliana Francis, Joakim Gustafson, Éva Székely
Interspeech 2025, Rotterdam, The Netherlands
AAC Voice Cloning ASD Generative AI
2025
On the Production and Perception of a Single Speaker's Gender
Robin Netzorg, Naomi Carvalho, Andrea Guzman, Lydia Wang, Juliana Francis, Klo Vivienne Garoute, Keith Johnson, Gopala Krishna Anumanchipalli
Interspeech 2025, Rotterdam, The Netherlands
Voice Quality Gender Perception Acoustic Analysis
2024
ConnecTone: A Modular AAC System Prototype with Contextual Generative Text Prediction and Style-Adaptive Conversational TTS
Juliana Francis, Éva Székely, Joakim Gustafson
Interspeech 2024, Kos, Greece
AAC LLM Conversational TTS Open Source

Researchers

Based at the Department of Speech, Music and Hearing (TMH), KTH Royal Institute of Technology, Stockholm.

JG
Joakim Gustafson
Professor, Head of Department
PI — both projects
RAPPORT Personalized Voices
Personal Homepage
ÉS
Éva Székely
Assistant Professor
co-PI — RAPPORT
RAPPORT
KTH Profile
JB
Jonas Beskow
Professor, Vice Head of Department
co-PI — Personalized Voices
Personalized Voices
KTH Profile
JF
Juliana Francis
Researcher
RAPPORT
RAPPORT
KTH Profile

Open Positions

We are looking for researchers to join the team. If you are passionate about speech synthesis, voice technology, or assistive communication, we would love to hear from you.

Postdoctoral Researcher — Personalized AI Voices for Communication Aids
Full-time 2–4 years Stockholm, Sweden Personalized Voices
Apply by
25 April 2026

A postdoctoral position in the Vetenskapsrådet-funded Personalized Voices project, working on speech synthesis technology that enables expressive communication for users of Speech Generating Devices (SGDs). The work spans corpus recording, neural TTS modelling, and user studies with real SGD users.

Responsibilities
  • Plan and lead recordings of an annotated speech corpus in Swedish and English with voice actors
  • Develop neural TTS models with multi-dimensional prosody control
  • Implement verifier-based feedback systems for affect verification
  • Conduct user studies with SGD users in collaboration with clinical partners
  • Publish findings at leading venues (Interspeech, ICASSP, SSW, etc.)
Required Qualifications
  • Doctoral degree in speech technology, machine learning, computational linguistics, or related field
  • Experience with neural speech synthesis and/or prosody control in TTS
  • Strong Python and PyTorch skills
  • Fluent English communication skills
Preferred
  • Recent doctoral degree (within three years of deadline)
  • Speech corpus collection and annotation experience
  • Experience with end-user evaluation
Apply at KTH →
Questions? Contact Joakim Gustafson  ·  Ref: PA-2026-1130