Personalized AI Voices – KTH Speech, Music and Hearing

About

The Research Programme

Millions of people worldwide rely on Speech Generating Devices (SGDs) for communication. Current technology gives them intelligible speech — but not expressive speech. A terse synthesized message can sound rude when the user meant it neutrally. A joke lands flat. Sarcasm disappears. Uncertainty, warmth, emphasis: all lost.

Our research group at KTH is working to change this — giving SGD users not just a voice, but their voice: one they can shape, control, and rely on to convey what they actually mean. Two complementary projects address this challenge. RAPPORT focuses on the full conversational system — the speed, timing, and context awareness that make real-time dialogue possible. Personalized Voices focuses on the expressive engine underneath — the voice itself, its prosody, its affect, and the controls that let a user shape it.

RAPPORT

WASP-funded 2024 – 2029

Real-time context-aware speech prosthesis for conversational interaction. SGD users communicate at 5–15 words per minute; natural conversation runs at 140–200. RAPPORT closes this gap by using large language models to expand brief user input into full utterances, gaze and facial expression detection for real-time prosody control, and turn-taking models that let users start composing a response before their conversation partner has finished speaking.

Personalized Voices

Vetenskapsrådet-funded 2026 – 2029

Developing the expressive TTS engine at the heart of next-generation SGDs. Personalized Voices builds a speech synthesis architecture with independently controllable dimensions: voice identity (who you sound like), prosodic realization (how you deliver a message), and communicative intent (what emotion or attitude you convey). The research focuses on the design of control mechanisms and evaluation methods that hold regardless of which underlying TTS technology is used.

Research Goals

What We Are Working Towards

Our research is organized around four themes that together define what a genuinely useful personalized voice system must achieve.

Goal 1

Corpus Creation and Voice Mapping

A core contribution of the project is a richly annotated conversational speech corpus in both Swedish and English, recorded with voice actors covering a broad range of speaking styles, affective states, speech acts, disfluencies, and turn-taking cues. Two-dimensional voice maps are created for each speaker to identify phonation regions and guide modelling of voice quality dimensions such as creakiness and breathiness. The corpus and voice profiles will be released as open-source resources for the research community.

Goal 2

Real-Time Conversational Integration

Natural conversation requires speed and timing. RAPPORT addresses this with LLM-based text expansion (turning brief input into full contextually appropriate utterances), style prediction from conversational context, and turn-taking models that let the system hold the floor while the user composes — so responses land at the right moment, not after the conversation has moved on.

Goal 3

Expressive Prosody and Affect Control

We build TTS architectures that separate three independently controllable dimensions — voice identity, prosodic realization, and communicative intent — allowing each to be set via audio reference, text description, or direct manipulation. A verifier-based feedback loop ensures generated speech matches the user's intended affect, regardless of which underlying synthesis engine is used.

Goal 4

Evaluation with Real Users

We evaluate beyond isolated listening tests — in real conversational tasks with actual SGD users. Both projects include structured studies with people living with ALS, cerebral palsy, and ASD, in collaboration with clinical partners. The question we ask is not "does it sound good?" but "does it help people communicate what they intend?"

Publications

Recent Work

These projects build upon the PI's previous work in spontaneous speech synthesis with prosody control carried out in the VR-funded project Connected: Context-Aware Speech Synthesis (2020–2025), which developed and evaluated a conversational TTS framework with controllable speaking style, breathing, fillers, and voice quality — resulting in 20+ publications across Interspeech, ICASSP, SSW, and other leading venues.

Early publications from the current projects. The list will grow as research progresses.

2025

Interpolating Speaker Identities for Synthetic Voice Generation

Juliana Francis, Joakim Gustafson, Éva Székely

13th Speech Synthesis Workshop (SSW 2025), Leeuwarden, The Netherlands

Voice Personalization Speaker Embedding Zero-Shot TTS

PDF Demo

2025

From Static to Dynamic: Enhancing AAC with Generative Imagery and Zero-Shot TTS

Juliana Francis, Joakim Gustafson, Éva Székely

Interspeech 2025, Rotterdam, The Netherlands

AAC Voice Cloning ASD Generative AI

PDF Demo

2025

On the Production and Perception of a Single Speaker's Gender

Robin Netzorg, Naomi Carvalho, Andrea Guzman, Lydia Wang, Juliana Francis, Klo Vivienne Garoute, Keith Johnson, Gopala Krishna Anumanchipalli

Interspeech 2025, Rotterdam, The Netherlands

Voice Quality Gender Perception Acoustic Analysis

PDF

2024

ConnecTone: A Modular AAC System Prototype with Contextual Generative Text Prediction and Style-Adaptive Conversational TTS

Juliana Francis, Éva Székely, Joakim Gustafson

Interspeech 2024, Kos, Greece

AAC LLM Conversational TTS Open Source

PDF Demo

Team

Researchers

Based at the Department of Speech, Music and Hearing (TMH), KTH Royal Institute of Technology, Stockholm.

Joakim Gustafson

Professor, Head of Department
PI — both projects

RAPPORT Personalized Voices

Personal Homepage

ÉS

Éva Székely

Assistant Professor
co-PI — RAPPORT

RAPPORT

KTH Profile

Jonas Beskow

Professor, Vice Head of Department
co-PI — Personalized Voices

Personalized Voices

KTH Profile

Juliana Francis

Researcher
RAPPORT

RAPPORT

KTH Profile

Get Involved

Open Positions

We are looking for researchers to join the team. If you are passionate about speech synthesis, voice technology, or assistive communication, we would love to hear from you.

Postdoctoral Researcher — Personalized AI Voices for Communication Aids

Full-time 2–4 years Stockholm, Sweden Personalized Voices

Apply by

25 April 2026

A postdoctoral position in the Vetenskapsrådet-funded Personalized Voices project, working on speech synthesis technology that enables expressive communication for users of Speech Generating Devices (SGDs). The work spans corpus recording, neural TTS modelling, and user studies with real SGD users.

Responsibilities

Plan and lead recordings of an annotated speech corpus in Swedish and English with voice actors
Develop neural TTS models with multi-dimensional prosody control
Implement verifier-based feedback systems for affect verification
Conduct user studies with SGD users in collaboration with clinical partners
Publish findings at leading venues (Interspeech, ICASSP, SSW, etc.)

Required Qualifications

Doctoral degree in speech technology, machine learning, computational linguistics, or related field
Experience with neural speech synthesis and/or prosody control in TTS
Strong Python and PyTorch skills
Fluent English communication skills

Preferred

Recent doctoral degree (within three years of deadline)
Speech corpus collection and annotation experience
Experience with end-user evaluation

Apply at KTH →

Questions? Contact Joakim Gustafson · Ref: PA-2026-1130