VR-funded project · 2020–2025

STANCE

Perception of Speaker Stance

How do listeners infer meaning from prosody? STANCE investigates the acoustic and contextual cues through which speakers signal attitudes, intentions, and social meanings in spontaneous conversation — using neural speech synthesis as a scientific instrument.

27
Publications
5
Years
KTH
Speech Group
Research themes
Disfluency & HesitationHow filled pauses, repetitions, and their placement signal confidence and competence
Voice Quality & PhonationPerceptual effects of creaky, breathy, and tense voice in conversation
Gender & IdentityInclusive voice design for gender diversity and assistive communication
Pragmatics & DiscourseDiscourse markers, politeness, and turn-taking cues in synthesis
About the project

Why study speaker stance in spontaneous speech?

Human speakers constantly signal their attitudes, certainty, and social positioning through prosody — the way they vary pitch, timing, and voice quality. Yet systematic study of these signals has been hampered by a fundamental challenge: in natural speech, it is impossible to vary a single acoustic cue while holding everything else constant.

STANCE addresses this by developing neural text-to-speech systems built on ecologically valid spontaneous conversational speech, with independent, fine-grained control over prosodic features. These systems serve as a scientific instrument — an analysis-by-synthesis methodology that enables controlled perception experiments with stimuli that mirror the richness of natural conversation.

The project's outputs span TTS methodology, speech perception, pragmatics, gender studies, and assistive communication, united by the question of how prosody conveys meaning beyond the words themselves.

Project details
PIÉva Székely
Co-PIJoakim Gustafson
FunderSwedish Research Council (Vetenskapsrådet)
Duration2020–2025
HostKTH Royal Institute of Technology, Division of Speech, Music and Hearing
KTH Speech, Music and Hearing
Research goals

Three central questions

STANCE pursues a unified scientific agenda across prosody, perception, and speech technology, grounded in three complementary research goals.

🔬
Analysis by Synthesis

Develop neural TTS systems with independent, fine-grained prosodic control — enabling controlled perception experiments with spontaneous-sounding stimuli that vary a single feature at a time.

🎙️
Prosody & Pragmatic Meaning

Investigate how timing, pitch, and voice quality interact to signal speaker attitudes — confidence, hesitation, stance on topic, and pragmatic functions such as discourse markers and politeness.

🌐
Context, Speaker & Identity

Examine how discourse context, speaker gender, voice identity, and listener background shape the perception of stance — with applications in inclusive voice design and assistive communication.

What we found

Key findings

Six cross-cutting themes drawn from the conclusions and discussions of STANCE publications.

🎚️
Prosody causally shapes social perception

Prosodic choices directly and measurably affect how speakers are judged. Politeness conveyed through voice increases request compliance. Mid-utterance filled pauses lower confidence ratings more than initial pauses. The duration of a discourse marker like "well" shifts its perceived polarity — functioning as hedging or reluctant agreement depending on context.

🗣️
Voice quality carries pragmatic meaning

Voice quality features signal specific interactional functions beyond style. Non-positional creaky voice makes speakers sound less certain and more turn-final. Breathy voice maps reliably onto two distinct constructions: self-directed musings and grounding attempts. Synthesized smiling voice — created by training near laughter data — is perceived as smiling without degrading naturalness.

🤔
Disfluency effects are nuanced, not simply negative

Disfluencies do not uniformly make speakers sound worse. Repetitions harm perceived competence and sincerity most, but explicit listener forewarning significantly reduces these effects — a finding relevant for L2 speakers and speakers with ASD. Pause-internal phonetic particles (including tongue clicks) reduce perceived certainty and can now be synthesized for controlled experiments.

⚧️
Synthetic voices both reflect and construct gender

Gender-ambiguous TTS can serve as a research instrument for exposing implicit bias in speech perception. A voice palette built on non-binary speakers is positively received by AAC users seeking gender-expansive options. Speech-LLMs encode gender bias at the semantic token level during text encoding — making speaker assignment a direct diagnostic window into training data bias.

🌐
AI voices may reshape how humans speak

Widespread interaction with synthetic voices constitutes an emergent sociolinguistic pressure. The influence is complex, mediated by identity, engagement, and context — and demands a transdisciplinary response. The legal, ethical, and societal dimensions of voice-based AI shaping human speech norms remain largely unaddressed.

📊
Standard TTS evaluation fails conversational speech

MOS does not correlate with SSL vocoding loss, and MOS predictors trained on read speech fail on spontaneous speech without fine-tuning. Discrete-token models like Bark produce natural prosody and spontaneous behaviours but are poor in robustness and speaker consistency — dimensions MOS cannot capture. Evaluation must be contextual to reflect what matters in real conversational use.

Research output

Publications

27 peer-reviewed publications from the STANCE project, spanning speech synthesis, speech perception, pragmatics, and inclusive voice design.

2025
Interspeech
VoiceQualityVC: A Voice Conversion System for Studying the Perceptual Effects of Voice Quality in Speech
Lameris, H., Gustafson, J., & Székely, É. (2025). Proceedings of Interspeech 2025, 2295–2299. https://doi.org/10.21437/Interspeech.2025-902
Interspeech
Voices of 'Cyborg Awesomeness': Posthuman Embodiment of Nonbinary Gender Expression in AI Speech Technologies
Hope, M., & Székely, É. (2025). Proceedings of Interspeech 2025, 689–693. https://doi.org/10.21437/Interspeech.2025-2229
Interspeech
Voice Reconstruction through Large-Scale TTS Models: Comparing Zero-Shot and Fine-tuning Approaches to Personalise TTS in Assistive Communication
Székely, É., Mihajlik, P., Kádár, M. S., & Tóth, L. (2025). Proceedings of Interspeech 2025, 2735–2739. https://doi.org/10.21437/Interspeech.2025-1726
IWSDS
Will AI Shape the Way We Speak? The Emerging Sociolinguistic Influence of Synthetic Voices
Székely, É., Miniota, J., & Hejná, M. (2025). Proceedings of IWSDS 2025, 335–340. https://aclanthology.org/2025.iwsds-1.37
Interspeech
Who Gets the Mic? Investigating Gender Bias in the Speaker Assignment of a Speech-LLM
Puhach, D., Payberah, A. H., & Székely, É. (2025). Proceedings of Interspeech 2025. https://arxiv.org/abs/2508.13603
2024
Interspeech
An Inclusive Approach to Creating a Palette of Synthetic Voices for Gender Diversity
Székely, É., & Hope, M. (2024). Proceedings of Interspeech 2024, 3070–3074. https://doi.org/10.21437/Interspeech.2024-1631
Interspeech
CreakVC: A Voice Conversion Tool for Modulating Creaky Voice
Lameris, H., Gustafson, J., & Székely, É. (2024). Proceedings of Interspeech 2024, 1005–1006. https://doi.org/10.21437/Interspeech.2024-1534
Interspeech
"Well", What Can You Do with Messy Data? Exploring the Prosody and Pragmatic Function of the Discourse Marker "Well" with Found Data and Speech Synthesis
O'Mahony, J., Lai, C., & Székely, É. (2024). Proceedings of Interspeech 2024, 4084–4088. https://doi.org/10.21437/Interspeech.2024-2122
Interspeech
Contextual Interactive Evaluation of TTS Models in Dialogue Systems
Wang, S., Székely, É., & Gustafson, J. (2024). Proceedings of Interspeech 2024, 2965–2969. https://doi.org/10.21437/Interspeech.2024-1008
SIGDIAL
Voice and Choice: Investigating the Role of Prosodic Variation in Request Compliance and Perceived Politeness Using Conversational TTS
Székely, É., Higginbotham, J., & Possemato, F. (2024). Proceedings of SIGDIAL 2024, 466–476. https://aclanthology.org/2024.sigdial-1.40
LREC-COLING
The Role of Creaky Voice in Turn Taking and the Perception of Speaker Stance: Experiments Using Controllable TTS
Lameris, H., Székely, É., & Gustafson, J. (2024). Proceedings of LREC-COLING 2024, 16058–16065. https://aclanthology.org/2024.lrec-main.1396
LREC-COLING
Evaluating Text-to-Speech Synthesis from a Large Discrete Token-Based Speech Language Model
Wang, S., & Székely, É. (2024). Proceedings of LREC-COLING 2024, 6464–6474. https://aclanthology.org/2024.lrec-main.573
2023
Interspeech
Prosody-Controllable Gender-Ambiguous Speech Synthesis: A Tool for Investigating Implicit Bias in Speech Perception
Székely, É., Gustafson, J., & Torre, I. (2023). Proceedings of Interspeech 2023, 1234–1238. https://doi.org/10.21437/Interspeech.2023-2086
Interspeech
So-to-Speak: An Exploratory Platform for Investigating the Interplay Between Style and Prosody in TTS
Székely, É., Wang, S., & Gustafson, J. (2023). Proceedings of Interspeech 2023, 2016–2017.
Interspeech
Pardon My Disfluency: The Impact of Disfluency Effects on the Perception of Speaker Competence and Confidence
Kirkland, A., Gustafson, J., & Székely, É. (2023). Proceedings of Interspeech 2023, 5217–5221. https://doi.org/10.21437/Interspeech.2023-887
Interspeech
OverFlow: Putting Flows on Top of Neural Transducers for Better TTS
Mehta, S., Kirkland, A., Lameris, H., Beskow, J., Székely, É., & Henter, G. E. (2023). Proceedings of Interspeech 2023, 4279–4283. https://doi.org/10.21437/Interspeech.2023-2134
Interspeech
A Comparative Study of Self-Supervised Speech Representations in Read and Spontaneous TTS
Wang, S., Henter, G. E., Gustafson, J., & Székely, É. (2023). Proceedings of Interspeech 2023, 4289–4293. https://doi.org/10.21437/Interspeech.2023-2272
Interspeech
Beyond Style: Synthesizing Speech with Pragmatic Functions
Lameris, H., Gustafson, J., & Székely, É. (2023). Proceedings of Interspeech 2023, 3382–3386. https://doi.org/10.21437/Interspeech.2023-2072
Interspeech
Situating Speech Synthesis: Investigating Contextual Factors in the Evaluation of Conversational TTS
Lameris, H., Kirkland, A., Gustafson, J., & Székely, É. (2023). Proceedings of Interspeech 2023.
Interspeech
The Impact of Pause-Internal Phonetic Particles on Recall in Synthesized Lectures
Elmers, M., & Székely, É. (2023). Proceedings of Interspeech 2023, 3387–3391. https://doi.org/10.21437/Interspeech.2023-1491
ICPhS
Prosody-Controllable Spontaneous TTS with Neural HMMs
Lameris, H., Mehta, S., Henter, G. E., Gustafson, J., & Székely, É. (2023). Proceedings of ICPhS 2023, 3141–3145. https://arxiv.org/abs/2211.13533
ICPhS
Neural Speech Synthesis with Controllable Creaky Voice Style
Lameris, H., Włodarczak, M., Gustafson, J., & Székely, É. (2023). Proceedings of ICPhS 2023.
IVA / ACM
Generation of Speech and Facial Animation with Controllable Articulatory Effort for Amusing Conversational Characters
Gustafson, J., Székely, É., & Beskow, J. (2023). Proceedings of IVA 2023. https://doi.org/10.1145/3570945.3607289
SSW
On the Use of Self-Supervised Speech Representations in Spontaneous Speech Synthesis
Wang, S., Henter, G. E., Gustafson, J., & Székely, É. (2023). Proceeding of the 12th ISCA Speech Synthesis Workshop (SSW), Grenoble, France
2022
Interspeech
Where's the Uh, Hesitation? The Interplay Between Filled Pause Location, Speech Rate and Fundamental Frequency in Perception of Confidence
Kirkland, A., Lameris, H., Székely, É., & Gustafson, J. (2022). Proceedings of Interspeech 2022, 4990–4994. https://doi.org/10.21437/Interspeech.2022-10973
Speech Prosody
Two Pragmatic Functions of Breathy Voice in American English Conversation
Ward, N. G., Kirkland, A., Włodarczak, M., & Székely, É. (2022). Proceedings of Speech Prosody 2022, 82–86. https://doi.org/10.21437/SpeechProsody.2022-17
2021
SSW
Perception of Smiling Voice in Spontaneous Speech Synthesis
Kirkland, A., Włodarczak, M., Gustafson, J., & Székely, É. (2021). Proceedings of SSW11, 108–112. https://doi.org/10.21437/SSW.2021-19
The team

The researchers behind STANCE

ÉS
É. Székely
Principal Investigator
KTH
JG
J. Gustafson
Co-PI
KTH
Collaborators
JB
J. Beskow
KTH
GH
G. Eje Henter
KTH
HL
H. Lameris
KTH
SW
S. Wang
KTH
AK
A. Kirkland
KTH
SM
S. Mehta
KTH
JM
J. Miniota
KTH
IT
I. Torre
KTH
AP
A. H. Payberah
KTH
MW
M. Włodarczak
Stockholm University
DP
D. Puhach
Uppsala University
MH
M. Hope
University of Delaware, USA
PM
P. Mihajlik
University of Budapest, Hungary
MK
M. S. Kádár
University of Budapest, Hungary
LT
L. Tóth
University of Szeged, Hungary
MH
M. Hejná
Aarhus University, Denmark
JO
J. O'Mahony
University of Edinburgh, UK
CL
C. Lai
University of Edinburgh, UK
JH
J. Higginbotham
University at Buffalo, USA
FP
F. Possemato
Rijksuniversiteit Groningen, The Netherlands
ME
M. Elmers
Saarland University, Germany
NW
N. G. Ward
1University of Texas, US