Perception of Speaker Stance
How do listeners infer meaning from prosody? STANCE investigates the acoustic and contextual cues through which speakers signal attitudes, intentions, and social meanings in spontaneous conversation — using neural speech synthesis as a scientific instrument.
Human speakers constantly signal their attitudes, certainty, and social positioning through prosody — the way they vary pitch, timing, and voice quality. Yet systematic study of these signals has been hampered by a fundamental challenge: in natural speech, it is impossible to vary a single acoustic cue while holding everything else constant.
STANCE addresses this by developing neural text-to-speech systems built on ecologically valid spontaneous conversational speech, with independent, fine-grained control over prosodic features. These systems serve as a scientific instrument — an analysis-by-synthesis methodology that enables controlled perception experiments with stimuli that mirror the richness of natural conversation.
The project's outputs span TTS methodology, speech perception, pragmatics, gender studies, and assistive communication, united by the question of how prosody conveys meaning beyond the words themselves.
STANCE pursues a unified scientific agenda across prosody, perception, and speech technology, grounded in three complementary research goals.
Develop neural TTS systems with independent, fine-grained prosodic control — enabling controlled perception experiments with spontaneous-sounding stimuli that vary a single feature at a time.
Investigate how timing, pitch, and voice quality interact to signal speaker attitudes — confidence, hesitation, stance on topic, and pragmatic functions such as discourse markers and politeness.
Examine how discourse context, speaker gender, voice identity, and listener background shape the perception of stance — with applications in inclusive voice design and assistive communication.
Six cross-cutting themes drawn from the conclusions and discussions of STANCE publications.
Prosodic choices directly and measurably affect how speakers are judged. Politeness conveyed through voice increases request compliance. Mid-utterance filled pauses lower confidence ratings more than initial pauses. The duration of a discourse marker like "well" shifts its perceived polarity — functioning as hedging or reluctant agreement depending on context.
Voice quality features signal specific interactional functions beyond style. Non-positional creaky voice makes speakers sound less certain and more turn-final. Breathy voice maps reliably onto two distinct constructions: self-directed musings and grounding attempts. Synthesized smiling voice — created by training near laughter data — is perceived as smiling without degrading naturalness.
Disfluencies do not uniformly make speakers sound worse. Repetitions harm perceived competence and sincerity most, but explicit listener forewarning significantly reduces these effects — a finding relevant for L2 speakers and speakers with ASD. Pause-internal phonetic particles (including tongue clicks) reduce perceived certainty and can now be synthesized for controlled experiments.
Gender-ambiguous TTS can serve as a research instrument for exposing implicit bias in speech perception. A voice palette built on non-binary speakers is positively received by AAC users seeking gender-expansive options. Speech-LLMs encode gender bias at the semantic token level during text encoding — making speaker assignment a direct diagnostic window into training data bias.
Widespread interaction with synthetic voices constitutes an emergent sociolinguistic pressure. The influence is complex, mediated by identity, engagement, and context — and demands a transdisciplinary response. The legal, ethical, and societal dimensions of voice-based AI shaping human speech norms remain largely unaddressed.
MOS does not correlate with SSL vocoding loss, and MOS predictors trained on read speech fail on spontaneous speech without fine-tuning. Discrete-token models like Bark produce natural prosody and spontaneous behaviours but are poor in robustness and speaker consistency — dimensions MOS cannot capture. Evaluation must be contextual to reflect what matters in real conversational use.
27 peer-reviewed publications from the STANCE project, spanning speech synthesis, speech perception, pragmatics, and inclusive voice design.