An inclusive approach to creating a palette of synthetic voices for gender diversity

Interspeech, 2024

Éva Székely, ¹, Maxwell Hope²

¹KTH Royal Institute of Technology, ²University of Delaware, USA

szekely@kth.se, maxhope@udel.edu

Abstract

Mainstream text-to-speech (TTS) technologies predominantly rely on binary, cisgender speech, failing to adequately
represent the diversity of gender expansive (e.g., transgender and/or nonbinary) people. This poses challenges,
particularly for users of Speech Generating Devices (SGDs) seeking TTS voices that authentically reflect their
identity and desired expressive nuances. This paper introduces a novel approach for constructing a palette of
controllable gender-expansive TTS voices using recordings from 14 gender-expansive speakers.
We employ Constrained PCA to extract gender-independent speaker identity vectors from x-vectors, using acoustic
Vocal Tract Length (aVTL) as a known component.
The result is applied as a speaker embedding in neural TTS, allowing control over the aVTL and several emergent
properties captured as a representation of the vocal space across speakers. In addition to quantitative metrics,
we present a community evaluation conducted by nonbinary SGD users.

Samples

Adjustment on the emergent properties at inference.

Component 1 decreased (SP11)
Speaker 11 unadjusted
Component 1 increased (SP11)
Component 1 decreased (SP12)
Speaker 12 unadjusted
Component 1 increased (SP12)
Component 2 decreased (SP08)
Speaker 08 unadjusted
Component 2 increased (SP08)
Component 2 decreased (SP09)
Speaker 09 unadjusted
Component 2 increased (SP09)
Component 3 decreased (SP02)
Speaker 02 unadjusted
Component 3 increased (SP02)
Component 3 decreased (SP13)
Speaker 13 unadjusted
Component 3 increased (SP13)

Adjustment on the acoustic Vocal Tract Length at inference

Vocal Tract Length Decreased (SP02)
Vocal Tract Length Increased (SP02)
Vocal Tract Length Decreased (SP06)
Vocal Tract Length Increased (SP06)
Vocal Tract Length Decreased (SP13)
Vocal Tract Length Increased (SP13)