Interspeech, 2024
Éva Székely, 1, Maxwell Hope2
1KTH Royal Institute of Technology, 2University of Delaware, USA
szekely@kth.se, maxhope@udel.edu
Mainstream text-to-speech (TTS) technologies predominantly rely on binary, cisgender speech, failing to adequately represent the diversity of gender expansive (e.g., transgender and/or nonbinary) people. This poses challenges, particularly for users of Speech Generating Devices (SGDs) seeking TTS voices that authentically reflect their identity and desired expressive nuances. This paper introduces a novel approach for constructing a palette of controllable gender-expansive TTS voices using recordings from 14 gender-expansive speakers. We employ Constrained PCA to extract gender-independent speaker identity vectors from x-vectors, using acoustic Vocal Tract Length (aVTL) as a known component. The result is applied as a speaker embedding in neural TTS, allowing control over the aVTL and several emergent properties captured as a representation of the vocal space across speakers. In addition to quantitative metrics, we present a community evaluation conducted by nonbinary SGD users.
Component 1 decreased (SP11) Speaker 11 unadjusted Component 1 increased (SP11) Component 1 decreased (SP12) Speaker 12 unadjusted Component 1 increased (SP12) Component 2 decreased (SP08) Speaker 08 unadjusted Component 2 increased (SP08) Component 2 decreased (SP09) Speaker 09 unadjusted Component 2 increased (SP09) Component 3 decreased (SP02) Speaker 02 unadjusted Component 3 increased (SP02) Component 3 decreased (SP13) Speaker 13 unadjusted Component 3 increased (SP13)
Vocal Tract Length Decreased (SP02) Vocal Tract Length Increased (SP02) Vocal Tract Length Decreased (SP06) Vocal Tract Length Increased (SP06) Vocal Tract Length Decreased (SP13) Vocal Tract Length Increased (SP13)