Beyond graphemes and phonemes: continuous phonological features in neural text-to-speech synthesis

Interspeech, 2024

Christina Tånnander1,2, Shivam Mehta1, Jonas Beskow1, Jens Edlund1

1KTH Royal Institute of Technology, 2Swedish Agency for accessible Media

christina.tannander@mtm.se, {smehta,beskow}@kth.se, edlund@speech.kth.se

Abstract

We introduce continuous phonological features as input to TTS with the dual objective of more precise control over
phonological aspects and better potential for exploration of latent features in TTS models for speech science purposes.
In our framework, the TTS is conditioned on continuous values between 0.0 and 1.0, where each phoneme has a specified
position on each feature axis. We chose 11 features to represent US English and trained a voice with Matcha-TTS.
Effectiveness was assessed by investigating two selected features in two ways:
through a categorical perception experiment confirming the expected alignment of feature positions and phoneme perception,
and through analysis of acoustic correlates confirming a gradual, monotonic change of acoustic features consistent with changes in the phonemic input features.

Examples from the the categorical perception test

9 degress of front vowel height (from most open to most closed)

(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)

9 degress of place of articulation for voiceless stops (from most back to most front)

(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)