Statistical voice transformation in Unit Selection TTS framework
Fabio Tesser, CNR-ISTC Padova, Italy
Unit Selection Text To Speech synthesizers make use of the `Choose the best to modify the least´ strategy, a technique that attempts to minimize the use of signal modifications in the original recorded speech. While this approach results in excellent signal quality of the synthesized speech, however the quality itself is limited to the recorded units that are present in the database. The creation of a new Unit Selection voice includes a lot of audio registration, speech segmentation and analysis, making the building of new TTS characters (a new speaker or a different voice expressivity) very costly.
A possible solution to this lack of flexibility, is to use a voice transformation system to modify the voice signal generated by the Unit Selection synthesizer. Using statistical analysis and signal processing techniques based on real data (data driven), it is possible to transform some spectral characteristics of the synthesized speech in a speech with characteristics that match those of target taken into consideration.
A project that tries to investigate in this direction has been recently launched and a first prototype of a statistical voice transformation module will be shown.