Annual Report 1999

Table of Contents

Bastiaan Kleijn
Professor of
Speech Signal Processing



The research in the speech signal processing (SSP) group is aimed towards the development of better algorithms for speech and audio coding, speech synthesis, and speech enhancement. Research in these areas has made great strides in the last few decades and its results are now a part of everyday life. For example, speech coding is an enabling technology for modern mobile telephony and audio coding is now used in many consumer-electronics devices. Despite these accomplishments of the recent past, the pace of advancement remains very high. We are still far removed from any fundamental bounds on performance and the economic incentives for more efficient algorithms with better sound quality are strong. More-over, rapid changes in the telecommunication infrastructure and in particular the growth of the Internet, have led to new challenges which are to be addressed by new algorithms.

The research of the SSP group for 1999 can be separated into five topics:

1) the waveform interpolation speech model (for both coding and synthesis), 2) sinusoidal audio coding, 3) auditory modeling and auditory model motivated coding methods, 4) speech and audio coding for the Internet environment, and 5) speech bandwidth extension. In the following, we will describe these individual topics in some detail.

1) Waveform-Interpolation Based Coding and Synthesis

Waveform interpolation (WI) is a state-of-the-art method of modeling the speech signal with applications in speech coding and synthesis. During 1999, our work on this topic was spent on improving the basic model, and on adapting the model for speech synthesis purposes.

We have made significant modifications to the WI method in terms of basic modeling accuracy. Some of these improvements in the modeling paradigm apply to sinusoidal coding as well. Conventional sinusoidal and waveform interpolation coders have a modeling error which remains present even when the quantization error is set to zero. In our new method, frame representations, which are very similar to conventional implementations of the fore-mentioned coders, are used to eliminate the modeling errors. A second problem common to both WI and sinusoidal coders is that their time-frequency localization of the unvoiced speech component is often insufficient to characterize the speech signal in a perceptually accurate manner. We have developed a frame-based method which makes it possible to preserve the time-frequency location of the unvoiced component even when it is characterized with only statistical parameters.

Our work in text-to-speech synthesis is performed jointly with the speech communication group in the department. We are developing a concatenative speech synthesis system based on WI. As in conventional concatenative systems, our synthesis system uses as input a sequence of phonemes, with associated durations and pitch and energy contours, which is produced from text by existing software. Based on this input, our synthesis system selects diphone segments from a pre-recorded data base. These diphones are then modified using the WI representation and concatenated to obtain the correct prosody of the speech signal. The objective is that the speech modification and concatenation procedures do not affect the naturalness of the original speech segments. We expect that the newest WI representation will be particularly effective for speech modification.

2) Audio Coding

Traditionally, audio coders have used nonparametric descriptions of the signal based on filter banks. However, within the last five years, parametric coding techniques have been shown to facilitate efficient coding of audio signals, particularly at low rates. We have contributed in this area in a collaborative project with Delft University of Technology and Philips Research in Eindhoven, both in the Netherlands.

The project is arranged around the joint development of an efficient low-rate audio coder. In this coder, the audio signal is described as the sum of a set of sinusoids and a remainder signal. The remainder signal describes the audio components which cannot be modeled efficiently with sinusoids. The sinusoids are selected using a matching pursuit approach. We have developed a new method where the matching pursuit algorithm uses the same amplitude-complementary windows as are used in the overlap-add synthesis. Using such consistency between analysis and synthesis improves performance significantly over earlier methods. Our sinusoidal estimation is based on a perceptual criterion which includes the effects of neighboring segments. Thus, when analyzing the current segment we take advantage of the forward masking effect due to the estimated sinusoids from the previous segments (possibly overlapping with the current segment). Our experimental results indicate that the number of sinusoids can be reduced significantly with our time masking model, without introducing perceptual artifacts in the reconstructed signal.

3) Auditory Modeling and Coding

In speech and audio coding, understanding the human perception of the signals is of fundamental importance. Knowledge of human perception may lead to new quantitative criteria for speech quality, and even to new coding algorithms with better sound quality. Our work currently focuses on two areas: the description of phase in voiced speech, which is relevant for low-rate coding, and the development of a new coding paradigm where we code in the perceptual domain rather than in the speech domain.

In our work on phase, we have made estimates of quantitative measures of the ability of the human auditory system to perceive distortions of the Fourier phase spectrum of the pitch cycle. In our first set of studies, we defined the perceptual phase capacity to be the size of a codebook of phase spectra necessary to represent all possible phase spectra in a perceptually accurate manner. Our experiments indicated that the perceptual phase capacity for low pitched speech is much higher than it is for high pitched speech. These results are consistent with the well-known fact that speech coding schemes which preserve the phase accurately work better for male voices, while coders which put more emphasis on the amplitude spectrum of the speech signal result in better quality for female speech. While the perceptual phase capacity measure aids in our understanding of the human auditory system, it is only a first step in determining the bit rate required for a perceptually accurate representation of the phase in voiced speech. We are currently working on measuring the perceptual phase entropy which provides a direct estimate of the bit-rate requirement of the phase information.

The inclusion of more sophisticated auditory models in existing speech coding paradigms is generally fraught with difficulties. This was the motivation for a collaborative project with Vienna University of Technology. In this project, we are exploring new coding procedures which facilitate the usage of such auditory models. We have worked towards a new speech coding paradigm in which the coding is performed in a domain where the single-letter squared-error criterion forms an accurate measure of perceived distortion. For this purpose, we need a model of the auditory periphery which is accurate, can be be inverted with relatively low computational effort, and which represents the signal with relatively few parameters. The results of our first coding system which uses this principle indicate that the new paradigm in general and the particular auditory model we developed form a promising basis for the coding of both speech and audio at low bit rates.

4) Source Coding for the Internet

For terrestrial transmission of voice, networks based on the Internet protocol are taking over from transmission over circuit-switched networks. In the future, the same transition is expected for wireless networks. In the Internet, the speech signal is transmitted in the form of packets, and the packets can be lost or delayed. This leads to quality degradation, particularly in real-time tasks, such as voice transmission. These problems can be addressed by increasing the network capacity, introducing various service quality levels, and by designing new coding and buffering procedures which are optimized for this type of environment. In practice, the latter approach can provide a major boost to network efficiency, particularly for wireless networks. However, existing speech coders do not function well on the Internet. In the Internet environment, the capacity is governed by an average rate rather than the peak rate, and the properties of errors are very different from those in circuit-switched networks. Thus, there is a need for a new generation of variable-rate source coders, which exhibit graceful degradation in quality when packets are lost. We have started work on several new source coding paradigms which are designed to perform well in the new environment.

One of our efforts towards source coding for the Internet is based on the coding paradigm motivated by auditory modeling. We demonstrated that our approach to speech and audio coding in a perceptual domain provides an implicit forward error concealment mechanism to handle packet erasures in a packet network. To this end, the individual acoustic subchannels of our auditory model are grouped into different transport subchannels or packets. Due to the strongly overlapping, redundant filter-bank structure of the model, reconstruction of speech without audible degradation becomes possible even if a significant percentage of channels is erased.

5) Speech Enhancement

Our work on speech enhancement has mostly been related to bandwidth extension of speech. Speech transmission over conventional telephone networks is limited to be less than 4 kHz in bandwidth, thus creating the typical telephone sound quality. While it is well-known that wide-band speech sounds significantly better than this narrow-band signal, the existing infrastructure has prevented the widespread introduction of wide-band signals. Thus, there is a strong incentive for the development of effective bandwidth extension methods, which can create wide-band speech from narrow-band speech. We have been working on new methods for regenerating the high frequency band (4 to 7 kHz). Our methods have been based on vector quantization of the mel-frequency cepstral coefficients. We found that these procedures performed better than alternative methods. More importantly, our tests demonstrate that the reconstructed wide-band speech is significantly more pleasant to the human ear than the original narrow-band speech, despite the fact that some artifacts in the reconstructed wide-band signal remain audible. We expect to remove most of these artifacts in the future.

The success of the bandwidth extension has led us to measure quantitatively the shared information between the original and reconstructed signals. In particular, we investigated the mutual information in speech between the spectral envelope of the high frequency band and low frequency bands of various widths. Direct methods on the computation of the mutual information often result in an excessive amount of data required even for modest situations. We addressed this problem by using techniques based on quantized descriptions. Our simulations show that the slope and the energy of the high band (4 to 7 kHz) share less than 1 bit of information with the low frequency band (0 to 4 kHz) in the speech signal. These results suggest that very little information about the high band suffices for the creation of a perceptually satisfactory wide-band signal.

Published by: TMH, Speech, Music and Hearing

Last updated: 2004-10-25