Annual Report 1999
Table of Contents

Bastiaan Kleijn
Professor of
Speech Signal Processing
Introduction
The research in the speech signal processing (SSP) group is aimed towards
the development of better algorithms for speech and audio coding, speech synthesis, and speech enhancement. Research
in these areas has made great strides in the last few decades and its results are now a part of everyday life.
For example, speech coding is an enabling technology for modern mobile telephony and audio coding is now used in
many consumer-electronics devices. Despite these accomplishments of the recent past, the pace of advancement remains
very high. We are still far removed from any fundamental bounds on performance and the economic incentives for
more efficient algorithms with better sound quality are strong. More-over, rapid changes in the telecommunication
infrastructure and in particular the growth of the Internet, have led to new challenges which are to be addressed
by new algorithms.
The research of the SSP group for 1999 can be separated into five topics:
1) the waveform interpolation speech model (for both coding and synthesis),
2) sinusoidal audio coding, 3) auditory modeling and auditory model motivated coding methods, 4) speech and audio
coding for the Internet environment, and 5) speech bandwidth extension. In the following, we will describe these
individual topics in some detail.
1) Waveform-Interpolation Based Coding and Synthesis
Waveform interpolation (WI) is a state-of-the-art method of modeling
the speech signal with applications in speech coding and synthesis. During 1999, our work on this topic was spent
on improving the basic model, and on adapting the model for speech synthesis purposes.
We have made significant modifications to the WI method in terms of basic
modeling accuracy. Some of these improvements in the modeling paradigm apply to sinusoidal coding as well. Conventional
sinusoidal and waveform interpolation coders have a modeling error which remains present even when the quantization
error is set to zero. In our new method, frame representations, which are very similar to conventional implementations
of the fore-mentioned coders, are used to eliminate the modeling errors. A second problem common to both WI and
sinusoidal coders is that their time-frequency localization of the unvoiced speech component is often insufficient
to characterize the speech signal in a perceptually accurate manner. We have developed a frame-based method which
makes it possible to preserve the time-frequency location of the unvoiced component even when it is characterized
with only statistical parameters.
Our work in text-to-speech synthesis is performed jointly with the speech communication group in
the department. We are developing a concatenative speech synthesis system based on WI. As in conventional concatenative
systems, our synthesis system uses as input a sequence of phonemes, with associated durations and pitch and energy
contours, which is produced from text by existing software. Based on this input, our synthesis system selects diphone
segments from a pre-recorded data base. These diphones are then modified using the WI representation and concatenated
to obtain the correct prosody of the speech signal. The objective is that the speech modification and concatenation
procedures do not affect the naturalness of the original speech segments. We expect that the newest WI representation
will be particularly effective for speech modification.
2) Audio Coding
Traditionally, audio coders have used nonparametric descriptions of the
signal based on filter banks. However, within the last five years, parametric coding techniques have been shown
to facilitate efficient coding of audio signals, particularly at low rates. We have contributed in this area in
a collaborative project with Delft University of Technology and Philips Research in Eindhoven, both in the Netherlands.
The project is arranged around the joint development of an efficient
low-rate audio coder. In this coder, the audio signal is described as the sum of a set of sinusoids and a remainder
signal. The remainder signal describes the audio components which cannot be modeled efficiently with sinusoids.
The sinusoids are selected using a matching pursuit approach. We have developed a new method where the matching
pursuit algorithm uses the same amplitude-complementary windows as are used in the overlap-add synthesis. Using
such consistency between analysis and synthesis improves performance significantly over earlier methods. Our sinusoidal
estimation is based on a perceptual criterion which includes the effects of neighboring segments. Thus, when analyzing
the current segment we take advantage of the forward masking effect due to the estimated sinusoids from the previous
segments (possibly overlapping with the current segment). Our experimental results indicate that the number of
sinusoids can be reduced significantly with our time masking model, without introducing perceptual artifacts in
the reconstructed signal.
3) Auditory Modeling and Coding
In speech and audio coding, understanding the human perception of the
signals is of fundamental importance. Knowledge of human perception may lead to new quantitative criteria for speech
quality, and even to new coding algorithms with better sound quality. Our work currently focuses on two areas:
the description of phase in voiced speech, which is relevant for low-rate coding, and the development of a new
coding paradigm where we code in the perceptual domain rather than in the speech domain.
In our work on phase, we have made estimates of quantitative measures
of the ability of the human auditory system to perceive distortions of the Fourier phase spectrum of the pitch
cycle. In our first set of studies, we defined the perceptual phase capacity to be the size of a codebook of phase
spectra necessary to represent all possible phase spectra in a perceptually accurate manner. Our experiments indicated
that the perceptual phase capacity for low pitched speech is much higher than it is for high pitched speech. These
results are consistent with the well-known fact that speech coding schemes which preserve the phase accurately
work better for male voices, while coders which put more emphasis on the amplitude spectrum of the speech signal
result in better quality for female speech. While the perceptual phase capacity measure aids in our understanding
of the human auditory system, it is only a first step in determining the bit rate required for a perceptually accurate
representation of the phase in voiced speech. We are currently working on measuring the perceptual phase entropy
which provides a direct estimate of the bit-rate requirement of the phase information.
The inclusion of more sophisticated auditory models in existing speech
coding paradigms is generally fraught with difficulties. This was the motivation for a collaborative project with
Vienna University of Technology. In this project, we are exploring new coding procedures which facilitate the usage
of such auditory models. We have worked towards a new speech coding paradigm in which the coding is performed in
a domain where the single-letter squared-error criterion forms an accurate measure of perceived distortion. For
this purpose, we need a model of the auditory periphery which is accurate, can be be inverted with relatively low
computational effort, and which represents the signal with relatively few parameters. The results of our first
coding system which uses this principle indicate that the new paradigm in general and the particular auditory model
we developed form a promising basis for the coding of both speech and audio at low bit rates.
4) Source Coding for the Internet
For terrestrial transmission of voice, networks based on the Internet
protocol are taking over from transmission over circuit-switched networks. In the future, the same transition is
expected for wireless networks. In the Internet, the speech signal is transmitted in the form of packets, and the
packets can be lost or delayed. This leads to quality degradation, particularly in real-time tasks, such as voice
transmission. These problems can be addressed by increasing the network capacity, introducing various service quality
levels, and by designing new coding and buffering procedures which are optimized for this type of environment.
In practice, the latter approach can provide a major boost to network efficiency, particularly for wireless networks.
However, existing speech coders do not function well on the Internet. In the Internet environment, the capacity
is governed by an average rate rather than the peak rate, and the properties of errors are very different from
those in circuit-switched networks. Thus, there is a need for a new generation of variable-rate source coders,
which exhibit graceful degradation in quality when packets are lost. We have started work on several new source
coding paradigms which are designed to perform well in the new environment.
One of our efforts towards source coding for the Internet is based on
the coding paradigm motivated by auditory modeling. We demonstrated that our approach to speech and audio coding
in a perceptual domain provides an implicit forward error concealment mechanism to handle packet erasures in a
packet network. To this end, the individual acoustic subchannels of our auditory model are grouped into different
transport subchannels or packets. Due to the strongly overlapping, redundant filter-bank structure of the model,
reconstruction of speech without audible degradation becomes possible even if a significant percentage of channels
is erased.
5) Speech Enhancement
Our work on speech enhancement has mostly been related to bandwidth extension
of speech. Speech transmission over conventional telephone networks is limited to be less than 4 kHz in bandwidth,
thus creating the typical telephone sound quality. While it is well-known that wide-band speech sounds significantly
better than this narrow-band signal, the existing infrastructure has prevented the widespread introduction of wide-band
signals. Thus, there is a strong incentive for the development of effective bandwidth extension methods, which
can create wide-band speech from narrow-band speech. We have been working on new methods for regenerating the high
frequency band (4 to 7 kHz). Our methods have been based on vector quantization of the mel-frequency cepstral coefficients.
We found that these procedures performed better than alternative methods. More importantly, our tests demonstrate
that the reconstructed wide-band speech is significantly more pleasant to the human ear than the original narrow-band
speech, despite the fact that some artifacts in the reconstructed wide-band signal remain audible. We expect to
remove most of these artifacts in the future.
The success of the bandwidth extension has led us to measure quantitatively
the shared information between the original and reconstructed signals. In particular, we investigated the mutual
information in speech between the spectral envelope of the high frequency band and low frequency bands of various
widths. Direct methods on the computation of the mutual information often result in an excessive amount of data
required even for modest situations. We addressed this problem by using techniques based on quantized descriptions.
Our simulations show that the slope and the energy of the high band (4 to 7 kHz) share less than 1 bit of information
with the low frequency band (0 to 4 kHz) in the speech signal. These results suggest that very little information
about the high band suffices for the creation of a perceptually satisfactory wide-band signal.
|