Kåre Sjölander, Jonas Beskow, Joakim Gustafson, Erland Lewin, Rolf Carlson, and Björn Granström
Department of Speech, Music, and Hearing, KTH, Sweden


This paper describes the efforts at KTH in creating educational tools for speech technology. The demand for such tools is increasing with the advent of speech as a medium for man-machine communication. The world wide web was chosen as our platform in order to increase the usability and accessibility of our computer exercises. The aim was to provide dedicated educational software instead of exercises based on complex research tools. Currently, the set of exercises comprises basic speech analysis, multi-modal speech synthesis and spoken dialogue systems. Students access web pages in which the exercises have been embedded as applets. This makes it possible to use them in a classroom setting, as well as from the students? home computers.


The speech group at KTH has developed a number of speech technology tools for use in education of undergraduate students or researchers in the speech field. Many of these tools have so far been developed for a certain computer environment and have also needed teacher guidance. During the last year development work was started on a toolkit for spoken language technology that can be used over the Internet. The aim is to free the students from the need of using a particular computer at a particular time and place.

A speech technology toolkit that serves as a basis in the creation spoken language systems has been developed. This toolkit is partly based on the software technology in our existing spoken dialogue systems. A new addition is the design of an architecture for communication between programs on different computers using the Internet, the Broker Architecture [1]. We have used the speech toolkit to build three web-based educational modules.

Our courses on speech technology includes an introductory section on basic phonetics and speech analysis. A set of exercises for this section has been developed in which students analyze their own speech in various ways.

An interactive tool for working with parametric speech synthesis has been developed. The tool facilitates editing of parameter tracks, and it provides real-time feedback of the synthesized speech. It serves as an interface to KTH's multilingual rule based synthesis system, and can be used to control a formant synthesiser as well as a 3-D "talking head".

The integrated lab environment GULAN [2] for spoken dialog systems has seen further development in the area of dialogue management [3] in cooperation with the NLP lab at the University of Linköping. The system has also been redesigned for web deployment.



The Broker Architecture relays function calls, results and error conditions between modules in text form over standard TCP internet connections. A number of programming languages are used in the modules of our toolkit.

Java [4], an object oriented, powerful programming language is used in components that require complex data structures.

C/C++ is used where maximum performance is required, for example, for the speech recognition and speech synthesis engines.

The Tcl/Tk language [5] is used for user interfaces and as a glue language in some modules. It was chosen because of several useful characteristics. It is platform independent and makes it simple to integrate existing modules and applications. It has powerful and easy to use networking facilities. The accompanying Tk toolkit provides facilities for quickly creating GUIs. Being a scripting language it is easy to change and extend the functionality of applications as well as to maintain them. The main drawbacks with the Tcl language is its execution speed and primitive syntax. However this can be overcome by implementing complex and time critical code in a more powerful language. Also, it is possible to run scripts which are embedded in web pages and which download quickly because of their relatively compact text representation. In all, an ideal solution for computer based instruction and distance learning.


One of our extensions to the Tcl language is the Snack speech analysis module [6]. It provides a uniform interface to the audio hardware on a number of platforms, adding commands to play, record and manipulate sound in many audio formats, as well as disk I/O in common audio file formats. Also, it has streaming audio capabilities which makes it easy to create client/server audio applications. There are commands to visualize sounds using waveforms, spectrograms, and spectrum sections. The Snack module serves as a basis when creating customized recording tools, speech analysis applications, audio annotation tools, demonstrators, and the like. The module has a powerful and intuitive way of handling sound as objects. A spectrogram object connected to a sound object will update automatically and in real time as the sound data changes. The modules also supports postscript printing in order to create hard copies or, for example, make it possible to create illustrations. Currently, it is possible to write platform independent scripts which run on Unix (Linux, Solaris, HP-UX, IRIX) and Windows95/NT using the Snack module. It is also possible to run scripts embedded in web pages through the use of the Tcl plug-in. The Snack module can be freely downloaded from http://www.speech.kth.se/SNACK/download.html. A short example of how this module can be used follows:



package require snack


sound snd

pack [ spectrogram .s -sound snd -height 200]

pack [ button .r -text Record -com {snd record}]

pack [ button .t -text Stop -command {snd stop}]

The example creates a simple real time spectrogram application. A sound object called snd is created, which is empty initially. Next, a spectrogram is created, that is linked to the sound. And finally two buttons labeled Record and Stop. When clicked, these buttons will execute the commands record and stop, respectively, of the sound object. As recording commences the spectrogram will update in real time to reflect the changing contents of the sound object. There are numerous options to handle analysis bandwidth, scales, and similar properties. The example script could easily be extended with for example the ability to play a recording:


pack [button .p -text Play -command {snd play}]

In order to be able to save the recording the following line could be added:


pack [button .w -text Save -command {snd write [tk_getSaveFile]}]


Also, the script above would run without modification if embedded in a web page, except for the save function that would need special privileges.


3.1. Speech Analysis Exercises

In our courses on speech technology we have an introductory section on basic phonetics and speech analysis. For this section a set of exercises were developed in which students analyze their own speech in various ways. These exercises are accessed through web pages, in which simple speech analysis tools have been embedded as small applications (applets) dedicated to the task at hand (http://www.speech.kth.se/labs/analysis/). In this way it was possible to make these exercises available to students both working in our laboratory, at Linköping University and from their home computers. The big advantage of using a web browser as a platform is that all installation issues are solved, except for the download and installation of plug-ins. Instructions and other useful information can also accompany the tools in a natural and easily accessible way, using HTML. A screen-shot of one of the exercises is shown in Figure 1. The exercises covered measurements of vowel formant frequencies, comparisons of speakers and speaking styles, Swedish word accent, and phonetic segmentation.


An interactive tool for working with parametric speech synthesis has been developed. The tool facilitates editing of parameter tracks, and it provides real-time feedback of the synthesized speech. It serves as an interface to KTH's multilingual rule based synthesis system [7, 8], and can be used to control a formant synthesizer as well as a 3-D "talking head" [9]. The tool can run either as a stand-alone application or as an applet in a Web browser. It has been used in research and education during the past two years.

The tool gives full control over all parameters involved in the formant synthesis process, including formant frequencies and bandwidths, fundamental frequency and voice source parameters. The user can select a language (Swedish, French, American English or German) and synthesize arbitrary text, either in orthographic or phonetic mode. Once the phonetic transcription is generated, the synthesizer produces the control parameter tracks in a two-step process: First, the phonetic rules generate a series of control points for each parameter that define a target track. Next, filters are applied to the target track to create a smoothed continuos track to be output to the synthesiser. The filter type and coefficients may differ between parameters, and the filter coefficients for a given parameter may be time varying, under the control of another parameter.

The main interface, shown in Figure 2, consists of a number of panels that display the parameter tracks, a time scale, a menu bar, a horizontal scrollbar and a status bar. Each of the panels has an associated value-scale and controls for vertical zooming and scrolling. The panels are stacked and aligned vertically, in such a way that all panels share the same time scale, and horizontal scrolling affects all panels. Typically, related parameters or parameters of the same unit are displayed together in one panel. The default configuration contains three panels, displaying formant parameters, fundamental frequency and source parameters respectively. The parameter tracks in the formant panel can be overlaid on top of a spectrogram. Parameter tracks are edited in an intuitive way by dragging the control points. Control points can be freely inserted or deleted, and segment durations can be lengthened or shortened using a time scale at the bottom of the display.


4.1. Speech Synthesis Exercises

Students are given a number of tasks to accomplish using the editing tool. The first task is to change the identity of the consonant in a synthesised CV syllable by manipulating the formant transitions. For example, to change /ba/ to /da/, the transitions of the second and third formant into the vowel part will have to be changed from rising to falling. In the second task, the students will apply the knowledge gained in the previous speech analysis exercises, where they study the pitch contour of Swedish tones. Using a set of minimal pairs with respect to tone, the task is to synthesise the first word in the pair and manipulate the F0 contour to arrive at the other word. A similar exercise involves changing the meaning of a word by modifying vowel length, vowel quality and stress. The last task is to experiment with prosodic modifications at sentence level, such as changing a statement into question.

Formant based synthesis is sometimes compared to more commercially popular synthesis methods based on concatenation of segments recorded from natural speech. Concatenation based synthesis can offer high voice quality, but is limited in flexibility. Typically, only pitch and duration can be altered freely. In contrast, we feel that formant based synthesis has a significant pedagogical value. By using a parametric synthesis paradigm based on a familiar phonetic representation controlled from intuitive graphical interface, exercises can be designed that provide the students with a deeper understanding not only of fundamental speech synthesis techniques, but also about acoustic-phonetic correlates in general.


The educational dialogue system GULAN [2] has been redesigned into an application which is accessed through a web page. In this dialogue system users can make simple queries in the web-based Yellow Pages on selected topics using speech. Results are presented using a combination of synthesized speech and an interactive map. Our aim is to give the students hands-on experience by letting them use the system on their own, examining it in detail, and extending its functionality. In this way, we hope to give them an understanding of the problems and issues involved in building dialogue systems and to spur their interest for spoken dialogue technology and its possibilities. Recently the system has been redesigned with an improved dialogue manager described in [3]. This system also makes extensive use of the broker for modules such as recognition and synthesis in order to make the system more lightweight.

5.1. Dialogue System Exercises

The students were given a set of tasks to complete. First of all they had to use the system in order to figure out what it could and could not do. This also included experimenting with the speech recognition component itself in order to understand its current limitations regarding, for example, speaking style, vocabulary, and grammar. The principal task was to extend the system with new fields from the Yellow Pages. New words and phrases had to be added to the lexicon and the grammar had to be modified accordingly. This is an interactive process where the students can listen to the transcriptions using the text-to-speech system. Immediately after they have loaded the updated lexicon into the running recognizer they can use the new words. They also have to extend the text generation capabilities in order to handle the new fields. The students could also modify the prosodic patterns of the synthesized responses.


In this paper some our efforts in creating web-based educational tools for speech technology have been presented. Freeing our students from the need to use certain computers or special laboratory set-ups at certain hours has proved incredibly useful and also necessary, considering that the total number of students has risen from about 20 to 150 in two years. Especially bearing in mind that one of our courses was given at Linköping University.

Much remains to be done, but the basic framework has shown its strength. Our current systems will be continuously developed and updated and their scope is also going to be widened. The speech analysis module will be extended in order to make resynthesis possible in conjunction with the speech synthesis tool. The educational dialogue system will be improved. Modules for multimodal synthesis and prosodic analysis will be added, as well as dialogue dependent speech recognition and speech synthesis. A main focus will be the continued development of the highly flexible dialogue manager in cooperation with Linköping University.

We believe that using the Internet will play an increasingly important role for making speech technology available anywhere for educational and co-operative purposes. Our investment in the web-based modular approach has already paid off in terms of effortless portability and easy implementation of demonstrators.


This work was in part supported by the Centre for Speech Technology (CTT) and the Swedish Language Technology Program.


  1. Lewin E., "The Broker Architecture," http://www.speech.kth.se/proj/broker/.
  2. Sjölander, K., and Gustafson, J., "An Integrated System for Teaching Spoken Dialogue Systems Technology", Proc. Eurospeech 97, Rhodes, Greece, 1997.
  3. Gustafson, J., Elmberg, P., Carlson, R., Jönsson, A., "An educational dialogut system with a user controllable dialogue manager", Proceedings of ICLSP´98, Sydney, Australia, 1998.
  4. Sun Microsystems, Inc., "The Java Technology Hompage," http://java.sun.com/.
  5. J. K. Ousterhout, "Tcl and the Tk Toolkit." Addison Wesley, ISBN: 3-89319-793-1, 1994.
  6. Sjölander, K., "The Snack Sound Visualization Module", http://www.speech.kth.se/SNACK/.
  7. Carlson, R., Granström, B., and Hunnicutt, S., "A multi-language text-to-speech module", Proceedings of ICASSP '82, Paris, Vol. 3, pp 1604-1607, 1982.
  8. Carlson, R., Granström, B., Karlsson, I., "Experiments with voice modelling in speech synthesis", Speech Communication 10, pp 481-489, 1991.
  9. Beskow, J., "Rule-based Visual Speech Synthesis", Proceedings of Eurospeech '95, Madrid, Spain, September 1995.