Version 2.0; December 20th, 1999
(View version 1.01 of this document)
The most recent version of this document can be found at http://www.speech.kth.se/cost250/polycost/be/latest.
Four baseline experiments are defined: text-dependent speaker verification (SV) on a fixed password sentence in English, text-prompted SV on a digit sequence spoken in English, text-independent SV on free speech in the subject's mother tongue, and finally text-independent speaker identification on the same free speech. The definition of a baseline experiment includes the definition of client and impostor speakers and speakers for training a non-client model; sessions for enrollment and test; which speech items to use and how to compute and present results.
The present version (2.0) of this document is a revision from version 1.0. Some major changes have been made to make the experiments more difficult and to reduce the number of impostor attempts. The specification of how to compute and present results has also been changed. The experiments were made more difficult by decreasing the amount of enrollment data; The number of impostor attempts was reduced by removing less important tests, such as cross-sex and cross-language impostor attempts; and speaker verification results with a posteriori thresholds are now to be computed with speaker-independent rather than speaker-dependent thresholds.
A consequence of the introduced changes is that results from version 2.0 experiments are not directly comparable to results produced with the previous version. However, since the number of published results with the previous version is not so large, it is believed that the changes presented for the current version are important enough to motivate the changes.
This document and the guidelines it presents is a result of cooperative work by many researchers within various working groups (WG) in the COST250 action. The specification of standardized baseline experiments for public speech databases is central to good assessment of speaker recognition technology, especially in the context of cooperative and competitive research. This document therefore has its natural home in WG4 [7], which was dealing with assessment topics. The task of WG3 [8] was to research parameters and algorithms for speaker recognition, and this is where the actual experiments were run. POLYCOST and these specifications were used in several of the presented experiments. The POLYCOST database itself is a result of work in WG2 [9], dealing with databases for speaker recognition.
The first version of this document was published in January 1997. A number of experimental results have been presented within COST250 for POLYCOST and the first version of these guidelines. Some of them have also been published at open conferences, for instance [1][2]. It appeared that the results from those experiments looked very good. They turned out to be overly optimistic, though. In [2], results were presented for POLYCOST and two other databases (Gandalf and SESP) on similar recognition tasks, and the POLYCOST error rates were much lower. In [1] some of the particularities of POLYCOST were investigated, especially the importance of the fact that subjects in the database have different mother tongue. Error rates were roughly doubled if only same-language impostor attempts were used, compared to if also cross-language impostors were included. This effect was observed even for BE1 and BE2 where all subjects speak English. There was also a large difference in error rate between the three speaker verification baselines. BE1 and BE2 (text-dependent tasks) were much easier than BE3, the text-dependent task, and generated very few errors. When the enrollment sets were reduced for the two easier tasks, the absolute number of errors observed in the easier tasks increased accordingly. Note that for reliable comparison between two recognition systems, the number of observed errors must be large enough.
The above mentioned results lead to a suggestion to modify the specification of baseline experiments [1]. The suggestions were later accepted within COST250 and the current document describes the new version of the baseline experiments. These are the changes from version 1.0 to 2.0:
Experiments with the new baseline specification confirms that the resulting error rates are higher and comparable to those from similar tasks on other databases [3].
The organization of this paper is as follows: section 2 shortly summarizes some features of the POLYCOST database. Section 3 defines the conditions for four baseline experiments, including the choice of task, speakers, enrollment material and test material. Some comments on the implementation of baseline experiments are then made on section 4. Finally, the computation and presentation of results are defined in section 5 .
The baseline experiments defined in this document shall be applied to version 2.0 of POLYCOST (or, equivalently, version 1.0 (from July 96) with all 4 known bugs fixed). Version 2.0 of POLYCOST will soon be available through ELRA.
Go to the home page of POLYCOST for more information on the database.
When building a non-client model, exactly sessions 01 to 05 for all off-line speakers shall be used (except for M045 where there is only one session). Which speech items to use is defined for the respective experiment in sections 3.2 to 3.5.
Off-line speakers were chosen according to the following criterion:
"Pick the male and the female speaker with the least number of recording sessions from each country".
Click here to see exceptions made to this rule.
| Subject | PIN | Language | Code | ||
|---|---|---|---|---|---|
| M045 | 6592714 | French | fr | ||
| F017 | 4172956 | French | fr | ||
| M058 | 7925416 | French | fr | ||
| M023 | 4762195 | Danish | da | ||
| F049 | 7962451 | Danish | da | ||
| M057 | 7691542 | Catalan | ca | ||
| F011 | 2541679 | Spanish | es | ||
| F050 | 9154276 | French | fr | ||
| M005 | 1724695 | French | fr | ||
| F030 | 5612497 | English | en | ||
| M037 | 5972641 | English | en | ||
| M059 | 7946215 | Italian | it | ||
| F006 | 1956274 | Italian | it | ||
| F039 | 6751942 | Dutch | nl | ||
| M016 | 2941765 | Dutch | nl | ||
| M018 | 2975461 | Portuguese | pt | ||
| F044 | 7561249 | Swedish | sv | ||
| M073 | 9671524 | Swedish | sv | ||
| M010 | 2475916 | Turkish | tr | ||
| F018 | 4295716 | Turkish | tr | ||
| F058 | 9745216 | English | en | ||
| M056 | 7625149 | English | en |
True-identity tests shall be made on session 05 and later sessions. With existing sessions for the 110 client speakers, this gives 664 true-identity tests.
The reason for excluding speakers with less than 5 sessions and for using only sessions 05 and later for tests in all experiments is firstly to leave room for comparative experiments with more than two enrollment sessions with an invariant test material; and secondly to keep similarities to version 1.0 of the experiment specifications.
For the purpose of making an unambiguous definition of what impostor tests to include, a file specifying the mother tongue of each subject has been provided. The contents of this file are based on what the subject claims to be his mother tongue in the MOT01 items of the database. In case an entry in that file would turn out to have an error, the (unchanged) contents of the file is still to be used. The file has three columns: the subject identifier, a code for the country of origin of the subject's calls (or "country" according to MOT01 statements to be precise - this may in some cases be the subject's home country, which may be different from the country the calls were placed from; ISO3166 Country Codes are used), and a code for the language (mother tongue; ISO639 Language Codes are used).
Recordings from session 05 only shall be used for impostor tests. If a certain speaker has not recorded a session 05, this speaker will also not be used for impostor tests. This excludes speakers F035 and M042 and gives a total of 110 impostor speakers, the same set of speakers as those used as clients.
There are two reasons for choosing session 05 for impostor tests rather than for example 01 which would give more available impostor speakers. Firstly, it is assumed that later sessions will contain less speaking-errors because subjects are learning the recording protocol. Secondly, since session 05 is used for true-identity tests and as a test session in the speaker identification task, it is possible to compare the outcome of these with the outcome of impostor tests.
For enrollment, annotation and segmentation information may be used in BE2 as defined in section 3.3.1, but not in the other experiments. The reason for the choice of using annotation information in experiment 2 at enrollment time is again easy implementation of the experiments. The manually verified annotation files provided with the POLYCOST database should be used.
A client model for speaker X shall be built from the first 2 sessions for that speaker, namely from files X/0{1,2}/SEN01. One verification test shall then be performed on each of the remaining recordings, namely on files X/{05,...}/SEN01. Models may not be adapted to test files.
To simulate impostor attempts, the SEN01-file from session 05 from the impostor speakers shall be used.
The choice of training material for a non-client model in this experiment is not obvious. In the rather unlikely case of a system where all users have the same password phrase, training on SEN01 would be the natural choice. It is not realistic in the case of user-individual password phrases, however, since a non-client model can not be trained for each existing password phrase in a system. The closest alternative would perhaps be to build a non-client model from sub-word components. This is done for instance in [4] where Parthasarathy & Rosenberg conclude that it is important that the non-client model captures the text contents of the spoken utterance. Choosing SEN01 for non-client model training can be seen as the ideal case of such a synthesized non-client model. This approach has therefore been chosen for this baseline experiment.
An alternative approach, which has been abandoned, is to train a non-client model from the MOT02-item. This is more realistic in the sense that the password sentence is not represented in the training material. However, it is not realistic to train phoneme models on this small off-line data and the performance of a global fully text-independent non-client model can be questioned. The speech in MOT02 is also in mother tongue while SEN01 is in English.
Each session contains recordings of five ten-digit sequences, shown in Table 2. In this experiment, two sequences taken from the first two sessions shall be used when building a speaker model for speaker X, namely files X/01/DIG01 and X/02/DIG02. This gives two occurrences of each digit for enrollment.
One verification test shall then be done on sequence 5 in each of the sessions 05 and later, namely on files X/{05,...}/DIG05. Models may not be adapted to test files.
To simulate impostor attempts, the DIG05-file from session 05 from the impostor speakers shall be used, and a non-client model, if used, shall be trained from files DIG0{1,2,3,4} in sessions 01 to 05 for the off-line speakers.
| Item | contents |
|---|---|
| DIG01 | 0 1 2 3 4 5 6 7 8 9 |
| DIG02 | 8 3 9 4 6 1 7 2 0 5 |
| DIG03 | 5 0 6 9 2 8 1 3 7 4 |
| DIG04 | 9 8 7 6 5 4 3 2 1 0 |
| DIG05 | 1 0 2 9 3 8 4 7 5 6 |
During the test phase, on the other hand, segmentation can usually be done implicitly as part of the decoder operation in the speaker verification module. Hence the choice of using segmentation information in the enrollment but not in the test phase. This strategy was used for instance within the CAVE project for experiments on YOHO and SESP databases [5].
A speaker model for speaker X shall be built from the unconstrained speech item from the two first sessions, namely files X/0{1,2}/MOT02. Each of these items contain up to 20 seconds of free speech. One verification test shall then be done on the somewhat constrained speech item in each of the sessions 05 and later, namely on files X/{05,...}/MOT01. Models may not be adapted to test files. To simulate impostor attempts, the MOT01-file from session 05 from the impostor speakers shall be used. A non-client model, if used, shall be trained from the MOT02-files.
For item MOT01 subjects were asked to speak their name, christen name, sex (female/male), town, country and mother tongue. This constraint means that subjects will say roughly the same thing in each test, which normalizes test utterances on text contents. The task is still text-independent since enrollment is made on unrelated text and models are not updated.
The knowledge of what the subject is saying in the test files may not be used a priori in the test.
| BE | Task | Speech | Non-client model | Enrollment | Tests (FR) | Tests (FA) |
|---|---|---|---|---|---|---|
| 1 | ver | fixed sentence | 0{1-5}/SEN01 * | 0{1,2}/SEN01 | {05,...}/SEN01 | 05/SEN01 |
| 2 | ver | prompted digits | 0{1-5}/DIG0{1-4} * | 01/DIG01,02/DIG02 * | {05,...}/DIG05 | 05/DIG05 |
| 3 | ver | free, mother tongue | 0{1-5}/MOT02 * | 0{1,2}/MOT02 | {05,...}/MOT01 | 05/MOT01 |
| 4 | id | free, mother tongue | 0{1-5}/MOT02 * | 0{1,2}/MOT02 | {05,...}/MOT01 | - |
The format of the experiment specification files is described in section 4.1. It is the same format used by The COST250 Speaker Recognition Reference System [12]. Identical experiment specification files are provided as part of this reference system.
All experiment specification files listed in Table 4 can also be downloaded in one package (unix-type experiment.tar.gz or WinZip-type experiment.zip).
| file | description |
|---|---|
| es.exp (BE1, 2, 3) | lists files to use in each enrollment operation |
| ts.exp (BE1, 2, 3) | lists files to use in each access test operation |
| os_wld.exp (BE1, 2, 3) | lists files to use when training a single "world" non-client model |
| os_gen.exp (BE1, 2, 3) | lists files to use when training a gender-dependent non-client model. This file lists exactly the same files as os_wld.exp, but male and female speakers are listed separately. |
An enrollment operation involves training a speaker model for a certain identity from a set of files. A line that defines an enrollment operation has the following format:
enroll identity file1 ... fileN
A verification operation involves a speaker claiming an identity,
using a set of files to support the claim. A line that defines a
verification operation has the following format:
speaker identity file1 ... fileN
speaker and identity are
strings like M010 or F031 where M indicates a male and F a female
speaker. filei is a file tag rather
than a complete file name. To synthesize a complete file name, the file
tag must be prefixed by the name of the database's base directory and
suffixed by a file name extension. For POLYCOST, the file tag has the
format
speaker/session/filename
where filename is a string like DIG05.
The case of letters in the directory names and file tags may or may not correspond to case in file names in your file system. This depends on your operating system, especially if you copied sample files directly from CD-ROM. This should not cause any ambiguity problems with POLYCOST, but may cause practical problems.
Empty lines and lines beginning with a hash mark (#) shall be ignored. They are used for comments in the file and to make the file more readable for a human reader.
| file | speaker set |
|---|---|
| be_cli1.lst / be_cli2.lst | client speakers on disc 1 and 2 |
| be_imp1.lst / be_imp2.lst | impostor speakers on disc 1 and 2 |
| be_off1.lst / be_off2.lst | off-line speakers on disc 1 and 2 |
With a speaker-independent threshold, many of the equations in [10] can be simplified. From equation (11.37), the test set false rejection rate is calculated as
The test set EER is generally defined as the error rate at the point where the test set false rejection rate and the test set false acceptance rate are equal,
. Software for plotting a DET-curve from the ROC data can be retrieved from the NIST web site.
[2]
Melin H., Koolwaaij J.W., Lindberg J., Bimbot F.
(1998), "A Comparative Evaluation of Variance Flooring Techniques in HMM-based Speaker Verification",
ICSLP'98, Sydney, Australia, Nov. 30 - Dec. 4, pp. 1903-1906.
(abstract)
(paper - pdf, 4 pages)
[3]
Melin H., Lindberg J.
(1999), "Variance Flooring, Scaling and Tying for Text-Dependent Speaker Verification",
EUROSPEECH'99, Budapest, Hungary, September 5-9, pp. 1975-1978.
(abstract)
(paper - pdf, 4 pages)
[4]
Parthasarathy S., Rosenberg A.E.
(1996). "General Phrase Speaker Verification Using Sub-word Background
Models and Likelihood-ratio Scoring", ICSLP-96, Philadelphia, USA, pp. 2403-2406.
(paper - pdf, 4 pages)
[5]
Bimbot F., Hutter H.P., Jaboulet C., Koolwaaij J., Lindberg J.,
Pierrot J.B., (1998). "An Overview of The CAVE Project Research
Activities in Speaker Verification", RLA2C, Avignon, France, April
20-23, pp. 215-220.
(paper - ps, 6 pages)
[6]
Petrovska D., Hennebert J., Melin H., Genoud D., (1998). "POLYCOST: a
Telephone-Speech Database for Speaker Recognition", RLA2C, Avignon,
France, April 20-23, pp. 211-214.
(abstract)
(paper - pdf, 4 pages)
[7] Falcone M. (1999). "COST250 Working Group 4: Speaker Recognition Assessment and Dissemination", In: COST250 Final Report.
[8] Olsen J., Lindberg B. (1999). "Algorithms & Parameters for Speaker Recognition: Activities in COST250 Working Group 3", In: COST250 Final Report.
[9] Melin H. (1999). "Databases for Speaker Recognition: Activities in COST250 Working Group 2", In: COST250 Final Report.
[10] Bimbot F., Chollet G. (1997). "Assessment of speaker verification systems", In: Handbook of Standards and Resources for Spoken Language Systems, Gibbon D., Moore R., Winski R. (Eds.), Mouton de Gruyter, ISBN 3-11-015366-1.
[11] Martin A., Doddington G., Kamm T., Ordowski M., Przybocki M. (1997). "The DET Curve in Assessment of Detection Task Performance", Proc. Eurospeech-97, Rhodes, Greece, September, pp. 1895-1898.
[12] Melin H., Ariyaeeinia A., Falcone M. (1999). "The COST250 Speaker Recognition Reference System", In: COST250 Final Report.
lists
directory are now lower-case. File names in Table 4 were changed from upper-case to lower-case.
Version 1.01, 14/10/1997
Changes from version 1.0:
Version 1.0, 8/1/1997
Changes from version 1.0b1:
Version 1.0b1, 25/11/1996