Guidelines for experiments on the POLYCOST database

Version 1.01; October 14th, 1997
This is not the latest version. The latest version can be found here.


Abstract

The purpose of this document is to define a common ground for speaker recognition experiments on the POLYCOST database. It is done by defining a set of baseline experiments for which results always should be included when presenting evaluations made on this database. By including these results and by presenting the differences introduced in new experiments, a comparison between systems tested on different sites is made possible.

Four baseline experiments are defined: text-dependent speaker verification (SV) on fixed password sentence, text-prompted SV on digit sequence, text-independent SV on free speech in subject's mother tongue and finally text-independent speaker identification on the same free speech. The definition of the baseline experiment includes the definition of client and impostor speakers and speakers for training a world model; sessions for enrollment and test; which speech items to use and how to compute and present results.


Table of contents


1. Introduction

The purpose of this document is to define a common ground for speaker recognition experiments on the POLYCOST database. This is done by defining a set of baseline experiments for which results always should be included when presenting evaluations made on this database. By including these results and by presenting the differences introduced in new experiments, a comparison between systems tested on different sites is made possible.

The organization of this paper is as follows: section 2 shortly summarizes some features of the POLYCOST database. Section 3 defines the conditions for four baseline experiments, including the choice of task, speakers, enrollment material and test material. Some comments on the implementation of baseline experiments are then made on section 4. Finally, the computation and presentation of results are defined in section 5 .


2. The database

The POLYCOST database was recorded as a common initiative within the COST 250 action during January-March 1996. It contains around 10 sessions recorded by 134 subjects from 14 countries. Each session contains 14 items; 4 repetitions of a seven-digit client code (CLI), 5 ten-digit sequences (DIG), 2 fixed sentences (SEN), 1 international phone number (PHO), and 2 items with speech in the subject's mother tongue (MOT). The language in all items except the two last is English.

The baseline experiments defined in this document shall be applied to version 1.0 of POLYCOST (from July 96) with bug 1 fixed. The skeleton scripts for implementing these baseline experiments, which are described in section 4.1, will automatically fix this bug if data is read directly from CD-ROM.

Go to homepage of POLYCOST for more information on the database.


3. Definition of experiments

A set of four baseline experiments is defined. Three of the experiments are on speaker verification tasks and the fourth on a closed-set speaker identification task. The experimental conditions were chosen to keep experiments realistic, well- defined and easy to implement.

3.1 General guidelines

As far as possible, the four baseline experiments has been defined with equal conditions. For instance, the same speakers and sessions are always used in the test phase. Apart from the obvious difference in recognition task (text-dependent, digit- prompted and text-independent), the only major difference between conditions for the four experiments is that experiment 1 uses four enrollment sessions instead of two. This difference has been introduced for the following reason: while two enrollment sessions is believed to be more "realistic" than four, in experiment 1 it was not possible to use two sessions only because the used speech item only occur in one repetition per session.

This section first covers some issues common to all baseline experiments and then defines the four experiments one by one. Table 3 summarizes details of the baseline experiments.

3.1.1 Choice of data for off-line training

A set of 22 speakers, as shown in table 1, have been set aside for use as an off-line database. These speakers can be used, for instance, to build world-models and to simulate impostor tests when setting a threshold during enrollment of a new client speaker.

When building a world-model, exactly sessions 01 to 05 for all off-line speakers shall be used (except for M045 where there is only one session). Which speech items to use is defined for the respective experiment in sections 3.2 to 3.5.

Off-line speakers were chosen according to the following criterion:

"Pick the male and the female speaker with the least number of recording sessions from each country".

Click here to see exceptions made to this rule.


Subject PIN
Country
#sessions
Language
M045 6592714
BE
1
French
F017 4172956
BE
7
French
M058 7925416
CH
5
French
M023 4762195
DK
6
Danish
F049 7962451
DK
7
Danish
M057 7691542
ES
9
Catalan
F011 2541679
ES
10
Spanish
F050 9154276
FR
8
French
M005 1724695
FR
6
French
F030 5612497
IR
10
English
M037 5972641
IR
10
English
M059 7946215
IT
9
Italian
F006 1956274
IT
10
Italian
F039 6751942
NL
8
Dutch
M016 2941765
NL
9
Dutch
M018 2975461
PT
9
Portuguese
F044 7561249
SE
9
Swedish
M073 9671524
SE
9
Swedish
M010 2475916
TR
9
Turkish
F018 4295716
TR
8
Turkish
F058 9745216
UK
7
English
M056 7625149
UK
7
English
Table 1. The 22 speakers set aside to the off-line database. There are 12 male and 10 female speakers. The language entry was derived from listening to what the subject says in the MOT01 speech items.

3.1.2 Client speakers

All speakers in the database who have not been set aside as part of the off-line database and who have at least 5 sessions recorded, shall be used as client speaker in the experiments. This amounts to 110 speakers, where speakers F035 and M042 have been excluded because they have too few sessions.

True-identity tests shall be made on session 05 and later sessions. With existing sessions for the 110 client speakers, this gives 666 true-identity tests.

The reason for excluding speakers with less than 5 sessions and for using only sessions 05 and later for tests in all experiments is the following: Experiment 1 defines 4 training sessions and hence to allow for at least one test, 5 sessions are needed. To make test conditions more similar between different baseline experiments, the same speakers and test sessions will therefore be used for all those experiments. Also, when comparing different numbers of enrollment sessions, up to four enrollment sessions can be tested while keeping the test material invariant.

3.1.3 Impostor speakers

To simulate impostor attempts against speaker X in the speaker verification experiments, recordings from all speakers in the database except speaker X and the world-model speakers shall be used. This scheme includes both same-sex and cross- sex impostor attempts, but the proposed scoring software (section 5.1) will report results separately for those two cases.

Recordings from session 05 only shall be used for impostor tests. If a certain speaker has not recorded a session 05, this speaker will also not be used for impostor tests. This excludes speakers F035 and M042 and gives a total of 110 impostor speakers, the same set of speakers as those used as clients. With 109 impostor tests per client, there will be 11990 impostor tests in each baseline experiment.

There are two reasons for choosing session 05 for impostor tests rather than for example 01 which would give more available impostor speakers. Firstly, it is assumed that later sessions will contain less speaking errors because subjects are learning the recording protocol. Secondly, since session 05 is used for true-identity tests and as a test session in the speaker identification task, it is possible to compare the outcome of these with the outcome of impostor tests.

The choice of impostor speakers defined above will be referred to as "the impostor speakers".

3.1.4 Number of tests

Given the available data in version 1.0 of the database and the choice of client and impostor speakers as defined in the previous sections, the total number of true identity tests will be 666 and the number of impostor tests will be 11990. These numbers apply to all speaker verification experiments (1-3), while only the true identity tests are relevant to the closed-set speaker identification task in experiment 4.

The number of impostor tests is quite large and will cause large processing times when running experiments. For example, if one verification test takes 5 seconds and sequential processing is used, the total processing time for all tests is about 17½ hours!

3.1.5 The use of annotation files

For all baseline experiments defined here, a test shall be made independently of the actual contents of the test file, that is, even if the manuscript utterance is not there or if the file contains only silence. This way, annotations for the test files are not needed, nor may be used, to produce a result on a given baseline experiment. The main reasons for this choice is to make experiments easy to implement and to not rely on annotations of test data. The fact that a portion of errors produced in an experiment will be due to speaking errors rather than errors made by the classifier, should instead be related to during the analysis of test results. This is not yet done by the scoring software presented in section 5, however.

For enrollment, annotation and segmentation information may be used in experiment 2 as defined in section 3.3.1, but not in the other experiments. The reason for the choice of using annotation information in experiment 2 at enrollment time is again easy implementation of the experiments.

3.2 Experiment 1: Text-dependent speaker verification on SEN-files

The task in this experiment is speaker verification on a fixed password phrase which is common to all speakers. There are two such phrases in POLYCOST: SEN01 with text "Joe took father's green shoe bench out" and SEN02 with text "He eats several light tacos". This baseline experiment shall be done on the first phrase, SEN01, only.

A client model for speaker X shall be built from the first 4 sessions for that speaker, namely from files X/0{1,2,3,4}/SEN01. One verification test shall then be performed on each of the remaining recordings, namely on files X/{05,...}/SEN01. Models may not be adapted to test files.

To simulate impostor attempts, the SEN01-file from session 05 from the impostor speakers shall be used.

The choice of training material for a world-model in this experiment is not obvious. In the rather unlikely case of a system where all users have the same password phrase, training on SEN01 would be the natural choice. It is not realistic in the case of user-individual password phrases, however, since a world-model can not be trained for each existing password phrase in a system. The closest alternative would perhaps be to build a world model from subword components. This is done for instance in [2] where Parthasarathy & Rosenberg concludes that it is important that the world model captures the text contents of the spoken utterance. Choosing SEN01 for world-model training can be seen as the ideal case of such a synthesized world-model. This approach has therefore been chosen for this baseline experiment.

An alternative approach, which has been abandoned, is to train a world model from the MOT02-item. This is more realistic in the sense that the password sentence is not represented in the training material. However, it is not realistic to train phoneme models on this small off-line data and the performance of a global fully text-independent world-model can be questioned. The speech in MOT02 is also in mother tongue while SEN01 is in English.

3.3 Experiment 2: Digit-prompted speaker verification on DIG-files

The task in this experiment is speaker verification on a sequence of digits which was not represented in the enrollment material. Hence, it is a simulation of a verification system where a sequence of digits is prompted to the client in the moment of the test. Prompting in this case is done by means of text display as opposed to audio prompting.

Each session contains recordings of five ten-digit sequences, shown in table 2. In this experiment, the first four sequences from the first two sessions shall be used when building a speaker model for speaker X, namely files X/0{1,2}/DIG0{1,2,3,4}. This gives eight occurrences of each digit for enrollment.

One verification test shall then be done on sequence 5 in each of the sessions 05 and later, namely on files X/{05,...}/DIG05. Models may not be adapted to test files.

To simulate impostor attempts, the DIG05-file from session 05 from the impostor speakers shall be used and a world-model, if used, shall be trained from files DIG0{1,2,3,4} in sessions 01 to 05 for the off-line speakers.


Item contents
DIG01 0 1 2 3 4 5 6 7 8 9
DIG02 8 3 9 4 6 1 7 2 0 5
DIG03 5 0 6 9 2 8 1 3 7 4
DIG04 9 8 7 6 5 4 3 2 1 0
DIG05 1 0 2 9 3 8 4 7 5 6
Table 2. Pre-scribed contents of the DIG-items.

3.3.1 The use of annotation files

In this experiment, models for individual digits should be trained from sequences of digits. In a real situation the recognition system would of course have to produce segmentations on its own (if segmentations are required). In the experiment defined here, however, the goal is to test the speaker verification part of the system. Therefore, we make the assumption that before the enrollment is started the speech has been put through an "ideal" digit segmenter. With this approach a system under test do not need to have the segmenter component. We also eliminate the influence on the built model from differences in segmenting modules.

During the test phase, on the other hand, segmentation can usually be done implicitly as part of the decoder operation in the speaker verification module. Hence the choice of using segmentation information in the enrollment but not in the test phase. This strategy was used within the CAVE project for experiments on YOHO and SESP databases.

3.4 Experiment 3: Text-independent speaker verification on MOT-files

The task in this experiment is speaker verification in a text-independent manner on text spoken in the speaker's mother tongue.

A speaker model for speaker X shall be built from the unconstrained speech item from the two first sessions, namely files X/0{1,2}/MOT02. Each of these items contain up to 20 seconds of free speech. One verification test shall then be done on the somewhat constrained speech item in each of the sessions 05 and later, namely on files X/{05,...}/MOT01. Models may not be adapted to test files. To simulate impostor attempts, the MOT01-file from session 05 from the impostor speakers shall be used. A world-model, if used, shall be trained from the MOT02-files.

For item MOT01 subjects were asked to speak their name, christen name, sex (female/male), town, country and mother tongue. This constraint means that subjects will say roughly the same thing in each test, which normalizes test utterances on text contents. The task is still text-independent since enrollment is made on unrelated text and models are not updated.

The knowledge of what the subject is saying in the test files may not be used a priori in the test.

3.5 Experiment 4: Text-independent speaker identification on MOT-files

This experiment is defined in all applicable aspects exactly the same as experiment 3, but with the task of closed-set speaker identification. Hence, speaker model X shall be built on files X/0{1,2}/MOT02 and each of files X/{05,...}/MOT01 shall be used for speaker identification tests. Adaptation on test utterances is not allowed. All speakers in the database shall be registered as clients and, thus, the task is closed-set speaker identification and there is no need for impostor tests.


BE Task Speech World model Enrollment Tests (FR) Tests (FA)
1 ver Fixed sentence 0{1-5}/SEN01 * 0{1-4}/SEN01 {05,...}/SEN01 05/SEN01
2 ver Prompted digits 0{1-5}/DIG0{1-4} * 0{1,2}/DIG0{1-4} * {05,...}/DIG05 05/DIG05
3 ver Free, mother tongue 0{1-5}/MOT02 * 0{1,2}/MOT02 {05,...}/MOT01 05/MOT01
4 id Free, mother tongue 0{1-5}/MOT02 * 0{1,2}/MOT02 {05,...}/MOT01 -
Table 3. Summary of test conditions for the four baseline experiments.
*) annotation information may be used.


4. Implementation

The implementation of the baseline experiments as defined above is in principle left to the user. As a help, a set of skeleton csh-scripts (for UNIX environments) is provided, plus a set of list files which define the different speaker sets. Note that these are not a ready-to-go set of scripts but only a suggestion for how the basic procedures of the baseline tests can be implemented.

4.1 Csh Scripts

There are three csh-scripts: runenrol, runtest and trainworld. They all use the file polydefs which contains definitions of file paths plus enrollment and test sets for the experiments. runenrol is supposed to build all client models while runtest will run all true-identity and all impostor tests. The last script, trainworld, can be used to build a world (impostor) model. All baseline experiments can be handled by the same scripts. The scripts are made such that speech data can be read either from the two distribution CD-ROM:s or from a hard disk where all data has been stored. In the first case, the script will halt for the user to change disc in the CD- ROM reader.

The provided csh skeleton scripts will fix one known bug in the v 1.0 release of POLYCOST when reading data directly from the CD-ROM:s, namely by swapping sessions F006/01 and F031/01. When data is read from a hard disk instead, the scripts assume that the bug has been fixed already and the sessions are therefore not swapped. Other bugs have been discovered in v1.0 and thus new bug fixing procedures should be implemented in those csh scripts when reading from the cdroms. This has not yet been done, so take care. The easiest way to avoid those bugs is to make a local copy of the distribution and fix the bugs manually as descibe here.

4.2 Perl Scripts

Perl versions of the csh scripts have been created : there are three perl-scripts: runenrol.pl, runtest.pl and trainworld.pl. They all use the file polydefs.pl which contains definitions of file paths plus enrollment and test sets for the experiments.

The Perl scripts have exactly the same functionalities as the csh ones excepted NO BUG FIXING has been implemented. When reading datas directly from v1.0 of the cdroms, a bug fixing procedure should be implemented in the script. This has not yet been done in the perl scripts distributed here, so take care. When reading datas from a local copy (hard disk), make sure the bugs described in the known bug page in the v 1.0 release of POLYCOST have been repaired manually on the local copy.

4.3 List files

There are three pairs of list files as shown in table 4. Files are ordinary text files with one speaker per line.


file speaker set
BE_CLI1.LST / BE_CLI2.LST client speakers on disc 1 and 2
BE_IMP1.LST / BE_IMP2.LST impostor speakers on disc 1 and 2
BE_OFF1.LST / BE_OFF2.LST off-line speakers on disc 1 and 2
Table 4. List files which defines three different speaker sets for the baseline experiments. Disc numbers refers to v 1.0 distribution CD-ROM discs.


5. Scoring

In order to be able to compare different experiments done on the POLYCOST database, all results must be presented in a similar way. Otherwise two quite comparable experiments might end up with incomparable results due to the way the error rates are calculated and the results are presented. A common scoring software was developed within the CAVE project for calculating false acceptance (FA), false rejection (FR) and equal error rates (EER) based on likelihood values and thresholds from an experiment. This implementation is based on the methods described in the EAGLES handbook [1]. This program has also been distributed to COST 250 partners.

5.1 Scoring procedures

The scoring software distributed to the COST 250 members consist of two parts, one for static scoring and one for dynamic scoring. Two input files are necessary for computing static scores, while only one input file is needed for the dynamic scores to be calculated. When calculating static scores one file containing likelihood scores and one file containing thresholds is used. Only the file with likelihood scores is needed for the computation of dynamic scores. The static evaluation program computes FA and FR rates for an experiment with a priori fixed thresholds, while the dynamic evaluation program adjusts the thresholds, by equalising a posteriori the false acceptance and false rejection rates. The dynamic scoring program then report these equal error rates.

5.1.1 Likelihood file

The likelihood file is a text file with four items per line. Each line is the result of an access attempt. On each line must be given, in the following order:

This file must have the same number of lines as the number of attempts in the experiment and must have the extension llk in its filename (example.llk).

Example of likelihood file:
The first two lines are genuine attempts, the third line is a same-sex impostor attempt against M011 and the last line is a cross-sex impostor attempt against M011.

M006 M006  -1.016141  -2.066694
M006 M006  -1.063104  -2.082051
M034 M011  -1.693927  -2.413161
F024 M011  -1.914679  -1.981914
...

5.1.2 Threshold file

The threshold file is a text file containing the list of all enrolled speakers together with their corresponding decision threshold on the log likelihood ratio. Each line contains the speaker identity followed by the threshold for this speaker. This file must have exactly the same number of lines as the number of enrolled speakers and must have extension thr in its filename (example.thr).

Example of threshold file:

M006 0.745374
M011 0.638556
F024 0.578569
F031 0.578722
...

5.1.3 Static evaluation

The static evaluation program calculates FA and FR rates for a system with a priori fixed thresholds. It requires both the input files mentioned above.

For each claimed speaker, an FR rate is computed. These individual scores are then averaged over all male, all female and all speakers in order to give the male, female and sex independent FR. The test-set false rejection rate is computed as the total number of false rejections over the total number of genuine attempts, regardless of the distribution of speakers by gender, and of the number of tests per speaker.

For each couple of claimed speaker and impostor, a separate false acceptance rate is computed. These scores are then averaged over 4 different subsets:

Scores obtained for MM and FF are then averaged to form the "Same Sex" FA, whereas MF and FM scores are averaged into the "Cross Sex" FA. The "Sex Independent" FA is obtained as the average of the Same Sex and Cross Sex score. The test-set FA is obtained as the proportion of false acceptance over the total number of impostor trials, regardless of the claimed speaker's and the impostor's identities and of the number of tests per speaker.

5.1.4 Dynamic evaluation

The dynamic evaluation program computes several sets of a posteriori optimal thresholds in order to equalise FA and FR in several test configurations. It requires only the likelihood file as input, but provides 3 different threshold files as output. For each speaker, three distinct ROC curves are considered, from which are derived various average EERs.

The male-male (MM) average EER is obtained as the average same sex EER over all male speakers, while the female-female (FF) average EER is obtained similarly for female speakers. The average of these two provide the same-sex EER.

The male-female average EER is obtained as the average cross sex EER over all male speakers, while the female-male average EER is obtained similarly for female speakers. The average of both these provides the cross-sex EER.

The sex-independent EER is estimated for each speaker from the gender-balanced ROC and the sex-independent average EER is obtained by averaging these scores in a gender-balanced way.

5.2 Presentation of results

The scoring software present all calculated scores in text files. The format of these text files are presented below. It is recommended to always report the same-sex and sex-independent EER when performing dynamic evaluation on the POLYCOST database. When performing static evaluation the same-sex and sex-independent FA and the male and female FR should be reported. It is recommended to include the full output from the scoring program as an appendix to a report.

5.2.1 Static evaluation

An example of the output from the static evaluation software is shown below (file example.sta):

by-gender average false rejection rate

      -------------------------
      |  30.416 (M) |         |
      --------------|  29.396 |
      |  28.375 (F) |         |
      -------------------------

test-set false rejection rate

      -----------
      |  29.989 |
      -----------


(XY) : X=claimed Y=true

by-gender average of average false acceptance rates

----------------------------------------------------------- | 15.369 (MM) | | | ---------------| 20.229 (Same Sex) | | | 25.090 (FF) | | | -------------------------------------| 14.538 (Sex Ind.) | | 10.184 (MF) | | | ---------------| 8.847 (Cross Sex) | | | 7.509 (FM) | | | ----------------------------------------------------------- test set false acceptance rate ----------- | 12.359 | -----------

5.2.2 Dynamic evaluation

An example of the output from the dynamic evaluation software is shown below (file example.dyn):

EER:
    -----------------------------------------------------------
    |  19.883 (MM) |                     |                    |
    ---------------|  22.550 (Same Sex)  |                    |
    |  25.217 (FF) |                     |                    |
    -------------------------------------|  19.598 (Sex Ind.) |
    |  15.534 (MF) |                     |                    |
    ---------------|  15.737 (Cross Sex) |                    |
    |  15.941 (FM) |                     |                    |
    -----------------------------------------------------------

6. Conclusions

A common ground for experiments on the POLYCOST database has been established through the definition of a set of four baseline experiments and procedures for computing and presenting results for tests. The purpose of the guidelines is to standardize testing on this database and thus enable comparison between experiments made in different test sites.

These guidelines should be used in experiments until the next meeting of the COST 250 in April 1997, where a new version of guidelines will be discussed.


References

[1] F. Bimbot, G. Chollet (1995). "Assessment of speaker verification systems", In: Spoken Language Resources and Assessment, EAGLES Handbook. (links: 1. EAGLES on-line, 2. EAGLES Handbook on Spoken Language Systems, DRAFT Version of 18 May 1995 )

[2] S. Parthasarathy, A.E. Rosenberg (1996). "General Phrase Speaker Verification Using Sub-word Background Models and Likelihood-ratio Scoring", ICSLP-96, Philadelphia.


Appendix A: Revision history

Changes made from the draft version of November 25th, 1996:
Section 5 on scoring procedures and presentation of results has been added. The template scripts referred to in section 4.1 have been slightly changed. None of the basic guidelines have been changed.

October 14th 1997 : Template Perl scripts have been added in section 4.


Håkan Melin, Johan Lindberg
KTH, Dept. of Speech, Music and Hearing (TMH)

Jean Hennebert (for the perl scripts), CIRC , EPFL