Guidelines for experiments on the POLYCOST database
Version 1.01; October 14th, 1997
This is not the latest version. The latest version can be
found here.
Abstract
The purpose of this document is to define a common ground for speaker recognition
experiments on the POLYCOST database. It is done by defining a set of baseline
experiments for which results always should be included when presenting evaluations made
on this database. By including these results and by presenting the differences
introduced in new experiments, a comparison between systems tested on different
sites is made possible.
Four baseline experiments are defined: text-dependent speaker verification (SV) on
fixed password sentence, text-prompted SV on digit sequence, text-independent SV on free speech
in subject's mother tongue and
finally text-independent speaker identification on the same free speech.
The definition of the baseline experiment includes the definition of client and impostor
speakers and speakers for training a world model; sessions for enrollment and test;
which speech items to use and how to compute and present results.
Table of contents
The purpose of this document is to define a common ground for speaker recognition
experiments on the
POLYCOST
database. This is done by defining a set of baseline
experiments for which results always should be included when presenting evaluations made
on this database. By including these results and by presenting the differences
introduced in new experiments, a comparison between systems tested on different
sites is made possible.
The organization of this paper is as follows:
section 2
shortly summarizes some
features of the POLYCOST database.
Section 3
defines the conditions for four baseline
experiments, including the choice of task, speakers, enrollment material and test
material. Some comments on the implementation of baseline experiments are then
made on
section 4.
Finally, the computation and presentation of results are defined in
section 5 .
The POLYCOST database was recorded as a common initiative within the
COST 250
action during January-March 1996. It contains around 10 sessions recorded by 134 subjects
from 14 countries. Each session contains 14 items; 4 repetitions of a seven-digit client
code (CLI), 5 ten-digit sequences (DIG), 2 fixed sentences (SEN), 1 international
phone number (PHO), and 2 items with speech in the subject's mother tongue (MOT).
The language in all items except the two last is English.
The baseline experiments defined in this document shall be applied to
version 1.0
of
POLYCOST (from July 96) with
bug 1
fixed. The skeleton scripts for implementing these
baseline experiments, which are described in
section 4.1,
will automatically fix this
bug if data is read directly from CD-ROM.
Go to homepage of
POLYCOST
for more information on the database.
A set of four baseline experiments is defined. Three of the experiments are on speaker
verification tasks and the fourth on a closed-set speaker identification task. The
experimental conditions were chosen to keep experiments realistic, well-
defined and easy to implement.
As far as possible, the four baseline experiments has been defined with equal
conditions. For instance, the same speakers and sessions are always used in the test
phase. Apart from the obvious difference in recognition task (text-dependent, digit-
prompted and text-independent), the only major difference between conditions for the
four experiments is that experiment 1 uses four enrollment sessions instead of two.
This difference has been introduced for the following reason: while two enrollment
sessions is believed to be more "realistic" than four, in experiment 1 it was
not possible to use two sessions only because the used speech item only occur in one
repetition per session.
This section first covers some issues common to all baseline experiments and then
defines the four experiments one by one. Table 3
summarizes details of the baseline experiments.
A set of 22 speakers, as shown in table 1, have been set aside for use as an off-line
database. These speakers can be used, for instance, to build world-models and to
simulate impostor tests when setting a threshold during enrollment of a new client
speaker.
When building a world-model, exactly sessions 01 to 05 for all off-line speakers shall
be used (except for M045 where there is only one session). Which speech items to use is
defined for the respective experiment in sections 3.2 to 3.5.
Off-line speakers were chosen according to the following criterion:
"Pick the male and the female speaker with the least number of recording sessions
from each country".
Click
here
to see exceptions made to this rule.
| Subject |
PIN |
Country |
#sessions |
Language |
| M045 |
6592714 |
BE |
1 |
French |
| F017 |
4172956 |
BE |
7 |
French |
| M058 |
7925416 |
CH |
5 |
French |
| M023 |
4762195 |
DK |
6 |
Danish |
| F049 |
7962451 |
DK |
7 |
Danish |
| M057 |
7691542 |
ES |
9 |
Catalan |
| F011 |
2541679 |
ES |
10 |
Spanish |
| F050 |
9154276 |
FR |
8 |
French |
| M005 |
1724695 |
FR |
6 |
French |
| F030 |
5612497 |
IR |
10 |
English |
| M037 |
5972641 |
IR |
10 |
English |
| M059 |
7946215 |
IT |
9 |
Italian |
| F006 |
1956274 |
IT |
10 |
Italian |
| F039 |
6751942 |
NL |
8 |
Dutch |
| M016 |
2941765 |
NL |
9 |
Dutch |
| M018 |
2975461 |
PT |
9 |
Portuguese |
| F044 |
7561249 |
SE |
9 |
Swedish |
| M073 |
9671524 |
SE |
9 |
Swedish |
| M010 |
2475916 |
TR |
9 |
Turkish |
| F018 |
4295716 |
TR |
8 |
Turkish |
| F058 |
9745216 |
UK |
7 |
English |
| M056 |
7625149 |
UK |
7 |
English |
Table 1. The 22 speakers set aside to the off-line database. There are 12 male and 10
female speakers. The language entry was derived from listening to what the subject
says in the MOT01 speech items.
All speakers in the database who have not been set aside as part of the off-line
database and who have at least 5 sessions recorded, shall be used as client speaker in
the experiments. This amounts to 110 speakers, where speakers F035 and M042 have
been excluded because they have too few sessions.
True-identity tests shall be made on session 05 and later sessions. With existing
sessions for the 110 client speakers, this gives 666 true-identity tests.
The reason for excluding speakers with less than 5 sessions and for using only
sessions 05 and later for tests in all experiments is the following: Experiment 1
defines 4 training sessions and hence to allow for at least one test, 5 sessions are
needed. To make test conditions more similar between different baseline experiments,
the same speakers and test sessions will therefore be used for all those experiments.
Also, when comparing different numbers of enrollment sessions, up to four enrollment
sessions can be tested while keeping the test material invariant.
To simulate impostor attempts against speaker X in the speaker verification
experiments, recordings from all speakers in the database except speaker X and the
world-model speakers shall be used. This scheme includes both same-sex and cross-
sex impostor attempts, but the proposed scoring software (section 5.1) will report
results separately for those two cases.
Recordings from session 05 only shall be used for impostor
tests. If a certain speaker has not recorded a session 05, this speaker will also not be
used for impostor tests. This excludes speakers F035 and M042 and gives a total of
110 impostor speakers, the same set of speakers as those used as clients. With 109
impostor tests per client, there will be 11990 impostor tests in each baseline
experiment.
There are two reasons for choosing session 05 for impostor tests rather than for
example 01 which would give more available impostor speakers. Firstly, it is assumed
that later sessions will contain less speaking errors because subjects are learning the
recording protocol. Secondly, since session 05 is used for true-identity tests and as a test
session in the speaker identification task, it is possible to compare the outcome of
these with the outcome of impostor tests.
The choice of impostor speakers defined above will be referred to as "the impostor
speakers".
Given the available data in version 1.0 of the database and the choice of client and
impostor speakers as defined in the previous sections, the total number of true identity
tests will be 666 and the number of impostor tests will be 11990. These numbers
apply to all speaker verification experiments (1-3), while only the true identity tests
are relevant to the closed-set speaker identification task in experiment 4.
The number of impostor tests is quite large and will cause large processing times
when running experiments. For example, if one verification test takes 5 seconds
and sequential processing is used, the total processing time for all tests is
about 17½ hours!
For all baseline experiments defined here, a test shall be made independently of the
actual contents of the test file, that is, even if the manuscript utterance is not there or if
the file contains only silence. This way, annotations for the test files are not needed,
nor may be used, to produce a result on a given baseline experiment. The main
reasons for this choice is to make experiments easy to implement and to not rely on
annotations of test data. The fact that a portion of errors produced in an experiment
will be due to speaking errors rather than errors made by the classifier, should instead
be related to during the analysis of test results. This is not yet done by the scoring
software presented in section 5, however.
For enrollment, annotation and segmentation information may be used in experiment
2 as defined in section 3.3.1, but not in the other experiments. The reason
for the choice of using annotation information in experiment 2 at enrollment time is
again easy implementation of the experiments.
The task in this experiment is speaker verification on a fixed password phrase which
is common to all speakers. There are two such phrases in POLYCOST: SEN01 with text
"Joe took father's green shoe bench out" and SEN02 with text "He eats several light
tacos". This baseline experiment shall be done on the first phrase, SEN01, only.
A client model for speaker X shall be built from the first 4 sessions for that speaker,
namely from files X/0{1,2,3,4}/SEN01. One verification test shall then be performed
on each of the remaining recordings, namely on files X/{05,...}/SEN01. Models may
not be adapted to test files.
To simulate impostor attempts, the SEN01-file from session 05 from the impostor
speakers shall be used.
The choice of training material for a world-model in this experiment is not obvious.
In the rather unlikely case of a system where all users have the same password phrase,
training on SEN01 would be the natural choice. It is not realistic in the case of
user-individual password phrases, however, since a world-model can not be trained
for each existing password phrase in a system.
The closest alternative would perhaps be to build a world model from subword components.
This is done for instance in
[2]
where Parthasarathy & Rosenberg concludes that it is important that
the world model captures the text contents of the spoken utterance.
Choosing SEN01 for world-model training can be seen as the ideal case of such a
synthesized world-model. This approach has therefore been chosen for this baseline experiment.
An alternative approach, which has been abandoned, is to train a world model from
the MOT02-item.
This is more realistic in the sense that the password sentence is not represented in the
training material. However, it is not realistic to train phoneme models on this small
off-line data and the performance of a global fully text-independent world-model can
be questioned.
The speech in MOT02 is also in mother tongue while SEN01 is in English.
The task in this experiment is speaker verification on a sequence of digits which was
not represented in the enrollment material. Hence, it is a simulation of a verification
system where a sequence of digits is prompted to the client in the moment of the test.
Prompting in this case is done by means of text display as opposed to audio
prompting.
Each session contains recordings of five ten-digit sequences, shown
in table 2. In this experiment, the first four sequences from the first two sessions shall
be used when building a speaker model for speaker X, namely files
X/0{1,2}/DIG0{1,2,3,4}. This gives eight occurrences of each digit for enrollment.
One verification test shall then be done on sequence 5 in each of the sessions 05 and
later, namely on files X/{05,...}/DIG05. Models may not be adapted to test files.
To simulate impostor attempts, the DIG05-file from session 05 from the impostor
speakers shall be used and a world-model, if used, shall be trained from files
DIG0{1,2,3,4} in sessions 01 to 05 for the off-line speakers.
| Item |
contents |
| DIG01 |
0 1 2 3 4 5 6 7 8 9 |
| DIG02 |
8 3 9 4 6 1 7 2 0 5 |
| DIG03 |
5 0 6 9 2 8 1 3 7 4 |
| DIG04 |
9 8 7 6 5 4 3 2 1 0 |
| DIG05 |
1 0 2 9 3 8 4 7 5 6 |
Table 2. Pre-scribed contents of the DIG-items.
In this experiment, models for individual digits should be trained from sequences of
digits. In a real situation the recognition system would of course have to produce
segmentations on its own (if segmentations are required). In the experiment defined
here, however, the goal is to test the speaker verification part of the system. Therefore,
we make the assumption that before the enrollment is started the speech has been put
through an "ideal" digit segmenter. With this approach a system under test do not
need to have the segmenter component. We also eliminate the influence on the built
model from differences in segmenting modules.
During the test phase, on the other hand, segmentation can usually be done implicitly
as part of the decoder operation in the speaker verification module. Hence the choice
of using segmentation information in the enrollment but not in the test phase. This
strategy was used within the
CAVE project
for experiments on YOHO and SESP databases.
The task in this experiment is speaker verification in a text-independent manner on
text spoken in the speaker's mother tongue.
A speaker model for speaker X shall be built from the unconstrained speech item from
the two first sessions, namely files X/0{1,2}/MOT02. Each of these items contain up
to 20 seconds of free speech.
One verification test shall then be done on the somewhat constrained speech item in
each of the sessions 05 and later, namely on files X/{05,...}/MOT01. Models may not
be adapted to test files.
To simulate impostor attempts, the MOT01-file from session 05 from the impostor
speakers shall be used. A world-model, if used, shall be trained from the MOT02-files.
For item MOT01 subjects were asked to speak their name, christen name, sex
(female/male), town, country and mother tongue. This constraint means that subjects
will say roughly the same thing in each test, which normalizes test utterances on text
contents. The task is still text-independent since enrollment is made on unrelated text
and models are not updated.
The knowledge of what the subject is saying in the test files may not be used a priori
in the test.
This experiment is defined in all applicable aspects exactly the same as experiment 3,
but with the task of closed-set speaker identification. Hence, speaker model X shall be
built on files X/0{1,2}/MOT02 and each of files X/{05,...}/MOT01 shall be used
for speaker identification tests. Adaptation on test utterances is not allowed. All
speakers in the database shall be registered as clients and, thus, the task is closed-set
speaker identification and there is no need for impostor tests.
| BE |
Task |
Speech |
World model |
Enrollment |
Tests (FR) |
Tests (FA) |
| 1 |
ver |
Fixed sentence |
0{1-5}/SEN01 * |
0{1-4}/SEN01 |
{05,...}/SEN01 |
05/SEN01 |
| 2 |
ver |
Prompted digits |
0{1-5}/DIG0{1-4} * |
0{1,2}/DIG0{1-4} * |
{05,...}/DIG05 |
05/DIG05 |
| 3 |
ver |
Free, mother tongue |
0{1-5}/MOT02 * |
0{1,2}/MOT02 |
{05,...}/MOT01 |
05/MOT01 |
| 4 |
id |
Free, mother tongue |
0{1-5}/MOT02 * |
0{1,2}/MOT02 |
{05,...}/MOT01 |
- |
Table 3. Summary of test conditions for the four baseline experiments.
*) annotation information may be used.
The implementation of the baseline experiments as defined above is in principle left to
the user. As a help, a set of skeleton csh-scripts (for UNIX environments) is
provided, plus a set of list files which define the different speaker sets. Note that
these are not a ready-to-go set of scripts but only a suggestion for how the basic
procedures of the baseline tests can be implemented.
There are three csh-scripts:
runenrol,
runtest and
trainworld.
They all use the file
polydefs
which contains definitions of file paths plus enrollment and test sets for
the experiments.
runenrol
is supposed to build all client models while
runtest
will run all true-identity and all impostor tests.
The last script, trainworld, can be used to
build a world (impostor) model.
All baseline experiments can be handled by the same scripts.
The scripts are made such that speech data can be read
either from the two distribution CD-ROM:s or from a hard disk where all data has
been stored. In the first case, the script will halt for the user to change disc in the CD-
ROM reader.
The provided csh skeleton scripts will fix one
known bug in the v 1.0 release of POLYCOST
when reading data directly from the CD-ROM:s, namely by swapping sessions F006/01 and F031/01.
When data is read from a hard disk instead, the scripts assume that the bug has been fixed already
and the sessions are therefore not swapped. Other bugs have been discovered in v1.0 and thus
new bug fixing procedures should be implemented in those csh scripts when reading from the
cdroms. This has not yet been done, so take care. The easiest way to avoid those bugs is
to make a local copy of the distribution and fix the bugs manually as descibe
here.
Perl versions of the csh scripts have been created : there are three perl-scripts:
runenrol.pl,
runtest.pl and
trainworld.pl.
They all use the file
polydefs.pl
which contains definitions of file paths plus enrollment and test sets for
the experiments.
The Perl scripts have exactly the same functionalities as the csh ones excepted
NO BUG FIXING has been implemented. When reading datas directly from v1.0 of
the cdroms, a bug fixing procedure should be implemented in the script. This has not
yet been done in the perl scripts distributed here, so take care.
When reading datas from a local copy (hard disk), make sure the bugs described in the
known bug page in the v 1.0
release of POLYCOST have been repaired manually on the local copy.
There are three pairs of list files as shown in table 4. Files are ordinary text files with
one speaker per line.
Table 4. List files which defines three different speaker sets for the baseline
experiments. Disc numbers refers to
v 1.0 distribution CD-ROM discs.
In order to be able to compare different experiments done on the POLYCOST database, all results must be presented in a similar way. Otherwise two quite comparable experiments might end up with incomparable results due to the way the error rates are calculated and the results are presented. A common scoring software was developed within the
CAVE project
for calculating false acceptance (FA), false rejection (FR) and equal error rates (EER) based on likelihood values and thresholds from an experiment.
This implementation is based on the methods described in the EAGLES handbook [1].
This program has also been distributed to
COST 250
partners.
The scoring software distributed to the COST 250 members consist of two parts, one for static scoring and one for dynamic scoring. Two input files are necessary for computing static scores, while only one input file is needed for the dynamic scores to be calculated. When calculating static scores one file containing likelihood scores and one file containing thresholds is used. Only the file with likelihood scores is needed for the computation of dynamic scores. The static evaluation program computes FA and FR rates for an experiment with a priori fixed thresholds, while the dynamic evaluation program adjusts the thresholds, by equalising a posteriori the false acceptance and false rejection rates. The dynamic scoring program then report these equal error rates.
The likelihood file is a text file with four items per line. Each line is the result of an access attempt. On each line must be given, in the following order:
- The identity of the true speaker, sex and number
- The identity of the claimed speaker, sex and number
- The log likelihood of the claimed speaker model
- The log likelihood of an impostor (non-speaker) model
This file must have the same number of lines as the number of attempts in the experiment and must have the extension llk in its filename (example.llk).
Example of likelihood file:
The first two lines are genuine attempts, the third line is a same-sex impostor attempt against M011 and the last line is a cross-sex impostor attempt against M011.
M006 M006 -1.016141 -2.066694
M006 M006 -1.063104 -2.082051
M034 M011 -1.693927 -2.413161
F024 M011 -1.914679 -1.981914
...
The threshold file is a text file containing the list of all enrolled speakers together with their corresponding decision threshold on the log likelihood ratio. Each line contains the speaker identity followed by the threshold for this speaker. This file must have exactly the same number of lines as the number of enrolled speakers and must have extension thr in its filename (example.thr).
Example of threshold file:
M006 0.745374
M011 0.638556
F024 0.578569
F031 0.578722
...
The static evaluation program calculates FA and FR rates for a system with a priori fixed thresholds. It requires both the input files mentioned above.
For each claimed speaker, an FR rate is computed. These individual scores are then averaged over all male, all female and all speakers in order to give the male, female and sex independent FR. The test-set false rejection rate is computed as the total number of false rejections over the total number of genuine attempts, regardless of the distribution of speakers by gender, and of the number of tests per speaker.
For each couple of claimed speaker and impostor, a separate false acceptance rate is computed. These scores are then averaged over 4 different subsets:
- MM : male claimed speaker * male impostor
- FF : female claimed speaker * female impostor
- FM : female claimed speaker * male impostor
- MF : male claimed speaker * female impostor
Scores obtained for MM and FF are then averaged to form the "Same Sex" FA, whereas MF and FM scores are averaged into the "Cross Sex" FA. The "Sex Independent" FA is obtained as the average of the Same Sex and Cross Sex score. The test-set FA is obtained as the proportion of false acceptance over the total number of impostor trials, regardless of the claimed speaker's and the impostor's identities and of the number of tests per speaker.
The dynamic evaluation program computes several sets of a posteriori optimal thresholds in order to equalise FA and FR in several test configurations. It requires only the likelihood file as input, but provides 3 different threshold files as output. For each speaker, three distinct ROC curves are considered, from which are derived various average EERs.
- One ROC obtained by considering the false acceptance rate averaged over all impostor speakers of the SAME SEX as speaker X as a function of the false rejection rate for X, when the threshold of X varies
- One ROC obtained by considering the false acceptance rate averaged over all impostor speakers of the OPPOSITE SEX as speaker X as a function of the false rejection rate for X, when the threshold of X varies
- One ROC obtained by considering the false acceptance rate averaged in a GENDER-BALANCED way over the entire population of impostors against speaker X.
The male-male (MM) average EER is obtained as the average same sex EER over all male speakers, while the female-female (FF) average EER is obtained similarly for female speakers. The average of these two provide the same-sex EER.
The male-female average EER is obtained as the average cross sex EER over all male speakers, while the female-male average EER is obtained similarly for female speakers. The average of both these provides the cross-sex EER.
The sex-independent EER is estimated for each speaker from the gender-balanced ROC and the sex-independent average EER is obtained by averaging these scores in a gender-balanced way.
The scoring software present all calculated scores in text files. The format of these text files are presented below. It is recommended to always report the same-sex and sex-independent EER when performing dynamic evaluation on the POLYCOST database. When performing static evaluation the same-sex and sex-independent FA and the male and female FR should be reported. It is recommended to include the full output from the scoring program as an appendix to a report.
An example of the output from the static evaluation software is shown below (file example.sta):
by-gender average false rejection rate
-------------------------
| 30.416 (M) | |
--------------| 29.396 |
| 28.375 (F) | |
-------------------------
test-set false rejection rate
-----------
| 29.989 |
-----------
(XY) : X=claimed Y=true
by-gender average of average false acceptance rates
-----------------------------------------------------------
| 15.369 (MM) | | |
---------------| 20.229 (Same Sex) | |
| 25.090 (FF) | | |
-------------------------------------| 14.538 (Sex Ind.) |
| 10.184 (MF) | | |
---------------| 8.847 (Cross Sex) | |
| 7.509 (FM) | | |
-----------------------------------------------------------
test set false acceptance rate
-----------
| 12.359 |
-----------
An example of the output from the dynamic evaluation software is shown below (file example.dyn):
EER:
-----------------------------------------------------------
| 19.883 (MM) | | |
---------------| 22.550 (Same Sex) | |
| 25.217 (FF) | | |
-------------------------------------| 19.598 (Sex Ind.) |
| 15.534 (MF) | | |
---------------| 15.737 (Cross Sex) | |
| 15.941 (FM) | | |
-----------------------------------------------------------
A common ground for experiments on the POLYCOST database has been
established through the definition of a set of four baseline experiments
and procedures for computing and presenting results for tests.
The purpose of the guidelines is to standardize testing on this database
and thus enable comparison between experiments made in different test sites.
These guidelines should be used in experiments until the next meeting
of the COST 250 in April 1997, where a new version of guidelines will
be discussed.
[1]
F. Bimbot, G. Chollet (1995). "Assessment of speaker verification systems",
In: Spoken Language Resources and Assessment, EAGLES Handbook.
(links:
1. EAGLES on-line,
2. EAGLES Handbook on Spoken Language Systems, DRAFT Version of 18 May 1995
)
[2]
S. Parthasarathy, A.E. Rosenberg (1996). "General Phrase Speaker
Verification Using Sub-word Background Models and Likelihood-ratio
Scoring", ICSLP-96, Philadelphia.
Changes made from the draft version of November 25th, 1996:
Section 5
on scoring procedures and presentation of results has been
added. The template scripts referred to in
section 4.1
have been slightly
changed.
None of the basic guidelines have been changed.
October 14th 1997 : Template Perl scripts have been added in section 4.
Håkan Melin,
Johan Lindberg
KTH,
Dept. of Speech, Music and Hearing (TMH)
Jean Hennebert (for the perl scripts),
CIRC ,
EPFL