Approach

Approach

Data Collection:

All voice samples were recorded on a Linux Pentium 150 with a SB16 soundcard using the program "brec." A standard computer microphone was used (the one with the little tiny head). All samples were recorded as Microsoft RIFF wav files. Each file was recorded at 8000 Hz and 16 bits for one second. The samples were gathered on three seperate occassions, and varied in pitch, inflection, and speed. No attempt was made to filter out microphone artifacts.

The Samples:

The network was trained on a total of 8 speakers, 4 male and 4 female. A total of 15 utterances were recorded for each of these speakers. An additional 3 speakers were recorded for the purposes of testing. Two samples each were recorded from one male and one female speaker, and one utterance was recorded from a small domestic feline.

Data Processing:

I used Jialong He's feature extraction program to do LPC and MFCC cepstral analysis on the sound files. This analysis was carried out in 512 sample windows with and overlap of 256 samples. 16 co-efficents were calculated for each method. The resulting data formed a 912 dimensional vector. This was then inserted into a pattern file that could be read in by the network, associated with an 8 dimensional one hot encoded vector that identified the speaker. Since the MFCC method proved superior, it was used exclusively in the latter trials.

Network Training:

Because my project was completely original (a long literature search turned up nothing), I decided to use toolkit code to begin with, and choose Lars Linden's Art Gallery. I used the provided "Art_Sim" program to carry out training. Each network was an ARTMAP with the ART-A type set to FUZZY and the ART-B type set to none. The number of inputs was 1824 and real valued, as complement coding was used (otherwise the norms of the weights in the prototype degrade severely). The number of outputs was 8 and binary. After some experimentation, all networks were trained with a ART-A vigilance of .6 and a mapfield vigalence of 1.0. The recode rate was always set to 1.0 (fast learning) as it was not a parameter in the project. The networks were trained on anywhere from 2 to 10 samples.

Testing:

Testing was carried out with an ART-A vigilance of .2 after experimentation showed this to give the best results. I modified Lars Linden's code to show the ART-A category for each selection so that I could deduce the identity of the speaker. In later trials, I implemented a simple voting system. This system takes as input three trained networks and a test pattern. If two of the networks agree that an utterance was spoken by a particular speaker, it outputs that speaker. Otherwise, it outputs "Unknown."

back to main page