Results

General Findings:

Several interesting aspects of using ARTMAP for speaker recognition came out during the project:
  • MFCC coeffecients seemed to perform better than LPC coefficents
  • Order of training sample presentation has a large impact on the resultant network
  • Using a high vigilance for training results in a proliferation of prototypes, which allows for good coverage.  Decreasing the vigilance before testing allows the network to generalize.
  • Because the criteria ART uses for evaluating how close a match is and whether a match is close enough are not scaled the same, varying the vigilance can effect results in a non-linear manner, but this effect is usually minor.
  • The more samples are used, the better the network does in absolute terms, but the improvement seems logrithmic, so that throwing more and more samples at a network won't catch the occasional odd case.
  • Using voting did not improve the number of speakers correctly identified, but it did cause the networks to replace some incorrect identifications with the "Unknown" message, which could be useful
  • The network was aweful at realizing when it hadn't been trained on somebody.  Even with voting it gave an identification for the two speakers it hadn't been trained on, and it even thought it recognized a cat.

  • "Best of Breed" Results:

    Test
    #Incorrect
    Percentage Perfomance (% correct)
    Trained on 2, Tested on 3 (for each speaker)
    12/24
    50%
    Trained on 3, Tested on 2
    7/16
    56.25%
    Trained on 10, Tested on 5
    10/40
    75%
    Trained on 10, Tested on 5, three different ways with voting
    9/40 incorrect 1/40 unable to classify
    72.5% correct, 2.5% unclassified
    Trained on 8, Tested on 7
    9/56
    84%
    Trained on 8, Tested on 7 with voting* 14/56 incorrect, 9/56 could not classify 59% correct, 16% unable to classify
    *Note: The best performing network was not used.  Instead, three networks with performances of 59%, 68%, and 68% were used


    Conclusion:
     

    FUZZY ARTMAPs main advantage is its incredible speed.  Total time to do three way voting, including loading in the networks (about 750K each), doing all the tests, and processing 884K of data takes 6 seconds.  Training usually takes from 1 to 10 seconds, even for 10 samples.  Its main disadvantage is its unperdictability.  The same set of samples produced anywhere from a 59% to a 84% sucess rate depending on the order in which they were presented.  Still, especially with voting, the system could be used where adaptibility and speed is a big factor and absolute identification is not.

    Furthermore, the results are consitently much better than chance, and in the best cases approach those of LVQ methods trained with fairly large codebooks.  Also, considering the small number of samples, the network does well compared to humans.  Two days ago, I talked to one of the participants for a good 10 sentances before recognizing her voice, and I've known her for a year and a half now.  Overall, this technique is interesting and potentially worthy of further research.
     

    back to main page