Korean Phoneme Discrimination

Ben Lickly
Fall 2006
CS 152 Final Project


Project Introduction

Motivation

Certain sounds in Korean are particularly difficult for native English speakers to distinguish. This is often one of the most difficult early steps in Korean language learning. In particular, the sounds for the consonants "siot" and "ssang-siot", which sound distinct to Korean speakers, both sound like the English letter "s" to native English speakers.

Examples sound files (from the International Phonetic Association):

Korean word for flesh (begins with siot).

Korean word for rice (begins with ssang-siot).

Clearly, having a computer program that is able to distinguish between these sounds would be a very useful tool to a learner of Korean. In my project, I have attempted to do just that. In particular, I have trained a feedforward backpropagation neural network to achieve this task.

Data

The first step was to gather sampled data. I gathered sample data from two native speakers, one male and one female, each speaking words that contained either siot or ssang-siot. In addition, I searched through recordings from the Language Labs at Indiana University for those containing the siot and ssang-siot sounds.

In total, 52 samples were obtained. 26 samples were obtained each of siot and ssang-siot.

After gathering enough raw data, I edited these recordings to get rid of vowels and extraneous consonant sounds, leaving each one as either a single siot or ssang-siot phoneme.

Network Setup

The network used was a feedforward network with three layers: an input layer, an output layer a hidden layer. The output layer had two neurons. The first neuron was supposed to go high when the input to the network was a siot sound. The second was supposed to go high when the input was a ssang-siot sound.

Program

The neural network was coded in MATLAB, utilizting the MATLAB Neural Networks Toolbox, as well as a fucntion from the third party Auditory Toolbox for MATLAB.

The source code for my program can be downloaded here

The data on which I trained the network is located here. Note the naming convention of "s_*.wav" for siot sound files and "ss_*.wav" for ssang-siot files. This is required by the program.

Finally, copies of both of the trained neural networks for 5 and 2 hidden layer networks are available as .mat files here and here respectively.

Training

Since raw audio data is too much for the network to handle, each audio sample was broken up into its Mel Frequency Cepstral Coefficients, a popular form of feature extraction that approximates the range of human hearing.

Each piece of audio data was broken into 13 cepstral coefficients for each half of the sound interval. This resulted in a total of 26 coefficients to be input into the network.

In each round, the neural network was trained for a maximum of 10,000 epochs, with 10% of the data being withheld for verification. If the network reached a mean squared error of 0.1 or below, within the epoch limit, the round would end early.

If a round of training did not converge, or the testing set suggested that the network had fit to parameters specific to the training set, anouther round of training would be started.

Results

With 5 hidden layer nodes, the network was able to converge in 120 rounds over a total of 1013135 epochs, in about 3 hours.

The smallest number of hidden layer neurons for which the network converged was two. With this setup, the network was able to converge over 101 rounds of training, or 933468 total epochs. This took about 2.5 hours to train.

With only a single hidden layer, making the overall network only as powerful as an adaline, the network did not converge, even when run for 12 hours. I suspect that the siot and ssang-siot data are not linearly seperable, but I do not know for certain.

Possible Extensions

Right now the network works well for classifying very well pronounced Korean phonemes, with very little error. The biggest problem is that when given data that is not as clear, its confidence diminishes rapidly. The most straightforward way to improve this would be add more diversity to the training data, including sounds that are not all perfectly pronounced.

It might be possible to get a native speaker to give a more diverse range of samples. An easier way to accomplish this, however, would be to incorporate training data spoken by non-native speakers, but classified by a native Korean speaker.

References

1. Ahmad, A.M; Ismail, S; Samaon, D.F. Recurrent neural network with backpropagation through time for speech recognition. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1412458
2. Hosom, John-Paul; Cole, Ron; and Fanty, Mark. Speech Recognition Using Neural Networks at the Center for Spoken Language Understanding. http://cslu.cse.ogi.edu/tutordemos/nnet_recog/recog.html
3. Koizumi, T; Mori, M; Taniguchi, S; and Maruya, M. Recurrent neural networks for phoneme recognition. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=607119
4. Marshall, Austin. Artificial Neural Network for Speech Recognition. http://www.utdallas.edu/~austinwm/ANNSpeechRecognition.pdf
5. Slaney, Malcom. Matlab Auditory Toolbox. http://cobweb.ecn.purdue.edu/~malcolm/interval/1998-010/
6. International Phonetic Association. Handbook of the IPA: Sound Recordings. http://www.arts.gla.ac.uk/IPA/sounds.html
7. Unknown Author.Indiana University Korean Language Lab. http://languagelab.bh.indiana.edu/korean101.html

Other Files

My original project proposal

My midyear presentation

My final presentation