9/98 - 12/98 Summary
During the second semester of my project, I did an experiment entirely of my own design. I recorded my own EEG data, but based the program of stimulus shown to the subject on the program used by Charles Anderson at CSU. Thus, the format of my EEG data was very similar to his.
EEG Signal Recording
The data for this experiment were taken using the neuroscience facilities at Pomona College, which are located in a small two story house that has been converted into a laboratory. The subject (me) was seated in a closet with dim lighting, a comfortable chair, and a computer running the program used to present visual stimulae to the subject: NeuroScan Inc.'s STIM 2.0. STIM takes "sequence" files as input, which designate which images or sounds are to be displayed, and in what order. Each image or sound is coupled with a typecode, which is sent to the EEG recorder when the image or sound is presented. This typecode gets marked on the EEG so that it can later be determined which part of the recording corresponds to a particular stimulus. The sequence files and images I used are available here. The EEG data I took is available in a variety of formats here. The most useful format is probably the one in the "matlab" directory. The contents of that directory are available as a gzip'd tar file here. The Matlab files have 12 channels of data each. The *beeps.txt files indicate the sample number at which a beep occurred.
The computer used to present the stimulae had internal fans that produced some slight noise, but other than that, the closet in which the subject was seated was fairly silent. A QuikCap-64 was used to record from positions FPZ, F3, FZ, F4, FCZ, C3, CZ, C4, PZ, P3, POZ, and P4, as defined by the 10-20 system of electrode placement. These twelve channels were referenced to electrically linked mastoids at M1 and M2. The impedance of all electrodes was kept below 20 Kohms. The data were recorded at a sampling rate of 250 Hz with a SynAmps Model 5038 EEG amplifier, which uses a 16 bit A/D converter. A serial cable connected the stimulae presentation computer to the SynAmps EEG amplifier, and was used to signal when a stimulus was presented. The SynAmps was programmed to do analog bandpass filtering from 0.15-30 Hz, and was calibrated with a known voltage before the recording session. Eye blinks were detected by means of a separate channel of data recorded from an electrode placed below the subject's left eye (VEOG).
Data were recorded from one subject, a 21-year-old, right-handed, male, college student (me). Five different programs of simulus were presented using STIM, each displaying a total of ten images. The subject was given written instructions at the beginning of each program of ten images. In general, the instructions were to view an image related to a particular mental task, and to concentrate on the task, after hearing an audible tone, until the next image was presented. The programs of stimulus took the following format:
An image was presented for 5
seconds.
A blank (dark) screen was
presented for 5
seconds.
A 1 KHz tone (beep) sounded.
The blank screen continued for
another 5
seconds.
The next image in the program was
presented.
All tasks were performed with the subject's eyes open. The tasks used in this experiment are the same as those chosen by Keirn and Aunon in [1] to invoke hemispheric brainwave assymmetry. The five tasks were:
Baseline Task: The
instructions given to the subject preceding the stimulus program were not to
perform a specific mental task, but to relax as much as possible and think of
nothing in particular. This task is considered a baseline task for alpha wave
production and was used as a control measure of the EEG. The ten images
presented were all exactly the same, and consisted of a white (blank) screen.
Letter Task: The subject was
shown images consisting of a black word on a white background. Each word was
indicative of a friend or family member (e.g., "father", "mother", "aunt",
"uncle", etc.), and the subject was asked ot mentally compose a letter to that
person without vocalizing or making any physical movements.
Math Task: The subject was
shown images consisting of nontrivial multiplication problems, such as 89 times
67, and was asked to solve them without vocalizing or making any physical
movements. The problems were designed so that they could not be solved in the
time allowed. Although they repeated, the subject did not solve any of them
to completion.
Geometric Figure Rotation:
The subject was shown images of three-dimensional figures (rendered and shaded),
and asked to visualize them rotating about an axis.
Visual Counting: The subject
was shown an image of black arabic numerals on a white background, and asked
to visualize similar numerals being written on a blackboard, one after another,
sequentially in ascending order, the previous numeral being erased before the
next being written.
Data were recorded for 15 seconds per image, or 150 seconds per program. Thus, one recording of all five programs resulted in 750 seconds of data. The five programs were each recorded twice, giving us a total of 1500 seconds of data containing 100 beeps, each beep indicating the start of a five second period during which the subject was to be concentrating on a particular brain state.
After having some time to examine the data, it became apparent that CPZ, channel 14 on the QuikCap, had not been recording properly. However, all the other channels were quite clear, and since CPZ is along the center line, I noted that disgarding it would not have an adverse effect on the assymmetry ratios if I wanted to use them later on in the experiment. This left me with 11 channels of scalp data, and a single channel for detecting eye movements.
Artifact Removal
Contamination of EEG activity by eye movements, blinks, cardiac signals, and muscle and line noise is a serious problem for EEG interpretation and analysis. One way of dealing with this problem is to simply reject segments of EEG with unacceptable amounts of noise. However, this may result in an unacceptable amount of data loss. Fortunately, there are other, algorithmic alternatives to disgarding data. One algorithm in particular seems to stand out from the rest: Independent Component Analysis. To understand what it does and why it serves our purpose, it is of use to gain some context with regard to the type of data we are dealing with when we analyze EEG.
The signals we hope to measure when we record the voltage potentials from a subject's scalp are those that result from the activity of neurons some significant distance away from the electrode we are using to take the measurement. Each electrode is "hearing" a summation of all the neural activity in the vicinity. The signals differ from one another by virtue of the fact that they are located in different geographical areas of the scalp. Neural activity in an area which is close in proximity to one electrode will be "louder" in the recording produced by that electrode than one further from the source. Thus, in an ideal situation, each electrode would detect a unique linear mixture of all the neural activity happening in a subject's brain.
Unfortunately, this ideal linear mixture is augmented by other electrical activity which does not pertain to neurons firing. Typically, these noise signals are much greater in amplitude than the signals of interest, and have the effect of obliterating a good amount of useful information. Some of the noise signals, such as those resulting from eye blinks and other muscle movements, are infrequent enough such that the segments of data in which they appear can be simply disgarded without losing too much. Others, such as cardiac signals and eye movements, are regular enough to make obtaining useful data a cumbersome task. The problem of removing this noise from the interesting signal can be stated as follows: from N unique linear mixtures of an undetermined number of sources, can we somehow separate out N statistically independent mixtures? In other words, can we "unmix" the statistically unrelated noise onto a separate channel from the interesting signals? In fact, it has been known for some time that this is possible.
Independent Component Analysis, proposed by Bell and Sejnowski in [2], is a simple neural algorithm that blindly separates mixtures of independent sources using infomax. In [2], they show that maximizing the joint entropy of the output of a neural processor minimizes the mutual information among the output components. Bell and Sejnowski offer the following two reasons for why ICA is suitable for performing blind source separation on EEG data: (1) it is plausible that EEG data recorded at multiple scalp electrodes are linear sums of temporally independent components arising from spatially fixed, distinct, or overlapping brain or extra-brain networks, and, (2) spatial smearing of EEG data by volume conduction does not involve significant time delays [2].
Representation of EEG Signals
The key to training a neural network to do a reliable discrimination is finding a suitable representation of the EEG signals. Since the early days of automatic EEG processing in the medical community, representations based on a Fourier transform have been most commonly applied to the problem of discriminating and classifying EEG patterns. This approach builds upon earlier observations that there are some characteristic waveforms that fall primarily within four frequency bands: delta (1-3 Hz), theta (4-7 Hz), alpha (8-13 Hz), and beta (14-20 Hz).
In related work, Anderson, Devulapalli, and Stolz [3] found that a frequency-band representation yielded the best results of four methods that they tried. Others have had success with similar representations as well. I wanted to use ICA for artifact removal, but I realized that if I did so, I would not be able to use the asymmetry ratios, which had shown very positive results in the past, in my signal representation. This is because the sources computed by ICA do not have the same spatial relationship to the skull as do the signals derrived from the electrodes. However, I reasoned that although the asymmetry ratios certainly emphasized the differences between mental states in certain frequency bands, these differences would still be present in any frequency-band representation of the data, regardless of whether or not I precomputed them and presented them to the network explicitly.
Thus, I decided to use a representation based on the power spectral densities of the sources computed by ICA. With a sample rate of 250 Hz and 12 channels of data, each five second window of time during which the subject was to be concentrating on a particular brain state contained 15,000 data points. After computing the ICA sources and disgarding the one which was representative of eye and muscle movements, I was left with 11 channels of data.
Inside the period of concentration, I took ten windows of 11 channel EEG data, each offset by 50 samples from the one before it. For example, the first of the ten started at the beep, the second started 50 samples after the beep, etc. Since there were 100 windows per session, and 2 sessions of each mental state, each mental state was represented by 200 feature vectors. Of these features, half were used for training and half for validation. Thus, some (longer) windows overlapped previous windows, while others (shorter) did not.
The length of the window was varied from one half second (125 samples) to one nineteenth of a second (12 samples). For each window, I computed the Discrete Fourier Transform of each channel, which left me with 11 vectors containing a number of power values equal to the number of samples in the time domain. The power spectral density was computer by taking the dot product of the Fourier transform with its conjugate, and dividing the resulting vector by the number of power values in the Fourier transform. These 11 power spectral density vectors were concatenated together to form a presentation vector.
Pattern Classification
Three-layer feedforward artificial neural networks were trained using Matlab with a modified version of the standard backprop algorithm (tbpx). The learning rate was dynamically adjusted as the network trained, such that as the standard squared error (SSE) of the network decreased, the learning rate increased, and when the SSE increased, the learning rate fell back down to a preset minimum. With regard to the numbe rof hidden nodes in each hidden layer, a variety of thoughtfully selected configurations were tried. Some of the best performing configurations were 40-5, 50-10, 100-10, 250-50, 1000-100, where 40-50 indicates that there were 40 hidden nodes in the first layer, and 5 in the second. All networks were trained on Turing, the Harvey Mudd College Computer Science Department's six processor Sun Ultra Enterprise 3000 with 1.5 GB of memory.
Networks were trained to differentiate between all pairs of mental states, and all triples of mental states. One-hot encoding was used to enumerate mental states, and a "correct" classification was one in which the correct output was larger than all other outputs. Half of the total number of feature vectors were used as a validation set to prevent over-fitting, and training was stopped when the SSE of the validation set did not decrease for 200 epochs. Once this occurred, the network weights were reset to their values at the point when the network received the highest "score" upon evaluation of the validation set, where the "score" of the network was the number of validation patterns classified correctly.
Results
Table 1 shows the results of differentiation between two mental tasks. The notation used for the "Best Network" column indicates the number of hidden nodes in each hidden layer. For example, "40-5" indicates that there were 40 nodes in the first hidden layer, and 5 in the second. In the "Best Classification" column, the percentage indicates the number of test patterns classified correctly. Below the percentage is an indication of which mental tasks were used in the differentiation which resulted in that particular percentage. Clearly, some pairs of tasks are more easily differentiated between than others.
Window Length (sec, samples) | Best Network | Best Classification | Worst Classification | Training Time (min) |
0.5, 125 | 40-5 | geom, mult: 86% | count, mult: 77% | 7 |
0.25, 62 | 40-5 | base, letter: 90% | count, letter: 69% | 6 |
0.125, 31 | 100-10 | base, count: 82% | count, letter: 69% | 3 |
0.05, 13 | 1000-100 | geom, mult: 85% | count, letter: 67% | 10 |
Table 2 shows the results of differentiation between three mental tasks. The classification accuracies indicate that this was a much harder problem for the network to learn. Another indication of the difficulty of the problem is that networks with a greater number of hidden nodes performed better. When only differentiating between two mental tasks, it was sufficient to use a lesser number of hidden nodes: after a certain threshold, classification accuracy did not increase if the number of hidden nodes was increased. After finishing with the results for three-way differentiation, it became clear that it would not be worth the training time to comute them for four-way differentiation.
Window Length (sec, samples) | Best Network | Best Classification | Worst Classification | Training Time (min) |
0.5, 125 | 100-10 | base, letter, mult: 86% | count, letter, mult: 71% | 30 |
0.25, 62 | 250-50 | count, geom, letter: 77% | count, geom, letter: 63% | 16 |
0.125, 31 | 250-50 | geom, letter, mult: 74% | count, letter, mult: 56% | 8 |
0.05, 13 | 250-50 | geom, letter, mult: 66% | count, letter, mult: 51% | 6 |
Conclusion
Accurate, two-way differentiation can be done using a short window of EEG data. This is probably the most significant result of the experiment, because most applications for control systems have real-time requirements. For example, with a recognition rate of one symbol per second, it would be very difficult to steer a wheelchair or compose a letter on the computer. On the other hand, with a rate of sixteen symbols per second it might be possible to accomplish something.
Increasing the number of hidden nodes increases the accuracy of classification, although the increase in accuracy is very gradual after a point. Unfortunately, since larger networks take longer to train, there is a threshold at which the return in accuracy does not justify the investment in training time.
Finally, ICA is fast and useful for removing artifacts without discarding useful data. This experiment is one more verification that the ICA algorithm works as a method of removing artifacts from EEG data. Furthermore, the processing time required to run the ICA algorithm was insignificant compared to the time required to train the neural networks using backprop.
[1] Zachary A. Keirn and Jorge I. Aunon. A New Mode of Communication Between Man and His Surroundings. IEEE Transactions on Biomedical Engineering, 37(12):1209-1214, December 1990.
[2] A. J. Bell and T. J. Sejnowski. An Information Maximization Approach to Blind Separation and Blind Deconvolution. Neural Computation, 7:1129-1159, 1995.
[3] C. W. Anderson, S. V. Devulapalli, and E. A. Stolz. EEG Signal Classification with Different Signal Representations. Neural Networks for Signal Processing V, pages 475-483. IEEE Service Center, Piscataway, NJ, 1995.