|
Setting Up SpamAssassinSpamAssassin is a mail filter that uses several tests on email headers and bodies to identify spam. Although it has limited capabilities to manage mail, it works best when coupled with Procmail. Each test SpamAssassin runs has an associated point value, and after all these tests are complete, it calculates the number of points the email "earned" and, if it's higher than a set limit, it'll do something with the email based on the settings you give it.SpamAssassin is installed on Turing, but it is not used by default. This document describes the basics of setting up and using SpamAssassin to filter your mail. If you are already using SpamAssassin, please note that we are using the spamd daemon to improve system performance, so change $HOME/local/bin/spamassassin (or wherever the SpamAssassin install you're using is) in your .procmailrc files to /usr/local/bin/spamc. The .procmailrc FileTo enable SpamAssassin on your account, you'll need a .procmailrc file in your home directory. If you don't know how to use Procmail, I suggest reading the Procmail qref. To get Procmail to use SpamAssassin, see the example provided by the makers of SpamAssassin. I've modified the example to work correctly on Turing by making it run the daemonized version of SpamAssassin instead of the default.This example handles spam by moving mail with a score of 15 or greater to "almost-certainly-spam" and mail with a score between your set threshold (which I'll explain how to set up later) and 15 to "probably-spam". It also ignores really long messages (spam is usually reasonably short) and has a work-around for a bug in Procmail. Here, I'll explain the two most important parts of the example.
:0fw: spamassassin.lock
:0:
SpamAssassin's User PrefsTo configure how SpamAssassin works for you, you'll need a user_prefs file in $HOME/.spamassassin/. Fortunately, there's a really cool SpamAssassin Configuration Generator online that will make this file for you. It is very well documented, so you shouldn't have too much trouble. To make SpamAssassin work on Turing, there are some options that should be disabled, which I note below.The generator is divided into four sections: Threshold and Report Options, Bayes Options, Network Test Options, and Language Options. The Threshold and Report Options section tells SpamAssassin how strict to be and what to do with spam--set these options as you see fit. One of the most important settings this generator needs is the threshold--the score an e-mail needs to be considered "spam". The generator gives you three options (5.0, 7.5 and 10.0), but you can change the setting to whatever you want by modifying the file the generator creates. AC uses a rather conservative score of 9.0 on their mail servers. I use 6.5, and have yet to get a false positive (a good e-mail marked as spam). Others, however, have gotten quite a few false positives with scores higher than 7.0, so you should play with this setting for a while to see what's right for you. The Bayes section sets up Bayesian analysis system (see below). This can be very helpful, but it can sometimes create rather large data files --- up to 34MB in a few instances (I train on a lot of spam, and the files are around 8MB) --- so you may want to disable this if diskspace is tight. The Network Test options should also be disabled. RBL checks use up extra network bandwidth, and SpamAssassin works great without it. Also, we don't have the software for the network checksum tests installed, so enabling those options won't do any good anyway. The Language Options section tells SpamAssassin how to deal with messages written in different languages--usually based on which character set is used. I can't remember ever getting a non-English spam, so I don't know how well these options work, but feel free to set them up however you want. The Configuration Generator makes a very well commented user_prefs file, so you don't need to use it every time you want to modify your settings. Just use your favorite text editor to change the variables in the file, which should be reasonably self-explanatory. To get more information, type perldoc Mail::SpamAssassin::Conf on Turing.
Using Bayesian Filtering EffectivelyTo turn on the Bayesian filter, you must have the line use_bayes 1 in your user_prefs file. You likely want the line auto_learn 1 there too. To train the Bayesian filter, use the sa-learn command. (Training will not work until you have use_bayes 1 in the user_prefs file!) You can see man sa-learn for the various options, but the most important are --spam to indicate that you are providing messages that should be considered spam, or --ham to indicate that you are training on non-spam. The other important part of the command is the location of the message(s) to train on. Options are:
For example, my .cshrc file contains the line alias learnspam 'sa-learn --spam ~/Maildir/.Spam/cur'because I have procmail put messages that are possibly spam (for me, scores between 5 and 8) into a Spam IMAP directory. If you have the default mail configuration on Turing, then this might be alias learnspam 'sa-learn --spam --mbox ~/Mail/probably-spam' or whatever your spam folder is named. Every day or two I check that nothing important has fallen into the Spam folder, run learnspam to train on accumulated spam, and clear the folder. (I also have a clearspam alias to delete the spam-folder messages, and a learnham that trains by assuming my inbox contains non-spam.) If you have a very large database and sa-learn is crashing then you might need to unlimit datasize first so that the process can allocate enough memory. Bayesian filtering will only start after enough spam and non-spam messages have been sent through sa-learn (200 spam and 200 non-spam). I have found that with adequate training the Bayesian filter is rather accurate. Therefore, I prefer to boost the penalty for high-probability messages, by putting the following lines in my user_prefs file: score BAYES_90 0 0 4.2 4.2 score BAYES_99 0 0 5 5This sets the penalty for being 90%-sure to 4.2, and the penalty for being 99%-sure to 5. My require_hits cutoff is 5, which means that messages scoring 99% will always be considered spam. You may wish to adjust these penalties up or down according to your preferences. Additional InformationHere are some additional sources of information:
HMC Computer Science Department
Contact Information
Last Modified Saturday, 11-Sep-2004 03:03:25 PDT
|