We begin where others give up.


(C) Seewald Solutions, 1180 Wien, Austria. Commercial use prohibited.


Projects Publications CV KDD WEKA Contact Business

Spam Filtering

As an ongoing project I have tackled my former employer OFAI's spam problem middle of 2004 to end of 2005. Around 94% of incoming mails at the institute are spam, and without proper filtering life is practically unbearable. I am in the process of setting up an institute-wide model in cooperation with the local system administration.

In the process I have collected a set of 77,286 ham and spam mails from eight users, developed my own training methods and evaluated them against commercial and free spam filtering systems (TR-2005-04), including Symantec BrightMail 6. A more recent report puts the work in larger context, adds new empirical results and proposes a convenient approach for mail data collection, to be published by the IDA journal in Spring 2007.

Scripts to train SpamAssassin

One of the outcomes of this research was a way to adapt SpamAssassin to a specific mail collection. Contrary to sa-learn, which only trains the NaiveBayes model, SA-Train.pl additionally learns a set of score values via linear SVM. I found out early that a single score value set is not sufficient for all applications. The approach outlined here is similar to SA-Train with training methodology simple.

SA-Train.pl uses Algorithm::SVM V0.12 instead of the perl-port of WEKA's SMO which I used before. It should be around three times faster. There have also been optimizations concerning memory usage which has been reduced about twentyfold. Until the time that Algorithm::SVM V0.12 appears at CPAN, you can download the latest version here (source code).

Bugs, comments and extension requests to alex@seewald.at. If you use this for research purposes, please cite Seewald A.K.: An Evaluation of Naive Bayes Variants in Content-Based Learning for Spam Filtering, Intelligent Data Analysis 11(5), 2007.

Performance

Since the latest version, SA-Train can also use simplified training procedures -S and -B, which work as follows: -S ignores the NaiveBayes model and just learns the linear SVM, which is somewhat similar to the process for obtaining the default scores for each new version of SpamAssassin; -B outputs static weights for the BAYES_* tests and ignores all other tests (i.e. sets their score to 0), which essentially reduces to a pure NaiveBayes learner (similar to SpamBayes).

Based on my sample of 77286 mails with a spam/ham ratio of roughly 1, and using a five-fold crossvalidation, these are the results for the main three settings -S, -B and the default one (neither -B nor -S specified). All values in percent (i.e. multiplied by 100)

(default)-S-B
Ham error (FP rate)0.420±0.0471.495±0.1710.089±0.015
Spam error (FN rate)0.423±0.0303.550±0.2322.549±0.276

Another way to look at the performance is with ROC curves. In this case, I have plotted ham vs. spam error at all possible thresholds. This needs a logarithmic scale, otherwise the differences between the three settings would be too small.

This is what you can see on the left. As you can see, -B and the default setting are rather close. Since -B is the fastest, I would suggest to try it first. The default threshold (required_hits) of 5 reduces ham error at the cost of increased spam error. If you want similar performance as the default setting, change the threshold to 2.

You could also try the default setting with different complexity parameters for the SVM (i.e. the -c parameter), which you should vary systematically. Using different thresholds for SVM output - as we did here - is not really sensible, as it will lose the optimality of the underlying optimization problem.

What works worst is not using the bayesian model, which is what -S does. This is reasonably similar to what the SpamAssassin developers do to compute the default scores, which also give very bad performance on this sample.


Mail data collection

How mails should be collected at an institute - i.e. generating a model for a large(r) number of users.

This collection procedure gives excellent results, is less costly that incremental training (no more checking tens of thousands of spam mails for the elusive false positive) and the filter should work from day one with the error rates estimated via CV. Humans are of course not perfect at this task: error estimates range from 0.25% to 1%, so everything within this range is competitive.

Datasets

For confidentiality reasons, I cannot give away my full mailbox collection. However, I have chosen to contribute a small part of my personal emails in a reasonably safe form - as word vector of their contents, as results from SpamAssassin rule sets, and actual full sender email addresses. Available here.

Future work

Unfortunately, spam filters do not really solve the spam problem. A nice metaphor is a dam at a lake, where the water always gets higher and higher. Even if the error rate were almost perfect, there is always a point where enough spams will get through to make mail-reading become a nuisance. The approach outlined above will keep you dry for at least a few more years. In the meantime I am working on a way to fight back spam, which I have called proactive spam filtering, and which just might be able to stem the tide. See also my recent research project (sorry, only available in German for now).