Grindstone for Spam

Grindstone for Spam^1.0 is an open source, lightweight tool set designed to aid on develop SpamAssasin^tm based filters and improve the general performance of them. It includes features such as optimization of scores asigned to rules, offline evaluation of filter, reduction of unuseful terms stored at bayes database and offline regex evaluation. The toolkit has been developed using Perl, C and bash scripting.

Feel free to download it and give a look.

Special thanks to SpamAssasin^tm team. Our software is entirely inspired in their filters and oriented to their architecture.

This work was partially funded by the projects Optimización de sistemas antispam (08TIC041E) and Deseño e validación de filtro antispam intelixente baseado en análise contextual ponderado do contido das mensaxes (09TIC028E) from Xunta de Galicia.

We started developing Grindstone for Spam on 2008 when through our experience on developing filters based on SpamAssassin technology detected some needs while optimizing them. This project is the result of our efforts in order to create some utilities to make the filter development easier.

SING Group brings together a reduced number of researchers from Universty of Vigo whith the aim of developing intelligent models and deploying them on real environments. One of the most relevant research lines of this group is the anti-spam techniques. The projects carried out by the SING group always follow a practical point of view, but taking into consideration the formal aspects needed in any research work.

Ultreia Comunicaciones is a computer science consultancy enterprise with a great experience on filter developing and optimizing.Ultreia Comunicaciones has developed several server spam filters to different customers. They also are continuously maintaining these filters and involved on research activities about spam filtering.

The main features introduced in GS4Spam are focused in save administrators time while adding rules, optimizing the speed of Bayes Database and offline filter evaluation. These aims has been guided the design of several tools that builts up this toolkit. Most of the tools requires the usage of an e-mail corpora (such as the SpamAssassin Public Corpus or the Enron Dataset). We recomend compiling your own corpus by manually classificating your email in order to achieve the filter that best fit with your organization (For instance, filtering illegal drugs e-mails can be harder on drug enterprises and organizations and particular optimizations are needed). Place your spam messages in .eml format (raw RFC 2822 format) in the folder src_data/_spam_ and ham ones in src_data/_ham_.

Grindstone for Spam^1.0 contains the following tools:

makexvalidation: Construct a 10 fold cross validation from a corpus in order to get better evaluation results. In order to operate with this tool you should use an e-mail corpora. The command does not take any argument and creates a xval folder containing the ten fold cross validation groups.
checkmasses and checkmassesxval has been designed to construct logs similar than SpamAssassin tool with the same name (you will find a similar tool in the spamassassin current developement branch). The construction of these logs can be useful in order to debug a filter and are also used to improve the optimizefilter speed. checkmasses test your filter using the emails found in your corpora while checkmassesxval uses a 10-fold cross validation scheme to build the logs. In order to use checkmassesxval, you need to execute makexvalidation tool before to create the xval folder. During the execution of the checkmassesxval tool the bayes system is trained several times in order to test how filter works while training using some messages of your own corpora. Therefore you should make a backup of your bayes database using sa-learn or nsa-learn tool if you want to preserve it. We recomend the usage of checkmassesxval tool to achieve more acuracy results and correctly test the Naïve Bayes operation.
optimizefilter: This tools implements a optimizefilter algorithm able to read score config from your filter rules and use logs generated with the above mentioned tools to automatically optimize the scores associated to each rule of the filter. Scores are showed by using standard output.
nsa-learn: This is a improved sa-learn tool. We introduce the option --optimize that perform a bayes database optimization by removing unhelpful terms. Use it at operation environment in order to achive more speed on evaluating Naive Bayes rules while preserving the accuracy of the filter. Use nsa-learn.sh --optmimize >value< where value is the removal factor (1..100). We recomend a removal factor in the interval (15-30) in order to preserve acurracy and get a significant storage reduction. Due to logical deletion policies included in the SpamAssassin's BDB databases, you should execute a backup-clean-restore cycle to achieve a reduction on phisical disk space and computational overhead.
evaluateregex: Makes an offline evaluation about the ham and spam messages from your corpora that matches with a perl regular expression. Example: evaluate_regex "/russia/i"
evaluatefilter: is able to compute some spam filtering performance measurements from logs generated by checkmasses and checkmassesxval tools.

In order to use Grindstone4Spam for optimization, we recommend to follow these simple four steps:

Compile a corpus: You should compile spam and legitimate messages from your environment. Please note that messages that legitimate messages received in a drug company are different to those received in a financial company. Therefore, using real messages from the target environment is better than compile another kind of messages (distribution lists, public available corpora, etc.) to improve spam filters. You should include new messages to your corpus periodically in order to avoid the effects of concept drift. Place spam messages at src_data/_spam_ and ham ones in src_data/ham. A collection having more than 5000 messages is a good choice for optimizing your filters.
Build a crossvalidation. Once you have compiled e-mail, we recomend to execute a crossvalidation scheme. The crossvalidation is useful to include Naïve Bayes rules in the optimization and to get a better optimization from a statistical point of view. Less than 5 secs are required to execute this step so, why not to execute it?. Simply execute ./makexvalidation. The crossvalidation is generated at xval directory.
Generate Logs. You shoud generate the files spam.log and ham.log. These files are used to improve the optimization processes. Log files contains for each message the rules that match avoiding the need of classify one single message a lot of times by calling SpamAssassin client (spamc). This process takes a lot of time (53 minutes using 9300 messages and 902 rules). Simply run ./checkmassesxval.
Execute Optimization and evaluation processes: Once you have the log files computed, you can run multiple execution processes by running ./optimizefilter --ham-log ham.log --spam-log spam.log and/or ./evaluatefilter --ham-log ham.log --spam-log spam.log. The best optimized score is showed in stdout and stored in 52_optimized_scores.cf. In order to install this score file, replace original SpamAssassin /usr/share/spamassassin/50_scores.cf by the generated one.

Everytime you update the compiled messages, you should remove spam.log, ham.log and xval dir and return to second step.

During filter operation (specially when autolearn is active), you will find an increasing of the bayes database size. This increasing is produced by the usage of logical deletions and the continuous storing of information. The bayes database growing causes a significantly increasing on the time needed to filter a message. If you execute a backup-clean-restore cycle by using sa-learn tool, you will reduce the space used by logical deleted information and improve your filter performance. Nevertheless, you can significatively increase the performance by removing unuseful terms from your filter. nsa-learn is a modification over the original sa-learn that identify and remove these terms and uses a parameter to indicate the stimated minimum utility of a term that will not be erased. We recomend using 60 for the value of this parameter. Use the following commands to execute the filter maintaining process at least one time per week:

./nsa-learn --optimize 60
./nsa-learn --backup > backup.bayes_db
./nsa-learn --clear
./nsa-learn --restore backup.bayes_db

We have submmited a paper to a research journal in order to present results achieved by using the framework. Results on evaluating the improvements while using filters optimized by GS4Spam are promissing.

Citing GrindStone4SPAM

Please cite the use of GrindStone4SPAM as: Méndez,J.R, Reboiro-Jato,M., Díaz,F., Díaz, E., Fdez-Riverola,F. (2012). Grindstone4SPAM: An Optimization Toolkit for Boosting E-Mail Classification. Journal of Systems and Software 85(12):2909--2920.

Download stable version using link. You can download using subversion at our source repository by typing:
svn checkout http://svn.grindstone4spam.atopa.me/svn/
You can find SVN repository information in this link.

Feel free to contact us for information in the following e-mail address: moncho.mendez _at_ sing.ei.uvigo.es_. Please replace _at_ by @.

Project Leader:

Ph.D. José Ramón Méndez (University of Vigo)

Software Developers:

PhD. José Ramón Méndez (University of Vigo)
Ivan Paz (University of Vigo)

SpamAssassin Consultancy Advise

Eduardo Díaz (Ultreia Comunicaciones)

Research & Innovation Advise

Ph.D. José Ramón Méndez (University of Vigo)
Ph.D. Florentino Fdez-Riverola (University of Vigo)
Ph.D. Fernando Díaz (University of Valladolid)

Grindstone4Spam^1.0

SING Group - Ultreia Comunicaciones