In order to use Grindstone4Spam for optimization, we recommend to follow these simple four steps:
- Compile a corpus: You should compile spam and legitimate messages from your environment.
Please note that messages that legitimate messages received in a drug company are different to
those received in a financial company. Therefore, using real messages from the target environment
is better than compile another kind of messages (distribution lists, public available corpora,
etc.) to improve spam filters. You should include new messages to your corpus periodically in order to
avoid the effects of concept drift. Place spam messages at src_data/_spam_ and ham ones in
src_data/ham. A collection having more than 5000 messages is a good choice for optimizing your
filters.
- Build a crossvalidation. Once you have compiled e-mail, we recomend to execute a crossvalidation
scheme. The crossvalidation is useful to include Naïve Bayes rules in the optimization and to get a
better optimization from a statistical point of view. Less than 5 secs are required to execute this step
so, why not to execute it?. Simply execute ./makexvalidation. The crossvalidation is generated
at xval directory.
- Generate Logs. You shoud generate the files spam.log and ham.log. These files are used to improve
the optimization processes. Log files contains for each message the rules that match avoiding the
need of classify one single message a lot of times by calling SpamAssassin client (spamc). This process
takes a lot of time (53 minutes using 9300 messages and 902 rules). Simply run ./checkmassesxval.
- Execute Optimization and evaluation processes: Once you have the log files computed, you can
run multiple execution processes by running
./optimizefilter --ham-log ham.log --spam-log spam.log and/or ./evaluatefilter --ham-log
ham.log --spam-log spam.log. The best optimized score is showed in stdout and stored in
52_optimized_scores.cf. In order to install this score file, replace original SpamAssassin
/usr/share/spamassassin/50_scores.cf by the generated one.
Everytime you update the compiled messages, you should remove spam.log, ham.log and xval
dir and return to second step.
During filter operation (specially when autolearn is active), you will find an increasing of
the bayes database size. This increasing is produced by the usage of logical deletions and the
continuous storing of information. The bayes database growing causes a significantly increasing
on the time needed to filter a message. If you execute a backup-clean-restore cycle by using
sa-learn tool, you will reduce the space used by logical deleted information and improve your filter
performance. Nevertheless, you can significatively increase the performance by removing unuseful
terms from your filter. nsa-learn is a modification over the original sa-learn that identify and
remove these terms and uses a parameter to indicate the stimated minimum utility of a term that
will not be erased. We recomend using 60 for the value of this parameter. Use the following commands
to execute the filter maintaining process at least one time per week:
./nsa-learn --optimize 60
./nsa-learn --backup > backup.bayes_db
./nsa-learn --clear
./nsa-learn --restore backup.bayes_db