Taking Tiger Server's spam protection up a notch


I was quite happy to see that Apple decided to include amavis / clamav / spamassassin in OS X 10.4 Server, but although I found the virus protection nice, the spam protection looked weak in comparison to my previous experience with spamassassin. So, here's how I fixed that...

A few things about the anti-spam configuration are worth noting ahead of time so that some of the information later is more easily understood:

  • amavisd is run by the user clamav - this will dictate the locations of many things
  • clamav's HOME is /private/var/clamav
  • the user amavisd doesn't appear to have much purpose, but it's home directory is in /private/var/virusmails (where incoming antivirus stuff goes)
  • /etc/mail/spamassassin is still the right directory to look in for global spamassassin configuration parameters
  • all /etc/ configuration files that you change may end up being overwritten by the system if you change anything in the mail tab or upgrade the OS, or install a security patch, or blink... so back up your changes

Let's start with what is "wrong" with the implementation from Apple. I'm going to just say that it is much more conservative than I'd like to be in spam protection. Out of the box it does the basic spamassassin tests for content and known baddies, and will handle Bayesian filtering and auto-white listing. However, for these to work, there's some training to be done and it's not clear in what manner that must be completed.

Bayesian Filtering

Step one: ignore the manual on manual training. So, while it indicates that you want to learn ham and spam using sa-learn from the prompt, it leaves out the fact that the spam and ham databases really need to be in the home directory of the user clamav. For this purpose, you're much better off forcing the sa-learn utility to be run as the user clamav. To do so, log in as root and execute the following:

su - clamav -c "sa-learn --spam --no-sync --dbpath /var/amavis/.spamassassin /*."
su - clamav -c "sa-learn --ham --no-sync --dbpath /var/amavis/.spamassassin /*."
su - clamav -c "sa-learn --dbpath /var/amavis/.spamassassin --sync"

That will charge the filter with spam and ham and then sync the databases. If you don't want to be bored watching nothing happen, you can add --showdots to either or both commands.

Once you've charged the system, then you'll be getting Bayesian filtering, which is a whole lot better than you're getting right now. If you've got more than 400 messages of each kind (spam and ham), you should now be seeing BAYES_XX indicators showing up in the X-Spam-Status header line in mail messages.

RBLs

If you've used spamassassin in the past, you know that it has a special way of dealing with RBLs. Basically, it uses them as an input to its formula of determining whether it is spam, as opposed to actually "Blocking" it, as would be done if it were an MTA.

This is really cool, and much safer than doing it at the MTA (such as what Apple encourages in their Mail configuration panel). However, out of the box, Apple's configuration won't work, because it is missing the Net::DNS libraries, which are used by spamassassin to do DNS lookups.

In order to start this process, you'll need to have Apple's Developer Tools installed (they're there on the install disks). If you haven't done so, install them now.

To fix this, run cpan as root (sudo cpan will suffice) and configure it if necessary. Then, type install Net::DNS. It will spit out a bunch of things, and probably ask you if it is OK to install two or three other libraries. This is OK.

This isn't quite enough, though, because spamassassin is particularly picky about getting very fast responses to its DNS queries, so you will probably need to add dns_available yes to your /etc/mail/spamassassin/local.cf in order to get it not to skip this.

Now, you're going to have to restart amavisd in order to make it grab the new spamassassin settings. So far, the only way I have found to do this is to ps the process, kill it, and then run the process again as the clamav user (su -clamav -c "amavisd" &). Once this is running, much of your spam should now indicate matching on rules with RBL in them.

Razor and DCC

Have you heard of cloudmark? They've got an anti- spam product for Windows that runs on the client. However, it's based on a system called Vipul's Razor. Cloudmark continues to allow "mere mortals" to interact with the service because it's in their best interest (the bigger the database of spam, the more effective their filtering), but they've indicated that they may stop at some point.

However, for the time being, Razor provides a real-time content-based filter that is updated very regularly. So, why not use it? To do so, you're going to have to make sure you have the Developer's Tools installed (they came with OS X Server) and then you'll need to download the software from the URL above). The razor installation instructions are pretty straightforward. The only step that's a little weird is that I usually try to install the razor configuration in /etc/mail/spamassassin/.razor so that it comes across more system-wide. I would imagine that you could put this in ~/clamav/.razor or ~/clamav/.spamassassin/.razor, but I just put it in /etc/mail/spamassassin/.razor for ease of finding.

Either way, we're going to use the /etc/mail/spamassassin/local.cf to tell razor where to look. Do this by adding the following line to the local.cf file:

razor_config /etc/mail/spamassassin/.razor/razor-agent.conf

or wherever your configuration file is.

Before restarting amavisd, though, we're going to add support for another service of this type, DCC (the Distributed Checksum Clearinghouse). This one seems to have more items listed and isn't as good at predicting spam, but when used in conjunction with the other services, it can discern messages sent to you by a human and those sent by machine for the most part (I gather too many people mark opt-in mailing list items as spam on DCC).

To install this, you'll need to download the code and follow the DCC installation instructions. Skip over step 3, since we're not using the "milter" configuration. Steps 6-8 may also be skipped, as we really want to use the public servers for now (unless you have atrociously high volumes or special needs).

Since dcc is a daemon, you'll need to configure it to start at startup. For now, I'm going to leave that (and the autolearner) as an exercise for the reader. However, I hope to add it later.

The last step is to configure DCC in spamassassin. This step is really unnecessary, because it checks for it in the "usual location" when it starts, but I like suspenders and a belt some time, so I added dcc_home /var/dcc to my /etc/mail/spamassassin/local.cf file

Once again, we need to restart the amavisd process, as indicated above. If you have succeeded, you'll start to see RAZOR_ and DCC_ entries in your X-Spam-Status headers for spam.

Conclusion

By adding the DNS-based RBL's, DCC, and Razor, and loading up your Bayesian filtering database, you've now got a much more effective spam filter.

I like to pair this with a sieve filter that automatically drops incoming spam into a separate imap mailbox, so that I don't have to retrieve the spam on low-volume messaging clients like my phone.