spamassassin-users December 2011 archive
Main Archive Page > Month Archives  > spamassassin-users archives
spamassassin-users: Re: Bayes and MySQL - does it actually work?

Re: Bayes and MySQL - does it actually work?

From: Kris Deugau <kdeugau_at_nospam>
Date: Wed Dec 21 2011 - 18:10:27 GMT
To: spamassassin-users <users@spamassassin.apache.org>

Marc Perkel wrote:
> I've been trying for a long time to get bayes/mysql to actually work.
> Running a dedicated server with MySQL. Several servers running SA
> configured to talk to it.
>
> I'm running big servers with lots of ram and raid 0 flash drives for
> speed. Also using InnoDB. I'm beginning to wonder if it is ever going to
> work and if someone is going to fix it?

I'm not sure what official testing has been done, but some testing I did
about a year ago when upgrading the SA cluster here showed pretty much
the same IO load for a global Bayes no matter what combination of
MyISAM, InnoDB, generic SQL, or MySQL-specific SA modules I used.

Enabling MySQL replication also bogged things down pretty badly.

Performance with the database on physical disks simply wasn't keeping up
with more than about double the average message rate (if that...), so I
fell back to the "good enough" setup of putting the SA database on a
RAMdisk, and tweaking the MySQL init script to reload the database on
startup. A database dump is done once a day, about a half-hour after a
Bayes expiry run.

This is handling ~250K messages/day, although with some tweaks to
serialize mail delivery a little more to level off the extreme peaks in
messages/second it should probably be able to handle a lot more volume.

We also have several SA instances - on the inbound side, the first pass
has ~25 of the top-scoring only-hits-spam rules (mostly DNSBLs) to skim
off the junk that would usually score 15+ on a full ruleset. Anything
that gets past that is then passed to a full SA instance with a long
list of local rules targeted at the ones reported as missed spam by
customers. That first pass tags more than 80% of the junk for far less
processing cost than feeding it all through the full ruleset.

Occasional mail spikes[1] sometimes cause SA to sloooooooowwwww
dooowwwnnn due to CPU contention (60+ spamd threads are simply going to
take a while to chew through mail if you've only got 16 logical CPU
cores), but otherwise a pair of dual-socket, quad-core Xeon E5630
machines with 12G of RAM are mostly idle. (RAM usage is fairly steady
at just over 4G.) Average scan times are just under a second.

-kgd

[1] I'm looking at you, Rocket Science Group - hundreds of messages per
second from netblocks all over the US, all nominally operated by (AKA
"tagged in WHOIS for") the same group - and quite a lot of it spam.
Unfortunately MailChimp seems to buy rack space, hosting, or managed
email servers from them or I'd drop all of their netblocks in the local
reject-at-the-border DNSBL and be done with it.