spamassassin-users December 2011 archive
Main Archive Page > Month Archives  > spamassassin-users archives
spamassassin-users: Re: Bayes and MySQL - does it actually work?

Re: Bayes and MySQL - does it actually work?

From: Jernej Porenta <jernej.porenta_at_nospam>
Date: Fri Dec 23 2011 - 09:59:41 GMT
To: Robert Schetterer <robert@schetterer.org>

On Dec 23, 2011, at 8:15 AM, Robert Schetterer wrote:

> Am 23.12.2011 02:45, schrieb Marc Perkel:
>>>> This is handling ~250K messages/day, although with some tweaks to
>>>> serialize mail delivery a little more to level off the extreme peaks in
>>>> messages/second it should probably be able to handle a lot more volume.
>>>>
>>>> We also have several SA instances - on the inbound side, the first pass
>>>> has ~25 of the top-scoring only-hits-spam rules (mostly DNSBLs) to skim
>>>> off the junk that would usually score 15+ on a full ruleset. Anything
>>>> that gets past that is then passed to a full SA instance with a long
>>>> list of local rules targeted at the ones reported as missed spam by
>>>> customers. That first pass tags more than 80% of the junk for far less
>>>> processing cost than feeding it all through the full ruleset.

We are processing 300k+ mails (peaks up to 1M/day) with 3 mail servers + 1 dedicated MySQL server replicated to one old server and so far, we haven't seen any performance degradations by using Bayes in MySQL InnoDB engine. Mail servers are dual socket Xeon servers with 8G RAM, while MySQL server is dual-socket Xeon with 48G RAM, but SA Bayes is not the most used database on that server. We are using amavisd-new instead of spamd.

However, we've seen some degradations when we moved to new MySQL server, but some tweaking did help:
- correctly sizing InnoDB engine
- optimizing MySQL buffer sizes
- disable RAID battery autolearn period
- optimizing I/O scheduler
- optimizing network kernel stuff
- optimize kernel swappiness level
- using Mail::SpamAssassin::BayesStore::MySQL instead of Mail::SpamAssassin::BayesStore::SQL
- manually pruning auto-whitelisting data and bayes data

Currently our MySQL bayes data has over 2M tokens in place and we don't see any performance impact on SpamAssassin. Our backup setup runs on replicated database, so there is no performance impact on our primary MySQL server.

I don't have any numbers to compare MySQL and PostgreSQL, but I believe that newer versions of MySQL and its derivates (Percona Server etc.) did improve quite a lot, compared to older ones.

regards, Jernej