spamassassin-users December 2011 archive
Main Archive Page > Month Archives  > spamassassin-users archives
spamassassin-users: Re: Bayes and MySQL - does it actually work?

Re: Bayes and MySQL - does it actually work?

From: Marc Perkel <support_at_nospam>
Date: Fri Dec 23 2011 - 01:45:12 GMT
To: Robert Schetterer <>

On 12/21/2011 10:58 AM, Robert Schetterer wrote:
> Am 21.12.2011 19:10, schrieb Kris Deugau:
>> Marc Perkel wrote:
>>> I've been trying for a long time to get bayes/mysql to actually work.
>>> Running a dedicated server with MySQL. Several servers running SA
>>> configured to talk to it.
>>> I'm running big servers with lots of ram and raid 0 flash drives for
>>> speed. Also using InnoDB. I'm beginning to wonder if it is ever going to
>>> work and if someone is going to fix it?
>> I'm not sure what official testing has been done, but some testing I did
>> about a year ago when upgrading the SA cluster here showed pretty much
>> the same IO load for a global Bayes no matter what combination of
>> MyISAM, InnoDB, generic SQL, or MySQL-specific SA modules I used.
>> Enabling MySQL replication also bogged things down pretty badly.
>> Performance with the database on physical disks simply wasn't keeping up
>> with more than about double the average message rate (if that...), so I
>> fell back to the "good enough" setup of putting the SA database on a
>> RAMdisk, and tweaking the MySQL init script to reload the database on
>> startup. A database dump is done once a day, about a half-hour after a
>> Bayes expiry run.
>> This is handling ~250K messages/day, although with some tweaks to
>> serialize mail delivery a little more to level off the extreme peaks in
>> messages/second it should probably be able to handle a lot more volume.
>> We also have several SA instances - on the inbound side, the first pass
>> has ~25 of the top-scoring only-hits-spam rules (mostly DNSBLs) to skim
>> off the junk that would usually score 15+ on a full ruleset. Anything
>> that gets past that is then passed to a full SA instance with a long
>> list of local rules targeted at the ones reported as missed spam by
>> customers. That first pass tags more than 80% of the junk for far less
>> processing cost than feeding it all through the full ruleset.
>> Occasional mail spikes[1] sometimes cause SA to sloooooooowwwww
>> dooowwwnnn due to CPU contention (60+ spamd threads are simply going to
>> take a while to chew through mail if you've only got 16 logical CPU
>> cores), but otherwise a pair of dual-socket, quad-core Xeon E5630
>> machines with 12G of RAM are mostly idle. (RAM usage is fairly steady
>> at just over 4G.) Average scan times are just under a second.
>> -kgd
>> [1] I'm looking at you, Rocket Science Group - hundreds of messages per
>> second from netblocks all over the US, all nominally operated by (AKA
>> "tagged in WHOIS for") the same group - and quite a lot of it spam.
>> Unfortunately MailChimp seems to buy rack space, hosting, or managed
>> email servers from them or I'd drop all of their netblocks in the local
>> reject-at-the-border DNSBL and be done with it.
> Interesting Infos, by the way
> anyone knows postgresql performs better i.e with Bayes clusters etc ?
> at last using postscreen has helped here stopping bots,so these mails
> never reach spamd,
> but for sure in large mailsystems a spamassassin setup
> has to be configured very carefully ever, and analysed during runtime
> to get performance tweaks
> however 250K messages/day seems not that much to me
> scanning outbound mail with spamd ,was slow here too,i only use
> clamav-milter with sanesecurity for that, also for inbound before
> spamass-milter
> but no flames, for performance issues, a look to the total mailsetup
> is needed ever, there is no straight right or wrong most cases
> only analysing the bottlenecks will help

Maybe it's time for me to try postgresql. Can you provide a link to how
to optimize SA for it?

-- Marc Perkel - Sales/Support Junk Email Filter dot com 415-992-3400