spamassassin-users December 2011 archive
Main Archive Page > Month Archives  > spamassassin-users archives
spamassassin-users: Re: Bayes and MySQL - does it actually work?

Re: Bayes and MySQL - does it actually work?

From: Robert Schetterer <robert_at_nospam>
Date: Wed Dec 21 2011 - 18:58:25 GMT
To: users@spamassassin.apache.org

Am 21.12.2011 19:10, schrieb Kris Deugau:
> Marc Perkel wrote:
>> I've been trying for a long time to get bayes/mysql to actually work.
>> Running a dedicated server with MySQL. Several servers running SA
>> configured to talk to it.
>>
>> I'm running big servers with lots of ram and raid 0 flash drives for
>> speed. Also using InnoDB. I'm beginning to wonder if it is ever going to
>> work and if someone is going to fix it?
>
> I'm not sure what official testing has been done, but some testing I did
> about a year ago when upgrading the SA cluster here showed pretty much
> the same IO load for a global Bayes no matter what combination of
> MyISAM, InnoDB, generic SQL, or MySQL-specific SA modules I used.
>
> Enabling MySQL replication also bogged things down pretty badly.
>
> Performance with the database on physical disks simply wasn't keeping up
> with more than about double the average message rate (if that...), so I
> fell back to the "good enough" setup of putting the SA database on a
> RAMdisk, and tweaking the MySQL init script to reload the database on
> startup. A database dump is done once a day, about a half-hour after a
> Bayes expiry run.
>
> This is handling ~250K messages/day, although with some tweaks to
> serialize mail delivery a little more to level off the extreme peaks in
> messages/second it should probably be able to handle a lot more volume.
>
> We also have several SA instances - on the inbound side, the first pass
> has ~25 of the top-scoring only-hits-spam rules (mostly DNSBLs) to skim
> off the junk that would usually score 15+ on a full ruleset. Anything
> that gets past that is then passed to a full SA instance with a long
> list of local rules targeted at the ones reported as missed spam by
> customers. That first pass tags more than 80% of the junk for far less
> processing cost than feeding it all through the full ruleset.
>
> Occasional mail spikes[1] sometimes cause SA to sloooooooowwwww
> dooowwwnnn due to CPU contention (60+ spamd threads are simply going to
> take a while to chew through mail if you've only got 16 logical CPU
> cores), but otherwise a pair of dual-socket, quad-core Xeon E5630
> machines with 12G of RAM are mostly idle. (RAM usage is fairly steady
> at just over 4G.) Average scan times are just under a second.
>
> -kgd
>
> [1] I'm looking at you, Rocket Science Group - hundreds of messages per
> second from netblocks all over the US, all nominally operated by (AKA
> "tagged in WHOIS for") the same group - and quite a lot of it spam.
> Unfortunately MailChimp seems to buy rack space, hosting, or managed
> email servers from them or I'd drop all of their netblocks in the local
> reject-at-the-border DNSBL and be done with it.

Interesting Infos, by the way
anyone knows postgresql performs better i.e with Bayes clusters etc ?
at last using postscreen has helped here stopping bots,so these mails
never reach spamd,
but for sure in large mailsystems a spamassassin setup
has to be configured very carefully ever, and analysed during runtime
to get performance tweaks
however 250K messages/day seems not that much to me
scanning outbound mail with spamd ,was slow here too,i only use
clamav-milter with sanesecurity for that, also for inbound before
spamass-milter

but no flames, for performance issues, a look to the total mailsetup
is needed ever, there is no straight right or wrong most cases
only analysing the bottlenecks will help

-- Best Regards MfG Robert Schetterer Germany/Munich/Bavaria