spamassassin-users December 2011 archive
Main Archive Page > Month Archives  > spamassassin-users archives
spamassassin-users: Re: Bayes and MySQL - does it actually work?

Re: Bayes and MySQL - does it actually work?

From: Robert Schetterer <robert_at_nospam>
Date: Fri Dec 23 2011 - 07:15:56 GMT
To: Marc Perkel <support@junkemailfilter.com>

Am 23.12.2011 02:45, schrieb Marc Perkel:
>
>
> On 12/21/2011 10:58 AM, Robert Schetterer wrote:
>> Am 21.12.2011 19:10, schrieb Kris Deugau:
>>> Marc Perkel wrote:
>>>> I've been trying for a long time to get bayes/mysql to actually work.
>>>> Running a dedicated server with MySQL. Several servers running SA
>>>> configured to talk to it.
>>>>
>>>> I'm running big servers with lots of ram and raid 0 flash drives for
>>>> speed. Also using InnoDB. I'm beginning to wonder if it is ever
>>>> going to
>>>> work and if someone is going to fix it?
>>> I'm not sure what official testing has been done, but some testing I did
>>> about a year ago when upgrading the SA cluster here showed pretty much
>>> the same IO load for a global Bayes no matter what combination of
>>> MyISAM, InnoDB, generic SQL, or MySQL-specific SA modules I used.
>>>
>>> Enabling MySQL replication also bogged things down pretty badly.
>>>
>>> Performance with the database on physical disks simply wasn't keeping up
>>> with more than about double the average message rate (if that...), so I
>>> fell back to the "good enough" setup of putting the SA database on a
>>> RAMdisk, and tweaking the MySQL init script to reload the database on
>>> startup. A database dump is done once a day, about a half-hour after a
>>> Bayes expiry run.
>>>
>>> This is handling ~250K messages/day, although with some tweaks to
>>> serialize mail delivery a little more to level off the extreme peaks in
>>> messages/second it should probably be able to handle a lot more volume.
>>>
>>> We also have several SA instances - on the inbound side, the first pass
>>> has ~25 of the top-scoring only-hits-spam rules (mostly DNSBLs) to skim
>>> off the junk that would usually score 15+ on a full ruleset. Anything
>>> that gets past that is then passed to a full SA instance with a long
>>> list of local rules targeted at the ones reported as missed spam by
>>> customers. That first pass tags more than 80% of the junk for far less
>>> processing cost than feeding it all through the full ruleset.
>>>
>>> Occasional mail spikes[1] sometimes cause SA to sloooooooowwwww
>>> dooowwwnnn due to CPU contention (60+ spamd threads are simply going to
>>> take a while to chew through mail if you've only got 16 logical CPU
>>> cores), but otherwise a pair of dual-socket, quad-core Xeon E5630
>>> machines with 12G of RAM are mostly idle. (RAM usage is fairly steady
>>> at just over 4G.) Average scan times are just under a second.
>>>
>>> -kgd
>>>
>>> [1] I'm looking at you, Rocket Science Group - hundreds of messages per
>>> second from netblocks all over the US, all nominally operated by (AKA
>>> "tagged in WHOIS for") the same group - and quite a lot of it spam.
>>> Unfortunately MailChimp seems to buy rack space, hosting, or managed
>>> email servers from them or I'd drop all of their netblocks in the local
>>> reject-at-the-border DNSBL and be done with it.
>> Interesting Infos, by the way
>> anyone knows postgresql performs better i.e with Bayes clusters etc ?
>> at last using postscreen has helped here stopping bots,so these mails
>> never reach spamd,
>> but for sure in large mailsystems a spamassassin setup
>> has to be configured very carefully ever, and analysed during runtime
>> to get performance tweaks
>> however 250K messages/day seems not that much to me
>> scanning outbound mail with spamd ,was slow here too,i only use
>> clamav-milter with sanesecurity for that, also for inbound before
>> spamass-milter
>>
>> but no flames, for performance issues, a look to the total mailsetup
>> is needed ever, there is no straight right or wrong most cases
>> only analysing the bottlenecks will help
>>
>
> Maybe it's time for me to try postgresql. Can you provide a link to how
> to optimize SA for it?
>

sorry no, i have no links beside offical ones,
but i was told from good DB People postgresql
is more handy in Cluster Setups
but as i said , try to limit amount of mails
comming to spamassassin by using other filter tecs before it
this should help anyway, beside of the DB Stuff
-- Best Regards MfG Robert Schetterer Germany/Munich/Bavaria