spamassassin-users January 2011 archive
Main Archive Page > Month Archives  > spamassassin-users archives
spamassassin-users: Re: Training Bayes on outbound mail

Re: Training Bayes on outbound mail

From: Karsten Bräckelmann <guenther_at_nospam>
Date: Sat Jan 29 2011 - 00:57:50 GMT
To: users@spamassassin.apache.org

On Fri, 2011-01-28 at 18:10 +0000, Dominic Benson wrote:
> Recently, in order to balance the ham/spam ratio given to sa-learn, I
> have started to pass mail submitted by authenticated users to sa-learn
> --ham.
> The thinking here is that users would generally want to receive mail
> that they send, and many messages will either be replies or replied to,
> so this is likely to have a fair amount in common with legitimate mail
> coming in.
> The existing bayes training was from auto-learn, on 60k ham and 360k
> spam; since starting to do this, nearly twice as much ham as spam has
> been learned.
>
> I haven't seen any mention of this strategy on-list or on the web, so
> I'm interested in whether (a) anyone else does this, and (b) is there a
> good reason not to do it that I haven't thought of?

This topic does come up occasionally. It has been discussed, some
caveats to be aware of have been mentioned -- but IIRC no one ever came
back to report about some substantial changes. Or whether or not it
worked for them.

Besides some good points (err, caveats) already raised...

Given your numbers of ham and spam, you seem to be under the impression
that the ratio should be 1:1 for best results. While there are no hard
numbers I know of, that most likely is not the best ratio to aim at.
Though I do see how the docs might imply that.

A training ratio commonly advised is *not* 1:1 spam vs ham. But to have
both numbers in the neighborhood of your actual in-stream. If you have
10 times more spam than ham, this means you can (and probably should)
learn more spam.

Personally, I have seen ratios of 50:1 or even higher, that just work
perfectly. Why is that? Probably, because there is a rather limited set
of hammy tokens. But an unequally higher set of spammy tokens. The
latter grows rapidly when obfuscation techniques enter the scene.

Also, spam changes *much* faster over time than ham. The latter can be
assumed to be almost static over a couple years.

In other words, there likely is nothing wrong with your initial spam to
ham ratio of 6:1, and even 12:1 later. Unless you notice a significant
raise in your ham's Bayes score, there's probably no need for such a
pro-active counter measure.

I guess in most cases it won't hurt. But in most cases, it isn't worth
the effort either.

Since you mentioned replies -- sure, true, they most likely are ham. ;)
However, odds are the relevant tokens already do score hammy. The
additional training of sent mail is unlikely to make much of a
difference.

And most of the replies themselves are likely to be auto-learned as ham
anyway, no?

Hmm, this turned out longer than the "additional note" I anticipated...

-- char *t="\10pse\0r\0dtu\0.@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4"; main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1: (c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}