spamassassin-users December 2011 archive
Main Archive Page > Month Archives  > spamassassin-users archives
spamassassin-users: Re: Problems with Cyrillic spam

Re: Problems with Cyrillic spam

From: Martin Gregorie <martin_at_nospam>
Date: Thu Dec 15 2011 - 18:00:35 GMT
To: users@spamassassin.apache.org

On Thu, 2011-12-15 at 10:57 -0500, darxus@chaosreigns.com wrote:
> On 12/15, Martin Gregorie wrote:
> > The problem that needs addressing is that the ok_locales configuration
> > parameter doesn't work. This appears to be because it thinks the
> > sender's choice of (in Windows terms) the character translation code
> > page is a reliable indication of the sender's locale. I accept that this
>
> I'd argue that ok_locales is defined by the way it functions, which was
> dependent on the fact that at one time it was useful to differentiate
> languages by character set. And TextCat's functionality is basically
> exactly what you're looking for. So it would make less sense to redefine
> ok_locales, and more sense to fix TextCat.
>
In that case I'm missing some information: how to write a rule that can
interpret the value(s) returned by TextCat.

Why wouldn't it be sensible to rewrite ok_locales to compare TextCat
return value(s) against its list of OK codes?

> I don't think your comment will help either way. Cyrillic character sets
> aren't hard to find, and all the devs are aware of the problem.
>
Then why has ok_locales not been fixed already? This is not a criticism,
just a request for information. Is it something that's difficult to do
efficiently? I'd imagine that language recognition by looking codepoint
values is possible but not necessarily fast nor unambiguous.

> If, on the other hand, you want to fix TextCat, or otherwise implement a
> solution to the problem, and attach a patch to a bugzilla comment, that
> would be awesome.
>
I've no time ATM and in any case I'm a middling to poor Perl coder. Now,
if SA was written in C or Java....

Martin