spamassassin-users December 2011 archive
Main Archive Page > Month Archives  > spamassassin-users archives
spamassassin-users: Re: Problems with Cyrillic spam

Re: Problems with Cyrillic spam

From: Martin Gregorie <martin_at_nospam>
Date: Thu Dec 15 2011 - 11:36:09 GMT

On Wed, 2011-12-14 at 23:36 -0500, wrote:
> On 12/15, Martin Gregorie wrote:
> > Could somebody with access to the SA Bugzilla kindly add a comment to
> > bug 4078 saying that this is also an issue with Cyrillic encoded in
> > UTF-8? I'm asking because at present #4078 only mentions Windows code
> > pages and koi8. There is nothing to indicate that this is also a problem
> > with UTF-8.
> Although as Karsten pointed out, bug 4078 isn't actually
> related, since that bug is actually related to character sets primarily in
> another language. Which UTF8 is not. Bug 6364 is probably exactly the
> same as your issue, just in a different language - needing TextCat fixed /
> rewritten.
The actual problem is that bug 4078 is over-restrictive in its
applicability: it merely says that CHARSET_FARAWAY_HEADER isn't returned
if a message body is in Hebrew.

The problem that needs addressing is that the ok_locales configuration
parameter doesn't work. This appears to be because it thinks the
sender's choice of (in Windows terms) the character translation code
page is a reliable indication of the sender's locale. I accept that this
used to work, but since the widespread introduction of UTF-8 and other
Unicode encodings, any such assumption is deeply flawed.

The same comments are also applicable to textcat (bug 6364)

There are really only two possibilities for resolving these bugs:
1) Fix bug 6364 by rewriting the code textcat uses to recognise the
   predominant language used in body text. Fix bug 4078 by rationalising
   ok_locales to use the revised textcat code to determine the locale
   used by the sender before comparing this with the list of acceptable
2) Declare textcat and ok_locales to be irretrievably broken and
   remove them from future versions of SA.

That said, I'm happy to become a bugzilla user, but before I add
anything to it, I'd like to know if you'd prefer me to add comments to
4078 and/or 6364 or if it would be best raise a new bug containing my
suggestion #1. I've kept an example message that I can provide as