spamassassin-dev: Re: Mass-check Corpora (once was: Re: Update M

Re: Mass-check Corpora (once was: Re: Update Mirror Issues)

From: Warren Togami Jr. <wtogami_at_nospam>
Date: Thu Feb 03 2011 - 05:14:34 GMT
To: Jo√£o Gouveia <>

On 2/2/2011 5:25 PM, Karsten Bräckelmann wrote:
> Spam to live accounts strongly preferred, human reviewed by "trained
> monkeys". Emphasis on trained. ;) Some crap like backscatter should be
> filtered from the trap data, if possible, and trap volume kept lower --
> best done by random sampling, rather than dupe elimination.
> How much will that add to the corpus? In particular, how much would the
> first class be, without trap data at all?

Karsten brings up a good point about two types of spam. How about
something like:

* We want a total of 70K spam in your nightly corpus over the past week.
  This means 10K spam per day.
* 3K spam on Monday is from trained monkeys. Include 7K from a random
selection of trap spam.
* 2K spam on Tuesday is from trained monkeys. Include 8K from a random
selection of trap spam.
* etc.

You could even split it into two separate masscheck runs.

> Given we're talking original figures of 1 million spam per *day*,
> already discussing ways to cut that down to 50-100k -- over a period of
> up to 2 months for spam, 60 days, mind you -- which is less than 2k a
> day...

It seems his spam is lacking spamassassin headers, so without "reuse" we
are unable to determine delivery-time status of the network rules. I
suggested that as long as his mail is lacking spamassassin headers,
perhaps his random sample should be limited to the past week. Although
not perfect, the past week might be closest to "reuse" in results.

A better alternative would to add spamassassin headers as each message
was decided to be added to nightly masscheck corpus. The random subset
of trap spam would have headers from seconds after delivery, and
trained-monkey spam headers would be from whenever it was sorted.
"reuse" would then be possible, and the age of spam included in the
nightly masscheck can be calibrated based upon how much this corpus
overwhelms everyone else's recent spam.