Pantek Library
Hosting Provided By
CybrHost
High Speed Hosting

Re: A different approach to scoring spamassassin hits

From: Tom Allison <tom(at)tacocat.net>
Date: Sat Jun 30 2007 - 16:56:33 EDT

On Jun 30, 2007, at 2:55 PM, Bart Schaefer wrote:

>
> On 6/29/07, Tom Allison <tom@tacocat.net> wrote:
>>
>> The thought I had, and have been working on for a while, is changing
>> how the scoring is done. Rather than making Bayes a part of the
>> scoring process, make the scoring process a part of the Bayes
>> statistical Engine. As an example you would simply feed into the
>> Bayesian process, as tokens, the indications of scoring hits (binary
>> yes/no) would be examined next to the other tokens in the message.
>
> There are a few problems with this.
>
> (1) It assumes that Bayesian (or similar) classification is more
> accurate than SA's scoring system. Either that, or you're willing to
> give up accuracy in the name of removing all those confusing knobs you
> don't want to touch, but it would seem to me to be better to have the
> knobs and just not touch them.
>

I know that without SA you can have >99.9% accuracy with pure bayesian classification.
But there are specific non Bayes things that are made visible through spamassassin rules that a typical bayes process can't catch (very well or at all). The whole issue of "knobs" is moot under a statistical approach because each users scoring will determine the real importance of each particular rule hit.

> (2) For many SA rules you would be, in effect, double-counting some
> tokens. An SA scoring rule that matches a phrase, for example, is
> effectively matching a collection of tokens that are also being fed
> individually to the Bayes engine. In theory, you should not
> second-guess the system by passing such compound tokens to Bayes;
> instead it should be allowed to learn what combinations of tokens are
> meaningful when they appear together.

Bayes does not match a phrase, only words. At least that is what most Bayes filters do.
There are some approaches that do use multiple words, but not a "phrase". Therefore I think the intersection of Bayes and Spamassassin rules is going to be small.

> (It might be worthwhile, though, to e.g. add tokens that are not
> otherwise present in the message, such as for the results of network
> tests.)

This is what I'm interested in and mentioned in paragraph one. There are a lot of things you can do with SpamAssassin that just Bayes will never do. It is exactly this type of work that I think would be most interesting to pursue.

> (3) It introduces a bootstrapping problem, as has already been noted.
> Everyone has to train the engine and re-train it when new rules are
> developed.
>
> I've thought of a few more, but they all have to do with the benifits
> of having all those "knobs" and if you've already adopted the basic
> premise that they should be removed there doesn't seem to be any
> reason to argue that part.
>
> To summarize my opinion: If what you want is to have a Bayesian-type
> engine make all the decisions, then you should install a Bayesian
> engine and work on ways to feed it the right tokens; you should not
> install SpamAssassin and then work on ways to remove the scoring.

Do you need help?X

It makes sense to do this approach. However it would not make sense to try and reinvent the fantastic amount of useful work that has come from SpamAssassin. That would take a very long time to address. SpamAssassin has some really great ways of finding the right tokens. Why would I consider trying to duplicate all that effort. Received on Sat Jun 30 16:57:14 2007

This archive was generated by hypermail 2.1.8 : Sat Jun 30 2007 - 17:00:04 EDT


Contact Us  Legal Notices  Order Services Online 
Pantek Home  Privacy Policy  IT news  Site Map  Pantek Library