Pantek Library
Hosting Provided By
CybrHost
High Speed Hosting

Re: A different approach to scoring spamassassin hits

From: Loren Wilton <lwilton(at)earthlink.net>
Date: Sat Jun 30 2007 - 18:29:38 EDT


> And after typing all this I'm thinking you might be right. But part of
> this approach is to run all these rules in YES/NO fashion and see if the
> probability is significant. For example: If I tested for SOME_TEST=NO
> and found it was scoring a probability of ~0.500 then it's indisputable
> that you are right.

Well, this still doesn't make any real sense to me; it seems equivalent to the attempts at bayes poison that spammers stick into their spams: a bunch of words totally unrelated to the mail in the hopes of outweighing the useful terms. Now their trick works as a good spam indication because the words they pick aren't common to my ham mails, so it is really a good spam indication rather than poison. I'm not immediately convinced that will hold for the usage you intend. Maybe. Maybe not.

However, if you want to do this, remember that bayes works on tokens and has a tokenizer. So SOME_RULE=YES is probably either two or three tokens, and you will end up scoring on the probability of YES and NO, along with the frequency of the rule names, which will be 1. So you probably want to do NO_SOME_RULE and YES_OTHER_RULE or the like when you build the insert list. Again though I'm not sure I see the point in the yes and no factors; the presence or absense of a word in the mail seems like a pretty good yes/no indication to me.

Were I doing it I'd try it both ways and see if there is any difference in results.

        Loren Received on Sat Jun 30 18:30:18 2007

This archive was generated by hypermail 2.1.8 : Sat Jun 30 2007 - 18:40:02 EDT


Contact Us  Legal Notices  Order Services Online 
Pantek Home  Privacy Policy  IT news  Site Map  Pantek Library