Pantek Library
Hosting Provided By
CybrHost
High Speed Hosting

Re: A different approach to scoring spamassassin hits

From: Tom Allison <tom(at)tacocat.net>
Date: Sat Jun 30 2007 - 21:34:02 EDT

On Jun 30, 2007, at 6:29 PM, Loren Wilton wrote:

>
>> And after typing all this I'm thinking you might be right. But
>> part of this approach is to run all these rules in YES/NO fashion
>> and see if the probability is significant. For example: If I
>> tested for SOME_TEST=NO and found it was scoring a probability of
>> ~0.500 then it's indisputable that you are right.
>
> Well, this still doesn't make any real sense to me; it seems
> equivalent to the attempts at bayes poison that spammers stick into
> their spams: a bunch of words totally unrelated to the mail in the
> hopes of outweighing the useful terms. Now their trick works as a
> good spam indication because the words they pick aren't common to
> my ham mails, so it is really a good spam indication rather than
> poison. I'm not immediately convinced that will hold for the usage
> you intend. Maybe. Maybe not.
>
> However, if you want to do this, remember that bayes works on
> tokens and has a tokenizer. So SOME_RULE=YES is probably either
> two or three tokens, and you will end up scoring on the probability
> of YES and NO, along with the frequency of the rule names, which
> will be 1. So you probably want to do NO_SOME_RULE and
> YES_OTHER_RULE or the like when you build the insert list. Again
> though I'm not sure I see the point in the yes and no factors; the
> presence or absense of a word in the mail seems like a pretty good
> yes/no indication to me.
>
> Were I doing it I'd try it both ways and see if there is any
> difference in results.

I agree with you that it's probably not going to be very effective to use a binary token (eg: SOME_RULE=YES vs SOME_RULE=NO) compared to the presence of the rule (SOME_RULE exists implies SOME_RULE=YES).

So the method:

        $list = $status->get_names_of_tests_hit () may cover everything that is required to evaluate this approach.

Unfortunately I'm not on the SpamAssassin Bayes modules -- I wrote my own Bayes Engine because I wanted to do that and then thought about including the Rules results from SpamAssassin. I don't know where this might be going, but it seems to be working extremely well for me based on a training set of just a couple hundred emails in total. Received on Sat Jun 30 21:34:40 2007

This archive was generated by hypermail 2.1.8 : Sat Jun 30 2007 - 21:40:03 EDT

Do you need help?X

Contact Us  Legal Notices  Order Services Online 
Pantek Home  Privacy Policy  IT news  Site Map  Pantek Library