Re: A different approach to scoring spamassassin hits
On Jun 30, 2007, at 6:29 PM, Loren Wilton wrote:
> >> And after typing all this I'm thinking you might be right. But >> part of this approach is to run all these rules in YES/NO fashion >> and see if the probability is significant. For example: If I >> tested for SOME_TEST=NO and found it was scoring a probability of >> ~0.500 then it's indisputable that you are right. > > Well, this still doesn't make any real sense to me; it seems > equivalent to the attempts at bayes poison that spammers stick into > their spams: a bunch of words totally unrelated to the mail in the > hopes of outweighing the useful terms. Now their trick works as a > good spam indication because the words they pick aren't common to > my ham mails, so it is really a good spam indication rather than > poison. I'm not immediately convinced that will hold for the usage > you intend. Maybe. Maybe not. > > However, if you want to do this, remember that bayes works on > tokens and has a tokenizer. So SOME_RULE=YES is probably either > two or three tokens, and you will end up scoring on the probability > of YES and NO, along with the frequency of the rule names, which > will be 1. So you probably want to do NO_SOME_RULE and > YES_OTHER_RULE or the like when you build the insert list. Again > though I'm not sure I see the point in the yes and no factors; the > presence or absense of a word in the mail seems like a pretty good > yes/no indication to me. > > Were I doing it I'd try it both ways and see if there is any > difference in results.
I agree with you that it's probably not going to be very effective to
use a binary token (eg: SOME_RULE=YES vs SOME_RULE=NO) compared to
the presence of the rule (SOME_RULE exists implies SOME_RULE=YES).
So the method:
$list = $status->get_names_of_tests_hit ()
may cover everything that is required to evaluate this approach.
Unfortunately I'm not on the SpamAssassin Bayes modules -- I wrote my
own Bayes Engine because I wanted to do that and then thought about
including the Rules results from SpamAssassin. I don't know where
this might be going, but it seems to be working extremely well for me
based on a training set of just a couple hundred emails in total.
Received on Sat Jun 30 21:34:40 2007
This archive was generated by hypermail 2.1.8
: Sat Jun 30 2007 - 21:40:03 EDT
|