Pantek Library
Hosting Provided By
CybrHost
High Speed Hosting

Re: A different approach to scoring spamassassin hits

From: Tom Allison <tom(at)tacocat.net>
Date: Sat Jun 30 2007 - 06:33:32 EDT

On Jun 30, 2007, at 4:46 AM, John Andersen wrote:

>
> On Friday 29 June 2007, Tom Allison wrote:
>
>> It would be the Bayes process that determines the effective number of
>> points you assign for each HIT based on what it's learned about it
>> from you. So the tags of: ADVANCE_FEE_1, ADVANCE_FEE_2 would be
>> represented as a token of format:
>> ADVANCE_FEE_1=YES or NO
>> ADVANCE_FEE_2=YES or NO
>> and each of these tokens would then be evaluated based on your
>> learning process.
>
> Sort of like a multiple linear regression analysis, where you
> simply start
> dropping terms with low coefficients to simplify the calculation.
>
> Interesting Idea.
>
> You have a bit of a chicken and egg problem at the start. Until
> some learning takes place in the system.
>

For a purely bayesian filter this is always the case. But I have found through mailing lists and personal experience that this can be mitigated through a variety of approaches.

The first approach is to impliment SA after you have trained it from some past corpus of mail you've captured. The opinion on how many you need to be effective varies from 10's to 1,000's. This is strictly a YMMV issue.

Personally, I use an approach of train on error (never auto-train or train on everything but only the minimum to get right) with a result of 10 emails gets me above 90%. But my scoring is a little vague -- I use a ternary Yes, No, Maybe scoring process. If I exclude the Maybe I have 100% success in very short order. Including Maybe I have 98% success after training on ~100 messages. But the worse is over in the first day.

Another method would be to simply seed the data from a SQL script to preload certain tokens and values. Kind of a "hack" in my opinion but it would be effective and any discrepancies would be quickly resolved by training. In the case of SA I would seed the rules into the tables for the simplest, yet effective results. Received on Sat Jun 30 06:34:18 2007

This archive was generated by hypermail 2.1.8 : Sat Jun 30 2007 - 06:40:02 EDT

Do you need help?X

Contact Us  Legal Notices  Order Services Online 
Pantek Home  Privacy Policy  IT news  Site Map  Pantek Library