Pantek Library
Hosting Provided By
CybrHost
High Speed Hosting

Re: A different approach to scoring spamassassin hits

From: Marc Perkel <marc(at)perkel.com>
Date: Sat Jun 30 2007 - 01:20:00 EDT

Tom Allison wrote:
> For some years now there has been a lot of effective spam filtering
> using statistical approaches with variations on Bayesian theory, some
> of these are inverse Chi Square modifications to Niave Bayes or even
> CRM114 and other "languages" have been developed to improve the
> scoring of statistical analysis of spam. For all statistical
> processes the spamicity is always between 0 and 1.
>
> Before this, and along side this, has been the approach of
> spamassassin wherein every email is evaluated against a library of
> rules and for each rule and number of points is assigned to it. Given
> enough points, the email is ham/spam. To accomodate the Bayesian
> process, SA was modified with a Bayes engine and the ability to add
> points depending on where the bayesian score fell (>.85, >.95...).
> And for all of these processes the score is between something negative
> and something positive depending on the total number of hits and the
> points assigned to them.
>
> It occurred to me that this process of assigning points to each "HIT"
> (either addition or subtraction of points) is slightly arbitrary.
> There is a long process of evaluating for the "most effective score"
> for each rule and then providing that as the default. The Mail Admin
> has the option to retune these various parameters as needed. To me,
> this looks like a lot of knobs I can turn on a very complex machine I
> will probably never really understand. In short, if I touch it, I
> will break it. But the arbitrary part of the process is this manual
> balancing act between how many points to apply to something and
> getting the call from the CEO about his over abundance of east
> european teenage solicitors (or lack thereof).
>
> The thought I had, and have been working on for a while, is changing
> how the scoring is done. Rather than making Bayes a part of the
> scoring process, make the scoring process a part of the Bayes
> statistical Engine. As an example you would simply feed into the
> Bayesian process, as tokens, the indications of scoring hits (binary
> yes/no) would be examined next to the other tokens in the message.
>
> It would be the Bayes process that determines the effective number of
> points you assign for each HIT based on what it's learned about it
> from you. So the tags of: ADVANCE_FEE_1, ADVANCE_FEE_2 would be
> represented as a token of format:
> ADVANCE_FEE_1=YES or NO
> ADVANCE_FEE_2=YES or NO
> and each of these tokens would then be evaluated based on your
> learning process.
>
> An advantage of this would be the elimination of the process to
> determine the best number of points to assign or to determine if you
> even want a rule included.
>
> Point assignments would be determined based on the statistical hits
> (number of spam, number of ham) and would be tuned between a per site
> or per user basis depending on the bayes engine configuration. Each
> users, by means of their feedback, would tune the importance of each
> rule applied.
>
> Determining if you wanted to include a rule would be automatically
> determined for you based on the resulting scoring. if you have a rule
> that has an overall historical performance of 0.499 then it's pretty
> obvious that it's incapable of "Seeing" your kind of spam/ham. But if
> you throw together a rule and run it for a week and find it's scoring
> 0.001 or 0.999 then you have evidence of how effective the rule is and
> can continue to use it. It is conceivable that you could start with
> All known rules and later on remove all the rules that are nominally
> 0.500 to improve performance on a objective process. It would also
> apply to any of the networked rules like botnet, dcc, razor because
> they just have a tagline and a YES/NO indication.
>
> I've been working on something like this myself with great affect, but
> it would be far more practical to utilize much of the knowledge and
> capability that already exists in spamassassin. But I'm not familiar
> enough with spamassassin to know how to gain visibility into all the
> rules run and all their results (hits are easy in PerMsgStatus, but
> misses are not). If someone would be willing to give me some pointer
> to a roadmap of sorts it would be appreciated.
>
> Many Thanks for those of you who have read this far for your patience
> and consideration.

Tom, I suggested something somilar to that years ago and I'd still like to see it tried out. I wonder what would happen if you stripped ot the body and ran bayes just on the headers and the rules and let bayes figure it out. You do have to have some points to start with to get bayes pointed in the right direction. But you could use black lists and white lists to do bayes training. Also needs more rules to identify ham and not just rules to identify spam. Received on Sat Jun 30 01:20:41 2007

This archive was generated by hypermail 2.1.8 : Sat Jun 30 2007 - 01:30:03 EDT


Contact Us  Legal Notices  Order Services Online 
Pantek Home  Privacy Policy  IT news  Site Map  Pantek Library