|
|||||||||||
|
RE: Global Bayes and AWL
From: Giampaolo Tomassoni <g.tomassoni(at)libero.it>
Date: Sat Oct 13 2007 - 12:43:05 EDT
It is not impossible, but it would borrow its own speed cost. Some time ago I wished have a three-level layered Bayes: site level, organization level and mailbox level. The idea was to reshape the bayes DB store code and, probably, scoring code, such that
>From a store standpoint, this means that tokens shouldn't have any ham/spam When an incoming mail is scanned and tokens are extracted, for each token the code should "count" how many times the user (auto-) reported that token as being ham or spam or, if there are no occurrences of that token in the user layer, how many times that token had been reported as ham or spam at organizational level (that is: by all users in a domain/organization). Then, if there is again no occurrence of the token, how many times that token had been tagged as spammy at site level (that is: by every user in every organization), if any. This reasoning could even be changed somehow in order to statistically prioritize user preferences over organizational ones over site ones, which would be much preferred the previous idea since simply spreading the mail corpus in three levels would easily result in a unreliably too small user and even a organizational virtual corpus. However, this would mean to tune the well-known Bayes classification equations to this need, which should be done carefully and not released before a review from some Bayes' theory-savvy person. A further benefit steaming from a multi-layer approach would be easy and reliable expiration of bayes entries, by simply deleting mails arrived before the expire period, then tokens not anymore referred by any e-mail. This is something most serious sql server could even do automatically after deleting any token whose last-seen time is before a given threshold. Also, actually AWL owns its own table to do its work. This design could instead use two further fields on the "mails" table with the source mail address and ip address in them, and a further field in the usermails table with the computed SA score in it. AWL could use this data in order to do its dirty job, thereby obtaining data expiration for free. Of course, since there were so much impact in the Bayes code, I surely preferred this design be in the mainstream SA code, in order to avoid to "reinvent the wheel" each time I had to update SA. The problem is that this design would be much more complex than the actual one and the question is: would it be eadible by everybody but the tiniest ISPs using SA? It probably would be good to me, with some hundreds e-mails received per day. But what if one has to scan 10,000,00 mails/day? Sure one can use smart sql servers with statistical query optimizers and the like, but this way too computing the bayes score in an incoming mail would probably take a couple of seconds in the average, as opposed to the current few tents of second... So, flexibility often comes at speed expenses and I guess many in this list would not appreciate. Giampaolo Received on Sat Oct 13 12:46:38 2007 This archive was generated by hypermail 2.1.8 : Fri Jul 04 2008 - 15:01:19 EDT |
||||||||||
|
|||||||||||