|
|||||||||||
|
Re: Some thoughts on Baysian Setup...
From: Chris St. Pierre <stpierre(at)NebrWesleyan.edu>
Date: Mon Aug 27 2007 - 10:46:51 EDT
> 1. Most users don't know how, arn't allowed, or can't be bothered to train Disagree. With proper training -- or if you make it trivially easy, like GMail/Yahoo's "Report as Spam" links -- then users will train Bayes. > 2. Most people would consider the same emails to be SPAM. 90% of what I Strongly disagree. Many users consider anything they don't want to be spam, including all sorts of soliticed email. I had one user who, rather than turn off email notifications from Facebook, reported them as spam until they started getting blocked. Since we've implemented a system where reporting a message as spam automatically blacklists the sender for the reporting user, I've had a number of reports of students blacklisting their professors because they didn't want some notification they got sent. Perhaps you and I might agree on what spam is, but Joe User does _not_. > 3. The emails which we would disagree on would probably be newsletters and Again, you and I would probably find this situation, but you and Joe User (or I and Joe User) would not. > 4. Site wide bayes saves disk space and more importantly it saves Not sure on this one. None of the performance statistics I gather saw any noticeable hit when I switched from sitewide to per-user. > 5. A larger database leads to more accurate baysian identification - I am "It depends." :) With Bayes poisoning all the rage, it sometimes helps to avoid a really huge database. A few months ago, we started over and, for the first week or two, spam went up, but then it dropped to below previous levels; cleaning out the crap can help from time-to-time. So what's important is having a well-tuned database -- not necessarily a large database. If Joe and Jane User get different kinds of mail, disagree on what spam is, etc., then they should have different databases. (What if Joe receives a legitimate newsletter on stock tips, for instance?) > 1. What I think of as HAM emails could be widely different from what you I again disagree. We retain all of the messages that users report as FPs and FNs, and, in general, the FPs are more obvious and certainly easier to agree on. I would never use the FNs as a spam corpus, for aforementioned reasons, but I think the FPs would be pretty reliable. > 2. If a server has one customer who is a plumber and one who is an artist, Agree, mostly. If you have one customer who is a day trader and one who works with Pfizer Canada, then they'll constantly be fighting each other because the former doesn't want spam about Viagra from our neighbors to the north and the latter doesn't want spam about the latest stock that's about to blow up. (This is obviously a contrived example, but you get the idea.) With a diverse user base, any sort of one-size-fits-all filtering is bound to increase FPs and FNs. > 3. If per user bayes is chosen then bayes_00 will only fire on emails ..or were expressly learned by the user. Agree. > 4. If a HAM email is misclasified as SPAM then users are more likely to For some value of "few," I agree. > SPAM tokens are stored on a server wide basis - can be a LARGE database if I think users would be just as adept at poisoning such a split database as they would be at poisoning a unified, site-wide database. In any reasonably diverse user base, what my fellow user thinks is spam should not affect what I get in my mailbox.
Chris St. Pierre
This archive was generated by hypermail 2.1.8 : Thu Oct 25 2007 - 22:21:15 EDT |
||||||||||
|
|||||||||||