$Id: README,v 1.2 2003/04/28 16:42:21 karl Exp $ - What? Another implementation of Paul Graham's algorithm for spam detection: http://www.paulgraham.com/spam.html. (I also looked extensively at Eric Raymond's bogofilter source while writing this, and stole some ideas from there.) - Where? http://www.cs.umb.edu/~karl/kspam and kspam.tar.gz. Prerequisites: perl, procmail, Unix. - Why? Mainly because bogofilter (http://www.tuxedo.org/~esr/bogofilter) started to think one day that just about everything coming in was spam, and I couldn't figure out how to debug it -- no logging facilities, and the code had enough #ifdef's that it wasn't obvious to me what was actually executing. bogofilter is optimized for speed. kspam is very, very, very slow. I don't mind this, because I get maybe 300 messages on a busy day -- if it takes a few seconds to process a message, that's fine by me. Obviously kspam would not suitable for a big shared server with hundreds of users. (There are lots of other Graham implementations for this case, q.v.) On the upside, kspam is pretty small (the whole thing is 500 lines of perl), and it has lots of debugging and logging. I suppose if it ever needs to be fast, I could rewrite it in C. - How? See INSTALL for a distillation of this (but you should read it anyway, to understand what's happening -- basically, I'm distributing my own personal mail setup, and it's highly unlikely you'll want to use it as-is.) ** Part 1: seeding the word lists. All probabilistic spam detection algorithms need some input (both spam and nonspam) to start with. You don't get spam detection starting with message #1 -- more like message #10000. Nonspam is easy, you can just use all your saved messages from correspondents (it's ok if there's a few spams sprinkled in). For spam, I had assiduously saved all my junk mail for quite a while, carefully removing all real messages from the junk. Then, I ran the included program kseed to seed the word lists (zcat spamfiles | kseed --spam; zcat nonspamfiles | kseed --nonspam). This took many hours, but that was ok with me, I let it run overnight. ** Part 2: classifying incoming mail. Ok, given some reasonable word lists, we can now run incoming mail through the algorithm to filter out the spam. I do this through procmail; kspam doesn't do mail delivery itself, of course. It uses lockfile(1) from procmail for locking against itself. See the included procmailrc for an example (this is my real .procmailrc). This does the following: 1) weed out duplicate messages first, using formail -D. 2) don't call kspam directly, instead call a script ~/bin/testspam (also included). This is because kspam is not flexible enough -- sometimes I want to just do some greps to make sure some messages are treated as nonspam (whitelist) or spam (blacklist). 3) if a message is spam, save it in ~/misc/caughtspam. I don't want to delete anything outright because of possible misclassifications, etc. ** Part 3: checks and balances. When a spam gets through the filter, I save it (in a folder ~/mail/spam). Then nightly I run a cron job (caughtspam, included) which reclassifies those messages using kspam --SPAM. An alternative would be to reclassify the message on the spot (as esr does), but I prefer the batch approach. Another nightly activity is to report on the spam that *was* caught by the filter (checkspam, included). This is to have a hope of catching false positives. All the mail is saved away (in ~/misc/old/caughtspam/YYYYYMMDD) just in case I need to seed a spam generator or something :). Of course, you should try all this out using some test account first, not just let it loose on your real mail. --karl@cs.umb.edu