Ending spam: Bayesian content filtering and the art of statistical content filtering
by Jonathan A. Zdziarski
No Starch Press
Price: $39.99 U.S.
In case you haven’t noticed, I agreed to many book reviews. I enjoy reading and learning, so the idea of free books appeals to me. Especially on topics I enjoy; and Spam has always been an interest of mine.
First, I must say that Johnathan Zdzarski picked a great publisher. Many publishers just throw promo copies of books to the four winds to fulfill their contractual obligations, or count on their name or the brand recognition of the book series, and you never hear from them again. Not with No Starch Press. Patricia Witkin does her job and was all over me, making sure I did my review, and available for input. Some writers get whiney about a publisher holding their feet to the fire; but I am in awe, that in this jaded world, there are people like her doing her job so well. Seriously, if I put out my own technical book, I’m going to target them first — even if they aren’t the largest publisher, they do their job of promotion.
I found the font easy to read, and there were little notes on the pros and cons and what this could be used for type explanations that were very nice. The book is black and white, and sparse on illustrations; but then they’d probably be out of place on this kind of book, still it added to the dryness. I wasn’t fond of the way they shifted between code/email and body text — both were quite readable and efficient, but they were too close in weight, style and size to be visually obvious. The first paragraph or hundred words of every chapter was in a larger size and that actually bugged me, it felt like when a newbie first finds how to use font size or style, and abuses it. But this book isn’t about dazzling presentation or page layout techniques; it’s about the info baby!
It is a well laid out book broken into 14 chapters, and 3 sections. These are as follows:
• Part I – Introduction to Spam Filtering
• History of Spam
• Historical Approaches to fighting Spam
• Language Classification Concepts
• Statistical Filtering Fundamentals
• Part II – Fundamentals of Statistical Filtering
• Decoding: Uncombobulating messages
• Tokenization: The building blocks of Spam
• Low-down dirty tricks of Spammers
• Data storage for a zillion records
• Scaling in large environments
• Part III – Advanced Concepts of Statistical Filtering
• Testing Theory
• Concept Identification: Advanced Tokenization
• Fifth-order Markovian Discrimination
• Intelligent Feature Set Reduction
• Collaborative Algorithms
• Appendix: Examples of Filtering
Are you getting from the contents that this is no “Dummies guide to Spam”? I didn’t find the book any harder to read than the author’s name. This book is not a light read for those casually interested in the topic of Spam. As the full intimidating title and contents should indicate, this should actually be more a textbook for understanding different ways to address the problem of Spam. Sort of a well-written Masters or Doctoral thesis (without all the supporting references or proofs), if that isn’t an oxymoron or two. Don’t get me wrong, Johnathan does a good job of bringing highly technical concepts down to earth, and there’s lots that lay-people can learn about spam and anti-spam from the book. But the book has far more heft than its measly 287 pages would lead you to believe.
Chapter one was very good, but I was a little letdown that they didn’t go into where the term Spam came from. Amusing little asides about Monty Python would be a refreshing break to the meatier topics. However, light and fluffy was not the goal — the litany of facts and information is the meat and potatoes of this book. The tone continued this way throughout the book. I’m not sure if Johnathan is a dry, fact based instructor; but the tone is about fact, observation, information — with some very dry things that only an uber-nerd would crack a smile at. Even then, I wished I had one of my nerdier friends around to ask them if I should be smiling at that.
Chapter two was very useful to the masses, explaining blacklist, whitelists, challenge/response, throttling, collaborative filtering, address obfuscation, sender policy framework, litigation, spam fingerprinting, and intellectual property.
By Chapter three, it starts getting deep and beyond where most people would find value. Like on old maps where it says, “beyond here there be monsters”. I’m a nerd, and my normal rate of blasting through technical books at about 5-10 pages per minute — but already I was taking my foot off the gas and slowing way down. Whoa there, hairpin ahead. Yeah, yeah, tokenizers, datasets, analysis, I get that. Training methods, decision matrices, and classification. Uh huh. Smoke is starting to leak out. And we were just getting started.
Chapter four had mathematical representations of what things were doing (later ones had simple flow charts); but since this is targeted at coders, I was left wishing for actual code examples. I can read both, but most programmers should understand C or even a metalanguage; so I felt that even though it would greatly increase the page count, some of us would have gotten more value with code examples, or even a CD with sample implementation of many concepts. Maybe that’s just a personal preference thing. Still, by this chapter and beyond it was all new ground for me, and I assume all but very savvy spam programmers. And it just kept going. And going.
If I were writing programs to do this, it would have taken me a week or more to go through the book in detail. As I was reading for information, it went a bit faster. Still, there was information that was useful in general, as well as in specific to approaching problems of spamming and spam blocking — enough that I think Computer Science students or Information Systems students should go through these topics or a book like this, just for that. I really liked the discussions on scaling, distributing, testing and so on, as that was applicable to me as a Web Programmer, as well as others. Not to mention if you were using the Heuristic or Bayesian techniques for advanced searching or other applications. Did I mention this wasn’t a casual read?
The covers casual look and soft-back format belies the content within; the content feels more textbookish. Magnified by tone, and little politically correct things like censoring what an ASCII spammer might show (a bad representation of a breasts and bum). Seriously, if it had sample questions and self-tests at the end of every chapter, a course could easily and should easily be built around it for Information Systems or Computer Science degrees. I seriously think I deserve a few more credits for going through it. Kudos to Johnathan for getting this kind of book published in a world that loves fluff, as this was as fluffy as a cement pillow.
The book says in the intro who it is for: those interested in creating their own spam filter, nerds who just need to know, and spammers who want to figure out how to defeat current anti-spam techniques. I agree and just hope most of the readers are in the first two categories. But that should give you an indication if this book is for you; if you’re looking for really understanding how you would approach the problems of attacking spam, or need to know how others are, then this book is totally for you.
This book is down, and dirty, loaded with information, and will make your head-hurt (in a good way). It isn’t a hard read in that Johnathan did a good job of bringing this down to mortals level, assuming nerds are mortal — but it just not a light subject; so I wouldn’t think of giving it to the casually interested. To be fair though, even the casually interested could get information from the first few chapters and appendix, and assuming they don’t give up there are neat tidbits even for the casually interested throughout, many more for the casually interested programmer.
Anyone with a thirst for knowledge, and a little commitment will generate enough heat off their heads while reading this book to make their beanie propeller spin, at high rate of speed. And that makes it easily worth the $40 entry fee.