Practical Technology

for practical people.

Cleaning the Spam Cesspool


Is there anyone left in the world unacquainted with the plight of the Abachas of Nigeria? The Abacha family fortune is not only down to a mere US $30 million, but it’s inaccessible, frozen in certain bank accounts that can be unlocked only with your Urgent Assistance, which usually consists of disclosing a bank account number, to your ultimate detriment.

The Abacha e-mails are just one of the many waves of Nigerian-based fraud schemes, which date back to faxes in the 1980s. And in turn, they’re just the tip of the spam sewage tsunami that, by some estimates, is much as 90%+ percent of all e-mails, and its volume is growing ever larger.

For years people have been trying, with only limited success, to sort the good mail wheat from the spam chaff in a variety of ways, such as creating lists of known spammers, or mail servers that harbor known spammers, blacklisting and lists of known spam messages, filtering. When done conservatively, such methods still let a lot of spam through. When done aggressively, they block legitimate messages as well as spam.

But these are hacksaws, when what’s needed are scalpels. And sure enough, a new technique has come along recently that promises to shunt almost all one’s unwanted messages to the virtual trash bin without also zapping any of the mail you want to read. It works by a sort of mathematical induction: you identify which messages are spam, and pattern-matching software, based on principles of probability theory first formulated by Thomas Bayes in the 18th century, finds commonalities among the bad messages, and among the remaining good ones as well. Rules are then formulated that generalize from those particulars.

Because it’s the content of spam messages that’s analyzed, not just the headers, it’s much harder for spammers to circumvent these filters. As Paul Graham, a self-employed software engineer and the leading exponent of the new Bayesian approach, notes, the only way spammers can ultimately evade the filters is to make their messages indistinguishable from legitimate mail. Of course, if that happened, their messages would be legitimate mail.

And just in time. According to Ferris Research, a San Francisco and London-based e-mail and groupware analysis firm, “spam will cost $140 billion worldwide in 2008, of which $42 billion will be in the United States alone.”

Sounds unbelievable? Ha! Many ISPs think that Ferris’ numbers are on the low side. According to MessageLabs in May 2009,

Pinning Spam Down

The problem with spam is that we can’t really state what it is—it’s one of those we-know-it-when-we-see-it things. What’s spam for me might not be spam for you. Is a political ad from 2008 presidential candidate John McCain spam? A Democrat might say yes, a Republican, no.

Is it a notice from Amazon or Orbitz that a favorite author has a new book out or that flights to your hometown are now available for $100 less? Ummm…. maybe. A note from your cousin written from a new e-mail address? A contract offer from a friend of a former client? Sure, you know those last aren’t spam, but does your spam program?

There’s the other difficulty. Not only can something be spam for one person and not for another, but even a small possibility of falsly identifying a critical message—like a job offer—as spam, and having it filtered out before you ever see it, is a risk we might not want to take.

Even those ads promising to enlarge body parts that half the recipient population doesn’t even have aren’t spam to someone; while spam is cheap to send, it’s not entirely cost-free. Without at least a few buyers, the sellers would eventually stop.

It is, though, distressingly cheap for the spammer. The only costs are maintaining an Internet connection, and a mail server. Thus, whereas a physical-mail marketer has to pay for paper, ink, and postage, the e-mail spammer has none of those expenses. Junk mail that shows up in your physical mailbox has to earn a response rate of 1-2 percent to be cost effective. But, spam can be highly profitable at even at just 0.01 percent.

In short, as Jared Blank, a senior analyst with Jupiter Research (Darien, Ct.), told me several years ago: “The true problem is that spam is effective.”

You’re not the only one who hates spam

Given those economic facts, we’re only going to see more spam. That’s a problem for all of us, none more so than the ISPs. All that spam “eats up bandwidth and disk space. To keep up, ISPs needs ever more servers, and ways to balance loads between them.

This isn’t cheap. Corporate mail administrators face the same problem. . Mail harvesting attacks, in which a program attempts to send mail to a corporate domain and records which addresses don’t bounce, is an old and true way to grab internal e-mail addresses.

Stopping the spam

One popular idea is to make spam illegal. Nice idea, but it’s not been that effective. While 2003’s CAN-SPAM law just put Alan Ratsky, the world’s highest profile spammer in jail, spam continues with barely a check.

That’s largely because spam hasn’t been bound to the U.S. in years. Instead, spam comes from botnets. These are made up of anywhere from dozens to tens of thousands of malware infected Windows PCs that their controllers use to spread spam around the world. The botnet masters are scattered around the world, but many of them seem to be in the former Soviet Union. No U.S. court will ever see them,

Another legal problem is that spam is often characterized as unsolicited commercial e-mail (UCE). But that’s not right, said Keith Lynch in 2003, maintainer of the Spam Timeline. “Most spam isn’t commercial anyway, at least not in the sense of being about a real product or service.” Much of it is consists of things like chain letters; ‘stealth’ spamming software and lists of millions of e-mail addresses from people who obviously have no sense of irony; porno web pages and psychic hotlines; pyramid schemes; and just plain old scams, such as the Nigerian messages from the Abachas.”

Other legal or financial notions, such as placing a surcharge on outgoing e-mail are as unlikely to be adopted today as they were when they were first suggested a decade ago. They tend to fall into one or more of three categories—they’re impossible to enforce, users would object, or they would break the fundamental design of the Internet. If we are to end spam—or at least reduce it to manageable levels, it is to technology that we must look.

The good news is that there are many ways to kill spam, and they’re being adopted by ISPs and corporations around the world.. They involve looking at the addresses that mail is coming from and either blocking ones that are believed to be spam-related, or only letting in ones that are known to not be; or looking at the content of each e-mail and seeing if it contains words and phrases that match known spam messages.

The bad news is that all these methods also kill good e-mail. For example, a friend might send you a real message with the subject: “Great Free Movie Offer,” about a two for one deal at your favorite local theater and many spam filters would blast it on sight.

The Spam Killers

Since the late 1990s though there has been a more effective method of scooping up spam that promises to identify almost all of it, while reducing the number of good fish accidentally caught in its nets to nearly zero: Bayseian analysis. Instead of manually identifying the characteristics of spam, this strategy uses the computer to do what it does best—make millions of calculations. Take every word in a few hundred or thousand e-mails and, instead of guessing, find out which words they have in common.

The results are striking. Paul Graham, who one of the first programmers to seriously apply this approach, found that the word “sexy” in his e-mail means it has a .99 probability of being spam. But And so does “ff0000,” which is the HTML code for the bright red color that so many spam messages use. Put them together and its odds on that you just nailed a spam message.
A probability of .99 sounds pretty good, until you realize that means if you get 1000 e-mails this week, you’re going to lose 10 legitimate messages. (And Murphy’s Law ensures that one of them is that job offer.) So the real genius of this method comes from two other techniques.

Graham and others realized that the goal of spam filtering was to make a probabilistic inference about something new, based on known probabilities in the past. As luck would have it, there’s a 250 year edifice of mathematics for doing just that sort of thing: Bayesian analysis.

Without delving too deeply, Bayes’ Theorem provides a way of combining probabilities. If there’s a .99 probability that an e-mail with the word “sexy” in it is spam, and a .97 probability for the word “sex,” then the likelihood that a message with both words is spam is 99.97%.

Graham’s other insight was to look at the reverse probabilities as well—what are the words that never occur in spam? Since Graham’s methods use the entire e-mail message, the same e-mail addresses that would get passed by a whitelist become highly reliable indicators that something isn’t spam. In fact, looked at that way, Bayesian filtering becomes almost a superset of all of the other spam-catching techniques—blacklists, whitelists, and rule-based filtering.

Graham was not the first to look at Bayesian tactics to attack spam. That honor goes to two different sets of authors, both of whom presented papers at the 1998 annual conference of the American Association for Artificial Intelligence: Patrick Pantel and Dekang Lin, at the time at the University of Manitoba; and Mehran Sahami, Stanford University, and three Microsoft researchers, Susan Dumais, David Heckerman, and Eric Horvitz. Graham, an Internet entrepreneur who cashed out before the dot-com crash, took the basic idea, made a practical implementation of it in LISP, and popularized it with a web site and freely available source code.

Like all filtering software, Graham’s has to balance between getting as many spams as possible while avoiding false positives. But his numbers are better. Much better. At zero false positives he claims to typically catches 99.5 percent of the spam.

Bayesian techniques can keep up with spammers because you can feed it the rare spam that makes it through the filter. Graham says he hasn’t even deployed all the possible weapons in the Bayesian arsenal, such as using pairs of words instead of just individual ones, pooling filters among multiple users, or using a simple whitelist first and using a Bayesian filter on those that don’t pass.

At the same time though, Graham is quick to note that “spammers haven’t yet made a serious effort to spoof statistical filters.”

When they do though they’ll face several other Bayesian efforts to stop them in their tracks. Eric S. Raymond, one of the leaders of the open source software movement’s leaders, is working on his own program, Bogofilter. Raymond’s doing this because he believes in it. “I went with Paul’s technique because it looks a lot more robust against spammer attempts to game against it than pattern-matching approaches.”

Many others have worked on perfecting the Bayesian model for spam prevention. Bill Yerazunis, a research scientist for Mitsubishi Electric Research Lab, has come up with what Graham says is “the best spam filtering performance I’ve heard of to date.” An October 2002 test of the software, which has mouthful of a name, the Controllable Regex Mutilator, had an accuracy of 99.87. One of its three errors was a false positive, which Yerazunis says would have been caught if he had used a whitelist in conjunction with the Bayesian approach.

“Why Am I Getting Spam?”

If you’ve ever been in a Texas duststorm, you know that no matter how much you seal up the windows and doorframes, dust will get in. Similarly, even a new e-mail account will soon start filling up with spam. How does it happen?

Spammers will “harvest” addresses from Usenet postings, Web pages, and even e-mail lists, if they’re publically archived (and a surprising number of them are). Spammers are nothing if not extremely clever, and know that often the mail servers on which even private mailing lists reside are configured in such as way that they can be queried for list recipients’ e-mail addresses.

Many companies that require free registrations to view their site will then sell lists of e-mail addresses. Often, if you look at the fine print of their user agreement, you’ll see that you’ve actually consented to this. Likewise, many companies and organizations, such as a bank, church, or e-commerce site, will sell your e-mail address or otherwise make available to its “business partners.”

Answering spam, even to opt out from getting more of it, will sometimes (though not always) result in getting more. That’s because your address is now “validated”—known to have a live person on the other end of the line.

Whitelists, Blacklists, and Shades of Grey

Blacklists: One of the most famous approaches, and for the last few years one of the least effective, blackhole list (now usually shortened to ‘blacklist’) creators try to stop spam by determining the domain names or IP addresses of known spammers and ISPs that tolerate spammers. E-mail administrators then use the address lists to block e-mail from these addressesthem. The biggest problem is that errors can be made, and those errors can keep ordinary users from sending e-mail. This kind of error is called a called a false positive and it can be almost impossible to fix.

For example, a popular mailing list, Politech, run by the journalist Declan McCullagh was listed several times in 2002 by the commercial blacklist company SpamCop, not because McCullagh had ever spammed anyone. Rather, the ISP McCullagh used to send out his e-mails was in an IP address range that belonged to Rackspace, a Web hosting company that had an open mail relay—meaning anyone could send mail—, including spam—anonymously through it.

Other ISPs used SpamCop’s blacklist choose to refuse mail from anyone on anyone on Rackspace. SpamCop’s blacklist. Thus those ISPs’ customers of those ISPs couldn’t get McCullagh’s e-mails even though they were subscribed to the Politech list.

This guilt-by-association is supposed to enrage people like McCullagh, which who are in turn is supposed to create pressure on Rackspace their ISP to close its open mail relay to stop its spamming ways. Sometimes that works, sometimes it doesn’t. The collateral dammage to McCullagh and his readers is, by the blacklist theory, an inevitable consequence civilian causalities of the war between spammers and system administrators.

But even the best of the blacklists may not be that effective. Blacklists based on just ISP address ranges will always block some legal mail, it’s just a question of how much. In addition, since botnets are now the primary spam senders, blacklisting has become almost useless for stopping spam. A single IP address from say a DSL-connected Windows PC may send a flood of spam on April 1st and then not send anymore again until November. With millions of infected Windows systems at their disposal, smart spammers have no need to hang around a particular Internet address range.

Whitelists: The opposite of a blacklist is a whitelist. Instead of trying to turn away all the bad guys in the parking lotat the door, picture a guard checking an invitation list to see if you can enter the ballroom.

Of course, eventually a legitimate party-goer will show up without an invitation, for example that cousin who wrote from her new e-mail account. So whitelist systems usually auto-reply to unfamilar senders, requiring them to perform some task that, while trivial, requires sentience (such as answering the question “What is 2+3?”). Since most spammers use commercial software incapable of understanding—and responding correctly to—the auto-reply, you never hear from them again, while your sentient cousin, who does answer correctly, gets her new e-mail address gets added to the whitelist.

Whitelists are not very complicated. And they work. many users find whitelists too cumbersome for people. The proof is in the using—very few people use them. In fact, many e-mailers will have never even encountered one.

Rule-based filters: Programs looking for specific words found in popular spam (like “Debt Free” or “Size does matter”) are probably the most common approach to filtering out spam.

The problem with this methodology is that it’s always one step behind; it requires programmers to actually look at e-mail and see what features are typical of spam. Once again, the difference between sentience-enabled eyeballs, and mechanical rules, makes this anything but trivial.

You or I can tell at a glance that a message with the subject of “F R E E S E X” is likely to be spam. But, a program using heuristics might miss it unless it’s been programming to ignore white space. Which, in turn, won’t stop a message with “F*R*E*E**S*E*X” as the subject. It seems easy to implement rule-based filters, but getting them right and keeping them up to date isn’t easy.

And, of course, the spammers are always coming up with something new to get around the latest set of rules. For example, there are programs out there that identifies words, phrases, and patterns that are likely to trigger spam filters and recommend alternatives that avoid spam filters.

It’s a never-ending war of attack and counterattack, so many of the e-mail systems, such as Google’s GMail include a ‘training’ mode by which you can report spam to improve the anti-spam mechanism’s performance.

At the same time, the more complete and complicated a heuristic rule set grows the more system resources it requires. So end-user filtering programs like can cause your mail programs runs more and more slowly. For ISPs that same slow performance means more delays, which eventually leads to more costs to upgrade anti-spam servers.

Make no mistake about it; trainable rule-based filters are good. They’re just always a step behind. And, they come with a a built-in, slowly growing performance hit.

This story is based on a tale I write in 2003 that was published under the title ‘Saving Private Ryan’ in IEEE Spectrum. As you can tell, things haven’t gotten much better with spam prevention since then.