Building a better spam-blocking CAPTCHA

January 23, 2009 by sjvn01 | 0 comments

How do you let people create user accounts or post comments on your Web site without letting spam bots in? Simple — make your users prove they’re human. Many Web sites use CAPTCHA (Completely Automated Public Turing Test to Tell Computers and Humans Apart) technology to try to tell the bots from the people.

CAPTCHA’s idea is simple enough. It presents users with an image showing an obfuscated string of letters that they must type in to get an e-mail or social networking account, for instance, or to enter a comment on an online forum. The theory is that only humans can decipher the letters hidden in the image and type in the correct code, and for a time it was an effective tool to keep the bots out.

But while no one has yet come up with a computer that can fool people into thinking it’s another person, computers are great at fooling other computers. These days, malware makers and spammers regularly trick the CAPTCHA systems at big-name Web sites such as Yahoo Mail, Gmail and Craigslist, and use these sites to automate their attacks.

So what can we do? Can CAPTCHA be saved?

The rise and fall of CAPTCHA

CAPTCHA was created in 2000 by researchers at Carnegie Mellon University, and by 2007, the technology was being used almost everywhere on the Web. For example, if you try to leave a comment on this story, you’ll need to jump through a CAPTCHA hoop before you can leave a message.

Unfortunately, beginning in early 2008, crackers started getting the better of the CAPTCHA systems. In short order, Yahoo Mail’s, Gmail’s and Hotmail’s CAPTCHA defenses were cracked.

Then, adding insult to injury, the crackers started releasing their work in the form of do-it-yourself CAPTCHA cracking software that anyone could use. For example, a program called CL Auto Posting Tool attempts to post bogus ads to Craigslist while automatically overcoming Craigslist’s antispam protections.

These programs work by using OCR (optical character recognition) software to try to make sense of CAPTCHA’s disguised text. If they fail, they try again. They take advantage of the fact that some CAPTCHA systems don’t automatically give users a new CAPTCHA image to puzzle out. Instead, they’ll let you, or a cracker program, keep working at the hidden text until it’s solved.

Get one of these programs, aim it at the site you want to have bogus accounts on, and you can start spreading spam, anonymously flaming people you don’t like, and sending thousands of people links to your malware-infested site.

It’s not that the OCR-based cracker programs are that good. They’re not. As CAPTCHA expert Sumeet Prasad from security firm Websense explained in a blog posting, while only 10% to 15% of the attempts on Hotmail are successful, a CAPTCHA cracker program needs only six seconds per attack. If a site allows an unlimited number of chances to crack a single image, that means it will take, on average, less than a minute to break in.

Because they are clearly insecure, CAPTCHA systems that allow unlimited or multiple attempts are becoming uncommon. Still, today’s automated bots are capable of breaking even those systems that make users respond to a new CAPTCHA image after the first or second unsuccessful attempt. (On average, of course, the bots’ efforts are less likely to work at one-try CAPTCHA systems.) That said, simple CAPTCHA systems, such as the ones that use random, non-malformed letters against a simple background, are still in common use and are easily breakable.

Another way to crack a badly designed CAPTCHA program is to reuse the session identification URL of a solved CAPTCHA image. In this case, either the cracker, or more likely a cracking program, first gets the right answer to a CAPTCHA. It then reconnects to the Web site with a URL containing the solved session identification information with a new username. Presto! You have an automated site cracker with a 100% success rate until the session ID eventually expires.

p>Breaking into CAPTCHA protected systems isn’t just something that individual crackers do for fun and financial gain. CAPTCHA cracking, believe it or not, has become a business in its own right. For example, Indian-based company DeCaptcher.com will solve CAPTCHAs for your spamming needs at a rate of $2 per 1,000 successfully cracked CAPTCHAs. The site explains:

“Using the advertisement in blogs, social networks, etc. significantly increases the efficiency of the business. Many services use pictures called CAPTCHAs in order to prevent automated use of these services. Solve CAPTCHAs with the help of this portal, increase your business efficiency now!”

Is it any wonder that CAPTCHA, while still popular, is becoming almost as useful a security technique as locking the barn door after the horse has been stolen?

A second chance for CAPTCHA?

So with all that, can CAPTCHA be saved? According to Carnegie Mellon computer scientists, the answer is yes. The first of their redesigns of CAPTCHA, according to Luis von Ahn, a professor of computer science at the university, is the aptly named reCAPTCHA.

This system, von Ahn said, works in conjunction with the Google Books Project and the Internet Archive, two projects that are converting paper books to digital format using OCR software. As explained above, OCR software often doesn’t read words accurately. When the projects’ OCR programs flag a word as unreadable, it’s saved as an image and used on the Web as a CAPTCHA test.

This has two positive results. First, these CAPTCHAs are already known to be resistant to OCR attacks, making Web sites that use reCAPTCHA less vulnerable to CAPTCHA crackers. Second, human users are decoding the words that the book projects’ OCR software can’t read, and thus helping to complete the two projects’ accurate conversion of older books to digital formats.

How does reCAPTCHA know that the human got a word right? By using a control word, where the system already knows the correct spelling, along with the unknown word. Von Ahn explains, “If a user enters the correct answer to the control word, the user’s other answer is recorded as a plausible guess for the unknown word. If the first three human guesses match each other, but differ from the OCRs’ guesses, the word is marked as correct and becomes a potential control word.”

Image-based CAPTCHA

The Carnegie Mellon crew is also looking at image-based CAPTCHA. The first of these, ESP-PIX, requires users to pick a word that describes all four objects in an image. The newest of them, SQ-PIX, requires users to first pick out the right image from three and then trace the outline of the object within the image. For example, you might see an image of a cat, one of a flower and one of a balloon, with the instruction “Trace all balloons.”

These tests do have their shortcomings. For starters, what is clear to the designers may not be clear to users. In the ESP-PIX test, for example, the answer “girl” for three images of adult women and one of a young girl doesn’t make much sense. And the SQ-PIX test may require a degree of manual dexterity that not all users have. My editor, who is right-handed but uses a trackball with her left hand, found that the test failed her more often than it passed her. However, these are works in progress; Carnegie Mellon doesn’t have a scheduled completion date.

Carnegie Mellon isn’t the only group looking at image-based CAPTCHA. Penn State developers are working on Imagination CAPTCHA. In this system, a user must first pick out the geometric center of a distorted image from a page that’s filled with similar overlapping pictures.

If you get that right, you’re presented with another carefully distorted image and asked to pick a word to describe what you’re seeing. The Imagination system is based on ALIPR (Automatic Linguistic Indexing of Pictures), an automated image-tagging and searching technology.

The core idea, as the developers explain on their site, is that image recognition is a harder problem for computers to solve than text recognition, making the Imagination system more secure than text-based CAPTCHAs. In fact, the developers welcome attempts to crack the system: “If you think a robot can also pass our test without random guessing, give it a try and we’d love to know how far your robot can get.”

Unfortunately, color-blind users are likely to face problems with the Imagination system. (Blind and hard-of-sight people, of course, will have problems with all image-based CAPTCHAs.)

Image-based CAPTCHAs still aren’t in widespread use. A few simple ones, such as KittenAuth, are starting to see use. (For example, some phpBB online forum systems are using KittenAuth.) With KittenAuth, users are presented with a grid of 12 pictures of animals and then asked to pick out, for example, the ones containing — you guessed it — kittens.

Microsoft Research has taken the same idea for its ASIRRA (Animal Species Image Recognition for Restricting Access) technology. ASIRRA uses a larger pool of images from PetFinder.com, but otherwise this Web service CAPTCHA is essentially a KittenAuth clone. While to my knowledge no major sites are currently using ASIRRA, Microsoft has made PHP, Python, C#, Perl, VisualBasic and JScript code available, as well as a WordPress plug-in — so it shouldn’t be long before multiple Web sites are giving ASIRRA a try.

Sneaky CAPTCHA tricks

Stephen Moseley, a Web designer and developer at media production company Hannisdal Express has a sneaky way of stopping CAPTCHA bot attackers: incorporate a hidden field with CSS (Cascading Style Sheets). The field is coded so that human users never see it. Bots, however, read the page’s code and note that there is a field to be filled in, and proceed to do so. That, of course, is enough to mark the visitor as a potential cracking program rather than an actual user.

“The bots should fill it in, and if you compare the inputted value to the value you start with, you can quit execution right there,” says Moseley. “You do, however, have to make sure to label this so that people with screen readers can understand not to fill it in. I’ve used this on some nonhigh traffic forms and it works pretty well. It probably won’t stop serious spam bots for a large site, though.”

Moseley also suggests using simple math problems in CAPTCHA tests. As he explains, though, this approach has two problems: “possible discrimination against the mentally handicapped and the fact that you would need to make the questions random (i.e., you don’t want it to always be 2 + 2).”

The bottom line

What all these variations on CAPTCHA mean for Web administrators is that CAPTCHA will continue to be useful. However, the old, simple CAPTCHA systems are hopelessly obsolete.

And even the improved CAPTCHA strategies may not be useful for long. Carnegie Mellon’s von Ahn believes that, for the immediate future, image-based CAPTCHAs will be effective. Eventually though, within 50 years at the most, von Ahn thinks that computers will be bright enough to solve any form of CAPTCHA.

But what about right now? To secure a Web site in 2009, companies would be well advised to look at reCAPTCHA, which comes with a wide variety of application and programming plug-ins and an open API (application program interface). With these, no matter what software you’re running on your Web site, you should be able to easily add reCAPTCHA protection to your Web-based applications.

Looking ahead, you should start following image-based CAPTCHA technologies. They promise to have a longer effective life.

All that said, it should also be kept in mind that, even as bot-based CAPTCHA attacks are held at bay, there’s no effective defense against humans breaking CAPTCHAs for money. All that any CAPTCHA system, or any other security measure, can really do is slow down would-be crackers.

At the end of the day, Web security must be concerned not only with keeping out attackers, but with minimizing the damage they can cause when they have broken into a site.

A version of this story first appeared in ComputerWorld.

Practical Technology

for practical people.