How CAPTCHA Works

You're using your computer to purchase tickets to see a concert at a local venue. Before you can buy the tickets, you first have to pass a test. It's not a hard test -- in fact, that's the point. For you, the test should be simple and straightforward. But for a computer, the test should be almost impossible to solve.

This sort of test is a CAPTCHA, an acronym that stands for Completely Automated Public Turing Test to Tell Computers and Humans Apart. They're also known as a type of Human Interaction Proof (HIP). You've probably seen CAPTCHA tests on lots of Web sites. The most common form of CAPTCHA is an image of several distorted letters. It's your job to type the correct series of letters into a form. If your letters match the ones in the distorted image, you pass the test.

Why would anyone need to create a test that can tell humans and computers apart? It's because of people trying to game the system -- they want to exploit weaknesses in the computers running the site. While these individuals probably make up a minority of all the people on the Internet, their actions can affect millions of users and Web sites. For example, a free e-mail service might find itself bombarded by account requests from an automated program. That automated program could be part of a larger attempt to send out spam mail to millions of people. The CAPTCHA test helps identify which users are real human beings and which ones are computer programs.

One interesting thing about CAPTCHA tests is that the people who design the tests aren't always upset when their tests fail. That's because for a CAPTCHA test to fail, someone has to find a way to teach a computer how to solve the test. In other words, every CAPTCHA failure is really an advance in artificial intelligence.

Let's take a closer look at exactly what a CAPTCHA is in the next section.

CAPTCHAs and the Turing Test

CAPTCHA technology has its foundation in an experiment called the Turing Test. Alan Turing, sometimes called the father of modern computing, proposed the test as a way to examine whether or not machines can think -- or appear to think -- like humans. The classic test is a game of imitation. In this game, an interrogator asks two participants a series of questions. One of the participants is a machine and the other is a human. The interrogator can't see or hear the participants and has no way of knowing which is which. If the interrogator is unable to figure out which participant is a machine based on the responses, the machine passes the Turing Test.

Of course, with a CAPTCHA, the goal is to create a test that humans can pass easily but machines can't. It's also important that the CAPTCHA application is able to present different CAPTCHAs to different users. If a visual CAPTCHA presented a static image that was the same for every user, it wouldn't take long before a spammer spotted the form, deciphered the letters, and programmed an application to type in the correct answer automatically.

Most, but not all, CAPTCHAs rely on a visual test. Computers lack the sophistication that human beings have when it comes to processing visual data. We can look at an image and pick out patterns more easily than a computer. The human mind sometimes perceives patterns even when none exist, a quirk we call pareidolia. Ever see a shape in the clouds or a face on the moon? That's your brain trying to associate random information into patterns and shapes.

But not all CAPTCHAs rely on visual patterns. In fact, it's important to have an alternative to a visual CAPTCHA. Otherwise, the Web site administrator runs the risk of disenfranchising any Web user who has a visual impairment. One alternative to a visual test is an audible one. An audio CAPTCHA usually presents the user with a series of spoken letters or numbers. It's not unusual for the program to distort the speaker's voice, and it's also common for the program to include background noise in the recording. This helps thwart voice recognition programs.

Another option is to create a CAPTCHA that asks the reader to interpret a short passage of text. A contextual CAPTCHA quizzes the reader and tests comprehension skills. While computer programs can pick out key words in text passages, they aren't very good at understanding what those words actually mean.

In the next section, we'll take a closer look at the kinds of sites that use CAPTCHA to verify whether or not you have a pulse.

Who Uses CAPTCHA

One common application of CAPTCHA is for verifying online polls. In fact, a former Slashdot poll serves as an example of what can go wrong if pollsters don't implement filters on their surveys. In 1999, Slashdot published a poll that asked visitors to choose the graduate school that had the best program in computer science. Students from two universities -- Carnegie Mellon and MIT -- created automated programs called bots to vote repeatedly for their respective schools. While those two schools received thousands of votes, the other schools only had a few hundred each. If it's possible to create a program that can vote in a poll, how can we trust online poll results at all? A CAPTCHA form can help prevent programmers from taking advantage of the polling system.

Registration forms on Web sites often use CAPTCHAs. For example, free Web-based e-mail services like Hotmail, Yahoo! Mail or Gmail allow people to create an e-mail account free of charge. Usually, users must provide some personal information when creating an account, but the services typically don't verify this information. They use CAPTCHAs to try to prevent spammers from using bots to generate hundreds of spam mail accounts.

Ticket brokers like TicketMaster also use CAPTCHA applications. These applications help prevent ticket scalpers from bombarding the service with massive ticket purchases for big events. Without some sort of filter, it's possible for a scalper to use a bot to place hundreds or thousands of ticket orders in a matter of seconds. Legitimate customers become victims as events sell out minutes after tickets become available. Scalpers then try to sell the tickets above face value. While CAPTCHA applications don't prevent scalping, they do make it more difficult to scalp tickets on a large scale.

Some Web pages have message boards or contact forms that allow visitors to either post messages to the site or send them directly to the Web administrators. To prevent an avalanche of spam, many of these sites have a CAPTCHA program to filter out the noise. A CAPTCHA won't stop someone who is determined to post a rude message or harass an administrator, but it will help prevent bots from posting messages automatically.

The most common form of CAPTCHA requires visitors to type in a word or series of letters and numbers that the application has distorted in some way. Some CAPTCHA creators came up with a way to increase the value of such an application: digitizing books. An application called reCAPTCHA harnesses users responses in CAPTCHA fields to verify the contents of a scanned piece of paper. Because computers aren't always able to identify words from a digital scan, humans have to verify what a printed page says. Then it's possible for search engines to search and index the contents of a scanned document.

Here's how it works: First, the administrator of the reCAPTCHA program digitally scans a book. Then, the reCAPTCHA program selects two words from the digitized image. The application already recognizes one of the words. If the visitor types that word into a field correctly, the application assumes the second word the user types is also correct. That second word goes into a pool of words that the application will present to other users. As each user types in a word, the application compares the word to the original answer. Eventually, the application receives enough responses to verify the word with a high degree of certainty. That word can then go into the verified pool.

It sounds time consuming, but remember that in this case the CAPTCHA is pulling double duty. Not only is it verifying the contents of a digitized book, it's also verifying that the people filling out the form are actually people. In turn, those people are gaining access to a service they want to use.

Next, we'll take a look at the process that goes into creating a CAPTCHA.

Creating a CAPTCHA

The first step to creating a CAPTCHA is to look at the different ways humans and machines process information. Machines follow sets of instructions. If something falls outside the realm of those instructions, the machine isn't able to compensate. A CAPTCHA designer has to take this into account when creating a test. For example, it's easy to build a program that looks at metadata -- the information on the Web that's invisible to humans but machines can read. If you create a visual CAPTCHA and the image's metadata includes the solution, your CAPTCHA will be broken in no time.

Similarly, it's unwise to build a CAPTCHA that doesn't distort letters and numbers in some way. An undistorted series of characters isn't very secure. Many computer programs can scan an image and recognize simple shapes like letters and numbers.

One way to create a CAPTCHA is to pre-determine the images and solutions it will use. This approach requires a database that includes all the CAPTCHA solutions, which can compromise the reliability of the test. According to Microsoft Research experts Kumar Chellapilla and Patrice Simard, humans should have an 80 percent success rate at solving any particular CAPTCHA, but machines should only have a 0.01 success rate [source: Chellapilla and Simard]. If a spammer managed to find a list of all CAPTCHA solutions, he or she could create an application that bombards the CAPTCHA with every possible answer in a brute force attack. The database would need more than 10,000 possible CAPTCHAs to meet the qualifications of a good CAPTCHA.

Other CAPTCHA applications create random strings of letters and numbers. You aren't likely to ever get the same series twice. Using randomization eliminates the possibility of a brute force attack -- the odds of a bot entering the correct series of random letters are very low. The longer the string of characters, the less likely a bot will get lucky.

CAPTCHAs take different approaches to distorting words. Some stretch and bend letters in weird ways, as if you're looking at the word through melted glass. Others put the word behind a crosshatched pattern of bars to break up the shape of the letters. A few use different colors or a field of dots to achieve the same effect. In the end, the goal is the same: to make it really hard for a computer to figure out what's in the CAPTCHA.

Designers can also create puzzles or problems that are easy for humans to solve. Some CAPTCHAs rely on pattern recognition and extrapolation. For example, a CAPTCHA might include a series of shapes and ask the user which shape among several choices would logically come next. The problem with this approach is that not all humans are good with these kinds of problems and the success rate for a human user can drop below 80 percent.

Next, we'll take a look at how computers can break CAPTCHAs.

Breaking a CAPTCHA

The challenge in breaking a CAPTCHA isn't figuring out what a message says -- after all, humans should have at least an 80 percent success rate. The really hard task is teaching a computer how to process information in a way similar to how humans think. In many cases, people who break CAPTCHAs concentrate not on making computers smarter, but reducing the complexity of the problem posed by the CAPTCHA.

Let's assume you've protected an online form using a CAPTCHA that displays English words. The application warps the font slightly, stretching and bending the letters in unpredictable ways. In addition, the CAPTCHA includes a randomly generated background behind the word.

A programmer wishing to break this CAPTCHA could approach the problem in phases. He or she would need to write an algorithm -- a set of instructions that directs a machine to follow a certain series of steps. In this scenario, one step might be to convert the image in grayscale. That means the application removes all the color from the image, taking away one of the levels of obfuscation the CAPTCHA employs.

Next, the algorithm might tell the computer to detect patterns in the black and white image. The program compares each pattern to a normal letter, looking for matches. If the program can only match a few of the letters, it might cross reference those letters with a database of English words. Then it would plug in likely candidates into the submit field. This approach can be surprisingly effective. It might not work 100 percent of the time, but it can work often enough to be worthwhile to spammers.

What about more complex CAPTCHAs? The Gimpy CAPTCHA displays 10 English words with warped fonts across an irregular background. The CAPTCHA arranges the words in pairs and the words of each pair overlap one another. Users have to type in three correct words in order to move forward. How reliable is this approach?

As it turns out, with the right CAPTCHA-cracking algorithm, it's not terribly reliable. Greg Mori and Jitendra Malik published a paper detailing their approach to cracking the Gimpy version of CAPTCHA. One thing that helped them was that the Gimpy approach uses actual words rather than random strings of letters and numbers. With this in mind, Mori and Malik designed an algorithm that tried to identify words by examining the beginning and end of the string of letters. They also used the Gimpy's 500-word dictionary.

Mori and Malik ran a series of tests using their algorithm. They found that their algorithm could correctly identify the words in a Gimpy CAPTCHA 33 percent of the time [source: Mori and Malik]. While that's far from perfect, it's also significant. Spammers can afford to have only one-third of their attempts succeed if they set bots to break CAPTCHAs several hundred times every minute.

You'd think that the inventors of CAPTCHA would be upset that their hard work is being picked apart by hackers, but you'd be wrong. Find out why in the next section.

Electronic Ears

Audio CAPTCHAs aren't foolproof either. In the spring of 2008, there were reports that hackers figured out a way to beat Google's audio CAPTCHA system. To crack an audio CAPTCHA, you have to create a library of sounds representing each character in the CAPTCHA's database. Keep in mind that depending on the distortion, there might be several sounds for the same character. After categorizing each sound, the spammer uses a variation of voice-recognition software to interpret the audio CAPTCHA [source: Networkworld].

CAPTCHA and Artificial Intelligence

Luis von Ahn of Carnegie Mellon University is one of the inventors of CAPTCHA. In a 2006 lecture, von Ahn talked about the relationship between things like CAPTCHA and the field of artificial intelligence (AI). Because CAPTCHA is a barrier between spammers or hackers and their goal, these people have dedicated time and energy toward breaking CAPTCHAs. Their successes mean that machines are getting more sophisticated. Every time someone figures out how to teach a machine to defeat a CAPTCHA, we move one step closer to artificial intelligence.

As people find new ways to get around CAPTCHA, computer scientists like von Ahn develop CAPTCHAs that address other challenges in the field of AI. A step backward for CAPTCHA is still a step forward for AI -- every defeat is also a victory [source: Human Computation].

But what about Web administrators? They might not find von Ahn's philosophy to be nearly as attractive. From their perspective, they still have to deal with a massive problem -- spammers and hackers. People who maintain Web sites or create online polls need to be aware that several CAPTCHA systems are no longer effective. It's important to do a little research on which CAPTCHA applications are still reliable. And it's equally important to keep up to date on the subject. If one CAPTCHA system fails, the administrator might need to remove the code from his or her site and replace it with another version.

As for CAPTCHA designers, they have to walk a fine line. As computers become more sophisticated, the testing method must also evolve. But if the test evolves to the point where humans can no longer solve a CAPTCHA with a decent success rate, the system as a whole fails. The answer may not involve warping or distorting text -- it might require users to solve a mathematical equation or answer questions about a short story. And as these tests get more complicated, there's a risk of losing user interest. How many people will still want to post a reply to a message board if they must first solve a quadratic equation?

In 2014, Google (which acquired reCAPTCHA in 2009) started phasing out the classic service. In place, it asked you to check a box with the words "I am not a robot." This was called No CAPTCHA. In 2017, Google announced it was as getting rid of No CAPTCHA. Instead the service would rely on techniques like noticing how you move an onscreen pointer or analyzing your browsing habits to determine whether you are human or robot. This is called Invisible reCAPTCHA. If you seem suspicious (perhaps you are in fact a robot), you'll see one of the old reCAPTCHA challenges to solve as further verification[source: Titcomb].

Lots More Information

Sources

Chellapilla, Kumar and Patrice Simard. "Using Machine Learning to Break Visual Human Interaction Proofs (HIPS)." Microsoft Research. (Aug. 6, 2008) http://research.microsoft.com/~kumarc/pubs/chellapilla_nips04.pdf
Chew, Monica and J.D. Tygar. "Collaborative Filtering CAPTCHAs." In Human Interactive Proofs: Second International Workshop. 2005. (Aug. 4, 2008) http://www.cs.berkeley.edu/~tygar/papers/Collaborative_filtering_CAPTCHAs.pdf
Jongsma, Carl. "Breaking Google's audio CAPTCHA." Computerworld. May 2, 2008. (Aug. 4, 2008) http://www.networkworld.com/news/2008/050208-breaking-googles-audio.html
Leyden, John. "Spammers crack Gmail CAPTCHA." The Register. Feb. 25, 2008. (Aug. 5, 2008) http://www.theregister.co.uk/2008/02/25/gmail_captcha_crack/
Mori, Greg and Jitendra Malik. "Breaking a Visual CAPTCHA." Simon Fraser University. (Aug. 4, 2008) http://www.cs.sfu.ca/~mori/research/gimpy/
Oppy, Graham and David Dowe. "The Turing Test." Stanford Encyclopedia of Philosophy. April 9, 2003. (Aug. 5, 2008) http://plato.stanford.edu/entries/turing-test/
Thompson, Clive. "For Certain Tasks, the Cortex Still Beats the CPU." Wired. June 25, 2007. (Aug. 5, 2008) http://www.wired.com/techbiz/it/magazine/15-07/ff_humancomp
Vaughan-Nichols, Steven J. "How CAPTCHA got trashed." Computerworld. July 15, 2008. (Aug. 5, 2008) http://www.computerworld.com.au/index.php/id;489635775;fp;;fpid;
Von Ahn, Luis. "Human Computation." Google TechTalks. July 26, 2006. (Aug. 6, 2008) http://video.google.com/videoplay?docid=-8246463980976635143
Von Ahn, Luis, Manuel Blum and John Langford. "Telling Humans and Computers Apart Automatically." Communications of the ACM. Feb. 2004. Vol. 47, No. 2. (Aug. 4, 2008) http://www.cs.cmu.edu/~biglou/captcha_cacm.pdf
Von Ahn, Luis, Manuel Blum and John Langford . "Using Hard AI Problems for Security." Computer Science Department -- Carnegie Mellon University. ( Aug. 4, 2008) http://www.captcha.net/captcha_crypt.pdf
W3C Working Group Note. "Inaccessibility of CAPTCHA." W3C. Nov. 23, 2005. (Aug. 4, 2008) http://www.w3.org/TR/turingtest/

How CAPTCHA Works

CAPTCHAs and the Turing Test

Who Uses CAPTCHA

Creating a CAPTCHA

Breaking a CAPTCHA

CAPTCHA and Artificial Intelligence

Frequently Answered Questions

Is CAPTCHA by Google?

What CAPTCHA means?

Why is CAPTCHA needed?

Lots More Information

Related HowStuffWorks Articles

More Great Links

Sources