How 3-D Gestures Work

The ZCam camera from 3DV System was a motion-sensitive predecessor to today's 3-D gesture system technology.
AP Photo/Paul Sakuma

How do you redefine a user interface? What steps do you need to take to change the way people interact with technology? It's not just a matter of developing the right tools. You also have to take into account the way people want to use gadgets. The most technologically advanced interface means nothing if it just doesn't feel right when you're taking it out for a spin.

But we're entering an era in which we need to revisit user interfaces. Computers pop up in more gadgets and applications each year. Within a decade, even the most basic appliance might house a type of computer. And with a growing emphasis on 3-D video, a new way to take advantage of this third dimension requires an innovative approach.


A 3-D gesture system is one way to tackle this challenge. At its most basic level, a 3-D gesture system interprets motions within a physical space as commands. Applications for such technology fall across the spectrum of computing from video games to data management. But creating a workable 3-D gesture system presents a host of challenges.

Several engineers have tried to create systems that can interpret our movements as computer commands. But what kinds of applications will these systems make possible? And what kinds of components are necessary to put together a 3-D gesture system?


The Dimensions of a 3-D Gesture System

The Xbox Kinect uses infrared light to project a grid in front of the camera view -- sensors measure the grid as it deforms and register the data as movement.
Michal Czerwonka/Getty Images

You can divide the parts of a 3-D gesture system into two main categories: hardware and software. Together, these elements interpret your movements and translate them into commands. You might be able to blast zombies in a video game, navigate menus while looking for the next blockbuster to watch on movie night or even get to work on the next great American novel just by moving around.

On the hardware side, you'll want a camera system, a computer and a display. The camera system may have additional elements built in to sense depth -- it's common to use an infrared projector and an infrared sensor. The computer takes the data gathered by the camera and sensors, crunches the numbers and pushes the image to the display so that you can see the results. The display presents the data in a way that lets you judge how far you need to move to manipulate what's going on.


On the software side, you'll need applications that actually convert information gathered by the software into meaningful results. Not every movement will become a command -- sometimes you might make an accidental motion that the computer mistakes for an instruction. To prevent unintended commands, 3-D gesture software has error-correction algorithms.

Why worry about error correction? A gesture may need to meet a threshold of confidence before the software will register it as a command. Otherwise, using the system could be an exercise in frustration. Imagine that you're working on an important three-dimensional drawing by moving your hands to change its size and shape. Suddenly, you sneeze and the delicate work you've done so far is ruined as your involuntary actions cause the drawing to distort dramatically.

Error-correction algorithms require your actions to match pre-assigned gestures within a certain level of confidence before the action is carried out. If the software detected that your movements didn't meet the level of confidence required it could ignore those motions and not translate them into commands. This also means you may have to perform a gesture in a very specific way before the system will recognize it.

Some commands may not be as sensitive as others. These would have a much lower threshold of confidence. For example, flipping between images by moving your hand to the left or right isn't really a mission-critical command. With a lower confidence requirement, the system will accept commands more readily.


Detection and Projection

Recognizing gestures is just part of the software's job. It also has to interface with applications so that the gestures you make translate into meaningful actions on the screen. With some applications, this is pretty straightforward. Flipping through a photo album may only rely on a few gestures to navigate pictures and zoom in or out of views. Each of those gestures may be fairly simple.

But other programs might require a greater variety of complex gestures. Let's say you've just come home with the newest version of "Extreme Table Tennis Pro Elite" and you're ready to test your skills against the toughest computer opponents to ever pick up a paddle. You pop your game into a console system that has a 3-D gesture component and pick up a real paddle of your own. What happens next?


The system analyzes the scene in front of it. It detects the presence of the paddle in your hand. As the game begins, you watch the screen and wait for your opponent to volley for serve. As the digital ball screams toward you, the 3-D gesture system determines where the ball would really go within the context of your physical space if it were an actual solid object.

You make your move, preparing a wicked return with crazy backspin. Now the 3-D system has to analyze your reaction, plot it against the flight path of the ball and determine if you made contact or if you completely whiffed it. Assuming your amazing table tennis skills haven't failed you, you hit the ball successfully. Now the system has to determine where the digital ball would go based upon your real physical movements. The software projects a flight path and the ball follows it.

Some games may not involve a physical prop. Your progress through the game will depend entirely upon the movements you make with your body. The system's job is to make sure the actions you take impact the progression of the game appropriately. And all of these actions have to be accounted for within the game itself. It's a big job! That's why some applications require you to move in a specific way to calibrate the system before you get started.


Going Deep

A typical camera captures the world as a two-dimensional image. The single lens directs light to a sensor and a recording device captures the data. While we can infer how far away or close an object is to the camera based on its size, we can't really make out a three-dimensional image from a two-dimensional camera system.

This limitation creates a problem with gesture-based interfaces. If you stand in front of a normal camera and wave your arms around, the camera can capture the horizontal and vertical movement. A computer with the proper software might be able to interpret those motions as commands. But what if you move your hands closer to the camera? A 2-D system can't interpret these motions. And 2-D systems can have a hard time distinguishing between a user and the background.


So how can you teach a camera to see in three dimensions? One way is to add a second camera -- this is called a stereo camera system. Each camera captures images within the same physical space. The streams of data from the two cameras travel into a single computer, which compares the images and draws conclusions about depth based on the information. The two cameras don't have to be next to one another -- you might position one to look at a room head on and the second camera could be positioned looking down at the floor from the ceiling.

In a way, this mimics how humans perceive depth. We tend to judge how far something is from us based on several visual cues. One of those comes from parallax. This refers to how both eyes perceive the same scene from slightly different angles. If you were to draw straight lines from your eyes to an object within your frame of vision, you'd see the two lines converge. Our brains combine the information from our eyes to create an image within our minds.


A Little Light Gesturing

What travels at 299,792,458 meters per second in a vacuum? No, it's not a dust bunny. It's light. It might seem like trivia to you, but the speed of light comes in handy when you're building a 3-D gesture system, particularly if it's a time-of-flight arrangement.

This type of 3-D gesture system pairs a depth sensor and a projector with the camera. The projector emits light in pulses -- typically it's infrared light, which is outside the spectrum of visible light for humans. The sensor detects the infrared light reflected off everything in front of the projector. A timer measures how long it takes for the light to leave the projector, reflect off objects and return to the sensor. As objects move, the amount of time it takes the light to travel will vary and the computer interprets the data as movements and commands.


Imagine you're playing a tennis video game using a 3-D gesture system. You stand at the ready, waiting to receive a serve from your highly seeded computer opponent. The 3-D gesture system takes note of where you are in relation to your surroundings -- the infrared light hits you and reflects back to the sensor, giving the computer all the data it needs to know your position.

Your opponent serves the ball and you spring into motion, swinging your arm forward to intercept the ball. During this time, the projector continues to fire out pulses of infrared light millions of times per second. As your hand moves away from and then toward the camera, the amount of time it takes for the infrared light to reach the sensor changes. These changes are interpreted by the computer's software as movement and further interpreted as video game commands. Your video game representation returns the serve, wins a point and the virtual crowd goes wild.

Another way to map out a three-dimensional body is to use a method called structured light. With this approach, a projector emits light -- again outside the spectrum of visible light -- in a grid pattern. As the grid encounters physical objects, it distorts. A sensor detects this distortion and sends the data to a computer, which measures the distortion. As you move about, your movements will cause the grid to distort in different ways. These differences create the data that the computer needs to interpret your movements as commands.

A 3-D gestures system doesn't have to rely on a single technological approach. Some systems could use a combination of multiple technologies in order to figure out where you are and what you're doing.


Beyond the Lens

The Kinect is probably the most recognizable 3-D gesture system on the consumer market right now, but many more products will be joining it soon.
Kiyoshi Ota/Getty Images

Is 3-D gesture control the interface of the future? That will depend upon the ingenuity of the engineers, the efficiency of the various systems and the behavior of users. Designing a workable user interface is no small task -- there are hundreds of failed products that at one time or another were going to revolutionize the way we interact with machines. For 3-D gesture systems to avoid the same fate, they'll have to be useful and reliable. That doesn't just depend on technology but user psychology.

If a particular gesture doesn't make sense to a user, he or she may not be willing to use the system as a whole. You probably wouldn't want to have to perform the "Hokey Pokey" just to change the channel -- but if you do, it's OK, we don't judge you. Creating a good system means not only perfecting the technology but also predicting how people will want to use it. That's not always easy.


There are a few 3-D gesture systems on the market already. Microsoft's Kinect is probably the system most familiar to the average consumer. It lets you control your Xbox 360 with gestures and voice commands. In 2012, Microsoft announced plans to incorporate Kinect-like functionality into Windows 8 machines. And the hacking community has really embraced the Kinect, manipulating it for projects ranging from 3-D scanning technology to robotics.

At CES 2012, several companies showcased devices that included 3-D gesture recognition. One company, SoftKinetic, demonstrated a time-of-flight system that remained accurate even when objects were just a few inches away from the camera. A time-of-flight system measures distances based on how fast light reflects off an object, based on the speed of light. If companies want to include gesture recognition functions in a computer or tablet, they'll need to rely on systems that can handle gestures made close to the lens.

In the future, we may see tablets with a form of this gesture- recognition software. Imagine propping a tablet up on your desk and placing your hands in front of it. The tablet's camera and sensors detect the location of your hands and map out a virtual keyboard. Then you can just type away on your desktop as if you have an actual keyboard under your fingertips, and the system tracks every finger movement.

The real test for 3-D gesture systems comes with 3-D displays. Adding depth to our displays gives us the opportunity to explore new ways to manipulate data. For example, imagine a 3-D display showing data arranged in the form of stacked boxes extending in three dimensions. With a 3-D gesture display, you could select a specific box even if it weren't at the top of a stack just by reaching toward the camera. These gesture and display systems could create a virtual world that is as immersive as it is flexible.

Will these systems take the place of the tried-and-true interfaces we've grown used to? If they do, it'll probably take a few years. But with the right engineering and research, they could help change the stereotypical image of the stationary computer nerd into an active data wizard.


Author's Note

I got the idea for this article after my visit to CES 2012. It seems like there's a new emerging trend at the show every year. In 2012, that trend was the reinvention of the user interface. It seemed like every company was trying to add in gesture and voice control systems into products. But don't get too excited -- it might take a year or two for those innovations to make their way into common consumer electronics.

Related Articles

More Great Links


  • Bodker, Susanne. "Through the Interface: A Human Activity Approach to User Interface Design." CRC Press. 1990.
  • Iddan, Gavriel J., et al."3D Imaging System." United States Patent & Trademark Office Patent # 7,224,384.
  • Krah, Christoph H. "Three-dimensional Imaging and Display System." United States Patent & Trademark Office Patent Application # 20110298798.
  • Krzeslo, Eric, et al. "Computer Videogame System with Body Position Detector that Requires User to Assume Various Body Positions." United States Patent & Trademark Office Patent Application # 20100210359.
  • Latta, Stephen G., et al. "Gesture Keyboarding." United States Patent & Trademark Office Patent Application #20100199228.
  • Latta, Stephen G. et al. "Gesture Recognizer System Architecture." United States Patent & Trademark Office Patent #7,996,793.
  • Pinault, Gilles, et al. "Volume Recognition Record and System." United States Patent & Trademark Office Patent Application # 20100208035.
  • Ringbeck, Thorsten. "A 3D Time of Flight Camera for Object Detection." PMDTechnologies GmbH. July 12, 2007. (Feb. 10, 2012)
  • Silver, William, et al. "Method and Apparatus for Human Interface to a Machine Vision System." United States Patent & Trademark Office Patent # 7,957,554.
  • Wallack, Aaron, et al. "Methods and Apparatus for Practical 3D Vision System." United States Patent & Trademark Office Patent Application # 20100303337.


Frequently Answered Questions

How do gesture sensors work?
Gesture sensors work by detecting the movement and position of objects in three-dimensional space. They can be used to track the movement of people or objects, and to generate three-dimensional images of objects.
What is gesture control used for?
Gesture control is used to control various electronic devices, such as smartphones, tablets, and computers.