A typical camera captures the world as a two-dimensional image. The single lens directs light to a sensor and a recording device captures the data. While we can infer how far away or close an object is to the camera based on its size, we can't really make out a three-dimensional image from a two-dimensional camera system.
This limitation creates a problem with gesture-based interfaces. If you stand in front of a normal camera and wave your arms around, the camera can capture the horizontal and vertical movement. A computer with the proper software might be able to interpret those motions as commands. But what if you move your hands closer to the camera? A 2-D system can't interpret these motions. And 2-D systems can have a hard time distinguishing between a user and the background.
So how can you teach a camera to see in three dimensions? One way is to add a second camera -- this is called a stereo camera system. Each camera captures images within the same physical space. The streams of data from the two cameras travel into a single computer, which compares the images and draws conclusions about depth based on the information. The two cameras don't have to be next to one another -- you might position one to look at a room head on and the second camera could be positioned looking down at the floor from the ceiling.
In a way, this mimics how humans perceive depth. We tend to judge how far something is from us based on several visual cues. One of those comes from parallax. This refers to how both eyes perceive the same scene from slightly different angles. If you were to draw straight lines from your eyes to an object within your frame of vision, you'd see the two lines converge. Our brains combine the information from our eyes to create an image within our minds.