Video Based Sensing In Reactive Performance Spaces
Robb Lovell
Technical University of British Columbia
robblovell@nexus.techbc.ca
A reactive performance space is a theatrical environment that enables physical actions to effect and manipulate electronic media. These spaces allow performers to improvise with media through a variety of means.
Electronic media consists of any media that can be controlled from a computer. These are generally divided into four categories: visuals, light, sound, and mechanical systems. Physical actions within the space consist of anything that can be sensed and interpreted by a computer. This consists of things like video based sensing, tracking systems, sound sampling, pitch detection, or analog sensors (heat, touch, bend, acceleration, etc).
Video based sensing is one important component to reactive spaces that provides the computer with the means to interpret what is happening within the space. This paper presents concepts around how a computer interprets reality through video based input.
Image understanding (often called image processing) is the field in computer science that tries to give the computer the ability to understand what is happening visually through the use of cameras. It is important to realize that, to date, no computer has really understood in a general way what goes on in a video scene. To a computer, the objects that are contained within a video scene, and the movements of those objects, are just that: blobs that move. This means that telling the computer to follow a hand moving in an image is not possible as a starting point.
So how does the computer follow a hand moving in an image? This is the problem that image understanding tries to solve. Perhaps it is more correct to say to the computer: “follow the largest moving thing in the image, assume that the camera view is constrained to looking over the shoulder of a person (say a conductor) of just the area that hand can reach, and that there is a constant background”. In a more sophisticated technique it might be: “follow the blobs in the image that match multiple previously recorded views of a hand, within the scene lit in a certain way, with a background that is subtracted from the current scene”.
From the previous examples it can be seen that the techniques used can be easily fooled by removing the context in which the technique was created. For instance if the camera views the conductor from the front, or if the camera is too far away, the technique might fail.
The hope of this paper is to give performers, designers, and artists an insight into the concepts and processes that go into making a computer understand visually based input. Many of these concepts can be applied to other types of sensors.
Camera’s see space in distorted ways. Of course they see space in much the same way as humans except there is no distance information available (except through difficult computation). Because of this, camera geometry is not corrected for distortions related to size and distance from the camera lens, and there is no peripheral vision near to the camera. Objects in a scene viewed by the camera that are close to it are big, and objects that are far from it are small.
What this means in a performance environment is that actions of performers are distorted by their physical relationship to the camera. Actions that cut across the camera’s view appear to be different than actions that move toward or away from the camera. Actions that are preformed close to the camera are different from those performed at a distance from the camera.
There is a major difference between a camera mounted overhead and looks down verses a camera that is mounted looking from the side. Overhead camera views foreshorten human bodies that are directly underneath the camera and lengthen those that are toward the edges of the camera’s view. An algorithm that is designed for an overhead camera will not have to deal too much with proportion problems (assuming that all the objects it deals with are on the floor). But, a side viewing camera will have to deal with distance issues, and proportion problems.
These kinds of distortions are true of any kind of sensor. Infrared distance sensors are only sensitive to distance within a cone emanating from the front of the sensor. A bend sensor only sees data where it is bent, not where the object it is attached to is bent. It is important to realize that sensors while sensitive to reality, do not represent reality exactly, but only represent reality as shadows.
A camera does not see objects in space, nor does it distinguish between bodies or boxes or tables. This might seem obvious, but underlying the statement is a fact that is not as obvious: Cameras only see light, not shape. The shape of something is only obtained (if your lucky) by processing the output from a camera. Because cameras only see light, it is important that performers working with camera based systems know that how light falls on their bodies determines how they are seen by the computer.
Another fact that comes out of this realization is that changes in lighting are seen by the camera as changes of movement. If a light is turned off, for instance, then the camera will see something completely different in composition. Humans can easily take this into account because we are constantly able to recognize the content of what we see, but computers are not able to have access to this kind of information (at least not with current techniques).
The type of camera and type of lighting can make a difference in what the computer perceives, and how much the performer or designer has to pay attention to lighting anomalies.
So what kind of information can be extracted from a camera by a computer easily? The answer to this question is complex and is dependent upon the environment the camera is used within, and the techniques used to process the data streaming from the camera.
There are several processing techniques that can get at particular kinds of information. This information is general in scope and can be used to infer things based on the environmental setup. These general “operators” process light to extract some property of the scene. These operators include but are not limited to: motion, presence, background, and objects.
The motion operator does not extract speed, although it implies it. Motion is calculated by subtracting successive images from each other, and counting the number of pixels that have changed. Motion is light changing, but under constant lighting conditions, motion is the change in surface area of objects in the scene. This precise definition is needed because motion does not unambiguously extract the speed of an object. To see why this is so, consider the motion of a hand just in front of a camera lens and the motion of a hand at 10 meters from the camera. At the camera lens, a hand is big, and any movement by a hand causes many pixels in the image plane of the camera to change, thus the motion detected is large. At 10 meters, the hand is very small in the image plane, and as it moves, causes only minimal changes in image.
The presence operator is the absence or presence of light. Under constant lighting conditions, it can imply the absence or presence of a body or any other object that reflects light. The presence operator implies a size to the objects that are seen that is dependent on how far from the camera the objects are placed. Changes in size of the objects are seen as motion. It is important to realize that anything with a texture (something with a pattern in it like a checked table cloth) will show up as many objects to the computer.
The background operator is used to enhance the sensitivity of the presence and motion operators. It simply is an operator that tries to determine what is background and what is foreground. The most simple background technique is to grab a snapshot of a scene with nothing but the background contained in it. Later, this snapped scene can be subtracted from the current one to show the objects that are not the background. Other more sophisticated techniques involve slowly accumulating the background over time, or more complicated statistical techniques. By subtracting the background from an incoming scene taken from a camera, the objects that are in the foreground show up clearly in an image. However, if an object has the same color and intensity as the background, it will remain invisible to the computer.
The object operator tries to find objects that are distinct as single entities within the physical space. The result of this operator is a list of things that look different to the computer in some way. There are a vast number of ways that things can look different to the computer. The most common is through the division of light things from darker ones, or through the quantification of different color spectrums. Once an object list is extracted by the computer, there are many type of information that, in theory, could be extracted from them. Information such as size, speed, acceleration, and even recognition of what the object is. In practice, these parameters are difficult to obtain reliably because of something called the “correspondence problem” and in the case of recognition, ambiguity in comparing stored models with the current scene. (The correspondence problem is the problem of matching the previous scenes objects with the next, or of matching objects from two views of two cameras.)
A threshold is a quantity that the computer uses to divide things into categories. Simple thresholds might divide a group of numbers into two categories or bins. Thresholds are a key tool for extracting meaning from an image.
For instance, a threshold that divides light from dark, tells the computer what in a scene is lit and what is not. To see this consider the values of pixels in a gray level image. Each pixel in an image takes on a value between 0 and 255 that directly corresponds to an intensity of light in a scene. A value of 0 is dark, and a value of 255 is light. In an environment that uses the full spectrum of the camera’s capabilities, a threshold value of 128 will divide the light objects from the dark objects.
When a motion operator is applied to an image by subtracting successive frames (called a difference image), the resulting image must have a threshold applied to determine if something is moving or not. To see this consider the following analysis. There are three possible scenarios that result in a difference image. The first is where an object, say at brightness 193 doesn’t move. The resulting subtraction is 0 from frame to frame. The other two are found when the object moves across a dark background (say intensity 41). These two cases consist of the leading and trailing edges of the object: 193-41= 152 represents the trailing edge of the object, and 41-193= -152 represents the leading edge. Here, two thresholds must be applied to determine how much motion is happening due to the object, something around 100 and –100 might do the trick. Any pixel value greater than 100 and less than–100 is a pixel corresponding to some movement.
One of the tricks of the trade in image understanding is knowing how to set up an environment in order to enable the computer to make assumptions about what it is seeing. Computers can extract information much more easily within highly structured environments because the computer can assume certain things are true about the environment, at least most of the time.
Perhaps the two most important parameters that can be structured are the lighting conditions, and the content of the background. There are no hard and fast rules, or situations that are bettor than other. In general, each situation where the computer must extract information is different in some way.
The best way to understand the process of discovering a technique for extracting something from an environment is to look at an example.
Consider a situation where the desired information to be extracted from an environment is the position of someone within a room. Generally, if the room is empty, this can be accomplished in a non-precise way by having a camera with a wide angle lens mounted overhead and viewing the room from above. Three assumptions are made about the environment and the objects found in the camera’s image. First, it is assumed that people will show up as objects that are of a different color from the background. This is not always true, but in general, people tend not to wear the colors of the floor. Another assumption is that the objects that are seen are people and these people are on the floor and not flying through the room. This is only true if people aren’t jumping up and down or suspended from cables attached to the ceiling. Finally, it is assumed that the lighting will remain constant and that any changes that do occur are the result of people moving and not lighting changes. Because people are on the floor, the position in the room of the person feet directly relates to a point in the image, in this way the person’s location is reliably extracted if no assumptions are violated.
But what happens if the lighting condition is violated such as when the room contains moving projectable surfaces, as with the art piece called Trajets (http://ccii.banff.org/trajets/)? In this case the above described technique will not work. Too much interference from the screens prevents people being identified as different objects than the screens. This problem is solved in Trajets by changing the environment and the sensing configuruation. In Trajets, moving screens hang about 1 foot off the ground allowing cameras to look underneath the screens without interference. The addition of rope lights around the outside edge of the piece allows the camera’s to “see” in an otherwise dark environment. The cameras see people’s feet back lit by the rope lights underneath the moving screens. Once someone’s feet are seen in both cameras, it is easy calculate their position in the room. However, this technique breaks down as more people enter the room because of difficulties matching objects from two camera views.
In summary, the process of creating a situation where the computer can understand what is happening in an environment is creative effort that involves several kinds of activities. In general these activities are as follows. Write down what is known about and environment and what can be assumed. Based on these assumptions, decide on how the computer is going to distinguish interesting objects from uninteresting objects. This will constrain camera views and positions, and background content. Impose any environmental changes that you can in order to enhance the computer’s ability to distinguish interesting objects. This involves changing lighting, background, or costumes. Then, decide on the processing technique that will best work in the environment.
Once information is extracted from a scene this information is then used by the computer to make decisions. Often this involves transforming the data from one range to another, processes that decide when actions or activities occur, or the application of a series of ongoing test that fire rules.
The transformation process is where the computer takes extracted environmental information and converts this into intentions for action. It is the part of the process where everything is represented virtually in the computer. Environmental information has been abstracted to a set of numbers that are representations of the real state of the environment. Actions that the computer will take as a result are also created and manipulated as numbers and algorithms that are implemented by controller and rendering processes down the line.
Because numbers are abstractions of real things, there are difficulties in matching up one abstraction with another. For instance if a relationship between pixel values and a sequence on video is desired, there is a difference between their abstracted representations that is handled in the transformation step. If pixels in images are values between 0 and 255, and DVD frame numbers are values between 34,000 and 34,600, some correlation needs to be established. This relationship could be as simple as: “if the pixel values go above 128 then play frames 34,000 to 34,600”. This rule, establishes a relationship between pixel intensity, and frame numbers (most likely within a particular time frame established by another rule).
The goal of a vision understanding system is to provide the computer with a means to interpret actions that are occurring in the real world. This is a difficult task because the computer is unable to recognize with any detail what is happening in a video image. The person creating the means for a computer to understand part of an environment must make assumptions about the structure and content of the environment in order to create algorithms to extract information for the computer to use.