Toward Computer Cognition of Gesture
Robb E. Lovell
Institute for Studies in the Arts
Arizona State University
Intelligent Stage Research Lab
Matthews Center, Rm. 222
Tempe, AZ 85287-3302
(602) 965-1060
http://isrl.fa.asu.edu
robb.lovell@asu.edu
Abstract
Rudolf Labans movement analysis [Laban], and the work of other researchers [Bartenieff et al.], [Dell], are useful as starting points for defining different qualities and modes of movement for gesture recognition. They present possibilities for exploration into ways that the computer can recognize different aspects of movement, and define a means to approach recognition of gesture systematically. The use of movement theory is necessary in order to focus the computers analytical activity within the framework of real-time recognition. Also, by approaching the recognition of movement in this way the computer can gain a wider range of understanding. The purpose of the research is to find a means to use gesture to control activities by the computer, in contrast to matching points to a 3-D model in order to move or animate figures for film. Along with extracting quantitative information, an attempt is made to have the computer understand the qualities of movement being performed. Also explored are various ways gesture can be defined in linguistic terms. The desire is to create a new type of interaction with the computer that augments current technologies in the same way as other interface technologies.
Background
The Institute for Studies in the Arts is performing research into perceptive performance spaces. The Intelligent Stage is a theatrical space within the Institute which is being created to respond to the actions of artists as they move, allowing them to control electronic elements such as sound, lighting, graphics, video, and robotics. This is accomplished through the processing of digitized video, speech, photoelectric switches, contact switches, and other types of sensors. The systems primary sensing occurs through a video based image understanding program called Eyes [Lovell et al.]. Responses are generated through several controller computers that manipulate electronic media through custom control programs.
Introduction
If a computer is given the ability to sense its environment and the dexterity to respond in some way to the sensed information, it has the ability to make decisions that influence, even change, the direction of performances. Most computers have this ability in limited ways. A keyboard represents the senses, and the response displayed on a monitor is its dexterity.
Many performances and installations have followed this paradigm. A user (designer, viewer, artist, or performer) enters information about what the computer should do, either before or during a performance or installation, and the computer uses inputted information to control or respond with media. The computer can exhibit many intelligent behaviors in this environment depending upon how it is programmed. Most of the time responses are based upon randomized manipulations or heuristic road maps for the computer to follow.
Discussion
In an artistic setting, a keyboard and monitor are not necessarily the desired forms of interactivity. The computer should be able interact in more complicated physical environments, seeing actions of performers and hearing what is spoken in order to respond more fluidly. Seeing and hearing imply understanding of the environment, at least in an instinctual way, so that intelligent coherent responses can occur. While the computer cant approach this kind of perception, it can understand limited things about the outside world through image processing techniques, through knowledge representations, and through assumptions about the real world.
One way for the computer to understand what is happening is for it to be shown or told something to look for in a future time frame. A movement or sound is "recorded" by extracting observed phenomenon through the sensors, and statistics are computed that describe the action occurring. Using the same set of observed phenomenon, the computer compares current activity to the given model and looks for a match
Several abilities are required of a computer for it to be able recognize gestures. The computer must be able to extract and track multiple points on the body in real time, distinguishing each point from others and map the motions of those points over time. The computer must have techniques for effectively matching models of certain specific gestures to currently occurring gestures and determining when a match occurs.
Just how the computer can accomplish all of these tasks is an open question. However, a limited form of gesture recognition can be achieved with less than perfect computer abilities. By structuring an environment that makes it easier for the computer to understand what is happening, the techniques for extracting and matching points on the body become more computable.
The programming must follow a set of facts or assumptions about what is being seen because the computer has no built in cognitive understanding of the environment. In effect, the program which tracks movement will be built by constructing a sketch of the scene through assumptions that can be made from information given in a video image. The environment in which the program operates will be made to conform to the assumptions used by the program.
The environment will, as a given, be ambiguous in nature. The information contained within a video image is incomplete and limited, and since this is the computers view of the world, its representation of the current state of its surroundings will be unreliable. Even a rich abstract representation of the world will do little to solve this problem because of the limited ability of the computer to recognize patterns and apply knowledge fast enough at a given point in time. In addition, the information that is possible to extract from an image by the computer will be a small subset of the possible volume of data that can be extracted because of processing limitations. Add to this the complication of having to match two views of the same scene and the difficulty of repeating movements precisely, and the ambiguity of representing a gesture in a scene increases.
Gesture Recognition
Gesture recognition is defined in this paper as the matching of a previously performed or abstractly described movement of an object or objects with a movement currently being performed. Objects can be three-dimensionally mapped body sites, inanimate objects, or two-dimensionally mapped image blobs or sprites.
The movement of an object is described through perceived features extracted from the object as it moves by the computers sensors. These features are anything that can be extracted by the computer as information about the object, for instance, velocity, position, size, or luminance.
The way a feature behaves over time is called its pathway. A pathway consists of a set of measurements that describe the way a feature changes as the object moves. A pathway can be represented as simply as with a constant number, or as a complexly modeled behavior, yet it is always based upon a set of values stored at discrete moments in time.
A gesture then becomes definable to the computer as a set of particular pathways. Each individual instantiated pathway of this set is a gesture phoneme, and can be defined in a period of time, independent of time, or over different time frames.
The recording or defining of the phonemes of a gesture must be accomplished in a "fuzzy" way. It can be very precisely specified, but repeated performances are unlikely to be matched given the imprecision of the body, lighting, environment, feature extraction, and recording devices. Each recorded statistic is bracketed with a window of plus or minus certain threshold values in order to facilitate matching. A threshold represents values that define the boundaries between what is similar enough to be regarded as identical and what is not.
In addition, many times a gesture consists of undefined elements. A particular gesture might not specify a time period or might leave out a set of extracted features. For example a gesture can consist of an object moving a distance of 3 meters in the x direction in an unspecified time and y-z space.
Relative vs. Absolute
An important categorization needs to be made distinguishing between relative and absolute measurements. For instance, a spatial path of a feature can be defined in an absolute space as its x, y, z location, and in a relative space as its change in x, change in y, and change in z derivative (distance traveled). This distinction is important because an absolute description is confined to a particular location in an environment. In contrast, a relative description can occur at any location in an environment.
Gesture Space
A gesture space is defined as the set of possible gestures for a given pathway or feature set. For instance a "positional" gesture space could consist of the set of features: x location, y location, and z location.
The goal of this research is to discover gesture spaces within structured environments that are useful for communicating movement of the human body to the computer. To accomplish this, two structured environments will be defined in which the computer will make assumptions that will help it to extract the motion of particular body sites over time.
Based on this structured environment, a system of interpretation of movement will be set up that will specify categories of gestures. These categorizations will then define the dividing lines between one gesture and another, not in a rigid unchanging manner, but as a set of construction blocks which can be used to define which set of motions mean one gesture, over another set of motions meaning another.
In the search for useful gesture spaces it is best to start by exploring a very simple gesture space that consists of a single object and its pathways defined over time. This model may then be expanded to include multiple objects, where a gesture may include other types of pathways between the objects.
Environment 1
The simplest environment consists of one high contrast object. The computer assumes that the first bright blob it sees in the image is the object in the world. If the world conforms to this definition, then there is no ambiguity, the computer will be able to track the object in three dimensions. This is because two camera views of the object will be easily matched both spatially and over time (a view is the content of a video image seen from one camera).
Features that can be extracted from this environment are the objects location, area, extents, shape, and luminosity. What cant be extracted is the objects orientation (unless the object is color coded in some way to indicate an orientation). Values that can be calculated from the extracted features are the rates of change and acceleration of each extracted feature (displacement and velocity for the location feature).
In this setting where only one object exists, there is only a global, fixed reference point. This means that the notion of body-centered motion does not exist because it implies a relative spatial measurement to a second object in the space (the center of the body). Measurements of the object based on a fixed reference point in the environment.
Laban Theories
Given this environment, several models of gesture can be built based Labans movement scales: dimensional, diagonal and others; the Laban crystalline forms; and Labanotation [Denny et al.]. The notation system and scales were created to describe inherently body-centered movement where each motion is toward, away, around, or in some way related to the center of the body. Yet they provide useful starting points for analyzing movement in an absolute space because they also could indicate the relative motion of an object in relation to an arbitrary fixed point over time. These theories of movement also define a low level of resolution for separating one pathway from another.
Dimensional Gesture Space Example
Starting with the dimensional scale, motion along a coordinate axis is defined in relation to the body. In the environment described above, the analog is movement along one of the global axes. Motion is either positive (moving away from the body) or negative (moving toward the body). A gesture is a combination of movements along each axis and toward or away from a starting point.
The dimensional scale clarifies the idea of a gesture space where only motions within a predefined feature space are recognized. The computer limits itself to classifying all motion into one of six distinct orientations: +x, -x, +y, -y, +z and -z. In this system, distance and velocity are not important, only the orientation of the movement is calculated. An example of a gesture might be the phonemes: the hand moves left, right, backward, and then right again in the space (+x, -x +y, and x), representing a series of specific direction changes. The six directions are thought of as movement phonemes, or distinguishable parts of an overall gesture. A gesture space allows the computer to restrict its attention to the relevant information it needs to accomplish a particular goal, which in this case is to observe which general direction movement is traveling.
Linguistics
By defining a gesture space as described above, movement of the human body is perceived not as a detailed construction of the actual movement, but as a classification into directions traveled. In more abstract words, the gesture is recognized by classification into a set of features and pathways. The information in the dimensional gesture space is useful to the computer as practical knowledge about directions: up, down, left, right, forward, and backward. As a result we can instruct the computer to look for a motion that is "moving stage left" rather than formulate a more obscure instruction to look for motion of "x".
By classifying movement into linguistic phrases, a language is built which is used to describe gestures and states of motion to the computer. Each gesture space defines an action or a procedure the computer can use to recognize what is happening in response to a linguistic request.
This approach is different than one where a gesture is recorded and the computer then looks for the same gesture repeated by matching current activity with a pre-recorded one. An advantage of this approach is the built in "fuzziness" of describing something in words. It also provides the human participant with a more natural interface for indicating direction, place, and gross movement. Finally, it provides an enhancement to the "recording and comparing" method by providing a means to communicate added intent.
Instantaneous Measurements
More parameters can be added to the dimensional gesture space that increases the complexity of what can be described to the computer. Absolute position of the object can be measured leading to the calculated instantaneous parameters of position, displacement, velocity, and acceleration. Any measurements requiring more the time between video frames, and twisting motions that are related to orientation of the object are ignored at this point.
Measurement of instantaneous values of displacement implies two samples over a small discrete time period. For the Eyes sensing system this time unit is the time between processed frames of video. Calculation of velocity and acceleration from displacement is at best an approximation because the time unit between frames is fairly large, and because the cameras low resolution introduces perceived variations in measurements of the displacement.
Without a reference point these parameters are not very useful, except when phrases like "at a given moment if the object moves 1 foot" are used. In this case, "a given moment" implies that the speaker is referring to an instantaneous parameter.
Like the dimensional phonemes, position can be quantified into a low resolution in order to extract a simple linguistic description. A useful starting point based on Labanotation and the dimensional scale is: low, medium, high, left (stage left), right (stage right), forward (down stage), backward (up stage), and center (center stage). The computer correlates a global position of +z directly to an object being in a high space relative to a fixed location (i.e. a non-body centered location).
The diagonal scale and Labans cube structure add new spatial paths to the dimensional gesture space. They can also define an independent system of movement along diagonals moving away and toward the center of the body. In the environment defined for one object, these directions represent movement along the diagonals in absolute Cartesian space. This expands the vocabulary describing position to include combinations of words like forward left, low-left, low-right, forward-high-left, forward-high-right, etc.
Words like "quarter" or "halfway" are also useful when describing the position of something in a performance space. References using these words modify or specify more precisely where something is located. For instance, "stage right a quarter of the way in" tells the computer where on stage right to look.
Environment 2
A second simple environment that has been explored consists of two high contrast, distinctly colored objects. The computer assumes that the first blob in the image of a particular color is one of the objects and that the first blob of another color is the second object.
Along with features extracted from the individual objects, relational features can also be calculated, namely orientation of one object in relation to the other and separation distance of the two objects.
If one of the objects is the center of the body, orientation and separation become body-centered gestures. With a two-object system it is possible to use phrase like "moving away from" or "moving toward" or words like "expanding", "contracting", or "spinning".
Accumulated Features
Until now, this paper has defined information that can be recorded that allows the computer to recognize the state of the body or its movement at particular moments in time. By concatenating phonemes of this information together a gesture is described. The overall gesture does not have to be performed within any given time frame, instead, only specific states have to be reached independent of duration of the actions.
Along with information about instantaneous states, other features can be extracted which require time points and intervals. These measurements describe pathways that imply more complex gestures which are time dependent.
Values that are averaged or accumulated over time require a specified interval of time in order to be calculated. These values include accumulated distance and displacement; average, minimum, and maximum velocity; average, minimum and maximum acceleration; as well as other features.
Ambiguities
A problem is created when distance traveled or average velocity is incorporated into the system. The new system can represent phrases like "moved 5 feet to the left", but now there is a hidden ambiguity about what a movement of 5 feet means. There are two possible interpretations: "the object moved 5 feet to the left from a starting position and never moved in another direction" and "the object eventually moved 5 feet to the left in one direction after moving in many other directions". The first interpretation is a subset of the second and is the most likely meaning, however the second is just as valid and is also represented by the gesture space.
The inclusion of the word "eventually" implies that the left direction distance above is accumulated even with other direction changes, where if the word is not included, the distance in the desired direction is "zeroed" with each direction change.
This kind of ambiguity is a potential source of miscommunication between what the computer is expected to extract and what it actually observes. Most of the time, instructions to the computer are not so specific as to imply one interpretation over the other.
Time and Reference Points
Accumulated features introduce another concept that has to be more precisely defined: reference points and intervals of time. Displacement is described with reference to a particular point in time where the movement starts. Velocity as a feature also requires reference points. A velocity can be instantaneous, or average, depending on whether the constraint is with reference to a time interval. Also, the direction the object is traveling can also be specified in relation to time by phrases like: it faced or moved in the x direction for five seconds.
Time can be constrained in three ways: as a reference point where something starts or ends, as an interval within which or outside of which something happens, or as a combination of reference points and intervals.
How a reference point is picked has arbitrary effects on the measurement involved. Ideally, one should be able to define particular time-related things in a fuzzy way as a range of intervals or a window around a starting or ending point. What is most important is that time is described in a non-precise way. For instance, if one specifies that a gesture could happen a minute after another gesture. One would like to have the computer be able to interpret that not as exactly 60 seconds from the time the previous gesture occurs, but rather that it could happen anywhere from 15 to 200 seconds. The system includes a correlation between time length and percentage fuzziness that shrinks as time intervals increase. For instance, "one hour" implies ±20 minutes where 1 minute implies ±30 seconds. By including words like "exactly" or "close to" in the gesture description, this fuzziness can be overridden if more precise timing is required.
Order of Events
Another variable arises which when putting together movement phonemes is the order in which phonemes occur. One can say, "when the dancer has traveled downstage left and downstage right" and imply no ordering to the events, or one could add the words "and then" and imply that one follows the other. One event could also be time based, being required to be performed at an absolute time, where another event as part of the same gesture could be independent of time.
Effort Shape
How can the language of Laban be used to define movement to the computer so that the computer can then extract the desired quality of movement?
Laban effort-shape vocabulary describes several qualities, flow (bound vs. free), space (direct vs. indirect), time (quick vs. slow), and weight (light vs. heavy); only time and space can be directly measured. Flow and weight might be extractable from some unknown combination of observed features, however this is unlikely because they depend more upon muscle state, mental intent, and subtle differences in shape and movement than on any visually extractable feature.
One advantage to incorporating the effort-shape vocabulary into gesture descriptions is that describes qualities of movement as a modifier to other features or pathways. By saying the gesture "takes an indirect path from point a to point b" this implies that, if there are numerous direction changes outside the noise of the background, then an indirect path was taken. The quality of indirect is important, not the time it takes to traverse the pathway or the overall direction.
The words quick and slow imply a division of movement into two velocity levels, very fast and very slow, and perhaps things that are neither quick or slow. But, more likely, it also implies how short or long a time interval it takes to perform a gesture. An issue that must be addressed here is the definition of quick and slow quantifiably? The computer cannot make a subjective decision without parameters that describe the relationship of the two extremes. One way to solve this problem is to provide examples of the gesture performed slowly and then quickly and have the computer use these as benchmarks for future reference. Another way is to have the computer start out with pre-defined quantities and then modify its benchmarks as it experiences the gesture over time.
Conclusion
This paper presents the initial research into ways of describing movement to a computer so that it can use those descriptions to understand gesture. The goal is to correlate descriptive concepts of gestures to processing actions by the computer that allow the computer to recognize gesture. The Laban systems point the way to a "low-resolution" approach to understanding movement, often dividing directions or motions into only two or three distinct categories such as high, medium, and low; left-right, up-down, and forward-backward; or fast and slow. The scales define useful gesture spaces that can be used to create a linguistic system for describing gestures and movement states to the computer.
References
Bartenieff, Irmgard with Dori Lewis. (1980), "Body Movement: Coping with the Environment." New York: Gordon and Breach.
Denny, Lovell, Mitchell, (1996), "Virtual Space Harmony: The Value of Rudolf Laban's Theory of Space Harmony as a Means to Navigate a Virtual Stage Environment(VSE)" Institute for Studies in the Arts.
Dell, Cecily, revised by Aileen Crow and Irmgard Bartenieff. (1977), "Space Harmony: Basic Terms." New York: Dance Notation Bureau Press.
Dell, Cecily. (1977 revised), "A Primer for Movement Description: Using Effort-Shape and Supplementary Concepts." New York: Dance Notation Bureau Press.
Laban, Rudolf. (1966), "Choreutics" Annotated and edited by Lisa Ullmann. London: Macdonald & Evans.
Lovell, Mitchell (1996), "Using Human Movement to Control Activities in Theatrical Environments", Third International Conference on Dance and Technology.
Hardware
Indigo2 150Mhz, 128MB, 3GB. Galileo frame grabber
Macintosh II f/x 40Mhz, 64MB, 1GB.
3 Sony DXC-1514 Cameras
Software
MAX 3.0
OMS 2.1
Special F/X (Indigo2)
Eyes (Custom C++) Image manipulation & understanding.
Gesture Matching (Custom MAX) Patches for analyzing feature data.
Credits
Arizona State University
Institute for Studies in the Arts