Computers can do a lot of things, but they can’t “see” video lacking a title, keyword, or description. And that can pose a challenge for law-enforcement officers seeking suspects or for investigators combing through hours of security tape.
In the wake of the Boston Marathon bombing, for example, hundreds of officers worked around the clock to look at thousands of hours of video to find the suspects. It took almost three days. Now imagine the alternative: the ability to search “baseball cap, black backpack” and quickly find footage of people who match that description, regardless of how the video was tagged.
A group of electrical and computer engineers led by professor Leslie Collins wants to show that computers can have that kind of “sight.” The engineers have created a computer-vision algorithm speedy enough to understand video and make decisions about it in something approaching real time.
The algorithm proved its abilities via a PS3 video game system. Five computers cranking away on interlocking algorithms, using vision from a single Web cam, drove a virtual red sports car around the curves of a mountain road at 130 miles per hour. The hazards were many: Sparks flew as the car grazed the guardrails, and there were skids from avoiding the occasional oncoming truck. Yet the car kept going reasonably well, while the computers made onthe-fly choices about steering, brakes, and acceleration. Key to the success was a reinforcement learning algorithm, which is an adaptive, responsive computer code. It tries different responses to maximize its reward and then evaluates what worked and what didn’t so it can do better next time. As it builds a history of success and failure through repeated trials, the algorithm reveals the boundaries of appropriate responses to the problem. One of the five computers is the master decision maker, sending choices about steering, acceleration, and braking to the PS3. The other four split up the visual tasks. After making each choice, the visual computers ask the camera for another frame of video to analyze, but none of them is looking at thirty frames per second—yet—according to Kenny Morton, an assistant research professor in Collins’ lab.
Machines are learning to see, but they have a way to go. ATMs, for instance, can reliably read handwritten amounts on a check and correctly credit an account. A tollbooth can shoot a picture of your license plate flashing past at 60 mph. Still, to truly see and interpret the detail and variety of video, to recognize an individual face or a moving landscape in real time, is an unconquered task.
For now. Collins says her team’s next goal is to use the vision algorithm to pick an individual out of a crowd scene based on a match to a video database of different gaits. And while the algorithm’s performance continues to improve, the machines still struggle to recognize the varied and shifting shapes of the roadway and the other cars. The learning curve is the toughest challenge of all. “Just recognizing vehicles in all their various orientations, occlusions, and shapes is wicked hard,” says team member and assistant research professor Peter Torrione M.S. ’02, Ph.D. ’08.