Astute followers of artificial intelligence may recall a moment from three years ago, when Google announced it had birthed unto the world a computer able to recognize cats using only videos uploaded by YouTube users. At the time, this represented something of a high water mark in AI. To get an idea for how far we have come since then, one has only to reflect on recent advances in the RoboWatch project, an endeavor that is teaching computers to learn complex tasks using instructional videos posted on YouTube.
That innocent “learn to play guitar” clip you posted on your YouTube video feed last week? It may someday contribute to putting Carlos Santana out of a job. That’s probably pushing it; it’s more likely that thousands of home nurses and domestic staff will be axed long before guitar gods have to compete with robots. A recent groundswell of interest in bringing robots into the marketplace as caregivers for the elderly and infirm, in part fueled by graying population bases throughout the developed world, has created the necessity for teaching robots simple household tasks. Enter the RoboWatch project.
Most advanced forms of AI currently in use rely upon a branch of supervised machine learning, which requires large datasets to be “trained” on. The basic idea is that when provided with a sufficiently large database of labeled examples, the computer can learn to recognize what differentiates the items within the training set, and later apply that classifying ability to new instances it encounters. The one drawback to this form of artificial intelligence is that it requires large databases of labeled examples, which are not always available or require much human curation to create.
RoboWatch is taking a different tack, using what’s called unsupervised learning to discover the important steps in YouTube instructional videos without any previous labeling of data. Take for instance a YouTube video on omelet making. Using the RoboWatch method, the computer successfully parsed the video on omelet creation and catalog the important steps without having first been trained with labeled examples.
It was able to do this by looking at a large amount of instructional omelet-making videos on YouTube and creating a universal storyline from their audio and video signals. As it turns out, most of these videos will contain certain identical steps, such as cracking the eggs, whisking them in a bowl, and so on. When presented with enough video footage, the RoboWatch algorithm can tease out what the essential parts of the process are and what is arbitrary, creating a kind of archetypal omelet formula. It’s easy to see how unsupervised learning could quickly enable a robot to gain a vast assortment of practical household know-how while keeping human instruction to a minimum.
The RoboWatch project follows similar advances in video captioning pioneered at Carnegie Mellon University. Earlier this year, we reported on a project headed by Dr. Eric Xing, which seeks to use real-time video summarization to detect unusual activity in video feeds. This could lead to surveillance cameras with the built-in ability to detect suspicious activity. Putting these developments together, it’s clear unsupervised learning models using video footage are likely to pave the way for the next breakthrough in artificial intelligence, one that will see robots entering our lives in ways that are likely to both scare and fascinate us.