Tropes

www.scientificamerican.com 2015 01977.txt.txt

#See-and-Tell AI Machine Can Describe Objects It Observes A young child can look at whatever is in front of them, and describe what they seeut for artificial intelligence systems, that a daunting task. That because it combines two separate skills: the ability to recognize objects as well as generate sentences describing the scene. Scientists at the University of Toronto and the University of Montreal have developed software, modeled on brain cell networks, that they claim can take any image and generate a caption and get it rightost of the time. Their approach builds on earlier work involving natural language processinghe ability to convert speech or text from one language to anotherr, more generally, to extract meaning from words and sentences. t about both the combination of image information with natural language, says Richard Zemel, a computer scientist at the University of Toronto. hat what new herehe marriage of image and text. We think of it as a translation problem he notes. hen youe trying to translate a sentence from English to French, you have to get the meaning of the sentence in English first and then convert it to French. Here, you need the meaning, the content of the image; then you can translate it into text. ut how does the software model nowwhat in the image in the first place? Before the system can process an unfamiliar picture, it is trained on a massive data setctually three different data sets containing more than 120,000 already captioned images. The model also needs to have some sense of what words are likely to be found alongside other words, in normal English sentences. For example, an image that causes the model to generate the word oatis likely to also use the word ater, because those words usually go together. Moreover, it has some idea of what important in an image. Zemel points out, for example, if an image has a person in it, the model tends to mention that in the caption. Often, the results are dead-on. For one image, it generated the caption, stop sign on a road with a mountain in the background? just as the image showed; it was also accurate for, woman is throwing a Frisbee in a park, and giraffe standing in a forest with trees in the background. But occasionally it stumbles. When an image contained two giraffes near one another but far from the camera, it identified them as large white bird. And a vendor behind a vegetable stand yielded the caption, woman is sitting at a table with a large pizza. Sometimes similar-looking objects are mistaken simply for one another sandwich wrapped in tinfoil can be misidentified as a cell phone, for example (especially if someone is holding it near their face). In their tests, Zemel says, the model came up with the captions that ould be mistaken for that of a humanabout 70 percent of the time. One potential application might be as an aid for the visually impaired, Zemel says. A blind person might snap a photo of whatever in front of them and ask the system to produce a sentence describing the scene. It could also help with the otherwise-laborious task of labeling images. A media outlet might want to instantly locate all archival images of, say, children playing hockey or cars being assembled in a factory daunting task if the thousands of images on one hard drive haven been labeled.)Is the model thinking? here are analogies to be made between what the model is doing and what brains are doing, Zemel says, particularly in terms of representing the outside world and in devoting ttentionto specific parts of a scene. t getting toward what wee trying to achieve, which is to get a machine to be able to construct a representation of our everyday world in a way which is reflective of understanding. emel and his colleagues will be presenting a paper on the work, how, Attend and Tell: Neural Image Caption Generation with Visual Attention, at the International Conference on Machine learning in July o

< Back - Next >