Robots are slowly becoming more ubiquitous as costs decrease and skills are expanded. Retail locations now use robots for a variety of in-store functions and they have become even more useful in the struggle against the coronavirus pandemic.
Most robots are equipped with some combination of cameras and sensors to give them the power of “sight.” But researchers with Carnegie Mellon University questioned whether sound would be helpful as well, organizing what they said was the first large-scale study of the interactions between sound and robotic action.
Lerrel Pinto, a robotic scientist who just finished his Ph.D. at Carnegie Mellon University, said in an interview that this was some of the first work exploring how sound would help robots better understand the world around them.
“Most of the time our robots have a camera on their head or in their hands and through these cameras, they are able to see and perceive the world. But why only cameras? Humans have eyes, ears and so we have more than one modality of sensing the world. So why just vision? Why not use sound as well?” Pinto said in an interview.
“I see this work as some of the first steps of using sound in robotics. We are getting a better understanding of how sound can be used along with a robot that is interacting with the world.”
Working out of the Robotics Institute at Carnegie Mellon University, Pinto, as well as fellow researchers Dhiraj Gandhi and Abhinav Gupta, presented their findings during the virtual Robotics: Science and Systems conference last month. The three started the project last June, according to a release from the university.
“We present three key contributions in this paper: (a) we create the largest sound-action-vision robotics dataset; (b) we demonstrate that we can perform fine grained object recognition using only sound; and (c) we show that sound is indicative of action, both for post-interaction prediction, and pre-interaction forward modeling,” they write in the study.
“In some domains like forward model learning, we show that sound in fact provides more information than can be obtained from visual information alone.”
In the published study, the three researchers said sounds did help a robot differentiate between objects and predict the physical properties of new objects. They also found that hearing helped robots determine what type of action caused a particular sound. Robots using sound capabilities were able to successfully classify objects 76% of the time, according to Pinto and the study.
Using toy blocks, hand tools, shoes, apples, and tennis balls and other objects, they created “the largest available sound-action-vision dataset with 15,000 interactions on 60 objects using the robotic platform Tilt-Bot,” according to the study. The “Tilt-Bot” is a square tray attached to the arm of a Sawyer robot.
“By tilting objects and allowing them to crash into the walls of a robotic tray, we collect rich four-channel audio information. Using this data, we explore the synergies between sound and action and present three key insights. First, sound is indicative of fine-grained object class information, e.g., sound can differentiate a metal screwdriver from a metal wrench,” the study said.
“Second, sound also contains information about the causal effects of an action, i.e. given the sound produced, we can predict what action was applied to the object. Finally, object representations derived from audio embeddings are indicative of implicit physical properties. We demonstrate that on previously unseen objects, audio embeddings generated through interactions can predict forward models 24% better than passive visual embeddings.”
They note that other researchers have looked into how sound can be used to estimate granular materials like rice by shaking a container. But Pinto said their study showed just how helpful sound could be to a robot.
“I think what was really exciting was that when it failed, it would fail on things you expect it to fail on,” he said, adding that a robot could use what it learned about the sound of one set of objects to make predictions about the physical properties of previously unseen objects. While the robot in their experiment could not tell the difference between two different colored blocks, it could tell the difference between a block and a cup.
Sound, they wrote in the study, captures rich object information that is often imperceptible through visual or force data. There hardly exist any systems, algorithms, or datasets that exploit sound as a vehicle to build physical understanding, according to their research, in part because the intricacies of sound make it difficult to extract information that is useful for robotics.
“The first insight is that sound is indicative of fine-grained object information. This implies that just from the sound an object makes, a learned model can identify the object with 79.2% accuracy from a set of diverse 60 objects,” the study said.
“Our second insight is that sound is indicative of action. This implies that just from hearing the sound of an object, a learned model can predict what action was applied to the object. Our final insight is that sound is indicative of the physical properties of an object. This implies that just from hearing the sound an object makes, a learned model can infer the implicit physical properties of the object.”
According to Carnegie Mellon University, the study received support from the Defense Advanced Research Projects Agency and the Office of Naval Research.
Pinto told TechRepublic that there are multiple ways this study can be useful for the advancement of robotics. Telling the composition, size, and weight of an object will be key in case vision is impaired or restricted.
“The next step will be learning whether we can use the sound of an object to better manipulate the object. What that means is, maybe you have a cup and hear the sound of the cup, so will this help you in maybe picking up the cup and moving it without spilling things that are in the cup?” Pinto asked, noting that it may be smart to equip future robots with canes that could be used to tap certain objects.
“A lot of this work is in the foundation stages so it is unclear when it will be used in the real world. But we are sharing it with other researchers for further study.”