Associative memory for navigation

Suppose a robot was constructed which could traverse the interior of a building, and in doing so build an associative memory map of the visual layout of a building -- using some kind of neural net. So with the map built, it could tell, given a picture of anywhere in the house, where it (the robot) is currently located (relatively) in the house (i.e. it receives as input a picture of a table (say the kitchen table), and from that it would know that to the left of that is a door, and on the other side of that door is the living room (assuming it somehow knew what portion of its map was designated to be the living room).

Is this even remotely possible?

If so, then would it be feasible to, in some input format, tell the robot, for example, to "go to room x" (assuming somehow the robot had learned to associate "room x" with a particular portion of its internal map)?

Reply to
chad.d.johnson
Loading thread data ...

Yes. Humans can do it, so it's possible without a doubt. :) It's not easy however.

Sure.

I'm not aware of anyone that has done just what you are thinking of, but there's been a lot of work on related ideas.

I know I've seen robotics mapping projects where the goal of the robot was to create a 2D map of its environment as it moved around a building, and to use that map to locate itself in the environment. But as I recall, the sensor supplying the data was not visual but instead something more easily usable like laser distance measurements in 360 degrees around the robot. This would give it distance measurements to the walls. It would then build up a map of the walls and doors. The project I'm thinking of made heavy use of statistical techniques. Actually, now that I'm thinking about it, it might have been given an simple 2D map of the building, and it's goal was to move around and to try and figure out where is was on the map. I think it worked by estimating the probability of it's location at all points on the map (down to some resolution), and using the sensor data to update the probability of it being at each location on the map until it received enough data to estimate it's location to a high probability. I recall seeing a video of the computer screen which represented it's best guess as to it's current location on the map as it moved around. It started off not knowing where it was and then quickly reduced it to be at a few possible areas and then refined that to it's actual location.

Trying to do the same sort of thing from visual data alone would be much harder. I'm not aware of any project which has done that.

Once it has a map, and the ability to locate itself on the map, then simply pointing to different locations on the map is a simple way to tell it where you want it to go. Typing English like words such as "go to the kitchen" however would be a bit more complex depending on how flexible you wanted the command language to be. If for example, you wanted to talk to it tell it things like, "this is the kitchen", and then later, be able to tell it, "go get me a beer from the fridge in the kitchen" then you are at a very different level of problem than using a mouse to point to a location on it's map to tell it where to go.

Reply to
Curt Welch

Cool. Was it able to navigate pretty well?

Yea, I think using image data alone would not work well. I think the robot would need to be able to have some form of radar and be able to ping its surrounding environment in order to understand how far away objects are. It would need to then be able to (somehow) in the neural net associate distance data with actual objects and specific positions in the image map it constructs.

I guess if I were doing this with a human I would point and say, "That is the refrigerator," "That is the kitchen table," "This is the living room couch." So maybe I could have some UI where that I could type into a textbox the object name, click a button when the robot starts scanning that object/area, and when it's done I click a button to stop the phrase/object association processes.

Is my head too far in the clouds? Is something like this needed in the robotics field (would it be useful)? Is there a better way than what I've described? I'd like to do it as closely as possible to how humans perform navigation and understand their positions (generally speaking).

Reply to
Chad Johnson

It would be useful if it worked well. I've never seen such a thing that worked well enough to be useful however.

No one really knows how humans do these things. At least not well enough to be able to build a machine to duplicate our skills in these tasks.

If you try to train the robot by example, you have to give it a lot of examples for it to work. For example, if you point at the fridge and tell it, that's the refrigerator, what's going to stop it from thinking that all large white areas are to be known as "the refrigerator"? Or what if there's a magnet on the fridge and the robot makes the assumption the magnet is the fridge? So when you say go to the fridge, the robot runs over to the white-board which also has magnets on it. Before we can understand such a message, we need to create some parsing of the environment into objects and have a large base of commonsense experience about what the person is most likely to be trying to communicate to us.

The basic concept of association is very simple and powerful and a fundamental part of what makes humans intelligent. But the hard part of the problem is understanding how to decode raw sensor data into "things" that the associations can be made with. I don't know of any AI projects that has really solved that part of the problem.

When we, as humans, look at a kitchen, we don't just see raw 2D pixel data. We see a 3D room full of 3D objects. Before you can use basic high level associations like telling the robot the big white box is a refrigerator, the robot first has to decode that raw 2D data into a description of a 3D room full of objects so that frig you understand, is the one the robot already understands before you try to give it a name.

But how do we, as humans, learn to see the kitchen in this way? Some people believe a lot of that ability in us is the result of millions of years of evolution building custom hardware in us that decodes the visual data for us in that way (and they believe that we have different hardware for decoding visual data, than for decoding sound data). So they believe each part of the brain simply has complex hardware for processing each type of data and performing each function. So to duplicate that in an AI project, we need to develop a lot of different complex modules and make them all work together.

I happen to believe that most of that work is done by one generic type of hardware which is able to adapt to, and decode whatever type of data you send to it - like a ANN can learn to decode any data it's been trained to decode.

There are lot of people looking at these sorts of problems from all different directions, but there simply are no solutions that match what humans can do.

Reply to
Curt Welch

Yes, it is possible to do this with associative memory and neural nets.

It is also possible to do it far more efficiently and reliably by forgetting about the neural nets, and using an object recognition algorithm such as SIFT:

formatting link

Reply to
Bob

That algorithm looks like it could be useful. One reason I'm considering a neural net is because it would allow phrases (and in the future, other things) to be associated with objects it (somehow) finds. I guess I have a few options here, some of which are:

  • Integrate this algorithm with an ANN. Store the image data in whatever data structure is used by this SIFT algorithm, and in the neural net store phrases and associate those phrases with parts of the SIFT algorithm's data structure (indexes or whatnot).
  • Integrate this algorithm with an ANN, using it (in some way) in place of some core internal comparison algorithms (so have a hybrid ANN-SIFT system). So I'd send raw image data as input to the ANN, the ANN would have SIFT do its comparison magic, and the ANN would return the results of SIFT.
  • Use the SIFT algorithm exclusively. I don't know whether phrases could be associated with the image data, though.

Maybe someone has done something like this already. I'll have to research this.

Would SIFT actually store image data in some data structure, or do you just pass the raw image data for the images to the algorithm each time you want to compare images? Would I need some supplementary data structure to store this data?

How flexible are SIFT's association capabilities? Is it as capable as a neural net? What if, say, I did actually get somewhere with this robot, and I used SIFT. What if I wanted to associate sounds with the images in addition to text (obviously I'd need a sound-based algorithm for processing the sound)? Could I do this without using a neural net?

Reply to
chad.d.johnson

Isn't being able to distinguish objects only going to be an issue if an objects' location changes or if the object is moving? What is the disadvantage if an object is identified based on the image and radar readings of it and its surrounding environment? It's less human-like, yes, but would the robot not still be able to locate the object?

On a different note, about the distinguishing objects from one another and the environment: in my room I have a bookcase, and to the right I have two computers stacked on top of one another. How do I know that the computers are not part of the bookshelf -- that I have 3 separate objects? Some things I notice are (and I am sort of just thinking to myself here):

  • The bookshelf area is colored differently than the computer area
  • The bookshelf does not have buttons or lights or internal shapes like the bookshelf does; the computer has dinstinctly different features than the bookshelf area does.

Now, how I know there are *two* computers in the computer area of the image data rather than just one? The computers look different, but suppose they looked exactly the same and were aligned perfectly so that they looked like one object. I would have a more difficult time realizing that there are two computers in that area and not just one. It would take me a little longer to make this determination, and I think there is the chance that I may not realize that there actually are two objects. So to make this determination, I think the biggest factor would be that I would realize that I am seeing the same or similar thing twice, and with the number two in my mind, I would likely poll any existing knowledge about computer cases, and, assuming I had any, I would likely determine that the height of the area is too tall to be just one computer. So basically I'm doing a size comparison against my existing computer casing-related knowledge and determining whether any cases I've seen have ever been that tall. If the results are around 50/50, I may inspect the cases closer (e.g. try separating the two repeated areas).

It seems that distinguishing whether an area in an image is one or multiple objects involves closeness matching. I very much bet this could be done with some algorithm. Maybe one that could separate the image data into multiple areas based on various hard-coded (or even learned) characteristics, such as size, color, shape, shadows, internal characteristics (shapes, colors) etc. Then other non-image inputs could be used as complements, such as physical dimensions from radar.

Any thoughts? :)

Reply to
Chad Johnson

Don't let terminology confuse you. "Neural networks" are simply one form of reinforcement learning -- with all the problems and benefits that brings. They are large multidimensional nonlinear dynamic feedback systems; they are not a magic bullet.

A simple language parser/command interface will get you much further. Rely on specialized numerical algorithms for vision processing, put a language grammar on top for human interaction, and forget about the magic bullet.

Read about it. Its rather elegant and well optimized.

In many ways, SIFT is more capable than a neural network. Why? Because SIFT was designed and optimized with years of mathematical insight. As were numerous other vision algorithms.

Neural networks are nothing without training. Even with good training, they may fail to converge, go unstable, or result in a highly inefficient pattern matching system. Many simple problems (e.g. calculate the square root of x identify pairs x and x^2) just don't map well to a neural-network architecture.

In regards to visual mapping, the OpenSLAM project

formatting link
might be of interest.

Wishful, uncritical thinking lead to the AI Winter -- a fate predicted years before by Drew McDermott in his 1976 ACM SIGART paper entitled "Artificial Intelligence Meets Natural Stupidity". This mindset is to be avoided like the plague; it causes confusion and blindness where reality is masked by ignorant dreams.

Apologies for the rant (meant for NN hype rather than you personally), Daniel

Reply to
D Herring

I have to agree with Bob here -- I spent a lot of time in grad school working with neural networks, and from that experience, I will argue that they're of little practical use. The training time for them is far too long, and most of the time the resulting network is brittle (either overgeneral or overspecific). There are many other learning algorithms which train faster and work better.

But Bob, thanks for the link to SIFT; it looks cool. The newer competitor, SURF, looks even more interesting for robotics, as it's speedy and a robot vision system surely needs that. Makes me want to (finally) finish a laptop bot, or some other bot with a real computer on it, and start doing some real vision work!

Best,

- Joe

Reply to
Joe Strout

My point was more to do with the simple idea that the robot wouldn't be able to understand your question. It wouldn't be able to learn what object you were talking about if it didn't have a concept of objects that were close to your concept of objects.

You just can't know that there are two computers there simply by looking at the image. How, for example, do you know that those two computers stacked on top of each other are not actually a foam sculpture carefully shaped and painted to look exactly like two computers stacked together?

When you look at those computers, your brain doesn't parse it as a foam sculpture sitting next to a bookcase. Your brain parses it as two computers, which means you can lift the top one, and the bottom one won't move, and that they have an expected weight which is much heaver than a foam sculpture.

Part of your ability to see those as two computers might, for you, come from the fact that you put them them there in the first place. But even if I walked into your office, knowing nothing about it, I too would probably see them as two computers, and not a single foam sculpture. This happens because of my long experience interacting with similar environments. Every time I've seen something like that in the past, and then interacted with it, I found it had typical computer-like properties.

So the job of parsing an image into objects, is not something you can do very accurately without a long history of interacting with similar environments in the past. The job of parsing the image into objects, is a job of statistical probability, based on past experience.

If you attempt to hard-code some computer algorithm to parse interior scenes accurately into book cases, and books, and stacks of computers, you will likely find that when that algorithm moves outside into a forest, almost nothing was parsed correctly. You would have to make all sorts of additions to the algorithm for it to even get close to correctly parsing trees and shadows and leaves, and dogs with spots, and patches of snow on the ground.

So, I think the key to getting a robot to understand sensory data like we understand it, is developing adaptive decoding algorithms that tune themselves based on experience.

In addition, a huge advantage we have is that we process temporal data. We extract our information not from static data (like a photograph), but from how the sensory data changes over time. Our expectations are based on probabilistic predictions of how sensory data is likely to change over time because of how the sensory data has changed in the past.

For example, when we see a book with a title printed on it, we see the book as one item, instead of seeing the cover as one item, and the title as being a different item. Part of why we can do this is our expectations of how this image of the book is likely to change over time. If the book moves to the right, we expect the title to move to the right at the same time. If the title moves, we expect the book to move. We see it as one object, because we expect the book, and the title, to have temporally correlated motions. We expect them to change in similar ways, over time.

This is not something any algorithm could know without exposure to many books in the past. How does it know the title is printed on the book, and not just paper which is cut out and laying on top of the book? How does it know the paper title might not blow off in a second with a slight gust of wind? It knows it because everything it's seen in the past which looked similar to this book, makes the sensory data decoding system predict that the most likely event is for the title to move when the rest of the features of the book moves, so as a result, all those different visual features get parsed as "one object". They are temporally predictive of each other and that's what (I believe) causes the sensory data processing system to associate them as being "one object" in the effective parse tree of the pixel data.

I don't know how to actually implement this concept in hardware, but it's the type of concept I believe we need to look at to understand how humans deal with sensory data and how we can "understand" sensory data in terms of objects.

Reply to
Curt Welch

Yea, I think past experience is definitely a critical factor. Thinking about it more, I guess if I looked at the computers the first time (no machine/computer experience) it would take me a while to figure out there are two of these objects.

I'll have to keep thinking about these things. Thanks for being open to discussing them with me though -- it would have taken me a lot of time to come close to even considering many of the things you mentioned.

Reply to
Chad Johnson

Yes. It's called Simultaneous Localization and Mapping, or SLAM. There have been major developments in this area since 2000 or so, and there are some algorithms available that work quite well.

Read "SLAM for Dummies" to get started:

formatting link
John Nagle

Reply to
John Nagle

This is a key principle of Jeff Hawkin's "Hierarchical Temporal Memory" theory, which is the basis of the AI work being done by their new company Numenta. (If that sentence doesn't have enough search terms for anyone's google-fu, I don't know what would!)

Best,

- Joe

Reply to
Joe Strout

PolyTech Forum website is not affiliated with any of the manufacturers or service providers discussed here. All logos and trade names are the property of their respective owners.