Suppose a robot was constructed which could traverse the interior of a
building, and in doing so build an associative memory map of the
visual layout of a building -- using some kind of neural net. So with
the map built, it could tell, given a picture of anywhere in the
house, where it (the robot) is currently located (relatively) in the
house (i.e. it receives as input a picture of a table (say the kitchen
table), and from that it would know that to the left of that is a
door, and on the other side of that door is the living room (assuming
it somehow knew what portion of its map was designated to be the
Is this even remotely possible?
If so, then would it be feasible to, in some input format, tell the
robot, for example, to "go to room x" (assuming somehow the robot had
learned to associate "room x" with a particular portion of its
Yes. Humans can do it, so it's possible without a doubt. :) It's not easy
I'm not aware of anyone that has done just what you are thinking of, but
there's been a lot of work on related ideas.
I know I've seen robotics mapping projects where the goal of the robot was
to create a 2D map of its environment as it moved around a building, and to
use that map to locate itself in the environment. But as I recall, the
sensor supplying the data was not visual but instead something more easily
usable like laser distance measurements in 360 degrees around the robot.
This would give it distance measurements to the walls. It would then build
up a map of the walls and doors. The project I'm thinking of made heavy
use of statistical techniques. Actually, now that I'm thinking about it,
it might have been given an simple 2D map of the building, and it's goal
was to move around and to try and figure out where is was on the map. I
think it worked by estimating the probability of it's location at all
points on the map (down to some resolution), and using the sensor data to
update the probability of it being at each location on the map until it
received enough data to estimate it's location to a high probability. I
recall seeing a video of the computer screen which represented it's best
guess as to it's current location on the map as it moved around. It
started off not knowing where it was and then quickly reduced it to be at a
few possible areas and then refined that to it's actual location.
Trying to do the same sort of thing from visual data alone would be much
harder. I'm not aware of any project which has done that.
Once it has a map, and the ability to locate itself on the map, then simply
pointing to different locations on the map is a simple way to tell it where
you want it to go. Typing English like words such as "go to the kitchen"
however would be a bit more complex depending on how flexible you wanted
the command language to be. If for example, you wanted to talk to it tell
it things like, "this is the kitchen", and then later, be able to tell it,
"go get me a beer from the fridge in the kitchen" then you are at a very
different level of problem than using a mouse to point to a location on
it's map to tell it where to go.
On Sep 6, 2:40 pm, email@example.com (Curt Welch) wrote:
Cool. Was it able to navigate pretty well?
Yea, I think using image data alone would not work well. I think the
robot would need to be able to have some form of radar and be able to
ping its surrounding environment in order to understand how far away
objects are. It would need to then be able to (somehow) in the neural
net associate distance data with actual objects and specific positions
in the image map it constructs.
I guess if I were doing this with a human I would point and say, "That
is the refrigerator," "That is the kitchen table," "This is the living
room couch." So maybe I could have some UI where that I could type
into a textbox the object name, click a button when the robot starts
scanning that object/area, and when it's done I click a button to stop
the phrase/object association processes.
Is my head too far in the clouds? Is something like this needed in the
robotics field (would it be useful)? Is there a better way than what
I've described? I'd like to do it as closely as possible to how humans
perform navigation and understand their positions (generally speaking).
It would be useful if it worked well. I've never seen such a thing that
worked well enough to be useful however.
No one really knows how humans do these things. At least not well enough
to be able to build a machine to duplicate our skills in these tasks.
If you try to train the robot by example, you have to give it a lot of
examples for it to work. For example, if you point at the fridge and tell
it, that's the refrigerator, what's going to stop it from thinking that all
large white areas are to be known as "the refrigerator"? Or what if
there's a magnet on the fridge and the robot makes the assumption the
magnet is the fridge? So when you say go to the fridge, the robot runs
over to the white-board which also has magnets on it. Before we can
understand such a message, we need to create some parsing of the
environment into objects and have a large base of commonsense experience
about what the person is most likely to be trying to communicate to us.
The basic concept of association is very simple and powerful and a
fundamental part of what makes humans intelligent. But the hard part of
the problem is understanding how to decode raw sensor data into "things"
that the associations can be made with. I don't know of any AI projects
that has really solved that part of the problem.
When we, as humans, look at a kitchen, we don't just see raw 2D pixel data.
We see a 3D room full of 3D objects. Before you can use basic high level
associations like telling the robot the big white box is a refrigerator,
the robot first has to decode that raw 2D data into a description of a 3D
room full of objects so that frig you understand, is the one the robot
already understands before you try to give it a name.
But how do we, as humans, learn to see the kitchen in this way? Some
people believe a lot of that ability in us is the result of millions of
years of evolution building custom hardware in us that decodes the visual
data for us in that way (and they believe that we have different hardware
for decoding visual data, than for decoding sound data). So they believe
each part of the brain simply has complex hardware for processing each type
of data and performing each function. So to duplicate that in an AI
project, we need to develop a lot of different complex modules and make
them all work together.
I happen to believe that most of that work is done by one generic type of
hardware which is able to adapt to, and decode whatever type of data you
send to it - like a ANN can learn to decode any data it's been trained to
There are lot of people looking at these sorts of problems from all
different directions, but there simply are no solutions that match what
humans can do.
Isn't being able to distinguish objects only going to be an issue if
an objects' location changes or if the object is moving? What is the
disadvantage if an object is identified based on the image and radar
readings of it and its surrounding environment? It's less human-like,
yes, but would the robot not still be able to locate the object?
On a different note, about the distinguishing objects from one another
and the environment: in my room I have a bookcase, and to the right I
have two computers stacked on top of one another. How do I know that
the computers are not part of the bookshelf -- that I have 3 separate
objects? Some things I notice are (and I am sort of just thinking to
* The bookshelf area is colored differently than the computer area
* The bookshelf does not have buttons or lights or internal shapes
like the bookshelf does; the computer has dinstinctly different
features than the bookshelf area does.
Now, how I know there are *two* computers in the computer area of the
image data rather than just one? The computers look different, but
suppose they looked exactly the same and were aligned perfectly so
that they looked like one object. I would have a more difficult time
realizing that there are two computers in that area and not just one.
It would take me a little longer to make this determination, and I
think there is the chance that I may not realize that there actually
are two objects. So to make this determination, I think the biggest
factor would be that I would realize that I am seeing the same or
similar thing twice, and with the number two in my mind, I would
likely poll any existing knowledge about computer cases, and, assuming
I had any, I would likely determine that the height of the area is too
tall to be just one computer. So basically I'm doing a size comparison
against my existing computer casing-related knowledge and determining
whether any cases I've seen have ever been that tall. If the results
are around 50/50, I may inspect the cases closer (e.g. try separating
the two repeated areas).
It seems that distinguishing whether an area in an image is one or
multiple objects involves closeness matching. I very much bet this
could be done with some algorithm. Maybe one that could separate the
image data into multiple areas based on various hard-coded (or even
learned) characteristics, such as size, color, shape, shadows,
internal characteristics (shapes, colors) etc. Then other non-image
inputs could be used as complements, such as physical dimensions from
Any thoughts? :)
My point was more to do with the simple idea that the robot wouldn't be
able to understand your question. It wouldn't be able to learn what object
you were talking about if it didn't have a concept of objects that were
close to your concept of objects.
You just can't know that there are two computers there simply by looking at
the image. How, for example, do you know that those two computers stacked
on top of each other are not actually a foam sculpture carefully shaped and
painted to look exactly like two computers stacked together?
When you look at those computers, your brain doesn't parse it as a foam
sculpture sitting next to a bookcase. Your brain parses it as two
computers, which means you can lift the top one, and the bottom one won't
move, and that they have an expected weight which is much heaver than a
Part of your ability to see those as two computers might, for you, come
from the fact that you put them them there in the first place. But even if
I walked into your office, knowing nothing about it, I too would probably
see them as two computers, and not a single foam sculpture. This happens
because of my long experience interacting with similar environments. Every
time I've seen something like that in the past, and then interacted with
it, I found it had typical computer-like properties.
So the job of parsing an image into objects, is not something you can do
very accurately without a long history of interacting with similar
environments in the past. The job of parsing the image into objects, is a
job of statistical probability, based on past experience.
If you attempt to hard-code some computer algorithm to parse interior
scenes accurately into book cases, and books, and stacks of computers, you
will likely find that when that algorithm moves outside into a forest,
almost nothing was parsed correctly. You would have to make all sorts of
additions to the algorithm for it to even get close to correctly parsing
trees and shadows and leaves, and dogs with spots, and patches of snow on
So, I think the key to getting a robot to understand sensory data like we
understand it, is developing adaptive decoding algorithms that tune
themselves based on experience.
In addition, a huge advantage we have is that we process temporal data. We
extract our information not from static data (like a photograph), but from
how the sensory data changes over time. Our expectations are based on
probabilistic predictions of how sensory data is likely to change over time
because of how the sensory data has changed in the past.
For example, when we see a book with a title printed on it, we see the book
as one item, instead of seeing the cover as one item, and the title as
being a different item. Part of why we can do this is our expectations of
how this image of the book is likely to change over time. If the book
moves to the right, we expect the title to move to the right at the same
time. If the title moves, we expect the book to move. We see it as one
object, because we expect the book, and the title, to have temporally
correlated motions. We expect them to change in similar ways, over time.
This is not something any algorithm could know without exposure to many
books in the past. How does it know the title is printed on the book, and
not just paper which is cut out and laying on top of the book? How does it
know the paper title might not blow off in a second with a slight gust of
wind? It knows it because everything it's seen in the past which looked
similar to this book, makes the sensory data decoding system predict that
the most likely event is for the title to move when the rest of the
features of the book moves, so as a result, all those different visual
features get parsed as "one object". They are temporally predictive of
each other and that's what (I believe) causes the sensory data processing
system to associate them as being "one object" in the effective parse tree
of the pixel data.
I don't know how to actually implement this concept in hardware, but it's
the type of concept I believe we need to look at to understand how humans
deal with sensory data and how we can "understand" sensory data in terms of
On Sep 6, 11:30 pm, firstname.lastname@example.org (Curt Welch) wrote:
Yea, I think past experience is definitely a critical factor. Thinking
about it more, I guess if I looked at the computers the first time (no
machine/computer experience) it would take me a while to figure out
there are two of these objects.
I'll have to keep thinking about these things. Thanks for being open
to discussing them with me though -- it would have taken me a lot of
time to come close to even considering many of the things you
This is a key principle of Jeff Hawkin's "Hierarchical Temporal Memory"
theory, which is the basis of the AI work being done by their new
company Numenta. (If that sentence doesn't have enough search terms for
anyone's google-fu, I don't know what would!)
On Sep 6, 12:01 pm, email@example.com wrote:
Yes, it is possible to do this with associative memory and neural
It is also possible to do it far more efficiently and reliably by
about the neural nets, and using an object recognition algorithm such
That algorithm looks like it could be useful. One reason I'm
considering a neural net is because it would allow phrases (and in the
future, other things) to be associated with objects it (somehow)
finds. I guess I have a few options here, some of which are:
* Integrate this algorithm with an ANN. Store the image data in
whatever data structure is used by this SIFT algorithm, and in the
neural net store phrases and associate those phrases with parts of the
SIFT algorithm's data structure (indexes or whatnot).
* Integrate this algorithm with an ANN, using it (in some way) in
place of some core internal comparison algorithms (so have a hybrid
ANN-SIFT system). So I'd send raw image data as input to the ANN, the
ANN would have SIFT do its comparison magic, and the ANN would return
the results of SIFT.
* Use the SIFT algorithm exclusively. I don't know whether phrases
could be associated with the image data, though.
Maybe someone has done something like this already. I'll have to
Would SIFT actually store image data in some data structure, or do you
just pass the raw image data for the images to the algorithm each time
you want to compare images? Would I need some supplementary data
structure to store this data?
How flexible are SIFT's association capabilities? Is it as capable as
a neural net? What if, say, I did actually get somewhere with this
robot, and I used SIFT. What if I wanted to associate sounds with the
images in addition to text (obviously I'd need a sound-based algorithm
for processing the sound)? Could I do this without using a neural net?
Don't let terminology confuse you. "Neural networks" are simply one
form of reinforcement learning -- with all the problems and benefits
that brings. They are large multidimensional nonlinear dynamic
feedback systems; they are not a magic bullet.
A simple language parser/command interface will get you much further.
Rely on specialized numerical algorithms for vision processing, put
a language grammar on top for human interaction, and forget about the
Read about it. Its rather elegant and well optimized.
In many ways, SIFT is more capable than a neural network. Why?
Because SIFT was designed and optimized with years of mathematical
insight. As were numerous other vision algorithms.
Neural networks are nothing without training. Even with good
training, they may fail to converge, go unstable, or result in a
highly inefficient pattern matching system. Many simple problems
(e.g. calculate the square root of x <~> identify pairs x and x^2)
just don't map well to a neural-network architecture.
In regards to visual mapping, the OpenSLAM project
(http://openslam.org /) might be of interest.
Wishful, uncritical thinking lead to the AI Winter -- a fate predicted
years before by Drew McDermott in his 1976 ACM SIGART paper entitled
"Artificial Intelligence Meets Natural Stupidity". This mindset is to
be avoided like the plague; it causes confusion and blindness where
reality is masked by ignorant dreams.
Apologies for the rant (meant for NN hype rather than you personally),
I have to agree with Bob here -- I spent a lot of time in grad school
working with neural networks, and from that experience, I will argue
that they're of little practical use. The training time for them is far
too long, and most of the time the resulting network is brittle (either
overgeneral or overspecific). There are many other learning algorithms
which train faster and work better.
But Bob, thanks for the link to SIFT; it looks cool. The newer
competitor, SURF, looks even more interesting for robotics, as it's
speedy and a robot vision system surely needs that. Makes me want to
(finally) finish a laptop bot, or some other bot with a real computer on
it, and start doing some real vision work!
Yes. It's called Simultaneous Localization and Mapping, or SLAM.
There have been major developments in this area since 2000 or so, and
there are some algorithms available that work quite well.
Read "SLAM for Dummies" to get started:
Polytechforum.com is a website by engineers for engineers. It is not affiliated with any of manufacturers or vendors discussed here.
All logos and trade names are the property of their respective owners.