I have been planning a robotic project and will base my system on a Mini-ITX running a 1 GHz Via CPU. I will also use a single USB webcamera (most probably the Logitech QuickCam 4000 Pro) for its vision. Everything will be programmed in Java using JMF. But before I start building the thing I want to work more on the "cognitive" tasks of the robot first.
My main task for the first part of the project is to have the robot respond to a speech command (using IBM Viavoice 10) which would ask the robot to identify the items it can see and read them out aloud (speech by AT&T natural voices). I can basically develop all of this on a standard computer today.
I have looked into some algorithms for vision, and the ones I see as having a good degree of accuracy is the ones based on feature learning and recognition based on AdaBoost algorithm (which I do not really have the deep knowledge of yet). These seem to base on learning a neural network where you teach the network how e.g. a face looks like based on simple and tiny primitives (contrasts in the image). However, the theories I have read papers on lacks more "hands on" details I'd like to see. If anyone has any code examples on these algorithms (and hopefully a better explanation) I would be very happy to get some references.
I have previously worked with neural nets using back-propagation learning algorithms which worked quite well (used for image compression work by Prof. Munroe at Univ. of Pittsburgh) and it too learns features in an image. I might try this method too and see how I can adapt it to vision. The learning process can take a long time, but it is important that the identification is close to real-time. The work around AdaBoost seem to have some pointer to this (using integral image to quickly calculate features). I am just not sure how I go about learning the network and identifying the features found. Also more details about sliding the recognition window across the image at different scales is also needed. I have some ideas, but why re-invent the wheel?
Also, it seems that some parameters about the environment context can assist the recognizer. E.g. if it can see a face, there is a higher probability that there is a body underneath it. Coffee cups are most probably upright and not on its side. If other items indicate that its indoors (or the robot might already know this), then it is not likely that it is a dolphin it observes on the table (if that got a high match). Of course these rules are quite manual in the way that we need to weight them and there is all sorts of exceptions (there might be a picture of a dophin on the wall). A context graph about the relation between objects seems to be a way to go. E.g. the rule "Cup - lies on top of - Table" might be an indication that if it has seen a cup, then it might be a table it has observed as a "blob" underneath. A mental memory map can be created based on these relations too.
Now, the use-case needs to be limited to certain tasks at first though. If I can get it to respond to a voice command: "Tell me what you see?" with the spoken answer: "I see two faces, a ball and a cup." I would be very satisfied!
As for "intelligent" conversation I have been looking at Alice which has an AIML XML based language for simulating intelligent conversation where the robot can learn about its surroundings based on input (which I hope to be able to do with speech recognition). For instance, it should be able to recognize a face in its vision and ask "Hello, what is your name?", to which your reply will be memorized and used for later conversation. It should also act like a simple intelligent agent where I could ask it information like "What time is it?" and "What is the weather forecast?" (in which it would use the wlan access to get weather information).
I guess there is many ways I could take this but as a starter I'd like to do as much as possible using a normal computer and a webcamera. And then I can work more on the hardware and building the robot. Most people seem to do it the other way though.
Best regards, JC