Wednesday, June 1, 2011

In Conclusion

My original goal was to create a Tic-Tac-Toe game the users play with their hands. I did not meet this goal, however I got pretty close. I got ALL of the computer vision aspects of my project done, and all that is left is translating matlab code into C# (not a lot though, just if-else statements), progamming a tic-tac-toe game (3 hours of work max), and manually translating bytes into useful information (ughhh lots of work). The big set-back in my project was how difficult it was to do anything with the live video feed coming from Kinect. It's the type of thing where if my computer was broken for 2 weeks (1/5 of the time of this class) I might have been abled to figure it out, but realistically it would have taken me more time.

Even without finishing, I felt that I was incredibly successful in my project. Near the end of the project I transitioned more into computer vision aspects, and greatly improved my code. Instead of working on the project just for a tic-tac-toe game, I went a lot more generatic. I created code that can literally be used to detect any part of the human body in any pose or position. If I combined all my scripts it would literally be as easy as

"python runme.py [dir_of_pos_data] [dir_of_neg_data] [dir_of_test_data] [num_trees]", let it run for 20 minutes to 3 days (depending ;) and then boom. A classifier that dynamically shows you how well it works, that is incredibly generic.

If I had wanted to I could have, as the Prof said "gotten away with murder" in terms of detecting where in a 3x3 grid someones hands were, and then based on whoes turn it was say it was an X or an O. But instead of focusing on the tic-tac-toe game part of the project, I focused on the computer vision aspects, and felt like I learned a lot more that way and have a better final product.

The questions I was looking to answer by engaging in this project where all about the Kinect and essiently what can it do? What I learned is that the Kinect is flat-out amazing. An incredibly cheap depth and RGB sensor, with remarkable accuracy. I feel that the Kinect in a few years might become a sandard for computer vision projects for objects between 1.5 and 6 meters away (once better documented drivers are made -_-,) and more people play with it.

Tuesday, May 31, 2011

Improvments learned by messing with O data

When I retraced my steps that I had done for the first 6 weeks with my X data, and translated it to training a classifier for an O I discovered multiple things I can do better, and I even fixed a bug! (And by fixing the bug I got even better accuracy :D).

Improvements:
1) Intra-rater reliability, just like David. But instead of a trained dermatoligist classify wrinkles (and ranking them in a different order), it is me labeling a positive X or a positive O. Everytime I extract what is an positive X or and positive O from a image, I extract slightly different areas.

To attempt to compensate for this, for each training frame I extract the positive example twice. What I learned from this is that the area that I, as a human, classify as the X or the O being made but the human can vary up to 5% in its size/location. While not much, I felt it was significent enough to matter.

2) Throwing out the top and bottom 10% is not ALWAYS doing what I want it to do. What I was attempting to achieve with this was a quick hack to throw out most false positives. It depends on where in the image the X or O is, and the amount of data, but for the most part it doesn't do this.

Proof on concept, and example where it does work: In very small numbers throwing out top and bottom 10% of data does what I want it to do. For the following I only used 4 frames of a sequence.

Without throwing at data: (computer thinks center of O is somewhere within the green).

Throwing out top and bottom 10%



As you can see, a couple false positives when dealing with a small amount of data GREATLY reduce the accuracy. Even though when throwing out the top and bottom 10% make it so the correct area is not selected, the answer is a lot more precise. However, again this is with a crazy small amount of data.


Changed this to weighing each positve based on how far it is from the mean.
Without any weighing:

First attempt: Some improvements, but not as much as I hoped.

Second attempt: Ooohhh yeaahhhh.





Sample final result of 10 frames (Note: This result is FAR BETTER THAN THE AVERAGE RESULT WILL BE), and is pretty much as good as I will be able to make it.


3) Use the positive examples that I used to train to determine what ratio of boxes to scan when looking for the X or O. What this means is that when I labeled all my positive X's, they all roughly had the same ratio of size. This ratio did not really change based on distance from the Kinect and hand size (this is a good thing). When scanning frames for an X, I was using a ratio of 4/3 width-to-height (1.33333), the actual ratio is 1.336. Close enough. However for Os the actual ratio is 1.848, which is quite different than 1.33.


Bug Fixed!

Due to bug by preforming a shallow copy instead of a deep copy, I was incorrectly passing in data to be scanned by the X classifier. The exact bug was that once I found a positive on an image the pixels were modified to label this region based on the tree that detected it as a positive, and this modified data was then passed on for further scannings.

O Data

Over the past week or so I have been retracing all my steps that I used to successfully create 3 ADTrees to really accuratlly determine an X given a series of images (aka a live video stream), except I have been been doing it to determine an O. Overall it has been just as successful, or even more successful than determing the location of an X.

Also, by retracing all my steps it has given me a chance to refine my pipeline, and in the process I have discovered better was to do what I am doing (will explain in the next post).

Example of an O:


Results:
Red, Green, and Blue are related to the red, green, and blue sections from whem I was doing the Xs. They were pulled from the same subset of random numbers.

For Xs, usefulness was Blue >> Green >>>>> Red

For Os, it is something like this. Red >>>>>>>>>>>>> Green > Blue.

Lol random numbers.
Here are some sample result images:

[TODO: Upload whole album]













And just to prove that blue and green exist....



They just never fire. It is at the point were it made me triple check everything to make sure I was not doing anything wrong with the blue and green classifiers, and as far as I can tell I am not. If I lower the threshold for what a positive is, blue and green fire a lot more, but then red fires too much and has a lot of false positives.



Monday, May 16, 2011

Math is cool

This week I worked on locating the most probably location for an X (or and O) based on the data from the multiple trees. When given a live stream of data from the kinect this will probably be the last second or so of data (30 frames). However since my desktop (which I have been using for the live data from kinect) is still broken I worked with what I had and use the static images I had and matlab to experiment.

This is what I came up with and seems to be working the best. I do not know if it will be able to be run in real time, but if it is unable to it will at least be close, and should only require a few modifications to get it to real-time runnable.


1) Scan over all image with 3 different block sizes, 1/6, 1/7 and 1/8 of image. Run each block in all 3 three trees. Run boxes with a 25% overlay. This will result in a lot of positives, some will be false positives. However, the highest concentration of positives should be the correct area.
2) Keep the data of the last however amount of frames, and for each new frame add the data to this collection.
3) Throw out the top and bottom 10% of data.
4) Find the area where the CENTER of the X (or O) should be by using the mean of the remaining data and the standard deviation. So for the results below the computers best guess for the CENTER of the X is the region within the green box, the box does NOT outline the whole area of the X.

Results!
Note: Since I did not have access to a live stream of data, I manually picked out images where the X's are close by. The final image where the box is drawn on is just the first image of the series that was passed in as the.

Some of the images that were passed in for this (18 images where passed in, not going to upload them all)






Some more result images.






Since I was not using a live stream of data, and I was instead eye-balling grouping images together that appeared to Xs in the same area of the image the final results are not as good as they will be once using real live stream data. Also, as you can see on the results the computer is less certain about the Y coordinate of the image, but generally pretty sure about the X coordinate. This makes sense as most false positives are heads, elbows, and knees/upper legs. These results throw off the Y coordinate more than the X.

Monday, May 9, 2011

Paper

Here is the paper I have been referencing in class

Progress on the real time data stream

Unfortunately isn't going as well as I would have hoped. Manipulating the kinect data, at least with the drivers I am using, is rough. But I am learning lots and am making progress, it's just not as fast as I would have hoped.

The main problem I am having is how to access the and process the data from Kinect. How the driver I am using works is that shared memory is used, and is constantly being written to by the Kinect and then read to be displayed. What I am going to have to do is manually go into the memory, hopefully figure out how the video stream is being stored and manually parse the data into extraction the hue value from the depth stream. Once I have this it will be hopefully be easy to segment it, extract the values I need, and then pass them through the ADtrees.

Current:

Mockup:

Also, the computer I have been doing all this development on is going to be out of commission for the about the next week, so I will not be able to work on this part of the project at all.

3 ADTrees!

1) Last time I ended talking about how 100,000 comparisons was too much. I dropped it down to 10,000 and it became much more manageable. I then did this three times, each with different randomly generated numbers to create three trees for detecting whether a square is an X or not.

Out of the 10,000 passed in comparisons they were reduced down to:

Tree 1) 48 comparisons
Tree 2) 52 comparisons
Tree 3) 58 comparisons

After I got these nice results, after some trial and error (off by 1 errors are lame) I extracted all the 'useful' numbers out of the random numbers into their own files, and manually altered the alternating decision trees to work with these. The result, fast extraction and comparison of pixels from a section of an image. Also because I was curious I mapped the comparisons to an image so I could see what it looked like. For parsing whether or not a section of an picture is an X or the endpoints of each line is compared.





Red is tree 1
Green is tree 2
Blue is tree 3

As we can see... one of the tress is much better than the other two. Random numbers are random.
And the results: Here