Kinect-tac-toe: 2011

Wednesday, June 1, 2011

In Conclusion

My original goal was to create a Tic-Tac-Toe game the users play with their hands. I did not meet this goal, however I got pretty close. I got ALL of the computer vision aspects of my project done, and all that is left is translating matlab code into C# (not a lot though, just if-else statements), progamming a tic-tac-toe game (3 hours of work max), and manually translating bytes into useful information (ughhh lots of work). The big set-back in my project was how difficult it was to do anything with the live video feed coming from Kinect. It's the type of thing where if my computer was broken for 2 weeks (1/5 of the time of this class) I might have been abled to figure it out, but realistically it would have taken me more time.

Even without finishing, I felt that I was incredibly successful in my project. Near the end of the project I transitioned more into computer vision aspects, and greatly improved my code. Instead of working on the project just for a tic-tac-toe game, I went a lot more generatic. I created code that can literally be used to detect any part of the human body in any pose or position. If I combined all my scripts it would literally be as easy as

"python runme.py [dir_of_pos_data] [dir_of_neg_data] [dir_of_test_data] [num_trees]", let it run for 20 minutes to 3 days (depending ;) and then boom. A classifier that dynamically shows you how well it works, that is incredibly generic.

If I had wanted to I could have, as the Prof said "gotten away with murder" in terms of detecting where in a 3x3 grid someones hands were, and then based on whoes turn it was say it was an X or an O. But instead of focusing on the tic-tac-toe game part of the project, I focused on the computer vision aspects, and felt like I learned a lot more that way and have a better final product.

The questions I was looking to answer by engaging in this project where all about the Kinect and essiently what can it do? What I learned is that the Kinect is flat-out amazing. An incredibly cheap depth and RGB sensor, with remarkable accuracy. I feel that the Kinect in a few years might become a sandard for computer vision projects for objects between 1.5 and 6 meters away (once better documented drivers are made -_-,) and more people play with it.

Tuesday, May 31, 2011

Improvments learned by messing with O data

When I retraced my steps that I had done for the first 6 weeks with my X data, and translated it to training a classifier for an O I discovered multiple things I can do better, and I even fixed a bug! (And by fixing the bug I got even better accuracy :D).

Improvements:
1) Intra-rater reliability, just like David. But instead of a trained dermatoligist classify wrinkles (and ranking them in a different order), it is me labeling a positive X or a positive O. Everytime I extract what is an positive X or and positive O from a image, I extract slightly different areas.

To attempt to compensate for this, for each training frame I extract the positive example twice. What I learned from this is that the area that I, as a human, classify as the X or the O being made but the human can vary up to 5% in its size/location. While not much, I felt it was significent enough to matter.

2) Throwing out the top and bottom 10% is not ALWAYS doing what I want it to do. What I was attempting to achieve with this was a quick hack to throw out most false positives. It depends on where in the image the X or O is, and the amount of data, but for the most part it doesn't do this.

Proof on concept, and example where it does work: In very small numbers throwing out top and bottom 10% of data does what I want it to do. For the following I only used 4 frames of a sequence.

Without throwing at data: (computer thinks center of O is somewhere within the green).

Throwing out top and bottom 10%

As you can see, a couple false positives when dealing with a small amount of data GREATLY reduce the accuracy. Even though when throwing out the top and bottom 10% make it so the correct area is not selected, the answer is a lot more precise. However, again this is with a crazy small amount of data.

Changed this to weighing each positve based on how far it is from the mean.
Without any weighing:

First attempt: Some improvements, but not as much as I hoped.

Second attempt: Ooohhh yeaahhhh.

Sample final result of 10 frames (Note: This result is FAR BETTER THAN THE AVERAGE RESULT WILL BE), and is pretty much as good as I will be able to make it.

3) Use the positive examples that I used to train to determine what ratio of boxes to scan when looking for the X or O. What this means is that when I labeled all my positive X's, they all roughly had the same ratio of size. This ratio did not really change based on distance from the Kinect and hand size (this is a good thing). When scanning frames for an X, I was using a ratio of 4/3 width-to-height (1.33333), the actual ratio is 1.336. Close enough. However for Os the actual ratio is 1.848, which is quite different than 1.33.

Bug Fixed!

Due to bug by preforming a shallow copy instead of a deep copy, I was incorrectly passing in data to be scanned by the X classifier. The exact bug was that once I found a positive on an image the pixels were modified to label this region based on the tree that detected it as a positive, and this modified data was then passed on for further scannings.

O Data

Over the past week or so I have been retracing all my steps that I used to successfully create 3 ADTrees to really accuratlly determine an X given a series of images (aka a live video stream), except I have been been doing it to determine an O. Overall it has been just as successful, or even more successful than determing the location of an X.

Also, by retracing all my steps it has given me a chance to refine my pipeline, and in the process I have discovered better was to do what I am doing (will explain in the next post).

Example of an O:

Results:
Red, Green, and Blue are related to the red, green, and blue sections from whem I was doing the Xs. They were pulled from the same subset of random numbers.

For Xs, usefulness was Blue >> Green >>>>> Red

For Os, it is something like this. Red >>>>>>>>>>>>> Green > Blue.

Lol random numbers.
Here are some sample result images:

[TODO: Upload whole album]

And just to prove that blue and green exist....

They just never fire. It is at the point were it made me triple check everything to make sure I was not doing anything wrong with the blue and green classifiers, and as far as I can tell I am not. If I lower the threshold for what a positive is, blue and green fire a lot more, but then red fires too much and has a lot of false positives.

Monday, May 16, 2011

Math is cool

This week I worked on locating the most probably location for an X (or and O) based on the data from the multiple trees. When given a live stream of data from the kinect this will probably be the last second or so of data (30 frames). However since my desktop (which I have been using for the live data from kinect) is still broken I worked with what I had and use the static images I had and matlab to experiment.

This is what I came up with and seems to be working the best. I do not know if it will be able to be run in real time, but if it is unable to it will at least be close, and should only require a few modifications to get it to real-time runnable.

1) Scan over all image with 3 different block sizes, 1/6, 1/7 and 1/8 of image. Run each block in all 3 three trees. Run boxes with a 25% overlay. This will result in a lot of positives, some will be false positives. However, the highest concentration of positives should be the correct area.
2) Keep the data of the last however amount of frames, and for each new frame add the data to this collection.
3) Throw out the top and bottom 10% of data.
4) Find the area where the CENTER of the X (or O) should be by using the mean of the remaining data and the standard deviation. So for the results below the computers best guess for the CENTER of the X is the region within the green box, the box does NOT outline the whole area of the X.

Results!
Note: Since I did not have access to a live stream of data, I manually picked out images where the X's are close by. The final image where the box is drawn on is just the first image of the series that was passed in as the.

Some of the images that were passed in for this (18 images where passed in, not going to upload them all)

Some more result images.

Since I was not using a live stream of data, and I was instead eye-balling grouping images together that appeared to Xs in the same area of the image the final results are not as good as they will be once using real live stream data. Also, as you can see on the results the computer is less certain about the Y coordinate of the image, but generally pretty sure about the X coordinate. This makes sense as most false positives are heads, elbows, and knees/upper legs. These results throw off the Y coordinate more than the X.

Monday, May 9, 2011

Paper

Here is the paper I have been referencing in class

Progress on the real time data stream

Unfortunately isn't going as well as I would have hoped. Manipulating the kinect data, at least with the drivers I am using, is rough. But I am learning lots and am making progress, it's just not as fast as I would have hoped.

The main problem I am having is how to access the and process the data from Kinect. How the driver I am using works is that shared memory is used, and is constantly being written to by the Kinect and then read to be displayed. What I am going to have to do is manually go into the memory, hopefully figure out how the video stream is being stored and manually parse the data into extraction the hue value from the depth stream. Once I have this it will be hopefully be easy to segment it, extract the values I need, and then pass them through the ADtrees.

Current:

Mockup:

Also, the computer I have been doing all this development on is going to be out of commission for the about the next week, so I will not be able to work on this part of the project at all.

3 ADTrees!

1) Last time I ended talking about how 100,000 comparisons was too much. I dropped it down to 10,000 and it became much more manageable. I then did this three times, each with different randomly generated numbers to create three trees for detecting whether a square is an X or not.

Out of the 10,000 passed in comparisons they were reduced down to:

Tree 1) 48 comparisons
Tree 2) 52 comparisons
Tree 3) 58 comparisons

After I got these nice results, after some trial and error (off by 1 errors are lame) I extracted all the 'useful' numbers out of the random numbers into their own files, and manually altered the alternating decision trees to work with these. The result, fast extraction and comparison of pixels from a section of an image. Also because I was curious I mapped the comparisons to an image so I could see what it looked like. For parsing whether or not a section of an picture is an X or the endpoints of each line is compared.

Red is tree 1
Green is tree 2
Blue is tree 3

As we can see... one of the tress is much better than the other two. Random numbers are random.
And the results: Here

Monday, April 25, 2011

Update on the 3 todo's

1) I started working on the final code. Through www.dreamspark.com I was able to get Visual Studio Pro for free (cheer for being a student). Progress has been slow so far as I have never coded a windows application before, have never used C#, ect. But I am learning lots!

2) I started collecting data for Os, and finished labeling up data for Xs. For Xs I also mirrored all the data I have to effectively double the data, leaving me with 400 positive and currently 1600 negative (might add more if I need to bootstrap more).

3) I decided to go all out with my data collection... and am using 100,000 random comparisons. Progress has been slow because a lot of the scripts/what I was using before doesn't work so well with so much data... it just crashes (and not very gracefully). I was at the stage to do the machine learning point when one of my sticks of ram died (down to 8 gigs from 12), so that is also going to put a hinder on things. At this point though I am really curious to how much better if at all using so much data will work, as compared to only 3000 original data points. Hopefully I will know soon!

Monday, April 18, 2011

More results!

Link

Sunday, April 17, 2011

This Week And Beyonds Todo List

1) Start working on the final code. Figure out what I need, if I'm going to do it on Windows/Linux, what libaries I will need, ect. I believe that I am in a good enough position for detecting an X from a non X, that it is time to move on from that sole problem. I can do somethings with real time data when detecting an X (like an average if there is an X over multiple frames), that is not possible when just detecting an X in a single image.

2) Collect data for Os and finish labeling data for Xs.

3) Out of the 3000 random pixel comparisions I extract and pass in for detecing a X compared to a non X, in the current ADT only 31 are used. Wut.

So what this means is that I can use 10,000, or mabye even 100,000 random comparisions, have jboost crunch numbers over night, and then just tell me what are the most useful comparisions. I can then extract what are the useful comparisions from the random values and then only pass in those and have extremely fast, extremely accurate code. When checking for an X there will be no need to extract all 3000 comparision, pass them all in, and then ONLY have 31 be used.

With the current 3000 random pixel comparisions the results are pretty good, but there are probably some comparisions that are better. By using 10,000 or 100,000 comparisions these better comparisions will be picked out and give even better results.

More Progress

This last week I tried a couple of things. First was instead of trying the comparision of lots of random points, I tried comparing random lines. This, didn't not work anywhere near as well as the random points did. The first major setback was how slow it was compared to random points. Maybe this was just because it was done in matlab, but comparing 100 random lines was slower than 4000 random points (and I usually don't use 4000 points). And with 100 random lines, there was not enough data for effective detection.

After that I went with another suggestion from class, and that was the idea of background subtraction. Using Kinect gives a unique feature that you don't get from RGB data, that is you can make assumtions about what is the background and then get rid of it. So I modified my image converting script to do background subtraction.

Example of image with background subtraction. The background is gone! (and replaced with black)

Also collected a lot more data labeling negitives and positives that I believed would help my results the most. The most recent results where ran with 120 positive and 600 negitive labels. Some of the test images included Xs that WERE NOT in the positive training examples, and Xs where still detected just as well as the Xs that were in the training data. Results will be in a future post.

Monday, April 11, 2011

Results!

So, results. So far the best way to classify Xs has been by:

Step 1) Generate a bunch of random numbers. I did 40000, which would allow for 10000 comparisons in a single image (x and y or each point, two points for one comparison). These numbers are between 0 and 1, and represent a percentage, thus to get the pixel you take the percentage value and times it by the width or height of the image/section you are looking at.

Step 2) Compare the darkness of the pairs of pixels, label it as 1, 0, or -1 depending on how they compare.

Step 3) Repeat. Alot.

Step 4) Allow jboost to do its magic.

For the results below I only used 2 thousand random comparisons.

Results 1

As you can see was scanning regions too big. Even with this fairly big error, the results were surprisingly good. There are two regions that it constantly detects as a positive that is clearly not. The crotch of the person and the random wall segment to the person's bottom right. For the next round of training more negative examples will be passed in of these regions to hopefully correct this issue.

Results 2

Scanning a smaller region, and for whatever reason had worse results as when compared to the bigger region, but still fairly good. Also, in both results sets it was better at detecting my girlfriends Xs than mine. Not fair.

Next things todo:
1) More data, I have a feeling that with 200 positive and 1000 negatives (as compared to the 100 positive 300 negatives for these results) that my results will be significantly better, but I will have to see. This is partially due to that I will focus more training on specific regions that I see need more help

2) Instead of just two points try on a line (top half - bottom half).

3) Starting working on making it real time in C/C#/C++ and with real time data off of the Kinect.

I tried graphing out relevant data but couldn't get any graphs that showed anything useful.

TODO List: Completed

1) Data gotten from Kinect is displayed and interpreted in a MUCH more consistent manner. The closer an object is to the Kinect, the lighter it is, the farther away the darker.

Example:

After collecting the data from Kinect, converted the image to HSV colorspace, set the Hue data as the Value data, and set Hue and Saturation as 0. The result: it makes some things visible that were invisible before.

------>

2) More data has been collected! The results in the next post were done with 100 positive and 300 negative examples. Slowly working my way up, think I am going to go towards 200 positive and 1000 negative (currently at 100/415). With 100 positive examples still don't feel like I have enough positive data.

3) To label positive examples I cropped out the regions from the top of the fingers (which ever finger was higher) to the bottom of the wrists.

Examples:

4) Tried multiple things. Well explain more the next post (the results post). The best working way (so far) has been to simply compare the darkness of two random pixels, then label this relationship as a 1, or -1 depending on which pixel is darker, or if they are the same then 0.

Wednesday, April 6, 2011

TODO

1) Figure out a better way to parse the data from Kinect in a more consistent fashion.

2) Get more, better data.

Microsoft uses a million images for three trees. (http://research.microsoft.com/pubs/145347/BodyPartRecognition.pdf)

Go up to 100 positive and 500 negative examples for now.

3) Better, more consistent way of labeling positive examples. Probably something like:

4) Try random lines, try Viola-Jones style, try other things!

Sunday, April 3, 2011

Machines learning what an X is

Today I worked on attempting to detect a user creating an 'X' with their hands/arms solely based on the depth data from the Kinect camera. I went through about 3 or 4 iterations till what I am at none, and unfortunately none of them have had much success :(.

Essentially I was feeding in positive and negative samples of an 'X', labeling them as such, and then trying to figure out a good way to classify these samples such that in the future a computer would be good at discovering them.

Examples (first 3 positive, last 3 are negative):

At first I was just doing general histograms of the value part (of the image in HSV format), without much success at all, as this had no relation to the location of the values. Several iterations later and I was breaking down the positive and negatives images into 3x3 grids and doing average values of value in relation to where they were located with some minor success...

Yah.... going to need to do something different.