The ability to detect and read text in natural scenes is important for a robot to explore, localize and navigate autonomously in unknown environments. Comprehending text provides meaningful context for a robot to understand its surroundings and can be really useful for household robots to recognize labeled objects. Apart from understanding the scene, the ability to speak provides interactivity between robot and the real world, which can also be useful in indoor settings. The goal of this project is to apply state-of-art text detection and recognition algorithm to PR2 robot and achieve the task of reading out printed text and signs in indoor environments. The deliverables will be a video demonstrating the robustness of the system and a releasable software package that is integrated with ROS.
To literate the robot, first it should be able to detect text from the surroundings. An accurate text detection algorithm is important for the task. In our project, we implement the text detection algorithm based on a method called stroke width transform . It turns out to be very accurate for most of the settings. We integrate the algorithm with OCR text recognition engine into ROS platform and perform live demo on PR robot, making it literate!
Stroke Width Transform
In order to find text from image like the following:
According to the paper. First, image is converted into edges using canny edge detection. The threshold is chosen so that edges of letters are not missing while avoiding generating too many edges.
Then, stroke width transform is performed to compute the width of possible stroke each pixel belongs to. Starting from each edge points and tracing along gradient, stop when there is another edge point on the ray. If there is another edge point, and the angle between gradient of two edge points is within certain tolerance(different from original paper, I choose +/2 pi/2 in my implementation, because smaller angles tends to break strokes into parts which brings trouble in following steps). After one pass over all edge points, another pass is performed to set points having stroke width values greater than median value on the ray to the median value, which is intended for corner part of the letters.
Note that, letters can have brighter or darker color compared to its background. Stroke width transform has to be performed two passes both forward and backward along edge gradient. Following are the two passes stroke width transform results:
Connected Component Analysis
After the stroke width transform, pixels are clustered into components using connected component analysis, neighbor pixels are clustered together if their stroke width value ratio is between [0.3, 3]. The following are the connect components for two passes.
From the observation that within each letter, width between edges are close to constant, meaning stroke width value should not have too much variance. So, letter candidates are accepted only when the variance within that component is less than a threshold(the threshold is set based on overall performance on various data sets).
Having all the letter candidates, two candidates are grouped together if they are in the same line, close to each other, has similar stroke width mean value and similar gray scale values, similar height.
pairs of letters are chained together if they have same ends. My implementation use an iterative merge procedure so that pairs are chained into sets, corresponding to candidate words in image. Each set is associated with a bounding box in original image.
Since the stroke width transform is performed on two passes, we have results on both bright and dark fonts. It is an interesting observation that because of the relative relaxed rules imposed on letter filtering step, the algorithm captures the regularities even between letters. To put them together, we overlay the bounding boxes for the two-pass results and reject the overlay if it introduce too many non-letter-candidate components.
Output: MacBook Pro
Font Similarity Correction
Sometimes the results from OCR is far from perfect. It is designed for document images and have some trouble with natural scenes. So, we proposed a correction process based on font similarity. Correlation scores are generated by convolve each pairs of font template, for example:
Then we modified edit distance algorithm to find word distance between the pair of words based on how close they look like. The penalty in modified edit distance is set to the correlation measure between two letters(digits)
We provide a dictionary including common words and linearly search and measure their word distance, achieve the closest match as the correction result.
Text to Speech
To create the interaction between robot and the environment, we have an additional step to let robot read what it sees. Here we need text to speech system to convert words into sound. In the experiment we use festival to handle the conversion.
Indicated by its name, the project is carried on PR2 robot from willow garage, and the software is implemented on ROS platform(Diamondback version). We use the high resolution Prosilica camera to capture image frames from the robot's viewpoint and performs the detection stage and have the robot read the signs.
There is an issue for detection when there is too much motion blur obtained from the camera, because motion blur significantly reduce the edges and ruin the first step of the algorithm. Actually those images are hard for human too. In order to have PR2 read only when it's stable, we read from robot state message to obtain the velocity and start performing detection only after the velocity becomes zero.
Here is the video demonstrating PR2 reading signs and more:
For more information of the code, please visit: http://www.ros.org/wiki/literate_pr2
If you luckily have a PR2 at your side, then check out the whole stack and make your PR2 literate! If not, you can still check out the read_text package and perform detection on images.
There are many improvements can be made for the project.
In terms of text detection algorithm, it is pretty good in accuracy but still far from perfect. It can detect a lot of fonts and different languages, but it have trouble with art fonts, since they don't preserve consistency in stroke width within each letter and won't pass the letter filter stage of the algorithm.
For OCR, it is primarily designed for documented text, so it is trained on particular sets of fonts, we can train on our own sets, especially the fonts we see from the dataset. Also, we can integrate the letter detection results from stroke width transform into OCR page analysis for better recognition.
The correction stage we have right now are based on the word itself. We can get more information from the context to have better correction results. One way is use HMM (Hidden Markov Model) to train on the words to provide another prior.
1. Boris Epshtein, Eyal Ofek, Yonatan Wexler, "Detecting text in natural scenes with stroke width transform," Computer Vision and Pattern Recognition (CVPR) 2010.
Feedback to Presenters
Great project definition, also an interesting way of watching soccer. Will it be extensible to 3D space, like players passing ball on the ground?
The demo look good! From it I can see the "arm" is moving very smoothly. I would love to see the formulation in detail.
I really like the idea of coordination between different type of robots. The video is also fun. I think his work can benefit from door norb detection from vision technique. Other thing is if the door is spring loaded, you might want to consider to make PR2 go through the door avoid it be closed.
The simulation of planning in high dimension looks pretty good when UAV flying over the obstacles. It would be better to consider combining 3d planning locally and 2d planning globally which could be much faster.
Interesting project, I am looking forward to the future results. Will it lead to robot cooking?
A very practical and useful project, applying planning algorithm to clinical use, I have never thought of that before. It is possible to have the "robot" turn towards the object?
I am sure it is good project for mobile robot network, although I didn't have enough time to get idea of the equations.
I think The planning with coordination on 4D space is a tough job so you are doing great! I wonder why the two robots not start at the same time and one pause when they become close.
It is a fun project to do on PR2. Since there is already someone in GRASP doing the work be without vision component, you might want to put in your detection part to make it more robust.
I like the idea of using HOG and part model. I would love to see how to make part model work since view point changes the appearance of the bag and its parts. Maybe using 3d data can help a lot. Also you might want to consider the characteristic of bag surface, like, it's not smooth and has parallel curves, etc.
The simulation seems good, the UAVs almost cover the whole area. It would be better if you can explain the algorithm more clearly.
It's pretty impressive to see the improvement of arm turning speed twice its original. If we can see it actually implemented on the arm will be nice.
I like the video of two robot coordinating, also the shooting angle is great. It is possible that two robot facing the same direction? may be that will be easier for robot to moving forward together?
It is nice to see android application! You might want to think about using the phone camera as a global path planner for the iRobot.
Both the robot and busy man did a great job! It's impressive to see the system being so reactive. To handle obstacles with real volume and different color, non smooth surface, in future will be great!