Presentation comments below...

Backpack Detection using Image and Depth

Download the PDF version.


We are interested in the general problem of object category detection, i.e. we know the kind of object we are looking for, and we would like to find instances of that kind of object in any given scene. An autonomous robot designed to perform a set of tasks would benefit from a robust object detection capability since many tasks involve a small, known set of objects that have been previously learned during task training.

Detecting objects of a known category is a mixture of object detection and object categorization. In contrast to object recognition, where one seeks to identify all known object instances within a scene, object categorization seeks to recognize not only specific instances of objects but their kind as well, even if a specific instance hasn't been seen (e.g. all bikes, all airplanes, etc.) While we are ultimately interested in a scalable object detection and categorization system, this paper focuses on detecting a single object class within cluttered environments, specifically, backpacks in an indoor office environment.

Why do we focus on backpacks? Detecting backpacks in a cluttered environment is an especially difficult task: backpacks are deformable objects and can appear in many places and configurations: the bag can look considerably different when full versus empty, and they can be found in a variety of places within an indoor setting: on someone's back, on the floor, on a desktop, standing up or lying down, sitting on a chair, etc. In addition, the detection of deformable containers like backpacks and duffels by autonomous systems (whether mobile or not) has a variety of applications in safety or security sensitive scenarios, e.g. airports and building search.

The rest of this paper discusses the first steps in exploring the detection of a standard backpack in a cluttered indoor environment using the image and point cloud data produced by a low-cost, commercial vision sensor, the MS Kinect. We describe an object detection system that utilizes both color image data from the Kinect camera and point cloud data from the Kinect's structured light stereo system. We obtain reasonable results using a single prototype backpack image and several windowing schemes tested on a challenging set of recorded capture sequences.

Figure 1: An example detection, visualized as a segmented point cloud. The red sphere indicates the detection hypothesis.

Related Work

Until human-level recognition has been achieved, there will always be a tremendous amount of work in object detection, recognition and categorization. We consider similar work along several representative dimensions: bag of words, dense descriptors, deformable parts, and the combined image and depth approach.

One of the simplest approaches to object detection is the bag of words model. Inspired by the information retrieval community's use of document vectors [Baeza-Yates1999], the bag of words model extracts distinctive visual features across many images in a dataset, and represents an object's appearance within an image by a histogram of these features. A representative description of this approach can be found in the Czurka et al. comparison of two classifiers using a bag of Harris affine interest points [Csurka2004]. Representing appearance by an unordered set of invariant features has its benefits in simplicity and robustness to pose variation, but doesn't always capture interclass variation or handle low-feature objects. Therefore, there is a trend to incorporate geometry, even into the bag of words approach, as additional features. This can be done in a weak fashion as in [Csurka2006], where the geometric features used are the collocation of features in a region, as well as the scale and orientation. Everingham and others [Everingham2009] mention additional methods for including geometry information such as tiling or pyramids used in context with the PASCAL challenge.

The bag of words model has its drawbacks though, in that even with geometric information the individual features are often not discriminative enough in cluttered scenes. A more holistic approach that represents objects as whole may be more robust, and that is where the histogram of oriented gradients (HOG) descriptor is useful. Dalal and Triggs popularized the use of this descriptor in [Dalal2005] for their work on people detection. They achieved superior results on an existing data set using an SVM-based weighting scheme on their descriptor elements which prompted a new, more challenging data set to be created. Others have made of and modified the dense descriptor approach; in [Villamizar2009], Villamizar et al. attempt to overcome some limitations in the descriptor arising from cast shadows in an outdoor environment. Wang, Han and Yan merge the HOG with a local binary patterns (LBPs) to explicitly handle occlusion. Hinterstoisser et al. provide an interesting variant to the HOG descriptor called Dominant Orientation Templates [Hinterstoisser2010] which they claim is as discriminating as HOG but faster.

While dense descriptors like the HOG can be quite robust to pose variation, explicitly handling pose variation for deformable objects may be better. This is the main concept behind deformable parts models, exemplified by the work done by Fergus et al. and Felzenszwalb et al. In [Fergus2003], a constellation approach that considers both appearance and shape using a Gaussian density of feature locations is used. Felzenszwalb makes use of multi-level HOG descriptors in [Felzenszwalb2010] by cascading an object-level descriptor with part-level descriptors that are constrained via a learned part model represented as a spring model a la the original parts model described by Fischler and Elschlager [Fischler1973].

Others have done some work using 3D point clouds, but not in the same manner we present in this paper. For example, at Willow Garage, Steder uses a 3D feature he developed to classify objects directly from point clouds [Steder2010]. In [Golovinskiy2009], Golovinskiy et al. perform segmentation on large urban point clouds to aid in object classification of large objects like homes, vehicles and lamp posts. Lai and Fox [Lai2010] make use of Google's 3D object repository to train a classifier using spin image features and then apply this to point clouds taken from indoor and outdoor scenes. The common theme here focuses on object detection and classification within only the point cloud, and seems to neglect the image information that may be captured at the same time. For example, in [Lai2010], they use the images to help the human hand-label the point cloud, but then discard the data during classification. We think it would be more beneficial to use both, and this work indicates that idea has potential.

Figure 2: Our approach for detecting backpacks using the data available from the Kinect sensor.


For this paper, we focus on integrating point cloud data with image-based HOG matching using a variety of sliding window schemes. The high-level algorithm is illustrated in Figure 2 and described as follows:

  1. Retrieve image and point cloud data from Kinect sensor
  2. Filter point cloud:
    1. smooth points
    2. estimate normals
  3. Segment connected components (distance, normals, color)
  4. Extract ROI from scene image using segment bounds on image plane
  5. Apply HOG detector to each segment image (requires generating HOG descriptor for segment images)
    1. Scale segment
    2. Slide a single-scale window
    3. Slide a multi-scale window
  6. Score detection based on histogram intersection and weighted by segment radius

Point cloud filtering

Dense point clouds derived from stereo are notoriously noisy, and the Kinect data is no exception. Even with the subpixel accuracy taken into consideration, disparity discontinuities at a distance of even four meters present an interesting challenge. This work does not attempt to derive any new technique for this problem, so we handle it with existing software provided by the Point Cloud Library (PCL) stack in ROS using a voxel grid filter and the moving least squares algorithm.

The voxel grid filter reduces the size of the point cloud data while ensuring that there is a representative point from the original cloud for each occupied voxel in an overlayed grid of a user-specified size. While this process may discard useful data points, it helps to lower the computational cost of the later processing steps.

The moving least squares algorithm smooths the point cloud by formulating the problem as an optimization that minimizes the squared error between a polynomial surface fit to the points within a neighborhood of a target point. As a by-product of the optimization, the algorithm generates the normals of this surface at each point, which we then make use of in successive processing.


To find interesting objects within the scene as candidates for detection, we segment the point cloud with a connected components approach. We construct a graph representing connectivity of points in the cloud based on proximity, normal similarity and color similarity. Since the previous filtering algorithms destroy the image structure of the point cloud as it's received from the Kinect, we use a kd-tree and perform an approximate nearest neighbor search to find the neighbors for a single point in the cloud. This handles proximity while normal similarity is calculated using the dot product of the point normals estimated in the filtering step. To measure color similarity, we convert the RGB colorspace to the L*a*b* colorspace and threshold the Euclidean distance between colors. Finally, if two points are similar enough, we add an edge between them in the connectivity graph. A standard disjoint sets data structure maintains the connected components incrementally as we iterate through the points. These components are then processed into segments if they exceed a user-specified minimum point count.

Once a segment is generated, we extract the centroid and maximum distance from the centroid over all points in the segment and extract the image plane-aligned bounding box using the camera matrix. This bounding box is used to extract an image of the segmented region from the original scene image.

Figure 3: An example point cloud visualization, colored according to the connected components segmentation. In this case, the bag is the purple segment approximately in the center of the image.

Figure 4: Visualization of the HOG descriptor (on the right). The upper and lower left image show visualizations of the gradient magnitude and gradient direction, respectively.

Histogram of oriented gradients

The histogram of oriented gradient (or HOG) descriptor is a robust descriptor representing a dense, normalized grid of gradient orientations over an image region. It is robust since it matches small translations and rotations in object pose, and is fairly insensitive to the color of the original object. However, it does work directly on the image of the object and therefore is sensitive to gradients imposed by lighting and shadow variances (especially prevalent in outdoor situations).

For simplicity, we implement the R-HOG, or rectangular HOG, block structures but otherwise calculate the HOG descriptor following Dalal and Triggs minus the gamut correction. This involves the following steps:

  1. Calculate the image gradients, \nabla_{x,y}(\mathcal{I}), using the simple derivative filters (-1,0,1) and (-1,0,1)^{T}.
  2. Calculate the magnitude and direction of the gradient at each pixel
  3. Using 8\times8 pixel cells, bin the gradient orientations into 9 unsigned bins (i.e. 0-180 degrees, 20 degree increments)
  4. Using a block size of 2\times2 cells, normalize the 36 element histogram using a truncated L_{2} normal (called L_{2}-Hys in the [Dalal2005], after the initial normalization, all elements are clipped to the value 0.2 and then the vector is renormalized).

Since the blocks are shifted one cell at a time horizontally and vertically, most cells (except the borders) contribute to four different normalized blocks. The normalization helps to reduce the influence of outlier gradients caused by lighting or pose variation.


In order to perform detection, we need two HOG descriptors and some kind of similarity measure. In our current approach we use a single image of a backpack to generate the HOG descriptor model, and the other HOG descriptor is calculated using the region of interest in the scene image defined by the candidate point cloud segment. The model descriptor defines a window that slides over the segment image, and returns the maximum response from histogram intersection metric:

I(H_{1},H_{2})=\frac{2}{\sum H_{1}+\sum H_{2}}\sum_{i=1}^{D}\min(H_{1}^{i},H_{2}^{i})

Windows for Detection

We explore several variants of the window-based detector described below:

  1. Target scaling: in this variant, the image defined by the point cloud segment is scaled (while maintaining aspect ratio) in one dimension to match the size of the default prototype image. The HOG descriptor is generated for each image and the score is simply the histogram intersection metric.
  2. Single scale sliding window: we slide the prototype over the segment image and the score is the maximum intersection score for each position.
  3. Multi-scale sliding window: we scale the prototype image from 0.4 to 1.0 at 0.1 increments, and slide each of these images over the segment image and return the maximum intersection score for each position over each scale.

Size Prior

Since backpacks are often sized similarly, we add a weighting to the intersection score that reflects how well the segment radius matches our estimated backpack size. In this case, radius is defined as the maximum distance from the segment point cloud center to a point in the cloud. The weight is derived from a Gaussian distribution centered on the mean estimated radius (in our case, we chose 0.15 meters with a standard deviation of ten centimeters).

Baseline Detector Response

In addition to the windows, we examine the performance of the HOG detector run directly on the entire scene image, disregarding any point cloud segmentation data. We use the same multi-scale sliding window approach as above to get baseline detection results for comparison with the inclusion of segment priors.

Figure 5: ROS-based system architecture.


We evaluate the detector using a single prototype backpack on thirteen data segments each featuring at least one backpack appearance captured within our office environment. A data segment includes a recorded Kinect color video feed and Kinect depth data automatically transformed into colored point clouds using the Robot Operating System (ROS) openni_camera module.

Several of the segments are extremely challenging (even for a human); some include the target backpack at a significant distance (at least 6 meters), some have dark backpacks on a dark background, or occluded by other objects (e.g. an umbrella or desk leg). Each segment was hand-labeled with a simple axis-aligned bounding box. We consider a detection successful if the center of the detection region is contained within the labeled box. This is a relatively loose criterion, but provides a simple Boolean response and allows for flexibility in the detection, since some variations of the algorithm may detect only a portion of the bag.

The data is recorded using an off-the-shelf Kinect sensor in two ways: several of the data segments were collected by hand (simply holding the sensor), and the remaining segments were collected by teleoperating a robot with a mounted Kinect. No observable difference was detected in this application due to the collection method.

The current system was architected with a view towards near real-time evaluation on a robot. Figure 5 shows the connectivity and data transfer graph between the nodes of the system. The bold components are the primary nodes in the system; the segment module runs a background thread that processes the incoming Kinect data from a pre-recorded segment or directly off the live camera feed while the detector node calls a service interface published by the segment module to retrieve the next segment. The detector utilizes the HOG processor code to do the detection, and then publishes the detections to an external client or visualization tool.

Results and Analysis

Table 1 lists the results for the single prototype detector run on the thirteen datasets. The scores for four of the datasets are not included, since no detector made more than 1-2 frame detections.

Table 1: Results for the single-prototype detector. We include a comparison of the detector when using just the image, as well as three different window methods for the image+segment detector. Scores are given as the fraction of frames with a correct detection. (SW) stands for sliding window.

Comparison to Baseline

The top row of Table 1 shows the results for the baseline detector using only the scene image and no point cloud segmentation results or size prior. With the exception of segment 2, the segmentation-based detector outperforms the image-only detector in a majority of the window schemes. We feel this is strong evidence that including the three dimensional spatial information (even with subpar segmentation results) is an important element in future object detector algorithms.

Window Analysis

It's quite surprising that the single scale sliding window outperforms the multiscale sliding window on five of the nine capture segments. Examining the nature of the segments and primarily the visual size of the backpacks and their distance to the camera, it's clear the multiscale detector performed better when the bag was farther away, while the single scale detector did better when the bags had a size that more closely matched the prototype image size (in this case 300\times300 pixels).

One question that warrants further investigation is this: the multiscale detector includes the single scale detector as a subset of its algorithm; why then doesn't it perform at least as well? The answer has to do with the way the detector picks the final detection; in the case of the multiscale detector, a smaller scale may have matched an alternate object better, such that it had a higher score than the unit scale descriptor. This tends to imply more discriminative scoring is required, meaning more features or perhaps learning a sequential decision process for selecting detector stages.

Sensitivity to Segmentation

When watching the detections in real time, it's clear the results are sensitive to the segmentation, since the system assumes the segments represent a useful scene decomposition. This is often not the case. In several of the capture segments, the backpack completely fails to be segmented from a nearby object (a cubicle wall or desk drawers). This means that even if the HOG detector produces a reasonable intersection, the size of the segment (if it's bigger than the average backpack) will reduce the score.

Backpacks that are composed of multiple colors often were segmented into multiple segments. We included the color segmentation to help counteract the normal estimation and surface smoothing that often blended two separate but adjacent objects together. Unfortunately this negatively impacted performance, since the detector only checked portions of the images defined by the segments.

In other cases, capture segments were challenging due to lighting or distance. This is a problem that occurs with all vision sensors, and we were unable to spend any significant amount of time to address this issue of robustness.

Good Results

Figure 6 shows sample frames from three out of six of the highest performing capture segments (1-4, 10, and 13). While we would have hoped for 100% accuracy on these capture segments, the results are reasonable considering the variation in the bag appearance and their pose, as well as the fact that we are using a relatively simple HOG detector. Note that in each of the frames in the figure, the backpacks is reasonably well lit and mostly unoccluded. This is not the case with most of the low detection results.

Figure 6: Example segments with good detection results. Segments 1, 4, and 13 are represented here.

Bad Results

Figure 7 shows sample frames from the data segments that had the lowest detection rates. In almost every case, the bag was partially occluded or in shadow. A more advanced detection scheme that considers these situations using multiple prototypes, making decisions based on the some confidence metric of each type of data (point cloud, image), and modeling occlusions and pose variations explicitly could improve significantly improve performance.

Figure 7: Example segments with poor detection results. Segments 5-9 and 12 are represented here. There is at least a portion of a bag in each of the images!

Current Limitations and Future Work

This research is preliminary in many ways, and will definitely benefit from future investigation. Several components in the detection framework are primitive, and could benefit from leveraging ideas and components from some of the related research mention in section on related work. For example, instead of using a simple maximum response during detection, we can handle multiple hypotheses by using non-maximal suppression techniques with an appropriately learned threshold.

In addition, we are not utilizing the information available from the point clouds to the fullest extent possible. For example, our prototype can include a point cloud based representation of the object as well as the image. As described in [Lai2010], we can use a ''soup of segments'' i.e. multiple segmentations of the scene instead of the single segmentation we are currently using to help with poor initial segmentations. Merging proximal segments of initial hypotheses could improve detection accuracy by overcoming the limitations mentioned in the section on sensitivity to segmentation.

Finally, and perhaps the most obvious next step, we need to develop and learn a part model to use for second level detection. In addition to improving detection accuracy, we will have more information for estimating pose and aligning important 3D model components for grasp and manipulation work (since the logical successor to object detection is manipulation).


We have presented an initial evaluation of a combined image and point cloud object detection system using HOG descriptors. The experimental results show that the system consistently generates more detections when using the point cloud derived object segmentation image regions than when using only the image-based approach. We are optimistic that incorporating the additional research and components discussed in the previous section will only improve the performance and take us one step closer to a feasible quickly-trained object category detection system for mobile robots.


Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern Information Retrieval. (A. C. M. Press, Ed.)New York (Vol. 463, p. 513). Addison Wesley. Retrieved from

Csurka, Gabriela, Dance, C., Perronnin, F., & Willamowski, J. (2006). Generic Visual Categorization Using Weak Geometry. Toward Category-Level Object Recognition, 207224. Springer. Retrieved April 26, 2011, from

Csurka, Gabriella, Dance, C., Fan, L., & Willamowski, Jutta. (2004). Visual categorization with bags of keypoints. ECCV International Workshop on Statistical Learning in Computer Vision. Retrieved April 19, 2011, from

Dalal, N., & Triggs, B. (2005). Histograms of Oriented Gradients for Human Detection. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPRʼ05), 886-893. Ieee. doi: 10.1109/CVPR.2005.177.

Everingham, M., Gool, L., Williams, C. K. I., Winn, J., & Zisserman, Andrew. (2009). The Pascal Visual Object Classes (VOC) Challenge. International Journal of Computer Vision, 88(2), 303-338. doi: 10.1007/s11263-009-0275-4.

Fei-Fei, L., Fergus, Rob, & Perona, Pietro. (2006). One-shot learning of object categories. IEEE transactions on pattern analysis and machine intelligence, 28(4), 594-611. doi: 10.1109/TPAMI.2006.79.

Felzenszwalb, P. F., & Huttenlocher, D. P. (2005). Pictorial Structures for Object Recognition. International Journal of Computer Vision, 61(1), 55-79. doi: 10.1023/B:VISI.0000042934.15159.49.

Fergus, R., Perona, P., & Zisserman, A. (2003). Object class recognition by unsupervised scale-invariant learning. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, II-264-II-271. IEEE Comput. Soc. doi: 10.1109/CVPR.2003.1211479.

Fischler, M. a, & Elschlager, R. a. (1973). The Representation and Matching of Pictorial Structures. IEEE Transactions on Computers, C-22(1), 67-92. doi: 10.1109/T-C.1973.223602.

Golovinskiy, A., Kim, V. G., & Funkhouser, T. (2009). Shape-based recognition of 3D point clouds in urban environments. 2009 IEEE 12th International Conference on Computer Vision, 2154-2161. Ieee. doi: 10.1109/ICCV.2009.5459471.

Hinterstoisser, S., Lepetit, V., Ilic, S., Fua, P., & Navab, N. (2010). Dominant orientation templates for real-time detection of texture-less objects. IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE. Retrieved May 16, 2011, from

Lai, K., & Fox, D. (2010). Object Recognition in 3D Point Clouds Using Web Data and Domain Adaptation. The International Journal of Robotics Research, 29(8), 1019-1037. doi: 10.1177/0278364910369190. Steder, B. (2010). 3D Point Cloud Based Object Recognition System. Willow Garage.

Szeliski, R. (2010). Computer vision: Algorithms and applications. Microsoft Research. Springer-Verlag New York Inc. Retrieved March 20, 2011, from

Villamizar, M., Scandaliaris, J., Sanfeliu, A., & Andrade-Cetto, J. (2009). Combining color-based invariant gradient detector with HoG descriptors for robust image detection in scenes under cast shadows. 2009 IEEE International Conference on Robotics and Automation, 1997-2002. Ieee. doi: 10.1109/ROBOT.2009.5152429.

Final Presentations

Anirudha Mahumdar

The problem domain is quite interesting of it's own accord. I enjoyed the high-level overview of under-actuated robots. I wonder if it would be useful to present some real-world example of these kinds of robots in context. You answered the questions presented to you clearly.

Avik De

I appreciated your visualizations and the overall clarity of your presentation. I found the way you presented the material in a developing fashion to be quite useful in understanding your motivation and direction. A lot of information in a short space but well presented. I'm slow with some of the implications of your math, so a tad bit more hand-holding would be nice but great for the time limit.

Ben Charrow

Yay - I love your vision (and I share it)! Robots coexisting in peace :-) Good explanation of contribution and approach of leveraging existing work. Excellent progress and demo! It would be nice to address how you might automatically determine the parameters you had to spend time discovering?

Brian MacAllister

Clear, straightforward presentation of goals for the planner. I enjoyed watching the quad-rotor simulation demo. I think the detailed step-through of the algorithm is probably unnecessary for this presentation (although I had something very similar... so the comment applies to me too.) Watch out for the "ums" :-)

Caitlin Powers

I'm always fascinated with the work being done with online parameter estimation, and your presentation was an interesting instance. I liked the smooth fallback to the 2-dof controller, but I would have liked to hear why you think it failed (just a bug?) on the WAM.

Fei Miao

This was a fun presentation on path planning for two robots. I found your overview and comparison of the two methods particularly enlightening. I wasn't completely clear what your goal was in the beginning; perhaps you can consider clarifying this up front, and be conscious of projecting your voice.

Kartik Mohta

Unfortunately, I missed the first part of your presentation, but I always like hearing about useful applications of Voronoi tessellations (since I've used them myself). You had a nice presentation of the algorithm. Please feel free to take a bit more time to answer questions (although it's hard in a time-constrained situation...) :-)

Menglong Zhu

Very exciting project and excellent project definition. I think you've made great progress and I'd love to get my hands on your code. It would be nice to do some additional spelling checks on your slides to fix the last minute typos. Overall, impressive results and I look forward to hearing more about it.

Philip Dames

I enjoyed your discussion of the centralized controller for maximizing merged mutual information. I would have liked to have more analysis of your demo's behavior and how it compares to other navigation-based and information-based exploration algorithms.

Stephen McGill

Great "presentation" and slide design; a lot of text, but the demo videos made up for it. You've made good progress and had a nice overview of walking / coordination strategies. I wonder if you've considered whether the robots could communicate and negotiate within a shared spatial system when they are performing the initial "side" choosing?

Chinwendu Enyioha

- A lot of information in a SHORT pres. - Formulation looks interesting

Ceridwen Magee

- Use path planning to help manipulation at the micro level. - Cool visualizations

Cem Karan

- Radio-constrained planning. Why (?) did the graphics take so long; can't wait to see more results.

Ian McMahon

- Magnet structure building; love the idea - do you think it'll work with Legos?

Monica Lui

- Target localization & capture.

Shuai Li

- Optimize trajectory for WAM throwing at max speed using onstraints on joint velocity.

Teyvonia Thomas

- Irobot control with android phone brain? Sweet, plop a phone on a Create and have it grab a snack :-)

Yida Zhang

- Cool demo video - Have you considered handling moving objects (i.e. Rolling Balls of Havoc?)