Object detection is a computer vision task with the goal of localization of an object and classifying that object in an image or video. It is the first step in various computer vision techniques such as segmentation and recognition. Due to the various advances made in object detection over the years, choosing an algorithmn might become overwhelming. Thus in this work we conduct a thorough and exhaustive summarization of the best advances made in this field.


R-CNN or Region-based Convolutional Neural Network proposed in 2014 is a crucial step in the advance of real time object detection. Its most important contribution is using region proposals to localize object in an image.


As seen from the image it uses a selective search algorithmn to propose possible locations (2000 possible locations) of objects these are called bounding box. Each of these proposals having IoU greater than 0.5 are labelled to their corresponding class, IoU of less than 0.3 are considered background and the rest are ignored. Since the proposals could be of different sizes we need to resize them to same size so that they could be fed to a deep learning architecture for feature extraction. One SVM is trained for each of the available target classes on the extrcated features. To improve localization performance a bounding-box regressor is used to predict the centre co-ordinates and the height and width of the bounding box.

Fast R-CNN

One drawback of R-CNN is that is its speed, thus Fast R-CNN was proposed in 2015 to improve upon the speed of Fast R-CNN. Fast R-CNN was able to increase the speed of prediction by projecting the proposals of selective search on the feature maps extracted by deep learning architecture.


The image is directly fed to the deep learning architecture to extract the feature maps. The proposals from the selective search are then projected on the feature maps, thus unlike R-CNN the deep learning architecture is not used multiple times. Due to the fact that Fast R-CNN uses densely connected layers for prediction, ROI pooling must be used to extract fixed number of features from the variable sized region of interests.

Faster R-CNN

Fast R-CNN improved upon the speed of R-CNN, however it’s speed can still be improved upon. The CNN architecture is bottlenecked by the selective search algorithmn. Proposed in 2016 Faster R-CNN, overcomes this by using a the features extracted from the backbone for proposing the regions.


The features extracted from the backbone are fed to a n * n convolution (the original paper used n=3) followed by two 1 * 1 convolution of output size 4k and 2k(k is the anchor boxes used at different scales and different aspect ratios) to predict the bounding box and probability of an object or not respectively. Only predictions of certian size and top scoring predictions after passsing through non-max supression algorithmn are kept. Followed by a ROIPooling layer to get a fixed number of features, two densely connected layers are used to predict the probability of different classes and the bounding box respectively.


You Only Look Once (YOLO) was proposed in 2016 for real time object detection. It is able to achive this by using a single neural network to predict both the classification scores and bounding boxes for detected objects.


The algorithm divides the image into S * S grids. Each grid predicts B number of bounding boxes in the format S * S * (B*5 +C) for C number of classes. Apart from predicting the bounding box, it also predicts the confidence scores for each box which is the IoU between predicted box and ground truth.


As seen from the above image, the model is trained on a weighted loss to give more importance to grids having objects. Also when calculating the loss for the bounding boxe we calculate the error between the predicted width or height and the squared root of the actual width or height of the bounding box.


To improve upon the speed and accuracy of YOLO, YOLO V2 was proposed in 2016 claiming to be Better, Faster and stronger. Below we examine each claim individually.

  • Better
    • Using BtachNormalization after each layer imporved mAP by 2%.
    • Finetunning the classification network on ImageNet at 448 resolution got a 4% improvement in mAP.
    • From the backbone one maxpooling is removed and the network is shrunk to operate on 416 input size images. This is done to get an output of 13 * 13 sized feature maps, since for large object the centre would be one pixel and not 4 pixels. It also uses anchor boxes which leads to a decrease in mAP but an increase in recall.
    • It also uses K-means to choose anchor boxes that leads to good IoU scores.
    • The model predicts the center of the object relative to the bounding box, by using a logistic activation to constrain the network’s prediction between 0 and 1. Along with selecting better priors with K-means this enables the model to increase mAP by 5%.
    • To improve the model’s predictions for smaller objects the feature maps at resolution of 26 * 26 is mapped into 13 * 13 resolution feature map, then concatenated with the final 13 * 13 feature maps for detection.
    • To teach the model to predict well at all scales, after each 10 batch the new image dimension are randomly chosen, the model is resized and training is continued.
  • Faster To improve the model’s detection speed DarkNet-19 was used since its performance is comparable to ResNet50 but with less number of parameters.
  • Stronger The algorithm is trained from detection as well as classification datasets. When the model is used for classification the loss is propagated only to the classification specific parts of the architecture. Datasets for detection only has deneral labels (Dog, Boat) but classification have wide range of labels (Norfolk terrier, Yorkshire terrier) thus WordTree must be used and conditional probability is predicted.