YOLO V3 made some small improvements on the YOLO V2 algorithm. When predicting the class label sigmoid is used instead of softmax, this helps when training on datasets like Open Images Dataset where overlapping labels(Woman and Person) are present. Apart from using the last layer feature map, it takes the feature map from earlier layers upsample then and merge using concatenation. Convolution layers are used to process both the semantic and fine-grained information thus contained. This process is again repeated with previous layers to get features for final prediction.


YOLO V4 made improvements in different parts of the YOLO V3 architecture. It consists of:

  • Backbone

    CSPResNext50 and EfficientNet-B3 but CSPDarkNet53 were considered however CSPDarkNet53 was chosen because of its higher prediction speed.

  • Neck

    Spatial pyramid pooling increases the receptive field and is useful in seperating the most significant context features without significantly decreasing the network speed. PANet is used to aggregate features from previous backbone layers.

  • Head

    The algorithm uses the same anchor based head as YOLO V3.

  • Bag of Freebies

    • Backbone

      • CutMix augmentation

        In this augmentation a part of the image is removed and replaced with another image. The labels are also changed proportinal to the number of pixels that was removed and replaced.

      • Mosiac augmentation

        Mosiac augmentation combines 4 different images in one image. This way the model does not need a large minibatch and it helps the model to learn prediction at smaller scale.

      • Dropblock regularization

        This augmentation turnos off the neurons for a patch of the feature map.

      • Class Label Smoothing

        This is a regularizing technique that introduces noise and accounts for mistakes that may be present in the dataset. It also helps the model prevents the model from predicting too confidently and generalizing poorly.

    • Detector

      • CIoU loss

        Also known as Complete IoU loss it improves the model by considering three important geometric factors when calculating the loss which are

        • the area overlap between the predicted and ground truth
        • euclidean distance between the central points of the predicted and ground truth bounding box
        • the consistency of the aspect ratios for the bounding boxes
      • Cross mini-Batch Normalization

        This variation of batch normalization uses the last K mini-batches in a single batch to claculate statitics

      • Self-adversarial Training

        Self-adversarial Training is a new type of data augmentation technique where the network executes an adverserial attack on itself by altering the image so that the object is not visible to the network. At a later stage in the training the altered image is fed to the network, thus the model is forced to confront its own vulnerabilities.

      • Eliminate grid sensitivity

        In YOLOV3 the model’s predictions tx and ty are fed to a sigmoid activation to limit them between 0 and 1. Thus the model has to predict high values of tx and ty when the object is at large distance from the cell offset. YOLOV4 solves this by multiplying the output of the sigmoid activation by a factor greater than one.

      • Uses multiple anchor for single ground truth.

      • Cosine annealing scheduler

        Cosine annealing scheduler has the learning rate decrease gradually and a suddenly increase. This sudden increase is useful when the model is stuck in a local minima.

  • Bag of Specials

    • Backbone

      • Mish

        Mish is used to solve the gradient vanishing problem, since it is a continuously differentiable activation function.

      • Cross-stage partial connections

        Cross-stage partial connections helps reduce the computational bottlenecks by copying the feature maps from an earlier layer and passing it unlatered to the later layers.

    • Detector

      • Spatial Pyramid pooling(SPP)

        Since the original Spatial Pyramid pooling will output one dimensional feature vector, cannot be applied to a Convolution architecture. Thus YOLO V4 uses a modified SPP where different window sized max pooling layer and concatenate the results.

      • Spatial Attention Module(SAM)

        YOLO V4 modifies the Squeeze and Excitation Network. The inputs are fed to a convolution layer followed by a sigmoid activation before being multiplied with the input.

      • Path Aggregation Network (PAN)

        Used for feature aggregation the original PAN adds the multiple inputs of the PAN, but in YOLO V4 the multiple inputs are concatenated.

      • DIoU-NMS

        Used to supress multiple predictions, the original Non Max Supression calculated the IoU between the high confidence predictions and lower confidence predictions. If this IoU was greater than a certain threshold the lower confidence prediction was removed. DIoU-NMS is similar in working but uses DIoU (IoU + euclidean distance between the centre of predictions and ground truth) instead of IoU.


Proposed in 2020 YOLO V5 used various augmentations geometric distortion, random scaling, cropping, translation, shearing, and rotating along with various unique augmentation such as Mixup, Mosiac, CutMix. The algorithm uses CSPDarknet53 along with SPP layer as backbone, PANet as Neck and YOLO detection head. To improve the model’s predictions of tiny objects and make it robust to object scale variance the model uses a total of 4 heads.


As seen from the image some changes are made to the backbone to improve its classification performance. To capture global and spatial information some convolution are replaced with transformer encoders. To decrease the computation and memory cost only transformer encoder blocks are applied at the head. Convolutional block attention module (CBAM) is a lightweight attention module which can be trained in an end to end manner and easily integrated in various CNN architectures. Experimentally CBAM provides large improvement when integrated into the model, this is because CBAM is able to get the attention map at both the spatial and channel dimension for feature refinement. To reduce the variance of neural network YOLO V5 used ensembling technique as weighted box fusion (WBF) since unlike NMS it merges the all boxes to form the final result. Five different models are trained each performs scaling and horizontal flips on the test image and non-max supression on the resultant predictions to fuse them and weighted box fusion on the five different models. On VisDrone2021 dataset the model was seen to perform well for object localization but poor classification especially on the hard categories, thus a extra self-trained classifier (ResNet18) was trained to get a 0.8 to 1 percent AP improvement.

Single Shot MultiBox Detector

Proposed in 2016 Single Shot MultiBox Detector (SSD) is accurate as Faster R-CNN and faster than YOLO. The algorithm uses feed-forward convolutional network to predict fixed number of bounding box and scores for the presence of object class instances in those boxes. A truncated base network is used as early network layers, to which convolution layers are added to progresively decrease the dimension of the feature maps. This allows the model to make predictions at multiple feature maps of different scales. For a feature map of m * n an output os m * n * (c + 4) * k is calculated for c number of classes and k number of bounding boxes. These bounding boxes are varied over location, aspect ratio , scale and are matched to any ground truth box with overlap of greater than 0.5 jaccard overlap. The scale of the default bounding boxes are varied across the featue maps so one particular feature map would be responsive to particular scale of object. The aspect ratios are varied to have 6 default boxes per location. Using multiple anchor boxes causes the model to have far more negative samples than positive samples so only the most confident negative samples are chosen so that negative samples are three times larger in number than positive samples. The algorithm uses an weighted loss between localization and confidence of various classes.

DEtection TRansformer

Proposed in 2020 DETR uses a backbone followed by transformer encoder-decoder architecture for object detection.


The backbone uses an imagenet pre-trained resnet model to extract features of size H/32, W/32, 2048. Since transformer cannot take these as input they are collapsed into C, H*W and fed to the transformer encoder after a 1 * 1 convolution to reduce the number of channels. Since this causes the feature maps to loose spatial information, positional encodings are added to them before fed to the transformer. The decoder takes these features and is supplimented by object queries, where each object has one query. These object queries are randomly initialized and learnt through training. The output of the decoder is fed to a feed forward network(FFN) i.e a 3 densely connected layers followed by relu activation to predict the normalized center coordinates, height and width of the box w.r.t. the input image and a densely connected layer to predict the class label using a softmax activation. Since a predefined number of N bounding boxes are predicted an extra background class is needed. Hungarian loss is used to match the predictions to the ground truth.

Swin Transformer

The use of Transformers has revolutionied the Natural Language Processing field, thus to use them for image data Swin transformers was proposed in 2021.


The algorithm uses a patch splitting module to split an RGB image into patches. Each patch is treated as a token and its feature is set as a concatenation of the pixel RGB values. For example a 4 * 4 patch would result in one of the token having 48 (4 * 4 * 3) pixels. A linear enbedding layer is then used to chage the number of features. Augmented by relative positional embedding these features are then fed to a Swin transformer block followed by a Patch merging block. In Swin Transformer blocks the multi-head self attention is replaced by a shifted window based multi-head self attention with the other parts being the same. Since computing the attention between a token and all other tokens would be computationally expensive Shifted multi-head self attention calculates the attention between one token and other tokens in a local window. To introduce cross-window connections while maintaining efficiency the window is shifted by [M/2, M/2], where M is the window size. This shifting is done in a cycle-shifting manner. The merging layer concatenates the features of each group of 2 * 2 neighboring patches. This is followed by a linear layer to reduce the 4 * C (C features for 1 patch) to 2 * C. This reduces the number of tokens by a factor of 4 and increases the number of features for each token by a factor of 2.

Small Object Detection using Context and Attention

The algorithm improves upon the SSD algorithm using attention along with feature fusion mechanism. Feature fusion mechanism is used to give contextual information to a feature map. The later layer feature maps are passed through a deconvolution layer to match spatial size and to reduce the number of channels to half the number of target feature map followed by a normalization because different layer features have different scales and relu. After passing through a convolution layer the target feature maps are concatenated with them. A residual attention mechanism consisting of trunk and mask branch is used. The trunk branch has two residual blocks, of each of 3 convolution layers. To get the attention map the mask branch performs up-sampling and down-sampling with residual connection followed by 2 residual blocks followed by sigmoid. After the attention map is multiplied with the output of trunk branch they are passed through a residual block, Normalization and relu activation. When combining the two modules instead of passing through the convolution layer the target feature maps are passed through the attention module. Thus the algorithm would be able to utilize context information from the target as well as later layers.

Focal Loss for Dense Object Detection

Selecting the loss has a large impact on how well the algorithm performs, however for object detection little work is done in this regard. Focal loss proposed in 2018 aimed to improve the loss by giving more importance to hard samples.


When an example is misclassified i.e. the predicted value approaches 0 the loss would be unaffected. However when the example is correctly classified the predicted value approaches 1 the loss would be downweighted. The work uses the loss on a custom neural network architecture(RetinaNet). RetinaNet uses Feature Pyramid Network (FPN) to get feature maps at multiple scales on top of ResNet architecture as backbone alongwith 2 task specific subnetwork.


The classification subnet uses four 3 * 3 filter convolutions of 256 channels with relu, followed by convolution with K * A number of filters and sigmoid for K number of classes and A number of anchors. The regression subnet uses a same network but terminates with 4 * K number of filters.

Feature Pyramid Networks for Object Detection

The algorithm is used to improve the backbone by getting predictions from multiples scales. It consists of two parts the top-down pathway and bottom-up pathway.


The bottom-up pathway is the backbone used for feature extraction and the part where the spatial size decreases. There can be many layers producing the same spatial size feature maps, the algorithm uses the features extracted from the last layer that produces feature map of a particular size. The top-down pathway is used to upsample the low resolution feature maps and are then added to feature maps of the bottom-up pathway after passing through 1 * 1 convolution to get the channel number same. After adding a 3 * 3 convolution is used on the enhanced feature maps to get the final feature maps which will be used for predictions at each scale.

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

Deep neural network architectures require fixed size inputs. This work considers this requirement detrimental to the model’s performance. Since the inputs would have to be resized the object of interest may not be recognizable after the change. Convolution layers would work well with variable sized inputs, however when using densely connected layers the size must be fixed.


As shown the features maps are passed through a fixed grid which can take either the max or average value of that particular grid and combine them with other grid outputs to get a fixed number of features. This enables the algorithm to let the inputs retain their size while using densely connected layers.

EfficientDet: Scalable and Efficient Object Detection

This work proposed 2020 aims to imrpove the backbone for object detection. They do this by improving the PANet by BiFPN and a compound scaling method for the resolution, depth, and width for all backbone, feature network, and box/class prediction networks at the same time. PANet improves upon FPN by adding an additional bottom-up pathway.


BiFPN improves upon PANet by removing the nodes having only one input edge. They conclude that since there is no feature fusion in these nodes, their contribution to the final prediction would be minimum. If original input and output node are at same level they are fused, since this can be done with minimal additional computation. Also, multiple bottom-up layers are added. Since we are fusing features from multiple scales, the authors observe that different features at different resolutions contribute differently. To resolve this they use Fast normalized to weight each feature, so the model can learn the importance of each.


It is able to normalize weight so that they fall between 0 and 1, but since there is no softmax operation here, it is much more efficient.


As seen in the image the EfficientDet has a backbone to extract features which are then fed to BiFPN. A compound scaling method is used to jointly scale different parameters such as width/depth of backbone, BiFPN depth is increased linearly, its width is increased exponentially, the width of prediction network is increased with BiFPN width, however its depth is increased linearly, input resolution is increased linearly.

Opinion and Future Direction

In this work we explored different object detection algorithm. The most recent algorithm such as the Swin transformer and the YOLO V5 are capable to detecting hundreds of image in a second with high mAP. However these algorithm are still vastly primitive when compared to human vision. The future research in object detection may focus but is not limited to the following aspects:

  • Adapt to different domain: In real world the environment changes coanstantly so the algorithm used must be able to keep up with the changes, recover and adapt to these changes. Adverserial training and GAN would be the latest advancement made to address this problem and may be of great help to object detection in the future.

  • Constrained conditions: Currently large amount of data is needed to train these algorithm. The annotation of objects in images is a timeconsuming, expensive, and inefficient process. Thus weakly supervised detection techniques where the detectors are only trained with image-level annotations, or with limited bounding box annotations is of great importance.

  • Object detection in other modalities: Most of the algorithm work with 2D images. Developing algorithm to work with RGB-D image, 3d point cloud, LIDAR would be beneficial for autonomous vehicles, unmanned aerial vehicles, and robotics.