We sees an image and instantly know what objects are in the image and where they are unlike machines. The human visual system is fast and accurate. Every AI researcher is struggling to find an efficient method for real time object detection. In 2015 researchers from Allen institute for AI, University of Washington, and Facebook came together and developed the fastest object detection model, YOLO ( You Only Look Once ).
Object detection is a general term to describe a collection of related computer vision and image processing tasks that involve identifying objects in given frame. It is widely use for face recognition, Applications like tracking the ball during football match, image annotation, etc.
How YOLO works?
YOLO is totally new approach to detect objects in given frame than traditional models. Earlier models were able to recognize objects but fails to locate them. YOLO is complex convolutional neural network which applies single neural network and predicts bounding boxes around the objects and class probabilities directly from full images in one evaluation. What author said is they frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities.
From the above figure we can see that YOLO is processing images in simple and straightforward manner. System first resizes the input image to 448 × 448, then runs a single convolutional network on the image, and thresholds the resulting detections by the model’s confidence.
Let’s discuss this in detail
YOLO architecture is inspire by Inception image classification model and trained on ImageNet data. YOLO consist of 24 convolutional layers. Alternating 1 × 1 convolutional layers reduce the features space from preceding layers. For the last convolution layer, it outputs a tensor with shape (7, 7, 1024). The tensor is then flattened. Then Used 2 fully connected layers as a form of linear regression, it outputs 7×7×30 parameters. This network runs at 45 frames per second with no batch processing on a Titan X GPU and a fast version runs at more than 150 fps. This
means it can process streaming video in real-time with less than 25 milliseconds of latency.
YOLO network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes across all classes for an image simultaneously. It divides the image into S x S grid. Then it looks for the center of an image and if it falls in the grid cell, then that grid cell is responsible for detecting that object.
Then each grid cell does the prediction of ‘B’ bounding box around the object along with confidence scores for those boxes. “Confidence box tells how confident the model is that the box contain an object.” Confidence box returns zero if grid cell does not contain any object.
Confidence box is nothing but the bounding box which locates the object in given image as mentioned. Each bounding box can be described using five descriptors: X, Y, W, H and confidence. (X, Y) is the centre co-ordinate of bounding box and W, H is the width and height. Finally the confidence prediction represents the IOU (intersection over union) between the predicted box and any ground truth box.
From the above discussion and figure, we can see that the model is running on full image and detecting image simultaneously. which makes it much faster than any other model.
- In YOLO, each grid is able to predict only two boxes, which make it harder to predict small objects that appears in group.
- It struggles to generalize to objects in new or unusual aspect ratios or configurations.
- You Only Look Once: Unified, Real-Time Object Detection, 2015.
- YOLO9000: Better, Faster, Stronger, 2016.
- YOLOv3: An Incremental Improvement, 2018.
I will post object detection code plus performance comparison of YOLO with other models in upcoming blog. Stay tuned!!!!