Create YOLOv3 using PyTorch from scratch (Part-1)

2 Understanding the YOLO model

2.1 What does YOLO do

In a nutshell, the YOLO model takes an image, or multiple images, and detects objects in the images. The output of the model consists of:

Figure 1: Demontration of the YOLOv3 detection result. The thin yellow grid lines divide the entire image into 13 * 13 cells. Each cell makes its own predictions. 2 detected objects, the dog and the truck, are shown by their bounding boxes. The center of the dog box is located in the cell (8, 3), and the center of the truck box is located in cell (2, 9). Note that I made up these boxes and locations for illustration purposes, the truth values of this sample may be somewhat different. But the principle is the same.

A number of objects it recognizes, it could be 1, or many, or none.
For each recognized object, it outputs:
1. Its x- and y- coordinates and the width and height of the bounding box enclosing the object. Using these coordinate and size information, one could locate the detected object in the image, and visualize the detection by drawing out the bounding box on top of the image, as in the example in Figure 1.
2. A confidence score measuring how confidence the model is about the existence of this detected object. This is a score in the range of 0 – 1, and could be used to filter out those less certain predictions.
3. If the model is trained on data containing multiple types of objects, or, using a more formal phrase, multiple classes of objects, it also outputs a confidence score for the detected objecting belonging to each class. Typically these classes are mutually exclusive, i.e. an object belongs to only 1 class (a multi-class problem). But a variant of the YOLOv2 – YOLO9000 – was designed to be able to detect objects belongs to more than 1 classes (a multi-label problem). E.g. a Norfolk terrier is labeled both as a “terrier” and as a “dog”.
  
  In our subsequent discussion and implementation, we will be restricted to the multi-class model: the classes are mutually exclusive.

So, the model performs multiple tasks, localization and classification, in a single pass through the network, thus its name You Only Look Once. This also makes YOLO a multi-task model.

2.2 Input and output data

Given the above, the input data to the YOLO model are the images. When represented in numerical format, these are N-dimensional arrays/tensors of the shape:

[Bt, C, H, W]

where,

Bt: batch-size.
C: size of the channel or feature dimension. For images in RGB format, C = 3. The ordering of the 3 color channels doesn’t really matter, as long as you stick to the same ordering during training and inference time.
H and W are the height and width of the image, in number of pixels. These need to be standardized to a fixed size, e.g. 416 x 416, or take some random perturbations as a method of data augmentation method during the training state, e.g. randomly sampled within a range. But typically one has to set a reasonable upper bound during training time, to save computations.

Also note that this [Bt, C, H, W] ordering is following the PyTorch convention. In Tensorflow it is ordered as [Bt, H, W, C].

When making inferences/predictions, the model outputs a number of detection proposals, because there could be multiple objects in a single image. How these proposals are arranged will be covered in just a minute. But we could already make an educated guess about what each proposal would contain. Based on the previous section, it should provide an array like this:

[x, y, w, h, obj, c1, c2, ..., ck]

where,

x, y: give information about the x- and y- coordinates of the detection.
w, h: are the width and height information.
obj: is the object confidence score.
c1 to ck: the confidence score for each class.

Therefore, each proposed object detection is an array/tensor of length 5 + Nc, where Nc is the number of classes to classify the detected object.

For instance, the COCO detection dataset has 80 different classes, then Nc = 80, and each detection is represented by a tensor of size 5 + 80 = 85.

For the Pascal VOC dataset, Nc = 20, and each detection is a tensor of size 5 + 20 = 25.

The ordering of the x, y etc. elements are not crucial as long as it is kept consistent. But I also don’t see any good reason to break this convention, so I will use this same ordering as the original YOLO model.

2.3 Arrange the detections, horizontally

The reasoning is fairly intuitive.

There are typically multiple objects in a single image, so we need to make multiple predictions.

Different objects are located at different places in the image, so we place different detections at different locations.

In the realm of numerically represented images, it is most natural to encode locations as elements in a matrix/array.

That’s how YOLO deals with this: it divides a “prediction matrix/array” into Cy number of rows and Cx number of columns, so a total of Cy * Cx cells, and each cell makes its own predictions, corresponding to different places in the image.

In the example shown in Figure 1, the thin grid lines denote such cells, and the dog target object is bounded by a bounding box, whose center point is located in cell (8,3). Another object, the truck, is located at a different cell (2,9).

2.4 Multiple detections in the same cell

A natural question to ask is: what if two objects overlap with each other and are located into the same cell?

One way to solve this problem is to make the cell size smaller, to reduce the chance that multiple objects would land in the same cell. This is particularly effective for large-sized objects.

Additionally, YOLO also allows each cell to predict multiple objects. And this is done slightly differently in YOLOv1 and later versions (up to v3 at least, I haven’t read about v4 or later).

In YOLOv1, each cell makes B number of predictions, corresponding to B number of bounding boxes. For instance, for evaluation on the Pascal VOC data, the author set B = 2. So the total number of predictions is Cy * Cx * B. But, the B number of predictions in each cell have to be of the same class, so the output tensor size is [Cy, Cx, B * 5 + Nc].

Since YOLOv2, the concept of anchor boxes was introduced. These could be understood as prescribed bounding box templates. They come with different sizes and aspect ratios. For instance, one of them used in YOLOv3 has a dimension of 116 * 90, measured in number of pixels. We will go deeper into the size computation shenanigans later, but for now, it is suffice to know that each cell can produce more than 1 objects. In YOLOv1, each prediction is associated with a bounding box. In later versions, each prediction is associated with an anchor box.

The number of anchor boxes B in each cell can be changed. In YOLOv2 they used 5, and in YOLOv3 3. Different from v1, the B number of predictions in each cell can be of different classes in v2 and later versions. So, the output tensor size is now [Cy, Cx, B, 5 + Nc].

2.5 Multi-scale detections

The author of YOLO admitted that the v1 version struggled at detection small objects, particularly those come in groups:

Our model struggles with small objects that appear in
groups, such as flocks of birds.

(From the YOLOv1 paper.)

Why is it the case?

The limited number of bounding boxes in each cell, and the relatively small number of predicting cells mentioned previously are part of the reason.

It is also because the network of YOLOv1 makes predictions using the outputs only from the last model layer.

Because the information from input images have gone through multiple convolution layers and pooling layers, the feature map sizes are becoming smaller and smaller in the width and height dimensions. What are left at the end of the convolution layers are highly distilled representations of the image, with fine-grain details largely lost during the process. And small-sized objects are particularly susceptible to such a loss.

So, to counter this, YOLOv2 included a passthrough layer that connects feature maps with a 26 * 26 resolution with those with 13 * 13 resolution, thus providing some fine-grained features to the detection-making layers. The connection is done in a pixel-shuffle manner. We will not expand on this because YOLOv3 does this differently.

YOLOv3 achieves multi-scale detections, by producing predictions at 3 different scale levels:

Large scale: for the detection of large-sized objects. These outputs are taken at the end of the convolution network (see Figure 2 for a schematic), with a stride of 32. I.e. if the input image is 416 * 416 pixels, feature maps at stride-32 have a size of 416 / 32 = 13.
Mid scale: for detecting mid-size objects. These are taken from the middle of the network, with a stride of 16. I.e. feature maps are 26 * 26.
Small scale: for detecting small-sized objects. These are taken from an even earlier layer in the network, with a stride of 8. I.e. feature maps are 52 * 52.

Again, to complement the mid scale and small scale detections with fine-grained features, passthrough connections are made:

For mid scale detections, feature map from a layer towards the end of the network is taken. These feature maps are at stride-32 (13 * 13 in size). Up-sample them (by interpolation) by a factor of 2, to a stride-16 (26 * 26 in size). Then concatenate them with feature map from the last layer with stride=16 (layer number 61) along the channel dimension. This concatenated feature map is passed through a few layers of convolutions before outputting a prediction.
For small scale detections, feature map from the above created side-branch at stride-16 is taken, up-sampled to stride=8, and concatenated with the output from layer 36 at a stride of 8. This concatenated feature map is passed through a few more conv layers before outputting the prediction for small objects.

Figure 2 below gives an illustration of this process. Note that when making the small scale predictions, it is incorporating information at 3 scales: stride 8, 16 and 32.

Figure 2: Structure of the YOLOv3 model. Blue boxes represent convolution layers, with their stride level labeled out. Prediction outputs are shown as red boxes, and there are 3 of them, with different stride levels. Pass-through connections are labeled as “Route”, and the layer from which these are taking out are put in parenthese (e.g. Layer 61. Indexing starts from 0).

This method allows us to get more meaningful semantic information from the upsampled features and finer-grained information from the earlier feature map.

So, with 3 scales, the output tensor size is now [Cy1, Cx1, B, 5 + Nc] + [Cy2, Cx2, B, 5 + Nc] + [Cy3, Cx3, B, 5 + Nc].

Where Cy1 = Cx1 = 13, Cy2 = Cx2 = 26 and Cy3 = Cx3 = 52.

Recall that in YOLOv3, B=3 number of prescribed anchor boxes are associated with each scale level, these are:

(116, 90), (156, 198), (373, 326) for large objects,
(30, 61), (62, 45), (59, 119) for mid objects,
(10, 13), (16, 30), (33, 23) for small objects.

These are taken from a K-Means clustering of the bounding boxes from training data. And the numbers are using the unit of pixels. This leads to size and location computations, detailed in the following sub-section.

2.6 The localization task

Let’s dive deeper into how YOLO predicts the location of an object.

Firstly, the location of an object is represented by the location of its bounding box, so we need only 4 numbers, in either of these 2 ways:

(x, y) coordinate of a corner point and (x, y) coordinate of the diagonal point: [x1, y1, x2, y2]. Or
(x, y) coordinate of the center point and (width, height) size: [x, y, w, h].

YOLOv3 takes the 2nd [x, y, w, h] representation for bounding box predictions. But both formats will be used at different places, so we need some housekeeping codes to do the transitions. It should be trivial to implement though.

Now about the sizes. I found it beneficial to first get the coordinate systems straight. Because YOLO implicitly uses 3 different coordinate systems, it is easy to get confused about which one is used at different places, particularly during the training stage.

The original, image coordinate: measured in pixels. This is the coordinate system we use to locate a pixel in an image, x- for counting the columns, and y- for rows.
The feature map coordinate: again, x- for counting columns and y- for rows, but here we are counting the feature map cells. Recall that we used the term cells previously, it may be beneficial to stick to this cells term to distinguish from the pixels unit used in the image coordinate. This distinction is easily observed in Figure 1, where the image is divided into 13 * 13 cells, but one such obviously contains more than 1 pixels. At stride=32, the feature map has a size 13 * 13. So the feature map coordinate at this particular stride level measures offsets within a 13 * 13 matrix. Similarly, at stride=16, the feature map measures offsets within a 26 * 26 matrix, and 1 unit of offset here is twice as large as 1 unit of offset at the stride=32 level, in the image coordinate distance sense. Also note that offsets in feature map coordinate can have decimal places. E.g. 1.5 means 1 and a half units of offset, with respect to the corner point of a feature map at a certain stride level.
The fractional coordinate: for both x- and y- dimensions, this is a positive float. It could be measuring offsets, like the x- and y- locations, or width/height sizes. For instance, the bounding box labels in training data are typically encoded in fractional coordinates. E.g. suppose a label is in [x_center, y_center, width, height, class] format, and has values of [0.5, 0.51, 0.1, 0.12, 10], then the first 4 floats tell the bounding box location/size measured in fractions of that image.

Now let’s look at how YOLOv3 locates a bounding box.

First, this is the equation given in the YOLOv2 paper (same holds for YOLOv3):

bx=σ(tx)+cxby=σ(ty)+cybw=pwetwbh=pheth

where:

ty, tx are the raw model outputs about the y- and x- locations of the bounding box, produced by a cell at location (cy,cx) in the feature map.

σ() is the sigmoid function. So σ(ty) and σ(tx) are floats in the [0,1] range, and are fractional offsets with respect to the corner of the cell at (cy,cx).

When added onto the integer cell counts of cy and cx, the resultant bounding box center location (by,bx) is using the feature map coordinate system.

tw and th are the raw model outputs about the width and height sizes of the bounding box, again produced by the same cell at (cy,cx) in the feature map.

pw and ph are the width and height of an anchor box, so etw and eth are in the factional coordinate, and act as non-negative scaling factors to resize this associated anchor box to match the object being detected. Because bx and by are in feature map coordinates, so should bw and bh.

But, the prescribed anchor boxes, e.g. the one with a size of (116, 90), are measured in image coordinates using units of pixels. Therefore, we need to convert 116 to a width measure in the feature map coordinate, and similarly for the height of 90. To do so, we need to divide them by the stride of the feature map in question. For instance, for large scale detections, pw may be 116/32=3.625 and ph may be 90/32=2.8125.

Why “may be”? Recall that a single cell has 3 anchor boxes in YOLOv3, so it could have been the anchor box of (156, 198), or the (373, 326) one, in the case of large scale detections. During inference time, all 3 anchor boxes are used. During training time, only those with the closet match with the ground truth label will be picked. If the ground truth label object is not a “large” object in the first place, maybe none of the 3 are picked. We will come back to the training process in a later post.

Hopefully I’m not over-complicating things, but I found it helpful to map out these different coordinate systems to better understand YOLO’s localization mechanisms.

Figure 3 below gives an illustration of the relationships between the 3 coordinate systems, and the conversions of some variables in between them.

Figure 3: Relationships between the 3 coordinate systems used in YOLOv3. Ellipses show some variables in the coordinate system as the corresponding color, and arrows denote operations applied to convert from 1 coordinate to another.

Having got the predictions in [bx, by, bw, bh] format, we only need to multiply with the current stride level to get back to the image coordinate, measured in pixels, and the results could be plotted out. (NOTE that if you have re-sized the image, for instance, to the standard 416 * 416 size, there is an extra step of re-scaling needed.)

So that is, very superficially, how YOLOv3 predicts object locations. Exactly how it produces the correct numbers of [tx, ty, tw, th] such that simple coordinate transformations could lead to a correct bounding box is beyond me. And frankly, I don’t think we understand neural networks well enough to clearly decipher this black-box. All we can say is that when we feed the model with enough number of correctly formulated training data, it somehow learns to build the correct associations, with certain degrees of generalizability beyond the data it has seen.

2.7 Confidence and classification scores

In addition to localization, YOLO also predicts a confidence score of the existence of an object, and a probability for the objecting belonging to each of the Nc possible classes, should there being an object in the first place.

Formally, the equation for confidence score prediction is:

Po=σ(to)

where to is the raw model output. Using the

[x, y, w, h, obj, c1, c2, ..., ck]

arrangement, it is the obj term.

Again, σ() is the sigmoid function, and its output can be interpreted as a probability prediction.

The probability prediction for classes is a conditioned on the objectness score:

Pr(Class_i | Object) * Pr(Object)

Note that this is the equation (1) in the YOLOv1 paper, and it is stated that:

At test time we multiply the conditional class probabili-
ties and the individual box confidence predictions,

Pr(Classi |Object) ∗ Pr(Object) ∗ IOU(truth, pred) = Pr(Classi) ∗
IOU(truth, pred)

I’m wondering if there is some error in this statement: we don’t really have the truth term to compare against at test time. Based on the implementation of YOLOV3 Pytorch implementation by eriklindernoren, I assume that there shouldn’t be an IOU term at test/inference time.

Again, under the exclusive classes assumption, class prediction is just a multi-class classification task, and the only difference is the conditional probability of objectness prediction that is multiplied by.

I think I’ll talk more about objectness and class predictions in the Training the model post later.

2 Understanding the YOLO model

2.1 What does YOLO do

2.2 Input and output data

2.3 Arrange the detections, horizontally

2.4 Multiple detections in the same cell

2.5 Multi-scale detections

2.6 The localization task

2.7 Confidence and classification scores

Recommend

Preview app and driver compatibility insights in Endpoint Manager

The Galaxy Watch 5 may get a sorely-needed charging boost

618营销黑话！

Addressing the need for multiple Microsoft Tunnel Gateway servers

基于SqlSugar的开发框架循序渐进介绍（8）-- 在基类函数封装实现用户操作日志记录 -...

Does Homeowners Insurance Cover Jewelry?

创企获FDA批准：可用苹果Apple Watch监测帕金森症状

文案的一字之师

[Giveaway] JBL Quantum 350 headset

The OnePlus 10 could slightly differ from the Pro model in one key area

About Joyk