Create YOLOv3 using PyTorch from scratch (Part-1)
source link: https://numbersmithy.com/create-yolov3-using-pytorch-from-scratch-part-1/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
2 Understanding the YOLO model
2.1 What does YOLO do
In a nutshell, the YOLO model takes an image, or multiple images, and detects objects in the images. The output of the model consists of:
Figure 1: Demontration of the YOLOv3 detection result. The thin yellow grid lines divide the entire image into 13 * 13 cells. Each cell makes its own predictions. 2 detected objects, the dog and the truck, are shown by their bounding boxes. The center of the dog box is located in the cell (8, 3), and the center of the truck box is located in cell (2, 9). Note that I made up these boxes and locations for illustration purposes, the truth values of this sample may be somewhat different. But the principle is the same.
- A number of objects it recognizes, it could be 1, or many, or none.
- For each recognized object, it outputs:
- Its x- and y- coordinates and the width and height of the bounding box enclosing the object. Using these coordinate and size information, one could locate the detected object in the image, and visualize the detection by drawing out the bounding box on top of the image, as in the example in Figure 1.
- A confidence score measuring how confidence the model is about the existence of this detected object. This is a score in the range of 0 – 1, and could be used to filter out those less certain predictions.
If the model is trained on data containing multiple types of objects, or, using a more formal phrase, multiple classes of objects, it also outputs a confidence score for the detected objecting belonging to each class. Typically these classes are mutually exclusive, i.e. an object belongs to only 1 class (a multi-class problem). But a variant of the YOLOv2 – YOLO9000 – was designed to be able to detect objects belongs to more than 1 classes (a multi-label problem). E.g. a Norfolk terrier is labeled both as a “terrier” and as a “dog”.
In our subsequent discussion and implementation, we will be restricted to the multi-class model: the classes are mutually exclusive.
So, the model performs multiple tasks, localization and classification, in a single pass through the network, thus its name You Only Look Once. This also makes YOLO a multi-task model.
2.2 Input and output data
Given the above, the input data to the YOLO model are the images. When represented in numerical format, these are N-dimensional arrays/tensors of the shape:
[Bt, C, H, W]
where,
Bt
: batch-size.C
: size of the channel or feature dimension. For images in RGB format,C = 3
. The ordering of the 3 color channels doesn’t really matter, as long as you stick to the same ordering during training and inference time.H
andW
are the height and width of the image, in number of pixels. These need to be standardized to a fixed size, e.g.416 x 416
, or take some random perturbations as a method of data augmentation method during the training state, e.g. randomly sampled within a range. But typically one has to set a reasonable upper bound during training time, to save computations.
Also note that this [Bt, C, H, W]
ordering is following the PyTorch
convention. In Tensorflow it is ordered as [Bt, H, W, C]
.
When making inferences/predictions, the model outputs a number of detection proposals, because there could be multiple objects in a single image. How these proposals are arranged will be covered in just a minute. But we could already make an educated guess about what each proposal would contain. Based on the previous section, it should provide an array like this:
[x, y, w, h, obj, c1, c2, ..., ck]
where,
x
,y
: give information about the x- and y- coordinates of the detection.w
,h
: are the width and height information.obj
: is the object confidence score.c1
tock
: the confidence score for each class.
Therefore, each proposed object detection is an array/tensor of length 5 + Nc, where Nc is the number of classes to classify the detected object.
For instance, the COCO detection dataset has 80
different classes, then Nc = 80
, and each detection is represented
by a tensor of size 5 + 80 = 85
.
For the Pascal VOC dataset, Nc = 20
, and each detection is a
tensor of size 5 + 20 = 25
.
The ordering of the x
, y
etc. elements are not crucial as long as
it is kept consistent. But I also don’t see any good reason to break
this convention, so I will use this same ordering as the original YOLO model.
2.3 Arrange the detections, horizontally
The reasoning is fairly intuitive.
There are typically multiple objects in a single image, so we need to make multiple predictions.
Different objects are located at different places in the image, so we place different detections at different locations.
In the realm of numerically represented images, it is most natural to encode locations as elements in a matrix/array.
That’s how YOLO deals with this: it divides a “prediction matrix/array”
into Cy
number of rows and Cx
number of columns, so a total of
Cy * Cx
cells, and each cell makes its own predictions,
corresponding to different places in the image.
In the example shown in Figure 1, the thin grid lines denote such
cells, and the dog target object is bounded by a bounding box, whose
center point is located in cell (8,3)
. Another object, the truck,
is located at a different cell (2,9)
.
2.4 Multiple detections in the same cell
A natural question to ask is: what if two objects overlap with each other and are located into the same cell?
One way to solve this problem is to make the cell size smaller, to reduce the chance that multiple objects would land in the same cell. This is particularly effective for large-sized objects.
Additionally, YOLO also allows each cell to predict multiple objects. And this is done slightly differently in YOLOv1 and later versions (up to v3 at least, I haven’t read about v4 or later).
In YOLOv1, each cell makes B
number of predictions, corresponding to
B
number of bounding boxes. For instance, for evaluation on the Pascal VOC
data, the author set B = 2
. So the total number of predictions is
Cy * Cx * B
. But, the B
number of predictions in each cell have to be of the same class, so the output tensor size is [Cy, Cx, B * 5 + Nc]
.
Since YOLOv2, the concept of anchor boxes was introduced. These
could be understood as prescribed bounding box templates. They come
with different sizes and aspect ratios. For instance, one of them used
in YOLOv3 has a dimension of 116 * 90
, measured in number of pixels. We will
go deeper into the size computation shenanigans later, but for now, it
is suffice to know that each cell can produce more than 1 objects. In
YOLOv1, each prediction is associated with a bounding box. In later
versions, each prediction is associated with an anchor box.
The number of anchor boxes B
in each cell can be changed. In YOLOv2 they
used 5, and in YOLOv3 3. Different from v1, the B
number of
predictions in each cell can be of different classes in v2 and later
versions. So, the output tensor size is now [Cy, Cx, B, 5 + Nc]
.
2.5 Multi-scale detections
The author of YOLO admitted that the v1 version struggled at detection small objects, particularly those come in groups:
Our model struggles with small objects that appear in groups, such as flocks of birds.
(From the YOLOv1 paper.)
Why is it the case?
The limited number of bounding boxes in each cell, and the relatively small number of predicting cells mentioned previously are part of the reason.
It is also because the network of YOLOv1 makes predictions using the outputs only from the last model layer.
Because the information from input images have gone through multiple convolution layers and pooling layers, the feature map sizes are becoming smaller and smaller in the width and height dimensions. What are left at the end of the convolution layers are highly distilled representations of the image, with fine-grain details largely lost during the process. And small-sized objects are particularly susceptible to such a loss.
So, to counter this, YOLOv2 included a passthrough layer that
connects feature maps with a 26 * 26
resolution with those with
13 * 13
resolution, thus providing some fine-grained features to the
detection-making layers. The connection is done in a pixel-shuffle
manner. We will not expand on this because YOLOv3 does this
differently.
YOLOv3 achieves multi-scale detections, by producing predictions at 3 different scale levels:
- Large scale: for the detection of large-sized objects. These outputs
are taken at the end of the convolution network (see Figure 2 for
a schematic), with a stride of
32. I.e. if the input image is
416 * 416
pixels, feature maps at stride-32 have a size of416 / 32 = 13
. - Mid scale: for detecting mid-size objects. These are taken from
the middle of the network, with a stride of 16. I.e. feature maps
are
26 * 26
. - Small scale: for detecting small-sized objects. These are taken
from an even earlier layer in the network, with a stride of
8. I.e. feature maps are
52 * 52
.
Again, to complement the mid scale and small scale detections with fine-grained features, passthrough connections are made:
- For mid scale detections, feature map from a layer towards the end of the network is taken. These feature maps are at stride-32 (13 * 13 in size). Up-sample them (by interpolation) by a factor of 2, to a stride-16 (26 * 26 in size). Then concatenate them with feature map from the last layer with stride=16 (layer number 61) along the channel dimension. This concatenated feature map is passed through a few layers of convolutions before outputting a prediction.
- For small scale detections, feature map from the above created side-branch at stride-16 is taken, up-sampled to stride=8, and concatenated with the output from layer 36 at a stride of 8. This concatenated feature map is passed through a few more conv layers before outputting the prediction for small objects.
Figure 2 below gives an illustration of this process. Note that when making the small scale predictions, it is incorporating information at 3 scales: stride 8, 16 and 32.
Figure 2: Structure of the YOLOv3 model. Blue boxes represent convolution layers, with their stride level labeled out. Prediction outputs are shown as red boxes, and there are 3 of them, with different stride levels. Pass-through connections are labeled as “Route”, and the layer from which these are taking out are put in parenthese (e.g. Layer 61. Indexing starts from 0).
This method allows us to get more meaningful semantic information from the upsampled features and finer-grained information from the earlier feature map.
So, with 3 scales, the output tensor size is now [Cy1, Cx1, B, 5 +
Nc] + [Cy2, Cx2, B, 5 + Nc] + [Cy3, Cx3, B, 5 + Nc]
.
Where Cy1 = Cx1 = 13
, Cy2 = Cx2 = 26
and Cy3 = Cx3 = 52
.
Recall that in YOLOv3, B=3
number of prescribed anchor boxes are associated with
each scale level, these are:
(116, 90), (156, 198), (373, 326)
for large objects,(30, 61), (62, 45), (59, 119)
for mid objects,(10, 13), (16, 30), (33, 23)
for small objects.
These are taken from a K-Means clustering of the bounding boxes from training data. And the numbers are using the unit of pixels. This leads to size and location computations, detailed in the following sub-section.
2.6 The localization task
Let’s dive deeper into how YOLO predicts the location of an object.
Firstly, the location of an object is represented by the location of its bounding box, so we need only 4 numbers, in either of these 2 ways:
- (x, y) coordinate of a corner point and (x, y) coordinate of the
diagonal point:
[x1, y1, x2, y2]
. Or - (x, y) coordinate of the center point and (width, height) size:
[x, y, w, h]
.
YOLOv3 takes the 2nd [x, y, w, h]
representation for bounding
box predictions.
But both formats will be used at different places, so we need some housekeeping codes
to do the transitions. It should be trivial to implement though.
Now about the sizes. I found it beneficial to first get the coordinate systems straight. Because YOLO implicitly uses 3 different coordinate systems, it is easy to get confused about which one is used at different places, particularly during the training stage.
- The original, image coordinate: measured in pixels. This is the coordinate system we use to locate a pixel in an image, x- for counting the columns, and y- for rows.
- The feature map coordinate: again, x- for counting columns and y-
for rows, but here we are counting the feature map cells.
Recall that we used the term cells previously, it may
be beneficial to stick to this cells term to distinguish from the
pixels unit used in the image coordinate. This distinction is
easily observed in Figure 1, where the image is divided
into
13 * 13
cells, but one such obviously contains more than 1 pixels. At stride=32, the feature map has a size13 * 13
. So the feature map coordinate at this particular stride level measures offsets within a13 * 13
matrix. Similarly, at stride=16, the feature map measures offsets within a26 * 26
matrix, and 1 unit of offset here is twice as large as 1 unit of offset at the stride=32 level, in the image coordinate distance sense. Also note that offsets in feature map coordinate can have decimal places. E.g.1.5
means 1 and a half units of offset, with respect to the corner point of a feature map at a certain stride level. - The fractional coordinate: for both x- and y- dimensions,
this is a positive float. It could be measuring
offsets, like the x- and y- locations, or width/height
sizes. For instance, the bounding box labels in training data are
typically encoded in fractional coordinates. E.g. suppose a label
is in
[x_center, y_center, width, height, class]
format, and has values of[0.5, 0.51, 0.1, 0.12, 10]
, then the first 4 floats tell the bounding box location/size measured in fractions of that image.
Now let’s look at how YOLOv3 locates a bounding box.
First, this is the equation given in the YOLOv2 paper (same holds for YOLOv3):
bx=σ(tx)+cxby=σ(ty)+cybw=pwetwbh=pheth
where:
ty, tx are the raw model outputs about the y- and x- locations of the bounding box, produced by a cell at location (cy,cx) in the feature map.
σ() is the sigmoid function. So σ(ty) and
σ(tx) are floats in the [0,1]
range, and are fractional
offsets with respect to the corner of the cell at (cy,cx).
When added onto the integer cell counts of cy and cx, the resultant bounding box center location (by,bx) is using the feature map coordinate system.
tw and th are the raw model outputs about the width and height sizes of the bounding box, again produced by the same cell at (cy,cx) in the feature map.
pw and ph are the width and height of an anchor box, so etw and eth are in the factional coordinate, and act as non-negative scaling factors to resize this associated anchor box to match the object being detected. Because bx and by are in feature map coordinates, so should bw and bh.
But, the prescribed anchor boxes, e.g. the one with a size of (116,
90)
, are measured in image coordinates using units of pixels.
Therefore, we need to convert 116
to a width measure in the feature
map coordinate, and similarly for the height of 90
. To do so, we
need to divide them by the stride of the feature map in
question. For instance, for large scale detections, pw may be 116/32=3.625 and ph may be 90/32=2.8125.
Why “may be”? Recall that a single cell has 3 anchor boxes in YOLOv3,
so it could have been the anchor box of (156, 198)
, or the (373,
326)
one, in the case of large scale detections. During inference
time, all 3 anchor boxes are used. During training time, only those
with the closet match with the ground truth label will be picked. If
the ground truth label object is not a “large” object in the first
place, maybe none of the 3 are picked. We will come back to the
training process in a later post.
Hopefully I’m not over-complicating things, but I found it helpful to map out these different coordinate systems to better understand YOLO’s localization mechanisms.
Figure 3 below gives an illustration of the relationships between the 3 coordinate systems, and the conversions of some variables in between them.
Figure 3: Relationships between the 3 coordinate systems used in YOLOv3. Ellipses show some variables in the coordinate system as the corresponding color, and arrows denote operations applied to convert from 1 coordinate to another.
Having got the predictions in [bx, by, bw, bh]
format, we only need
to multiply with the current stride level to get back to the image
coordinate, measured in pixels, and the results could be plotted out.
(NOTE that if you have re-sized the image, for instance, to the
standard 416 * 416
size, there is an extra step of re-scaling needed.)
So that is, very superficially, how YOLOv3 predicts object locations. Exactly
how it produces the correct numbers of [tx, ty, tw, th]
such that
simple coordinate transformations could lead to a correct bounding box
is beyond me. And frankly, I don’t think we understand neural networks
well enough to clearly decipher this black-box. All we can say is that
when we feed the model with enough number of correctly formulated training data, it
somehow learns to build the correct associations, with certain degrees
of generalizability beyond the data it has seen.
2.7 Confidence and classification scores
In addition to localization, YOLO also predicts a confidence score of
the existence of an object, and a probability for the objecting
belonging to each of the Nc
possible classes, should there being an
object in the first place.
Formally, the equation for confidence score prediction is:
Po=σ(to)
where to is the raw model output. Using the
[x, y, w, h, obj, c1, c2, ..., ck]
arrangement, it is the obj
term.
Again, σ() is the sigmoid function, and its output can be interpreted as a probability prediction.
The probability prediction for classes is a conditioned on the objectness score:
Pr(Class_i | Object) * Pr(Object)
Note that this is the equation (1) in the YOLOv1 paper, and it is stated that:
At test time we multiply the conditional class probabili- ties and the individual box confidence predictions, Pr(Classi |Object) ∗ Pr(Object) ∗ IOU(truth, pred) = Pr(Classi) ∗ IOU(truth, pred)
I’m wondering if there is some error in this statement: we don’t
really have the truth term to compare against at test time. Based
on the implementation of YOLOV3 Pytorch implementation by
eriklindernoren, I assume that there shouldn’t be an IOU
term at
test/inference time.
Again, under the exclusive classes assumption, class prediction is just a multi-class classification task, and the only difference is the conditional probability of objectness prediction that is multiplied by.
I think I’ll talk more about objectness and class predictions in the Training the model post later.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK