Create YOLOv3 using PyTorch from scratch (Part-4)

In this post we create 3 essential tools in the object detection task: IoU (Intersection-over-Union), NMS (Non-Maximum suppression) and mAP (mean Average Precision).

1 Overview

This is Part-4 of the series on building a YOLOv3 model from scratch.

Here is an overview of the series:

Understanding the YOLO model.
Building the model backbone.
Load pre-trained weights
Get the tools ready: this post.

Before getting into the training part, it might be helpful to first get some utilities ready, including the codes to compute IoU (Intersection-Over-Union), NMS (Non-Maximum suppression) and mAP (mean Average Precision).
Training data preparation.
Train the model.

2 Intersection-Over-Union (IoU)

2.1 Concept

IoU is a common metric in object detection tasks. It is a ratio of, as its name suggests, intersection over union. Figure 1 gives an illustration.

Figure 1: Schematic of IoU between 2 boxes. The intersection region is labeled “Inter”. Union is denoted by hatching.

2.2 Python code

The implementation is also fairly simple. Code first:

def compute_IOU(bbox1, bbox2, x1y1x2y2=False):
    '''Compute IoU of 2 groups of bounding boxes

    Args:
        bbox1 (ndarray): bounding box 1, in either [x_center, y_center, w, h]
            format, or [x1, y1, x2, y2] format.
        bbox2 (ndarray): bounding box 2, in either [x_center, y_center, w, h]
            format, or [x1, y1, x2, y2] format.
    Keyword Args:
        x1y1x2y2 (bool): If True, assumes bbox1 and bbox2 are in [x1, y1, x2, y2] format.
    Returns:
        iou (ndarray): intersection over union ratios.
    '''

    if not x1y1x2y2:
        bbox1 = to_x1y1x2y2(bbox1)
        bbox2 = to_x1y1x2y2(bbox2)

    b1x1, b1y1, b1x2, b1y2 = bbox1[:,0], bbox1[:,1], bbox1[:,2], bbox1[:,3]
    b2x1, b2y1, b2x2, b2y2 = bbox2[:,0], bbox2[:,1], bbox2[:,2], bbox2[:,3]

    # intersection coordinates
    inter_x1 = np.maximum(b1x1, b2x1)
    inter_x2 = np.minimum(b1x2, b2x2)
    inter_y1 = np.maximum(b1y1, b2y1)
    inter_y2 = np.minimum(b1y2, b2y2)

    # intersection area
    inter = np.clip(inter_x2 - inter_x1, 0, None) * np.clip(inter_y2 - inter_y1, 0, None)

    w1, h1 = (b1x2 - b1x1) , (b1y2 - b1y1)
    w2, h2 = (b2x2 - b2x1) , (b2y2 - b2y1)
    area1 = w1 * h1
    area2 = w2 * h2

    # union area
    union = area1 + area2 - inter
    # iou
    iou = inter / (union + 1e-7)

    # check if box1 inside box2
    idx = abs(area1 / (area2 + 1e-7) - iou) <= 1e-3
    iou[idx] = (inter / (area1 + 1e-7))[idx]
    # check if box2 inside box1
    idx = abs(area2 / (area1 + 1e-7) - iou) <= 1e-3
    iou[idx] = (inter / (area2 + 1e-7))[idx]

    return iou

Some points to note:

We put this compute_IOU() function in the utils.py script in the YOLOv3_pytorch project folder. For a structure of the folder, refer back to the Create the Darknet-53 model section of part-2.
We assume the input arguments bbox1 and bbox2 are bounding boxes already converted numpy arrays, and have columns of [xcenter, ycenter, width, height], or [x1, y1, x2, y2].

Sometimes, a small bounding box may be entirely enclosed by a larger one, then the IoU will be the areal ratio between the 2, and could be a fairly small number. This may cause trouble when doing the NMS filtering. We will see why it is an issue in the next section. But to prevent such an issue, we manually change the IoU values of such cases to a number close to 1:

    # check if box1 inside box2
    idx = abs(area1 / (area2 + 1e-7) - iou) <= 1e-3
    iou[idx] = (inter / (area1 + 1e-7))[idx]
    # check if box2 inside box1
    idx = abs(area2 / (area1 + 1e-7) - iou) <= 1e-3
    iou[idx] = (inter / (area2 + 1e-7))[idx]

To convert bounding boxes in [xcenter, ycenter, width, height] format to [x1, y1, x2, y2], we use a to_x1y1x2y2() function:

    def to_x1y1x2y2(bbox):
        '''Convert from [x_center, y_center, w, h] -> [x1, y1, x2, y2]
        '''
        x1 = bbox[:,0] - bbox[:,2] / 2
        x2 = bbox[:,0] + bbox[:,2] / 2
        y1 = bbox[:,1] - bbox[:,3] / 2
        y2 = bbox[:,1] + bbox[:,3] / 2

        return np.c_[x1, y1, x2, y2]

3 Non-Maximum suppression (NMS)

3.1 Concept

NMS is often used to remove duplicate predictions, where multiple similar bounding boxes are predicted around a same target. Figure 2 gives an illustration of the process.

The steps of NMS:

Sort the predictions by their confidence scores, from highest to lowest, and put them into a “waiting list”. In the example of Figure 2, we have:

box index confidence score
1 0.99
4 0.98
2 0.90
3 0.81
5 0.70
6 0.50
Select the top-ranking bounding box (box-1), remove it from the “waiting list” and put it into the result list.
Compute IoU scores between the just selected bounding box with all remaining ones in the “waiting list”. Remove boxes with IoU scores greater than a prescribed threshold (typically 0.5) from the “waiting list”.

In the example of Figure 2, let’s assume that box-2, and -3 overlap with box-1 with IoUs >= 0.5, so they are removed. Box-4, -5 and -6 have little to none overlap with box-1, so they stay in the “waiting list”. This makes sense since we are picking a “winner” predictor by suppressing similar competitors, but leaving those irrelevant predictors alone.
Repeat step-2 and -3 until the “waiting list” is empty.

box index	confidence score
1	0.99
4	0.98
2	0.90
3	0.81
5	0.70
6	0.50

Figure 2: Schematic of the NMS process. Confidence score of each bounding box is shown in parentheses in the box label.

In the example in Figure 2, after having selected box-1 and removed box-2 and -3, we pick box-4 as the next top-ranking box. Then we compute the IoUs between box-4 with -5 and -6.

Notice that box-6 is entirely enclosed by box-4, so the “original” IoU ratio will be fairly small, and there is a high chance that it is below the 0.5 threshold, and thus doesn’t get removed.

That’s why when enclosing bounding boxes are detected, we manually change the IoU to the ratio of intersection over the area of the smaller box, which essentially gives an IoU of 1.0.

3.2 Python code

Code first:

def NMS(predictions, conf_thres, iou_thres, verbose=True):
    '''Non-maximum suppression on predictions of a single image

    Args:
        predictions (ndarray): bbox predictions. In shape (n, 5+n_classes).
            First 5 columns are: [x_c, y_c, w, h, conf_score]. Last n_classes
            columns are classification predictions.
        conf_thres (float): filter out detections with object confidence score
            lower than this.
        iou_thres (float): float in range [0, 1]. Suppress bboxes with overlaps
            greater than this ratio.
    Returns:
        results (ndarray): filtered detections, in shape (m, 6). m is the number
            of filtered detections. Columns are: [x_c, y_c, w, h, conf_score, class_idx].
            class_idx is np.argmax(results[:,5:], axis=1).
    '''

    results = []
    conf_idx = 4

    if verbose:
        print('Number of predictions before confidence filtering:', len(predictions))

    # select by conf
    idx = predictions[:, conf_idx] >= conf_thres
    predictions = predictions[idx]
    if len(predictions) == 0:
        return results

    if verbose:
        print('Number of predictions after confidence filtering:', len(predictions))

    # conditional on objectness
    predictions[:, 5:] *= predictions[:, 4:5]

    # select by class prob. idx_box select bboxes, idx_cls select classes
    idx_box, idx_cls = np.where(predictions[:, 5:] >= conf_thres)
    boxes = predictions[idx_box, :4]
    confs = predictions[idx_box, idx_cls+5]
    clss = idx_cls[:, None]
    pred = np.c_[boxes, confs, clss]  # [m, 6]

    if verbose:
        print('Number of predictions after class prob filtering:', len(pred))
    if len(pred) == 0:
        return results

    # sort by conf
    idx = np.argsort(pred[:, conf_idx])[::-1]
    pred = pred[idx]
    # NMS
    while True:
        p1, pred = pred[0:1], pred[1:]
        results.append(p1)
        if len(pred) == 0:
            break

        # compute iou with others
        ious = compute_IOU(p1[:4], pred[:, :4], x1y1x2y2=False)
        # remove overlaps
        pred = pred[ious < iou_thres]

    results = np.vstack(results)

    return results

Some points:

The low confident predictions are first removed by a conf_thres threshold. This will drastically reduced the number of candidates to go through. Recall that with an input size of 416 x 416 and 3 size scales, YOLOv3 outputs (52 * 52 + 26 * 26 + 13 * 13) * 3 = 10647 predictions per image. With a confidence thresholding we can reduce that down to about a hundred, or even fewer, depending the chosen threshold.
The classification probabilities are conditioned on objectness confidence score:
```
  predictions[:, 5:] *= predictions[:, 4:5]
```

4 mean Average Precision (mAP)

4.1 mAP is the average of APs, and AP is parameterized on IoU threshold

mAP (mean Average Precision), is an averaged score across multiple classes. E.g. across the 80 different classes in the COCO detection dataset. For each class, we compute an AP (Average Precision), and the results of all classes are averaged to get the mAP.

The AP score is parameterized by the IoU threshold. For instance, the “traditional” way, also the way used in PASCAL VOC metric, uses a IoU threshold of 0.5. Thus computed AP is denoted as AP@IoU=0.5.

Instead of taking a single IoU=0.5 threshold, one could take multiple IoU thresholds and average the results. For instance, repeat the similar computations at 10 different IoU thresholds, starting from 0.5 to 0.95, with a step of 0.05. The averaged final result is denoted AP@[0.5:0.05:0.95]. This is the so-called COCO primary challenge metric. But let’s leave this multiple-IoU threshold thing to the very end.

Note that the IoU threshold here is NOT the same thing as the threshold used in the NMS process:

The IoU threshold in the NMS process is used to flag duplicates.
the IoU threshold for AP computation is used to flag true positive and false positive predictions.

4.2 True and false positive predictions, precision and recall

So how do we decide whether a prediction is true or false?

Recall that AP is conditioned on a specific class, so the predicted object has to be of the correct class.

Then the localization of the prediction has to be reasonably good. This is measured by the IoU score: if the predicted bounding box overlaps with a ground truth label of the specific class at IoU >= IoU_threshold, then it is flagged as a true positive prediction. Otherwise, it is a false positive prediction.

An example is given in Figure 3. The predicted boxes have gone through the NMS process, and there are 4 boxes left: box-1, -2, -7 and -8.

Figure 3: Demo for AP computation for the “person” class. 2 ground truth labels are given as “label-1” and “label-2”. 4 preditions are given as “box-1”, “box-4”, “box-7” and “box-8”.

Here, we are looking at the AP for the “person” class, so box-7 that predicts a laptop is irrelevant, and we ignore it.

There are 2 ground truth “person” labels in the this image. Box-1 is a close match with label-1, so it is a true positive. Similarly for box-4.

Box-8 also predicts a “person”, but by too poor a localization (IoU with label-2 < the IoU threshold, which, for instance, was set to 0.5). So box-8 is flagged as false positive.

But how do we know box-1 should be compared against label-1, and box-8 against label-2? In fact, we compare each prediction with all labels of class “person”, and chose the best match with the highest IoU.

Then, for this image and class “person”, we have 2 true positives: TP=2, and 1 false positive: FP=1.

So, precision and recall can be computed:

precision: P = TP / (TP + FP) = 2 / 3 = 0.666.
recall: R = TP / #labels = 2 / 2 = 1.

4.3 Area Under the P-R Curve (AUC)

Precision and recall are 2 competing scores, and typically one has to scarifies one for another. Only a hypothetically ideal predictor can achieve precision = recall = 1.

To measure the combined effect of precision and recall, we can compute the Area Under the Curve (AUC), with Precision (P) on the y-axis, and Recall (R) on the x-axis. And this area is defined as the AP for this class.

So we need to first construct such a P-R curve.

Here is where I think the 2 posts I gave above are inaccurate: they seem to suggest that one constructs a P-R curve by iterating through a range of IoU thresholds. Doing that does give us a list of (P, R) pairs, but that is in contradiction to the fact one can compute the AUC, and the AP score, at a fixed IoU=0.5 threshold.

So I believe the following method is correct:

Pick an IoU threshold level, e.g. IoU=0.5.
Pick a class, e.g. “person”.
Get an image been predicted, and its ground truth label. Select detections of the “person” class, and labels of the “person” class.
Decide each of the detections is true or false positive, as done in the previous sub-section.
Repeat 3-4 for all the images to be evaluated.
Put results in a table like below:

conf score (float)	TP (1 or 0)	FP (1 or 0)	Accum TP (int)	Accum FP (int)	Precision (float)	Recall (float)
0.98	1	0	1	0	1 / 1	1 / N
0.88	1	0	2	0	2 / 2	2 / N
0.80	0	1	2	1	2 / 3	2 / N
0.77	0	1	2	2	2 / 4	2 / N
0.74	1	0	3	2	3 / 5	3 / N
0.62	1	0	4	2	4 / 6	4 / N
0.55	0	1	4	3	4 / 7	4 / N
0.54	1	0	5	3	5 / 8	5 / N
0.44	0	1	5	4	5 / 9	5 / N

(Note that I’m filling the table with made-up numbers).

Some explanations about the table:

conf score is the objectness confidence score of each prediction. And the table rows are sorted by this score, from high to low.
TP and FP are binary flags of the predictions, judged by the IoU score being above the 0.5 threshold or not.
Accum TP is the accumulative sum of TP, similarly for Accum FP.
Precision is computed as Accum TP / (Accum TP + Accum FP).
Recall is computed as Accum TP / N, where N is the total number labels for the “person” class within these evaluated images.

Then, the last 2 columns, Precision and Recall, give us the desired P-R curve. We only need to prepend the sequence with a beginning point of (recall=0,precision=1)(recall=0,precision=1).

When plotted out, the curve looks like the blue curve in Figure 4 blow (suppose N = 15.)

Notice that there are zig-zag patterns in the curve. To smooth these out, an 11-point interpolation method was devised, and it works as follows:

Divide the recall range of [0, 1] to 11 equally spaced intervals: 0, 0.1, 0.2, ... 1.

At each interpolated recall level r∈{0,0.1,⋯,1}r∈{0,0.1,⋯,1}, get an interpolated precision using:

Pinterp(r)=maxr˜,r˜≥rP(r˜)Pinterp(r)=maxr~,r~≥rP(r~)

where P(r˜)P(r~) is the precision value taken at recall level r˜r~ in the table above. So the above equation is taking the maximum observed precision level, at and to the right of a sampled recall level rr.

When plotted out, the 11-point interpolated P-R curve is the red curve in Figure 4.

Finally, the AP is computed as:

AP=111∑r∈{0,0.1,⋯,1}Pinterp(r))AP=111∑r∈{0,0.1,⋯,1}Pinterp(r))

Figure 4: Demo P-R curve. Blue curve is the computed P-R values taken from Table 1 (assuming N = 15). Red curve is the 11-point interpolated curve.

Below is the Python code to generate Figure 4:

import numpy as np
import matplotlib.pyplot as plt

tp = np.array([1, 1, 0, 0, 1, 1, 0, 1, 0]) # true positive flags
fp = 1 - tp                                # false positive flags
acc_tp = np.cumsum(tp)                     # accumulative sum
acc_fp = np.cumsum(fp)

N = 15                                     # total number of labels
pre = acc_tp / (acc_tp + acc_fp)           # precision
rec = acc_tp / N                           # recall

pre = np.r_[1., pre]                       # add a beginning point
rec = np.r_[0., rec]

# do 11-point interpolation
rec_interp = np.linspace(0, 1, 11, endpoint=True)
pre_interp = []
for rii in rec_interp:
    p_on_right = pre[rec>=rii]
    if len(p_on_right) > 0:
        pii = np.max(p_on_right)
    else:
        pii = 0
    pre_interp.append(pii)

#-------------------Plot------------------------
figure = plt.figure()
ax = figure.add_subplot(1,1,1)

ax.plot(rec, pre, 'b-o', ms=5, label='P-R')
ax.step(rec_interp, pre_interp, where='post', color='r')
ax.plot(rec_interp, pre_interp, marker='^', ms=8, ls='none', color='r', label='P-R (11-point interp)')
ax.set_xlabel('Recall')
ax.set_ylabel('Precision')
ax.set_ylim(-0.2, 1.2)
ax.grid(True, axis='both')
ax.legend()
figure.show()

4.4 Python code for AP computation

Code first:

def compute_AP(pred, label, class_idx, conf_thres=0.25, iou_thres=0.5, interp=11):
    '''Compute Average Precision for a class

    Args:
        pred (ndarray): predictions, in shape (n, 7). n is the number of detections.
            The columns are: [batch_idx, x_c, y_c, w, h, conf, clss].
        label (ndarray): ground truth labels, in shape (m, 6). m is the number
            of labels. The columns are: [batch_idx, x_c, y_c, w, h, clss].
        class_idx (int): target class idx.
    Keyword Args:
        conf_thres (float): optionally filter prediction by confidence score.
        iou_thres (float): IoU score threshold to flag true/false positives.
        interp (int or None): if None, compute Area Under Curve using trapz
            integration. If 11, smooth the P-R curve using 11-point interpolation.
    Returns:
        precision (ndarray): 1d array, precision values.
        recall (ndarray): 1d array, recall values.
        ap (float): average precision.
    '''

    # select class
    pred = pred[pred[:, -1] == class_idx]
    label = label[label[:, -1] == class_idx]

    if len(pred) == 0:
        return 0, 0, 0

    # filter by confidence
    conf_idx = np.where(pred[:, 5] >= conf_thres)[0]
    if len(conf_idx) == 0:
        # all false positive
        return 0, 0, 0

    npred = len(pred)
    nlabel = len(label)
    tp = np.zeros(npred, dtype='int')   # true positive flags
    label_visited = np.zeros(nlabel, dtype='int')  # record labels that have been matched with a detection

    # loop through detections
    for idxii in conf_idx:
        pii = pred[[idxii]]

        # select labels in same image
        labidx = np.where(label[:, 0] == pii[:, 0])[0]
        labii = label[labidx]
        if len(labii) == 0:
            # no matching label, false positive
            continue

        # compute iou with labels
        ious = compute_IOU(pii[:, 1:5], labii[:, 1:5], x1y1x2y2=False)
        # get the  closest label
        best_iou_idx = np.argmax(ious)
        best_iou = ious[best_iou_idx]
        all_labidx = labidx[best_iou_idx]

        if best_iou >= iou_thres and label_visited[all_labidx] == 0:
            tp[idxii] = 1                  # mark this prediction true positive
            label_visited[all_labidx] = 1  # mark this label as visited

    # build the AP computing table
    # sort by conf
    conf_idx = np.argsort(pred[:, 5])[::-1]
    pred = pred[conf_idx]
    tp = tp[conf_idx]
    fp = 1 - tp
    acc_tp = np.cumsum(tp)
    acc_fp = np.cumsum(fp)

    precision = acc_tp / (acc_tp + acc_fp + 1e-6)
    recall = acc_tp / nlabel

    # add starting point
    precision = np.r_[1., precision]
    recall = np.r_[0., recall]

    # do interpolation if needed
    if not interp:
        ap = np.trapz(precision, recall)
    elif interp == 11:
        rr = np.linspace(0, 1, 11, endpoint=True)
        ap = 0
        for rii in rr:
            pre_on_right = precision[recall >= rii]
            if len(pre_on_right):
                ap += np.max(pre_on_right)
        ap /= 11
    else:
        raise Exception("<interp> is None or 11.")

    return precision, recall, ap

Some explanations:

Remember that AP is class-specific, so we start by selecting the correct class. If nothing gets selected, then true positive detections, precision and AP are all 0:

# select class
pred = pred[pred[:, -1] == class_idx]
label = label[label[:, -1] == class_idx]

if len(pred) == 0:
    return 0, 0, 0

We initialize an 0-valued label_visited array to record the “visited” labels. This is because if a ground truth label is assigned to a prediction (by having the highest IoU score among all labels), it can not be assigned to a second prediction in the future, to avoid double counting.

So, when we flag a prediction as true positive, we also flag that matched label as “visited”:

# compute iou with labels
ious = compute_IOU(pii[:, 1:5], labii[:, 1:5], x1y1x2y2=False)
# get the  closest label
best_iou_idx = np.argmax(ious)
best_iou = ious[best_iou_idx]
all_labidx = labidx[best_iou_idx]

if best_iou >= iou_thres and label_visited[all_labidx] == 0:
    tp[idxii] = 1                  # mark this prediction true positive
    label_visited[all_labidx] = 1  # mark this label as visited

4.5 Python code for mAP computation

Having got the AP computation code, it is trivial to get mAP: we simply loop through all classes to get an average:

def compute_mAP(pred, label, conf_thres=0.25, iou_thres=0.5, interp=11):
    '''Compute mean Average Precision across classes

    Args:
        pred (ndarray): predictions, in shape (n, 7). n is the number of detections.
            The columns are: [batch_idx, x_c, y_c, w, h, conf, clss].
        label (ndarray): ground truth labels, in shape (m, 6). m is the number
            of labels. The columns are: [batch_idx, x_c, y_c, w, h, clss].
    Keyword Args:
        conf_thres (float): optionally filter prediction by confidence score.
        iou_thres (float): IoU score threshold to flag true/false positives.
        interp (int or None): if None, compute Area Under Curve using trapz
            integration. If 11, smooth the P-R curve using 11-point interpolation.
    Returns:
        mAP (float): mean Average Precision.
    '''

    class_idx = np.unique(label[:, -1])
    class_aps = []
    # loop through classes
    for clsii in class_idx:
        _, _, apii = compute_AP(pred, label, clsii, conf_thres, iou_thres, interp)
        class_aps.append(apii)

    mAP = np.mean(class_aps)

    return mAP

4.6 Multiple IoU thresholds

Now it is the correct point to extend into multiple IoU threshold values.

Having got the mAP using IoU=0.5, we can iterate through the IoU range of [0.5, 0.95], with a step of 0.05. The average mAP across these 10 IoU levels is called mAP@[0.5:0.05:0.95]. Or, using the COCO naming convention, AP@[0.5:0.05:0.95]: they treat AP and mAP as synonyms (thanks COCO for the confusion).

5 Summary

In this post we create 3 essential tools in object detection task:

Intersection-Over-Union (IoU): a goodness of match metric between a bounding box prediction and a ground truth label.
Non-Maximum suppression (NMS): a method to filter out duplicate detections.
mean Average Precision (mAP): a metric of the overall accuracy of a detection model.

All these will be used in later posts when we write the model-training code, and it is a good idea to store these functions in the utils.py script in the YOLOv3_pytorch project folder.

Author: guangzhi

Created: 2022-06-17 Fri 16:00

Validate

Create YOLOv3 using PyTorch from scratch (Part-4)

Create YOLOv3 using PyTorch from scratch (Part-4)

1 Overview

2 Intersection-Over-Union (IoU)

2.1 Concept

2.2 Python code

3 Non-Maximum suppression (NMS)

3.1 Concept

3.2 Python code

4 mean Average Precision (mAP)

4.1 mAP is the average of APs, and AP is parameterized on IoU threshold

4.2 True and false positive predictions, precision and recall

4.3 Area Under the P-R Curve (AUC)

4.4 Python code for AP computation

4.5 Python code for mAP computation

4.6 Multiple IoU thresholds

5 Summary

Recommend

android-uevent 简记

HIDL示例-JAVA服务创建-Client验证-Android10.0 HwBinder通信原理(四)

轻松筹、水滴筹联合声明严厉打击冒充平台收取“推广费”行为

Fanatics hires former Dick Clark Productions CEO to lead Topps, NFTs

HwServiceManager篇-Android10.0 HwBinder通信原理(五)

MBP Sale at Best Buy

How To Easily Build And Support Tables In Figma

第三届中国西安区块链产业发展论坛于今日下午召开

起底IP代理生意链

京东汽车：以价值共生树立消费信赖_业界_科技频道首页_财经网 - CAIJING.COM.CN

About Joyk