April 5, 2020 By Abhisek Jana 2 Comments

Support Vector Machines for Beginners – Duality Problem

The Objective Function of Primal Problem works fine for Linearly Separable Dataset, however doesn’t solve Non-Linear Dataset. In this Support Vector Machines for Beginners – Duality Problem article we will dive deep into transforming the Primal Problem into Dual Problem and solving the objective functions using Quadratic Programming. Don’t worry if this sounds too complicated, I will explain the concepts in a step by step approach.

Quick Recap

We have gone though the understanding of Support Vectors and Margins. Then used the concepts to build the Objective Functions for Hard and Soft Margin Classifier in our last tutorial. We have also learned that the Objective Function which we have defined is known as Primal Problem.

Objective Function : minβ,b,ξi{||β2||2+Cn∑i=1(ξi)k}s.t Linear Constraint : yi(βTxi+b)≥1–ξi,where ξi≥0

Here are some important points to remember from last tutorial,

Use training data to define optimal Geometric Margin so that we can identify the support vectors.
Introduced of Slack Variable and defined the objective function for Soft Margin Classifier.
Convert the inequality constraint to equality constraint, so that we use Gradient Descent.

In case you need to refer the previous tutorial, here is the link,

Support Vector Machines for Beginners – Linear SVM

Primal Problem

Primal Problem is helpful in solving Linear SVM using SGD. Remember we need to optimize D+1 ( where D is the dimension ) parameters in Primal Problem.

However our final goal is to solve Non-Linear SVM where Primal Problem is not helpful. In order to find a solution, we need to use several mathematical tricks. We will go through all of them one after another.

Duality

In mathematical optimization theory, duality means that optimization problems may be viewed from either of two perspectives, the primal problem or the dual problem (the duality principle). The solution to the dual problem provides a lower bound to the solution of the primal (minimization) problem.
Wikipedia

I am going to explain in a very simple way so that you can understand without the need to have a strong mathematical background.

Primal Problem is something that we want to minimize. In the diagram, P* minimizes the Primal Objective P.
Dual Problem is something we want to maximize. Here we want to convert the Primal Problem (P) to a Dual Problem (D). D* maximizes the Dual Objective D.
Sometimes solving the Dual Problem is same as solving the Primal Problem.
(P∗−D∗) is called as the Duality Gap.
If P∗−D∗>0 we can say weak duality holds.

The goal will be to find in which condition we can have P∗−D∗=0 and we can say that strong duality holds.
In this picture, we can see that maximizing the Dual Problem is same as Minimizing the Primal Problem.
There are some conditions named KKT condition, needs to hold in order to have P∗−D∗=0
Next, we will talk about Lagrange Multiplier, which will help us to determine when we can find strong duality.

Lagrange Multiplier

We can find the maximum or minimum of a multivariable function with some constraint using the Lagrange Multiplier. It’s important understand Lagrange Multiplier to solve constraint optimization problems, like we have in SVM.

If you recall our objective function, we do have one constrain such that yi(βTxi+b)≥1–ξi for the objective function,

minβ,b,ξi{||β2||2+Cn∑i=1(ξi)k}

Concepts

We will start with a generalized version. Let’s find out how to solve problem such as:

maxx,yf(x,y)such that g(x,y)=c

Refer the below plot where f(x,y) function has been represented using the contour plot ( surface ) and g(x,y) has been shown as a line ( green ).

In case we didn’t have the constraint g(x,y), we could have just taken the derivative of f(x,y) w.r.t x and y, then set them to 0 to find solution for x and y.

Support-Vector-Machine-for-Beginners-Duality-Problem-adeveloperdiary.com-4.jpg?resize=589%2C423

However, now we have a constraint given by the function g(x,y) ( represented using the green line). Hence we need to find the point on the green curve for which f(x,y) is maximum.

Notice that the solution is the point where f(x,y) and g(x,y) both are parallel (highlighted in yellow circle). Mathematically, gradient vectors at that point of tangency are oriented along the same direction.

We are going to introduce a new variable α as Lagrange Multiplier. Then express the following as a function named, Lagrangian.

L(x,y,α)=f(x,y)–α(g(x,y)–c)

The solution is to find (x,y,α) so that ΔL=0

δx,y,αL(x,y,α)=0

Let’s first take the partial derivative w.r.t α

δαL(x,y,α)=ddα(f(x,y)–α(g(x,y)–c))=–(g(x,y)−c)

Hence we can say,

δαL(x,y,α)=–g(x,y)+c=0

We can also take the partial derivate w.r.t x

δαL(x,y,α)=ddα(f(x,y)–α(g(x,y)–c))=0δxf(x,y)–αδxg(x,y)=0δxf(x,y)=αδxg(x,y)

Similarly,

δyf(x,y)=αδyg(x,y)

We can write them using one equation,

δx,yf(x,y)=αδx,yg(x,y)

This matches with the idea that δf(x,y) and δg(x,y) are both pointing to the same direction at the solution ( yellow circle ) . This means the gradients are indeed parallel.

Note, even if they point to the same direction, they may not have the same magnitude, hence the gradient of g(x,y) is scaled by α

Now we have 3 equations and 3 unknowns. We can easily solve them.

If you have come this far and the above idea didn’t make a lot of sense, don’t worry, finish the entire article and read again few more times and everything starts to make sense. Feel free to post questions in the comment section, I will try my best to answer them.

We will now solve one sample equation using Lagrange Multiplier.

Example

This probably is the most common example, it’s really easy to understand hence I haven’t changed it.

Problem

Find,

maxx,yx+ys.t. x2+y2=1

Solution

Let’s identify the possible solutions visually, before we use Lagrange Multiplier.

Here is the plot of the constraint x2+y2=1 in red, which is a circle.
x+y is basically the entire plane.
The blue line is an arbitrary representation of x+y=c

Support-Vector-Machine-for-Beginners-Duality-Problem-adeveloperdiary.com-5.png?resize=359%2C352

x+y needs to be on the circle as per the defined constraint.
We can visually identify two points ( among all the points ) where the value of x+y needs to touch the circle at two different points.
These points are the max and min values.

Support-Vector-Machine-for-Beginners-Duality-Problem-adeveloperdiary.com-6.png?resize=380%2C428

Now let’s use Lagrange Multiplier to solve this mathematically.

L(x,y,α)=f(x,y)–α(g(x,y)–c)=x+y−α(x2+y2−1)

Take the derivatives w.r.t x,y,α.

δxL(x,y,α)=1−2αx=0x=12α

δyL(x,y,α)=1−2αy=0y=12α

δαL(x,y,α)=x2+y2−1=012α2+12α2−1=012α2=1α=±1√2

Using this, we can derive the values of x and y.

x=±121√2x=1√2similarly, y=1√2

Since we are tying to maximize x+y, we will consider only the positive values.

So we have the final result as,

f(x,y)=x+y=1√2+1√2=√2

Support-Vector-Machine-for-Beginners-Duality-Problem-adeveloperdiary.com-7.png?resize=350%2C381

Multiple Constraints

Lagrange Multiplier can be used for vectorized implementation as well. Assume x has D dimensions, then there will be total D+1 unknowns for the following function,

maxxf(x)s.t. g(x)=0

You also may have more than one constraints. If you have n constraints then there will be total D+n unknowns.

maxxf(x)s.t. g1(x)=0,g2(x)=0,…,gn(x)=0

We can then define the Lagrangian as following,

L(x1,…,xD,α1,…,αn)=f(x)–n∑i=1αigi(x)

Inequality Constraint

Lagrange Multiplier works with Inequality Constraints as well. We can always add inequality constraints along with equality constraint. Let’s taken an example,

maxxf(x)s.t. gi(x)≤0 , ∀i=1..nhi(x)=0 , ∀j=1..m

We can define the Lagrangian as (One constant for each constraint),

L(x,α,λ)=f(x)+n∑i=1αigi(x)+m∑j=1λjhj(x)

However, in order to find the solution for the variables, it wont be enough only to take the gradients and set those to zero due to the Inequality Constraints.

Setting Δx,λL=0 still gives two of the conditions, but for the Inequality Constraint, we need to have 3 additional conditions. Hence instead of total 3, we will now have total 5 conditions.

αigi(x)=0 , ∀i=1..ngi(x)≤0 , ∀i=1..nαi≥0 , ∀i=1..nδxdL=0 , ∀d=1..DδλjL=0 , ∀j=1..m

Strong Duality

KKT Conditions

These above five conditions are called as KKT (Karush–Kuhn–Tucker) conditions and they must be met for strong duality, i.e for P∗−D∗=0 to be true.

Duality: Hard Margin Classifier

We will now change our Hard Margin Classifier’s Objective Function from Primal Problem to Dual Problem using Lagrange Multiplier and KKT Conditions.

Recall the optimization problem,

Objective Function : minβ,b{||β2||2}s.t Linear Constraint : yi(βTxi+b)≥1 , ∀xi∈D

The constraint can be redefined as following (This is required for the 2nd KKT Condition gi(x)≤0),

s.t Linear Constraint : 1–yi(βTxi+b)≤0 , ∀xi∈D

The Lagrangian can defined as following,

min L=||β2||2+n∑i=1αi(1−yi(βTxi+b))=||β2||2–n∑i=1αi(yi(βTxi+b)−1)

Here are the remaining 4 KKT Conditions, ( gi(x)≤0 already defined above)

αi(1−yi(βTxi+b))=0and αi≥0

Taking derivatives w.r.t β and b,

δβL=β–n∑i=1αiyixi=0β=n∑i=1αiyixiand δbL=n∑i=1αiyi=0

Plugging these we get the Dual Lagrangian Objective Function,

Ldual=||β2||2–n∑i=1αi(yi(βTxi+b)−1)=12βTβ–n∑i=1αiyiβTxi–n∑i=1αiyib+n∑i=1αi=12βTβ–βT(n∑i=1αiyixi)–b(n∑i=1αiyi)+n∑i=1αi=12βTβ–βT(β)–b(0)+n∑i=1αi=–12βTβ+n∑i=1αi=n∑i=1αi–12n∑i=1n∑j=1αiαjyiyjxTixj

L should be minimized w.r.t β and b, and should be maximized w.r.t αi. Hence, instead of minimizing the Primal Problem, we can now maximize the Dual Problem (P*=D* as the KKT conditions are now satisfied). So we can write the following,

Objective Function: maxαLdual=n∑i=1αi–12n∑i=1n∑j=1αiαjyiyjxTixjLinear Constraints: αi≥0,∀i∈D, and n∑i=1αiyi=0

The Ldual is a Convex Quadratic programming problem due to the αiαj term and can be solved using standard optimization techniques.

Weight Vector and Bias

Once we have got the αi values, we can calculate the weight vector β and bias b. As per the KKT conditions we had,

αi(1−yi(βTxi+b))=0

Above equation provides two cases,

yi(βTxi+b)=1

Now, if yi(βTxi+b)=1 then αi>0, which means the point xi must be a support vector.

Otherwise if αi=0 then yi(βTxi+b)≥1, which means if the point is not support vector then αi=0

The above two intuitions are important, this indicates that non support vectors do not have a role in finding the weight and bias.

Once we know αi for all points, we can compute the weight vector β by summing only for the support vectors.

β=∑i,αi≥0αiyixi

In order to compute the bias b, we need to get one solution bi per support vector, then average them.

αi(1−yi(βTxi+b))=0yi(βTxi+b)=1bi=1yi−βTxi=yi–βTxi[ since yi∈{−1,+1},yi=1yi]b=avgαi≥1{bi}

Hard Margin Classifier

We can now find the optimal Hyperplane given β and b. For any new point z, we can predict the class using following,

ˆy=sign(βTz+b)

Duality: Soft Margin Classifier

Similar to the Hard Margin Classifier, we will derive the Dual Problem for Soft Margin Classifier also. Let’s recap the Primal Objective function,

Objective Function : minβ,b,ξi{||β2||2+Cn∑i=1(ξi)k}s.t Linear Constraint : yi(βTxi+b)≥1–ξi,where ξi≥0

The value of k could be set to either 1 (Hinge Loss) or 2 (Quadratic Loss).

Hinge Loss

We can compute the Lagrangian by introducing two Lagrange multipliers αi and λi as we have two inequality constraints. (Remember the 2nd KKT Condition gi(x)≤0 ? )

αi(1–ξi−yi(βTxi+b))=0, and αi≥0λi(0−ξi)=0, and λi≥0

The Lagrangian can be then defined as,

L=||β2||2+Cn∑i=1ξi+n∑i=1αi(1–ξi−yi(βTxi+b))+n∑i=1λi(0−ξi)=||β2||2+Cn∑i=1ξi–n∑i=1αi(yi(βTxi+b)–1+ξi)–n∑i=1λiξi

We will take partial derivative w.r.t β, b and ξi to turn this to dual Lagrangian.

δβL=β–n∑i=1αiyixi=0β=n∑i=1αiyixiδbL=n∑i=1αiyi=0δξiL=C−αi−λi=0

δξiL does not have the summation (∑) term, as we are taking partial derivative against ξi and not ξ.

Plugging these to the Lagrangian we get,

Ldual=||β2||2+Cn∑i=1ξi–n∑i=1αi(yi(βTxi+b)–1+ξi)–n∑i=1λiξi=12βTβ−βT(n∑i=1αiyixi)−bn∑i=1αiyi+n∑i=1αi+n∑i=1(C–αi–λi)ξi=12βTβ−βT(β)−b(0)+n∑i=1αi+n∑i=1(0)ξi=n∑i=1αi–12βTβ=n∑i=1αi–12n∑i=1n∑j=1αiαjyiyjxTixj

The Dual Objective can be written as,

Objective Function: maxαLdual=n∑i=1αi–12n∑i=1n∑j=1αiαjyiyjxTixjLinear Constraints: 0≤αi≤0,∀i∈D, and n∑i=1αiyi=0

Notice the objective function is same as the Hard Margin Classifier, however the inequality constraint is different.

Since αi+λi=C and αi≥0,λi≥0, we can say 0≤αi≤0. There is no constraint on λ as its not part of the final equation.

Weight Vector and Bias

Similar to the Hard Margin Classifier, we can obtain the weight vector from the support vectors as before. Now the Support Vectors include all the points that are on the margin ( Zero Slack ξi=0 ) and also all the points with positive Slack ξi>0

β=∑i,αi≥0αiyixi

We can now solve for ξi,

λi(0−ξi)=0(C–αi)(0−ξi)=0,[λi from KKT Condition ]ξi(C–αi)=0

Now we have two cases for the support vectors with αi>0

If ξi>0, then (C−αi)=0. We can say αi=C
If (C−αi)>0, then αi<C and ξi=0. We can say these support vectors are on the margin.

Using those support vectors that are on the margin, that is 0≤αi≤0 and ξi=0, we can solve for bi,

αi(1–ξi−yi(βTxi+b))=0αi(1–0–yi(βTxi+b))=0αi(yi(βTxi+b)−1)=0yi(βTxi+b)=1bi=1yi−βTxi=yi–βTxi

We will average all the bi to compute b. Notice both β and b can be computed without computing the slack value ξi for each points.

Quadratic Loss

We need to make few adjustments to the objective function for the Quadratic Loss.

Drop the constraint ξi≥0 as ∑i=1nξ2i is always positive.
Any negative value of the slack will be replaced by zero during optimization due to the fact that ξi=0 leads to a smaller value of the primary objective.

Objective Function : minβ,b,ξi{||β2||2+Cn∑i=1ξ2i}s.t Linear Constraint : yi(βTxi+b)≥1–ξi

The Lagrangian can be defined as below. Notice we only need one Lagrange Multiplier due to the dropped constraint.

L=||β2||2+Cn∑i=1ξ2i–n∑i=1αi(yi(βTxi+b)–1+ξi)

Now we can calculate based on the KKT Condition like before. Take partial Derivative w.r.t β,b and ξi and set them to zero.

β=n∑i=1αiyixin∑i=1αiyi=0ξi=12Cαi

Plugging in those into the Lagrangian gives us the Dual Objective. I am expanding the derivation only for the last part since the rest is same as earlier.

Ldual=||β2||2+Cn∑i=1ξ2i–n∑i=1αi(yi(βTxi+b)–1+ξi)=n∑i=1αi–12n∑i=1n∑j=1αiαjyiyjxTixj–n∑i=1αiξi=n∑i=1αi–12n∑i=1n∑j=1αiαjyiyjxTixj+Cn∑i=1(12Cαi)2–n∑i=1αi(12Cαi)=n∑i=1αi–12n∑i=1n∑j=1αiαjyiyjxTixj+Cn∑i=1(14C2α2i)–n∑i=112Cα2i=n∑i=1αi–12n∑i=1n∑j=1αiαjyiyjxTixj+14Cn∑i=1α2i–12Cn∑i=1α2i=n∑i=1αi–12n∑i=1n∑j=1αiαjyiyjxTixj–14Cn∑i=1α2i=n∑i=1αi–12n∑i=1n∑j=1αiαjyiyj(xTixj+12Cδij)

The δ is a function can be defined as δi,j=1 if i=j, otherwise δi,j=0

The Dual Objective can be written as,

maxαLdual=n∑i=1αi–12n∑i=1n∑j=1αiαjyiyj(xTixj+12Cδij)s.t constraint : αi≥0,∀i∈D, and n∑i=1αiyi=0

Weight Vector and Bias

Follow the same approach to compute the weight and bias as,

β=∑i,αi≥0αiyixib= avgi,C≤αi≤0{yi–βTxi}

Conclusion

We have gone through plethora of equations and derivations in this tutorial. If you are learning SVM for the first time, you must be utterly confused on why we need to convert Primal Problem to Dual Problem. If you look that Dual Objective function we have now, the most important part is xTixj.

We will use something called Kernel Trick to represent xTixj as inner product in order to achieve Non-Linearity. This is the main reason that we went though all these computation just to define an objective function.

In the next tutorial we will learn briefly about Kernel and use it in SVM Dual Problem.

Support Vector Machines for Beginners – Duality Problem

Support Vector Machines for Beginners – Duality Problem

Quick Recap

Primal Problem

Duality

Lagrange Multiplier

Concepts

Example

Problem

Solution

Multiple Constraints

Inequality Constraint

KKT Conditions

Duality: Hard Margin Classifier

Weight Vector and Bias

Hard Margin Classifier

Duality: Soft Margin Classifier

Hinge Loss

Weight Vector and Bias

Quadratic Loss

Weight Vector and Bias

Conclusion

Recommend

Vitalik: 什么是“共享安全性”，为何它如此重要？

边缘计算的数据模式，与现有系统的整合和共存

爱奇艺起诉租号平台获赔103万

科技向善，莫让算法“算计”人

MassTransit Mediator with ASP.NET Core

Vitalik: 两个 slot 的提议者/构建者分离方案

Review: Panic’s Nova 2021 – a VS Code and Sublime Text beater?

Agile Cincinnati, Cincinnati, USA, April 25 2022

WebRTC 如何在安卓系统上采集视频数据 | 社区征文

Arch linux系列安装 haroopad markdown 编辑器

About Joyk