Support Vector Machines for Beginners – Duality Problem
source link: http://www.adeveloperdiary.com/data-science/machine-learning/support-vector-machines-for-beginners-duality-problem/?amp%3Butm_medium=rss&%3Butm_campaign=support-vector-machines-for-beginners-duality-problem
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
April 5, 2020 By Abhisek Jana 2 Comments
Support Vector Machines for Beginners – Duality Problem
The Objective Function of Primal Problem works fine for Linearly Separable Dataset, however doesn’t solve Non-Linear Dataset. In this Support Vector Machines for Beginners – Duality Problem article we will dive deep into transforming the Primal Problem into Dual Problem and solving the objective functions using Quadratic Programming. Don’t worry if this sounds too complicated, I will explain the concepts in a step by step approach.
Quick Recap
We have gone though the understanding of Support Vectors and Margins. Then used the concepts to build the Objective Functions for Hard and Soft Margin Classifier in our last tutorial. We have also learned that the Objective Function which we have defined is known as Primal Problem.
Objective Function : minβ,b,ξi{||β2||2+Cn∑i=1(ξi)k}s.t Linear Constraint : yi(βTxi+b)≥1–ξi,where ξi≥0
Here are some important points to remember from last tutorial,
- Use training data to define optimal Geometric Margin so that we can identify the support vectors.
- Introduced of Slack Variable and defined the objective function for Soft Margin Classifier.
- Convert the inequality constraint to equality constraint, so that we use Gradient Descent.
In case you need to refer the previous tutorial, here is the link,
Primal Problem
Primal Problem is helpful in solving Linear SVM using SGD. Remember we need to optimize D+1
( where D is the dimension ) parameters in Primal Problem.
However our final goal is to solve Non-Linear SVM where Primal Problem is not helpful. In order to find a solution, we need to use several mathematical tricks. We will go through all of them one after another.
Duality
In mathematical optimization theory, duality means that optimization problems may be viewed from either of two perspectives, the primal problem or the dual problem (the duality principle). The solution to the dual problem provides a lower bound to the solution of the primal (minimization) problem.
Wikipedia
I am going to explain in a very simple way so that you can understand without the need to have a strong mathematical background.
- Primal Problem is something that we want to minimize. In the diagram, P* minimizes the Primal Objective P.
- Dual Problem is something we want to maximize. Here we want to convert the Primal Problem (P) to a Dual Problem (D). D* maximizes the Dual Objective D.
- Sometimes solving the Dual Problem is same as solving the Primal Problem.
- (P∗−D∗) is called as the Duality Gap.
- If P∗−D∗>0 we can say weak duality holds.
- The goal will be to find in which condition we can have P∗−D∗=0 and we can say that strong duality holds.
- In this picture, we can see that maximizing the Dual Problem is same as Minimizing the Primal Problem.
- There are some conditions named KKT condition, needs to hold in order to have P∗−D∗=0
- Next, we will talk about Lagrange Multiplier, which will help us to determine when we can find strong duality.
Lagrange Multiplier
We can find the maximum or minimum of a multivariable function with some constraint using the Lagrange Multiplier. It’s important understand Lagrange Multiplier to solve constraint optimization problems, like we have in SVM.
If you recall our objective function, we do have one constrain such that yi(βTxi+b)≥1–ξi for the objective function,
Concepts
We will start with a generalized version. Let’s find out how to solve problem such as:
Refer the below plot where f(x,y) function has been represented using the contour plot ( surface ) and g(x,y) has been shown as a line ( green ).
In case we didn’t have the constraint g(x,y), we could have just taken the derivative of f(x,y) w.r.t x and y, then set them to 0 to find solution for x and y.
However, now we have a constraint given by the function g(x,y) ( represented using the green line). Hence we need to find the point on the green curve for which f(x,y) is maximum.
Notice that the solution is the point where f(x,y) and g(x,y) both are parallel (highlighted in yellow circle). Mathematically, gradient vectors at that point of tangency are oriented along the same direction.
We are going to introduce a new variable α as Lagrange Multiplier. Then express the following as a function named, Lagrangian.
The solution is to find (x,y,α) so that ΔL=0
Let’s first take the partial derivative w.r.t α
Hence we can say,
We can also take the partial derivate w.r.t x
Similarly,
We can write them using one equation,
This matches with the idea that δf(x,y) and δg(x,y) are both pointing to the same direction at the solution ( yellow circle ) . This means the gradients are indeed parallel.
Note, even if they point to the same direction, they may not have the same magnitude, hence the gradient of g(x,y) is scaled by α
Now we have 3 equations and 3 unknowns. We can easily solve them.
If you have come this far and the above idea didn’t make a lot of sense, don’t worry, finish the entire article and read again few more times and everything starts to make sense. Feel free to post questions in the comment section, I will try my best to answer them.
We will now solve one sample equation using Lagrange Multiplier.
Example
This probably is the most common example, it’s really easy to understand hence I haven’t changed it.
Problem
Find,
Solution
Let’s identify the possible solutions visually, before we use Lagrange Multiplier.
- Here is the plot of the constraint x2+y2=1 in red, which is a circle.
- x+y is basically the entire plane.
- The blue line is an arbitrary representation of x+y=c
- x+y needs to be on the circle as per the defined constraint.
- We can visually identify two points ( among all the points ) where the value of x+y needs to touch the circle at two different points.
- These points are the max and min values.
Now let’s use Lagrange Multiplier to solve this mathematically.
Take the derivatives w.r.t x,y,α.
Using this, we can derive the values of x and y.
Since we are tying to maximize x+y, we will consider only the positive values.
So we have the final result as,
Multiple Constraints
Lagrange Multiplier can be used for vectorized implementation as well. Assume x has D dimensions, then there will be total D+1
unknowns for the following function,
You also may have more than one constraints. If you have n
constraints then there will be total D+n
unknowns.
We can then define the Lagrangian as following,
Inequality Constraint
Lagrange Multiplier works with Inequality Constraints as well. We can always add inequality constraints along with equality constraint. Let’s taken an example,
We can define the Lagrangian as (One constant for each constraint),
However, in order to find the solution for the variables, it wont be enough only to take the gradients and set those to zero due to the Inequality Constraints.
Setting Δx,λL=0 still gives two of the conditions, but for the Inequality Constraint, we need to have 3 additional conditions. Hence instead of total 3, we will now have total 5 conditions.
KKT Conditions
These above five conditions are called as KKT (Karush–Kuhn–Tucker) conditions and they must be met for strong duality, i.e for P∗−D∗=0 to be true.
Duality: Hard Margin Classifier
We will now change our Hard Margin Classifier’s Objective Function from Primal Problem to Dual Problem using Lagrange Multiplier and KKT Conditions.
Recall the optimization problem,
The constraint can be redefined as following (This is required for the 2nd KKT Condition gi(x)≤0),
The Lagrangian can defined as following,
Here are the remaining 4 KKT Conditions, ( gi(x)≤0 already defined above)
Taking derivatives w.r.t β and b,
Plugging these we get the Dual Lagrangian Objective Function,
L should be minimized w.r.t β and b, and should be maximized w.r.t αi. Hence, instead of minimizing the Primal Problem, we can now maximize the Dual Problem (P*=D* as the KKT conditions are now satisfied). So we can write the following,
The Ldual is a Convex Quadratic programming problem due to the αiαj term and can be solved using standard optimization techniques.
Weight Vector and Bias
Once we have got the αi values, we can calculate the weight vector β and bias b. As per the KKT conditions we had,
Above equation provides two cases,
- yi(βTxi+b)=1
Now, if yi(βTxi+b)=1 then αi>0, which means the point xi must be a support vector.
Otherwise if αi=0 then yi(βTxi+b)≥1, which means if the point is not support vector then αi=0
The above two intuitions are important, this indicates that non support vectors do not have a role in finding the weight and bias.
Once we know αi for all points, we can compute the weight vector β by summing only for the support vectors.
In order to compute the bias b, we need to get one solution bi per support vector, then average them.
Hard Margin Classifier
We can now find the optimal Hyperplane given β and b. For any new point z, we can predict the class using following,
Duality: Soft Margin Classifier
Similar to the Hard Margin Classifier, we will derive the Dual Problem for Soft Margin Classifier also. Let’s recap the Primal Objective function,
The value of k could be set to either 1 (Hinge Loss) or 2 (Quadratic Loss).
Hinge Loss
We can compute the Lagrangian by introducing two Lagrange multipliers αi and λi as we have two inequality constraints. (Remember the 2nd KKT Condition gi(x)≤0 ? )
The Lagrangian can be then defined as,
We will take partial derivative w.r.t β, b and ξi to turn this to dual Lagrangian.
δξiL does not have the summation (∑) term, as we are taking partial derivative against ξi and not ξ.
Plugging these to the Lagrangian we get,
The Dual Objective can be written as,
Notice the objective function is same as the Hard Margin Classifier, however the inequality constraint is different.
Since αi+λi=C and αi≥0,λi≥0, we can say 0≤αi≤0. There is no constraint on λ as its not part of the final equation.
Weight Vector and Bias
Similar to the Hard Margin Classifier, we can obtain the weight vector from the support vectors as before. Now the Support Vectors include all the points that are on the margin ( Zero Slack ξi=0 ) and also all the points with positive Slack ξi>0
We can now solve for ξi,
Now we have two cases for the support vectors with αi>0
- If ξi>0, then (C−αi)=0. We can say αi=C
- If (C−αi)>0, then αi<C and ξi=0. We can say these support vectors are on the margin.
Using those support vectors that are on the margin, that is 0≤αi≤0 and ξi=0, we can solve for bi,
We will average all the bi to compute b. Notice both β and b can be computed without computing the slack value ξi for each points.
Quadratic Loss
We need to make few adjustments to the objective function for the Quadratic Loss.
- Drop the constraint ξi≥0 as ∑i=1nξ2i is always positive.
- Any negative value of the slack will be replaced by zero during optimization due to the fact that ξi=0 leads to a smaller value of the primary objective.
The Lagrangian can be defined as below. Notice we only need one Lagrange Multiplier due to the dropped constraint.
Now we can calculate based on the KKT Condition like before. Take partial Derivative w.r.t β,b and ξi and set them to zero.
Plugging in those into the Lagrangian gives us the Dual Objective. I am expanding the derivation only for the last part since the rest is same as earlier.
The δ is a function can be defined as δi,j=1 if i=j
, otherwise δi,j=0
The Dual Objective can be written as,
Weight Vector and Bias
Follow the same approach to compute the weight and bias as,
Conclusion
We have gone through plethora of equations and derivations in this tutorial. If you are learning SVM for the first time, you must be utterly confused on why we need to convert Primal Problem to Dual Problem. If you look that Dual Objective function we have now, the most important part is xTixj.
We will use something called Kernel Trick to represent xTixj as inner product in order to achieve Non-Linearity. This is the main reason that we went though all these computation just to define an objective function.
In the next tutorial we will learn briefly about Kernel and use it in SVM Dual Problem.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK