6

Dynamic Programming Algorithm for Segmented Least Squares

 2 years ago
source link: https://stackoverflow.com/questions/4084437/dynamic-programming-algorithm-for-segmented-least-squares/70221342
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Dynamic Programming Algorithm for Segmented Least Squares

Asked 11 years, 1 month ago
Active 13 days ago
Viewed 7k times

I've been trying to implement this algorithm in Python for a few days now. I keep coming back to it and just giving up and getting frustrated. I don't know whats going on. I don't have anyone to ask or anywhere to go for help so I've come here.

PDF Warning: http://www.cs.uiuc.edu/class/sp08/cs473/Lectures/lec10.pdf

I don't think its a clearly explained, I sure don't understand.

My understanding of what's happening is this:

We have a set of of points (x1,y1), (x2,y2).. and we want to find some lines that best fit this data. We can have multiple straight lines, and these lines come from the given forums for a and b (y = ax +b).

Now the algorithm starts at the end (?) and assumes that a point p(x_i, y_i) is part of the line segment. Then the notes say that the optimal solution is 'is optimal solution for {p1, . . . pi−1} plus (best) line through {pi , . . . pn}'. Which just means to me, that we go to the point p(x_i, y_i) and go backwards and find another line segment through the rest of the points. Now the optimal solution is both these line segments.

Then it takes a logical jump I can't follow, and says "Suppose the last point pn is part of a segment that starts at p_i. If Opt(j) denotes the cost of the first j points and e(j,k) the error of the best line through points j to k then Opt(n) = e(i, n) + C + Opt(i − 1)"

Then there is the pseudocode for the algorithm, which I don't understand. I understand that we want to iterate through the list of points and find the points which minimize the OPT(n) formula, but I just don't follow it. It's making me feel stupid.

I know this question is a pain in the ass and that it's not easy to answer but I'm just looking for some guidance towards understanding this algorithm. I apologize for the PDF but I don't have a neater way of getting the crucial information to the reader.

Thank you for your time and reading this, I appreciate it.

asked Nov 3 '10 at 5:28

The least squares problem asks to find a single line of best fit for the given points and has a nice closed form solution. This solution is used as a primitive to solve the segmented least squares problem.

In the segmented least squares problem, we can have any number of segments to fit the given points and we have to pay a cost for each new segment used. If the cost of using a new segment was 0, we could perfectly fit all points by passing a separate line through each point.

Now suppose we are trying to find the set of segments that is a best fit to the n given points. If we knew the best solutions for n-1 subproblems: best fit for first point, best fit for first 2 points, ..., best fit for first n-1 points, then we could compute the best solution for the original problem with n points as follows:

The nth point lies on a single segment and this segment starts at some earlier point i, for some i = 1, 2, ..., n. We have already solved all smaller subproblems i.e., having less than n points so we can find the best fit for n points that minimizes:

cost of best fit for first i-1 points (already computed) + cost of single line that is best fit for points i to n (using least squares problem) + cost of using a new segment

The value of i that minimizes the above quantity gives us the solution: use the best fit to subproblem i-1 and the segment through point i to n.

If you need more help, I've written an explanation of the algorithm and provided C++ implementation here: http://kartikkukreja.wordpress.com/2013/10/21/segmented-least-squares-problem/

answered Nov 27 '13 at 9:35

The tricky part, the dynamic programming part, is the section

for j = 1 to n
    M[j] = min_(1=<i=<j) ( e(i,j) + C + M[i-1])

where M[0] = 0 is done earlier and n is the total number of data points. Also, the underscore means that the section in parentheses afterwards should be subscripted. The professor could well have used OPT instead of M, and this is done in other some other universities' lectures about this which you can find on the web (and which look almost identical). Now, if you look at my code block above and that in the lecture, you will notice a difference. I used M[i-1] and your professor used M[j-1]. It's a typo in your professor's pseudocode. You can even look back to the slide before and see that he had it correct in the error function there.

Basically, for each j, you are finding the point i to draw a line to j from such that the error of it, plus the cost of adding this extra line (C), plus the cost of making all line segments up to i (which has already been chosen optimally) is minimized.

Also, remember that e(i,i) = 0 as well as e(i,i+1) because fitting a line to a point gives no error as well as to just two points.

answered Nov 3 '10 at 6:11

Starting from point 1, the best solution up until a point j, must include the new end-point j in the last line segment, so the problem becomes where do I have to place the last split to minimize the cost of this last line-segment?

Fortunately the cost is calculated in terms of subproblems of the same problem you are trying to solve, and fortunately you've already solved these smaller subproblems by the time you've moved to the next point j. So at the new point j you are trying to find an optimal point i, between points 1 and j, to split off a new line segment that includes j, and minimizes the cost: optimal_cost_up_to(i) + cost_of_split + cost_of_lsq_fit(i+1,j). Now the confusing part is that at any point it might seem like you are just finding a single split, but in reality all the previous splits are determined by optimal_cost_up_to(i), and since you've already calculated the optimal cost for all these points leading up to j, then you just need to memoize the answers so that the algorithm doesn't have to recalculate these costs each time it advances a point.

In python I'd probably use a dictionary to store the results, although for this dynamic programming algorithm an array might be better...

anyways...

    def optimalSolution(points,split_cost)
        solutions = {0:{'cost':0,'splits':[]}}
        for j in range(1,len(points)):
            best_split = None
            best_cost = lsqFitCost(points,0,j)
            for i in range(0,j):
                cost = solutions[i]['cost'] + split_cost + lsqFitCost(points,i+1,j)
                if cost < best_cost:
                   best_cost = cost
                   best_split = i
            if best_split != None:
                solution[j] = {'cost':best_cost,'splits':solution[best_split]['splits']+[best_split]}
            else:
                solution[j] = {'cost':best_cost,'splits':[]}
        return solutions

it's not pretty, and I haven't checked it (there might be an error involving the case where no split is the best cost), but hopefully it'll help get you on the right path? Note that lsqFitCost has to do a lot of work on each iteration, but for a least squares line fit like this you can make this cost a lot less by maintaining running sums used in the calculation... you should Google least squares line fitting for more info. This could make lsqFitCost constant so the overall time would be O(N^2).

answered Nov 3 '10 at 7:19

The basic premise of dynamic programming is to optimize (lower the 'cost' or in this case, error) of a problem recursively while storing the cost of subproblems as you go so they are not recomputed (this is called memoization).

It's a little late so I'm not gonna get too detailed, but it seems the part you have the most problem with is the core DP itself. DP is needed here because of the 'segmeneted' quality. As your PDF shows, finding the best-fit line by least squares is simple and does not require DP.

Opt(n) - e(i, n) + C + Opt(i-1) --- our cost function where
C is the constant cost of a new line segment (without which the problem is trivial and we would just make new line segments for every two points)
e(i, n) is the error of the points i to n with ONE segment (not recursive)
Opt(i-1) is the least cost recursively from the first point to the (i-1)th

Now the algorithm

Ensure the list of points is sorted by x values
M[0] = 0 --- self explanatory
For all pairs (i, j) with i < j: find e(i,j) ---- (this will require nested for/foreach loops, or a comprehension kind of structure. Store these values in 2D array!)
For (j=1..n): M[j] = min([Opt(j) for i in range(1,j)]

So basically, find the one-line cost between any two possibly points, store in e
Find the least cost up to j, for j between 1 and n. Values along the way will help with later computation, so store them! Note that i is a parameter to Opt as well. M[n] is the total optimized cost.

Question for you - How can you determine where segmentation occurs? How can you use this, and M, to detetermine the set of line segments once its all over?

Hope this helps!

answered Nov 3 '10 at 6:10

Here is how the dynamic programming for the segmented least squares problem is formulated:

Here, M[j] represents the minimum error (regression) line fitted on the points i to j, we can track the starting point i with minimum error (MSE) by keeping a back pointer array along with the dynamic programming array. Also, c denotes the cost to draw a line (acts as a penalty on number of lines fit). The optimal substructure property is by Bellman equation.

Here is my python implementation for the above DP algorithm, on the following 2D dataset with points (xs, ys) (scatter plotted below):

def ls_fit(xs, ys, m):
    a = (m*sum(xs*ys)-sum(xs)*sum(ys)) / (m*sum(xs**2)-sum(xs)**2)
    b = (sum(ys)-a*sum(xs)) / m
    return a, b

def compute_errors(xs, ys):
    n = len(xs)
    e = np.zeros((n,n))
    for j in range(n):
        for i in range(j+1):
            m = j-i+1
            if m > 1:
                a, b = ls_fit(xs[i:i+m], ys[i:i+m], m)
                e[i,j] = sum((ys[i:i+m] - a*xs[i:i+m] - b)**2)
    return e

def build_DP_table(e, n):
    M = np.zeros(n)
    p = np.zeros(n, dtype=int) # backpointers
    for j in range(1, n):
        cost = [e[i,j] + c + M[i-1] for i in range(j)]
        M[j] = np.min(cost)
        p[j] = np.argmin(cost)
    return M, p

Now plot the least square line segments obtained with the dynamic programming formulation:

c = 10
tol = 2
starts = np.unique(p)
drawn = set([])
plt.plot(xs, ys, 'g.')
for start in starts:
    indices = np.where(abs(p-start) < tol)[0]
    a, b = ls_fit(xs[indices], ys[indices], len(indices))
    if not (a, b) in drawn:
        plt.plot([xs[min(indices)],xs[max(indices)]], [a*xs[min(indices)]+b, a*xs[max(indices)]+b], linewidth=3, 
                 label='line: ({:.2f}, {:.2f})'.format(a,b))
        drawn.add((a,b))
plt.legend()

As expected, the DP found the 3 optimal least-square lines (L=3) fitted on the data.

answered Dec 3 at 22:29

Your Answer

Sign up or log in

Sign up using Google
Sign up using Facebook
Sign up using Email and Password

Post as a guest

Name
Email

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

Not the answer you're looking for? Browse other questions tagged python algorithm dynamic-programming implementation least-squares or ask your own question.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK