1

[2207.07949] A Nearly Tight Analysis of Greedy k-means++

 1 year ago
source link: https://arxiv.org/abs/2207.07949
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

[Submitted on 16 Jul 2022]

A Nearly Tight Analysis of Greedy k-means++

Download PDF

The famous k-means++ algorithm of Arthur and Vassilvitskii [SODA 2007] is the most popular way of solving the k-means problem in practice. The algorithm is very simple: it samples the first center uniformly at random and each of the following k-1 centers is then always sampled proportional to its squared distance to the closest center so far. Afterward, Lloyd's iterative algorithm is run. The k-means++ algorithm is known to return a \Theta(\log k) approximate solution in expectation.
In their seminal work, Arthur and Vassilvitskii [SODA 2007] asked about the guarantees for its following \emph{greedy} variant: in every step, we sample \ell candidate centers instead of one and then pick the one that minimizes the new cost. This is also how k-means++ is implemented in e.g. the popular Scikit-learn library [Pedregosa et al.; JMLR 2011].
We present nearly matching lower and upper bounds for the greedy k-means++: We prove that it is an O(\ell^3 \log^3 k)-approximation algorithm. On the other hand, we prove a lower bound of \Omega(\ell^3 \log^3 k / \log^2(\ell\log k)). Previously, only an \Omega(\ell \log k) lower bound was known [Bhattacharya, Eube, Röglin, Schmidt; ESA 2020] and there was no known upper bound.

Subjects: Data Structures and Algorithms (cs.DS); Machine Learning (cs.LG)
Cite as: arXiv:2207.07949 [cs.DS]
  (or arXiv:2207.07949v1 [cs.DS] for this version)
  https://doi.org/10.48550/arXiv.2207.07949

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK