A Go implementation of TensorFlow's streaming quantiles estimator
source link: https://www.tuicool.com/articles/hit/jq6Nfib
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
quantiles - Optimal Quantile Approximation in Streams
This is a translation of TensorFlow's quantile helper class , it aims to compute approximate quantiles with error bound guarantees for weighted data sets. This implementation is an adaptation of techniques from the following papers:
- (2001) Space-efficient online computation of quantile summaries .
- (2004) Power-conserving computation of order-statistics over sensor networks .
- (2007) A fast algorithm for approximate quantiles in high speed data streams .
- (2016) XGBoost: A Scalable Tree Boosting System .
The key ideas at play are the following:
-
Maintain an in-memory multi-level quantile summary in a way to guarantee
a maximum approximation error of
eps * W
per bucket whereW
is the total weight across all points in the input dataset. -
Two base operations are defined:
MERGE
andCOMPRESS
.MERGE
combines two summaries guaranteeing aepsNew = max(eps1, eps2)
.COMPRESS
compresses a summary tob + 1
elements guaranteeingepsNew = epsOld + 1/b
. -
b * sizeof(summary entry)
must ideally be small enough to fit in an average CPU L2 cache. -
To distribute this algorithm with maintaining error bounds, we need
the worker-computed summaries to have no more than
eps / h
error where h is the height of the distributed computation graph which is 2 for an MR with no combiner.
We mainly want to max out IO bw by ensuring we're not compute-bound and using a reasonable amount of RAM.
Complexity:
O(n * log(1/eps * log(eps * n))) O(1/eps * log^2(eps * n))
An epsilon value of zero would make the algorithm extremely inefficent and therefore, is disallowed.
Example Usage
package quantiles_test import ( "fmt" "github.com/axiomhq/quantiles" ) func Example() { sketch := quantiles.NewDefault() for i := 0.0; i < 1e6; i++ { if err := sketch.Push(i, 1.0); err != nil { panic(err) } } fmt.Print("ApproximationError:") fmt.Println(sketch.ApproximationError(1)) // 0 <nil> fmt.Print("Finalize:") fmt.Println(sketch.Finalize()) // <nil> fmt.Print("GenerateQuantiles(4):") fmt.Println(sketch.GenerateQuantiles(4)) // [0 251865 503730 746595 999999] <nil> fmt.Print("GenerateQuantiles(10):") fmt.Println(sketch.GenerateQuantiles(10)) // [0 98946 197892 296838 395789 503730 602676 701622 800568 899514 999999] <nil> sum, err := sketch.FinalSummary() if err != nil { panic(err) } fmt.Print("GenerateQuantiles(4):") fmt.Println(sum.GenerateQuantiles(4)) // [0 251865 503730 746595 999999] }
TODO
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK