5

(有意思的项目)--快速统计大文件(10 亿行)中的数据需要的技术是

 8 months ago
source link: https://www.v2ex.com/t/1007019
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

V2EX  ›  程序员

(有意思的项目)--快速统计大文件(10 亿行)中的数据需要的技术是

  lsk569937453 · 4 小时 32 分钟前 · 1195 次点击
刚才在 github 上看到一个有意思的项目 https://github.com/gunnarmorling/1brc 。项目是开放式的,任何人都可以提交,主要是统计 10 亿行数据。

目前排名第一的提交是
```
https://github.com/gunnarmorling/1brc/blob/main/src/main/java/dev/morling/onebrc/CalculateAverage_royvanrijn.java
```
该算法的注释如下:

```
* Initial submission: 62000 ms
* Chunked reader: 16000 ms
* Optimized parser: 13000 ms
* Branchless methods: 11000 ms
* Adding memory mapped files: 6500 ms (based on bjhara's submission)
* Skipping string creation: 4700 ms
* Custom hashmap... 4200 ms
* Added SWAR token checks: 3900 ms
* Skipped String creation: 3500 ms (idea from kgonia)
* Improved String skip: 3250 ms
* Segmenting files: 3150 ms (based on spullara's code)
* Not using SWAR for EOL: 2850 ms
* Inlining hash calculation: 2450 ms
* Replacing branchless code: 2200 ms (sometimes we need to kill the things we love)
* Added unsafe memory access: 1900 ms (keeping the long[] small and local)
```
感觉还挺有意思的,我其实想知道
1.memory mapped files 会不会把内存干爆阿。
2.unsafe memory access 为什么这么快,会造成内存泄露?
3.把内存限制一下会不会更有挑战点。

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK