llama.cpp 的載入速度加速

在 Hacker News 上看到「Llama.cpp 30B runs with only 6GB of RAM now (github.com/ggerganov)」這個消息，原 pull request 在「Make loading weights 10-100x faster #613」這邊。

這個 PR 的作者 Justine Tunney 在 PR 上有提到他改變 model 檔案格式，以便改用 mmap()，大幅降低了需要預先讀取的時間 (因為變成 lazy-loading style)，而且這也讓系統可以利用 cache page，避免了 double buffering 的問題：

This was accomplished by changing the file format so we can mmap() weights directly into memory without having to read() or copy them thereby ensuring the kernel can make its file cache pages directly accessible to our inference processes; and secondly, that the file cache pages are much less likely to get evicted (which would force loads to hit disk) because they're no longer competing with memory pages that were needlessly created by gigabytes of standard i/o.

這讓我想到在資料庫領域中，PostgreSQL 也會用 mmap() 操作，有點類似的概念。

另外 Justine Tunney 在這邊的 comment 有提到一個意外觀察到的現象，他發現實際在計算的時候用到的 model 內容意外的少：他用一個簡單的 prompt 測試，發現 20GB 的 30B model 檔案在他的 Intel 機器上實際只用到了 1.6GB 左右：

If I run 30B on my Intel machine:

[...]

As we can see, 400k page faults happen, which means only 1.6 gigabytes ((411522 * 4096) / (1024 * 1024)) of the 20 gigabyte weights file actually needed to be used.

這點他還在懷疑是不是他的修改有 bug，但目前他覺得不太像，也看不太出來：

Now, since my change is so new, it's possible my theory is wrong and this is just a bug. I don't actually understand the inner workings of LLaMA 30B well enough to know why it's sparse. Maybe we made some kind of rare mistake where llama.cpp is somehow evaluating 30B as though it were the 7B model. Anything's possible, however I don't think it's likely. I was pretty careful in writing this change, to compare the deterministic output of the LLaMA model, before and after the Git commit occurred. I haven't however actually found the time to reconcile the output of LLaMA C++ with something like PyTorch. It'd be great if someone could help with that, and possibly help us know why, from more a data science (rather than systems engineering perspective) why 30B is sparse.

如果不是 bug 的話，這其實冒出了一個很有趣的訊號，表示這些 model 是有可能再瘦身的？

Meltdown 與 Spectre 都有用到的 FLUSH+RELOAD

Meltdown 與 Spectre 攻擊裡都有用到的 FLUSH+RELOAD 技巧。這個技巧是出自於 2013 年的「Flush+Reload: a High Resolution, Low Noise, L3 Cache Side-Channel Attack」。當時還因此對 GnuPG 發了一個 CVE-2013-4242。 FLUSH+RELOAD 是希望透過 shared memory & cache 得到 side channel information，藉此突破安全機制。論文裡面提到兩個攻擊模式，一種是在同一個 OS 裡面 (same-OS)，另外一種是在同一台機器，但是是兩個不同的 VM (cross-VM)。攻擊的前提是要拿到與 GnuPG process 相同的 shared memory。兩個環境的作法都是透過 mmap() GnuPG 的執行檔以取得 shared memory。在 same-OS 的情況下會使用同一個 process：…

January 5, 2018

In "Computer"

FreeBSD 上 PHP5 的 pecl-APC 效能

環境是單顆 E5405 的機器，上面的作業系統是 FreeBSD 7.1，應用程式的部份是 apache 2.2 (worker) + mod_fastcgi 2.4.6 + PHP 5.2.8 + APC 3.0.19。由於前陣子發現 PHP 在 MP 架構下效能不太好，偶而會有一堆 php-cgi 卡住，用 top 會發現卡在 "lockf" 這個狀態，這時候前端的 L4 switch 會看到 500 Internal Server Error 而把這台機器暫時離線。等到 php-cgi 慢慢消化完，前端的 L4 switch 又會抓到 200。在 php-cgi 卡住的狀況下試著用 gdb 找原因，當時 backtrace 發現是卡在 APC…

January 28, 2009

In "Computer"

關於 jquery-latest.js...

jQuery 官方希望大家不要再使用 jquery-latest.js 了：「Don’t Use jquery-latest.js」。由於他們發現有大量的網站使用 jquery-latest.js，如果直接照著字面上的意思升級到 2.0，會造成這些網站在 IE{6,7,8} 上爛掉： To mitigate the risk of “breaking the web”, the jQuery team decided back in 2013 that jquery-latest.js could not be upgraded to the 2.0 branch even though that is technically the latest version. There would just be too many…

July 5, 2014

In "Browser"

Author Gea-Suan LinPosted on April 1, 2023Categories Computer, Murmuring, SoftwareTags ai, buffer, cache, cpp, double, learning, llama, machine, mmap, performance, speed, time

Your email address will not be published. Required fields are marked *

Comment *

Name *

Email *

Website

Notify me of follow-up comments by email.

Notify me of new posts by email.

To respond on your own website, enter the URL of your response which should contain a link to this post's permalink URL. Your response will then appear (possibly after moderation) on this page. Want to update or remove your response? Update or delete your post and re-enter your post's URL again. (Learn More)

Previous Previous post: Tailscale Funnel 公開測試

llama.cpp 的載入速度加速

llama.cpp 的載入速度加速

Related

Meltdown 與 Spectre 都有用到的 FLUSH+RELOAD

FreeBSD 上 PHP5 的 pecl-APC 效能

關於 jquery-latest.js...

Leave a Reply

Post navigation

Recommend

^^FREE^^ Netboom Coins Generator Hack Without Verification

自己充值的钱没用完， chatgpt 注册，提供接码手机号，实测成功，有需要的留下 qq，非...

Should I See It?

RTX 3060异军突起占据榜首！Steam三月软硬件调查出炉

除了 iCloud 和 Google photo 还有备份照片不压缩的 app 么？

文心一言通过了我的申请，然而

Photos: Galaxy Cluster Warps Space and Time, James Webb Telescope Shows

赏金 1000u 来个大佬帮忙改个 c++代码

海信视像：2022年净利增长近五成份额持续提升改写行业格局

the Cold War

About Joyk