Performance of Flash Attention and torch.compile()

I am trying to build a small repo about multi-modal models (CLIP, ALBEF, BLIP etc). The GPT code is mainly from nanoGPT. Then I became inquisitive about the performance of “Flash Attention” and “torch.compile()”.

The metrics with my original code (w/o Flash Attention, w/o torch.compile()):

[100] loss: 4.0315 time 23.7708
[200] loss: 4.0020 time 23.9010
[300] loss: 3.8115 time 23.9407
[400] loss: 3.7021 time 23.9785
[500] loss: 3.6626 time 24.0076
[600] loss: 3.7109 time 24.0060

Python

[100] loss: 4.0315 time 23.7708

[200] loss: 4.0020 time 23.9010

[300] loss: 3.8115 time 23.9407

[400] loss: 3.7021 time 23.9785

[500] loss: 3.6626 time 24.0076

[600] loss: 3.7109 time 24.0060

The metrics after adding Flash Attention:

[100] loss: 4.1204 time 23.0655
[200] loss: 3.8950 time 23.2243
[300] loss: 3.9116 time 23.2714
[400] loss: 3.7837 time 23.2864
[500] loss: 3.8313 time 23.2993
[600] loss: 3.9138 time 23.3255

Python

[100] loss: 4.1204 time 23.0655

[200] loss: 3.8950 time 23.2243

[300] loss: 3.9116 time 23.2714

[400] loss: 3.7837 time 23.2864

[500] loss: 3.8313 time 23.2993

[600] loss: 3.9138 time 23.3255

The metrics after adding Flash Attention and torch.compile()

[100] loss: 3.9969 time 14.8842                                                                                               
[200] loss: 3.8506 time 15.0004                                                                                               
[300] loss: 3.8702 time 15.0050                               
[400] loss: 3.7977 time 15.0061                                                                                               
[500] loss: 3.7374 time 15.0492       
[600] loss: 3.6589 time 15.0661

Python

[100] loss: 3.9969 time 14.8842

[200] loss: 3.8506 time 15.0004

[300] loss: 3.8702 time 15.0050

[400] loss: 3.7977 time 15.0061

[500] loss: 3.7374 time 15.0492

[600] loss: 3.6589 time 15.0661

Seems “torch.compile()” is much more powerful than “Flash Attention”

Performance bottleneck in Jedis
I have had write a test program which using Jedis to read/write Redis Cluster. It…
The performance of R-CNN in mxnet
We are trying to use faster R-CNN network (also is an example in mxnet) to…
Performance test for unikernels (Rumpkernel and OSv)
Unikernels are specialised, single-address-space machine images constructed by using library operating systems. The concept of…

Performance of Flash Attention and torch.compile()

Performance of Flash Attention and torch.compile()

Related Posts

Recommend

Windows 上的 OpenSSH：安装、配置和使用指南

Atlas Data Federation and Online Archive Can Now Be Deployed in Azure

运维自动化Agent与多租户平台建设

女王盛宴，新品来袭！选择这几款2024创新音频产品，来一起畅享音乐-太平洋科技

Github 2024-03-02 Rust开源项目日报 Top10

Modern CSS Tooltips And Speech Bubbles (Part 2)

Craft winning paid search ads in 2024: 4 best practices

A/B testing and SEO: How to navigate pitfalls and maximize results

Kubernetes集群如何用Ipvs替换Iptables

CSS-only bottom-anchored scrolling area

About Joyk