Running Large Language Models Natively on Mobile and Laptops

May 07, 2023 2 min read

MLC LLM is a new open source project aimed to enable deploying large language models on a variety of hardware platforms and applications. It additionally includes a framework to optimize model performance for each specific use case.

Our mission is to enable everyone to develop, optimize and deploy AI models natively on everyone's devices. Everything runs locally with no server support and accelerated with local GPUs on your phone and laptops.

At the foundations of MLC LLM lies an approach called machine learning compilation (MLC) which combines ML programming abstractions, learning-driven search, compilation, and an optimized library runtime for easy of deployment.

In comparison with the more controlled case of deployment on server-class systems, the main complexity that the project faces is heterogeneity of supported hardware specs. This includes supporting different models of CPUs, GPUs, and potentially other co-processors and accelerators; addressing memory constraints; and dealing with OS environment variation, where some dependencies, for example Python or specific packages, could not always be granted.

To achieve these goals, MLC LLM is based on Apache TVM Unity, a compiler stack for deep learning systems, and leverages tokenizers from Hugging Face and Google, open-source LLMs such as Llama, Vicuna, Dolly, and others.

The project includes both a C++ CLI tool and an iOS chat app showcasing how to integrate the compiled artifacts and the required pre/post-processing.

MLC LLM can be deployed on recent Apple Silicon, including iPhone 14 Pro, iPad Pro with M1 or the A12Z chip, and M1-based MacBook Pro and later models; AMD GPUs including Raden Pro 5300M, AMD GPU on Steam Deck, RX6800 16G VRAM, and others; NVIDIA GPUs including GTX 1060 (6GB), RTX 3080, RTX 2080Ti, and others; and the Intel UHD Graphics 630 GPU. Support for Android devices is in the works.

Performance varies significantly across supported hardware, with several NVIDIA GPUs, the AMD RX6800 16G VRAM, and the 2021 MacBook Pro M1 Max scoring above 20 tokens/second. For comparison, the M1 iPad Pro reaches 10.6 tokens/second and the iPhone 14 Pro 7.2 tokens/second.

According to the project maintainers, the MLC LLM makes it possible to run quick experiments and try out compiler optimizations and eventually deploy to the desired targets easily.

If you want to find out more about MLC, you can check out the official documentation, which will guide you through the key abstractions used to represent machine learning programs, automatic optimization techniques, and how to optimize for dependencies, memory, and performance.

As a related note, MLC LLM has a companion project focused on Web browsers, WebLLM.

About the Author

Sergio De Simone

Sergio De Simone is a software engineer. Sergio has been working as a software engineer for over fifteen years across a range of different projects and companies, including such different work environments as Siemens, HP, and small startups. For the last few years, his focus has been on development for mobile platforms and related technologies. He is currently working for BigML, Inc., where he leads iOS and OS X development.

Running Large Language Models Natively on Mobile and Laptops

Running Large Language Models Natively on Mobile and Laptops

About the Author

Sergio De Simone

Recommend

Asus ROG Ally detailed specs and pricing officially confirmed

链游Fusionist域名服务AceDomains的1000个高级域名已被全部铸造

解决复杂问题的三把钥匙之一“系统思考”

Fabarta参加数云原力大会，与各方共同发布《2023数据资产盘点实践白皮书》-品玩

Wealthiest People in the Czech Republic (May 12, 2023)

腾讯心悦俱乐部卖账号钱款不到账，客服不处理

How to Find UX Edge Cases for a Flow with ChatGPT | Prototypr

小宇宙、蜻蜓、荔枝、酷我……哪个平台的播客最“好听”？

Apple’s AirTags are available at a rare discount

Alleged Silk Road hitman hit with narcotics, money laundering charges

About Joyk