6

Running Large Language Models Natively on Mobile and Laptops

 1 year ago
source link: https://www.infoq.com/news/2023/05/mlc-llm-mobile-laptops/?itm_source=infoq&itm_medium=popular_widget&itm_campaign=popular_content_list&itm_content=
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Running Large Language Models Natively on Mobile and Laptops

May 07, 2023 2 min read

MLC LLM is a new open source project aimed to enable deploying large language models on a variety of hardware platforms and applications. It additionally includes a framework to optimize model performance for each specific use case.

Our mission is to enable everyone to develop, optimize and deploy AI models natively on everyone's devices. Everything runs locally with no server support and accelerated with local GPUs on your phone and laptops.

At the foundations of MLC LLM lies an approach called machine learning compilation (MLC) which combines ML programming abstractions, learning-driven search, compilation, and an optimized library runtime for easy of deployment.

In comparison with the more controlled case of deployment on server-class systems, the main complexity that the project faces is heterogeneity of supported hardware specs. This includes supporting different models of CPUs, GPUs, and potentially other co-processors and accelerators; addressing memory constraints; and dealing with OS environment variation, where some dependencies, for example Python or specific packages, could not always be granted.

To achieve these goals, MLC LLM is based on Apache TVM Unity, a compiler stack for deep learning systems, and leverages tokenizers from Hugging Face and Google, open-source LLMs such as Llama, Vicuna, Dolly, and others.

1mlc-llm-1683489676540.jpg

The project includes both a C++ CLI tool and an iOS chat app showcasing how to integrate the compiled artifacts and the required pre/post-processing.

MLC LLM can be deployed on recent Apple Silicon, including iPhone 14 Pro, iPad Pro with M1 or the A12Z chip, and M1-based MacBook Pro and later models; AMD GPUs including Raden Pro 5300M, AMD GPU on Steam Deck, RX6800 16G VRAM, and others; NVIDIA GPUs including GTX 1060 (6GB), RTX 3080, RTX 2080Ti, and others; and the Intel UHD Graphics 630 GPU. Support for Android devices is in the works.

Performance varies significantly across supported hardware, with several NVIDIA GPUs, the AMD RX6800 16G VRAM, and the 2021 MacBook Pro M1 Max scoring above 20 tokens/second. For comparison, the M1 iPad Pro reaches 10.6 tokens/second and the iPhone 14 Pro 7.2 tokens/second.

According to the project maintainers, the MLC LLM makes it possible to run quick experiments and try out compiler optimizations and eventually deploy to the desired targets easily.

If you want to find out more about MLC, you can check out the official documentation, which will guide you through the key abstractions used to represent machine learning programs, automatic optimization techniques, and how to optimize for dependencies, memory, and performance.

As a related note, MLC LLM has a companion project focused on Web browsers, WebLLM.

About the Author

Sergio De Simone

Sergio De Simone is a software engineer. Sergio has been working as a software engineer for over fifteen years across a range of different projects and companies, including such different work environments as Siemens, HP, and small startups. For the last few years, his focus has been on development for mobile platforms and related technologies. He is currently working for BigML, Inc., where he leads iOS and OS X development.

Show more

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK