1

You can now train a 70b language model at home

 6 months ago
source link: http://www.answer.ai/posts/2024-03-06-fsdp-qlora.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Background

The big idea

There are two very different levels of hardware used to train deep learning models. There is the data center class hardware, such as H100s and A100s, costing hundreds of thousands of dollars. Then there are desktop computers containing gaming GPUs, such as dual 4090s, costing under $10,000 (and which can be assembled from 2nd hand parts for less than half the price of a pre-built system).

But here’s the key point: the gaming GPUs have similar performance to the data center GPUs that cost over 10x more! It would be great if we could use these 10x cheaper (but nearly as fast) cards to train large language models, but we can’t, because they have much less memory. The best currently available data center cards have 80GB RAM, whilst gaming cards max out at 24GB RAM. Since only the largest models produce the best results, creating the best models has been largely inaccessible to most people.

We realized that there’s actually no intrinsic reason for this. The super fast hardware is there, waiting to be used – we just need a way to feed it with the model and the data in a way that meets its memory constraints. The obvious question is: why hasn’t this been done then? All the big industry labs have the 10x more expensive hardware already, so they don’t really have the incentive to figure this out.

The big idea here is simple: figure out how to use these cheaper, lower-memory gaming GPUs to train the best available open source models. So the goal is this: train1 a 70 billion parameter (70b) model using only gaming GPUs, which means our per-GPU memory will be at most 24GB. It’ll be a challenge, because each parameter normally takes 16 bits (2 bytes), so that’s 70*2=140GB to even store the weights – and that’s without including all the other data such as activations, gradients, and optimization state!

Why this project?

Answer.AI is a very unusual type of organization – a for-profit R&D lab closer in spirit to 19th century electricity labs than to today’s AI research groups. Figuring out how to make large model training inexpensive and accessible is just the kind of thing Eric Ries and Jeremy Howard hoped we’d be able to do when the organization was launched at NeurIPS last year.

Solving this problem is hard. It requires understanding many separate libraries (e.g bitsandbytes, PEFT, Transformers, Accelerate, and PyTorch), and computer science and math concepts (e.g discretization, distributed computing, GPU programming, linear algebra, SGD concepts such as gradient checkpointing), and how they all interact.

Academia is full of brilliant people that solve hard problems. But academia hasn’t solved this particular problem. That’s because it’s difficult for university researchers to justify spending time on this kind of work. Combining existing tools and techniques together isn’t generally considered “novel” enough to result in publication in a high impact journal, but that’s the currency that academics need. Furthermore, academics are generally expected to become highly specialized within their field, making it challenging to bring together so many pieces into a single solution.

And, of course, big tech companies are also full of brilliant people that solve hard problems. But this particular problem, training models with consumer GPUs, isn’t a problem they need to solve – they’ve already bought the big expensive GPUs! Many startups are also full of brilliant people that solve hard problems! But, as Eric Ries explains, “today’s financial market forces businesses to prioritize short-term gains over everything else”. It’s extremely hard for a startup to justify to investors why they’re spending their funds on open source software and public research.

Whilst academia, big tech, and startups had good reasons for not solving this problem, these are the exact reasons that this problem was a great fit for Answer.AI. Everyone who works at the company has built the kinds of systems that we had to work with on this problem, so we were able to understand how all the pieces fit together. People who love to both deeply understand the foundations of software and AI, and also love to hack at fun and interesting end-to-end systems are the kinds of people who are drawn to Answer.AI, and vice versa.

The problems we choose to solve together are selected by the same people that will do the solving. So we tend to pick up projects that involve bringing together multiple ideas together to create practically useful solutions. And because we’re a public benefit company with a charter to produce long term benefit from AI, open source software and public research are directly in line with our mission.

QLoRA: Train bigger models on a single GPU

Two projects have been released recently that took the first critical steps towards making this a reality: QLoRA (by Tim Dettmers et al), and FSDP (by Meta’s PyTorch team).

QLoRA is a simple but brilliant combination of two critically important advances in modern neural networks: quantization, and LoRA. Quantization is a technique where, instead of using 16 or even 32 bits to store the weights of a neural network, 4 (or even fewer) bits are used. There are only 16 possible values of a 4 bit number, but Dettmers and Zettlemoyer showed that this can be enough in the large language models that are popular today. Tim Dettmers made these 4-bit “quantized” models easy to create, thanks to his bitsandbytes library, and recently Hugging Face has stepped in to help maintain and document this library, particularly thanks to the initiative of Titus von Koeller.

Unfortunately, once a model is quantized, it can not be trained any further with regular approaches – with just 16 possible values, the gradient descent method used for model training will observe zero gradients nearly everywhere, so it can’t make any updates to the quantized weights. This is a major problem, because it means that quantization can only be used for inference, not for continued pre-training or fine-tuning. Whilst inference is useful and important, it’s really just consuming models. But we want everybody to be able to contribute to creating models!

The trick to avoiding this limitation is to use LoRA – “Low-Rank Adaptation of Large Language Models”. LoRA doesn’t train the whole large language model at all, but instead adds “adaptors”, which are very small matrices (generally smaller than 1% of the full model) that are trained, whilst keeping the rest of the model constant. If you’ve played with models like Stable Diffusion, you will have probably seen these adapters many times; it’s how those models are generally shared, and why they are so small and fast to download.

Tim realized that LoRA can be combined with quantization: use a quantized base model, which is not changed at all by the training, and add trainable LoRA adaptors that are not quantized. This combination is QLoRA. Tim’s team was able to use this to, for the first time, train a model that (unquantized) is larger than the GPU: they trained a 65b model (which is 130GB unquantized) on a 48GB card.

Hugging Face stepped in once again here, creating the PEFT library, which made LoRA training far simpler, and also integrating it directly with bitsandbytes to allow anyone to use QLoRA with just a few lines of code. The Hugging Face team has been working tirelessly behind the scenes to ensure that the open source community can use these technologies to train their models. If you’ve ever used Transformers to load a 4-bit model using a single function argument, then you’ve got them to thank (and even if you haven’t, you’ve almost certainly used the work of folks that have built their model with this ecosystem).

QLoRA didn’t quite slay the problem we set out to solve, to train a 70b model on 24GB cards, but it got closer than anything before. When quantized to 4 bits (which is 0.5 bytes), the 70b model takes 70/2 = 35 GB, which is larger than the 24GB gaming GPUs we want to use.

There are other limitations to QLoRA. A 48GB card is very expensive, and training a 65b model only just fits on such a card. That can be a problem, because we need to store lots of other things too, including the activations2, gradients, and optimization state of the model during training. If there’s not much memory left over after loading the model weights, there’s not enough working memory to support training.

For instance, one of the benefits of language models is that we can use them to “chat” with, or understand, or analyze long documents or conversations. To make models that can handle long sequences like that, we need to show them examples of long sequences during training. The longest sequence used in training is called the “sequence length”. Trying to use anything but a short sequence length will cause an error when training a 65b QLoRA model on a 48GB card, because there isn’t enough memory to store all the information about the sequence; nearly all the memory is used just to store the model itself.

Furthermore, if the model can only look at a single sequence at a time, it’s going to take a really long time to get through all the data in our training set. So instead we want to be able to “batch” a few sequences together at a time. The number of sequences included is the “batch size”. When there’s very little space left on the GPU after loading the model weights, we can only use very small batch sizes, resulting in extremely slow training.

FSDP: Scale training to multiple GPUs

One obvious solution to the problem of the RAM limitations of a single consumer GPU, is to use more than one GPU! A very common approach in the open source community is to simply place a few layers of the model on each card. So then to train, you run the first few layers on the first GPU, then the next few on the second GPU, and so forth. For instance, a 70b (140GB) model could be spread over 8 24GB GPUs, using 17.5GB on each. There’s even a convenient setting in Hugging Face Transformers, device_map=’auto’, which you may well have used; that’s what this is actually doing behind the scenes. This does the job, but there’s a giant downside: only one GPU is ever active at a time, as all the others wait for their “turn”. That means that ⅞ of the compute is wasted.

Distributed Data Parallel (DDP) was previously the gold standard approach to training models across multiple GPUs efficiently. This requires keeping the full model on each GPU – if you have a small model (e.g. a 2b model, which takes 4GB RAM) you can simply load the whole thing onto each GPU separately, and have each GPU then churn through training examples in parallel. So for instance, if you had 4 GPUs, that’s a 4x training speedup. But DDP doesn’t work if the model doesn’t fit onto a GPU, with enough room to spare for the data needed for the training process.

So we need something that can split a model across GPUs (like device_map=’auto’) and also use them in parallel (like DPP). This is where Meta’s Fully Sharded Data Parallel (FSDP) library comes in. It “shards” a large model, by splitting its parameters across multiple GPUs, allowing all the GPUs to be used simultaneously. When a layer of the neural network is calculated on a particular GPU during training, all the required shards are copied there. Then, the calculation is made, and finally the copied parts are deleted from that GPU. Whilst this sounds terribly inefficient, actually by being smart about copying the data of the next layer at the same time the current layer is busy calculating, it’s possible for this approach to result in no slowdown compared to DDP.

FSDP’s ability to bring the performance of DDP to models that are larger than any one GPU has been a revelation. For instance, a 70b (70 billion parameter) unquantized model takes 140GB of RAM (because each parameter is stored as 16 bits, which is 2 bytes), but even NVIDIA’s H100 card (which costs around $40,000 for a single card!) falls short of what’s needed, with its 80GB RAM. But with FSDP, four H100 GPUs can be combined for a total of 320GB RAM.

(Mind you, such a machine is going to set you back around $150,000…)


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK