扩展说明: 指令微调 Llama 2 - HuggingFace - JOYK Joy of Geek, Geek News, Link all geek

这篇博客是一篇来自 Meta AI，关于指令微调 Llama 2 的扩展说明。旨在聚焦构建指令数据集，有了它，我们则可以使用自己的指令来微调 Llama 2 基础模型。

目标是构建一个能够基于输入内容来生成指令的模型。这么做背后的逻辑是，模型如此就可以由其他人生成自己的指令数据集。这在当想开发私人个性化定制模型，如发送推特、写邮件等，时很方便。这也意味着你可以通过你的邮件来生成一个指令数据集，然后用它来训练一个模型来为你写邮件。

好，那我们来开始吧？我们将进行:

定义应用场景细节并创建指令的提示词模板
构建指令数据集
使用 trl 与 SFTTrainer 指令微调 Llama 2
测试模型、进行推理

1. 定义应用场景细节并创建指令的提示词模板

在描述应用场景前，我们要更好的理解一下究竟什么是指令。

指令是一段文本或提供给大语言模型，类似 Llama，GPT-4 或 Claude，使用的提示词，用来指导它去生成回复。指令可以让人们做到把控对话，约束模型输出更自然、实用的输出，并使这些结果能够对齐用户的目的。制作清晰的、整洁的指令则是生成高质量对话的关键。

指令的例子如下表所示。

能力	示例指令
头脑风暴	提供一系列新口味的冰淇淋的创意。
分类	根据剧情概要，将这些电影归类为喜剧、戏剧或恐怖片。
确定性问答	用一个单词回答“法国的首都是哪里？”
生成	用罗伯特·弗罗斯特的风格写一首关于大自然和季节变化的诗。
信息提取	从这篇短文中提取主要人物的名字。
开放性问答	为什么树叶在秋天会变色？用科学的理由解释一下。
摘要	用 2-3 句话概括一下这篇关于可再生能源最新进展的文章。

如开头所述，我们想要微调模型，以便根据输入 (或输出) 生成指令。我们希望将其用作创建合成数据集的方法，以赋予 LLM 和代理个性化能力。

把这个想法转换成一个基础的提示模板，按照 Alpaca 格式.

### Instruction:Use the Input below to create an instruction, which could have been used to generate the input using an LLM.  ### Input:Dear [boss name], I'm writing to request next week, August 1st through August 4th,off as paid time off. I have some personal matters to attend to that week that require me to be out of the office. I wanted to give you as much advance notice as possible so you can plan accordingly while I am away. Please let me know if you need any additional information from me or have any concerns with me taking next week off. I appreciate you considering this request. Thank you, [Your name] ### Response:Write an email to my boss that I need next week 08/01 - 08/04 off.

2. 创建指令数据集

在定义了我们的应用场景和提示模板后，我们需要创建自己的指令数据集。创建高质量的指令数据集是获得良好模型性能的关键。研究表明，“对齐，越少越好” 表明，创建高质量、低数量 (大约 1000 个样本) 的数据集可以达到与低质量、高数量的数据集相同的性能。

创建指令数据集有几种方法，包括:

使用现有数据集并将其转换为指令数据集，例如 FLAN
使用现有的 LLM 创建合成指令数据集，例如 Alpaca
人力创建指令数据集，例如 Dolly。

每种方法都有其优缺点，这取决于预算、时间和质量要求。例如，使用现有数据集是最简单的，但可能不适合您的特定用例，而使用人力可能是最准确的，但必然耗时、昂贵。也可以结合几种不同方法来创建指令数据集，如 Orca: Progressive Learning from Complex Explanation Traces of GPT-4.。

为了简单起见，我们将使用 Dolly，这是一个开源的指令跟踪记录数据集，由数千名 Databricks 员工在 InstructGPT paper 中描述的几个行为类别中生成，包括头脑风暴、分类、确定性回答、生成、信息提取、开放性回答和摘要。

开始编程吧，首先，我们来安装依赖项。

!pip install "transformers==4.31.0" "datasets==2.13.0" "peft==0.4.0" "accelerate==0.21.0" "bitsandbytes==0.40.2" "trl==0.4.7" "safetensors>=0.3.1" --upgrade

我们使用 🤗 Datasets library 的 load_dataset() 方法加载 databricks/databricks-dolly-15k 数据集。

from datasets import load_datasetfrom random import randrange # 从hub加载数据集dataset = load_dataset("databricks/databricks-dolly-15k", split="train") print(f"dataset size: {len(dataset)}")print(dataset[randrange(len(dataset))])# dataset size: 15011

为了指导我们的模型，我们需要将我们的结构化示例转换为通过指令描述的任务集合。我们定义一个 formatting_function ，它接受一个样本并返回一个符合格式指令的字符串。

def format_instruction(sample):	return f"""### Instruction:Use the Input below to create an instruction, which could have been used to generate the input using an LLM.  ### Input:{sample['response']} ### Response:{sample['instruction']}"""

我们来在一个随机的例子上测试一下我们的结构化函数。

from random import randrange print(format_instruction(dataset[randrange(len(dataset))]))

3. 使用 `trl` 和`SFTTrainer` 指令微调 Llama 2

我们将使用最近在由 Tim Dettmers 等人的发表的论文“QLoRA: Quantization-aware Low-Rank Adapter Tuning for Language Generation”中介绍的方法。QLoRA 是一种新的技术，用于在微调期间减少大型语言模型的内存占用，且并不会降低性能。QLoRA 的 TL;DR; 是这样工作的:

将预训练模型量化为 4bit 位并冻结它。
附加轻量化的、可训练的适配器层。(LoRA)
在使用冻结的量化模型基于文本内容进行微调时，仅微调适配器层参数。

如果您想了解有关 QLoRA 及其工作原理的更多信息，我建议您阅读 Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA 博客文章。

Flash Attention (快速注意力)

Flash Attention 是一种经过重新排序的注意力计算方法，它利用经典技术 (排列、重计算) 来显著加快速度，将序列长度的内存使用量从二次降低到线性。它基于论文“FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness”。

TL;DR; 将训练加速了 3 倍。在这儿获得更多信息 FlashAttention。 Flash Attention 目前仅支持 Ampere (A10, A40, A100, …) & Hopper (H100, …) GPU。你可以检查一下你的 GPU 是否支持，并用下面的命令来安装它:

注意: 如果您的机器的内存小于 96GB，而 CPU 核心数足够多，请减少 MAX_JOBS 的数量。在我们使用的 g5.2xlarge 上，我们使用了 4 。

python -c "import torch; assert torch.cuda.get_device_capability()[0] >= 8, 'Hardware not supported for Flash Attention'"pip install ninja packagingMAX_JOBS=4 pip install flash-attn --no-build-isolation

安装 flash attention 是会需要一些时间 (10-45 分钟)。

该示例支持对所有 Llama 检查点使用 Flash Attention，但默认是未启用的。要开启 Flash Attention，请取消代码块中这段的注释， # COMMENT IN TO USE FLASH ATTENTION 。

import torchfrom transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig use_flash_attention = False # COMMENT IN TO USE FLASH ATTENTION# replace attention with flash attention # if torch.cuda.get_device_capability()[0] >= 8:#     from utils.llama_patch import replace_attn_with_flash_attn#     print("Using flash attention")#     replace_attn_with_flash_attn()#     use_flash_attention = True  # Hugging Face 模型idmodel_id = "NousResearch/Llama-2-7b-hf" # non-gated# model_id = "meta-llama/Llama-2-7b-hf" # gated  # BitsAndBytesConfig int-4 config bnb_config = BitsAndBytesConfig(    load_in_4bit=True,    bnb_4bit_use_double_quant=True,    bnb_4bit_quant_type="nf4",    bnb_4bit_compute_dtype=torch.bfloat16) # 加载模型与分词器model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, use_cache=False, device_map="auto")model.config.pretraining_tp = 1  # 通过对比doc中的字符串，验证模型是在使用flash attentionif use_flash_attention:    from utils.llama_patch import forward        assert model.model.layers[0].self_attn.forward.__doc__ == forward.__doc__, "Model is not using flash attention"  tokenizer = AutoTokenizer.from_pretrained(model_id)tokenizer.pad_token = tokenizer.eos_tokentokenizer.padding_side = "right"

SFTTrainer 支持与 peft 的本地集成，这使得高效地指令微调LLM变得非常容易。我们只需要创建 LoRAConfig 并将其提供给训练器。

from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model # 基于 QLoRA 论文来配置 LoRApeft_config = LoraConfig(        lora_alpha=16,        lora_dropout=0.1,        r=64,        bias="none",        task_type="CAUSAL_LM", )  # 为训练准备好模型model = prepare_model_for_kbit_training(model)model = get_peft_model(model, peft_config)

在开始训练之前，我们需要定义自己想要的超参数 (TrainingArguments)。

from transformers import TrainingArguments args = TrainingArguments(    output_dir="llama-7-int4-dolly",    num_train_epochs=3,    per_device_train_batch_size=6 if use_flash_attention else 4,    gradient_accumulation_steps=2,    gradient_checkpointing=True,    optim="paged_adamw_32bit",    logging_steps=10,    save_strategy="epoch",    learning_rate=2e-4,    bf16=True,    tf32=True,    max_grad_norm=0.3,    warmup_ratio=0.03,    lr_scheduler_type="constant",    disable_tqdm=True # 当配置的参数都正确后可以关闭tqdm)

我们现在有了用来训练模型 SFTTrainer 所需要准备的每一个模块。

from trl import SFTTrainer max_seq_length = 2048 # 数据集的最大长度序列 trainer = SFTTrainer(    model=model,    train_dataset=dataset,    peft_config=peft_config,    max_seq_length=max_seq_length,    tokenizer=tokenizer,    packing=True,    formatting_func=format_instruction,     args=args,)

通过调用 Trainer 实例上的 train() 方法来训练我们的模型。

# 训练trainer.train() # tqdm关闭后将不显示进度条信息 # 保存模型trainer.save_model()

不使用 Flash Attention 的训练过程在 g5.2xlarge 上花费了 03:08:00。实例的成本为 1,212$/h ，总成本为 3.7$ 。

使用 Flash Attention 的训练过程在 g5.2xlarge 上花费了 02:08:00。实例的成本为 1,212$/h ，总成本为 2.6$ 。

使用 Flash Attention 的结果令人满意，速度提高了 1.5 倍，成本降低了 30%。

4. 测试模型、进行推理

在训练完成后，我们想要运行和测试模型。我们会使用 peft 和 transformers 将 LoRA 适配器加载到模型中。

if use_flash_attention:    # 停止 flash attention    from utils.llama_patch import unplace_flash_attn_with_attn    unplace_flash_attn_with_attn()    import torchfrom peft import AutoPeftModelForCausalLMfrom transformers import AutoTokenizer  args.output_dir = "llama-7-int4-dolly" # 加载基础LLM模型与分词器model = AutoPeftModelForCausalLM.from_pretrained(    args.output_dir,    low_cpu_mem_usage=True,    torch_dtype=torch.float16,    load_in_4bit=True,) tokenizer = AutoTokenizer.from_pretrained(args.output_dir)

我们来再次用随机样本加载一次数据集，试着来生成一条指令。

from datasets import load_dataset from random import randrange  # 从hub加载数据集并得到一个样本dataset = load_dataset("databricks/databricks-dolly-15k", split="train")sample = dataset[randrange(len(dataset))] prompt = f"""### Instruction:Use the Input below to create an instruction, which could have been used to generate the input using an LLM.  ### Input:{sample['response']} ### Response:""" input_ids = tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda()# with torch.inference_mode():outputs = model.generate(input_ids=input_ids, max_new_tokens=100, do_sample=True, top_p=0.9,temperature=0.9) print(f"Prompt:\n{sample['response']}\n")print(f"Generated instruction:\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompt):]}")print(f"Ground truth:\n{sample['instruction']}")

太好了！我们的模型可以工作了！如果想要加速我们的模型，我们可以使用 Text Generation Inference 部署它。因此我们需要将我们适配器的参数合并到基础模型中去。

from peft import AutoPeftModelForCausalLM model = AutoPeftModelForCausalLM.from_pretrained(    args.output_dir,    low_cpu_mem_usage=True,)  # 合并 LoRA 与 base modelmerged_model = model.merge_and_unload() # 保存合并后的模型merged_model.save_pretrained("merged_model",safe_serialization=True)tokenizer.save_pretrained("merged_model") # push合并的模型到hub上# merged_model.push_to_hub("user/repo")# tokenizer.push_to_hub("user/repo")

原文作者: Philschmid

原文链接: https://www.philschmid.de/instruction-tune-llama-2

译者: Xu Haoran

扩展说明: 指令微调 Llama 2 - HuggingFace

1. 定义应用场景细节并创建指令的提示词模板

2. 创建指令数据集

3. 使用 `trl` 和`SFTTrainer` 指令微调 Llama 2

Flash Attention (快速注意力)

4. 测试模型、进行推理

Recommend

The mysterious litany of require_dependecy calls

Gemini Advanced Is a Central Part of Google's Subscription Future | WIRED

开挂了！股价暴增466.6%，市值突破12万亿

Meeting C++ 2023: tracks A and B are online!

DIY｜ikbc C87 机械键盘有线改蓝牙小结

The Husqvarna Logo History, Colors, Font, and Meaning

WebXR Sites Are Apple Vision Pro's First Cross-Compatible Destinations

Taylor Swift joins Elon Musk in trying to silence student who tracks celebrity j...

使用Bot-Hosting搭建免费节点：4k视频高速秒开，解锁流媒体、ChatGPT|| 2024年免费节...

Google rebrands Bard to Gemini, launches a paid version based on a more powerful...

About Joyk

扩展说明: 指令微调 Llama 2 - HuggingFace

1. 定义应用场景细节并创建指令的提示词模板

2. 创建指令数据集

3. 使用 trl 和SFTTrainer 指令微调 Llama 2

Flash Attention (快速注意力)

4. 测试模型、进行推理

Recommend

About Joyk

3. 使用 `trl` 和`SFTTrainer` 指令微调 Llama 2