数据思维 | 微调大型语言模型（LLM）建议收藏！-CSDN博客

本文链接：https://blog.csdn.net/csdn_xmj/article/details/147409565

本文来源公众号“数据思维”，仅用于学术分享，侵权删，干货满满。

原文链接：微调大型语言模型（LLM）

大语言模型（LLMs）已极大地革新了自然语言处理（NLP）领域，在文本生成、翻译、摘要以及问答等任务中表现出色。然而，这些模型并非总是适用于特定的领域或任务。

为了解决这一问题，人们会进行微调操作。微调是通过在较小的、特定任务的数据集上对模型进行优化，对预训练的大语言模型进行定制，使其更适合专业应用。这使得模型在保持广泛语言能力的同时，提升其性能。

大语言模型（LLMs）中的微调
如何进行微调？
实现：使用 DialogSum 数据库对大语言模型进行微调
微调方法的类型
提示工程、检索增强生成（RAG）与微调的对比
何时使用微调？

大语言模型（LLMs）中的微调

微调是指采用一个预训练模型，并通过在较小的、特定领域的数据集上进一步训练，使其适应特定任务的过程。微调是一种迁移学习形式，它优化了模型的能力，在无需大量数据集或昂贵计算资源的情况下，提高了其在专业任务中的准确性。

微调使我们能够：

引导模型在特定任务上实现最佳性能。
确保模型输出与现实应用中的预期结果一致。
减少模型的幻觉现象，并提高输出的相关性和可靠性。

微调大型语言模型

如何进行微调？

一般的微调过程可以分解为以下步骤：

选择基础模型
根据你的任务和计算预算，选择一个预训练模型。
选择微调方法
根据任务和数据集，选择最合适的方法（例如，监督学习、基于指令的方法、参数高效微调（PEFT））。
准备数据集
为特定任务的训练构建数据结构，确保数据格式符合模型的要求。
训练
使用诸如 TensorFlow、PyTorch 等框架，或像 Transformers 这样的高级库来对模型进行微调。
评估与迭代
测试模型，必要时对其进行优化，然后重新训练以提高性能。

实现：使用 DialogSum 数据库对大语言模型进行微调

让我们使用参数高效微调（PEFT）中的低秩适配器（LoRA）方法对模型进行微调。我们将使用 flan-t5-base 模型和 DialogSum 数据库。

Flan-T5 是谷歌发布的 T5 模型的指令微调版本。
DialogSum 是一个大规模的对话摘要数据集，由 13,460 个对话（另外还有 100 个保留数据用于主题生成）以及相应的人工标注的摘要和主题组成。

第 1 步：安装必要的库

以下命令用于安装该任务所需的库，包括 Hugging Face 的 Transformers 库、Datasets 库和 PEFT（参数高效微调）库。这些库支持模型的加载、训练和微调。

!pip install datasets!pip install transformers!pip install evaluate!pip install accelerate -U!pip install transformers[torch]!pip install peft

第 2 步：设置环境

配置计算设备，如果有可用的 GPU 则使用 GPU。导入所有用于处理数据集、加载模型、标记化和评估的必要库。

import torchdevice = 'cuda' if torch.cuda.is_available() else 'cpu'
from datasets import load_datasetfrom transformers import AutoModelForSeq2SeqLM, AutoTokenizer, TrainingArguments, Trainer, GenerationConfigimport evaluateimport pandas as pdimport numpy as np

步骤 3：加载数据集

加载用于对话摘要的 Hugging Face 数据集。在这个例子中，我们使用 “knkarthick/dialogsum” 数据集。

huggingface_dataset_name = "knkarthick/dialogsum"dataset = load_dataset(huggingface_dataset_name)

第 4 步：加载预训练模型和 Tokenizer

使用预训练的 T5 模型（google/flan-t5-base）进行序列到序列学习，并初始化其标记器。

model_name = "google/flan-t5-base"base_model = AutoModelForSeq2SeqLM.from_pretrained(model_name)tokenizer = AutoTokenizer.from_pretrained(model_name)

第 5 步：检查可训练参数

定义一个函数来计算和打印模型中可训练参数的百分比。

def print_number_of_trainable_model_parameters(model):    trainable_model_params = 0    all_model_params = 0    for _, param in model.named_parameters():        all_model_params += param.numel()        if param.requires_grad:            trainable_model_params += param.numel()    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"
print(print_number_of_trainable_model_parameters(base_model))

输出：

trainable model parameters: 247577856
all model parameters: 247577856
percentage of trainable model parameters: 100.00%

第 6 步：执行基准推理

在微调前，使用测试集中的一个样本对预训练模型进行测试，以评估其性能。

i = 20dialogue = dataset['test'][i]['dialogue']summary = dataset['test'][i]['summary']
prompt = f"Summarize the following dialogue  {dialogue}  Summary:"
input_ids = tokenizer(prompt, return_tensors="pt").input_idsoutput = tokenizer.decode(base_model.generate(input_ids, max_new_tokens=200)[0], skip_special_tokens=True)
print(f"Input Prompt : {prompt}")print("--------------------------------------------------------------------")print("Human evaluated summary ---->")print(summary)print("---------------------------------------------------------------------")print("Baseline model generated summary : ---->")print(output)

输出：

Input Prompt : Summarize the following dialogue#Person1#: What's wrong with you? Why are you scratching so much?
#Person2#: I feel itchy! I can't stand it anymore! I think I may be coming down with something. I feel lightheaded and weak.
#Person1#: Let me have a look. Whoa! Get away from me!
#Person2#: What's wrong?
#Person1#: I think you have chicken pox! You are contagious! Get away! Don't breathe on me!
#Person2#: Maybe it's just a rash or an allergy! We can't be sure until I see a doctor.
#Person1#: Well in the meantime you are a biohazard! I didn't get it when I was a kid and I've heard that you can even die if you get it as an adult!
#Person2#: Are you serious? You always blow things out of proportion. In any case, I think I'll go take an oatmeal bath.Summary:
--------------------------------------------------------------------
Human evaluated summary ---->
#Person1# thinks #Person2# has chicken pox and warns #Person2# about the possible hazards but #Person2# thinks it will be fine.
---------------------------------------------------------------------
Baseline model generated summary : ---->
Person1 is scratching so much that he can't stand it anymore.

第 7 步：标记化数据集

对数据集进行标记化处理，为训练做准备。该函数生成输入和标签 ID，并将它们截断或填充到固定长度。

def tokenize_function(example):    start_prompt = 'Summarize the following conversation.\n\n'    end_prompt = '\n\nSummary: '    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids    example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids    return example
tokenized_datasets = dataset.map(tokenize_function, batched=True)tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary'])tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 100 == 0, with_indices=True)

第 8 步：使用 LoRA 配置应用 PEFT

使用 PEFT（参数高效微调），通过仅调整特定层来最小化训练时间和资源使用。

from peft import LoraConfig, get_peft_model, TaskType
lora_config = LoraConfig(    task_type=TaskType.SEQ_2_SEQ_LM,    r=8,    lora_alpha=32,    lora_dropout=0.1,)
peft_model_train = get_peft_model(base_model, lora_config)print(print_number_of_trainable_model_parameters(peft_model_train))

输出：

trainable model parameters: 3538944
all model parameters: 251116800
percentage of trainable model parameters: 1.41%

第 9 步：定义训练参数

设置训练配置，包括批量大小、学习率和训练轮数。

output_dir = "./peft-dialogue-summary-training"
peft_training_args = TrainingArguments(    output_dir=output_dir,    auto_find_batch_size=True,    learning_rate=1e-3,    num_train_epochs=5,)

步骤 10：训练模型

使用 Hugging Face 的 Trainer API 来训练启用了 PEFT 的模型。

peft_trainer = Trainer(    model=peft_model_train,    args=peft_training_args,    train_dataset=tokenized_datasets["train"],)
peft_trainer.train()

输出：

TrainOutput(global_step=160, training_loss=3.586883544921875,
metrics={'train_runtime': 150.5997,
'train_samples_per_second': 4.15,
'train_steps_per_second': 1.062,
'total_flos': 434768117760000.0,
'train_loss': 3.586883544921875, 'epoch': 5.0})

第 11 步：保存微调后的模型

保存训练好的 PEFT 模型和标记器，以备将来使用。

peft_model_path = "./peft-dialogue-summary-checkpoint-local"peft_trainer.model.save_pretrained(peft_model_path)tokenizer.save_pretrained(peft_model_path)

输出：

('./peft-dialogue-summary-checkpoint-local/tokenizer_config.json',
'./peft-dialogue-summary-checkpoint-local/special_tokens_map.json',
'./peft-dialogue-summary-checkpoint-local/spiece.model',
'./peft-dialogue-summary-checkpoint-local/added_tokens.json',
'./peft-dialogue-summary-checkpoint-local/tokenizer.json')

第 12 步：加载和测试微调模型

加载微调后的模型，并在相同的输入提示上测试其性能。

from peft import PeftModel
peft_model_base = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")peft_model = PeftModel.from_pretrained(peft_model_base, peft_model_path, is_trainable=False)
peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)
print(f"Input Prompt : {prompt}")print("--------------------------------------------------------------------")print("Human evaluated summary ---->")print(summary)print("---------------------------------------------------------------------")print("Baseline model generated summary : ---->")print(output)print("---------------------------------------------------------------------")print("Peft model generated summary : ---->")print(peft_model_text_output)

输出：

完整代码

# Step 1: Install Necessary Libraries!pip install datasets!pip install transformers!pip install evaluate!pip install accelerate -U!pip install transformers[torch]!pip install peft
# Step 2: Import Librariesimport torchfrom datasets import load_datasetfrom transformers import AutoModelForSeq2SeqLM, AutoTokenizer, TrainingArguments, Trainer, GenerationConfigimport pandas as pdimport numpy as npfrom peft import LoraConfig, get_peft_model, TaskType, PeftModel
# Step 3: Configure Devicedevice = 'cuda' if torch.cuda.is_available() else 'cpu'
# Step 4: Load Datasethuggingface_dataset_name = "knkarthick/dialogsum"dataset = load_dataset(huggingface_dataset_name)
# Step 5: Load Pre-trained Model and Tokenizermodel_name = "google/flan-t5-base"base_model = AutoModelForSeq2SeqLM.from_pretrained(model_name)tokenizer = AutoTokenizer.from_pretrained(model_name)
# Step 6: Define Function to Count Trainable Parametersdef print_number_of_trainable_model_parameters(model):    trainable_model_params = 0    all_model_params = 0    for _, param in model.named_parameters():        all_model_params += param.numel()        if param.requires_grad:            trainable_model_params += param.numel()    return f"trainable model parameters: {trainable_model_params}\nall model parameters: {all_model_params}\npercentage of trainable model parameters: {100 * trainable_model_params / all_model_params:.2f}%"
print(print_number_of_trainable_model_parameters(base_model))
# Step 7: Perform Baseline Inferencei = 20dialogue = dataset['test'][i]['dialogue']summary = dataset['test'][i]['summary']
prompt = f"Summarize the following dialogue  {dialogue}  Summary:"input_ids = tokenizer(prompt, return_tensors="pt").input_idsoutput = tokenizer.decode(base_model.generate(input_ids, max_new_tokens=200)[0], skip_special_tokens=True)
print(f"Input Prompt : {prompt}")print("--------------------------------------------------------------------")print("Human evaluated summary ---->")print(summary)print("---------------------------------------------------------------------")print("Baseline model generated summary : ---->")print(output)
# Step 8: Tokenize Datasetdef tokenize_function(example):    start_prompt = 'Summarize the following conversation.\n\n'    end_prompt = '\n\nSummary: '    prompt = [start_prompt + dialogue + end_prompt for dialogue in example["dialogue"]]    example['input_ids'] = tokenizer(prompt, padding="max_length", truncation=True, return_tensors="pt").input_ids    example['labels'] = tokenizer(example["summary"], padding="max_length", truncation=True, return_tensors="pt").input_ids    return example
tokenized_datasets = dataset.map(tokenize_function, batched=True)tokenized_datasets = tokenized_datasets.remove_columns(['id', 'topic', 'dialogue', 'summary'])tokenized_datasets = tokenized_datasets.filter(lambda example, index: index % 100 == 0, with_indices=True)
# Step 9: Apply PEFT with LoRA Configurationlora_config = LoraConfig(    task_type=TaskType.SEQ_2_SEQ_LM,    r=8,    lora_alpha=32,    lora_dropout=0.1,)
peft_model_train = get_peft_model(base_model, lora_config)print(print_number_of_trainable_model_parameters(peft_model_train))
# Step 10: Define Training Argumentsoutput_dir = "./peft-dialogue-summary-training"peft_training_args = TrainingArguments(    output_dir=output_dir,    auto_find_batch_size=True,    learning_rate=1e-3,    num_train_epochs=5,)
# Step 11: Train the Modelpeft_trainer = Trainer(    model=peft_model_train,    args=peft_training_args,    train_dataset=tokenized_datasets["train"],)peft_trainer.train()
# Step 12: Save the Fine-Tuned Modelpeft_model_path = "./peft-dialogue-summary-checkpoint-local"peft_trainer.model.save_pretrained(peft_model_path)tokenizer.save_pretrained(peft_model_path)
# Step 13: Load and Test Fine-Tuned Modelpeft_model_base = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-base")peft_model = PeftModel.from_pretrained(peft_model_base, peft_model_path, is_trainable=False)
peft_model_outputs = peft_model.generate(input_ids=input_ids, generation_config=GenerationConfig(max_new_tokens=200, num_beams=1))peft_model_text_output = tokenizer.decode(peft_model_outputs[0], skip_special_tokens=True)
print(f"Input Prompt : {prompt}")print("--------------------------------------------------------------------")print("Human evaluated summary ---->")print(summary)print("---------------------------------------------------------------------")print("Baseline model generated summary : ---->")print(output)print("---------------------------------------------------------------------")print("Peft model generated summary : ---->")print(peft_model_text_output)

微调方法的类型

监督微调

监督微调指的是使用带有标记的输入 - 输出对的特定任务数据集，对预训练模型进行进一步训练。这个过程使模型能够根据给定的数据集学习如何将输入映射到输出。

过程：

使用一个预训练模型。
准备一个符合模型预期的输入 - 输出对数据集。
在微调过程中调整预训练的权重，以使模型适应新任务。

监督微调非常适合诸如情感分析、文本分类和命名实体识别等有标记数据集可用的任务。

指令微调

指令微调在提示模板中用详细的指令来扩充输入 - 输出示例。这使得模型能够更好地对新任务进行泛化，特别是那些涉及自然语言指令的任务。

过程：

使用一个预训练模型。
准备一个以指令 - 响应配对形式的数据集。
用指令微调过程对模型进行训练，这类似于神经网络的训练。

指令微调通常用于构建聊天机器人、问答系统以及其他需要自然语言交互的任务。

参数高效微调（PEFT）

训练一个完整的模型需要大量资源。参数高效微调方法通过仅修改模型参数的一个子集，实现了内存和计算的高效利用，显著减少了训练所需的内存。

参数高效微调方法：

选择法
冻结模型的大部分层，只对特定的层进行微调。
重参数化方法（LoRA）
使用低秩矩阵对模型权重进行重参数化，冻结原始权重并添加小的可训练参数。
例如：如果一个模型的维度是 512x64，完全微调将需要 32,768 个参数。使用 LoRA，参数数量可以减少到 4,608 个。
添加法
在模型的编码器或解码器一侧添加新层，并针对特定任务对这些层进行训练。
软提示法
只训练添加到模型提示中的新标记，保持其他标记和权重冻结。

当处理超出内存限制的大型模型时，参数高效微调非常有用，它可以降低训练成本和资源需求。

基于人类反馈的强化学习（RLHF）

基于人类反馈的强化学习使用强化学习，使微调后模型的输出与人类偏好保持一致。这种方法在初始微调阶段之后进一步优化模型的行为。

过程：

准备数据集
生成提示 - 补全配对，并根据人类评估者的一致性标准对它们进行排序。
训练奖励模型
构建一个奖励模型，根据人类反馈对补全内容进行评分。
更新模型
使用强化学习，通常是近端策略优化（PPO）算法，根据奖励模型来更新模型权重。

基于人类反馈的强化学习非常适合那些需要类似人类输出的任务，例如生成符合用户期望或道德准则的文本。

提示工程、检索增强生成（RAG）与微调的对比

让我们探究一下提示工程、检索增强生成（RAG）和微调之间的区别。

标准	提示工程	检索增强生成（RAG）	微调
目的	提示工程专注于如何编写有效的提示，以便为给定任务最大化生成优化的输出。	检索增强生成（RAG）的目的是从外部数据库中为给定提示获取相关信息。	微调专注于为特定任务训练和调整模型。
模型	模型权重不更新。它专注于构建有效的提示。	模型权重不更新。它专注于为给定提示构建上下文。	模型权重会更新。
复杂度	不需要技术知识。	与微调相比，它的复杂度较低，因为它只需要与向量数据库和检索机制相关的技能。	需要技术知识。
计算成本	成本非常低。仅涉及与 API 调用相关的成本。	与微调相比具有成本效益。	根据模型大小和数据集大小，可能需要专门的硬件来训练模型。
知识	模型不学习新数据。	提示通过上下文的形式配备了新数据。	模型学习新数据。