从零开始逐步指导开发者构建自己的大型语言模型（LLM）学习笔记- 第5章 GPT 的大语言模型训练

本文链接：https://blog.csdn.net/chenchihwen/article/details/145124773

Chapter 5: Pretraining on Unlabeled Data

从模型配置和初始化，到训练、评估、优化，再到使用各种技巧控制文本生成，最后加载预训练权重，是一个使用 GPT 架构进行自然语言处理任务的典型示例，适合学习和理解深度学习和自然语言处理的实践

导入必要的库和模块：
- 导入了多个常用的 Python 库，如 matplotlib、numpy、tiktoken、torch、tensorflow 等，还使用 importlib.metadata 中的 version 函数来获取这些库的版本信息。
- 定义了一些函数，如 text_to_token_ids、token_ids_to_text、calc_loss_batch、calc_loss_loader、train_model_simple、evaluate_model、generate_and_print_sample、plot_losses、softmax_with_temperature、print_sampled_tokens、generate、assign、load_weights_into_gpt 等，用于不同的功能，包括数据处理、模型训练、评估、文本生成、损失计算和权重加载等。

功能概述

模型配置和初始化：
- GPT_CONFIG_124M 定义了 GPT 模型的配置，包括词汇表大小、上下文长度、嵌入维度、注意力头数、层数、丢弃率等。
- 使用 GPTModel 类（可能在 previous_chapters 模块中定义）初始化一个 GPT 模型，设置 torch.manual_seed(123) 以确保可重复性，将模型设置为评估模式 model.eval() 以在推理时禁用 dropout。
- 展示了如何使用 generate_text_simple 函数（来自 previous_chapters）生成文本，并使用 text_to_token_ids 和 token_ids_to_text 函数在文本和 token 之间进行转换。
- 初始生成的文本质量不佳，因为模型尚未训练。
计算生成文本的损失：交叉熵和困惑度：
- 解释了如何计算交叉熵损失，通过输入和目标 token 计算 logits，将 logits 转换为概率，使用 torch.argmax 找到预测的 token，然后计算交叉熵。
- 展示了 PyTorch 的 torch.nn.functional.cross_entropy 函数的使用，该函数内部自动完成了一些计算步骤。
- 困惑度是交叉熵损失的指数，它可以更直观地表示模型对词汇表的不确定程度，越低表示模型预测越接近实际分布。
训练和验证集的损失计算：
- 从 GitHub 下载一个文本文件 the-verdict.txt，并将其分为训练集和验证集。
- 使用 create_dataloader_v1（来自 previous_chapters）创建数据加载器，将文本数据转换为批次，使用 calc_loss_batch 和 calc_loss_loader 计算批次和加载器的损失。
- 考虑了设备选择（GPU 或 CPU），将模型和数据移动到相应设备。
模型训练：
- 定义 train_model_simple 函数进行简单的训练，包括训练循环、梯度计算、更新权重、评估和打印生成的样本。
- 调用 train_model_simple 函数训练模型，设置 num_epochs 为 10，使用 AdamW 优化器，观察训练和验证损失以及生成的文本在训练过程中的变化，显示出开始时生成无意义的字符串，后来逐渐生成语法更正确的句子，但最终会出现过拟合，因为训练集太小。
- 最后使用 plot_losses 函数将训练和验证损失绘制成图。
解码策略以控制随机性：
- 介绍了 generate_text_simple 函数在生成文本时的确定性，然后引入了温度缩放（temperature scaling）和 top-k 采样来控制生成文本的随机性和多样性。
- 温度缩放通过将 logits 除以一个大于 0 的数，改变 softmax 后的概率分布，温度大于 1 会使分布更均匀，小于 1 会使分布更尖锐。
- top-k 采样通过保留 top-k 个最可能的 token 并将其余的 logits 设为负无穷，再计算 softmax 来实现。
- 修改 generate 函数，将温度缩放和 top-k 采样结合，生成更具多样性的文本。
PyTorch 中的模型权重保存和加载：
- 使用 torch.save 保存模型权重（model.state_dict()），并使用 torch.load 加载权重。
- 对于使用自适应优化器（如 AdamW）的情况，还可以保存和加载优化器的状态。
从 OpenAI 加载预训练权重：
- 从 OpenAI 下载 GPT-2 的预训练权重，使用 TensorFlow 加载权重（因为 OpenAI 使用 TensorFlow）。
- 定义 load_weights_into_gpt 函数将下载的权重分配给自定义的 GPTModel 实例，考虑了形状匹配和不同权重的分配。
- 加载权重后，使用 generate 函数生成文本，验证模型是否正确加载。

关键知识点和技术

自然语言处理：使用 GPT 架构进行文本生成任务，涉及文本的 token 化、生成和评估。
深度学习基础：包括模型训练、损失函数（交叉熵）、优化器（如 AdamW）、评估指标（如困惑度）。
深度学习技巧：使用了层归一化（LayerNorm）、dropout、残差连接等，以及如何避免过拟合。
数据处理：将文本数据转换为 token 并分批处理，使用数据加载器进行训练和验证。
生成策略：通过温度缩放和 top-k 采样控制文本生成的随机性和多样性。

代码示例解释

计算交叉熵损失示例：

# 输入和目标张量，分别表示输入和期望生成的 token ID
inputs = torch.tensor([[16833, 3626, 6100], [40, 1107, 588]])
targets = torch.tensor([[3626, 6100, 345], [1107, 588, 11311]])
# 模型前向传播得到 logits
with torch.no_grad():
    logits = model(inputs)
# 将 logits 转换为概率
probas = torch.softmax(logits, dim=-1)
# 计算交叉熵损失
logits_flat = logits.flatten(0, 1)
targets_flat = targets.flatten()
loss = torch.nn.functional.cross_entropy(logits_flat, targets_flat)

这里，首先将输入传递给模型得到 logits，然后将 logits 转换为概率分布，最后使用 PyTorch 的 cross_entropy 函数计算交叉熵损失。在使用 cross_entropy 函数前，需要将 logits 和 targets 展平以满足函数的输入要求。

训练循环示例：

def train_model_simple(model, train_loader, val_loader, optimizer, device, num_epochs, eval_freq, eval_iter, start_context, tokenizer):
    train_losses, val_losses, track_tokens_seen = [], [], []
    tokens_seen, global_step = 0, -1
    for epoch in range(num_epochs):
        model.train()
        for input_batch, target_batch in train_loader:
            optimizer.zero_grad()
            loss = calc_loss_batch(input_batch, target_batch, model, device)
            loss.backward()
            optimizer.step()
            tokens_seen += input_batch.numel()
            global_step += 1
            if global_step % eval_freq == 0:
                train_loss, val_loss = evaluate_model(model, train_loader, val_loader, device, eval_iter)
                train_losses.append(train_loss)
                val_losses.append(val_loss)
                track_tokens_seen.append(tokens_seen)
                print(f"Ep {epoch+1} (Step {global_step:06d}): Train loss {train_loss:.3f}, Val loss {val_loss:.3f}")
        generate_and_print_sample(model, tokenizer, device, start_context)
    return train_losses, val_losses, track_tokens_seen

在 train_model_simple 函数中，在每个 epoch 中，将模型设置为训练模式，遍历训练数据加载器，计算损失，反向传播梯度，更新权重。在一定的评估频率下评估训练和验证损失，并打印生成的样本。

温度缩放示例：

def softmax_with_temperature(logits, temperature):
    scaled_logits = logits / temperature
    return torch.softmax(scaled_logits, dim=0)
# 不同温度下的缩放
temperatures = [1, 0.1, 5]
scaled_probas = [softmax_with_temperature(next_token_logits, T) for T in temperatures]

softmax_with_temperature 函数将 logits 除以温度，然后进行 softmax 操作，不同的温度会导致不同的概率分布，影响后续的 token 采样。

总的来说，这段代码展示了一个完整的流程，从模型配置和初始化，到训练、评估、优化，再到使用各种技巧控制文本生成，最后加载预训练权重，是一个使用 GPT 架构进行自然语言处理任务的典型示例，适合学习和理解深度学习和自然语言处理的实践。

In [1]:

In this chapter, we implement the training loop and code for basic model evaluation to pretrain an LLM
At the end of this chapter, we also load openly available pretrained weights from OpenAI into our model

The topics covered in this chapter are shown below

5.1 Evaluating generative text models

这段文本主要描述了关于评估生成文本模型的内容。首先提到以回顾上一章中初始化 GPT 模型的代码作为这一部分的开始。接着阐述了语言模型的基本评估指标。最后说明在这一部分中，会将这些评估指标应用于训练集和验证集。

5.1.1 Using GPT to generate text

We initialize a GPT model using the code from the previous chapter

In [2]:

import torch
from previous_chapters import GPTModel

GPT_CONFIG_124M = {
    "vocab_size": 50257,   # Vocabulary size
    "context_length": 256, # Shortened context length (orig: 1024)
    "emb_dim": 768,        # Embedding dimension
    "n_heads": 12,         # Number of attention heads
    "n_layers": 12,        # Number of layers
    "drop_rate": 0.1,      # Dropout rate
    "qkv_bias": False      # Query-key-value bias
}

torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
model.eval();  # Disable dropout during inference

这段话主要描述了在训练语言模型时的一些设置和操作。首先提到当前使用 0.1 的 dropout，但如今训练大语言模型时不使用 dropout 较为常见。现代语言模型在查询、键和值矩阵的线性层中也不使用偏置向量（与早期 GPT 模型不同），通过设置 “qkv_bias” 为 False 实现。为减少训练模型的计算资源需求，将上下文长度（context_length）设置为仅 256 个标记，而原始的 1.24 亿参数的 GPT-2 模型使用 1024 个标记，这样更多读者可以在笔记本电脑上执行代码示例。但可以将上下文长度增加到 1024 个标记且不需要修改代码，后面还会加载一个上下文长度为 1024 的预训练模型。接着提到使用上一章的 generate_text_simple 函数生成文本，并定义了两个方便的函数 text_to_token_ids 和 token_ids_to_text 用于在标记和文本表示之间进行转换，在本章中会一直使用。

In [3]:

如我们在上面看到的，该模型不能生成良好的文本，因为它尚未经过训练。我们如何以数字形式衡量或捕捉“良好的文本”是什么，以便在训练期间进行跟踪呢？下一小节介绍用于计算生成输出的损失度量的指标，我们可以用它来衡量训练进度。关于微调大语言模型的后续章节也将介绍额外的衡量模型质量的方法。

5.1.2 Calculating the text generation loss: cross-entropy and perplexity

这段文本描述了计算文本生成损失的方法，包括交叉熵和困惑度。假设存在一个名为“inputs”的张量，其中包含了两个训练样本（以行为单位）的标记 ID。对应于“inputs”，“targets”包含了模型需要生成的期望标记 ID。注意，“targets”是“inputs”向右移动一个位置得到的，就像在第二章实现数据加载器时所解释的那样。

In [4]:

inputs = torch.tensor([[16833, 3626, 6100],   # ["every effort moves",
                       [40,    1107, 588]])   #  "I really like"]

targets = torch.tensor([[3626, 6100, 345  ],  # [" effort moves you",
                        [1107,  588, 11311]]) #  " really like chocolate"]

将输入提供给模型，我们获得了由两个输入示例的对数几率向量，每个输入示例由三个标记组成。每个标记都是一个 50257 维的向量，对应于词汇表的大小。应用 softmax 函数，我们可以将对数几率张量转换为具有相同维度的张量，其中包含概率得分。

In [5]:

下面的图表使用非常小的词汇表以用于说明目的，概述了我们如何将概率分数转换回文本，这是我们在上一章末尾讨论过的内容。

如前一章所讨论的，我们可以应用 argmax 函数将概率得分转换为预测的标记 ID。上面的 softmax 函数为每个标记生成一个 50257 维的向量；argmax 函数返回此向量中最高概率得分的位置，这就是给定标记的预测标记 ID。

Since we have 2 input batches with 3 tokens each, we obtain 2 by 3 predicted token IDs:

如果我们解码这些标记，我们会发现它们与我们希望模型预测的标记，即目标标记有很大的不同。

That's because the model wasn't trained yet
To train the model, we need to know how far it is away from the correct predictions (targets)

The token probabilities corresponding to the target indices are as follows:

我们希望最大化所有这些值，使其接近概率 1。在数学优化中，最大化概率得分的对数比最大化概率得分本身更容易；这超出了本书的范围，但我在这里录制了一个有更多细节的讲座：L8.2 逻辑回归损失函数。: L8.2 Logistic Regression Loss Function

torch.log：这是 PyTorch 中的一个函数，用于计算张量中每个元素的自然对数（以 e 为底）。

torch.cat((target_probas_1, target_probas_2))：这是 PyTorch 中的一个函数，用于将两个张量沿着指定的维度拼接在一起。在这个例子中，target_probas_1 和 target_probas_2 是两个张量，它们被拼接成一个新的张量。

log_probas：这是一个变量，用于存储计算得到的对数概率。

Next, we compute the average log probability:

交叉熵损失
这段话的意思是：目标是通过优化模型权重使平均对数概率尽可能大。由于有对数，最大可能值是 0，而目前离 0 还很远。在深度学习中，通常不是最大化平均对数概率，而是最小化平均对数概率的负值。在这种情况下，不是使 -10.7722 增大以接近 0，而是在深度学习中最小化 10.7722 使其接近 0。-10.7722 的负值即 10.7722，在深度学习中也被称为交叉熵损失。

PyTorch already implements a cross_entropy function that carries out the previous steps

Before we apply the cross_entropy function, let's check the shape of the logits and targets

For the cross_entropy function in PyTorch, we want to flatten these tensors by combining them over the batch dimension:
通常指在处理数据时将一批数据作为一个整体进行处理的那个维度。“展平” 就是把多维的数据变成一维的数据，以便进行后续的计算和处理。

Note that the targets are the token IDs, which also represent the index positions in the logits tensors that we want to maximize
The cross_entropy function in PyTorch will automatically take care of applying the softmax and log-probability computation internally over those token indices in the logits that are to be maximized
请注意，目标是令牌 ID，它也代表我们想要最大化的对数张量中的索引位置。PyTorch 中的交叉熵函数会自动负责在要最大化的对数张量中的那些令牌索引上内部应用 softmax 和对数概率计算。

A concept related to the cross-entropy loss is the perplexity of an LLM
The perplexity is simply the exponential of the cross-entropy loss
与交叉熵损失相关的一个概念是语言模型的困惑度。困惑度简单来说就是交叉熵损失的指数形式。

The perplexity is often considered more interpretable because it can be understood as the effective vocabulary size that the model is uncertain about at each step (in the example above, that'd be 48,725 words or tokens)
In other words, perplexity provides a measure of how well the probability distribution predicted by the model matches the actual distribution of the words in the dataset
Similar to the loss, a lower perplexity indicates that the model predictions are closer to the actual distribution
困惑度通常被认为更具可解释性，因为它可以被理解为模型在每一步不确定的有效词汇量大小（在上述例子中，是 48725 个单词或标记）。换句话说，困惑度提供了一种衡量模型预测的概率分布与数据集中单词的实际分布匹配程度的指标。与损失类似，较低的困惑度表明模型预测更接近实际分布。

5.1.3 Calculating the training and validation set losses

We use a relatively small dataset for training the LLM (in fact, only one short story)
The reasons are:
- You can run the code examples in a few minutes on a laptop computer without a suitable GPU
- The training finishes relatively fast (minutes instead of weeks), which is good for educational purposes
- We use a text from the public domain, which can be included in this GitHub repository without violating any usage rights or bloating the repository size
For example, Llama 2 7B required 184,320 GPU hours on A100 GPUs to be trained on 2 trillion tokens
- At the time of this writing, the hourly cost of an 8xA100 cloud server at AWS is approximately $30
- So, via an off-the-envelope calculation, training this LLM would cost 184,320 / 8 * $30 = $690,000
Below, we use the same dataset we used in chapter 2
这段文本主要在讨论以下内容：首先提到计算训练集和验证集的损失，接着说明使用相对较小的数据集（实际上只有一篇短篇小说）来训练大语言模型的原因，包括可以在几分钟内在没有合适 GPU 的笔记本电脑上运行代码示例、训练完成相对较快（几分钟而非几周，这对教学目的有好处），并且使用公共领域的文本可以避免违反使用权利和增加存储库大小。然后以 Llama 2 7B 为例说明了训练大语言模型所需的资源和成本，最后提到下面将使用与第二章中相同的数据集。

这段文本主要在讨论关于训练语言模型（LLM）的内容。首先指出有 5145 个标记的文本对于训练 LLM 来说很短，但强调这是出于教育目的，并且之后会加载预训练权重。接着提到将数据集分为训练集和验证集，并使用第二章中的数据加载器为 LLM 训练准备批次。出于可视化目的，下面的图假设最大长度为 6，但对于训练加载器，将最大长度设置为 LLM 支持的上下文长度。下面的图仅为了简单起见只显示了输入标记。由于训练 LLM 是为了预测文本中的下一个单词，所以目标与这些输入看起来相同，只是目标在位置上移动了一位。

我们使用相对较小的批处理大小来减少计算资源需求，并且因为数据集一开始就非常小。例如，Llama 2 7B 是使用批处理大小为 1024 进行训练的。

接下来，我们实现一个实用函数来计算给定批次的交叉熵损失。此外，我们实现第二个实用函数，用于计算数据加载器中用户指定数量批次的损失。

如果您有一台带有支持 CUDA 的 GPU 的机器，大型语言模型（LLM）将在 GPU 上进行训练，而无需对代码进行任何更改。通过设备设置，我们确保数据被加载到与 LLM 模型相同的设备上。

5.2 Training an LLM

“Training an LLM” 意为 “训练一个大语言模型”。“In this section, we finally implement the code for training the LLM” 表示 “在这一部分，我们最终实现用于训练大语言模型的代码”。“We focus on a simple training function” 即 “我们专注于一个简单的训练函数”。“if you are interested in augmenting this training function with more advanced techniques, such as learning rate warmup, cosine annealing, and gradient clipping, please refer to Appendix D” 的意思是 “如果你对使用更先进的技术来增强这个训练函数感兴趣，例如学习率预热、余弦退火和梯度裁剪，请参考附录 D”。

def train_model_simple(model, train_loader, val_loader, optimizer, device, num_epochs,
                       eval_freq, eval_iter, start_context, tokenizer):
    # Initialize lists to track losses and tokens seen
    train_losses, val_losses, track_tokens_seen = [], [], []
    tokens_seen, global_step = 0, -1

    # Main training loop
    for epoch in range(num_epochs):
        model.train()  # Set model to training mode
        
        for input_batch, target_batch in train_loader:
            optimizer.zero_grad() # Reset loss gradients from previous batch iteration
            loss = calc_loss_batch(input_batch, target_batch, model, device)
            loss.backward() # Calculate loss gradients
            optimizer.step() # Update model weights using loss gradients
            tokens_seen += input_batch.numel()
            global_step += 1

            # Optional evaluation step
            if global_step % eval_freq == 0:
                train_loss, val_loss = evaluate_model(
                    model, train_loader, val_loader, device, eval_iter)
                train_losses.append(train_loss)
                val_losses.append(val_loss)
                track_tokens_seen.append(tokens_seen)
                print(f"Ep {epoch+1} (Step {global_step:06d}): "
                      f"Train loss {train_loss:.3f}, Val loss {val_loss:.3f}")

        # Print a sample text after each epoch
        generate_and_print_sample(
            model, tokenizer, device, start_context
        )

    return train_losses, val_losses, track_tokens_seen


def evaluate_model(model, train_loader, val_loader, device, eval_iter):
    model.eval()
    with torch.no_grad():
        train_loss = calc_loss_loader(train_loader, model, device, num_batches=eval_iter)
        val_loss = calc_loss_loader(val_loader, model, device, num_batches=eval_iter)
    model.train()
    return train_loss, val_loss


def generate_and_print_sample(model, tokenizer, device, start_context):
    model.eval()
    context_size = model.pos_emb.weight.shape[0]
    encoded = text_to_token_ids(start_context, tokenizer).to(device)
    with torch.no_grad():
        token_ids = generate_text_simple(
            model=model, idx=encoded,
            max_new_tokens=50, context_size=context_size
        )
    decoded_text = token_ids_to_text(token_ids, tokenizer)
    print(decoded_text.replace("\n", " "))  # Compact print format
    model.train()

Now, let's train the LLM using the training function defined above:

# Note:
# Uncomment the following code to calculate the execution time
# import time
# start_time = time.time()

torch.manual_seed(123)
model = GPTModel(GPT_CONFIG_124M)
model.to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=0.0004, weight_decay=0.1)

num_epochs = 10
train_losses, val_losses, tokens_seen = train_model_simple(
    model, train_loader, val_loader, optimizer, device,
    num_epochs=num_epochs, eval_freq=5, eval_iter=5,
    start_context="Every effort moves you", tokenizer=tokenizer
)

# Note:
# Uncomment the following code to show the execution time
# end_time = time.time()
# execution_time_minutes = (end_time - start_time) / 60
# print(f"Training completed in {execution_time_minutes:.2f} minutes.")

RUN IN CPU NOTEBOOK

GUP 快很多

这段话的意思如下：
- 观察上面的结果，可以看到模型一开始生成难以理解的单词串，而到最后，它能够生成语法上或多或少正确的句子。
- 然而，根据训练集和验证集的损失，可以看出模型开始过拟合。如果检查它在最后生成的一些段落，会发现它们与训练集中的内容完全一致，即模型只是简单地记住了训练数据。
- 之后，将介绍可以在一定程度上减轻这种记忆的解码策略。
- 这里的过拟合是因为训练集非常小，并且对其进行了多次迭代。
- 这里的大语言模型训练主要用于教育目的，主要是为了看到模型能够学习生成连贯的文本。
- 不是在大量昂贵的硬件上花费数周或数月来训练这个模型，而是在后面加载预训练的权重。

如果你对使用更高级的技术（如学习率预热、余弦退火和梯度裁剪）来增强这个训练函数感兴趣，请参考附录 D。如果你对更大的训练数据集和更长时间的训练运行感兴趣，请查看../03_bonus_pretraining_on_gutenberg。

5.3 Decoding strategies to control randomness

解码策略以控制随机性：在使用相对较小的语言模型进行推理时成本相对较低，比如我们上面训练的 GPT 模型，所以如果在训练时使用了 GPU，那么在推理时无需使用 GPU。使用我们在前面简单训练函数中使用过的 generate_text_simple 函数（来自上一章），我们可以一次生成一个单词（或标记）的新文本。如 5.1.2 节中所解释的，下一个生成的标记是词汇表中所有标记中对应概率得分最大的标记。

解码策略以控制随机性：在使用相对较小的语言模型进行推理时成本相对较低，比如我们上面训练的 GPT 模型，所以如果在训练时使用了 GPU，那么在推理时无需使用 GPU。使用我们在前面简单训练函数中使用过的 generate_text_simple 函数（来自上一章），我们可以一次生成一个单词（或标记）的新文本。如 5.1.2 节中所解释的，下一个生成的标记是词汇表中所有标记中对应概率得分最大的标记。

5.3.1 Temperature scaling

“Temperature scaling” 直译为 “温度缩放”。

在这段文本中，首先提到之前总是使用 “torch.argmax” 来采样具有最高概率的标记作为下一个标记。接着引入 “温度缩放” 的概念，为了增加多样性，可以使用 “torch.multinomial (probs, num_samples=1)” 从概率分布中采样下一个标记，在这里，每个索引被选中的机会对应于其在输入张量中的概率。

可以理解为，“温度缩放” 是一种在处理概率分布时的方法，通过改变采样方式，从单纯取最高概率的标记变为从概率分布中进行采样，从而增加结果的多样性。

这段话描述了在确定最可能的标记（token）时，不使用通过 torch.argmax 的方法，而是使用 torch.multinomial(probas, num_samples=1)从 softmax 分布中进行采样来确定最可能的标记。为了说明目的，展示了在使用原始 softmax 概率对下一个标记进行 1000 次采样时会发生什么情况。

“我们可以通过一个叫做温度缩放的概念来控制分布和选择过程。‘温度缩放’只是用一个大于 0 的数除以对数几率的一个花哨说法。温度大于 1 在应用 softmax 后会导致令牌概率更均匀分布。温度小于 1 在应用 softmax 后会导致更自信（更尖锐或峰值更高）的分布。”

我们可以看到，通过温度 0.1 进行重新缩放会导致分布更加尖锐，接近 torch.argmax，这样几乎总是选择最有可能的单词。

The rescaled probabilities via temperature 5 are more uniformly distributed:

假设一个大语言模型的输入是“每一份努力都激励着你”，使用上述方法有时会产生无意义的文本，例如“每一份努力都激励着你披萨”，这种情况出现的概率为 3.2%（1000 次中有 32 次）。

5.3.2 Top-k sampling

“Top-k sampling”即“Top-k 采样”。这是一种在自然语言处理等领域中常用的采样方法。在语言模型生成文本时，通常会计算出下一个词的概率分布，然后从这个分布中采样得到下一个词。Top-k 采样会选取概率最高的 k 个词，然后从这 k 个词中进行随机采样来确定下一个词。这样可以避免生成一些低概率的、不常见或不合理的词，提高生成文本的质量和可读性。例如，在一个语言模型中，如果设置 k = 5，那么在生成下一个词时，只从概率最高的 5 个词中进行随机选择。

（请注意，此图中的数字截断为小数点后两位以减少视觉混乱。Softmax 行中的值应总和为 1.0。）

这段代码创建了一个与“next_token_logits”形状相同且值全为负无穷（-torch.inf）的张量“new_logits”。然后将“next_token_logits”张量中索引为“top_pos”的元素值复制到“new_logits”张量的对应位置。更多细节可参考给出的链接。

5.3.3 Modifying the text generation function

“Modifying the text generation function”的意思是“修改文本生成函数”。

“The previous two subsections introduced temperature sampling and top-k sampling”的意思是“前两个小节介绍了温度采样和 top-k 采样”。

“Let's use these two concepts to modify the generate_simple function we used to generate text via the LLM earlier, creating a new generate function”的意思是“让我们使用这两个概念来修改我们之前用于通过语言模型生成文本的 generate_simple 函数，创建一个新的 generate 函数”。

def generate(model, idx, max_new_tokens, context_size, temperature=0.0, top_k=None, eos_id=None):

    # For-loop is the same as before: Get logits, and only focus on last time step
    for _ in range(max_new_tokens):
        idx_cond = idx[:, -context_size:]
        with torch.no_grad():
            logits = model(idx_cond)
        logits = logits[:, -1, :]

        # New: Filter logits with top_k sampling
        if top_k is not None:
            # Keep only top_k values
            top_logits, _ = torch.topk(logits, top_k)
            min_val = top_logits[:, -1]
            logits = torch.where(logits < min_val, torch.tensor(float("-inf")).to(logits.device), logits)

        # New: Apply temperature scaling
        if temperature > 0.0:
            logits = logits / temperature

            # Apply softmax to get probabilities
            probs = torch.softmax(logits, dim=-1)  # (batch_size, context_len)

            # Sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)  # (batch_size, 1)

        # Otherwise same as before: get idx of the vocab entry with the highest logits value
        else:
            idx_next = torch.argmax(logits, dim=-1, keepdim=True)  # (batch_size, 1)

        if idx_next == eos_id:  # Stop generating early if end-of-sequence token is encountered and eos_id is specified
            break

        # Same as before: append sampled index to the running sequence
        idx = torch.cat((idx, idx_next), dim=1)  # (batch_size, num_tokens+1)

    return idx

5.4 Loading and saving model weights in PyTorch

在 PyTorch 中，模型的权重是模型在训练过程中学习到的参数。加载模型权重可以让你使用已经训练好的模型进行预测或进一步的训练，而保存模型权重则可以在以后的时间里重新加载该模型，继续训练或进行预测，而无需重新训练整个模型。

例如，你可以使用预训练的模型权重来初始化自己的模型，从而加快训练速度或提高模型性能。或者，在训练过程中定期保存模型权重，以便在出现问题时可以恢复到之前的状态。

在 PyTorch 中，推荐的方法是保存模型权重，即所谓的“状态字典（state_dict）”，通过将“torch.save”函数应用于“.state_dict()”方法来实现。

In [42]:

torch.save(model.state_dict(), "model.pth")

然后我们可以将模型权重加载到一个新的 GPTModel 模型实例中，如下所示：

这段话的意思是：通常使用像 Adam 或 AdamW 这样的自适应优化器而不是常规的随机梯度下降（SGD）来训练大型语言模型。这些自适应优化器为每个模型权重存储额外的参数，所以如果我们计划在以后继续进行预训练，那么也有必要保存它们。

5.5 Loading pretrained weights from OpenAI

这段文本主要介绍了加载预训练权重的相关内容。首先提到之前仅用一本很小的短篇故事书训练了一个小的 GPT-2 模型用于教学目的。接着指出对感兴趣的读者可以在某个路径找到在完整的古登堡图书语料库上进行的更长时间的预训练运行。然后说明幸运的是，不必花费数万美元在大型预训练语料库上预训练模型，而是可以加载由 OpenAI 提供的预训练权重。还提到另一种从 Hugging Face Hub 加载权重的方法在某个路径。之后表示首先有一些样板代码用于从 OpenAI 下载文件并将权重加载到 Python 中。由于 OpenAI 使用 TensorFlow，所以必须安装和使用 TensorFlow 来加载权重，tqdm 是一个进度条库，最后提到可以取消注释并运行下一个单元格来安装所需的库。

我们接下来可以按照如下方式下载具有 1.24 亿个参数的模型的权重。

In [49]:

settings, params = download_and_load_gpt2(model_size="124M", models_dir="gpt2")

File already exists and is up-to-date: gpt2/124M/checkpoint
File already exists and is up-to-date: gpt2/124M/encoder.json
File already exists and is up-to-date: gpt2/124M/hparams.json
File already exists and is up-to-date: gpt2/124M/model.ckpt.data-00000-of-00001
File already exists and is up-to-date: gpt2/124M/model.ckpt.index
File already exists and is up-to-date: gpt2/124M/model.ckpt.meta
File already exists and is up-to-date: gpt2/124M/vocab.bpe

In [50]:

print("Settings:", settings)

Settings: {'n_vocab': 50257, 'n_ctx': 1024, 'n_embd': 768, 'n_head': 12, 'n_layer': 12}

In [51]:

print("Parameter dictionary keys:", params.keys())

Parameter dictionary keys: dict_keys(['blocks', 'b', 'g', 'wpe', 'wte'])

In [52]:

print(params["wte"])
print("Token embedding weight tensor dimensions:", params["wte"].shape)

[[-0.11010301 -0.03926672  0.03310751 ... -0.1363697   0.01506208
   0.04531523]
 [ 0.04034033 -0.04861503  0.04624869 ...  0.08605453  0.00253983
   0.04318958]
 [-0.12746179  0.04793796  0.18410145 ...  0.08991534 -0.12972379
  -0.08785918]
 ...
 [-0.04453601 -0.05483596  0.01225674 ...  0.10435229  0.09783269
  -0.06952604]
 [ 0.1860082   0.01665728  0.04611587 ... -0.09625227  0.07847701
  -0.02245961]
 [ 0.05135201 -0.02768905  0.0499369  ...  0.00704835  0.15519823
   0.12067825]]
Token embedding weight tensor dimensions: (50257, 768)

Alternatively, "355M", "774M", and "1558M" are also supported model_size arguments
The difference between these differently sized models is summarized in the figure below:

Above, we loaded the 124M GPT-2 model weights into Python, however we still need to transfer them into our GPTModel instance
First, we initialize a new GPTModel instance
Note that the original GPT model initialized the linear layers for the query, key, and value matrices in the multi-head attention module with bias vectors, which is not required or recommended; however, to be able to load the weights correctly, we have to enable these too by setting qkv_bias to True in our implementation, too
We are also using the 1024 token context length that was used by the original GPT-2 model(s)

In [53]:

# Define model configurations in a dictionary for compactness
model_configs = {
    "gpt2-small (124M)": {"emb_dim": 768, "n_layers": 12, "n_heads": 12},
    "gpt2-medium (355M)": {"emb_dim": 1024, "n_layers": 24, "n_heads": 16},
    "gpt2-large (774M)": {"emb_dim": 1280, "n_layers": 36, "n_heads": 20},
    "gpt2-xl (1558M)": {"emb_dim": 1600, "n_layers": 48, "n_heads": 25},
}

# Copy the base configuration and update with specific model settings
model_name = "gpt2-small (124M)"  # Example model name
NEW_CONFIG = GPT_CONFIG_124M.copy()
NEW_CONFIG.update(model_configs[model_name])
NEW_CONFIG.update({"context_length": 1024, "qkv_bias": True})

gpt = GPTModel(NEW_CONFIG)
gpt.eval();

The next task is to assign the OpenAI weights to the corresponding weight tensors in our GPTModel instance

In [54]:

def assign(left, right):
    if left.shape != right.shape:
        raise ValueError(f"Shape mismatch. Left: {left.shape}, Right: {right.shape}")
    return torch.nn.Parameter(torch.tensor(right))

In [55]:

import numpy as np

def load_weights_into_gpt(gpt, params):
    gpt.pos_emb.weight = assign(gpt.pos_emb.weight, params['wpe'])
    gpt.tok_emb.weight = assign(gpt.tok_emb.weight, params['wte'])
    
    for b in range(len(params["blocks"])):
        q_w, k_w, v_w = np.split(
            (params["blocks"][b]["attn"]["c_attn"])["w"], 3, axis=-1)
        gpt.trf_blocks[b].att.W_query.weight = assign(
            gpt.trf_blocks[b].att.W_query.weight, q_w.T)
        gpt.trf_blocks[b].att.W_key.weight = assign(
            gpt.trf_blocks[b].att.W_key.weight, k_w.T)
        gpt.trf_blocks[b].att.W_value.weight = assign(
            gpt.trf_blocks[b].att.W_value.weight, v_w.T)

        q_b, k_b, v_b = np.split(
            (params["blocks"][b]["attn"]["c_attn"])["b"], 3, axis=-1)
        gpt.trf_blocks[b].att.W_query.bias = assign(
            gpt.trf_blocks[b].att.W_query.bias, q_b)
        gpt.trf_blocks[b].att.W_key.bias = assign(
            gpt.trf_blocks[b].att.W_key.bias, k_b)
        gpt.trf_blocks[b].att.W_value.bias = assign(
            gpt.trf_blocks[b].att.W_value.bias, v_b)

        gpt.trf_blocks[b].att.out_proj.weight = assign(
            gpt.trf_blocks[b].att.out_proj.weight, 
            params["blocks"][b]["attn"]["c_proj"]["w"].T)
        gpt.trf_blocks[b].att.out_proj.bias = assign(
            gpt.trf_blocks[b].att.out_proj.bias, 
            params["blocks"][b]["attn"]["c_proj"]["b"])

        gpt.trf_blocks[b].ff.layers[0].weight = assign(
            gpt.trf_blocks[b].ff.layers[0].weight, 
            params["blocks"][b]["mlp"]["c_fc"]["w"].T)
        gpt.trf_blocks[b].ff.layers[0].bias = assign(
            gpt.trf_blocks[b].ff.layers[0].bias, 
            params["blocks"][b]["mlp"]["c_fc"]["b"])
        gpt.trf_blocks[b].ff.layers[2].weight = assign(
            gpt.trf_blocks[b].ff.layers[2].weight, 
            params["blocks"][b]["mlp"]["c_proj"]["w"].T)
        gpt.trf_blocks[b].ff.layers[2].bias = assign(
            gpt.trf_blocks[b].ff.layers[2].bias, 
            params["blocks"][b]["mlp"]["c_proj"]["b"])

        gpt.trf_blocks[b].norm1.scale = assign(
            gpt.trf_blocks[b].norm1.scale, 
            params["blocks"][b]["ln_1"]["g"])
        gpt.trf_blocks[b].norm1.shift = assign(
            gpt.trf_blocks[b].norm1.shift, 
            params["blocks"][b]["ln_1"]["b"])
        gpt.trf_blocks[b].norm2.scale = assign(
            gpt.trf_blocks[b].norm2.scale, 
            params["blocks"][b]["ln_2"]["g"])
        gpt.trf_blocks[b].norm2.shift = assign(
            gpt.trf_blocks[b].norm2.shift, 
            params["blocks"][b]["ln_2"]["b"])

    gpt.final_norm.scale = assign(gpt.final_norm.scale, params["g"])
    gpt.final_norm.shift = assign(gpt.final_norm.shift, params["b"])
    gpt.out_head.weight = assign(gpt.out_head.weight, params["wte"])
    
    
load_weights_into_gpt(gpt, params)
gpt.to(device);

If the model is loaded correctly, we can use it to generate new text using our previous generate function:

In [56]:

torch.manual_seed(123)

token_ids = generate(
    model=gpt,
    idx=text_to_token_ids("Every effort moves you", tokenizer).to(device),
    max_new_tokens=25,
    context_size=NEW_CONFIG["context_length"],
    top_k=50,
    temperature=1.5
)

print("Output text:\n", token_ids_to_text(token_ids, tokenizer))

Output text:
 Every effort moves you toward finding an ideal new way to practice something!

What makes us want to be on top of that?

We know that we loaded the model weights correctly because the model can generate coherent text; if we made even a small mistake, the model would not be able to do that

For an alternative way to load the weights from the Hugging Face Hub, see ../02_alternative_weight_loading
If you are interested in seeing how the GPT architecture compares to the Llama architecture (a popular LLM developed by Meta AI), see the bonus content at ../07_gpt_to_llama

Summary and takeaways

See the ./gpt_train.py script, a self-contained script for training
The ./gpt_generate.py script loads pretrained weights from OpenAI and generates text based on a prompt
You can find the exercise solutions in ./exercise-solutions.ipynb