从零开始逐步指导开发者构建自己的大型语言模型（LLM）学习笔记- 第1章理解大型语言模型

本文链接：https://blog.csdn.net/chenchihwen/article/details/144919477

这章没有代码是youTube的视频，英听后，内容如下：

开发大型语言模型（LLM）的流程幻灯片，主要分为三个部分：构建、训练和微调。

构建大型语言模型（Building an LLM）
- 从左下角的 “Tokenized text”（标记化文本）开始。
- 依次经过 “Token embedding layer”（标记嵌入层）和 “Positional embedding layer”（位置嵌入层）。
- 然后经过多次 “Dropout”（随机失活）、“Masked multi - head attention”（掩码多头注意力）、“LayerNorm”（层归一化）和 “Feed - forward”（前馈神经网络）操作。
- 最后通过 “Linear output layer”（线性输出层）得到结果。
基础模型（Foundation model）
- 构建好的大型语言模型成为基础模型。
微调（Finetuning）
- 基础模型可以通过两种方式进行微调：
  - 使用 “Dataset with class labels”（带类别标签的数据集）微调成 “Classifier”（分类器）。
  - 使用 “Instruction dataset”（指令数据集）微调成 “Personal assistant”（个人助手）。

演讲原文如下，先看中文意思

https://www.youtube.com/watch?v=kPGTx4wcm_whttps://www.youtube.com/watch?v=kPGTx4wcm_w

大型语言模型（LLM）开发的详细综述，涵盖了从数据集准备到模型训练、微调，再到评估和实际应用的各个方面，以下是详细内容：

一、LLM 开发概述

开发阶段
- 构建（Building）：准备数据集，实现注意力机制和模型架构相关的编码。
- 预训练（Pre - training）：在大型数据集上训练模型，形成基础模型，并评估和保存模型权重。
- 微调（Fine - tuning）：根据特定任务（如分类、问答、创建聊天机器人）对模型进行调整，利用特定的指令数据集。
当前应用场景
- 公共或专有服务：如通过公共 API 访问 ChatGPT 和 Gemini 等。
- 本地运行自定义模型：利用开源模型（如 Llama 3）在本地与模型交互。
- 在外部服务器上部署自定义模型：用于产品开发和集成到应用程序中。

二、数据集准备

重要性
- 数据集准备是开发 LLM 的重要初始步骤，需要进行采样以确保数据对训练具有代表性和有效性。
- 数据输入过程对模型学习效果至关重要。
训练方法
- LLM 通常采用下一个标记预测（next token prediction）方法，在包含数十亿单词的大型数据集上训练。
- 实际训练中，为提高效率会对输入数据进行批处理（batching），每个批次包含相同长度的输入（通常为张量形式），预训练的输入长度通常在 256 到 1024 个标记或更多。

三、语言模型中的词预测

过程
- 语言模型每次迭代进行一个单词预测任务，逐词生成文本。
- 从给定输入文本开始，模型基于输入上下文预测下一个单词，生成的单词再反馈回模型作为新的上下文用于后续预测，直到生成 “文本结束” 标记或达到特定的标记数量限制。
标记化机制
- 语言模型处理的是标记（tokens）而非单词，标记可以包括单词的部分或标点符号。
- 输入文本经过标记化处理后，每个标记转换为标记 ID 用于模型计算，标记化通常依赖于从训练数据构建的词汇表。

四、训练数据集规模趋势

增长趋势
- 用于训练语言模型的数据集规模在不断增大，例如 GPT - 3 使用约 5000 亿个标记训练，而更近期的模型使用超过万亿个标记的数据集。
- 但在数据集透明度方面有所变化，早期模型会详细说明数据来源，而新模型往往只提到标记总数，不具体说明来源。
数据与模型性能的平衡
- 较小的数据集可能使语言模型保留更多用于重要推理和理解的能力，一些模型（如 Microsoft's F）认为不包含所有数据（如琐碎的游戏结果）可以使模型专注于更重要的信息，有助于有效推理。

五、理解 LLM 架构

核心架构
- 开发大型语言模型（如 GPT - 2 和 GPT - 3）的核心架构包括掩码多头注意力模块、前馈层、位置嵌入层等，通常有一个 Transformer 块会重复多次（12 到 64 次不等，取决于模型大小）。
- 小型和大型模型的差异主要在于 Transformer 块的重复次数和多头注意力机制中的头数（类似于卷积神经网络中的通道）。
模型大小差异
- 以 GPT - 2 模型为例，不同大小的模型架构差异主要在 Transformer 块的重复次数上，模型参数可以从 1.24 亿到 15 亿不等，较小模型的嵌入维度约为 768，较大模型可达 4000。
层归一化和训练技术进展
- 现代架构（如 LLaMA 2）使用均方根（RMS）归一化代替传统层归一化，有利于多 GPU 训练。
- 训练技术改进包括使用相对位置嵌入，使模型能够处理更多的标记输入（从 1024 增加到 4000 个标记），模型通常训练 1 到 2 个轮次，训练方法常涉及从大型数据集中随机采样。

六、控制语言模型的随机性

重要性
- 控制语言模型训练中的随机性对于实现预期结果至关重要，通过管理随机性可以防止模型仅仅复制训练数据而不是生成新的连贯文本。
- 处理大数据集时，较长的训练通常能产生更熟练的文本生成，但对于小数据集，过长训练可能导致过拟合，需要密切监控模型。
训练中的挑战
- 训练语言模型时，需要了解哪些记忆是可取的，例如历史日期等事实应该被记住，但要避免不必要的训练集记忆以保持模型的通用性。

七、预训练语言模型

预训练的成本和替代方案
- 预训练大型语言模型需要大量计算资源和时间，通常需要在强大的 GPU 上训练数周，因此对于许多任务来说，利用各机构共享的预训练权重进行模型调整更为实际。
- 了解语言模型架构的变化很重要，因为不同架构会影响特定任务的性能，支持多种模型权重的工具和库有助于探索这些差异。

八、微调语言模型用于分类

输出层调整
- 对语言模型进行分类任务的微调时，需要调整输出层，减小其大小以匹配特定的类别数量（例如，垃圾邮件检测的两类），这不仅提高效率，还能增强模型性能。
- 微调过程中通常会监控损失和分类准确率，损失用于优化，准确率用于评估模型在目标分类任务上的性能。
更新层数
- 微调不需要更新模型的所有层，在许多情况下，仅更新最后几层就能获得良好结果，例如仅更新最后一个输出层和两个额外的 Transformer 块就能获得与更新所有层相当的性能，且速度更快、资源更高效。

九、指令数据集概述

数据集规模和示例
- 指令数据集用于让大型语言模型根据特定指令生成响应，其规模通常在 50,000 到 100,000 个示例之间。
- 例如，Paka 数据集约有 51,000 个示例，是最早公开可用的指令数据集之一；Lima 数据集只有 1000 个示例，但取得了不错的结果，表明输入数据的质量可能比数量更重要。

十、偏好微调（Preference Tuning）

目的和作用
- 偏好微调用于优化 LLM 生成的响应，使模型朝着期望的行为方向发展，通常在指令查找之后进行。
- 这一过程旨在根据特定特征优化输出，使模型的响应更符合技术或用户友好的要求，引导模型在响应中关注优化有用性或安全性的元素。

十一、语言模型的评估

MMU 分数
- 定义：MMU 分数通常在 0 到 100 之间，用于根据语言模型（LLMs）回答多项选择题的表现对其进行排名。
- 局限性：该分数仅能衡量模型在多项选择题场景下的表现，无法全面评估 LLMs 的更广泛能力。
其他评估工具
- 除了 MMU 分数，还提到了 Alpaca Eval 等工具，这些工具可以提供对话性能的比较测量以及其他基准。
模型比较
- 通过计算成对排名（pairwise rankings）来评估不同语言模型的优劣。例如，在某些测试中，Gemini Ultra 比 GP40 表现略好，但总体上 GP40 也是很有竞争力的模型。不同模型的有效性取决于用户的具体需求和使用场景。

十二、预训练与微调策略

预训练
- 从头开始预训练：成本高昂且通常是不必要的，主要适用于研究或大型公司创建基础模型。
- 继续预训练：在现有模型基础上继续预训练是一种更实用的方法，能让模型在较小数据集上获取新知识，适合用于更新语言模型。
微调
- 对于特殊用途（如垃圾邮件分类或创建聊天机器人），微调是必不可少的。此外，偏好微调（preference tuning）可以增强模型的安全性和实用性。

十三、训练技术在模型开发中的应用

模型开发过程
- 模型构建过程通常结合预训练、继续预训练和微调来实现特定功能。例如，Meta AI 的 Code Llama 模型开发就体现了这一点。
- Code Llama 的开发阶段：
  - 首先在通用语言上进行预训练。
  - 接着进行针对代码的集中训练。
  - 最后进行指令微调，以更好地理解和响应用户输入，展示了模型逐步改进的迭代性质。

十四、开发环境和资源

Lightning AI Studio
- 这是一个对模型训练很有用的环境，它提供了在多个 GPU 和 CPU 之间无缝切换的灵活性。
- 该平台提供类似于 GitHub 仓库的现成模板，无需进行初始安装和处理依赖关系，使开发者能够轻松地构建和试验语言模型，无需繁琐的设置。

some screen cut from youtub

Developing an LLM: Building, Training, Finetuning

Overview of Large Language Model Development

"Today I want to talk about developing a large language model, focusing on the three stages: building, training, and fine-tuning."

The discussion highlights the key processes involved in developing large language models (LLMs), which can be categorized into three main stages: building, training, and fine-tuning. Understanding each stage is vital for effective LLM development.

Current Use Cases for LLMs

"The most popular way to use large language models these days is via public or proprietary services like public APIs."

Large language models are primarily accessed through public or proprietary services, such as public APIs. Examples include ChatGPT and Gemini, where users can submit queries and receive responses.
Another common use case involves running custom LLMs locally, thanks to open-source models like Llama 3, enabling users to interact with the models directly on their machines.
Organizations may also deploy custom LLMs on external servers, allowing for API access, particularly beneficial for product development and integration into applications.

Stages of Developing LLMs

"Today, the talk is about what goes into developing LLMs, particularly the stages involved."

The development of LLMs can be broken down into three primary stages: building, pre-training, and fine-tuning.
The building stage involves preparing datasets and implementing essential coding that integrates attention mechanisms and the overall architecture of the model.
The pre-training stage involves training the model on large datasets to form a foundational model, with mechanisms in place to evaluate and save the model's weights for later use.
Fine-tuning tailors the model for specific tasks, making it suitable for classification, question answering, or creating chatbots, utilizing specialized instruction datasets.

Preparing Datasets for LLMs

"Understanding how an LLM works requires understanding what it works with, particularly the data set."

An important initial step in developing an LLM is the preparation of the dataset, which needs sampling to ensure it's representative and effective for training.
The process of feeding data into the LLM is critical, as it determines how well the model learns. This stage is foundational to understanding the architecture and functions of the LLM.
Typically, LLMs are trained to predict the next word in a sequence, utilizing a method known as next token prediction. Training is performed on vast datasets, often involving billions of words.

Efficient Training Methods in LLMs

"In practice, it would be inefficient to feed it just one sentence or text at a time."

Practically, training consists of batching input data to enhance efficiency, similar to methodologies used in general deep learning.
Each batch must contain inputs of the same length, typically structured as tensors, making it possible to process multiple inputs simultaneously.
Batching allows for more effective learning and faster training times, with typical input lengths for pre-training spanning from 256 to 1024 tokens or more, depending on the model's capacity.

Word Prediction in Language Models

"Language models typically perform a one-word prediction task per iteration, generating text word by word."

Language models, such as those used in AI applications, generate text one word at a time. The process begins with a given input, which in this case is some text. The model predicts the next word based on the input context.
After generating a word, this output is fed back into the language model, allowing it to use the new context for subsequent predictions. For instance, an output may be "this is," which the model would process to predict the next word.
This methodology continues iteratively until either an "end of text" token is generated or a specific limit on the number of tokens is reached.

The Mechanics of Tokenization

"There's a distinction between words and tokens in language models, as models tokenize input text before processing."

It’s important to understand that language models do not strictly operate on words; they work with tokens. This distinction is crucial as tokens can include parts of words or punctuation.
When input text is processed, it undergoes tokenization, which breaks the text into smaller components such as individual word tokens or punctuation marks. For instance, the phrase "this is an example" could be split into tokens like "this," "is," "an," and "example."
Each token is then converted into a token ID, which the model uses for further computation. The tokenization often relies on vocabulary built from training data.

Trends in Dataset Size for Training Language Models

"Training datasets for language models are steadily increasing in size, and now reach trillions of tokens."

The trend in training language models shows that datasets have grown significantly over time. For example, the model GPT-3 was trained on approximately 500 billion tokens, while more recent models are using datasets that exceed a trillion tokens.
Despite the growth in data size, there is a noticeable shift in transparency, with fewer details shared about the datasets used for training. While early models, like GPT-3, detailed the types of data sources they incorporated, newer models often only mention collective token counts without specifying sources.

Balancing Data and Model Performance

"Smaller datasets can allow language models to retain more capacity for important reasoning and understanding."

Interestingly, there’s an emerging perspective that smaller training datasets may enhance model performance by reserving cognitive capacity for crucial reasoning tasks.
Some models, such as Microsoft's F, suggest that not including every piece of data—like trivial game results—could allow models to focus on more significant information that aids in effective reasoning.
This approach argues for a balance between the amount of data used for training and the quality of the model's output, which might spark future research into optimizing training strategies.

Understanding LLM Architecture

"We need to really see more research to say whether this is an actual thing here with capacity."

The video discusses the fundamental architecture behind developing large language models (LLMs) like GPT-2 and GPT-3. It emphasizes that while there may be variations, the core architecture remains largely consistent across these models.
The architecture includes components such as masked multi-head attention modules, feed-forward layers, and positional embedding layers. Notably, there is a Transformer block that is repeated numerous times, generally between 12 and 64 times depending on the model size.
The discussion notes that the differences between smaller and larger models primarily lie in the repetition of Transformer blocks and the number of heads in the multi-head attention mechanism, reminiscent of channels in convolutional neural networks.

Variations in Model Sizes

"The difference between these architectures is really minor; it's just the number of times you repeat this Transformer block."

In examining different sizes of the GPT-2 model, the speaker highlights that the architectures differ minimally in terms of the number of times the Transformer block is repeated.
For example, models can range from 124 million to 1.5 billion parameters, and this variation results from adjusting the depth of the network rather than altering the architecture significantly.
The speaker also compares the embedding dimensions across models, with smaller models having dimensions around 768, while larger ones reach up to 4,000.

Advances in Layer Norm and Training Techniques

"RMS norm is basically what most modern architectures use."

The discussion shifts to modern practices where architectures like LLaMA 2 replace traditional layer norm with root mean square (RMS) norm, which is advantageous for multi-GPU training.
Key enhancements in training involve using relative positional embeddings instead of absolute ones, allowing models to handle larger token inputs, increased from 1,024 to 4,000 tokens.
Furthermore, the video notes that while models are commonly trained for 1 to 2 epochs, the specific approach to training can vary significantly, often involving random sampling from large datasets due to feasibility concerns.

Controlling Randomness in Language Models

"You can control the amount of randomness so it doesn't regenerate the training data."

Controlling randomness in the training of language models (LMs) is crucial for achieving desired outcomes. By managing how much randomness is introduced, one can prevent the model from merely reproducing the training data rather than generating new, coherent text.
An effective strategy when dealing with larger datasets is to allow for longer training, as it typically results in more proficient text generation. However, if the dataset is small, extended training may lead to overfitting, and it is advisable to monitor the model closely to ensure it generates coherent text before stopping.
A significant challenge in training LMs is understanding what type of memorization is desirable. Certain facts, like historical dates, should be retained while avoiding unnecessary memorization of other parts of the training set to maintain the model's versatility.

Pre-training Language Models

"Pre-training takes a long time; it's usually not necessary if you are interested in adapting an LLM for a certain task."

Pre-training large language models requires extensive computational resources and time, often involving weeks of training on powerful GPUs. This makes it an impractical step for every task, especially when adapting a pre-trained model is usually sufficient.
For practical use, many practitioners prefer to leverage pre-trained weights shared by various institutions, enabling them to focus on adapting the model to specific downstream tasks rather than starting from scratch.
It is important to understand the architecture variations in LMs, as different architectures can affect performance for specific tasks. Tools and libraries that support a range of model weights can facilitate exploring these differences while maintaining code clarity.

Fine-tuning Language Models for Classification

"For text classification, you have to replace the output layer to match your specific classification needs."

Fine-tuning language models for classification tasks involves adjusting the output layer, often reducing its size to match the specific number of classes needed (e.g., two classes for spam detection).
This adjustment not only improves efficiency but also enhances model performance by eliminating unnecessary mapping to a larger vocabulary.
During fine-tuning, it is common to monitor both the loss and classification accuracy. While loss is differentiable and used for optimization, accuracy provides a straightforward measure of performance, enabling clearer evaluation of the model on the target classification task.

Updating Layers During Fine-tuning

"It's not necessary to fine-tune all the layers; only updating the last few layers can yield good results."

Fine-tuning does not require all layers of the model to be updated; in many cases, focusing on the last few layers can lead to substantial performance gains while significantly reducing training time.
Practical experiments have shown that updating just the last output layer and two additional transformer blocks can yield performance that is comparable to updating all layers. This approach is typically faster and resource-efficient.
This strategy can help businesses implement models more efficiently, emphasizing the practical applications of fine-tuning in various tasks beyond just simple classification, such as optimizing chatbot functionalities.

Instruction Data Set Overview

"The instruction data set looks like, usually the data set sizes range between 50,000 to 100,000 examples."

The video discusses the input fed into large language models (LLMs) to generate responses based on specific instructions formatted in a structured manner. An example given illustrates an instruction to rewrite a sentence in passive voice.
The foundational size of instruction data sets is highlighted, noting that they typically vary from 50,000 to 100,000 examples. A notable dataset, referred to as Paka, comprised around 51,000 examples and was one of the first publicly available instruction datasets.
There's a mention of another dataset named Lima, which contained only 1,000 examples but achieved commendable results. This suggests that the quality of input data may have more significant implications than sheer volume.

Preference Tuning Explanation

"Preference tuning is to refine the responses by the LLM, essentially steering the model towards the desired behavior."

The video introduces preference tuning, emphasizing its role in refining the responses generated by LLMs. Preference tuning typically follows instruction finding, although it is suggested that discussing it in detail would require more time.
This process aims to refine output according to specific features, which can lead to responses that may either be technical or user-friendly, guiding the model to behave in a desired manner—focusing on elements that optimize usefulness or safety in responses.
The speaker provides personal anecdotes about laptop usage, illustrating how varied user preferences can shape different responses from the model regarding purchasing advice.

Evaluating Language Models

"MMU score usually falls between zero and 100, used to rank LLMs based on their performance in answering multiple-choice questions."

The video delves into evaluating language models (LLMs) and the implications of using the MMU score. This score, which ranges from zero to 100, is commonly presented during introductions to new models and indicates the model's effectiveness at answering multiple-choice questions.
It draws attention to the limitations of relying solely on MMU scores, as they primarily gauge performance in a narrow context—multiple-choice scenarios—while the broader capabilities of LLMs are not adequately assessed by this metric alone.
Additional tools like Alpaca Eval are mentioned, which provide a comparative measure of conversational performance alongside other benchmarks.

Evaluating Language Models

"According to the evaluations, both Gemini Ultra and GP40 performed well, with GP40 being noted as a strong model."

Evaluations of language models generally involve computing pairwise rankings, which assess the strengths of different models against each other.
Gemini Ultra is found to be slightly better than GP40 in certain tests, but GP40 remains a competitive option overall.
The effectiveness of each model can vary based on specific user needs and use cases.

Pre-Training and Fine-Tuning Strategies

"Pre-training from scratch is expensive and generally unnecessary; it's more effective to continue pre-training with an existing model."

Pre-training from scratch is a costly and seldom required process, suitable mainly for research or large companies wishing to create foundational models.
Continued pre-training enables a model to acquire new knowledge on a smaller dataset, making it a practical solution for updating language models.
Fine-tuning is essential for special use cases, such as spam classification or creating chatbots, while preference tuning enhances model safety and helpfulness.

Application of Training Techniques in Model Development

"The process of building models often combines pre-training, continued pre-training, and fine-tuning to achieve specialized capabilities."

The development of specialized models, such as Code Llama by Meta AI, illustrates the application of pre-training techniques tailored for specific tasks like coding.
The model underwent several stages, starting with pre-training on general language, followed by focused training on code, demonstrating the iterative nature of model improvement.
Instruction fine-tuning further refines the model to better understand and respond to user inputs.

Development Environment and Resources

"Using tools like Lightning AI Studio allows for flexible development across multiple GPUs and facilitates quick starting without installation hassles."

Lightning AI Studio is a useful environment for training models, offering the flexibility to switch between CPUs and GPUs seamlessly.
The studio provides ready-to-use templates similar to GitHub repositories, eliminating the need for initial installations and dependencies.
These resources make it easier for developers to build and experiment with language models without extensive setup.

根据作者要求标注引用来源如下

@book{build-llms-from-scratch-book,
  author       = {Sebastian Raschka},
  title        = {构建大型语言模型（从零开始）},
  publisher    = {Manning},
  year         = {2023},
  isbn         = {978-1633437166},
  url          = {https://www.manning.com/books/build-a-large-language-model-from-scratch},
  note         = {正在进行中},
  github       = {https://github.com/rasbt/LLMs-from-scratch}
}

从零开始逐步指导开发者构建自己的大型语言模型（LLM）学习笔记- 第1章 理解大型语言模型

一、LLM 开发概述

二、数据集准备

三、语言模型中的词预测

四、训练数据集规模趋势

五、理解 LLM 架构

六、控制语言模型的随机性

七、预训练语言模型

八、微调语言模型用于分类

九、指令数据集概述

十、偏好微调（Preference Tuning）

十一、语言模型的评估

十二、预训练与微调策略

十三、训练技术在模型开发中的应用

十四、开发环境和资源

Developing an LLM: Building, Training, Finetuning

Overview of Large Language Model Development

Current Use Cases for LLMs

Stages of Developing LLMs

Preparing Datasets for LLMs

Efficient Training Methods in LLMs

Word Prediction in Language Models

The Mechanics of Tokenization

Trends in Dataset Size for Training Language Models

Balancing Data and Model Performance

Understanding LLM Architecture

Variations in Model Sizes

Advances in Layer Norm and Training Techniques

Controlling Randomness in Language Models

Pre-training Language Models

Fine-tuning Language Models for Classification

Updating Layers During Fine-tuning

Instruction Data Set Overview

Preference Tuning Explanation

Evaluating Language Models

Evaluating Language Models

Pre-Training and Fine-Tuning Strategies

Application of Training Techniques in Model Development

Development Environment and Resources

从零开始逐步指导开发者构建自己的大型语言模型（LLM）学习笔记- 第1章理解大型语言模型