从零实现2B参数LLM

MODEL ZOO Jan 18, 2025

推荐： ComfyUI-PandasAI | ComfyUI-Autodistill | HDRI-Pack高清环境贴图包

我们将使用 Pile 数据集从头开始训练一个 20 亿参数的 LLM。结果，我们得到了一个 LLM，它在响应中输出完美的语法和标点符号，较短的上下文有意义，但不是整个响应。

之前，我在 Medium 上写了一篇关于使用 Tiny Shakespeare 数据集创建 230 多万参数的 LLM 的文章，但输出没有意义。

# 2.3 Million Parameter LLM Output

ZELBETH:
Sey solmenter! tis tonguerered if
Vurint as steolated have loven OID the queend refore
Are been, good plmp:

Proforne, wiftes swleen, was no blunderesd a a quain beath!
Tybell is my gateer stalk smend as be matious dazest

我有一个想法，如果我让 Transformer 架构更小、更简单，并且训练数据更加多样化，会怎么样？那么，一个人使用几乎报废的 GPU，能够创建多大的模型，这些模型的参数能够说出正确的语法并生成有意义的文本？

这是我们按照此博客训练的模型的输出：

我发现 1300 多万个参数足以在语法和标点符号方面开始发挥作用，这是一个积极的方面。这意味着我们可以使用非常具体的数据集来进一步微调我们之前训练的模型，以完成缩小的任务。我们最终可能会得到一个参数少于 10 亿甚至大约 5 亿个参数的模型，这对于我们的特定用例来说是完美的，尤其是安全地在私有数据上运行它。

我建议你首先使用我的 GitHub 存储库中提供的脚本训练一个 1300 多万个参数的模型。你将在一天内获得结果，而不必等待更长时间，或者如果你的本地 GPU 可能不够强大，无法训练十亿个参数的模型。

1、GitHub 代码概述

所有代码都可以在我的 GitHub 存储库中找到：

代码库组织如下：

train-llm-from-scratch/
├── src/
│ ├── models/
│ │ ├── mlp.py # Definition of the Multi-Layer Perceptron (MLP) module
│ │ ├── attention.py # Definitions for attention mechanisms (single-head, multi-head)
│ │ ├── transformer_block.py # Definition of a single Transformer block
│ │ ├── transformer.py # Definition of the main Transformer model
├── config/
│ └── config.py # Contains default configurations (model parameters, file paths, etc.)
├── data_loader/
│ └── data_loader.py # Contains functions for creating data loaders/iterators
├── scripts/
│ ├── train_transformer.py # Script for training the Transformer model
│ ├── data_download.py # Script for downloading the dataset
│ ├── data_preprocess.py # Script for preprocessing the downloaded data
│ ├── generate_text.py # Script for generating text using a trained model
├── data/ # Directory to store the dataset
│ ├── train/ # Contains training data
│ └── val/ # Contains validation data
├── models/ # Directory where trained models are saved

scripts/ 目录包含用于执行数据集下载、数据预处理、模型训练和使用训练后的模型生成文本等任务的脚本。
src/models/ 目录包含关键组件的实现，包括转换器模型、多层感知器 (MLP)、注意机制和转换器块。
config/ 目录包含指定项目默认参数的配置文件。
data_loader/ 目录提供用于创建数据加载器和迭代器的函数。

2、先决条件和训练时间

确保你对面向对象编程 (OOP) 和神经网络 (NN) 有基本的了解。熟悉 PyTorch 也有助于编码。

你需要一个 GPU 来训练你的模型。Colab 或 Kaggle T4 可以训练 1300 多万个参数的模型，但它们无法训练十亿个参数的模型。看一下比较：

3、安装模块

确保 Git 已安装在你的环境中。你首先需要克隆存储库：

git clone https://github.com/FareedKhan-dev/train-llm-from-scratch.git
cd train-llm-from-scratch

然后你可以安装所需的依赖项：

pip install -r requirements.txt

4、导入库

让我们导入本博客中将用到的必需库：

# PyTorch for deep learning functions and tensors
import torch
import torch.nn as nn
import torch.nn.functional as F

# Numerical operations and arrays handling
import numpy as np

# Handling HDF5 files
import h5py

# Operating system and file management
import os

# Command-line argument parsing
import argparse

# HTTP requests and interactions
import requests

# Progress bar for loops
from tqdm import tqdm

# JSON handling
import json

# Zstandard compression library
import zstandard as zstd

# Tokenization library for large language models
import tiktoken

# Math operations (used for advanced math functions)
import math

5、准备训练数据

我们的训练数据集需要多样化，包含来自不同领域的信息，而 The Pile 是正确的选择。虽然它的大小为 825 GB，但我们只会使用其中的一小部分，即 5%–10%。让我们首先下载数据集并查看其工作原理。我将下载 HuggingFace 上提供的版本。

# Download validation dataset
!wget https://huggingface.co/datasets/monology/pile-uncopyrighted/resolve/main/val.jsonl.zst

# Download the first part of the training dataset
!wget https://huggingface.co/datasets/monology/pile-uncopyrighted/resolve/main/train/00.jsonl.zst

# Download the second part of the training dataset
!wget https://huggingface.co/datasets/monology/pile-uncopyrighted/resolve/main/train/01.jsonl.zst

# Download the third part of the training dataset
!wget https://huggingface.co/datasets/monology/pile-uncopyrighted/resolve/main/train/02.jsonl.zst

下载需要一些时间，但您也可以将训练数据集限制为一个文件 00.jsonl.zst，而不是三个。它已经分为 train/val/test。完成后，请确保将文件正确放置在各自的目录中。

import os
import shutil
import glob

# Define directory structure
train_dir = "data/train"
val_dir = "data/val"

# Create directories if they don't exist
os.makedirs(train_dir, exist_ok=True)
os.makedirs(val_dir, exist_ok=True)

# Move all train files (e.g., 00.jsonl.zst, 01.jsonl.zst, ...)
train_files = glob.glob("*.jsonl.zst")
for file in train_files:
    if file.startswith("val"):
        # Move validation file
        dest = os.path.join(val_dir, file)
    else:
        # Move training file
        dest = os.path.join(train_dir, file)
    shutil.move(file, dest)

我们的数据集采用 .jsonl.zst 格式，这是一种常用于存储大型数据集的压缩文件格式。它结合了 JSON 行 (.jsonl)（其中每行代表一个有效的 JSON 对象）和 Zstandard (.zst) 压缩。让我们阅读其中一个下载文件的示例并查看它的外观。

in_file = "data/val/val.jsonl.zst"  # Path to our validation file

with zstd.open(in_file, 'r') as in_f:
    for i, line in tqdm(enumerate(in_f)):  # Read first 5 lines
        data = json.loads(line)
        print(f"Line {i}: {data}")  # Print the raw data for inspection
        if i == 2:
            break


#### OUTPUT ####
Line: 0 
{
  "text": "Effect of sleep quality ... epilepsy.",
  "meta": {
    "pile_set_name": "PubMed Abstracts"
  }
}

Line: 1
{
  "text": "LLMops a new GitHub Repository ...",
  "meta": {
    "pile_set_name": "Github"
  }
}

现在我们需要对数据集进行编码（标记化）。我们的目标是拥有一个至少可以输出正确单词的 LLM。为此，我们需要使用已经可用的标记器。我们将使用 OpenAI 的 tiktoken 开源标记器。我们将使用用于 ChatGPT (GPT-3) 模型的 r50k_base 标记器来标记我们的数据集。

我们需要为此创建一个函数以避免重复，因为我们将对训练数据集和验证数据集进行标记。

def process_files(input_dir, output_file):
    """
    Process all .zst files in the specified input directory and save encoded tokens to an HDF5 file.

    Args:
        input_dir (str): Directory containing input .zst files.
        output_file (str): Path to the output HDF5 file.
    """
    with h5py.File(output_file, 'w') as out_f:
        # Create an expandable dataset named 'tokens' in the HDF5 file
        dataset = out_f.create_dataset('tokens', (0,), maxshape=(None,), dtype='i')
        start_index = 0

        # Iterate through all .zst files in the input directory
        for filename in sorted(os.listdir(input_dir)):
            if filename.endswith(".jsonl.zst"):
                in_file = os.path.join(input_dir, filename)
                print(f"Processing: {in_file}")

                # Open the .zst file for reading
                with zstd.open(in_file, 'r') as in_f:
                    # Iterate through each line in the compressed file
                    for line in tqdm(in_f, desc=f"Processing {filename}"):
                        # Load the line as JSON
                        data = json.loads(line)

                        # Append the end-of-text token to the text and encode it
                        text = data['text'] + "<|endoftext|>"
                        encoded = enc.encode(text, allowed_special={'<|endoftext|>'})
                        encoded_len = len(encoded)

                        # Calculate the end index for the new tokens
                        end_index = start_index + encoded_len

                        # Expand the dataset size and store the encoded tokens
                        dataset.resize(dataset.shape[0] + encoded_len, axis=0)
                        dataset[start_index:end_index] = encoded

                        # Update the start index for the next batch of tokens
                        start_index = end_index

关于此功能有两个要点：

我们将标记化数据存储在 HDF5 文件中，这使我们能够在训练模型时灵活地更快地访问数据。
附加 <|endoftext|> 标记标记标记每个文本序列的结尾，向模型发出信号，表明它已到达有意义的上下文的结尾，这有助于生成连贯的输出。
现在我们可以使用以下方法简单地对我们的训练和验证数据集进行编码：

# Define tokenized data output directories
out_train_file = "data/train/pile_train.h5"
out_val_file = "data/val/pile_dev.h5"

# Loading tokenizer of (GPT-3/GPT-2 Model)
enc = tiktoken.get_encoding('r50k_base')

# Process training data
process_files(train_dir, out_train_file)

# Process validation data
process_files(val_dir, out_val_file)

让我们看一下标记化数据的样本：

with h5py.File(out_val_file, 'r') as file:
    # Access the 'tokens' dataset
    tokens_dataset = file['tokens']
    
    # Print the dtype of the dataset
    print(f"Dtype of 'tokens' dataset: {tokens_dataset.dtype}")
    
    # load and print the first few elements of the dataset
    print("First few elements of the 'tokens' dataset:")
    print(tokens_dataset[:10])  # First 10 token


#### OUTPUT ####
Dtype of 'tokens' dataset: int32

First few elements of the 'tokens' dataset:
[ 2725  6557    83 23105   157   119   229    77  5846  2429]

我们已经准备好了用于训练的数据集。现在我们将编写 Transformer 架构并相应地研究其理论。

6、Transformer 概述

让我们快速了解一下如何使用 Transformer 架构来处理和理解文本。它的工作原理是将文本分解成称为 token 的小块，并预测序列中的下一个 token。Transformer 有许多层，称为 Transformer 块，它们堆叠在一起，最后有最后一层进行预测。

每个 Transformer 块都有两个主要组件：

自注意力头：它们确定输入的哪些部分对于模型来说最重要。例如，在处理句子时，注意力头可以突出显示单词之间的关系，例如代词与其所指的名词的关系。
MLP（多层感知器）：这是一个简单的前馈神经网络。它获取注意力头强调的信息并进一步处理。 MLP 有一个输入层，用于接收来自注意力头的数据，一个隐藏层，用于增加处理的复杂性，还有一个输出层，用于将结果传递到下一个转换器块。

总之，注意力头充当“思考什么”的部分，而 MLP 则是“如何思考”的部分。堆叠许多转换器块允许模型理解文本中的复杂模式和关系，但这并不总是能保证的。

我们不必看原始的纸质图，而是可视化一个更简单、更容易的架构图，我们将对其进行编码。

让我们看一下我们将要编码的架构流程：

输入标记被转换为嵌入并与位置信息相结合。
该模型有 64 个相同的 Transformer 块，它们按顺序处理数据。
每个块首先运行多头注意力来查看标记之间的关系。
然后，每个块通过 MLP 处理数据，MLP 会扩展然后压缩数据。
每个步骤都使用残差连接（快捷方式）来帮助信息流动。
整个过程中都使用层规范化来稳定训练。
注意机制计算哪些标记应该相互关注。
MLP 将数据扩展为 4 倍大小，应用 ReLU，然后将其压缩回去。
该模型使用 16 个注意力头来捕获不同类型的关系。
最后一层将处理后的数据转换为词汇表大小的预测。
该模型通过重复预测下一个最可能的标记来生成文本。

7、多层感知器 (MLP)

MLP 是 Transformer 前馈网络中的基本构建块。它的作用是引入非线性并学习嵌入表示中的复杂关系。在定义 MLP 模块时，一个重要的参数是 n_embed，它定义了输入嵌入的维数。

MLP 通常由一个隐藏的线性层组成，该层将输入维度扩大一个因子（通常是 4，我们将使用它），然后是一个非线性激活函数，通常是 ReLU。这种结构使我们的网络能够学习更复杂的特征。最后，投影线性层将扩展的表示映射回原始嵌入维度。这种转换序列使 MLP 能够细化注意力机制学习到的表示。

# --- MLP (Multi-Layer Perceptron) Class ---

class MLP(nn.Module):
    """
    A simple Multi-Layer Perceptron with one hidden layer.

    This module is used within the Transformer block for feed-forward processing.
    It expands the input embedding size, applies a ReLU activation, and then projects it back
    to the original embedding size.
    """
    def __init__(self, n_embed):
        super().__init__()
        self.hidden = nn.Linear(n_embed, 4 * n_embed)  # Linear layer to expand embedding size
        self.relu = nn.ReLU()                        # ReLU activation function
        self.proj = nn.Linear(4 * n_embed, n_embed)  # Linear layer to project back to original size

    def forward(self, x):
        """
        Forward pass through the MLP.

        Args:
            x (torch.Tensor): Input tensor of shape (B, T, C), where B is batch size,
                              T is sequence length, and C is embedding size.

        Returns:
            torch.Tensor: Output tensor of the same shape as the input.
        """
        x = self.forward_embedding(x)
        x = self.project_embedding(x)
        return x

    def forward_embedding(self, x):
        """
        Applies the hidden linear layer followed by ReLU activation.

        Args:
            x (torch.Tensor): Input tensor.

        Returns:
            torch.Tensor: Output after the hidden layer and ReLU.
        """
        x = self.relu(self.hidden(x))
        return x

    def project_embedding(self, x):
        """
        Applies the projection linear layer.

        Args:
            x (torch.Tensor): Input tensor.

        Returns:
            torch.Tensor: Output after the projection layer.
        """
        x = self.proj(x)
        return x

我们刚刚编写了 MLP 部分，其中 init 方法初始化了一个隐藏的线性层，该层扩展了输入嵌入大小 (n_embed)，并初始化了一个投影层，该投影层将其缩小。在隐藏层之后应用 ReLU 激活。前向方法定义通过这些层的数据流，通过 forward_embedding 应用隐藏层和 ReLU，通过 project_embedding 应用投影层。

8、单头注意力

注意头是我们模型的核心部分。其目的是关注输入序列的相关部分。在定义 Head 模块时，一些重要的参数是 head_size、n_embed 和 context_length。head_size 参数确定键、查询和值投影的维数，影响注意力机制的表示能力。

输入嵌入维度 n_embed 定义这些投影层的输入大小。context_length 用于创建因果掩码，确保模型仅关注前面的标记。

在 Head 中，key、query 和 value 的线性层 (nn.Linear) 被无偏初始化。大小为 context_length x context_length 的下三角矩阵 (tril) 被注册为缓冲区以实现因果掩蔽，从而防止注意力机制关注未来的 token。

# --- Attention Head Class ---

class Head(nn.Module):
    """
    A single attention head.

    This module calculates attention scores and applies them to the values.
    It includes key, query, and value projections, and uses causal masking
    to prevent attending to future tokens.
    """
    def __init__(self, head_size, n_embed, context_length):
        super().__init__()
        self.key = nn.Linear(n_embed, head_size, bias=False)   # Key projection
        self.query = nn.Linear(n_embed, head_size, bias=False) # Query projection
        self.value = nn.Linear(n_embed, head_size, bias=False) # Value projection
        # Lower triangular matrix for causal masking
        self.register_buffer('tril', torch.tril(torch.ones(context_length, context_length)))

    def forward(self, x):
        """
        Forward pass through the attention head.

        Args:
            x (torch.Tensor): Input tensor of shape (B, T, C).

        Returns:
            torch.Tensor: Output tensor after applying attention.
        """
        B, T, C = x.shape
        k = self.key(x)     # (B, T, head_size)
        q = self.query(x)   # (B, T, head_size)
        scale_factor = 1 / math.sqrt(C)
        # Calculate attention weights: (B, T, head_size) @ (B, head_size, T) -> (B, T, T)
        attn_weights = q @ k.transpose(-2, -1) * scale_factor
        # Apply causal masking
        attn_weights = attn_weights.masked_fill(self.tril[:T, :T] == 0, float('-inf'))
        attn_weights = F.softmax(attn_weights, dim=-1)
        v = self.value(x)   # (B, T, head_size)
        # Apply attention weights to values
        out = attn_weights @ v # (B, T, T) @ (B, T, head_size) -> (B, T, head_size)
        return out

我们的注意力头类 init 方法初始化用于键、查询和值投影的线性层，每个层将输入嵌入 (n_embed) 投影到 head_size。基于 context_length 的下三角矩阵用于因果掩码。前向方法通过缩放查询和键的点积来计算注意力权重，应用因果掩码，使用 softmax 对权重进行归一化，并计算值的加权和以产生注意力输出。

9、多头注意力

为了捕捉输入序列中的不同关系，我们将使用多头注意力的概念。MultiHeadAttention 模块管理并行运行的多个独立注意力头。

这里的关键参数是 n_head，它决定了并行注意力头的数量。输入嵌入维度 (n_embed) 和 context_length 也是实例化各个注意力头所必需的。每个头独立处理输入，将其投影到大小为 n_embed // n_head 的低维子空间中。通过拥有多个头，模型可以同时关注输入的不同方面。

# --- Multi-Head Attention Class ---

class MultiHeadAttention(nn.Module):
    """
    Multi-Head Attention module.

    This module combines multiple attention heads in parallel. The outputs of each head
    are concatenated to form the final output.
    """
    def __init__(self, n_head, n_embed, context_length):
        super().__init__()
        self.heads = nn.ModuleList([Head(n_embed // n_head, n_embed, context_length) for _ in range(n_head)])

    def forward(self, x):
        """
        Forward pass through the multi-head attention.

        Args:
            x (torch.Tensor): Input tensor of shape (B, T, C).

        Returns:
            torch.Tensor: Output tensor after concatenating the outputs of all heads.
        """
        # Concatenate the output of each head along the last dimension (C)
        x = torch.cat([h(x) for h in self.heads], dim=-1)
        return x

现在我们已经定义了 MultiHeadAttention 类，它结合了多个注意力头，init 方法初始化 Head 实例列表（总共 n_head），每个实例的 head_size 为 n_embed // n_head。forward 方法将每个注意力头应用于输入 x，并沿最后一个维度连接它们的输出，合并每个头学习到的信息。

10、Transformer 块

要创建一个十亿参数模型，我们肯定需要一个深度架构。为此，我们需要编写一个 Transformer 块并将它们堆叠起来。块的关键参数是 n_head、n_embed 和 context_length。每个块包含一个多头注意力层和一个前馈网络 (MLP)，每个块之前应用层规范化，每个块之后应用残差连接。

层规范化由嵌入维度 n_embed 参数化，有助于稳定训练。如前所述，多头注意力机制采用 n_head、n_embed 和 context_length。 MLP 还利用了嵌入维度 n_embed。这些组件协同工作以处理输入并学习复杂模式。

# --- Transformer Block Class ---

class Block(nn.Module):
    """
    A single Transformer block.

    This block consists of a multi-head attention layer followed by an MLP,
    with layer normalization and residual connections.
    """
    def __init__(self, n_head, n_embed, context_length):
        super().__init__()
        self.ln1 = nn.LayerNorm(n_embed)
        self.attn = MultiHeadAttention(n_head, n_embed, context_length)
        self.ln2 = nn.LayerNorm(n_embed)
        self.mlp = MLP(n_embed)

    def forward(self, x):
        """
        Forward pass through the Transformer block.

        Args:
            x (torch.Tensor): Input tensor.

        Returns:
            torch.Tensor: Output tensor after the block.
        """
        # Apply multi-head attention with residual connection
        x = x + self.attn(self.ln1(x))
        # Apply MLP with residual connection
        x = x + self.mlp(self.ln2(x))
        return x

    def forward_embedding(self, x):
        """
        Forward pass focusing on the embedding and attention parts.

        Args:
            x (torch.Tensor): Input tensor.

        Returns:
            tuple: A tuple containing the output after MLP embedding and the residual.
        """
        res = x + self.attn(self.ln1(x))
        x = self.mlp.forward_embedding(self.ln2(res))
        return x, res

我们的 Block 类代表单个转换器块。init 方法初始化层规范化层 (ln1、ln2)、MultiHeadAttention 模块和 MLP 模块，所有这些都由 n_head、n_embed 和 context_length 参数化。

forward 方法实现块的前向传递，应用层规范化和多头注意以及残差连接，然后是另一个层规范化和 MLP，同样使用残差连接。forward_embedding 方法提供了一种替代的前向传递，重点关注注意和初始 MLP 嵌入阶段。

11、最终模型

到目前为止，我们已经编写了转换器模型的小组件。接下来，我们将 token 和位置嵌入与一系列转换器块集成以执行序列到序列任务。为此，我们需要编写几个关键参数：n_head、n_embed、context_length、vocab_size 和 N_BLOCKS。

vocab_size 决定了 token 嵌入层的大小，将每个 token 映射到大小为 n_embed 的密集向量。context_length 参数对于位置嵌入层很重要，它对输入序列中每个 token 的位置进行编码，维度也是 n_embed。注意力头的数量 (n_head) 和块的数量 (N_BLOCKS) 决定了网络的深度和复杂性。

这些参数共同定义了 Transformer 模型的架构和容量，所以让我们对其进行编码。

# --- Transformer Model Class ---

class Transformer(nn.Module):
    """
    The main Transformer model.

    This class combines token and position embeddings with a sequence of Transformer blocks
    and a final linear layer for language modeling.
    """
    def __init__(self, n_head, n_embed, context_length, vocab_size, N_BLOCKS):
        super().__init__()
        self.context_length = context_length
        self.N_BLOCKS = N_BLOCKS
        self.token_embed = nn.Embedding(vocab_size, n_embed)
        self.position_embed = nn.Embedding(context_length, n_embed)
        self.attn_blocks = nn.ModuleList([Block(n_head, n_embed, context_length) for _ in range(N_BLOCKS)])
        self.layer_norm = nn.LayerNorm(n_embed)
        self.lm_head = nn.Linear(n_embed, vocab_size)
        self.register_buffer('pos_idxs', torch.arange(context_length))

    def _pre_attn_pass(self, idx):
        """
        Combines token and position embeddings.

        Args:
            idx (torch.Tensor): Input token indices.

        Returns:
            torch.Tensor: Sum of token and position embeddings.
        """
        B, T = idx.shape
        tok_embedding = self.token_embed(idx)
        pos_embedding = self.position_embed(self.pos_idxs[:T])
        return tok_embedding + pos_embedding

    def forward(self, idx, targets=None):
        """
        Forward pass through the Transformer.

        Args:
            idx (torch.Tensor): Input token indices.
            targets (torch.Tensor, optional): Target token indices for loss calculation. Defaults to None.

        Returns:
            tuple: Logits and loss (if targets are provided).
        """
        x = self._pre_attn_pass(idx)
        for block in self.attn_blocks:
            x = block(x)
        x = self.layer_norm(x)
        logits = self.lm_head(x)
        loss = None
        if targets is not None:
            B, T, C = logits.shape
            flat_logits = logits.view(B * T, C)
            targets = targets.view(B * T).long()
            loss = F.cross_entropy(flat_logits, targets)
        return logits, loss

    def forward_embedding(self, idx):
        """
        Forward pass focusing on the embedding and attention blocks.

        Args:
            idx (torch.Tensor): Input token indices.

        Returns:
            tuple: Output after attention blocks and the residual.
        """
        x = self._pre_attn_pass(idx)
        residual = x
        for block in self.attn_blocks:
            x, residual = block.forward_embedding(x)
        return x, residual

    def generate(self, idx, max_new_tokens):
        """
        Generates new tokens given a starting sequence.

        Args:
            idx (torch.Tensor): Initial sequence of token indices.
            max_new_tokens (int): Number of tokens to generate.

        Returns:
            torch.Tensor: The extended sequence of tokens.
        """
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -self.context_length:]
            logits, _ = self(idx_cond)
            logits = logits[:, -1, :]
            probs = F.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, idx_next), dim=1)
        return idx

我们的 Transformer 类 init 方法初始化 token 和位置嵌入层 (token_embed、position_embed)、一系列 Block 模块 (attn_blocks)、最终层规范化层 (layer_norm) 和用于语言建模的线性层 (lm_head)。

_pre_attn_pass 方法结合了 token 和位置嵌入。forward 方法通过嵌入层和一系列 Transformer 块处理输入序列，应用最终层规范化并生成 logit。如果提供了目标，它还会计算损失。forward_embedding 方法提供中间前向传递，直至注意块的输出，generate 方法实现 token 生成。

12、批处理

当我们在大数据上训练深度学习模型时，由于 GPU 可用性，我们会分批处理它。因此，让我们创建一个 get_batch_iterator 函数，将 data_path 带到 HDF5 文件、所需的 batch_size、每个序列的 context_length 以及将数据加载到的设备。

batch_size 决定训练时并行处理多少个序列，context_length 指定每个输入序列的长度，data_path 指向训练数据的位置。

# --- Data Loading Utility --- 

def get_batch_iterator(data_path, batch_size, context_length, device="gpu"):
    """
    Creates an iterator for generating batches of data from an HDF5 file.

    Args:
        data_path (str): Path to the HDF5 file containing tokenized data.
        batch_size (int): Number of sequences in each batch.
        context_length (int): Length of each sequence.
        device (str, optional): Device to load the data onto ('cpu' or 'cuda'). Defaults to "cpu".

    Yields:
        tuple: A tuple containing input sequences (xb) and target sequences (yb).
    """
    # Open the HDF5 file in read mode
    with h5py.File(data_path, 'r') as hdf5_file:
        
        # Extract the dataset of tokenized sequences
        dataset = hdf5_file['tokens']
        
        # Get the total size of the dataset
        dataset_size = dataset.shape[0]
        
        # Calculate the number of examples (sequences) that can be made from the data
        n_examples = (dataset_size - 1) // context_length
        
        # Create an array of indices for examples and shuffle them for randomness
        example_idxs = np.arange(n_examples)
        np.random.shuffle(example_idxs)
        
        # Initialize epoch counter and example counter
        epochs = 0
        counter = 0
        
        while True:
            # Check if the current batch exceeds the number of available examples
            if counter + batch_size > n_examples:
                # Shuffle the indices again and reset the counter to 0
                np.random.shuffle(example_idxs)
                counter = 0
                print(f"Finished epoch {epochs}")  # Print epoch number when an epoch finishes
                epochs += 1  # Increment the epoch counter
            
            # Select a batch of random indices to generate sequences
            random_indices = example_idxs[counter:counter+batch_size] * context_length
            
            # Retrieve sequences from the dataset based on the random indices
            random_samples = torch.tensor(np.array([dataset[idx:idx+context_length+1] for idx in random_indices]))
            
            # Separate the input sequences (xb) and target sequences (yb)
            xb = random_samples[:, :context_length].to(device)  # Input sequence (first half of the random sample)
            yb = random_samples[:, 1:context_length+1].to(device)  # Target sequence (second half of the random sample)
            
            # Increment the counter to move to the next batch
            counter += batch_size
            
            # Yield the input and target sequences as a tuple for the current batch
            yield xb, yb

我们的 get_batch_iterator 函数处理训练数据的加载和批处理。它以 data_path、batch_size、context_length 和 device 作为输入。该函数打开 HDF5 文件，对数据进行混洗，然后进入无限循环以生成批次。在每次迭代中，它都会选择数据的随机子集来形成一批输入序列 (xb) 及其对应的目标序列 (yb)。

13、训练参数

现在我们已经编写了模型，我们需要定义训练参数，例如 head、block 等的数量以及数据路径。

# --- Configuration ---

# Define vocabulary size and transformer configuration
VOCAB_SIZE = 50304          # Number of unique tokens in the vocabulary
CONTEXT_LENGTH = 512        # Maximum sequence length for the model
N_EMBED = 2048              # Dimension of the embedding space
N_HEAD = 16                 # Number of attention heads in each transformer block
N_BLOCKS = 64               # Number of transformer blocks in the model

# Paths to training and development datasets
TRAIN_PATH = "data/train/pile_val.h5"  # File path for the training dataset
DEV_PATH = "data/val/pile_val.h5"      # File path for the validation dataset

# Transformer training parameters
T_BATCH_SIZE = 32          # Number of samples per training batch
T_CONTEXT_LENGTH = 16      # Context length for training batches
T_TRAIN_STEPS = 200000     # Total number of training steps
T_EVAL_STEPS = 1000        # Frequency (in steps) to perform evaluation
T_EVAL_ITERS = 250         # Number of iterations to evaluate the model
T_LR_DECAY_STEP = 50000    # Step at which to decay the learning rate
T_LR = 5e-4                # Initial learning rate for training
T_LR_DECAYED = 5e-5        # Learning rate after decay
T_OUT_PATH = "models/transformer_B.pt"  # Path to save the trained model

# Device configuration
DEVICE = 'cuda'

# Store all configurations in a dictionary for easy access and modification
default_config = {
    'vocab_size': VOCAB_SIZE,
    'context_length': CONTEXT_LENGTH,
    'n_embed': N_EMBED,
    'n_head': N_HEAD,
    'n_blocks': N_BLOCKS,
    'train_path': TRAIN_PATH,
    'dev_path': DEV_PATH,
    't_batch_size': T_BATCH_SIZE,
    't_context_length': T_CONTEXT_LENGTH,
    't_train_steps': T_TRAIN_STEPS,
    't_eval_steps': T_EVAL_STEPS,
    't_eval_iters': T_EVAL_ITERS,
    't_lr_decay_step': T_LR_DECAY_STEP,
    't_lr': T_LR,
    't_lr_decayed': T_LR_DECAYED,
    't_out_path': T_OUT_PATH,
    'device': DEVICE,
}

对于大多数参数，我使用了最常见的值，并将它们存储在字典中以方便访问。这里的参数是十亿参数模型的参数。如果你想训练一个具有数百万个参数的模型，你可以减少主要参数，包括 CONTEXT_LENGTH、N_EMBED、N_HEAD 和 N_BLOCKS。但是，你也可以在我的 GitHub 存储库中运行百万参数模型脚本。

14、训练模型

让我们初始化我们的 Transformer 模型并检查其参数总数。

# --- Initialize the Model and Print Parameters --- 

model = Transformer(
    n_head=config['n_head'],
    n_embed=config['n_embed'],
    context_length=config['context_length'],
    vocab_size=config['vocab_size'],
    N_BLOCKS=config['n_blocks']
).to(config['device'])


# Print the total number of parameters
total_params = sum(p.numel() for p in model.parameters())
print(f"Total number of parameters in the model: {total_params:,}")


#### OUTPUT ####
2,141,346,251

现在我们有了 20 亿参数模型，我们需要定义 Adam 优化器和损失跟踪函数，这将帮助我们在整个训练过程中跟踪模型的进度。

# --- Optimizer Setup and Loss Tracking --- 

# Set up the AdamW optimizer with the specified learning rate.
optimizer = torch.optim.AdamW(model.parameters(), lr=config['t_lr'])

# List to track loss values during training.
losses = []

# Define a window size for averaging recent losses in the training loop.
AVG_WINDOW = 64

# Helper function to estimate the average loss for training and development data.
@torch.no_grad()
def estimate_loss(steps):
    """
    Evaluate the model on training and development datasets and calculate average loss.

    Args:
        steps (int): Number of steps to evaluate.

    Returns:
        dict: Dictionary containing average losses for 'train' and 'dev' splits.
    """
    out = {}
    model.eval()  # Set the model to evaluation mode.

    for split in ['train', 'dev']:
        # Select the appropriate data path for the current split.
        data_path = config['train_path'] if split == 'train' else config['dev_path']
        
        # Create a batch iterator for evaluation.
        batch_iterator_eval = get_batch_iterator(
            data_path, config['t_batch_size'], config['t_context_length'], device=config['device']
        )
        
        # Initialize a tensor to track loss values for each evaluation step.
        losses_eval = torch.zeros(steps)
        for k in range(steps):
            try:
                # Fetch a batch and calculate the loss.
                xb, yb = next(batch_iterator_eval)
                _, loss = model(xb, yb)
                losses_eval[k] = loss.item()
            except StopIteration:
                # Handle the case where the data iterator ends early.
                print(f"Warning: Iterator for {split} ended early.")
                break
        
        # Compute the mean loss for the current split.
        out[split] = losses_eval[:k + 1].mean()
    
    model.train()  # Restore the model to training mode.
    return out

我们现在将初始化批处理函数和训练循环，这将开始我们的训练。

# --- Training Loop ---

# Create a batch iterator for the training data.
batch_iterator = get_batch_iterator(
    config['train_path'],
    config['t_batch_size'],
    config['t_context_length'],
    device=config['device']
)

# Create a progress bar to monitor training progress.
pbar = tqdm(range(config['t_train_steps']))
for step in pbar:
    try:
        # Fetch a batch of input and target data.
        xb, yb = next(batch_iterator)
        
        # Perform a forward pass and compute the loss.
        _, loss = model(xb, yb)
        
        # Record the loss for tracking.
        losses.append(loss.item())
        pbar.set_description(f"Train loss: {np.mean(losses[-AVG_WINDOW:]):.4f}")
        
        # Backpropagate the loss and update the model parameters.
        optimizer.zero_grad(set_to_none=True)
        loss.backward()
        optimizer.step()

        # Periodically evaluate the model on training and development data.
        if step % config['t_eval_steps'] == 0:
            train_loss, dev_loss = estimate_loss(config['t_eval_iters']).values()
            print(f"Step: {step}, Train loss: {train_loss:.4f}, Dev loss: {dev_loss:.4f}")

        # Decay the learning rate at the specified step.
        if step == config['t_lr_decay_step']:
            print('Decaying learning rate')
            for g in optimizer.param_groups:
                g['lr'] = config['t_lr_decayed']
    except StopIteration:
        # Handle the case where the training data iterator ends early.
        print("Training data iterator finished early.")
        break

15、保存训练好的模型

由于我们的训练循环具有处理错误的能力，因此如果循环抛出任何错误，它将保存我们部分训练的模型以避免丢失。训练完成后，我们可以保存训练好的模型，以便稍后用于推理。

# --- Save Model and Final Evaluation ---

# Perform a final evaluation of the model on training and development datasets.
train_loss, dev_loss = estimate_loss(200).values()

# Ensure unique model save path in case the file already exists.
modified_model_out_path = config['t_out_path']
save_tries = 0
while os.path.exists(modified_model_out_path):
    save_tries += 1
    model_out_name = os.path.splitext(config['t_out_path'])[0]
    modified_model_out_path = model_out_name + f"_{save_tries}" + ".pt"

# Save the model's state dictionary, optimizer state, and training metadata.
torch.save(
    {
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
        'losses': losses,
        'train_loss': train_loss,
        'dev_loss': dev_loss,
        'steps': len(losses),
    },
    modified_model_out_path
)
print(f"Saved model to {modified_model_out_path}")
print(f"Finished training. Train loss: {train_loss:.4f}, Dev loss: {dev_loss:.4f}")

十亿参数模型的最终训练损失为 0.2314，开发损失为 0.643。

16、训练损失

当我绘制百万和十亿参数模型的损失时，它们看起来非常不同。

十亿参数模型的损失一开始就高得多，而且波动很大。一开始它下降得很快，但随后出现波动，然后才变得平稳。这表明，更大的模型在开始时更难找到正确的学习方法。它可能需要更多的数据和精心的设置。当学习率降低时（红线），损失会更稳定地下降，表明这有助于它进行微调。

百万参数模型的损失从一开始就更容易下降。它不像更大的模型那样波动。当学习率降低时，它不会改变曲线太多。这可能是因为较小的模型更容易训练，而且能更快地找到好的解决方案。巨大的差异表明训练非常大的模型有多难。他们需要不同的方法，也许需要更多的时间来学习好。

我们现在有保存的模型。我们终于可以用它来推理，看看它是如何生成文本的。 😓

17、生成文本

让我们创建一个函数来从我们保存的模型生成文本，该函数将保存的模型路径和编码器作为输入并返回生成的文本。

def generate_text(model_path, input_text, max_length=512, device="gpu"):
    """
    Generate text using a pre-trained model based on the given input text.

    Args:
    - model_path (str): Path to the model checkpoint.
    - device (torch.device): Device to load the model on (e.g., 'cpu' or 'cuda').
    - input_text (str): The input text to seed the generation.
    - max_length (int, optional): Maximum length of generated text. Defaults to 512.

    Returns:
    - str: The generated text.
    """

    # Load the model checkpoint
    checkpoint = torch.load(model_path)

    # Initialize the model (you should ensure that the Transformer class is defined elsewhere)
    model = Transformer().to(device)

    # Load the model's state dictionary
    model.load_state_dict(checkpoint['model_state_dict'])

    # Load the tokenizer for the GPT model (we use 'r50k_base' for GPT models)
    enc = tiktoken.get_encoding('r50k_base')

    # Encode the input text along with the end-of-text token
    input_ids = torch.tensor(
        enc.encode(input_text, allowed_special={'<|endoftext|>'}),
        dtype=torch.long
    )[None, :].to(device)  # Add batch dimension and move to the specified device

    # Generate text with the model using the encoded input
    with torch.no_grad():
        # Generate up to 'max_length' tokens of text
        generated_output = model.generate(input_ids, max_length)

        # Decode the generated tokens back into text
        generated_text = enc.decode(generated_output[0].tolist())

    return generated_text

这里需要调用我们之前定义的转换器来加载架构，然后我们将保存的模型加载为该架构中的状态。

我们首先观察一下百万和十亿参数模型在不提供任何输入的情况下会生成什么，看看它们随机生成什么。

# Defining the file paths for the pre-trained models
Billion_model_path = 'models/transformer_B.pt'  # Path to the Billion model
Million_model_path = 'models/transformer_M.pt'  # Path to the Million model

# Using '<|endoftext|>' as input to the models (acts as a prompt that allows the models to generate text freely)
input_text = "<|endoftext|>"

# Call the function to generate text based on the input text using the Billion model
B_output = generate_text(Billion_model_path, input_text)

# Call the function to generate text based on the input text using the Million model
M_output = generate_text(Million_model_path, input_text)

# Print the output generated by both models
print(B_output)  # Output from the Billion model
print(M_output)  # Output from the Million model

当上下文简短时，这两个 LLM 都能够生成清晰准确的单词。例如，在百万参数输出中，“中国的村庄与城市直接相连”这句话很有意义，传达了一个清晰的想法。它很容易理解，并在逻辑上将村庄与城市联系起来。

然而，当上下文变得更长、更复杂时，清晰度开始减弱。在十亿参数输出中，像“东海岸有两英里，1037 和 7300 万难民（hypotetus）”和“铁匠、音乐家和精品酒店，激发加拿大人的压力”这样的句子变得更难理解。这些想法似乎脱节，句子结构不自然。虽然使用的单词可能仍然是正确的，但整体含义变得混乱和不清楚。

积极的一点是，1300 多万参数的 LLM 也开始生成一些有意义的内容，单词拼写正确。例如，当我使用主题输入文本时，它会开始为我生成一封电子邮件。虽然显然更宽的文本不会提供有意义的结果，但请看一下输出：

# Input text
input_text "Subject: "

# Call the Million parameter Mod
m_output = generate_text(Million_model_path, input_text)

print(m_output)  # Output from the Million model