MLflow大模型部署指南

在当今的机器学习世界中，GPT-3 等大型语言模型 (LLM) 和其他基于 Transformer 的模型正在彻底改变我们与数据交互的方式。大规模部署这些模型进行推理（即进行预测）可能具有挑战性，尤其是在使用 MLflow 等平台不支持的模型时。

在本博客中，我们将介绍如何使用自定义 Python 函数 (pyfunc) 使用 MLflow 部署大型语言模型 (LLM)，该函数允许我们处理具有特殊要求的模型。在本文结束时，你将了解使用 MLflow 保存、部署和提供 LLM 的过程，以便能够处理非标准配置和依赖项。

为什么要自定义 PyFunc？

在深入研究解决方案之前，让我们首先了解为什么我们需要在 MLflow 中使用自定义 pyfunc。

MLflow 支持各种模型风格，其中最常见的一种是 Transformers 风格，它支持来自 HuggingFace 库的模型。但是，并非所有模型和配置都适合默认设置。某些模型，尤其是较大的模型或需要特殊依赖项的模型，需要自定义处理。

MLflow 中的自定义 pyfunc 允许我们定义自己的流程，包括模型的加载方式、预测方式以及模型与数据的交互方式。这对于可能具有独特依赖项或数据接口的大型语言模型 (LLM) 特别有用。

1、硬件建议和先决条件

在部署大型语言模型（如 MPT-7B，一个 70 亿参数模型）时，用于推理的硬件起着至关重要的作用：

GPU 要求：高效运行这样的模型需要具有至少 64GB VRAM 的支持 CUDA 的 GPU。
CPU 警告：虽然可以在 CPU 上运行模型，但速度会非常慢，单个预测需要数十分钟。

要开始使用，你需要安装一些 Python 包。这些依赖项对于模型加载、注意力计算和推理必不可少：

pip install xformers==0.0.20 einops==0.6.1 flash-attn==v1.0.3.post0 triton-pre-mlir@git+https://github.com/vchiley/triton.git@triton_pre_mlir#subdirectory=python

此外，还需要安装 mlflow、torch、transformers 和 huggingface_hub：

pip install mlflow torch transformers huggingface_hub

2、下载模型和分词器

接下来，我们需要从 HuggingFace 的中心下载模型及其分词器（tokenizer）。以下是我们的操作方法：

from huggingface_hub import snapshot_download

# Download the MPT-7B instruct model and tokenizer to a local directory cache
snapshot_location = snapshot_download(repo_id="mosaicml/mpt-7b-instruct", local_dir="mpt-7b")

snapshot_download 函数下载模型和 tokenizer，我们将在接下来的步骤中使用它来设置自定义 pyfunc。

3、定义自定义 PyFunc

现在，让我们定义我们的自定义 Python 函数 (pyfunc)。pyfunc 允许你定义如何加载模型、如何处理推理请求以及如何与数据交互。在本例中，我们将创建一个扩展 mlflow.pyfunc.PythonModel 的 MPT 类。

3.1 加载模型

首先，让我们定义当 MLflow 调用我们的 pyfunc 时应如何加载模型和 tokenizer。

import mlflow
import torch
import transformers

class MPT(mlflow.pyfunc.PythonModel):
    def load_context(self, context):
        """
        This method initializes the tokenizer and language model
        using the specified model snapshot directory.
        """
        # Initialize tokenizer and language model
        self.tokenizer = transformers.AutoTokenizer.from_pretrained(
            context.artifacts["snapshot"], padding_side="left"
        )

        config = transformers.AutoConfig.from_pretrained(
            context.artifacts["snapshot"], trust_remote_code=True
        )

        self.model = transformers.AutoModelForCausalLM.from_pretrained(
            context.artifacts["snapshot"],
            config=config,
            torch_dtype=torch.bfloat16,
            trust_remote_code=True,
        )

        # Set the device to CPU (or GPU if available)
        self.model.to(device="cpu")
        self.model.eval()

让我们以简单而详细的方式逐行分解代码。

class MPT(mlflow.pyfunc.PythonModel):

在这里，我们定义了一个名为 MPT（代表 Mosaic Pre-trained Transformer）的新类。编程中的类就像创建对象的蓝图或模板，在这种情况下，我们的对象将是使用 transformers 的机器学习模型。 mlflow.pyfunc.PythonModel 意味着这个类将从允许我们定义自定义模型的特殊 MLflow 模板扩展。

def load_context(self, context):

此行定义了一个名为 load_context 的函数（也称为方法）。简单来说，函数是程序执行某些操作时遵循的一组指令。在这里，函数将设置模型，为使用做好准备。 context是一个传递给函数的重要信息，告诉它需要加载什么（例如文件或其他数据）。

self.tokenizer = transformers.AutoTokenizer.from_pretrained(
context.artifacts["snapshot"], padding_side="left")

此行从模型的保存版本（称为“快照”，即 snapshot）加载分词器。分词器负责将文本（如句子）分解为模型可以理解的较小部分。可以将其视为将人类可读的文本转换为模型可以理解的语言的翻译器。 padding_side="left" 仅表示文本应如何格式化或“填充”（在文本的左侧）。

config = transformers.AutoConfig.from_pretrained(
context.artifacts["snapshot"], trust_remote_code=True)

在这里，我们加载模型的配置。配置包含控制模型工作方式的设置，例如其大小或应执行的任务类型。这就像在使用模型之前检查其说明或用户手册一样。 trust_remote_code=True 表示我们信任模型的配置，即使它来自外部来源。

self.model = transformers.AutoModelForCausalLM.from_pretrained(
context.artifacts["snapshot"], config=config,
torch_dtype=torch.bfloat16, trust_remote_code=True)

此行加载实际模型。模型是执行任务的“大脑”，例如回答问题或生成文本。 from_pretrained 函数告诉它使用“快照”中保存的版本。 torch_dtype=torch.bfloat16 部分设置了有关如何在特定类型的硬件上有效使用模型计算的技术细节。最后， trust_remote_code=True 再次意味着我们信任此模型的代码和配置的外部来源。

self.model.eval()

此行将模型切换到评估模式。在训练模型时，它会从数据中学习并自我调整，但当它准备好用于实际预测（推理）时，我们会告诉它停止自我调整。此行确保模型处于“测试”模式，随时准备为我们提供答案或预测。

3.2 预测逻辑

接下来，让我们定义预测方法，该方法处理预测过程。该方法从用户那里获取输入，构建提示，然后将其传递给模型以生成响应。

    def _build_prompt(self, instruction):
        """
        This method generates the prompt for the model.
        """
        INSTRUCTION_KEY = "### Instruction:"
        RESPONSE_KEY = "### Response:"
        INTRO_BLURB = (
            "Below is an instruction that describes a task. "
            "Write a response that appropriately completes the request."
        )

        return f"""{INTRO_BLURB}
        {INSTRUCTION_KEY}
        {instruction}
        {RESPONSE_KEY}
        """

    def predict(self, context, model_input, params=None):
        """
        This method generates prediction for the given input.
        """
        prompt = model_input["prompt"][0]
        temperature = params.get("temperature", 0.1) if params else 0.1
        max_tokens = params.get("max_tokens", 1000) if params else 1000

        prompt = self._build_prompt(prompt)

        # Encode the input and generate prediction
        encoded_input = self.tokenizer.encode(prompt, return_tensors="pt").to("cpu")
        output = self.model.generate(
            encoded_input,
            do_sample=True,
            temperature=temperature,
            max_new_tokens=max_tokens,
        )

        # Remove the prompt from the generated text
        prompt_length = len(self.tokenizer.encode(prompt, return_tensors="pt")[0])
        generated_response = self.tokenizer.decode(
            output[0][prompt_length:], skip_special_tokens=True
        )

        return {"candidates": [generated_response]}

我们之前使用 transformer 和 MLflow 设置了一个机器学习模型 (MPT)，它可以处理自然语言输入并生成相应的输出。现在，我们来看看模型如何处理输入（指令），生成任务提示，然后产生响应。

_build_prompt 方法

def _build_prompt(self, instruction):
    """
    This method generates the prompt for the model.
    """
    INSTRUCTION_KEY = "### Instruction:"
    RESPONSE_KEY = "### Response:"
    INTRO_BLURB = (
        "Below is an instruction that describes a task. "
        "Write a response that appropriately completes the request."
    )

    return f"""{INTRO_BLURB}
    {INSTRUCTION_KEY}
    {instruction}
    {RESPONSE_KEY}
    """

目的：此方法负责生成将提供给模型的文本提示。提示结合了任务描述、指令（即用户给出的输入）和模型应生成响应的标签。

解释：

INSTRUCTION_KEY = "### Instruction:" 和 RESPONSE_KEY = "### Response:" 是预定义标签。
INTRO_BLURB 介绍任务，告诉模型输入是一条指令，并且输出应为响应。
最后的返回语句以模型可读的方式格式化提示。

示例：

如果传递的指令是 "Translate the sentence 'Hello, how are you?' into French."，则生成的提示将是：

Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Translate the sentence 'Hello, how are you?' into French.
### Response:

predict方法

def predict(self, context, model_input, params=None):
    """
    This method generates prediction for the given input.
    """
    prompt = model_input["prompt"][0]
    temperature = params.get("temperature", 0.1) if params else 0.1
    max_tokens = params.get("max_tokens", 1000) if params else 1000

    prompt = self._build_prompt(prompt)

    # Encode the input and generate prediction
    encoded_input = self.tokenizer.encode(prompt, return_tensors="pt").to("cpu")
    output = self.model.generate(
        encoded_input,
        do_sample=True,
        temperature=temperature,
        max_new_tokens=max_tokens,
    )

    # Remove the prompt from the generated text
    prompt_length = len(self.tokenizer.encode(prompt, return_tensors="pt")[0])
    generated_response = self.tokenizer.decode(
        output[0][prompt_length:], skip_special_tokens=True
    )

    return {"candidates": [generated_response]}

目的： predict 方法根据提供的输入从模型生成预测（响应）。

分步说明：

a) 从输入中提取提示

prompt = model_input["prompt"][0]

model_input 是一个包含输入提示的字典。第一个元素 ( [0]) 被提取为模型将用于生成响应的提示。

示例：如果 model_input 是 {"prompt": ["Translate the sentence 'Hello, how are you?' into French."]}， prompt将是 'Translate the sentence 'Hello, how are you?' into French.'。

b) 设置参数（ temperature， max_tokens）

temperature = params.get("temperature", 0.1) if params else 0.1
max_tokens = params.get("max_tokens", 1000) if params else 1000

temperature 控制模型响应的创造性或随机性。较低的温度（例如 0.1）意味着模型将生成更可预测的响应。

max_tokens 限制生成的输出的长度。如果未指定，则默认为 1000 个标记。

示例：如果 params 为 {"temperature": 0.5, "max_tokens": 150}，则模型将生成稍微更有创意的响应，最多为 150 个标记。

c) 生成完整的提示

prompt = self._build_prompt(prompt)

提示 prompt被传递给 _build_prompt方法，该方法使用指令和响应标签对其进行格式化。

示例：生成的最终提示将如下所示

Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
Translate the sentence 'Hello, how are you?' into French.
### Response:

d) 对输入提示进行编码

encoded_input = self.tokenizer.encode(prompt, return_tensors="pt").to("cpu")

prompt被编码为模型可以使用分词器理解的格式。

然后将编码的输入转换为张量，这是 PyTorch 使用的数据结构（记住，我们导入了 torch）。

to("cpu") 确保输入在 CPU 上处理（尽管如果需要可以切换到 GPU）。

e) 生成输出

output = self.model.generate(
    encoded_input,
    do_sample=True,
    temperature=temperature,
    max_new_tokens=max_tokens,
)

现在要求模型根据 encoded_input生成响应。

do_sample=True表示模型可以生成创造性的响应（采样），而不仅仅是最有可能的响应。

temperature和 max_new_tokens控制响应的创造性和长度。

示例：如果输入是翻译指令，模型将生成法语翻译，如“Bonjour, comment ça va?”

f) 从生成的输出中删除提示

prompt_length = len(self.tokenizer.encode(prompt, return_tensors="pt")[0])
generated_response = self.tokenizer.decode(
    output[0][prompt_length:], skip_special_tokens=True
)

由于模型会生成提示和响应，因此我们需要从输出中删除提示部分。

我们计算提示的长度（以标记为单位），并使用它来切掉提示，只留下模型的响应。

示例：如果模型生成了响应 Bonjour, comment ça va? 以及原始提示，我们将提示切掉以仅返回 Bonjour, comment ça va?。

g) 返回响应

return {"candidates": [generated_response]}

生成的响应在键 candidates下的字典中返回。此格式允许模型在必要时返回多个响应（例如，如果你请求了多个响应）。

示例：输出可能如下所示

{"candidates": ["Bonjour, comment ça va?"]}

4、定义模型输入和输出模式

为了让 MLflow 正确跟踪并提供模型服务，我们需要定义输入和输出模式。这些模式描述了模型期望的数据结构以及它返回的数据。

from mlflow.models.signature import ModelSignature
from mlflow.types import ColSpec, DataType, Schema

input_schema = Schema([ColSpec(DataType.string, "prompt")])
output_schema = Schema([ColSpec(DataType.string, "candidates")])

signature = ModelSignature(inputs=input_schema, outputs=output_schema)

我们还定义了一个示例输入示例，它将用于在为模型提供服务时验证输入格式。

import pandas as pd

input_example = pd.DataFrame({"prompt": ["What is machine learning?"]})

5、在 MLflow 中记录模型

定义自定义 pyfunc 后，下一步是将模型记录到 MLflow 的跟踪服务器。此步骤保存模型及其相关元数据，使其准备好提供服务。

import mlflow

# Set the experiment
mlflow.set_experiment(experiment_name="mpt-7b-instruct-evaluation")

# Start an MLflow run
with mlflow.start_run():
    model_info = mlflow.pyfunc.log_model(
        "mpt-7b-instruct",
        python_model=MPT(),
        artifacts={"snapshot": snapshot_location},
        pip_requirements=[
            f"torch=={torch.__version__}",
            f"transformers=={transformers.__version__}",
            "einops",
            "sentencepiece",
        ],
        input_example=input_example,
        signature=signature,
    )

此命令将自定义 pyfunc 记录到 MLflow 的跟踪系统，并将其与指定的实验关联。该模型现在可用于服务和推理。

6、加载和提供模型

记录模型后，可以使用以下命令加载它以供使用：

loaded_model = mlflow.pyfunc.load_model(model_info.model_uri)

这将从 MLflow 的模型注册表中加载已保存的模型，现在您可以使用它进行推理。

7、测试模型进行推理

最后，让我们通过提供示例输入来测试模型。如果有可用的 GPU，则可以在 GPU 上运行它以获得更快的性能。

# Uncomment the following lines if running with a GPU
loaded_model.predict(pd.DataFrame({"prompt": ["What is machine learning?"]}), params={"temperature": 0.6})

这将返回由 LLM 生成的响应。

8、结束语

在本教程中，我们介绍了使用 MLflow 的自定义 pyfunc 部署大型语言模型 (LLM) 所需的步骤。通过定义自定义类、处理模型加载和推理以及使用 MLflow 跟踪和服务模型，我们创建了一个端到端解决方案，用于部署不适合标准 MLflow 管道的复杂模型。

通过这种方法，你可以部署具有自定义依赖项和要求的各种模型，同时保持一个简单且用户友好的界面，供用户在生产环境中与你的模型进行交互。

原文链接：Deploying (LLMs) with MLflow: A Step-by-Step Guide

汇智网翻译整理，转载请标明出处