Phi-3.5电商数据集微调

随着 Phi-3.5 系列的推出,微软与 Meta AI 一起加入了大型语言模型的竞争格局。该系列包括一个小型语言模型、一个视觉语言模型,并采用混合专家方法来实现顶级性能。

在本教程中,我们将探索 Microsoft Phi-3.5 系列模型。我们将加载 Phi-3.5-mini-instruct 模型并对其进行微调,以根据文本描述对电子商务产品进行分类。在最后的步骤中,我们将演示如何将 LoRA(低秩自适应)与基础模型合并并将其推送到 Hugging Face。这将实现高效的云部署,使模型可用于各种应用程序。

1、 Microsoft Phi-3.5

Microsoft Phi-3.5 引入了三种创新模型:Phi-3.5-mini、Phi-3.5-vision 和最新添加的 Phi-3.5-MoE(混合专家模型)。

Phi-3.5-mini 针对增强的多语言支持进行了优化,具有令人印象深刻的 128K 上下文长度。尽管尺寸较小,但由于通过监督微调、近端策略优化和直接偏好优化进行了严格的增强,它的性能可与大型模型相媲美,从而确保精确遵循指令。

Phi-3.5-vision 是一种尖端的轻量级多模态模型,在由合成数据和过滤的公共网站组成的数据集上进行训练。它在多帧图像理解和推理方面表现出色,使其成为详细图像比较、多图像摘要/故事讲述和视频摘要的理想选择,具有广泛的应用潜力。

杰出的模型 Phi-3.5-MoE 采用混合专家架构,拥有 16 位专家和 66 亿个活动参数。它提供卓越的性能、更低的延迟和强大的安全性,以及全面的多语言支持。

Phi-3.5 模型系列为开源社区提供经济高效、高性能的解决方案,推动小型语言模型和生成式 AI 的发展。

2、访问Phi-3.5 模型

在本节中,我们将加载 Phi-3.5-mini-instruct 模型并在 Kaggle 平台中运行模型推理。

在启用 T4x2 GPU 的情况下启动会话:

使用 pip 命令安装所有必要的 Python 包。

%%capture
%pip install -U transformers accelerate

使用 Transformers 库加载完整模型和 tokenizer。

使用模型和 tokenizer 创建文本生成管道。

from transformers import AutoTokenizer,AutoModelForCausalLM,pipeline
import torch

base_model = "microsoft/Phi-3.5-mini-instruct"

tokenizer = AutoTokenizer.from_pretrained(base_model)

model = AutoModelForCausalLM.from_pretrained(
        base_model,
        return_dict=True,
        low_cpu_mem_usage=True,
        torch_dtype=torch.float16,
        device_map="auto",
        trust_remote_code=True,
)


pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    device_map="auto",
)

使用系统提示和用户查询创建消息,并使用聊天模板将其转换为聊天提示。

通过向管道提供提示来生成响应。

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is the tallest building in the world?"}
]

# Apply the chat template to the messages
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Generate the output using the pipeline
outputs = pipe(prompt, max_new_tokens=120, do_sample=True)

# Print the generated text
print(outputs[0]["generated_text"])

我们得到了准确而详细的结果。

<|system|>
You are a helpful assistant.<|end|>
<|user|>
What is the tallest building in the world?<|end|>
<|assistant|>
 As of my knowledge cutoff in 2023, the tallest building in the world is the Burj Khalifa, located in Dubai, United Arab Emirates. It stands at a remarkable 828 meters (2,716.5 feet) tall once its antenna is included. Completed in January 2010, the Burj Khalifa marks a significant achievement in architecture and engineering, setting numerous records. It provides office space, luxury condominiums, and various leisure facilities. This landmark continues to

我们甚至可以使用说明和呼叫转录为文本生成管道提供自定义提示,以检查模型是否遵循用户的命令。

prompt = """
        In a call center environment, classify customer interactions as 'Fraudulent' or 'Non-Fraudulent'. 
        Consider factors such as the nature of the inquiry, caller verification details, transaction history, and any red flags raised during the call.
        [Lisa Adams, contacts the call center claiming unauthorized transactions on her credit card statement. She demands a full refund, asserting that she has never visited the merchant in question.] =
        """
outputs = pipe(prompt, max_new_tokens=120, do_sample=True)
print(outputs[0]["generated_text"])

我们可以看到,该模型表现相当不错,将呼叫标记为欺诈并提供了解释。

 In a call center environment, classify customer interactions as 'Fraudulent' or 'Non-Fraudulent'. 
        Consider factors such as the nature of the inquiry, caller verification details, transaction history, and any red flags raised during the call.
        [Lisa Adams, contacts the call center claiming unauthorized transactions on her credit card statement. She demands a full refund, asserting that she has never visited the merchant in question.] =
        
        Call Interaction Classification: Fraudulent
        
        Explanation: 
        The situation described by Lisa Adams indicates a potential case of credit card fraud. There are several red flags in this interaction that suggest the customer might be reporting unauthorized transactions:

1. The caller claims unauthorized transactions - This is a common indicator of fraud, especially if the transactions were for places or services the customer did not recognize or didn't patronize according to their personal knowledge or documented transaction history (e.g., no visits to the

如果你在 Kaggle 平台上运行模型时遇到问题,请参阅 Phi-3.5 Kaggle 笔记本的简单模型推理。它带有预构建的设置和代码以及输出。

3、微调Phi-3.5-mini-instruct

在本指南中,我们将学习加载和处理电子商务文本分类数据。我们还将加载模型和标记器,在微调之前在测试数据集上评估模型,构建训练器,在训练集上微调模型,并在微调后测试模型。

3.1 设置

启动启用 GPU 加速的新 Kaggle 笔记本。然后,确保你已使用 Kaggle 密文将 Hugging Face 和 Weights & Biases API 令牌设置为环境变量。

安装所有必要的 Pythons 包。

%%capture
%pip install -U bitsandbytes
%pip install -U transformers
%pip install -U accelerate
%pip install -U peft
%pip install -U trl

登录 Weights and biases 服务,签署 API 密钥,然后启动新项目。

import wandb

from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()

wb_token = user_secrets.get_secret("wandb")

wandb.login(key=wb_token)
run = wandb.init(
    project='Fine-tune Phi-3.5-it on Ecommerce Text Classification', 
    job_type="training", 
    anonymous="allow"

最后,加载我们将在微调和评估过程中使用的所有必要的 Python 包和函数。

import numpy as np
import pandas as pd
import os
from tqdm import tqdm
import bitsandbytes as bnb
import torch
import torch.nn as nn
import transformers
from datasets import Dataset
from peft import LoraConfig, PeftConfig
from trl import SFTTrainer
from trl import setup_chat_format
from transformers import (AutoModelForCausalLM, 
                          AutoTokenizer, 
                          BitsAndBytesConfig, 
                          TrainingArguments, 
                          pipeline, 
                          logging)
from sklearn.metrics import (accuracy_score, 
                             classification_report, 
                             confusion_matrix)
from sklearn.model_selection import train_test_split

3.2 加载和处理数据集

电子商务文本分类数据集添加到你的笔记本中,如下所示。数据集由两列组成:标签(电子商务类别)和产品的文本描述。

加载 CSV 文件,处理它,并查看前 5 行。

df = pd.read_csv("/kaggle/input/ecommerce-text-classification/ecommerceDataset.csv")
df.columns = ["label","text"]
df.loc[:,'label'] = df.loc[:,'label'].str.replace('Clothing & Accessories','Clothing')
df.head()

数据集由产品描述和类别标签组成。

随机排列数据集并仅选择前 2000 行。这是一个示例指南,旨在通过在有限的样本集上微调模型来加快微调过程。

接下来,我们将数据拆分为训练、评估和测试数据集。

# Shuffle the DataFrame and select only 2000 rows
df = df.sample(frac=1, random_state=85).reset_index(drop=True).head(2000)

# Split the DataFrame
train_size = 0.8
eval_size = 0.1

# Calculate sizes
train_end = int(train_size * len(df))
eval_end = train_end + int(eval_size * len(df))

# Split the data
X_train = df[:train_end]
X_eval = df[train_end:eval_end]
X_test = df[eval_end:]

我们将创建两个函数。 generate_prompt 函数将转换提示中的文本列,包括说明、文本描述和标签。  generate_test_prompt 函数相同,但没有标签。

# Define the prompt generation functions
def generate_prompt(data_point):
    return f"""
            Classify the E-commerce text into Electronics, Household, Books and Clothing.
text: {data_point["text"]}
label: {data_point["label"]}""".strip()

def generate_test_prompt(data_point):
    return f"""
            Classify the E-commerce text into Electronics, Household, Books and Clothing.
text: {data_point["text"]}
label: """.strip()

# Generate prompts for training and evaluation data
X_train.loc[:,'text'] = X_train.apply(generate_prompt, axis=1)
X_eval.loc[:,'text'] = X_eval.apply(generate_prompt, axis=1)

# Generate test prompts and extract true labels
y_true = X_test.loc[:,'label']
X_test = pd.DataFrame(X_test.apply(generate_test_prompt, axis=1), columns=["text"])

将 pandas 训练和评估数据数据框转换为 Hugging Face 数据集。

# Convert to datasets
train_data = Dataset.from_pandas(X_train[["text"]])
eval_data = Dataset.from_pandas(X_eval[["text"]])

train_data['text'][3]

文本由系统说明、产品文本描述和标签组成。

3.3 加载模型和分词器

使用存储库 ID 从 Hugging Face Hub 加载 4 位量化模型和标记器。然后,设置模型和标记器,以便它们可以使用。

base_model_name = "microsoft/Phi-3.5-mini-instruct"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=False,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype="float16",
)

model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    device_map="auto",
    torch_dtype="float16",
    quantization_config=bnb_config, 
)

model.config.use_cache = False
model.config.pretraining_tp = 1

tokenizer = AutoTokenizer.from_pretrained(base_model_name)

tokenizer.pad_token_id = tokenizer.eos_token_id

3.4 评估微调前的模型

我们需要在微调之前评估基础模型,以确定微调是否改善了结果。为此,我们将创建一个 predict函数,该函数采用测试数据集并根据产品和文本描述生成电子商务类别。

def predict(test, model, tokenizer):
    y_pred = []
    categories = ["Electronics", "Household", "Books", "Clothing"]
    
    for i in tqdm(range(len(test))):
        prompt = test.iloc[i]["text"]
        pipe = pipeline(task="text-generation", 
                        model=model, 
                        tokenizer=tokenizer, 
                        max_new_tokens=4, 
                        temperature=0.1)
        
        result = pipe(prompt)
        answer = result[0]['generated_text'].split("label:")[-1].strip()
        
        # Determine the predicted category
        for category in categories:
            if category.lower() in answer.lower():
                y_pred.append(category)
                break
        else:
            y_pred.append("none")
    
    return y_pred

y_pred = predict(X_test, model, tokenizer)

我们有一个预测类别列表,现在我们将把它们与实际类别进行比较以生成模型评估报告。 “评估”函数获取预测和实际类别的列表并生成详细的评估报告。 该报告包括平均准确度、每个类别的单独准确度、分类报告和混淆矩阵。

def evaluate(y_true, y_pred):
    labels = ["Electronics", "Household", "Books", "Clothing"]
    mapping = {label: idx for idx, label in enumerate(labels)}
    
    def map_func(x):
        return mapping.get(x, -1)  # Map to -1 if not found, but should not occur with correct data
    
    y_true_mapped = np.vectorize(map_func)(y_true)
    y_pred_mapped = np.vectorize(map_func)(y_pred)
    
    # Calculate accuracy
    accuracy = accuracy_score(y_true=y_true_mapped, y_pred=y_pred_mapped)
    print(f'Accuracy: {accuracy:.3f}')
    
    # Generate accuracy report
    unique_labels = set(y_true_mapped)  # Get unique labels
    
    for label in unique_labels:
        label_indices = [i for i in range(len(y_true_mapped)) if y_true_mapped[i] == label]
        label_y_true = [y_true_mapped[i] for i in label_indices]
        label_y_pred = [y_pred_mapped[i] for i in label_indices]
        label_accuracy = accuracy_score(label_y_true, label_y_pred)
        print(f'Accuracy for label {labels[label]}: {label_accuracy:.3f}')
        
    # Generate classification report
    class_report = classification_report(y_true=y_true_mapped, y_pred=y_pred_mapped, target_names=labels, labels=list(range(len(labels))))
    print('\nClassification Report:')
    print(class_report)
    
    # Generate confusion matrix
    conf_matrix = confusion_matrix(y_true=y_true_mapped, y_pred=y_pred_mapped, labels=list(range(len(labels))))
    print('\nConfusion Matrix:')
    print(conf_matrix)


evaluate(y_true, y_pred)

我们实现了 65% 的平均准确率。让我们看看微调是否可以提高这个分数。

Accuracy: 0.645
Accuracy for label Electronics: 0.950
Accuracy for label Household: 0.531
Accuracy for label Books: 0.561
Accuracy for label Clothing: 0.658

Classification Report:
              precision    recall  f1-score   support

 Electronics       0.46      0.95      0.62        40
   Household       0.83      0.53      0.65        81
       Books       0.96      0.56      0.71        41
    Clothing       0.86      0.66      0.75        38

   micro avg       0.69      0.65      0.67       200
   macro avg       0.78      0.67      0.68       200
weighted avg       0.79      0.65      0.67       200


Confusion Matrix:
[[38  0  1  0]
 [33 43  0  2]
 [ 9  6 23  2]
 [ 2  3  0 25]]

3.5 设置模型

从模型中提取线性模型名称。

def find_all_linear_names(model):
    cls = bnb.nn.Linear4bit
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])
    if 'lm_head' in lora_module_names:  # needed for 16 bit
        lora_module_names.remove('lm_head')
    return list(lora_module_names)
modules = find_all_linear_names(model)
modules
['gate_up_proj', 'down_proj', 'qkv_proj', 'o_proj']

使用线性模块名称创建 LoRA。我们将仅对 LoRA 进行微调,并保留模型的其余部分以节省内存并加快训练时间。

接下来,为 Kaggle 环境配置模型的超参数。你可以更改这些参数以提高准确性并根据您的机器减少训练时间。要了解每个超参数,请遵循 Fine-Tuning Llama 2 教程

我们现在将设置一个监督微调 (SFT) 训练器并提供训练和评估数据集、LoRA 配置、训练参数、标记器和模型。

output_dir="Phi-3.5-mini-instruct"

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=modules,
)

training_arguments = TrainingArguments(
    output_dir=output_dir,                    # directory to save and repository id
    num_train_epochs=1,                       # number of training epochs
    per_device_train_batch_size=1,            # batch size per device during training
    gradient_accumulation_steps=4,            # number of steps before performing a backward/update pass
    gradient_checkpointing=True,              # use gradient checkpointing to save memory
    optim="paged_adamw_8bit",
    logging_steps=1,                         
    learning_rate=2e-5,                       # learning rate, based on QLoRA paper
    weight_decay=0.001,
    fp16=False,
    bf16=False,
    max_grad_norm=0.3,                        # max gradient norm based on QLoRA paper
    max_steps=-1,
    warmup_ratio=0.03,                        # warmup ratio based on QLoRA paper
    group_by_length=False,
    lr_scheduler_type="cosine",               # use cosine learning rate scheduler
    report_to="wandb",                  # report metrics to w&b
    eval_strategy="steps",              # save checkpoint every epoch
    eval_steps = 0.2
)

trainer = SFTTrainer(
    model=model,
    args=training_arguments,
    train_dataset=train_data,
    eval_dataset=eval_data,
    peft_config=peft_config,
    dataset_text_field="text",
    tokenizer=tokenizer,
    max_seq_length=512,
    packing=False,
    dataset_kwargs={
    "add_special_tokens": False,
    "append_concat_token": False,
    }
)

3.6 模型训练

我们将使用 .train 函数启动微调过程。

# Train model
trainer.train()

loss 逐渐减少,如果再进行更多轮训练,我们本可以获得更好的结果。

完成 Weights & Biases 运行以生成评估报告。

wandb.finish()
model.config.use_cache = True

你可以通过访问 Weights & Biases 网站、选择你的项目并查看训练分析来分析模型性能。

将模型和 tokenizer 保存在本地,以便我们稍后可以将它们用于 m合并模型并将其推送到远程服务器。

# Save trained model and tokenizer
trainer.save_model(output_dir)
tokenizer.save_pretrained(output_dir)

3.7 测试微调后模型

让我们测试一下微调后模型性能是否有所改善。我们将首先生成一个谓词标签列表,并将其与实际标签一起提供给评估函数。

y_pred = predict(X_test, model, tokenizer)
evaluate(y_true, y_pred)

模型准确率提高了约 32.31%,其余性能指标看起来也很棒。除了图书类别外,该模型能够相当准确地识别类别。

Accuracy: 0.860
Accuracy for label Electronics: 0.825
Accuracy for label Household: 0.926
Accuracy for label Books: 0.683
Accuracy for label Clothing: 0.947

Classification Report:
              precision    recall  f1-score   support

 Electronics       0.97      0.82      0.89        40
   Household       0.88      0.93      0.90        81
       Books       0.90      0.68      0.78        41
    Clothing       0.88      0.95      0.91        38

   micro avg       0.90      0.86      0.88       200
   macro avg       0.91      0.85      0.87       200
weighted avg       0.90      0.86      0.88       200


Confusion Matrix:
[[33  6  1  0]
 [ 1 75  2  3]
 [ 0  3 28  2]
 [ 0  1  0 36]]

请确保通过单击右上角的“保存版本”按钮来保存 Kaggle 笔记本。然后,选择快速保存选项并更改保存输出选项以包含保存模型文件和所有代码。

如果你在微调模型时遇到问题,请参阅 Fine-tune Phi-3.5 for Text Classification Kaggle 笔记本

4、合并和导出微调模型

要将模型合并并导出到 Hugging Face,我们将首先创建一个新的 Kaggle 笔记本并添加已保存的笔记本以访问微调模型和标记器。

添加另一个 Kaggle 笔记本类似于添加数据集和模型。单击“+ 添加输入”按钮,粘贴笔记本的链接,然后按添加按钮。

使用 Kaggle 密文将 Hugging Face API 设置为环境变量,并安装加载和合并模型所需的所有 Python 包。

%%capture
%pip install -U bitsandbytes
%pip install -U transformers
%pip install -U accelerate
%pip install -U peft

使用模型 ID 和模型采用者的位置设置基本模型和微调模型变量。

# Model
base_model = "microsoft/Phi-3.5-mini-instruct"
fine_tuned_model = "/kaggle/input/fine-tune-phi-3-5-for-text-classification/Phi-3.5-mini-instruct/"

从 Hugging Face 中心加载完整模型以及标记器。

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from peft import PeftModel
import torch
# Reload tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(base_model)

base_model_reload = AutoModelForCausalLM.from_pretrained(
        base_model,
        return_dict=True,
        low_cpu_mem_usage=True,
        torch_dtype=torch.float16,
        device_map="auto",
        trust_remote_code=True,
)

使用两行代码将基础模型与采用者合并。

# Merge adapter with base model
model = PeftModel.from_pretrained(base_model_reload, fine_tuned_model)
model = model.merge_and_unload()

为了测试模型是否已成功合并,我们将使用合并后的模型和标记器创建一个文本生成管道,并传递示例提示以生成响应。

text = "Inalsa Dazzle Glass Top, 3 Burner Gas Stove with Rust Proof Powder Coated Body, Black Toughened Glass Top, 2 Medium and 1 Small High Efficiency Brass Burners, Aluminum Mixing Tubes, Powder Coated Body, Inbuilt Stainless Steel Drip Trays, 360 degree Swivel Nozzle,Bigger Legs to Facilitate Cleaning Under Cooktop"
prompt = f"""Classify the E-commerce text into Electronics, Household, Books and Clothing.
text: {text}
label: """.strip()

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    device_map="auto",
)

outputs = pipe(prompt, max_new_tokens=4, do_sample=True, temperature=0.1)
print(outputs[0]["generated_text"].split("label: ")[-1].strip())1].strip())

该模型已准确预测生产类别:

Household

我们将通过提供模型在本地保存完整模型目录。

model_dir = "Phi-3.5-mini-instruct-Ecommerce-Text-Classification"
model.save_pretrained(model_dir)
tokenizer.save_pretrained(model_dir)

将保存的模型推送到 Hugging Face hub。首先,使用从 Kaggle secrets 中提取的 API 密钥登录 Hugging Face CLI,然后使用 push_to_hub 函数推送模型和 tokenizer。

from huggingface_hub import login
from kaggle_secrets import UserSecretsClient
user_secrets = UserSecretsClient()

hf_token = user_secrets.get_secret("HUGGINGFACE_TOKEN")
login(token = hf_token)


model.push_to_hub(model_dir, use_temp_dir=False)
tokenizer.push_to_hub(model_dir, use_temp_dir=False)

它创建了一个新的模型存储库并将所有文件推送到 Hugging Face 模型存储库。

如果你在合并模型和导出模型时遇到问题,请查看 Phi-3.5 微调适配器到完整模型 Kaggle 笔记本

5、结束语

大型语言模型正在变得更小、更高效,从而降低了运营成本并提高了各个领域的适应性。在本教程中,我们探索了 Phi-3.5 Mini、Vision 和 MoE 模型。我们还学习了如何使用 Kaggle Notebooks 访问 Phi-3.5 Mini 模型。

接下来,我们在分类数据上对模型进行了微调并评估了它们的性能,实现了从 65% 到 86% 的准确率的显着提高——这是一项了不起的成就。仅通过 RAG 或函数调用无法实现这样的性能。

最后,我们将 LoRA 与基础模型集成,并将完整模型导出到 Hugging Face Hub,每个人都可以使用它。


原文连接:Fine-Tuning Phi-3.5 on E-Commerce Classification Dataset

汇智网翻译整理,转载请标明出处