Qwen-2 VL微调数据集制作指南

针对专门任务对大型语言模型 (LLM) 进行微调通常需要精心策划的数据集，尤其是在使用 Qwen-2-VL 等视觉语言模型时。Qwen-2-VL 是一款功能强大的工具，可用于理解和解释文本和图像，非常适合文档分析、视觉问答 (VQA) 等场景。但是，创建针对此类模型要求的自定义数据集可能具有挑战性。

在本文中，我将引导你完成使用 LLaMA-Factory 创建用于微调 Qwen-2-VL 的视觉语言数据集的整个过程，LLaMA-Factory 是一个专为训练和微调模型而设计的开源库。我们将介绍从准备数据到将其上传到 Hugging Face 以及最终将其集成到微调脚本中的所有内容。

先决条件

在深入研究之前，请确保你具有：

Python 编程的基本知识。
一个 OpenAI 和 Hugging Face 帐户并获取 API 令牌。
对 Finetuning、LLaMA-Factory 和 Qwen-2-VL 有基本的了解。如果您不熟悉这些工具，可以查看它们各自的文档：
Qwen-2-VL 微调脚本
LLaMA-Factory

1、设置你的环境

开始之前，请确保安装所需的库：

pip install openai pillow pandas datasets huggingface_hub

对于此示例，我们正在为文档 VQA 构建一个数据集，其中每个图像代表一个合同文档，数据集包含从这些图像派生的问答对。

2、准备图像

将所有合同图像存储在名为 images 的文件夹中。
调整图像大小以确保模型可以管理它们：

from PIL import Image
import os

def resize_image(image_path, max_size=1024):
    with Image.open(image_path) as img:
        aspect_ratio = img.width / img.height
        new_width = max_size if img.width > img.height else int(max_size * aspect_ratio)
        new_height = max_size if img.height > img.width else int(max_size / aspect_ratio)
        resized_img = img.resize((new_width, new_height), Image.Resampling.LANCZOS)
        resized_img.save(image_path)

3、生成问答对

对于每幅图像，我们使用 LLM 生成问答对：

示例问题：“合同的生效日期是什么？”
示例答案：“2023 年 1 月 1 日”

以下是处理图像并生成 QA 对的脚本：

import os
import csv
import base64
from PIL import Image
from openai import OpenAI

OPENAI_API_KEY=""

# Initialize OpenAI client
client = OpenAI(api_key=OPENAI_API_KEY)

def resize_image(image_path, max_size=1024):
    """
    Resize the image while maintaining aspect ratio.
    Overwrites the original image with the resized version.
    """
    with Image.open(image_path) as img:
        # Calculate the aspect ratio
        aspect_ratio = img.width / img.height
        
        # Determine new dimensions while maintaining aspect ratio
        if img.width > img.height:
            new_width = max_size
            new_height = int(max_size / aspect_ratio)
        else:
            new_height = max_size
            new_width = int(max_size * aspect_ratio)
        
        # Resize and save the image back to the original path
        resized_img = img.resize((new_width, new_height), Image.Resampling.LANCZOS)
        resized_img.save(image_path)
        print(f"Resized and saved: {image_path} to {new_width}x{new_height}")

def generate_question_answer_pairs(image_base64):
    """
    Function to call GPT-4 model and generate question-answer pairs based on the given image.
    """
    # Prepend the required prefix for base64 image data
    image_data_url = f"data:image/png;base64,{image_base64}"
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
    {
      "role": "system",
      "content": [
        {
          "type": "text",
          "text": """# ROLE: Vision language model dataset generator
#Mission: Analyze the given image to generate a list of question-answer pairs based on the text and information present. 
The questions should focus on key information present or not present in the document. 
For each question, if the answer is present in the document, provide the exact answer. 
If the information is not present or cannot be identified from the document, use the answer \"Not Present.\" 
Avoid using specific names of individuals when possible. 
The goal is to create both positive and negative answer sets to train the model to understand and distinguish between available and unavailable information.\

# Steps
1. Examine the document for key pieces of information.
2. Identify the following elements where applicable. 

For example: 
  - Organization names.
  - Titles and roles.
  - Dates (effective date, expiration date, etc.).
  - Signatures.
  - Specific contract terms, phrases, or numbers.

3. Formulate questions based on these identified elements.
4. Determine the answer for each question, whether it is directly available or \"Not Present\" if absent.\

# Output Format

- CSV format with two columns: \"Question\" and \"Answer\".
- Each row should represent one question-answer pair.
- Format each entry as follows: \"Question\",\"Answer\"
- Ensure the output is structured with each question-answer pair on a new line.
- Enclose within ```csv and ``` for post processing. 

"""        }
      ]
    },
    {
      "role": "user",
      "content": [
        {
          "type": "image_url",
          "image_url": {
            "url": image_data_url
          }
        }
      ]
    }
  ],
        temperature=1,
        max_tokens=2048,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
        response_format={"type": "text"}
    )
    
    # Extract and clean the response text
    response_text = response.choices[0].message.content
    # Remove ```csv delimiters and strip out extra triple quotes
    clean_csv = response_text.replace("```csv", "").replace("```", "").strip()
    clean_csv = clean_csv.replace('"""', '"')  # Replace triple quotes with single quotes
    
    # Filter out unwanted headers like "Question","Answer"
    cleaned_lines = [
        line for line in clean_csv.splitlines()
        if not line.lower().strip().startswith(("question", "answer"))  # Removes any header lines
    ]
    
    return "\n".join(cleaned_lines)

def get_processed_images(output_csv_path):
    """
    Reads the output CSV file and returns a set of image names that have already been processed.
    """
    processed_images = set()
    if os.path.exists(output_csv_path):
        with open(output_csv_path, mode='r') as csvfile:
            csv_reader = csv.reader(csvfile)
            next(csv_reader)  # Skip header
            for row in csv_reader:
                processed_images.add(row[0])  # Image name is in the first column
    return processed_images

def process_images_in_folder(folder_path, output_csv_path):
    """
    Function to scan a folder for images, resize, process each using GPT-4 model, and save results into a CSV file.
    Skips images that have already been processed.
    """
    # Get the set of processed images
    processed_images = get_processed_images(output_csv_path)
    
    # Open or create the CSV file for appending data
    with open(output_csv_path, mode='a', newline='') as csvfile:
        csv_writer = csv.writer(csvfile)
        
        # Write the header if the file is empty
        if os.stat(output_csv_path).st_size == 0:
            csv_writer.writerow(['Image Name', 'Question', 'Answer'])

        # Iterate through each image in the specified folder
        for image_filename in os.listdir(folder_path):
            if image_filename.lower().endswith(('.png', '.jpg', '.jpeg')):
                # Skip the image if it has already been processed
                if image_filename in processed_images:
                    print(f"Skipping already processed image: {image_filename}")
                    continue

                image_path = os.path.join(folder_path, image_filename)
                
                # Resize the image to a manageable size
                resize_image(image_path, max_size=1024)
                
                # Convert the resized image to base64
                with open(image_path, "rb") as image_file:
                    image_base64 = base64.b64encode(image_file.read()).decode('utf-8')
                
                # Generate question-answer pairs using the model
                qa_pairs = generate_question_answer_pairs(image_base64)
                
                # Split the clean CSV data into rows and write each row with the image name
                for row in qa_pairs.splitlines():
                    question, answer = row.split(',', 1)
                    csv_writer.writerow([image_filename, question.strip(), answer.strip()])

# Define folder path and output CSV path
input_folder_path = "images"
output_csv_path = "output.csv"

# Process the images and generate the CSV
process_images_in_folder(input_folder_path, output_csv_path)

此 Python 代码旨在处理一个图像文件夹，调整它们的大小，使用 GPT-4 从每幅图像的内容生成问答对，并将结果保存到 CSV 文件中。它遵循系统化的方法，以确保图像不会被多次处理，在使用 GPT-4 模型处理图像之前，先将图像调整为可管理的大小。代码通过以 base64 编码图像、将编码的图像数据发送到 GPT-4 并解析响应以提取问答对来处理图像到文本的转换。

以下是其功能的详细分类：

依赖项：代码导入了多个库，包括用于文件和目录处理的 os、用于读取和写入 CSV 文件的 csv、用于编码图像的 base64、用于图像处理的 PIL 以及用于与 OpenAI API 交互的 OpenAI。
API 初始化：它使用 API 密钥初始化 OpenAI 客户端以启用与 GPT-4模型的交互。
resize_image 函数：该函数获取图像路径，将图像调整为最大尺寸（1024 像素），保持其纵横比，然后覆盖原始图像。
generate_question_answer_pairs 函数：该函数将 base64 编码的图像发送到 GPT-4 模型。指示模型分析图像并生成问答对，重点关注图像中存在的关键信息，或在信息缺失时将答案标记为“不存在”。输出结构为 CSV 数据。
get_processed_images 函数：读取输出 CSV 以收集一组已经处理过的图像，防止冗余处理。
process_images_in_folder 函数：遍历指定文件夹中的图像，调整大小并对其进行编码，使用 GPT-4 模型生成问答对，并将结果附加到 CSV 文件中。它会跳过已经处理过的图像，确保效率。
文件路径：包含图像的文件夹和输出 CSV 文件路径定义为 input_folder_path 和 output_csv_path。
执行：代码处理指定文件夹中的所有图像，生成问答对，并将其保存到 CSV 文件中。

此代码适用于根据图像内容创建问答对数据集，特别是涉及从合同、ID 等视觉文档中提取信息的任务。

3、将数据集上传到 Hugging Face Hub

现在我们有了 CSV 格式的数据集，让我们准备将其上传到 Hugging Face Hub。

import os
import pandas as pd
from datasets import Dataset, Features, Image, Value
from huggingface_hub import HfApi

读取 CSV 并对数据进行分组：

# Read the CSV file
csv_file_path = 'output.csv'  # Replace with the path to your uploaded CSV file
images_folder_path = 'images'  # Replace with the path to your images folder
df = pd.read_csv(csv_file_path)

grouped_data = df.groupby('Image Name')

为 Hugging Face 数据集准备数据：

data_list = []
for image_name, group in grouped_data:
    messages = []
    # Load the image path (no need to open the image in PIL here)
    image_path = os.path.normpath(os.path.join(images_folder_path, image_name))

    for idx, row in group.iterrows():
        # Add user message with <image> tag to ALL user questions
        user_message = row['Question'] + "<image>"  # Add <image> to every question
        messages.append({"role": "user", "content": user_message})
        messages.append({"role": "assistant", "content": row['Answer']})

    entry = {
        "messages": messages,
        "images": [image_path] * len(group)  # Store the image path for each question
    }
    data_list.append(entry)

# Define dataset features
features = Features({
    'messages': [{'role': Value('string'), 'content': Value('string')}],  # Image is no longer optional
    'images': [Image()]  # Specify the 'images' feature as a list of Image type
})

# Convert to a Hugging Face Dataset
dataset = Dataset.from_list(data_list, features=features)

插入 <image> 标签是重要的一步。如果没有它，你将遇到“ValueError：图像数量与标记数量不匹配”错误。该错误表示提供的图像数量与数据集消息中的标记数量不匹配。当数据结构在处理过程中未按预期对齐时，通常会发生这种情况。

上传至 Hugging Face Hub：

# Define Hugging Face repository details
dataset_repo_id = "hfusername/DOCVQA-dataset"  # Replace with your Hugging Face username and dataset name
hf_token = ""  # Replace with your Hugging Face token

# Push the dataset to Hugging Face Hub
api = HfApi()
api.create_repo(repo_id=dataset_repo_id, token=hf_token, repo_type="dataset", exist_ok=True, private=True)
dataset.push_to_hub(dataset_repo_id, token=hf_token)
print(f"Dataset has been uploaded to {dataset_repo_id} on the Hugging Face Hub.")

# Optional: Print a sample to verify the structure
print("\nSample entry from the dataset:")
print(dataset[1])

HF 中的 VL 数据集

4、用LLaMA-Factory微调Qwen-2-VL

现在数据集已准备就绪并上传，让我们使用 LLaMA-Factory 配置微调过程。

Colab 笔记本：链接

更新 dataset_info.json：在 LLaMAFactory/data/dataset_info.json 中，添加：

"mycustomdataset": {
    "hf_hub_url": "your_username/DOCVQA-dataset",
    "formatting": "sharegpt",
    "columns": {
        "messages": "messages",
        "images": "images"
    },
    "tags": {
        "role_tag": "role",
        "content_tag": "content",
        "user_tag": "user",
        "assistant_tag": "assistant"
    }
}

微调脚本：更新脚本中 args 中的数据集参数以微调 Qwen-2-VL：

args = dict(
    stage="sft",
    do_train=True,
    model_name_or_path="Qwen/Qwen2-VL-2B-Instruct",
    dataset="mycustomdataset",
    template="qwen2_vl",
    finetuning_type="lora",
    lora_target="all",
    output_dir="qwen2vl_lora",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    lr_scheduler_type="cosine",
    logging_steps=10,
    warmup_ratio=0.1,
    save_steps=1000,
    learning_rate=5e-5,
    num_train_epochs=3.0,
    max_samples=500,
    max_grad_norm=1.0,
    loraplus_lr_ratio=16.0,
    fp16=True,
    use_liger_kernel=True,
)

运行脚本：执行脚本以启动微调过程。根据需要调整超参数以获得最佳性能。

5、结束语

按照本指南，你现在拥有一个自定义视觉语言数据集和一个使用 LLaMA-Factory 微调 Qwen-2-VL 模型的设置。此过程可适用于文档 VQA 以外的各种视觉语言任务，使其成为构建专用模型的多功能方法。

原文链接：A Step-by-Step Guide to Creating a Custom Vision-Language Dataset for Fine-Tuning Qwen-2-VL with LLaMA-Factory

汇智网翻译整理，转载请标明出处