APPLICATION

基于Milvus的多模态RAG实践

如何使用 Milvus 构建自己的多模态 RAG 系统，并利用GPT-4o语言模型来优化输出。

admin

Nov 29, 2024 • 13 min read

局限于单一数据格式已不再足够。随着企业越来越依赖信息来做出关键决策，他们需要能够比较不同格式的数据。幸运的是，局限于单一数据类型的传统 AI 系统已经让位于能够理解和处理复杂信息的多模态系统。

多模态搜索和多模态检索增强生成 (RAG) 系统最近在这一领域取得了巨大进步。这些系统处理多种类型的数据，包括文本、图像和音频，以提供上下文感知响应。

在这篇博文中，我们将讨论开发人员如何使用 Milvus 构建自己的多模态 RAG 系统。我们还将指导ni 构建这样一个系统，该系统可以处理文本和图像数据，特别是执行相似性搜索，并利用语言模型来优化输出。那么，让我们开始吧。

1、什么是 Milvus？

矢量数据库是一种特殊类型的数据库，用于存储、索引和检索矢量嵌入，矢量嵌入是数据的数学表示，允许你比较数据，不仅要比较等价性，还要比较语义相似性。Milvus 是一个开源的高性能矢量数据库，专为扩展而构建。您可以在 GitHub 上找到它，它具有 Apache-2.0 许可证和超过 30,000 颗星。

Milvus 帮助开发人员提供灵活的解决方案来管理和查询大规模矢量数据。它的效率使 Milvus 成为使用深度学习模型构建应用程序的开发人员的理想选择，例如检索增强生成 (RAG)、多模态搜索、推荐引擎和异常检测。

Milvus 提供多种部署选项以满足开发人员的需求。Milvus Lite 是一个轻量级版本，可在 Python 应用程序中运行，非常适合在本地环境中对应用程序进行原型设计。Milvus Standalone 和 Milvus Distributed 是可扩展且可用于生产的选项。

2、多模态 RAG：超越文本

在构建系统之前，了解传统的基于文本的 RAG 及其向多模态 RAG 的演变非常重要。

检索增强生成 (RAG) 是一种从外部源检索上下文信息并从大型语言模型 (LLM) 生成更准确输出的方法。传统 RAG 是一种改进 LLM 输出的高效策略，但它仍然局限于文本数据。在许多实际应用中，数据超越了文本—结合图像、图表和其他模态提供了关键的上下文。

多模态 RAG 通过允许使用不同的数据类型来解决上述限制，为 LLM 提供更好的上下文。

简而言之，在多模态 RAG 系统中，检索组件在不同数据模态中搜索相关信息，生成组件根据检索到的信息生成更准确的结果。

向量嵌入和相似性搜索是多模态 RAG 的两个基本概念。让我们了解它们两者。

3、向量嵌入

如前所述，向量嵌入（Vector Embeddings）是数据的数学/数字表示。机器使用这种表示来理解不同数据类型（例如文本、图像和音频）的语义含义。

使用自然语言处理 (NLP) 时，文档块会转换为向量，语义相似的单词会映射到向量空间中的附近点。图像也是如此，其中嵌入表示语义特征。这使我们能够以数字格式理解颜色、纹理和对象形状等指标。

使用向量嵌入的主要目标是帮助保留不同数据之间的关系和相似性。

4、相似性搜索

相似性搜索用于查找和定位给定数据集中的数据。在向量嵌入的上下文中，相似性搜索会在给定数据集中找到最接近查询向量的向量。

以下是一些常用于测量向量之间相似性的方法：

欧几里得距离：测量向量空间中两点之间的直线距离。
余弦相似度：测量两个向量之间夹角的余弦（重点是它们的方向而不是幅度）。
点积：对应元素的简单乘积相加。

相似度度量的选择通常取决于特定于应用程序的数据以及开发人员解决问题的方式。

在对大规模数据集执行相似性搜索时，所需的计算能力和资源非常高。这就是近似最近邻 (ANN) 算法发挥作用的地方。ANN 算法用于以较小的百分比或数量准确度换取显着的速度升级。这使它们成为大规模应用程序的合适选择。

Milvus 还使用高级 ANN 算法（包括 HNSW 和 DiskANN）在大型向量嵌入数据集上执行高效的相似性搜索，使开发人员能够快速找到相关数据点。此外，Milvus 还支持其他索引算法，例如 HSNW、IVF、CAGRA 等，使其成为更高效的向量搜索解决方案。

5、使用 Milvus 构建多模态 RAG

现在我们已经了解了概念，是时候使用 Milvus 构建多模态 RAG 系统了。对于此示例，我们将使用 Milvus Lite（Milvus 的轻量级版本，非常适合进行实验和原型设计）进行向量存储和检索，使用 BGE 进行精确的图像处理和嵌入，使用 GPT-4o 进行高级结果重新排名。

5.1 先决条件

首先，你需要一个 Milvus 实例来存储数据。你可以使用 pip 设置 Milvus Lite，使用 Docker 运行本地实例，或通过 Zilliz Cloud 注册免费托管的 Milvus 帐户。

其次，你需要一个用于 RAG 管道的 LLM，因此请前往 OpenAI 获取 API 密钥。免费套餐足以让此代码运行。

接下来，创建一个新目录和一个 Python 虚拟环境，或采取任何你用来管理 Python 的步骤。

对于本教程，你还需要安装 pymilvus 库（Milvus 的官方 Python SDK）和一些常用工具。

5.2 设置 Milvus Lite

使用如下命令安装pymilvus开发包：

pip install -U pymilvus

安装依赖项：

pip install --upgrade pymilvus openai datasets opencv-python timm einops ftfy peft tqdm

git clone https://github.com/FlagOpen/FlagEmbedding.git
pip install -e FlagEmbedding

5.3 下载数据

以下命令将下载示例数据并将其提取到本地文件夹 ./images_folder，其中包括：

图片：亚马逊评论 2023 的子集，包含来自“家电”、“手机和配件”和“电子产品”类别的约 900 张图片。
示例查询图像：leopard.jpg

wget https://github.com/milvus-io/bootcamp/releases/download/data/amazon_reviews_2023_subset.tar.gz
tar -xzf amazon_reviews_2023_subset.tar.gz

5.4 加载嵌入模型

我们将使用可视化 BGE 模型“bge-visualized-base-en-v1.5”为图像和文本生成嵌入。

现在从 HuggingFace 下载权重：

wget https://huggingface.co/BAAI/bge-visualized/resolve/main/Visualized_base_en_v1.5.pth

然后，让我们构建一个编码器：

import torch
from visual_bge.modeling import Visualized_BGE

class Encoder:

    def __init__(self, model_name: str, model_path: str):

        self.model = Visualized_BGE(model_name_bge=model_name, model_weight=model_path)

        self.model.eval()

    def encode_query(self, image_path: str, text: str) -> list[float]:

        with torch.no_grad():

            query_emb = self.model.encode(image=image_path, text=text)

        return query_emb.tolist()[0]

    def encode_image(self, image_path: str) -> list[float]:

        with torch.no_grad():

            query_emb = self.model.encode(image=image_path)

        return query_emb.tolist()[0]

model_name = "BAAI/bge-base-en-v1.5"

model_path = "./Visualized_base_en_v1.5.pth"  # Change to your own value if using a different model path

encoder = Encoder(model_name, model_path)

5.6 生成图像的嵌入

本节将指导你如何将示例图像及其相应的嵌入加载到我们的数据库中。

我们需要为数据集中的所有图像创建嵌入。从数据目录加载所有图像并将其转换为嵌入。

import os

from tqdm import tqdm

from glob import glob

data_dir = (

    "./images_folder"  # Change to your own value if using a different data directory

)

image_list = glob(

    os.path.join(data_dir, "images", "*.jpg")

)  # We will only use images ending with ".jpg"

image_dict = {}

for image_path in tqdm(image_list, desc="Generating image embeddings: "):

    try:

        image_dict[image_path] = encoder.encode_image(image_path)

    except Exception as e:

        print(f"Failed to generate embedding for {image_path}. Skipped.")

        continue

print("Number of encoded images:", len(image_dict))

接下来，我们将首先搜索相关的使用多模态查询对图像进行排序，然后使用 LLM 服务对检索到的结果进行重新排序，并找到带有解释的最佳结果。

5.7 执行多模态搜索

现在我们准备使用由图像和文本指令组成的查询执行高级多模态搜索。

query_image = os.path.join(

    data_dir, "leopard.jpg"

)  # Change to your own query image path

query_text = "phone case with this image theme"

query_vec = encoder.encode_query(image_path=query_image, text=query_text)

search_results = milvus_client.search(

    collection_name=collection_name,

    data=[query_vec],

    output_fields=["image_path"],

    limit=9,  # Max number of search results to return

    search_params={"metric_type": "COSINE", "params": {}},  # Search parameters

)[0]

retrieved_images = [hit.get("entity").get("image_path") for hit in search_results]

print(retrieved_images)

结果如下所示：

['./images_folder/images/518Gj1WQ-RL._AC_.jpg', 
'./images_folder/images/41n00AOfWhL._AC_.jpg'

5.8 使用 GPT-4o 对结果进行重新排序

现在，我们将使用 GPT-4o 对检索到的图像进行排序，并找到最匹配的结果。LLM 也会解释为什么它会这样排序。

创建全景图：

import numpy as np

import cv2

img_height = 300

img_width = 300

row_count = 3

def create_panoramic_view(query_image_path: str, retrieved_images: list) -> np.ndarray:

    """

    creates a 5x5 panoramic view image from a list of images

    args:

        images: list of images to be combined

    returns:

        np.ndarray: the panoramic view image

    """

    panoramic_width = img_width * row_count

    panoramic_height = img_height * row_count

    panoramic_image = np.full(

        (panoramic_height, panoramic_width, 3), 255, dtype=np.uint8

    )

    # create and resize the query image with a blue border

    query_image_null = np.full((panoramic_height, img_width, 3), 255, dtype=np.uint8)

    query_image = Image.open(query_image_path).convert("RGB")

    query_array = np.array(query_image)[:, :, ::-1]

    resized_image = cv2.resize(query_array, (img_width, img_height))

    border_size = 10

    blue = (255, 0, 0)  # blue color in BGR

    bordered_query_image = cv2.copyMakeBorder(

        resized_image,

        border_size,

        border_size,

        border_size,

        border_size,

        cv2.BORDER_CONSTANT,

        value=blue,

    )

    query_image_null[img_height * 2 : img_height * 3, 0:img_width] = cv2.resize(

        bordered_query_image, (img_width, img_height)

    )

    # add text "query" below the query image

    text = "query"

    font_scale = 1

    font_thickness = 2

    text_org = (10, img_height * 3 + 30)

    cv2.putText(

        query_image_null,

        text,

        text_org,

        cv2.FONT_HERSHEY_SIMPLEX,

        font_scale,

        blue,

        font_thickness,

        cv2.LINE_AA,

    )

    # combine the rest of the images into the panoramic view

    retrieved_imgs = [

        np.array(Image.open(img).convert("RGB"))[:, :, ::-1] for img in retrieved_images

    ]

    for i, image in enumerate(retrieved_imgs):

        image = cv2.resize(image, (img_width - 4, img_height - 4))

        row = i // row_count

        col = i % row_count

        start_row = row * img_height

        start_col = col * img_width

        border_size = 2

        bordered_image = cv2.copyMakeBorder(

            image,

            border_size,

            border_size,

            border_size,

            border_size,

            cv2.BORDER_CONSTANT,

            value=(0, 0, 0),

        )

        panoramic_image[

            start_row : start_row + img_height, start_col : start_col + img_width

        ] = bordered_image

        # add red index numbers to each image

        text = str(i)

        org = (start_col + 50, start_row + 30)

        (font_width, font_height), baseline = cv2.getTextSize(

            text, cv2.FONT_HERSHEY_SIMPLEX, 1, 2

        )

        top_left = (org[0] - 48, start_row + 2)

        bottom_right = (org[0] - 48 + font_width + 5, org[1] + baseline + 5)

        cv2.rectangle(

            panoramic_image, top_left, bottom_right, (255, 255, 255), cv2.FILLED

        )

        cv2.putText(

            panoramic_image,

            text,

            (start_col + 10, start_row + 30),

            cv2.FONT_HERSHEY_SIMPLEX,

            1,

            (0, 0, 255),

            2,

            cv2.LINE_AA,

        )

    # combine the query image with the panoramic view

    panoramic_image = np.hstack([query_image_null, panoramic_image])

    return panoramic_image

将查询图像和检索到的图像与全景视图中的索引组合在一起：

from PIL import Image

combined_image_path = os.path.join(data_dir, "combined_image.jpg")

panoramic_image = create_panoramic_view(query_image, retrieved_images)

cv2.imwrite(combined_image_path, panoramic_image)

combined_image = Image.open(combined_image_path)

show_combined_image = combined_image.resize((300, 300))

show_combined_image.show()

对结果重新排序并给出解释：

我们会将所有组合图像连同适当的提示一起发送到多模态 LLM 服务，以对检索到的结果进行排序并给出解释。注意：要启用 GPT-4o 作为 LLM，你需要提前准备好 OpenAI API 密钥。

import requests

import base64

openai_api_key = "sk-***"  # Change to your OpenAI API Key

def generate_ranking_explanation(

    combined_image_path: str, caption: str, infos: dict = None

) -> tuple[list[int], str]:

    with open(combined_image_path, "rb") as image_file:

        base64_image = base64.b64encode(image_file.read()).decode("utf-8")

    information = (

        "You are responsible for ranking results for a Composed Image Retrieval. "

        "The user retrieves an image with an 'instruction' indicating their retrieval intent. "

        "For example, if the user queries a red car with the instruction 'change this car to blue,' a similar type of car in blue would be ranked higher in the results. "

        "Now you would receive instruction and query image with blue border. Every item has its red index number in its top left. Do not misunderstand it. "

        f"User instruction: {caption} \n\n"

    )

    # add additional information for each image

    if infos:

        for i, info in enumerate(infos["product"]):

            information += f"{i}. {info}\n"

    information += (

        "Provide a new ranked list of indices from most suitable to least suitable, followed by an explanation for the top 1 most suitable item only. "

        "The format of the response has to be 'Ranked list: []' with the indices in brackets as integers, followed by 'Reasons:' plus the explanation why this most fit user's query intent."

    )

    headers = {

        "Content-Type": "application/json",

        "Authorization": f"Bearer {openai_api_key}",

    }

    payload = {

        "model": "gpt-4o",

        "messages": [

            {

                "role": "user",

                "content": [

                    {"type": "text", "text": information},

                    {

                        "type": "image_url",

                        "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"},

                    },

                ],

            }

        ],

        "max_tokens": 300,

    }

    response = requests.post(

        "https://api.openai.com/v1/chat/completions", headers=headers, json=payload

    )

    result = response.json()["choices"][0]["message"]["content"]

    # parse the ranked indices from the response

    start_idx = result.find("[")

    end_idx = result.find("]")

    ranked_indices_str = result[start_idx + 1 : end_idx].split(",")

    ranked_indices = [int(index.strip()) for index in ranked_indices_str]

    # extract explanation

    explanation = result[end_idx + 1 :].strip()

    return ranked_indices, explanation

获取排名后的图片索引以及最佳结果的原因：

ranked_indices, explanation = generate_ranking_explanation(

    combined_image_path, query_text

)

显示最佳结果并附上解释：

print(explanation)

best_index = ranked_indices[0]

best_img = Image.open(retrieved_images[best_index])

best_img = best_img.resize((150, 150))

best_img.show()

结果：

Reasons: The most suitable item for the user's query intent is index 6 because the instruction specifies a phone case with the theme of the image, which is a leopard. The phone case with index 6 has a thematic design resembling the leopard pattern, making it the closest match to the user's request for a phone case with the image theme.

翻译：

原因：最适合用户查询意图的项目是索引 6，因为指令指定了图片主题为豹子的手机壳。索引为 6 的手机壳采用类似豹纹的主题设计，与用户对具有图像主题的手机壳的要求最为匹配。

查看此笔记本中的完整代码。要了解有关如何使用本教程开始在线演示的更多信息，请参阅示例应用程序。

6、结束语

在这篇博文中，我们讨论了使用 Milvus（一个开源矢量数据库）构建多模态 RAG 系统。我们介绍了开发人员如何设置 Milvus、加载图像数据、执行相似性搜索以及使用 LLM 对检索到的结果进行重新排序以获得更准确的响应。

多模态 RAG 解决方案为可以轻松理解和处理多种形式数据的 AI 系统开辟了各种可能性。一些常见的可能性包括改进的图像搜索引擎、更好的上下文驱动结果等等。

原文链接：Want to Search for Something With an Image and a Text Description? Try a Multimodal RAG

汇智网翻译整理，转载请标明出处