LIBRARY

LlamaParse高级PDF解析器

LlamaParse 是一种支持生成 AI 的文档解析技术，专为包含嵌入对象（如表格和图形）的复杂文档而设计。

admin

Nov 14, 2024 • 8 min read

检索增强生成 (RAG) 的核心重点是将你感兴趣的数据连接到大型语言模型 (LLM)。此过程将生成式 AI 的功能与你的数据联系起来，从而实现基于你的特定数据集的复杂问题解答和 LLM 生成的见解。我的假设是，这些 RAG 系统不仅会像我们通常看到的那样对聊天机器人类型的应用程序有用，而且还会集成到旨在改善业务决策和做出预测的创新 AI 应用程序中。

RAG 的实用性毋庸置疑，随着技术的不断进步，我们可以期待更多变革性应用程序，这些应用程序将彻底改变我们从信息中学习和与信息交互的方式。

但是……

1、PDF 问题

重要的半结构化数据通常存储在复杂的文件类型中，例如众所周知难以处理的 PDF 文件。想想重要的文档经常以 PDF 格式呈现——例如收益电话会议记录、投资者报告、新闻文章、10K/10Q 文档和 ARXIV 上的研究论文，仅举几例。我们需要一种方法来干净高效地从这些 PDF 文件中提取嵌入信息，如文本、表格、图像、图表等，以便将这些重要数据提取到 RAG 管道中。

解决方案是：

2、LlamaParse

LlamaParse 是一种支持生成 AI 的文档解析技术，专为包含嵌入对象（如表格和图形）的复杂文档而设计。

LlamaParse 的核心功能是支持在这些复杂文档（如 PDF）上创建检索系统，它通过从这些文档中提取数据并将其转换为易于提取的格式（如 markdown 或文本）来实现此目的。转换数据后，可以将其嵌入并加载到你的 RAG 管道中。

LlamaParse 功能概述：

支持的文件类型：PDF、.pptx、.docx、.rtf、.pages、.epub 等……
转换后的输出类型：Markdown、文本
提取功能：文本、表格、图像、图表、漫画书、数学方程式
自定义解析指令：由于 LlamaParse 启用了 LLM，因此你可以像提示 LLM 一向其传递指令。你可以使用此提示来描述文档，从而为 LLM 在解析时添加更多上下文，指示你希望输出的外观，或要求 LLM 在解析过程中进行预处理，如情绪分析、语言翻译、摘要等……
JSON 模式：输出文档的完整结构，提取带有大小和位置元数据的图像，以 JSON 格式提取表格以便于分析。这非常适合自定义 RAG 应用程序，其中文档结构和元数据用于最大化文档的信息价值并引用文档中检索到的节点的来源。

Markdown 的优势：

LlamaParse 将 PDF 转换为 markdown 格式具有一些独特的优势。Markdown 通过识别标题、页眉、小节、表格和图像等结构元素来指定文档的固有结构。这似乎微不足道，但由于 markdown 识别了这些元素，我们可以使用 LlamaIndex 中的专用解析器（如 MarkdownElementNodeParser()）轻松地根据结构将文档拆分为更小的块。以 markdown 格式表示 PDF 文件的结果使我们能够提取 PDF 的每个元素并将它们提取到 RAG 管道中。

3、代码示例

以下代码介绍了使用 LlamaParse 提取 PDF 文件的 RAG 管道的实现。

在我们的 GitHub 上查看完整的笔记本或在 Colab 上打开笔记本。

安装和导入库：

!pip install llama-index
!pip install llama-index-core
!pip install llama-index-embeddings-openai
!pip install llama-parse
!pip install llama-index-vector-stores-kdbai
!pip install pandas
!pip install llama-index-postprocessor-cohere-rerank
!pip install kdbai_client

from llama_parse import LlamaParse
from llama_index.core import Settings
from llama_index.core import StorageContext
from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import MarkdownElementNodeParser
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.vector_stores.kdbai import KDBAIVectorStore
from llama_index.postprocessor.cohere_rerank import CohereRerank
from getpass import getpass
import kdbai_client as kdbai

为 LlamaCloud、OpenAI 和 Cohere 设置 API 密钥：

# llama-parse is async-first, running the async code in a notebook requires the use of nest_asyncio
import nest_asyncio
nest_asyncio.apply()


import os
# API access to llama-cloud
os.environ["LLAMA_CLOUD_API_KEY"] = "llx-"

# Using OpenAI API for embeddings/llms
os.environ["OPENAI_API_KEY"] = "sk-"

# Using Cohere for reranking
os.environ["COHERE_API_KEY"] = "xyz..."

设置 KDB.AI 矢量数据库（在此免费注册）：

#Set up KDB.AI endpoing and API key
KDBAI_ENDPOINT = (
    os.environ["KDBAI_ENDPOINT"]
    if "KDBAI_ENDPOINT" in os.environ
    else input("KDB.AI endpoint: ")
)
KDBAI_API_KEY = (
    os.environ["KDBAI_API_KEY"]
    if "KDBAI_API_KEY" in os.environ
    else getpass("KDB.AI API key: ")
)

#connect to KDB.AI
session = kdbai.Session(api_key=KDBAI_API_KEY, endpoint=KDBAI_ENDPOINT)

连接到“默认”数据库，为 KDB.AI 表创建架构，定义索引，并创建表：

# Connect with kdbai database
db = session.database("default")

# The schema contains two metadata columns (document_id, text) and one embeddings column
schema = [
        dict(name="document_id", type="bytes"),
        dict(name="text", type="bytes"),
        dict(name="embeddings", type="float32s"),
    ]

# indexflat, define the index name, type, column to apply the index to (embeddings)
# and params which include thesearch metric (Euclidean distance), and dims
indexFlat = {
        "name": "flat",
        "type": "flat",
        "column": "embeddings",
        "params": {'dims': 1536, 'metric': 'L2'},
    }

KDBAI_TABLE_NAME = "LlamaParse_Table"

# First ensure the table does not already exist
try:
    db.table(KDBAI_TABLE_NAME).drop()
except kdbai.KDBAIException:
    pass

#Create the table
table = db.create_table(KDBAI_TABLE_NAME, schema, indexes=[indexFlat])

下载示例 PDF，或导入你自己的 PDF。这个PDF 是一篇精彩的文章，名为“LLM 上下文回忆依赖于提示”，由 VMware NLP 实验室的 Daniel Machlab 和 Rick Battle 撰写：

!wget 'https://arxiv.org/pdf/2404.08865' -O './LLM_recall.pdf'

让我们使用 LLM 和嵌入模型设置 LlamaParse 和 LlamaIndex：

EMBEDDING_MODEL  = "text-embedding-3-small"
GENERATION_MODEL = "gpt-4o"

llm = OpenAI(model=GENERATION_MODEL)
embed_model = OpenAIEmbedding(model=EMBEDDING_MODEL)

Settings.llm = llm
Settings.embed_model = embed_model

pdf_file_name = './LLM_recall.pdf'

创建自定义解析指令传递给 LlamaParse：

parsing_instructions = '''The document titled "LLM In-Context Recall is Prompt Dependent" is an academic preprint from April 2024, authored by Daniel Machlab and Rick Battle from the VMware NLP Lab. It explores the in-context recall capabilities of Large Language Models (LLMs) using a method called "needle-in-a-haystack," where a specific factoid is embedded in a block of unrelated text. The study investigates how the recall performance of various LLMs is influenced by the content of prompts and the biases in their training data. The research involves testing multiple LLMs with varying context window sizes to assess their ability to recall information accurately when prompted differently. The paper includes detailed methodologies, results from numerous tests, discussions on the impact of prompt variations and training data, and conclusions on improving LLM utility in practical applications. It contains many tables. Answer questions using the information in this article and be precise.'''

运行 LlamaParse 并打印一些 markdown 输出：

documents = LlamaParse(result_type="markdown", parsing_instructions=parsing_instructions).load_data(pdf_file_name)
print(documents[0].text[:1000])

从 markdown 文件中提取 base_nodes（文本）和对象节点（表）：

# Parse the documents using MarkdownElementNodeParser
node_parser = MarkdownElementNodeParser(llm=llm, num_workers=8).from_defaults()

# Retrieve nodes (text) and objects (table)
nodes = node_parser.get_nodes_from_documents(documents)

base_nodes, objects = node_parser.get_nodes_and_objects(nodes)

创建一个利用 KDB.AI 的索引：

vector_store = KDBAIVectorStore(table)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

#Create the index, inserts base_nodes and objects into KDB.AI
recursive_index = VectorStoreIndex(
    nodes= base_nodes + objects, storage_context=storage_context
)

# Query KDB.AI to ensure the nodes were inserted
table.query()

创建 LlamaIndex 查询引擎以执行 RAG 管道，我们使用 cohere 重新排序器来帮助改善结果：

### Define reranker
cohere_rerank = CohereRerank(top_n=10)

### Create the query_engine to execute RAG pipeline using LlamaIndex, KDB.AI, and Cohere reranker
query_engine = recursive_index.as_query_engine(similarity_top_k=20,
                                               node_postprocessors=[cohere_rerank],
                                               vector_store_kwargs={
                                                    "index" : "flat",
                                                },
                                            )

让我们尝试一下：

query_1 = "describe the needle in a haystack method only using the provided information"

response_1 = query_engine.query(query_1)

print(str(response_1))

输出：

>>>The needle-in-a-haystack method involves embedding a factoid (referred to as the “needle”) within a block of filler text (referred to as the “haystack”). The model is then tasked with retrieving this embedded factoid. The recall performance of the model is evaluated across various haystack lengths and with different placements of the needle to identify patterns in performance. This method demonstrates that an LLM’s ability to recall information is influenced not only by the content of the prompt but also by potential biases in its training data. Adjustments to the model’s architecture, training strategy, or fine-tuning can enhance its recall performance, providing insights into LLM behavior for more effective applications.

query_1 = "list the LLMs that are evaluated with needle-in-a-haystack testing?"

response_1 = query_engine.query(query_1)

print(str(response_1))

输出（此输出取自 PDF 文档中的表格）：

>>>Llama 2 13B, Llama 2 70B, GPT-4 Turbo, GPT-3.5 Turbo 1106, GPT-3.5 Turbo 0125, Mistral v0.1, Mistral v0.2, WizardLM, and Mixtral are the LLMs evaluated with needle-in-a-haystack testing.

query_1 = "what is the best thing to do in San Francisco? "

response_1 = query_engine.query(query_1)

print(str(response_1))

输出（此输出取自 PDF 文档中的表格）：

总结

在本演练中，我们探索了如何在复杂的 PDF 文档上构建检索增强生成管道。我们使用 LlamaParse 将 PDF 转换为 markdown 格式，提取文本和表格，然后将它们导入 KDB.AI，以便使用 LlamaIndex 查询引擎进行检索。

随着 RAG 系统投入生产，它们能够提取复杂文档类型中的知识非常重要 — LlamaParse 可以实现这一点！

原文链接：RAG + LlamaParse: Advanced PDF Parsing for Retrieval

汇智网翻译整理，转载请标明出处