LIBRARY

ExtractThinker 文档智能处理库

ExtractThinker是一个灵活的文档智能库，可帮助你从各种文档中提取和分类结构化数据，就像文档处理工作流的 ORM 一样。

admin

Nov 22, 2024 • 13 min read

在本文中，我们将探索如何使用 ExtractThinker 高效地大规模处理文档。我们将讨论何时使用不同的模型（如 O1、GPT4o 及其迷你版本）、如何处理 OCR、提取图表以及使用异步批处理管理重负载。

1、ExtractThinker 简介

ExtractThinker是一个灵活的文档智能库，可帮助你从各种文档中提取和分类结构化数据，就像文档处理工作流的 ORM 一样。你说的一个短语是“LLM 的文档智能”或“智能文档处理的 LangChain”。其动机是创建文档处理所需的利基功能，例如拆分大型文档和高级分类。

下图映射了将要讨论的所有内容：

DocumentLoader

DocumentLoader 是文档和 LLM 之间的连接，通常在 SOTA OCR 上完成。支持多种文档加载器，包括 Tesseract OCR、Azure Form Recognizer、AWS Textract、Google Document AI 等。

LLM

它是模型的装饰器。它建立在 LiteLLM 和 Instructor 等工具之上，以方便不可知论者使用。是围绕文档智能的需求而设计的。

Contract

也是一个装饰器，但是是 Pydantic 的。目标是包括自定义功能，例如验证器和提示工程，以便自动注入和处理。

Extractor

协调文档加载器和 LLM 之间的交互以提取结构化数据。

Process

表示跨文件的流程。建立在上述组件之上。你可以为某些用例选择 DocumentLoaders以及 Extractors。

还有其他较小的组件，例如 Splitters 和 Classifications，但我们将结合适当的示例来查看它们。

2、选择正确的模型

选择合适的模型对于平衡性能、准确性和成本至关重要。首先让我们来看看成本：

GPT-4o mini

用例：基本文本提取任务，类似于 OCR。

非常适合从文档中提取文本，你必须将图像或 PDF 转换为机器可读的文本。经济高效且快速，适合大批量处理。

GPT-4o

用例：分类和拆分。

GPT-4o 模型让你对文档的内容和结构有了更多的了解。非常适合对文档进行分类、将组合文档拆分为单独的部分以及执行复杂的分类任务。

何时使用：

将文档分类为发票、合同或收据等类型。
根据内容将多页文档拆分为单独的部分。
高级分类，其中了解上下文和细微差别很重要。

o1 和 o1-mini 模型

用于：需要推理和从数据中得出结论的高级提取任务。

o1 和 o1-mini 模型专为复杂的提取场景而设计，其中模型需要执行更深入的分析和推理。例如，从图表中提取数据、解释值以及根据提取的坐标计算人均 GDP 等聚合指标。

何时使用：

根据提取的数据执行计算或生成见解。

上面的详细描述可以打包在下图中：

3、使用 DocumentLoader

在 ExtractThinker 中， DocumentLoader 是连接文档和 LLM 的关键组件。它使用 SOTA OCR 技术或直接文本提取工具从各种文档格式中提取文本和布局信息。

OCR vs. Pure Vision

仅使用 LLM 就可以完美地提取数据，但问题有两个方面：幻觉和精度。如果数据不够清晰或不够可见，就会出现幻觉。OCR 会为你提供所需的数据，而 LLM 会为你提供结构。例如，在某些带有签名的文档中，精度会成为问题，而 OCR 可以很好地处理这种情况。

因此，在生产案例中，请尝试使用启用 Vision 的 OCR。这会将 OCR 文本 + 图像添加到 LLM 请求中。

可用的 DocumentLoaders

ExtractThinker 提供多种 DocumentLoaders，包括：

DocumentLoaderTesseract：使用 Tesseract OCR 从图像或扫描的 PDF 中提取文本。
DocumentLoaderPyPdf：使用 PyPDF 直接从 PDF 中提取文本，适用于数字生成的 PDF。
DocumentLoaderAWSTextract：与 AWS Textract 集成以实现高级 OCR 功能。
DocumentLoaderAzureForm：利用 Azure 表单识别器提取结构化数据。
DocumentLoaderGoogleDocumentAI：连接到 Google Document AI 进行 OCR 和数据提取。

它包含两个主要方法，在 extract() 上调用的 load 和在 split() 中调用的 load_content_list。

从 DocumentLoader 获取内容：

import os
from extract_thinker.document_loader import DocumentLoaderTesseract

# Set the path to your Tesseract executable
tesseract_path = os.getenv('TESSERACT_PATH')
if not tesseract_path:
    raise ValueError('TESSERACT_PATH environment variable is not set')

# Gets the content in JSON or just in text
content = loader.load(test_file_path)

# Gets a JSON with the content and images, per page
content = loader.load_content_list(test_file_path)

4、Extractor：提取结构化数据和图表

Extractor 是 ExtractThinker 中的核心组件，可协调 DocumentLoader 和 LLM 之间的交互，以从文档中提取结构化数据。它利用 LLM 的功能根据预定义的数据结构（称为合约：Contract）解释和组织提取的文本。

合约是 Pydantic 模型，用于定义你要从文档中提取的数据的结构。它们的作用类似于 Extractor 和 LLM 用来解析和组织提取的信息的模式。

4.1 从发票中提取数据

首先定义发票合约：

from extract_thinker import Contract
from pydantic import Field
from typing import List

class InvoiceLineItem(Contract):
    description: str = Field(description="Description of the item")
    quantity: int = Field(description="Quantity of the item")
    unit_price: float = Field(description="Unit price of the item")
    amount: float = Field(description="Total amount for the item")

class InvoiceContract(Contract):
    invoice_number: str = Field(description="Invoice number")
    invoice_date: str = Field(description="Date of the invoice")
    total_amount: float = Field(description="Total amount of the invoice")
    line_items: List[InvoiceLineItem] = Field(description="List of line items in the invoice")

此合约指定我们要提取发票编号、日期、总金额和行项目，每个项目都包含其详细信息。

从发票中提取数据：

import os
from extract_thinker import Extractor
from extract_thinker.document_loader import DocumentLoaderPyPdf  # Or any other suitable DocumentLoader

# Initialize the Extractor
extractor = Extractor()

# Load the DocumentLoader
extractor.load_document_loader(DocumentLoaderPyPdf())

# Load the LLM
extractor.load_llm('gpt-4o-mini')  # Use the appropriate model for your use case

# Define the path to your document
test_file_path = 'path/to/your/invoice.pdf'

# Perform the extraction
result = extractor.extract(test_file_path, InvoiceContract)

# Access the extracted data
print("Invoice Number:", result.invoice_number)
print("Invoice Date:", result.invoice_date)
print("Total Amount:", result.total_amount)
for item in result.line_items:
    print(f"Item: {item.description}, Quantity: {item.quantity}, Unit Price: {item.unit_price}, Amount: {item.amount}")

4.2 从图表中提取数据

从图表中提取数据需要更高级的合约，它可以处理图表的结构，包括其类型、描述和数据点。

定义图表合约：

from extract_thinker import Contract
from pydantic import Field
from typing import List, Literal

class XYCoordinate(Contract):
    x: float = Field(description='Value on the x-axis')
    y: float = Field(description='Value on the y-axis')

class Chart(Contract):
    classification: Literal['line', 'bar', 'pie'] = Field(description='Type of the chart')
    description: str = Field(description='Description of the chart')
    coordinates: List[XYCoordinate] = Field(description='Data points in the chart')
    gdp_variation: str = Field(description='Description of the GDP variation')

class ChartWithContent(Contract):
    content: str = Field(description='Content of the page without the chart')
    chart: Chart = Field(description='Extracted chart data')

这个合约允许我们不仅提取文本内容，还可以提取图表的详细信息，包括其数据点。

提取图表数据：

import os
from extract_thinker import Extractor
from extract_thinker.document_loader import DocumentLoaderTesseract  # If working with images

# Initialize the Extractor
extractor = Extractor()

# Load the DocumentLoader
tesseract_path = os.getenv('TESSERACT_PATH')
if not tesseract_path:
    raise ValueError('TESSERACT_PATH environment variable is not set')
extractor.load_document_loader(DocumentLoaderTesseract(tesseract_path))

# Load the LLM (use O1 or GPT-4o for complex tasks)
extractor.load_llm("o1-preview")  # Use 'o1' for advanced reasoning

# Define the path to your document
test_file_path = 'path/to/your/document_with_chart.png'

# Perform the extraction
result = extractor.extract(test_file_path, ChartWithContent, vision=True)

# Access the extracted data
print("Content without Chart:", result.content)
print("Chart Type:", result.chart.classification)
print("Chart Description:", result.chart.description)
print("GDP Variation:", result.chart.gdp_variation)
print("Data Points:")
for coord in result.chart.coordinates:
    print(f"X: {coord.x}, Y: {coord.y}")

注意：选择模型时，请记住经验法则。如果结论需要根据数据进行计算，在本例中，计算 GDP，必须使用 o1 模型。

5、流程：拆分和分类

在 ExtractThinker 中， Process 组件表示一个工作流，它协调从文档中加载、拆分、分类和提取数据。这种模块化方法允许你高效地处理复杂的文档处理任务。

流程简单介绍：

目的：管理文档的一系列操作，包括加载、拆分、分类和提取。
组件：结合文档加载器、拆分器、分类和提取器以创建灵活的处理管道。
灵活性：通过混合和匹配不同的组件，根据您的特定需求定制工作流。

5.1 拆分文档

处理多页或组合文档时，将它们拆分成单独的部分或页面对于准确处理至关重要。 ExtractThinker 提供拆分策略来有效地处理此问题。

急切拆分：此策略一次处理整个文档，预先识别所有拆分点。它最适合适合模型上下文窗口的小型到中型文档，为较小的输入提供更简单的实现和更快的处理速度。
延迟拆分：这种方法以增量方式处理文档，评估较小的块以确定拆分位置。它非常适合超出模型上下文窗口的大型文档，使其成为处理大量数据的可扩展且高效的选项。

5.2 使用拆分器

ExtractThinker 提供不同的拆分器（例如 ImageSplitter 和 TextSplitter）来处理拆分逻辑。

设置拆分器：

from extract_thinker import Process, SplittingStrategy
from extract_thinker.splitter import ImageSplitter
from extract_thinker.document_loader import DocumentLoaderTesseract

# Initialize the Process
process = Process()

# Load the DocumentLoader
tesseract_path = os.getenv('TESSERACT_PATH')
process.load_document_loader(DocumentLoaderTesseract(tesseract_path))

# Load the Splitter with the desired model and strategy
process.load_splitter(
    ImageSplitter('gpt-4o', strategy=SplittingStrategy.EAGER)
)

5.3 分类

分类是关于识别你正在处理的文档或部分的类型，例如发票、合同或驾驶执照。当不同类型的文档需要不同的提取逻辑时，这一点至关重要。

分类使用分类类定义，指定名称、描述和相关合同以及要使用的提取器。

使用具有多个提取器的分类：

from extract_thinker import Classification
from extract_thinker import Extractor

# Define your Contracts (as previously defined)
class InvoiceContract(Contract):
    invoice_number: str
    total_amount: float
    # ... other fields

class DriverLicenseContract(Contract):
    name: str
    license_number: str
    # ... other fields

# Initialize Extractors for each classification if needed
invoice_extractor = Extractor()
invoice_extractor.load_document_loader(DocumentLoaderPyPdf())
invoice_extractor.load_llm('gpt-4o-mini')

license_extractor = Extractor()
license_extractor.load_document_loader(DocumentLoaderTesseract(tesseract_path))
license_extractor.load_llm('gpt-4o-mini')

# Define Classifications
classifications = [
    Classification(
        name="Invoice",
        description="This is an invoice document",
        contract=InvoiceContract,
        extractor=invoice_extractor
    ),
    Classification(
        name="Driver License",
        description="This is a driver's license document",
        contract=DriverLicenseContract,
        extractor=license_extractor
    )
]

result = process.classify(
    test_file_path,
    classifications,
)

5.4 高级分类策略

ExtractThinker 支持高级分类策略以提高准确性和可靠性。

分类策略：

共识：结合多个分类器的结果以达成共识决策。
高阶：使用高阶推理进行更准确的分类。
阈值：应用置信度阈值来确定分类确定性。

高级分类：

from extract_thinker import ClassificationStrategy

# Initialize multiple Extractors for classification
extractor1 = Extractor()
extractor1.load_document_loader(DocumentLoaderTesseract(tesseract_path))
extractor1.load_llm('gpt-4o')

extractor2 = Extractor()
extractor2.load_document_loader(DocumentLoaderPyPdf())
extractor2.load_llm('gpt-4o-mini')

# Add classifiers to the process
process.add_classify_extractor([[extractor1], [extractor2]])

# Perform classification with a strategy
result = process.classify(
    test_file_path,
    classifications,
    strategy=ClassificationStrategy.CONSENSUS,
    threshold=0.8
)

print("Document classified as:", result.name)

5.5 合并流程中的分裂和分类

通过结合拆分和分类，您可以高效地处理包含多种内容类型的复杂文档。

完成流程工作流：

# Initialize the Process and load components
process = Process()
process.load_document_loader(DocumentLoaderTesseract(tesseract_path))
process.load_splitter(
    ImageSplitter('gpt-4o', strategy=SplittingStrategy.EAGER)
)

# Process the document
test_file_path = 'path/to/your/multi_page_document.pdf'
split_content = process.load_file(test_file_path)\
    .split(classifications)\
    .extract()

# Access the extracted data
for content in split_content:
    if isinstance(content, InvoiceContract):
        print("Extracted Invoice:")
        print("Invoice Number:", content.invoice_number)
        print("Total Amount:", content.total_amount)
    elif isinstance(content, DriverLicenseContract):
        print("Extracted Driver License:")
        print("Name:", content.name)
        print("License Number:", content.license_number)

说明：

load_file()：加载文档。
split()：根据分类拆分文档。
extract()：根据每个分类部分定义的合约提取数据。

6、异步批处理，适用于重负载

ExtractThinker 提供批处理功能，利用异步执行有效处理重负载。当响应时间不是问题时，这允许你以较低的价格处理文档。

使用批量请求：

...
# Setting the extractor as usual
path = 'path/to/your/document.pdf'
batch_job = extractor.extract_batch(
  path,
  InvoiceContract,
)

# can be "queued", "processing", "completed" or "failed"
status = await batch_job.get_status()

# await for the result 
result = await batch_job.get_result()

说明：

extract_batch：启动批量提取过程。
BatchJob：表示批处理作业，允许您监视其状态并检索结果。
get_status：检查批处理作业的当前状态。
get_result：作业完成后检索结果。

处理批处理作业状态

批处理作业可以有几种状态：

queued：作业在队列中，即将开始处理。
processing：作业当前正在处理中。
completed：作业已成功完成处理。
failed：作业处理失败。

批处理一次处理一个文件，因此你可以自行控制批处理数量。批处理作业管理 OpenAI API 需要创建的所有 JSONL 文件，输出也是如此。完成后，无论成功与否，都会删除文件。

总而言之，如果请求时间不是制约因素，你可以使用此 ExtractThinker 功能轻松节省 50% 的成本。

7、结束语

在数据为王的世界里，ExtractThinker 使你能够充分发挥文档的潜力。通过智能地在 GPT-4o Mini（用于快速文本提取）、GPT-4o（用于高级分类）和 O1（用于深度推理任务）等模型之间进行选择，你可以定制工作流程以实现最高效率和准确性。我们探索了选择正确的模型、使用 DocumentLoaders、使用 Extractor 提取结构化数据、使用 Processes 管理复杂的工作流程以及使用异步批处理处理重负载。

这是我在获得 API 中的 O1 模型访问权限后创建的一篇文章，它解决了几个棘手的问题，例如需要在单独代理中完成的数据聚合和计算。

ExtractThinker 即将发布，这是文档内容的汇编。

原文链接：Scaling Document Extraction with o1, GPT-4o & Mini | ExtractThinker

汇智网翻译整理，转载请标明出处