扫描PDF的文本提取方法

扫描PDF是任何试图提取信息的人常遇到的障碍。与数字原生的PDF不同,扫描文档通常将文本呈现为图像,这使得简单的复制粘贴变得困难。本文探讨了两种强大的方法来解决这一问题,特别是在处理复杂布局和表格时:利用视觉语言模型(VLMs)以及结合文档分割和OCR进行准确的文本提取。

扫描PDF的挑战

扫描PDF缺乏底层的文本层,这意味着简单的复制粘贴无法奏效。你实际上处理的是文本的图像,而不是实际的文本数据。复杂的布局、表格和不同大小的文本进一步增加了提取的难度。解决方案必须能够理解视觉布局,并将图像准确地转换为有意义的、结构化的文本。

1、使用视觉语言模型(VLMs)

对于页数较少且需要高质量输出的情况,视觉语言模型(VLMs)是理想的选择。这些模型能够解释图像并理解文本和表格,具有显著的精确度。

使用Google的Gemini 1.5 Pro实现:

import PIL.Image
import os
import google.generativeai as genai
from pdf2image import convert_from_path

# Replace with your API key
GOOGLE_API_KEY = "YOUR_API_KEY"
genai.configure(api_key=GOOGLE_API_KEY)
pdf_path = "test.pdf" # Change this path to point to your pdf
pdf_name = os.path.splitext(os.path.basename(pdf_path))[0]
# Create the output directory if it doesn't exist
output_dir = "GeminiResult"
os.makedirs(output_dir, exist_ok=True)
# Choose a Gemini model.
model = genai.GenerativeModel(model_name="gemini-1.5-pro")
prompt = """
    Extract all text content and tabular data from this image, strictly preserving the original reading order as they appear on the page.
    1. **Reading Order:** Process the content strictly based on the reading order within the image. Do not rearrange or reorder blocks or tables.
    2. **Text Blocks:** Extract distinct blocks of text and represent each block as a separate entity, separated by double newlines ("\\n\\n").
    3. **Tables:** Identify any tables present in the image. For each table, output it in a structured, comma-separated format (.csv). Each row of the table should be on a new line, with commas separating column values.
        - Include the header row, if present.
        - Ensure that all columns of each row are comma separated values.
    4. **Output Format:**
        - Output text blocks and tables in the order they are read on the page. When a table is encountered while reading the page, output it in CSV format at that point in the output.
    5. If there are no text or no tables return empty string.
     If the table contains only one row, then return text of that row separated by comma.
    """
try:
    # Convert all pages of the PDF to PIL image objects
    images = convert_from_path(pdf_path)
    
    if not images:
        raise FileNotFoundError(f"Could not convert the PDF to images")
    for i, img in enumerate(images):
        page_number = i + 1
        output_file_path = os.path.join(output_dir, f"{pdf_name}_{page_number}.txt")
        
        try:
           response = model.generate_content([prompt, img], generation_config={"max_output_tokens": 4096})
           response.resolve()
           with open(output_file_path, "w", encoding="utf-8") as f:
              f.write(response.text)
           print(f"Processed page {page_number} and saved to {output_file_path}")
        
        except Exception as page_err:
           print(f"Error processing page {page_number}: {page_err}")
           with open(output_file_path, "w", encoding="utf-8") as f:
              f.write(f"Error: An error occurred during processing of page {page_number} : {page_err}")
except FileNotFoundError as e:
    print(f"Error: Could not find file: {e}")
except Exception as e:
    print(f"Error: An error occurred during processing: {e}")
  1. 设置: 导入库并配置Google Gemini API。
  2. PDF转换: 使用pdf2image将PDF的每一页转换为图像。
  3. 提示工程: 指导模型按正确的顺序提取文本,分离文本块并结构化表格。
  4. VLM处理: Gemini模型分析图像并根据提示提取文本。
  5. 输出: 将提取的文本和表格保存为每页的文本文件。

优点:

  • 对于少量页面,准确性高。
  • 使用API调用,实现简单。
  • 有效处理表格并以CSV格式输出。

缺点:

  • API成本较高。
  • 对于大型PDF,扩展性有限。

2、文档分割 + OCR以实现扩展性

对于较大的PDF,结合文档分割和OCR更为高效。该方法识别文本块并对每个部分应用OCR。

使用YOLO和DocTr实现:

from ultralytics import YOLO
import fitz
import os
import pathlib
from PIL import Image, ImageEnhance
import numpy as np
import fitz
import os
from doctr.io import DocumentFile
from doctr.models import ocr_predictor

# List of sample PDF files to process
pdf_list = ['test.pdf']
# Load the document segmentation model
docseg_model = YOLO('yolov8x-doclaynet-epoch64-imgsz640-initiallr1e-4-finallr1e-5.pt')
# Initialize a dictionary to store results
mydict = {}

def enhance_image(img):
    """Apply image enhancements for better quality."""
    # Enhance sharpness
    enhancer = ImageEnhance.Sharpness(img)
    img = enhancer.enhance(1.5)
    
    # Enhance contrast
    enhancer = ImageEnhance.Contrast(img)
    img = enhancer.enhance(1.2)
    
    # Enhance color
    enhancer = ImageEnhance.Color(img)
    img = enhancer.enhance(1.1)
    
    return img

def process_pdf_page(pdf_path, page_num, docseg_model, output_dir):
    """Processes a single page of a PDF with maximum quality settings."""
    
    pdf_doc = fitz.open(pdf_path)
    page = pdf_doc[page_num]
    
     # Increase the resolution matrix for maximum quality
    zoom = 4  # Increased zoom factor for higher resolution
    matrix = fitz.Matrix(zoom, zoom)
    
    # Use high-quality rendering options
    pix = page.get_pixmap(
        matrix=matrix,
        alpha=False,  # Disable alpha channel for clearer images
        colorspace=fitz.csRGB  # Force RGB colorspace
    )
    
    img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
    
    # Apply image enhancements
    img = enhance_image(img)
    
    # Resize with high-quality settings
    if zoom != 1:
        original_size = (int(page.rect.width), int(page.rect.height))
        img = img.resize(original_size, Image.Resampling.LANCZOS)
    # Generate a temporary filename for the page image
    temp_img_filename = os.path.join(output_dir, f"temp_page_{page_num}.png")
    
    # Save with maximum quality settings
    img.save(
        temp_img_filename,
        "PNG",
        quality=100,
        optimize=False,
        dpi=(300, 300)  # Set high DPI
    )
    # Run the model on the image
    results = docseg_model(source=temp_img_filename, save=True, show_labels=True, show_conf=True, boxes=True)
    # Extract the results
    page_width = page.rect.width
    one_third_width = page_width / 3
    
    
    all_coords = []
    
    for entry in results:
        thepath = pathlib.Path(entry.path)
        thecoords = entry.boxes.xyxy.numpy()
        all_coords.extend(thecoords)

    # Sort the coordinates into two groups and then sort each group by y1
    left_group = []
    right_group = []
    for bbox in all_coords:
            x1 = bbox[0]
            if x1 < one_third_width:
                left_group.append(bbox)
            else:
                right_group.append(bbox)

    left_group = sorted(left_group, key=lambda bbox: bbox[1])
    right_group = sorted(right_group, key=lambda bbox: bbox[1])
    
    sorted_coords = left_group + right_group

    mydict[f"{pdf_path} Page {page_num}"] = sorted_coords
    # Clean up the temporary image
    os.remove(temp_img_filename)
    pdf_doc.close()
    
   
# Process each PDF in the list
for pdf_path in pdf_list:
    try:
        pdf_doc = fitz.open(pdf_path)
        num_pages = pdf_doc.page_count
        pdf_doc.close()
        output_dir = os.path.splitext(pdf_path)[0] + "_output"
        os.makedirs(output_dir, exist_ok=True)
        for page_num in range(num_pages):
            process_pdf_page(pdf_path, page_num, docseg_model, output_dir)
    except Exception as e:
        print(f"Error processing {pdf_path}: {e}")
# Create the 'tmp' directory if it doesn't exist
tmp_dir = 'tmp'
os.makedirs(tmp_dir, exist_ok=True)
# Iterate through the results and save cropped images with maximum quality
for key, coords in mydict.items():
    pdf_name, page_info = key.split(" Page ")
    page_number = int(page_info)
    pdf_doc = fitz.open(pdf_name)
    page = pdf_doc[page_number]
    
    zoom = 4
    matrix = fitz.Matrix(zoom,zoom)
    for i, bbox in enumerate(coords):
        # Scale the bounding box coordinates appropriately
        xmin, ymin, xmax, ymax = map(lambda x: x , bbox)
            
        # Create a rectangle from the bounding box
        rect = fitz.Rect(xmin, ymin, xmax, ymax)
            
        # Crop using get_pixmap with a maximum resolution matrix
        cropped_pix = page.get_pixmap(
            clip=rect,
            matrix=matrix,
            alpha=False,
            colorspace=fitz.csRGB
        )
        
        cropped_img = Image.frombytes("RGB", [cropped_pix.width, cropped_pix.height], cropped_pix.samples)
        cropped_img = enhance_image(cropped_img)
        
        output_filename = os.path.join(tmp_dir, f"{os.path.splitext(os.path.basename(pdf_name))[0]}_page{page_number}_{i}.png")
        
        # Save the cropped image
        cropped_img.save(output_filename, "PNG", quality=100, optimize=False, dpi=(300, 300))
    pdf_doc.close()

def extract_text_from_image(image_path, model):
    """Extracts text from a single image using DocTr."""
    doc = DocumentFile.from_images(image_path)
    result = model(doc)
    text_content = ""
    for page in result.pages:
        for block in page.blocks:
            for line in block.lines:
                for word in line.words:
                    text_content += word.value + " "
            text_content += "\n"
    return text_content.strip()

def process_cropped_images(tmp_dir, pdf_list):
    """Iterates through cropped images, extracts text using DocTr and stores the text in text files."""
    
    doctr_model = ocr_predictor(pretrained=True)
    
    for pdf_path in pdf_list:
        pdf_name = os.path.splitext(os.path.basename(pdf_path))[0]
        output_txt_path = f"{pdf_name}_extracted_text.txt"
        
        with open(output_txt_path, 'w', encoding='utf-8') as outfile:
            
            pdf_doc = fitz.open(pdf_path)
            num_pages = pdf_doc.page_count
            pdf_doc.close()
            for page_num in range(num_pages):
                
                outfile.write(f"Page: {page_num}\n")
                
                # Sort filenames of cropped images by chunk order
                cropped_images_for_page = sorted([
                    f for f in os.listdir(tmp_dir)
                    if f.startswith(f"{pdf_name}_page{page_num}_") and f.endswith(".png")
                ], key=lambda f: int(f.split("_")[-1].split(".")[0]))
                
                for i, image_filename in enumerate(cropped_images_for_page):
                    image_path = os.path.join(tmp_dir, image_filename)
                    text = extract_text_from_image(image_path, doctr_model)
                    outfile.write(f"  Chunk {i}: {text}\n")
        print(f"Text extracted from {pdf_name} saved to {output_txt_path}")

# Example usage:
tmp_dir = 'tmp' # Make sure your tmp directory exists
pdf_list = ['test.pdf'] # Your list of PDFs
process_cropped_images(tmp_dir, pdf_list)
  1. 设置: 导入库并加载YOLO模型用于文档分割,加载DocTr用于OCR。
  2. 图像预处理: 使用fitz(PyMuPDF)加载PDF,增强图像质量并保存临时图像。
  3. 文档分割: YOLO模型识别并定位文本块。
  4. 裁剪: 根据边界框裁剪部分内容并保存裁剪后的图像。
  5. OCR应用: 使用DocTr从每个裁剪部分提取文本。
  6. 输出: 保存提取的文本,并为每页的每个部分编号。

优点:

  • 适用于大型PDF,扩展性强。
  • 对于复杂布局,准确性较高。
  • 开源且可定制。

缺点:

  • 实现较为复杂。
  • 对于非常复杂的布局,准确性略低于VLMs。

让我们测试一下:

输出结果:

Fund Summary

• Investors in our Bitcoin Discovery Fund will become shareholders of a company that
owns fractional interests in bitcoin mines. As these mines discover bitcoin, the bitcoin will
be periodically distributed to owners, in proportion to their overall interest in the fund.

• As an investor, you will enjoy the accumulation of bitcoin at potentially below-market
rates, possible protection from inflation and opportunities for tax advantages. You will be
able to choose whether to receive your distributions in either BTC or in USD.


ENERGYFUNDERS BITCOIN DISCOVERY FUND POTENTIAL PROFIT SCENARIOS


Assumptions in Financial Forecast
• Each scenario reflects a forecast of potential profit
outcome. Scenarios assume initial bitcoin prices of
$20k, $60k, and $100k per bitcoin, plus investment
levels of $10k, $250k, and $1M.
• Natural gas price assumptions at $5.50/MCF.
Bitcoin network difficulty rate of 26.6 T.
• Our expected production cost ranges from $15,000
to 30,000 per bitcoin. Our anticipated cost of off-grid
power generation (natural gas) = $0.03-$0.07/kwh.
• We assume a 2.75% monthly increase in bitcoin
price until May 2024. In 2024, we assume the rate of
issuance of the bitcoin from the network to the
miners will halve from 6.25 to 3.125 BTC per block.
This halving has historically occurred every four
years, leading to pricing increases and supply
shocks. After May 2024, we project a monthly
bitcoin price increase of 5.5%.


$20,000 Bitcoin
$60,000 Bitcoin
$100,000 Bitcoin

Initial Investment,$10,000,$250,000,$1,000,000
Year 1 Return,$295,$8,620,$41,978
Year 2 Return,($178),($4,443),($17,772)
Year 3 Return,($1,002),($25,043),($100,171)
Month 37 Return,$297,$7,413,$29,652
Total Profit,($10,588),($263,453),($1,046,313)
IRR,0.00%,0.00%,0.00%
Total Return,-105.88%,-105.38%,-104.63%

Initial Investment,$10,000,$250,000,$1,000,000
Year 1 Return,$6,518,$164,188,$671,750
Year 2 Return,$4,527,$133,172,$452,688
Year 3 Return,$2,279,$56,997,$227,906
Month 37 Return,$4,133,$103,318,$413,273
Total Profit,$7,456,$187,654,$765,618
IRR,35.99%,36.43%,37.79%
Total Return,74.56%,75.06%,76.56%

Initial Investment,$10,000,$250,000,$1,000,000
Year 1 Return,$13,703,$343,830,$1,382,821
Year 2 Return,$12,118,$304,699,$1,218,795
Year 3 Return,$6,012,$150,305,$601,221
Month 37 Return,$7,835,$195,876,$783,503
Total Profit,$29,738,$744,710,$2,986,341
IRR,153.09%,154.32%,158.12%
Total Return,297.38%,297.88%,298.63%


The internal rate of return (IRR) is the annual rate of growth that an investment is expected to generate. The calculation excludes external variables, including inflation, the cost of capital, or the risk-free rate of
return. Total profit refers to the total dollar value returned, minus the upfront investment, providing a measure of net profit. Total return is a measure of the total profit relative to the total upfront investment.

*For illustrative purposes only. Based on uncertain and imperfect assumptions and future projections. No implicit or explicit guarantee of performance.


IF YOU'D LIKE TO LEARN MORE, VISIT OUR WEBSITE AT ENERGYFUNDERS.COM OR REACH OUT TO OUR TEAM AT INFO@ENERGYFUNDERS.COM

3、结束语

从复杂的扫描PDF中提取文本具有挑战性,但并非不可能。视觉语言模型(VLMs)为较小的文档提供了更高的准确性,而文档分割与OCR的结合则更适合处理较大的文档。选择最佳方法取决于你的需求、PDF的页数以及可用资源。


原文链接:Decoding the Scanned Page: Best Methods for Extracting Text from Complex Scanned PDFs

汇智网翻译整理,转载请标明出处