OpenAI-JSON模式 vs. 结构化输出

LIBRARY Nov 9, 2024

想象一下,你是一家法律公司,正在尝试简化合同审查流程。合同包含各种条款,每个条款都需要组织起来以突出显示管辖权、摘要、关键条款和法律含义等要素。您首先手动阅读每一份合同,但很快意识到这个过程是劳动密集型且不一致的。

探索自动化:从合同中提取结构化数据

为了加快速度,你决定使用人工智能自动从非结构化合同文本中提取结构化数据。这将通过生成捕获所有必要信息的 JSON 输出来节省团队数小时的手动工作。

1、首次尝试:JSON 模式

你从 JSON 模式开始。它看起来很有希望——它保证有效的 JSON 输出,这正是你所需要的,对吗?你的 JSON 模式设置可能如下所示:

from openai import OpenAI
import json

client = OpenAI()

# Define the contract text
contract_text = """
Title: Service Agreement between ABC Corp and XYZ Inc.

Introduction: This Service Agreement ("Agreement") is made and entered into on the 15th day of January 2024, 
by and between ABC Corp, a corporation located in California, and XYZ Inc., a corporation located in Texas.

Scope of Services: XYZ Inc. will provide software development and maintenance services to ABC Corp as outlined in Appendix A.

Term: The Agreement shall commence on January 15, 2024, and shall continue for a period of one year, 
with an option for renewal upon mutual agreement.

Confidentiality: Both parties agree to keep all exchanged information confidential.

Termination: Either party may terminate the Agreement with a 30-day written notice under conditions outlined in Section 9.

Jurisdiction: This Agreement shall be governed by the laws of the State of California.

Signatures: This Agreement is signed by representatives of ABC Corp and XYZ Inc.
"""

# Set up the JSON mode request
completion = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are a legal assistant who summarizes contracts in JSON format."},
        {
            "role": "user",
            "content": f"""Analyze the following contract and return in JSON format the title, summary, 
            key terms, jurisdiction, and legal implications. Ensure all text is valid JSON.

            Contract text:
            {contract_text}
            """
        }
    ],
    response_format={"type": "json_object"}
)

# Parse and print the JSON response
json_response = completion.choices[0].message.content
parsed_response = json.loads(json_response)
print(json.dumps(parsed_response, indent=4))

这将返回以下对象:

{
    "title": "Service Agreement between ABC Corp and XYZ Inc.",
    "summary": "This Service Agreement is established between ABC Corp and XYZ Inc. for software development and maintenance services to be provided by XYZ Inc. to ABC Corp. The agreement is effective from January 15, 2024, for one year with the possibility of renewal. Both parties are obligated to keep information confidential, and either party can terminate the agreement with a 30-day notice. The governing law is that of the State of California.",
    "key_terms": {
        "parties": [
            "ABC Corp",
            "XYZ Inc."
        ],
        "effective_date": "January 15, 2024",
        "duration": "1 year",
        "renewal": "Option for renewal upon mutual agreement",
        "services": "Software development and maintenance services",
        "confidentiality": "Both parties agree to keep all exchanged information confidential.",
        "termination": "30-day written notice",
        "signatures": "Signed by representatives of both ABC Corp and XYZ Inc."
    },
    "jurisdiction": "State of California",
    "legal_implications": "Both parties must adhere to confidentiality agreements and recognize the jurisdiction is limited to California state law. The termination clause allows either party an ability to end the contract with due notice, offering flexibility in the event of unforeseen circumstances."
}

输出是有效的 JSON!但经过几次测试后,你注意到一个问题:输出的结构有所不同。有时字段略有不同,或者缺少关键术语。你意识到 JSON 模式不强制执行特定格式 - 它只保证有效的 JSON,留下了不一致的空间。😕

2、解决方案:结构化输出

然后,你发现结构化输出。与 JSON 模式不同,结构化输出强制执行严格的架构,确保每个响应都符合你定义的结构,并且所有必填字段都存在。

你为合同摘要定义一个架构:

import json
from openai import OpenAI

client = OpenAI()

# Contract summarization prompt and schema setup
contract_summarizer_prompt = '''
    You are an AI legal assistant. Given a legal contract, summarize its key points in a structured JSON format. 
    Include the title, a brief summary, a list of key legal terms, jurisdiction, and any legal implications or important clauses.
'''

MODEL = "gpt-4o"

def get_contract_summary(contract_text):
    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {
                "role": "system", 
                "content": contract_summarizer_prompt
            },
            {
                "role": "user", 
                "content": contract_text
            }
        ],
        response_format={
            "type": "json_schema",
            "json_schema": {
                "name": "legal_summary",
                "schema": {
                    "type": "object",
                    "properties": {
                        "title": {"type": "string"},
                        "summary": {"type": "string"},
                        "key_terms": {"type": "array", "items": {"type": "string"}},
                        "jurisdiction": {"type": "string"},
                        "implications": {"type": "string"}
                    },
                    "required": ["title", "summary", "key_terms", "jurisdiction", "implications"],
                    "additionalProperties": False
                },
                "strict": True
            }
        }
    )
    return response.choices[0].message

# Example contract text
contract_text = """
Title: Service Agreement between ABC Corp and XYZ Inc.

Introduction: This Service Agreement ("Agreement") is made and entered into on the 15th day of January 2024, 
by and between ABC Corp, a corporation located in California, and XYZ Inc., a corporation located in Texas.

Scope of Services: XYZ Inc. will provide software development and maintenance services to ABC Corp as outlined in Appendix A.

Term: The Agreement shall commence on January 15, 2024, and shall continue for a period of one year, 
with an option for renewal upon mutual agreement.

Confidentiality: Both parties agree to keep all exchanged information confidential.

Termination: Either party may terminate the Agreement with a 30-day written notice under conditions outlined in Section 9.

Jurisdiction: This Agreement shall be governed by the laws of the State of California.

Signatures: This Agreement is signed by representatives of ABC Corp and XYZ Inc.
"""

# Run the function and print the structured output
result = get_contract_summary(contract_text)

# Parse the JSON content
parsed_content = json.loads(result.content)

# Print the parsed content with indentation for readability
print(json.dumps(parsed_content, indent=4))

这将返回以下 JSON 对象,并且每次运行此完成请求时都会可靠地返回此 JSON 对象的精确结构:

{
    "title": "Service Agreement between ABC Corp and XYZ Inc.",
    "summary": "This Service Agreement is between ABC Corp and XYZ Inc. for the provision of software development and maintenance services. It is effective from January 15, 2024, and will last one year with an option for renewal. The agreement includes confidentiality obligations and allows termination with a 30-day notice. It is governed by California law.",
    "key_terms": [
        "Service Agreement",
        "Scope of Services",
        "Term",
        "Confidentiality",
        "Termination",
        "Jurisdiction"
    ],
    "jurisdiction": "California",
    "implications": "The Agreement mandates confidentiality, includes a renewal option, and allows either party to terminate with notice. It is subject to California state laws, which means any legal disputes will be resolved under California jurisdiction."
}

使用结构化输出,每个响应都完全遵循架构,提供一致且可靠的摘要。每次合同审查都包括标题、摘要、关键术语、管辖权和法律含义——没有意外,没有缺失字段。您和您的法律团队对可预测的格式感到满意,这使得构建用于自动搜索和分类合同的工具变得容易。

总结:以下是 JSON 模式和结构化输出的细微差别的快速比较:


原文链接:Transform Unstructured Legal Text into Organized Data with OpenAI’s Structured Outputs

汇智网翻译整理,转载请标明出处

Tags