实战数据可视化AI代理

Plotly 是我最喜欢的数据可视化库。在广泛撰写了关于使用 Plotly 创建高级可视化的文章后,我开始好奇:我是否可以通过简单地提供数据框和自然语言指令来教语言模型构建我喜欢的可视化?这个项目是对这个想法进行实验的结果,我很高兴与你分享结果。

1、为什么要构建AI代理?

如果你尝试过 ChatGPT 等 LLM,就会知道它们可以为几乎任何语言或包生成代码。但是,仅仅依靠 LLM 是有局限性的。以下是我通过构建代理要解决的一些关键问题:

  • 描述你的数据:LLM 本身并不知道你的数据集的具体信息,例如列名和行详细信息。手动提供这些信息可能很麻烦,尤其是在数据集变大时。如果没有这种背景,LLM 可能会产生幻觉或发明列名,从而导致数据可视化错误。
  • 样式和偏好:数据可视化是一种艺术形式,每个人都有独特的审美偏好,这些偏好因图表类型和信息而异。不断地向 LLM 提供每个可视化的详细说明是很乏味的。配备样式信息的代理可以简化此过程,确保一致且个性化的视觉输出。
  • 代理推理:ReAct 代理具有“推理”和执行任务的能力,从而产生更准确的响应和更少的幻觉。这种先进的提示工程技术已被证明可以产生更强大和可靠的结果。您可以参考这篇论文阅读有关 ReAct 代理的更多信息。

构建代理可以缓解这些问题,为数据可视化和其他任务提供更高效、更量身定制的方法。

下面你可以看到我告诉 Llama3:70B(我用于最终代理的同一个 LLM)构建可视化时的基线:

2、系统设计

要构建此应用程序,我们需要为 LLM 代理配备两个工具,以帮助它生成更好的数据可视化。一个工具提供有关数据集的信息,并包含有关样式的信息。

Llama-index 允许使用任何查询引擎作为代理工具。由于这两种工具都涉及信息检索,因此查询引擎工具适合我们的需求。

3、DataFrame索引

此工具的目的是分析数据框并将其内容信息存储在索引中。要索引的数据包括列名、数据类型以及值的最小、最大和平均范围。这有助于代理了解他们正在处理的变量类型。

在此示例中,使用了来自 layoff.fyi 的数据。但是,该工具可以处理任何数据框。

4、预处理

预处理是必不可少的,并且因数据集而异。建议将数据转换为适当的类型(例如,将数字字符串转换为整数或浮点数)并删除空值。

#Optional pre-processing
import pandas as pd
import numpy as np


df = pd.read_csv('WARN Notices California_Omer Arain - Sheet1.csv')

#Changes date like column into datetime 
df['Received Date'] = [pd.to_datetime(x) for x in df['Received Date']]
df['Effective Date'] = [pd.to_datetime(x) for x in df['Effective Date']]
#Converts numbers stored as strings into ints
df['Number of Workers'] = [int(str(x).replace(',','')) if str(x)!='nan' else np.nan for x in df['Number of Workers']]
# Replacing NULL values
df = df.replace(np.nan,0)

5、将数据集信息存储到索引中

以下是实现dataframe索引的方法:

from llama_index.core.readers.json import JSONReader
from llama_index.core import VectorStoreIndex
import json

# Function that stores the max,min & mean for numerical values
def return_vals(df,c):
    if isinstance(df[c].iloc[0], (int, float, complex)):
        return [max(df[c]), min(df[c]), np.mean(df[c])]
# For datetime we need to store that information as string
    elif(isinstance(df[c].iloc[0],datetime.datetime)):
        return [str(max(df[c])), str(min(df[c])), str(np.mean(df[c]))]
    else:
# For categorical variables you can store the top 10 most frequent items and their frequency
        return list(df[c].value_counts()[:10])

# declare a dictionary 
dict_ = {}
for c in df.columns:
# storing the column name, data type and content
  dict_[c] = {'column_name':c,'type':str(type(df[c].iloc[0])), 'variable_information':return_vals(df,c)}
# After looping storing the information as a json dump that can be loaded 
# into a llama-index Document

# Writing the information into dataframe.json 

with open("dataframe.json", "w") as fp:
    json.dump(dict_ ,fp) 


reader = JSONReader()
# Load data from JSON file
documents = reader.load_data(input_file='dataframe.json')

# Creating an Index
dataframe_index =  VectorStoreIndex.from_documents(documents)

6、样式索引

样式工具用作文档存储,其中包含有关如何在 plotly 中设置不同图表样式的自然语言说明。我鼓励你尝试使用给出的不同说明。以下是我为折线图和条形图构建说明的方法!

from llama_index.core import Document
from llama_index.core import VectorStoreIndex

styling_instructions =[Document(text="""
  Dont ignore any of these instructions.
        For a line chart always use plotly_white template, reduce x axes & y axes line to 0.2 & x & y grid width to 1. 
        Always give a title and make bold using html tag axis label and try to use multiple colors if more than one line
        Annotate the min and max of the line
        Display numbers in thousand(K) or Million(M) if larger than 1000/100000 
        Show percentages in 2 decimal points with '%' sign
        """
        )
        , Document(text="""
        Dont ignore any of these instructions.
        For a bar chart always use plotly_white template, reduce x axes & y axes line to 0.2 & x & y grid width to 1. 
        Always give a title and make bold using html tag axis label and try to use multiple colors if more than one line
        Always display numbers in thousand(K) or Million(M) if larger than 1000/100000. Add annotations x values
        Annotate the values on the y variable
        If variable is a percentage show in 2 decimal points with '%' sign.
        """)


       # You should fill in instructions for other charts and play around with these instructions
       , Document(text=
          """ General chart instructions
        Do not ignore any of these instructions
         always use plotly_white template, reduce x & y axes line to 0.2 & x & y grid width to 1. 
        Always give a title and make bold using html tag axis label 
        Always display numbers in thousand(K) or Million(M) if larger than 1000/100000. Add annotations x values
        If variable is a percentage show in 2 decimal points with '%'""")
         ]
# Creating an Index
style_index =  VectorStoreIndex.from_documents(styling_instructions)

7、构建代理

创建的索引需要基于代理作为工具。Llama-Index 具有允许从索引构建查询引擎并将其用作工具的功能。

#All imports for this section
from llama_index.core.agent import ReActAgent
from llama_index.core.tools import QueryEngineTool
from llama_index.core.tools import  ToolMetadata
from llama_index.llms.groq import Groq


# Build query engines over your indexes
# It makes sense to only retrieve one document per query 
# However, you may play around with this if you need multiple charts
# Or have two or more dataframes with similar column names
dataframe_engine = dataframe_index.as_query_engine(similarity_top_k=1)
styling_engine = style_index.as_query_engine(similarity_top_k=1)

# Builds the tools
query_engine_tools = [
    QueryEngineTool(
        query_engine=dataframe_engine,
# Provides the description which helps the agent decide which tool to use 
        metadata=ToolMetadata(
            name="dataframe_index",
            description="Provides information about the data in the data frame. Only use column names in this tool",
        ),
\
    ),
    QueryEngineTool(
# Play around with the description to see if it leads to better results
        query_engine=styling_engine,
        metadata=ToolMetadata(
            name="Styling",
            description="Provides instructions on how to style your Plotly plots"
            "Use a detailed plain text question as input to the tool.",
        ),
    ),
]

# I used open-source models via Groq but you can use OpenAI/Google/Mistral models as well
llm = Groq(model="llama3-70b-8192", api_key="<your_api_key>")

# initialize ReAct agent
agent = ReActAgent.from_tools(query_engine_tools, llm=llm, verbose=True)

8、调整代理提示

Llama-Index 和其他编排包有默认提示,可能不适合你的特定用例。在实验时我发现稍微调整提示有助于防止幻觉。

这是 ReAct Agent 的默认提示
调整后的提示,所做的更改以黄色突出显示
from llama_index.core import PromptTemplate

new_prompt_txt= """You are designed to help with building data visualizations in Plotly. You may do all sorts of analyses and actions using Python

## Tools

You have access to a wide variety of tools. You are responsible for using the tools in any sequence you deem appropriate to complete the task at hand.
This may require breaking the task into subtasks and using different tools to complete each subtask.

You have access to the following tools, use these tools to find information about the data and styling:
{tool_desc}


## Output Format

Please answer in the same language as the question and use the following format:

```
Thought: The current language of the user is: (user's language). I need to use a tool to help me answer the question.
Action: tool name (one of {tool_names}) if using a tool.
Action Input: the input to the tool, in a JSON format representing the kwargs (e.g. {{"input": "hello world", "num_beams": 5}})
```

Please ALWAYS start with a Thought.

Please use a valid JSON format for the Action Input. Do NOT do this {{'input': 'hello world', 'num_beams': 5}}.

If this format is used, the user will respond in the following format:

```
Observation: tool response
```

You should keep repeating the above format till you have enough information to answer the question without using any more tools. At that point, you MUST respond in the one of the following two formats:

```
Thought: I can answer without using any more tools. I'll use the user's language to answer
Answer: [your answer here (In the same language as the user's question)]
```

```
Thought: I cannot answer the question with the provided tools.
Answer: [your answer here (In the same language as the user's question)]
```

## Current Conversation

Below is the current conversation consisting of interleaving human and assistant messages."""

# Adding the prompt text into PromptTemplate object
new_prompt = PromptTemplate(new_prompt_txt)

# Updating the prompt
agent.update_prompts({'agent_worker:system_prompt':new_prompt})

9、可视化

现在到了有趣的部分,在第一部分中,我展示了 Llama 3 如何响应我构建可视化的请求。现在让我们向代理发出类似的请求。

response = agent.chat("Give Plotly code for a line chart for Number of Workers get information from the dataframe about the correct column names and make sure to style the plot properly and also give a title")
你可以看到代理如何分解请求和最后用 Python 代码响应。可以直接构建输出解析器或复制过去并运行
通过运行以下代码创建的图表,注释和标签/标题以及轴格式与样式信息完全相同。 也不会构成数据,因为它已经包含有关数据框的信息。
当要求条形图时生成的代码能够使用 groupby 并计算平均值。 但是,没有为条形值生成注释

使用样式说明和代理提示可以得到更好的响应。 该项目还有很长的路要走! 但是,它可以帮助你节省时间并提供更好的可视化代码。

10、未来计划

项目的下一阶段涉及优化提示和处理常见的故障用例。最终目标是制作一套代理工具,帮助我(作为数据科学家)节省工作时间。


原文链接:Building an Agent for Data Visualization (Plotly)

汇智网翻译整理,转载请标明出处