基于LLM的大文档分割器
在处理无法适应大型语言模型(LLM)上下文窗口的大段文本时,将文本分割成更小的部分是必要的。这是构建检索增强生成(RAG)应用程序的常见做法。然而,现有的文本分割器可能无法有效保持片段之间的语义关系。
在这篇文章中,我们将探讨 RecursiveCharacterTextSplitter
的局限性,并介绍一种新的解决方案,称为基于LLM的上下文分割器,它利用大型语言模型的力量来保持上下文并提高检索准确性。
1、RecursiveCharacterTextSplitter
Langchain 提供了多个最先进的文本分割器。理想情况下,这些分割器应该能够保持语义相关的文本片段在一起。“语义相关”可能取决于文本类型。
虽然一些Langchain文本分割器对Markdown文档或代码文件做得很好,但它们最受欢迎的分割器 RecursiveCharacterTextSplitter
在许多情况下并不适用。
Langchain推荐的默认文本分割器是 RecursiveCharacterTextSplitter
。这个文本分割器接受一个字符列表。它尝试根据第一个字符进行分割,但如果任何片段太大,则会继续尝试下一个字符,依此类推。默认情况下,它尝试分割的字符是 ["\n\n", "\n", " ", ""]
。
除了控制可以分割的字符外,还可以控制其他几个方面:
- length_function:如何计算片段的长度。默认只计算字符数量,但通常在这里传递一个标记计数器。
- chunk_size:片段的最大大小(由长度函数测量)。
- chunk_overlap:片段之间的最大重叠。有一定的重叠以保持片段之间的连续性是很不错的(例如,滑动窗口)。
- add_start_index:是否在元数据中包含每个片段在原始文档中的起始位置。
以下是一个测试RecursiveCharacterTextSplitter性能的小型文档示例输入文本。请注意,这是一个小型文档,因此更容易理解,但相同的概念适用于更大的文本:
The Amanita phalloides has a large and imposing epigeous (above ground) fruiting body (basidiocrap).
A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all white.
AA. Phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.
Gala apples are a popular variety known for their sweet flavor and crisp texture.
They have a distinctive reddish-orange skin with yellow striping, making them visually appealing in fruit displays.
Originally developed in New Zealand in the 1930s, they have since become a favorite in many countries and are widely cultivated for consumption.
Their versatility makes them perfect for both eating fresh and using in various culinary dishes.
Radishes are small, root vegetables with a sharp, peppery flavor that can range from mild to spicy.
They are usually round or cylindrical in shape and can come in various colors, including red, white, purple, and black.
Rich in vitamins and minerals, radishes are often consumed raw in salads, but can also be cooked or pickled for different culinary applications.
Their crunchy texture and vibrant color make them a popular addition to dishes, adding both taste and aesthetic appeal.
上述文本讨论了三个不同的主题:蘑菇、苹果和萝卜。在这个运行中,我们将使用 chunk_size = 100
和 chunk_overlap = 0
。以下是结果:
The Amanita phalloides has a large and imposing epigeous (above ground) fruiting body (basidiocrap).
— — -
A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all white.
— — -
AA. Phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.
— — -
Gala apples are a popular variety known for their sweet flavor and crisp texture.
— — -
They have a distinctive reddish-orange skin with yellow striping, making them visually appealing in
— — -
fruit displays.
— — -
Originally developed in New Zealand in the 1930s, they have since become a favorite in many
— — -
countries and are widely cultivated for consumption.
— — -
Their versatility makes them perfect for both eating fresh and using in various culinary dishes.
— — -
Radishes are small, root vegetables with a sharp, peppery flavor that can range from mild to spicy.
— — -
They are usually round or cylindrical in shape and can come in various colors, including red,
— — -
white, purple, and black.
— — -
Rich in vitamins and minerals, radishes are often consumed raw in salads, but can also be cooked or
— — -
pickled for different culinary applications.
— — -
Their crunchy texture and vibrant color make them a popular addition to dishes, adding both taste
— — -
and aesthetic appeal.
— — -
如你所见,RecursiveCharacterTextSplitter失去了上下文并未能保持它。这对于第3个片段尤为重要,在那里提到Phalloides是最有毒的已知蘑菇之一,您希望确保片段1和2永远不会与片段3分开。
2、基于LLM的上下文分割器
为了解决 RecursiveCharacterTextSplitter
的局限性,我提出了基于LLM的上下文分割器。这个新的文本分割器在索引过程中利用LLM的力量将文本分割成片段,同时保持上下文。
基于LLM的上下文分割器利用 RecursiveCharacterTextSplitter
将内容分割成片段。它使用LLM比较每个片段与下一个片段,通过索引衡量它们的相似性。如果索引超过定义的阈值,则认为这两个片段属于同一个上下文。此过程在一个滑动窗口中继续,直到相似性指数低于阈值或片段大小接近所需的最大片段大小。
以下是将此算法应用于原始输入文本的结果:
The Amanita phalloides has a large and imposing epigeous (above ground) fruiting body (basidiocrap).A mushroom with a large fruiting body is the Amanita phalloides. Some varieties are all white. AA. Phalloides, a.k.a Death Cap, is one of the most poisonous of all known mushrooms.
— — -
Gala apples are a popular variety known for their sweet flavor and crisp texture.They have a distinctive reddish-orange skin with yellow striping, making them visually appealing in fruit displays. Originally developed in New Zealand in the 1930s, they have since become a favorite in many countries and are widely cultivated for consumption. Their versatility makes them perfect for both eating fresh and using in various culinary dishes.
— — -
Radishes are small, root vegetables with a sharp, peppery flavor that can range from mild to spicy.They are usually round or cylindrical in shape and can come in various colors, including red, white, purple, and black. Rich in vitamins and minerals, radishes are often consumed raw in salads, but can also be cooked or pickled for different culinary applications. Their crunchy texture and vibrant color make them a popular addition to dishes, adding both taste and aesthetic appeal.
— — -
通过这种方法,LLM有效地检测并保持了每个分割中的上下文。因此,RAG管道中使用的算法(如向量相似度搜索)的检索准确性得到了显著提高。
原文链接:LLM based context splitter for large documents
汇智网翻译整理,转载请标明出处