【LangChain】Chapter5 - Evaluation_from langchain.evaluation.qa import qaevalchain-CSDN博客

本文链接：https://blog.csdn.net/AustinCyy/article/details/145101436

说在前面

本节将介绍如何去评估我们的LLM应用的性能。（视频时长15:07）

注：所有的示例代码文件课程网站上都有（完全免费），并且是配置好的 Juptyernotebook 环境和配置好的 OPENAI_API_KEY，不需要自行去花钱租用，建议代码直接上课程网站上运行。课程网站

另外，LLM 的结果并不总是相同的。在执行代码时，可能会得到与视频中略有不同的答案。

Main Content

如何评估我们的 LLM 应用的性能是非常关键的。接下来我们将从三个方面进行讲解如何进行 Evaluation。

Example generation
Manual evaluation (and debuging)
LLM-assisted evaluation

前置工作

import os

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

# account for deprecation of LLM model
import datetime
# Get the current date
current_date = datetime.datetime.now().date()

# Define the date after which the model should be set to "gpt-3.5-turbo"
target_date = datetime.date(2024, 6, 12)

# Set the model variable based on the current date
if current_date > target_date:
    llm_model = "gpt-3.5-turbo"
else:
    llm_model = "gpt-3.5-turbo-0301"

Example generation

Create our Q&A application

我们先创建一个用于问答的 LLM 应用用于演示。

1.导入所需要的库。

from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import CSVLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import DocArrayInMemorySearch

2.加载关于户外服装的数据集 OutdoorClothingCatalog_1000.csv

file = 'OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file)
data = loader.load()

3.对加载的数据集创建一个向量存储索引。

index = VectorstoreIndexCreator(
    vectorstore_cls=DocArrayInMemorySearch
).from_loaders([loader])

4.这里使用上一节讲到的 Stuff 方法，制作基于文档的 LLM 问答应用。

llm = ChatOpenAI(temperature = 0.0, model=llm_model)
qa = RetrievalQA.from_chain_type(
    llm=llm, 
    chain_type="stuff", 
    retriever=index.vectorstore.as_retriever(), 
    verbose=True,
    chain_type_kwargs = {
        "document_separator": "<<<<>>>>>"
    }
)

5.查看数据库内容。

data[10]

在这里插入图片描述

data[11]

在这里插入图片描述

Hard-coded examples

此处我们人工创建一些评估样例 examples。（以字典的形式）

examples = [
    {
        "query": "Do the Cozy Comfort Pullover Set\
        have side pockets?",
        "answer": "Yes"
    },
    {
        "query": "What collection is the Ultra-Lofty \
        850 Stretch Down Hooded Jacket from?",
        "answer": "The DownTek collection"
    }
]

LLM-Generated examples

此处我们使用 LLM 来辅助生成一些评估样例。（以字典的形式）

1.导入所需要的库 QAGenerateChain，用于 LLM 问答文本的生成。

from langchain.evaluation.qa import QAGenerateChain

2.创建一个样例生成链 example_gen_chain。

example_gen_chain = QAGenerateChain.from_llm(ChatOpenAI(model=llm_model))

3.利用导入的文件的五条数据，生成得到五条样例。

new_examples = example_gen_chain.apply_and_parse(
    [{"doc": t} for t in data[:5]]
)

4.查看生成的样例。

new_examples[0]

在这里插入图片描述

5.与查看的生成样例对应的原始文本。

data[0]

在这里插入图片描述

Combine examples

1.合并手工制作的数据和生成的数据为 examples。

examples += new_examples

2.运行样例中的第一个（examples[0]）的问题，得到 LLM 的回答。

qa.run(examples[0]["query"])

在这里插入图片描述

总结一下： 在评估LLM 应用时，我们需要去准备一些问题和回复的样本数据集。这个数据集我们可以手动去搓，当然也可以使用 LangChain 中的工具去基于数据集利用 LLM 去制作，亦或者两者相结合来制作这个数据集。

Manual evaluation

这里我们将进行人工的评估。

1.为看见 chain 执行中的详细内容，便于我们对应用的输出结果进行分析，我们将设置 langchain.debug=True，这样就会显示所有详细的运行过程内容。

import langchain
langchain.debug = True

2.运行一个样本问题，得到回复。此时我们能够看见LLM在生成得到我们的回复前的所有详细步骤。我们对LLM 生成的问题的回复和我们的数据集中的问题的回复进行对比，对我们制作的LLM问答应用进行评估。

qa.run(examples[0]["query"])

在这里插入图片描述

3.知晓这一过程中发生了什么之后，关闭 debug 功能。

# Turn off the debug mode
langchain.debug = False

LLM-assisted evaluation

此处我们将使用 LLM 来辅助我们进行LLM应用的评估。

1.对我们的 examples 数据集中的七个样例问题都生成对应的回复储存在 predictions 中。

predictions = qa.apply(examples)

2.导入我们需要的 QAEvalChain 问答评估链。

from langchain.evaluation.qa import QAEvalChain

3.创建一个用于问答评估的链 eval_chain 。

llm = ChatOpenAI(temperature=0, model=llm_model)
eval_chain = QAEvalChain.from_llm(llm)

4.在执行问答评估链时，会对生成的回复和我们的样例中的回复进行比较然后评分。

graded_outputs = eval_chain.evaluate(examples, predictions)

5.将结果进行格式化后输出，如下所示。

for i, eg in enumerate(examples):
    print(f"Example {i}:")
    print("Question: " + predictions[i]['query'])
    print("Real Answer: " + predictions[i]['answer'])
    print("Predicted Answer: " + predictions[i]['result'])
    print("Predicted Grade: " + graded_outputs[i]['text'])
    print()

Example 0:
Question: Do the Cozy Comfort Pullover Set   have side pockets?
Real Answer: Yes
Predicted Answer: The Cozy Comfort Pullover Set does have side pockets.
Predicted Grade: CORRECT

Example 1:
Question: What collection is the Ultra-Lofty  850 Stretch Down Hooded Jacket from?
Real Answer: The DownTek collection
Predicted Answer: The Ultra-Lofty 850 Stretch Down Hooded Jacket is from the DownTek collection.
Predicted Grade: CORRECT

Example 2:
Question:  According to the document, what is the approximate weight of the Women's Campside Oxfords per pair?
Real Answer:  The approximate weight of the Women's Campside Oxfords per pair is 1 lb. 1 oz.
Predicted Answer: The approximate weight of the Women's Campside Oxfords per pair is 1 lb. 1 oz.
Predicted Grade: CORRECT

Example 3:
Question:  What are the dimensions available for the Recycled Waterhog Dog Mat, Chevron Weave?
Real Answer:  The small size has dimensions of 18" x 28" and the medium size has dimensions of 22.5" x 34.5".
Predicted Answer: The dimensions available for the Recycled Waterhog Dog Mat, Chevron Weave are:
- Small: 18" x 28"
- Medium: 22.5" x 34.5"
Predicted Grade: CORRECT

Example 4:
Question:  What are some key features of the Infant and Toddler Girls' Coastal Chill Swimsuit, Two-Piece as described in the document?
Real Answer:  The key features of the Infant and Toddler Girls' Coastal Chill Swimsuit, Two-Piece include bright colors, ruffles, exclusive whimsical prints, four-way-stretch and chlorine-resistant fabric, UPF 50+ rated fabric for sun protection, crossover no-slip straps, fully lined bottom for secure fit and maximum coverage.
Predicted Answer: Some key features of the Infant and Toddler Girls' Coastal Chill Swimsuit, Two-Piece are:
- Bright colors, ruffles, and exclusive whimsical prints
- Four-way-stretch and chlorine-resistant fabric
- UPF 50+ rated fabric for high sun protection
- Crossover no-slip straps for a secure fit
- Fully lined bottom for maximum coverage
- Machine washable and line dry for best results
Predicted Grade: CORRECT

Example 5:
Question:  What is the fabric composition of the Refresh Swimwear V-Neck Tankini Contrasts top?
Real Answer:  The body of the tankini top is made of 82% recycled nylon and 18% Lycra® spandex, while the lining is composed of 90% recycled nylon and 10% Lycra® spandex.
Predicted Answer: The fabric composition of the Refresh Swimwear V-Neck Tankini Contrasts top is 82% recycled nylon with 18% Lycra® spandex for the body, and 90% recycled nylon with 10% Lycra® spandex for the lining.
Predicted Grade: CORRECT

Example 6:
Question:  What technology is featured in the EcoFlex 3L Storm Pants that makes them more breathable?
Real Answer:  The EcoFlex 3L Storm Pants feature TEK O2 technology, which offers the most breathability ever tested.
Predicted Answer: The EcoFlex 3L Storm Pants feature TEK O2 technology, which is designed to offer the most breathability tested by the brand.
Predicted Grade: CORRECT

6.下面取 Example 0 进行分析，讲解使用LLM辅助评估的作用。可以看到此处的 Predicted Answer 与 Real Answer 对比，针对相同的问题，最终表达的意思是一样的，都是 Cozy Comfort Pullover Set 有侧口袋，但是二者的表述却大相径庭，并且没有相同的关键字，可以看出，我们如果使用关键词对比的方法进行评估遇到这种情况就会吃瘪，这也体现出LLM辅助评估的优势：LLM可以分析文本，对文本的语义进行比较，泛用性更广，不会遇到未出现相同的关键词就无法比较的情况。

Example 0:
Question: Do the Cozy Comfort Pullover Set have side pockets?
Real Answer: Yes
Predicted Answer: The Cozy Comfort Pullover Set does have side pockets.
Predicted Grade: CORRECT

总结

本节介绍了使用 LangChain 对 LLM 问答应用进行评估。主要可以分为两个部分，数据集的制作和评估。数据集的制作，我们既可以人工制作，也可以使用LLM辅助，这样可以提高效率。在评估方面我们同样可以使用LLM帮助我们进行评估，并且LLM在对文本内容进行评估时效果不错。
最后补充一下，在对LLM进行评估这个问题上，目前缺少统一的标准和方法，所以可以在这方面进行研究。