说在前面
本节将介绍如何去评估我们的LLM应用的性能。(视频时长15:07)
注: 所有的示例代码文件课程网站上都有(完全免费),并且是配置好的 Juptyernotebook 环境和配置好的 OPENAI_API_KEY
,不需要自行去花钱租用,建议代码直接上课程网站上运行。 课程网站
另外,LLM 的结果并不总是相同的。在执行代码时,可能会得到与视频中略有不同的答案。
Main Content
如何评估我们的 LLM 应用的性能是非常关键的。接下来我们将从三个方面进行讲解如何进行 Evaluation。
- Example generation
- Manual evaluation (and debuging)
- LLM-assisted evaluation
前置工作
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
# account for deprecation of LLM model
import datetime
# Get the current date
current_date = datetime.datetime.now().date()
# Define the date after which the model should be set to "gpt-3.5-turbo"
target_date = datetime.date(2024, 6, 12)
# Set the model variable based on the current date
if current_date > target_date:
llm_model = "gpt-3.5-turbo"
else:
llm_model = "gpt-3.5-turbo-0301"
Example generation
Create our Q&A application
我们先创建一个用于问答的 LLM 应用用于演示。
1.导入所需要的库。
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import CSVLoader
from langchain.indexes import VectorstoreIndexCreator
from langchain.vectorstores import DocArrayInMemorySearch
2.加载关于户外服装的数据集 OutdoorClothingCatalog_1000.csv
file = 'OutdoorClothingCatalog_1000.csv'
loader = CSVLoader(file_path=file)
data = loader.load()
3.对加载的数据集创建一个向量存储索引。
index = VectorstoreIndexCreator(
vectorstore_cls=DocArrayInMemorySearch
).from_loaders([loader])
4.这里使用上一节讲到的 Stuff
方法,制作基于文档的 LLM 问答应用。
llm = ChatOpenAI(temperature = 0.0, model=llm_model)
qa = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=index.vectorstore.as_retriever(),
verbose=True,
chain_type_kwargs = {
"document_separator": "<<<<>>>>>"
}
)
5.查看数据库内容。
data[10]
data[11]
Hard-coded examples
此处我们人工创建一些评估样例 examples
。(以字典的形式)
examples = [
{
"query": "Do the Cozy Comfort Pullover Set\
have side pockets?",
"answer": "Yes"
},
{
"query": "What collection is the Ultra-Lofty \
850 Stretch Down Hooded Jacket from?",
"answer": "The DownTek collection"
}
]
LLM-Generated examples
此处我们使用 LLM 来辅助生成一些评估样例。(以字典的形式)
1.导入所需要的库 QAGenerateChain
,用于 LLM 问答文本的生成。
from langchain.evaluation.qa import QAGenerateChain
2.创建一个样例生成链 example_gen_chain
。
example_gen_chain = QAGenerateChain.from_llm(ChatOpenAI(model=llm_model))
3.利用导入的文件的五条数据,生成得到五条样例。
new_examples = example_gen_chain.apply_and_parse(
[{"doc": t} for t in data[:5]]
)
4.查看生成的样例。
new_examples[0]
5.与查看的生成样例对应的原始文本。
data[0]
Combine examples
1.合并手工制作的数据和生成的数据为 examples
。
examples += new_examples
2.运行样例中的第一个(examples[0]
)的问题,得到 LLM 的回答。
qa.run(examples[0]["query"])
总结一下: 在评估LLM 应用时,我们需要去准备一些问题和回复的样本数据集。这个数据集我们可以手动去搓,当然也可以使用 LangChain 中的工具去基于数据集利用 LLM 去制作,亦或者两者相结合来制作这个数据集。
Manual evaluation
这里我们将进行人工的评估。
1.为看见 chain 执行中的详细内容,便于我们对应用的输出结果进行分析,我们将设置 langchain.debug=True
,这样就会显示所有详细的运行过程内容。
import langchain
langchain.debug = True
2.运行一个样本问题,得到回复。此时我们能够看见LLM在生成得到我们的回复前的所有详细步骤。我们对LLM 生成的问题的回复和我们的数据集中的问题的回复进行对比,对我们制作的LLM问答应用进行评估。
qa.run(examples[0]["query"])
3.知晓这一过程中发生了什么之后,关闭 debug
功能。
# Turn off the debug mode
langchain.debug = False
LLM-assisted evaluation
此处我们将使用 LLM 来辅助我们进行LLM应用的评估。
1.对我们的 examples
数据集中的七个样例问题都生成对应的回复储存在 predictions
中。
predictions = qa.apply(examples)
2.导入我们需要的 QAEvalChain
问答评估链。
from langchain.evaluation.qa import QAEvalChain
3.创建一个用于问答评估的链 eval_chain
。
llm = ChatOpenAI(temperature=0, model=llm_model)
eval_chain = QAEvalChain.from_llm(llm)
4.在执行问答评估链时,会对生成的回复和我们的样例中的回复进行比较然后评分。
graded_outputs = eval_chain.evaluate(examples, predictions)
5.将结果进行格式化后输出,如下所示。
for i, eg in enumerate(examples):
print(f"Example {i}:")
print("Question: " + predictions[i]['query'])
print("Real Answer: " + predictions[i]['answer'])
print("Predicted Answer: " + predictions[i]['result'])
print("Predicted Grade: " + graded_outputs[i]['text'])
print()
Example 0:
Question: Do the Cozy Comfort Pullover Set have side pockets?
Real Answer: Yes
Predicted Answer: The Cozy Comfort Pullover Set does have side pockets.
Predicted Grade: CORRECT
Example 1:
Question: What collection is the Ultra-Lofty 850 Stretch Down Hooded Jacket from?
Real Answer: The DownTek collection
Predicted Answer: The Ultra-Lofty 850 Stretch Down Hooded Jacket is from the DownTek collection.
Predicted Grade: CORRECT
Example 2:
Question: According to the document, what is the approximate weight of the Women's Campside Oxfords per pair?
Real Answer: The approximate weight of the Women's Campside Oxfords per pair is 1 lb. 1 oz.
Predicted Answer: The approximate weight of the Women's Campside Oxfords per pair is 1 lb. 1 oz.
Predicted Grade: CORRECT
Example 3:
Question: What are the dimensions available for the Recycled Waterhog Dog Mat, Chevron Weave?
Real Answer: The small size has dimensions of 18" x 28" and the medium size has dimensions of 22.5" x 34.5".
Predicted Answer: The dimensions available for the Recycled Waterhog Dog Mat, Chevron Weave are:
- Small: 18" x 28"
- Medium: 22.5" x 34.5"
Predicted Grade: CORRECT
Example 4:
Question: What are some key features of the Infant and Toddler Girls' Coastal Chill Swimsuit, Two-Piece as described in the document?
Real Answer: The key features of the Infant and Toddler Girls' Coastal Chill Swimsuit, Two-Piece include bright colors, ruffles, exclusive whimsical prints, four-way-stretch and chlorine-resistant fabric, UPF 50+ rated fabric for sun protection, crossover no-slip straps, fully lined bottom for secure fit and maximum coverage.
Predicted Answer: Some key features of the Infant and Toddler Girls' Coastal Chill Swimsuit, Two-Piece are:
- Bright colors, ruffles, and exclusive whimsical prints
- Four-way-stretch and chlorine-resistant fabric
- UPF 50+ rated fabric for high sun protection
- Crossover no-slip straps for a secure fit
- Fully lined bottom for maximum coverage
- Machine washable and line dry for best results
Predicted Grade: CORRECT
Example 5:
Question: What is the fabric composition of the Refresh Swimwear V-Neck Tankini Contrasts top?
Real Answer: The body of the tankini top is made of 82% recycled nylon and 18% Lycra® spandex, while the lining is composed of 90% recycled nylon and 10% Lycra® spandex.
Predicted Answer: The fabric composition of the Refresh Swimwear V-Neck Tankini Contrasts top is 82% recycled nylon with 18% Lycra® spandex for the body, and 90% recycled nylon with 10% Lycra® spandex for the lining.
Predicted Grade: CORRECT
Example 6:
Question: What technology is featured in the EcoFlex 3L Storm Pants that makes them more breathable?
Real Answer: The EcoFlex 3L Storm Pants feature TEK O2 technology, which offers the most breathability ever tested.
Predicted Answer: The EcoFlex 3L Storm Pants feature TEK O2 technology, which is designed to offer the most breathability tested by the brand.
Predicted Grade: CORRECT
6.下面取 Example 0
进行分析,讲解使用LLM辅助评估的作用。可以看到此处的 Predicted Answer
与 Real Answer
对比,针对相同的问题,最终表达的意思是一样的,都是 Cozy Comfort Pullover Set 有侧口袋,但是二者的表述却大相径庭,并且没有相同的关键字,可以看出,我们如果使用关键词对比的方法进行评估遇到这种情况就会吃瘪,这也体现出LLM辅助评估的优势:LLM可以分析文本,对文本的语义进行比较,泛用性更广,不会遇到未出现相同的关键词就无法比较的情况。
Example 0:
Question: Do the Cozy Comfort Pullover Set have side pockets?
Real Answer: Yes
Predicted Answer: The Cozy Comfort Pullover Set does have side pockets.
Predicted Grade: CORRECT
总结
本节介绍了使用 LangChain 对 LLM 问答应用进行评估。主要可以分为两个部分,数据集的制作和评估。数据集的制作,我们既可以人工制作,也可以使用LLM辅助,这样可以提高效率。在评估方面我们同样可以使用LLM帮助我们进行评估,并且LLM在对文本内容进行评估时效果不错。
最后补充一下,在对LLM进行评估这个问题上,目前缺少统一的标准和方法,所以可以在这方面进行研究。