torchtext 文档翻译

最新推荐文章于 2023-03-23 00:51:48 发布

liu_chengwei

最新推荐文章于 2023-03-23 00:51:48 发布

阅读量1.7k

点赞数 4

文章标签： python 深度学习自然语言处理 pytorch

本文链接：https://blog.csdn.net/liu_chengwei/article/details/106869812

版权

torchtext.data组件介绍(参考pytorch文档)

链接: https://torchtext.readthedocs.io/en/latest/data.html#

torchtext.data包含以下功能：

能够定义预处理管道。
Batching, padding, and numericalizing(包括构建词汇表对象)。
封装数据集分割(train, validation, test)。
加载一个自定义的NLP数据集。

torchtext的组件

Dataset继承pytorch的Dataset，用于加载数据；Batch定义examples与其Fields的一个batch；Example定义一个单独的训练或测试example，将example中的每一列存储为一个属性。

① Dataset 类
class torchtext.data.Dataset (examples, fields, filter_pred=None)
定义由Examples及其Fields组成的数据集。

变量：
sort_key (callable) - 用于对数据集example进行排序的键，以便将长度相似的examples，batching在一起，以最小化padding。
example (list(Example)) - 数据集中的examples。
fields (dict[str, Field]) - 每列名或字段名，以及相应的Field 对象。具有相同Field object对象的两个字段将具有共享的词汇表。

函数
_ init _(examples, fields, filter_pred=None)
-从Examples和Fields的列表创建数据集。
参数：
examples - Examples列表
fields (List(tuple(str, Field)) - 这个元组中要使用的Fields。str是字段名，Field是关联字段。
fields = [(‘label’, LABEL), (‘content’, CONTENT)]
filter_pred (callable or None) - 只使用filter_pred(example)为True的examples，如果为None，则使用所有examples。默认是没有的.。

过滤函数
filter_examples (field_names)
从数据集examples中移除给定field的未知单词。
参数：
filter_pred (list(str)) - 在这个example中，只有field_names中的fields名，未知的单词会被删除。

分割函数
split (split_ratio=0.7, stratified=False, strata_field=‘label’,