NLP（五十一）使用PyTorch训练多标签文本分类模型

原创

已于 2023-07-10 17:43:57 修改 · 置顶 · 7.1k 阅读

40 ·

CC 4.0 BY-SA版权

文章标签：

#pytorch #自然语言处理 #分类

于 2022-03-17 23:21:31 首次发布

本文将介绍如何使用PyTorch训练多标签文本分类模型。
所谓多标签文本分类，指的是文本可能会属于多个类别，而不是单个类别。与文本多分类的区别在于，文本多分类模型往往有多个类别，但文本至属于其中一个类别；而多标签文本分类也会有多个类别，但文本会属于其中多个类别。

数据集

本文演示的数据集为英语论文数据集，参考网址为：https://datahack.analyticsvidhya.com/contest/janatahack-independence-day-2020-ml-hackathon，数据下载需翻墙，读者也可参看后续给出的项目Github。该论文数据集实际上是比赛数据，供选手尝试模型。本文所采用的数据集为英语，至于中文，其原理是一致的，稍微做调整即可。
该数据集给出论文的标题（TITLE）和摘要（ABSTRACT），来预测论文属于哪个主题。该数据集共有20972个训练样本，有六个主题，分别为：Computer Science, Physics, Mathematics, Statistics, Quantitative Biology, Quantitative Finance。在此给出一个样例数据：

TITLE : Many-Body Localization: Stability and Instability
ABSTRACT: Rare regions with weak disorder (Griffiths regions) have the potential to spoil localization. We describe a non-perturbative construction of local integrals of motion (LIOMs) for a weakly interacting spin chain in one dimension, under a physically reasonable assumption on the statistics of eigenvalues. We discuss ideas about the situation in higher dimensions, where one can no longer ensure that interactions involving the Griffiths regions are much smaller than the typical energy-level spacing for such regions. We argue that ergodicity is restored in dimension d > 1, although equilibration should be extremely slow, similar to the dynamics of glasses.
TOP