主要内容
本文主要任务是基于文本信息进行用户评价分类,分为两类(即正面情绪和负面情绪)数据样例如下:
项目目录与地址
本文使用的数据有
停顿词(hit_stopwords.txt)来源:
停顿词项目目录预览 - stopwords - GitCode
data目录下的所有数据来源:
项目首页 - chinese_text_cnn - GitCode
所有项目代码地址:
text_classificationWithLSTM: 基于lstm与cnn的文本分类 (gitee.com)
一:数据预处理data_set.py
首先对所获取的数据进行停顿词处理,利用hit_stopwords.txt来进行清洗掉停顿词,对于一些去掉停顿词只剩空格或者符号无效内容的进行删掉,最后生成训练模型所需要的train.txt和test.txt
import pandas as pd
import jieba
# 数据读取
def load_tsv(file_path):
data = pd.read_csv(file_path, sep='\t')
data_x = data.iloc[:, -1]
data_y = data.iloc[:, 1]
return data_x, data_y
with open('./hit_stopwords.txt', 'r', encoding='UTF8') as f:
stop_words = [word.strip() for word in f.readlines()]
print('Successfully')
def drop_stopword(datas):
for data in datas:
for word in data:
if word in stop_words:
data.remove(word)
return datas
def save_data(datax, path):
with open(path, 'w', encoding="UTF8") as f:
for lines in datax:
for i, line in enumerate(lines):
f.write(str(line))
# 如果不是最后一行,就添加一个逗号
if i != len(lines) - 1:
f.write(',')
f.write('\n')
if __name__ == '__main__':
train_x, train_y = load_tsv("./data/train.tsv")
test_x, test_y = load_tsv("./data/test.tsv")
train_x = [list(jieba.cut(x)) for x in train_x]
test_x = [list(jieba.cut(x)) for x in test_x]
train_x = drop_stopword(train_x)
test_x = drop_stopword(test_x)
save_data(train_x, './train.txt')
save_data(test_x, './test.txt')
print('Successfully')
二:lstm模型训练
import pandas as pd
import torch
from torch import nn
import jieba