Python实现神经网络语言模型（代码详解）

瑞雪兆我心

已于 2024-03-17 16:30:26 修改

阅读量5.5k

点赞数 25

文章标签： python 神经网络语言模型

于 2024-02-28 21:30:44 首次发布

本文链接：https://blog.csdn.net/contributed_l/article/details/136352148

版权

引言

神经网络语言模型（Neural Network Language Model, NNLM）是利用神经网络计算词向量的方法，根据(w{t-n+1}...w{t-1})来预测(w{t})是什么词，即用前(n-1)个单词来预测第(n)个单词。

一、原理图

二、代码

参考TensorDataset和DataLoader解释

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from tqdm import tqdm
import numpy as np
import re

sentences = ["我爱你", "喜羊羊", "灰太狼"]  # sentences = ['我爱你', '喜羊羊', '灰太狼']

# 将输入的句子进行中文分字处理。它首先使用正则表达式找到句子中的中文字符，并将句子按照中文字符进行分割。然后，去除分割后的结果中的空白字符，最后返回分割后的中文字符列表。
def seg_char(sent):
    pattern = re.compile(r'([\u4e00-\u9fa5])')
    chars = pattern.split(sent)
    chars = [w for w in chars if len(w.strip()) > 0]
    return chars

# 对给定的句子列表进行中文分字处理，得到一个包含所有句子中汉字的二维数组。
chars = np.array([seg_char(i) for i in sentences])  # chars = [['我' '爱' '你'], ['喜' '羊' '羊'], ['灰' '太' '狼']]

# 将二维数组展平为一个一维数组。
chars = chars.reshape(1, -1)  # chars = [['我' '爱' '你' '喜' '羊' '羊' '灰' '太' '狼']]

# 通过去除数组中的空白字符和重复项，得到汉字的列表。
word_list = np.squeeze(chars)  # word_list = ['我' '爱' '你' '喜' '羊' '羊'