Keras是一个简单易用但功能强大的Python深度学习库。这个面向初学者的三篇系列文章为任何人提供了一种简单的方法来开始解决实际的机器学习问题。在深入研究两个流行的变体:递归神经网络(RNN)和卷积神经网络(CNN)之前,我们将为初学者介绍经典且完整的神经网络。
01
下载数据集
下载后会得到类似的一个结构数据
接着再安装tensorflow
pip install tensorflow
Tensorflow有一种非常简单的方式让我们读取数据集:text_dataset_from_directory,所以我们这么处理:
from tensorflow.keras.preprocessing import text_dataset_from_directory
train_data = text_dataset_from_directory("./train")
test_data = text_dataset_from_directory("./test")
dataset现在是一个Tensorflow Dataset对象,我们以后可以使用!
还有另一件事要做。如果浏览数据集,您会注意到一些评论中包括<br />标记,它们是HTML换行符。我们想要摆脱这些,因此我们将稍微修改数据准备:
from tensorflow.keras.preprocessing import text_dataset_from_directory
from tensorflow.strings import regex_replace
def prepareData(dir):
data = text_dataset_from_directory(dir)
return data.map(
lambda text, label: (regex_replace(text, '<br />', ' '), label),
)
train_data = prepareData('./train')
test_data = prepareData('./test')
02
建模型
我们将使用Sequential类,它表示线性的图层堆栈。首先,我们将实例化一个空的顺序模型并定义其输入类型:
from tensorflow.keras.models import Sequential
from tensorflow.keras import Input
model = Sequential()
model.add(Input(shape=(1,), dtype="string"))
03
文本向量化
我们的第一层是TextVectorization层,它将处理输入字符串并将其转换为整数序列,每个整数代表一个令牌。
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
max_tokens = 1000
max_len = 100
vectorize_layer = TextVectorization(
max_tokens=max_tokens,
output_mode="int",
output_sequence_length=max_len,
)
要初始化该层,我们需要调用.adapt()
:
train_texts = train_data.map(lambda text, label: text)
vectorize_layer.adapt(train_texts)
把该层加到模型中
model.add(vectorize_layer)
04
嵌入
我们的下一层是嵌入层,它将把上一层产生的整数转换为固定长度的向量。
from tensorflow.keras.layers import Embedding
max_tokens = 1000
model.add(vectorize_layer)
model.add(Embedding(max_tokens + 1, 128))
05
循环层
最后,我们准备好使网络成为RNN的循环层!我们将使用长短期内存(LSTM)层,这是解决此类问题的常用选择。实现非常简单:
from tensorflow.keras.layers import LSTM
model.add(LSTM(64))
为了结束我们的网络,我们将添加一个标准的全连接(Dense)层和一个具有sigmoid激活的输出层:
from tensorflow.keras.layers import Dense
model.add(Dense(64, activation="relu"))
model.add(Dense(1, activation="sigmoid"))
模型激活后输出一个介于0和1之间的数字,这对于我们的问题是完美的-0代表否定的评论,而1代表积极的评论。
06
编译模型
在开始编译之前,我们需要compile过程。我们在编译步骤中决定一些关键因素,包括:
优化。我们将保留一个很好的默认值:基于Adam梯度的优化器。Keras还提供了许多其他优化器。
损失函数。由于我们只有2个输出类别(正值和负值),因此我们将使用Binary Cross-Entropy loss。查看所有Keras的损失。
指标列表。由于这是一个分类问题,因此我们只需要Keras报告准确性指标即可。
该编译如下所示:
model.compile(
optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'],
)
07
训练
使用Keras训练模型非常简单:
model.fit(train_data, epochs=10)
将到目前为止编写的所有代码放在一起并运行,可以得到如下结果:
Epoch 1/10
loss: 0.6441 - accuracy: 0.6281
Epoch 2/10
loss: 0.5544 - accuracy: 0.7250
Epoch 3/10
loss: 0.5670 - accuracy: 0.7200
Epoch 4/10
loss: 0.4505 - accuracy: 0.7919
Epoch 5/10
loss: 0.4221 - accuracy: 0.8062
Epoch 6/10
loss: 0.4051 - accuracy: 0.8156
Epoch 7/10
loss: 0.3870 - accuracy: 0.8247
Epoch 8/10
loss: 0.3694 - accuracy: 0.8339
Epoch 9/10
loss: 0.3530 - accuracy: 0.8406
Epoch 10/10
loss: 0.3365 - accuracy: 0.8502
10个epoch后,我们已达到85%的精度!
08
使用模型
现在,我们有了一个有效的,训练有素的模型,让我们使用它。我们要做的第一件事就是将其保存到磁盘,以便我们可以随时将其备份:
model.save_weights('cnn.h5')
现在,我们可以随时通过重建模型并加载保存的权重来重新加载训练后的模型:
model = Sequential()
model.load_weights('cnn.h5')
使用训练有素的模型进行预测很容易:我们将一个字符串传递给predict()它,然后输出一个分数。
print(model.predict([
"i loved it! highly recommend it to anyone and everyone looking for a great movie to watch.",
]))
print(model.predict([
"this was awful! i hated it so much, nobody should watch this. the acting was terrible, the music was terrible, overall it was just bad.",
]))
完整代码
from tensorflow.keras.preprocessing import text_dataset_from_directory
from tensorflow.strings import regex_replace
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
from tensorflow.keras.models import Sequential
from tensorflow.keras import Input
from tensorflow.keras.layers import Dense, LSTM, Embedding, Dropout
def prepareData(dir):
data = text_dataset_from_directory(dir)
return data.map(
lambda text, label: (regex_replace(text, '<br />', ' '), label),
)
# Assumes you're in the root level of the dataset directory.
# If you aren't, you'll need to change the relative paths here.
train_data = prepareData('./train')
test_data = prepareData('./test')
for text_batch, label_batch in train_data.take(1):
print(text_batch.numpy()[0])
print(label_batch.numpy()[0]) # 0 = negative, 1 = positive
model = Sequential()
# ----- 1. INPUT
# We need this to use the TextVectorization layer next.
model.add(Input(shape=(1,), dtype="string"))
# ----- 2. TEXT VECTORIZATION
# This layer processes the input string and turns it into a sequence of
# max_len integers, each of which maps to a certain token.
max_tokens = 1000
max_len = 100
vectorize_layer = TextVectorization(
# Max vocab size. Any words outside of the max_tokens most common ones
# will be treated the same way: as "out of vocabulary" (OOV) tokens.
max_tokens=max_tokens,
# Output integer indices, one per string token
output_mode="int",
# Always pad or truncate to exactly this many tokens
output_sequence_length=max_len,
)
# Call adapt(), which fits the TextVectorization layer to our text dataset.
# This is when the max_tokens most common words (i.e. the vocabulary) are selected.
train_texts = train_data.map(lambda text, label: text)
vectorize_layer.adapt(train_texts)
model.add(vectorize_layer)
# ----- 3. EMBEDDING
# This layer turns each integer (representing a token) from the previous layer
# an embedding. Note that we're using max_tokens + 1 here, since there's an
# out-of-vocabulary (OOV) token that gets added to the vocab.
model.add(Embedding(max_tokens + 1, 128))
# ----- 4. RECURRENT LAYER
model.add(LSTM(64))
# ----- 5. DENSE HIDDEN LAYER
model.add(Dense(64, activation="relu"))
# ----- 6. OUTPUT
model.add(Dense(1, activation="sigmoid"))
# Compile and train the model.
model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])
model.fit(train_data, epochs=10)
model.save_weights('rnn')
model.load_weights('rnn')
# Try the model on our test dataset.
model.evaluate(test_data)
# Should print a very high score like 0.98.
print(model.predict([
"i loved it! highly recommend it to anyone and everyone looking for a great movie to watch.",
]))
# Should print a very low score like 0.01.
print(model.predict([
"this was awful! i hated it so much, nobody should watch this. the acting was terrible, the music was terrible, overall it was just bad.",
]))
最后欢迎大家关注我的公众号——>识得橙子
让我们大家一起进步