LSTM CNN Transformer各有各的好处

本文概述了Transformer模型在机器翻译中的革新,包括非序列处理、自注意力机制和位置编码。对比了RNN/LSTM的序列依赖与CNN的局部依赖,强调了Transformer通过整句处理避免长依赖问题。同时,指出了Transformer的局限性,如输入长度限制,以及新模型Transformer-XL的改进。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

I’ll list some bullet points of the main innovations introduced by transformers , followed by bullet points of the main characteristics of the other architectures you mentioned, so we can then compared them.

Transformers

Transformers (Attention is all you need) were introduced in the context of machine translation with the purpose to avoid recursion in order to allow parallel computation (to reduce training time) and also to reduce drops in performance due to long dependencies. The main characteristics are:

Non sequential: sentences are processed as a whole rather than word by word.
Self Attention: this is the newly introduced ‘unit’ used to compute similarity scores between words in a sentence.
Positional embeddings: another innovation introduced to replace recurrence. The idea is to use fixed or learned weights which encode information related to a specific position of a token in a sentence.
The first point is the main reason why transformer do not suffer from long dependency issues. The original transformers do not rely on past hidden states to capture dependencies with previous words. They instead process a sentence as a whole. That is why there is no risk to lose (or “forget”) past information. Moreover, multi-head attention and positional embeddings both provide information about the relationship between different words.

RNN / LSTM

Recurrent neural networks and Long-short term memory models, for what concerns this question, are almost identical in their core properties:

Sequential processing: sentences must be processed word by word.
Past information retained through past hidden states: sequence to sequence models follow the Markov property: each state is assumed to be dependent only on the previously seen state.
The first property is the reason why RNN and LSTM can’t be trained in parallel. In order to encode the second word in a sentence I need the previously computed hidden states of the first word, therefore I need to compute that first. The second property is a bit more subtle, but not hard to grasp conceptually. Information in RNN and LSTM are retained thanks to previously computed hidden states. The point is that the encoding of a specific word is retained only for the next time step, which means that the encoding of a word strongly affects only the representation of the next word, so its influence is quickly lost after a few time steps. LSTM (and also GruRNN) can boost a bit the dependency range they can learn thanks to a deeper processing of the hidden states through specific units (which comes with an increased number of parameters to train) but nevertheless the problem is inherently related to recursion. Another way in which people mitigated this problem is to use bi-directional models. These encode the same sentence from the start to end, and from the end to the start, allowing words at the end of a sentence to have stronger influence in the creation of the hidden representation. However, this is just a workaround rather than a real solution for very long dependencies.

CNN

Also convolutional neural networks are widely used in nlp since they are quite fast to train and effective with short texts. The way they tackle dependencies is by applying different kernels to the same sentence, and indeed since their first application to text (Convolutional Neural Networks for Sentence Classification) they were implement as multichannel CNN. Why do different kernels allow to learn dependencies? Because a kernel of size 2 for example would learn relationships between pairs of words, a kernel of size 3 would capture relationships between triplets of words and so on. The evident problem here is that the number of different kernels required to capture dependencies among all possible combinations of words in a sentence would be enormous and unpractical because of the exponential growing number of combinations when increasing the maximum length size of input sentences.

To summarize, Transformers are better than all the other architectures because they totally avoid recursion, by processing sentences as a whole and by learning relationships between words thanks to multi-head attention mechanisms and positional embeddings. Nevertheless, it must be pointed out that also transformers can capture only dependencies within the fixed input size used to train them, i.e. if I use as a maximum sentence size 50, the model will not be able to capture dependencies between the first word of a sentence and words that occur more than 50 words later, like in another paragraph. New transformers like Transformer-XL tries to overcome exactly this issue, by kinda re-introducing recursion by storing hidden states of already encoded sentences to leverage them in the subsequent encoding of the next sentences.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

MrCharles

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值