LSTM CNN Transformer各有各的好处_woa-lstm和woa-cnn哪个好-CSDN博客

本文链接：https://blog.csdn.net/MrCharles/article/details/122971644

本文概述了Transformer模型在机器翻译中的革新，包括非序列处理、自注意力机制和位置编码。对比了RNN/LSTM的序列依赖与CNN的局部依赖，强调了Transformer通过整句处理避免长依赖问题。同时，指出了Transformer的局限性，如输入长度限制，以及新模型Transformer-XL的改进。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

I’ll list some bullet points of the main innovations introduced by transformers , followed by bullet points of the main characteristics of the other architectures you mentioned, so we can then compared them.

Transformers

Transformers (Attention is all you need) were introduced in the context of machine translation with the purpose to avoid recursion in order to allow parallel computation (to reduce training time) and also to reduce drops in performance due to long dependencies. The main characteristics are:

Non sequential: sentences are processed as a whole rather than word by word.
Self Attention: this is the newly introduced ‘unit’ used to compute similarity scores between words in a sentence.
Positional embeddings: another innovation introduced to replace recurrence. The idea is to use fixed or learned weights which encode information related to a specific position of a token in a sentence.
The first point is the main reason why transformer do not suffer from long dependency issues. The original transformers do not rely on past hidden states to capture dependencies with previous words. They instead process a sentence as a whole. That is why there is no risk to lose (or “forget”) past information. Moreover, multi-head attention and positional embeddings both provide information about the relationship between different words.

RNN / LSTM

Recurrent neural networks and Long-short term memory models, for what concerns this question, are almost identical in their core properties:

Sequential processing: sentences must be processed word by word.
Past information retained through past hidden states: sequence to sequence models follow the Markov property: each state is assumed to be dependent only on the previously seen state.
The first property is the reason why RNN and LSTM can’t be trained in parallel. In order to encode the second word in a sentence I need the previously computed hidden states of the first word, therefore I need to compute that first. The second property is a bit more subtle, but not hard to grasp conceptually. Information in RNN and LSTM are retained thanks to previously computed hidden states. The point is that the encoding of a specific word is retained only for the next time step, which means that the encoding of a word strongly affects only the representation of the next word, so its influence is quickly lost after a few time steps. LSTM (and also GruRNN) can boost a bit the dependency range they can learn thanks to a deeper processing of the hidden states through specific units (which comes with an increased number of parameters to train) but nevertheless the problem is inherently related to recursion. Another way in which people mitigated this problem is to use bi-directional models. These encode the same sentence from the start to end, and from the end to the start, allowing words at the end of a sentence to have stronger influence in the creation of the hidden representation. However, this is just a workaround rather than a real solution for very long dependencies.

CNN

Also convolutional neural networks are widely used in nlp since they are quite fast to train and effective with short texts. The way they tackle dependencies is by applying different kernels to the same sentence, and indeed since their first application to text (Convolutional Neural Networks for Sentence Classification) they were implement as multichannel CNN. Why do different kernels allow to learn dependencies? Because a kernel of size 2 for example would learn relationships between pairs of words, a kernel of size 3 would capture relationships between triplets of words and so on. The evident problem here is that the number of different kernels required to capture dependencies among all possible combinations of words in a sentence would be enormous and unpractical because of the exponential growing number of combinations when increasing the maximum length size of input sentences.

To summarize, Transformers are better than all the other architectures because they totally avoid recursion, by processing sentences as a whole and by learning relationships between words thanks to multi-head attention mechanisms and positional embeddings. Nevertheless, it must be pointed out that also transformers can capture only dependencies within the fixed input size used to train them, i.e. if I use as a maximum sentence size 50, the model will not be able to capture dependencies between the first word of a sentence and words that occur more than 50 words later, like in another paragraph. New transformers like Transformer-XL tries to overcome exactly this issue, by kinda re-introducing recursion by storing hidden states of already encoded sentences to leverage them in the subsequent encoding of the next sentences.