研读论文《Attention Is All You Need》（12）

最新推荐文章于 2025-06-05 23:40:33 发布

CS创新实验室

最新推荐文章于 2025-06-05 23:40:33 发布

阅读量623

点赞数 16

分类专栏：研读论文文章标签：人工智能深度学习机器学习论文注意力

本文链接：https://blog.csdn.net/qiwsir/article/details/148312517

版权

研读论文专栏收录该内容

14 篇文章

订阅专栏

原文 23

3.5 Positional Encoding

Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence. To this end, we add “positional encodings” to the input embeddings at the bottoms of the encoder and decoder stacks. The positional encodings have the same dimension $d_{model}$ as the embeddings, so that the two can be summed. There are many choices of positional encodings, learned and fixed [9].

翻译

3.5 位置编码

由于我们的模型不包含循环结构或卷积操作，为了让模型能够利用序列的顺序，我们必须注入一些信息，这些信息关乎序列中各个标记的相对位置或绝对位置。为此，我们在编码器和解码器堆栈底部的输入嵌入中添加了“位置编码”。这些位置编码的维度 $d_{model}$ 与嵌入向量的维度相同，因此二者可以直接相加。位置编码有多种选择，可以是习得的，也可以是固定的[9]。

重点句子解析

Since our model contains no recurrence and no convolution, in order for the model to make use of the order of the sequence, we must inject some information about the relative or absolute position of the tokens in the sequence.

【解析】

这是“原因状语从句+目的状语(不定式的复合结构)+主句”构成的复合句。

主句是“we must inject some information about…”。其主干是we must inject some information. 句中的must inject是“助动词+动词原形”构成的谓语，谓语前后分别是主语和宾语。介词短语about…做后置定语，修饰information。其中，the relative or absolute做定语，修饰position；of the tokens做后置定语，也是修饰position；句尾的介词短语in the sequence修饰the tokens，做后置定语。

句首是since引导的原因状语从句，从句的主干“our model contains no recurrence and no convolution”属于主谓宾结构，其中，contains是谓语动词，前后分别是主语和宾语，只不过宾语是由and连接的两个并列名词短语，即：no recurrence and no convolution. 实际上，我们也可以把这个从句改写为：Since our model doesn’t contain any recurrence or convolution.

中间部分的“in order for the model to make use of the order of the sequence”属于不定式的复合结构做目的状语，我们可以把它概括为“in order for sb. to do sth.(为了让某人能够做某事)”。其中的the model就相当于sb.，是不定式动作的发出者(逻辑主语)；不定式to make use of the order of the sequence就相当于to do sth, 其中的make use of是固定短语，意为“利用”；the order of the sequence表示“序列的顺序”。of the sequence是后置定语，修饰the order。

【参考翻译】

由于我们的模型不包含循环结构或卷积操作，为了让模型能够利用序列的顺序，我们必须注入一些信息，这些信息关乎序列中这些标记的相对位置或绝对位置。

原文 24

In this work, we use sine and cosine functions of different frequencies:
$\begin{split} PE_{(pos,2i)}&=\sin(pos/10000^{2i/d_{model}}) \\PE_{pos,2i+1}&=\cos(pos/10000^{2i/d_{model}}) \end{split}$
where $p os$ is the position and $i$ is the dimension. That is, each dimension of the positional encoding corresponds to a sinusoid. The wavelengths form a geometric progression from $2 π$ to $10000 \cdot 2 π$ . We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset $k, PE_{pos+k}$ can be represented as a linear function of $P E_{pos}$ .

翻译

在这项工作中，我们采用不同频率的正弦与余弦函数：
$\begin{split} PE_{(pos,2i)}&=\sin(pos/10000^{2i/d_{model}}) \\PE_{pos,2i+1}&=\cos(pos/10000^{2i/d_{model}}) \end{split}$
其中， $p os$ 表示位置， $i$ 表示维度。也就是说，位置编码的每个维度都对应一个正弦曲线。其波长构成从 $2 π$ 到 $10000 \cdot 2 π$ 的几何级数。我们选择该函数是因为我们假设，模型可以借此轻松学习基于相对位置的注意力机制——对于任意固定偏移量 $k$ ， $PE_{pos+k}$ 都可以表示为 $PE_{pos}$ 的线性函数。

重点句子解析

We chose this function because we hypothesized it would allow the model to easily learn to attend by relative positions, since for any fixed offset $k$ , $PE_{pos+k}$ can be represented as a linear function of $PE_{pos}$ .

【解析】

句子的结构是：主句+原因状语从句(because…)+原因状语从句(since…)。具体来说，We chose this function是主句，后边是because引导的原因状语从句，然后是since引导的另一个原因状语从句。其中，because从句本身又是包含了宾语从句的一个复合句，we hypothesized后边是省略了引导词that的宾语从句（that引导宾语从句时，不充当任何语法成分，常常可以省略）。宾语从句使用了allow sb./sth. to do (sth.)的结构，“attend by relative positions”表示“基于相对位置进行注意力分配”。attend的本意是“处理，照料”，此处可以活译为“进行注意力分配”；by的本意是“通过，凭借”，此处可以活译为“基于”。在since引导的原因状语从句中，“for any fixed offset k”是状语，引出针对的对象。其中for表示“对于”，offset表示“偏移量”； $PE_{pos+k}$ can be represented as…是从句的主体，其中“be represented as…”意为“(被)表示为”；a linear function of P Epos中的介词短语“of $PE_{pos}$ ”是后置定语，修饰a linear function。

【参考翻译】

我们选择该函数是因为我们假设，模型可以借此轻松学习基于相对位置的注意力机制——对于任意固定偏移量 $k$ ， $PE_{pos+k}$ 都可以表示为 $PE_{pos}$ 的线性函数。

原文 25

We also experimented with using learned positional embeddings [9] instead, and found that the two versions produced nearly identical results (see Table 3 row (E)). We chose the sinusoidal version because it may allow the model to extrapolate to sequence lengths longer than the ones encountered during training.