Attention is all you need
Intro & Background
The intro section first talks about RNN-like models and their drawbacks concerning both time and space
- Since RNN-like models have a inherently sequential nature, i.e., \(h_t\) is dependent on the hidden state at time \(h_{t-1}\), they are not parallelizable and hence time-consuming.
- RNN-like models also have to store the hidden states of all the previous time steps for back-propagation, which is space-consuming and limits the ability of batch processing.
Then background section also talks about convolutional networks for sequencial modeling
- CNN operation is totally parallelizable and hence faster than RNN-like models, and it's also space-efficient.
- However, CNN is not good at capturing the dependencies between the output and the input, especially when the distance between the two is large. It has something to do with the concept of receptive field. To capture long-range dependencies, we need to stack multiple layers of CNN, which is not efficient.
To address the drawbacks mentioned above, the authors propose Transformer, which is based solely on attention mechanisms
- No recurrence is included thus parallelizable
- attention machenism can draw global dependencies between the input and the output no matter how far they are from each other
Scaled dot attention
The core principle of attention is how to map a query and a set of key-value pairs to an output. The output is a weighted sum of the values. The formula is given by \[\operatorname{Attention}(Q,K,V)=\operatorname{softmax}({\frac{Q K^{T} }{\sqrt{d_{k} } } })V\] Note that dot-product attention is much faster and more space-efficient in ractice
- query \(\bf q\): The row of matrix \(Q\), and thus \(Q\) has the size of \(n\times d_k\), in which \(n\) is the number of queries. The query vector has a dimension of \(d_k\) for each corresponding embedded token. \(d_k\) is actually much smaller than the dimension of an embedded vector. The vector \(\bf q\) is linear projected from the corresponding embedded vector \(\bf e\) \[\bf q=W_q e\] in which \(\bf W_q\) is a learnable matrix.
- key \(\bf k\): The key vector also has a dimension of \(d_k\) for each unit and can be calculated by \[\bf k=W_k e\] in which \(\bf W_k\) is also learnable
- The dot product of \(\bf q\) and \(\bf k\): Intuitively, \(\bf q\) serves as an active query on behalf of each unit (token, patch...) and \(\bf k\) is a key waiting for query from \(\bf q\). The scalar value \(\bf \langle q, k\rangle\) measures the extent to which the unit represented by \(\bf q\) matches that represented by \(\bf k\). By backpropagation and optimization, matrices \(\bf W_q\) and \(\bf W_k\) can capture a proper dependencies between each unit. \(\bf \langle q, k\rangle\) is also called attention score
Division of \(\sqrt d_k\): The author says that for large values of \(d_k\), the dot products grow large in magnitude, pushing the softmax function into regions where it has extremely small gradients
- Softmax: The softmax operation is conducted on \(\frac{QK^T}{\sqrt{d_k} }\), which is an \(n\times n\) matrix, along the dimension of \(\bf k\), i.e., Given one \(\bf q_i\), \(\sum\nolimits_{j = 1}^{ {d_k} } {\left\langle { { {\bf{q} }_i},{ {\bf{k} }_j} } \right\rangle } = 1\)
- Mask operation: \(\bf q_i\) should only query keys that locate before it, i.e. \[\left\langle { { {\bf{q} }_i},{ {\bf{k} }_j} } \right\rangle = \left\{ \begin{array}{l} \left\langle { { {\bf{q} }_i},{ {\bf{k} }_j} } \right\rangle ,i \le j\\ 0,i > j \end{array} \right.\] The mask operation should be conducted ahead of softmax
value \(\bf v\): The value vector has a dimension of \(d_v\), which equals to the dimension of embedded vectors, for each unit, \(\bf v\) is calculated by \[\bf v=W_ve\] Given that \(d_k << d_v\), thus for the three matrices \({\left( { {W_v} } \right)_{ {d_v} \times {d_v} } },{\left( { {W_k} } \right)_{ {d_k} \times {d_v} }},{\left( { {W_q} } \right)_{ {d_k} \times {d_v} } }\) \[\operatorname{param}(W_v)>>\operatorname{param}(W_q) + \operatorname{param}(W_k)\] Thus it's better to decompose \(W_v\) into the multiplication of two low-rank matrices. Each value \(\bf v_i\) is associated with corresponding key \(\bf k_i\). The output of attention for each unit is actually the weighted sum of all value vectors, serving as the variation of embedded vectors \[{\mathop{\rm Attention}\nolimits} (Q,K,V) = {\mathop{\rm softmax}\nolimits} (\frac{ {Q{K^T} } }{ {\sqrt { {d_k} } } })V = {\left[ { {\mathop{\rm softmax}\nolimits} (\frac{ {Q{K^T} } }{ \sqrt {d_k} })} \right]_{n \times n} }{V_{n \times {d_v} } } = {\left[ {\begin{array}{c} {\Delta {E_1} }\\ {\Delta {E_2} }\\ \vdots \\ {\Delta {E_n} } \end{array} } \right]_{n \times {d_v} } }\] Calculating \(E_i+\Delta E_i\) gives the prediction of word embedding