Query - what I am interested in
Key - what I can offer
Vakue - what I give you
In general, attention is a communication mechanism between nodes.
Every node looks at other nodes and aggregate a weighted info about them.
Initially, no notion of space. (solved by positionally encode nodes)
No communication across batch :)
masking decides who is allowed to communicate with others and who is not
self-attention <-> keys and values are computed from the same source.
scaled attention : divided by 1/sqrt(head_size) to have unit variance (stability)
ex. N(0,1) + N(0,1) ~ N(0,2)
SoftMax(N(0,1)) ~ N(0,1) but SoftMax(N(0,100)) -> OHE
OHE means all attention to just a one particular node
encoder allows everybody to communicate, decode doesnt allow the future
To solve the previous issues with RNN we want to elimenate reccurence!
Thus, to fed all data at once, we need to encode position into embeddings
To do this, we combine simple embeddings with some position metric
As a result, we got position-aware encodings
Attention score in [0,1]
Example: He tossed the tennis ball to serve
we pay attention to: tennis, ball, serve
At step 3, we only get embeddings we should pay attention to
At step 4, we actually compute the features to make predictions on (output)
Attention (Standorf into)
Attention input:
attn_from : hidden state of decoder (h_d[x])
attn_to : hidden state of encoder (h_e[x]))
Attention function output (black elipse):
score (how h_e0 is important for decoder to decode h_d0)
Attention algo output (context):
sum of softmax distributions over all encoder's states h_e[x]
Note, attention (black elipse) might be any arbitrary function, example:
Attn(h_e0, h_d0) = np.dot(h_e0, h_d0)
Attn(h_e0, h_d0) = nn.Linear(nn.concat(h_e0, h_d0)) etc.
It solves:
Forgetability : no need to memorize all at once
Bottleneck : attention now has access to any data
What is new to Seq2Seq?
decoder has access to all hidden states (not only last one)
Why context sums all hidden states?
one output might depend on multiple input-hidden-states
Self-attention: encoder access to encoder (itself) or decoder access to decoder (itself)
Compute self-attention for Input_1
2. Integrate self-A_1 to Hidden_state_1
3. Compute self-attention for Input_2
4. Integrate Self-A_2 to Hidden_state_2
Note: self-A uses only initial hidden_state (not updated) => ✔️ parallezation
5. Decoder self-attention is the same
Multi-head attention: multiple attention mechanism for same sentence.
Each attention head is aimed to extract a particular relationship btw two tokens.
Readings: