Nick AI Research

Attention

Query - what I am interested in

Key - what I can offer

Vakue - what I give you

In general, attention is a communication mechanism between nodes.

Every node looks at other nodes and aggregate a weighted info about them.
Initially, no notion of space. (solved by positionally encode nodes)
No communication across batch :)
masking decides who is allowed to communicate with others and who is not
self-attention <-> keys and values are computed from the same source.
scaled attention : divided by 1/sqrt(head_size) to have unit variance (stability)
- - - ex. N(0,1) + N(0,1) ~ N(0,2)
    - SoftMax(N(0,1)) ~ N(0,1) but SoftMax(N(0,100)) -> OHE
    - OHE means all attention to just a one particular node

encoder allows everybody to communicate, decode doesnt allow the future

Attention (MIT into)

Step 1. Position-aware embeddings

To solve the previous issues with RNN we want to elimenate reccurence!

Thus, to fed all data at once, we need to encode position into embeddings

To do this, we combine simple embeddings with some position metric

As a result, we got position-aware encodings

Step 2: Extract query - key - value

Step 3 : Compute Attention Score (aka Weighting)

Attention score in [0,1]

Example: He tossed the tennis ball to serve

we pay attention to: tennis, ball, serve

Step 4: Extract features with high attention

At step 3, we only get embeddings we should pay attention to

At step 4, we actually compute the features to make predictions on (output)

Self-attention head block

Attention (Standorf into)

Attention input:

attn_from : hidden state of decoder (h_d[x])
attn_to : hidden state of encoder (h_e[x]))

Attention function output (black elipse):

score (how h_e0 is important for decoder to decode h_d0)

Attention algo output (context):

sum of softmax distributions over all encoder's states h_e[x]

Note, attention (black elipse) might be any arbitrary function, example:

Attn(h_e0, h_d0) = np.dot(h_e0, h_d0)
Attn(h_e0, h_d0) = nn.Linear(nn.concat(h_e0, h_d0)) etc.

It solves:

Forgetability : no need to memorize all at once
Bottleneck : attention now has access to any data

What is new to Seq2Seq?

decoder has access to all hidden states (not only last one)

Why context sums all hidden states?

one output might depend on multiple input-hidden-states

Self - attention

Self-attention: encoder access to encoder (itself) or decoder access to decoder (itself)

Compute self-attention for Input_1

2. Integrate self-A_1 to Hidden_state_1

3. Compute self-attention for Input_2

4. Integrate Self-A_2 to Hidden_state_2

Note: self-A uses only initial hidden_state (not updated) => ✔️ parallezation

5. Decoder self-attention is the same

Multi-head attention

Multi-head attention: multiple attention mechanism for same sentence.

Each attention head is aimed to extract a particular relationship btw two tokens.

Readings:

J.Alammar : NMT Model

Page updated

Google Sites

Report abuse