Pros:
constant path lenght between any tokens
(contrary to RNN, where first token was far from last one)
parallelization
Cons:
self-attention: quadratic in time and space (scaling is an issue)
History
result sample : a bunch of non-sense but the words are from the same "space"
result sample: still non-sense, but has a "flow", we can read it as a correct sentence.
result sample: a bit of sense, some samples even long ones, might be interpreted in some intellifent way.
sentences make sens though might be wrong in logic. non-sence might take place
model can make consistent sentences, coherent across many paragraphs making a stories. non-sence might take place
fully make sense. is able to flow across many paragraphs. inherit the style of text (poetic aspect)