Predict the next word given the context:
next word = word frequency probability
Independent assumption
❌ unseen words in train will have zero probability to appear.
💡split unseen words to parts
Why we model words, not caracters?
Caracters might have problems with grammar, white spaces and decoding.