basics of NLP. Count-based approach (Non-Neuro)
NLP structure
Before NN, NLP used count-based (aka statistical) approach, mainly Bayesian, that couldn't:
work with unseen words
From raw text to "vocablurary" of any tokens like "change", "go", "=" etc
note: for now "change" and "changes" are two different tokens.
Tokenization
Stemming : delete end
Lemmatization : cast to standard
stop words : “the”, “is” , “and” etc
non-informative words : "hello", "best regards" etc
"Hello world" example of NLP model : token → OHE → FC → Softmax → next word distribution
Example : ['Text of the very first new sentence with the first words in sentence.', 'Text of the second sentence.','Number three with lot of words words words.','Short text, less words.',]
Def. Embedding of word X = any arbitrary transformation of word X into a numerical representation (vector)
OHE is to place 1 if word is found in vocab, else 0.
Problems:
OHE vectors doesn't take Context into account
can't define the meaningful metric between two words
exponential memory grow: OHE vector has billions axes with one bit of information
Sentence is represented as the count of every word in vocab.
Problems: word order and context is not taken into account.
Encoding weights help to encode a world so its embedding better represents the meaning and importance of the world.
TF-IDF tells how particular word is important for document.
TF-IDF uses hypothesis that less frequent words are more important
So that TF-IDF lowers the weights of common words ('this', 'a', 'she') and raises the weights of rare important words ('space', 'nature', 'shark')
PMI tells how likely two (or N) words come together.
ex. "Puerto" and "Rico" are more likely to come together than "Puerto" and "Cat"
Check Embeddings page
Main Hypothesis: similar words (by meaning) should have similar vectors (by distance)
Def. Context embedding of word X = number of co-occurence of word X within the window of surrounding words
Word2Vec transforms individual words into a numerical representation (aka embeddings)
It supposes main NLP hypothesis : similar words (by meaning) should have similar vectors (by distance)
Despite OHE, word2vec now takes into account context. It has two main architectures: Skip-grams and CBOW
CBOW predicts a target word from a list of context words
✔️ fast as predicts only 1 distribution
⚠️ the order of words in the context is not important
Skip-Gram predicts a context words from a target word
✔️ works well for rare target word
❌ slow, hard to train
Cross-Entropy Loss (CEL)
CEL tells how far predicted distributions are from ground truth.
Skip-Gram structure
Last layer: SoftMax to predict distribution
...of the word o given a context c, where:
v - centered word
u - context word
Skip-Gram likelihood to max:
which is minus log to min:
Fig 3. End-to-end Skip-gram model training on the sentence “The quick brown fox”. Window m=1 (only left and right)
Skip-gram captures the meaning of words given their local context
Problem: In the sentence "The cat..." , tokens "The" and "cat" might often appear together but Skip-gram doesn’t know if “The” is a common word or a word closely related to “cat” specifically.
Solution is the GloVe model : takes into account the frequency of "The" in global text and in particular with world "cat".
Glove model takes into account both local context and global statistics of words in text.
Main idea: focus on co-occurence probabilities:
how often word j appears within context of word i
X - matrix of co-occurences
X_ij = number of times when word j appears in the context of word i.
co-occurence probabilities