Goal: Given user question about the product, extract relevant documents and answer the question.
Model: miniLM (deepset/minilm-uncased-squad2)
Dataset: SubjQA pairs of question & answers
Steps:
Perform QA for (query, context)
Perform extractive QA for (query) only
set up Retrieval that select few conexts from entire data
set up Reader to answer query based on selected contexts
Example: Extractive QA for e-commerce website
DATASET : SubQA electronics
Train examples: 1,295
Validation examples: 255
Test examples: 358
Example Context:
I really like this keyboard. I give it 4 stars because it doesn’t have a CAPS LOCK key so I never know if my caps are on. But for the price, it really suffices as a wireless keyboard. I have very large hands and this keyboard is compact, but I have no complaints.
Example Question:
Does the keyboard lightweight?
Example Answer:
this keyboard is compact
MODEL
BERT architecture
33M params: 12-layers, 384-hidden, 12-heads, 1536-d_ff
2.7x faster than BERT
Fine-tuned on SQuAD 2.0
pre-trained model : miniML
"microsoft/MiniLM-L12-H384-uncased"
Span classification
Common QA frameworks:
Span classification : predict start & end position within the context.
Free text generation
Multiple-Choice QA
Fill the blanks in answer template with correct words
Yes/No QA
Span classification QA framework:
Example : The Wright brothers flew the motor-operated airplane on December 17, 1903. Their aircraft, the W-Flyer, used ailerons for control and had a 12-horsepower engine.
It works well!
However in real life, we have questions only 🤔 So we need to somehow find relevant passages in the entire corpus.
The simpliest way: concat all reviews into huge context. But this will have unnaceptable latency☹️The smartest way is:
Retriever-Reader architectuer
Retriever: embed contexts, select one with high dot product @ query
sparse retriever: high-dim sparse vector (ex. Bag Of Words, TF-IDF, BM25)
dense retriever: low-dim dense vector (ex. BERT, RoBERTa)
Reader: extract answer from the best documents provided by retriever
Document store: docs database provided to the retriever at query time
Set up Retriever
We use BM25 retriever (TF-IDF based).
Let's query "What is the length of the cord?"
The retriever managed to extract related reviews where potential answer might be found (check the table)
Set up Reader
The reader is basically the elastic abstraction for the model we already played with (miniML)
Example : The Wright brothers flew the motor-operated airplane on December 17, 1903. Their aircraft, the W-Flyer, used ailerons for control and had a 12-horsepower engine.
Extractive QA
Finally, we achieved our Goal!
Product: Amazon Kindle e-book (code: B0074BW614)
Query : "Is it good for reading?"
Retriever: BM25
Document dataset: SubjQA
Reader: miniML (BART) model
top 3 answers for query "Is it good for reading?"& product Kindle e-book
Evaluate Retriever
The quality mainly cames from neuro-Reader (BERT-based), but Retriever sets a bottleneck, providing relevant documents.
Evaluation:
retrieve top_k documents
check if answer if present in each retrieved document
compute recall
Result:
The optimal choice of top_k is around k=3
Sparse (BM25) vs Dense (DPR) retriever evaluation
Evaluate Reader
Extractive QA has two reader metrics.
Representative score=balance of both
Domain Adaptation
EM and F1 score on SubjQA dataset for 3 models.
base model: MiniLM-L12-H384-uncased
miniML(FT-SQuAD) generalizes poorly QA task on SubjQA
miniML(FT-SQuAD & FT-SubjQA) adapts well to SubjQA
miniML(FT-SubjQA) overfits due to small SubjQA dataset
Evaluate QA Pipeline
Let's compare Reader vs Retrieval&Reader.
We can see the Retriever impact on overall performance
Reader-only vs Retriever & Reader scores on SubjQA
Generative QA
We implemeted only extractive QA which tries to find answer's start token in the context (token classification)
Generative QA can synhesize the answer from scattered parts along the entire context.
Retrieval-Augmented Generation (RAG):
Reader is now a generator (encoder-decoder like T5 or BART)
RAG receives latent vectors of retrieved documents and query as input
Conclusion
Pre-trained & fine-tuned models for QA might still lack of generalization. Domain adaptation can boost the performance.