Goal: Identify entities (aka NER) for a Swiss user (4 languages)
Model: base : XLM-RoBERTa + head : token-classification
Dataset: multi-lingual PANX dataset (DE, FR, IT and ENG)
Agenda:
Fine-tune XMLR to Panx DE only and zero-shot on FR, IT, ENG.
👉 Latin languages are similar + provide good zero-shot accuracy
Fine-tune XMLR to Panx FR on 250, 1k, 2k, 4k, 12k examples
👉 Latin zero-shot is already a very strong prediction
Fine-tune XMLR to Panx DE + FR, compare results
Fine-tune XMLR to Panx DE + FR + IT + ENG. Make conclusion
Labeled senteces with IOB format.
we will import PANX for DE, FR, IT and EN as
Swiss = 63% DE + 23% FR + 8% IT + 6% EN.
Total 7 tags : O, B-PER, I-PER, B-ORG, I-ORG, B-LOC, I-LOC
Example:
XLM-RoBERTa is a multilingial version of RoBERTa
XLMR is pre-trained on 2.5TB of CommonCrawl with 100 languages
nn.Dropout()
nn.Linear( hidden_size, num_labels=7)
Preprocess:
Tokenizer : AutoTokeniezr.from_pretraiend("xml-roberta")
tag subsequent subwords with IGN
wrap with <s> and </s>
FT dataset : PANX.DE['train']
data size: 12.6k
Results:
F1 score on PANX.DE['test'] : 87%
zero-shot F1 PANX.FR['test'] : 71%
zero-shot F1 PANX.IT['test'] : 68%
zero-shot F1 PANX.EN['test'] : 59%
Conclusion: Simply querying FR on GER-trained model (zero-shot) already provides 71% accuracy. 💪Latin languages🤝
Finetuned dataset : PANX.FR['train']
data size: 250, 500, 1k, 2k and 4k
Results:
French benefits from German zero shot
NER is transferable between languages
Finetuned dataset :
12.6k PANX.DE['train'] + 4.6k PANX.FR['train']
Conclusion: by finetuning on DE+FR instead DE, we not only maintain DE accuracy, but greatly improved FR, IT and EN.
Finetuned dataset :
12.6k PANX.DE['train']
4.6k PANX.FR['train']
1.6k PANX.IT['train']
1.2k PANX.EN['train']
Notes:
English here is low resource data (only 6%)
but still English got significant gain in perf
Cross-lingual transfer is extremely beneficial for less common (rare) languages
The farther are linguistic groups, the less are benefits from linguistic transfer
Multi-language training reduces single-language accuracy but improves overall performance