May 19, 2020via Amazon Science
When unsupervised training pays off in natural-language processing
Why it matters
Critical insight for AI teams building language models: unsupervised tokenizers outperform at smaller vocabulary sizes, potentially reducing training costs and complexity for specific use cases.
Key signals
- Unsupervised tokenizers perform better at smaller vocabulary sizes
- Research conducted by Amazon Science team
- Findings apply specifically to natural language processing applications
The hook
Amazon's research reveals when unsupervised training actually beats supervised methods in NLP.
At smaller vocabulary sizes, tokenizers trained on unannotated data work best.
Relevance score:75/100