Model WarsFebruary 14, 2024via Google Research Blog

Learning the importance of training data under concept drift

Why it matters

As real-world data constantly evolves, Google Research shows that intelligently prioritizing training data by age and content—not treating all data equally—unlocks significant performance gains. This matters for any company deploying AI in nonstationary environments where yesterday's data may be irrelevant tomorrow.

Key signals

15% relative accuracy gains on large-scale nonstationary learning benchmark (39M photos over 10 years)
Tested across 7+ datasets spanning photos, satellite imagery, social media text, medical records, sensor data, and tabular data
Method separates instance-specific and age-related decay contributions using multiple fixed timescales
Outperforms offline training and standard continual learning approaches on photo categorization task
Reduces accuracy degradation in test period vs. baseline methods, addressing catastrophic forgetting problem
Published by Google Research, February 2024

The hook

15% accuracy gains. That's what Google Research achieved by reweighting training data for AI models facing concept drift across 39M photos.

Posted by Nishant Jain, Pre-doctoral Researcher, and Pradeep Shenoy, Research Scientist, Google Research The constantly changing nature of the world around us poses a significant challenge for the development of AI models. Often, models are trained on longitudinal data with the hope that the training data used will accurately represent inputs the model may receive in the future. More generally, the default assumption that all training data are equally relevant often breaks in practice. For example, the figure below shows images from the CLEAR nonstationary learning benchmark, and it illustrates how visual features of objects evolve significantly over a 10 year span (a phenomenon we refer to as slow concept drift), posing a challenge for object categorization models. Sample images from the CLEAR benchmark. (Adapted from Lin et al.) online and continual learning, repeatedly update a model with small amounts of recent data in order to keep it current. This implicitly prioritizes recent data, as the learnings from past data are gradually erased by subsequent updates. However in the real world, different kinds of information lose relevance at different rates, so there are two key issues: 1) By design they focus exclusively on the most recent data and lose any signal from older data that is erased. 2) Contributions from data instances decay uniformly over time irrespective of the contents of the data. In our recent work, “Instance-Conditional Timescales of Decay for Non-Stationary Learning”, we propose to assign each instance an importance score during training in order to maximize model performance on future data. To accomplish this, we employ an auxiliary model that produces these scores using the training instance as well as its age. This model is jointly learned with the primary model. We address both the above challenges and achieve significant gains over other robust learning methods on a range of benchmark datasets for nonstationary learning. For in...

Read full story on Google Research Blog

Relevance score:78/100

Learning the importance of training data under concept drift

Get stories like this every Friday.