Data Distribution Mismatch

202502012211
tags: #machine-learning #data-quality #distribution

Data distribution mismatch occurs when training data comes from a different distribution than the data you'll encounter in production.

Common scenarios:

Training on high-quality studio photos, deploying on mobile phone photos
Training on formal text, deploying on social media text
Training on historical data, deploying in changing market conditions

Diagnosis using Training-Dev set:

Create training-dev set from same distribution as training data
Compare performance: training vs training-dev vs dev vs test
If training-dev error >> training error: variance problem
If dev error >> training-dev error: data mismatch problem

Solutions:

Collect more data from target distribution
Artificial Data Synthesis (but beware of introducing artifacts)
Domain adaptation techniques
Feature engineering to make model more robust

Important: Always ensure dev and test sets come from the same distribution as your target application.

Data mismatch is often more impactful than algorithmic improvements.

Reference

Machine Learning Yearning by Andrew Ng