Data Distribution Mismatch

202502012211
tags: #machine-learning #data-quality #distribution

Data distribution mismatch occurs when training data comes from a different distribution than the data you'll encounter in production.

Common scenarios:

Diagnosis using Training-Dev set:

  1. Create training-dev set from same distribution as training data
  2. Compare performance: training vs training-dev vs dev vs test
  3. If training-dev error >> training error: variance problem
  4. If dev error >> training-dev error: data mismatch problem

Solutions:

Important: Always ensure dev and test sets come from the same distribution as your target application.

Data mismatch is often more impactful than algorithmic improvements.


Reference

Machine Learning Yearning by Andrew Ng