Artificial Data Synthesis

202502012218
tags: #machine-learning #data-augmentation #data-quality

Artificial data synthesis creates new training examples from existing data to address Data Distribution Mismatch or insufficient training data.

Common techniques:

Image data:

Text data:

Audio data:

Tabular data:

Critical warning: Artificial data can introduce subtle artifacts that don't exist in real data. The model may learn to recognize these artifacts instead of the true underlying patterns.

Best practices:

Synthesis helps with Data Distribution Mismatch but should supplement, not replace, collecting real target-distribution data.


Reference

Machine Learning Yearning by Andrew Ng