Data Curation



Data Curation is the most critical ingredient in a machine learning workflow. A pipeline can ingest a wide variety of data representations. The quality of the data can be noisy, dirty, coarse, random, chaotic, and/or require substantial effort to wrangle and munge. Values can be missing or impossible. Often times it is the most expensive and laborious step requiring cleaning, preprocessing, selection, normalization, transformation, reduction, scaling, augmentation, sorting, formatting and structuring. Almost always, the data is not in the appropriate structure for training set input.

Some tips:

Randomly shuffle data instances, this eliminates ordered bias.

Discard corrupt or inconsistent values. This can include outliers if presumed to be bugs not a feature.

Start with a prototype dataset before scaling to the full dataset. This will reduce computational overhead while you are iterating changes in preparation for full-scale production.

Measure competitive performance against naïve or simple approaches. Often times this illuminates assumptions on the inference model.

Take precautions against unbalanced data. For instance, classifiers will not be able to accurately predict positive and negative output if there is significant disproportion between the two classes represented in the dataset.

Minimize overfit. Cross validate. Regularize. Evaluate with confusion matrix.