Dirty!

HoloClean:
Weakly Supervised Data Repairing

Post by Theo Rekatsinas, Ihab Ilyas, and Chris RĂ©

Data cleaning and repairing account for about 60% of the work of data scientists.

Noisy and erroneous data is a major bottleneck in analytics. Data cleaning and repairing account for about 60% of the work of data scientists. To address this bottleneck, we recently introduced HoloClean, a semi-automated data repairing framework that relies on statistical learning and inference to repair errors in structured data. In HoloClean, we build upon the paradigm of weak supervision and demonstrate how to leverage diverse signals, including user-defined heuristic rules (such as generalized data integrity constraints) and external dictionaries, to repair erroneous data.

HoloClean has three key properties:

Detecting and Repairing Erroneous Data

HoloClean can fix diverse errors in structured datasets, ranging from conflicting and misspelled values to outliers and null entries.
All too often, the data collected by companies, organizations, and researchers is filled with mistakes, errors, and incomplete values. This is referred to as dirty data, and it can represent a formidable obstacle to downstream applications. Take for example a snippet from the Food Inspections dataset published by the City of Chicago:

Errors can vary!

Errors in this dataset range from misspelled entries (e.g., "Chicago" is spelled "Cicago") and conflicting Zip code values for the same address (e.g., "60608" versus "60609" for the same address in Chicago) to outlier values for key attributes (e.g., "Johnnyo's" instead of "John Veliotis Sr.").

In HoloClean, we focus on structured datasets such as the one shown above. Our goal is to identify and repair all cells whose initial, observed value is different from their true value, which is unknown. We term these erroneous cells. Given the above, data cleaning is separated into two tasks: (i) error detection, whose goal is to identify erroneous cells, and (ii) data repairing, whose goal is to infer the true value of detected erroneous cells.

Data cleaning is a statistical learning and inference problem.

HoloClean casts data cleaning as a statistical learning and inference problem. Each cell of an input, dirty dataset is associated with a random variable. That random variable can either have a fixed value if the corresponding cell was not detected to be erroneous, or an unknown value, if the corresponding cell was detected to be erroneous. HoloClean uses random variables with fixed values as training data to learn a probabilistic model for repairing erroneous cells, whose random variables have unknown values.

Data Cleaning via Weak Supervision

In HoloClean, users only need to specify high-level assertions that capture their domain-expertise with respect to invariants that the input data needs to satisfy. No other supervision is required!

How can we train a probabilistic model for data cleaning efficiently? As with any other large-scale machine learning problem, users cannot afford to iterate over all cells in a dataset with millions of tuples to identify erroneous cells and suggest repairs. This is where weak supervision shines!

Inputs!

HoloClean unifies heterogeneous weak signals that provide evidence on the correct value of structured data to detect and repair errors.
HoloClean leverages a variety of weak signals to address error detection and data repairing: All signals described above are used to automatically generate a probabilistic model for data cleaning:

holoclean

Overall, HoloClean is a data cleaning framework that takes as input a dirty dataset, a collection of integrity constraints, and potentially a collection of external data and forms a probabilistic model for data cleaning. HoloClean builds upon DeepDive, our in-house general-purpose inference engine, to execute learning and inference over its model. For each random variable HoloClean estimates its maximum a posteriori assignment as well as the marginal distribution over the values in its domain. The latter can be used to identify repairs with low confidence and solicit additional user-feedback in a principled manner.

HoloClean in Practice

In our paper we evaluate HoloClean over a variety of real-world datasets, including the Food Inspections dataset presented above. We compare HoloClean with various state-of-the-art data cleaning methods. All prior methods are designed to use each of the signals presented above in isolation. On the other hand, due to the flexibility and extensibility of probabilistic models, HoloClean can combine all signals in a unified framework.

f1-score

In our experiments we find that HoloClean finds data repairs with an average precision of ~90% and an average recall of above 76% across a diverse array of datasets exhibiting different types of errors. This yields an average F1 improvement of more than 2x against state-of-the-art methods.

Scaling Probabilistic Inference

Hard constraints (e.g., integrity constraints) lead to complex and non-scalable repairing models.

The main technical challenge in HoloClean is scaling inference over the probabilistic model used for data cleaning. It is well-known that inference in the presence of constraints is #P-complete. This is because during inference one needs to consider all possible joint assignments over sets of random variables that are correlated. For example, consider the dataset shown in the figure below:

relaxed model

The user-specified integrity constraint shown in the Example introduces a correlation across the four random variables corresponding to the cells of Tuples t1 and t3. If we naively encode this correlation by converting the integrity constraint to a first-order logic constraint, we need to enumerate all possible assignments over these four random variables. It is easy to see that for complex constraints and data cleaning instances with millions of tuples and random variables with large domains this naive approach does not scale.

HoloClean relaxes constraints over sets of data cells to simple features over individual data cells. This gives a scalable repairing model with independent random variables alone.

To ensure scalability, HoloClean applies integrity constraints over the input dataset to identify tuples that provide conflicting information and uses integrity constraints to learn features over the random variables associated with the cells corresponding to these tuples. The final probabilistic model generated by HoloClean corresponds to a voting model over independent random variables that ensures the local consistency of the values assigned to different cells.

In our paper, we empirically find that when there is sufficient redundancy in observing the correct value of cells in a dataset, HoloClean's approximate model obtains more accurate repairs and is more robust to misspecification of the domain of random variables in its probabilistic model.

Next Steps

A few things we are excited about: