cleanlab 2.1 adds Multi-Annotator Analysis and Outlier Detection: toward a broad framework for Data-Centric AI

09/21/2022

Curtis Northcutt
Jonas Mueller

cleanlab 2.1 is a leap forward toward a standard open-source framework for Data-Centric AI that can be used by engineers and data scientists in diverse applications. cleanlab 2.1 extends cleanlab beyond classification with label errors to several new data-centric ML tasks including:

CleanLearning for finding label issues and training robust ML models on datasets with label errors now works out-of-the-box with many data formats including pandas/pytorch/tensorflow datasets. Often in one line of code, CleanLearning enables dozens of data-centric AI workflows with almost any model and data format – an example using HuggingFace Transformers, Keras, and Tensorflow datasets is available here.

cleanlab v2.1 adds multi-annotator analysis, out of distribution detection, token classification, and CleanLearning support for: pandas, pytorch, tensorflow, keras, and many other data formats + models.

Advancing open-source Data-Centric AI:

Two newsworthy aspects of this release:

cleanlab 2.1 is the most effective Python package to analyze multi-annotator (crowdsourcing) data for annotator and label quality (paper forthcoming).
cleanlab has grown quickly over the last year. cleanlab is the first tool that detects data and label issues in most supervised learning datasets, including: image, text, audio, and token classification. cleanlab 2.1 is also useful for other core data-centric tasks like: out of distribution detection, dataset curation, and robust learning with noisy labels.

Major new functionalities added in 2.1:

CROWDLAB algorithms for analyzing data labeled by multiple annotators to:
- Accurately infer the best consensus label for each example in your dataset
- Estimate the quality of each consensus label (how likely is it correct)
- Estimate the quality of each annotator (how trustworthy are their suggested labels)
Out of Distribution Detection based on either:
- Feature values/embeddings
- Predicted class probabilities
Label error detection for Token Classification
- Supports NLP tasks like entity recognition
CleanLearning can now:
- Run on non-array data types including: pandas Dataframe, pytorch/tensorflow Datasets
- Utilize any Keras model (supporting sequential and functional APIs)

Other developer-focused improvements:

Added an FAQ with advice for common questions
Added many additional tutorial and example notebooks at: docs.cleanlab.ai and github.com/cleanlab/examples
Reduced dependencies: e.g. scipy is no longer needed

Code Examples and New Workflows in cleanlab 2.1:

1. Detect out of distribution examples

Detect out of distribution examples in a dataset based on its numeric feature embeddings

from cleanlab.outlier import OutOfDistribution

ood = OutOfDistribution()

# To get outlier scores for train_data using feature matrix train_feature_embeddings
ood_train_feature_scores = ood.fit_score(features=train_feature_embeddings)

# To get outlier scores for additional test_data using feature matrix test_feature_embeddings
ood_test_feature_scores = ood.score(features=test_feature_embeddings)

Detect out of distribution examples in a dataset based on predicted class probabilities from a trained classifier

from cleanlab.outlier import OutOfDistribution

ood = OutOfDistribution()

# To get outlier scores for train_data using predicted class probabilities (from a trained classifier) and given class labels
ood_train_predictions_scores = ood.fit_score(pred_probs=train_pred_probs, labels=labels)

# To get outlier scores for additional test_data using predicted class probabilities
ood_test_predictions_scores = ood.score(pred_probs=test_pred_probs)

2. Multi-annotator data

For data labeled by multiple annotators (stored as matrix multiannotator_labels whose rows correspond to examples, columns to each annotator’s chosen labels), cleanlab 2.1 can: find improved consensus labels, score their quality, and assess annotators, all by leveraging predicted class probabilities pred_probs from any trained classifier.

from cleanlab.multiannotator import get_label_quality_multiannotator

get_label_quality_multiannotator(multiannotator_labels, pred_probs)

3. Entity Recognition and Token Classification

cleanlab 2.1 can now find label issues in token classification (text) data, where each word in a sentence is labeled with one of K classes (eg. entity recognition).

from cleanlab.token_classification.filter import find_label_issues
from cleanlab.token_classification.summary import display_issues

issues = find_label_issues(per_token_labels, per_token_pred_probs)
display_issues(issues, tokens, pred_probs= per_token_pred_probs, given_labels= per_token_labels,
               class_names=optional_list_of_ordered_class_names)

Example inputs (for dataset with K=2 classes) might look like this:

tokens = [..., ["I", "love", "cleanlab"], ...]
per_token_labels = [..., [1, 0, 0], ...]
per_token_pred_probs = [..., np.array([[0.8,0.2], [0.9,0.1], [0.3,0.7]]), ...]  # predictions from model
optional_list_of_ordered_class_names = ["not-person", "person"]

Running this code on the CoNLL-2003 named entity recognition dataset uncovers many label errors, such as the following sentence:

Little change from today’s weather expected.

where Little is wrongly labeled as a PERSON entity in CoNLL.

4. CleanLearning can now operate directly on non-array dataset formats like tensorflow/pytorch datasets and use Keras models

import numpy as np
import tensorflow as tf
from cleanlab.experimental.keras import KerasWrapperModel

dataset = tf.data.Dataset.from_tensor_slices((features_np_array, labels_np_array))  # example tensorflow dataset created from numpy arrays
dataset = dataset.shuffle(buffer_size=len(features_np_array)).batch(32)

def make_model(num_features, num_classes):
    inputs = tf.keras.Input(shape=(num_features,))
    outputs = tf.keras.layers.Dense(num_classes)(inputs)
    return tf.keras.Model(inputs=inputs, outputs=outputs, name="my_keras_model")

model = KerasWrapperModel(make_model, model_kwargs={"num_features": features_np_array.shape[1], "num_classes": len(np.unique(labels_np_array))})
cl = CleanLearning(model)
cl.fit(dataset, labels_np_array)  # variant of model.fit() that is more robust to noisy labels
robust_predictions = cl.predict(dataset)  # equivalent to model.predict() after training on cleaner data

More details in the official release notes.

Beyond cleanlab v2.1

While cleanlab 2.1 finds data issues, an interface is needed to efficiently fix these issues your dataset. Cleanlab Studio finds and fixes errors automatically in a (very cool) no-code platform. Export your corrected dataset in a single click to train better ML models on better data.

Try Cleanlab Studio at https://app.cleanlab.ai/.

Learn more about Cleanlab

How Google, Wells Fargo, and others use Cleanlab.
Step-by-step tutorials to find issues in your data and train robust ML models:
- Image | Text | Audio | Outliers | Dataset Curation | Multi-annotator Data
Ways to try out Cleanlab:
- Open-source: GitHub
- No-code, automatic platform (easy mode): Cleanlab Studio
- Learn how Cleanlab works: Cleanlab Vizzy
Documentation | Blogs | Research Publications | Cleanlab History | Team

Join our community of scientists and engineers to help build the future of open-source Data-Centric AI: Cleanlab Slack Community

Contributors

A big thank you to the data-centric jedi who contributed code for cleanlab 2.1 (in no particular order): Aravind Putrevu, Jonas Mueller, Anish Athalye, Johnson Kuan, Wei Jing Lok, Caleb Chiam, Hui Wen Goh, Ulyana Tkachenko, Curtis Northcutt, Rushi Chaudhari, Elías Snorrason, Shuangchi He, Eric Wang, and Mattia Sangermano.

We thank the individuals who contributed bug reports or feature requests. If you’re interested in contributing to cleanlab, check out our contributing guide!