Blog
Company updates, tutorials, research, and more!
Detecting Errors in Numerical Data via any Regression Model
09/18/2023
New algorithms to identify values in a numerical data column that are likely incorrect (eg. due to noise from erroneous sensors, data entry/processing mistakes, imperfect human estimates).
- Jonas Mueller
- Mayank Kumar
- Hui Wen Goh
- Hang Zhou
The Future of Relevance Determination: Leveraging AI for Enhanced E-Discovery
08/03/2023
Use AI software to automatically identify mis-categorized legal documents and provide more accurate relevance determination.
- Chris Mauck
Most AI & Analytics are impaired by data issues. Now AI can help you fix them.
07/31/2023
Data is the fuel for AI (and Analytics), but is messy in real enterprise applications. Here’s how to use AI to also refine it, allowing your company to build a Data Engine as powerful as those at the heart of today’s biggest tech companies.
- Jonas Mueller
- Curtis Northcutt
- Anish Athalye
Automated Data Quality at Scale
07/27/2023
A fully-automated analysis of errors in the ImageNet training set.
- Anish Athalye
- Angela Liu
How To Train and Deploy Reliable Models on Messy Real-World Data With a Few Clicks
07/24/2023
Introducing an entirely automated solution to: train cutting-edge ML models on raw data, use these models to detect various issues in the data, correct these issues, train better models on the improved data, and deploy them to serve reliable predictions in applications.
- Hui Wen Goh
- Jonas Mueller
- Anish Athalye
Letter from the CEO: Announcing Our Seed Funding and the Launch of Cleanlab Studio for Enterprise
07/20/2023
Cleanlab Studio for Enterprise launches to automate data curation for LLMs and the modern AI stack with $5 million in seed funding from Bain Capital Ventures.
- Curtis Northcutt
Assessing the Quality of Synthetic Data with Cleanlab Studio
07/12/2023
Use AI to measure the quality of LLM-generated data, automatically detecting unrealistic synthetic examples and underrepresented tails of the real data distribution.
- Elías Snorrason
Enhancing Product Analytics and E-commerce with Cleanlab Studio
07/06/2023
Using AI to analyze product listings for errors, and how this boosts the accuracy of product categorization and analytics efforts.
- Sanjana Garg
Beware of Unreliable Data in Model Evaluation: A LLM Prompt Selection case study with Flan-T5
06/29/2023
You may choose suboptimal prompts for your LLM (or make other suboptimal choices via model evaluation) unless you clean your test data.
- Chris Mauck
- Jonas Mueller
Improving Legal Judgement Prediction with Cleanlab Studio
06/27/2023
A legal sector case study using Cleanlab Studio to produce better models for making predictions (eg. of final judgements) based on court case documents.
- Hui Wen Goh
Improving any OpenAI Language Model by Systematically Improving its Data
06/01/2023
Reduce LLM prediction error by 37% via data-centric AI.
- Chris Mauck
- Jonas Mueller
Datalab: A Linter for ML Datasets
05/16/2023
Catch issues in your data/labels. This unified audit uses your ML model to automatically detect various problems in real-world datasets that can be fixed to produce a better model.
- Elías Snorrason
- Sanjana Garg
- Hui Wen Goh
- Jesse Cummings
- Jonas Mueller
CleanVision: Audit your Image Data for better Computer Vision
03/22/2023
Introducing an open-source Python package to automatically identify common issues in image datasets.
- Sanjana Garg
- Ulyana Tkachenko
- Yiming Chen
- Elías Snorrason
- Jonas Mueller
ActiveLab: Active Learning with Data Re-Labeling
03/02/2023
ActiveLab helps you optimally choose which data to (re)label, lowering the cost to train an accurate ML model.
- Hui Wen Goh
- Jonas Mueller
Cleanlab: The History, Present, and Future
04/01/2022
How an MIT grad student project became a company with tech used by Google, Amazon, Tesla, Uber, Facebook, and companies around the world.
- Curtis Northcutt
Cleanlab Studio: Issues Found in Popular Datasets
The Cleanlab Studio Audit uses AI to auto-detect problems in given data. Explore all sorts of issues found in popular datasets by Cleanlab Studio!
Detecting Dataset Drift and Non-IID Sampling: A k-Nearest Neighbors approach that works for Image/Text/Audio/Numeric Data
05/30/2023
A simple method to determine if a dataset violates the IID assumption in common ways (e.g. temporal drift, or interaction between almost adjacent datapoints).
- Jesse Cummings
- Elías Snorrason
- Jonas Mueller
Effectively Annotate Text Data for Transformers via Active Learning + Re-labeling
05/22/2023
Use ActiveLab to efficiently choose which data to (re)label to train the best Transformer model.
- Chris Mauck
Training Transformer Networks in Scikit-Learn?!
03/08/2023
Learn how to easily make any Tensorflow/Keras model compatible with scikit-learn.
- Hui Wen Goh
cleanlab 2.3 adds support for Active Learning, Tensorflow/Keras models made sklearn-compatible, and highly scalable Label Error Detection
03/01/2023
Highlighting what's new in cleanlab 2.3
- Jonas Mueller
Handling Mislabeled Tabular Data to Improve Your XGBoost Model
02/06/2023
Learn how to reduce prediction errors by 70% using data-centric techniques with cleanlab.
- Chris Mauck
Automatic Error Detection for Image/Text Tagging and Multi-label Datasets
11/29/2022
Introducing new data quality algorithms for multi-label classification in cleanlab v2.2
- Aditya Thyagarajan
- Elías Snorrason
- Curtis Northcutt
- Jonas Mueller
Out-of-Distribution Detection via Embeddings or Predictions
10/19/2022
Introducing cleanlab's dual new methods to detect outliers and how they perform on real image data.
- Ulyana Tkachenko
- Jonas Mueller
A Simple Adjustment Improves Out-of-Distribution Detection for Any Classifier
10/19/2022
Exploring new ways to identify outliers based on probabilistic predictions from a trained classifier.
- Ulyana Tkachenko
- Jonas Mueller
- Curtis Northcutt
Detecting Label Errors in Entity Recognition Data
10/12/2022
Understanding cleanlab's new methods for text-based token classification tasks.
- Wei-Chen (Eric) Wang
- Elías Snorrason
- Jonas Mueller
CROWDLAB: Simple and effective algorithms to handle data labeled by multiple annotators
10/05/2022
Understanding cleanlab's new methods for multi-annotator data and what makes them effective.
- Hui Wen Goh
- Ulyana Tkachenko
- Jonas Mueller
cleanlab 2.1 adds Multi-Annotator Analysis and Outlier Detection: toward a broad framework for Data-Centric AI
09/21/2022
Highlighting new features available in cleanlab 2.1
- Curtis Northcutt
- Jonas Mueller
How we built Cleanlab Vizzy
08/17/2022
How we built an in-browser visualization of Cleanlab's Confident Learning algorithm.
- Caleb Chiam
- Luke Mainwaring
- Yiming Chen
Handling Label Errors in Text Classification Datasets
05/10/2022
Learn how to find label issues in text datasets and improve NLP models.
- Wei Jing Lok
- Jonas Mueller
- Hui Wen Goh
Finding Label Issues in Audio Classification Datasets
04/27/2022
Learn how to find label issues in any audio classification dataset.
- Johnson Kuan
- Jonas Mueller
- Anish Athalye
Finding Label Issues in Image Classification Datasets
04/21/2022
Learn how to automatically find label issues in any image classification dataset.
- Wei Jing Lok
- Jonas Mueller
cleanlab 2.0: Automatically Find Errors in ML Datasets
04/21/2022
Announcing cleanlab 2.0: an open-source framework for machine learning and analytics with messy, real-world data.
- Curtis Northcutt
- Jonas Mueller
- Anish Athalye