Blog

Company updates, tutorials, research, and more!

Detecting Errors in Numerical Data via any Regression Model

09/18/2023

New algorithms to identify values in a numerical data column that are likely incorrect (eg. due to noise from erroneous sensors, data entry/processing mistakes, imperfect human estimates).

Jonas Mueller
Mayank Kumar
Hui Wen Goh
Hang Zhou

The Future of Relevance Determination: Leveraging AI for Enhanced E-Discovery

08/03/2023

Use AI software to automatically identify mis-categorized legal documents and provide more accurate relevance determination.

Chris Mauck

Most AI & Analytics are impaired by data issues. Now AI can help you fix them.

07/31/2023

Data is the fuel for AI (and Analytics), but is messy in real enterprise applications. Here’s how to use AI to also refine it, allowing your company to build a Data Engine as powerful as those at the heart of today’s biggest tech companies.

Jonas Mueller
Curtis Northcutt
Anish Athalye

Automated Data Quality at Scale

07/27/2023

A fully-automated analysis of errors in the ImageNet training set.

Anish Athalye
Angela Liu

How To Train and Deploy Reliable Models on Messy Real-World Data With a Few Clicks

07/24/2023

Introducing an entirely automated solution to: train cutting-edge ML models on raw data, use these models to detect various issues in the data, correct these issues, train better models on the improved data, and deploy them to serve reliable predictions in applications.

Hui Wen Goh
Jonas Mueller
Anish Athalye

Letter from the CEO: Announcing Our Seed Funding and the Launch of Cleanlab Studio for Enterprise

07/20/2023

Cleanlab Studio for Enterprise launches to automate data curation for LLMs and the modern AI stack with $5 million in seed funding from Bain Capital Ventures.

Curtis Northcutt

Assessing the Quality of Synthetic Data with Cleanlab Studio

07/12/2023

Use AI to measure the quality of LLM-generated data, automatically detecting unrealistic synthetic examples and underrepresented tails of the real data distribution.

Elías Snorrason

Enhancing Product Analytics and E-commerce with Cleanlab Studio

07/06/2023

Using AI to analyze product listings for errors, and how this boosts the accuracy of product categorization and analytics efforts.

Sanjana Garg

Beware of Unreliable Data in Model Evaluation: A LLM Prompt Selection case study with Flan-T5

06/29/2023

You may choose suboptimal prompts for your LLM (or make other suboptimal choices via model evaluation) unless you clean your test data.

Chris Mauck
Jonas Mueller

Improving Legal Judgement Prediction with Cleanlab Studio

06/27/2023

A legal sector case study using Cleanlab Studio to produce better models for making predictions (eg. of final judgements) based on court case documents.

Hui Wen Goh

Improving any OpenAI Language Model by Systematically Improving its Data

06/01/2023

Reduce LLM prediction error by 37% via data-centric AI.

Chris Mauck
Jonas Mueller

Datalab: A Linter for ML Datasets

05/16/2023

Catch issues in your data/labels. This unified audit uses your ML model to automatically detect various problems in real-world datasets that can be fixed to produce a better model.

Elías Snorrason
Sanjana Garg
Hui Wen Goh
Jesse Cummings
Jonas Mueller

CleanVision: Audit your Image Data for better Computer Vision

03/22/2023

Introducing an open-source Python package to automatically identify common issues in image datasets.

Sanjana Garg
Ulyana Tkachenko
Yiming Chen
Elías Snorrason
Jonas Mueller

ActiveLab: Active Learning with Data Re-Labeling

03/02/2023

ActiveLab helps you optimally choose which data to (re)label, lowering the cost to train an accurate ML model.

Hui Wen Goh
Jonas Mueller

Cleanlab: The History, Present, and Future

04/01/2022

How an MIT grad student project became a company with tech used by Google, Amazon, Tesla, Uber, Facebook, and companies around the world.

Curtis Northcutt

Cleanlab Studio: Issues Found in Popular Datasets

The Cleanlab Studio Audit uses AI to auto-detect problems in given data. Explore all sorts of issues found in popular datasets by Cleanlab Studio!

Detecting Dataset Drift and Non-IID Sampling: A k-Nearest Neighbors approach that works for Image/Text/Audio/Numeric Data

05/30/2023

A simple method to determine if a dataset violates the IID assumption in common ways (e.g. temporal drift, or interaction between almost adjacent datapoints).