Using Cleanlab Studio to Audit Public Datasets

The Cleanlab Studio Audit uses AI to auto-detect problems in a given dataset. Like a PSA, the CSA is a recurring series to inform the community about issues in popular datasets — all automatically found and corrected with Cleanlab Studio.

Cleanlab Studio can just as easily help you improve your own image, text, or tabular/CSV/Excel dataset. Try it now!

If you find interesting issues in any dataset, they can be featured here! Just fill out this form.

The Fashion MNIST Dataset (cited in 2,200+ papers) contains Hundreds of Miscategorized Items

06/09/2023

The Cleanlab Studio Audit uses AI to auto-detect problems in given data. Here we report issues found in the famous Fashion MNIST image classification dataset, which can impair product identification and other business intelligence efforts.

Ganesh Tata
Chris Mauck

The Stanford Cars Dataset aka Cars196 (cited in 1000+ papers) contains many Fine-Grained Errors

05/24/2023

The Cleanlab Studio Audit uses AI to auto-detect problems in given data. Here we report issues found in the Stanford Cars196 image classification dataset, which can impair product categorization, product identification, and other business intelligence efforts.

Chris Mauck

The Office-Home Dataset (cited by 600+ papers) contains hundreds of incorrect labels and outliers.

04/21/2023

The Cleanlab Studio Audit uses AI to auto-detect problems in given data -- here we report findings for the Office-Home image classification dataset.

Chris Mauck
Jonas Mueller

Use Cleanlab to Improve LLMs: Find Errors in Human Feedback in the Anthropic RLHF Dataset

04/11/2023

The Cleanlab Studio Audit uses AI to auto-detect problems in given data -- here we report findings for a popular Reinforcement Learning from Human Feedback dataset.

Chris Mauck
Jonas Mueller

View more errors detected with Cleanlab in famous ML benchmark datasets at labelerrors.com