Using Cleanlab Studio to Audit Public Datasets

The Cleanlab Studio Audit uses AI to auto-detect problems in a given dataset. Like a PSA, the CSA is a recurring series to inform the community about issues in popular datasets — all automatically found and corrected with Cleanlab Studio.

Cleanlab Studio can just as easily help you improve your own image, text, or tabular/CSV/Excel dataset. Try it now!

If you find interesting issues in any dataset, they can be featured here! Just fill out this form.

The Fashion MNIST Dataset (cited in 2,200+ papers) contains Hundreds of Miscategorized Items

The Fashion MNIST Dataset (cited in 2,200+ papers) contains Hundreds of Miscategorized Items

06/09/2023

The Cleanlab Studio Audit uses AI to auto-detect problems in given data. Here we report issues found in the famous Fashion MNIST image classification dataset, which can impair product identification and other business intelligence efforts.

  • Ganesh TataGanesh Tata
  • Chris MauckChris Mauck
The Stanford Cars Dataset aka Cars196 (cited in 1000+ papers) contains many Fine-Grained Errors

The Stanford Cars Dataset aka Cars196 (cited in 1000+ papers) contains many Fine-Grained Errors

05/24/2023

The Cleanlab Studio Audit uses AI to auto-detect problems in given data. Here we report issues found in the Stanford Cars196 image classification dataset, which can impair product categorization, product identification, and other business intelligence efforts.

  • Chris MauckChris Mauck
The Office-Home Dataset (cited by 600+ papers) contains hundreds of incorrect labels and outliers.

The Office-Home Dataset (cited by 600+ papers) contains hundreds of incorrect labels and outliers.

04/21/2023

The Cleanlab Studio Audit uses AI to auto-detect problems in given data -- here we report findings for the Office-Home image classification dataset.

  • Chris MauckChris Mauck
  • Jonas MuellerJonas Mueller
Use Cleanlab to Improve LLMs: Find Errors in Human Feedback in the Anthropic RLHF Dataset

Use Cleanlab to Improve LLMs: Find Errors in Human Feedback in the Anthropic RLHF Dataset

04/11/2023

The Cleanlab Studio Audit uses AI to auto-detect problems in given data -- here we report findings for a popular Reinforcement Learning from Human Feedback dataset.

  • Chris MauckChris Mauck
  • Jonas MuellerJonas Mueller


View more errors detected with Cleanlab in famous ML benchmark datasets at labelerrors.com