About the Dataset

RVL-CDIP (Ryerson Vision Lab — Complex Document Information Processing) is a large-scale document image classification dataset containing approximately 400,000 grayscale images across 16 categories:

advertisement, budget, email, file folder, form, handwritten, invoice, letter, memo, news article, presentation, questionnaire, resume, scientific publication, scientific report, specification

The images are split into training (320K), validation (40K), and test (40K) sets.

This tool hosts the full dataset on S3 and provides a browser for viewing images and reviewing label quality. Some images have been annotated with error corrections where the original label was found to be incorrect or ambiguous.


Using the Browser

Filters

Use the dropdown menus at the top of the browse page to narrow results:

  • Split — Show only train, val, or test images.
  • Original Label — Filter by the label assigned in the original dataset.
  • Correctness — Show images marked as correct, incorrect, or unknown.
  • Mixed — Show images that contain content from multiple categories.
  • Mixed With / Fixed Label — Further filter by the secondary category or corrected label.

Click Apply to update results. Filters combine with AND logic.

Viewing Images

Click any thumbnail to open a detail view showing the full-size image and its metadata (label, split, correctness, notes). If editing is enabled, you can update the correctness flag, fixed label, and mixed status from this view.

Pagination

Results are shown 48 per page. Use the page controls at the bottom to navigate.