RVL-CDIP Model Hyperparameters

Training hyperparameters for the models reported in the paper. For models trained by us, two variants exist: normal (original RVL-CDIP labels) and fixed (label-cleaned RVL-CDIP). Models marked "pretrained" are evaluated using the published authors' fine-tuned checkpoint as-is.

CNNs: AlexNet, GoogLeNet, ResNet-50, ResNeXt-50, VGG-16

ImageNet-pretrained backbones from torch.hub (pytorch/vision:v0.10.0). Final classification layer replaced with Linear(*, 16).

Setting Value
Optimizer SGD, lr=1e-3, momentum=0.9
LR schedule LambdaLR, λ(epoch) = (1 − epoch/60)^0.5
Loss CrossEntropyLoss
Epochs 60
Batch size 64
Image preproc Resize 224×224, normalize (mean=0.9199, std=0.1853 — RVL-CDIP grayscale stats), grayscale replicated to 3 channels

BERT

Setting Value
HF model ID google-bert/bert-base-uncased
Optimizer AdamW (HF Trainer defaults: lr=5e-5, weight_decay=0.0)
LR schedule linear with no warmup (HF Trainer default)
Loss CrossEntropyLoss
Epochs 50
Batch size 64
Max token length 512 (truncate + pad)
Input per-document Textract OCR text

RoBERTa

Setting Value
HF model ID FacebookAI/roberta-base
Optimizer AdamW (HF Trainer defaults: lr=5e-5, weight_decay=0.0)
LR schedule linear with no warmup (HF Trainer default)
Loss CrossEntropyLoss
Epochs 50
Batch size 64
Max token length 512 (truncate + pad)
Input per-document Textract OCR text

BigBird

Both normal and fixed variants use the same hyperparameters; only the train/val/test splits differ.

Setting Value
HF model ID google/bigbird-roberta-base
Optimizer AdamW (HF Trainer default)
LR schedule linear with warmup
Loss CrossEntropyLoss
Learning rate 3e-5
Warmup ratio 0.1
Weight decay 0.01
Epochs 5
Per-device batch size 4
Gradient accumulation 8 (effective batch 64, 2× A6000)
Precision bf16
Attention type block_sparse, block_size=64, num_random_blocks=3
Max token length 2048 (truncate + pad to max_length)
Seed 42
Best-model selection eval_accuracy, loaded at end
Input per-document Textract OCR text

Longformer

Trained with HF Trainer. The normal variant uses the original RVL-CDIP labels (bf16). The fixed variant is a fp32 fix-forward warm-started from the normal checkpoint after a bf16 single-step overflow killed the from-scratch fixed-labels run.

Setting Normal Fixed (fp32 fix-forward)
HF model ID allenai/longformer-base-4096 allenai/longformer-base-4096
Optimizer AdamW (HF Trainer default) AdamW (HF Trainer default)
LR schedule linear with warmup linear with warmup
Loss CrossEntropyLoss CrossEntropyLoss
Learning rate 3e-5 1e-5
Warmup ratio 0.1 0.05
Weight decay 0.01 0.01
Epochs 5 2 (continuing from normal checkpoint)
Per-device batch size 16 8
Gradient accumulation 2 (effective batch 32) 4 (effective batch 32)
Precision bf16 fp32
Max token length 2048 (covers 99.9% of docs) 2048
Seed 42 42
Best-model selection eval_accuracy, loaded at end eval_accuracy, loaded at end
Input per-document Textract OCR text per-document Textract OCR text

LayoutLM v1

Setting Value
HF model ID microsoft/layoutlm-base-uncased
Optimizer AdamW, lr=5e-5, grad-norm clip 1.0
LR schedule linear with warmup, warmup_ratio=0.1
Loss CrossEntropyLoss
Epochs 5
Batch size 16
Max sequence length 512
Input Textract OCR tokens + bboxes (normalized to 0–1000)

LayoutLMv3

Setting Value
HF model ID microsoft/layoutlmv3-base
Optimizer Adam, lr=1e-5
LR schedule none
Loss CrossEntropyLoss
Epochs 10
Batch size 16
Precision fp16
Input Textract OCR tokens + bboxes + RGB image (processor with apply_ocr=False)

DocXClassifier-B (pretrained)

Authors' published checkpoint, no in-house fine-tuning.

Setting Value
Source RvlCdip_docxclassifier_base.pt (DocXClassifier release)
Backbone ConvNeXt-base
Image preproc CV2 resize, ImageNet mean/std, grayscale replicated to 3 channels

DiT-base (pretrained)

Authors' published checkpoint, no in-house fine-tuning.

Setting Value
HF model ID microsoft/dit-base-finetuned-rvlcdip
Architecture BEiT, 86M params
Image preproc BeitImageProcessor defaults (224×224, ImageNet mean/std)

DiT-large (pretrained)

Authors' published checkpoint, no in-house fine-tuning.

Setting Value
HF model ID microsoft/dit-large-finetuned-rvlcdip
Architecture BEiT, 304M params
Image preproc BeitImageProcessor defaults (224×224, ImageNet mean/std)

Donut (pretrained)

Authors' published checkpoint, no in-house fine-tuning. OCR-free; class parsed from the autoregressive decode.

Setting Value
HF model ID naver-clova-ix/donut-base-finetuned-rvlcdip
Architecture Swin encoder + BART decoder
Image preproc DonutProcessor defaults
Decoding greedy autoregressive

LLM Zero-Shot Classifiers (Bedrock API)

All LLM zero-shot results are produced by querying models through the AWS Bedrock converse API (bedrock-runtime, us-east-1). Input is per-document Textract OCR text prepended to a prompt; the model returns a JSON prediction.

Model key Bedrock model ID
maverick us.meta.llama4-maverick-17b-instruct-v1:0
mistral-7b mistral.mistral-7b-instruct-v0:2
openai-20 openai.gpt-oss-20b-1:0
openai-120 openai.gpt-oss-120b-1:0
qwen qwen.qwen3-32b-v1:0
Inference setting Value
temperature 0.0
topP 0.9
maxTokens 8192

Prompts

Three prompt variants were tested. The paper reports results using Prompt 1 only; analysis of Prompts 2 and 3 is left to future work.

Prompt 1 — bare category list, no descriptions:

You are a document categorization system.

Classify the text of a document into one of the following categories:

    1. advertisement
    2. budget
    3. email
    4. file_folder
    5. form
    6. handwritten
    7. invoice
    8. letter
    9. memo
    10. news article
    11. presentation
    12. questionnaire
    13. resume
    14. scientific publication
    15. scientific report
    16. specification

Respond only with the category name. Do not provide an explanation.

Respond with the following JSON format:
{"prediction": <category name>}

Here is the document text to classify:

Prompt 2 — categories with domain-specific descriptions derived from the RVL-CDIP label definitions:

You are a document categorization system.

Classify the text of a document into one of the following categories:

    1. advertisement: advertisements from print-form media like newspapers, magazines, and radio/television scripts.
    2. budget: various budget documents such as expense, spending, sales, cash, and accounting reports and forecasts; campaign contribution requests; checks and check stubs.
    3. email: printed emails.
    4. file_folder: folders and binders. These will typically have little text.
    5. form: form documents with form-like elements. These include fax forms.
    6. handwritten: handwriten documents.
    7. invoice: invoices, bills, quotations, and estimates.
    8. letter: letters, often with letterhead and "Dear..." salutations.
    9. memo: memo or memoranda documents or inter-office correspondence documents, often with clear "TO", "FROM" headings.
    10. news_article: news articles in the form of clippings from newspapers and other print-form news media.
    11. presentation: presentation and overhead slides, transcripts of speeches and statements, and press releases.
    12. questionnaire: customer surveys and questionnaires, survey prompts.
    13. resume: resumes, curricula vitae (CVs), biographical sketches, executive biographies, business cards.
    14. scientific_publication: scientific publications or articles form scientific journals and book chapters; includes book title pages.
    15. scientific_report: scientific reports like bioassay, pathology, test reports, charts, graphs, tables, research progress repors, research proposals.
    16. specification: specifications like data sheets (including material safety data sheets); product, material, and test specifications.

Respond only with the category name. Do not provide an explanation.

Respond with the following JSON format:
{"prediction": <category name>}

Here is the document text to classify:

Prompt 3 — categories with concise dictionary-style definitions:

You are a document categorization system.

Classify the text of a document into one of the following categories:

    1. advertisement: a notice or announcement in a public medium promoting a product.
    2. budget: an estimate, often itemized, of expected income and expense for a given period in the future.
    3. email: a message sent by email.
    4. file_folder: a folded sheet of light cardboard used to cover or hold documents.
    5. form: a document with blank spaces to be filled in with particulars before it is executed.
    6. handwritten: handwritten documents.
    7. invoice: an itemized bill for goods sold or services provided, containing individual prices, the total charge, and the terms.
    8. letter: a written or printed communication addressed to a person or organization and usually transmitted by mail.
    9. memo: an informal message, especially one sent between two or more employees of the same company, concerning company business.
    10: news_article: the presentation of a report on recent or new events in a newspaper or other periodical or on radio or television.
    11. presentation: a document for a formal talk given to an audience to share information, persuade, inspire, or demonstrate something, often using visuals like slides to support the spoken message.
    12. questionnaire: a list of questions, usually printed, submitted for replies that can be analyzed for usable information.
    13. resume: a brief written account of personal, educational, and professional qualifications and experience, as that prepared by an applicant for a job.
    14. scientific_publication: periodical that shares new scientific research and knowledge with the scientific community and the public.
    15. scientific_report: document that details a scientific experiment or research project, presenting the methodology, data, results, and conclusions to inform readers and allow for reproduction of the study.
    16. specification: a detailed description or assessment of requirements, dimensions, materials, etc.

Respond only with the category name. Do not provide an explanation.

Respond with the following JSON format:
{"prediction": <category name>}

Here is the document text to classify: