RVL-CDIP Model Hyperparameters

Training hyperparameters for the models reported in the paper. For models trained by us, two variants exist: normal (original RVL-CDIP labels) and fixed (label-cleaned RVL-CDIP). Models marked "pretrained" are evaluated using the published authors' fine-tuned checkpoint as-is.

CNNs: AlexNet, GoogLeNet, ResNet-50, ResNeXt-50, VGG-16

ImageNet-pretrained backbones from torch.hub (pytorch/vision:v0.10.0). Final classification layer replaced with Linear(*, 16).

Setting	Value
Optimizer	SGD, lr=1e-3, momentum=0.9
LR schedule	`LambdaLR`, λ(epoch) = (1 − epoch/60)^0.5
Loss	CrossEntropyLoss
Epochs	60
Batch size	64
Image preproc	Resize 224×224, normalize (mean=0.9199, std=0.1853 — RVL-CDIP grayscale stats), grayscale replicated to 3 channels

BERT

Setting	Value
HF model ID	`google-bert/bert-base-uncased`
Optimizer	AdamW (HF `Trainer` defaults: lr=5e-5, weight_decay=0.0)
LR schedule	linear with no warmup (HF `Trainer` default)
Loss	CrossEntropyLoss
Epochs	50
Batch size	64
Max token length	512 (truncate + pad)
Input	per-document Textract OCR text

RoBERTa

Setting	Value
HF model ID	`FacebookAI/roberta-base`
Optimizer	AdamW (HF `Trainer` defaults: lr=5e-5, weight_decay=0.0)
LR schedule	linear with no warmup (HF `Trainer` default)
Loss	CrossEntropyLoss
Epochs	50
Batch size	64
Max token length	512 (truncate + pad)
Input	per-document Textract OCR text

BigBird

Both normal and fixed variants use the same hyperparameters; only the train/val/test splits differ.

Setting	Value
HF model ID	`google/bigbird-roberta-base`
Optimizer	AdamW (HF `Trainer` default)
LR schedule	linear with warmup
Loss	CrossEntropyLoss
Learning rate	3e-5
Warmup ratio	0.1
Weight decay	0.01
Epochs	5
Per-device batch size	4
Gradient accumulation	8 (effective batch 64, 2× A6000)
Precision	bf16
Attention type	`block_sparse`, block_size=64, num_random_blocks=3
Max token length	2048 (truncate + pad to max_length)
Seed	42
Best-model selection	`eval_accuracy`, loaded at end
Input	per-document Textract OCR text

Longformer

Trained with HF Trainer. The normal variant uses the original RVL-CDIP labels (bf16). The fixed variant is a fp32 fix-forward warm-started from the normal checkpoint after a bf16 single-step overflow killed the from-scratch fixed-labels run.

Setting	Normal	Fixed (fp32 fix-forward)
HF model ID	`allenai/longformer-base-4096`	`allenai/longformer-base-4096`
Optimizer	AdamW (HF `Trainer` default)	AdamW (HF `Trainer` default)
LR schedule	linear with warmup	linear with warmup
Loss	CrossEntropyLoss	CrossEntropyLoss
Learning rate	3e-5	1e-5
Warmup ratio	0.1	0.05
Weight decay	0.01	0.01
Epochs	5	2 (continuing from normal checkpoint)
Per-device batch size	16	8
Gradient accumulation	2 (effective batch 32)	4 (effective batch 32)
Precision	bf16	fp32
Max token length	2048 (covers 99.9% of docs)	2048
Seed	42	42
Best-model selection	`eval_accuracy`, loaded at end	`eval_accuracy`, loaded at end
Input	per-document Textract OCR text	per-document Textract OCR text

LayoutLM v1

Setting	Value
HF model ID	`microsoft/layoutlm-base-uncased`
Optimizer	AdamW, lr=5e-5, grad-norm clip 1.0
LR schedule	linear with warmup, warmup_ratio=0.1
Loss	CrossEntropyLoss
Epochs	5
Batch size	16
Max sequence length	512
Input	Textract OCR tokens + bboxes (normalized to 0–1000)

LayoutLMv3

Setting	Value
HF model ID	`microsoft/layoutlmv3-base`
Optimizer	Adam, lr=1e-5
LR schedule	none
Loss	CrossEntropyLoss
Epochs	10
Batch size	16
Precision	fp16
Input	Textract OCR tokens + bboxes + RGB image (processor with `apply_ocr=False`)

DocXClassifier-B (pretrained)

Authors' published checkpoint, no in-house fine-tuning.

Setting	Value
Source	`RvlCdip_docxclassifier_base.pt` (DocXClassifier release)
Backbone	ConvNeXt-base
Image preproc	CV2 resize, ImageNet mean/std, grayscale replicated to 3 channels

DiT-base (pretrained)

Authors' published checkpoint, no in-house fine-tuning.

Setting	Value
HF model ID	`microsoft/dit-base-finetuned-rvlcdip`
Architecture	BEiT, 86M params
Image preproc	`BeitImageProcessor` defaults (224×224, ImageNet mean/std)

DiT-large (pretrained)

Authors' published checkpoint, no in-house fine-tuning.

Setting	Value
HF model ID	`microsoft/dit-large-finetuned-rvlcdip`
Architecture	BEiT, 304M params
Image preproc	`BeitImageProcessor` defaults (224×224, ImageNet mean/std)

Donut (pretrained)

Authors' published checkpoint, no in-house fine-tuning. OCR-free; class parsed from the autoregressive decode.

Setting	Value
HF model ID	`naver-clova-ix/donut-base-finetuned-rvlcdip`
Architecture	Swin encoder + BART decoder
Image preproc	`DonutProcessor` defaults
Decoding	greedy autoregressive

LLM Zero-Shot Classifiers (Bedrock API)

All LLM zero-shot results are produced by querying models through the AWS Bedrock converse API (bedrock-runtime, us-east-1). Input is per-document Textract OCR text prepended to a prompt; the model returns a JSON prediction.

Model key	Bedrock model ID
`maverick`	`us.meta.llama4-maverick-17b-instruct-v1:0`
`mistral-7b`	`mistral.mistral-7b-instruct-v0:2`
`openai-20`	`openai.gpt-oss-20b-1:0`
`openai-120`	`openai.gpt-oss-120b-1:0`
`qwen`	`qwen.qwen3-32b-v1:0`

Inference setting	Value
`temperature`	0.0
`topP`	0.9
`maxTokens`	8192

Prompts

Three prompt variants were tested. The paper reports results using Prompt 1 only; analysis of Prompts 2 and 3 is left to future work.

Prompt 1 — bare category list, no descriptions:

You are a document categorization system.

Classify the text of a document into one of the following categories:

    1. advertisement
    2. budget
    3. email
    4. file_folder
    5. form
    6. handwritten
    7. invoice
    8. letter
    9. memo
    10. news article
    11. presentation
    12. questionnaire
    13. resume
    14. scientific publication
    15. scientific report
    16. specification

Respond only with the category name. Do not provide an explanation.

Respond with the following JSON format:
{"prediction": <category name>}

Here is the document text to classify:

Prompt 2 — categories with domain-specific descriptions derived from the RVL-CDIP label definitions:

You are a document categorization system.

Classify the text of a document into one of the following categories:

    1. advertisement: advertisements from print-form media like newspapers, magazines, and radio/television scripts.
    2. budget: various budget documents such as expense, spending, sales, cash, and accounting reports and forecasts; campaign contribution requests; checks and check stubs.
    3. email: printed emails.
    4. file_folder: folders and binders. These will typically have little text.
    5. form: form documents with form-like elements. These include fax forms.
    6. handwritten: handwriten documents.
    7. invoice: invoices, bills, quotations, and estimates.
    8. letter: letters, often with letterhead and "Dear..." salutations.
    9. memo: memo or memoranda documents or inter-office correspondence documents, often with clear "TO", "FROM" headings.
    10. news_article: news articles in the form of clippings from newspapers and other print-form news media.
    11. presentation: presentation and overhead slides, transcripts of speeches and statements, and press releases.
    12. questionnaire: customer surveys and questionnaires, survey prompts.
    13. resume: resumes, curricula vitae (CVs), biographical sketches, executive biographies, business cards.
    14. scientific_publication: scientific publications or articles form scientific journals and book chapters; includes book title pages.
    15. scientific_report: scientific reports like bioassay, pathology, test reports, charts, graphs, tables, research progress repors, research proposals.
    16. specification: specifications like data sheets (including material safety data sheets); product, material, and test specifications.

Respond only with the category name. Do not provide an explanation.

Respond with the following JSON format:
{"prediction": <category name>}

Here is the document text to classify:

Prompt 3 — categories with concise dictionary-style definitions:

You are a document categorization system.

Classify the text of a document into one of the following categories:

    1. advertisement: a notice or announcement in a public medium promoting a product.
    2. budget: an estimate, often itemized, of expected income and expense for a given period in the future.
    3. email: a message sent by email.
    4. file_folder: a folded sheet of light cardboard used to cover or hold documents.
    5. form: a document with blank spaces to be filled in with particulars before it is executed.
    6. handwritten: handwritten documents.
    7. invoice: an itemized bill for goods sold or services provided, containing individual prices, the total charge, and the terms.
    8. letter: a written or printed communication addressed to a person or organization and usually transmitted by mail.
    9. memo: an informal message, especially one sent between two or more employees of the same company, concerning company business.
    10: news_article: the presentation of a report on recent or new events in a newspaper or other periodical or on radio or television.
    11. presentation: a document for a formal talk given to an audience to share information, persuade, inspire, or demonstrate something, often using visuals like slides to support the spoken message.
    12. questionnaire: a list of questions, usually printed, submitted for replies that can be analyzed for usable information.
    13. resume: a brief written account of personal, educational, and professional qualifications and experience, as that prepared by an applicant for a job.
    14. scientific_publication: periodical that shares new scientific research and knowledge with the scientific community and the public.
    15. scientific_report: document that details a scientific experiment or research project, presenting the methodology, data, results, and conclusions to inform readers and allow for reproduction of the study.
    16. specification: a detailed description or assessment of requirements, dimensions, materials, etc.

Respond only with the category name. Do not provide an explanation.

Respond with the following JSON format:
{"prediction": <category name>}

Here is the document text to classify: