Training hyperparameters for the models reported in the paper. For models trained by us, two variants exist: normal (original RVL-CDIP labels) and fixed (label-cleaned RVL-CDIP). Models marked "pretrained" are evaluated using the published authors' fine-tuned checkpoint as-is.
ImageNet-pretrained backbones from torch.hub (pytorch/vision:v0.10.0). Final classification layer replaced with Linear(*, 16).
| Setting | Value |
|---|---|
| Optimizer | SGD, lr=1e-3, momentum=0.9 |
| LR schedule | LambdaLR, λ(epoch) = (1 − epoch/60)^0.5 |
| Loss | CrossEntropyLoss |
| Epochs | 60 |
| Batch size | 64 |
| Image preproc | Resize 224×224, normalize (mean=0.9199, std=0.1853 — RVL-CDIP grayscale stats), grayscale replicated to 3 channels |
| Setting | Value |
|---|---|
| HF model ID | google-bert/bert-base-uncased |
| Optimizer | AdamW (HF Trainer defaults: lr=5e-5, weight_decay=0.0) |
| LR schedule | linear with no warmup (HF Trainer default) |
| Loss | CrossEntropyLoss |
| Epochs | 50 |
| Batch size | 64 |
| Max token length | 512 (truncate + pad) |
| Input | per-document Textract OCR text |
| Setting | Value |
|---|---|
| HF model ID | FacebookAI/roberta-base |
| Optimizer | AdamW (HF Trainer defaults: lr=5e-5, weight_decay=0.0) |
| LR schedule | linear with no warmup (HF Trainer default) |
| Loss | CrossEntropyLoss |
| Epochs | 50 |
| Batch size | 64 |
| Max token length | 512 (truncate + pad) |
| Input | per-document Textract OCR text |
Both normal and fixed variants use the same hyperparameters; only the train/val/test splits differ.
| Setting | Value |
|---|---|
| HF model ID | google/bigbird-roberta-base |
| Optimizer | AdamW (HF Trainer default) |
| LR schedule | linear with warmup |
| Loss | CrossEntropyLoss |
| Learning rate | 3e-5 |
| Warmup ratio | 0.1 |
| Weight decay | 0.01 |
| Epochs | 5 |
| Per-device batch size | 4 |
| Gradient accumulation | 8 (effective batch 64, 2× A6000) |
| Precision | bf16 |
| Attention type | block_sparse, block_size=64, num_random_blocks=3 |
| Max token length | 2048 (truncate + pad to max_length) |
| Seed | 42 |
| Best-model selection | eval_accuracy, loaded at end |
| Input | per-document Textract OCR text |
Trained with HF Trainer. The normal variant uses the original RVL-CDIP labels (bf16). The fixed variant is a fp32 fix-forward warm-started from the normal checkpoint after a bf16 single-step overflow killed the from-scratch fixed-labels run.
| Setting | Normal | Fixed (fp32 fix-forward) |
|---|---|---|
| HF model ID | allenai/longformer-base-4096 |
allenai/longformer-base-4096 |
| Optimizer | AdamW (HF Trainer default) |
AdamW (HF Trainer default) |
| LR schedule | linear with warmup | linear with warmup |
| Loss | CrossEntropyLoss | CrossEntropyLoss |
| Learning rate | 3e-5 | 1e-5 |
| Warmup ratio | 0.1 | 0.05 |
| Weight decay | 0.01 | 0.01 |
| Epochs | 5 | 2 (continuing from normal checkpoint) |
| Per-device batch size | 16 | 8 |
| Gradient accumulation | 2 (effective batch 32) | 4 (effective batch 32) |
| Precision | bf16 | fp32 |
| Max token length | 2048 (covers 99.9% of docs) | 2048 |
| Seed | 42 | 42 |
| Best-model selection | eval_accuracy, loaded at end |
eval_accuracy, loaded at end |
| Input | per-document Textract OCR text | per-document Textract OCR text |
| Setting | Value |
|---|---|
| HF model ID | microsoft/layoutlm-base-uncased |
| Optimizer | AdamW, lr=5e-5, grad-norm clip 1.0 |
| LR schedule | linear with warmup, warmup_ratio=0.1 |
| Loss | CrossEntropyLoss |
| Epochs | 5 |
| Batch size | 16 |
| Max sequence length | 512 |
| Input | Textract OCR tokens + bboxes (normalized to 0–1000) |
| Setting | Value |
|---|---|
| HF model ID | microsoft/layoutlmv3-base |
| Optimizer | Adam, lr=1e-5 |
| LR schedule | none |
| Loss | CrossEntropyLoss |
| Epochs | 10 |
| Batch size | 16 |
| Precision | fp16 |
| Input | Textract OCR tokens + bboxes + RGB image (processor with apply_ocr=False) |
Authors' published checkpoint, no in-house fine-tuning.
| Setting | Value |
|---|---|
| Source | RvlCdip_docxclassifier_base.pt (DocXClassifier release) |
| Backbone | ConvNeXt-base |
| Image preproc | CV2 resize, ImageNet mean/std, grayscale replicated to 3 channels |
Authors' published checkpoint, no in-house fine-tuning.
| Setting | Value |
|---|---|
| HF model ID | microsoft/dit-base-finetuned-rvlcdip |
| Architecture | BEiT, 86M params |
| Image preproc | BeitImageProcessor defaults (224×224, ImageNet mean/std) |
Authors' published checkpoint, no in-house fine-tuning.
| Setting | Value |
|---|---|
| HF model ID | microsoft/dit-large-finetuned-rvlcdip |
| Architecture | BEiT, 304M params |
| Image preproc | BeitImageProcessor defaults (224×224, ImageNet mean/std) |
Authors' published checkpoint, no in-house fine-tuning. OCR-free; class parsed from the autoregressive decode.
| Setting | Value |
|---|---|
| HF model ID | naver-clova-ix/donut-base-finetuned-rvlcdip |
| Architecture | Swin encoder + BART decoder |
| Image preproc | DonutProcessor defaults |
| Decoding | greedy autoregressive |
All LLM zero-shot results are produced by querying models through the AWS Bedrock converse API (bedrock-runtime, us-east-1). Input is per-document Textract OCR text prepended to a prompt; the model returns a JSON prediction.
| Model key | Bedrock model ID |
|---|---|
maverick |
us.meta.llama4-maverick-17b-instruct-v1:0 |
mistral-7b |
mistral.mistral-7b-instruct-v0:2 |
openai-20 |
openai.gpt-oss-20b-1:0 |
openai-120 |
openai.gpt-oss-120b-1:0 |
qwen |
qwen.qwen3-32b-v1:0 |
| Inference setting | Value |
|---|---|
temperature |
0.0 |
topP |
0.9 |
maxTokens |
8192 |
Three prompt variants were tested. The paper reports results using Prompt 1 only; analysis of Prompts 2 and 3 is left to future work.
Prompt 1 — bare category list, no descriptions:
You are a document categorization system.
Classify the text of a document into one of the following categories:
1. advertisement
2. budget
3. email
4. file_folder
5. form
6. handwritten
7. invoice
8. letter
9. memo
10. news article
11. presentation
12. questionnaire
13. resume
14. scientific publication
15. scientific report
16. specification
Respond only with the category name. Do not provide an explanation.
Respond with the following JSON format:
{"prediction": <category name>}
Here is the document text to classify:
Prompt 2 — categories with domain-specific descriptions derived from the RVL-CDIP label definitions:
You are a document categorization system.
Classify the text of a document into one of the following categories:
1. advertisement: advertisements from print-form media like newspapers, magazines, and radio/television scripts.
2. budget: various budget documents such as expense, spending, sales, cash, and accounting reports and forecasts; campaign contribution requests; checks and check stubs.
3. email: printed emails.
4. file_folder: folders and binders. These will typically have little text.
5. form: form documents with form-like elements. These include fax forms.
6. handwritten: handwriten documents.
7. invoice: invoices, bills, quotations, and estimates.
8. letter: letters, often with letterhead and "Dear..." salutations.
9. memo: memo or memoranda documents or inter-office correspondence documents, often with clear "TO", "FROM" headings.
10. news_article: news articles in the form of clippings from newspapers and other print-form news media.
11. presentation: presentation and overhead slides, transcripts of speeches and statements, and press releases.
12. questionnaire: customer surveys and questionnaires, survey prompts.
13. resume: resumes, curricula vitae (CVs), biographical sketches, executive biographies, business cards.
14. scientific_publication: scientific publications or articles form scientific journals and book chapters; includes book title pages.
15. scientific_report: scientific reports like bioassay, pathology, test reports, charts, graphs, tables, research progress repors, research proposals.
16. specification: specifications like data sheets (including material safety data sheets); product, material, and test specifications.
Respond only with the category name. Do not provide an explanation.
Respond with the following JSON format:
{"prediction": <category name>}
Here is the document text to classify:
Prompt 3 — categories with concise dictionary-style definitions:
You are a document categorization system.
Classify the text of a document into one of the following categories:
1. advertisement: a notice or announcement in a public medium promoting a product.
2. budget: an estimate, often itemized, of expected income and expense for a given period in the future.
3. email: a message sent by email.
4. file_folder: a folded sheet of light cardboard used to cover or hold documents.
5. form: a document with blank spaces to be filled in with particulars before it is executed.
6. handwritten: handwritten documents.
7. invoice: an itemized bill for goods sold or services provided, containing individual prices, the total charge, and the terms.
8. letter: a written or printed communication addressed to a person or organization and usually transmitted by mail.
9. memo: an informal message, especially one sent between two or more employees of the same company, concerning company business.
10: news_article: the presentation of a report on recent or new events in a newspaper or other periodical or on radio or television.
11. presentation: a document for a formal talk given to an audience to share information, persuade, inspire, or demonstrate something, often using visuals like slides to support the spoken message.
12. questionnaire: a list of questions, usually printed, submitted for replies that can be analyzed for usable information.
13. resume: a brief written account of personal, educational, and professional qualifications and experience, as that prepared by an applicant for a job.
14. scientific_publication: periodical that shares new scientific research and knowledge with the scientific community and the public.
15. scientific_report: document that details a scientific experiment or research project, presenting the methodology, data, results, and conclusions to inform readers and allow for reproduction of the study.
16. specification: a detailed description or assessment of requirements, dimensions, materials, etc.
Respond only with the category name. Do not provide an explanation.
Respond with the following JSON format:
{"prediction": <category name>}
Here is the document text to classify: