After selecting an appropriate model for structured extraction, inconsistencies in output still persisted within the ReceiptFlow pipeline. Further analysis revealed that the issue was not primarily due to model limitations, but rather the format of the input itself. This article explores how different input representations—ranging from raw OCR HTML to structured formats—directly impacted model performance. Experiments conducted using Qwen models in a local setup with llama.cpp demonstrate that reducing input complexity and explicitly defining structure significantly improves extraction accuracy and consistency.
Introduction
After selecting a suitable model, I observed that output quality was still inconsistent across receipts.The root cause was not the model , it was the input format. This article explores how different input representations affected extraction performance and why structured inputs significantly improved results.
Initial Input: Raw OCR HTML
The OCR system (LightOnOCR) generated HTML outputs containing nested tags, layout artifacts, inconsistent formatting, and irrelevant metadata. While this format preserves layout information, it also introduces complexity that is not directly useful for a language model. In theory, HTML should help the model understand structure (tables, rows, columns). However, in practice, the model struggles to interpret noisy or inconsistent markup, especially when the HTML is not clean or standardized.
Example Input
CASH RECEIPT
****
STORE NAME
Store Address Here
+01234567890
*********************
Date:01.01.22 Time:13.45
Cashier:John Doe
<table>
<tbody>
<tr><td>Cheese</td><td>3.59</td></tr>
<tr><td>Bread x4</td><td>4.40</td></tr>
<tr><td>Chicken Wings</td><td>12.40</td></tr>
<tr><td>Coffee Creamer</td><td>3.20</td></tr>
<tr><td>Soap x1</td><td>1.10</td></tr>
<tr><td>Tax</td><td>3.10</td></tr>
<tr><td>Total</td><td>24.10</td></tr>
</tbody>
</table>
Credit Card
Number:9999 9999 9999 9999
THANK YOU FOR SHOPPING
Barcode: [Barcode Image]
---
Total: 93.35
Sub Total: 117.2
Tax: 5.86
Order Total: 123.06
This example highlights a common real-world issue: multiple conflicting totals and mixed structures within the same receipt.
Problem
Feeding raw HTML directly into the model resulted in several issues:
- Incorrect grouping of items
The model struggled to distinguish between actual line items and summary rows like tax or totals. - Confusion between totals and line items
Fields such as “Total”, “Sub Total”, and “Order Total” were often misinterpreted or duplicated. - Hallucinated fields
In some cases, the model introduced fields that were not present in the input, especially when the structure was ambiguous. - Inconsistent JSON structure
Outputs varied significantly between runs, making downstream processing unreliable. These issues were not random—they were a direct consequence of high input entropy and unclear relationships between elements.
Experiments
To better understand the impact of input format, I conducted a series of controlled experiments.
Experiment 1 — Raw HTML
Input: Direct OCR output without modification
Result:
- High noise
- Poor extraction accuracy
- Frequent misclassification of fields
The model spent most of its capacity trying to interpret structure rather than extracting useful information.
Experiment 2 — Markdown Conversion
In this step, HTML was converted into a simplified markdown-like format to improve readability.
Result:
- Improved readability for the model
- Slight improvement in extraction
- Still ambiguous relationships between items and totals
While this reduced noise, it did not fully solve the structural ambiguity problem.
Experiment 3 — Structured Item List
The input was reformatted into a clearly defined structure:
- Header:
- merchant name
- date
- Items:
- description
- quantity
- price
- total
Result:
This significantly improved:
- extraction accuracy
- consistency across runs
- reduction in hallucination
The model no longer needed to “guess” relationships,it was explicitly provided.
Final Input Strategy
- The final approach used a hybrid format with minimal header for metadata and strict itemized list for transactions
- This ensured that important fields were preserved , unnecessary noise was removed and relationships were explicitly defined
Why This Works
LLMs perform better when:
- Input entropy is low
Less noise means the model can focus on relevant information instead of parsing irrelevant tokens. - Relationships are explicit
Clearly defined structures reduce ambiguity and improve consistency. - Noise is minimized
Removing redundant or conflicting information reduces hallucination. In essence, the model performs best when the input is already close to the desired output format.
Key Insight
Input representation has a greater impact than model size in structured extraction tasks.
Even smaller models performed significantly better when given clean, structured input compared to larger models working on noisy data.
Conclusion
Optimizing input format was the single biggest improvement in pipeline performance. Instead of increasing model size or complexity, restructuring the input provided a more efficient and reliable solution. This highlights an important principle in LLM pipelines: better inputs lead to better outputs.
Next Step
Even with better inputs, outputs still contained inconsistencies that required further debugging and correction.
Q&A Section
Q1. Why did raw HTML perform poorly?
Because it introduced noise, ambiguity, and conflicting structures that the model struggled to interpret.
Q2. Did converting to markdown solve the problem?
Partially. It improved readability but did not eliminate structural ambiguity.
Q3. What worked best?
A structured input format with clearly defined fields and relationships.
Q4. What is the main takeaway?
Input representation has a greater impact than model size for structured extraction tasks.
Q5. Can smaller models perform well with better input?
Yes. Even smaller models improved significantly when given clean, structured input.
References
Brown, T. B., et al. Language Models are Few-Shot Learners, NeurIPS, 2020 Kiela, D., et al. Hallucinations in Neural Models, ACL, 2021 Smith, R. Tesseract OCR Engine, ICDAR, 2007 llama.cpp Documentation Qwen Model Documentation → See 04-debugging-llm-output.md