Input Representation vs Model Performance in OCR + LLM Pipelines

by Kashish

After selecting an appropriate model for structured extraction, inconsistencies in output still persisted within the ReceiptFlow pipeline. Further analysis revealed that the issue was not primarily due to model limitations, but rather the format of the input itself. This article explores how different input representations—ranging from raw OCR HTML to structured formats—directly impacted model performance. Experiments conducted using Qwen models in a local setup with llama.cpp demonstrate that reducing input complexity and explicitly defining structure significantly improves extraction accuracy and consistency.

Introduction

After selecting a suitable model, I observed that output quality was still inconsistent across receipts.The root cause was not the model , it was the input format. This article explores how different input representations affected extraction performance and why structured inputs significantly improved results.

Initial Input: Raw OCR HTML

The OCR system (LightOnOCR) generated HTML outputs containing nested tags, layout artifacts, inconsistent formatting, and irrelevant metadata. While this format preserves layout information, it also introduces complexity that is not directly useful for a language model. In theory, HTML should help the model understand structure (tables, rows, columns). However, in practice, the model struggles to interpret noisy or inconsistent markup, especially when the HTML is not clean or standardized.

Example Input

  CASH RECEIPT

  ****

  STORE NAME
  Store Address Here
  +01234567890
  *********************

  Date:01.01.22    Time:13.45
  Cashier:John Doe

  <table>
    <tbody>
      <tr><td>Cheese</td><td>3.59</td></tr>
      <tr><td>Bread x4</td><td>4.40</td></tr>
      <tr><td>Chicken Wings</td><td>12.40</td></tr>
      <tr><td>Coffee Creamer</td><td>3.20</td></tr>
      <tr><td>Soap x1</td><td>1.10</td></tr>
      <tr><td>Tax</td><td>3.10</td></tr>
      <tr><td>Total</td><td>24.10</td></tr>
    </tbody>
  </table>

  Credit Card
  Number:9999 9999 9999 9999

  THANK YOU FOR SHOPPING

  Barcode: [Barcode Image]

  ---

  Total: 93.35
  Sub Total: 117.2
  Tax: 5.86
  Order Total: 123.06

This example highlights a common real-world issue: multiple conflicting totals and mixed structures within the same receipt.

Problem

Feeding raw HTML directly into the model resulted in several issues:

Incorrect grouping of items
The model struggled to distinguish between actual line items and summary rows like tax or totals.
Confusion between totals and line items
Fields such as “Total”, “Sub Total”, and “Order Total” were often misinterpreted or duplicated.
Hallucinated fields
In some cases, the model introduced fields that were not present in the input, especially when the structure was ambiguous.
Inconsistent JSON structure
Outputs varied significantly between runs, making downstream processing unreliable. These issues were not random—they were a direct consequence of high input entropy and unclear relationships between elements.

Experiments

To better understand the impact of input format, I conducted a series of controlled experiments.

Experiment 1 — Raw HTML

Input: Direct OCR output without modification

Result:

High noise
Poor extraction accuracy
Frequent misclassification of fields

The model spent most of its capacity trying to interpret structure rather than extracting useful information.

Experiment 2 — Markdown Conversion

In this step, HTML was converted into a simplified markdown-like format to improve readability.

Result:

Improved readability for the model
Slight improvement in extraction
Still ambiguous relationships between items and totals

While this reduced noise, it did not fully solve the structural ambiguity problem.

Experiment 3 — Structured Item List

The input was reformatted into a clearly defined structure:

Header:
- merchant name
- date
Items:
- description
- quantity
- price
- total

Result:
This significantly improved:

extraction accuracy
consistency across runs
reduction in hallucination

The model no longer needed to “guess” relationships,it was explicitly provided.

Final Input Strategy

The final approach used a hybrid format with minimal header for metadata and strict itemized list for transactions
This ensured that important fields were preserved , unnecessary noise was removed and relationships were explicitly defined

Why This Works

LLMs perform better when:

Input entropy is low
Less noise means the model can focus on relevant information instead of parsing irrelevant tokens.
Relationships are explicit
Clearly defined structures reduce ambiguity and improve consistency.
Noise is minimized
Removing redundant or conflicting information reduces hallucination. In essence, the model performs best when the input is already close to the desired output format.

Key Insight

Input representation has a greater impact than model size in structured extraction tasks.

Even smaller models performed significantly better when given clean, structured input compared to larger models working on noisy data.

Conclusion

Optimizing input format was the single biggest improvement in pipeline performance. Instead of increasing model size or complexity, restructuring the input provided a more efficient and reliable solution. This highlights an important principle in LLM pipelines: better inputs lead to better outputs.

Next Step

Even with better inputs, outputs still contained inconsistencies that required further debugging and correction.

Q&A Section

Q1. Why did raw HTML perform poorly?

Because it introduced noise, ambiguity, and conflicting structures that the model struggled to interpret.

Q2. Did converting to markdown solve the problem?

Partially. It improved readability but did not eliminate structural ambiguity.

Q3. What worked best?

A structured input format with clearly defined fields and relationships.

Q4. What is the main takeaway?

Input representation has a greater impact than model size for structured extraction tasks.

Q5. Can smaller models perform well with better input?

Yes. Even smaller models improved significantly when given clean, structured input.

References

Brown, T. B., et al. Language Models are Few-Shot Learners, NeurIPS, 2020 Kiela, D., et al. Hallucinations in Neural Models, ACL, 2021 Smith, R. Tesseract OCR Engine, ICDAR, 2007 llama.cpp Documentation Qwen Model Documentation → See 04-debugging-llm-output.md

Let us know your challenges or support us by sharing the article

Check iunera.com to learn more about what we do!

Categories:

Uncategorized

Tags:

input optimization pipeline input representation LLM LLM hallucination reduction OCR HTML preprocessing receipt extraction accuracy structured input LLM performance