Built a Reliable Receipt OCR + LLM Pipeline with llama.cpp

by Kashish

Extracting structured data from real-world receipts sounds simple at first, but it quickly turns into a messy problem.
Receipts are all over the place , different formats, layouts, fonts, even languages sometimes. And even when you run OCR on them, what you get back is usually noisy, inconsistent, and honestly… not directly usable.
So instead of trying to “fix everything in one step”, I built this project — **Review Tool by Kashish** — as a pipeline.

The idea was simple: don’t rely on one model to do everything. Break the problem into stages, and let each stage handle one responsibility properly.

This article doesn’t just show the final system , it also documents the process of getting there. What failed, what worked, and what actually made a difference.

How this project is structured

I didn’t treat this like a single script. I broke it into a series of focused writeups, each solving one part of the problem:
– Tool calling didn’t work → 01-tool-calling-failure
– Model comparison → 02-model-evaluation
– Input format experiments → 03-input-format-optimization
– Debugging LLM outputs → 04-debugging-llm-output`
– Final validation logic → 05-validation
Each one builds on the previous one , so it’s more like a system evolution than isolated docs.

Pipeline Overview

Image → OCR (LightOnOCR) → HTML → LLM (Qwen via llama.cpp) → JSON → Cleaning → Validation

Why this pipeline?

OCR alone isn’t enough.
It can read text, sure — but it doesn’t understand structure. And receipts are not just text, they’re semi-structured documents with relationships (items, totals, tax, etc.).

System Execution

Running the LLM Server

This runs a local inference server using llama.cpp. This step initializes a **multimodal model (LightOnOCR)** capable of processing images. The `–mmproj` layer enables mapping visual features into the language model space.

Example Workflow

Input: Receipt Image

Receipts are highly unstructured inputs. Variations in layout, font, and formatting introduce noise that must be normalized before structured extraction.

Running the LightOnOcr using llama.cpp

LightOnOCR converts visual input into structured HTML, not just plain text. This is important because:

HTML preserves layout relationships
Tables and rows are maintained
Improves downstream extraction by LLM

Img -> Json Script:

$prompt = "Extract all text from this receipt as HTML."
for ($i=1; $i -le 100; $i++) {
Write-Host "Processing $i.png..."
$img = "C:\mymodeldir\samples\$i.png"
$b64 = [Convert]::ToBase64String([IO.File]::ReadAllBytes($img))
$body = @{
    messages = @(
        @{
            role = "user"
            content = @(
                @{ type="text"; text=$prompt },
                @{ type="image_url"; image_url=@{ url="data:image/png;base64,$b64" } }
            )
        }
    )
} | ConvertTo-Json -Depth 5
$response = Invoke-RestMethod -Uri "http://127.0.0.1:8080/v1/chat/completions" `
    -Method Post `
    -Body $body `
    -ContentType "application/json"
$output = $response.choices[0].message.content
$output | Out-File "C:\mymodeldir\ocr_outputs\$i.html"
Write-Host "Saved $i.html"
}

This script performs:

image loading
base64 encoding
API communication with the OCR model

Base64 encoding is required because llama.cpp expects image input as a URL or encoded string and this allows embedding image data directly into requests.

OCR extracts structured HTML:

CASH RECEIPT

STORE NAME Store Address Here +01234567890

Date:01.01.22 Time:13.45 Cashier:John Doe

Cheese	3.59
Bread x4	4.40
Chicken Wings	12.40
Coffee Creamer	3.20
Soap x1	1.10
Tax	3.10
Total	24.10

Credit Card Number:9999 9999 9999 9999

THANK YOU FOR SHOPPING

Barcode: [Barcode Image]

Total: 93.35 Sub Total: 117.2 Tax: 5.86 Order Total: 123.06

The OCR output is intentionally kept as HTML because:

it preserves structure (tables, rows)
provides semantic grouping of items
reduces ambiguity compared to plain text

However, this output is still noisy and requires interpretation.

Running the Qwen3.5 using llama.cpp:

This step uses a text-only LLM (Qwen) to interpret structured HTML. Unlike OCR, this model does not “see” images, it performs semantic parsing and reasoning This separation improves modularity:

OCR handles perception
LLM handles understandin

HTML-> JSON Script:

for ($i=1; $i -le 100; $i++) {
Write-Host "Processing $i.html..."
$htmlPath = "C:\mymodeldir\ocr_outputs\$i.html"
if (!(Test-Path $htmlPath)) { continue }
$html = Get-Content $htmlPath -Raw
# cleaning
$html = $html -replace "RM|SR|\$",""
$html = $html -replace "\*\*",""
$html = $html -replace "`r|`n"," "
$html = $html -replace "\s+"," "
$prompt = @"
Extract structured receipt data.
Return ONLY JSON:
{
"merchant_name": "string",
"merchant_tax_id": "string",
"date": "string",
"invoice_no": "string",
"currency": "string",
"total_amount": "string",
"tax_amount": "string",
"line_items": [
    {
        "item_desc": "string",
        "item_qty": number,
        "item_total": "string"
    }
  ]
  }
 INPUT:
 $html
"@
$body = @{
    temperature = 0
    max_tokens = 700
    messages = @(
        @{
            role = "user"
            content = $prompt
        }
    )
} | ConvertTo-Json -Depth 6
$response = Invoke-RestMethod -Uri "http://127.0.0.1:8081/v1/chat/completions" `
    -Method Post `
    -Body $body `
    -ContentType "application/json"
$output = $response.choices[0].message.content
if (![string]::IsNullOrWhiteSpace($output)) {
    $output | Out-File "C:\mymodeldir\json_outputs\$i.json"
}
Write-Host "Saved $i.json"
}

LLM converts HTML → JSON:

  {
"merchant_name":  "ore Name",
"address":  "ore Address Here",
"phone_number":  "+01234567890",
"date":  "01.01.22",
"time":  "13:45",
"invoice_number":  "not present in receipt",
"tax_id":  "not present in receipt",
"currency":  "not present in receipt",
"items":  [
              {
                  "name":  "Cheese",
                  "quantity":  "1",
                  "price":  "3.59"
              },
              {
                  "name":  "Bread x4",
                  "quantity":  "4",
                  "price":  "4.40"
              },
              {
                  "name":  "Chicken Wings",
                  "quantity":  "1",
                  "price":  "12.40"
              },
              {
                  "name":  "Coffee Creamer",
                  "quantity":  "1",
                  "price":  "3.20"
              },
              {
                  "name":  "Soap x1",
                  "quantity":  "1",
                  "price":  "1.10"
              },
              {
                  "name":  "Tax",
                  "quantity":  "1",
                  "price":  "3.10"
              },
              {
                  "name":  "Total",
                  "quantity":  "1",
                  "price":  "24.10"
              }
          ],
"subtotal":  "117.2",
"tax":  "5.86",
"total":  "24.10",
"payment_method":  "Credit Card",
"change":  "not present in receipt",
"discounts":  "not present in receipt",
"barcode":  "[Barcode Image]"

}

This step introduces deterministic correction, which is critical.

LLMs:

generate approximate outputs
may include noise or errors

Cleaning layer ensures:

numeric consistency
removal of invalid entries
normalization of values

This is more efficient than re-running the model.

Cleaning layer fixes inconsistencies using Script:

   for ($i=1; $i -le 100; $i++) {

   $path = "C:\mymodeldir\json_outputs\$i.json"
   if (!(Test-Path $path)) { continue }

try {
    $json = Get-Content $path -Raw | ConvertFrom-Json
} catch { continue }

$newItems = @()
$newSum = 0

foreach ($item in $json.line_items) {

    $clean = $item.item_total -replace "[^0-9\.]", ""
    if ($clean -eq "") { continue }

    if ($item.item_desc.Length -lt 3) { continue }

    $item.item_total = $clean
    $newItems += $item
    $newSum += [double]$clean
}

$json.line_items = $newItems
$json.total_amount = [math]::Round($newSum, 2)

$json | ConvertTo-Json -Depth 6 | Out-File $path

}

This is the most important step for reliability.

Validation ensures:

sum of items ≈ total amount
financial consistency is maintained

Formula used: | Σ(items) – total | < tolerance

This compensates for:

rounding errors
OCR inconsistencies

Validation layer verifies correctness:

Let us know your challenges or support us by sharing the article

Check iunera.com to learn more about what we do!

Categories:

Uncategorized

Tags:

No Tag