Why Tool Calling Failed in llama.cpp (Qwen 2.5)

by Kashish

When building ReceiptFlow, the goal was simple: take messy OCR output from receipts and convert it into clean, structured JSON. Tool calling seemed like the perfect solution because it promised strict structure and predictable outputs, reducing the need for heavy post-processing.

However, in a local setup using llama.cpp and Qwen 2.5 (3B), it failed consistently. Instead of producing structured JSON, the model ignored constraints, hallucinated data, and generated unreliable outputs. This article walks through the exact setup, experiments, observed failures, and the key realization that ultimately changed the direction of the pipeline.

Keywords

tool calling failure, llama.cpp structured output, Qwen 2.5 limitations, OCR to JSON extraction, LLM hallucination receipts, local LLM pipeline, JSON extraction errors

Introduction

When I started building a receipt processing pipeline using local LLMs, my initial approach was to use tool calling to enforce structured output. The idea was simple: define a strict schema and force the model to respond in that format. This is similar to how function calling works in APIs like OpenAI, where the model is constrained to return structured JSON. However, when implementing this using a local setup (llama.cpp + Qwen 2.5 3B), the results were far from expected. This article documents the exact setup, experiments, failures, and why tool calling did not work reliably in this environment.

System Setup

The inference pipeline was built using:

Model: Qwen 2.5 (3B, GGUF format)
Runtime: llama.cpp (llama-server)
Endpoint: http://127.0.0.1:8081/v1/chat/completions

Server Execution

./llama-server -m qwen-3b.gguf --port 8081

Below is the actual runtime environment:

This setup allowed local inference without relying on external APIs, which was important for experimentation and control.

Initial Approach: Tool Calling

The idea was to define a strict schema inside the prompt and instruct the model to ONLY output that structure.

Tool Schema

<tool_call>
 <tool_name>receipt_parser</tool_name>
 <arguments>
 {
   "merchant_name": "string",
   "date": "string",
   "total_amount": "string",
   "items": [...]
 }
 </arguments>
</tool_call>

Prompt Constraints:

DO NOT explain anything
ONLY output tool_call
DO NOT hallucinate
USE ONLY provided receipt

This was combined with OCR-extracted HTML as input.

Observed Behavior

Despite strict constraints, the model consistently failed to comply.

Common Failure Patterns

Ignoring the tool schema
- Output included explanations
- Extra text outside JSON
Hallucinating data
- Generated fake receipts
- Ignored input content
Malformed outputs
- Broken JSON
- Missing fields

Example Failure

In several cases, the model responded with:

Since no receipt was provided, I will create a hypothetical example...

This clearly shows that the model was not respecting the input or constraints.

Root Cause Analysis

After multiple iterations, the failure was not random — it was systemic.

Lack of Tool Calling Enforcement: Unlike APIs such as OpenAI, llama.cpp does NOT enforce tool calling.
- The schema is treated as plain text
- No structural constraints exist at runtime
- This means:

The model is "suggested" to follow the format, not forced

Stateless Inference Each request was independent:
- No conversation memory
- No reinforcement of output format So the model had to interpret the schema from scratch every time.
Model Size Limitations At 3B parameters:

Limited ability to strictly follow structured instructions
Tendency to prioritize natural language over rigid format

This becomes worse when the input is noisy (OCR HTML) and prompt is complex. 4. Input Complexity The input itself (OCR HTML) contained nested tags, inconsistent structure, irrelevant tokens.

This increased cognitive load on the model.

What Changed

Instead of forcing tool calling, I simplified the approach : I Removed tool schema entirely and asked model to output JSON directly.

Example:

Extract the following receipt into JSON with fields:
merchant_name, date, items, total

Result After Change:

This change led to more consistent JSON output , reduced hallucination and easier downstream processing.

While the model still made mistakes, outputs were predictable enough to fix.

Key Insight

Tool calling is not inherently flawed , but it requires:

API-level enforcement
strong model alignment
structured runtime support

In local setups like llama.cpp: Prompt simplicity > schema rigidity

Practical Takeaway

For local LLM pipelines one should avoid over-constraining the model, prefer simple, structured promptsand handle strict validation in post-processing.

Conclusion

Tool calling failed not because of incorrect prompting, but because the underlying system does not support enforcing structured outputs. Switching to prompt-based JSON extraction proved to be more reliable and practical.

Why did tool calling fail in this setup?

Because llama.cpp does not enforce structured outputs.

Was the issue related to prompt design?

No, the issue was primarily due to system limitations rather than prompt quality.

Did model size affect performance?

Yes, the 3B model struggled with strict structured constraints.

What worked better than tool calling?

Simple JSON extraction prompts without rigid schema enforcement.

Can tool calling work in local environments?

Only if the runtime provides enforcement mechanisms, which were not present here.

Next Step

Once tool calling was removed, the next challenge was selecting the right model for extraction.

References

Brown, T. B., et al. Language Models are Few-Shot Learners, NeurIPS, 2020
OpenAI, Function Calling in Language Models, 2023
Kiela, D., et al. Hallucinations in Neural Models, ACL, 2021
Smith, R. Tesseract OCR Engine, ICDAR, 2007
llama.cpp Documentation → See: 02-Model-evaluation.md

Let us know your challenges or support us by sharing the article

Check iunera.com to learn more about what we do!

Categories:

Uncategorized

Tags:

JSON extraction errors llama.cpp structured output LLM hallucination receipts local LLM pipeline OCR to JSON extraction Qwen 2.5 limitations tool calling failure