How Markdown, JSON-LD and Schema.org Improve Vectorsearch RAGs and NLWeb

by Tim

Imagine you’re a fair code developer wanting to twin yourself into an NLWeb chatbot. Visualize how you want to feed all your knowledge in forms of code, blogposts, READMEs, and others into AI training/indexing to generate your perfect digital AI twin with your knowledge. You need structured Schema.org Json-LD data for NLWeb to index your content effectively and to make your digital AI twin understand all the markdown files you have written. In this article, we dig into why it matters so much and how you effectively can get most out of markdown for AI indexing.

An AI and vector search RAG application need to understand the context of content. When you lift it with Json-LD labeling the AI results get much better.

Voosting AI content search with Markdown and Json-LD

Markdown an underrated content type in AI indexing

Markdown is the underrated cornerstone of modern documentation, powering millions of GitHub READMEs, wikis, and technical guides. Its simplicity and readability make markdown favorite documentation tool among developers. However, feeding markdown into AI-powered platforms like Microsoft’s NLWeb poses a significant challenge, because it is not optimized for it. The AI logic beneath NLWeb and Retrieval-Augmented Generation (RAG), associated vector search and other systems rely on structured data, such as Schema.org JSON-LD or normal Json to supercharge their natural language analysis, relationships, and context, enabling the best conversational web experiences.

We at iunera had the problem to index our markdown files in the best way into NLWeb to create a knowledge base about our copyleft blockchain based licensing technology OCTL, enabling all developers and AI training dataset providers to earn fair compensation for their work. Thereby, we discovered that there is (to our knowledge) no tool to transform markdown into Json-LD structured content, based on Schema.org guidelines.

Structured content, compliant to Schema.org, is more or less required if you want to have the vector database search of any kind, which is for example used by NLWeb. Providing information readable in a contextual and semantic way is the best insurance for producing an outstanding experience in discovering relevant content, what then leads to a normally outstanding AI experience.

In order to enable others in creating and outstanding AI experience, we create json-ld-markdown for Ai indexing tool, which transforms markdown into JSON-LD with Schema.org properties for providing the ultimate AI training or indexing data (Yes, I know it is vector database search optimization, but a bit of marketing slang is required to be found :)).

Generally, our efforts here align together with our previous contributions to make Microsoft’s NLWeb easier. So far, we provide an ultimate tutorial in form of a guide to setup NLWeb yourself, a nlweb-js-client, a Java library for structured data JSON-LD/Schema.org mapping, a Kubernetes Helm chart, and a Docker container under review NLWeb PR #141 and now this markdown to Schema.org transformer library. More NLWeb stuff form iunera is supposed to be released in near future.

Let us explore why structured data is important for AI and NLWeb, how JSON-LD enhances RAG and NLWeb, and how you can use our tool to create a digital AI twin of your knowledge base.

The Challenge: Markdown’s Limitations for AI and Structured Data

Markdown’s human-readable format, with headings, lists, and links, is ideal for documentation but problematic for AI. As IBM’s guide on structured vs. unstructured data explains, unstructured data like markdown requires complex processing to extract meaning, often leading to errors in entity recognition.

For example, a markdown README might describe a “feature” without clarifying whether it’s a software component or a product benefit, confusing a vector databases or AI’s indexing process.

Without tools for transforming markdown to JSON-LD, developers face time-consuming manual annotation and transformation. iunera’s json-ld-markdown automates creating Json-LD out of markdown, making markdown files easily accessible as Ai training material or better “vectorizable” documents. Side effect of this that the transformed markdown pages are also enhancing traditional SEO through rich results.

Background: How Structured Data Powers NLWeb and RAG

To understand the value of JSON-LD, let’s explore how NLWeb and RAG applications work. NLWeb, detailed in its GitHub repository, enables AI-driven web interactions by ingesting structured data (e.g., JSON-LD, RSS) into vector databases like Qdrant or Azure AI Search. These databases support vector search, which converts text into numerical vectors to find similar content, and Retrieval-Augmented Generation (RAG), which combines retrieval with language model responses for accurate answers.

Without structured data, vector search lacks context, reducing accuracy. For instance, a markdown phrase like “bug fix” might be confused with “insect repellent” without metadata. JSON-LD for RAG provides this context, as Schema App argues, enabling precise entity recognition and knowledge graph integration. This is critical for NLWeb ingestion, ensuring conversational responses are relevant and effective.

Vector Search Example: With and Without Schema.org JSON-LD

As a foundation one needs to understand vector search used in A RAG or NLWeb.

How does vector search work (simplified)

Imagine vector search used by AIs. Text, like this article, is parsed and tokenized in a vector of meanings and topics. So what topics is this article about. Then a prompt is tokenized into a vector of meaning and intention. Documents containing the most similar tokens of meaning and intention are then computed. These documents then serve as context for the AI to generate a nice answer for the query.

How does text search work (very simplified)

Imagine the document is just analyzed by all words and they are counted. Then A prompt is analyzed for the words in it. Then documents are retrieved that have the most often appearance of the words in combination.

TLDR; It is clear that vector search is superior.

Let’s compare NLWeb’s vector search performance with and without JSON-LD.

Scenario: A user queries an NLWeb chatbot: “What are the setup steps for Project X?”

Vector search without JSON-LD:

Input: Raw markdown:markdown## Setup Install Node.js, clone the repo, and run `npm install`.
Vector Search Process: NLWeb converts the text to vectors based on word patterns. Without metadata, “Setup” and “Install” are generic, potentially matching unrelated content (e.g., hardware setup guides).
Result: The chatbot retrieves irrelevant responses, such as setup instructions for another project.
Quality: Low precision due to lack of entity context.

Vector search with JSON-LD:

Input: Markdown processed by iunera’s json-ld-markdown tool:json{ "@context": "http://schema.org", "@type": "HowTo", "name": "Setup for Project X", "step": [ { "@type": "HowToStep", "text": "Install Node.js" }, { "@type": "HowToStep", "text": "Clone the repo" }, { "@type": "HowToStep", "text": "Run npm install" } ] }
Vector Search Process: NLWeb recognizes “HowTo” and “HowToStep” entities, enriching vectors with metadata. “Setup” is clearly linked to software installation for “Project X.”
Result: The chatbot delivers precise setup steps, directly answering the query.
Quality: High precision, as Schema.org for AI clarifies intent and context.

This example shows how transforming markdown to JSON-LD improves vector search, making content AI-ready.

Markdown’s AI problem: Markdown does not provide semantic annotations

We saw now why markdown needs to be transformed to Json-Ld, but this is not so straightforward like it seemed before, because except of structural information there is nothing in the markdown file – or not?

One can extract semantic annotations by the use an AI to extract the meaning tokens and then ingest it in a vector database (what is often done). Same can be done with a prompt. This improves the result, but understanding a text has limits. Therefore, it helps when the meaning of the text and the context is provided.

Schema.org JSON-LD solves this by providing machine-readable metadata that labels content with specific types, such as “Article,” “Person,” or “HowTo.” According to Google’s Structured Data Guide, JSON-LD helps also ordinary search engines and AI understand content, improving discoverability and functionality.

Propsed Standardized mapping of Markdown-to-JSON-LD

iunera’s json-ld-markdown tool empowers developers and businesses to make markdown AI-ready. The goal is to simply get the meaning out of structure and some trigger words that are commonly used. Markdown to Json-LD for AI is developed as open prototype, licensed under Fair Code.

It automatically maps markdown structures (e.g., headings, lists) to Schema.org types like “Article,” “FAQPage,” or “SoftwareApplication” and tries to retrieve as much as possible semantic intent form the markdown (e.g. extracting linked entities and spitting an article in parts). Advanced users can even use inline annotations for precise labeling AI training data with Schema.org, ensuring accurate metadata.

Example of Markdown to Schema.org

Here is a simple example in automatically detecting an mapping an FAQ section:

## Frequently Asked Questions
### What is Project X?
Project X is a tool for data analysis.

The FAQ is detected on the heading the questions and answers on the subheading. Ultimately this results in:

{
  "@context": "http://schema.org",
  "@type": "FAQPage",
  "mainEntity": [
    {
      "@type": "Question",
      "name": "What is Project X?",
      "acceptedAnswer": {
        "@type": "Answer",
        "text": "Project X is a tool for data analysis."
      }
    }
  ]
}

Key benefits of standardizing markdown mapping to structured data

Automation: Eliminates manual JSON-LD creation, saving time.
AI Compatibility: Enhances JSON-LD for RAG and structured data for NLWeb, improving vector search and conversational responses.
SEO Boost: Enables rich results as side effect

A complete open specification waiting for feedback

We have created a specification howto transform markdown to Json-LD inferring semantic context and structuring the Json-LD according to the intended structure of the markdown author. Additionally, we created a proposal for semantic annotations for Json-LD, making interlinking of documents easy.

Default Markdown to Structured JSON-LD Data Transformation

Suggestion: Extended Schema.org Markdown Annotation Format

We would be glad for feedback or contributions to extend our approach.
The complete grammar of the transformation is outlined in details to enable it easily to generate transformers in any language.

Try markdown to Json-LD in action:

Let us know what you think or test the markdown to Schema.org Json with online with the Markdown-to-JSON-LD Converter demo website.

This JSON-LD is ready for NLWeb or SEO-rich snippets – we all make mistakes. Validate it with Google’s Rich Results Test to confirm compatibility. Share feedback or suggestions on json-ld-markdown GitHub to help us refine this prototype for transforming markdown to JSON-LD.

For Developers: Integrate with the npm Package

Developers can integrate markdown to AI training data conversion into their projects using the json-ld-markdown npm package (check json-ld-markdown GitHub for the latest package name, as it’s in development). This library enables programmatic conversion of markdown to JSON-LD, ideal for NLWeb or RAG workflows. Example:

javascript

const { convertMarkdownToJsonLd } = require('json-ld-markdown');
const markdown = '# Project X\nA data analysis tool.';
const jsonLd = convertMarkdownToJsonLd(markdown);
console.log(jsonLd); // Outputs Schema.org JSON-LD

Conclusion: Building a Semantic Web with JSON-LD and Schema.org – together

We believe transforming markdown to JSON-LD is a stepping stone to a semantic web where AI systems like NLWeb deliver precise, conversational experiences. As Schema App explains, structured data is critical for LLMs, supporting voice search, knowledge graphs, and personalized interactions. By using Schema.org for AI, developers can make markdown content—such as GitHub READMEs or wikis—queryable, creating digital AI twins that answer questions with accuracy.

iunera’s contributions, including the nlweb-js-client, Java library for Json-LD Schema.org Datatypes, NLWeb K8s Helm chart, and Docker container and pull requests the official NLWeb implementations, show our commitment to advancing semantic search and AI (public or private).

The json-ld-markdown tool is open for contributions by it’s fair licensing, and we invite contributions to enhance its Schema.org support, integrate with other RAG platforms, or enable knowledge graph generation. Get involved by:

Testing at markdown-to-jsonld-ai.iunera.com.
Exploring our Java Json-LD structure data transformation library for enterprise data.
Following Microsoft’s NLWeb for updates.

With markdown to JSON-LD, we believe we contribute to in companies private and public RAGs and it can help enterprises to build even more meaningful applications where AI understands your content deeply.

We ask you to join us today:

Test the tool at markdown-to-jsonld-ai.iunera.com and validate with Google’s Rich Results Test
Integrate the npm package json-ld-markdown GitHub with nlweb-js-client and test the Json-LD output for NLWeb indexing.
Tell us your feedback, suggestions, share with friends or get even involved in the development.

Let us know your challenges or support us by sharing the article

Check iunera.com to learn more about what we do!

Categories:

Machine Learning and AI NLWeb Our Projects

Tags:

ai training json-ld markdown to AI training data NLweb rag Schema.org structured data vectordb