lastcrawler.xyz

Back

2026-03-12

5 min read

RAGLLMsEmbeddings

Structured Data for Embeddings: The Missing Piece in Your RAG Pipeline

Most RAG pipelines start the same way: scrape a bunch of pages, chunk the text, embed it, throw it in a vector store. Then you wonder why retrieval quality is mediocre.

The problem isn't your embedding model or your chunking strategy. It's the data going in.

Why web scraping for RAG pipelines produces garbage embeddings

When you scrape a webpage for RAG and dump the raw text into your pipeline, you're embedding navigation links, footer text, cookie consent copy, and ad placeholders right alongside the actual content. Your vector store doesn't know the difference between a product description and a sidebar widget.

python

# What most RAG pipelines do
html = requests.get(url).text
text = strip_tags(html)  # 80% noise, 20% signal
chunks = split_into_chunks(text, 512)
embeddings = embed(chunks)  # now your noise is searchable

The result: when your user asks "What's the price of the Pro plan?", the retriever pulls back a chunk that includes the pricing table, but also the navigation menu, a testimonial, and half a FAQ answer. The LLM has to figure out which part matters.

What structured extraction changes

Instead of feeding raw text into your pipeline, extract the specific fields you care about first. Define a schema, get clean data, then embed only what matters. You can convert any URL to clean markdown as a starting point, but for RAG the real gains come from structured JSON extraction.

python

# What you should do instead
data = crawler.json(url, schema={
    "plans": [{
        "name": "string",
        "price": "string",
        "features": ["string"],
        "limits": "string"
    }]
})

# Now embed structured, clean data
for plan in data["plans"]:
    chunk = f"{plan['name']}: {plan['price']}. {', '.join(plan['features'])}"
    embed_and_store(chunk, metadata=plan)

The difference is stark. Your chunks are clean, your metadata is structured, and your retriever returns what the user actually asked for.

Better chunking through structure

The best chunking strategy isn't a fixed token window — it's semantic boundaries that your data already has. When you extract structured data, those boundaries come for free.

A product has a name, description, price, and features. A blog post has a title, author, date, and sections. An API reference has endpoints, parameters, and response schemas. Each of these is a natural chunk boundary.

Metadata makes retrieval smarter

Structured extraction gives you metadata you can filter on before vector search even happens. Instead of searching all 50,000 chunks, filter to category: "pricing" first, then do similarity search on 200 chunks.

This hybrid approach -- metadata filtering plus vector search -- beats pure semantic search in most production RAG systems.

The pipeline we recommend

  1. Crawl the source pages with structured extraction
  2. Store the raw structured data in a database (you'll want it later)
  3. Chunk by semantic boundaries, not token counts
  4. Embed clean text with structured metadata attached
  5. Retrieve with metadata filters plus vector similarity

The crawling step is where most teams cut corners. Don't. Clean data in means useful answers out. For a complete walkthrough of this flow, see our guide on building a web crawl to vector database pipeline.

FAQ

Why does web scraping for RAG pipelines produce poor retrieval quality?

Most pipelines scrape raw HTML, strip the tags, and embed everything — including navigation links, footers, cookie banners, and ad copy. The vector store can't distinguish signal from noise, so retrieval pulls back chunks full of irrelevant content alongside what the user actually asked for.

How does structured data improve embeddings for RAG?

When you extract specific fields via a JSON schema before embedding, you control exactly what goes into each chunk. The result is clean, coherent text with structured metadata you can filter on -- which improves both retrieval precision and LLM answer quality.

What's the best chunking strategy for a RAG pipeline built on web data?

Use semantic boundaries from your data structure rather than fixed token windows. A product's name, description, and feature list is a natural chunk. A blog post's section is a natural chunk. These boundaries come for free when you extract structured data — you don't need to guess where to split.

+

Last Crawler

2026-03-12

+_+

Home

2026