2026-03-27
9 min read
Web Scraping for RAG Pipelines: From URL to Vector Store
RAG pipelines are only as good as what's in your vector store. Clean, well-structured content produces precise, grounded answers. Raw HTML full of navigation menus, cookie banners, and footer text produces an LLM that hallucinates confidently about your privacy policy.
The web scraping step is where most RAG pipelines go wrong. Not because scraping is hard, but because the default output of scraping tools — raw HTML or barely-cleaned text — is the wrong input format for embeddings.
Why raw HTML destroys RAG quality
Consider a typical product documentation page. The actual content — the documentation — is maybe 40% of the page's text. The rest is:
- Navigation menus and breadcrumbs
- Footer links and legal text
- Sidebar with related articles
- Cookie consent banners
- Search bars and form elements
- Header with logo and account links
- Social sharing buttons
- Comment sections
When you embed this page as-is, 60% of your vectors represent navigation cruft. When a user asks "how do I configure authentication?", the retriever might return chunks that include Home > Docs > API > Authentication | Sign up | Log in | Cookie Policy mixed in with the actual answer.
This isn't a retrieval algorithm problem. It's a data quality problem.
The extraction pipeline that works
Step 1: Extract clean content, not HTML
The first step is getting clean content from each URL. You need the meaningful text, properly structured, without the boilerplate.
Using markdown extraction:
python
import requests
def extract_markdown(url: str) -> str:
response = requests.post("https://lastcrawler.xyz/api/markdown", json={
"url": url
})
return response.json()["markdown"]
Markdown extraction strips navigation, ads, and boilerplate, giving you just the content with heading structure intact. That's what you want for RAG -- the heading hierarchy maps well to semantic chunks. For more on how this endpoint works, see our guide on the URL to Markdown API.
Using JSON extraction for structured data:
python
def extract_structured(url: str) -> dict:
response = requests.post("https://lastcrawler.xyz/api/json", json={
"url": url,
"schema": {
"title": "string",
"sections": [{
"heading": "string",
"content": "string",
"code_examples": ["string"]
}],
"metadata": {
"last_updated": "string",
"author": "string",
"category": "string"
}
}
})
return response.json()
JSON extraction gives you pre-chunked content with metadata. Each section is already a natural chunk boundary. Learn more about schema-driven extraction in our URL to JSON API guide.
Step 2: Smart chunking
Don't chunk blindly by token count. Use the document structure.
python
def chunk_markdown(markdown: str, max_tokens: int = 512) -> list[dict]:
"""Chunk markdown by heading boundaries, respecting max token count."""
sections = []
current_section = {"heading": "", "content": "", "level": 0}
for line in markdown.split("\n"):
if line.startswith("#"):
# New heading — flush current section
if current_section["content"].strip():
sections.append(current_section.copy())
level = len(line) - len(line.lstrip("#"))
current_section = {
"heading": line.lstrip("# ").strip(),
"content": "",
"level": level
}
else:
current_section["content"] += line + "\n"
if current_section["content"].strip():
sections.append(current_section)
# Split oversized sections by paragraph
chunks = []
for section in sections:
text = f"{section['heading']}\n\n{section['content']}"
if len(text.split()) <= max_tokens:
chunks.append(text)
else:
paragraphs = section["content"].split("\n\n")
current_chunk = section["heading"] + "\n\n"
for para in paragraphs:
if len((current_chunk + para).split()) > max_tokens:
chunks.append(current_chunk.strip())
current_chunk = section["heading"] + "\n\n" + para + "\n\n"
else:
current_chunk += para + "\n\n"
if current_chunk.strip():
chunks.append(current_chunk.strip())
return chunks
The heading hierarchy from markdown extraction gives you natural chunk boundaries. Each chunk starts with its section heading, providing context for the embedding.
Step 3: Embed and store
python
from openai import OpenAI
import chromadb
client = OpenAI()
chroma = chromadb.Client()
collection = chroma.create_collection("docs")
def ingest_url(url: str):
# Extract clean markdown
markdown = extract_markdown(url)
# Chunk by heading structure
chunks = chunk_markdown(markdown)
# Embed and store
for i, chunk in enumerate(chunks):
embedding = client.embeddings.create(
model="text-embedding-3-small",
input=chunk
).data[0].embedding
collection.add(
documents=[chunk],
embeddings=[embedding],
ids=[f"{url}_{i}"],
metadatas=[{"source_url": url, "chunk_index": i}]
)
Step 4: Batch ingest with crawling
For ingesting an entire site:
python
def ingest_site(base_url: str, max_pages: int = 100):
# Crawl the site to get all URLs
crawl_response = requests.post("https://lastcrawler.xyz/api/links", json={
"url": base_url
})
urls = crawl_response.json()["links"][:max_pages]
for url in urls:
try:
ingest_url(url)
print(f"Ingested: {url}")
except Exception as e:
print(f"Failed: {url} — {e}")
Markdown vs JSON extraction for RAG
Use markdown when:
- Ingesting documentation, blog posts, or articles
- Content is primarily text with heading structure
- You want to preserve the natural reading flow
- Chunk boundaries should follow document structure
Use JSON when:
- Ingesting product catalogs, directories, or listings
- You need specific fields extracted consistently
- Content is structured data (prices, ratings, specs)
- You want pre-defined chunk boundaries via schema
Example: Product catalog RAG
python
# JSON extraction for structured product data
products = extract_structured("https://store.example.com/catalog", schema={
"products": [{
"name": "string",
"description": "string",
"price": "number",
"specs": {"key": "string", "value": "string"},
"category": "string"
}]
})
# Each product becomes a naturally-bounded chunk
for product in products["products"]:
chunk = f"""Product: {product['name']}
Category: {product['category']}
Price: ${product['price']}
Description: {product['description']}
Specs: {', '.join(f"{s['key']}: {s['value']}" for s in product.get('specs', []))}"""
# Embed and store with rich metadata
collection.add(
documents=[chunk],
metadatas=[{
"product_name": product["name"],
"price": product["price"],
"category": product["category"]
}]
)
Common mistakes
1. Embedding raw HTML
Every <nav>, <footer>, and <script> tag in your embeddings is noise that degrades retrieval quality. Always extract clean content first.
2. Fixed-size chunking without structure awareness
Splitting by 512 tokens regardless of content structure means chunks that start mid-sentence and end mid-paragraph. Use heading boundaries.
3. No metadata
Store the source URL, extraction date, section heading, and content type as metadata. Without metadata, you can't filter, update, or debug your vector store.
4. Not handling extraction failures
Some pages will fail -- auth walls, CAPTCHAs, server errors. Log failures and retry. Don't silently skip pages or you'll end up with blind spots in your knowledge base.
5. Stale data
Web content changes. Set up periodic re-crawls and update your vector store. The /crawl endpoint with a schedule handles this at the infrastructure level. For a full walkthrough of this pipeline, see web crawl to vector database.
FAQ
Q: How often should I re-crawl for RAG freshness?
A: Depends on how fast the source content changes. Documentation sites: weekly. News/blogs: daily. Product catalogs: daily or more. Use the lastModified field from crawl results to skip unchanged pages.
Q: Should I use markdown or plain text for embeddings?
A: Markdown. The heading markers (#, ##, ###) provide structural signals that help embedding models understand document hierarchy. They also make retrieved chunks more readable when passed as LLM context.
Q: How many tokens per chunk is optimal?
A: 256-512 tokens for most use cases. Smaller chunks give more precise retrieval but lose context. Larger chunks provide more context but reduce precision. Start at 512 and adjust based on retrieval quality testing.
Last Crawler
2026-03-27