lastcrawler.xyz

Back

2026-03-27

9 min read

RAGWeb ScrapingAIVector DatabaseLLM

Web Scraping for RAG Pipelines: From URL to Vector Store

RAG pipelines are only as good as what's in your vector store. Clean, well-structured content produces precise, grounded answers. Raw HTML full of navigation menus, cookie banners, and footer text produces an LLM that hallucinates confidently about your privacy policy.

The web scraping step is where most RAG pipelines go wrong. Not because scraping is hard, but because the default output of scraping tools — raw HTML or barely-cleaned text — is the wrong input format for embeddings.

Why raw HTML destroys RAG quality

Consider a typical product documentation page. The actual content — the documentation — is maybe 40% of the page's text. The rest is:

When you embed this page as-is, 60% of your vectors represent navigation cruft. When a user asks "how do I configure authentication?", the retriever might return chunks that include Home > Docs > API > Authentication | Sign up | Log in | Cookie Policy mixed in with the actual answer.

This isn't a retrieval algorithm problem. It's a data quality problem.

The extraction pipeline that works

Step 1: Extract clean content, not HTML

The first step is getting clean content from each URL. You need the meaningful text, properly structured, without the boilerplate.

Using markdown extraction:

python

import requests

def extract_markdown(url: str) -> str:
    response = requests.post("https://lastcrawler.xyz/api/markdown", json={
        "url": url
    })
    return response.json()["markdown"]

Markdown extraction strips navigation, ads, and boilerplate, giving you just the content with heading structure intact. That's what you want for RAG -- the heading hierarchy maps well to semantic chunks. For more on how this endpoint works, see our guide on the URL to Markdown API.

Using JSON extraction for structured data:

python

def extract_structured(url: str) -> dict:
    response = requests.post("https://lastcrawler.xyz/api/json", json={
        "url": url,
        "schema": {
            "title": "string",
            "sections": [{
                "heading": "string",
                "content": "string",
                "code_examples": ["string"]
            }],
            "metadata": {
                "last_updated": "string",
                "author": "string",
                "category": "string"
            }
        }
    })
    return response.json()

JSON extraction gives you pre-chunked content with metadata. Each section is already a natural chunk boundary. Learn more about schema-driven extraction in our URL to JSON API guide.

Step 2: Smart chunking

Don't chunk blindly by token count. Use the document structure.

python

def chunk_markdown(markdown: str, max_tokens: int = 512) -> list[dict]:
    """Chunk markdown by heading boundaries, respecting max token count."""
    sections = []
    current_section = {"heading": "", "content": "", "level": 0}

    for line in markdown.split("\n"):
        if line.startswith("#"):
            # New heading — flush current section
            if current_section["content"].strip():
                sections.append(current_section.copy())

            level = len(line) - len(line.lstrip("#"))
            current_section = {
                "heading": line.lstrip("# ").strip(),
                "content": "",
                "level": level
            }
        else:
            current_section["content"] += line + "\n"

    if current_section["content"].strip():
        sections.append(current_section)

    # Split oversized sections by paragraph
    chunks = []
    for section in sections:
        text = f"{section['heading']}\n\n{section['content']}"
        if len(text.split()) <= max_tokens:
            chunks.append(text)
        else:
            paragraphs = section["content"].split("\n\n")
            current_chunk = section["heading"] + "\n\n"
            for para in paragraphs:
                if len((current_chunk + para).split()) > max_tokens:
                    chunks.append(current_chunk.strip())
                    current_chunk = section["heading"] + "\n\n" + para + "\n\n"
                else:
                    current_chunk += para + "\n\n"
            if current_chunk.strip():
                chunks.append(current_chunk.strip())

    return chunks

The heading hierarchy from markdown extraction gives you natural chunk boundaries. Each chunk starts with its section heading, providing context for the embedding.

Step 3: Embed and store

python

from openai import OpenAI
import chromadb

client = OpenAI()
chroma = chromadb.Client()
collection = chroma.create_collection("docs")

def ingest_url(url: str):
    # Extract clean markdown
    markdown = extract_markdown(url)

    # Chunk by heading structure
    chunks = chunk_markdown(markdown)

    # Embed and store
    for i, chunk in enumerate(chunks):
        embedding = client.embeddings.create(
            model="text-embedding-3-small",
            input=chunk
        ).data[0].embedding

        collection.add(
            documents=[chunk],
            embeddings=[embedding],
            ids=[f"{url}_{i}"],
            metadatas=[{"source_url": url, "chunk_index": i}]
        )

Step 4: Batch ingest with crawling

For ingesting an entire site:

python

def ingest_site(base_url: str, max_pages: int = 100):
    # Crawl the site to get all URLs
    crawl_response = requests.post("https://lastcrawler.xyz/api/links", json={
        "url": base_url
    })
    urls = crawl_response.json()["links"][:max_pages]

    for url in urls:
        try:
            ingest_url(url)
            print(f"Ingested: {url}")
        except Exception as e:
            print(f"Failed: {url} — {e}")

Markdown vs JSON extraction for RAG

Use markdown when:

Use JSON when:

Example: Product catalog RAG

python

# JSON extraction for structured product data
products = extract_structured("https://store.example.com/catalog", schema={
    "products": [{
        "name": "string",
        "description": "string",
        "price": "number",
        "specs": {"key": "string", "value": "string"},
        "category": "string"
    }]
})

# Each product becomes a naturally-bounded chunk
for product in products["products"]:
    chunk = f"""Product: {product['name']}
Category: {product['category']}
Price: ${product['price']}
Description: {product['description']}
Specs: {', '.join(f"{s['key']}: {s['value']}" for s in product.get('specs', []))}"""

    # Embed and store with rich metadata
    collection.add(
        documents=[chunk],
        metadatas=[{
            "product_name": product["name"],
            "price": product["price"],
            "category": product["category"]
        }]
    )

Common mistakes

1. Embedding raw HTML

Every <nav>, <footer>, and <script> tag in your embeddings is noise that degrades retrieval quality. Always extract clean content first.

2. Fixed-size chunking without structure awareness

Splitting by 512 tokens regardless of content structure means chunks that start mid-sentence and end mid-paragraph. Use heading boundaries.

3. No metadata

Store the source URL, extraction date, section heading, and content type as metadata. Without metadata, you can't filter, update, or debug your vector store.

4. Not handling extraction failures

Some pages will fail -- auth walls, CAPTCHAs, server errors. Log failures and retry. Don't silently skip pages or you'll end up with blind spots in your knowledge base.

5. Stale data

Web content changes. Set up periodic re-crawls and update your vector store. The /crawl endpoint with a schedule handles this at the infrastructure level. For a full walkthrough of this pipeline, see web crawl to vector database.

FAQ

Q: How often should I re-crawl for RAG freshness?

A: Depends on how fast the source content changes. Documentation sites: weekly. News/blogs: daily. Product catalogs: daily or more. Use the lastModified field from crawl results to skip unchanged pages.

Q: Should I use markdown or plain text for embeddings?

A: Markdown. The heading markers (#, ##, ###) provide structural signals that help embedding models understand document hierarchy. They also make retrieved chunks more readable when passed as LLM context.

Q: How many tokens per chunk is optimal?

A: 256-512 tokens for most use cases. Smaller chunks give more precise retrieval but lose context. Larger chunks provide more context but reduce precision. Start at 512 and adjust based on retrieval quality testing.

+

Last Crawler

2026-03-27

+_+

Home

2026