lastcrawler.xyz

Back

2026-03-10

8 min read

RAGVector DatabaseTutorial

From Web Crawl to Vector Database: Building a RAG Pipeline

Most RAG tutorials start at step three: you already have clean text, now chunk it. That's not how it works in practice. In practice, you have a list of URLs and a deadline.

This is the full pipeline: URL in, retrievable knowledge out.

The pipeline at a glance

  1. Crawl pages and extract clean content
  2. Chunk the content at semantic boundaries
  3. Generate embeddings
  4. Store in a vector database with metadata
  5. Query with hybrid search

Each step has decisions that compound. Cut corners early and every downstream step gets worse.

Step 1: Crawl and extract content

Resist the urge to requests.get your way through a site. As we covered in web scraping for RAG pipelines, raw HTML is noisy -- navigation, footers, cookie banners -- and that noise ends up in your embeddings.

Use batch crawl to pull markdown from multiple pages in one call. Our URL to markdown API strips most HTML structure, so markdown is a reasonable starting point. For structured content (product pages, API docs, pricing), use JSON extraction with a schema instead.

python

import httpx

pages = [
    "https://docs.example.com/getting-started",
    "https://docs.example.com/api-reference",
    "https://docs.example.com/examples",
]

response = httpx.post(
    "https://your-crawler/api/browser/crawl",
    json={
        "urls": pages,
        "format": "markdown",
        "onlyMainContent": True,
    }
)

results = response.json()  # list of {url, content, title}

The onlyMainContent flag tells the crawler to ignore navigation and boilerplate. It's not perfect, but it removes the worst offenders.

For a docs site, this gives you clean prose per page. For product or pricing pages, switch to format: "json" with a schema so you get structured fields instead of flat text.

Step 2: Clean and chunk the content

Fixed token windows are a lazy default. They split sentences mid-thought, break code examples in half, and ignore every semantic signal in the document.

Structured extraction gives you better chunks for free. A page with H2 sections is already pre-chunked. Each section is a natural unit of meaning.

python

import re

def chunk_markdown(content: str, source_url: str) -> list[dict]:
    """Split markdown by H2/H3 headings, preserve metadata."""
    chunks = []
    current_heading = "Introduction"
    current_text = []

    for line in content.splitlines():
        if line.startswith("## ") or line.startswith("### "):
            if current_text:
                chunks.append({
                    "text": "\n".join(current_text).strip(),
                    "heading": current_heading,
                    "source": source_url,
                })
            current_heading = line.lstrip("#").strip()
            current_text = []
        else:
            current_text.append(line)

    # Don't forget the last section
    if current_text:
        chunks.append({
            "text": "\n".join(current_text).strip(),
            "heading": current_heading,
            "source": source_url,
        })

    return [c for c in chunks if len(c["text"]) > 100]  # drop stubs


all_chunks = []
for page in results:
    chunks = chunk_markdown(page["content"], page["url"])
    all_chunks.extend(chunks)

print(f"{len(all_chunks)} chunks from {len(results)} pages")

The source and heading fields become metadata you can filter on later. Keep them.

For content without clear heading structure, fall back to a sliding window with overlap — but make it large enough to preserve context. 512 tokens with a 64-token overlap is a reasonable floor.

Step 3: Generate embeddings

The embedding model matters less than people think at this stage. Pick one that fits your language and domain, then stay consistent -- you can't mix embeddings from different models in the same index.

python

import openai

client = openai.OpenAI()

def embed_chunks(chunks: list[dict], batch_size: int = 100) -> list[dict]:
    """Add embeddings to chunks in batches."""
    for i in range(0, len(chunks), batch_size):
        batch = chunks[i : i + batch_size]
        texts = [f"{c['heading']}\n\n{c['text']}" for c in batch]

        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=texts,
        )

        for chunk, embedding_obj in zip(batch, response.data):
            chunk["embedding"] = embedding_obj.embedding

    return chunks


embedded_chunks = embed_chunks(all_chunks)

Prepending the section heading to the text before embedding improves retrieval for hierarchical content. The heading gives the model context that the body text often lacks.

Alternatives: Cohere embed-english-v3.0 is competitive and has a native reranker you can add later. For self-hosted, BAAI/bge-large-en-v1.5 runs well on a single GPU and matches OpenAI small on most benchmarks.

Step 4: Store in a vector database

Pinecone is managed, fast, and has good filtering support. Good default if you don't want to run infrastructure.

Chroma is open-source and runs in-process for development, which makes local iteration fast. Less battle-tested at scale.

Weaviate is open-source with built-in BM25 hybrid search, which is useful if you want keyword and vector search in one query without extra plumbing.

The code below uses Pinecone, but the pattern is the same for all three: upsert vectors with an ID, the embedding, and metadata.

python

from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key="your-api-key")

index_name = "docs-rag"
if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=1536,  # text-embedding-3-small output size
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1"),
    )

index = pc.Index(index_name)

vectors = [
    {
        "id": f"{chunk['source']}-{i}",
        "values": chunk["embedding"],
        "metadata": {
            "text": chunk["text"],
            "heading": chunk["heading"],
            "source": chunk["source"],
        },
    }
    for i, chunk in enumerate(embedded_chunks)
]

# Upsert in batches of 100
for i in range(0, len(vectors), 100):
    index.upsert(vectors=vectors[i : i + 100])

print(f"Upserted {len(vectors)} vectors")

The metadata.text field stores the raw chunk text so you can return it directly without a separate lookup. That's fine for moderate data volumes. For large corpora, store text in a separate key-value store and retrieve by ID.

Step 5: Query and retrieve

Pure vector search retrieves semantically similar chunks but misses exact-match cases. Hybrid search -- metadata filtering first, then vector similarity -- is more precise in practice.

python

def query_rag(question: str, source_filter: str = None, top_k: int = 5) -> list[dict]:
    """Retrieve relevant chunks for a question."""
    # Embed the question
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=[question],
    )
    query_vector = response.data[0].embedding

    # Build filter (optional)
    filter_dict = None
    if source_filter:
        filter_dict = {"source": {"$eq": source_filter}}

    # Query the index
    results = index.query(
        vector=query_vector,
        top_k=top_k,
        include_metadata=True,
        filter=filter_dict,
    )

    return [
        {
            "text": match.metadata["text"],
            "heading": match.metadata["heading"],
            "source": match.metadata["source"],
            "score": match.score,
        }
        for match in results.matches
    ]


# Natural language query, optionally scoped to a section of the docs
chunks = query_rag("How do I authenticate API requests?")
for chunk in chunks:
    print(f"[{chunk['score']:.2f}] {chunk['heading']} — {chunk['source']}")
    print(chunk["text"][:200])
    print()

Pass the retrieved chunks to your LLM as context. Include the source URL in the prompt so the model can cite it.

The complete pipeline in ~40 lines

python

import httpx, openai
from pinecone import Pinecone

PAGES = ["https://docs.example.com/getting-started", "https://docs.example.com/api-reference"]

# 1. Crawl
pages = httpx.post("https://your-crawler/api/browser/crawl", json={"urls": PAGES, "format": "markdown", "onlyMainContent": True}).json()

# 2. Chunk
def chunk(content, url):
    sections, heading, lines = [], "Introduction", []
    for line in content.splitlines():
        if line.startswith("## ") or line.startswith("### "):
            if lines: sections.append({"text": "\n".join(lines).strip(), "heading": heading, "source": url})
            heading, lines = line.lstrip("#").strip(), []
        else:
            lines.append(line)
    if lines: sections.append({"text": "\n".join(lines).strip(), "heading": heading, "source": url})
    return [s for s in sections if len(s["text"]) > 100]

chunks = [c for page in pages for c in chunk(page["content"], page["url"])]

# 3. Embed
oai = openai.OpenAI()
texts = [f"{c['heading']}\n\n{c['text']}" for c in chunks]
embeddings = oai.embeddings.create(model="text-embedding-3-small", input=texts).data
for c, e in zip(chunks, embeddings): c["embedding"] = e.embedding

# 4. Store
pc = Pinecone(api_key="your-key").Index("docs-rag")
pc.upsert(vectors=[{"id": f"{c['source']}-{i}", "values": c["embedding"], "metadata": {"text": c["text"], "source": c["source"]}} for i, c in enumerate(chunks)])

# 5. Query
q_vec = oai.embeddings.create(model="text-embedding-3-small", input=["How do I authenticate?"]).data[0].embedding
results = pc.query(vector=q_vec, top_k=5, include_metadata=True)
for r in results.matches: print(r.metadata["text"][:200])

Performance tips

Cache crawl results. Re-crawling a 500-page site every time you rebuild your index is slow and wasteful. Store the raw markdown keyed by URL and last-modified timestamp. Only re-crawl pages that have changed.

Incremental updates. Track which URLs you've indexed and when. On each run, crawl only changed or new pages, delete stale vectors by source URL, and upsert the new ones. Pinecone's delete supports metadata filters, so delete(filter={"source": url}) removes all chunks for a given page cleanly.

Handle large sites in batches. Crawling 1,000 pages in a single request will time out. Break it into batches of 50–100 URLs, run them in parallel with a semaphore to avoid overwhelming the crawler, and checkpoint progress so a failure doesn't lose all work.

Token budgets. Embedding calls charge by token. Chunks over 8,000 tokens will be truncated by most models. Keep chunks between 200 and 1,000 tokens for best retrieval performance and predictable cost.

FAQ

What's the best way to build a web crawl to vector database pipeline?

Crawl pages to markdown or structured JSON, chunk by semantic boundaries (headings or logical sections rather than fixed token counts), embed each chunk with a consistent model, and upsert to a vector database with metadata attached. The most important step is keeping metadata (source URL, section heading, content type) alongside each vector so you can filter before doing similarity search.

Is this a complete RAG pipeline tutorial?

Yes. The five steps above cover everything from raw URLs to a queryable vector index: crawling, chunking, embedding, storing, and retrieving. The only step not covered here is the LLM generation step — take the retrieved chunks and pass them as context in your system prompt.

What's the difference between web scraping for embeddings vs. standard web scraping?

Standard web scraping extracts data for direct use — you need a product price, you scrape the product page. Scraping for embeddings has a different constraint: the text you extract determines retrieval quality. Noise (navigation, footers, ads) gets embedded alongside signal and degrades results. Structured data extraction solves this by letting you define exactly which fields to pull, so you're embedding only the content that answers user questions.

+

Last Crawler

2026-03-10

+_+

Home

2026