2026-03-27

6 min read

MarkdownAPILLMContent ExtractionRAG

URL to Markdown API: Clean Content Extraction for LLMs

LLMs don't need HTML. They need text -- clean, structured text with heading hierarchy preserved and boilerplate stripped. Every <nav>, <footer>, cookie banner, and ad block that ends up in your LLM's context window is wasted tokens.

URL to Markdown solves this in one API call: send a URL, get back clean markdown with the content preserved and the cruft removed.

The problem with raw HTML as LLM context

A typical webpage is 80KB of HTML. The actual content — the article, the documentation, the product info — is maybe 5KB of that. The rest is:

Document structure (<html>, <head>, <body>)
CSS classes and inline styles
JavaScript for interactivity
Navigation and footer (repeated on every page)
Ads, tracking pixels, analytics scripts
Cookie consent popups
Social sharing widgets
Related content sidebars

Feeding 80KB of HTML to an LLM when you need 5KB of content is a 16x token waste. At GPT-4o pricing (~$2.50/1M input tokens), that's $0.0002 vs $0.003 per page. Over 10K pages, $2 vs $30.

Worse, the noise hurts response quality. The LLM has to figure out what's content and what's boilerplate, and it doesn't always get it right.

How markdown extraction works

bash

curl -X POST https://lastcrawler.xyz/api/markdown \
  -H "Content-Type: application/json" \
  -d '{"url": "https://docs.example.com/authentication"}'

The API:

Renders the page in a real Chrome browser (handles JavaScript, dynamic content)
Identifies the main content area using layout analysis
Strips navigation, footers, sidebars, ads, and boilerplate
Converts the remaining content to markdown with preserved structure
Returns clean markdown with headings, lists, code blocks, and links intact

Input: A URL Output: Clean markdown

markdown

# Authentication

## Getting Started

To authenticate API requests, include your API key in the `Authorization` header:

\`\`\`bash
curl -H "Authorization: Bearer sk-your-api-key" https://api.example.com/data
\`\`\`

## API Keys

Generate API keys from your dashboard. Each key has configurable permissions:

- **Read-only** — can fetch data but not modify it
- **Read-write** — full access to CRUD operations
- **Admin** — includes user management and billing

## Rate Limits

| Plan | Requests/minute | Burst limit |
|------|----------------|-------------|
| Free | 60 | 10 |
| Pro | 600 | 100 |
| Enterprise | 6,000 | 1,000 |

Compare this to the raw HTML of the same page, which would include the site header, navigation breadcrumbs, version selector, sidebar, search widget, footer, and dozens of <div> wrappers with CSS classes.

Use cases

LLM context loading

python

import requests
from openai import OpenAI

def answer_from_url(url: str, question: str) -> str:
    # Get clean content
    md_response = requests.post("https://lastcrawler.xyz/api/markdown", json={
        "url": url
    })
    markdown = md_response.json()["markdown"]

    # Use as LLM context
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"Answer based on this content:\n\n{markdown}"},
            {"role": "user", "content": question}
        ]
    )
    return response.choices[0].message.content

RAG document ingestion

Markdown is the ideal input format for RAG pipelines. For a complete walkthrough of chunking, embedding, and storing web content, see web scraping for RAG pipelines.

python

def ingest_docs(urls: list[str]):
    for url in urls:
        markdown = requests.post(
            "https://lastcrawler.xyz/api/markdown",
            json={"url": url}
        ).json()["markdown"]

        # Chunk by heading for natural boundaries
        chunks = split_by_headings(markdown)

        for chunk in chunks:
            embedding = embed(chunk)
            vector_store.upsert(
                text=chunk,
                embedding=embedding,
                metadata={"source": url}
            )

Content archival and analysis

python

def archive_article(url: str):
    markdown = extract_markdown(url)

    # Store as .md file
    filename = url_to_filename(url)
    with open(f"archive/{filename}.md", "w") as f:
        f.write(f"---\nsource: {url}\ndate: {datetime.now().isoformat()}\n---\n\n")
        f.write(markdown)

Content diff monitoring

python

def check_for_changes(url: str, previous_md: str) -> bool:
    current_md = extract_markdown(url)
    if current_md != previous_md:
        diff = compute_diff(previous_md, current_md)
        notify(f"Content changed on {url}:\n{diff}")
        return True
    return False

Markdown vs. raw text vs. HTML

Format	Token efficiency	Structure preserved	LLM compatibility
Raw HTML	Poor (16x overhead)	Full DOM structure	LLM must parse HTML
Plain text	Good	No structure	Loses hierarchy
Markdown	Good	Headings, lists, code, links	Native LLM format

Markdown is the sweet spot: minimal token overhead with preserved document structure. LLMs handle markdown well -- it's the format most training data uses for structured text.

Comparison with other tools

Jina Reader (`r.jina.ai/`)

Jina Reader is the fastest way to get markdown from a URL — just prepend the prefix. No API key needed. For quick, one-off extractions, it's excellent.

Last Crawler's difference: runs on a global edge network for better success on protected sites, offers additional endpoints (JSON, screenshot, PDF, crawl) in the same API, and provides batch crawling for processing entire sites.

Mozilla Readability

The open-source library that Firefox Reader Mode uses. Works well for articles but struggles with non-article content (product pages, documentation, forums).

Last Crawler's difference: handles all content types, not just articles. Renders JavaScript before extraction. Works on SPAs and dynamically loaded content.

BeautifulSoup + manual extraction

Maximum control, maximum maintenance. You write the selectors, you handle the edge cases, you update when sites change.

Last Crawler's difference: no selectors to maintain. Content extraction adapts to page layout on its own. When you need typed fields instead of prose, the URL to JSON API uses the same rendering pipeline with schema-based extraction.

FAQ

Q: Does markdown extraction work on JavaScript-heavy sites?

A: Yes. The page is rendered in a real Chrome browser before content extraction. React apps, Next.js sites, SPAs with client-side routing — all render fully before the content is extracted. This same rendering capability powers all extraction endpoints, including those used by AI agents for web scraping.

Q: How is the "main content" determined?

A: The extraction layer uses a combination of DOM analysis (content density, semantic HTML tags like <article> and <main>) and layout heuristics to identify the primary content area. It's not perfect on every page, but it handles the vast majority of common page layouts correctly.

Q: Can I get markdown from a page that requires scrolling to load all content?

A: The browser rendering handles infinite scroll and lazy-loaded content. The full page content is loaded before markdown extraction begins.

Last Crawler

2026-03-27

URL to Markdown API: Clean Content Extraction for LLMs

The problem with raw HTML as LLM context

How markdown extraction works

Use cases

LLM context loading

RAG document ingestion

Content archival and analysis

Content diff monitoring

Markdown vs. raw text vs. HTML

Comparison with other tools

Jina Reader (r.jina.ai/)

Mozilla Readability

BeautifulSoup + manual extraction

FAQ

Jina Reader (`r.jina.ai/`)