2026-03-27
6 min read
URL to Markdown API: Clean Content Extraction for LLMs
LLMs don't need HTML. They need text -- clean, structured text with heading hierarchy preserved and boilerplate stripped. Every <nav>, <footer>, cookie banner, and ad block that ends up in your LLM's context window is wasted tokens.
URL to Markdown solves this in one API call: send a URL, get back clean markdown with the content preserved and the cruft removed.
The problem with raw HTML as LLM context
A typical webpage is 80KB of HTML. The actual content — the article, the documentation, the product info — is maybe 5KB of that. The rest is:
- Document structure (
<html>,<head>,<body>) - CSS classes and inline styles
- JavaScript for interactivity
- Navigation and footer (repeated on every page)
- Ads, tracking pixels, analytics scripts
- Cookie consent popups
- Social sharing widgets
- Related content sidebars
Feeding 80KB of HTML to an LLM when you need 5KB of content is a 16x token waste. At GPT-4o pricing (~$2.50/1M input tokens), that's $0.0002 vs $0.003 per page. Over 10K pages, $2 vs $30.
Worse, the noise hurts response quality. The LLM has to figure out what's content and what's boilerplate, and it doesn't always get it right.
How markdown extraction works
bash
curl -X POST https://lastcrawler.xyz/api/markdown \
-H "Content-Type: application/json" \
-d '{"url": "https://docs.example.com/authentication"}'
The API:
- Renders the page in a real Chrome browser (handles JavaScript, dynamic content)
- Identifies the main content area using layout analysis
- Strips navigation, footers, sidebars, ads, and boilerplate
- Converts the remaining content to markdown with preserved structure
- Returns clean markdown with headings, lists, code blocks, and links intact
Input: A URL Output: Clean markdown
markdown
# Authentication
## Getting Started
To authenticate API requests, include your API key in the `Authorization` header:
\`\`\`bash
curl -H "Authorization: Bearer sk-your-api-key" https://api.example.com/data
\`\`\`
## API Keys
Generate API keys from your dashboard. Each key has configurable permissions:
- **Read-only** — can fetch data but not modify it
- **Read-write** — full access to CRUD operations
- **Admin** — includes user management and billing
## Rate Limits
| Plan | Requests/minute | Burst limit |
|------|----------------|-------------|
| Free | 60 | 10 |
| Pro | 600 | 100 |
| Enterprise | 6,000 | 1,000 |
Compare this to the raw HTML of the same page, which would include the site header, navigation breadcrumbs, version selector, sidebar, search widget, footer, and dozens of <div> wrappers with CSS classes.
Use cases
LLM context loading
python
import requests
from openai import OpenAI
def answer_from_url(url: str, question: str) -> str:
# Get clean content
md_response = requests.post("https://lastcrawler.xyz/api/markdown", json={
"url": url
})
markdown = md_response.json()["markdown"]
# Use as LLM context
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": f"Answer based on this content:\n\n{markdown}"},
{"role": "user", "content": question}
]
)
return response.choices[0].message.content
RAG document ingestion
Markdown is the ideal input format for RAG pipelines. For a complete walkthrough of chunking, embedding, and storing web content, see web scraping for RAG pipelines.
python
def ingest_docs(urls: list[str]):
for url in urls:
markdown = requests.post(
"https://lastcrawler.xyz/api/markdown",
json={"url": url}
).json()["markdown"]
# Chunk by heading for natural boundaries
chunks = split_by_headings(markdown)
for chunk in chunks:
embedding = embed(chunk)
vector_store.upsert(
text=chunk,
embedding=embedding,
metadata={"source": url}
)
Content archival and analysis
python
def archive_article(url: str):
markdown = extract_markdown(url)
# Store as .md file
filename = url_to_filename(url)
with open(f"archive/{filename}.md", "w") as f:
f.write(f"---\nsource: {url}\ndate: {datetime.now().isoformat()}\n---\n\n")
f.write(markdown)
Content diff monitoring
python
def check_for_changes(url: str, previous_md: str) -> bool:
current_md = extract_markdown(url)
if current_md != previous_md:
diff = compute_diff(previous_md, current_md)
notify(f"Content changed on {url}:\n{diff}")
return True
return False
Markdown vs. raw text vs. HTML
| Format | Token efficiency | Structure preserved | LLM compatibility |
|---|---|---|---|
| Raw HTML | Poor (16x overhead) | Full DOM structure | LLM must parse HTML |
| Plain text | Good | No structure | Loses hierarchy |
| Markdown | Good | Headings, lists, code, links | Native LLM format |
Markdown is the sweet spot: minimal token overhead with preserved document structure. LLMs handle markdown well -- it's the format most training data uses for structured text.
Comparison with other tools
Jina Reader (r.jina.ai/)
Jina Reader is the fastest way to get markdown from a URL — just prepend the prefix. No API key needed. For quick, one-off extractions, it's excellent.
Last Crawler's difference: runs on a global edge network for better success on protected sites, offers additional endpoints (JSON, screenshot, PDF, crawl) in the same API, and provides batch crawling for processing entire sites.
Mozilla Readability
The open-source library that Firefox Reader Mode uses. Works well for articles but struggles with non-article content (product pages, documentation, forums).
Last Crawler's difference: handles all content types, not just articles. Renders JavaScript before extraction. Works on SPAs and dynamically loaded content.
BeautifulSoup + manual extraction
Maximum control, maximum maintenance. You write the selectors, you handle the edge cases, you update when sites change.
Last Crawler's difference: no selectors to maintain. Content extraction adapts to page layout on its own. When you need typed fields instead of prose, the URL to JSON API uses the same rendering pipeline with schema-based extraction.
FAQ
Q: Does markdown extraction work on JavaScript-heavy sites?
A: Yes. The page is rendered in a real Chrome browser before content extraction. React apps, Next.js sites, SPAs with client-side routing — all render fully before the content is extracted. This same rendering capability powers all extraction endpoints, including those used by AI agents for web scraping.
Q: How is the "main content" determined?
A: The extraction layer uses a combination of DOM analysis (content density, semantic HTML tags like <article> and <main>) and layout heuristics to identify the primary content area. It's not perfect on every page, but it handles the vast majority of common page layouts correctly.
Q: Can I get markdown from a page that requires scrolling to load all content?
A: The browser rendering handles infinite scroll and lazy-loaded content. The full page content is loaded before markdown extraction begins.
Last Crawler
2026-03-27