2026-03-27

8 min read

AI AgentsWeb ScrapingAPILLM

Web Scraping APIs for AI Agents: What You Actually Need

AI agents are only as useful as the data they can access. An agent that can reason about a product's pricing but can't actually read the product page is doing expensive inference on stale context. The challenge is feeding live web data to your LLM agent without drowning it in noise.

The problem isn't that web scraping is hard. It's that most scraping tools were built for batch data collection, not for real-time agent tool calls. What an agent needs from a scraping API is pretty different from what a data team running nightly crawls needs.

What agents need from a web data API

1. Structured output, not HTML

An agent doesn't want <div class="price-wrapper"><span class="currency">$</span><span class="amount">299</span></div>. It wants {"price": 299, "currency": "USD"}.

Every token of HTML context you feed an LLM is money wasted. Agents need typed JSON that matches a schema the agent understands — a proper URL to JSON API handles this in a single call. No parsing step, no regex, no BeautifulSoup in the tool chain.

python

# What the agent needs
tool_result = scrape(url="https://store.example.com/product/123", schema={
    "name": "string",
    "price": "number",
    "in_stock": "boolean",
    "reviews": [{"rating": "number", "text": "string"}]
})

# Returns typed JSON the agent can reason about directly
# {"name": "Widget Pro", "price": 299, "in_stock": true, "reviews": [...]}

2. Sub-second latency for real-time tool calls

When an agent calls a tool, the user is waiting. A scraping API that takes 10 seconds to return data makes the agent feel broken. Batch crawling latency is irrelevant here — what matters is single-page extraction speed.

The target: under 2 seconds from API call to structured data returned. Anything slower and the agent's response time degrades to the point where users lose trust.

3. Reliability on sites that matter

Agents get sent to sites they've never seen before. A user asks "compare prices for X across Amazon, Best Buy, and Walmart" — the agent needs to successfully extract from all three, not fail on two out of three because they have bot protection.

Success rate on protected sites isn't a nice-to-have for agents. It's the difference between a tool that works and one that says "sorry, I couldn't access that page."

4. No state management

Agents are stateless tool callers. They shouldn't need to manage browser sessions, handle cookies across requests, or maintain crawl state. The API should be a pure function: URL + schema in, data out.

Why existing tools fall short

Firecrawl

Firecrawl's markdown output works well as LLM context, but agents need structured data, not markdown. The /extract endpoint exists but uses a 5x credit multiplier, making it expensive for frequent agent calls. Anti-bot failures on protected sites (Amazon, LinkedIn) break agent workflows unpredictably.

Crawl4AI

Excellent for batch crawling but requires self-hosted infrastructure. An agent running in a serverless function can't spin up a local Crawl4AI instance per request. No built-in anti-bot protection means protected sites fail consistently.

Selenium / Playwright directly

Raw browser automation gives you full control but no structure. You'd need to manage browser instances, handle anti-bot detection, parse HTML into structured data, and deal with timeouts and failures. That's 500 lines of glue code the agent framework shouldn't need. If you're going down this path, read about how to build web tools for AI agents that sidestep these problems.

Jina Reader

Great for quick markdown extraction, but no custom schema support. You get back markdown, then need another LLM call to extract structured data from it. Two LLM calls where one should suffice.

What a proper agent-to-web interface looks like

The API an agent needs is simple:

python

# Tool definition for the agent
{
    "name": "extract_web_data",
    "description": "Extract structured data from any URL",
    "parameters": {
        "url": {"type": "string", "description": "The URL to extract from"},
        "schema": {"type": "object", "description": "JSON schema defining desired output"}
    }
}

python

# Implementation using Last Crawler
import requests

def extract_web_data(url: str, schema: dict) -> dict:
    response = requests.post("https://lastcrawler.xyz/api/json", json={
        "url": url,
        "schema": schema
    })
    return response.json()

That's the entire tool implementation. The agent defines what data it wants via the schema, and gets typed JSON back. No HTML parsing, no proxy management, no browser session lifecycle.

Real-world agent workflow

Here's a practical example — a competitive pricing agent:

python

# Agent receives: "Compare iPhone 16 Pro pricing across major retailers"

# Step 1: Agent decides to check multiple sources
sources = [
    "https://www.apple.com/shop/buy-iphone/iphone-16-pro",
    "https://www.amazon.com/dp/B0DGHJ...",
    "https://www.bestbuy.com/site/apple-iphone-16-pro/...",
]

pricing_schema = {
    "product_name": "string",
    "price": "number",
    "availability": "string",
    "shipping": "string",
    "condition": "string"
}

# Step 2: Agent calls the tool for each source
results = []
for source in sources:
    data = extract_web_data(url=source, schema=pricing_schema)
    results.append({"source": source, "data": data})

# Step 3: Agent reasons about the structured results
# All data is typed JSON — no parsing, no extraction hallucinations

Notice that the agent never touches HTML. Every intermediate representation is typed JSON the LLM can work with directly. No token waste, no parsing errors, no "I couldn't extract the price from the HTML."

Integration patterns

LangChain

python

from langchain.tools import tool

@tool
def web_extract(url: str, fields: str) -> str:
    """Extract structured data from a URL. Fields is a comma-separated list."""
    schema = {field.strip(): "string" for field in fields.split(",")}
    response = requests.post("https://lastcrawler.xyz/api/json", json={
        "url": url,
        "schema": schema
    })
    return json.dumps(response.json(), indent=2)

OpenAI Function Calling

python

tools = [{
    "type": "function",
    "function": {
        "name": "extract_web_data",
        "description": "Extract structured data from any webpage",
        "parameters": {
            "type": "object",
            "properties": {
                "url": {"type": "string"},
                "schema": {"type": "object"}
            },
            "required": ["url", "schema"]
        }
    }
}]

Claude Tool Use

python

tools = [{
    "name": "extract_web_data",
    "description": "Extract structured data from any URL using AI",
    "input_schema": {
        "type": "object",
        "properties": {
            "url": {"type": "string", "description": "URL to extract from"},
            "schema": {"type": "object", "description": "Desired output structure"}
        },
        "required": ["url", "schema"]
    }
}]

The latency equation

For agent tool calls, every second of latency is a second the user waits. Here's how the options stack up for single-page extraction:

Approach	Typical latency	Structured output?
Last Crawler `/json`	~1.2s	Yes, schema-based
Firecrawl `/extract`	~3-5s	Yes, but 5x credits
Jina Reader + LLM parse	~4-8s (two calls)	Requires extra LLM call
Playwright + LLM parse	~5-15s	Requires browser + LLM
Crawl4AI local	~3-8s	Requires local infra

In an agent workflow, 1.2s feels responsive. 8s feels broken. Users don't care why it's slow.

FAQ

Q: Can I use a web scraping API as an MCP server for Claude or Cursor?

A: Yes. Wrapping a scraping API as an MCP (Model Context Protocol) server gives AI coding tools direct web access. Last Crawler's simple POST API makes it straightforward to build an MCP server that any compatible tool can use.

Q: How do I handle rate limits when multiple agents call the same API?

A: Use API key-level rate limiting and implement exponential backoff in your tool wrapper. For high-concurrency agent deployments, batch requests where possible and cache results for identical URLs within a time window.

Q: Should my agent use markdown or JSON extraction?

A: JSON for tool calls where the agent needs to reason about specific fields. Markdown for context loading where the agent needs to understand the full page content (e.g., answering questions about an article) — see how a URL to Markdown API works for that use case. Use both endpoints for different purposes.

Last Crawler

2026-03-27