How to Feed Live Web Data to Your LLM Agent

LLMs have a knowledge cutoff. Agents built on them inherit that limitation -- unless you give them tools to reach out and get current information. Real-time web access is one of the most useful capabilities you can add to an agent, and also one of the most commonly botched.

Here's how to do it right.

Why agents need web access

An agent operating on training data alone can't:

Check current prices, availability, or inventory
Verify whether a fact is still true (people change jobs, companies get acquired, laws change)
Access proprietary or niche information that wasn't in training data
Research topics that emerged after the knowledge cutoff

For any agent doing research or decision-making on real-world topics, web access isn't optional. Without it, you've got an expensive autocomplete. The problem is that most web scraping APIs weren't built for AI agents -- they were built for humans writing scripts.

The wrong way: raw HTML

The obvious approach is to fetch page content and pass it to the agent:

python

import requests

def browse(url: str) -> str:
    """Fetch a URL and return the content."""
    resp = requests.get(url)
    return resp.text  # 50,000 tokens of HTML the agent has to parse

This fails for several reasons:

Context window destruction — a single product page can be 30,000–100,000 tokens of raw HTML, boilerplate, nav menus, and scripts. You burn most of your context on noise.
Parsing burden on the model — you're paying inference costs for the model to do extraction work that should happen before the model sees the data.
Inconsistent structure — every site's HTML is different. The agent learns no generalizable pattern.
Bot detection — basic requests.get() gets blocked by most modern sites.

Some teams try cleaning the HTML first -- stripping tags, extracting body text -- but this still produces unstructured text walls that waste tokens and don't give the agent typed data to work with. A URL-to-markdown API is a step up from raw HTML, but for agent tool use, schema-driven JSON extraction is better.

The right way: structured extraction

Give the agent a tool that accepts a URL and a schema, and returns data that matches the schema. The agent specifies what it needs; the tool figures out how to extract it.

python

def extract_from_url(url: str, schema: dict) -> dict:
    """Extract structured data from a URL matching the given schema."""
    resp = requests.post("https://lastcrawler.com/api/json", json={
        "url": url,
        "schema": schema
    })
    return resp.json()

Now the agent gets typed fields instead of text. A price is a number. A boolean is a boolean. A list of features is a list. The model can work with these directly.

Building the web tool for your agent

Define the interface

Tool definitions need two things to work well: a precise description and typed output. The description is how the agent decides when to call the tool. Vague descriptions cause the agent to call the wrong tool or use it incorrectly.

python

from anthropic import Anthropic

client = Anthropic()

tools = [
    {
        "name": "fetch_web_data",
        "description": (
            "Extract structured data from any public webpage. "
            "Provide a URL and a schema describing the fields you want. "
            "Returns a JSON object with the extracted data. "
            "Use this when you need current information from a specific website."
        ),
        "input_schema": {
            "type": "object",
            "properties": {
                "url": {
                    "type": "string",
                    "description": "The full URL to extract data from"
                },
                "schema": {
                    "type": "object",
                    "description": "JSON schema describing the fields to extract"
                }
            },
            "required": ["url", "schema"]
        }
    }
]

Implement with schema-driven extraction

The tool implementation should handle the HTTP call and return data or a structured error:

python

import requests

def fetch_web_data(url: str, schema: dict) -> dict:
    try:
        resp = requests.post(
            "https://lastcrawler.com/api/json",
            json={"url": url, "schema": schema},
            timeout=30
        )
        resp.raise_for_status()
        return {"success": True, "data": resp.json()}
    except requests.exceptions.Timeout:
        return {
            "success": False,
            "error": "timeout",
            "message": "Request timed out after 30 seconds. The site may be slow or blocking access."
        }
    except requests.exceptions.HTTPError as e:
        return {
            "success": False,
            "error": "http_error",
            "message": f"HTTP {e.response.status_code}. The page may require authentication or not exist."
        }
    except Exception as e:
        return {
            "success": False,
            "error": "unknown",
            "message": str(e)
        }

Handle errors so the agent can reason about them

The agent needs to understand what went wrong and what to do next. "Request failed" is useless. "Page requires authentication — try the public pricing page instead" gives the agent something to work with.

Return errors as structured data, not exceptions. Include the error type, a human-readable message, and when possible a suggested alternative action.

Context window management

Web data can get large fast. Three strategies that keep costs down:

Extract only what you need. The schema constrains output to exactly the fields specified. If you only need {price, currency, in_stock}, that's all you get back — not the entire product description, reviews, and related items.

Cache repeated lookups. Agents often revisit the same URL multiple times in a reasoning chain. Cache extraction results with a short TTL (5–10 minutes for most use cases). The second call is instant and free.

python

import functools
import time

cache = {}

def cached_fetch(url: str, schema: dict, ttl: int = 300) -> dict:
    key = (url, str(sorted(schema.items())))
    if key in cache:
        result, timestamp = cache[key]
        if time.time() - timestamp < ttl:
            return result
    result = fetch_web_data(url, schema)
    cache[key] = (result, time.time())
    return result

Summarize before passing to the agent. For content-heavy extractions (articles, documentation), extract the full text but summarize it before including it in the agent's context. The agent cares about the conclusion, not every word.

Example: a price comparison agent

Here's a complete example of an agent that compares product prices across multiple sites.

python

import json
from anthropic import Anthropic

client = Anthropic()

PRICE_SCHEMA = {
    "product_name": "string",
    "price": "number",
    "currency": "string",
    "in_stock": "boolean",
    "shipping_estimate": "string"
}

def run_price_comparison_agent(product: str, urls: list[str]):
    system_prompt = f"""You are a price comparison assistant.
    The user wants to find the best price for: {product}

    Use the fetch_web_data tool to check prices at each URL provided.
    Compare results and recommend the best option, considering price, availability, and shipping."""

    messages = [
        {"role": "user", "content": f"Check prices at these URLs: {', '.join(urls)}"}
    ]

    while True:
        response = client.messages.create(
            model="claude-opus-4-6",
            max_tokens=4096,
            system=system_prompt,
            tools=tools,
            messages=messages
        )

        if response.stop_reason == "end_turn":
            # Extract final text response
            for block in response.content:
                if block.type == "text":
                    print(block.text)
            break

        if response.stop_reason == "tool_use":
            messages.append({"role": "assistant", "content": response.content})

            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    result = fetch_web_data(block.input["url"], block.input["schema"])
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": json.dumps(result)
                    })

            messages.append({"role": "user", "content": tool_results})

# Usage
run_price_comparison_agent(
    product="Sony WH-1000XM6 headphones",
    urls=["https://sony.com/...", "https://amazon.com/...", "https://bestbuy.com/..."]
)

The agent calls fetch_web_data for each URL, gets back structured price data, then reasons across all results to make a recommendation. No HTML parsing, no token waste on boilerplate, no brittle selectors to maintain.

FAQ

Q: Why not just use an existing AI web browsing tool?

A: Most "web browsing" tools return markdown or text extracted from a page. That's better than raw HTML but still forces the agent to parse unstructured output. Schema-driven extraction returns typed fields the agent can reason about directly — prices as numbers, booleans as booleans, lists as lists. The agent doesn't need to understand the page; it just gets the data it asked for.

Q: How do I handle sites that require JavaScript to render content?

A: Standard HTTP clients only get the initial HTML, which is often empty or incomplete for JS-heavy sites. You need a real browser to execute JavaScript and extract from the rendered DOM. Browser-based extraction APIs handle this automatically — you pass a URL and get back rendered, extracted data regardless of whether the site is a static page or a React SPA.

Q: What's the right granularity for tool schemas?

A: Match the schema to the task, not the page. If your agent is doing price comparison, the schema should return price fields. Don't try to extract everything on the page just in case. A tighter schema means smaller responses, faster extraction, and less noise for the agent to process. If your agent has multiple distinct web tasks, build multiple task-specific tools with dedicated schemas rather than one generic "browse" tool.

Last Crawler

2026-03-14