lastcrawler.xyz

Back

2026-03-27

7 min read

JSONAPIWeb ScrapingStructured DataDeveloper Tools

URL to JSON API: Extract Structured Data from Any Webpage

The traditional web scraping workflow looks like this: fetch the page, parse the HTML with BeautifulSoup or Cheerio, find the right CSS selectors, extract the text, clean it, handle edge cases, and hope the site doesn't change its layout next week.

Schema-based extraction flips this. You define the shape of the data you want as a JSON schema. The API figures out where that data lives on the page and returns typed JSON matching your schema. No selectors, no parsing, no maintenance when the site changes its HTML. If you're building autonomous workflows, this works particularly well for AI agents that need structured web data.

How it works

1. Define your schema

A schema describes the structure you want, not where to find it:

json

{
  "schema": {
    "company_name": "string",
    "pricing_plans": [{
      "name": "string",
      "price_monthly": "number",
      "features": ["string"],
      "is_popular": "boolean"
    }],
    "total_plans": "integer"
  }
}

This is a contract between you and the API. You'll get back JSON that matches this exact shape, with the right types.

2. Send the request

bash

curl -X POST https://lastcrawler.xyz/api/json \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/pricing",
    "schema": {
      "company_name": "string",
      "pricing_plans": [{
        "name": "string",
        "price_monthly": "number",
        "features": ["string"],
        "is_popular": "boolean"
      }],
      "total_plans": "integer"
    }
  }'

3. Get typed JSON back

json

{
  "company_name": "Acme Corp",
  "pricing_plans": [
    {
      "name": "Starter",
      "price_monthly": 9,
      "features": ["5 projects", "1GB storage", "Email support"],
      "is_popular": false
    },
    {
      "name": "Pro",
      "price_monthly": 29,
      "features": ["Unlimited projects", "100GB storage", "Priority support", "API access"],
      "is_popular": true
    },
    {
      "name": "Enterprise",
      "price_monthly": 99,
      "features": ["Everything in Pro", "SSO", "Dedicated account manager", "Custom integrations"],
      "is_popular": false
    }
  ],
  "total_plans": 3
}

Notice: price_monthly is a number, not "$29". is_popular is a boolean, not "Yes". features is an array. The API handles the type coercion.

Why schemas beat CSS selectors

Selectors break when HTML changes

python

# This breaks when the site redesigns
price = soup.select_one('.pricing-card .price-amount span.dollars').text

The class name changes, the nesting changes, the tag changes. Every selector is a bet that the HTML structure stays the same. That bet loses regularly.

Schemas are intent-based

json

{"price_monthly": "number"}

This says "I want the monthly price as a number." It doesn't say where on the page it is, what HTML element contains it, or what class name wraps it. The AI extraction layer figures that out. For more on why schemas work well as a way to describe what you want from a page, see JSON schemas as an extraction language.

When the site redesigns, the price is still on the page -- just in a different <div>. Schema-based extraction keeps working because it's looking for meaning, not DOM structure.

Selectors require per-site maintenance

Every target site needs its own set of selectors. Change one site's layout and you update one scraper. Schema-based extraction uses the same schema across different sites that have similar data.

python

# Same schema works on any pricing page
pricing_schema = {
    "plans": [{
        "name": "string",
        "price": "number",
        "features": ["string"]
    }]
}

# Works on site A
data_a = extract("https://site-a.com/pricing", pricing_schema)

# Works on site B with completely different HTML
data_b = extract("https://site-b.com/pricing", pricing_schema)

Practical examples

E-commerce product extraction

For a complete walkthrough of extracting product catalogs at scale, see our guide on extracting product data from e-commerce sites.

python

product_schema = {
    "name": "string",
    "price": "number",
    "original_price": "number",
    "discount_percentage": "number",
    "rating": "number",
    "review_count": "integer",
    "in_stock": "boolean",
    "variants": [{
        "color": "string",
        "size": "string",
        "available": "boolean"
    }],
    "description": "string",
    "specifications": [{"key": "string", "value": "string"}]
}

Job listing extraction

python

job_schema = {
    "title": "string",
    "company": "string",
    "location": "string",
    "salary_range": {
        "min": "number",
        "max": "number",
        "currency": "string"
    },
    "employment_type": "string",
    "experience_required": "string",
    "skills": ["string"],
    "posted_date": "string",
    "remote": "boolean"
}

News article extraction

python

article_schema = {
    "headline": "string",
    "author": "string",
    "published_date": "string",
    "summary": "string",
    "body_text": "string",
    "tags": ["string"],
    "related_articles": [{
        "title": "string",
        "url": "string"
    }]
}

SaaS comparison extraction

python

comparison_schema = {
    "tools": [{
        "name": "string",
        "pricing": "string",
        "free_tier": "boolean",
        "key_features": ["string"],
        "best_for": "string",
        "limitations": ["string"]
    }]
}

Python, JavaScript, and curl examples

Python

python

import requests

def extract_json(url: str, schema: dict) -> dict:
    response = requests.post(
        "https://lastcrawler.xyz/api/json",
        json={"url": url, "schema": schema}
    )
    response.raise_for_status()
    return response.json()

# Usage
data = extract_json(
    "https://example.com/products/widget",
    {"name": "string", "price": "number", "in_stock": "boolean"}
)
print(f"{data['name']}: ${data['price']}")

JavaScript / TypeScript

typescript

async function extractJson(url: string, schema: Record<string, unknown>) {
  const response = await fetch("https://lastcrawler.xyz/api/json", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ url, schema }),
  });
  return response.json();
}

// Usage
const data = await extractJson(
  "https://example.com/products/widget",
  { name: "string", price: "number", in_stock: "boolean" }
);
console.log(`${data.name}: $${data.price}`);

curl

bash

curl -s -X POST https://lastcrawler.xyz/api/json \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com/products/widget", "schema": {"name": "string", "price": "number"}}' \
  | jq .

When to use JSON extraction vs. other formats

FormatUse whenEndpoint
JSON (/json)You need specific fields with typesStructured data extraction
Markdown (/markdown)You need readable content for LLMsContent/article extraction
Screenshot (/screenshot)You need visual captureVisual testing, archiving
Raw HTML (/scrape)You want to parse HTML yourselfCustom parsing pipelines
Links (/links)You need to discover pagesSitemap building, crawling

FAQ

Q: How does the AI know where the data is on the page?

A: The extraction layer renders the page in a real browser, then uses AI to identify which visible content matches each field in your schema. It's not pattern-matching HTML -- it reads the rendered page and understands what the content means.

Q: What happens if a field doesn't exist on the page?

A: Fields that can't be found are returned as null. The response always matches your schema shape; missing data doesn't break the structure.

Q: Can I extract data from pages that require JavaScript to render?

A: Yes. Last Crawler uses real Chrome instances via edge-native browser rendering. JavaScript-heavy SPAs, client-rendered content, and dynamically loaded data are all rendered before extraction.

Q: How does this compare to OpenAI's structured output?

A: Different layer. OpenAI's structured output constrains LLM generation to match a schema. Last Crawler's JSON extraction reads a real webpage and extracts existing data into a schema. You'd use Last Crawler to get the data, then optionally use OpenAI to reason about it.

+

Last Crawler

2026-03-27

+_+

Home

2026