2026-03-27
7 min read
URL to JSON API: Extract Structured Data from Any Webpage
The traditional web scraping workflow looks like this: fetch the page, parse the HTML with BeautifulSoup or Cheerio, find the right CSS selectors, extract the text, clean it, handle edge cases, and hope the site doesn't change its layout next week.
Schema-based extraction flips this. You define the shape of the data you want as a JSON schema. The API figures out where that data lives on the page and returns typed JSON matching your schema. No selectors, no parsing, no maintenance when the site changes its HTML. If you're building autonomous workflows, this works particularly well for AI agents that need structured web data.
How it works
1. Define your schema
A schema describes the structure you want, not where to find it:
json
{
"schema": {
"company_name": "string",
"pricing_plans": [{
"name": "string",
"price_monthly": "number",
"features": ["string"],
"is_popular": "boolean"
}],
"total_plans": "integer"
}
}
This is a contract between you and the API. You'll get back JSON that matches this exact shape, with the right types.
2. Send the request
bash
curl -X POST https://lastcrawler.xyz/api/json \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/pricing",
"schema": {
"company_name": "string",
"pricing_plans": [{
"name": "string",
"price_monthly": "number",
"features": ["string"],
"is_popular": "boolean"
}],
"total_plans": "integer"
}
}'
3. Get typed JSON back
json
{
"company_name": "Acme Corp",
"pricing_plans": [
{
"name": "Starter",
"price_monthly": 9,
"features": ["5 projects", "1GB storage", "Email support"],
"is_popular": false
},
{
"name": "Pro",
"price_monthly": 29,
"features": ["Unlimited projects", "100GB storage", "Priority support", "API access"],
"is_popular": true
},
{
"name": "Enterprise",
"price_monthly": 99,
"features": ["Everything in Pro", "SSO", "Dedicated account manager", "Custom integrations"],
"is_popular": false
}
],
"total_plans": 3
}
Notice: price_monthly is a number, not "$29". is_popular is a boolean, not "Yes". features is an array. The API handles the type coercion.
Why schemas beat CSS selectors
Selectors break when HTML changes
python
# This breaks when the site redesigns
price = soup.select_one('.pricing-card .price-amount span.dollars').text
The class name changes, the nesting changes, the tag changes. Every selector is a bet that the HTML structure stays the same. That bet loses regularly.
Schemas are intent-based
json
{"price_monthly": "number"}
This says "I want the monthly price as a number." It doesn't say where on the page it is, what HTML element contains it, or what class name wraps it. The AI extraction layer figures that out. For more on why schemas work well as a way to describe what you want from a page, see JSON schemas as an extraction language.
When the site redesigns, the price is still on the page -- just in a different <div>. Schema-based extraction keeps working because it's looking for meaning, not DOM structure.
Selectors require per-site maintenance
Every target site needs its own set of selectors. Change one site's layout and you update one scraper. Schema-based extraction uses the same schema across different sites that have similar data.
python
# Same schema works on any pricing page
pricing_schema = {
"plans": [{
"name": "string",
"price": "number",
"features": ["string"]
}]
}
# Works on site A
data_a = extract("https://site-a.com/pricing", pricing_schema)
# Works on site B with completely different HTML
data_b = extract("https://site-b.com/pricing", pricing_schema)
Practical examples
E-commerce product extraction
For a complete walkthrough of extracting product catalogs at scale, see our guide on extracting product data from e-commerce sites.
python
product_schema = {
"name": "string",
"price": "number",
"original_price": "number",
"discount_percentage": "number",
"rating": "number",
"review_count": "integer",
"in_stock": "boolean",
"variants": [{
"color": "string",
"size": "string",
"available": "boolean"
}],
"description": "string",
"specifications": [{"key": "string", "value": "string"}]
}
Job listing extraction
python
job_schema = {
"title": "string",
"company": "string",
"location": "string",
"salary_range": {
"min": "number",
"max": "number",
"currency": "string"
},
"employment_type": "string",
"experience_required": "string",
"skills": ["string"],
"posted_date": "string",
"remote": "boolean"
}
News article extraction
python
article_schema = {
"headline": "string",
"author": "string",
"published_date": "string",
"summary": "string",
"body_text": "string",
"tags": ["string"],
"related_articles": [{
"title": "string",
"url": "string"
}]
}
SaaS comparison extraction
python
comparison_schema = {
"tools": [{
"name": "string",
"pricing": "string",
"free_tier": "boolean",
"key_features": ["string"],
"best_for": "string",
"limitations": ["string"]
}]
}
Python, JavaScript, and curl examples
Python
python
import requests
def extract_json(url: str, schema: dict) -> dict:
response = requests.post(
"https://lastcrawler.xyz/api/json",
json={"url": url, "schema": schema}
)
response.raise_for_status()
return response.json()
# Usage
data = extract_json(
"https://example.com/products/widget",
{"name": "string", "price": "number", "in_stock": "boolean"}
)
print(f"{data['name']}: ${data['price']}")
JavaScript / TypeScript
typescript
async function extractJson(url: string, schema: Record<string, unknown>) {
const response = await fetch("https://lastcrawler.xyz/api/json", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ url, schema }),
});
return response.json();
}
// Usage
const data = await extractJson(
"https://example.com/products/widget",
{ name: "string", price: "number", in_stock: "boolean" }
);
console.log(`${data.name}: $${data.price}`);
curl
bash
curl -s -X POST https://lastcrawler.xyz/api/json \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com/products/widget", "schema": {"name": "string", "price": "number"}}' \
| jq .
When to use JSON extraction vs. other formats
| Format | Use when | Endpoint |
|---|---|---|
JSON (/json) | You need specific fields with types | Structured data extraction |
Markdown (/markdown) | You need readable content for LLMs | Content/article extraction |
Screenshot (/screenshot) | You need visual capture | Visual testing, archiving |
Raw HTML (/scrape) | You want to parse HTML yourself | Custom parsing pipelines |
Links (/links) | You need to discover pages | Sitemap building, crawling |
FAQ
Q: How does the AI know where the data is on the page?
A: The extraction layer renders the page in a real browser, then uses AI to identify which visible content matches each field in your schema. It's not pattern-matching HTML -- it reads the rendered page and understands what the content means.
Q: What happens if a field doesn't exist on the page?
A: Fields that can't be found are returned as null. The response always matches your schema shape; missing data doesn't break the structure.
Q: Can I extract data from pages that require JavaScript to render?
A: Yes. Last Crawler uses real Chrome instances via edge-native browser rendering. JavaScript-heavy SPAs, client-rendered content, and dynamically loaded data are all rendered before extraction.
Q: How does this compare to OpenAI's structured output?
A: Different layer. OpenAI's structured output constrains LLM generation to match a schema. Last Crawler's JSON extraction reads a real webpage and extracts existing data into a schema. You'd use Last Crawler to get the data, then optionally use OpenAI to reason about it.
Last Crawler
2026-03-27