lastcrawler.xyz

Back

2026-03-05

5 min read

JSONAI ExtractionDeveloper Tools

JSON Schema Web Scraping: Turn Any Website into a Typed JSON API

SQL lets you query databases. GraphQL lets you query APIs. What lets you query webpages?

Until recently, the answer was "a mess of HTTP requests, HTML parsing, and regex." But there's a better primitive emerging: JSON schemas as a declarative extraction language.

The idea is simple

You describe the shape of data you want. The system returns data in that shape. You don't specify where on the page the data lives or how to extract it. Just what you want.

json

{
  "schema": {
    "company": "string",
    "founded": "number",
    "employees": "string",
    "headquarters": "string",
    "products": [{
      "name": "string",
      "category": "string"
    }]
  }
}

Point this at any company's about page and you get structured data back. No selectors, no parsing logic, no site-specific code. This is the same principle behind our URL to JSON API — send a URL and a schema, get typed data back.

Why schemas beat prompts

You could ask an LLM "extract the company info from this page" in natural language. Sometimes it works. But the output format is unpredictable — sometimes you get markdown, sometimes JSON, sometimes a mix. Field names vary. Types are inconsistent. Nested structures are flattened or hallucinated.

Schemas fix this by constraining the output space. The AI knows what fields to look for, what types they should be, and how they nest. You get predictable structure with AI doing the hard part.

javascript

// Natural language: unpredictable output
const result = await llm("Extract company info from: " + pageContent);
// Could be anything

// Schema: predictable, typed output
const result = await crawler.json(url, { schema: companySchema });
// Always matches your schema

Schemas compose naturally

Need data from multiple pages? Your schemas compose. Extract a list of product URLs from a category page, then extract product data from each product page. The output of one extraction feeds the input of the next.

javascript

// Step 1: Get product URLs
const listing = await crawler.json(categoryUrl, {
  schema: { products: [{ url: "string", name: "string" }] }
});

// Step 2: Get details for each
const details = await Promise.all(
  listing.products.map(p =>
    crawler.json(p.url, {
      schema: {
        name: "string",
        price: "number",
        specs: [{ key: "string", value: "string" }],
        reviews_count: "number",
        avg_rating: "number"
      }
    })
  )
);

This is a query plan, not a scraping script. The difference matters.

Type safety all the way down

When your extraction is schema-driven, you get TypeScript types for free. Define your schema once, infer the type, and get autocomplete and type checking throughout your application.

Your data pipeline goes from "parse HTML, hope for the best, add runtime checks everywhere" to "define schema, get typed data, use it with confidence."

JSON schema web scraping turns the web into a queryable database

We think of the web as a collection of documents. But with schema-driven extraction, it starts to look more like a database. Any URL is a table. Your schema is a SELECT statement. The AI is the query engine.

That mental model changes how you build applications. Instead of maintaining scrapers for each data source, you maintain schemas. Schemas are declarative, versionable, testable, and composable. Scrapers aren't any of those things. This is why schema-driven extraction is showing up as the base layer in web scraping APIs built for AI agents.

FAQ

What is JSON schema web scraping?

It's an approach where you define the shape of data you want as a JSON schema — field names, types, and nesting — and an AI extracts exactly that from any webpage. You describe what you want, not where it is on the page or how to parse it.

How does JSON schema extraction differ from using an LLM prompt to extract data?

Prompts produce unpredictable output formats — sometimes JSON, sometimes markdown, sometimes a mix, with inconsistent field names and types. A schema constrains the output space, so the AI always returns data that matches your exact structure. It's the difference between "extract this" and "extract this into exactly this shape."

Can I use this to turn a website into a JSON API?

Yes. Point a schema at any URL and you get back structured JSON in the shape you defined. Wrap that in a simple API route and you have a fully typed endpoint backed by live web data — no official API from the site needed.

+

Last Crawler

2026-03-05

+_+

Home

2026