lastcrawler.xyz

Back

2026-03-17

5 min read

TutorialEcommerceJSON Extraction

How to Extract Product Data from Any Ecommerce Site

Every ecommerce site has product data. Getting it out in a consistent, structured format is a different problem entirely. Sites use different URL patterns, different page structures, different frameworks. Some render server-side, some are full SPAs. Some show prices in structured markup, some in inline JavaScript, some generated dynamically from inventory APIs.

The traditional answer is: write a custom scraper per site. Identify the selectors, handle the edge cases, maintain it forever. This is how you end up with 40 scrapers that each break on a different schedule.

The better approach: define what you want, once, and let AI figure out where it lives on each page. This is exactly what a URL to JSON API does — you send a URL and a schema, and get structured data back.

Step 1: Define Your Product Schema

Start by describing the data you want. Don't think about where it lives on the page — just define the shape of the output.

json

{
  "name": "string",
  "price": "number",
  "currency": "string",
  "in_stock": "boolean",
  "sku": "string",
  "brand": "string",
  "rating": "number",
  "review_count": "number",
  "images": ["string"],
  "description": "string",
  "specs": {
    "weight": "string",
    "dimensions": "string",
    "color": "string",
    "material": "string"
  }
}

Use the most specific types you can. price: "number" is better than price: "string" because it forces numeric output you can sort and compare. in_stock: "boolean" is better than availability: "string" because you don't have to parse "In Stock", "Ships in 3-5 days", "Only 2 left" into a boolean yourself.

Step 2: Extract from a Single Product Page

With your schema defined, extracting from a product URL is one API call:

javascript

import LastCrawler from 'last-crawler';

const client = new LastCrawler({ apiKey: process.env.LAST_CRAWLER_API_KEY });

const product = await client.json('https://example-store.com/products/widget-pro', {
  schema: {
    name: 'string',
    price: 'number',
    currency: 'string',
    in_stock: 'boolean',
    sku: 'string',
    brand: 'string',
    rating: 'number',
    review_count: 'number',
    images: ['string'],
    description: 'string'
  }
});

console.log(product);
// {
//   name: "Widget Pro 3000",
//   price: 49.99,
//   currency: "USD",
//   in_stock: true,
//   sku: "WP-3000-BLK",
//   brand: "Acme",
//   rating: 4.3,
//   review_count: 847,
//   images: ["https://..."],
//   description: "..."
// }

The same schema works on a different store's product page. You don't rewrite the schema — you just change the URL.

Step 3: Crawl a Category Page for Product URLs

Before you can extract all products, you need their URLs. Category pages list products — extract the links:

javascript

const categoryLinks = await client.json(
  'https://example-store.com/category/widgets',
  {
    schema: {
      products: [
        {
          url: 'string',
          name: 'string',
          price: 'number'
        }
      ],
      next_page: 'string'
    }
  }
);

const productUrls = categoryLinks.products.map(p => p.url);
// Also captures price/name from the listing as a bonus
// next_page gives you the pagination URL if there is one

Repeat with next_page until it's null to get all pages.

Step 4: Batch Extract All Products

With a list of product URLs, run extractions in parallel with a concurrency limit:

javascript

import pLimit from 'p-limit';

const limit = pLimit(5); // 5 concurrent requests

const results = await Promise.all(
  productUrls.map(url =>
    limit(() =>
      client.json(url, { schema: productSchema })
        .then(data => ({ url, data, error: null }))
        .catch(err => ({ url, data: null, error: err.message }))
    )
  )
);

const successful = results.filter(r => r.error === null);
const failed = results.filter(r => r.error !== null);

console.log(`Extracted ${successful.length} products, ${failed.length} failed`);

Failed extractions are worth retrying once before giving up — occasional rendering failures happen, and a single retry resolves most of them.

Handling Variations

Different site structures: The schema stays the same. The AI reads the page by meaning, so whether the price is in a <span class="price">, a JSON-LD block, or a React component's state doesn't matter. If the data is visible on the page, it gets extracted.

Single-page applications: Sites built with React, Vue, or Next.js that load content dynamically are handled automatically. The crawler waits for the page to fully render before extracting.

Sites requiring auth or session cookies: Pass cookies in the request options. Most product pages are publicly accessible, but if you're extracting account-specific pricing or inventory, you can authenticate first.

Products with variants (size, color): Capture the variant data explicitly in your schema:

json

{
  "name": "string",
  "base_price": "number",
  "variants": [
    {
      "name": "string",
      "sku": "string",
      "price": "number",
      "in_stock": "boolean"
    }
  ]
}

Multi-currency sites: Include currency in your schema. The AI will extract whichever currency is displayed. If you need a specific currency, filter by it or use a geo-targeted request.

Real-World Schema Patterns

Electronics:

json

{
  "name": "string",
  "price": "number",
  "currency": "string",
  "brand": "string",
  "model_number": "string",
  "specs": {
    "processor": "string",
    "ram": "string",
    "storage": "string",
    "display": "string",
    "battery_life": "string"
  },
  "in_stock": "boolean",
  "warranty": "string"
}

Apparel:

json

{
  "name": "string",
  "price": "number",
  "sale_price": "number",
  "brand": "string",
  "available_sizes": ["string"],
  "available_colors": ["string"],
  "material": "string",
  "care_instructions": "string",
  "in_stock": "boolean"
}

Marketplace listings (eBay, Etsy):

json

{
  "title": "string",
  "price": "number",
  "condition": "string",
  "seller_name": "string",
  "seller_rating": "number",
  "location": "string",
  "shipping_cost": "number",
  "quantity_available": "number"
}

The rule is: define what you actually need. Don't try to capture everything. A focused schema produces cleaner output than one with 30 optional fields. For a broader look at how teams use this for pricing and market analysis, see our guide on competitive intelligence with crawling.

FAQ

Can I extract product data from any ecommerce website?

For most publicly accessible product pages, yes. The main exceptions are sites that require login to view products, or sites that actively block all automated access. For the vast majority of ecommerce pages — even JavaScript-heavy ones — structured extraction works reliably.

How do I handle ecommerce sites that block scrapers?

Traditional scrapers get blocked because they look like bots: they use datacenter IPs, they don't render JavaScript, they request pages too fast with no browser fingerprint. AI-powered extraction runs in real browsers with normal rendering, which avoids most bot detection. You still shouldn't hammer a site with hundreds of concurrent requests — use a reasonable concurrency limit and add delays if you see rate limit responses.

Is this approach better than using a site's official API?

If a site has a product API, use it. Official APIs are stable, documented, and legal. But most ecommerce sites don't expose product data via API, or their API requires a partnership agreement you can't get. For sites without APIs, schema-driven web extraction is the most maintainable alternative to building and maintaining custom per-site scrapers. We compare the top options in our best web scraping API in 2026 roundup.

+

Last Crawler

2026-03-17

+_+

Home

2026