lastcrawler.xyz

Back

2026-03-28

12 min read

Web CrawlingBatch ProcessingScaleAPITutorial

Batch Web Crawling at Scale: How to Process Thousands of Pages in Minutes

A thread on r/webscraping hit 854 upvotes last month. The poster was running 50 Raspberry Pi 5 nodes, crawling 3.9 million records. The setup was impressive -- custom orchestration, distributed queues, hours of tuning. But the top comment cut right through the spectacle: "He's undercutting his possible performance by an unfathomable amount... 64 cores server vs 50 Pi5s."

Meanwhile, another thread with 482 upvotes showed someone building scraping agents with n8n -- chaining together dozens of tools just to get reliable page extraction at a few hundred pages per hour.

Both of these are smart engineers solving the wrong problem. Batch web crawling shouldn't require a cluster. It shouldn't require proxy lists. It definitely shouldn't require 50 single-board computers zip-tied to a shelf.

Why crawling falls apart at scale

Single-page scraping is deceptively easy. Fetch a URL, parse the HTML, extract what you need. Ten pages? No problem. A hundred pages? Still manageable. But somewhere between 500 and 5,000 pages, everything breaks at once.

Chrome eats your server alive

Every headless Chrome instance consumes 2-3 GB of RAM. That's per browser context -- not per tab, per isolated session. If you want to crawl 50 pages concurrently (which is a pretty modest number for batch work), you need 100-150 GB of RAM just for the browsers. On a typical $40/month VPS with 8 GB of RAM, you can run maybe 3 concurrent sessions before the OOM killer starts murdering processes.

The workaround is sequential processing. Crawl one page, close the browser, open a new one. This is reliable but brutally slow. At 3-5 seconds per page including browser startup, 10,000 pages takes 8-14 hours.

Rate limiting compounds the problem

Even if you solve the memory issue, sites fight back. Hit a site with 50 concurrent requests from the same IP and you'll get blocked within minutes. So you add proxy rotation -- but now you're managing a pool of residential proxies, handling failures when proxies go down mid-request, and paying $5-15 per GB of bandwidth.

The smarter approach is respecting rate limits with delays between requests. But delays + sequential processing = glacial crawl speeds. You can parallelize across multiple proxy IPs, but now you're back to managing infrastructure complexity.

Anti-bot systems don't care about your architecture

Modern anti-bot systems detect headless browsers through dozens of signals -- browser fingerprints, TLS fingerprints, mouse movement patterns, JavaScript execution timing. Solving this yourself means maintaining a constantly evolving set of evasion techniques. What worked last month fails this month. What works on one site fails on another.

This is maintenance burden that scales linearly with the number of sites you crawl. More sites means more breakage means more hours spent debugging why your crawler suddenly returns empty pages.

How the /crawl endpoint changes the math

Last Crawler's /crawl endpoint takes a fundamentally different approach. Instead of you managing browsers, proxies, and concurrency -- you send a URL and get back the crawled content for every linked page.

python

import requests

response = requests.post("https://lastcrawler.xyz/api/crawl", json={
    "url": "https://docs.example.com",
    "maxPages": 500
})

pages = response.json()
print(f"Crawled {len(pages)} pages")

That's it. Behind that single API call:

There's no Chrome to install, no memory to budget, no proxies to rotate, no concurrency to tune. The parallelism is built in at the infrastructure level.

URL discovery with /links

Sometimes you want to know what you're about to crawl before you crawl it. The /links endpoint gives you exactly that -- a list of all discoverable URLs from a starting page.

python

# Discover all URLs first
links_response = requests.post("https://lastcrawler.xyz/api/links", json={
    "url": "https://store.example.com/products"
})

urls = links_response.json()["links"]
print(f"Found {len(urls)} product URLs")

# Filter to just product pages
product_urls = [u for u in urls if "/products/" in u]
print(f"Filtered to {len(product_urls)} product pages")

This lets you plan your crawl -- estimate costs, filter URLs, and decide what's worth extracting before spending compute on it.

Practical patterns: batch crawl + extract

The real power comes from combining /crawl with other endpoints. Here are three patterns that cover most batch crawling use cases.

Pattern 1: Crawl entire site to markdown (for RAG ingestion)

This is the most common batch crawling use case right now. You have a documentation site, a knowledge base, or a content library, and you need all of it as clean text for your RAG pipeline.

python

import requests
import json

def crawl_site_to_markdown(base_url: str, max_pages: int = 1000) -> list[dict]:
    """Crawl an entire site and extract clean markdown from every page."""

    # Step 1: Crawl and get markdown for all pages
    response = requests.post("https://lastcrawler.xyz/api/crawl", json={
        "url": base_url,
        "maxPages": max_pages,
        "formats": ["markdown"]
    })

    pages = response.json()
    results = []

    for page in pages:
        results.append({
            "url": page["url"],
            "markdown": page["markdown"],
            "title": page.get("title", ""),
        })

    return results


# Crawl a documentation site
docs = crawl_site_to_markdown("https://docs.example.com", max_pages=500)

# Save for RAG ingestion
with open("docs_corpus.jsonl", "w") as f:
    for doc in docs:
        f.write(json.dumps(doc) + "\n")

print(f"Saved {len(docs)} pages to docs_corpus.jsonl")

The markdown output strips navigation, sidebars, footers -- all the boilerplate that corrupts embeddings. You get clean, structured text with heading hierarchy intact. From there, chunking and embedding is straightforward. We covered that full pipeline in our web scraping for RAG pipelines guide.

Pattern 2: Crawl + extract structured product data

Ecommerce catalogs are a natural fit for batch crawling. You need product names, prices, specs, and availability from hundreds or thousands of product pages. Doing this one page at a time is slow. Doing it with batch crawling and JSON extraction is fast.

python

def crawl_product_catalog(store_url: str) -> list[dict]:
    """Crawl an ecommerce site and extract structured product data."""

    # Step 1: Discover product URLs
    links_response = requests.post("https://lastcrawler.xyz/api/links", json={
        "url": store_url
    })
    all_urls = links_response.json()["links"]

    # Filter to product pages (adjust pattern per site)
    product_urls = [u for u in all_urls if "/product" in u or "/item" in u]
    print(f"Found {len(product_urls)} product URLs")

    # Step 2: Extract structured data from each product page
    products = []
    for url in product_urls:
        response = requests.post("https://lastcrawler.xyz/api/json", json={
            "url": url,
            "schema": {
                "name": "string",
                "price": "number",
                "currency": "string",
                "in_stock": "boolean",
                "sku": "string",
                "brand": "string",
                "rating": "number",
                "review_count": "number",
                "description": "string",
                "specs": [{"key": "string", "value": "string"}]
            }
        })
        product = response.json()
        product["source_url"] = url
        products.append(product)

    return products


catalog = crawl_product_catalog("https://store.example.com")
print(f"Extracted {len(catalog)} products")

For a deeper walkthrough of the extraction schema design, see our guide on extracting product data from ecommerce sites.

Pattern 3: Crawl + screenshot for visual monitoring

Sometimes you need visual proof that pages look right -- after a deployment, for compliance monitoring, or to track competitor changes. Batch crawling with screenshots gives you a visual snapshot of an entire site.

javascript

import LastCrawler from 'last-crawler';

const client = new LastCrawler({ apiKey: process.env.LAST_CRAWLER_API_KEY });

async function screenshotSite(baseUrl, maxPages = 100) {
  // Get all URLs
  const { links } = await client.links(baseUrl);

  const screenshots = [];
  for (const url of links.slice(0, maxPages)) {
    const screenshot = await client.screenshot(url, {
      fullPage: true,
      format: 'png'
    });

    screenshots.push({
      url,
      image: screenshot.base64,
      timestamp: new Date().toISOString()
    });
  }

  return screenshots;
}

const snapshots = await screenshotSite('https://mysite.com');
console.log(`Captured ${snapshots.length} screenshots`);

Run this on a schedule and you have a visual regression system that catches layout breaks, missing images, and broken pages across your entire site.

Performance: what to actually expect

The question everyone asks is: how fast?

In our testing across a range of sites with varying complexity:

PagesTypical timeNotes
10015-30 secondsSimple content pages
5001-2 minutesMixed content, some JS-heavy pages
1,0002-4 minutesFull site crawl with extraction
5,0008-15 minutesLarge documentation sites
10,00015-30 minutesEnterprise-scale catalogs

These numbers assume reasonably sized pages. Extremely heavy SPAs or pages with complex JavaScript take longer per page. But the parallelism means the wall-clock time scales sub-linearly -- crawling 1,000 pages takes way less than 10x the time of crawling 100 pages.

Compare this to sequential single-browser crawling at 3-5 seconds per page: 1,000 pages takes 50-85 minutes. 10,000 pages takes 8-14 hours. Batch crawling at the edge collapses that by an order of magnitude.

The cost comparison nobody wants to do

Let's put real numbers on this. The Reddit thread about 50 Raspberry Pi 5 nodes is a perfect case study.

Self-hosted: the 50 Pi setup

ComponentCost
50x Raspberry Pi 5 (8GB)$4,000 upfront
SD cards, power supplies, networking$800 upfront
Electricity (~5W each, 24/7)$25/month
Residential proxies$100-500/month
Developer time (setup: 40+ hours)$3,000-6,000 one-time
Developer time (maintenance: 5-10 hrs/week)$1,500-3,000/month

First-year total cost of ownership: roughly $28,000-$47,000 for hardware, setup, and maintenance. That's before you account for the 50 nodes sitting idle when you're not crawling. For a deeper breakdown of these hidden costs, see the true cost of web scraping in 2026.

And here's the kicker from that Reddit thread -- the commenter was right. A single 64-core server would outperform 50 Pi 5s for this workload. ARM cores on a Pi running headless Chrome are not where you want to be.

API-first: batch crawling as a service

With Last Crawler's pricing, the same workload looks different:

VolumeApproximate cost
10,000 pages/month~$5
100,000 pages/month~$50
1,000,000 pages/month~$400

No hardware. No proxies. No maintenance. No idle capacity. You pay for pages crawled and nothing else.

The math is brutally simple. Unless you're crawling billions of pages per month and have a dedicated team to run the infrastructure, the API approach costs less by a wide margin. And your team spends zero hours debugging why Chrome segfaulted on a Pi at 3am.

When to batch crawl vs single-page extract

Batch crawling and single-page extraction solve different problems. Using the wrong one wastes either time or money.

Use batch crawling (/crawl) when:

Use single-page extraction (/json, /markdown) when:

Use /links first, then single-page extraction when:

FAQ

How many pages can I crawl in a single /crawl request?

The practical limit depends on the site and your plan. For most sites, you can crawl up to several thousand pages in a single request. The maxPages parameter controls the upper bound. Start with a smaller number to test, then scale up once you've verified the output looks right.

Does batch crawling respect robots.txt?

Yes. The crawler follows robots.txt directives by default. If a site disallows crawling certain paths, those pages will be skipped. This is both ethically correct and practically smart -- sites that see you respecting robots.txt are less likely to block you. For more on the legal side, see our guide on whether web scraping is legal in 2026.

Can I crawl sites that require JavaScript rendering?

Yes. Every page gets a full browser session with complete JavaScript execution. SPAs, dynamically loaded content, lazy-loading images -- all handled. This is the same rendering pipeline as the single-page endpoints, just applied at scale across multiple pages in parallel.

What happens if some pages fail during a batch crawl?

Failed pages (timeouts, server errors, anti-bot blocks) are reported in the response with their error status. Successfully crawled pages are still returned. You don't lose the entire batch because a few pages had issues. Retry just the failed URLs if you need them.

How does batch crawling handle pagination?

The /crawl endpoint follows links from your starting URL. If a paginated listing links to page 2, page 3, etc. in the HTML, those pages will be discovered and crawled. For AJAX-based infinite scroll pagination, you may need to use /links on the base page first and then construct the paginated URLs yourself.

Can I combine batch crawling with JSON schema extraction?

Yes, and this is one of the most powerful patterns. Crawl a site to get all the pages, then use the /json endpoint with a schema to extract structured data from each page. The product catalog example above shows exactly this workflow. For schema design tips, see our URL to JSON API guide.

Is batch crawling suitable for real-time applications?

Not really. Batch crawling is optimized for throughput, not latency. A batch of 1,000 pages might complete in 2-4 minutes total, but you get results when the batch finishes (or as pages stream in). For real-time extraction where you need sub-second responses, use the single-page endpoints instead.

How much does batch crawling cost compared to running my own infrastructure?

At moderate scale (10K-100K pages/month), API-based batch crawling typically costs $5-50/month. The equivalent self-hosted setup -- servers, proxies, Chrome, maintenance time -- runs $500-3,000/month. The gap narrows at extreme scale (millions of pages/month) but doesn't close until you hit billions. We go deep on these numbers in the true cost of web scraping.

+

Last Crawler

2026-03-28

+_+

Home

2026