2026-03-16
6 min read
Web Scraping Without Getting Blocked: What Actually Works in 2026
Bot detection has gotten serious. Sites that used to be easy to scrape now run TLS fingerprinting, behavioral analysis, mouse movement heuristics, and ML-based anomaly detection on every request. The 2022 playbook gets you blocked in 2026.
Here's what still works, what doesn't, and why.
Why scrapers get blocked
Sites block scrapers using a layered approach:
- TLS/HTTP fingerprinting — your HTTP/2 settings, cipher suites, and header order uniquely identify your client library.
requests,curl, andaiohttpeach have distinct fingerprints. Detection happens before you even send a meaningful request. - Browser fingerprinting — Canvas rendering, WebGL output, font enumeration, screen resolution, and dozens of other browser properties are checked against expected profiles. Headless Chrome without modification has a known fingerprint.
- Behavioral signals — real users scroll, move their mouse, pause between clicks, and follow unpredictable navigation patterns. Scrapers move in straight lines: fetch, parse, repeat.
- IP reputation — datacenter IPs and known proxy ranges are flagged automatically. Cloudflare, Akamai, and Datadome all maintain databases of bad IP ranges.
- Rate limiting — too many requests from one IP in a short window triggers temporary or permanent blocks.
None of these signals alone is conclusive. Sites combine them into a confidence score, and once you cross a threshold, you get a CAPTCHA, a 403, or silently returned bad data.
The traditional toolkit and why it's failing
Proxy rotation
Rotating residential proxies used to work. You'd cycle through a pool of real user IPs, distribute requests, and avoid rate limits.
The problem: proxy providers sell the same IP pools to thousands of customers. Sites track IP reputation at scale and rotate their block lists just as fast as you rotate proxies. High-quality residential proxies that aren't already flagged cost $5–15/GB. For anything serious, you're spending hundreds of dollars a month to stay marginally ahead.
Headless browsers
Playwright and Puppeteer execute real JavaScript and pass most basic bot checks. They also consume 200–500MB of RAM per instance, take 2–5 seconds to spin up, and still fail fingerprinting checks without significant modification.
Getting Playwright to look like a real user requires patching WebGL, faking screen resolution, injecting mouse movement, randomizing timing, and staying ahead of whatever new signals detection vendors just added. It's a full-time job.
User-agent rotation
Setting User-Agent: Mozilla/5.0 ... doesn't help when your HTTP/2 fingerprint says you're Python requests. Sites stopped caring about user-agent strings years ago. It's detected instantly.
Rate limiting / polite scraping
Adding delays between requests helps with naive rate limiting. It does nothing for fingerprinting, IP reputation, or behavioral detection. You'll just get blocked more slowly.
The real problem: an arms race you can't win
Bot detection is a billion-dollar industry. Cloudflare, Akamai, PerimeterX, and DataDome are continuously updating their models. Every bypass technique gets patched within weeks of becoming public.
If you're trying to maintain a scraper against serious anti-bot protection, you're committing to an ongoing engineering investment with no finish line. Each new deployment of the detection vendor's model can break your scraper overnight without warning.
The only strategy that holds up long-term is to stop fighting on their terms.
What actually works: infrastructure that's already trusted
Sites don't block all browsers -- they block non-human access patterns. The solution is requests that look like real browser traffic because they are real browser traffic.
This means:
- Real browser sessions running at the edge — actual Chrome instances, not headless Chrome in a datacenter. Running from 300+ locations with IPs from clean residential and commercial ranges that haven't been poisoned by resellers. We explain how edge-native browser rendering makes this possible.
- Request profiles that match real users — TLS fingerprints, HTTP/2 settings, browser APIs that all match expected values because they come from the same binary users run.
- No shared IP pools — distributed edge infrastructure that isn't reselling the same IP ranges to everyone.
When your requests come from this kind of infrastructure, they don't trigger bot detection because they don't look like bot requests.
python
# Traditional approach: 30 lines of fragile configuration
import requests
from fake_useragent import UserAgent
import time
import random
ua = UserAgent()
proxies_pool = load_proxy_pool() # expensive, requires maintenance
session = requests.Session()
def scrape_with_retry(url, max_retries=3):
for attempt in range(max_retries):
proxy = random.choice(proxies_pool)
headers = {
"User-Agent": ua.random,
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.5",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
}
try:
resp = session.get(url, headers=headers, proxies={"https": proxy}, timeout=10)
if resp.status_code == 200:
return resp.text
except Exception:
time.sleep(random.uniform(1, 3))
return None
python
# API approach: 5 lines, works on sites the above can't touch
import requests
def scrape(url, schema):
resp = requests.post("https://lastcrawler.com/api/json", json={
"url": url,
"schema": schema
})
return resp.json()
The API call runs in a real browser session at the edge. No proxy management, no fingerprint patching, no behavioral simulation. You get structured data back directly. If you're evaluating options, our best web scraping API comparison for 2026 breaks down the top choices, and our Firecrawl alternatives guide covers why many teams are switching.
FAQ
Q: Can I really scrape any site without getting blocked?
A: No tool guarantees 100% success on every site. Some sites require authentication, CAPTCHA solving that's intentionally unsolvable by automation, or have legal restrictions on scraping. What edge-based real browser infrastructure does is eliminate the most common failure modes — IP reputation, fingerprinting, behavioral detection — so you're not blocked on sites that are technically scrapable.
Q: How is this different from just using a proxy with headless Chrome?
A: Proxies change your IP but not your fingerprint. Running headless Chrome through a proxy still presents a headless Chrome fingerprint, which detection vendors identify through browser API probing, Canvas rendering differences, missing browser plugins, and dozens of other signals. Edge-based browser infrastructure uses unmodified browser builds with clean IPs and realistic behavioral profiles — the combination that actually passes modern detection.
Q: Does bypassing anti-bot detection violate terms of service?
A: Almost certainly yes for sites that explicitly prohibit automated access. Whether that matters depends on what you're scraping and why. Public data, price monitoring, research, and competitive intelligence have different risk profiles than scraping behind a login or in ways that harm the target site. Read the ToS, assess your risk, and make an informed decision.
Last Crawler
2026-03-16