Engineering

Fetching Data from a Large URL List: The Complete Decision Guide

TinyFishie··
Share
Fetching Data from a Large URL List: The Complete Decision Guide

You have a list of 500 URLs — competitor product pages, supplier portals, job listings, or real estate listings. You need the data from each one.

The answer to "which tool fetches this data reliably" depends on what's in that list — not on how many URLs there are.

What's in your list → which tool:

  1. All static HTML, no strict automation requirements → requests + httpx (fastest, cheapest)
  2. JavaScript-rendered content, no strict automation requirements → Playwright or Crawlee
  3. Mixed list with some protected sites → Playwright + proxy rotation
  4. Protected or authenticated URLs at scale → TinyFish Web Agent
  5. Massive volume (100K+) of public pages → Scrapy

The Tool That Fits the List

Static HTML at Volume: requests + asyncio

If your URLs are documentation pages, blog posts, static product catalogs, or any content that loads fully in the initial HTML response, Python's requests library with async execution is the fastest and cheapest option—often by a large margin.

In our testing, this handles 1,000 static URLs in under a minute on a standard laptop. For 100K+ URLs, Scrapy's built-in scheduler, downloader middleware, and item pipeline make more sense—it handles deduplication, retry logic, and output formatting at Scrapy's architecture level.

Where this breaks down: Any URL that requires JavaScript execution. If the page shows a loading spinner and populates content after load, requests returns the spinner HTML, not the content.

JavaScript Content: Playwright with Batching

For lists where content loads via JavaScript—React SPAs, infinite scroll, dynamic filtering, price tables that render after an API call—you need a real browser.

Keep concurrency low (3–8 pages) when running locally—each headless Chromium instance consumes 100–300MB. For larger lists, cloud browser infrastructure (Browserless, Browserbase) handles the browser pool so you're not resource-limited on your machine.

Where this breaks down: Sites with strict automation requirements at the network and behavioral level. JavaScript-level automation handling helps at low volume; at scale, sites with enterprise-grade access requirements become harder to handle reliably.

Sites with Strict Requirements or Authenticated Access: TinyFish

This is where simple HTTP requests stop being sufficient. Your list includes:

  • Product pages that return different content to automation than to browsers
  • Pricing pages that require login using your own authorized account
  • Sites with strict automation requirements that affect reliability at scale
  • Authenticated portals where each URL requires an authorized session

For these, maintaining a Playwright-based crawler means:

  • Managing automation configuration that needs ongoing updates as site requirements evolve
  • Building session management for authenticated URLs
  • Handling multi-step login flows and session state
  • Debugging failures that change based on site configurations you don't control

AI web agents handle this at the infrastructure level. You pass a URL and a goal; the agent handles rendering, infrastructure-level request handling, and authentication for sites where you have authorized access.

The concurrency limit is determined by your plan—10 concurrent agents on Starter, 50 on Pro. For a 1,000-URL list on Pro, that's 20 sequential batches of 50.

When the math shifts: requests and Playwright are cheaper per-URL on cooperative, stable sites. TinyFish makes sense when you factor in what Playwright-at-scale actually costs: server infrastructure, proxy subscriptions, and the engineering hours spent maintaining scrapers as sites change. For mixed or complex URL lists, that total cost typically exceeds TinyFish's per-step pricing before you hit production scale.

Handling the Mixed List

Real URL lists are rarely uniform. A supplier monitoring list might include:

  • 60% static pricing pages (requests would work)
  • 30% JavaScript-rendered product tables (Playwright needed)
  • 10% authenticated portals with strict automation requirements (agents needed)

The practical approach: categorize your list before you crawl it. A quick HEAD request or a sample run reveals which URLs respond to simple HTTP requests vs. which require rendering vs. which block automation. Route each category to the appropriate tool. The 10% that requires agents is where reliability actually matters — authentication failures and automation blocks are what stall production workflows, not the cooperative pages.

To classify URLs before routing them, a quick probe is faster than a full crawl:

A 429 response means rate-limited — retry with backoff before escalating. A 403 indicates access is blocked or restricted; retrying with the same tool won't help. A near-empty response or JS framework marker means JS rendering is needed. Clean HTML with visible <p> tags is static.

Scale Considerations

For very large lists (100K+), distributed architecture matters regardless of tool—whether that's Scrapy's built-in scheduler, a task queue like Celery, or submitting batches to an async agent API and polling for results.

Test TinyFish against the protected or authenticated URLs in your list — 500 free steps, no credit card.

**Start your free trial →**

FAQ

What's the fastest way to fetch data from a large URL list in Python?

For static HTML content, httpx with asyncio is the fastest approach—you can process 20–50 URLs simultaneously with a single machine and finish 1,000 URLs in under a minute. The key is async execution: sequential requests would take 10–15x longer for the same list. For JavaScript-rendered content, Playwright in async mode with 5–10 concurrent browser pages is the practical ceiling before memory constraints become a factor on standard hardware.

How do I improve reliability when fetching data from many URLs?

Rate limiting is the first line: 1–2 requests per second per domain for most sites, slower for aggressive protection. Rotate user agents across requests. For moderate protection, requests with a realistic user agent and reasonable delays works. For sites with enterprise-grade automation detection, JavaScript-level automation plugins help at low volume but degrade at scale — TinyFish provides infrastructure-level browser sessions that are more reliable for protected sites at production scale.

Should I use Scrapy or Playwright for a large URL list?

Scrapy if your URLs return static HTML and you need high volume (10K+) with built-in scheduling, retry logic, and output pipelines. Playwright if URLs require JavaScript execution. The two aren't mutually exclusive—Scrapy has a Playwright middleware (scrapy-playwright) that handles JS rendering within Scrapy's architecture. For lists with mixed content types, start with Scrapy for the static subset and use a separate Playwright job for the JS-heavy URLs.

How do I deduplicate URLs before crawling?

Normalize URLs first: lowercase the scheme and domain, sort query parameters alphabetically, strip tracking parameters (utm_*, ref=, fbclid=), and resolve relative URLs to absolute. Python's urllib.parse.urlparse plus a set for deduplication handles most cases. For large lists with near-duplicate URLs (same page, different session IDs), a URL fingerprinting library like w3lib.url.canonicalize_url gives more aggressive deduplication.

When does crawling a URL list require authentication?

When the target pages are behind login walls that your team has authorized access to—supplier pricing portals, internal tools, subscription content, or any page that redirects to a login page for unauthenticated requests. Signs your list needs auth: all results return the same HTML (the login page), response sizes are suspiciously uniform, or you see redirect chains ending at /login. For authenticated list crawling at scale, session management becomes the primary complexity—handling login flows, session expiry, and re-authentication across many concurrent workers. TinyFish handles session management and multi-step login flows for sites where you have authorized account access — you provide credentials, the agent handles the rest.

Try TinyFish Free

500 free steps, no credit card. The fastest way to test whether TinyFish fits your workflow.

Start free →

Related Reading

Get started

Start building.

No credit card. No setup. Run your first operation in under a minute.

Get 500 free creditsRead the docs