Skip to main content

1242. Web Crawler Multithreaded

mediumAsked at Twilio

Web Crawler Multithreaded is Twilio's concurrency probe: crawl all URLs reachable from a startUrl that share the same hostname, returning the list. Twilio's API-platform team uses this to grade concurrent BFS, work-distribution, and visited-set synchronization. The grading bar is a thread-safe visited set plus a work-stealing queue or async pool.

By Alex Chen, Founder, InterviewChamp.AI · Last verified

Source citations

Public interview reports confirming this problem appears in Twilio loops.

  • Glassdoor (2026-Q1)Reported in Twilio platform-team on-site rounds as a 'concurrency in code' question.
  • LeetCode Discuss (2025-11)Cited in Twilio interview reports — explicitly framed as a 'distributed-systems flavor' problem.

Problem

Given a URL startUrl and an interface HtmlParser, implement a Multi-threaded web crawler to crawl all links that are under the same hostname as startUrl. Return all URLs obtained by your web crawler in any order. Your crawler should: start from the page startUrl, call HtmlParser.getUrls(url) to get all URLs from a webpage of a given URL, do not crawl the same link twice, and only crawl links that are under the same hostname as startUrl.

Constraints

  • 1 <= urls.length <= 1000
  • 1 <= urls[i].length <= 300
  • startUrl is one of the urls.
  • Hostname label must be from 1 to 63 characters long.
  • There may be duplicates in the URL library.

Examples

Example 1

Input
urls = [...], edges = [...], startUrl = "http://news.yahoo.com/news/topics/"
Output
All URLs reachable from startUrl that share the news.yahoo.com hostname

Explanation: Crawler starts at startUrl, follows links, returns only same-hostname URLs.

Approaches

1. Sequential BFS (rejected — single-threaded baseline)

Standard BFS with a queue and visited set; one URL fetched at a time.

Time
O(V + E) with high wall-clock latency
Space
O(V)
function crawlSeq(startUrl, htmlParser) {
  const host = new URL(startUrl).hostname;
  const visited = new Set([startUrl]);
  const queue = [startUrl];
  while (queue.length > 0) {
    const url = queue.shift();
    for (const next of htmlParser.getUrls(url)) {
      if (!visited.has(next) && new URL(next).hostname === host) {
        visited.add(next);
        queue.push(next);
      }
    }
  }
  return [...visited];
}

Tradeoff: Correct but serial — wall-clock dominated by network latency. Twilio's question is specifically about parallelism; this version is the baseline you propose before the concurrent one.

2. Promise pool with thread-safe visited (optimal in JS)

Bounded concurrency pool; each fetched URL produces more URLs that go back into the pool. Use a Set (single-threaded in JS but still safe under cooperative scheduling) for visited.

Time
O((V + E) / concurrency)
Space
O(V)
async function crawl(startUrl, htmlParser) {
  const host = new URL(startUrl).hostname;
  const visited = new Set([startUrl]);
  // In JS, the event loop is single-threaded — no race on Set.
  // For multi-threaded runtimes (Java/Python), wrap in a mutex or use ConcurrentHashMap.set.
  const inflight = new Set();

  async function visit(url) {
    const links = await htmlParser.getUrls(url);
    const toQueue = [];
    for (const next of links) {
      if (visited.has(next)) continue;
      try {
        if (new URL(next).hostname !== host) continue;
      } catch { continue; }
      visited.add(next);
      toQueue.push(next);
    }
    // Launch follow-ups; track them so the outer driver can await everything
    return Promise.all(toQueue.map(u => {
      const p = visit(u);
      inflight.add(p);
      p.finally(() => inflight.delete(p));
      return p;
    }));
  }

  await visit(startUrl);
  // Drain any straggling promises
  while (inflight.size > 0) await Promise.race(inflight);
  return [...visited];
}

Tradeoff: Leverages JS's single-threaded event loop — no mutex needed because there's no true preemption between Set.add and the visited check. For Java/Python multi-threaded runtimes you must wrap visited in a synchronized set or ConcurrentHashMap. Bounded concurrency would add a semaphore around the visit() entry.

3. Worker pool with explicit work queue (production-grade)

N worker threads (or Web Workers) draining a shared queue. Visited set behind a lock. Crawl ends when queue is empty AND no worker is processing.

Time
O((V + E) / N)
Space
O(V)
// Sketch for a Java/Python-style design — JS Web Workers can match this.
// 1. Create N workers. Each worker pulls from a shared queue.
// 2. Visited set behind a ConcurrentHashMap (Java) or asyncio.Lock (Python).
// 3. Termination condition: queue.empty() && activeWorkers == 0.
// 4. Sentinel pattern: each worker writes 'done' when it completes its task;
//    coordinator checks the empty-and-all-done condition.

Tradeoff: The production answer for real concurrency. The hardest part is termination detection — naively waiting on an empty queue races with a worker that's currently processing and about to enqueue more work. The 'activeWorkers' counter solves it.

Twilio-specific tips

Twilio's API-platform team grades two things on Web Crawler Multithreaded: (1) correct synchronization on the visited set, (2) correct TERMINATION detection (the most common bug — exiting too early because the queue happened to be empty while a worker was still processing). Walk through both before coding. The JS event-loop angle is a strong signal — explicitly note that single-threaded JS doesn't have a visited-set race, but porting to Java/Python does.

Common mistakes

  • Exiting when the queue is empty without checking that no worker is currently processing — terminates the crawl mid-stream when a worker is about to enqueue more.
  • Checking the visited set after work begins instead of when work is enqueued — multiple workers can enqueue the same URL between check-and-enqueue.
  • Forgetting to filter by hostname on every enqueue — crawls externally-linked sites and explodes.

Follow-up questions

An interviewer at Twilio may pivot to one of these next:

  • How would you implement distributed crawling across machines? (Hash-partition URLs by hostname; each machine owns a slice. Workers cross-enqueue via a message queue.)
  • How would you respect robots.txt and rate limits? (Per-hostname token bucket; consult robots.txt on first hit per hostname.)
  • How would you handle crawl persistence across restarts? (Materialize the queue and visited set to disk; resume on startup.)

Solve it now

Free. No sign-up. Python and JavaScript run instantly in your browser.

Output

Press Run or Cmd+Enter to execute

FAQ

Why doesn't JS need a mutex on the visited set?

Because JavaScript's event loop is single-threaded. Between any two synchronous operations (the .has() check and the .add()), no other code can run. The race only opens when you spawn Web Workers OR port to a multi-threaded language.

What is the canonical 'queue empty but worker still processing' bug?

A worker dequeues a URL, the queue becomes empty, and a naive coordinator returns. Then the worker fetches the page and finds 100 more URLs to enqueue — but the crawl already ended. The fix is to track active worker count and only terminate when (queue.empty() && active == 0).

Free learning resources

Curated free links for this problem.

Practice these live with InterviewChamp.AI

Drill Web Crawler Multithreaded and other Twilio interview questions under real-loop conditions with instant feedback on your reasoning, complexity claims, and code.

Practice these live with InterviewChamp.AI →