Design a News Aggregator (Google News)

System Design
Medium
Google
85.8K views

Design a service that aggregates articles from various sources. Focus on scraping/crawling, article clustering, and deduplication.

Why Interviewers Ask This

Interviewers ask this to evaluate your ability to architect scalable distributed systems handling massive unstructured data. They specifically assess your understanding of web crawling constraints, real-time deduplication strategies using hashing or embeddings, and clustering algorithms for grouping similar stories. At Google, they also look for your capacity to balance trade-offs between consistency, latency, and cost in a high-throughput environment.

How to Answer This Question

1. Clarify requirements by defining scale (articles per second), freshness (real-time vs. batch), and core features like deduplication and clustering. 2. Estimate system capacity using back-of-the-envelope calculations for storage and bandwidth. 3. Design the high-level architecture starting with a crawler layer that respects robots.txt and rate limits. 4. Detail the ingestion pipeline, focusing on text normalization and similarity detection using MinHash or LSH for efficient deduplication. 5. Explain the clustering logic, perhaps using hierarchical agglomerative clustering or vector embeddings to group related articles. 6. Discuss infrastructure choices like Bigtable for storage and Spanner for consistency, aligning with Google's internal tooling preferences while explaining why.

Key Points to Cover

  • Explicitly addressing the challenge of detecting near-duplicate content beyond simple string matching
  • Proposing specific algorithms like MinHash or LSH for efficient large-scale deduplication
  • Demonstrating knowledge of Google-scale infrastructure patterns such as sharding and async queues
  • Balancing the trade-off between real-time processing speed and computational cost
  • Including a concrete strategy for handling crawl politeness and source reliability

Sample Answer

To design a News Aggregator like Google News, I'd first clarify the scope: we need to handle millions of sources with sub-second latency for breaking news. The system starts with a distributed crawler, likely built on Go…

Common Mistakes to Avoid

  • Ignoring the 'robots.txt' protocol and legal implications of aggressive web scraping
  • Focusing solely on database schema without explaining how to process unstructured text
  • Suggesting O(n^2) comparison algorithms for deduplication which won't scale to billions of articles
  • Overlooking the need for a feedback loop to improve clustering accuracy over time

Sound confident on this question in 5 minutes

Answer once and get a 30-second AI critique of your structure, content, and delivery. First attempt is free — no signup needed.

Try it free

Related Interview Questions

Browse all 190 System Design questionsBrowse all 145 Google questions