Design a System for Monitoring E-commerce Price Changes
Design a service that continuously scrapes competitor websites, detects price changes, and stores the history efficiently. Focus on politeness and change detection algorithms.
Why Interviewers Ask This
Interviewers at Amazon ask this to evaluate your ability to balance technical scalability with ethical constraints like politeness. They specifically test if you can design a distributed scraping system that handles high concurrency, detects price deltas efficiently without redundant storage, and respects rate limits to avoid IP bans while maintaining data integrity for competitive analysis.
How to Answer This Question
1. Clarify requirements: Define scale (e.g., millions of SKUs), latency needs for real-time alerts, and the specific definition of 'politeness' regarding request rates. 2. Architecture overview: Propose a microservices architecture with a scheduler, scraper workers, a change detection engine, and a time-series database. 3. Politeness strategy: Detail how to implement randomized delays, user-agent rotation, and dynamic backoff algorithms to prevent blocking. 4. Change detection logic: Explain using content hashing or DOM diffing to identify actual price changes versus noise before writing to storage. 5. Data modeling: Describe storing only deltas in a columnar store like DynamoDB or Cassandra to optimize read performance for historical trends. 6. Scalability: Discuss horizontal scaling of scraper nodes and partitioning strategies based on product categories.
Key Points to Cover
- Explicitly define a polite scraping strategy using dynamic backoff and rate limiting
- Propose a hash-based change detection mechanism to minimize unnecessary storage writes
- Select a time-series or partitioned NoSQL database optimized for historical price trends
- Demonstrate understanding of distributed systems scaling through worker pools and task queues
- Address the trade-off between data freshness and server load on competitor sites
Sample Answer
To design this system, I would start by defining the scope: monitoring thousands of competitors across millions of products with sub-minute latency. The core architecture would consist of a central Scheduler that assigns…
Common Mistakes to Avoid
- Ignoring the 'politeness' constraint and proposing aggressive scraping that would get IPs banned
- Storing full HTML pages instead of just price deltas, leading to massive storage bloat
- Failing to explain how to handle race conditions when multiple workers scrape the same item simultaneously
- Overlooking error handling for network failures or CAPTCHA challenges during the scraping process
Sound confident on this question in 5 minutes
Answer once and get a 30-second AI critique of your structure, content, and delivery. First attempt is free — no signup needed.