Design a System to Detect Plagiarism
Design a service to detect textual similarity across millions of documents. Focus on document fingerprinting (shingling) and indexing (MinHash/LSH).
Why Interviewers Ask This
Interviewers at Google ask this to evaluate your ability to handle massive-scale data problems where exact matching is computationally impossible. They specifically test your knowledge of probabilistic algorithms like MinHash and LSH, your capacity to balance precision with recall, and your skill in designing distributed systems that can process millions of documents efficiently within strict latency constraints.
How to Answer This Question
1. Clarify Requirements: Immediately define scale (document count, size), latency expectations, and the definition of 'similarity' (exact vs. near-duplicate). Ask if real-time or batch processing is required.
2. High-Level Architecture: Propose a pipeline separating ingestion, fingerprinting, indexing, and querying. Emphasize distributed storage like Bigtable or Spanner for scalability.
3. Core Algorithm Design: Detail the shingling process (k-grams) to break text into chunks. Explain how MinHash converts these sets into compact signatures while preserving Jaccard similarity.
4. Scaling via LSH: Describe using Locality Sensitive Hashing to group similar signatures into buckets, avoiding O(N^2) comparisons. Discuss partitioning strategies across clusters.
5. Refinement & Trade-offs: Address false positives/negatives, handling updates, and storage optimization. Conclude by discussing how this aligns with Google's focus on efficient, scalable infrastructure.
Key Points to Cover
- Explicitly mention MinHash and LSH as the core techniques for scaling beyond brute force
- Demonstrate understanding of Shingling (k-grams) as the preprocessing step for text normalization
- Discuss the trade-off between computational efficiency and detection accuracy (false positives)
- Propose a distributed architecture capable of handling petabytes of data and high throughput
- Address the requirement for incremental updates rather than full reprocessing for new documents
Sample Answer
To design a plagiarism detection system for millions of documents, we first clarify that exact string matching is infeasible due to scale. We need an approximate solution that identifies near-duplicates efficiently.
Fi…
Common Mistakes to Avoid
- Focusing solely on exact string matching algorithms like Rabin-Karp without addressing the impossibility of O(N^2) comparisons at scale
- Ignoring the preprocessing step of shingling, which is critical for detecting paraphrased or slightly modified content
- Overlooking the need for LSH to reduce the search space, leading to a proposal that would crash under real-world load
- Failing to discuss how to handle updates or new document ingestion, resulting in a static system design
Sound confident on this question in 5 minutes
Answer once and get a 30-second AI critique of your structure, content, and delivery. First attempt is free — no signup needed.