692. Top K Frequent Words
mediumAsked at CohereReturn the k most frequent words, with lexicographic tiebreaking. Cohere includes this because frequency-ranked vocabulary selection with deterministic tiebreaking is exactly how tokeniser vocabularies are built from a training corpus.
By Alex Chen, Founder, InterviewChamp.AI · Last verified
Source citations
Public interview reports confirming this problem appears in Cohere loops.
- Glassdoor (2026-Q1)— Cohere MLE candidates report Top K Frequent Words as a direct NLP-relevant problem asked in coding rounds.
- Blind (2025-12)— Mentioned in Cohere prep threads as a natural extension of Top K Frequent Elements with an NLP spin.
Problem
Given an array of strings words and an integer k, return the k most frequent strings. Return the answer sorted by the frequency from highest to lowest. Sort the words with the same frequency by their lexicographical order.
Constraints
1 <= words.length <= 5001 <= words[i].length <= 10words[i] consists of lowercase English letters.k is in the range [1, the number of unique words in the array].
Examples
Example 1
words = ["i","love","leetcode","i","love","coding"], k = 2["i","love"]Explanation: "i" and "love" both appear twice; both beat "leetcode" and "coding" (once each). Tied words are sorted lexicographically, but both tie so either order would work — "i" < "love" alphabetically.
Example 2
words = ["the","day","is","sunny","the","the","the","sunny","is","is"], k = 4["the","is","sunny","day"]Explanation: Frequencies: the=4, is=3, sunny=2, day=1.
Approaches
1. Frequency map + custom sort
Count word frequencies. Sort entries by frequency descending, breaking ties lexicographically ascending. Return the first k words.
- Time
- O(n log n)
- Space
- O(n)
function topKFrequent(words, k) {
const freq = new Map();
for (const w of words) freq.set(w, (freq.get(w) || 0) + 1);
return [...freq.keys()]
.sort((a, b) => {
const diff = freq.get(b) - freq.get(a); // higher freq first
return diff !== 0 ? diff : a.localeCompare(b); // lex asc on tie
})
.slice(0, k);
}Tradeoff: O(n log n) — straightforward. The custom comparator with tiebreaking handles all cases. Clean and readable.
2. Min-heap of size k
Use a min-heap that evicts the 'least desirable' element when size exceeds k. The heap comparator reverses the final ordering so the least desirable (lowest frequency, highest lex) is on top.
- Time
- O(n log k)
- Space
- O(k)
// Conceptual JS — a real heap class would be needed for production:
function topKFrequent(words, k) {
const freq = new Map();
for (const w of words) freq.set(w, (freq.get(w) || 0) + 1);
// Sort is equivalent here; true heap is O(n log k) vs O(n log n)
const unique = [...freq.keys()];
unique.sort((a, b) => {
const diff = freq.get(b) - freq.get(a);
return diff !== 0 ? diff : a.localeCompare(b);
});
return unique.slice(0, k);
}Tradeoff: A true min-heap implementation is O(n log k) — superior when k << n. In JS, implementing a heap from scratch adds code volume; discuss the trade-off and mention that Python's heapq with a custom key makes this elegant.
Cohere-specific tips
Cohere's tokeniser training pipelines rank subword candidates by frequency and break ties deterministically. When presenting this problem, draw the parallel: 'Selecting the top-k vocabulary tokens is exactly this problem — frequency determines rank, lexicographic order breaks ties reproducibly.' Cohere values candidates who can articulate why deterministic tiebreaking matters: reproducible vocabulary construction across training runs.
Common mistakes
- Reversing the lexicographic order — ties should be broken by ascending lex order, not descending.
- Forgetting that localeCompare is needed for correct Unicode ordering in JS — a - b style comparison does not work for strings.
- Returning word objects instead of just the word strings.
- Not handling the edge case where k equals the total number of unique words — the slice handles this correctly.
Follow-up questions
An interviewer at Cohere may pivot to one of these next:
- Sort Characters by Frequency (LC 451) — rank characters by frequency.
- Find the Most Common Word (LC 819) — frequency ranking with a banned-word filter.
- How would you maintain the top-k vocabulary in a streaming text pipeline with millions of words per second?
Solve it now
Free. No sign-up. Python and JavaScript run instantly in your browser.
FAQ
Why ascending lexicographic order for ties?
The problem specifies it explicitly. In tokeniser contexts, alphabetical ordering provides a canonical, reproducible tie-breaker independent of the training data shuffle.
When does the heap approach beat sort?
When k << n — O(n log k) < O(n log n). For the given constraints (n ≤ 500) the difference is negligible; in large-vocabulary scenarios it matters significantly.
Can you use bucket sort here?
Yes — bucket by frequency, then sort words within each bucket lexicographically. This is O(n + B log B) where B is the max frequency. Useful when n is very large but frequencies are bounded.
Practice these live with InterviewChamp.AI
Drill Top K Frequent Words and other Cohere interview questions under real-loop conditions with instant feedback on your reasoning, complexity claims, and code.
Practice these live with InterviewChamp.AI →