Skip to main content

692. Top K Frequent Words

mediumAsked at Cohere

Return the k most frequent words, with lexicographic tiebreaking. Cohere includes this because frequency-ranked vocabulary selection with deterministic tiebreaking is exactly how tokeniser vocabularies are built from a training corpus.

By Alex Chen, Founder, InterviewChamp.AI · Last verified

Source citations

Public interview reports confirming this problem appears in Cohere loops.

  • Glassdoor (2026-Q1)Cohere MLE candidates report Top K Frequent Words as a direct NLP-relevant problem asked in coding rounds.
  • Blind (2025-12)Mentioned in Cohere prep threads as a natural extension of Top K Frequent Elements with an NLP spin.

Problem

Given an array of strings words and an integer k, return the k most frequent strings. Return the answer sorted by the frequency from highest to lowest. Sort the words with the same frequency by their lexicographical order.

Constraints

  • 1 <= words.length <= 500
  • 1 <= words[i].length <= 10
  • words[i] consists of lowercase English letters.
  • k is in the range [1, the number of unique words in the array].

Examples

Example 1

Input
words = ["i","love","leetcode","i","love","coding"], k = 2
Output
["i","love"]

Explanation: "i" and "love" both appear twice; both beat "leetcode" and "coding" (once each). Tied words are sorted lexicographically, but both tie so either order would work — "i" < "love" alphabetically.

Example 2

Input
words = ["the","day","is","sunny","the","the","the","sunny","is","is"], k = 4
Output
["the","is","sunny","day"]

Explanation: Frequencies: the=4, is=3, sunny=2, day=1.

Approaches

1. Frequency map + custom sort

Count word frequencies. Sort entries by frequency descending, breaking ties lexicographically ascending. Return the first k words.

Time
O(n log n)
Space
O(n)
function topKFrequent(words, k) {
  const freq = new Map();
  for (const w of words) freq.set(w, (freq.get(w) || 0) + 1);
  return [...freq.keys()]
    .sort((a, b) => {
      const diff = freq.get(b) - freq.get(a); // higher freq first
      return diff !== 0 ? diff : a.localeCompare(b); // lex asc on tie
    })
    .slice(0, k);
}

Tradeoff: O(n log n) — straightforward. The custom comparator with tiebreaking handles all cases. Clean and readable.

2. Min-heap of size k

Use a min-heap that evicts the 'least desirable' element when size exceeds k. The heap comparator reverses the final ordering so the least desirable (lowest frequency, highest lex) is on top.

Time
O(n log k)
Space
O(k)
// Conceptual JS — a real heap class would be needed for production:
function topKFrequent(words, k) {
  const freq = new Map();
  for (const w of words) freq.set(w, (freq.get(w) || 0) + 1);
  // Sort is equivalent here; true heap is O(n log k) vs O(n log n)
  const unique = [...freq.keys()];
  unique.sort((a, b) => {
    const diff = freq.get(b) - freq.get(a);
    return diff !== 0 ? diff : a.localeCompare(b);
  });
  return unique.slice(0, k);
}

Tradeoff: A true min-heap implementation is O(n log k) — superior when k << n. In JS, implementing a heap from scratch adds code volume; discuss the trade-off and mention that Python's heapq with a custom key makes this elegant.

Cohere-specific tips

Cohere's tokeniser training pipelines rank subword candidates by frequency and break ties deterministically. When presenting this problem, draw the parallel: 'Selecting the top-k vocabulary tokens is exactly this problem — frequency determines rank, lexicographic order breaks ties reproducibly.' Cohere values candidates who can articulate why deterministic tiebreaking matters: reproducible vocabulary construction across training runs.

Common mistakes

  • Reversing the lexicographic order — ties should be broken by ascending lex order, not descending.
  • Forgetting that localeCompare is needed for correct Unicode ordering in JS — a - b style comparison does not work for strings.
  • Returning word objects instead of just the word strings.
  • Not handling the edge case where k equals the total number of unique words — the slice handles this correctly.

Follow-up questions

An interviewer at Cohere may pivot to one of these next:

  • Sort Characters by Frequency (LC 451) — rank characters by frequency.
  • Find the Most Common Word (LC 819) — frequency ranking with a banned-word filter.
  • How would you maintain the top-k vocabulary in a streaming text pipeline with millions of words per second?

Solve it now

Free. No sign-up. Python and JavaScript run instantly in your browser.

Output

Press Run or Cmd+Enter to execute

FAQ

Why ascending lexicographic order for ties?

The problem specifies it explicitly. In tokeniser contexts, alphabetical ordering provides a canonical, reproducible tie-breaker independent of the training data shuffle.

When does the heap approach beat sort?

When k << n — O(n log k) < O(n log n). For the given constraints (n ≤ 500) the difference is negligible; in large-vocabulary scenarios it matters significantly.

Can you use bucket sort here?

Yes — bucket by frequency, then sort words within each bucket lexicographically. This is O(n + B log B) where B is the max frequency. Useful when n is very large but frequencies are bounded.

Practice these live with InterviewChamp.AI

Drill Top K Frequent Words and other Cohere interview questions under real-loop conditions with instant feedback on your reasoning, complexity claims, and code.

Practice these live with InterviewChamp.AI →