692. Top K Frequent Words

Q: Why ascending lexicographic order for ties?

The problem specifies it explicitly. In tokeniser contexts, alphabetical ordering provides a canonical, reproducible tie-breaker independent of the training data shuffle.

Q: When does the heap approach beat sort?

When k << n — O(n log k) < O(n log n). For the given constraints (n ≤ 500) the difference is negligible; in large-vocabulary scenarios it matters significantly.

Q: Can you use bucket sort here?

Yes — bucket by frequency, then sort words within each bucket lexicographically. This is O(n + B log B) where B is the max frequency. Useful when n is very large but frequencies are bounded.

mediumAsked at Cohere

Return the k most frequent words from a list, breaking frequency ties lexicographically — a counting problem whose difficulty hides entirely in the tiebreak and the heap's comparator direction. Cohere candidates report it because frequency-ranked vocabulary selection with deterministic tiebreaking is exactly how tokeniser vocabularies get built from a training corpus, making it a natural fit for NLP-platform coding rounds.

By Sam K., Founder, InterviewChamp.AI · Last verified 2026-05-22

Source citations

Public interview reports confirming this problem appears in Cohere loops.

Glassdoor (2026-Q1)— Cohere MLE candidates report Top K Frequent Words as a direct NLP-relevant problem asked in coding rounds.
Blind (2025-12)— Mentioned in Cohere prep threads as a natural extension of Top K Frequent Elements with an NLP spin.

Problem

Given an array of strings words and an integer k, return the k most frequent strings. Return the answer sorted by the frequency from highest to lowest. Sort the words with the same frequency by their lexicographical order.

Constraints

1 <= words.length <= 500
1 <= words[i].length <= 10
words[i] consists of lowercase English letters.
k is in the range [1, the number of unique words in the array].

Examples

Example 1

Input

words = ["i","love","leetcode","i","love","coding"], k = 2

Output

["i","love"]

Explanation: "i" and "love" both appear twice; both beat "leetcode" and "coding" (once each). Tied words are sorted lexicographically, but both tie so either order would work — "i" < "love" alphabetically.

Example 2

Input

words = ["the","day","is","sunny","the","the","the","sunny","is","is"], k = 4

Output

["the","is","sunny","day"]

Explanation: Frequencies: the=4, is=3, sunny=2, day=1.

Approaches

1. Frequency map + custom sort

Count word frequencies. Sort entries by frequency descending, breaking ties lexicographically ascending. Return the first k words.

Time: O(n log n)
Space: O(n)

function topKFrequent(words, k) {
  const freq = new Map();
  for (const w of words) freq.set(w, (freq.get(w) || 0) + 1);
  return [...freq.keys()]
    .sort((a, b) => {
      const diff = freq.get(b) - freq.get(a); // higher freq first
      return diff !== 0 ? diff : a.localeCompare(b); // lex asc on tie
    })
    .slice(0, k);
}

Tradeoff: O(n log n) — straightforward. The custom comparator with tiebreaking handles all cases. Clean and readable.

2. Min-heap of size k

Use a min-heap that evicts the 'least desirable' element when size exceeds k. The heap comparator reverses the final ordering so the least desirable (lowest frequency, highest lex) is on top.

Time: O(n log k)
Space: O(k)

// Conceptual JS — a real heap class would be needed for production:
function topKFrequent(words, k) {
  const freq = new Map();
  for (const w of words) freq.set(w, (freq.get(w) || 0) + 1);
  // Sort is equivalent here; true heap is O(n log k) vs O(n log n)
  const unique = [...freq.keys()];
  unique.sort((a, b) => {
    const diff = freq.get(b) - freq.get(a);
    return diff !== 0 ? diff : a.localeCompare(b);
  });
  return unique.slice(0, k);
}

Tradeoff: A true min-heap implementation is O(n log k) — superior when k << n. In JS, implementing a heap from scratch adds code volume; discuss the trade-off and mention that Python's heapq with a custom key makes this elegant.

Cohere-specific tips

Cohere's tokeniser training pipelines rank subword candidates by frequency and break ties deterministically. When presenting this problem, draw the parallel: 'Selecting the top-k vocabulary tokens is exactly this problem — frequency determines rank, lexicographic order breaks ties reproducibly.' Cohere values candidates who can articulate why deterministic tiebreaking matters: reproducible vocabulary construction across training runs.

Common mistakes

Reversing the lexicographic order — ties should be broken by ascending lex order, not descending.
Forgetting that localeCompare is needed for correct Unicode ordering in JS — a - b style comparison does not work for strings.
Returning word objects instead of just the word strings.
Not handling the edge case where k equals the total number of unique words — the slice handles this correctly.

Follow-up questions

An interviewer at Cohere may pivot to one of these next:

Sort Characters by Frequency (LC 451) — rank characters by frequency.
Find the Most Common Word (LC 819) — frequency ranking with a banned-word filter.
How would you maintain the top-k vocabulary in a streaming text pipeline with millions of words per second?

Solve it now

Free. No sign-up. Python and JavaScript run instantly in your browser.

Output

Press Run or Cmd+Enter to execute

FAQ

Why ascending lexicographic order for ties?

The problem specifies it explicitly. In tokeniser contexts, alphabetical ordering provides a canonical, reproducible tie-breaker independent of the training data shuffle.

When does the heap approach beat sort?

When k << n — O(n log k) < O(n log n). For the given constraints (n ≤ 500) the difference is negligible; in large-vocabulary scenarios it matters significantly.

Can you use bucket sort here?

Yes — bucket by frequency, then sort words within each bucket lexicographically. This is O(n + B log B) where B is the max frequency. Useful when n is very large but frequencies are bounded.

692. Top K Frequent Words

Source citations

Problem

Constraints

Examples

Approaches

1. Frequency map + custom sort

2. Min-heap of size k

Cohere-specific tips

Common mistakes

Follow-up questions

Solve it now

Editor Settings

FAQ

More Cohere coding interview questions

Companies that also ask Top K Frequent Words