Count Unique Matches: grep, sort, uniq -c Explained (2026)

grep -o 'pattern' file | sort | uniq -c | sort -rn is the one-liner I reach for whenever a question starts with "how many of each". It pulls every match out of a file, groups identical matches together, prefixes each group with its count, and ranks the result so the most frequent thing is on top. It is the workhorse of ad-hoc log analysis: top IP addresses, the spread of HTTP status codes, the error message that fired the most times.

The pipeline looks dense, but it is four small tools doing one job each. The single rule that trips everyone up is the sort before uniq: uniq only collapses adjacent duplicate lines, so if the input is not sorted the counts come out wrong. Get that ordering right and the rest is mechanical.

Set your values

Try it with your own values

Set your OS, the file to scan, and the pattern to count. Every command below updates with your values.

Operating systemFile to scanPattern

The one-liner

bash· Linux (GNU)

grep -oE ':pattern' :search_path | sort | uniq -c | sort -rn

That prints one line per distinct match, each prefixed with how many times it occurred, sorted with the highest count first. The rest of this page breaks down what each stage contributes and why the order is not negotiable.

Step by step: what each stage does

grep -o 'pattern' file prints only the matched portion of each line, one match per line, instead of the whole line. This is the part most people forget. Plain grep prints the entire matching line, so counting "GET requests" by piping plain grep into uniq actually counts whole log lines, which are nearly all distinct. -o reduces every line to just the thing you want to tally. The -E adds extended-regex syntax so (GET|POST) style alternation works without backslashes.

sort orders the matches alphabetically. On its own this does nothing useful, but it is mandatory because of how uniq works (next paragraph). After sort, every run of identical matches is contiguous.

uniq -c collapses each run of identical adjacent lines into a single line and prefixes it with the count of how many lines were in that run. Without -c it just dedups; with -c you get the frequency table. The output looks like 42 GET with the count right-padded.

sort -rn sorts that frequency table numerically (-n) in reverse (-r), so the largest count lands at the top. uniq -c emits its output in the input order (alphabetical, from the first sort), which is rarely the order you want to read. The second sort turns the table into a ranked leaderboard.

Why sort must come before uniq

uniq is a streaming deduplicator. It reads one line at a time and compares each line only to the line immediately before it. It never holds the whole file in memory and never sees non-adjacent lines together. So given this input:

code

GET
POST
GET
GET
POST

uniq -c produces:

code

   1 GET
   1 POST
   2 GET
   1 POST

Four groups, because the two POST lines and the GET at the top are separated by other values. That is not a count of unique matches, it is a count of unbroken runs. Sorting first puts all the GET lines together and all the POST lines together, so uniq -c sees two runs and reports 3 GET and 2 POST. Sorting is what makes uniq global instead of local. This is the single most common mistake with this pipeline, and the symptom is duplicate keys in the output with smaller-than-expected counts.

Build a top-N leaderboard

uniq -c puts the count first, which is exactly the shape sort -rn wants. Add head to cap the output at the top entries:

bash· Linux (GNU)

grep -oE ':pattern' :search_path | sort | uniq -c | sort -rn | head -10

head -10 keeps the top 10. Swap it for head -20 or pipe to tail instead if you want the rarest matches. This is the standard "top talkers" report.

Worked example: top IPs in an access log

A combined-format access log starts each line with the client IP. grep -oE with an IP-shaped pattern pulls just that field, and the pipeline ranks the noisiest clients:

bash· Linux (GNU)

grep -oE '^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' access.log | sort | uniq -c | sort -rn | head -10

The ^ anchors the match to the start of the line so a stray IP in a URL query string does not get counted. The output is a clean ranked list of clients, the fastest way to spot a scraper or a misbehaving health check.

Worked example: count each HTTP status code

The status code sits in a known column of the access log. When the thing you want to count is a whole field rather than a substring, extract it with awk instead of a regex:

bash· Linux (GNU)

awk '{print $9}' access.log | sort | uniq -c | sort -rn

In the combined log format the status code is field 9, so awk '{print $9}' prints just that column. The rest of the pipeline is identical. This awk '{print $N}' trick is the column-extraction companion to grep -o; the grep and print column article covers picking out fields in more depth.

Worked example: most common error message

To rank error messages, grep the lines that contain ERROR first, then carve out the message text. Here grep -oE captures everything after the ERROR token:

bash

grep -oE 'ERROR.*' app.log | sort | uniq -c | sort -rn | head

If the messages contain variable parts (timestamps, request IDs, user names), the raw counts fragment because every line is technically distinct. In that case strip the variable parts with sed before sorting, or count a stable prefix of each message rather than the whole thing.

Worked example: unique users in a log

Sometimes you do not want counts at all, just the distinct values. sort -u is the shortcut: it sorts and deduplicates in one step, no uniq needed.

bash· Linux (GNU)

grep -oE 'user=[a-z0-9_]+' app.log | sort -u

Pipe that to wc -l and you have the count of distinct users:

bash

grep -oE 'user=[a-z0-9_]+' app.log | sort -u | wc -l

sort -u is equivalent to sort | uniq but a little faster and one process shorter. Reach for it whenever you want the unique list and do not care how many times each value appeared.

uniq has more than -c

uniq -c is the count flag, but two siblings are worth knowing. Both still require sorted input:

Flag	What it keeps
`uniq -c`	Every distinct line, prefixed with its count
`uniq -d`	Only lines that appear more than once (the duplicates)
`uniq -u`	Only lines that appear exactly once (the non-repeated lines)
`uniq` (no flag)	One copy of each distinct line, no count

uniq -d answers "which values are duplicated at all", useful for spotting repeated IDs in a column that should be unique. uniq -u answers the inverse, "which values are one-offs". Combine uniq -d with -c (uniq -cd) to count only the duplicated values and skip the singletons.

macOS BSD vs GNU

For this pipeline the platform difference is small. sort and uniq behave the same on macOS (BSD) and Linux (GNU) for the flags used here: -c, -d, -u on uniq, and -r, -n, -u on sort all work identically. The one divergence is sort -h (human-numeric sort, so 2K sorts below 1M): that flag is GNU and only landed in newer BSD sort, so a script targeting older macOS should not assume it. The grep -o and grep -oE flags are portable across BSD and GNU grep. The only grep gotcha on macOS is the missing -P (PCRE), and this pipeline never needs it; -E is enough.

Common mistakes

1. Forgetting to sort before uniq. Covered above and worth repeating because it is silent: the command runs, produces output that looks plausible, and the counts are simply wrong. Any time you see the same key twice in uniq -c output, you skipped the sort.

2. Expecting uniq alone to dedup an unsorted file. uniq file.txt on an unsorted file only collapses accidental adjacent repeats. It is not a set operation. Use sort -u for actual deduplication.

3. Case sensitivity. sort and uniq are byte-comparison by default, so GET and get count as two different things. Add -f to uniq (uniq -if) for case-insensitive grouping, and pass -f to sort too so the runs line up. Or normalize case earlier in the pipeline with tr A-Z a-z.

4. Leading whitespace from uniq -c confusing the next stage. uniq -c right-pads the count with spaces, so its output lines start with whitespace. That is fine for sort -rn, which skips leading blanks. But if you pipe uniq -c output into awk or cut expecting the count in field 1, remember the leading spaces: awk handles them, cut -f1 with a tab delimiter does not.

5. Counting whole lines instead of matches. Skipping -o on grep means you count distinct log lines, almost all of which are unique, so every count comes back as 1. The -o is what makes the pipeline a frequency table instead of a line-uniqueness check.

When NOT to use this

The grep | sort | uniq -c pipeline is great for files up to a few hundred megabytes. Past that, the sort stage becomes the bottleneck: it has to buffer and order the entire stream (spilling to temp files on disk) before uniq sees a single line. For large inputs, count in one pass with an awk associative array, which never sorts:

bash

awk '{c[$0]++} END {for (k in c) print c[k], k}' big.log | sort -rn | head

awk builds a hash table keyed by each line, incrementing the count as it streams through once. Only the small final table gets sorted, not the whole input. For a multi-gigabyte log this is dramatically faster and uses far less memory. The tradeoff is that the output is unordered until the final sort -rn, and awk holds one hash entry per distinct value, so it is the wrong tool if the cardinality itself is huge (millions of distinct keys).

Beyond that, if you are running the same count every day, stop shell-scripting it. A real analytics pipeline, a log aggregator, or a database query is the right home for recurring reporting. The pipeline on this page is for the ad-hoc question you ask once.

FAQ

uniq only collapses duplicate lines that are adjacent to each other. It reads the input as a stream and compares each line to the one immediately before it, so it never sees the whole file at once. If identical values are scattered through the file, uniq treats each separated run as its own group and the counts come out wrong.

Sorting first puts every identical line next to its twins, which makes uniq behave as if it were counting globally. Skipping the sort is the most common bug with this pipeline, and the symptom is the same value appearing more than once in the output.

grep -o prints only the part of the line that matched the pattern, one match per line, instead of the entire matching line. This is essential for counting: without it you are counting whole log lines, which are nearly all distinct, so every count comes back as one.

With -o, each line is reduced to just the token you want to tally (an IP, a status code, a username), so the downstream sort | uniq -c produces a real frequency table.

Pipe the matches through sort -u to get the distinct list, then through wc -l to count them. The sort -u sorts and deduplicates in one step, so you do not need a separate uniq.

If you instead want a per-value breakdown (how many times each value appeared), use uniq -c after a plain sort, then sort -rn to rank by frequency.

uniq -c keeps every distinct line and prefixes it with how many times it occurred. uniq -d keeps only the lines that appeared more than once (the duplicates). uniq -u keeps only the lines that appeared exactly once (the non-repeated lines).

All three need sorted input for the same reason: uniq compares adjacent lines only. Use uniq -cd to count just the duplicated values and drop the one-offs.

Three usual causes. First, case: sort and uniq compare bytes, so GET and get are different. Normalize with tr A-Z a-z early, or add -f to both sort and uniq. Second, trailing whitespace or hidden characters making lines that look identical actually differ; pipe through sed to trim. Third, you forgot grep -o and are counting whole lines instead of the matched token.

Use awk when the file is large enough that the sort stage becomes slow. The sort | uniq -c pipeline must sort the entire input before counting; an awk associative array counts in a single streaming pass with no sort.

The pattern is an awk program that increments a hash entry per distinct line, then prints the table at the end. For a multi-gigabyte log it is far faster. The tradeoff: the output is unordered until you add a final sort -rn, and awk holds one hash slot per distinct value, so it struggles when there are millions of unique keys.

How to Count Unique Matches with grep, sort, and uniq

Set your values

The one-liner

Step by step: what each stage does

Why sort must come before uniq

Build a top-N leaderboard

Worked example: top IPs in an access log

Worked example: count each HTTP status code

Worked example: most common error message

Worked example: unique users in a log

uniq has more than -c

macOS BSD vs GNU

Common mistakes

When NOT to use this

See also

FAQ

Ishan Karunaratne

Related posts

How to Exclude Matches with grep -v (Invert Match)

How to Count Matches with grep -c (and the Line-vs-Occurrence Trap)

How to Search Multiple Patterns with grep

Why do I have to sort before uniq?

What does the -o flag do in grep -o?

How do I get just the count of unique values, not a breakdown?

What is the difference between uniq -c, uniq -d, and uniq -u?

Why are my counts wrong even though I sorted?

When should I use awk instead of sort and uniq?

Ishan Karunaratne