TechEarl

How to Count Unique Matches with grep, sort, and uniq

The grep -o 'pattern' file | sort | uniq -c | sort -rn pipeline is the classic log-analysis one-liner. Why sort must come before uniq, how each stage works, worked examples for top IPs and status codes, the awk one-pass alternative for huge files, and the BSD vs GNU notes.

Ishan KarunaratneIshan Karunaratne⏱️ 12 min readUpdated
The grep -o | sort | uniq -c | sort -rn pipeline counts unique matches and ranks them. Why sort comes before uniq, worked log-analysis examples, sort -u, uniq -d, and the awk one-pass alternative.

grep -o 'pattern' file | sort | uniq -c | sort -rn is the one-liner I reach for whenever a question starts with "how many of each". It pulls every match out of a file, groups identical matches together, prefixes each group with its count, and ranks the result so the most frequent thing is on top. It is the workhorse of ad-hoc log analysis: top IP addresses, the spread of HTTP status codes, the error message that fired the most times.

The pipeline looks dense, but it is four small tools doing one job each. The single rule that trips everyone up is the sort before uniq: uniq only collapses adjacent duplicate lines, so if the input is not sorted the counts come out wrong. Get that ordering right and the rest is mechanical.

Set your values

Try it with your own values

Set your OS, the file to scan, and the pattern to count. Every command below updates with your values.

The one-liner

bash· Linux (GNU)
grep -oE ':pattern' :search_path | sort | uniq -c | sort -rn

That prints one line per distinct match, each prefixed with how many times it occurred, sorted with the highest count first. The rest of this page breaks down what each stage contributes and why the order is not negotiable.

Step by step: what each stage does

grep -o 'pattern' file prints only the matched portion of each line, one match per line, instead of the whole line. This is the part most people forget. Plain grep prints the entire matching line, so counting "GET requests" by piping plain grep into uniq actually counts whole log lines, which are nearly all distinct. -o reduces every line to just the thing you want to tally. The -E adds extended-regex syntax so (GET|POST) style alternation works without backslashes.

sort orders the matches alphabetically. On its own this does nothing useful, but it is mandatory because of how uniq works (next paragraph). After sort, every run of identical matches is contiguous.

uniq -c collapses each run of identical adjacent lines into a single line and prefixes it with the count of how many lines were in that run. Without -c it just dedups; with -c you get the frequency table. The output looks like 42 GET with the count right-padded.

sort -rn sorts that frequency table numerically (-n) in reverse (-r), so the largest count lands at the top. uniq -c emits its output in the input order (alphabetical, from the first sort), which is rarely the order you want to read. The second sort turns the table into a ranked leaderboard.

Why sort must come before uniq

uniq is a streaming deduplicator. It reads one line at a time and compares each line only to the line immediately before it. It never holds the whole file in memory and never sees non-adjacent lines together. So given this input:

code
GET
POST
GET
GET
POST

uniq -c produces:

code
   1 GET
   1 POST
   2 GET
   1 POST

Four groups, because the two POST lines and the GET at the top are separated by other values. That is not a count of unique matches, it is a count of unbroken runs. Sorting first puts all the GET lines together and all the POST lines together, so uniq -c sees two runs and reports 3 GET and 2 POST. Sorting is what makes uniq global instead of local. This is the single most common mistake with this pipeline, and the symptom is duplicate keys in the output with smaller-than-expected counts.

Build a top-N leaderboard

uniq -c puts the count first, which is exactly the shape sort -rn wants. Add head to cap the output at the top entries:

bash· Linux (GNU)
grep -oE ':pattern' :search_path | sort | uniq -c | sort -rn | head -10

head -10 keeps the top 10. Swap it for head -20 or pipe to tail instead if you want the rarest matches. This is the standard "top talkers" report.

Worked example: top IPs in an access log

A combined-format access log starts each line with the client IP. grep -oE with an IP-shaped pattern pulls just that field, and the pipeline ranks the noisiest clients:

bash· Linux (GNU)
grep -oE '^[0-9]+\.[0-9]+\.[0-9]+\.[0-9]+' access.log | sort | uniq -c | sort -rn | head -10

The ^ anchors the match to the start of the line so a stray IP in a URL query string does not get counted. The output is a clean ranked list of clients, the fastest way to spot a scraper or a misbehaving health check.

Worked example: count each HTTP status code

The status code sits in a known column of the access log. When the thing you want to count is a whole field rather than a substring, extract it with awk instead of a regex:

bash· Linux (GNU)
awk '{print $9}' access.log | sort | uniq -c | sort -rn

In the combined log format the status code is field 9, so awk '{print $9}' prints just that column. The rest of the pipeline is identical. This awk '{print $N}' trick is the column-extraction companion to grep -o; the grep and print column article covers picking out fields in more depth.

Worked example: most common error message

To rank error messages, grep the lines that contain ERROR first, then carve out the message text. Here grep -oE captures everything after the ERROR token:

bash
grep -oE 'ERROR.*' app.log | sort | uniq -c | sort -rn | head

If the messages contain variable parts (timestamps, request IDs, user names), the raw counts fragment because every line is technically distinct. In that case strip the variable parts with sed before sorting, or count a stable prefix of each message rather than the whole thing.

Worked example: unique users in a log

Sometimes you do not want counts at all, just the distinct values. sort -u is the shortcut: it sorts and deduplicates in one step, no uniq needed.

bash· Linux (GNU)
grep -oE 'user=[a-z0-9_]+' app.log | sort -u

Pipe that to wc -l and you have the count of distinct users:

bash
grep -oE 'user=[a-z0-9_]+' app.log | sort -u | wc -l

sort -u is equivalent to sort | uniq but a little faster and one process shorter. Reach for it whenever you want the unique list and do not care how many times each value appeared.

uniq has more than -c

uniq -c is the count flag, but two siblings are worth knowing. Both still require sorted input:

FlagWhat it keeps
uniq -cEvery distinct line, prefixed with its count
uniq -dOnly lines that appear more than once (the duplicates)
uniq -uOnly lines that appear exactly once (the non-repeated lines)
uniq (no flag)One copy of each distinct line, no count

uniq -d answers "which values are duplicated at all", useful for spotting repeated IDs in a column that should be unique. uniq -u answers the inverse, "which values are one-offs". Combine uniq -d with -c (uniq -cd) to count only the duplicated values and skip the singletons.

macOS BSD vs GNU

For this pipeline the platform difference is small. sort and uniq behave the same on macOS (BSD) and Linux (GNU) for the flags used here: -c, -d, -u on uniq, and -r, -n, -u on sort all work identically. The one divergence is sort -h (human-numeric sort, so 2K sorts below 1M): that flag is GNU and only landed in newer BSD sort, so a script targeting older macOS should not assume it. The grep -o and grep -oE flags are portable across BSD and GNU grep. The only grep gotcha on macOS is the missing -P (PCRE), and this pipeline never needs it; -E is enough.

Common mistakes

1. Forgetting to sort before uniq. Covered above and worth repeating because it is silent: the command runs, produces output that looks plausible, and the counts are simply wrong. Any time you see the same key twice in uniq -c output, you skipped the sort.

2. Expecting uniq alone to dedup an unsorted file. uniq file.txt on an unsorted file only collapses accidental adjacent repeats. It is not a set operation. Use sort -u for actual deduplication.

3. Case sensitivity. sort and uniq are byte-comparison by default, so GET and get count as two different things. Add -f to uniq (uniq -if) for case-insensitive grouping, and pass -f to sort too so the runs line up. Or normalize case earlier in the pipeline with tr A-Z a-z.

4. Leading whitespace from uniq -c confusing the next stage. uniq -c right-pads the count with spaces, so its output lines start with whitespace. That is fine for sort -rn, which skips leading blanks. But if you pipe uniq -c output into awk or cut expecting the count in field 1, remember the leading spaces: awk handles them, cut -f1 with a tab delimiter does not.

5. Counting whole lines instead of matches. Skipping -o on grep means you count distinct log lines, almost all of which are unique, so every count comes back as 1. The -o is what makes the pipeline a frequency table instead of a line-uniqueness check.

When NOT to use this

The grep | sort | uniq -c pipeline is great for files up to a few hundred megabytes. Past that, the sort stage becomes the bottleneck: it has to buffer and order the entire stream (spilling to temp files on disk) before uniq sees a single line. For large inputs, count in one pass with an awk associative array, which never sorts:

bash
awk '{c[$0]++} END {for (k in c) print c[k], k}' big.log | sort -rn | head

awk builds a hash table keyed by each line, incrementing the count as it streams through once. Only the small final table gets sorted, not the whole input. For a multi-gigabyte log this is dramatically faster and uses far less memory. The tradeoff is that the output is unordered until the final sort -rn, and awk holds one hash entry per distinct value, so it is the wrong tool if the cardinality itself is huge (millions of distinct keys).

Beyond that, if you are running the same count every day, stop shell-scripting it. A real analytics pipeline, a log aggregator, or a database query is the right home for recurring reporting. The pipeline on this page is for the ad-hoc question you ask once.

See also

FAQ

TagsgrepsortuniqawkCLILinuxmacOSLog Analysis
Share
Ishan Karunaratne

Ishan Karunaratne

Tech Architect · Software Engineer · AI/DevOps

Tech architect and software engineer with 20+ years across software, Linux systems, DevOps, and infrastructure — and a more recent focus on AI. Currently Chief Technology Officer at a tech startup in the healthcare space.

Keep reading

Related posts

Use grep -v 'pattern' file to print every line that does not match. Exclude multiple patterns with -e or -vE, strip comments and blank lines, count with -vc, and avoid the OR-becomes-AND double-negative trap.

How to Exclude Matches with grep -v (Invert Match)

grep -v 'pattern' file prints every line that does NOT match. The flag reference, how to exclude multiple patterns, the strip-comments-and-blank-lines pipeline, the double-negative trap where -v of an OR becomes an AND of negations, and the macOS BSD vs GNU differences.

grep -c counts matching lines, not occurrences. Use grep -o piped into wc -l for the true count, grep -rc for per-file counts, grep -vc to count non-matching lines, plus the macOS BSD vs GNU differences.

How to Count Matches with grep -c (and the Line-vs-Occurrence Trap)

grep -c counts matching LINES, not occurrences. A line with three hits still counts as 1. The fix is grep -o piped into wc -l, which puts every match on its own line first. Per-file counts, filtering out the :0 noise, counting non-matching lines, and the BSD vs GNU differences.

Search multiple patterns with grep: grep -e 'A' -e 'B', grep -E 'A|B' alternation, and grep -f patterns.txt. Covers -F fixed strings, AND logic with chained greps and PCRE lookahead, and BSD vs GNU differences on macOS.

How to Search Multiple Patterns with grep

grep can OR several patterns three ways: -e per pattern, -E with alternation, or -f reading the list from a file. The one-liner is grep -E 'ERROR|WARN|FATAL' file. Here is when to pick each, how -F speeds up literal multi-pattern search, why grep has no single-pass AND, and the BSD vs GNU differences that bite on macOS.