≡ Menu

How to remove duplicate lines from a text file?

The question is – I have a large text file and I want to remove all duplicate lines of text from the file. Is there a Linux / Unix command that can help me doing this?

There are three commands that, when working together, can remove duplicate lines from a text file.

The commands are cat, sort and uniq.

The command cat prints the file to the standard output (the shell). The sort command sorts the lines by lexicographic order. And the uniq command takes the output of the sort command and displays only unique lines.

The command that will remove all duplicate lines is as follows:

cat file.txt | sort | uniq

For example if this is the file:

This is a line of text
This is another line
This is the third line
This is a line of text
This is another line
This is the fourth line of text
This is a duplicate line
This is a duplicate line

The output of the above command is as under:

This is a duplicate line
This is a line of text
This is another line
This is the fourth line of text
This is the third line

Note that the output order is changed because the command sort sorted the lines alphabetically.

Now on the other hand, if you want to display only lines that are repeated, you can use the -d option to uniq:

cat file.txt | sort | uniq -d

The output of this command is as under:

This is a duplicate line
This is a line of text
This is another line

These are the three lines that were repeated in the file!

Note that sort is quite capable of doing an external sort, so it will also work with very large files.

{ 0 comments… add one }

Leave a Comment