Find duplicate files

Multi tool use

Is it possible to find duplicate files on my disk which are bit to bit identical but have different file-names?

edited 35 mins ago

Jeff Schaller

42.5k1158135

asked Apr 4 '13 at 13:18

student

7,1651765127

3

Note that any possible method of doing this will invariably have to compare every single file on your system to every single other file. So this is going to take a long time, even when taking shortcuts.

– Shadur
Apr 4 '13 at 14:02

4

@Shadur if one is ok with checksums, it boils down to comparing just the hashes - which on most systems is of the order of 10^(5+-1) usually <64-byte entries. Of course, you have to read the data at least once. :)

– peterph
Apr 4 '13 at 14:57

15

@Shadur That's not true. You can reduce the time by checking for matching st_sizes, eliminating those with only one of the same, and then only calculating md5sums for matching st_sizes.

– Chris Down
Apr 4 '13 at 16:36

6

@Shadur even an incredibly silly approach disallowing any hash operations could do this in Θ(n log n) compares—not Θ(n²)—using any of several sort algorithms (based on file content).

– derobert
Apr 4 '13 at 17:09

1

@ChrisDown Yes, size matching would be one of the shortcuts I had in mind.

– Shadur
Apr 4 '13 at 19:38

|
show 3 more comments

Is it possible to find duplicate files on my disk which are bit to bit identical but have different file-names?

edited 35 mins ago

Jeff Schaller

42.5k1158135

asked Apr 4 '13 at 13:18

student

7,1651765127

3

Note that any possible method of doing this will invariably have to compare every single file on your system to every single other file. So this is going to take a long time, even when taking shortcuts.

– Shadur
Apr 4 '13 at 14:02

4

@Shadur if one is ok with checksums, it boils down to comparing just the hashes - which on most systems is of the order of 10^(5+-1) usually <64-byte entries. Of course, you have to read the data at least once. :)

– peterph
Apr 4 '13 at 14:57

15

@Shadur That's not true. You can reduce the time by checking for matching st_sizes, eliminating those with only one of the same, and then only calculating md5sums for matching st_sizes.

– Chris Down
Apr 4 '13 at 16:36

6

@Shadur even an incredibly silly approach disallowing any hash operations could do this in Θ(n log n) compares—not Θ(n²)—using any of several sort algorithms (based on file content).

– derobert
Apr 4 '13 at 17:09

1

@ChrisDown Yes, size matching would be one of the shortcuts I had in mind.

– Shadur
Apr 4 '13 at 19:38

|
show 3 more comments

Is it possible to find duplicate files on my disk which are bit to bit identical but have different file-names?

edited 35 mins ago

Jeff Schaller

42.5k1158135

asked Apr 4 '13 at 13:18

student

7,1651765127

Is it possible to find duplicate files on my disk which are bit to bit identical but have different file-names?

files duplicate-files

edited 35 mins ago

Jeff Schaller

42.5k1158135

asked Apr 4 '13 at 13:18

student

7,1651765127

edited 35 mins ago

Jeff Schaller

42.5k1158135

asked Apr 4 '13 at 13:18

student

7,1651765127

edited 35 mins ago

Jeff Schaller

42.5k1158135

edited 35 mins ago

Jeff Schaller

42.5k1158135

edited 35 mins ago

Jeff Schaller

42.5k1158135

asked Apr 4 '13 at 13:18

student

7,1651765127

asked Apr 4 '13 at 13:18

student

7,1651765127

asked Apr 4 '13 at 13:18

student

7,1651765127

3

Note that any possible method of doing this will invariably have to compare every single file on your system to every single other file. So this is going to take a long time, even when taking shortcuts.

– Shadur
Apr 4 '13 at 14:02

4

@Shadur if one is ok with checksums, it boils down to comparing just the hashes - which on most systems is of the order of 10^(5+-1) usually <64-byte entries. Of course, you have to read the data at least once. :)

– peterph
Apr 4 '13 at 14:57

15

@Shadur That's not true. You can reduce the time by checking for matching st_sizes, eliminating those with only one of the same, and then only calculating md5sums for matching st_sizes.

– Chris Down
Apr 4 '13 at 16:36

6

@Shadur even an incredibly silly approach disallowing any hash operations could do this in Θ(n log n) compares—not Θ(n²)—using any of several sort algorithms (based on file content).

– derobert
Apr 4 '13 at 17:09

1

@ChrisDown Yes, size matching would be one of the shortcuts I had in mind.

– Shadur
Apr 4 '13 at 19:38

|
show 3 more comments

3

Note that any possible method of doing this will invariably have to compare every single file on your system to every single other file. So this is going to take a long time, even when taking shortcuts.

– Shadur
Apr 4 '13 at 14:02

4

@Shadur if one is ok with checksums, it boils down to comparing just the hashes - which on most systems is of the order of 10^(5+-1) usually <64-byte entries. Of course, you have to read the data at least once. :)

– peterph
Apr 4 '13 at 14:57

15

@Shadur That's not true. You can reduce the time by checking for matching st_sizes, eliminating those with only one of the same, and then only calculating md5sums for matching st_sizes.

– Chris Down
Apr 4 '13 at 16:36

6

@Shadur even an incredibly silly approach disallowing any hash operations could do this in Θ(n log n) compares—not Θ(n²)—using any of several sort algorithms (based on file content).

– derobert
Apr 4 '13 at 17:09

1

@ChrisDown Yes, size matching would be one of the shortcuts I had in mind.

– Shadur
Apr 4 '13 at 19:38

Note that any possible method of doing this will invariably have to compare every single file on your system to every single other file. So this is going to take a long time, even when taking shortcuts.

– Shadur
Apr 4 '13 at 14:02

@Shadur if one is ok with checksums, it boils down to comparing just the hashes - which on most systems is of the order of 10^(5+-1) usually <64-byte entries. Of course, you have to read the data at least once. :)

– peterph
Apr 4 '13 at 14:57

@Shadur That's not true. You can reduce the time by checking for matching st_sizes, eliminating those with only one of the same, and then only calculating md5sums for matching st_sizes.

– Chris Down
Apr 4 '13 at 16:36

@Shadur even an incredibly silly approach disallowing any hash operations could do this in Θ(n log n) compares—not Θ(n²)—using any of several sort algorithms (based on file content).

– derobert
Apr 4 '13 at 17:09

@ChrisDown Yes, size matching would be one of the shortcuts I had in mind.

– Shadur
Apr 4 '13 at 19:38

|
show 3 more comments

7 Answers
7

active

oldest

votes

101

fdupes can do this. From man fdupes:

Searches the given path for duplicate files. Such files are found by comparing file sizes and MD5 signatures, followed by a byte-by-byte comparison.

In Debian or Ubuntu, you can install it with apt-get install fdupes. In Fedora/Red Hat/CentOS, you can install it with yum install fdupes. On Arch Linux you can use pacman -S fdupes, and on Gentoo, emerge fdupes.

To run a check descending from your filesystem root, which will likely take a significant amount of time and memory, use something like fdupes -r /.

As asked in the comments, you can get the largest duplicates by doing the following:

fdupes -r . | {

    while IFS= read -r file; do

        [[ $file ]] && du "$file"

    done

} | sort -n

This will break if your filenames contain newlines.

edited Aug 14 '17 at 17:38

genpfault

1357

answered Apr 4 '13 at 13:24

Chris Down

80.6k14189202

Thanks. How can I filter out the largest dupe? How can I make the sizes human readable?

– student
Apr 5 '13 at 9:31

@student: use something along the line of (make sure fdupes just outputs the filenames with no extra informatinos, or cut or sed to just keep that) : fdupes ....... | xargs ls -alhd | egrep 'M |G ' to keep files in Human readable format and only those with size in Megabytes or Gigabytes. Change the command to suit the real outputs.

– Olivier Dulac
Apr 5 '13 at 12:27

2

@OlivierDulac You should never parse ls. Usually it's worse than your use case, but even in your use case, you risk false positives.

– Chris Down
Apr 5 '13 at 13:13

@student - Once you have the filenames, du piped to sort will tell you.

– Chris Down
Apr 5 '13 at 13:14

@ChrisDown: it's true it's a bad habit, and can give false positives. But in that case (interactive use, and for display only, no "rm" or anything of the sort directly relying on it) it's fine and quick ^^ . I love those pages you link to, btw (been reading them since a few months, and full of many usefull infos)

– Olivier Dulac
Apr 5 '13 at 14:05

|
show 4 more comments

Another good tool is fslint:

fslint is a toolset to find various problems with filesystems,
including duplicate files and problematic filenames
etc.

Individual command line tools are available in addition to the GUI and to access them, one can change to, or add to
$PATH the /usr/share/fslint/fslint directory on a standard install. Each of these commands in that directory have a
--help option which further details its parameters.
   findup - find DUPlicate files

On debian-based systems, youcan install it with:

sudo apt-get install fslint

You can also do this manually if you don't want to or cannot install third party tools. The way most such programs work is by calculating file checksums. Files with the same md5sum almost certainly contain exactly the same data. So, you could do something like this:

find / -type f -exec md5sum {} ; > md5sums

gawk '{print $1}' md5sums | sort | uniq -d > dupes

while read d; do echo "---"; grep $d md5sums | cut -d ' ' -f 2-; done < dupes

Sample output (the file names in this example are the same, but it will also work when they are different):

$ while read d; do echo "---"; grep $d md5sums | cut -d ' ' -f 2-; done < dupes 

---

 /usr/src/linux-headers-3.2.0-3-common/include/linux/if_bonding.h

 /usr/src/linux-headers-3.2.0-4-common/include/linux/if_bonding.h

---

 /usr/src/linux-headers-3.2.0-3-common/include/linux/route.h

 /usr/src/linux-headers-3.2.0-4-common/include/linux/route.h

---

 /usr/src/linux-headers-3.2.0-3-common/include/drm/Kbuild

 /usr/src/linux-headers-3.2.0-4-common/include/drm/Kbuild

---

This will be much slower than the dedicated tools already mentioned, but it will work.

edited Apr 4 '13 at 16:06

answered Apr 4 '13 at 16:00

terdon♦

131k32258436

3

It would be much, much faster to find any files with the same size as another file using st_size, eliminating any that only have one file of this size, and then calculating md5sums only between files with the same st_size.

– Chris Down
Apr 4 '13 at 16:34

@ChrisDown yeah, just wanted to keep it simple. What you suggest will greatly speed things up of course. That's why I have the disclaimer about it being slow at the end of my answer.

– terdon♦
Apr 4 '13 at 16:37

add a comment |

Short answer: yes.

Longer version: have a look at the wikipedia fdupes entry, it sports quite nice list of ready made solutions. Of course you can write your own, it's not that difficult - hashing programs like diff, sha*sum, find, sort and uniq should do the job. You can even put it on one line, and it will still be understandable.

answered Apr 4 '13 at 13:25

peterph

23.7k24457

add a comment |

If you believe a hash function (here MD5) is collision-free on your domain:

find $target -type f -exec md5sum '{}' + | sort | uniq --all-repeated --check-chars=32 

 | cut --characters=35-

Want identical file names grouped? Write a simple script not_uniq.sh to format output:

#!/bin/bash



last_checksum=0

while read line; do

    checksum=${line:0:32}

    filename=${line:34}

    if [ $checksum == $last_checksum ]; then

        if [ ${last_filename:-0} != '0' ]; then

            echo $last_filename

            unset last_filename

        fi

        echo $filename

    else

        if [ ${last_filename:-0} == '0' ]; then

            echo "======="

        fi

        last_filename=$filename

    fi



    last_checksum=$checksum

done

Then change find command to use your script:

chmod +x not_uniq.sh

find $target -type f -exec md5sum '{}' + | sort | not_uniq.sh

This is basic idea. Probably you should change find if your file names containing some characters. (e.g space)

edited Feb 21 '17 at 18:15

Wayne Werner

6,26851736

answered Apr 13 '13 at 15:39

xin

29929

add a comment |

I thought to add a recent enhanced fork of fdupes, jdupes, which promises to be faster and more feature rich than fdupes (e.g. size filter):

jdupes . -rS -X size-:50m > myjdups.txt

This will recursively find duplicated files bigger than 50MB in the current directory and output the resulted list in myjdups.txt.

Note, the output is not sorted by size and since it appears not to be build in, I have adapted @Chris_Down answer above to achieve this:

jdupes -r . -X size-:50m | {

    while IFS= read -r file; do

        [[ $file ]] && du "$file"

    done

} | sort -n > myjdups_sorted.txt

answered Nov 23 '17 at 17:27

Sebastian Müller

1714

add a comment |

^{Wikipedia had an article (http://en.wikipedia.org/wiki/List_of_duplicate_file_finders), with a list of available open source software for this task, but it's now been deleted}.

I will add that the GUI version of fslint is very interesting, allowing to use mask to select which files to delete. Very useful to clean duplicated photos.

On Linux you can use:

- FSLint: http://www.pixelbeat.org/fslint/



- FDupes: https://en.wikipedia.org/wiki/Fdupes



- DupeGuru: https://www.hardcoded.net/dupeguru/

The 2 last work on many systems (windows, mac and linux) I 've not checked for FSLint

edited Jul 3 '17 at 10:09

Stéphane Chazelas

307k57581939

answered Jan 29 '14 at 11:01

MordicusEtCubitus

1293

5

It is better to provide actual information here and not just a link, the link might change and then the answer has no value left

– Anthon
Jan 29 '14 at 11:22

2

Wikipedia page is empty.

– ihor_dvoretskyi
Sep 10 '15 at 9:01

yes, it has been cleaned, what a pity shake...

– MordicusEtCubitus
Dec 21 '15 at 16:23

I've edited it with these 3 tools

– MordicusEtCubitus
Dec 21 '15 at 16:30

add a comment |

Here's my take on that:

find -type f -size +3M -print0 | while IFS= read -r -d '' i; do

  echo -n '.'

  if grep -q "$i" md5-partial.txt; then echo -e "n$i  ---- Already counted, skipping."; continue; fi

  MD5=`dd bs=1M count=1 if="$i" status=noxfer | md5sum`

  MD5=`echo $MD5 | cut -d' ' -f1`

  if grep "$MD5" md5-partial.txt; then echo "n$i  ----   Possible duplicate"; fi

  echo $MD5 $i >> md5-partial.txt

done

It's different in that it only hashes up to first 1 MB of the file.

This has few issues / features:

There might be a difference after first 1 MB so the result rather a candidate to check. I might fix that later.

Checking by file size first could speed this up.

Only takes files larger than 3 MB.

I use it to compare video clips so this is enough for me.

answered Jun 2 '17 at 1:50

Ondra Žižka

464312

add a comment |

protected by Community♦ Jan 14 '16 at 12:14

Thank you for your interest in this question.
Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).

Would you like to answer one of these unanswered questions instead?

7 Answers
7

active

oldest

votes

7 Answers
7

active

oldest

votes

101

fdupes can do this. From man fdupes:

Searches the given path for duplicate files. Such files are found by comparing file sizes and MD5 signatures, followed by a byte-by-byte comparison.

To run a check descending from your filesystem root, which will likely take a significant amount of time and memory, use something like fdupes -r /.

As asked in the comments, you can get the largest duplicates by doing the following:

fdupes -r . | {

    while IFS= read -r file; do

        [[ $file ]] && du "$file"

    done

} | sort -n

This will break if your filenames contain newlines.

edited Aug 14 '17 at 17:38

genpfault

1357

answered Apr 4 '13 at 13:24

Chris Down

80.6k14189202

Thanks. How can I filter out the largest dupe? How can I make the sizes human readable?

– student
Apr 5 '13 at 9:31

@student: use something along the line of (make sure fdupes just outputs the filenames with no extra informatinos, or cut or sed to just keep that) : fdupes ....... | xargs ls -alhd | egrep 'M |G ' to keep files in Human readable format and only those with size in Megabytes or Gigabytes. Change the command to suit the real outputs.

– Olivier Dulac
Apr 5 '13 at 12:27

2

@OlivierDulac You should never parse ls. Usually it's worse than your use case, but even in your use case, you risk false positives.

– Chris Down
Apr 5 '13 at 13:13

@student - Once you have the filenames, du piped to sort will tell you.

– Chris Down
Apr 5 '13 at 13:14

@ChrisDown: it's true it's a bad habit, and can give false positives. But in that case (interactive use, and for display only, no "rm" or anything of the sort directly relying on it) it's fine and quick ^^ . I love those pages you link to, btw (been reading them since a few months, and full of many usefull infos)

– Olivier Dulac
Apr 5 '13 at 14:05

|
show 4 more comments

101

fdupes can do this. From man fdupes:

Searches the given path for duplicate files. Such files are found by comparing file sizes and MD5 signatures, followed by a byte-by-byte comparison.

To run a check descending from your filesystem root, which will likely take a significant amount of time and memory, use something like fdupes -r /.

As asked in the comments, you can get the largest duplicates by doing the following:

fdupes -r . | {

    while IFS= read -r file; do

        [[ $file ]] && du "$file"

    done

} | sort -n

This will break if your filenames contain newlines.

edited Aug 14 '17 at 17:38

genpfault

1357

answered Apr 4 '13 at 13:24

Chris Down

80.6k14189202

Thanks. How can I filter out the largest dupe? How can I make the sizes human readable?

– student
Apr 5 '13 at 9:31

@student: use something along the line of (make sure fdupes just outputs the filenames with no extra informatinos, or cut or sed to just keep that) : fdupes ....... | xargs ls -alhd | egrep 'M |G ' to keep files in Human readable format and only those with size in Megabytes or Gigabytes. Change the command to suit the real outputs.

– Olivier Dulac
Apr 5 '13 at 12:27

2

@OlivierDulac You should never parse ls. Usually it's worse than your use case, but even in your use case, you risk false positives.

– Chris Down
Apr 5 '13 at 13:13

@student - Once you have the filenames, du piped to sort will tell you.

– Chris Down
Apr 5 '13 at 13:14

@ChrisDown: it's true it's a bad habit, and can give false positives. But in that case (interactive use, and for display only, no "rm" or anything of the sort directly relying on it) it's fine and quick ^^ . I love those pages you link to, btw (been reading them since a few months, and full of many usefull infos)

– Olivier Dulac
Apr 5 '13 at 14:05

|
show 4 more comments

101

fdupes can do this. From man fdupes:

Searches the given path for duplicate files. Such files are found by comparing file sizes and MD5 signatures, followed by a byte-by-byte comparison.

To run a check descending from your filesystem root, which will likely take a significant amount of time and memory, use something like fdupes -r /.

As asked in the comments, you can get the largest duplicates by doing the following:

fdupes -r . | {

    while IFS= read -r file; do

        [[ $file ]] && du "$file"

    done

} | sort -n

This will break if your filenames contain newlines.

edited Aug 14 '17 at 17:38

genpfault

1357

answered Apr 4 '13 at 13:24

Chris Down

80.6k14189202

fdupes can do this. From man fdupes:

Searches the given path for duplicate files. Such files are found by comparing file sizes and MD5 signatures, followed by a byte-by-byte comparison.

To run a check descending from your filesystem root, which will likely take a significant amount of time and memory, use something like fdupes -r /.

As asked in the comments, you can get the largest duplicates by doing the following:

fdupes -r . | {

    while IFS= read -r file; do

        [[ $file ]] && du "$file"

    done

} | sort -n

This will break if your filenames contain newlines.

edited Aug 14 '17 at 17:38

genpfault

1357

answered Apr 4 '13 at 13:24

Chris Down

80.6k14189202

edited Aug 14 '17 at 17:38

genpfault

1357

edited Aug 14 '17 at 17:38

genpfault

1357

edited Aug 14 '17 at 17:38

genpfault

1357

answered Apr 4 '13 at 13:24

Chris Down

80.6k14189202

answered Apr 4 '13 at 13:24

Chris Down

80.6k14189202

answered Apr 4 '13 at 13:24

Chris Down

80.6k14189202

Thanks. How can I filter out the largest dupe? How can I make the sizes human readable?

– student
Apr 5 '13 at 9:31

@student: use something along the line of (make sure fdupes just outputs the filenames with no extra informatinos, or cut or sed to just keep that) : fdupes ....... | xargs ls -alhd | egrep 'M |G ' to keep files in Human readable format and only those with size in Megabytes or Gigabytes. Change the command to suit the real outputs.

– Olivier Dulac
Apr 5 '13 at 12:27

2

@OlivierDulac You should never parse ls. Usually it's worse than your use case, but even in your use case, you risk false positives.

– Chris Down
Apr 5 '13 at 13:13

@student - Once you have the filenames, du piped to sort will tell you.

– Chris Down
Apr 5 '13 at 13:14

@ChrisDown: it's true it's a bad habit, and can give false positives. But in that case (interactive use, and for display only, no "rm" or anything of the sort directly relying on it) it's fine and quick ^^ . I love those pages you link to, btw (been reading them since a few months, and full of many usefull infos)

– Olivier Dulac
Apr 5 '13 at 14:05

|
show 4 more comments

Thanks. How can I filter out the largest dupe? How can I make the sizes human readable?

– student
Apr 5 '13 at 9:31

@student: use something along the line of (make sure fdupes just outputs the filenames with no extra informatinos, or cut or sed to just keep that) : fdupes ....... | xargs ls -alhd | egrep 'M |G ' to keep files in Human readable format and only those with size in Megabytes or Gigabytes. Change the command to suit the real outputs.

– Olivier Dulac
Apr 5 '13 at 12:27

2

@OlivierDulac You should never parse ls. Usually it's worse than your use case, but even in your use case, you risk false positives.

– Chris Down
Apr 5 '13 at 13:13

@student - Once you have the filenames, du piped to sort will tell you.

– Chris Down
Apr 5 '13 at 13:14

@ChrisDown: it's true it's a bad habit, and can give false positives. But in that case (interactive use, and for display only, no "rm" or anything of the sort directly relying on it) it's fine and quick ^^ . I love those pages you link to, btw (been reading them since a few months, and full of many usefull infos)

– Olivier Dulac
Apr 5 '13 at 14:05

Thanks. How can I filter out the largest dupe? How can I make the sizes human readable?

– student
Apr 5 '13 at 9:31

@student: use something along the line of (make sure fdupes just outputs the filenames with no extra informatinos, or cut or sed to just keep that) : fdupes ....... | xargs ls -alhd | egrep 'M |G ' to keep files in Human readable format and only those with size in Megabytes or Gigabytes. Change the command to suit the real outputs.

– Olivier Dulac
Apr 5 '13 at 12:27

@OlivierDulac You should never parse ls. Usually it's worse than your use case, but even in your use case, you risk false positives.

– Chris Down
Apr 5 '13 at 13:13

@student - Once you have the filenames, du piped to sort will tell you.

– Chris Down
Apr 5 '13 at 13:14

@ChrisDown: it's true it's a bad habit, and can give false positives. But in that case (interactive use, and for display only, no "rm" or anything of the sort directly relying on it) it's fine and quick ^^ . I love those pages you link to, btw (been reading them since a few months, and full of many usefull infos)

– Olivier Dulac
Apr 5 '13 at 14:05

|
show 4 more comments

Another good tool is fslint:

fslint is a toolset to find various problems with filesystems,
including duplicate files and problematic filenames
etc.

Individual command line tools are available in addition to the GUI and to access them, one can change to, or add to
$PATH the /usr/share/fslint/fslint directory on a standard install. Each of these commands in that directory have a
--help option which further details its parameters.
   findup - find DUPlicate files

On debian-based systems, youcan install it with:

sudo apt-get install fslint

find / -type f -exec md5sum {} ; > md5sums

gawk '{print $1}' md5sums | sort | uniq -d > dupes

while read d; do echo "---"; grep $d md5sums | cut -d ' ' -f 2-; done < dupes

Sample output (the file names in this example are the same, but it will also work when they are different):

$ while read d; do echo "---"; grep $d md5sums | cut -d ' ' -f 2-; done < dupes 

---

 /usr/src/linux-headers-3.2.0-3-common/include/linux/if_bonding.h

 /usr/src/linux-headers-3.2.0-4-common/include/linux/if_bonding.h

---

 /usr/src/linux-headers-3.2.0-3-common/include/linux/route.h

 /usr/src/linux-headers-3.2.0-4-common/include/linux/route.h

---

 /usr/src/linux-headers-3.2.0-3-common/include/drm/Kbuild

 /usr/src/linux-headers-3.2.0-4-common/include/drm/Kbuild

---

This will be much slower than the dedicated tools already mentioned, but it will work.

edited Apr 4 '13 at 16:06

answered Apr 4 '13 at 16:00

terdon♦

131k32258436

3

It would be much, much faster to find any files with the same size as another file using st_size, eliminating any that only have one file of this size, and then calculating md5sums only between files with the same st_size.

– Chris Down
Apr 4 '13 at 16:34

@ChrisDown yeah, just wanted to keep it simple. What you suggest will greatly speed things up of course. That's why I have the disclaimer about it being slow at the end of my answer.

– terdon♦
Apr 4 '13 at 16:37

add a comment |

Another good tool is fslint:

fslint is a toolset to find various problems with filesystems,
including duplicate files and problematic filenames
etc.

Individual command line tools are available in addition to the GUI and to access them, one can change to, or add to
$PATH the /usr/share/fslint/fslint directory on a standard install. Each of these commands in that directory have a
--help option which further details its parameters.
   findup - find DUPlicate files

On debian-based systems, youcan install it with:

sudo apt-get install fslint

find / -type f -exec md5sum {} ; > md5sums

gawk '{print $1}' md5sums | sort | uniq -d > dupes

while read d; do echo "---"; grep $d md5sums | cut -d ' ' -f 2-; done < dupes

Sample output (the file names in this example are the same, but it will also work when they are different):

$ while read d; do echo "---"; grep $d md5sums | cut -d ' ' -f 2-; done < dupes 

---

 /usr/src/linux-headers-3.2.0-3-common/include/linux/if_bonding.h

 /usr/src/linux-headers-3.2.0-4-common/include/linux/if_bonding.h

---

 /usr/src/linux-headers-3.2.0-3-common/include/linux/route.h

 /usr/src/linux-headers-3.2.0-4-common/include/linux/route.h

---

 /usr/src/linux-headers-3.2.0-3-common/include/drm/Kbuild

 /usr/src/linux-headers-3.2.0-4-common/include/drm/Kbuild

---

This will be much slower than the dedicated tools already mentioned, but it will work.

edited Apr 4 '13 at 16:06

answered Apr 4 '13 at 16:00

terdon♦

131k32258436

3

It would be much, much faster to find any files with the same size as another file using st_size, eliminating any that only have one file of this size, and then calculating md5sums only between files with the same st_size.

– Chris Down
Apr 4 '13 at 16:34

@ChrisDown yeah, just wanted to keep it simple. What you suggest will greatly speed things up of course. That's why I have the disclaimer about it being slow at the end of my answer.

– terdon♦
Apr 4 '13 at 16:37

add a comment |

Another good tool is fslint:

fslint is a toolset to find various problems with filesystems,
including duplicate files and problematic filenames
etc.

Individual command line tools are available in addition to the GUI and to access them, one can change to, or add to
$PATH the /usr/share/fslint/fslint directory on a standard install. Each of these commands in that directory have a
--help option which further details its parameters.
   findup - find DUPlicate files

On debian-based systems, youcan install it with:

sudo apt-get install fslint

find / -type f -exec md5sum {} ; > md5sums

gawk '{print $1}' md5sums | sort | uniq -d > dupes

while read d; do echo "---"; grep $d md5sums | cut -d ' ' -f 2-; done < dupes

Sample output (the file names in this example are the same, but it will also work when they are different):

$ while read d; do echo "---"; grep $d md5sums | cut -d ' ' -f 2-; done < dupes 

---

 /usr/src/linux-headers-3.2.0-3-common/include/linux/if_bonding.h

 /usr/src/linux-headers-3.2.0-4-common/include/linux/if_bonding.h

---

 /usr/src/linux-headers-3.2.0-3-common/include/linux/route.h

 /usr/src/linux-headers-3.2.0-4-common/include/linux/route.h

---

 /usr/src/linux-headers-3.2.0-3-common/include/drm/Kbuild

 /usr/src/linux-headers-3.2.0-4-common/include/drm/Kbuild

---

This will be much slower than the dedicated tools already mentioned, but it will work.

edited Apr 4 '13 at 16:06

answered Apr 4 '13 at 16:00

terdon♦

131k32258436

Another good tool is fslint:

fslint is a toolset to find various problems with filesystems,
including duplicate files and problematic filenames
etc.

Individual command line tools are available in addition to the GUI and to access them, one can change to, or add to
$PATH the /usr/share/fslint/fslint directory on a standard install. Each of these commands in that directory have a
--help option which further details its parameters.
   findup - find DUPlicate files

On debian-based systems, youcan install it with:

sudo apt-get install fslint

find / -type f -exec md5sum {} ; > md5sums

gawk '{print $1}' md5sums | sort | uniq -d > dupes

while read d; do echo "---"; grep $d md5sums | cut -d ' ' -f 2-; done < dupes

Sample output (the file names in this example are the same, but it will also work when they are different):

$ while read d; do echo "---"; grep $d md5sums | cut -d ' ' -f 2-; done < dupes 

---

 /usr/src/linux-headers-3.2.0-3-common/include/linux/if_bonding.h

 /usr/src/linux-headers-3.2.0-4-common/include/linux/if_bonding.h

---

 /usr/src/linux-headers-3.2.0-3-common/include/linux/route.h

 /usr/src/linux-headers-3.2.0-4-common/include/linux/route.h

---

 /usr/src/linux-headers-3.2.0-3-common/include/drm/Kbuild

 /usr/src/linux-headers-3.2.0-4-common/include/drm/Kbuild

---

This will be much slower than the dedicated tools already mentioned, but it will work.

edited Apr 4 '13 at 16:06

answered Apr 4 '13 at 16:00

terdon♦

131k32258436

edited Apr 4 '13 at 16:06

answered Apr 4 '13 at 16:00

terdon♦

131k32258436

answered Apr 4 '13 at 16:00

terdon♦

131k32258436

answered Apr 4 '13 at 16:00

terdon♦

131k32258436

3

It would be much, much faster to find any files with the same size as another file using st_size, eliminating any that only have one file of this size, and then calculating md5sums only between files with the same st_size.

– Chris Down
Apr 4 '13 at 16:34

@ChrisDown yeah, just wanted to keep it simple. What you suggest will greatly speed things up of course. That's why I have the disclaimer about it being slow at the end of my answer.

– terdon♦
Apr 4 '13 at 16:37

add a comment |

3

It would be much, much faster to find any files with the same size as another file using st_size, eliminating any that only have one file of this size, and then calculating md5sums only between files with the same st_size.

– Chris Down
Apr 4 '13 at 16:34

@ChrisDown yeah, just wanted to keep it simple. What you suggest will greatly speed things up of course. That's why I have the disclaimer about it being slow at the end of my answer.

– terdon♦
Apr 4 '13 at 16:37

It would be much, much faster to find any files with the same size as another file using st_size, eliminating any that only have one file of this size, and then calculating md5sums only between files with the same st_size.

– Chris Down
Apr 4 '13 at 16:34

@ChrisDown yeah, just wanted to keep it simple. What you suggest will greatly speed things up of course. That's why I have the disclaimer about it being slow at the end of my answer.

– terdon♦
Apr 4 '13 at 16:37

add a comment |

Short answer: yes.

answered Apr 4 '13 at 13:25

peterph

23.7k24457

add a comment |

Short answer: yes.

answered Apr 4 '13 at 13:25

peterph

23.7k24457

add a comment |

Short answer: yes.

answered Apr 4 '13 at 13:25

peterph

23.7k24457

Short answer: yes.

answered Apr 4 '13 at 13:25

peterph

23.7k24457

answered Apr 4 '13 at 13:25

peterph

23.7k24457

answered Apr 4 '13 at 13:25

peterph

23.7k24457

answered Apr 4 '13 at 13:25

peterph

23.7k24457

add a comment |

If you believe a hash function (here MD5) is collision-free on your domain:

find $target -type f -exec md5sum '{}' + | sort | uniq --all-repeated --check-chars=32 

 | cut --characters=35-

Want identical file names grouped? Write a simple script not_uniq.sh to format output:

#!/bin/bash



last_checksum=0

while read line; do

    checksum=${line:0:32}

    filename=${line:34}

    if [ $checksum == $last_checksum ]; then

        if [ ${last_filename:-0} != '0' ]; then

            echo $last_filename

            unset last_filename

        fi

        echo $filename

    else

        if [ ${last_filename:-0} == '0' ]; then

            echo "======="

        fi

        last_filename=$filename

    fi



    last_checksum=$checksum

done

Then change find command to use your script:

chmod +x not_uniq.sh

find $target -type f -exec md5sum '{}' + | sort | not_uniq.sh

This is basic idea. Probably you should change find if your file names containing some characters. (e.g space)

edited Feb 21 '17 at 18:15

Wayne Werner

6,26851736

answered Apr 13 '13 at 15:39

xin

29929

add a comment |

If you believe a hash function (here MD5) is collision-free on your domain:

find $target -type f -exec md5sum '{}' + | sort | uniq --all-repeated --check-chars=32 

 | cut --characters=35-

Want identical file names grouped? Write a simple script not_uniq.sh to format output:

#!/bin/bash



last_checksum=0

while read line; do

    checksum=${line:0:32}

    filename=${line:34}

    if [ $checksum == $last_checksum ]; then

        if [ ${last_filename:-0} != '0' ]; then

            echo $last_filename

            unset last_filename

        fi

        echo $filename

    else

        if [ ${last_filename:-0} == '0' ]; then

            echo "======="

        fi

        last_filename=$filename

    fi



    last_checksum=$checksum

done

Then change find command to use your script:

chmod +x not_uniq.sh

find $target -type f -exec md5sum '{}' + | sort | not_uniq.sh

This is basic idea. Probably you should change find if your file names containing some characters. (e.g space)

edited Feb 21 '17 at 18:15

Wayne Werner

6,26851736

answered Apr 13 '13 at 15:39

xin

29929

add a comment |

If you believe a hash function (here MD5) is collision-free on your domain:

find $target -type f -exec md5sum '{}' + | sort | uniq --all-repeated --check-chars=32 

 | cut --characters=35-

Want identical file names grouped? Write a simple script not_uniq.sh to format output:

#!/bin/bash



last_checksum=0

while read line; do

    checksum=${line:0:32}

    filename=${line:34}

    if [ $checksum == $last_checksum ]; then

        if [ ${last_filename:-0} != '0' ]; then

            echo $last_filename

            unset last_filename

        fi

        echo $filename

    else

        if [ ${last_filename:-0} == '0' ]; then

            echo "======="

        fi

        last_filename=$filename

    fi



    last_checksum=$checksum

done

Then change find command to use your script:

chmod +x not_uniq.sh

find $target -type f -exec md5sum '{}' + | sort | not_uniq.sh

This is basic idea. Probably you should change find if your file names containing some characters. (e.g space)

edited Feb 21 '17 at 18:15

Wayne Werner

6,26851736

answered Apr 13 '13 at 15:39

xin

29929

If you believe a hash function (here MD5) is collision-free on your domain:

find $target -type f -exec md5sum '{}' + | sort | uniq --all-repeated --check-chars=32 

 | cut --characters=35-

Want identical file names grouped? Write a simple script not_uniq.sh to format output:

#!/bin/bash



last_checksum=0

while read line; do

    checksum=${line:0:32}

    filename=${line:34}

    if [ $checksum == $last_checksum ]; then

        if [ ${last_filename:-0} != '0' ]; then

            echo $last_filename

            unset last_filename

        fi

        echo $filename

    else

        if [ ${last_filename:-0} == '0' ]; then

            echo "======="

        fi

        last_filename=$filename

    fi



    last_checksum=$checksum

done

Then change find command to use your script:

chmod +x not_uniq.sh

find $target -type f -exec md5sum '{}' + | sort | not_uniq.sh

This is basic idea. Probably you should change find if your file names containing some characters. (e.g space)

edited Feb 21 '17 at 18:15

Wayne Werner

6,26851736

answered Apr 13 '13 at 15:39

xin

29929

edited Feb 21 '17 at 18:15

Wayne Werner

6,26851736

edited Feb 21 '17 at 18:15

Wayne Werner

6,26851736

edited Feb 21 '17 at 18:15

Wayne Werner

6,26851736

answered Apr 13 '13 at 15:39

xin

29929

answered Apr 13 '13 at 15:39

xin

29929

answered Apr 13 '13 at 15:39

xin

29929

add a comment |

I thought to add a recent enhanced fork of fdupes, jdupes, which promises to be faster and more feature rich than fdupes (e.g. size filter):

jdupes . -rS -X size-:50m > myjdups.txt

This will recursively find duplicated files bigger than 50MB in the current directory and output the resulted list in myjdups.txt.

Note, the output is not sorted by size and since it appears not to be build in, I have adapted @Chris_Down answer above to achieve this:

jdupes -r . -X size-:50m | {

    while IFS= read -r file; do

        [[ $file ]] && du "$file"

    done

} | sort -n > myjdups_sorted.txt

answered Nov 23 '17 at 17:27

Sebastian Müller

1714

add a comment |

I thought to add a recent enhanced fork of fdupes, jdupes, which promises to be faster and more feature rich than fdupes (e.g. size filter):

jdupes . -rS -X size-:50m > myjdups.txt

This will recursively find duplicated files bigger than 50MB in the current directory and output the resulted list in myjdups.txt.

Note, the output is not sorted by size and since it appears not to be build in, I have adapted @Chris_Down answer above to achieve this:

jdupes -r . -X size-:50m | {

    while IFS= read -r file; do

        [[ $file ]] && du "$file"

    done

} | sort -n > myjdups_sorted.txt

answered Nov 23 '17 at 17:27

Sebastian Müller

1714

add a comment |

I thought to add a recent enhanced fork of fdupes, jdupes, which promises to be faster and more feature rich than fdupes (e.g. size filter):

jdupes . -rS -X size-:50m > myjdups.txt

This will recursively find duplicated files bigger than 50MB in the current directory and output the resulted list in myjdups.txt.

Note, the output is not sorted by size and since it appears not to be build in, I have adapted @Chris_Down answer above to achieve this:

jdupes -r . -X size-:50m | {

    while IFS= read -r file; do

        [[ $file ]] && du "$file"

    done

} | sort -n > myjdups_sorted.txt

answered Nov 23 '17 at 17:27

Sebastian Müller

1714

I thought to add a recent enhanced fork of fdupes, jdupes, which promises to be faster and more feature rich than fdupes (e.g. size filter):

jdupes . -rS -X size-:50m > myjdups.txt

This will recursively find duplicated files bigger than 50MB in the current directory and output the resulted list in myjdups.txt.

Note, the output is not sorted by size and since it appears not to be build in, I have adapted @Chris_Down answer above to achieve this:

jdupes -r . -X size-:50m | {

    while IFS= read -r file; do

        [[ $file ]] && du "$file"

    done

} | sort -n > myjdups_sorted.txt

answered Nov 23 '17 at 17:27

Sebastian Müller

1714

answered Nov 23 '17 at 17:27

Sebastian Müller

1714

answered Nov 23 '17 at 17:27

Sebastian Müller

1714

answered Nov 23 '17 at 17:27

Sebastian Müller

1714

add a comment |

^{Wikipedia had an article (http://en.wikipedia.org/wiki/List_of_duplicate_file_finders), with a list of available open source software for this task, but it's now been deleted}.

I will add that the GUI version of fslint is very interesting, allowing to use mask to select which files to delete. Very useful to clean duplicated photos.

On Linux you can use:

- FSLint: http://www.pixelbeat.org/fslint/



- FDupes: https://en.wikipedia.org/wiki/Fdupes



- DupeGuru: https://www.hardcoded.net/dupeguru/

The 2 last work on many systems (windows, mac and linux) I 've not checked for FSLint

edited Jul 3 '17 at 10:09

Stéphane Chazelas

307k57581939

answered Jan 29 '14 at 11:01

MordicusEtCubitus

1293

5

It is better to provide actual information here and not just a link, the link might change and then the answer has no value left

– Anthon
Jan 29 '14 at 11:22

2

Wikipedia page is empty.

– ihor_dvoretskyi
Sep 10 '15 at 9:01

yes, it has been cleaned, what a pity shake...

– MordicusEtCubitus
Dec 21 '15 at 16:23

I've edited it with these 3 tools

– MordicusEtCubitus
Dec 21 '15 at 16:30

add a comment |

^{Wikipedia had an article (http://en.wikipedia.org/wiki/List_of_duplicate_file_finders), with a list of available open source software for this task, but it's now been deleted}.

I will add that the GUI version of fslint is very interesting, allowing to use mask to select which files to delete. Very useful to clean duplicated photos.

On Linux you can use:

- FSLint: http://www.pixelbeat.org/fslint/



- FDupes: https://en.wikipedia.org/wiki/Fdupes



- DupeGuru: https://www.hardcoded.net/dupeguru/

The 2 last work on many systems (windows, mac and linux) I 've not checked for FSLint

edited Jul 3 '17 at 10:09

Stéphane Chazelas

307k57581939

answered Jan 29 '14 at 11:01

MordicusEtCubitus

1293

5

It is better to provide actual information here and not just a link, the link might change and then the answer has no value left

– Anthon
Jan 29 '14 at 11:22

2

Wikipedia page is empty.

– ihor_dvoretskyi
Sep 10 '15 at 9:01

yes, it has been cleaned, what a pity shake...

– MordicusEtCubitus
Dec 21 '15 at 16:23

I've edited it with these 3 tools

– MordicusEtCubitus
Dec 21 '15 at 16:30

add a comment |

^{Wikipedia had an article (http://en.wikipedia.org/wiki/List_of_duplicate_file_finders), with a list of available open source software for this task, but it's now been deleted}.

I will add that the GUI version of fslint is very interesting, allowing to use mask to select which files to delete. Very useful to clean duplicated photos.

On Linux you can use:

- FSLint: http://www.pixelbeat.org/fslint/



- FDupes: https://en.wikipedia.org/wiki/Fdupes



- DupeGuru: https://www.hardcoded.net/dupeguru/

The 2 last work on many systems (windows, mac and linux) I 've not checked for FSLint

edited Jul 3 '17 at 10:09

Stéphane Chazelas

307k57581939

answered Jan 29 '14 at 11:01

MordicusEtCubitus

1293

^{Wikipedia had an article (http://en.wikipedia.org/wiki/List_of_duplicate_file_finders), with a list of available open source software for this task, but it's now been deleted}.

I will add that the GUI version of fslint is very interesting, allowing to use mask to select which files to delete. Very useful to clean duplicated photos.

On Linux you can use:

- FSLint: http://www.pixelbeat.org/fslint/



- FDupes: https://en.wikipedia.org/wiki/Fdupes



- DupeGuru: https://www.hardcoded.net/dupeguru/

The 2 last work on many systems (windows, mac and linux) I 've not checked for FSLint

edited Jul 3 '17 at 10:09

Stéphane Chazelas

307k57581939

answered Jan 29 '14 at 11:01

MordicusEtCubitus

1293

edited Jul 3 '17 at 10:09

Stéphane Chazelas

307k57581939

edited Jul 3 '17 at 10:09

Stéphane Chazelas

307k57581939

edited Jul 3 '17 at 10:09

Stéphane Chazelas

307k57581939

answered Jan 29 '14 at 11:01

MordicusEtCubitus

1293

answered Jan 29 '14 at 11:01

MordicusEtCubitus

1293

answered Jan 29 '14 at 11:01

MordicusEtCubitus

1293

5

It is better to provide actual information here and not just a link, the link might change and then the answer has no value left

– Anthon
Jan 29 '14 at 11:22

2

Wikipedia page is empty.

– ihor_dvoretskyi
Sep 10 '15 at 9:01

yes, it has been cleaned, what a pity shake...

– MordicusEtCubitus
Dec 21 '15 at 16:23

I've edited it with these 3 tools

– MordicusEtCubitus
Dec 21 '15 at 16:30

add a comment |

5

It is better to provide actual information here and not just a link, the link might change and then the answer has no value left

– Anthon
Jan 29 '14 at 11:22

2

Wikipedia page is empty.

– ihor_dvoretskyi
Sep 10 '15 at 9:01

yes, it has been cleaned, what a pity shake...

– MordicusEtCubitus
Dec 21 '15 at 16:23

I've edited it with these 3 tools

– MordicusEtCubitus
Dec 21 '15 at 16:30

It is better to provide actual information here and not just a link, the link might change and then the answer has no value left

– Anthon
Jan 29 '14 at 11:22

Wikipedia page is empty.

– ihor_dvoretskyi
Sep 10 '15 at 9:01

yes, it has been cleaned, what a pity shake...

– MordicusEtCubitus
Dec 21 '15 at 16:23

I've edited it with these 3 tools

– MordicusEtCubitus
Dec 21 '15 at 16:30

add a comment |

Here's my take on that:

find -type f -size +3M -print0 | while IFS= read -r -d '' i; do

  echo -n '.'

  if grep -q "$i" md5-partial.txt; then echo -e "n$i  ---- Already counted, skipping."; continue; fi

  MD5=`dd bs=1M count=1 if="$i" status=noxfer | md5sum`

  MD5=`echo $MD5 | cut -d' ' -f1`

  if grep "$MD5" md5-partial.txt; then echo "n$i  ----   Possible duplicate"; fi

  echo $MD5 $i >> md5-partial.txt

done

It's different in that it only hashes up to first 1 MB of the file.

This has few issues / features:

There might be a difference after first 1 MB so the result rather a candidate to check. I might fix that later.

Checking by file size first could speed this up.

Only takes files larger than 3 MB.

I use it to compare video clips so this is enough for me.

answered Jun 2 '17 at 1:50

Ondra Žižka

464312

add a comment |

Here's my take on that:

find -type f -size +3M -print0 | while IFS= read -r -d '' i; do

  echo -n '.'

  if grep -q "$i" md5-partial.txt; then echo -e "n$i  ---- Already counted, skipping."; continue; fi

  MD5=`dd bs=1M count=1 if="$i" status=noxfer | md5sum`

  MD5=`echo $MD5 | cut -d' ' -f1`

  if grep "$MD5" md5-partial.txt; then echo "n$i  ----   Possible duplicate"; fi

  echo $MD5 $i >> md5-partial.txt

done

It's different in that it only hashes up to first 1 MB of the file.

This has few issues / features:

There might be a difference after first 1 MB so the result rather a candidate to check. I might fix that later.

Checking by file size first could speed this up.

Only takes files larger than 3 MB.

I use it to compare video clips so this is enough for me.

answered Jun 2 '17 at 1:50

Ondra Žižka

464312

add a comment |

Here's my take on that:

find -type f -size +3M -print0 | while IFS= read -r -d '' i; do

  echo -n '.'

  if grep -q "$i" md5-partial.txt; then echo -e "n$i  ---- Already counted, skipping."; continue; fi

  MD5=`dd bs=1M count=1 if="$i" status=noxfer | md5sum`

  MD5=`echo $MD5 | cut -d' ' -f1`

  if grep "$MD5" md5-partial.txt; then echo "n$i  ----   Possible duplicate"; fi

  echo $MD5 $i >> md5-partial.txt

done

It's different in that it only hashes up to first 1 MB of the file.

This has few issues / features:

There might be a difference after first 1 MB so the result rather a candidate to check. I might fix that later.

Checking by file size first could speed this up.

Only takes files larger than 3 MB.

I use it to compare video clips so this is enough for me.

answered Jun 2 '17 at 1:50

Ondra Žižka

464312

Here's my take on that:

find -type f -size +3M -print0 | while IFS= read -r -d '' i; do

  echo -n '.'

  if grep -q "$i" md5-partial.txt; then echo -e "n$i  ---- Already counted, skipping."; continue; fi

  MD5=`dd bs=1M count=1 if="$i" status=noxfer | md5sum`

  MD5=`echo $MD5 | cut -d' ' -f1`

  if grep "$MD5" md5-partial.txt; then echo "n$i  ----   Possible duplicate"; fi

  echo $MD5 $i >> md5-partial.txt

done

It's different in that it only hashes up to first 1 MB of the file.

This has few issues / features:

There might be a difference after first 1 MB so the result rather a candidate to check. I might fix that later.

Checking by file size first could speed this up.

Only takes files larger than 3 MB.

I use it to compare video clips so this is enough for me.

answered Jun 2 '17 at 1:50

Ondra Žižka

464312

answered Jun 2 '17 at 1:50

Ondra Žižka

464312

answered Jun 2 '17 at 1:50

Ondra Žižka

464312

answered Jun 2 '17 at 1:50

Ondra Žižka

464312

add a comment |

protected by Community♦ Jan 14 '16 at 12:14

This page is only for reference, If you need detailed information, please check here

QY7nbpSb6TGAkh1Y0OTMpJ1vdGgfQV q s3ufrtPyqFdM4xttISpZUJ6 KKsM1cDqH

搜尋此網誌

Cdtjkyj