Find duplicate files
Is it possible to find duplicate files on my disk which are bit to bit identical but have different file-names?
files duplicate-files
|
show 3 more comments
Is it possible to find duplicate files on my disk which are bit to bit identical but have different file-names?
files duplicate-files
3
Note that any possible method of doing this will invariably have to compare every single file on your system to every single other file. So this is going to take a long time, even when taking shortcuts.
– Shadur
Apr 4 '13 at 14:02
4
@Shadur if one is ok with checksums, it boils down to comparing just the hashes - which on most systems is of the order of 10^(5+-1) usually <64-byte entries. Of course, you have to read the data at least once. :)
– peterph
Apr 4 '13 at 14:57
15
@Shadur That's not true. You can reduce the time by checking for matchingst_size
s, eliminating those with only one of the same, and then only calculating md5sums for matchingst_size
s.
– Chris Down
Apr 4 '13 at 16:36
6
@Shadur even an incredibly silly approach disallowing any hash operations could do this in Θ(n log n) compares—not Θ(n²)—using any of several sort algorithms (based on file content).
– derobert
Apr 4 '13 at 17:09
1
@ChrisDown Yes, size matching would be one of the shortcuts I had in mind.
– Shadur
Apr 4 '13 at 19:38
|
show 3 more comments
Is it possible to find duplicate files on my disk which are bit to bit identical but have different file-names?
files duplicate-files
Is it possible to find duplicate files on my disk which are bit to bit identical but have different file-names?
files duplicate-files
files duplicate-files
edited 35 mins ago
Jeff Schaller
42.5k1158135
42.5k1158135
asked Apr 4 '13 at 13:18
studentstudent
7,1651765127
7,1651765127
3
Note that any possible method of doing this will invariably have to compare every single file on your system to every single other file. So this is going to take a long time, even when taking shortcuts.
– Shadur
Apr 4 '13 at 14:02
4
@Shadur if one is ok with checksums, it boils down to comparing just the hashes - which on most systems is of the order of 10^(5+-1) usually <64-byte entries. Of course, you have to read the data at least once. :)
– peterph
Apr 4 '13 at 14:57
15
@Shadur That's not true. You can reduce the time by checking for matchingst_size
s, eliminating those with only one of the same, and then only calculating md5sums for matchingst_size
s.
– Chris Down
Apr 4 '13 at 16:36
6
@Shadur even an incredibly silly approach disallowing any hash operations could do this in Θ(n log n) compares—not Θ(n²)—using any of several sort algorithms (based on file content).
– derobert
Apr 4 '13 at 17:09
1
@ChrisDown Yes, size matching would be one of the shortcuts I had in mind.
– Shadur
Apr 4 '13 at 19:38
|
show 3 more comments
3
Note that any possible method of doing this will invariably have to compare every single file on your system to every single other file. So this is going to take a long time, even when taking shortcuts.
– Shadur
Apr 4 '13 at 14:02
4
@Shadur if one is ok with checksums, it boils down to comparing just the hashes - which on most systems is of the order of 10^(5+-1) usually <64-byte entries. Of course, you have to read the data at least once. :)
– peterph
Apr 4 '13 at 14:57
15
@Shadur That's not true. You can reduce the time by checking for matchingst_size
s, eliminating those with only one of the same, and then only calculating md5sums for matchingst_size
s.
– Chris Down
Apr 4 '13 at 16:36
6
@Shadur even an incredibly silly approach disallowing any hash operations could do this in Θ(n log n) compares—not Θ(n²)—using any of several sort algorithms (based on file content).
– derobert
Apr 4 '13 at 17:09
1
@ChrisDown Yes, size matching would be one of the shortcuts I had in mind.
– Shadur
Apr 4 '13 at 19:38
3
3
Note that any possible method of doing this will invariably have to compare every single file on your system to every single other file. So this is going to take a long time, even when taking shortcuts.
– Shadur
Apr 4 '13 at 14:02
Note that any possible method of doing this will invariably have to compare every single file on your system to every single other file. So this is going to take a long time, even when taking shortcuts.
– Shadur
Apr 4 '13 at 14:02
4
4
@Shadur if one is ok with checksums, it boils down to comparing just the hashes - which on most systems is of the order of 10^(5+-1) usually <64-byte entries. Of course, you have to read the data at least once. :)
– peterph
Apr 4 '13 at 14:57
@Shadur if one is ok with checksums, it boils down to comparing just the hashes - which on most systems is of the order of 10^(5+-1) usually <64-byte entries. Of course, you have to read the data at least once. :)
– peterph
Apr 4 '13 at 14:57
15
15
@Shadur That's not true. You can reduce the time by checking for matching
st_size
s, eliminating those with only one of the same, and then only calculating md5sums for matching st_size
s.– Chris Down
Apr 4 '13 at 16:36
@Shadur That's not true. You can reduce the time by checking for matching
st_size
s, eliminating those with only one of the same, and then only calculating md5sums for matching st_size
s.– Chris Down
Apr 4 '13 at 16:36
6
6
@Shadur even an incredibly silly approach disallowing any hash operations could do this in Θ(n log n) compares—not Θ(n²)—using any of several sort algorithms (based on file content).
– derobert
Apr 4 '13 at 17:09
@Shadur even an incredibly silly approach disallowing any hash operations could do this in Θ(n log n) compares—not Θ(n²)—using any of several sort algorithms (based on file content).
– derobert
Apr 4 '13 at 17:09
1
1
@ChrisDown Yes, size matching would be one of the shortcuts I had in mind.
– Shadur
Apr 4 '13 at 19:38
@ChrisDown Yes, size matching would be one of the shortcuts I had in mind.
– Shadur
Apr 4 '13 at 19:38
|
show 3 more comments
7 Answers
7
active
oldest
votes
fdupes
can do this. From man fdupes
:
Searches the given path for duplicate files. Such files are found by comparing file sizes and MD5 signatures, followed by a byte-by-byte comparison.
In Debian or Ubuntu, you can install it with apt-get install fdupes
. In Fedora/Red Hat/CentOS, you can install it with yum install fdupes
. On Arch Linux you can use pacman -S fdupes
, and on Gentoo, emerge fdupes
.
To run a check descending from your filesystem root, which will likely take a significant amount of time and memory, use something like fdupes -r /
.
As asked in the comments, you can get the largest duplicates by doing the following:
fdupes -r . | {
while IFS= read -r file; do
[[ $file ]] && du "$file"
done
} | sort -n
This will break if your filenames contain newlines.
Thanks. How can I filter out the largest dupe? How can I make the sizes human readable?
– student
Apr 5 '13 at 9:31
@student: use something along the line of (make sure fdupes just outputs the filenames with no extra informatinos, or cut or sed to just keep that) :fdupes ....... | xargs ls -alhd | egrep 'M |G '
to keep files in Human readable format and only those with size in Megabytes or Gigabytes. Change the command to suit the real outputs.
– Olivier Dulac
Apr 5 '13 at 12:27
2
@OlivierDulac You should never parse ls. Usually it's worse than your use case, but even in your use case, you risk false positives.
– Chris Down
Apr 5 '13 at 13:13
@student - Once you have the filenames,du
piped tosort
will tell you.
– Chris Down
Apr 5 '13 at 13:14
@ChrisDown: it's true it's a bad habit, and can give false positives. But in that case (interactive use, and for display only, no "rm" or anything of the sort directly relying on it) it's fine and quick ^^ . I love those pages you link to, btw (been reading them since a few months, and full of many usefull infos)
– Olivier Dulac
Apr 5 '13 at 14:05
|
show 4 more comments
Another good tool is fslint
:
fslint is a toolset to find various problems with filesystems,
including duplicate files and problematic filenames
etc.
Individual command line tools are available in addition to the GUI and to access them, one can change to, or add to
$PATH the /usr/share/fslint/fslint directory on a standard install. Each of these commands in that directory have a
--help option which further details its parameters.
findup - find DUPlicate files
On debian-based systems, youcan install it with:
sudo apt-get install fslint
You can also do this manually if you don't want to or cannot install third party tools. The way most such programs work is by calculating file checksums. Files with the same md5sum almost certainly contain exactly the same data. So, you could do something like this:
find / -type f -exec md5sum {} ; > md5sums
gawk '{print $1}' md5sums | sort | uniq -d > dupes
while read d; do echo "---"; grep $d md5sums | cut -d ' ' -f 2-; done < dupes
Sample output (the file names in this example are the same, but it will also work when they are different):
$ while read d; do echo "---"; grep $d md5sums | cut -d ' ' -f 2-; done < dupes
---
/usr/src/linux-headers-3.2.0-3-common/include/linux/if_bonding.h
/usr/src/linux-headers-3.2.0-4-common/include/linux/if_bonding.h
---
/usr/src/linux-headers-3.2.0-3-common/include/linux/route.h
/usr/src/linux-headers-3.2.0-4-common/include/linux/route.h
---
/usr/src/linux-headers-3.2.0-3-common/include/drm/Kbuild
/usr/src/linux-headers-3.2.0-4-common/include/drm/Kbuild
---
This will be much slower than the dedicated tools already mentioned, but it will work.
3
It would be much, much faster to find any files with the same size as another file usingst_size
, eliminating any that only have one file of this size, and then calculating md5sums only between files with the samest_size
.
– Chris Down
Apr 4 '13 at 16:34
@ChrisDown yeah, just wanted to keep it simple. What you suggest will greatly speed things up of course. That's why I have the disclaimer about it being slow at the end of my answer.
– terdon♦
Apr 4 '13 at 16:37
add a comment |
Short answer: yes.
Longer version: have a look at the wikipedia fdupes entry, it sports quite nice list of ready made solutions. Of course you can write your own, it's not that difficult - hashing programs like diff
, sha*sum
, find
, sort
and uniq
should do the job. You can even put it on one line, and it will still be understandable.
add a comment |
If you believe a hash function (here MD5) is collision-free on your domain:
find $target -type f -exec md5sum '{}' + | sort | uniq --all-repeated --check-chars=32
| cut --characters=35-
Want identical file names grouped? Write a simple script not_uniq.sh
to format output:
#!/bin/bash
last_checksum=0
while read line; do
checksum=${line:0:32}
filename=${line:34}
if [ $checksum == $last_checksum ]; then
if [ ${last_filename:-0} != '0' ]; then
echo $last_filename
unset last_filename
fi
echo $filename
else
if [ ${last_filename:-0} == '0' ]; then
echo "======="
fi
last_filename=$filename
fi
last_checksum=$checksum
done
Then change find
command to use your script:
chmod +x not_uniq.sh
find $target -type f -exec md5sum '{}' + | sort | not_uniq.sh
This is basic idea. Probably you should change find
if your file names containing some characters. (e.g space)
add a comment |
I thought to add a recent enhanced fork of fdupes, jdupes, which promises to be faster and more feature rich than fdupes (e.g. size filter):
jdupes . -rS -X size-:50m > myjdups.txt
This will recursively find duplicated files bigger than 50MB in the current directory and output the resulted list in myjdups.txt.
Note, the output is not sorted by size and since it appears not to be build in, I have adapted @Chris_Down answer above to achieve this:
jdupes -r . -X size-:50m | {
while IFS= read -r file; do
[[ $file ]] && du "$file"
done
} | sort -n > myjdups_sorted.txt
add a comment |
Wikipedia had an article (http://en.wikipedia.org/wiki/List_of_duplicate_file_finders), with a list of available open source software for this task, but it's now been deleted.
I will add that the GUI version of fslint is very interesting, allowing to use mask to select which files to delete. Very useful to clean duplicated photos.
On Linux you can use:
- FSLint: http://www.pixelbeat.org/fslint/
- FDupes: https://en.wikipedia.org/wiki/Fdupes
- DupeGuru: https://www.hardcoded.net/dupeguru/
The 2 last work on many systems (windows, mac and linux) I 've not checked for FSLint
5
It is better to provide actual information here and not just a link, the link might change and then the answer has no value left
– Anthon
Jan 29 '14 at 11:22
2
Wikipedia page is empty.
– ihor_dvoretskyi
Sep 10 '15 at 9:01
yes, it has been cleaned, what a pity shake...
– MordicusEtCubitus
Dec 21 '15 at 16:23
I've edited it with these 3 tools
– MordicusEtCubitus
Dec 21 '15 at 16:30
add a comment |
Here's my take on that:
find -type f -size +3M -print0 | while IFS= read -r -d '' i; do
echo -n '.'
if grep -q "$i" md5-partial.txt; then echo -e "n$i ---- Already counted, skipping."; continue; fi
MD5=`dd bs=1M count=1 if="$i" status=noxfer | md5sum`
MD5=`echo $MD5 | cut -d' ' -f1`
if grep "$MD5" md5-partial.txt; then echo "n$i ---- Possible duplicate"; fi
echo $MD5 $i >> md5-partial.txt
done
It's different in that it only hashes up to first 1 MB of the file.
This has few issues / features:
- There might be a difference after first 1 MB so the result rather a candidate to check. I might fix that later.
- Checking by file size first could speed this up.
- Only takes files larger than 3 MB.
I use it to compare video clips so this is enough for me.
add a comment |
protected by Community♦ Jan 14 '16 at 12:14
Thank you for your interest in this question.
Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).
Would you like to answer one of these unanswered questions instead?
7 Answers
7
active
oldest
votes
7 Answers
7
active
oldest
votes
active
oldest
votes
active
oldest
votes
fdupes
can do this. From man fdupes
:
Searches the given path for duplicate files. Such files are found by comparing file sizes and MD5 signatures, followed by a byte-by-byte comparison.
In Debian or Ubuntu, you can install it with apt-get install fdupes
. In Fedora/Red Hat/CentOS, you can install it with yum install fdupes
. On Arch Linux you can use pacman -S fdupes
, and on Gentoo, emerge fdupes
.
To run a check descending from your filesystem root, which will likely take a significant amount of time and memory, use something like fdupes -r /
.
As asked in the comments, you can get the largest duplicates by doing the following:
fdupes -r . | {
while IFS= read -r file; do
[[ $file ]] && du "$file"
done
} | sort -n
This will break if your filenames contain newlines.
Thanks. How can I filter out the largest dupe? How can I make the sizes human readable?
– student
Apr 5 '13 at 9:31
@student: use something along the line of (make sure fdupes just outputs the filenames with no extra informatinos, or cut or sed to just keep that) :fdupes ....... | xargs ls -alhd | egrep 'M |G '
to keep files in Human readable format and only those with size in Megabytes or Gigabytes. Change the command to suit the real outputs.
– Olivier Dulac
Apr 5 '13 at 12:27
2
@OlivierDulac You should never parse ls. Usually it's worse than your use case, but even in your use case, you risk false positives.
– Chris Down
Apr 5 '13 at 13:13
@student - Once you have the filenames,du
piped tosort
will tell you.
– Chris Down
Apr 5 '13 at 13:14
@ChrisDown: it's true it's a bad habit, and can give false positives. But in that case (interactive use, and for display only, no "rm" or anything of the sort directly relying on it) it's fine and quick ^^ . I love those pages you link to, btw (been reading them since a few months, and full of many usefull infos)
– Olivier Dulac
Apr 5 '13 at 14:05
|
show 4 more comments
fdupes
can do this. From man fdupes
:
Searches the given path for duplicate files. Such files are found by comparing file sizes and MD5 signatures, followed by a byte-by-byte comparison.
In Debian or Ubuntu, you can install it with apt-get install fdupes
. In Fedora/Red Hat/CentOS, you can install it with yum install fdupes
. On Arch Linux you can use pacman -S fdupes
, and on Gentoo, emerge fdupes
.
To run a check descending from your filesystem root, which will likely take a significant amount of time and memory, use something like fdupes -r /
.
As asked in the comments, you can get the largest duplicates by doing the following:
fdupes -r . | {
while IFS= read -r file; do
[[ $file ]] && du "$file"
done
} | sort -n
This will break if your filenames contain newlines.
Thanks. How can I filter out the largest dupe? How can I make the sizes human readable?
– student
Apr 5 '13 at 9:31
@student: use something along the line of (make sure fdupes just outputs the filenames with no extra informatinos, or cut or sed to just keep that) :fdupes ....... | xargs ls -alhd | egrep 'M |G '
to keep files in Human readable format and only those with size in Megabytes or Gigabytes. Change the command to suit the real outputs.
– Olivier Dulac
Apr 5 '13 at 12:27
2
@OlivierDulac You should never parse ls. Usually it's worse than your use case, but even in your use case, you risk false positives.
– Chris Down
Apr 5 '13 at 13:13
@student - Once you have the filenames,du
piped tosort
will tell you.
– Chris Down
Apr 5 '13 at 13:14
@ChrisDown: it's true it's a bad habit, and can give false positives. But in that case (interactive use, and for display only, no "rm" or anything of the sort directly relying on it) it's fine and quick ^^ . I love those pages you link to, btw (been reading them since a few months, and full of many usefull infos)
– Olivier Dulac
Apr 5 '13 at 14:05
|
show 4 more comments
fdupes
can do this. From man fdupes
:
Searches the given path for duplicate files. Such files are found by comparing file sizes and MD5 signatures, followed by a byte-by-byte comparison.
In Debian or Ubuntu, you can install it with apt-get install fdupes
. In Fedora/Red Hat/CentOS, you can install it with yum install fdupes
. On Arch Linux you can use pacman -S fdupes
, and on Gentoo, emerge fdupes
.
To run a check descending from your filesystem root, which will likely take a significant amount of time and memory, use something like fdupes -r /
.
As asked in the comments, you can get the largest duplicates by doing the following:
fdupes -r . | {
while IFS= read -r file; do
[[ $file ]] && du "$file"
done
} | sort -n
This will break if your filenames contain newlines.
fdupes
can do this. From man fdupes
:
Searches the given path for duplicate files. Such files are found by comparing file sizes and MD5 signatures, followed by a byte-by-byte comparison.
In Debian or Ubuntu, you can install it with apt-get install fdupes
. In Fedora/Red Hat/CentOS, you can install it with yum install fdupes
. On Arch Linux you can use pacman -S fdupes
, and on Gentoo, emerge fdupes
.
To run a check descending from your filesystem root, which will likely take a significant amount of time and memory, use something like fdupes -r /
.
As asked in the comments, you can get the largest duplicates by doing the following:
fdupes -r . | {
while IFS= read -r file; do
[[ $file ]] && du "$file"
done
} | sort -n
This will break if your filenames contain newlines.
edited Aug 14 '17 at 17:38
genpfault
1357
1357
answered Apr 4 '13 at 13:24
Chris DownChris Down
80.6k14189202
80.6k14189202
Thanks. How can I filter out the largest dupe? How can I make the sizes human readable?
– student
Apr 5 '13 at 9:31
@student: use something along the line of (make sure fdupes just outputs the filenames with no extra informatinos, or cut or sed to just keep that) :fdupes ....... | xargs ls -alhd | egrep 'M |G '
to keep files in Human readable format and only those with size in Megabytes or Gigabytes. Change the command to suit the real outputs.
– Olivier Dulac
Apr 5 '13 at 12:27
2
@OlivierDulac You should never parse ls. Usually it's worse than your use case, but even in your use case, you risk false positives.
– Chris Down
Apr 5 '13 at 13:13
@student - Once you have the filenames,du
piped tosort
will tell you.
– Chris Down
Apr 5 '13 at 13:14
@ChrisDown: it's true it's a bad habit, and can give false positives. But in that case (interactive use, and for display only, no "rm" or anything of the sort directly relying on it) it's fine and quick ^^ . I love those pages you link to, btw (been reading them since a few months, and full of many usefull infos)
– Olivier Dulac
Apr 5 '13 at 14:05
|
show 4 more comments
Thanks. How can I filter out the largest dupe? How can I make the sizes human readable?
– student
Apr 5 '13 at 9:31
@student: use something along the line of (make sure fdupes just outputs the filenames with no extra informatinos, or cut or sed to just keep that) :fdupes ....... | xargs ls -alhd | egrep 'M |G '
to keep files in Human readable format and only those with size in Megabytes or Gigabytes. Change the command to suit the real outputs.
– Olivier Dulac
Apr 5 '13 at 12:27
2
@OlivierDulac You should never parse ls. Usually it's worse than your use case, but even in your use case, you risk false positives.
– Chris Down
Apr 5 '13 at 13:13
@student - Once you have the filenames,du
piped tosort
will tell you.
– Chris Down
Apr 5 '13 at 13:14
@ChrisDown: it's true it's a bad habit, and can give false positives. But in that case (interactive use, and for display only, no "rm" or anything of the sort directly relying on it) it's fine and quick ^^ . I love those pages you link to, btw (been reading them since a few months, and full of many usefull infos)
– Olivier Dulac
Apr 5 '13 at 14:05
Thanks. How can I filter out the largest dupe? How can I make the sizes human readable?
– student
Apr 5 '13 at 9:31
Thanks. How can I filter out the largest dupe? How can I make the sizes human readable?
– student
Apr 5 '13 at 9:31
@student: use something along the line of (make sure fdupes just outputs the filenames with no extra informatinos, or cut or sed to just keep that) :
fdupes ....... | xargs ls -alhd | egrep 'M |G '
to keep files in Human readable format and only those with size in Megabytes or Gigabytes. Change the command to suit the real outputs.– Olivier Dulac
Apr 5 '13 at 12:27
@student: use something along the line of (make sure fdupes just outputs the filenames with no extra informatinos, or cut or sed to just keep that) :
fdupes ....... | xargs ls -alhd | egrep 'M |G '
to keep files in Human readable format and only those with size in Megabytes or Gigabytes. Change the command to suit the real outputs.– Olivier Dulac
Apr 5 '13 at 12:27
2
2
@OlivierDulac You should never parse ls. Usually it's worse than your use case, but even in your use case, you risk false positives.
– Chris Down
Apr 5 '13 at 13:13
@OlivierDulac You should never parse ls. Usually it's worse than your use case, but even in your use case, you risk false positives.
– Chris Down
Apr 5 '13 at 13:13
@student - Once you have the filenames,
du
piped to sort
will tell you.– Chris Down
Apr 5 '13 at 13:14
@student - Once you have the filenames,
du
piped to sort
will tell you.– Chris Down
Apr 5 '13 at 13:14
@ChrisDown: it's true it's a bad habit, and can give false positives. But in that case (interactive use, and for display only, no "rm" or anything of the sort directly relying on it) it's fine and quick ^^ . I love those pages you link to, btw (been reading them since a few months, and full of many usefull infos)
– Olivier Dulac
Apr 5 '13 at 14:05
@ChrisDown: it's true it's a bad habit, and can give false positives. But in that case (interactive use, and for display only, no "rm" or anything of the sort directly relying on it) it's fine and quick ^^ . I love those pages you link to, btw (been reading them since a few months, and full of many usefull infos)
– Olivier Dulac
Apr 5 '13 at 14:05
|
show 4 more comments
Another good tool is fslint
:
fslint is a toolset to find various problems with filesystems,
including duplicate files and problematic filenames
etc.
Individual command line tools are available in addition to the GUI and to access them, one can change to, or add to
$PATH the /usr/share/fslint/fslint directory on a standard install. Each of these commands in that directory have a
--help option which further details its parameters.
findup - find DUPlicate files
On debian-based systems, youcan install it with:
sudo apt-get install fslint
You can also do this manually if you don't want to or cannot install third party tools. The way most such programs work is by calculating file checksums. Files with the same md5sum almost certainly contain exactly the same data. So, you could do something like this:
find / -type f -exec md5sum {} ; > md5sums
gawk '{print $1}' md5sums | sort | uniq -d > dupes
while read d; do echo "---"; grep $d md5sums | cut -d ' ' -f 2-; done < dupes
Sample output (the file names in this example are the same, but it will also work when they are different):
$ while read d; do echo "---"; grep $d md5sums | cut -d ' ' -f 2-; done < dupes
---
/usr/src/linux-headers-3.2.0-3-common/include/linux/if_bonding.h
/usr/src/linux-headers-3.2.0-4-common/include/linux/if_bonding.h
---
/usr/src/linux-headers-3.2.0-3-common/include/linux/route.h
/usr/src/linux-headers-3.2.0-4-common/include/linux/route.h
---
/usr/src/linux-headers-3.2.0-3-common/include/drm/Kbuild
/usr/src/linux-headers-3.2.0-4-common/include/drm/Kbuild
---
This will be much slower than the dedicated tools already mentioned, but it will work.
3
It would be much, much faster to find any files with the same size as another file usingst_size
, eliminating any that only have one file of this size, and then calculating md5sums only between files with the samest_size
.
– Chris Down
Apr 4 '13 at 16:34
@ChrisDown yeah, just wanted to keep it simple. What you suggest will greatly speed things up of course. That's why I have the disclaimer about it being slow at the end of my answer.
– terdon♦
Apr 4 '13 at 16:37
add a comment |
Another good tool is fslint
:
fslint is a toolset to find various problems with filesystems,
including duplicate files and problematic filenames
etc.
Individual command line tools are available in addition to the GUI and to access them, one can change to, or add to
$PATH the /usr/share/fslint/fslint directory on a standard install. Each of these commands in that directory have a
--help option which further details its parameters.
findup - find DUPlicate files
On debian-based systems, youcan install it with:
sudo apt-get install fslint
You can also do this manually if you don't want to or cannot install third party tools. The way most such programs work is by calculating file checksums. Files with the same md5sum almost certainly contain exactly the same data. So, you could do something like this:
find / -type f -exec md5sum {} ; > md5sums
gawk '{print $1}' md5sums | sort | uniq -d > dupes
while read d; do echo "---"; grep $d md5sums | cut -d ' ' -f 2-; done < dupes
Sample output (the file names in this example are the same, but it will also work when they are different):
$ while read d; do echo "---"; grep $d md5sums | cut -d ' ' -f 2-; done < dupes
---
/usr/src/linux-headers-3.2.0-3-common/include/linux/if_bonding.h
/usr/src/linux-headers-3.2.0-4-common/include/linux/if_bonding.h
---
/usr/src/linux-headers-3.2.0-3-common/include/linux/route.h
/usr/src/linux-headers-3.2.0-4-common/include/linux/route.h
---
/usr/src/linux-headers-3.2.0-3-common/include/drm/Kbuild
/usr/src/linux-headers-3.2.0-4-common/include/drm/Kbuild
---
This will be much slower than the dedicated tools already mentioned, but it will work.
3
It would be much, much faster to find any files with the same size as another file usingst_size
, eliminating any that only have one file of this size, and then calculating md5sums only between files with the samest_size
.
– Chris Down
Apr 4 '13 at 16:34
@ChrisDown yeah, just wanted to keep it simple. What you suggest will greatly speed things up of course. That's why I have the disclaimer about it being slow at the end of my answer.
– terdon♦
Apr 4 '13 at 16:37
add a comment |
Another good tool is fslint
:
fslint is a toolset to find various problems with filesystems,
including duplicate files and problematic filenames
etc.
Individual command line tools are available in addition to the GUI and to access them, one can change to, or add to
$PATH the /usr/share/fslint/fslint directory on a standard install. Each of these commands in that directory have a
--help option which further details its parameters.
findup - find DUPlicate files
On debian-based systems, youcan install it with:
sudo apt-get install fslint
You can also do this manually if you don't want to or cannot install third party tools. The way most such programs work is by calculating file checksums. Files with the same md5sum almost certainly contain exactly the same data. So, you could do something like this:
find / -type f -exec md5sum {} ; > md5sums
gawk '{print $1}' md5sums | sort | uniq -d > dupes
while read d; do echo "---"; grep $d md5sums | cut -d ' ' -f 2-; done < dupes
Sample output (the file names in this example are the same, but it will also work when they are different):
$ while read d; do echo "---"; grep $d md5sums | cut -d ' ' -f 2-; done < dupes
---
/usr/src/linux-headers-3.2.0-3-common/include/linux/if_bonding.h
/usr/src/linux-headers-3.2.0-4-common/include/linux/if_bonding.h
---
/usr/src/linux-headers-3.2.0-3-common/include/linux/route.h
/usr/src/linux-headers-3.2.0-4-common/include/linux/route.h
---
/usr/src/linux-headers-3.2.0-3-common/include/drm/Kbuild
/usr/src/linux-headers-3.2.0-4-common/include/drm/Kbuild
---
This will be much slower than the dedicated tools already mentioned, but it will work.
Another good tool is fslint
:
fslint is a toolset to find various problems with filesystems,
including duplicate files and problematic filenames
etc.
Individual command line tools are available in addition to the GUI and to access them, one can change to, or add to
$PATH the /usr/share/fslint/fslint directory on a standard install. Each of these commands in that directory have a
--help option which further details its parameters.
findup - find DUPlicate files
On debian-based systems, youcan install it with:
sudo apt-get install fslint
You can also do this manually if you don't want to or cannot install third party tools. The way most such programs work is by calculating file checksums. Files with the same md5sum almost certainly contain exactly the same data. So, you could do something like this:
find / -type f -exec md5sum {} ; > md5sums
gawk '{print $1}' md5sums | sort | uniq -d > dupes
while read d; do echo "---"; grep $d md5sums | cut -d ' ' -f 2-; done < dupes
Sample output (the file names in this example are the same, but it will also work when they are different):
$ while read d; do echo "---"; grep $d md5sums | cut -d ' ' -f 2-; done < dupes
---
/usr/src/linux-headers-3.2.0-3-common/include/linux/if_bonding.h
/usr/src/linux-headers-3.2.0-4-common/include/linux/if_bonding.h
---
/usr/src/linux-headers-3.2.0-3-common/include/linux/route.h
/usr/src/linux-headers-3.2.0-4-common/include/linux/route.h
---
/usr/src/linux-headers-3.2.0-3-common/include/drm/Kbuild
/usr/src/linux-headers-3.2.0-4-common/include/drm/Kbuild
---
This will be much slower than the dedicated tools already mentioned, but it will work.
edited Apr 4 '13 at 16:06
answered Apr 4 '13 at 16:00
terdon♦terdon
131k32258436
131k32258436
3
It would be much, much faster to find any files with the same size as another file usingst_size
, eliminating any that only have one file of this size, and then calculating md5sums only between files with the samest_size
.
– Chris Down
Apr 4 '13 at 16:34
@ChrisDown yeah, just wanted to keep it simple. What you suggest will greatly speed things up of course. That's why I have the disclaimer about it being slow at the end of my answer.
– terdon♦
Apr 4 '13 at 16:37
add a comment |
3
It would be much, much faster to find any files with the same size as another file usingst_size
, eliminating any that only have one file of this size, and then calculating md5sums only between files with the samest_size
.
– Chris Down
Apr 4 '13 at 16:34
@ChrisDown yeah, just wanted to keep it simple. What you suggest will greatly speed things up of course. That's why I have the disclaimer about it being slow at the end of my answer.
– terdon♦
Apr 4 '13 at 16:37
3
3
It would be much, much faster to find any files with the same size as another file using
st_size
, eliminating any that only have one file of this size, and then calculating md5sums only between files with the same st_size
.– Chris Down
Apr 4 '13 at 16:34
It would be much, much faster to find any files with the same size as another file using
st_size
, eliminating any that only have one file of this size, and then calculating md5sums only between files with the same st_size
.– Chris Down
Apr 4 '13 at 16:34
@ChrisDown yeah, just wanted to keep it simple. What you suggest will greatly speed things up of course. That's why I have the disclaimer about it being slow at the end of my answer.
– terdon♦
Apr 4 '13 at 16:37
@ChrisDown yeah, just wanted to keep it simple. What you suggest will greatly speed things up of course. That's why I have the disclaimer about it being slow at the end of my answer.
– terdon♦
Apr 4 '13 at 16:37
add a comment |
Short answer: yes.
Longer version: have a look at the wikipedia fdupes entry, it sports quite nice list of ready made solutions. Of course you can write your own, it's not that difficult - hashing programs like diff
, sha*sum
, find
, sort
and uniq
should do the job. You can even put it on one line, and it will still be understandable.
add a comment |
Short answer: yes.
Longer version: have a look at the wikipedia fdupes entry, it sports quite nice list of ready made solutions. Of course you can write your own, it's not that difficult - hashing programs like diff
, sha*sum
, find
, sort
and uniq
should do the job. You can even put it on one line, and it will still be understandable.
add a comment |
Short answer: yes.
Longer version: have a look at the wikipedia fdupes entry, it sports quite nice list of ready made solutions. Of course you can write your own, it's not that difficult - hashing programs like diff
, sha*sum
, find
, sort
and uniq
should do the job. You can even put it on one line, and it will still be understandable.
Short answer: yes.
Longer version: have a look at the wikipedia fdupes entry, it sports quite nice list of ready made solutions. Of course you can write your own, it's not that difficult - hashing programs like diff
, sha*sum
, find
, sort
and uniq
should do the job. You can even put it on one line, and it will still be understandable.
answered Apr 4 '13 at 13:25
peterphpeterph
23.7k24457
23.7k24457
add a comment |
add a comment |
If you believe a hash function (here MD5) is collision-free on your domain:
find $target -type f -exec md5sum '{}' + | sort | uniq --all-repeated --check-chars=32
| cut --characters=35-
Want identical file names grouped? Write a simple script not_uniq.sh
to format output:
#!/bin/bash
last_checksum=0
while read line; do
checksum=${line:0:32}
filename=${line:34}
if [ $checksum == $last_checksum ]; then
if [ ${last_filename:-0} != '0' ]; then
echo $last_filename
unset last_filename
fi
echo $filename
else
if [ ${last_filename:-0} == '0' ]; then
echo "======="
fi
last_filename=$filename
fi
last_checksum=$checksum
done
Then change find
command to use your script:
chmod +x not_uniq.sh
find $target -type f -exec md5sum '{}' + | sort | not_uniq.sh
This is basic idea. Probably you should change find
if your file names containing some characters. (e.g space)
add a comment |
If you believe a hash function (here MD5) is collision-free on your domain:
find $target -type f -exec md5sum '{}' + | sort | uniq --all-repeated --check-chars=32
| cut --characters=35-
Want identical file names grouped? Write a simple script not_uniq.sh
to format output:
#!/bin/bash
last_checksum=0
while read line; do
checksum=${line:0:32}
filename=${line:34}
if [ $checksum == $last_checksum ]; then
if [ ${last_filename:-0} != '0' ]; then
echo $last_filename
unset last_filename
fi
echo $filename
else
if [ ${last_filename:-0} == '0' ]; then
echo "======="
fi
last_filename=$filename
fi
last_checksum=$checksum
done
Then change find
command to use your script:
chmod +x not_uniq.sh
find $target -type f -exec md5sum '{}' + | sort | not_uniq.sh
This is basic idea. Probably you should change find
if your file names containing some characters. (e.g space)
add a comment |
If you believe a hash function (here MD5) is collision-free on your domain:
find $target -type f -exec md5sum '{}' + | sort | uniq --all-repeated --check-chars=32
| cut --characters=35-
Want identical file names grouped? Write a simple script not_uniq.sh
to format output:
#!/bin/bash
last_checksum=0
while read line; do
checksum=${line:0:32}
filename=${line:34}
if [ $checksum == $last_checksum ]; then
if [ ${last_filename:-0} != '0' ]; then
echo $last_filename
unset last_filename
fi
echo $filename
else
if [ ${last_filename:-0} == '0' ]; then
echo "======="
fi
last_filename=$filename
fi
last_checksum=$checksum
done
Then change find
command to use your script:
chmod +x not_uniq.sh
find $target -type f -exec md5sum '{}' + | sort | not_uniq.sh
This is basic idea. Probably you should change find
if your file names containing some characters. (e.g space)
If you believe a hash function (here MD5) is collision-free on your domain:
find $target -type f -exec md5sum '{}' + | sort | uniq --all-repeated --check-chars=32
| cut --characters=35-
Want identical file names grouped? Write a simple script not_uniq.sh
to format output:
#!/bin/bash
last_checksum=0
while read line; do
checksum=${line:0:32}
filename=${line:34}
if [ $checksum == $last_checksum ]; then
if [ ${last_filename:-0} != '0' ]; then
echo $last_filename
unset last_filename
fi
echo $filename
else
if [ ${last_filename:-0} == '0' ]; then
echo "======="
fi
last_filename=$filename
fi
last_checksum=$checksum
done
Then change find
command to use your script:
chmod +x not_uniq.sh
find $target -type f -exec md5sum '{}' + | sort | not_uniq.sh
This is basic idea. Probably you should change find
if your file names containing some characters. (e.g space)
edited Feb 21 '17 at 18:15
Wayne Werner
6,26851736
6,26851736
answered Apr 13 '13 at 15:39
xinxin
29929
29929
add a comment |
add a comment |
I thought to add a recent enhanced fork of fdupes, jdupes, which promises to be faster and more feature rich than fdupes (e.g. size filter):
jdupes . -rS -X size-:50m > myjdups.txt
This will recursively find duplicated files bigger than 50MB in the current directory and output the resulted list in myjdups.txt.
Note, the output is not sorted by size and since it appears not to be build in, I have adapted @Chris_Down answer above to achieve this:
jdupes -r . -X size-:50m | {
while IFS= read -r file; do
[[ $file ]] && du "$file"
done
} | sort -n > myjdups_sorted.txt
add a comment |
I thought to add a recent enhanced fork of fdupes, jdupes, which promises to be faster and more feature rich than fdupes (e.g. size filter):
jdupes . -rS -X size-:50m > myjdups.txt
This will recursively find duplicated files bigger than 50MB in the current directory and output the resulted list in myjdups.txt.
Note, the output is not sorted by size and since it appears not to be build in, I have adapted @Chris_Down answer above to achieve this:
jdupes -r . -X size-:50m | {
while IFS= read -r file; do
[[ $file ]] && du "$file"
done
} | sort -n > myjdups_sorted.txt
add a comment |
I thought to add a recent enhanced fork of fdupes, jdupes, which promises to be faster and more feature rich than fdupes (e.g. size filter):
jdupes . -rS -X size-:50m > myjdups.txt
This will recursively find duplicated files bigger than 50MB in the current directory and output the resulted list in myjdups.txt.
Note, the output is not sorted by size and since it appears not to be build in, I have adapted @Chris_Down answer above to achieve this:
jdupes -r . -X size-:50m | {
while IFS= read -r file; do
[[ $file ]] && du "$file"
done
} | sort -n > myjdups_sorted.txt
I thought to add a recent enhanced fork of fdupes, jdupes, which promises to be faster and more feature rich than fdupes (e.g. size filter):
jdupes . -rS -X size-:50m > myjdups.txt
This will recursively find duplicated files bigger than 50MB in the current directory and output the resulted list in myjdups.txt.
Note, the output is not sorted by size and since it appears not to be build in, I have adapted @Chris_Down answer above to achieve this:
jdupes -r . -X size-:50m | {
while IFS= read -r file; do
[[ $file ]] && du "$file"
done
} | sort -n > myjdups_sorted.txt
answered Nov 23 '17 at 17:27
Sebastian MüllerSebastian Müller
1714
1714
add a comment |
add a comment |
Wikipedia had an article (http://en.wikipedia.org/wiki/List_of_duplicate_file_finders), with a list of available open source software for this task, but it's now been deleted.
I will add that the GUI version of fslint is very interesting, allowing to use mask to select which files to delete. Very useful to clean duplicated photos.
On Linux you can use:
- FSLint: http://www.pixelbeat.org/fslint/
- FDupes: https://en.wikipedia.org/wiki/Fdupes
- DupeGuru: https://www.hardcoded.net/dupeguru/
The 2 last work on many systems (windows, mac and linux) I 've not checked for FSLint
5
It is better to provide actual information here and not just a link, the link might change and then the answer has no value left
– Anthon
Jan 29 '14 at 11:22
2
Wikipedia page is empty.
– ihor_dvoretskyi
Sep 10 '15 at 9:01
yes, it has been cleaned, what a pity shake...
– MordicusEtCubitus
Dec 21 '15 at 16:23
I've edited it with these 3 tools
– MordicusEtCubitus
Dec 21 '15 at 16:30
add a comment |
Wikipedia had an article (http://en.wikipedia.org/wiki/List_of_duplicate_file_finders), with a list of available open source software for this task, but it's now been deleted.
I will add that the GUI version of fslint is very interesting, allowing to use mask to select which files to delete. Very useful to clean duplicated photos.
On Linux you can use:
- FSLint: http://www.pixelbeat.org/fslint/
- FDupes: https://en.wikipedia.org/wiki/Fdupes
- DupeGuru: https://www.hardcoded.net/dupeguru/
The 2 last work on many systems (windows, mac and linux) I 've not checked for FSLint
5
It is better to provide actual information here and not just a link, the link might change and then the answer has no value left
– Anthon
Jan 29 '14 at 11:22
2
Wikipedia page is empty.
– ihor_dvoretskyi
Sep 10 '15 at 9:01
yes, it has been cleaned, what a pity shake...
– MordicusEtCubitus
Dec 21 '15 at 16:23
I've edited it with these 3 tools
– MordicusEtCubitus
Dec 21 '15 at 16:30
add a comment |
Wikipedia had an article (http://en.wikipedia.org/wiki/List_of_duplicate_file_finders), with a list of available open source software for this task, but it's now been deleted.
I will add that the GUI version of fslint is very interesting, allowing to use mask to select which files to delete. Very useful to clean duplicated photos.
On Linux you can use:
- FSLint: http://www.pixelbeat.org/fslint/
- FDupes: https://en.wikipedia.org/wiki/Fdupes
- DupeGuru: https://www.hardcoded.net/dupeguru/
The 2 last work on many systems (windows, mac and linux) I 've not checked for FSLint
Wikipedia had an article (http://en.wikipedia.org/wiki/List_of_duplicate_file_finders), with a list of available open source software for this task, but it's now been deleted.
I will add that the GUI version of fslint is very interesting, allowing to use mask to select which files to delete. Very useful to clean duplicated photos.
On Linux you can use:
- FSLint: http://www.pixelbeat.org/fslint/
- FDupes: https://en.wikipedia.org/wiki/Fdupes
- DupeGuru: https://www.hardcoded.net/dupeguru/
The 2 last work on many systems (windows, mac and linux) I 've not checked for FSLint
edited Jul 3 '17 at 10:09
Stéphane Chazelas
307k57581939
307k57581939
answered Jan 29 '14 at 11:01
MordicusEtCubitusMordicusEtCubitus
1293
1293
5
It is better to provide actual information here and not just a link, the link might change and then the answer has no value left
– Anthon
Jan 29 '14 at 11:22
2
Wikipedia page is empty.
– ihor_dvoretskyi
Sep 10 '15 at 9:01
yes, it has been cleaned, what a pity shake...
– MordicusEtCubitus
Dec 21 '15 at 16:23
I've edited it with these 3 tools
– MordicusEtCubitus
Dec 21 '15 at 16:30
add a comment |
5
It is better to provide actual information here and not just a link, the link might change and then the answer has no value left
– Anthon
Jan 29 '14 at 11:22
2
Wikipedia page is empty.
– ihor_dvoretskyi
Sep 10 '15 at 9:01
yes, it has been cleaned, what a pity shake...
– MordicusEtCubitus
Dec 21 '15 at 16:23
I've edited it with these 3 tools
– MordicusEtCubitus
Dec 21 '15 at 16:30
5
5
It is better to provide actual information here and not just a link, the link might change and then the answer has no value left
– Anthon
Jan 29 '14 at 11:22
It is better to provide actual information here and not just a link, the link might change and then the answer has no value left
– Anthon
Jan 29 '14 at 11:22
2
2
Wikipedia page is empty.
– ihor_dvoretskyi
Sep 10 '15 at 9:01
Wikipedia page is empty.
– ihor_dvoretskyi
Sep 10 '15 at 9:01
yes, it has been cleaned, what a pity shake...
– MordicusEtCubitus
Dec 21 '15 at 16:23
yes, it has been cleaned, what a pity shake...
– MordicusEtCubitus
Dec 21 '15 at 16:23
I've edited it with these 3 tools
– MordicusEtCubitus
Dec 21 '15 at 16:30
I've edited it with these 3 tools
– MordicusEtCubitus
Dec 21 '15 at 16:30
add a comment |
Here's my take on that:
find -type f -size +3M -print0 | while IFS= read -r -d '' i; do
echo -n '.'
if grep -q "$i" md5-partial.txt; then echo -e "n$i ---- Already counted, skipping."; continue; fi
MD5=`dd bs=1M count=1 if="$i" status=noxfer | md5sum`
MD5=`echo $MD5 | cut -d' ' -f1`
if grep "$MD5" md5-partial.txt; then echo "n$i ---- Possible duplicate"; fi
echo $MD5 $i >> md5-partial.txt
done
It's different in that it only hashes up to first 1 MB of the file.
This has few issues / features:
- There might be a difference after first 1 MB so the result rather a candidate to check. I might fix that later.
- Checking by file size first could speed this up.
- Only takes files larger than 3 MB.
I use it to compare video clips so this is enough for me.
add a comment |
Here's my take on that:
find -type f -size +3M -print0 | while IFS= read -r -d '' i; do
echo -n '.'
if grep -q "$i" md5-partial.txt; then echo -e "n$i ---- Already counted, skipping."; continue; fi
MD5=`dd bs=1M count=1 if="$i" status=noxfer | md5sum`
MD5=`echo $MD5 | cut -d' ' -f1`
if grep "$MD5" md5-partial.txt; then echo "n$i ---- Possible duplicate"; fi
echo $MD5 $i >> md5-partial.txt
done
It's different in that it only hashes up to first 1 MB of the file.
This has few issues / features:
- There might be a difference after first 1 MB so the result rather a candidate to check. I might fix that later.
- Checking by file size first could speed this up.
- Only takes files larger than 3 MB.
I use it to compare video clips so this is enough for me.
add a comment |
Here's my take on that:
find -type f -size +3M -print0 | while IFS= read -r -d '' i; do
echo -n '.'
if grep -q "$i" md5-partial.txt; then echo -e "n$i ---- Already counted, skipping."; continue; fi
MD5=`dd bs=1M count=1 if="$i" status=noxfer | md5sum`
MD5=`echo $MD5 | cut -d' ' -f1`
if grep "$MD5" md5-partial.txt; then echo "n$i ---- Possible duplicate"; fi
echo $MD5 $i >> md5-partial.txt
done
It's different in that it only hashes up to first 1 MB of the file.
This has few issues / features:
- There might be a difference after first 1 MB so the result rather a candidate to check. I might fix that later.
- Checking by file size first could speed this up.
- Only takes files larger than 3 MB.
I use it to compare video clips so this is enough for me.
Here's my take on that:
find -type f -size +3M -print0 | while IFS= read -r -d '' i; do
echo -n '.'
if grep -q "$i" md5-partial.txt; then echo -e "n$i ---- Already counted, skipping."; continue; fi
MD5=`dd bs=1M count=1 if="$i" status=noxfer | md5sum`
MD5=`echo $MD5 | cut -d' ' -f1`
if grep "$MD5" md5-partial.txt; then echo "n$i ---- Possible duplicate"; fi
echo $MD5 $i >> md5-partial.txt
done
It's different in that it only hashes up to first 1 MB of the file.
This has few issues / features:
- There might be a difference after first 1 MB so the result rather a candidate to check. I might fix that later.
- Checking by file size first could speed this up.
- Only takes files larger than 3 MB.
I use it to compare video clips so this is enough for me.
answered Jun 2 '17 at 1:50
Ondra ŽižkaOndra Žižka
464312
464312
add a comment |
add a comment |
protected by Community♦ Jan 14 '16 at 12:14
Thank you for your interest in this question.
Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).
Would you like to answer one of these unanswered questions instead?
3
Note that any possible method of doing this will invariably have to compare every single file on your system to every single other file. So this is going to take a long time, even when taking shortcuts.
– Shadur
Apr 4 '13 at 14:02
4
@Shadur if one is ok with checksums, it boils down to comparing just the hashes - which on most systems is of the order of 10^(5+-1) usually <64-byte entries. Of course, you have to read the data at least once. :)
– peterph
Apr 4 '13 at 14:57
15
@Shadur That's not true. You can reduce the time by checking for matching
st_size
s, eliminating those with only one of the same, and then only calculating md5sums for matchingst_size
s.– Chris Down
Apr 4 '13 at 16:36
6
@Shadur even an incredibly silly approach disallowing any hash operations could do this in Θ(n log n) compares—not Θ(n²)—using any of several sort algorithms (based on file content).
– derobert
Apr 4 '13 at 17:09
1
@ChrisDown Yes, size matching would be one of the shortcuts I had in mind.
– Shadur
Apr 4 '13 at 19:38