Find duplicate files












85















Is it possible to find duplicate files on my disk which are bit to bit identical but have different file-names?










share|improve this question




















  • 3





    Note that any possible method of doing this will invariably have to compare every single file on your system to every single other file. So this is going to take a long time, even when taking shortcuts.

    – Shadur
    Apr 4 '13 at 14:02






  • 4





    @Shadur if one is ok with checksums, it boils down to comparing just the hashes - which on most systems is of the order of 10^(5+-1) usually <64-byte entries. Of course, you have to read the data at least once. :)

    – peterph
    Apr 4 '13 at 14:57








  • 15





    @Shadur That's not true. You can reduce the time by checking for matching st_sizes, eliminating those with only one of the same, and then only calculating md5sums for matching st_sizes.

    – Chris Down
    Apr 4 '13 at 16:36






  • 6





    @Shadur even an incredibly silly approach disallowing any hash operations could do this in Θ(n log n) compares—not Θ(n²)—using any of several sort algorithms (based on file content).

    – derobert
    Apr 4 '13 at 17:09






  • 1





    @ChrisDown Yes, size matching would be one of the shortcuts I had in mind.

    – Shadur
    Apr 4 '13 at 19:38
















85















Is it possible to find duplicate files on my disk which are bit to bit identical but have different file-names?










share|improve this question




















  • 3





    Note that any possible method of doing this will invariably have to compare every single file on your system to every single other file. So this is going to take a long time, even when taking shortcuts.

    – Shadur
    Apr 4 '13 at 14:02






  • 4





    @Shadur if one is ok with checksums, it boils down to comparing just the hashes - which on most systems is of the order of 10^(5+-1) usually <64-byte entries. Of course, you have to read the data at least once. :)

    – peterph
    Apr 4 '13 at 14:57








  • 15





    @Shadur That's not true. You can reduce the time by checking for matching st_sizes, eliminating those with only one of the same, and then only calculating md5sums for matching st_sizes.

    – Chris Down
    Apr 4 '13 at 16:36






  • 6





    @Shadur even an incredibly silly approach disallowing any hash operations could do this in Θ(n log n) compares—not Θ(n²)—using any of several sort algorithms (based on file content).

    – derobert
    Apr 4 '13 at 17:09






  • 1





    @ChrisDown Yes, size matching would be one of the shortcuts I had in mind.

    – Shadur
    Apr 4 '13 at 19:38














85












85








85


31






Is it possible to find duplicate files on my disk which are bit to bit identical but have different file-names?










share|improve this question
















Is it possible to find duplicate files on my disk which are bit to bit identical but have different file-names?







files duplicate-files






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 35 mins ago









Jeff Schaller

42.5k1158135




42.5k1158135










asked Apr 4 '13 at 13:18









studentstudent

7,1651765127




7,1651765127








  • 3





    Note that any possible method of doing this will invariably have to compare every single file on your system to every single other file. So this is going to take a long time, even when taking shortcuts.

    – Shadur
    Apr 4 '13 at 14:02






  • 4





    @Shadur if one is ok with checksums, it boils down to comparing just the hashes - which on most systems is of the order of 10^(5+-1) usually <64-byte entries. Of course, you have to read the data at least once. :)

    – peterph
    Apr 4 '13 at 14:57








  • 15





    @Shadur That's not true. You can reduce the time by checking for matching st_sizes, eliminating those with only one of the same, and then only calculating md5sums for matching st_sizes.

    – Chris Down
    Apr 4 '13 at 16:36






  • 6





    @Shadur even an incredibly silly approach disallowing any hash operations could do this in Θ(n log n) compares—not Θ(n²)—using any of several sort algorithms (based on file content).

    – derobert
    Apr 4 '13 at 17:09






  • 1





    @ChrisDown Yes, size matching would be one of the shortcuts I had in mind.

    – Shadur
    Apr 4 '13 at 19:38














  • 3





    Note that any possible method of doing this will invariably have to compare every single file on your system to every single other file. So this is going to take a long time, even when taking shortcuts.

    – Shadur
    Apr 4 '13 at 14:02






  • 4





    @Shadur if one is ok with checksums, it boils down to comparing just the hashes - which on most systems is of the order of 10^(5+-1) usually <64-byte entries. Of course, you have to read the data at least once. :)

    – peterph
    Apr 4 '13 at 14:57








  • 15





    @Shadur That's not true. You can reduce the time by checking for matching st_sizes, eliminating those with only one of the same, and then only calculating md5sums for matching st_sizes.

    – Chris Down
    Apr 4 '13 at 16:36






  • 6





    @Shadur even an incredibly silly approach disallowing any hash operations could do this in Θ(n log n) compares—not Θ(n²)—using any of several sort algorithms (based on file content).

    – derobert
    Apr 4 '13 at 17:09






  • 1





    @ChrisDown Yes, size matching would be one of the shortcuts I had in mind.

    – Shadur
    Apr 4 '13 at 19:38








3




3





Note that any possible method of doing this will invariably have to compare every single file on your system to every single other file. So this is going to take a long time, even when taking shortcuts.

– Shadur
Apr 4 '13 at 14:02





Note that any possible method of doing this will invariably have to compare every single file on your system to every single other file. So this is going to take a long time, even when taking shortcuts.

– Shadur
Apr 4 '13 at 14:02




4




4





@Shadur if one is ok with checksums, it boils down to comparing just the hashes - which on most systems is of the order of 10^(5+-1) usually <64-byte entries. Of course, you have to read the data at least once. :)

– peterph
Apr 4 '13 at 14:57







@Shadur if one is ok with checksums, it boils down to comparing just the hashes - which on most systems is of the order of 10^(5+-1) usually <64-byte entries. Of course, you have to read the data at least once. :)

– peterph
Apr 4 '13 at 14:57






15




15





@Shadur That's not true. You can reduce the time by checking for matching st_sizes, eliminating those with only one of the same, and then only calculating md5sums for matching st_sizes.

– Chris Down
Apr 4 '13 at 16:36





@Shadur That's not true. You can reduce the time by checking for matching st_sizes, eliminating those with only one of the same, and then only calculating md5sums for matching st_sizes.

– Chris Down
Apr 4 '13 at 16:36




6




6





@Shadur even an incredibly silly approach disallowing any hash operations could do this in Θ(n log n) compares—not Θ(n²)—using any of several sort algorithms (based on file content).

– derobert
Apr 4 '13 at 17:09





@Shadur even an incredibly silly approach disallowing any hash operations could do this in Θ(n log n) compares—not Θ(n²)—using any of several sort algorithms (based on file content).

– derobert
Apr 4 '13 at 17:09




1




1





@ChrisDown Yes, size matching would be one of the shortcuts I had in mind.

– Shadur
Apr 4 '13 at 19:38





@ChrisDown Yes, size matching would be one of the shortcuts I had in mind.

– Shadur
Apr 4 '13 at 19:38










7 Answers
7






active

oldest

votes


















101














fdupes can do this. From man fdupes:




Searches the given path for duplicate files. Such files are found by comparing file sizes and MD5 signatures, followed by a byte-by-byte comparison.




In Debian or Ubuntu, you can install it with apt-get install fdupes. In Fedora/Red Hat/CentOS, you can install it with yum install fdupes. On Arch Linux you can use pacman -S fdupes, and on Gentoo, emerge fdupes.



To run a check descending from your filesystem root, which will likely take a significant amount of time and memory, use something like fdupes -r /.



As asked in the comments, you can get the largest duplicates by doing the following:



fdupes -r . | {
while IFS= read -r file; do
[[ $file ]] && du "$file"
done
} | sort -n


This will break if your filenames contain newlines.






share|improve this answer


























  • Thanks. How can I filter out the largest dupe? How can I make the sizes human readable?

    – student
    Apr 5 '13 at 9:31











  • @student: use something along the line of (make sure fdupes just outputs the filenames with no extra informatinos, or cut or sed to just keep that) : fdupes ....... | xargs ls -alhd | egrep 'M |G ' to keep files in Human readable format and only those with size in Megabytes or Gigabytes. Change the command to suit the real outputs.

    – Olivier Dulac
    Apr 5 '13 at 12:27








  • 2





    @OlivierDulac You should never parse ls. Usually it's worse than your use case, but even in your use case, you risk false positives.

    – Chris Down
    Apr 5 '13 at 13:13











  • @student - Once you have the filenames, du piped to sort will tell you.

    – Chris Down
    Apr 5 '13 at 13:14











  • @ChrisDown: it's true it's a bad habit, and can give false positives. But in that case (interactive use, and for display only, no "rm" or anything of the sort directly relying on it) it's fine and quick ^^ . I love those pages you link to, btw (been reading them since a few months, and full of many usefull infos)

    – Olivier Dulac
    Apr 5 '13 at 14:05



















22














Another good tool is fslint:




fslint is a toolset to find various problems with filesystems,
including duplicate files and problematic filenames
etc.



Individual command line tools are available in addition to the GUI and to access them, one can change to, or add to
$PATH the /usr/share/fslint/fslint directory on a standard install. Each of these commands in that directory have a
--help option which further details its parameters.



   findup - find DUPlicate files



On debian-based systems, youcan install it with:



sudo apt-get install fslint




You can also do this manually if you don't want to or cannot install third party tools. The way most such programs work is by calculating file checksums. Files with the same md5sum almost certainly contain exactly the same data. So, you could do something like this:



find / -type f -exec md5sum {} ; > md5sums
gawk '{print $1}' md5sums | sort | uniq -d > dupes
while read d; do echo "---"; grep $d md5sums | cut -d ' ' -f 2-; done < dupes


Sample output (the file names in this example are the same, but it will also work when they are different):



$ while read d; do echo "---"; grep $d md5sums | cut -d ' ' -f 2-; done < dupes 
---
/usr/src/linux-headers-3.2.0-3-common/include/linux/if_bonding.h
/usr/src/linux-headers-3.2.0-4-common/include/linux/if_bonding.h
---
/usr/src/linux-headers-3.2.0-3-common/include/linux/route.h
/usr/src/linux-headers-3.2.0-4-common/include/linux/route.h
---
/usr/src/linux-headers-3.2.0-3-common/include/drm/Kbuild
/usr/src/linux-headers-3.2.0-4-common/include/drm/Kbuild
---


This will be much slower than the dedicated tools already mentioned, but it will work.






share|improve this answer





















  • 3





    It would be much, much faster to find any files with the same size as another file using st_size, eliminating any that only have one file of this size, and then calculating md5sums only between files with the same st_size.

    – Chris Down
    Apr 4 '13 at 16:34













  • @ChrisDown yeah, just wanted to keep it simple. What you suggest will greatly speed things up of course. That's why I have the disclaimer about it being slow at the end of my answer.

    – terdon
    Apr 4 '13 at 16:37



















8














Short answer: yes.



Longer version: have a look at the wikipedia fdupes entry, it sports quite nice list of ready made solutions. Of course you can write your own, it's not that difficult - hashing programs like diff, sha*sum, find, sort and uniq should do the job. You can even put it on one line, and it will still be understandable.






share|improve this answer































    5














    If you believe a hash function (here MD5) is collision-free on your domain:



    find $target -type f -exec md5sum '{}' + | sort | uniq --all-repeated --check-chars=32 
    | cut --characters=35-


    Want identical file names grouped? Write a simple script not_uniq.sh to format output:



    #!/bin/bash

    last_checksum=0
    while read line; do
    checksum=${line:0:32}
    filename=${line:34}
    if [ $checksum == $last_checksum ]; then
    if [ ${last_filename:-0} != '0' ]; then
    echo $last_filename
    unset last_filename
    fi
    echo $filename
    else
    if [ ${last_filename:-0} == '0' ]; then
    echo "======="
    fi
    last_filename=$filename
    fi

    last_checksum=$checksum
    done


    Then change find command to use your script:



    chmod +x not_uniq.sh
    find $target -type f -exec md5sum '{}' + | sort | not_uniq.sh


    This is basic idea. Probably you should change find if your file names containing some characters. (e.g space)






    share|improve this answer

































      3














      I thought to add a recent enhanced fork of fdupes, jdupes, which promises to be faster and more feature rich than fdupes (e.g. size filter):



      jdupes . -rS -X size-:50m > myjdups.txt


      This will recursively find duplicated files bigger than 50MB in the current directory and output the resulted list in myjdups.txt.



      Note, the output is not sorted by size and since it appears not to be build in, I have adapted @Chris_Down answer above to achieve this:



      jdupes -r . -X size-:50m | {
      while IFS= read -r file; do
      [[ $file ]] && du "$file"
      done
      } | sort -n > myjdups_sorted.txt





      share|improve this answer































        2














        Wikipedia had an article (http://en.wikipedia.org/wiki/List_of_duplicate_file_finders), with a list of available open source software for this task, but it's now been deleted.



        I will add that the GUI version of fslint is very interesting, allowing to use mask to select which files to delete. Very useful to clean duplicated photos.



        On Linux you can use:



        - FSLint: http://www.pixelbeat.org/fslint/

        - FDupes: https://en.wikipedia.org/wiki/Fdupes

        - DupeGuru: https://www.hardcoded.net/dupeguru/


        The 2 last work on many systems (windows, mac and linux) I 've not checked for FSLint






        share|improve this answer





















        • 5





          It is better to provide actual information here and not just a link, the link might change and then the answer has no value left

          – Anthon
          Jan 29 '14 at 11:22






        • 2





          Wikipedia page is empty.

          – ihor_dvoretskyi
          Sep 10 '15 at 9:01











        • yes, it has been cleaned, what a pity shake...

          – MordicusEtCubitus
          Dec 21 '15 at 16:23











        • I've edited it with these 3 tools

          – MordicusEtCubitus
          Dec 21 '15 at 16:30



















        0














        Here's my take on that:



        find -type f -size +3M -print0 | while IFS= read -r -d '' i; do
        echo -n '.'
        if grep -q "$i" md5-partial.txt; then echo -e "n$i ---- Already counted, skipping."; continue; fi
        MD5=`dd bs=1M count=1 if="$i" status=noxfer | md5sum`
        MD5=`echo $MD5 | cut -d' ' -f1`
        if grep "$MD5" md5-partial.txt; then echo "n$i ---- Possible duplicate"; fi
        echo $MD5 $i >> md5-partial.txt
        done


        It's different in that it only hashes up to first 1 MB of the file.

        This has few issues / features:




        • There might be a difference after first 1 MB so the result rather a candidate to check. I might fix that later.

        • Checking by file size first could speed this up.

        • Only takes files larger than 3 MB.


        I use it to compare video clips so this is enough for me.






        share|improve this answer






















          protected by Community Jan 14 '16 at 12:14



          Thank you for your interest in this question.
          Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).



          Would you like to answer one of these unanswered questions instead?














          7 Answers
          7






          active

          oldest

          votes








          7 Answers
          7






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          101














          fdupes can do this. From man fdupes:




          Searches the given path for duplicate files. Such files are found by comparing file sizes and MD5 signatures, followed by a byte-by-byte comparison.




          In Debian or Ubuntu, you can install it with apt-get install fdupes. In Fedora/Red Hat/CentOS, you can install it with yum install fdupes. On Arch Linux you can use pacman -S fdupes, and on Gentoo, emerge fdupes.



          To run a check descending from your filesystem root, which will likely take a significant amount of time and memory, use something like fdupes -r /.



          As asked in the comments, you can get the largest duplicates by doing the following:



          fdupes -r . | {
          while IFS= read -r file; do
          [[ $file ]] && du "$file"
          done
          } | sort -n


          This will break if your filenames contain newlines.






          share|improve this answer


























          • Thanks. How can I filter out the largest dupe? How can I make the sizes human readable?

            – student
            Apr 5 '13 at 9:31











          • @student: use something along the line of (make sure fdupes just outputs the filenames with no extra informatinos, or cut or sed to just keep that) : fdupes ....... | xargs ls -alhd | egrep 'M |G ' to keep files in Human readable format and only those with size in Megabytes or Gigabytes. Change the command to suit the real outputs.

            – Olivier Dulac
            Apr 5 '13 at 12:27








          • 2





            @OlivierDulac You should never parse ls. Usually it's worse than your use case, but even in your use case, you risk false positives.

            – Chris Down
            Apr 5 '13 at 13:13











          • @student - Once you have the filenames, du piped to sort will tell you.

            – Chris Down
            Apr 5 '13 at 13:14











          • @ChrisDown: it's true it's a bad habit, and can give false positives. But in that case (interactive use, and for display only, no "rm" or anything of the sort directly relying on it) it's fine and quick ^^ . I love those pages you link to, btw (been reading them since a few months, and full of many usefull infos)

            – Olivier Dulac
            Apr 5 '13 at 14:05
















          101














          fdupes can do this. From man fdupes:




          Searches the given path for duplicate files. Such files are found by comparing file sizes and MD5 signatures, followed by a byte-by-byte comparison.




          In Debian or Ubuntu, you can install it with apt-get install fdupes. In Fedora/Red Hat/CentOS, you can install it with yum install fdupes. On Arch Linux you can use pacman -S fdupes, and on Gentoo, emerge fdupes.



          To run a check descending from your filesystem root, which will likely take a significant amount of time and memory, use something like fdupes -r /.



          As asked in the comments, you can get the largest duplicates by doing the following:



          fdupes -r . | {
          while IFS= read -r file; do
          [[ $file ]] && du "$file"
          done
          } | sort -n


          This will break if your filenames contain newlines.






          share|improve this answer


























          • Thanks. How can I filter out the largest dupe? How can I make the sizes human readable?

            – student
            Apr 5 '13 at 9:31











          • @student: use something along the line of (make sure fdupes just outputs the filenames with no extra informatinos, or cut or sed to just keep that) : fdupes ....... | xargs ls -alhd | egrep 'M |G ' to keep files in Human readable format and only those with size in Megabytes or Gigabytes. Change the command to suit the real outputs.

            – Olivier Dulac
            Apr 5 '13 at 12:27








          • 2





            @OlivierDulac You should never parse ls. Usually it's worse than your use case, but even in your use case, you risk false positives.

            – Chris Down
            Apr 5 '13 at 13:13











          • @student - Once you have the filenames, du piped to sort will tell you.

            – Chris Down
            Apr 5 '13 at 13:14











          • @ChrisDown: it's true it's a bad habit, and can give false positives. But in that case (interactive use, and for display only, no "rm" or anything of the sort directly relying on it) it's fine and quick ^^ . I love those pages you link to, btw (been reading them since a few months, and full of many usefull infos)

            – Olivier Dulac
            Apr 5 '13 at 14:05














          101












          101








          101







          fdupes can do this. From man fdupes:




          Searches the given path for duplicate files. Such files are found by comparing file sizes and MD5 signatures, followed by a byte-by-byte comparison.




          In Debian or Ubuntu, you can install it with apt-get install fdupes. In Fedora/Red Hat/CentOS, you can install it with yum install fdupes. On Arch Linux you can use pacman -S fdupes, and on Gentoo, emerge fdupes.



          To run a check descending from your filesystem root, which will likely take a significant amount of time and memory, use something like fdupes -r /.



          As asked in the comments, you can get the largest duplicates by doing the following:



          fdupes -r . | {
          while IFS= read -r file; do
          [[ $file ]] && du "$file"
          done
          } | sort -n


          This will break if your filenames contain newlines.






          share|improve this answer















          fdupes can do this. From man fdupes:




          Searches the given path for duplicate files. Such files are found by comparing file sizes and MD5 signatures, followed by a byte-by-byte comparison.




          In Debian or Ubuntu, you can install it with apt-get install fdupes. In Fedora/Red Hat/CentOS, you can install it with yum install fdupes. On Arch Linux you can use pacman -S fdupes, and on Gentoo, emerge fdupes.



          To run a check descending from your filesystem root, which will likely take a significant amount of time and memory, use something like fdupes -r /.



          As asked in the comments, you can get the largest duplicates by doing the following:



          fdupes -r . | {
          while IFS= read -r file; do
          [[ $file ]] && du "$file"
          done
          } | sort -n


          This will break if your filenames contain newlines.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Aug 14 '17 at 17:38









          genpfault

          1357




          1357










          answered Apr 4 '13 at 13:24









          Chris DownChris Down

          80.6k14189202




          80.6k14189202













          • Thanks. How can I filter out the largest dupe? How can I make the sizes human readable?

            – student
            Apr 5 '13 at 9:31











          • @student: use something along the line of (make sure fdupes just outputs the filenames with no extra informatinos, or cut or sed to just keep that) : fdupes ....... | xargs ls -alhd | egrep 'M |G ' to keep files in Human readable format and only those with size in Megabytes or Gigabytes. Change the command to suit the real outputs.

            – Olivier Dulac
            Apr 5 '13 at 12:27








          • 2





            @OlivierDulac You should never parse ls. Usually it's worse than your use case, but even in your use case, you risk false positives.

            – Chris Down
            Apr 5 '13 at 13:13











          • @student - Once you have the filenames, du piped to sort will tell you.

            – Chris Down
            Apr 5 '13 at 13:14











          • @ChrisDown: it's true it's a bad habit, and can give false positives. But in that case (interactive use, and for display only, no "rm" or anything of the sort directly relying on it) it's fine and quick ^^ . I love those pages you link to, btw (been reading them since a few months, and full of many usefull infos)

            – Olivier Dulac
            Apr 5 '13 at 14:05



















          • Thanks. How can I filter out the largest dupe? How can I make the sizes human readable?

            – student
            Apr 5 '13 at 9:31











          • @student: use something along the line of (make sure fdupes just outputs the filenames with no extra informatinos, or cut or sed to just keep that) : fdupes ....... | xargs ls -alhd | egrep 'M |G ' to keep files in Human readable format and only those with size in Megabytes or Gigabytes. Change the command to suit the real outputs.

            – Olivier Dulac
            Apr 5 '13 at 12:27








          • 2





            @OlivierDulac You should never parse ls. Usually it's worse than your use case, but even in your use case, you risk false positives.

            – Chris Down
            Apr 5 '13 at 13:13











          • @student - Once you have the filenames, du piped to sort will tell you.

            – Chris Down
            Apr 5 '13 at 13:14











          • @ChrisDown: it's true it's a bad habit, and can give false positives. But in that case (interactive use, and for display only, no "rm" or anything of the sort directly relying on it) it's fine and quick ^^ . I love those pages you link to, btw (been reading them since a few months, and full of many usefull infos)

            – Olivier Dulac
            Apr 5 '13 at 14:05

















          Thanks. How can I filter out the largest dupe? How can I make the sizes human readable?

          – student
          Apr 5 '13 at 9:31





          Thanks. How can I filter out the largest dupe? How can I make the sizes human readable?

          – student
          Apr 5 '13 at 9:31













          @student: use something along the line of (make sure fdupes just outputs the filenames with no extra informatinos, or cut or sed to just keep that) : fdupes ....... | xargs ls -alhd | egrep 'M |G ' to keep files in Human readable format and only those with size in Megabytes or Gigabytes. Change the command to suit the real outputs.

          – Olivier Dulac
          Apr 5 '13 at 12:27







          @student: use something along the line of (make sure fdupes just outputs the filenames with no extra informatinos, or cut or sed to just keep that) : fdupes ....... | xargs ls -alhd | egrep 'M |G ' to keep files in Human readable format and only those with size in Megabytes or Gigabytes. Change the command to suit the real outputs.

          – Olivier Dulac
          Apr 5 '13 at 12:27






          2




          2





          @OlivierDulac You should never parse ls. Usually it's worse than your use case, but even in your use case, you risk false positives.

          – Chris Down
          Apr 5 '13 at 13:13





          @OlivierDulac You should never parse ls. Usually it's worse than your use case, but even in your use case, you risk false positives.

          – Chris Down
          Apr 5 '13 at 13:13













          @student - Once you have the filenames, du piped to sort will tell you.

          – Chris Down
          Apr 5 '13 at 13:14





          @student - Once you have the filenames, du piped to sort will tell you.

          – Chris Down
          Apr 5 '13 at 13:14













          @ChrisDown: it's true it's a bad habit, and can give false positives. But in that case (interactive use, and for display only, no "rm" or anything of the sort directly relying on it) it's fine and quick ^^ . I love those pages you link to, btw (been reading them since a few months, and full of many usefull infos)

          – Olivier Dulac
          Apr 5 '13 at 14:05





          @ChrisDown: it's true it's a bad habit, and can give false positives. But in that case (interactive use, and for display only, no "rm" or anything of the sort directly relying on it) it's fine and quick ^^ . I love those pages you link to, btw (been reading them since a few months, and full of many usefull infos)

          – Olivier Dulac
          Apr 5 '13 at 14:05













          22














          Another good tool is fslint:




          fslint is a toolset to find various problems with filesystems,
          including duplicate files and problematic filenames
          etc.



          Individual command line tools are available in addition to the GUI and to access them, one can change to, or add to
          $PATH the /usr/share/fslint/fslint directory on a standard install. Each of these commands in that directory have a
          --help option which further details its parameters.



             findup - find DUPlicate files



          On debian-based systems, youcan install it with:



          sudo apt-get install fslint




          You can also do this manually if you don't want to or cannot install third party tools. The way most such programs work is by calculating file checksums. Files with the same md5sum almost certainly contain exactly the same data. So, you could do something like this:



          find / -type f -exec md5sum {} ; > md5sums
          gawk '{print $1}' md5sums | sort | uniq -d > dupes
          while read d; do echo "---"; grep $d md5sums | cut -d ' ' -f 2-; done < dupes


          Sample output (the file names in this example are the same, but it will also work when they are different):



          $ while read d; do echo "---"; grep $d md5sums | cut -d ' ' -f 2-; done < dupes 
          ---
          /usr/src/linux-headers-3.2.0-3-common/include/linux/if_bonding.h
          /usr/src/linux-headers-3.2.0-4-common/include/linux/if_bonding.h
          ---
          /usr/src/linux-headers-3.2.0-3-common/include/linux/route.h
          /usr/src/linux-headers-3.2.0-4-common/include/linux/route.h
          ---
          /usr/src/linux-headers-3.2.0-3-common/include/drm/Kbuild
          /usr/src/linux-headers-3.2.0-4-common/include/drm/Kbuild
          ---


          This will be much slower than the dedicated tools already mentioned, but it will work.






          share|improve this answer





















          • 3





            It would be much, much faster to find any files with the same size as another file using st_size, eliminating any that only have one file of this size, and then calculating md5sums only between files with the same st_size.

            – Chris Down
            Apr 4 '13 at 16:34













          • @ChrisDown yeah, just wanted to keep it simple. What you suggest will greatly speed things up of course. That's why I have the disclaimer about it being slow at the end of my answer.

            – terdon
            Apr 4 '13 at 16:37
















          22














          Another good tool is fslint:




          fslint is a toolset to find various problems with filesystems,
          including duplicate files and problematic filenames
          etc.



          Individual command line tools are available in addition to the GUI and to access them, one can change to, or add to
          $PATH the /usr/share/fslint/fslint directory on a standard install. Each of these commands in that directory have a
          --help option which further details its parameters.



             findup - find DUPlicate files



          On debian-based systems, youcan install it with:



          sudo apt-get install fslint




          You can also do this manually if you don't want to or cannot install third party tools. The way most such programs work is by calculating file checksums. Files with the same md5sum almost certainly contain exactly the same data. So, you could do something like this:



          find / -type f -exec md5sum {} ; > md5sums
          gawk '{print $1}' md5sums | sort | uniq -d > dupes
          while read d; do echo "---"; grep $d md5sums | cut -d ' ' -f 2-; done < dupes


          Sample output (the file names in this example are the same, but it will also work when they are different):



          $ while read d; do echo "---"; grep $d md5sums | cut -d ' ' -f 2-; done < dupes 
          ---
          /usr/src/linux-headers-3.2.0-3-common/include/linux/if_bonding.h
          /usr/src/linux-headers-3.2.0-4-common/include/linux/if_bonding.h
          ---
          /usr/src/linux-headers-3.2.0-3-common/include/linux/route.h
          /usr/src/linux-headers-3.2.0-4-common/include/linux/route.h
          ---
          /usr/src/linux-headers-3.2.0-3-common/include/drm/Kbuild
          /usr/src/linux-headers-3.2.0-4-common/include/drm/Kbuild
          ---


          This will be much slower than the dedicated tools already mentioned, but it will work.






          share|improve this answer





















          • 3





            It would be much, much faster to find any files with the same size as another file using st_size, eliminating any that only have one file of this size, and then calculating md5sums only between files with the same st_size.

            – Chris Down
            Apr 4 '13 at 16:34













          • @ChrisDown yeah, just wanted to keep it simple. What you suggest will greatly speed things up of course. That's why I have the disclaimer about it being slow at the end of my answer.

            – terdon
            Apr 4 '13 at 16:37














          22












          22








          22







          Another good tool is fslint:




          fslint is a toolset to find various problems with filesystems,
          including duplicate files and problematic filenames
          etc.



          Individual command line tools are available in addition to the GUI and to access them, one can change to, or add to
          $PATH the /usr/share/fslint/fslint directory on a standard install. Each of these commands in that directory have a
          --help option which further details its parameters.



             findup - find DUPlicate files



          On debian-based systems, youcan install it with:



          sudo apt-get install fslint




          You can also do this manually if you don't want to or cannot install third party tools. The way most such programs work is by calculating file checksums. Files with the same md5sum almost certainly contain exactly the same data. So, you could do something like this:



          find / -type f -exec md5sum {} ; > md5sums
          gawk '{print $1}' md5sums | sort | uniq -d > dupes
          while read d; do echo "---"; grep $d md5sums | cut -d ' ' -f 2-; done < dupes


          Sample output (the file names in this example are the same, but it will also work when they are different):



          $ while read d; do echo "---"; grep $d md5sums | cut -d ' ' -f 2-; done < dupes 
          ---
          /usr/src/linux-headers-3.2.0-3-common/include/linux/if_bonding.h
          /usr/src/linux-headers-3.2.0-4-common/include/linux/if_bonding.h
          ---
          /usr/src/linux-headers-3.2.0-3-common/include/linux/route.h
          /usr/src/linux-headers-3.2.0-4-common/include/linux/route.h
          ---
          /usr/src/linux-headers-3.2.0-3-common/include/drm/Kbuild
          /usr/src/linux-headers-3.2.0-4-common/include/drm/Kbuild
          ---


          This will be much slower than the dedicated tools already mentioned, but it will work.






          share|improve this answer















          Another good tool is fslint:




          fslint is a toolset to find various problems with filesystems,
          including duplicate files and problematic filenames
          etc.



          Individual command line tools are available in addition to the GUI and to access them, one can change to, or add to
          $PATH the /usr/share/fslint/fslint directory on a standard install. Each of these commands in that directory have a
          --help option which further details its parameters.



             findup - find DUPlicate files



          On debian-based systems, youcan install it with:



          sudo apt-get install fslint




          You can also do this manually if you don't want to or cannot install third party tools. The way most such programs work is by calculating file checksums. Files with the same md5sum almost certainly contain exactly the same data. So, you could do something like this:



          find / -type f -exec md5sum {} ; > md5sums
          gawk '{print $1}' md5sums | sort | uniq -d > dupes
          while read d; do echo "---"; grep $d md5sums | cut -d ' ' -f 2-; done < dupes


          Sample output (the file names in this example are the same, but it will also work when they are different):



          $ while read d; do echo "---"; grep $d md5sums | cut -d ' ' -f 2-; done < dupes 
          ---
          /usr/src/linux-headers-3.2.0-3-common/include/linux/if_bonding.h
          /usr/src/linux-headers-3.2.0-4-common/include/linux/if_bonding.h
          ---
          /usr/src/linux-headers-3.2.0-3-common/include/linux/route.h
          /usr/src/linux-headers-3.2.0-4-common/include/linux/route.h
          ---
          /usr/src/linux-headers-3.2.0-3-common/include/drm/Kbuild
          /usr/src/linux-headers-3.2.0-4-common/include/drm/Kbuild
          ---


          This will be much slower than the dedicated tools already mentioned, but it will work.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Apr 4 '13 at 16:06

























          answered Apr 4 '13 at 16:00









          terdonterdon

          131k32258436




          131k32258436








          • 3





            It would be much, much faster to find any files with the same size as another file using st_size, eliminating any that only have one file of this size, and then calculating md5sums only between files with the same st_size.

            – Chris Down
            Apr 4 '13 at 16:34













          • @ChrisDown yeah, just wanted to keep it simple. What you suggest will greatly speed things up of course. That's why I have the disclaimer about it being slow at the end of my answer.

            – terdon
            Apr 4 '13 at 16:37














          • 3





            It would be much, much faster to find any files with the same size as another file using st_size, eliminating any that only have one file of this size, and then calculating md5sums only between files with the same st_size.

            – Chris Down
            Apr 4 '13 at 16:34













          • @ChrisDown yeah, just wanted to keep it simple. What you suggest will greatly speed things up of course. That's why I have the disclaimer about it being slow at the end of my answer.

            – terdon
            Apr 4 '13 at 16:37








          3




          3





          It would be much, much faster to find any files with the same size as another file using st_size, eliminating any that only have one file of this size, and then calculating md5sums only between files with the same st_size.

          – Chris Down
          Apr 4 '13 at 16:34







          It would be much, much faster to find any files with the same size as another file using st_size, eliminating any that only have one file of this size, and then calculating md5sums only between files with the same st_size.

          – Chris Down
          Apr 4 '13 at 16:34















          @ChrisDown yeah, just wanted to keep it simple. What you suggest will greatly speed things up of course. That's why I have the disclaimer about it being slow at the end of my answer.

          – terdon
          Apr 4 '13 at 16:37





          @ChrisDown yeah, just wanted to keep it simple. What you suggest will greatly speed things up of course. That's why I have the disclaimer about it being slow at the end of my answer.

          – terdon
          Apr 4 '13 at 16:37











          8














          Short answer: yes.



          Longer version: have a look at the wikipedia fdupes entry, it sports quite nice list of ready made solutions. Of course you can write your own, it's not that difficult - hashing programs like diff, sha*sum, find, sort and uniq should do the job. You can even put it on one line, and it will still be understandable.






          share|improve this answer




























            8














            Short answer: yes.



            Longer version: have a look at the wikipedia fdupes entry, it sports quite nice list of ready made solutions. Of course you can write your own, it's not that difficult - hashing programs like diff, sha*sum, find, sort and uniq should do the job. You can even put it on one line, and it will still be understandable.






            share|improve this answer


























              8












              8








              8







              Short answer: yes.



              Longer version: have a look at the wikipedia fdupes entry, it sports quite nice list of ready made solutions. Of course you can write your own, it's not that difficult - hashing programs like diff, sha*sum, find, sort and uniq should do the job. You can even put it on one line, and it will still be understandable.






              share|improve this answer













              Short answer: yes.



              Longer version: have a look at the wikipedia fdupes entry, it sports quite nice list of ready made solutions. Of course you can write your own, it's not that difficult - hashing programs like diff, sha*sum, find, sort and uniq should do the job. You can even put it on one line, and it will still be understandable.







              share|improve this answer












              share|improve this answer



              share|improve this answer










              answered Apr 4 '13 at 13:25









              peterphpeterph

              23.7k24457




              23.7k24457























                  5














                  If you believe a hash function (here MD5) is collision-free on your domain:



                  find $target -type f -exec md5sum '{}' + | sort | uniq --all-repeated --check-chars=32 
                  | cut --characters=35-


                  Want identical file names grouped? Write a simple script not_uniq.sh to format output:



                  #!/bin/bash

                  last_checksum=0
                  while read line; do
                  checksum=${line:0:32}
                  filename=${line:34}
                  if [ $checksum == $last_checksum ]; then
                  if [ ${last_filename:-0} != '0' ]; then
                  echo $last_filename
                  unset last_filename
                  fi
                  echo $filename
                  else
                  if [ ${last_filename:-0} == '0' ]; then
                  echo "======="
                  fi
                  last_filename=$filename
                  fi

                  last_checksum=$checksum
                  done


                  Then change find command to use your script:



                  chmod +x not_uniq.sh
                  find $target -type f -exec md5sum '{}' + | sort | not_uniq.sh


                  This is basic idea. Probably you should change find if your file names containing some characters. (e.g space)






                  share|improve this answer






























                    5














                    If you believe a hash function (here MD5) is collision-free on your domain:



                    find $target -type f -exec md5sum '{}' + | sort | uniq --all-repeated --check-chars=32 
                    | cut --characters=35-


                    Want identical file names grouped? Write a simple script not_uniq.sh to format output:



                    #!/bin/bash

                    last_checksum=0
                    while read line; do
                    checksum=${line:0:32}
                    filename=${line:34}
                    if [ $checksum == $last_checksum ]; then
                    if [ ${last_filename:-0} != '0' ]; then
                    echo $last_filename
                    unset last_filename
                    fi
                    echo $filename
                    else
                    if [ ${last_filename:-0} == '0' ]; then
                    echo "======="
                    fi
                    last_filename=$filename
                    fi

                    last_checksum=$checksum
                    done


                    Then change find command to use your script:



                    chmod +x not_uniq.sh
                    find $target -type f -exec md5sum '{}' + | sort | not_uniq.sh


                    This is basic idea. Probably you should change find if your file names containing some characters. (e.g space)






                    share|improve this answer




























                      5












                      5








                      5







                      If you believe a hash function (here MD5) is collision-free on your domain:



                      find $target -type f -exec md5sum '{}' + | sort | uniq --all-repeated --check-chars=32 
                      | cut --characters=35-


                      Want identical file names grouped? Write a simple script not_uniq.sh to format output:



                      #!/bin/bash

                      last_checksum=0
                      while read line; do
                      checksum=${line:0:32}
                      filename=${line:34}
                      if [ $checksum == $last_checksum ]; then
                      if [ ${last_filename:-0} != '0' ]; then
                      echo $last_filename
                      unset last_filename
                      fi
                      echo $filename
                      else
                      if [ ${last_filename:-0} == '0' ]; then
                      echo "======="
                      fi
                      last_filename=$filename
                      fi

                      last_checksum=$checksum
                      done


                      Then change find command to use your script:



                      chmod +x not_uniq.sh
                      find $target -type f -exec md5sum '{}' + | sort | not_uniq.sh


                      This is basic idea. Probably you should change find if your file names containing some characters. (e.g space)






                      share|improve this answer















                      If you believe a hash function (here MD5) is collision-free on your domain:



                      find $target -type f -exec md5sum '{}' + | sort | uniq --all-repeated --check-chars=32 
                      | cut --characters=35-


                      Want identical file names grouped? Write a simple script not_uniq.sh to format output:



                      #!/bin/bash

                      last_checksum=0
                      while read line; do
                      checksum=${line:0:32}
                      filename=${line:34}
                      if [ $checksum == $last_checksum ]; then
                      if [ ${last_filename:-0} != '0' ]; then
                      echo $last_filename
                      unset last_filename
                      fi
                      echo $filename
                      else
                      if [ ${last_filename:-0} == '0' ]; then
                      echo "======="
                      fi
                      last_filename=$filename
                      fi

                      last_checksum=$checksum
                      done


                      Then change find command to use your script:



                      chmod +x not_uniq.sh
                      find $target -type f -exec md5sum '{}' + | sort | not_uniq.sh


                      This is basic idea. Probably you should change find if your file names containing some characters. (e.g space)







                      share|improve this answer














                      share|improve this answer



                      share|improve this answer








                      edited Feb 21 '17 at 18:15









                      Wayne Werner

                      6,26851736




                      6,26851736










                      answered Apr 13 '13 at 15:39









                      xinxin

                      29929




                      29929























                          3














                          I thought to add a recent enhanced fork of fdupes, jdupes, which promises to be faster and more feature rich than fdupes (e.g. size filter):



                          jdupes . -rS -X size-:50m > myjdups.txt


                          This will recursively find duplicated files bigger than 50MB in the current directory and output the resulted list in myjdups.txt.



                          Note, the output is not sorted by size and since it appears not to be build in, I have adapted @Chris_Down answer above to achieve this:



                          jdupes -r . -X size-:50m | {
                          while IFS= read -r file; do
                          [[ $file ]] && du "$file"
                          done
                          } | sort -n > myjdups_sorted.txt





                          share|improve this answer




























                            3














                            I thought to add a recent enhanced fork of fdupes, jdupes, which promises to be faster and more feature rich than fdupes (e.g. size filter):



                            jdupes . -rS -X size-:50m > myjdups.txt


                            This will recursively find duplicated files bigger than 50MB in the current directory and output the resulted list in myjdups.txt.



                            Note, the output is not sorted by size and since it appears not to be build in, I have adapted @Chris_Down answer above to achieve this:



                            jdupes -r . -X size-:50m | {
                            while IFS= read -r file; do
                            [[ $file ]] && du "$file"
                            done
                            } | sort -n > myjdups_sorted.txt





                            share|improve this answer


























                              3












                              3








                              3







                              I thought to add a recent enhanced fork of fdupes, jdupes, which promises to be faster and more feature rich than fdupes (e.g. size filter):



                              jdupes . -rS -X size-:50m > myjdups.txt


                              This will recursively find duplicated files bigger than 50MB in the current directory and output the resulted list in myjdups.txt.



                              Note, the output is not sorted by size and since it appears not to be build in, I have adapted @Chris_Down answer above to achieve this:



                              jdupes -r . -X size-:50m | {
                              while IFS= read -r file; do
                              [[ $file ]] && du "$file"
                              done
                              } | sort -n > myjdups_sorted.txt





                              share|improve this answer













                              I thought to add a recent enhanced fork of fdupes, jdupes, which promises to be faster and more feature rich than fdupes (e.g. size filter):



                              jdupes . -rS -X size-:50m > myjdups.txt


                              This will recursively find duplicated files bigger than 50MB in the current directory and output the resulted list in myjdups.txt.



                              Note, the output is not sorted by size and since it appears not to be build in, I have adapted @Chris_Down answer above to achieve this:



                              jdupes -r . -X size-:50m | {
                              while IFS= read -r file; do
                              [[ $file ]] && du "$file"
                              done
                              } | sort -n > myjdups_sorted.txt






                              share|improve this answer












                              share|improve this answer



                              share|improve this answer










                              answered Nov 23 '17 at 17:27









                              Sebastian MüllerSebastian Müller

                              1714




                              1714























                                  2














                                  Wikipedia had an article (http://en.wikipedia.org/wiki/List_of_duplicate_file_finders), with a list of available open source software for this task, but it's now been deleted.



                                  I will add that the GUI version of fslint is very interesting, allowing to use mask to select which files to delete. Very useful to clean duplicated photos.



                                  On Linux you can use:



                                  - FSLint: http://www.pixelbeat.org/fslint/

                                  - FDupes: https://en.wikipedia.org/wiki/Fdupes

                                  - DupeGuru: https://www.hardcoded.net/dupeguru/


                                  The 2 last work on many systems (windows, mac and linux) I 've not checked for FSLint






                                  share|improve this answer





















                                  • 5





                                    It is better to provide actual information here and not just a link, the link might change and then the answer has no value left

                                    – Anthon
                                    Jan 29 '14 at 11:22






                                  • 2





                                    Wikipedia page is empty.

                                    – ihor_dvoretskyi
                                    Sep 10 '15 at 9:01











                                  • yes, it has been cleaned, what a pity shake...

                                    – MordicusEtCubitus
                                    Dec 21 '15 at 16:23











                                  • I've edited it with these 3 tools

                                    – MordicusEtCubitus
                                    Dec 21 '15 at 16:30
















                                  2














                                  Wikipedia had an article (http://en.wikipedia.org/wiki/List_of_duplicate_file_finders), with a list of available open source software for this task, but it's now been deleted.



                                  I will add that the GUI version of fslint is very interesting, allowing to use mask to select which files to delete. Very useful to clean duplicated photos.



                                  On Linux you can use:



                                  - FSLint: http://www.pixelbeat.org/fslint/

                                  - FDupes: https://en.wikipedia.org/wiki/Fdupes

                                  - DupeGuru: https://www.hardcoded.net/dupeguru/


                                  The 2 last work on many systems (windows, mac and linux) I 've not checked for FSLint






                                  share|improve this answer





















                                  • 5





                                    It is better to provide actual information here and not just a link, the link might change and then the answer has no value left

                                    – Anthon
                                    Jan 29 '14 at 11:22






                                  • 2





                                    Wikipedia page is empty.

                                    – ihor_dvoretskyi
                                    Sep 10 '15 at 9:01











                                  • yes, it has been cleaned, what a pity shake...

                                    – MordicusEtCubitus
                                    Dec 21 '15 at 16:23











                                  • I've edited it with these 3 tools

                                    – MordicusEtCubitus
                                    Dec 21 '15 at 16:30














                                  2












                                  2








                                  2







                                  Wikipedia had an article (http://en.wikipedia.org/wiki/List_of_duplicate_file_finders), with a list of available open source software for this task, but it's now been deleted.



                                  I will add that the GUI version of fslint is very interesting, allowing to use mask to select which files to delete. Very useful to clean duplicated photos.



                                  On Linux you can use:



                                  - FSLint: http://www.pixelbeat.org/fslint/

                                  - FDupes: https://en.wikipedia.org/wiki/Fdupes

                                  - DupeGuru: https://www.hardcoded.net/dupeguru/


                                  The 2 last work on many systems (windows, mac and linux) I 've not checked for FSLint






                                  share|improve this answer















                                  Wikipedia had an article (http://en.wikipedia.org/wiki/List_of_duplicate_file_finders), with a list of available open source software for this task, but it's now been deleted.



                                  I will add that the GUI version of fslint is very interesting, allowing to use mask to select which files to delete. Very useful to clean duplicated photos.



                                  On Linux you can use:



                                  - FSLint: http://www.pixelbeat.org/fslint/

                                  - FDupes: https://en.wikipedia.org/wiki/Fdupes

                                  - DupeGuru: https://www.hardcoded.net/dupeguru/


                                  The 2 last work on many systems (windows, mac and linux) I 've not checked for FSLint







                                  share|improve this answer














                                  share|improve this answer



                                  share|improve this answer








                                  edited Jul 3 '17 at 10:09









                                  Stéphane Chazelas

                                  307k57581939




                                  307k57581939










                                  answered Jan 29 '14 at 11:01









                                  MordicusEtCubitusMordicusEtCubitus

                                  1293




                                  1293








                                  • 5





                                    It is better to provide actual information here and not just a link, the link might change and then the answer has no value left

                                    – Anthon
                                    Jan 29 '14 at 11:22






                                  • 2





                                    Wikipedia page is empty.

                                    – ihor_dvoretskyi
                                    Sep 10 '15 at 9:01











                                  • yes, it has been cleaned, what a pity shake...

                                    – MordicusEtCubitus
                                    Dec 21 '15 at 16:23











                                  • I've edited it with these 3 tools

                                    – MordicusEtCubitus
                                    Dec 21 '15 at 16:30














                                  • 5





                                    It is better to provide actual information here and not just a link, the link might change and then the answer has no value left

                                    – Anthon
                                    Jan 29 '14 at 11:22






                                  • 2





                                    Wikipedia page is empty.

                                    – ihor_dvoretskyi
                                    Sep 10 '15 at 9:01











                                  • yes, it has been cleaned, what a pity shake...

                                    – MordicusEtCubitus
                                    Dec 21 '15 at 16:23











                                  • I've edited it with these 3 tools

                                    – MordicusEtCubitus
                                    Dec 21 '15 at 16:30








                                  5




                                  5





                                  It is better to provide actual information here and not just a link, the link might change and then the answer has no value left

                                  – Anthon
                                  Jan 29 '14 at 11:22





                                  It is better to provide actual information here and not just a link, the link might change and then the answer has no value left

                                  – Anthon
                                  Jan 29 '14 at 11:22




                                  2




                                  2





                                  Wikipedia page is empty.

                                  – ihor_dvoretskyi
                                  Sep 10 '15 at 9:01





                                  Wikipedia page is empty.

                                  – ihor_dvoretskyi
                                  Sep 10 '15 at 9:01













                                  yes, it has been cleaned, what a pity shake...

                                  – MordicusEtCubitus
                                  Dec 21 '15 at 16:23





                                  yes, it has been cleaned, what a pity shake...

                                  – MordicusEtCubitus
                                  Dec 21 '15 at 16:23













                                  I've edited it with these 3 tools

                                  – MordicusEtCubitus
                                  Dec 21 '15 at 16:30





                                  I've edited it with these 3 tools

                                  – MordicusEtCubitus
                                  Dec 21 '15 at 16:30











                                  0














                                  Here's my take on that:



                                  find -type f -size +3M -print0 | while IFS= read -r -d '' i; do
                                  echo -n '.'
                                  if grep -q "$i" md5-partial.txt; then echo -e "n$i ---- Already counted, skipping."; continue; fi
                                  MD5=`dd bs=1M count=1 if="$i" status=noxfer | md5sum`
                                  MD5=`echo $MD5 | cut -d' ' -f1`
                                  if grep "$MD5" md5-partial.txt; then echo "n$i ---- Possible duplicate"; fi
                                  echo $MD5 $i >> md5-partial.txt
                                  done


                                  It's different in that it only hashes up to first 1 MB of the file.

                                  This has few issues / features:




                                  • There might be a difference after first 1 MB so the result rather a candidate to check. I might fix that later.

                                  • Checking by file size first could speed this up.

                                  • Only takes files larger than 3 MB.


                                  I use it to compare video clips so this is enough for me.






                                  share|improve this answer




























                                    0














                                    Here's my take on that:



                                    find -type f -size +3M -print0 | while IFS= read -r -d '' i; do
                                    echo -n '.'
                                    if grep -q "$i" md5-partial.txt; then echo -e "n$i ---- Already counted, skipping."; continue; fi
                                    MD5=`dd bs=1M count=1 if="$i" status=noxfer | md5sum`
                                    MD5=`echo $MD5 | cut -d' ' -f1`
                                    if grep "$MD5" md5-partial.txt; then echo "n$i ---- Possible duplicate"; fi
                                    echo $MD5 $i >> md5-partial.txt
                                    done


                                    It's different in that it only hashes up to first 1 MB of the file.

                                    This has few issues / features:




                                    • There might be a difference after first 1 MB so the result rather a candidate to check. I might fix that later.

                                    • Checking by file size first could speed this up.

                                    • Only takes files larger than 3 MB.


                                    I use it to compare video clips so this is enough for me.






                                    share|improve this answer


























                                      0












                                      0








                                      0







                                      Here's my take on that:



                                      find -type f -size +3M -print0 | while IFS= read -r -d '' i; do
                                      echo -n '.'
                                      if grep -q "$i" md5-partial.txt; then echo -e "n$i ---- Already counted, skipping."; continue; fi
                                      MD5=`dd bs=1M count=1 if="$i" status=noxfer | md5sum`
                                      MD5=`echo $MD5 | cut -d' ' -f1`
                                      if grep "$MD5" md5-partial.txt; then echo "n$i ---- Possible duplicate"; fi
                                      echo $MD5 $i >> md5-partial.txt
                                      done


                                      It's different in that it only hashes up to first 1 MB of the file.

                                      This has few issues / features:




                                      • There might be a difference after first 1 MB so the result rather a candidate to check. I might fix that later.

                                      • Checking by file size first could speed this up.

                                      • Only takes files larger than 3 MB.


                                      I use it to compare video clips so this is enough for me.






                                      share|improve this answer













                                      Here's my take on that:



                                      find -type f -size +3M -print0 | while IFS= read -r -d '' i; do
                                      echo -n '.'
                                      if grep -q "$i" md5-partial.txt; then echo -e "n$i ---- Already counted, skipping."; continue; fi
                                      MD5=`dd bs=1M count=1 if="$i" status=noxfer | md5sum`
                                      MD5=`echo $MD5 | cut -d' ' -f1`
                                      if grep "$MD5" md5-partial.txt; then echo "n$i ---- Possible duplicate"; fi
                                      echo $MD5 $i >> md5-partial.txt
                                      done


                                      It's different in that it only hashes up to first 1 MB of the file.

                                      This has few issues / features:




                                      • There might be a difference after first 1 MB so the result rather a candidate to check. I might fix that later.

                                      • Checking by file size first could speed this up.

                                      • Only takes files larger than 3 MB.


                                      I use it to compare video clips so this is enough for me.







                                      share|improve this answer












                                      share|improve this answer



                                      share|improve this answer










                                      answered Jun 2 '17 at 1:50









                                      Ondra ŽižkaOndra Žižka

                                      464312




                                      464312

















                                          protected by Community Jan 14 '16 at 12:14



                                          Thank you for your interest in this question.
                                          Because it has attracted low-quality or spam answers that had to be removed, posting an answer now requires 10 reputation on this site (the association bonus does not count).



                                          Would you like to answer one of these unanswered questions instead?



                                          Popular posts from this blog

                                          Loup dans la culture

                                          How to solve the problem of ntp “Unable to contact time server” from KDE?

                                          ASUS Zenbook UX433/UX333 — Configure Touchpad-embedded numpad on Linux