Is there an easy way to replace duplicate files with hardlinks?












129















I'm looking for an easy way (a command or series of commands, probably involving find) to find duplicate files in two directories, and replace the files in one directory with hardlinks of the files in the other directory.



Here's the situation: This is a file server which multiple people store audio files on, each user having their own folder. Sometimes multiple people have copies of the exact same audio files. Right now, these are duplicates. I'd like to make it so they're hardlinks, to save hard drive space.










share|improve this question




















  • 20





    One problem you may run into with hardlinks is if somebody decides to do something to one of their music files that you've hard-linked they could inadvertently be affecting other people's access to their music.

    – Steven D
    Oct 13 '10 at 2:48






  • 4





    another problem is that two different files containing "Some Really Great Tune", even if taken from the same source with the same encoder will very likely not be bit-for-bit identical.

    – msw
    Oct 13 '10 at 2:57








  • 3





    better sollution might be to have a public music folder...

    – Stefan
    Oct 13 '10 at 7:08






  • 3





    related: superuser.com/questions/140819/ways-to-deduplicate-files

    – David Cary
    Mar 16 '11 at 23:59






  • 1





    @tante: Using symlinks solves no problem. When a user "deletes" a file, the number of links to it gets decremented, when the count reaches zero, the files gets really deleted, that's all. So deletion is no problem with hardlinked files, the only problem is a user trying to edit the file (unprobable indeed) or to overwrite it (quite possible if logged in).

    – maaartinus
    Mar 14 '12 at 3:56
















129















I'm looking for an easy way (a command or series of commands, probably involving find) to find duplicate files in two directories, and replace the files in one directory with hardlinks of the files in the other directory.



Here's the situation: This is a file server which multiple people store audio files on, each user having their own folder. Sometimes multiple people have copies of the exact same audio files. Right now, these are duplicates. I'd like to make it so they're hardlinks, to save hard drive space.










share|improve this question




















  • 20





    One problem you may run into with hardlinks is if somebody decides to do something to one of their music files that you've hard-linked they could inadvertently be affecting other people's access to their music.

    – Steven D
    Oct 13 '10 at 2:48






  • 4





    another problem is that two different files containing "Some Really Great Tune", even if taken from the same source with the same encoder will very likely not be bit-for-bit identical.

    – msw
    Oct 13 '10 at 2:57








  • 3





    better sollution might be to have a public music folder...

    – Stefan
    Oct 13 '10 at 7:08






  • 3





    related: superuser.com/questions/140819/ways-to-deduplicate-files

    – David Cary
    Mar 16 '11 at 23:59






  • 1





    @tante: Using symlinks solves no problem. When a user "deletes" a file, the number of links to it gets decremented, when the count reaches zero, the files gets really deleted, that's all. So deletion is no problem with hardlinked files, the only problem is a user trying to edit the file (unprobable indeed) or to overwrite it (quite possible if logged in).

    – maaartinus
    Mar 14 '12 at 3:56














129












129








129


66






I'm looking for an easy way (a command or series of commands, probably involving find) to find duplicate files in two directories, and replace the files in one directory with hardlinks of the files in the other directory.



Here's the situation: This is a file server which multiple people store audio files on, each user having their own folder. Sometimes multiple people have copies of the exact same audio files. Right now, these are duplicates. I'd like to make it so they're hardlinks, to save hard drive space.










share|improve this question
















I'm looking for an easy way (a command or series of commands, probably involving find) to find duplicate files in two directories, and replace the files in one directory with hardlinks of the files in the other directory.



Here's the situation: This is a file server which multiple people store audio files on, each user having their own folder. Sometimes multiple people have copies of the exact same audio files. Right now, these are duplicates. I'd like to make it so they're hardlinks, to save hard drive space.







files hard-link deduplication duplicate-files






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 38 mins ago









Jeff Schaller

42.5k1158135




42.5k1158135










asked Oct 12 '10 at 19:23









JoshJosh

3,76664365




3,76664365








  • 20





    One problem you may run into with hardlinks is if somebody decides to do something to one of their music files that you've hard-linked they could inadvertently be affecting other people's access to their music.

    – Steven D
    Oct 13 '10 at 2:48






  • 4





    another problem is that two different files containing "Some Really Great Tune", even if taken from the same source with the same encoder will very likely not be bit-for-bit identical.

    – msw
    Oct 13 '10 at 2:57








  • 3





    better sollution might be to have a public music folder...

    – Stefan
    Oct 13 '10 at 7:08






  • 3





    related: superuser.com/questions/140819/ways-to-deduplicate-files

    – David Cary
    Mar 16 '11 at 23:59






  • 1





    @tante: Using symlinks solves no problem. When a user "deletes" a file, the number of links to it gets decremented, when the count reaches zero, the files gets really deleted, that's all. So deletion is no problem with hardlinked files, the only problem is a user trying to edit the file (unprobable indeed) or to overwrite it (quite possible if logged in).

    – maaartinus
    Mar 14 '12 at 3:56














  • 20





    One problem you may run into with hardlinks is if somebody decides to do something to one of their music files that you've hard-linked they could inadvertently be affecting other people's access to their music.

    – Steven D
    Oct 13 '10 at 2:48






  • 4





    another problem is that two different files containing "Some Really Great Tune", even if taken from the same source with the same encoder will very likely not be bit-for-bit identical.

    – msw
    Oct 13 '10 at 2:57








  • 3





    better sollution might be to have a public music folder...

    – Stefan
    Oct 13 '10 at 7:08






  • 3





    related: superuser.com/questions/140819/ways-to-deduplicate-files

    – David Cary
    Mar 16 '11 at 23:59






  • 1





    @tante: Using symlinks solves no problem. When a user "deletes" a file, the number of links to it gets decremented, when the count reaches zero, the files gets really deleted, that's all. So deletion is no problem with hardlinked files, the only problem is a user trying to edit the file (unprobable indeed) or to overwrite it (quite possible if logged in).

    – maaartinus
    Mar 14 '12 at 3:56








20




20





One problem you may run into with hardlinks is if somebody decides to do something to one of their music files that you've hard-linked they could inadvertently be affecting other people's access to their music.

– Steven D
Oct 13 '10 at 2:48





One problem you may run into with hardlinks is if somebody decides to do something to one of their music files that you've hard-linked they could inadvertently be affecting other people's access to their music.

– Steven D
Oct 13 '10 at 2:48




4




4





another problem is that two different files containing "Some Really Great Tune", even if taken from the same source with the same encoder will very likely not be bit-for-bit identical.

– msw
Oct 13 '10 at 2:57







another problem is that two different files containing "Some Really Great Tune", even if taken from the same source with the same encoder will very likely not be bit-for-bit identical.

– msw
Oct 13 '10 at 2:57






3




3





better sollution might be to have a public music folder...

– Stefan
Oct 13 '10 at 7:08





better sollution might be to have a public music folder...

– Stefan
Oct 13 '10 at 7:08




3




3





related: superuser.com/questions/140819/ways-to-deduplicate-files

– David Cary
Mar 16 '11 at 23:59





related: superuser.com/questions/140819/ways-to-deduplicate-files

– David Cary
Mar 16 '11 at 23:59




1




1





@tante: Using symlinks solves no problem. When a user "deletes" a file, the number of links to it gets decremented, when the count reaches zero, the files gets really deleted, that's all. So deletion is no problem with hardlinked files, the only problem is a user trying to edit the file (unprobable indeed) or to overwrite it (quite possible if logged in).

– maaartinus
Mar 14 '12 at 3:56





@tante: Using symlinks solves no problem. When a user "deletes" a file, the number of links to it gets decremented, when the count reaches zero, the files gets really deleted, that's all. So deletion is no problem with hardlinked files, the only problem is a user trying to edit the file (unprobable indeed) or to overwrite it (quite possible if logged in).

– maaartinus
Mar 14 '12 at 3:56










18 Answers
18






active

oldest

votes


















41














There is a perl script at http://cpansearch.perl.org/src/ANDK/Perl-Repository-APC-2.002/eg/trimtrees.pl which does exactly what you want:




Traverse all directories named on the
command line, compute MD5 checksums
and find files with identical MD5. IF
they are equal, do a real comparison
if they are really equal, replace the
second of two files with a hard link
to the first one.







share|improve this answer
























  • Sounds perfect, thanks!! I'll try it and accept if it works as described!

    – Josh
    Oct 12 '10 at 20:09






  • 3





    This did exactly what I asked for. However I believe that ZFS with dedup will eventually be the way to do, since I did find that the files had slight differences so only a few could be hardlinked.

    – Josh
    Dec 8 '10 at 20:13






  • 10





    Upvoted this, but after researching some more, I kind of which I didn't. rdfind is available via the package managers for ALL major platforms (os x, linux, (cyg)win, solaris), and works at a blazing native speed. So do check out the answer below.

    – oligofren
    Jan 3 '15 at 13:42













  • @oligofren I was thinking the same, but then I hit [Errno 31] Too many links. This scrips seems to be the only thing that handles that.

    – phunehehe
    Jun 26 '15 at 6:59






  • 3





    Checksumming every single file, rather than only files where there exists at least one other with identical size, is unnecessarily inefficient (and unnecessarily prone to hash collisions).

    – Charles Duffy
    Feb 1 '16 at 16:56





















73














rdfind does exactly what you ask for (and in the order johny why lists). Makes it possible to delete duplicates, replace them with either soft or hard links. Combined with symlinks you can also make the symlink either absolute or relative. You can even pick checksum algorithm (md5 or sha1).



Since it is compiled it is faster than most scripted solutions: time on a 15 GiB folder with 2600 files on my Mac Mini from 2009 returns this



9.99s user 3.61s system 66% cpu 20.543 total


(using md5).



Available in most package handlers (e.g. MacPorts for Mac OS X).






share|improve this answer





















  • 10





    +1 I used rdfind and loved it. It has a -dryrun true option that will let you know what it would have done. Replacing duplicates with hard links is as simple as -makehardlinks true. It produced a nice log and it let me know how much space was freed up. Plus, according to the author's benchmark, rdfind is faster than duff and fslint.

    – Daniel Trebbien
    Dec 29 '13 at 20:49











  • oooh, nice. I used to use fdupes, but its -L option for hardlinking dupes is missing in the latest Ubuntu 14.10. Was quite slow, and did not exist for Homebrew on OSX, so this answer is way better. Thanks!

    – oligofren
    Jan 3 '15 at 13:38











  • Very smart and fast algorithm.

    – ndemou
    Oct 30 '15 at 12:53






  • 1





    I suspect the performance of this tool has more to do with the algorithm itself and less to do with whether it's a compiled tool or a script. For this kind of operation, disk is going to be the bottleneck nearly all of the time. As long as scripted tools make sure that they've an async I/O operation in progress while burning the CPU on checksums, they should perform about as well as a native binary.

    – cdhowie
    May 31 '18 at 21:19



















49














Use the fdupes tool:



fdupes -r /path/to/folder gives you a list of duplicates in the directory (-r makes it recursive). The output looks like this:





filename1

filename2



filename3

filename4

filename5





with filename1 and filename2 being identical and filename3, filename4 and filename5 also being identical.






share|improve this answer



















  • 1





    Ubuntu Note: As of September 2013, it hasn't had a stable release (it is on 1.50-PR2-3), so the update doesn't appear in ubuntu yet.

    – Stuart Axon
    Aug 28 '13 at 14:19






  • 11





    I just tried installing fdupes_1.50-PR2-4 on both Ubuntu and Debian, neither has the -L flag. Luckily building from github.com/tobiasschulz/fdupes was super easy.

    – neu242
    Aug 30 '13 at 15:07






  • 3





    Try rdfind - like fdupes, but faster and available on OS X and Cygwin as well.

    – oligofren
    Jan 3 '15 at 13:43











  • Or if you just requre Linux compatibility, install rmlint which is blazingly fast, and has lots of nice options. Truly a modern alternative.

    – oligofren
    Jan 3 '15 at 14:28






  • 3





    fdupes seems to only find duplicates, not replace them with hardlinks, so not an answer to the question IMO.

    – Calimo
    Nov 8 '17 at 15:58



















22














I use hardlink from http://jak-linux.org/projects/hardlink/






share|improve this answer



















  • 1





    Nice hint, I am using on a regular base code.google.com/p/hardlinkpy but this was not updated for a while...

    – meduz
    Apr 11 '12 at 19:09






  • 2





    This appears to be similar to the original hardlink on Fedora/RHEL/etc.

    – Jack Douglas
    Jun 21 '12 at 8:43






  • 1





    hardlink is now a native binary in many Linux package systems (since ~2014) and extremely fast. For 1,2M files (320GB), it just took 200 seconds (linking roughly 10% of the files).

    – Marcel Waldvogel
    Feb 5 '17 at 19:13











  • FWIW, the above hardlink was created by Julian Andres Klode while the Fedora hardlink was created by Jakub Jelinek (source: pagure.io/hardlink - Fedora package name: hardlink)

    – maxschlepzig
    Jan 4 at 17:52



















18














This is one of the functions provided by "fslint" --
http://en.flossmanuals.net/FSlint/Introduction



Click the "Merge" button:



Screenshot






share|improve this answer





















  • 4





    The -m will hardlink duplicates together, -d will delete all but one, and -t will dry run, printing what it would do

    – Azendale
    Oct 29 '12 at 5:57






  • 1





    On Ubuntu here is what to do: sudo apt-get install fslint /usr/share/fslint/fslint/findup -m /your/directory/tree (directory /usr/share/fslint/fslint/ is not in $PATH by default)

    – Jocelyn
    Sep 8 '13 at 15:38





















14














Since your main target is to save disk space, there is another solution: de-duplication (and probably compression) on file system level. Compared with the hard-link solution, it does not have the problem of inadvertently affecting other linked files.



ZFS has dedup (block-level, not file-level) since pool version 23 and compression since long time ago.
If you are using linux, you may try zfs-fuse, or if you use BSD, it is natively supported.






share|improve this answer
























  • This is probably the way I'll go eventually, however, does BSD's ZFS implementation do dedup? I thought it did not.

    – Josh
    Dec 8 '10 at 20:14











  • In addition, the HAMMER filesystem on DragonFlyBSD has deduplication support.

    – hhaamu
    Jul 15 '12 at 17:48








  • 11





    ZFS dedup is the friend of nobody. Where ZFS recommends 1Gb ram per 1Tb usable disk space, you're friggin' nuts if you try to use dedup with less than 32Gb ram per 1Tb usable disk space. That means that for a 1Tb mirror, if you don't have 32 Gb ram, you are likely to encounter memory bomb conditions sooner or later that will halt the machine due to lack of ram. Been there, done that, still recovering from the PTSD.

    – killermist
    Sep 22 '14 at 18:51






  • 3





    To avoid the excessive RAM requirements with online deduplication (i.e., check on every write), btrfs uses batch or offline deduplication (run it whenever you consider it useful/necessary) btrfs.wiki.kernel.org/index.php/Deduplication

    – Marcel Waldvogel
    Feb 5 '17 at 19:18






  • 2





    Update seven years later: I eventually did move to ZFS and tried deduplication -- I found that it's RAM requirements were indeed just far to high. Crafty use of ZFS snapshots provided the solution I ended up using. (Copy one user's music, snapshot and clone, copy the second user's music into the clone using rsync --inplace so only changed blocks are stored)

    – Josh
    Sep 13 '17 at 13:54



















7














On modern Linux these days there's https://github.com/g2p/bedup which de-duplicates on a btrfs filesystem, but 1) without as much of the scan overhead, 2) files can diverge easily again afterwards.






share|improve this answer
























  • Background and more information is listed on btrfs.wiki.kernel.org/index.php/Deduplication (including reference to cp --reflink, see also below)

    – Marcel Waldvogel
    Feb 5 '17 at 19:22



















5














To find duplicate files you can use duff.




Duff is a Unix command-line utility
for quickly finding duplicates in a
given set of files.




Simply run:



duff -r target-folder


To create hardlinks to those files automaticly, you will need to parse the output of duff with bash or some other scripting language.






share|improve this answer
























  • Really slow though -- see rdfind.pauldreik.se/#g0.6

    – ndemou
    Oct 30 '15 at 12:52



















5














aptitude show hardlink


Description: Hardlinks multiple copies of the same file
Hardlink is a tool which detects multiple copies of the same file and replaces them with hardlinks.



The idea has been taken from http://code.google.com/p/hardlinkpy/, but the code has been written from scratch and licensed under the MIT license.
Homepage: http://jak-linux.org/projects/hardlink/






share|improve this answer


























  • The only program mentioned here available for Gentoo without unmasking and with hardlink support, thanks!

    – Jorrit Schippers
    Mar 9 '15 at 13:48



















4














I've used many of the hardlinking tools for Linux mentioned here.
I too am stuck with ext4 fs, on Ubuntu, and have been using its cp -l and -s for hard/softlinking. But lately noticed the lightweight copy in the cp man page, which would imply to spare the redundant disk space until one side gets modified:



   --reflink[=WHEN]
control clone/CoW copies. See below

When --reflink[=always] is specified, perform a lightweight copy, where the
data blocks are copied only when modified. If this is not possible the
copy fails, or if --reflink=auto is specified, fall back to a standard copy.





share|improve this answer
























  • I think I will update my cp alias to always include the --reflink=auto parameter now

    – Marcos
    Mar 14 '12 at 14:08






  • 1





    Does ext4 really support --reflink?

    – Jack Douglas
    Jun 21 '12 at 8:42






  • 7





    This is supported on btrfs and OCFS2. It is only possible on copy-on-write filesystems, which ext4 is not. btrfs is really shaping up. I love using it because of reflink and snapshots, makes you less scared to do mass operations on big trees of files.

    – clacke
    Jul 3 '12 at 18:57



















3














Seems to me that checking the filename first could speed things up. If two files lack the same filename then in many cases I would not consider them to be duplicates. Seems that the quickest method would be to compare, in order:




  • filename

  • size

  • md5 checksum

  • byte contents


Do any methods do this? Look at duff, fdupes, rmlint, fslint, etc.



The following method was top-voted on commandlinefu.com: Find Duplicate Files (based on size first, then MD5 hash)



Can filename comparison be added as a first step, size as a second step?



find -not -empty -type f -printf "%sn" | sort -rn | uniq -d | 
xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum |
sort | uniq -w32 --all-repeated=separate





share|improve this answer





















  • 2





    I've used duff, fdupes and rmlint, and strongly recommend readers to look at the third of these. It has an excellent option set (and documentation). With it, I was able to avoid a lot of the post-processing I needed to use with the other tools.

    – dubiousjim
    Sep 2 '15 at 6:32








  • 2





    In my practice filename is the least reliable factor to look at, and I've completely removed it from any efforts I make a de-duping. How many install.sh files can be found on an active system? I can't count the number of times I've saved a file and had name clash, with some on-the-fly renaming to save it. Flip side: no idea how many times I've downloaded something from different sources, on different days, only to find they are the same file with different names. (Which also kills the timestamp reliability.) 1: Size, 2: Digest, 3: Byte contents.

    – Gypsy Spellweaver
    Jan 28 '17 at 6:40











  • @GypsySpellweaver: (1) depends on personal use-case, wouldn't you agree? In my case, i have multiple restores from multiple backups, where files with same name and content exist in different restore-folders. (2) Your comment seems to assume comparing filename only. I was not suggesting to eliminate other checks.

    – johny why
    Mar 8 '17 at 21:50



















2














I made a Perl script that does something similar to what you're talking about:



http://pastebin.com/U7mFHZU7



Basically, it just traverses a directory, calculating the SHA1sum of the files in it, hashing it and linking matches together. It's come in handy on many, many occasions.






share|improve this answer



















  • 2





    I hope to get around to trying this soon... why not upload it on CPAN... App::relink or something

    – xenoterracide
    Feb 7 '11 at 11:12






  • 1





    @xenoterracide: because of all the similar and more mature solutions that already exist. see the other answers, especially rdfind.

    – oligofren
    Jan 3 '15 at 13:36








  • 1





    @oligofren I don't doubt better solutions exist. TMTOWTDI I guess.

    – amphetamachine
    Jan 5 '15 at 15:49



















2














Since I'm not a fan of Perl, here's a bash version:



#!/bin/bash

DIR="/path/to/big/files"

find $DIR -type f -exec md5sum {} ; | sort > /tmp/sums-sorted.txt

OLDSUM=""
IFS=$'n'
for i in `cat /tmp/sums-sorted.txt`; do
NEWSUM=`echo "$i" | sed 's/ .*//'`
NEWFILE=`echo "$i" | sed 's/^[^ ]* *//'`
if [ "$OLDSUM" == "$NEWSUM" ]; then
echo ln -f "$OLDFILE" "$NEWFILE"
else
OLDSUM="$NEWSUM"
OLDFILE="$NEWFILE"
fi
done


This finds all files with the same checksum (whether they're big, small, or already hardlinks), and hardlinks them together.



This can be greatly optimized for repeated runs with additional find flags (eg. size) and a file cache (so you don't have to redo the checksums each time). If anyone's interested in the smarter, longer version, I can post it.



NOTE: As has been mentioned before, hardlinks work as long as the files never need modification, or to be moved across filesystems.






share|improve this answer


























  • How can I change your script, so that instead of hardlinking it, it will just delete the duplicate files and will add an entry to a CSV file the deleted file -> Lined File. . ???

    – MR.GEWA
    Jan 12 '13 at 12:17













  • Sure. The hard link line: echo ln -f "$OLDFILE" "$NEWFILE" Just replaces the duplicate file with a hard link, so you could change it rm the $NEWFILE instead.

    – seren
    Jan 13 '13 at 4:15













  • and how on next line, write in some text file somehow $OLDFILE-> NEWFILE ???

    – MR.GEWA
    Jan 13 '13 at 13:12













  • Ahh, right. Yes, add a line after the rm such as: echo "$NEWFILE" >> /var/log/deleted_duplicate_files.log

    – seren
    Jan 14 '13 at 19:28








  • 1





    Don't friggin reinvent the wheel. There are more mature solutions available, like rdfind, that works at native speeds and just requires brew install rdfind or apt-get install rdfind to get installed.

    – oligofren
    Jan 3 '15 at 13:46



















2














If you want to replace duplicates by Hard Links on mac or any UNIX based system, you can try SmartDupe http://sourceforge.net/projects/smartdupe/
am developing it






share|improve this answer



















  • 3





    Can you expand on how “smart” it is?

    – Stéphane Gimenez
    Nov 4 '12 at 13:25






  • 1





    How can I compare files of two different directories?

    – Burcardo
    May 31 '16 at 8:26



















1














The applicatios FSLint (http://www.pixelbeat.org/fslint/) can find all equal files in any folder (by content) and create hardlinks. Give it a try!



Jorge Sampaio






share|improve this answer
























  • It hangs scanning 1TB almost full ext3 harddisk, brings the entire system to a crawl. Aborted after 14 hours of "searching"

    – Angsuman Chakraborty
    Sep 12 '16 at 11:09



















0














If you'll do hardlinks, pay attention on rights on that file. Notice, owner, group, mode, extended attributes, time and ACL (if you use this) is stored in INODE. Only file names are different because this is stored in directory structure, and other points to INODE properties. This cause, all file names linked to the same inode, have the same access rights. You should prevent modification that file, because any user can damage file to other. It is simple. It is enough, any user put other file in the same name. Inode number is then saved, and original file content is destroyed (replaced) for all hardlinked names.



Better way is deduplication on filesystem layer. You can use BTRFS (very popular last time), OCFS or like this. Look at the page: https://en.wikipedia.org/wiki/Comparison_of_file_systems , specialy at the table Features and column data deduplication. You can click it and sort :)



Specially look at ZFS filesystem. This is available as FUSE, but in this way it's very slow. If you want native support, look at the page http://zfsonlinux.org/ . Then you must patch kernel, and next install zfs tools for managament. I don't understand, why linux doesn't support as drivers, it is way for many other operating systems / kernels.



File systems supports deduplication by 2 ways, deduplicate files, or blocks. ZFS supports block. This means, the same contents that repeats in the same file can be deduplicated. Other way is time when data are deduplicated, this can be online (zfs) or offline (btrfs).



Notice, deduplication consumes RAM. This is, why writing files to ZFS volume mounted with FUSE, cause dramatically slow performance. This is described in documentation.
But you can online set on/off deduplication on volume. If you see any data should be deduplicated, you simply set deduplication on, rewrite some file to any temporary and finally replace. after this you can off deduplication and restore full performance. Of course, you can add to storage any cache disks. This can be very fast rotate disks or SSD disks. Of course this can be very small disks. In real work this is replacement for RAM :)



Under linux you should take care for ZFS because not all work as it should, specialy when you manage filesystem, make snapshot etc. but if you do configuration and don't change it, all works properly. Other way, you should change linux to opensolaris, it natively supports ZFS :) What is very nice with ZFS is, this works both as filesystem, and volumen manager similar to LVM. You do not need it when you use ZFS. See documentation if you want know more.



Notice difference between ZFS and BTRFS. ZFS is older and more mature, unfortunately only under Solaris and OpenSolaris (unfortunately strangled by oracle). BTRFS is younger, but last time very good supported. I recommend fresh kernel. ZFS has online deduplication, that cause slow down writes, because all is calculated online. BTRFS support off-line dedupliaction. Then this saves performance, but when host has nothing to do, you run periodically tool for make deduplication. And BTRFS is natively created under linux. Maybe this is better FS for You :)






share|improve this answer



















  • 1





    I do like the offline (or batch) deduplication approach btrfs has. Excellent discussion of the options (including the cp --reflink option) here: btrfs.wiki.kernel.org/index.php/Deduplication

    – Marcel Waldvogel
    Feb 5 '17 at 19:42











  • ZFS is not Solaris or OpenSolaris only. It's natively supported in FreeBSD. Also, ZFS on Linux is device driver based; ZFS on FUSE is a different thing.

    – KJ Seefried
    Mar 29 '18 at 19:07



















0














Hard links might not be the best idea; if one user changes the file, it affects both. However, deleting a hard link doesn't delete both files. Plus, I am not entirely sure if Hard Links take up the same amount of space (on the hard disk, not the OS) as multiple copies of the same file; according to Windows (with the Link Shell Extension), they do. Granted, that's Windows, not Unix...



My solution would be to create a "common" file in a hidden folder, and replace the actual duplicates with symbolic links... then, the symbolic links would be embedded with metadata or alternate file streams that only records however the two "files" are different from each other, like if one person wants to change the filename or add custom album art or something else like that; it might even be useful outside of database applications, like having multiple versions of the same game or software installed and testing them independently with even the smallest differences.






share|improve this answer































    0














    Easiest way is to use special program
    dupeGuru



    dupeGuru Preferences Screenshot



    as documentation says




    Deletion Options



    These options affect how duplicate deletion takes place.
    Most of the time, you don’t need to enable any of them.



    Link deleted files:



    The deleted files are replaced by a link to the reference file.
    You have a choice of replacing it either with a symlink or a hardlink.
    ...
    a symlink is a shortcut to the file’s path.
    If the original file is deleted or moved, the link is broken.
    A hardlink is a link to the file itself.
    That link is as good as a “real” file.
    Only when all hardlinks to a file are deleted is the file itself deleted.



    On OSX and Linux, this feature is supported fully,
    but under Windows, it’s a bit complicated.
    Windows XP doesn’t support it, but Vista and up support it.
    However, for the feature to work,
    dupeGuru has to run with administrative privileges.







    share|improve this answer























      Your Answer








      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "106"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f3037%2fis-there-an-easy-way-to-replace-duplicate-files-with-hardlinks%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      18 Answers
      18






      active

      oldest

      votes








      18 Answers
      18






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      41














      There is a perl script at http://cpansearch.perl.org/src/ANDK/Perl-Repository-APC-2.002/eg/trimtrees.pl which does exactly what you want:




      Traverse all directories named on the
      command line, compute MD5 checksums
      and find files with identical MD5. IF
      they are equal, do a real comparison
      if they are really equal, replace the
      second of two files with a hard link
      to the first one.







      share|improve this answer
























      • Sounds perfect, thanks!! I'll try it and accept if it works as described!

        – Josh
        Oct 12 '10 at 20:09






      • 3





        This did exactly what I asked for. However I believe that ZFS with dedup will eventually be the way to do, since I did find that the files had slight differences so only a few could be hardlinked.

        – Josh
        Dec 8 '10 at 20:13






      • 10





        Upvoted this, but after researching some more, I kind of which I didn't. rdfind is available via the package managers for ALL major platforms (os x, linux, (cyg)win, solaris), and works at a blazing native speed. So do check out the answer below.

        – oligofren
        Jan 3 '15 at 13:42













      • @oligofren I was thinking the same, but then I hit [Errno 31] Too many links. This scrips seems to be the only thing that handles that.

        – phunehehe
        Jun 26 '15 at 6:59






      • 3





        Checksumming every single file, rather than only files where there exists at least one other with identical size, is unnecessarily inefficient (and unnecessarily prone to hash collisions).

        – Charles Duffy
        Feb 1 '16 at 16:56


















      41














      There is a perl script at http://cpansearch.perl.org/src/ANDK/Perl-Repository-APC-2.002/eg/trimtrees.pl which does exactly what you want:




      Traverse all directories named on the
      command line, compute MD5 checksums
      and find files with identical MD5. IF
      they are equal, do a real comparison
      if they are really equal, replace the
      second of two files with a hard link
      to the first one.







      share|improve this answer
























      • Sounds perfect, thanks!! I'll try it and accept if it works as described!

        – Josh
        Oct 12 '10 at 20:09






      • 3





        This did exactly what I asked for. However I believe that ZFS with dedup will eventually be the way to do, since I did find that the files had slight differences so only a few could be hardlinked.

        – Josh
        Dec 8 '10 at 20:13






      • 10





        Upvoted this, but after researching some more, I kind of which I didn't. rdfind is available via the package managers for ALL major platforms (os x, linux, (cyg)win, solaris), and works at a blazing native speed. So do check out the answer below.

        – oligofren
        Jan 3 '15 at 13:42













      • @oligofren I was thinking the same, but then I hit [Errno 31] Too many links. This scrips seems to be the only thing that handles that.

        – phunehehe
        Jun 26 '15 at 6:59






      • 3





        Checksumming every single file, rather than only files where there exists at least one other with identical size, is unnecessarily inefficient (and unnecessarily prone to hash collisions).

        – Charles Duffy
        Feb 1 '16 at 16:56
















      41












      41








      41







      There is a perl script at http://cpansearch.perl.org/src/ANDK/Perl-Repository-APC-2.002/eg/trimtrees.pl which does exactly what you want:




      Traverse all directories named on the
      command line, compute MD5 checksums
      and find files with identical MD5. IF
      they are equal, do a real comparison
      if they are really equal, replace the
      second of two files with a hard link
      to the first one.







      share|improve this answer













      There is a perl script at http://cpansearch.perl.org/src/ANDK/Perl-Repository-APC-2.002/eg/trimtrees.pl which does exactly what you want:




      Traverse all directories named on the
      command line, compute MD5 checksums
      and find files with identical MD5. IF
      they are equal, do a real comparison
      if they are really equal, replace the
      second of two files with a hard link
      to the first one.








      share|improve this answer












      share|improve this answer



      share|improve this answer










      answered Oct 12 '10 at 20:04









      fschmittfschmitt

      7,6313043




      7,6313043













      • Sounds perfect, thanks!! I'll try it and accept if it works as described!

        – Josh
        Oct 12 '10 at 20:09






      • 3





        This did exactly what I asked for. However I believe that ZFS with dedup will eventually be the way to do, since I did find that the files had slight differences so only a few could be hardlinked.

        – Josh
        Dec 8 '10 at 20:13






      • 10





        Upvoted this, but after researching some more, I kind of which I didn't. rdfind is available via the package managers for ALL major platforms (os x, linux, (cyg)win, solaris), and works at a blazing native speed. So do check out the answer below.

        – oligofren
        Jan 3 '15 at 13:42













      • @oligofren I was thinking the same, but then I hit [Errno 31] Too many links. This scrips seems to be the only thing that handles that.

        – phunehehe
        Jun 26 '15 at 6:59






      • 3





        Checksumming every single file, rather than only files where there exists at least one other with identical size, is unnecessarily inefficient (and unnecessarily prone to hash collisions).

        – Charles Duffy
        Feb 1 '16 at 16:56





















      • Sounds perfect, thanks!! I'll try it and accept if it works as described!

        – Josh
        Oct 12 '10 at 20:09






      • 3





        This did exactly what I asked for. However I believe that ZFS with dedup will eventually be the way to do, since I did find that the files had slight differences so only a few could be hardlinked.

        – Josh
        Dec 8 '10 at 20:13






      • 10





        Upvoted this, but after researching some more, I kind of which I didn't. rdfind is available via the package managers for ALL major platforms (os x, linux, (cyg)win, solaris), and works at a blazing native speed. So do check out the answer below.

        – oligofren
        Jan 3 '15 at 13:42













      • @oligofren I was thinking the same, but then I hit [Errno 31] Too many links. This scrips seems to be the only thing that handles that.

        – phunehehe
        Jun 26 '15 at 6:59






      • 3





        Checksumming every single file, rather than only files where there exists at least one other with identical size, is unnecessarily inefficient (and unnecessarily prone to hash collisions).

        – Charles Duffy
        Feb 1 '16 at 16:56



















      Sounds perfect, thanks!! I'll try it and accept if it works as described!

      – Josh
      Oct 12 '10 at 20:09





      Sounds perfect, thanks!! I'll try it and accept if it works as described!

      – Josh
      Oct 12 '10 at 20:09




      3




      3





      This did exactly what I asked for. However I believe that ZFS with dedup will eventually be the way to do, since I did find that the files had slight differences so only a few could be hardlinked.

      – Josh
      Dec 8 '10 at 20:13





      This did exactly what I asked for. However I believe that ZFS with dedup will eventually be the way to do, since I did find that the files had slight differences so only a few could be hardlinked.

      – Josh
      Dec 8 '10 at 20:13




      10




      10





      Upvoted this, but after researching some more, I kind of which I didn't. rdfind is available via the package managers for ALL major platforms (os x, linux, (cyg)win, solaris), and works at a blazing native speed. So do check out the answer below.

      – oligofren
      Jan 3 '15 at 13:42







      Upvoted this, but after researching some more, I kind of which I didn't. rdfind is available via the package managers for ALL major platforms (os x, linux, (cyg)win, solaris), and works at a blazing native speed. So do check out the answer below.

      – oligofren
      Jan 3 '15 at 13:42















      @oligofren I was thinking the same, but then I hit [Errno 31] Too many links. This scrips seems to be the only thing that handles that.

      – phunehehe
      Jun 26 '15 at 6:59





      @oligofren I was thinking the same, but then I hit [Errno 31] Too many links. This scrips seems to be the only thing that handles that.

      – phunehehe
      Jun 26 '15 at 6:59




      3




      3





      Checksumming every single file, rather than only files where there exists at least one other with identical size, is unnecessarily inefficient (and unnecessarily prone to hash collisions).

      – Charles Duffy
      Feb 1 '16 at 16:56







      Checksumming every single file, rather than only files where there exists at least one other with identical size, is unnecessarily inefficient (and unnecessarily prone to hash collisions).

      – Charles Duffy
      Feb 1 '16 at 16:56















      73














      rdfind does exactly what you ask for (and in the order johny why lists). Makes it possible to delete duplicates, replace them with either soft or hard links. Combined with symlinks you can also make the symlink either absolute or relative. You can even pick checksum algorithm (md5 or sha1).



      Since it is compiled it is faster than most scripted solutions: time on a 15 GiB folder with 2600 files on my Mac Mini from 2009 returns this



      9.99s user 3.61s system 66% cpu 20.543 total


      (using md5).



      Available in most package handlers (e.g. MacPorts for Mac OS X).






      share|improve this answer





















      • 10





        +1 I used rdfind and loved it. It has a -dryrun true option that will let you know what it would have done. Replacing duplicates with hard links is as simple as -makehardlinks true. It produced a nice log and it let me know how much space was freed up. Plus, according to the author's benchmark, rdfind is faster than duff and fslint.

        – Daniel Trebbien
        Dec 29 '13 at 20:49











      • oooh, nice. I used to use fdupes, but its -L option for hardlinking dupes is missing in the latest Ubuntu 14.10. Was quite slow, and did not exist for Homebrew on OSX, so this answer is way better. Thanks!

        – oligofren
        Jan 3 '15 at 13:38











      • Very smart and fast algorithm.

        – ndemou
        Oct 30 '15 at 12:53






      • 1





        I suspect the performance of this tool has more to do with the algorithm itself and less to do with whether it's a compiled tool or a script. For this kind of operation, disk is going to be the bottleneck nearly all of the time. As long as scripted tools make sure that they've an async I/O operation in progress while burning the CPU on checksums, they should perform about as well as a native binary.

        – cdhowie
        May 31 '18 at 21:19
















      73














      rdfind does exactly what you ask for (and in the order johny why lists). Makes it possible to delete duplicates, replace them with either soft or hard links. Combined with symlinks you can also make the symlink either absolute or relative. You can even pick checksum algorithm (md5 or sha1).



      Since it is compiled it is faster than most scripted solutions: time on a 15 GiB folder with 2600 files on my Mac Mini from 2009 returns this



      9.99s user 3.61s system 66% cpu 20.543 total


      (using md5).



      Available in most package handlers (e.g. MacPorts for Mac OS X).






      share|improve this answer





















      • 10





        +1 I used rdfind and loved it. It has a -dryrun true option that will let you know what it would have done. Replacing duplicates with hard links is as simple as -makehardlinks true. It produced a nice log and it let me know how much space was freed up. Plus, according to the author's benchmark, rdfind is faster than duff and fslint.

        – Daniel Trebbien
        Dec 29 '13 at 20:49











      • oooh, nice. I used to use fdupes, but its -L option for hardlinking dupes is missing in the latest Ubuntu 14.10. Was quite slow, and did not exist for Homebrew on OSX, so this answer is way better. Thanks!

        – oligofren
        Jan 3 '15 at 13:38











      • Very smart and fast algorithm.

        – ndemou
        Oct 30 '15 at 12:53






      • 1





        I suspect the performance of this tool has more to do with the algorithm itself and less to do with whether it's a compiled tool or a script. For this kind of operation, disk is going to be the bottleneck nearly all of the time. As long as scripted tools make sure that they've an async I/O operation in progress while burning the CPU on checksums, they should perform about as well as a native binary.

        – cdhowie
        May 31 '18 at 21:19














      73












      73








      73







      rdfind does exactly what you ask for (and in the order johny why lists). Makes it possible to delete duplicates, replace them with either soft or hard links. Combined with symlinks you can also make the symlink either absolute or relative. You can even pick checksum algorithm (md5 or sha1).



      Since it is compiled it is faster than most scripted solutions: time on a 15 GiB folder with 2600 files on my Mac Mini from 2009 returns this



      9.99s user 3.61s system 66% cpu 20.543 total


      (using md5).



      Available in most package handlers (e.g. MacPorts for Mac OS X).






      share|improve this answer















      rdfind does exactly what you ask for (and in the order johny why lists). Makes it possible to delete duplicates, replace them with either soft or hard links. Combined with symlinks you can also make the symlink either absolute or relative. You can even pick checksum algorithm (md5 or sha1).



      Since it is compiled it is faster than most scripted solutions: time on a 15 GiB folder with 2600 files on my Mac Mini from 2009 returns this



      9.99s user 3.61s system 66% cpu 20.543 total


      (using md5).



      Available in most package handlers (e.g. MacPorts for Mac OS X).







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Jul 5 '13 at 9:22









      Tobias Kienzler

      4,349104589




      4,349104589










      answered Jul 5 '13 at 8:15









      d-bd-b

      94878




      94878








      • 10





        +1 I used rdfind and loved it. It has a -dryrun true option that will let you know what it would have done. Replacing duplicates with hard links is as simple as -makehardlinks true. It produced a nice log and it let me know how much space was freed up. Plus, according to the author's benchmark, rdfind is faster than duff and fslint.

        – Daniel Trebbien
        Dec 29 '13 at 20:49











      • oooh, nice. I used to use fdupes, but its -L option for hardlinking dupes is missing in the latest Ubuntu 14.10. Was quite slow, and did not exist for Homebrew on OSX, so this answer is way better. Thanks!

        – oligofren
        Jan 3 '15 at 13:38











      • Very smart and fast algorithm.

        – ndemou
        Oct 30 '15 at 12:53






      • 1





        I suspect the performance of this tool has more to do with the algorithm itself and less to do with whether it's a compiled tool or a script. For this kind of operation, disk is going to be the bottleneck nearly all of the time. As long as scripted tools make sure that they've an async I/O operation in progress while burning the CPU on checksums, they should perform about as well as a native binary.

        – cdhowie
        May 31 '18 at 21:19














      • 10





        +1 I used rdfind and loved it. It has a -dryrun true option that will let you know what it would have done. Replacing duplicates with hard links is as simple as -makehardlinks true. It produced a nice log and it let me know how much space was freed up. Plus, according to the author's benchmark, rdfind is faster than duff and fslint.

        – Daniel Trebbien
        Dec 29 '13 at 20:49











      • oooh, nice. I used to use fdupes, but its -L option for hardlinking dupes is missing in the latest Ubuntu 14.10. Was quite slow, and did not exist for Homebrew on OSX, so this answer is way better. Thanks!

        – oligofren
        Jan 3 '15 at 13:38











      • Very smart and fast algorithm.

        – ndemou
        Oct 30 '15 at 12:53






      • 1





        I suspect the performance of this tool has more to do with the algorithm itself and less to do with whether it's a compiled tool or a script. For this kind of operation, disk is going to be the bottleneck nearly all of the time. As long as scripted tools make sure that they've an async I/O operation in progress while burning the CPU on checksums, they should perform about as well as a native binary.

        – cdhowie
        May 31 '18 at 21:19








      10




      10





      +1 I used rdfind and loved it. It has a -dryrun true option that will let you know what it would have done. Replacing duplicates with hard links is as simple as -makehardlinks true. It produced a nice log and it let me know how much space was freed up. Plus, according to the author's benchmark, rdfind is faster than duff and fslint.

      – Daniel Trebbien
      Dec 29 '13 at 20:49





      +1 I used rdfind and loved it. It has a -dryrun true option that will let you know what it would have done. Replacing duplicates with hard links is as simple as -makehardlinks true. It produced a nice log and it let me know how much space was freed up. Plus, according to the author's benchmark, rdfind is faster than duff and fslint.

      – Daniel Trebbien
      Dec 29 '13 at 20:49













      oooh, nice. I used to use fdupes, but its -L option for hardlinking dupes is missing in the latest Ubuntu 14.10. Was quite slow, and did not exist for Homebrew on OSX, so this answer is way better. Thanks!

      – oligofren
      Jan 3 '15 at 13:38





      oooh, nice. I used to use fdupes, but its -L option for hardlinking dupes is missing in the latest Ubuntu 14.10. Was quite slow, and did not exist for Homebrew on OSX, so this answer is way better. Thanks!

      – oligofren
      Jan 3 '15 at 13:38













      Very smart and fast algorithm.

      – ndemou
      Oct 30 '15 at 12:53





      Very smart and fast algorithm.

      – ndemou
      Oct 30 '15 at 12:53




      1




      1





      I suspect the performance of this tool has more to do with the algorithm itself and less to do with whether it's a compiled tool or a script. For this kind of operation, disk is going to be the bottleneck nearly all of the time. As long as scripted tools make sure that they've an async I/O operation in progress while burning the CPU on checksums, they should perform about as well as a native binary.

      – cdhowie
      May 31 '18 at 21:19





      I suspect the performance of this tool has more to do with the algorithm itself and less to do with whether it's a compiled tool or a script. For this kind of operation, disk is going to be the bottleneck nearly all of the time. As long as scripted tools make sure that they've an async I/O operation in progress while burning the CPU on checksums, they should perform about as well as a native binary.

      – cdhowie
      May 31 '18 at 21:19











      49














      Use the fdupes tool:



      fdupes -r /path/to/folder gives you a list of duplicates in the directory (-r makes it recursive). The output looks like this:





      filename1

      filename2



      filename3

      filename4

      filename5





      with filename1 and filename2 being identical and filename3, filename4 and filename5 also being identical.






      share|improve this answer



















      • 1





        Ubuntu Note: As of September 2013, it hasn't had a stable release (it is on 1.50-PR2-3), so the update doesn't appear in ubuntu yet.

        – Stuart Axon
        Aug 28 '13 at 14:19






      • 11





        I just tried installing fdupes_1.50-PR2-4 on both Ubuntu and Debian, neither has the -L flag. Luckily building from github.com/tobiasschulz/fdupes was super easy.

        – neu242
        Aug 30 '13 at 15:07






      • 3





        Try rdfind - like fdupes, but faster and available on OS X and Cygwin as well.

        – oligofren
        Jan 3 '15 at 13:43











      • Or if you just requre Linux compatibility, install rmlint which is blazingly fast, and has lots of nice options. Truly a modern alternative.

        – oligofren
        Jan 3 '15 at 14:28






      • 3





        fdupes seems to only find duplicates, not replace them with hardlinks, so not an answer to the question IMO.

        – Calimo
        Nov 8 '17 at 15:58
















      49














      Use the fdupes tool:



      fdupes -r /path/to/folder gives you a list of duplicates in the directory (-r makes it recursive). The output looks like this:





      filename1

      filename2



      filename3

      filename4

      filename5





      with filename1 and filename2 being identical and filename3, filename4 and filename5 also being identical.






      share|improve this answer



















      • 1





        Ubuntu Note: As of September 2013, it hasn't had a stable release (it is on 1.50-PR2-3), so the update doesn't appear in ubuntu yet.

        – Stuart Axon
        Aug 28 '13 at 14:19






      • 11





        I just tried installing fdupes_1.50-PR2-4 on both Ubuntu and Debian, neither has the -L flag. Luckily building from github.com/tobiasschulz/fdupes was super easy.

        – neu242
        Aug 30 '13 at 15:07






      • 3





        Try rdfind - like fdupes, but faster and available on OS X and Cygwin as well.

        – oligofren
        Jan 3 '15 at 13:43











      • Or if you just requre Linux compatibility, install rmlint which is blazingly fast, and has lots of nice options. Truly a modern alternative.

        – oligofren
        Jan 3 '15 at 14:28






      • 3





        fdupes seems to only find duplicates, not replace them with hardlinks, so not an answer to the question IMO.

        – Calimo
        Nov 8 '17 at 15:58














      49












      49








      49







      Use the fdupes tool:



      fdupes -r /path/to/folder gives you a list of duplicates in the directory (-r makes it recursive). The output looks like this:





      filename1

      filename2



      filename3

      filename4

      filename5





      with filename1 and filename2 being identical and filename3, filename4 and filename5 also being identical.






      share|improve this answer













      Use the fdupes tool:



      fdupes -r /path/to/folder gives you a list of duplicates in the directory (-r makes it recursive). The output looks like this:





      filename1

      filename2



      filename3

      filename4

      filename5





      with filename1 and filename2 being identical and filename3, filename4 and filename5 also being identical.







      share|improve this answer












      share|improve this answer



      share|improve this answer










      answered Oct 12 '10 at 20:03









      tantetante

      4,9942023




      4,9942023








      • 1





        Ubuntu Note: As of September 2013, it hasn't had a stable release (it is on 1.50-PR2-3), so the update doesn't appear in ubuntu yet.

        – Stuart Axon
        Aug 28 '13 at 14:19






      • 11





        I just tried installing fdupes_1.50-PR2-4 on both Ubuntu and Debian, neither has the -L flag. Luckily building from github.com/tobiasschulz/fdupes was super easy.

        – neu242
        Aug 30 '13 at 15:07






      • 3





        Try rdfind - like fdupes, but faster and available on OS X and Cygwin as well.

        – oligofren
        Jan 3 '15 at 13:43











      • Or if you just requre Linux compatibility, install rmlint which is blazingly fast, and has lots of nice options. Truly a modern alternative.

        – oligofren
        Jan 3 '15 at 14:28






      • 3





        fdupes seems to only find duplicates, not replace them with hardlinks, so not an answer to the question IMO.

        – Calimo
        Nov 8 '17 at 15:58














      • 1





        Ubuntu Note: As of September 2013, it hasn't had a stable release (it is on 1.50-PR2-3), so the update doesn't appear in ubuntu yet.

        – Stuart Axon
        Aug 28 '13 at 14:19






      • 11





        I just tried installing fdupes_1.50-PR2-4 on both Ubuntu and Debian, neither has the -L flag. Luckily building from github.com/tobiasschulz/fdupes was super easy.

        – neu242
        Aug 30 '13 at 15:07






      • 3





        Try rdfind - like fdupes, but faster and available on OS X and Cygwin as well.

        – oligofren
        Jan 3 '15 at 13:43











      • Or if you just requre Linux compatibility, install rmlint which is blazingly fast, and has lots of nice options. Truly a modern alternative.

        – oligofren
        Jan 3 '15 at 14:28






      • 3





        fdupes seems to only find duplicates, not replace them with hardlinks, so not an answer to the question IMO.

        – Calimo
        Nov 8 '17 at 15:58








      1




      1





      Ubuntu Note: As of September 2013, it hasn't had a stable release (it is on 1.50-PR2-3), so the update doesn't appear in ubuntu yet.

      – Stuart Axon
      Aug 28 '13 at 14:19





      Ubuntu Note: As of September 2013, it hasn't had a stable release (it is on 1.50-PR2-3), so the update doesn't appear in ubuntu yet.

      – Stuart Axon
      Aug 28 '13 at 14:19




      11




      11





      I just tried installing fdupes_1.50-PR2-4 on both Ubuntu and Debian, neither has the -L flag. Luckily building from github.com/tobiasschulz/fdupes was super easy.

      – neu242
      Aug 30 '13 at 15:07





      I just tried installing fdupes_1.50-PR2-4 on both Ubuntu and Debian, neither has the -L flag. Luckily building from github.com/tobiasschulz/fdupes was super easy.

      – neu242
      Aug 30 '13 at 15:07




      3




      3





      Try rdfind - like fdupes, but faster and available on OS X and Cygwin as well.

      – oligofren
      Jan 3 '15 at 13:43





      Try rdfind - like fdupes, but faster and available on OS X and Cygwin as well.

      – oligofren
      Jan 3 '15 at 13:43













      Or if you just requre Linux compatibility, install rmlint which is blazingly fast, and has lots of nice options. Truly a modern alternative.

      – oligofren
      Jan 3 '15 at 14:28





      Or if you just requre Linux compatibility, install rmlint which is blazingly fast, and has lots of nice options. Truly a modern alternative.

      – oligofren
      Jan 3 '15 at 14:28




      3




      3





      fdupes seems to only find duplicates, not replace them with hardlinks, so not an answer to the question IMO.

      – Calimo
      Nov 8 '17 at 15:58





      fdupes seems to only find duplicates, not replace them with hardlinks, so not an answer to the question IMO.

      – Calimo
      Nov 8 '17 at 15:58











      22














      I use hardlink from http://jak-linux.org/projects/hardlink/






      share|improve this answer



















      • 1





        Nice hint, I am using on a regular base code.google.com/p/hardlinkpy but this was not updated for a while...

        – meduz
        Apr 11 '12 at 19:09






      • 2





        This appears to be similar to the original hardlink on Fedora/RHEL/etc.

        – Jack Douglas
        Jun 21 '12 at 8:43






      • 1





        hardlink is now a native binary in many Linux package systems (since ~2014) and extremely fast. For 1,2M files (320GB), it just took 200 seconds (linking roughly 10% of the files).

        – Marcel Waldvogel
        Feb 5 '17 at 19:13











      • FWIW, the above hardlink was created by Julian Andres Klode while the Fedora hardlink was created by Jakub Jelinek (source: pagure.io/hardlink - Fedora package name: hardlink)

        – maxschlepzig
        Jan 4 at 17:52
















      22














      I use hardlink from http://jak-linux.org/projects/hardlink/






      share|improve this answer



















      • 1





        Nice hint, I am using on a regular base code.google.com/p/hardlinkpy but this was not updated for a while...

        – meduz
        Apr 11 '12 at 19:09






      • 2





        This appears to be similar to the original hardlink on Fedora/RHEL/etc.

        – Jack Douglas
        Jun 21 '12 at 8:43






      • 1





        hardlink is now a native binary in many Linux package systems (since ~2014) and extremely fast. For 1,2M files (320GB), it just took 200 seconds (linking roughly 10% of the files).

        – Marcel Waldvogel
        Feb 5 '17 at 19:13











      • FWIW, the above hardlink was created by Julian Andres Klode while the Fedora hardlink was created by Jakub Jelinek (source: pagure.io/hardlink - Fedora package name: hardlink)

        – maxschlepzig
        Jan 4 at 17:52














      22












      22








      22







      I use hardlink from http://jak-linux.org/projects/hardlink/






      share|improve this answer













      I use hardlink from http://jak-linux.org/projects/hardlink/







      share|improve this answer












      share|improve this answer



      share|improve this answer










      answered Oct 18 '11 at 4:24









      waltinatorwaltinator

      75048




      75048








      • 1





        Nice hint, I am using on a regular base code.google.com/p/hardlinkpy but this was not updated for a while...

        – meduz
        Apr 11 '12 at 19:09






      • 2





        This appears to be similar to the original hardlink on Fedora/RHEL/etc.

        – Jack Douglas
        Jun 21 '12 at 8:43






      • 1





        hardlink is now a native binary in many Linux package systems (since ~2014) and extremely fast. For 1,2M files (320GB), it just took 200 seconds (linking roughly 10% of the files).

        – Marcel Waldvogel
        Feb 5 '17 at 19:13











      • FWIW, the above hardlink was created by Julian Andres Klode while the Fedora hardlink was created by Jakub Jelinek (source: pagure.io/hardlink - Fedora package name: hardlink)

        – maxschlepzig
        Jan 4 at 17:52














      • 1





        Nice hint, I am using on a regular base code.google.com/p/hardlinkpy but this was not updated for a while...

        – meduz
        Apr 11 '12 at 19:09






      • 2





        This appears to be similar to the original hardlink on Fedora/RHEL/etc.

        – Jack Douglas
        Jun 21 '12 at 8:43






      • 1





        hardlink is now a native binary in many Linux package systems (since ~2014) and extremely fast. For 1,2M files (320GB), it just took 200 seconds (linking roughly 10% of the files).

        – Marcel Waldvogel
        Feb 5 '17 at 19:13











      • FWIW, the above hardlink was created by Julian Andres Klode while the Fedora hardlink was created by Jakub Jelinek (source: pagure.io/hardlink - Fedora package name: hardlink)

        – maxschlepzig
        Jan 4 at 17:52








      1




      1





      Nice hint, I am using on a regular base code.google.com/p/hardlinkpy but this was not updated for a while...

      – meduz
      Apr 11 '12 at 19:09





      Nice hint, I am using on a regular base code.google.com/p/hardlinkpy but this was not updated for a while...

      – meduz
      Apr 11 '12 at 19:09




      2




      2





      This appears to be similar to the original hardlink on Fedora/RHEL/etc.

      – Jack Douglas
      Jun 21 '12 at 8:43





      This appears to be similar to the original hardlink on Fedora/RHEL/etc.

      – Jack Douglas
      Jun 21 '12 at 8:43




      1




      1





      hardlink is now a native binary in many Linux package systems (since ~2014) and extremely fast. For 1,2M files (320GB), it just took 200 seconds (linking roughly 10% of the files).

      – Marcel Waldvogel
      Feb 5 '17 at 19:13





      hardlink is now a native binary in many Linux package systems (since ~2014) and extremely fast. For 1,2M files (320GB), it just took 200 seconds (linking roughly 10% of the files).

      – Marcel Waldvogel
      Feb 5 '17 at 19:13













      FWIW, the above hardlink was created by Julian Andres Klode while the Fedora hardlink was created by Jakub Jelinek (source: pagure.io/hardlink - Fedora package name: hardlink)

      – maxschlepzig
      Jan 4 at 17:52





      FWIW, the above hardlink was created by Julian Andres Klode while the Fedora hardlink was created by Jakub Jelinek (source: pagure.io/hardlink - Fedora package name: hardlink)

      – maxschlepzig
      Jan 4 at 17:52











      18














      This is one of the functions provided by "fslint" --
      http://en.flossmanuals.net/FSlint/Introduction



      Click the "Merge" button:



      Screenshot






      share|improve this answer





















      • 4





        The -m will hardlink duplicates together, -d will delete all but one, and -t will dry run, printing what it would do

        – Azendale
        Oct 29 '12 at 5:57






      • 1





        On Ubuntu here is what to do: sudo apt-get install fslint /usr/share/fslint/fslint/findup -m /your/directory/tree (directory /usr/share/fslint/fslint/ is not in $PATH by default)

        – Jocelyn
        Sep 8 '13 at 15:38


















      18














      This is one of the functions provided by "fslint" --
      http://en.flossmanuals.net/FSlint/Introduction



      Click the "Merge" button:



      Screenshot






      share|improve this answer





















      • 4





        The -m will hardlink duplicates together, -d will delete all but one, and -t will dry run, printing what it would do

        – Azendale
        Oct 29 '12 at 5:57






      • 1





        On Ubuntu here is what to do: sudo apt-get install fslint /usr/share/fslint/fslint/findup -m /your/directory/tree (directory /usr/share/fslint/fslint/ is not in $PATH by default)

        – Jocelyn
        Sep 8 '13 at 15:38
















      18












      18








      18







      This is one of the functions provided by "fslint" --
      http://en.flossmanuals.net/FSlint/Introduction



      Click the "Merge" button:



      Screenshot






      share|improve this answer















      This is one of the functions provided by "fslint" --
      http://en.flossmanuals.net/FSlint/Introduction



      Click the "Merge" button:



      Screenshot







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited May 10 '16 at 22:19









      Flimm

      1,43541928




      1,43541928










      answered Dec 18 '10 at 22:38







      LJ Wobker















      • 4





        The -m will hardlink duplicates together, -d will delete all but one, and -t will dry run, printing what it would do

        – Azendale
        Oct 29 '12 at 5:57






      • 1





        On Ubuntu here is what to do: sudo apt-get install fslint /usr/share/fslint/fslint/findup -m /your/directory/tree (directory /usr/share/fslint/fslint/ is not in $PATH by default)

        – Jocelyn
        Sep 8 '13 at 15:38
















      • 4





        The -m will hardlink duplicates together, -d will delete all but one, and -t will dry run, printing what it would do

        – Azendale
        Oct 29 '12 at 5:57






      • 1





        On Ubuntu here is what to do: sudo apt-get install fslint /usr/share/fslint/fslint/findup -m /your/directory/tree (directory /usr/share/fslint/fslint/ is not in $PATH by default)

        – Jocelyn
        Sep 8 '13 at 15:38










      4




      4





      The -m will hardlink duplicates together, -d will delete all but one, and -t will dry run, printing what it would do

      – Azendale
      Oct 29 '12 at 5:57





      The -m will hardlink duplicates together, -d will delete all but one, and -t will dry run, printing what it would do

      – Azendale
      Oct 29 '12 at 5:57




      1




      1





      On Ubuntu here is what to do: sudo apt-get install fslint /usr/share/fslint/fslint/findup -m /your/directory/tree (directory /usr/share/fslint/fslint/ is not in $PATH by default)

      – Jocelyn
      Sep 8 '13 at 15:38







      On Ubuntu here is what to do: sudo apt-get install fslint /usr/share/fslint/fslint/findup -m /your/directory/tree (directory /usr/share/fslint/fslint/ is not in $PATH by default)

      – Jocelyn
      Sep 8 '13 at 15:38













      14














      Since your main target is to save disk space, there is another solution: de-duplication (and probably compression) on file system level. Compared with the hard-link solution, it does not have the problem of inadvertently affecting other linked files.



      ZFS has dedup (block-level, not file-level) since pool version 23 and compression since long time ago.
      If you are using linux, you may try zfs-fuse, or if you use BSD, it is natively supported.






      share|improve this answer
























      • This is probably the way I'll go eventually, however, does BSD's ZFS implementation do dedup? I thought it did not.

        – Josh
        Dec 8 '10 at 20:14











      • In addition, the HAMMER filesystem on DragonFlyBSD has deduplication support.

        – hhaamu
        Jul 15 '12 at 17:48








      • 11





        ZFS dedup is the friend of nobody. Where ZFS recommends 1Gb ram per 1Tb usable disk space, you're friggin' nuts if you try to use dedup with less than 32Gb ram per 1Tb usable disk space. That means that for a 1Tb mirror, if you don't have 32 Gb ram, you are likely to encounter memory bomb conditions sooner or later that will halt the machine due to lack of ram. Been there, done that, still recovering from the PTSD.

        – killermist
        Sep 22 '14 at 18:51






      • 3





        To avoid the excessive RAM requirements with online deduplication (i.e., check on every write), btrfs uses batch or offline deduplication (run it whenever you consider it useful/necessary) btrfs.wiki.kernel.org/index.php/Deduplication

        – Marcel Waldvogel
        Feb 5 '17 at 19:18






      • 2





        Update seven years later: I eventually did move to ZFS and tried deduplication -- I found that it's RAM requirements were indeed just far to high. Crafty use of ZFS snapshots provided the solution I ended up using. (Copy one user's music, snapshot and clone, copy the second user's music into the clone using rsync --inplace so only changed blocks are stored)

        – Josh
        Sep 13 '17 at 13:54
















      14














      Since your main target is to save disk space, there is another solution: de-duplication (and probably compression) on file system level. Compared with the hard-link solution, it does not have the problem of inadvertently affecting other linked files.



      ZFS has dedup (block-level, not file-level) since pool version 23 and compression since long time ago.
      If you are using linux, you may try zfs-fuse, or if you use BSD, it is natively supported.






      share|improve this answer
























      • This is probably the way I'll go eventually, however, does BSD's ZFS implementation do dedup? I thought it did not.

        – Josh
        Dec 8 '10 at 20:14











      • In addition, the HAMMER filesystem on DragonFlyBSD has deduplication support.

        – hhaamu
        Jul 15 '12 at 17:48








      • 11





        ZFS dedup is the friend of nobody. Where ZFS recommends 1Gb ram per 1Tb usable disk space, you're friggin' nuts if you try to use dedup with less than 32Gb ram per 1Tb usable disk space. That means that for a 1Tb mirror, if you don't have 32 Gb ram, you are likely to encounter memory bomb conditions sooner or later that will halt the machine due to lack of ram. Been there, done that, still recovering from the PTSD.

        – killermist
        Sep 22 '14 at 18:51






      • 3





        To avoid the excessive RAM requirements with online deduplication (i.e., check on every write), btrfs uses batch or offline deduplication (run it whenever you consider it useful/necessary) btrfs.wiki.kernel.org/index.php/Deduplication

        – Marcel Waldvogel
        Feb 5 '17 at 19:18






      • 2





        Update seven years later: I eventually did move to ZFS and tried deduplication -- I found that it's RAM requirements were indeed just far to high. Crafty use of ZFS snapshots provided the solution I ended up using. (Copy one user's music, snapshot and clone, copy the second user's music into the clone using rsync --inplace so only changed blocks are stored)

        – Josh
        Sep 13 '17 at 13:54














      14












      14








      14







      Since your main target is to save disk space, there is another solution: de-duplication (and probably compression) on file system level. Compared with the hard-link solution, it does not have the problem of inadvertently affecting other linked files.



      ZFS has dedup (block-level, not file-level) since pool version 23 and compression since long time ago.
      If you are using linux, you may try zfs-fuse, or if you use BSD, it is natively supported.






      share|improve this answer













      Since your main target is to save disk space, there is another solution: de-duplication (and probably compression) on file system level. Compared with the hard-link solution, it does not have the problem of inadvertently affecting other linked files.



      ZFS has dedup (block-level, not file-level) since pool version 23 and compression since long time ago.
      If you are using linux, you may try zfs-fuse, or if you use BSD, it is natively supported.







      share|improve this answer












      share|improve this answer



      share|improve this answer










      answered Oct 13 '10 at 5:13









      Wei-YinWei-Yin

      3921210




      3921210













      • This is probably the way I'll go eventually, however, does BSD's ZFS implementation do dedup? I thought it did not.

        – Josh
        Dec 8 '10 at 20:14











      • In addition, the HAMMER filesystem on DragonFlyBSD has deduplication support.

        – hhaamu
        Jul 15 '12 at 17:48








      • 11





        ZFS dedup is the friend of nobody. Where ZFS recommends 1Gb ram per 1Tb usable disk space, you're friggin' nuts if you try to use dedup with less than 32Gb ram per 1Tb usable disk space. That means that for a 1Tb mirror, if you don't have 32 Gb ram, you are likely to encounter memory bomb conditions sooner or later that will halt the machine due to lack of ram. Been there, done that, still recovering from the PTSD.

        – killermist
        Sep 22 '14 at 18:51






      • 3





        To avoid the excessive RAM requirements with online deduplication (i.e., check on every write), btrfs uses batch or offline deduplication (run it whenever you consider it useful/necessary) btrfs.wiki.kernel.org/index.php/Deduplication

        – Marcel Waldvogel
        Feb 5 '17 at 19:18






      • 2





        Update seven years later: I eventually did move to ZFS and tried deduplication -- I found that it's RAM requirements were indeed just far to high. Crafty use of ZFS snapshots provided the solution I ended up using. (Copy one user's music, snapshot and clone, copy the second user's music into the clone using rsync --inplace so only changed blocks are stored)

        – Josh
        Sep 13 '17 at 13:54



















      • This is probably the way I'll go eventually, however, does BSD's ZFS implementation do dedup? I thought it did not.

        – Josh
        Dec 8 '10 at 20:14











      • In addition, the HAMMER filesystem on DragonFlyBSD has deduplication support.

        – hhaamu
        Jul 15 '12 at 17:48








      • 11





        ZFS dedup is the friend of nobody. Where ZFS recommends 1Gb ram per 1Tb usable disk space, you're friggin' nuts if you try to use dedup with less than 32Gb ram per 1Tb usable disk space. That means that for a 1Tb mirror, if you don't have 32 Gb ram, you are likely to encounter memory bomb conditions sooner or later that will halt the machine due to lack of ram. Been there, done that, still recovering from the PTSD.

        – killermist
        Sep 22 '14 at 18:51






      • 3





        To avoid the excessive RAM requirements with online deduplication (i.e., check on every write), btrfs uses batch or offline deduplication (run it whenever you consider it useful/necessary) btrfs.wiki.kernel.org/index.php/Deduplication

        – Marcel Waldvogel
        Feb 5 '17 at 19:18






      • 2





        Update seven years later: I eventually did move to ZFS and tried deduplication -- I found that it's RAM requirements were indeed just far to high. Crafty use of ZFS snapshots provided the solution I ended up using. (Copy one user's music, snapshot and clone, copy the second user's music into the clone using rsync --inplace so only changed blocks are stored)

        – Josh
        Sep 13 '17 at 13:54

















      This is probably the way I'll go eventually, however, does BSD's ZFS implementation do dedup? I thought it did not.

      – Josh
      Dec 8 '10 at 20:14





      This is probably the way I'll go eventually, however, does BSD's ZFS implementation do dedup? I thought it did not.

      – Josh
      Dec 8 '10 at 20:14













      In addition, the HAMMER filesystem on DragonFlyBSD has deduplication support.

      – hhaamu
      Jul 15 '12 at 17:48







      In addition, the HAMMER filesystem on DragonFlyBSD has deduplication support.

      – hhaamu
      Jul 15 '12 at 17:48






      11




      11





      ZFS dedup is the friend of nobody. Where ZFS recommends 1Gb ram per 1Tb usable disk space, you're friggin' nuts if you try to use dedup with less than 32Gb ram per 1Tb usable disk space. That means that for a 1Tb mirror, if you don't have 32 Gb ram, you are likely to encounter memory bomb conditions sooner or later that will halt the machine due to lack of ram. Been there, done that, still recovering from the PTSD.

      – killermist
      Sep 22 '14 at 18:51





      ZFS dedup is the friend of nobody. Where ZFS recommends 1Gb ram per 1Tb usable disk space, you're friggin' nuts if you try to use dedup with less than 32Gb ram per 1Tb usable disk space. That means that for a 1Tb mirror, if you don't have 32 Gb ram, you are likely to encounter memory bomb conditions sooner or later that will halt the machine due to lack of ram. Been there, done that, still recovering from the PTSD.

      – killermist
      Sep 22 '14 at 18:51




      3




      3





      To avoid the excessive RAM requirements with online deduplication (i.e., check on every write), btrfs uses batch or offline deduplication (run it whenever you consider it useful/necessary) btrfs.wiki.kernel.org/index.php/Deduplication

      – Marcel Waldvogel
      Feb 5 '17 at 19:18





      To avoid the excessive RAM requirements with online deduplication (i.e., check on every write), btrfs uses batch or offline deduplication (run it whenever you consider it useful/necessary) btrfs.wiki.kernel.org/index.php/Deduplication

      – Marcel Waldvogel
      Feb 5 '17 at 19:18




      2




      2





      Update seven years later: I eventually did move to ZFS and tried deduplication -- I found that it's RAM requirements were indeed just far to high. Crafty use of ZFS snapshots provided the solution I ended up using. (Copy one user's music, snapshot and clone, copy the second user's music into the clone using rsync --inplace so only changed blocks are stored)

      – Josh
      Sep 13 '17 at 13:54





      Update seven years later: I eventually did move to ZFS and tried deduplication -- I found that it's RAM requirements were indeed just far to high. Crafty use of ZFS snapshots provided the solution I ended up using. (Copy one user's music, snapshot and clone, copy the second user's music into the clone using rsync --inplace so only changed blocks are stored)

      – Josh
      Sep 13 '17 at 13:54











      7














      On modern Linux these days there's https://github.com/g2p/bedup which de-duplicates on a btrfs filesystem, but 1) without as much of the scan overhead, 2) files can diverge easily again afterwards.






      share|improve this answer
























      • Background and more information is listed on btrfs.wiki.kernel.org/index.php/Deduplication (including reference to cp --reflink, see also below)

        – Marcel Waldvogel
        Feb 5 '17 at 19:22
















      7














      On modern Linux these days there's https://github.com/g2p/bedup which de-duplicates on a btrfs filesystem, but 1) without as much of the scan overhead, 2) files can diverge easily again afterwards.






      share|improve this answer
























      • Background and more information is listed on btrfs.wiki.kernel.org/index.php/Deduplication (including reference to cp --reflink, see also below)

        – Marcel Waldvogel
        Feb 5 '17 at 19:22














      7












      7








      7







      On modern Linux these days there's https://github.com/g2p/bedup which de-duplicates on a btrfs filesystem, but 1) without as much of the scan overhead, 2) files can diverge easily again afterwards.






      share|improve this answer













      On modern Linux these days there's https://github.com/g2p/bedup which de-duplicates on a btrfs filesystem, but 1) without as much of the scan overhead, 2) files can diverge easily again afterwards.







      share|improve this answer












      share|improve this answer



      share|improve this answer










      answered Jan 8 '14 at 17:37









      Matthew BlochMatthew Bloch

      17014




      17014













      • Background and more information is listed on btrfs.wiki.kernel.org/index.php/Deduplication (including reference to cp --reflink, see also below)

        – Marcel Waldvogel
        Feb 5 '17 at 19:22



















      • Background and more information is listed on btrfs.wiki.kernel.org/index.php/Deduplication (including reference to cp --reflink, see also below)

        – Marcel Waldvogel
        Feb 5 '17 at 19:22

















      Background and more information is listed on btrfs.wiki.kernel.org/index.php/Deduplication (including reference to cp --reflink, see also below)

      – Marcel Waldvogel
      Feb 5 '17 at 19:22





      Background and more information is listed on btrfs.wiki.kernel.org/index.php/Deduplication (including reference to cp --reflink, see also below)

      – Marcel Waldvogel
      Feb 5 '17 at 19:22











      5














      To find duplicate files you can use duff.




      Duff is a Unix command-line utility
      for quickly finding duplicates in a
      given set of files.




      Simply run:



      duff -r target-folder


      To create hardlinks to those files automaticly, you will need to parse the output of duff with bash or some other scripting language.






      share|improve this answer
























      • Really slow though -- see rdfind.pauldreik.se/#g0.6

        – ndemou
        Oct 30 '15 at 12:52
















      5














      To find duplicate files you can use duff.




      Duff is a Unix command-line utility
      for quickly finding duplicates in a
      given set of files.




      Simply run:



      duff -r target-folder


      To create hardlinks to those files automaticly, you will need to parse the output of duff with bash or some other scripting language.






      share|improve this answer
























      • Really slow though -- see rdfind.pauldreik.se/#g0.6

        – ndemou
        Oct 30 '15 at 12:52














      5












      5








      5







      To find duplicate files you can use duff.




      Duff is a Unix command-line utility
      for quickly finding duplicates in a
      given set of files.




      Simply run:



      duff -r target-folder


      To create hardlinks to those files automaticly, you will need to parse the output of duff with bash or some other scripting language.






      share|improve this answer













      To find duplicate files you can use duff.




      Duff is a Unix command-line utility
      for quickly finding duplicates in a
      given set of files.




      Simply run:



      duff -r target-folder


      To create hardlinks to those files automaticly, you will need to parse the output of duff with bash or some other scripting language.







      share|improve this answer












      share|improve this answer



      share|improve this answer










      answered Oct 12 '10 at 20:00









      StefanStefan

      11.6k3283123




      11.6k3283123













      • Really slow though -- see rdfind.pauldreik.se/#g0.6

        – ndemou
        Oct 30 '15 at 12:52



















      • Really slow though -- see rdfind.pauldreik.se/#g0.6

        – ndemou
        Oct 30 '15 at 12:52

















      Really slow though -- see rdfind.pauldreik.se/#g0.6

      – ndemou
      Oct 30 '15 at 12:52





      Really slow though -- see rdfind.pauldreik.se/#g0.6

      – ndemou
      Oct 30 '15 at 12:52











      5














      aptitude show hardlink


      Description: Hardlinks multiple copies of the same file
      Hardlink is a tool which detects multiple copies of the same file and replaces them with hardlinks.



      The idea has been taken from http://code.google.com/p/hardlinkpy/, but the code has been written from scratch and licensed under the MIT license.
      Homepage: http://jak-linux.org/projects/hardlink/






      share|improve this answer


























      • The only program mentioned here available for Gentoo without unmasking and with hardlink support, thanks!

        – Jorrit Schippers
        Mar 9 '15 at 13:48
















      5














      aptitude show hardlink


      Description: Hardlinks multiple copies of the same file
      Hardlink is a tool which detects multiple copies of the same file and replaces them with hardlinks.



      The idea has been taken from http://code.google.com/p/hardlinkpy/, but the code has been written from scratch and licensed under the MIT license.
      Homepage: http://jak-linux.org/projects/hardlink/






      share|improve this answer


























      • The only program mentioned here available for Gentoo without unmasking and with hardlink support, thanks!

        – Jorrit Schippers
        Mar 9 '15 at 13:48














      5












      5








      5







      aptitude show hardlink


      Description: Hardlinks multiple copies of the same file
      Hardlink is a tool which detects multiple copies of the same file and replaces them with hardlinks.



      The idea has been taken from http://code.google.com/p/hardlinkpy/, but the code has been written from scratch and licensed under the MIT license.
      Homepage: http://jak-linux.org/projects/hardlink/






      share|improve this answer















      aptitude show hardlink


      Description: Hardlinks multiple copies of the same file
      Hardlink is a tool which detects multiple copies of the same file and replaces them with hardlinks.



      The idea has been taken from http://code.google.com/p/hardlinkpy/, but the code has been written from scratch and licensed under the MIT license.
      Homepage: http://jak-linux.org/projects/hardlink/







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Nov 22 '13 at 15:22









      Anthon

      60.9k17104166




      60.9k17104166










      answered Nov 22 '13 at 15:03









      Julien PalardJulien Palard

      29635




      29635













      • The only program mentioned here available for Gentoo without unmasking and with hardlink support, thanks!

        – Jorrit Schippers
        Mar 9 '15 at 13:48



















      • The only program mentioned here available for Gentoo without unmasking and with hardlink support, thanks!

        – Jorrit Schippers
        Mar 9 '15 at 13:48

















      The only program mentioned here available for Gentoo without unmasking and with hardlink support, thanks!

      – Jorrit Schippers
      Mar 9 '15 at 13:48





      The only program mentioned here available for Gentoo without unmasking and with hardlink support, thanks!

      – Jorrit Schippers
      Mar 9 '15 at 13:48











      4














      I've used many of the hardlinking tools for Linux mentioned here.
      I too am stuck with ext4 fs, on Ubuntu, and have been using its cp -l and -s for hard/softlinking. But lately noticed the lightweight copy in the cp man page, which would imply to spare the redundant disk space until one side gets modified:



         --reflink[=WHEN]
      control clone/CoW copies. See below

      When --reflink[=always] is specified, perform a lightweight copy, where the
      data blocks are copied only when modified. If this is not possible the
      copy fails, or if --reflink=auto is specified, fall back to a standard copy.





      share|improve this answer
























      • I think I will update my cp alias to always include the --reflink=auto parameter now

        – Marcos
        Mar 14 '12 at 14:08






      • 1





        Does ext4 really support --reflink?

        – Jack Douglas
        Jun 21 '12 at 8:42






      • 7





        This is supported on btrfs and OCFS2. It is only possible on copy-on-write filesystems, which ext4 is not. btrfs is really shaping up. I love using it because of reflink and snapshots, makes you less scared to do mass operations on big trees of files.

        – clacke
        Jul 3 '12 at 18:57
















      4














      I've used many of the hardlinking tools for Linux mentioned here.
      I too am stuck with ext4 fs, on Ubuntu, and have been using its cp -l and -s for hard/softlinking. But lately noticed the lightweight copy in the cp man page, which would imply to spare the redundant disk space until one side gets modified:



         --reflink[=WHEN]
      control clone/CoW copies. See below

      When --reflink[=always] is specified, perform a lightweight copy, where the
      data blocks are copied only when modified. If this is not possible the
      copy fails, or if --reflink=auto is specified, fall back to a standard copy.





      share|improve this answer
























      • I think I will update my cp alias to always include the --reflink=auto parameter now

        – Marcos
        Mar 14 '12 at 14:08






      • 1





        Does ext4 really support --reflink?

        – Jack Douglas
        Jun 21 '12 at 8:42






      • 7





        This is supported on btrfs and OCFS2. It is only possible on copy-on-write filesystems, which ext4 is not. btrfs is really shaping up. I love using it because of reflink and snapshots, makes you less scared to do mass operations on big trees of files.

        – clacke
        Jul 3 '12 at 18:57














      4












      4








      4







      I've used many of the hardlinking tools for Linux mentioned here.
      I too am stuck with ext4 fs, on Ubuntu, and have been using its cp -l and -s for hard/softlinking. But lately noticed the lightweight copy in the cp man page, which would imply to spare the redundant disk space until one side gets modified:



         --reflink[=WHEN]
      control clone/CoW copies. See below

      When --reflink[=always] is specified, perform a lightweight copy, where the
      data blocks are copied only when modified. If this is not possible the
      copy fails, or if --reflink=auto is specified, fall back to a standard copy.





      share|improve this answer













      I've used many of the hardlinking tools for Linux mentioned here.
      I too am stuck with ext4 fs, on Ubuntu, and have been using its cp -l and -s for hard/softlinking. But lately noticed the lightweight copy in the cp man page, which would imply to spare the redundant disk space until one side gets modified:



         --reflink[=WHEN]
      control clone/CoW copies. See below

      When --reflink[=always] is specified, perform a lightweight copy, where the
      data blocks are copied only when modified. If this is not possible the
      copy fails, or if --reflink=auto is specified, fall back to a standard copy.






      share|improve this answer












      share|improve this answer



      share|improve this answer










      answered Mar 14 '12 at 9:59









      MarcosMarcos

      1,14211228




      1,14211228













      • I think I will update my cp alias to always include the --reflink=auto parameter now

        – Marcos
        Mar 14 '12 at 14:08






      • 1





        Does ext4 really support --reflink?

        – Jack Douglas
        Jun 21 '12 at 8:42






      • 7





        This is supported on btrfs and OCFS2. It is only possible on copy-on-write filesystems, which ext4 is not. btrfs is really shaping up. I love using it because of reflink and snapshots, makes you less scared to do mass operations on big trees of files.

        – clacke
        Jul 3 '12 at 18:57



















      • I think I will update my cp alias to always include the --reflink=auto parameter now

        – Marcos
        Mar 14 '12 at 14:08






      • 1





        Does ext4 really support --reflink?

        – Jack Douglas
        Jun 21 '12 at 8:42






      • 7





        This is supported on btrfs and OCFS2. It is only possible on copy-on-write filesystems, which ext4 is not. btrfs is really shaping up. I love using it because of reflink and snapshots, makes you less scared to do mass operations on big trees of files.

        – clacke
        Jul 3 '12 at 18:57

















      I think I will update my cp alias to always include the --reflink=auto parameter now

      – Marcos
      Mar 14 '12 at 14:08





      I think I will update my cp alias to always include the --reflink=auto parameter now

      – Marcos
      Mar 14 '12 at 14:08




      1




      1





      Does ext4 really support --reflink?

      – Jack Douglas
      Jun 21 '12 at 8:42





      Does ext4 really support --reflink?

      – Jack Douglas
      Jun 21 '12 at 8:42




      7




      7





      This is supported on btrfs and OCFS2. It is only possible on copy-on-write filesystems, which ext4 is not. btrfs is really shaping up. I love using it because of reflink and snapshots, makes you less scared to do mass operations on big trees of files.

      – clacke
      Jul 3 '12 at 18:57





      This is supported on btrfs and OCFS2. It is only possible on copy-on-write filesystems, which ext4 is not. btrfs is really shaping up. I love using it because of reflink and snapshots, makes you less scared to do mass operations on big trees of files.

      – clacke
      Jul 3 '12 at 18:57











      3














      Seems to me that checking the filename first could speed things up. If two files lack the same filename then in many cases I would not consider them to be duplicates. Seems that the quickest method would be to compare, in order:




      • filename

      • size

      • md5 checksum

      • byte contents


      Do any methods do this? Look at duff, fdupes, rmlint, fslint, etc.



      The following method was top-voted on commandlinefu.com: Find Duplicate Files (based on size first, then MD5 hash)



      Can filename comparison be added as a first step, size as a second step?



      find -not -empty -type f -printf "%sn" | sort -rn | uniq -d | 
      xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum |
      sort | uniq -w32 --all-repeated=separate





      share|improve this answer





















      • 2





        I've used duff, fdupes and rmlint, and strongly recommend readers to look at the third of these. It has an excellent option set (and documentation). With it, I was able to avoid a lot of the post-processing I needed to use with the other tools.

        – dubiousjim
        Sep 2 '15 at 6:32








      • 2





        In my practice filename is the least reliable factor to look at, and I've completely removed it from any efforts I make a de-duping. How many install.sh files can be found on an active system? I can't count the number of times I've saved a file and had name clash, with some on-the-fly renaming to save it. Flip side: no idea how many times I've downloaded something from different sources, on different days, only to find they are the same file with different names. (Which also kills the timestamp reliability.) 1: Size, 2: Digest, 3: Byte contents.

        – Gypsy Spellweaver
        Jan 28 '17 at 6:40











      • @GypsySpellweaver: (1) depends on personal use-case, wouldn't you agree? In my case, i have multiple restores from multiple backups, where files with same name and content exist in different restore-folders. (2) Your comment seems to assume comparing filename only. I was not suggesting to eliminate other checks.

        – johny why
        Mar 8 '17 at 21:50
















      3














      Seems to me that checking the filename first could speed things up. If two files lack the same filename then in many cases I would not consider them to be duplicates. Seems that the quickest method would be to compare, in order:




      • filename

      • size

      • md5 checksum

      • byte contents


      Do any methods do this? Look at duff, fdupes, rmlint, fslint, etc.



      The following method was top-voted on commandlinefu.com: Find Duplicate Files (based on size first, then MD5 hash)



      Can filename comparison be added as a first step, size as a second step?



      find -not -empty -type f -printf "%sn" | sort -rn | uniq -d | 
      xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum |
      sort | uniq -w32 --all-repeated=separate





      share|improve this answer





















      • 2





        I've used duff, fdupes and rmlint, and strongly recommend readers to look at the third of these. It has an excellent option set (and documentation). With it, I was able to avoid a lot of the post-processing I needed to use with the other tools.

        – dubiousjim
        Sep 2 '15 at 6:32








      • 2





        In my practice filename is the least reliable factor to look at, and I've completely removed it from any efforts I make a de-duping. How many install.sh files can be found on an active system? I can't count the number of times I've saved a file and had name clash, with some on-the-fly renaming to save it. Flip side: no idea how many times I've downloaded something from different sources, on different days, only to find they are the same file with different names. (Which also kills the timestamp reliability.) 1: Size, 2: Digest, 3: Byte contents.

        – Gypsy Spellweaver
        Jan 28 '17 at 6:40











      • @GypsySpellweaver: (1) depends on personal use-case, wouldn't you agree? In my case, i have multiple restores from multiple backups, where files with same name and content exist in different restore-folders. (2) Your comment seems to assume comparing filename only. I was not suggesting to eliminate other checks.

        – johny why
        Mar 8 '17 at 21:50














      3












      3








      3







      Seems to me that checking the filename first could speed things up. If two files lack the same filename then in many cases I would not consider them to be duplicates. Seems that the quickest method would be to compare, in order:




      • filename

      • size

      • md5 checksum

      • byte contents


      Do any methods do this? Look at duff, fdupes, rmlint, fslint, etc.



      The following method was top-voted on commandlinefu.com: Find Duplicate Files (based on size first, then MD5 hash)



      Can filename comparison be added as a first step, size as a second step?



      find -not -empty -type f -printf "%sn" | sort -rn | uniq -d | 
      xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum |
      sort | uniq -w32 --all-repeated=separate





      share|improve this answer















      Seems to me that checking the filename first could speed things up. If two files lack the same filename then in many cases I would not consider them to be duplicates. Seems that the quickest method would be to compare, in order:




      • filename

      • size

      • md5 checksum

      • byte contents


      Do any methods do this? Look at duff, fdupes, rmlint, fslint, etc.



      The following method was top-voted on commandlinefu.com: Find Duplicate Files (based on size first, then MD5 hash)



      Can filename comparison be added as a first step, size as a second step?



      find -not -empty -type f -printf "%sn" | sort -rn | uniq -d | 
      xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum |
      sort | uniq -w32 --all-repeated=separate






      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Jul 15 '12 at 13:10









      Mat

      39.6k8121127




      39.6k8121127










      answered Jul 9 '12 at 15:02









      johny whyjohny why

      1334




      1334








      • 2





        I've used duff, fdupes and rmlint, and strongly recommend readers to look at the third of these. It has an excellent option set (and documentation). With it, I was able to avoid a lot of the post-processing I needed to use with the other tools.

        – dubiousjim
        Sep 2 '15 at 6:32








      • 2





        In my practice filename is the least reliable factor to look at, and I've completely removed it from any efforts I make a de-duping. How many install.sh files can be found on an active system? I can't count the number of times I've saved a file and had name clash, with some on-the-fly renaming to save it. Flip side: no idea how many times I've downloaded something from different sources, on different days, only to find they are the same file with different names. (Which also kills the timestamp reliability.) 1: Size, 2: Digest, 3: Byte contents.

        – Gypsy Spellweaver
        Jan 28 '17 at 6:40











      • @GypsySpellweaver: (1) depends on personal use-case, wouldn't you agree? In my case, i have multiple restores from multiple backups, where files with same name and content exist in different restore-folders. (2) Your comment seems to assume comparing filename only. I was not suggesting to eliminate other checks.

        – johny why
        Mar 8 '17 at 21:50














      • 2





        I've used duff, fdupes and rmlint, and strongly recommend readers to look at the third of these. It has an excellent option set (and documentation). With it, I was able to avoid a lot of the post-processing I needed to use with the other tools.

        – dubiousjim
        Sep 2 '15 at 6:32








      • 2





        In my practice filename is the least reliable factor to look at, and I've completely removed it from any efforts I make a de-duping. How many install.sh files can be found on an active system? I can't count the number of times I've saved a file and had name clash, with some on-the-fly renaming to save it. Flip side: no idea how many times I've downloaded something from different sources, on different days, only to find they are the same file with different names. (Which also kills the timestamp reliability.) 1: Size, 2: Digest, 3: Byte contents.

        – Gypsy Spellweaver
        Jan 28 '17 at 6:40











      • @GypsySpellweaver: (1) depends on personal use-case, wouldn't you agree? In my case, i have multiple restores from multiple backups, where files with same name and content exist in different restore-folders. (2) Your comment seems to assume comparing filename only. I was not suggesting to eliminate other checks.

        – johny why
        Mar 8 '17 at 21:50








      2




      2





      I've used duff, fdupes and rmlint, and strongly recommend readers to look at the third of these. It has an excellent option set (and documentation). With it, I was able to avoid a lot of the post-processing I needed to use with the other tools.

      – dubiousjim
      Sep 2 '15 at 6:32







      I've used duff, fdupes and rmlint, and strongly recommend readers to look at the third of these. It has an excellent option set (and documentation). With it, I was able to avoid a lot of the post-processing I needed to use with the other tools.

      – dubiousjim
      Sep 2 '15 at 6:32






      2




      2





      In my practice filename is the least reliable factor to look at, and I've completely removed it from any efforts I make a de-duping. How many install.sh files can be found on an active system? I can't count the number of times I've saved a file and had name clash, with some on-the-fly renaming to save it. Flip side: no idea how many times I've downloaded something from different sources, on different days, only to find they are the same file with different names. (Which also kills the timestamp reliability.) 1: Size, 2: Digest, 3: Byte contents.

      – Gypsy Spellweaver
      Jan 28 '17 at 6:40





      In my practice filename is the least reliable factor to look at, and I've completely removed it from any efforts I make a de-duping. How many install.sh files can be found on an active system? I can't count the number of times I've saved a file and had name clash, with some on-the-fly renaming to save it. Flip side: no idea how many times I've downloaded something from different sources, on different days, only to find they are the same file with different names. (Which also kills the timestamp reliability.) 1: Size, 2: Digest, 3: Byte contents.

      – Gypsy Spellweaver
      Jan 28 '17 at 6:40













      @GypsySpellweaver: (1) depends on personal use-case, wouldn't you agree? In my case, i have multiple restores from multiple backups, where files with same name and content exist in different restore-folders. (2) Your comment seems to assume comparing filename only. I was not suggesting to eliminate other checks.

      – johny why
      Mar 8 '17 at 21:50





      @GypsySpellweaver: (1) depends on personal use-case, wouldn't you agree? In my case, i have multiple restores from multiple backups, where files with same name and content exist in different restore-folders. (2) Your comment seems to assume comparing filename only. I was not suggesting to eliminate other checks.

      – johny why
      Mar 8 '17 at 21:50











      2














      I made a Perl script that does something similar to what you're talking about:



      http://pastebin.com/U7mFHZU7



      Basically, it just traverses a directory, calculating the SHA1sum of the files in it, hashing it and linking matches together. It's come in handy on many, many occasions.






      share|improve this answer



















      • 2





        I hope to get around to trying this soon... why not upload it on CPAN... App::relink or something

        – xenoterracide
        Feb 7 '11 at 11:12






      • 1





        @xenoterracide: because of all the similar and more mature solutions that already exist. see the other answers, especially rdfind.

        – oligofren
        Jan 3 '15 at 13:36








      • 1





        @oligofren I don't doubt better solutions exist. TMTOWTDI I guess.

        – amphetamachine
        Jan 5 '15 at 15:49
















      2














      I made a Perl script that does something similar to what you're talking about:



      http://pastebin.com/U7mFHZU7



      Basically, it just traverses a directory, calculating the SHA1sum of the files in it, hashing it and linking matches together. It's come in handy on many, many occasions.






      share|improve this answer



















      • 2





        I hope to get around to trying this soon... why not upload it on CPAN... App::relink or something

        – xenoterracide
        Feb 7 '11 at 11:12






      • 1





        @xenoterracide: because of all the similar and more mature solutions that already exist. see the other answers, especially rdfind.

        – oligofren
        Jan 3 '15 at 13:36








      • 1





        @oligofren I don't doubt better solutions exist. TMTOWTDI I guess.

        – amphetamachine
        Jan 5 '15 at 15:49














      2












      2








      2







      I made a Perl script that does something similar to what you're talking about:



      http://pastebin.com/U7mFHZU7



      Basically, it just traverses a directory, calculating the SHA1sum of the files in it, hashing it and linking matches together. It's come in handy on many, many occasions.






      share|improve this answer













      I made a Perl script that does something similar to what you're talking about:



      http://pastebin.com/U7mFHZU7



      Basically, it just traverses a directory, calculating the SHA1sum of the files in it, hashing it and linking matches together. It's come in handy on many, many occasions.







      share|improve this answer












      share|improve this answer



      share|improve this answer










      answered Jan 31 '11 at 2:06









      amphetamachineamphetamachine

      3,82522338




      3,82522338








      • 2





        I hope to get around to trying this soon... why not upload it on CPAN... App::relink or something

        – xenoterracide
        Feb 7 '11 at 11:12






      • 1





        @xenoterracide: because of all the similar and more mature solutions that already exist. see the other answers, especially rdfind.

        – oligofren
        Jan 3 '15 at 13:36








      • 1





        @oligofren I don't doubt better solutions exist. TMTOWTDI I guess.

        – amphetamachine
        Jan 5 '15 at 15:49














      • 2





        I hope to get around to trying this soon... why not upload it on CPAN... App::relink or something

        – xenoterracide
        Feb 7 '11 at 11:12






      • 1





        @xenoterracide: because of all the similar and more mature solutions that already exist. see the other answers, especially rdfind.

        – oligofren
        Jan 3 '15 at 13:36








      • 1





        @oligofren I don't doubt better solutions exist. TMTOWTDI I guess.

        – amphetamachine
        Jan 5 '15 at 15:49








      2




      2





      I hope to get around to trying this soon... why not upload it on CPAN... App::relink or something

      – xenoterracide
      Feb 7 '11 at 11:12





      I hope to get around to trying this soon... why not upload it on CPAN... App::relink or something

      – xenoterracide
      Feb 7 '11 at 11:12




      1




      1





      @xenoterracide: because of all the similar and more mature solutions that already exist. see the other answers, especially rdfind.

      – oligofren
      Jan 3 '15 at 13:36







      @xenoterracide: because of all the similar and more mature solutions that already exist. see the other answers, especially rdfind.

      – oligofren
      Jan 3 '15 at 13:36






      1




      1





      @oligofren I don't doubt better solutions exist. TMTOWTDI I guess.

      – amphetamachine
      Jan 5 '15 at 15:49





      @oligofren I don't doubt better solutions exist. TMTOWTDI I guess.

      – amphetamachine
      Jan 5 '15 at 15:49











      2














      Since I'm not a fan of Perl, here's a bash version:



      #!/bin/bash

      DIR="/path/to/big/files"

      find $DIR -type f -exec md5sum {} ; | sort > /tmp/sums-sorted.txt

      OLDSUM=""
      IFS=$'n'
      for i in `cat /tmp/sums-sorted.txt`; do
      NEWSUM=`echo "$i" | sed 's/ .*//'`
      NEWFILE=`echo "$i" | sed 's/^[^ ]* *//'`
      if [ "$OLDSUM" == "$NEWSUM" ]; then
      echo ln -f "$OLDFILE" "$NEWFILE"
      else
      OLDSUM="$NEWSUM"
      OLDFILE="$NEWFILE"
      fi
      done


      This finds all files with the same checksum (whether they're big, small, or already hardlinks), and hardlinks them together.



      This can be greatly optimized for repeated runs with additional find flags (eg. size) and a file cache (so you don't have to redo the checksums each time). If anyone's interested in the smarter, longer version, I can post it.



      NOTE: As has been mentioned before, hardlinks work as long as the files never need modification, or to be moved across filesystems.






      share|improve this answer


























      • How can I change your script, so that instead of hardlinking it, it will just delete the duplicate files and will add an entry to a CSV file the deleted file -> Lined File. . ???

        – MR.GEWA
        Jan 12 '13 at 12:17













      • Sure. The hard link line: echo ln -f "$OLDFILE" "$NEWFILE" Just replaces the duplicate file with a hard link, so you could change it rm the $NEWFILE instead.

        – seren
        Jan 13 '13 at 4:15













      • and how on next line, write in some text file somehow $OLDFILE-> NEWFILE ???

        – MR.GEWA
        Jan 13 '13 at 13:12













      • Ahh, right. Yes, add a line after the rm such as: echo "$NEWFILE" >> /var/log/deleted_duplicate_files.log

        – seren
        Jan 14 '13 at 19:28








      • 1





        Don't friggin reinvent the wheel. There are more mature solutions available, like rdfind, that works at native speeds and just requires brew install rdfind or apt-get install rdfind to get installed.

        – oligofren
        Jan 3 '15 at 13:46
















      2














      Since I'm not a fan of Perl, here's a bash version:



      #!/bin/bash

      DIR="/path/to/big/files"

      find $DIR -type f -exec md5sum {} ; | sort > /tmp/sums-sorted.txt

      OLDSUM=""
      IFS=$'n'
      for i in `cat /tmp/sums-sorted.txt`; do
      NEWSUM=`echo "$i" | sed 's/ .*//'`
      NEWFILE=`echo "$i" | sed 's/^[^ ]* *//'`
      if [ "$OLDSUM" == "$NEWSUM" ]; then
      echo ln -f "$OLDFILE" "$NEWFILE"
      else
      OLDSUM="$NEWSUM"
      OLDFILE="$NEWFILE"
      fi
      done


      This finds all files with the same checksum (whether they're big, small, or already hardlinks), and hardlinks them together.



      This can be greatly optimized for repeated runs with additional find flags (eg. size) and a file cache (so you don't have to redo the checksums each time). If anyone's interested in the smarter, longer version, I can post it.



      NOTE: As has been mentioned before, hardlinks work as long as the files never need modification, or to be moved across filesystems.






      share|improve this answer


























      • How can I change your script, so that instead of hardlinking it, it will just delete the duplicate files and will add an entry to a CSV file the deleted file -> Lined File. . ???

        – MR.GEWA
        Jan 12 '13 at 12:17













      • Sure. The hard link line: echo ln -f "$OLDFILE" "$NEWFILE" Just replaces the duplicate file with a hard link, so you could change it rm the $NEWFILE instead.

        – seren
        Jan 13 '13 at 4:15













      • and how on next line, write in some text file somehow $OLDFILE-> NEWFILE ???

        – MR.GEWA
        Jan 13 '13 at 13:12













      • Ahh, right. Yes, add a line after the rm such as: echo "$NEWFILE" >> /var/log/deleted_duplicate_files.log

        – seren
        Jan 14 '13 at 19:28








      • 1





        Don't friggin reinvent the wheel. There are more mature solutions available, like rdfind, that works at native speeds and just requires brew install rdfind or apt-get install rdfind to get installed.

        – oligofren
        Jan 3 '15 at 13:46














      2












      2








      2







      Since I'm not a fan of Perl, here's a bash version:



      #!/bin/bash

      DIR="/path/to/big/files"

      find $DIR -type f -exec md5sum {} ; | sort > /tmp/sums-sorted.txt

      OLDSUM=""
      IFS=$'n'
      for i in `cat /tmp/sums-sorted.txt`; do
      NEWSUM=`echo "$i" | sed 's/ .*//'`
      NEWFILE=`echo "$i" | sed 's/^[^ ]* *//'`
      if [ "$OLDSUM" == "$NEWSUM" ]; then
      echo ln -f "$OLDFILE" "$NEWFILE"
      else
      OLDSUM="$NEWSUM"
      OLDFILE="$NEWFILE"
      fi
      done


      This finds all files with the same checksum (whether they're big, small, or already hardlinks), and hardlinks them together.



      This can be greatly optimized for repeated runs with additional find flags (eg. size) and a file cache (so you don't have to redo the checksums each time). If anyone's interested in the smarter, longer version, I can post it.



      NOTE: As has been mentioned before, hardlinks work as long as the files never need modification, or to be moved across filesystems.






      share|improve this answer















      Since I'm not a fan of Perl, here's a bash version:



      #!/bin/bash

      DIR="/path/to/big/files"

      find $DIR -type f -exec md5sum {} ; | sort > /tmp/sums-sorted.txt

      OLDSUM=""
      IFS=$'n'
      for i in `cat /tmp/sums-sorted.txt`; do
      NEWSUM=`echo "$i" | sed 's/ .*//'`
      NEWFILE=`echo "$i" | sed 's/^[^ ]* *//'`
      if [ "$OLDSUM" == "$NEWSUM" ]; then
      echo ln -f "$OLDFILE" "$NEWFILE"
      else
      OLDSUM="$NEWSUM"
      OLDFILE="$NEWFILE"
      fi
      done


      This finds all files with the same checksum (whether they're big, small, or already hardlinks), and hardlinks them together.



      This can be greatly optimized for repeated runs with additional find flags (eg. size) and a file cache (so you don't have to redo the checksums each time). If anyone's interested in the smarter, longer version, I can post it.



      NOTE: As has been mentioned before, hardlinks work as long as the files never need modification, or to be moved across filesystems.







      share|improve this answer














      share|improve this answer



      share|improve this answer








      edited Jul 3 '12 at 11:04









      Mat

      39.6k8121127




      39.6k8121127










      answered Jul 3 '12 at 5:15









      serenseren

      1212




      1212













      • How can I change your script, so that instead of hardlinking it, it will just delete the duplicate files and will add an entry to a CSV file the deleted file -> Lined File. . ???

        – MR.GEWA
        Jan 12 '13 at 12:17













      • Sure. The hard link line: echo ln -f "$OLDFILE" "$NEWFILE" Just replaces the duplicate file with a hard link, so you could change it rm the $NEWFILE instead.

        – seren
        Jan 13 '13 at 4:15













      • and how on next line, write in some text file somehow $OLDFILE-> NEWFILE ???

        – MR.GEWA
        Jan 13 '13 at 13:12













      • Ahh, right. Yes, add a line after the rm such as: echo "$NEWFILE" >> /var/log/deleted_duplicate_files.log

        – seren
        Jan 14 '13 at 19:28








      • 1





        Don't friggin reinvent the wheel. There are more mature solutions available, like rdfind, that works at native speeds and just requires brew install rdfind or apt-get install rdfind to get installed.

        – oligofren
        Jan 3 '15 at 13:46



















      • How can I change your script, so that instead of hardlinking it, it will just delete the duplicate files and will add an entry to a CSV file the deleted file -> Lined File. . ???

        – MR.GEWA
        Jan 12 '13 at 12:17













      • Sure. The hard link line: echo ln -f "$OLDFILE" "$NEWFILE" Just replaces the duplicate file with a hard link, so you could change it rm the $NEWFILE instead.

        – seren
        Jan 13 '13 at 4:15













      • and how on next line, write in some text file somehow $OLDFILE-> NEWFILE ???

        – MR.GEWA
        Jan 13 '13 at 13:12













      • Ahh, right. Yes, add a line after the rm such as: echo "$NEWFILE" >> /var/log/deleted_duplicate_files.log

        – seren
        Jan 14 '13 at 19:28








      • 1





        Don't friggin reinvent the wheel. There are more mature solutions available, like rdfind, that works at native speeds and just requires brew install rdfind or apt-get install rdfind to get installed.

        – oligofren
        Jan 3 '15 at 13:46

















      How can I change your script, so that instead of hardlinking it, it will just delete the duplicate files and will add an entry to a CSV file the deleted file -> Lined File. . ???

      – MR.GEWA
      Jan 12 '13 at 12:17







      How can I change your script, so that instead of hardlinking it, it will just delete the duplicate files and will add an entry to a CSV file the deleted file -> Lined File. . ???

      – MR.GEWA
      Jan 12 '13 at 12:17















      Sure. The hard link line: echo ln -f "$OLDFILE" "$NEWFILE" Just replaces the duplicate file with a hard link, so you could change it rm the $NEWFILE instead.

      – seren
      Jan 13 '13 at 4:15







      Sure. The hard link line: echo ln -f "$OLDFILE" "$NEWFILE" Just replaces the duplicate file with a hard link, so you could change it rm the $NEWFILE instead.

      – seren
      Jan 13 '13 at 4:15















      and how on next line, write in some text file somehow $OLDFILE-> NEWFILE ???

      – MR.GEWA
      Jan 13 '13 at 13:12







      and how on next line, write in some text file somehow $OLDFILE-> NEWFILE ???

      – MR.GEWA
      Jan 13 '13 at 13:12















      Ahh, right. Yes, add a line after the rm such as: echo "$NEWFILE" >> /var/log/deleted_duplicate_files.log

      – seren
      Jan 14 '13 at 19:28







      Ahh, right. Yes, add a line after the rm such as: echo "$NEWFILE" >> /var/log/deleted_duplicate_files.log

      – seren
      Jan 14 '13 at 19:28






      1




      1





      Don't friggin reinvent the wheel. There are more mature solutions available, like rdfind, that works at native speeds and just requires brew install rdfind or apt-get install rdfind to get installed.

      – oligofren
      Jan 3 '15 at 13:46





      Don't friggin reinvent the wheel. There are more mature solutions available, like rdfind, that works at native speeds and just requires brew install rdfind or apt-get install rdfind to get installed.

      – oligofren
      Jan 3 '15 at 13:46











      2














      If you want to replace duplicates by Hard Links on mac or any UNIX based system, you can try SmartDupe http://sourceforge.net/projects/smartdupe/
      am developing it






      share|improve this answer



















      • 3





        Can you expand on how “smart” it is?

        – Stéphane Gimenez
        Nov 4 '12 at 13:25






      • 1





        How can I compare files of two different directories?

        – Burcardo
        May 31 '16 at 8:26
















      2














      If you want to replace duplicates by Hard Links on mac or any UNIX based system, you can try SmartDupe http://sourceforge.net/projects/smartdupe/
      am developing it






      share|improve this answer



















      • 3





        Can you expand on how “smart” it is?

        – Stéphane Gimenez
        Nov 4 '12 at 13:25






      • 1





        How can I compare files of two different directories?

        – Burcardo
        May 31 '16 at 8:26














      2












      2








      2







      If you want to replace duplicates by Hard Links on mac or any UNIX based system, you can try SmartDupe http://sourceforge.net/projects/smartdupe/
      am developing it






      share|improve this answer













      If you want to replace duplicates by Hard Links on mac or any UNIX based system, you can try SmartDupe http://sourceforge.net/projects/smartdupe/
      am developing it







      share|improve this answer












      share|improve this answer



      share|improve this answer










      answered Nov 4 '12 at 0:57









      islamislam

      211




      211








      • 3





        Can you expand on how “smart” it is?

        – Stéphane Gimenez
        Nov 4 '12 at 13:25






      • 1





        How can I compare files of two different directories?

        – Burcardo
        May 31 '16 at 8:26














      • 3





        Can you expand on how “smart” it is?

        – Stéphane Gimenez
        Nov 4 '12 at 13:25






      • 1





        How can I compare files of two different directories?

        – Burcardo
        May 31 '16 at 8:26








      3




      3





      Can you expand on how “smart” it is?

      – Stéphane Gimenez
      Nov 4 '12 at 13:25





      Can you expand on how “smart” it is?

      – Stéphane Gimenez
      Nov 4 '12 at 13:25




      1




      1





      How can I compare files of two different directories?

      – Burcardo
      May 31 '16 at 8:26





      How can I compare files of two different directories?

      – Burcardo
      May 31 '16 at 8:26











      1














      The applicatios FSLint (http://www.pixelbeat.org/fslint/) can find all equal files in any folder (by content) and create hardlinks. Give it a try!



      Jorge Sampaio






      share|improve this answer
























      • It hangs scanning 1TB almost full ext3 harddisk, brings the entire system to a crawl. Aborted after 14 hours of "searching"

        – Angsuman Chakraborty
        Sep 12 '16 at 11:09
















      1














      The applicatios FSLint (http://www.pixelbeat.org/fslint/) can find all equal files in any folder (by content) and create hardlinks. Give it a try!



      Jorge Sampaio






      share|improve this answer
























      • It hangs scanning 1TB almost full ext3 harddisk, brings the entire system to a crawl. Aborted after 14 hours of "searching"

        – Angsuman Chakraborty
        Sep 12 '16 at 11:09














      1












      1








      1







      The applicatios FSLint (http://www.pixelbeat.org/fslint/) can find all equal files in any folder (by content) and create hardlinks. Give it a try!



      Jorge Sampaio






      share|improve this answer













      The applicatios FSLint (http://www.pixelbeat.org/fslint/) can find all equal files in any folder (by content) and create hardlinks. Give it a try!



      Jorge Sampaio







      share|improve this answer












      share|improve this answer



      share|improve this answer










      answered Jan 15 '15 at 16:29









      Jorge H B Sampaio JrJorge H B Sampaio Jr

      111




      111













      • It hangs scanning 1TB almost full ext3 harddisk, brings the entire system to a crawl. Aborted after 14 hours of "searching"

        – Angsuman Chakraborty
        Sep 12 '16 at 11:09



















      • It hangs scanning 1TB almost full ext3 harddisk, brings the entire system to a crawl. Aborted after 14 hours of "searching"

        – Angsuman Chakraborty
        Sep 12 '16 at 11:09

















      It hangs scanning 1TB almost full ext3 harddisk, brings the entire system to a crawl. Aborted after 14 hours of "searching"

      – Angsuman Chakraborty
      Sep 12 '16 at 11:09





      It hangs scanning 1TB almost full ext3 harddisk, brings the entire system to a crawl. Aborted after 14 hours of "searching"

      – Angsuman Chakraborty
      Sep 12 '16 at 11:09











      0














      If you'll do hardlinks, pay attention on rights on that file. Notice, owner, group, mode, extended attributes, time and ACL (if you use this) is stored in INODE. Only file names are different because this is stored in directory structure, and other points to INODE properties. This cause, all file names linked to the same inode, have the same access rights. You should prevent modification that file, because any user can damage file to other. It is simple. It is enough, any user put other file in the same name. Inode number is then saved, and original file content is destroyed (replaced) for all hardlinked names.



      Better way is deduplication on filesystem layer. You can use BTRFS (very popular last time), OCFS or like this. Look at the page: https://en.wikipedia.org/wiki/Comparison_of_file_systems , specialy at the table Features and column data deduplication. You can click it and sort :)



      Specially look at ZFS filesystem. This is available as FUSE, but in this way it's very slow. If you want native support, look at the page http://zfsonlinux.org/ . Then you must patch kernel, and next install zfs tools for managament. I don't understand, why linux doesn't support as drivers, it is way for many other operating systems / kernels.



      File systems supports deduplication by 2 ways, deduplicate files, or blocks. ZFS supports block. This means, the same contents that repeats in the same file can be deduplicated. Other way is time when data are deduplicated, this can be online (zfs) or offline (btrfs).



      Notice, deduplication consumes RAM. This is, why writing files to ZFS volume mounted with FUSE, cause dramatically slow performance. This is described in documentation.
      But you can online set on/off deduplication on volume. If you see any data should be deduplicated, you simply set deduplication on, rewrite some file to any temporary and finally replace. after this you can off deduplication and restore full performance. Of course, you can add to storage any cache disks. This can be very fast rotate disks or SSD disks. Of course this can be very small disks. In real work this is replacement for RAM :)



      Under linux you should take care for ZFS because not all work as it should, specialy when you manage filesystem, make snapshot etc. but if you do configuration and don't change it, all works properly. Other way, you should change linux to opensolaris, it natively supports ZFS :) What is very nice with ZFS is, this works both as filesystem, and volumen manager similar to LVM. You do not need it when you use ZFS. See documentation if you want know more.



      Notice difference between ZFS and BTRFS. ZFS is older and more mature, unfortunately only under Solaris and OpenSolaris (unfortunately strangled by oracle). BTRFS is younger, but last time very good supported. I recommend fresh kernel. ZFS has online deduplication, that cause slow down writes, because all is calculated online. BTRFS support off-line dedupliaction. Then this saves performance, but when host has nothing to do, you run periodically tool for make deduplication. And BTRFS is natively created under linux. Maybe this is better FS for You :)






      share|improve this answer



















      • 1





        I do like the offline (or batch) deduplication approach btrfs has. Excellent discussion of the options (including the cp --reflink option) here: btrfs.wiki.kernel.org/index.php/Deduplication

        – Marcel Waldvogel
        Feb 5 '17 at 19:42











      • ZFS is not Solaris or OpenSolaris only. It's natively supported in FreeBSD. Also, ZFS on Linux is device driver based; ZFS on FUSE is a different thing.

        – KJ Seefried
        Mar 29 '18 at 19:07
















      0














      If you'll do hardlinks, pay attention on rights on that file. Notice, owner, group, mode, extended attributes, time and ACL (if you use this) is stored in INODE. Only file names are different because this is stored in directory structure, and other points to INODE properties. This cause, all file names linked to the same inode, have the same access rights. You should prevent modification that file, because any user can damage file to other. It is simple. It is enough, any user put other file in the same name. Inode number is then saved, and original file content is destroyed (replaced) for all hardlinked names.



      Better way is deduplication on filesystem layer. You can use BTRFS (very popular last time), OCFS or like this. Look at the page: https://en.wikipedia.org/wiki/Comparison_of_file_systems , specialy at the table Features and column data deduplication. You can click it and sort :)



      Specially look at ZFS filesystem. This is available as FUSE, but in this way it's very slow. If you want native support, look at the page http://zfsonlinux.org/ . Then you must patch kernel, and next install zfs tools for managament. I don't understand, why linux doesn't support as drivers, it is way for many other operating systems / kernels.



      File systems supports deduplication by 2 ways, deduplicate files, or blocks. ZFS supports block. This means, the same contents that repeats in the same file can be deduplicated. Other way is time when data are deduplicated, this can be online (zfs) or offline (btrfs).



      Notice, deduplication consumes RAM. This is, why writing files to ZFS volume mounted with FUSE, cause dramatically slow performance. This is described in documentation.
      But you can online set on/off deduplication on volume. If you see any data should be deduplicated, you simply set deduplication on, rewrite some file to any temporary and finally replace. after this you can off deduplication and restore full performance. Of course, you can add to storage any cache disks. This can be very fast rotate disks or SSD disks. Of course this can be very small disks. In real work this is replacement for RAM :)



      Under linux you should take care for ZFS because not all work as it should, specialy when you manage filesystem, make snapshot etc. but if you do configuration and don't change it, all works properly. Other way, you should change linux to opensolaris, it natively supports ZFS :) What is very nice with ZFS is, this works both as filesystem, and volumen manager similar to LVM. You do not need it when you use ZFS. See documentation if you want know more.



      Notice difference between ZFS and BTRFS. ZFS is older and more mature, unfortunately only under Solaris and OpenSolaris (unfortunately strangled by oracle). BTRFS is younger, but last time very good supported. I recommend fresh kernel. ZFS has online deduplication, that cause slow down writes, because all is calculated online. BTRFS support off-line dedupliaction. Then this saves performance, but when host has nothing to do, you run periodically tool for make deduplication. And BTRFS is natively created under linux. Maybe this is better FS for You :)






      share|improve this answer



















      • 1





        I do like the offline (or batch) deduplication approach btrfs has. Excellent discussion of the options (including the cp --reflink option) here: btrfs.wiki.kernel.org/index.php/Deduplication

        – Marcel Waldvogel
        Feb 5 '17 at 19:42











      • ZFS is not Solaris or OpenSolaris only. It's natively supported in FreeBSD. Also, ZFS on Linux is device driver based; ZFS on FUSE is a different thing.

        – KJ Seefried
        Mar 29 '18 at 19:07














      0












      0








      0







      If you'll do hardlinks, pay attention on rights on that file. Notice, owner, group, mode, extended attributes, time and ACL (if you use this) is stored in INODE. Only file names are different because this is stored in directory structure, and other points to INODE properties. This cause, all file names linked to the same inode, have the same access rights. You should prevent modification that file, because any user can damage file to other. It is simple. It is enough, any user put other file in the same name. Inode number is then saved, and original file content is destroyed (replaced) for all hardlinked names.



      Better way is deduplication on filesystem layer. You can use BTRFS (very popular last time), OCFS or like this. Look at the page: https://en.wikipedia.org/wiki/Comparison_of_file_systems , specialy at the table Features and column data deduplication. You can click it and sort :)



      Specially look at ZFS filesystem. This is available as FUSE, but in this way it's very slow. If you want native support, look at the page http://zfsonlinux.org/ . Then you must patch kernel, and next install zfs tools for managament. I don't understand, why linux doesn't support as drivers, it is way for many other operating systems / kernels.



      File systems supports deduplication by 2 ways, deduplicate files, or blocks. ZFS supports block. This means, the same contents that repeats in the same file can be deduplicated. Other way is time when data are deduplicated, this can be online (zfs) or offline (btrfs).



      Notice, deduplication consumes RAM. This is, why writing files to ZFS volume mounted with FUSE, cause dramatically slow performance. This is described in documentation.
      But you can online set on/off deduplication on volume. If you see any data should be deduplicated, you simply set deduplication on, rewrite some file to any temporary and finally replace. after this you can off deduplication and restore full performance. Of course, you can add to storage any cache disks. This can be very fast rotate disks or SSD disks. Of course this can be very small disks. In real work this is replacement for RAM :)



      Under linux you should take care for ZFS because not all work as it should, specialy when you manage filesystem, make snapshot etc. but if you do configuration and don't change it, all works properly. Other way, you should change linux to opensolaris, it natively supports ZFS :) What is very nice with ZFS is, this works both as filesystem, and volumen manager similar to LVM. You do not need it when you use ZFS. See documentation if you want know more.



      Notice difference between ZFS and BTRFS. ZFS is older and more mature, unfortunately only under Solaris and OpenSolaris (unfortunately strangled by oracle). BTRFS is younger, but last time very good supported. I recommend fresh kernel. ZFS has online deduplication, that cause slow down writes, because all is calculated online. BTRFS support off-line dedupliaction. Then this saves performance, but when host has nothing to do, you run periodically tool for make deduplication. And BTRFS is natively created under linux. Maybe this is better FS for You :)






      share|improve this answer













      If you'll do hardlinks, pay attention on rights on that file. Notice, owner, group, mode, extended attributes, time and ACL (if you use this) is stored in INODE. Only file names are different because this is stored in directory structure, and other points to INODE properties. This cause, all file names linked to the same inode, have the same access rights. You should prevent modification that file, because any user can damage file to other. It is simple. It is enough, any user put other file in the same name. Inode number is then saved, and original file content is destroyed (replaced) for all hardlinked names.



      Better way is deduplication on filesystem layer. You can use BTRFS (very popular last time), OCFS or like this. Look at the page: https://en.wikipedia.org/wiki/Comparison_of_file_systems , specialy at the table Features and column data deduplication. You can click it and sort :)



      Specially look at ZFS filesystem. This is available as FUSE, but in this way it's very slow. If you want native support, look at the page http://zfsonlinux.org/ . Then you must patch kernel, and next install zfs tools for managament. I don't understand, why linux doesn't support as drivers, it is way for many other operating systems / kernels.



      File systems supports deduplication by 2 ways, deduplicate files, or blocks. ZFS supports block. This means, the same contents that repeats in the same file can be deduplicated. Other way is time when data are deduplicated, this can be online (zfs) or offline (btrfs).



      Notice, deduplication consumes RAM. This is, why writing files to ZFS volume mounted with FUSE, cause dramatically slow performance. This is described in documentation.
      But you can online set on/off deduplication on volume. If you see any data should be deduplicated, you simply set deduplication on, rewrite some file to any temporary and finally replace. after this you can off deduplication and restore full performance. Of course, you can add to storage any cache disks. This can be very fast rotate disks or SSD disks. Of course this can be very small disks. In real work this is replacement for RAM :)



      Under linux you should take care for ZFS because not all work as it should, specialy when you manage filesystem, make snapshot etc. but if you do configuration and don't change it, all works properly. Other way, you should change linux to opensolaris, it natively supports ZFS :) What is very nice with ZFS is, this works both as filesystem, and volumen manager similar to LVM. You do not need it when you use ZFS. See documentation if you want know more.



      Notice difference between ZFS and BTRFS. ZFS is older and more mature, unfortunately only under Solaris and OpenSolaris (unfortunately strangled by oracle). BTRFS is younger, but last time very good supported. I recommend fresh kernel. ZFS has online deduplication, that cause slow down writes, because all is calculated online. BTRFS support off-line dedupliaction. Then this saves performance, but when host has nothing to do, you run periodically tool for make deduplication. And BTRFS is natively created under linux. Maybe this is better FS for You :)







      share|improve this answer












      share|improve this answer



      share|improve this answer










      answered Jun 24 '14 at 8:51









      ZnikZnik

      265129




      265129








      • 1





        I do like the offline (or batch) deduplication approach btrfs has. Excellent discussion of the options (including the cp --reflink option) here: btrfs.wiki.kernel.org/index.php/Deduplication

        – Marcel Waldvogel
        Feb 5 '17 at 19:42











      • ZFS is not Solaris or OpenSolaris only. It's natively supported in FreeBSD. Also, ZFS on Linux is device driver based; ZFS on FUSE is a different thing.

        – KJ Seefried
        Mar 29 '18 at 19:07














      • 1





        I do like the offline (or batch) deduplication approach btrfs has. Excellent discussion of the options (including the cp --reflink option) here: btrfs.wiki.kernel.org/index.php/Deduplication

        – Marcel Waldvogel
        Feb 5 '17 at 19:42











      • ZFS is not Solaris or OpenSolaris only. It's natively supported in FreeBSD. Also, ZFS on Linux is device driver based; ZFS on FUSE is a different thing.

        – KJ Seefried
        Mar 29 '18 at 19:07








      1




      1





      I do like the offline (or batch) deduplication approach btrfs has. Excellent discussion of the options (including the cp --reflink option) here: btrfs.wiki.kernel.org/index.php/Deduplication

      – Marcel Waldvogel
      Feb 5 '17 at 19:42





      I do like the offline (or batch) deduplication approach btrfs has. Excellent discussion of the options (including the cp --reflink option) here: btrfs.wiki.kernel.org/index.php/Deduplication

      – Marcel Waldvogel
      Feb 5 '17 at 19:42













      ZFS is not Solaris or OpenSolaris only. It's natively supported in FreeBSD. Also, ZFS on Linux is device driver based; ZFS on FUSE is a different thing.

      – KJ Seefried
      Mar 29 '18 at 19:07





      ZFS is not Solaris or OpenSolaris only. It's natively supported in FreeBSD. Also, ZFS on Linux is device driver based; ZFS on FUSE is a different thing.

      – KJ Seefried
      Mar 29 '18 at 19:07











      0














      Hard links might not be the best idea; if one user changes the file, it affects both. However, deleting a hard link doesn't delete both files. Plus, I am not entirely sure if Hard Links take up the same amount of space (on the hard disk, not the OS) as multiple copies of the same file; according to Windows (with the Link Shell Extension), they do. Granted, that's Windows, not Unix...



      My solution would be to create a "common" file in a hidden folder, and replace the actual duplicates with symbolic links... then, the symbolic links would be embedded with metadata or alternate file streams that only records however the two "files" are different from each other, like if one person wants to change the filename or add custom album art or something else like that; it might even be useful outside of database applications, like having multiple versions of the same game or software installed and testing them independently with even the smallest differences.






      share|improve this answer




























        0














        Hard links might not be the best idea; if one user changes the file, it affects both. However, deleting a hard link doesn't delete both files. Plus, I am not entirely sure if Hard Links take up the same amount of space (on the hard disk, not the OS) as multiple copies of the same file; according to Windows (with the Link Shell Extension), they do. Granted, that's Windows, not Unix...



        My solution would be to create a "common" file in a hidden folder, and replace the actual duplicates with symbolic links... then, the symbolic links would be embedded with metadata or alternate file streams that only records however the two "files" are different from each other, like if one person wants to change the filename or add custom album art or something else like that; it might even be useful outside of database applications, like having multiple versions of the same game or software installed and testing them independently with even the smallest differences.






        share|improve this answer


























          0












          0








          0







          Hard links might not be the best idea; if one user changes the file, it affects both. However, deleting a hard link doesn't delete both files. Plus, I am not entirely sure if Hard Links take up the same amount of space (on the hard disk, not the OS) as multiple copies of the same file; according to Windows (with the Link Shell Extension), they do. Granted, that's Windows, not Unix...



          My solution would be to create a "common" file in a hidden folder, and replace the actual duplicates with symbolic links... then, the symbolic links would be embedded with metadata or alternate file streams that only records however the two "files" are different from each other, like if one person wants to change the filename or add custom album art or something else like that; it might even be useful outside of database applications, like having multiple versions of the same game or software installed and testing them independently with even the smallest differences.






          share|improve this answer













          Hard links might not be the best idea; if one user changes the file, it affects both. However, deleting a hard link doesn't delete both files. Plus, I am not entirely sure if Hard Links take up the same amount of space (on the hard disk, not the OS) as multiple copies of the same file; according to Windows (with the Link Shell Extension), they do. Granted, that's Windows, not Unix...



          My solution would be to create a "common" file in a hidden folder, and replace the actual duplicates with symbolic links... then, the symbolic links would be embedded with metadata or alternate file streams that only records however the two "files" are different from each other, like if one person wants to change the filename or add custom album art or something else like that; it might even be useful outside of database applications, like having multiple versions of the same game or software installed and testing them independently with even the smallest differences.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered May 3 '16 at 18:43









          Amaroq StarwindAmaroq Starwind

          1




          1























              0














              Easiest way is to use special program
              dupeGuru



              dupeGuru Preferences Screenshot



              as documentation says




              Deletion Options



              These options affect how duplicate deletion takes place.
              Most of the time, you don’t need to enable any of them.



              Link deleted files:



              The deleted files are replaced by a link to the reference file.
              You have a choice of replacing it either with a symlink or a hardlink.
              ...
              a symlink is a shortcut to the file’s path.
              If the original file is deleted or moved, the link is broken.
              A hardlink is a link to the file itself.
              That link is as good as a “real” file.
              Only when all hardlinks to a file are deleted is the file itself deleted.



              On OSX and Linux, this feature is supported fully,
              but under Windows, it’s a bit complicated.
              Windows XP doesn’t support it, but Vista and up support it.
              However, for the feature to work,
              dupeGuru has to run with administrative privileges.







              share|improve this answer




























                0














                Easiest way is to use special program
                dupeGuru



                dupeGuru Preferences Screenshot



                as documentation says




                Deletion Options



                These options affect how duplicate deletion takes place.
                Most of the time, you don’t need to enable any of them.



                Link deleted files:



                The deleted files are replaced by a link to the reference file.
                You have a choice of replacing it either with a symlink or a hardlink.
                ...
                a symlink is a shortcut to the file’s path.
                If the original file is deleted or moved, the link is broken.
                A hardlink is a link to the file itself.
                That link is as good as a “real” file.
                Only when all hardlinks to a file are deleted is the file itself deleted.



                On OSX and Linux, this feature is supported fully,
                but under Windows, it’s a bit complicated.
                Windows XP doesn’t support it, but Vista and up support it.
                However, for the feature to work,
                dupeGuru has to run with administrative privileges.







                share|improve this answer


























                  0












                  0








                  0







                  Easiest way is to use special program
                  dupeGuru



                  dupeGuru Preferences Screenshot



                  as documentation says




                  Deletion Options



                  These options affect how duplicate deletion takes place.
                  Most of the time, you don’t need to enable any of them.



                  Link deleted files:



                  The deleted files are replaced by a link to the reference file.
                  You have a choice of replacing it either with a symlink or a hardlink.
                  ...
                  a symlink is a shortcut to the file’s path.
                  If the original file is deleted or moved, the link is broken.
                  A hardlink is a link to the file itself.
                  That link is as good as a “real” file.
                  Only when all hardlinks to a file are deleted is the file itself deleted.



                  On OSX and Linux, this feature is supported fully,
                  but under Windows, it’s a bit complicated.
                  Windows XP doesn’t support it, but Vista and up support it.
                  However, for the feature to work,
                  dupeGuru has to run with administrative privileges.







                  share|improve this answer













                  Easiest way is to use special program
                  dupeGuru



                  dupeGuru Preferences Screenshot



                  as documentation says




                  Deletion Options



                  These options affect how duplicate deletion takes place.
                  Most of the time, you don’t need to enable any of them.



                  Link deleted files:



                  The deleted files are replaced by a link to the reference file.
                  You have a choice of replacing it either with a symlink or a hardlink.
                  ...
                  a symlink is a shortcut to the file’s path.
                  If the original file is deleted or moved, the link is broken.
                  A hardlink is a link to the file itself.
                  That link is as good as a “real” file.
                  Only when all hardlinks to a file are deleted is the file itself deleted.



                  On OSX and Linux, this feature is supported fully,
                  but under Windows, it’s a bit complicated.
                  Windows XP doesn’t support it, but Vista and up support it.
                  However, for the feature to work,
                  dupeGuru has to run with administrative privileges.








                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Jun 13 '17 at 14:20









                  Russian Junior Ruby DeveloperRussian Junior Ruby Developer

                  11




                  11






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Unix & Linux Stack Exchange!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f3037%2fis-there-an-easy-way-to-replace-duplicate-files-with-hardlinks%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Loup dans la culture

                      How to solve the problem of ntp “Unable to contact time server” from KDE?

                      ASUS Zenbook UX433/UX333 — Configure Touchpad-embedded numpad on Linux