Is there an easy way to replace duplicate files with hardlinks?

129

I'm looking for an easy way (a command or series of commands, probably involving find) to find duplicate files in two directories, and replace the files in one directory with hardlinks of the files in the other directory.

Here's the situation: This is a file server which multiple people store audio files on, each user having their own folder. Sometimes multiple people have copies of the exact same audio files. Right now, these are duplicates. I'd like to make it so they're hardlinks, to save hard drive space.

edited 38 mins ago

Jeff Schaller

42.5k1158135

asked Oct 12 '10 at 19:23

Josh

3,76664365

20

One problem you may run into with hardlinks is if somebody decides to do something to one of their music files that you've hard-linked they could inadvertently be affecting other people's access to their music.

– Steven D
Oct 13 '10 at 2:48

4

another problem is that two different files containing "Some Really Great Tune", even if taken from the same source with the same encoder will very likely not be bit-for-bit identical.

– msw
Oct 13 '10 at 2:57

3

better sollution might be to have a public music folder...

– Stefan
Oct 13 '10 at 7:08

3

related: superuser.com/questions/140819/ways-to-deduplicate-files

– David Cary
Mar 16 '11 at 23:59

1

@tante: Using symlinks solves no problem. When a user "deletes" a file, the number of links to it gets decremented, when the count reaches zero, the files gets really deleted, that's all. So deletion is no problem with hardlinked files, the only problem is a user trying to edit the file (unprobable indeed) or to overwrite it (quite possible if logged in).

– maaartinus
Mar 14 '12 at 3:56

|
show 4 more comments

129

edited 38 mins ago

Jeff Schaller

42.5k1158135

asked Oct 12 '10 at 19:23

Josh

3,76664365

20

One problem you may run into with hardlinks is if somebody decides to do something to one of their music files that you've hard-linked they could inadvertently be affecting other people's access to their music.

– Steven D
Oct 13 '10 at 2:48

4

another problem is that two different files containing "Some Really Great Tune", even if taken from the same source with the same encoder will very likely not be bit-for-bit identical.

– msw
Oct 13 '10 at 2:57

3

better sollution might be to have a public music folder...

– Stefan
Oct 13 '10 at 7:08

3

related: superuser.com/questions/140819/ways-to-deduplicate-files

– David Cary
Mar 16 '11 at 23:59

1

@tante: Using symlinks solves no problem. When a user "deletes" a file, the number of links to it gets decremented, when the count reaches zero, the files gets really deleted, that's all. So deletion is no problem with hardlinked files, the only problem is a user trying to edit the file (unprobable indeed) or to overwrite it (quite possible if logged in).

– maaartinus
Mar 14 '12 at 3:56

|
show 4 more comments

129

edited 38 mins ago

Jeff Schaller

42.5k1158135

asked Oct 12 '10 at 19:23

Josh

3,76664365

files hard-link deduplication duplicate-files

edited 38 mins ago

Jeff Schaller

42.5k1158135

asked Oct 12 '10 at 19:23

Josh

3,76664365

edited 38 mins ago

Jeff Schaller

42.5k1158135

asked Oct 12 '10 at 19:23

Josh

3,76664365

edited 38 mins ago

Jeff Schaller

42.5k1158135

edited 38 mins ago

Jeff Schaller

42.5k1158135

edited 38 mins ago

Jeff Schaller

42.5k1158135

asked Oct 12 '10 at 19:23

Josh

3,76664365

asked Oct 12 '10 at 19:23

Josh

3,76664365

asked Oct 12 '10 at 19:23

Josh

3,76664365

20

One problem you may run into with hardlinks is if somebody decides to do something to one of their music files that you've hard-linked they could inadvertently be affecting other people's access to their music.

– Steven D
Oct 13 '10 at 2:48

4

another problem is that two different files containing "Some Really Great Tune", even if taken from the same source with the same encoder will very likely not be bit-for-bit identical.

– msw
Oct 13 '10 at 2:57

3

better sollution might be to have a public music folder...

– Stefan
Oct 13 '10 at 7:08

3

related: superuser.com/questions/140819/ways-to-deduplicate-files

– David Cary
Mar 16 '11 at 23:59

1

@tante: Using symlinks solves no problem. When a user "deletes" a file, the number of links to it gets decremented, when the count reaches zero, the files gets really deleted, that's all. So deletion is no problem with hardlinked files, the only problem is a user trying to edit the file (unprobable indeed) or to overwrite it (quite possible if logged in).

– maaartinus
Mar 14 '12 at 3:56

|
show 4 more comments

20

One problem you may run into with hardlinks is if somebody decides to do something to one of their music files that you've hard-linked they could inadvertently be affecting other people's access to their music.

– Steven D
Oct 13 '10 at 2:48

4

another problem is that two different files containing "Some Really Great Tune", even if taken from the same source with the same encoder will very likely not be bit-for-bit identical.

– msw
Oct 13 '10 at 2:57

3

better sollution might be to have a public music folder...

– Stefan
Oct 13 '10 at 7:08

3

related: superuser.com/questions/140819/ways-to-deduplicate-files

– David Cary
Mar 16 '11 at 23:59

1

@tante: Using symlinks solves no problem. When a user "deletes" a file, the number of links to it gets decremented, when the count reaches zero, the files gets really deleted, that's all. So deletion is no problem with hardlinked files, the only problem is a user trying to edit the file (unprobable indeed) or to overwrite it (quite possible if logged in).

– maaartinus
Mar 14 '12 at 3:56

One problem you may run into with hardlinks is if somebody decides to do something to one of their music files that you've hard-linked they could inadvertently be affecting other people's access to their music.

– Steven D
Oct 13 '10 at 2:48

another problem is that two different files containing "Some Really Great Tune", even if taken from the same source with the same encoder will very likely not be bit-for-bit identical.

– msw
Oct 13 '10 at 2:57

better sollution might be to have a public music folder...

– Stefan
Oct 13 '10 at 7:08

related: superuser.com/questions/140819/ways-to-deduplicate-files

– David Cary
Mar 16 '11 at 23:59

@tante: Using symlinks solves no problem. When a user "deletes" a file, the number of links to it gets decremented, when the count reaches zero, the files gets really deleted, that's all. So deletion is no problem with hardlinked files, the only problem is a user trying to edit the file (unprobable indeed) or to overwrite it (quite possible if logged in).

– maaartinus
Mar 14 '12 at 3:56

|
show 4 more comments

18 Answers
18

active

oldest

votes

There is a perl script at http://cpansearch.perl.org/src/ANDK/Perl-Repository-APC-2.002/eg/trimtrees.pl which does exactly what you want:

Traverse all directories named on the
command line, compute MD5 checksums
and find files with identical MD5. IF
they are equal, do a real comparison
if they are really equal, replace the
second of two files with a hard link
to the first one.

answered Oct 12 '10 at 20:04

fschmitt

7,6313043

Sounds perfect, thanks!! I'll try it and accept if it works as described!

– Josh
Oct 12 '10 at 20:09

3

This did exactly what I asked for. However I believe that ZFS with dedup will eventually be the way to do, since I did find that the files had slight differences so only a few could be hardlinked.

– Josh
Dec 8 '10 at 20:13

10

Upvoted this, but after researching some more, I kind of which I didn't. rdfind is available via the package managers for ALL major platforms (os x, linux, (cyg)win, solaris), and works at a blazing native speed. So do check out the answer below.

– oligofren
Jan 3 '15 at 13:42

@oligofren I was thinking the same, but then I hit [Errno 31] Too many links. This scrips seems to be the only thing that handles that.

– phunehehe
Jun 26 '15 at 6:59

3

Checksumming every single file, rather than only files where there exists at least one other with identical size, is unnecessarily inefficient (and unnecessarily prone to hash collisions).

– Charles Duffy
Feb 1 '16 at 16:56

add a comment |

rdfind does exactly what you ask for (and in the order johny why lists). Makes it possible to delete duplicates, replace them with either soft or hard links. Combined with symlinks you can also make the symlink either absolute or relative. You can even pick checksum algorithm (md5 or sha1).

Since it is compiled it is faster than most scripted solutions: time on a 15 GiB folder with 2600 files on my Mac Mini from 2009 returns this

9.99s user 3.61s system 66% cpu 20.543 total

(using md5).

Available in most package handlers (e.g. MacPorts for Mac OS X).

edited Jul 5 '13 at 9:22

Tobias Kienzler

4,349104589

answered Jul 5 '13 at 8:15

d-b

94878

10

+1 I used rdfind and loved it. It has a -dryrun true option that will let you know what it would have done. Replacing duplicates with hard links is as simple as -makehardlinks true. It produced a nice log and it let me know how much space was freed up. Plus, according to the author's benchmark, rdfind is faster than duff and fslint.

– Daniel Trebbien
Dec 29 '13 at 20:49

oooh, nice. I used to use fdupes, but its -L option for hardlinking dupes is missing in the latest Ubuntu 14.10. Was quite slow, and did not exist for Homebrew on OSX, so this answer is way better. Thanks!

– oligofren
Jan 3 '15 at 13:38

Very smart and fast algorithm.

– ndemou
Oct 30 '15 at 12:53

1

I suspect the performance of this tool has more to do with the algorithm itself and less to do with whether it's a compiled tool or a script. For this kind of operation, disk is going to be the bottleneck nearly all of the time. As long as scripted tools make sure that they've an async I/O operation in progress while burning the CPU on checksums, they should perform about as well as a native binary.

– cdhowie
May 31 '18 at 21:19

add a comment |

Use the fdupes tool:

fdupes -r /path/to/folder gives you a list of duplicates in the directory (-r makes it recursive). The output looks like this:

filename1

filename2

filename3

filename4

filename5

with filename1 and filename2 being identical and filename3, filename4 and filename5 also being identical.

answered Oct 12 '10 at 20:03

tante

4,9942023

1

Ubuntu Note: As of September 2013, it hasn't had a stable release (it is on 1.50-PR2-3), so the update doesn't appear in ubuntu yet.

– Stuart Axon
Aug 28 '13 at 14:19

11

I just tried installing fdupes_1.50-PR2-4 on both Ubuntu and Debian, neither has the -L flag. Luckily building from github.com/tobiasschulz/fdupes was super easy.

– neu242
Aug 30 '13 at 15:07

3

Try rdfind - like fdupes, but faster and available on OS X and Cygwin as well.

– oligofren
Jan 3 '15 at 13:43

Or if you just requre Linux compatibility, install rmlint which is blazingly fast, and has lots of nice options. Truly a modern alternative.

– oligofren
Jan 3 '15 at 14:28

3

fdupes seems to only find duplicates, not replace them with hardlinks, so not an answer to the question IMO.

– Calimo
Nov 8 '17 at 15:58

|
show 1 more comment

I use hardlink from http://jak-linux.org/projects/hardlink/

answered Oct 18 '11 at 4:24

waltinator

75048

1

Nice hint, I am using on a regular base code.google.com/p/hardlinkpy but this was not updated for a while...

– meduz
Apr 11 '12 at 19:09

2

This appears to be similar to the original hardlink on Fedora/RHEL/etc.

– Jack Douglas
Jun 21 '12 at 8:43

1

hardlink is now a native binary in many Linux package systems (since ~2014) and extremely fast. For 1,2M files (320GB), it just took 200 seconds (linking roughly 10% of the files).

– Marcel Waldvogel
Feb 5 '17 at 19:13

FWIW, the above hardlink was created by Julian Andres Klode while the Fedora hardlink was created by Jakub Jelinek (source: pagure.io/hardlink - Fedora package name: hardlink)

– maxschlepzig
Jan 4 at 17:52

add a comment |

This is one of the functions provided by "fslint" --
http://en.flossmanuals.net/FSlint/Introduction

Click the "Merge" button:

Screenshot

edited May 10 '16 at 22:19

Flimm

1,43541928

answered Dec 18 '10 at 22:38

LJ Wobker

4

The -m will hardlink duplicates together, -d will delete all but one, and -t will dry run, printing what it would do

– Azendale
Oct 29 '12 at 5:57

1

On Ubuntu here is what to do: sudo apt-get install fslint /usr/share/fslint/fslint/findup -m /your/directory/tree (directory /usr/share/fslint/fslint/ is not in $PATH by default)

– Jocelyn
Sep 8 '13 at 15:38

add a comment |

Since your main target is to save disk space, there is another solution: de-duplication (and probably compression) on file system level. Compared with the hard-link solution, it does not have the problem of inadvertently affecting other linked files.

ZFS has dedup (block-level, not file-level) since pool version 23 and compression since long time ago.
If you are using linux, you may try zfs-fuse, or if you use BSD, it is natively supported.

answered Oct 13 '10 at 5:13

Wei-Yin

3921210

This is probably the way I'll go eventually, however, does BSD's ZFS implementation do dedup? I thought it did not.

– Josh
Dec 8 '10 at 20:14

In addition, the HAMMER filesystem on DragonFlyBSD has deduplication support.

– hhaamu
Jul 15 '12 at 17:48

11

ZFS dedup is the friend of nobody. Where ZFS recommends 1Gb ram per 1Tb usable disk space, you're friggin' nuts if you try to use dedup with less than 32Gb ram per 1Tb usable disk space. That means that for a 1Tb mirror, if you don't have 32 Gb ram, you are likely to encounter memory bomb conditions sooner or later that will halt the machine due to lack of ram. Been there, done that, still recovering from the PTSD.

– killermist
Sep 22 '14 at 18:51

3

To avoid the excessive RAM requirements with online deduplication (i.e., check on every write), btrfs uses batch or offline deduplication (run it whenever you consider it useful/necessary) btrfs.wiki.kernel.org/index.php/Deduplication

– Marcel Waldvogel
Feb 5 '17 at 19:18

2

Update seven years later: I eventually did move to ZFS and tried deduplication -- I found that it's RAM requirements were indeed just far to high. Crafty use of ZFS snapshots provided the solution I ended up using. (Copy one user's music, snapshot and clone, copy the second user's music into the clone using rsync --inplace so only changed blocks are stored)

– Josh
Sep 13 '17 at 13:54

|
show 2 more comments

On modern Linux these days there's https://github.com/g2p/bedup which de-duplicates on a btrfs filesystem, but 1) without as much of the scan overhead, 2) files can diverge easily again afterwards.

answered Jan 8 '14 at 17:37

Matthew Bloch

17014

Background and more information is listed on btrfs.wiki.kernel.org/index.php/Deduplication (including reference to cp --reflink, see also below)

– Marcel Waldvogel
Feb 5 '17 at 19:22

add a comment |

To find duplicate files you can use duff.

Duff is a Unix command-line utility
for quickly finding duplicates in a
given set of files.

Simply run:

duff -r target-folder

To create hardlinks to those files automaticly, you will need to parse the output of duff with bash or some other scripting language.

answered Oct 12 '10 at 20:00

Stefan

11.6k3283123

Really slow though -- see rdfind.pauldreik.se/#g0.6

– ndemou
Oct 30 '15 at 12:52

add a comment |

aptitude show hardlink

Description: Hardlinks multiple copies of the same file
Hardlink is a tool which detects multiple copies of the same file and replaces them with hardlinks.

The idea has been taken from http://code.google.com/p/hardlinkpy/, but the code has been written from scratch and licensed under the MIT license.
Homepage: http://jak-linux.org/projects/hardlink/

edited Nov 22 '13 at 15:22

Anthon

60.9k17104166

answered Nov 22 '13 at 15:03

Julien Palard

29635

The only program mentioned here available for Gentoo without unmasking and with hardlink support, thanks!

– Jorrit Schippers
Mar 9 '15 at 13:48

add a comment |

I've used many of the hardlinking tools for Linux mentioned here.
I too am stuck with ext4 fs, on Ubuntu, and have been using its cp -l and -s for hard/softlinking. But lately noticed the lightweight copy in the cp man page, which would imply to spare the redundant disk space until one side gets modified:

   --reflink[=WHEN]

          control clone/CoW copies. See below



       When  --reflink[=always]  is specified, perform a lightweight copy, where the 

data blocks are copied only when modified.  If this is not possible the

       copy fails, or if --reflink=auto is specified, fall back to a standard copy.

answered Mar 14 '12 at 9:59

Marcos

1,14211228

I think I will update my cp alias to always include the --reflink=auto parameter now

– Marcos
Mar 14 '12 at 14:08

1

Does ext4 really support --reflink?

– Jack Douglas
Jun 21 '12 at 8:42

7

This is supported on btrfs and OCFS2. It is only possible on copy-on-write filesystems, which ext4 is not. btrfs is really shaping up. I love using it because of reflink and snapshots, makes you less scared to do mass operations on big trees of files.

– clacke
Jul 3 '12 at 18:57

add a comment |

Seems to me that checking the filename first could speed things up. If two files lack the same filename then in many cases I would not consider them to be duplicates. Seems that the quickest method would be to compare, in order:

filename

size

md5 checksum

byte contents

Do any methods do this? Look at duff, fdupes, rmlint, fslint, etc.

The following method was top-voted on commandlinefu.com: Find Duplicate Files (based on size first, then MD5 hash)

Can filename comparison be added as a first step, size as a second step?

find -not -empty -type f -printf "%sn" | sort -rn | uniq -d | 

  xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | 

  sort | uniq -w32 --all-repeated=separate

edited Jul 15 '12 at 13:10

Mat

39.6k8121127

answered Jul 9 '12 at 15:02

johny why

1334

2

I've used duff, fdupes and rmlint, and strongly recommend readers to look at the third of these. It has an excellent option set (and documentation). With it, I was able to avoid a lot of the post-processing I needed to use with the other tools.

– dubiousjim
Sep 2 '15 at 6:32

2

In my practice filename is the least reliable factor to look at, and I've completely removed it from any efforts I make a de-duping. How many install.sh files can be found on an active system? I can't count the number of times I've saved a file and had name clash, with some on-the-fly renaming to save it. Flip side: no idea how many times I've downloaded something from different sources, on different days, only to find they are the same file with different names. (Which also kills the timestamp reliability.) 1: Size, 2: Digest, 3: Byte contents.

– Gypsy Spellweaver
Jan 28 '17 at 6:40

@GypsySpellweaver: (1) depends on personal use-case, wouldn't you agree? In my case, i have multiple restores from multiple backups, where files with same name and content exist in different restore-folders. (2) Your comment seems to assume comparing filename only. I was not suggesting to eliminate other checks.

– johny why
Mar 8 '17 at 21:50

add a comment |

I made a Perl script that does something similar to what you're talking about:

http://pastebin.com/U7mFHZU7

Basically, it just traverses a directory, calculating the SHA1sum of the files in it, hashing it and linking matches together. It's come in handy on many, many occasions.

answered Jan 31 '11 at 2:06

amphetamachine

3,82522338

2

I hope to get around to trying this soon... why not upload it on CPAN... App::relink or something

– xenoterracide
Feb 7 '11 at 11:12

1

@xenoterracide: because of all the similar and more mature solutions that already exist. see the other answers, especially rdfind.

– oligofren
Jan 3 '15 at 13:36

1

@oligofren I don't doubt better solutions exist. TMTOWTDI I guess.

– amphetamachine
Jan 5 '15 at 15:49

add a comment |

Since I'm not a fan of Perl, here's a bash version:

#!/bin/bash



DIR="/path/to/big/files"



find $DIR -type f -exec md5sum {} ; | sort > /tmp/sums-sorted.txt



OLDSUM=""

IFS=$'n'

for i in `cat /tmp/sums-sorted.txt`; do

 NEWSUM=`echo "$i" | sed 's/ .*//'`

 NEWFILE=`echo "$i" | sed 's/^[^ ]* *//'`

 if [ "$OLDSUM" == "$NEWSUM" ]; then

  echo ln -f "$OLDFILE" "$NEWFILE"

 else

  OLDSUM="$NEWSUM"

  OLDFILE="$NEWFILE"

 fi

done

This finds all files with the same checksum (whether they're big, small, or already hardlinks), and hardlinks them together.

This can be greatly optimized for repeated runs with additional find flags (eg. size) and a file cache (so you don't have to redo the checksums each time). If anyone's interested in the smarter, longer version, I can post it.

NOTE: As has been mentioned before, hardlinks work as long as the files never need modification, or to be moved across filesystems.

edited Jul 3 '12 at 11:04

Mat

39.6k8121127

answered Jul 3 '12 at 5:15

seren

1212

How can I change your script, so that instead of hardlinking it, it will just delete the duplicate files and will add an entry to a CSV file the deleted file -> Lined File. . ???

– MR.GEWA
Jan 12 '13 at 12:17

Sure. The hard link line: echo ln -f "$OLDFILE" "$NEWFILE" Just replaces the duplicate file with a hard link, so you could change it rm the $NEWFILE instead.

– seren
Jan 13 '13 at 4:15

and how on next line, write in some text file somehow $OLDFILE-> NEWFILE ???

– MR.GEWA
Jan 13 '13 at 13:12

Ahh, right. Yes, add a line after the rm such as: echo "$NEWFILE" >> /var/log/deleted_duplicate_files.log

– seren
Jan 14 '13 at 19:28

1

Don't friggin reinvent the wheel. There are more mature solutions available, like rdfind, that works at native speeds and just requires brew install rdfind or apt-get install rdfind to get installed.

– oligofren
Jan 3 '15 at 13:46

add a comment |

If you want to replace duplicates by Hard Links on mac or any UNIX based system, you can try SmartDupe http://sourceforge.net/projects/smartdupe/
am developing it

answered Nov 4 '12 at 0:57

islam

211

3

Can you expand on how “smart” it is?

– Stéphane Gimenez
Nov 4 '12 at 13:25

1

How can I compare files of two different directories?

– Burcardo
May 31 '16 at 8:26

add a comment |

The applicatios FSLint (http://www.pixelbeat.org/fslint/) can find all equal files in any folder (by content) and create hardlinks. Give it a try!

Jorge Sampaio

answered Jan 15 '15 at 16:29

Jorge H B Sampaio Jr

111

It hangs scanning 1TB almost full ext3 harddisk, brings the entire system to a crawl. Aborted after 14 hours of "searching"

– Angsuman Chakraborty
Sep 12 '16 at 11:09

add a comment |

If you'll do hardlinks, pay attention on rights on that file. Notice, owner, group, mode, extended attributes, time and ACL (if you use this) is stored in INODE. Only file names are different because this is stored in directory structure, and other points to INODE properties. This cause, all file names linked to the same inode, have the same access rights. You should prevent modification that file, because any user can damage file to other. It is simple. It is enough, any user put other file in the same name. Inode number is then saved, and original file content is destroyed (replaced) for all hardlinked names.

Better way is deduplication on filesystem layer. You can use BTRFS (very popular last time), OCFS or like this. Look at the page: https://en.wikipedia.org/wiki/Comparison_of_file_systems , specialy at the table Features and column data deduplication. You can click it and sort :)

Specially look at ZFS filesystem. This is available as FUSE, but in this way it's very slow. If you want native support, look at the page http://zfsonlinux.org/ . Then you must patch kernel, and next install zfs tools for managament. I don't understand, why linux doesn't support as drivers, it is way for many other operating systems / kernels.

File systems supports deduplication by 2 ways, deduplicate files, or blocks. ZFS supports block. This means, the same contents that repeats in the same file can be deduplicated. Other way is time when data are deduplicated, this can be online (zfs) or offline (btrfs).

Notice, deduplication consumes RAM. This is, why writing files to ZFS volume mounted with FUSE, cause dramatically slow performance. This is described in documentation.
But you can online set on/off deduplication on volume. If you see any data should be deduplicated, you simply set deduplication on, rewrite some file to any temporary and finally replace. after this you can off deduplication and restore full performance. Of course, you can add to storage any cache disks. This can be very fast rotate disks or SSD disks. Of course this can be very small disks. In real work this is replacement for RAM :)

Under linux you should take care for ZFS because not all work as it should, specialy when you manage filesystem, make snapshot etc. but if you do configuration and don't change it, all works properly. Other way, you should change linux to opensolaris, it natively supports ZFS :) What is very nice with ZFS is, this works both as filesystem, and volumen manager similar to LVM. You do not need it when you use ZFS. See documentation if you want know more.

Notice difference between ZFS and BTRFS. ZFS is older and more mature, unfortunately only under Solaris and OpenSolaris (unfortunately strangled by oracle). BTRFS is younger, but last time very good supported. I recommend fresh kernel. ZFS has online deduplication, that cause slow down writes, because all is calculated online. BTRFS support off-line dedupliaction. Then this saves performance, but when host has nothing to do, you run periodically tool for make deduplication. And BTRFS is natively created under linux. Maybe this is better FS for You :)

answered Jun 24 '14 at 8:51

Znik

265129

1

I do like the offline (or batch) deduplication approach btrfs has. Excellent discussion of the options (including the cp --reflink option) here: btrfs.wiki.kernel.org/index.php/Deduplication

– Marcel Waldvogel
Feb 5 '17 at 19:42

ZFS is not Solaris or OpenSolaris only. It's natively supported in FreeBSD. Also, ZFS on Linux is device driver based; ZFS on FUSE is a different thing.

– KJ Seefried
Mar 29 '18 at 19:07

add a comment |

Hard links might not be the best idea; if one user changes the file, it affects both. However, deleting a hard link doesn't delete both files. Plus, I am not entirely sure if Hard Links take up the same amount of space (on the hard disk, not the OS) as multiple copies of the same file; according to Windows (with the Link Shell Extension), they do. Granted, that's Windows, not Unix...

My solution would be to create a "common" file in a hidden folder, and replace the actual duplicates with symbolic links... then, the symbolic links would be embedded with metadata or alternate file streams that only records however the two "files" are different from each other, like if one person wants to change the filename or add custom album art or something else like that; it might even be useful outside of database applications, like having multiple versions of the same game or software installed and testing them independently with even the smallest differences.

answered May 3 '16 at 18:43

Amaroq Starwind

add a comment |

Easiest way is to use special program
dupeGuru

dupeGuru Preferences Screenshot

as documentation says

Deletion Options

These options affect how duplicate deletion takes place.
Most of the time, you don’t need to enable any of them.

Link deleted files:

The deleted files are replaced by a link to the reference file.
You have a choice of replacing it either with a symlink or a hardlink.
...
a symlink is a shortcut to the file’s path.
If the original file is deleted or moved, the link is broken.
A hardlink is a link to the file itself.
That link is as good as a “real” file.
Only when all hardlinks to a file are deleted is the file itself deleted.

On OSX and Linux, this feature is supported fully,
but under Windows, it’s a bit complicated.
Windows XP doesn’t support it, but Vista and up support it.
However, for the feature to work,
dupeGuru has to run with administrative privileges.

answered Jun 13 '17 at 14:20

Russian Junior Ruby Developer

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f3037%2fis-there-an-easy-way-to-replace-duplicate-files-with-hardlinks%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

18 Answers
18

active

oldest

votes

18 Answers
18

active

oldest

votes

There is a perl script at http://cpansearch.perl.org/src/ANDK/Perl-Repository-APC-2.002/eg/trimtrees.pl which does exactly what you want:

Traverse all directories named on the
command line, compute MD5 checksums
and find files with identical MD5. IF
they are equal, do a real comparison
if they are really equal, replace the
second of two files with a hard link
to the first one.

answered Oct 12 '10 at 20:04

fschmitt

7,6313043

Sounds perfect, thanks!! I'll try it and accept if it works as described!

– Josh
Oct 12 '10 at 20:09

3

This did exactly what I asked for. However I believe that ZFS with dedup will eventually be the way to do, since I did find that the files had slight differences so only a few could be hardlinked.

– Josh
Dec 8 '10 at 20:13

10

Upvoted this, but after researching some more, I kind of which I didn't. rdfind is available via the package managers for ALL major platforms (os x, linux, (cyg)win, solaris), and works at a blazing native speed. So do check out the answer below.

– oligofren
Jan 3 '15 at 13:42

@oligofren I was thinking the same, but then I hit [Errno 31] Too many links. This scrips seems to be the only thing that handles that.

– phunehehe
Jun 26 '15 at 6:59

3

Checksumming every single file, rather than only files where there exists at least one other with identical size, is unnecessarily inefficient (and unnecessarily prone to hash collisions).

– Charles Duffy
Feb 1 '16 at 16:56

add a comment |

There is a perl script at http://cpansearch.perl.org/src/ANDK/Perl-Repository-APC-2.002/eg/trimtrees.pl which does exactly what you want:

Traverse all directories named on the
command line, compute MD5 checksums
and find files with identical MD5. IF
they are equal, do a real comparison
if they are really equal, replace the
second of two files with a hard link
to the first one.

answered Oct 12 '10 at 20:04

fschmitt

7,6313043

Sounds perfect, thanks!! I'll try it and accept if it works as described!

– Josh
Oct 12 '10 at 20:09

3

This did exactly what I asked for. However I believe that ZFS with dedup will eventually be the way to do, since I did find that the files had slight differences so only a few could be hardlinked.

– Josh
Dec 8 '10 at 20:13

10

Upvoted this, but after researching some more, I kind of which I didn't. rdfind is available via the package managers for ALL major platforms (os x, linux, (cyg)win, solaris), and works at a blazing native speed. So do check out the answer below.

– oligofren
Jan 3 '15 at 13:42

@oligofren I was thinking the same, but then I hit [Errno 31] Too many links. This scrips seems to be the only thing that handles that.

– phunehehe
Jun 26 '15 at 6:59

3

Checksumming every single file, rather than only files where there exists at least one other with identical size, is unnecessarily inefficient (and unnecessarily prone to hash collisions).

– Charles Duffy
Feb 1 '16 at 16:56

add a comment |

There is a perl script at http://cpansearch.perl.org/src/ANDK/Perl-Repository-APC-2.002/eg/trimtrees.pl which does exactly what you want:

Traverse all directories named on the
command line, compute MD5 checksums
and find files with identical MD5. IF
they are equal, do a real comparison
if they are really equal, replace the
second of two files with a hard link
to the first one.

answered Oct 12 '10 at 20:04

fschmitt

7,6313043

There is a perl script at http://cpansearch.perl.org/src/ANDK/Perl-Repository-APC-2.002/eg/trimtrees.pl which does exactly what you want:

Traverse all directories named on the
command line, compute MD5 checksums
and find files with identical MD5. IF
they are equal, do a real comparison
if they are really equal, replace the
second of two files with a hard link
to the first one.

answered Oct 12 '10 at 20:04

fschmitt

7,6313043

answered Oct 12 '10 at 20:04

fschmitt

7,6313043

answered Oct 12 '10 at 20:04

fschmitt

7,6313043

answered Oct 12 '10 at 20:04

fschmitt

7,6313043

Sounds perfect, thanks!! I'll try it and accept if it works as described!

– Josh
Oct 12 '10 at 20:09

3

This did exactly what I asked for. However I believe that ZFS with dedup will eventually be the way to do, since I did find that the files had slight differences so only a few could be hardlinked.

– Josh
Dec 8 '10 at 20:13

10

Upvoted this, but after researching some more, I kind of which I didn't. rdfind is available via the package managers for ALL major platforms (os x, linux, (cyg)win, solaris), and works at a blazing native speed. So do check out the answer below.

– oligofren
Jan 3 '15 at 13:42

@oligofren I was thinking the same, but then I hit [Errno 31] Too many links. This scrips seems to be the only thing that handles that.

– phunehehe
Jun 26 '15 at 6:59

3

Checksumming every single file, rather than only files where there exists at least one other with identical size, is unnecessarily inefficient (and unnecessarily prone to hash collisions).

– Charles Duffy
Feb 1 '16 at 16:56

add a comment |

Sounds perfect, thanks!! I'll try it and accept if it works as described!

– Josh
Oct 12 '10 at 20:09

3

This did exactly what I asked for. However I believe that ZFS with dedup will eventually be the way to do, since I did find that the files had slight differences so only a few could be hardlinked.

– Josh
Dec 8 '10 at 20:13

10

Upvoted this, but after researching some more, I kind of which I didn't. rdfind is available via the package managers for ALL major platforms (os x, linux, (cyg)win, solaris), and works at a blazing native speed. So do check out the answer below.

– oligofren
Jan 3 '15 at 13:42

@oligofren I was thinking the same, but then I hit [Errno 31] Too many links. This scrips seems to be the only thing that handles that.

– phunehehe
Jun 26 '15 at 6:59

3

Checksumming every single file, rather than only files where there exists at least one other with identical size, is unnecessarily inefficient (and unnecessarily prone to hash collisions).

– Charles Duffy
Feb 1 '16 at 16:56

Sounds perfect, thanks!! I'll try it and accept if it works as described!

– Josh
Oct 12 '10 at 20:09

This did exactly what I asked for. However I believe that ZFS with dedup will eventually be the way to do, since I did find that the files had slight differences so only a few could be hardlinked.

– Josh
Dec 8 '10 at 20:13

Upvoted this, but after researching some more, I kind of which I didn't. rdfind is available via the package managers for ALL major platforms (os x, linux, (cyg)win, solaris), and works at a blazing native speed. So do check out the answer below.

– oligofren
Jan 3 '15 at 13:42

@oligofren I was thinking the same, but then I hit [Errno 31] Too many links. This scrips seems to be the only thing that handles that.

– phunehehe
Jun 26 '15 at 6:59

Checksumming every single file, rather than only files where there exists at least one other with identical size, is unnecessarily inefficient (and unnecessarily prone to hash collisions).

– Charles Duffy
Feb 1 '16 at 16:56

add a comment |

Since it is compiled it is faster than most scripted solutions: time on a 15 GiB folder with 2600 files on my Mac Mini from 2009 returns this

9.99s user 3.61s system 66% cpu 20.543 total

(using md5).

Available in most package handlers (e.g. MacPorts for Mac OS X).

edited Jul 5 '13 at 9:22

Tobias Kienzler

4,349104589

answered Jul 5 '13 at 8:15

d-b

94878

10

+1 I used rdfind and loved it. It has a -dryrun true option that will let you know what it would have done. Replacing duplicates with hard links is as simple as -makehardlinks true. It produced a nice log and it let me know how much space was freed up. Plus, according to the author's benchmark, rdfind is faster than duff and fslint.

– Daniel Trebbien
Dec 29 '13 at 20:49

oooh, nice. I used to use fdupes, but its -L option for hardlinking dupes is missing in the latest Ubuntu 14.10. Was quite slow, and did not exist for Homebrew on OSX, so this answer is way better. Thanks!

– oligofren
Jan 3 '15 at 13:38

Very smart and fast algorithm.

– ndemou
Oct 30 '15 at 12:53

1

I suspect the performance of this tool has more to do with the algorithm itself and less to do with whether it's a compiled tool or a script. For this kind of operation, disk is going to be the bottleneck nearly all of the time. As long as scripted tools make sure that they've an async I/O operation in progress while burning the CPU on checksums, they should perform about as well as a native binary.

– cdhowie
May 31 '18 at 21:19

add a comment |

Since it is compiled it is faster than most scripted solutions: time on a 15 GiB folder with 2600 files on my Mac Mini from 2009 returns this

9.99s user 3.61s system 66% cpu 20.543 total

(using md5).

Available in most package handlers (e.g. MacPorts for Mac OS X).

edited Jul 5 '13 at 9:22

Tobias Kienzler

4,349104589

answered Jul 5 '13 at 8:15

d-b

94878

10

+1 I used rdfind and loved it. It has a -dryrun true option that will let you know what it would have done. Replacing duplicates with hard links is as simple as -makehardlinks true. It produced a nice log and it let me know how much space was freed up. Plus, according to the author's benchmark, rdfind is faster than duff and fslint.

– Daniel Trebbien
Dec 29 '13 at 20:49

oooh, nice. I used to use fdupes, but its -L option for hardlinking dupes is missing in the latest Ubuntu 14.10. Was quite slow, and did not exist for Homebrew on OSX, so this answer is way better. Thanks!

– oligofren
Jan 3 '15 at 13:38

Very smart and fast algorithm.

– ndemou
Oct 30 '15 at 12:53

1

I suspect the performance of this tool has more to do with the algorithm itself and less to do with whether it's a compiled tool or a script. For this kind of operation, disk is going to be the bottleneck nearly all of the time. As long as scripted tools make sure that they've an async I/O operation in progress while burning the CPU on checksums, they should perform about as well as a native binary.

– cdhowie
May 31 '18 at 21:19

add a comment |

Since it is compiled it is faster than most scripted solutions: time on a 15 GiB folder with 2600 files on my Mac Mini from 2009 returns this

9.99s user 3.61s system 66% cpu 20.543 total

(using md5).

Available in most package handlers (e.g. MacPorts for Mac OS X).

edited Jul 5 '13 at 9:22

Tobias Kienzler

4,349104589

answered Jul 5 '13 at 8:15

d-b

94878

Since it is compiled it is faster than most scripted solutions: time on a 15 GiB folder with 2600 files on my Mac Mini from 2009 returns this

9.99s user 3.61s system 66% cpu 20.543 total

(using md5).

Available in most package handlers (e.g. MacPorts for Mac OS X).

edited Jul 5 '13 at 9:22

Tobias Kienzler

4,349104589

answered Jul 5 '13 at 8:15

d-b

94878

edited Jul 5 '13 at 9:22

Tobias Kienzler

4,349104589

edited Jul 5 '13 at 9:22

Tobias Kienzler

4,349104589

edited Jul 5 '13 at 9:22

Tobias Kienzler

4,349104589

answered Jul 5 '13 at 8:15

d-b

94878

answered Jul 5 '13 at 8:15

d-b

94878

answered Jul 5 '13 at 8:15

d-b

94878

10

+1 I used rdfind and loved it. It has a -dryrun true option that will let you know what it would have done. Replacing duplicates with hard links is as simple as -makehardlinks true. It produced a nice log and it let me know how much space was freed up. Plus, according to the author's benchmark, rdfind is faster than duff and fslint.

– Daniel Trebbien
Dec 29 '13 at 20:49

oooh, nice. I used to use fdupes, but its -L option for hardlinking dupes is missing in the latest Ubuntu 14.10. Was quite slow, and did not exist for Homebrew on OSX, so this answer is way better. Thanks!

– oligofren
Jan 3 '15 at 13:38

Very smart and fast algorithm.

– ndemou
Oct 30 '15 at 12:53

1

I suspect the performance of this tool has more to do with the algorithm itself and less to do with whether it's a compiled tool or a script. For this kind of operation, disk is going to be the bottleneck nearly all of the time. As long as scripted tools make sure that they've an async I/O operation in progress while burning the CPU on checksums, they should perform about as well as a native binary.

– cdhowie
May 31 '18 at 21:19

add a comment |

10

+1 I used rdfind and loved it. It has a -dryrun true option that will let you know what it would have done. Replacing duplicates with hard links is as simple as -makehardlinks true. It produced a nice log and it let me know how much space was freed up. Plus, according to the author's benchmark, rdfind is faster than duff and fslint.

– Daniel Trebbien
Dec 29 '13 at 20:49

oooh, nice. I used to use fdupes, but its -L option for hardlinking dupes is missing in the latest Ubuntu 14.10. Was quite slow, and did not exist for Homebrew on OSX, so this answer is way better. Thanks!

– oligofren
Jan 3 '15 at 13:38

Very smart and fast algorithm.

– ndemou
Oct 30 '15 at 12:53

1

I suspect the performance of this tool has more to do with the algorithm itself and less to do with whether it's a compiled tool or a script. For this kind of operation, disk is going to be the bottleneck nearly all of the time. As long as scripted tools make sure that they've an async I/O operation in progress while burning the CPU on checksums, they should perform about as well as a native binary.

– cdhowie
May 31 '18 at 21:19

+1 I used rdfind and loved it. It has a -dryrun true option that will let you know what it would have done. Replacing duplicates with hard links is as simple as -makehardlinks true. It produced a nice log and it let me know how much space was freed up. Plus, according to the author's benchmark, rdfind is faster than duff and fslint.

– Daniel Trebbien
Dec 29 '13 at 20:49

oooh, nice. I used to use fdupes, but its -L option for hardlinking dupes is missing in the latest Ubuntu 14.10. Was quite slow, and did not exist for Homebrew on OSX, so this answer is way better. Thanks!

– oligofren
Jan 3 '15 at 13:38

Very smart and fast algorithm.

– ndemou
Oct 30 '15 at 12:53

I suspect the performance of this tool has more to do with the algorithm itself and less to do with whether it's a compiled tool or a script. For this kind of operation, disk is going to be the bottleneck nearly all of the time. As long as scripted tools make sure that they've an async I/O operation in progress while burning the CPU on checksums, they should perform about as well as a native binary.

– cdhowie
May 31 '18 at 21:19

add a comment |

Use the fdupes tool:

fdupes -r /path/to/folder gives you a list of duplicates in the directory (-r makes it recursive). The output looks like this:

filename1

filename2

filename3

filename4

filename5

with filename1 and filename2 being identical and filename3, filename4 and filename5 also being identical.

answered Oct 12 '10 at 20:03

tante

4,9942023

1

Ubuntu Note: As of September 2013, it hasn't had a stable release (it is on 1.50-PR2-3), so the update doesn't appear in ubuntu yet.

– Stuart Axon
Aug 28 '13 at 14:19

11

I just tried installing fdupes_1.50-PR2-4 on both Ubuntu and Debian, neither has the -L flag. Luckily building from github.com/tobiasschulz/fdupes was super easy.

– neu242
Aug 30 '13 at 15:07

3

Try rdfind - like fdupes, but faster and available on OS X and Cygwin as well.

– oligofren
Jan 3 '15 at 13:43

Or if you just requre Linux compatibility, install rmlint which is blazingly fast, and has lots of nice options. Truly a modern alternative.

– oligofren
Jan 3 '15 at 14:28

3

fdupes seems to only find duplicates, not replace them with hardlinks, so not an answer to the question IMO.

– Calimo
Nov 8 '17 at 15:58

|
show 1 more comment

Use the fdupes tool:

fdupes -r /path/to/folder gives you a list of duplicates in the directory (-r makes it recursive). The output looks like this:

filename1

filename2

filename3

filename4

filename5

with filename1 and filename2 being identical and filename3, filename4 and filename5 also being identical.

answered Oct 12 '10 at 20:03

tante

4,9942023

1

Ubuntu Note: As of September 2013, it hasn't had a stable release (it is on 1.50-PR2-3), so the update doesn't appear in ubuntu yet.

– Stuart Axon
Aug 28 '13 at 14:19

11

I just tried installing fdupes_1.50-PR2-4 on both Ubuntu and Debian, neither has the -L flag. Luckily building from github.com/tobiasschulz/fdupes was super easy.

– neu242
Aug 30 '13 at 15:07

3

Try rdfind - like fdupes, but faster and available on OS X and Cygwin as well.

– oligofren
Jan 3 '15 at 13:43

Or if you just requre Linux compatibility, install rmlint which is blazingly fast, and has lots of nice options. Truly a modern alternative.

– oligofren
Jan 3 '15 at 14:28

3

fdupes seems to only find duplicates, not replace them with hardlinks, so not an answer to the question IMO.

– Calimo
Nov 8 '17 at 15:58

|
show 1 more comment

Use the fdupes tool:

fdupes -r /path/to/folder gives you a list of duplicates in the directory (-r makes it recursive). The output looks like this:

filename1

filename2

filename3

filename4

filename5

with filename1 and filename2 being identical and filename3, filename4 and filename5 also being identical.

answered Oct 12 '10 at 20:03

tante

4,9942023

Use the fdupes tool:

fdupes -r /path/to/folder gives you a list of duplicates in the directory (-r makes it recursive). The output looks like this:

filename1

filename2

filename3

filename4

filename5

with filename1 and filename2 being identical and filename3, filename4 and filename5 also being identical.

answered Oct 12 '10 at 20:03

tante

4,9942023

answered Oct 12 '10 at 20:03

tante

4,9942023

answered Oct 12 '10 at 20:03

tante

4,9942023

answered Oct 12 '10 at 20:03

tante

4,9942023

1

Ubuntu Note: As of September 2013, it hasn't had a stable release (it is on 1.50-PR2-3), so the update doesn't appear in ubuntu yet.

– Stuart Axon
Aug 28 '13 at 14:19

11

I just tried installing fdupes_1.50-PR2-4 on both Ubuntu and Debian, neither has the -L flag. Luckily building from github.com/tobiasschulz/fdupes was super easy.

– neu242
Aug 30 '13 at 15:07

3

Try rdfind - like fdupes, but faster and available on OS X and Cygwin as well.

– oligofren
Jan 3 '15 at 13:43

Or if you just requre Linux compatibility, install rmlint which is blazingly fast, and has lots of nice options. Truly a modern alternative.

– oligofren
Jan 3 '15 at 14:28

3

fdupes seems to only find duplicates, not replace them with hardlinks, so not an answer to the question IMO.

– Calimo
Nov 8 '17 at 15:58

|
show 1 more comment

1

Ubuntu Note: As of September 2013, it hasn't had a stable release (it is on 1.50-PR2-3), so the update doesn't appear in ubuntu yet.

– Stuart Axon
Aug 28 '13 at 14:19

11

I just tried installing fdupes_1.50-PR2-4 on both Ubuntu and Debian, neither has the -L flag. Luckily building from github.com/tobiasschulz/fdupes was super easy.

– neu242
Aug 30 '13 at 15:07

3

Try rdfind - like fdupes, but faster and available on OS X and Cygwin as well.

– oligofren
Jan 3 '15 at 13:43

Or if you just requre Linux compatibility, install rmlint which is blazingly fast, and has lots of nice options. Truly a modern alternative.

– oligofren
Jan 3 '15 at 14:28

3

fdupes seems to only find duplicates, not replace them with hardlinks, so not an answer to the question IMO.

– Calimo
Nov 8 '17 at 15:58

Ubuntu Note: As of September 2013, it hasn't had a stable release (it is on 1.50-PR2-3), so the update doesn't appear in ubuntu yet.

– Stuart Axon
Aug 28 '13 at 14:19

I just tried installing fdupes_1.50-PR2-4 on both Ubuntu and Debian, neither has the -L flag. Luckily building from github.com/tobiasschulz/fdupes was super easy.

– neu242
Aug 30 '13 at 15:07

Try rdfind - like fdupes, but faster and available on OS X and Cygwin as well.

– oligofren
Jan 3 '15 at 13:43

Or if you just requre Linux compatibility, install rmlint which is blazingly fast, and has lots of nice options. Truly a modern alternative.

– oligofren
Jan 3 '15 at 14:28

fdupes seems to only find duplicates, not replace them with hardlinks, so not an answer to the question IMO.

– Calimo
Nov 8 '17 at 15:58

|
show 1 more comment

I use hardlink from http://jak-linux.org/projects/hardlink/

answered Oct 18 '11 at 4:24

waltinator

75048

1

Nice hint, I am using on a regular base code.google.com/p/hardlinkpy but this was not updated for a while...

– meduz
Apr 11 '12 at 19:09

2

This appears to be similar to the original hardlink on Fedora/RHEL/etc.

– Jack Douglas
Jun 21 '12 at 8:43

1

hardlink is now a native binary in many Linux package systems (since ~2014) and extremely fast. For 1,2M files (320GB), it just took 200 seconds (linking roughly 10% of the files).

– Marcel Waldvogel
Feb 5 '17 at 19:13

FWIW, the above hardlink was created by Julian Andres Klode while the Fedora hardlink was created by Jakub Jelinek (source: pagure.io/hardlink - Fedora package name: hardlink)

– maxschlepzig
Jan 4 at 17:52

add a comment |

I use hardlink from http://jak-linux.org/projects/hardlink/

answered Oct 18 '11 at 4:24

waltinator

75048

1

Nice hint, I am using on a regular base code.google.com/p/hardlinkpy but this was not updated for a while...

– meduz
Apr 11 '12 at 19:09

2

This appears to be similar to the original hardlink on Fedora/RHEL/etc.

– Jack Douglas
Jun 21 '12 at 8:43

1

hardlink is now a native binary in many Linux package systems (since ~2014) and extremely fast. For 1,2M files (320GB), it just took 200 seconds (linking roughly 10% of the files).

– Marcel Waldvogel
Feb 5 '17 at 19:13

FWIW, the above hardlink was created by Julian Andres Klode while the Fedora hardlink was created by Jakub Jelinek (source: pagure.io/hardlink - Fedora package name: hardlink)

– maxschlepzig
Jan 4 at 17:52

add a comment |

I use hardlink from http://jak-linux.org/projects/hardlink/

answered Oct 18 '11 at 4:24

waltinator

75048

I use hardlink from http://jak-linux.org/projects/hardlink/

answered Oct 18 '11 at 4:24

waltinator

75048

answered Oct 18 '11 at 4:24

waltinator

75048

answered Oct 18 '11 at 4:24

waltinator

75048

answered Oct 18 '11 at 4:24

waltinator

75048

1

Nice hint, I am using on a regular base code.google.com/p/hardlinkpy but this was not updated for a while...

– meduz
Apr 11 '12 at 19:09

2

This appears to be similar to the original hardlink on Fedora/RHEL/etc.

– Jack Douglas
Jun 21 '12 at 8:43

1

hardlink is now a native binary in many Linux package systems (since ~2014) and extremely fast. For 1,2M files (320GB), it just took 200 seconds (linking roughly 10% of the files).

– Marcel Waldvogel
Feb 5 '17 at 19:13

FWIW, the above hardlink was created by Julian Andres Klode while the Fedora hardlink was created by Jakub Jelinek (source: pagure.io/hardlink - Fedora package name: hardlink)

– maxschlepzig
Jan 4 at 17:52

add a comment |

1

Nice hint, I am using on a regular base code.google.com/p/hardlinkpy but this was not updated for a while...

– meduz
Apr 11 '12 at 19:09

2

This appears to be similar to the original hardlink on Fedora/RHEL/etc.

– Jack Douglas
Jun 21 '12 at 8:43

1

hardlink is now a native binary in many Linux package systems (since ~2014) and extremely fast. For 1,2M files (320GB), it just took 200 seconds (linking roughly 10% of the files).

– Marcel Waldvogel
Feb 5 '17 at 19:13

FWIW, the above hardlink was created by Julian Andres Klode while the Fedora hardlink was created by Jakub Jelinek (source: pagure.io/hardlink - Fedora package name: hardlink)

– maxschlepzig
Jan 4 at 17:52

Nice hint, I am using on a regular base code.google.com/p/hardlinkpy but this was not updated for a while...

– meduz
Apr 11 '12 at 19:09

This appears to be similar to the original hardlink on Fedora/RHEL/etc.

– Jack Douglas
Jun 21 '12 at 8:43

hardlink is now a native binary in many Linux package systems (since ~2014) and extremely fast. For 1,2M files (320GB), it just took 200 seconds (linking roughly 10% of the files).

– Marcel Waldvogel
Feb 5 '17 at 19:13

FWIW, the above hardlink was created by Julian Andres Klode while the Fedora hardlink was created by Jakub Jelinek (source: pagure.io/hardlink - Fedora package name: hardlink)

– maxschlepzig
Jan 4 at 17:52

add a comment |

This is one of the functions provided by "fslint" --
http://en.flossmanuals.net/FSlint/Introduction

Click the "Merge" button:

Screenshot

edited May 10 '16 at 22:19

Flimm

1,43541928

answered Dec 18 '10 at 22:38

LJ Wobker

4

The -m will hardlink duplicates together, -d will delete all but one, and -t will dry run, printing what it would do

– Azendale
Oct 29 '12 at 5:57

1

On Ubuntu here is what to do: sudo apt-get install fslint /usr/share/fslint/fslint/findup -m /your/directory/tree (directory /usr/share/fslint/fslint/ is not in $PATH by default)

– Jocelyn
Sep 8 '13 at 15:38

add a comment |

This is one of the functions provided by "fslint" --
http://en.flossmanuals.net/FSlint/Introduction

Click the "Merge" button:

Screenshot

edited May 10 '16 at 22:19

Flimm

1,43541928

answered Dec 18 '10 at 22:38

LJ Wobker

4

The -m will hardlink duplicates together, -d will delete all but one, and -t will dry run, printing what it would do

– Azendale
Oct 29 '12 at 5:57

1

On Ubuntu here is what to do: sudo apt-get install fslint /usr/share/fslint/fslint/findup -m /your/directory/tree (directory /usr/share/fslint/fslint/ is not in $PATH by default)

– Jocelyn
Sep 8 '13 at 15:38

add a comment |

This is one of the functions provided by "fslint" --
http://en.flossmanuals.net/FSlint/Introduction

Click the "Merge" button:

Screenshot

edited May 10 '16 at 22:19

Flimm

1,43541928

answered Dec 18 '10 at 22:38

LJ Wobker

This is one of the functions provided by "fslint" --
http://en.flossmanuals.net/FSlint/Introduction

Click the "Merge" button:

Screenshot

edited May 10 '16 at 22:19

Flimm

1,43541928

answered Dec 18 '10 at 22:38

LJ Wobker

edited May 10 '16 at 22:19

Flimm

1,43541928

edited May 10 '16 at 22:19

Flimm

1,43541928

edited May 10 '16 at 22:19

Flimm

1,43541928

answered Dec 18 '10 at 22:38

LJ Wobker

answered Dec 18 '10 at 22:38

LJ Wobker

answered Dec 18 '10 at 22:38

LJ Wobker

4

The -m will hardlink duplicates together, -d will delete all but one, and -t will dry run, printing what it would do

– Azendale
Oct 29 '12 at 5:57

1

On Ubuntu here is what to do: sudo apt-get install fslint /usr/share/fslint/fslint/findup -m /your/directory/tree (directory /usr/share/fslint/fslint/ is not in $PATH by default)

– Jocelyn
Sep 8 '13 at 15:38

add a comment |

4

The -m will hardlink duplicates together, -d will delete all but one, and -t will dry run, printing what it would do

– Azendale
Oct 29 '12 at 5:57

1

On Ubuntu here is what to do: sudo apt-get install fslint /usr/share/fslint/fslint/findup -m /your/directory/tree (directory /usr/share/fslint/fslint/ is not in $PATH by default)

– Jocelyn
Sep 8 '13 at 15:38

The -m will hardlink duplicates together, -d will delete all but one, and -t will dry run, printing what it would do

– Azendale
Oct 29 '12 at 5:57

On Ubuntu here is what to do: sudo apt-get install fslint /usr/share/fslint/fslint/findup -m /your/directory/tree (directory /usr/share/fslint/fslint/ is not in $PATH by default)

– Jocelyn
Sep 8 '13 at 15:38

add a comment |

ZFS has dedup (block-level, not file-level) since pool version 23 and compression since long time ago.
If you are using linux, you may try zfs-fuse, or if you use BSD, it is natively supported.

answered Oct 13 '10 at 5:13

Wei-Yin

3921210

This is probably the way I'll go eventually, however, does BSD's ZFS implementation do dedup? I thought it did not.

– Josh
Dec 8 '10 at 20:14

In addition, the HAMMER filesystem on DragonFlyBSD has deduplication support.

– hhaamu
Jul 15 '12 at 17:48

11

ZFS dedup is the friend of nobody. Where ZFS recommends 1Gb ram per 1Tb usable disk space, you're friggin' nuts if you try to use dedup with less than 32Gb ram per 1Tb usable disk space. That means that for a 1Tb mirror, if you don't have 32 Gb ram, you are likely to encounter memory bomb conditions sooner or later that will halt the machine due to lack of ram. Been there, done that, still recovering from the PTSD.

– killermist
Sep 22 '14 at 18:51

3

To avoid the excessive RAM requirements with online deduplication (i.e., check on every write), btrfs uses batch or offline deduplication (run it whenever you consider it useful/necessary) btrfs.wiki.kernel.org/index.php/Deduplication

– Marcel Waldvogel
Feb 5 '17 at 19:18

2

Update seven years later: I eventually did move to ZFS and tried deduplication -- I found that it's RAM requirements were indeed just far to high. Crafty use of ZFS snapshots provided the solution I ended up using. (Copy one user's music, snapshot and clone, copy the second user's music into the clone using rsync --inplace so only changed blocks are stored)

– Josh
Sep 13 '17 at 13:54

|
show 2 more comments

ZFS has dedup (block-level, not file-level) since pool version 23 and compression since long time ago.
If you are using linux, you may try zfs-fuse, or if you use BSD, it is natively supported.

answered Oct 13 '10 at 5:13

Wei-Yin

3921210

This is probably the way I'll go eventually, however, does BSD's ZFS implementation do dedup? I thought it did not.

– Josh
Dec 8 '10 at 20:14

In addition, the HAMMER filesystem on DragonFlyBSD has deduplication support.

– hhaamu
Jul 15 '12 at 17:48

11

ZFS dedup is the friend of nobody. Where ZFS recommends 1Gb ram per 1Tb usable disk space, you're friggin' nuts if you try to use dedup with less than 32Gb ram per 1Tb usable disk space. That means that for a 1Tb mirror, if you don't have 32 Gb ram, you are likely to encounter memory bomb conditions sooner or later that will halt the machine due to lack of ram. Been there, done that, still recovering from the PTSD.

– killermist
Sep 22 '14 at 18:51

3

To avoid the excessive RAM requirements with online deduplication (i.e., check on every write), btrfs uses batch or offline deduplication (run it whenever you consider it useful/necessary) btrfs.wiki.kernel.org/index.php/Deduplication

– Marcel Waldvogel
Feb 5 '17 at 19:18

2

Update seven years later: I eventually did move to ZFS and tried deduplication -- I found that it's RAM requirements were indeed just far to high. Crafty use of ZFS snapshots provided the solution I ended up using. (Copy one user's music, snapshot and clone, copy the second user's music into the clone using rsync --inplace so only changed blocks are stored)

– Josh
Sep 13 '17 at 13:54

|
show 2 more comments

ZFS has dedup (block-level, not file-level) since pool version 23 and compression since long time ago.
If you are using linux, you may try zfs-fuse, or if you use BSD, it is natively supported.

answered Oct 13 '10 at 5:13

Wei-Yin

3921210

ZFS has dedup (block-level, not file-level) since pool version 23 and compression since long time ago.
If you are using linux, you may try zfs-fuse, or if you use BSD, it is natively supported.

answered Oct 13 '10 at 5:13

Wei-Yin

3921210

answered Oct 13 '10 at 5:13

Wei-Yin

3921210

answered Oct 13 '10 at 5:13

Wei-Yin

3921210

answered Oct 13 '10 at 5:13

Wei-Yin

3921210

This is probably the way I'll go eventually, however, does BSD's ZFS implementation do dedup? I thought it did not.

– Josh
Dec 8 '10 at 20:14

In addition, the HAMMER filesystem on DragonFlyBSD has deduplication support.

– hhaamu
Jul 15 '12 at 17:48

11

ZFS dedup is the friend of nobody. Where ZFS recommends 1Gb ram per 1Tb usable disk space, you're friggin' nuts if you try to use dedup with less than 32Gb ram per 1Tb usable disk space. That means that for a 1Tb mirror, if you don't have 32 Gb ram, you are likely to encounter memory bomb conditions sooner or later that will halt the machine due to lack of ram. Been there, done that, still recovering from the PTSD.

– killermist
Sep 22 '14 at 18:51

3

To avoid the excessive RAM requirements with online deduplication (i.e., check on every write), btrfs uses batch or offline deduplication (run it whenever you consider it useful/necessary) btrfs.wiki.kernel.org/index.php/Deduplication

– Marcel Waldvogel
Feb 5 '17 at 19:18

2

Update seven years later: I eventually did move to ZFS and tried deduplication -- I found that it's RAM requirements were indeed just far to high. Crafty use of ZFS snapshots provided the solution I ended up using. (Copy one user's music, snapshot and clone, copy the second user's music into the clone using rsync --inplace so only changed blocks are stored)

– Josh
Sep 13 '17 at 13:54

|
show 2 more comments

This is probably the way I'll go eventually, however, does BSD's ZFS implementation do dedup? I thought it did not.

– Josh
Dec 8 '10 at 20:14

In addition, the HAMMER filesystem on DragonFlyBSD has deduplication support.

– hhaamu
Jul 15 '12 at 17:48

11

ZFS dedup is the friend of nobody. Where ZFS recommends 1Gb ram per 1Tb usable disk space, you're friggin' nuts if you try to use dedup with less than 32Gb ram per 1Tb usable disk space. That means that for a 1Tb mirror, if you don't have 32 Gb ram, you are likely to encounter memory bomb conditions sooner or later that will halt the machine due to lack of ram. Been there, done that, still recovering from the PTSD.

– killermist
Sep 22 '14 at 18:51

3

To avoid the excessive RAM requirements with online deduplication (i.e., check on every write), btrfs uses batch or offline deduplication (run it whenever you consider it useful/necessary) btrfs.wiki.kernel.org/index.php/Deduplication

– Marcel Waldvogel
Feb 5 '17 at 19:18

2

Update seven years later: I eventually did move to ZFS and tried deduplication -- I found that it's RAM requirements were indeed just far to high. Crafty use of ZFS snapshots provided the solution I ended up using. (Copy one user's music, snapshot and clone, copy the second user's music into the clone using rsync --inplace so only changed blocks are stored)

– Josh
Sep 13 '17 at 13:54

This is probably the way I'll go eventually, however, does BSD's ZFS implementation do dedup? I thought it did not.

– Josh
Dec 8 '10 at 20:14

In addition, the HAMMER filesystem on DragonFlyBSD has deduplication support.

– hhaamu
Jul 15 '12 at 17:48

ZFS dedup is the friend of nobody. Where ZFS recommends 1Gb ram per 1Tb usable disk space, you're friggin' nuts if you try to use dedup with less than 32Gb ram per 1Tb usable disk space. That means that for a 1Tb mirror, if you don't have 32 Gb ram, you are likely to encounter memory bomb conditions sooner or later that will halt the machine due to lack of ram. Been there, done that, still recovering from the PTSD.

– killermist
Sep 22 '14 at 18:51

To avoid the excessive RAM requirements with online deduplication (i.e., check on every write), btrfs uses batch or offline deduplication (run it whenever you consider it useful/necessary) btrfs.wiki.kernel.org/index.php/Deduplication

– Marcel Waldvogel
Feb 5 '17 at 19:18

Update seven years later: I eventually did move to ZFS and tried deduplication -- I found that it's RAM requirements were indeed just far to high. Crafty use of ZFS snapshots provided the solution I ended up using. (Copy one user's music, snapshot and clone, copy the second user's music into the clone using rsync --inplace so only changed blocks are stored)

– Josh
Sep 13 '17 at 13:54

|
show 2 more comments

On modern Linux these days there's https://github.com/g2p/bedup which de-duplicates on a btrfs filesystem, but 1) without as much of the scan overhead, 2) files can diverge easily again afterwards.

answered Jan 8 '14 at 17:37

Matthew Bloch

17014

Background and more information is listed on btrfs.wiki.kernel.org/index.php/Deduplication (including reference to cp --reflink, see also below)

– Marcel Waldvogel
Feb 5 '17 at 19:22

add a comment |

On modern Linux these days there's https://github.com/g2p/bedup which de-duplicates on a btrfs filesystem, but 1) without as much of the scan overhead, 2) files can diverge easily again afterwards.

answered Jan 8 '14 at 17:37

Matthew Bloch

17014

Background and more information is listed on btrfs.wiki.kernel.org/index.php/Deduplication (including reference to cp --reflink, see also below)

– Marcel Waldvogel
Feb 5 '17 at 19:22

add a comment |

On modern Linux these days there's https://github.com/g2p/bedup which de-duplicates on a btrfs filesystem, but 1) without as much of the scan overhead, 2) files can diverge easily again afterwards.

answered Jan 8 '14 at 17:37

Matthew Bloch

17014

On modern Linux these days there's https://github.com/g2p/bedup which de-duplicates on a btrfs filesystem, but 1) without as much of the scan overhead, 2) files can diverge easily again afterwards.

answered Jan 8 '14 at 17:37

Matthew Bloch

17014

answered Jan 8 '14 at 17:37

Matthew Bloch

17014

answered Jan 8 '14 at 17:37

Matthew Bloch

17014

answered Jan 8 '14 at 17:37

Matthew Bloch

17014

Background and more information is listed on btrfs.wiki.kernel.org/index.php/Deduplication (including reference to cp --reflink, see also below)

– Marcel Waldvogel
Feb 5 '17 at 19:22

add a comment |

Background and more information is listed on btrfs.wiki.kernel.org/index.php/Deduplication (including reference to cp --reflink, see also below)

– Marcel Waldvogel
Feb 5 '17 at 19:22

Background and more information is listed on btrfs.wiki.kernel.org/index.php/Deduplication (including reference to cp --reflink, see also below)

– Marcel Waldvogel
Feb 5 '17 at 19:22

add a comment |

To find duplicate files you can use duff.

Duff is a Unix command-line utility
for quickly finding duplicates in a
given set of files.

Simply run:

duff -r target-folder

To create hardlinks to those files automaticly, you will need to parse the output of duff with bash or some other scripting language.

answered Oct 12 '10 at 20:00

Stefan

11.6k3283123

Really slow though -- see rdfind.pauldreik.se/#g0.6

– ndemou
Oct 30 '15 at 12:52

add a comment |

To find duplicate files you can use duff.

Duff is a Unix command-line utility
for quickly finding duplicates in a
given set of files.

Simply run:

duff -r target-folder

To create hardlinks to those files automaticly, you will need to parse the output of duff with bash or some other scripting language.

answered Oct 12 '10 at 20:00

Stefan

11.6k3283123

Really slow though -- see rdfind.pauldreik.se/#g0.6

– ndemou
Oct 30 '15 at 12:52

add a comment |

To find duplicate files you can use duff.

Duff is a Unix command-line utility
for quickly finding duplicates in a
given set of files.

Simply run:

duff -r target-folder

To create hardlinks to those files automaticly, you will need to parse the output of duff with bash or some other scripting language.

answered Oct 12 '10 at 20:00

Stefan

11.6k3283123

To find duplicate files you can use duff.

Duff is a Unix command-line utility
for quickly finding duplicates in a
given set of files.

Simply run:

duff -r target-folder

To create hardlinks to those files automaticly, you will need to parse the output of duff with bash or some other scripting language.

answered Oct 12 '10 at 20:00

Stefan

11.6k3283123

answered Oct 12 '10 at 20:00

Stefan

11.6k3283123

answered Oct 12 '10 at 20:00

Stefan

11.6k3283123

answered Oct 12 '10 at 20:00

Stefan

11.6k3283123

Really slow though -- see rdfind.pauldreik.se/#g0.6

– ndemou
Oct 30 '15 at 12:52

add a comment |

Really slow though -- see rdfind.pauldreik.se/#g0.6

– ndemou
Oct 30 '15 at 12:52

Really slow though -- see rdfind.pauldreik.se/#g0.6

– ndemou
Oct 30 '15 at 12:52

add a comment |

aptitude show hardlink

Description: Hardlinks multiple copies of the same file
Hardlink is a tool which detects multiple copies of the same file and replaces them with hardlinks.

The idea has been taken from http://code.google.com/p/hardlinkpy/, but the code has been written from scratch and licensed under the MIT license.
Homepage: http://jak-linux.org/projects/hardlink/

edited Nov 22 '13 at 15:22

Anthon

60.9k17104166

answered Nov 22 '13 at 15:03

Julien Palard

29635

The only program mentioned here available for Gentoo without unmasking and with hardlink support, thanks!

– Jorrit Schippers
Mar 9 '15 at 13:48

add a comment |

aptitude show hardlink

Description: Hardlinks multiple copies of the same file
Hardlink is a tool which detects multiple copies of the same file and replaces them with hardlinks.

The idea has been taken from http://code.google.com/p/hardlinkpy/, but the code has been written from scratch and licensed under the MIT license.
Homepage: http://jak-linux.org/projects/hardlink/

edited Nov 22 '13 at 15:22

Anthon

60.9k17104166

answered Nov 22 '13 at 15:03

Julien Palard

29635

The only program mentioned here available for Gentoo without unmasking and with hardlink support, thanks!

– Jorrit Schippers
Mar 9 '15 at 13:48

add a comment |

aptitude show hardlink

Description: Hardlinks multiple copies of the same file
Hardlink is a tool which detects multiple copies of the same file and replaces them with hardlinks.

The idea has been taken from http://code.google.com/p/hardlinkpy/, but the code has been written from scratch and licensed under the MIT license.
Homepage: http://jak-linux.org/projects/hardlink/

edited Nov 22 '13 at 15:22

Anthon

60.9k17104166

answered Nov 22 '13 at 15:03

Julien Palard

29635

aptitude show hardlink

Description: Hardlinks multiple copies of the same file
Hardlink is a tool which detects multiple copies of the same file and replaces them with hardlinks.

The idea has been taken from http://code.google.com/p/hardlinkpy/, but the code has been written from scratch and licensed under the MIT license.
Homepage: http://jak-linux.org/projects/hardlink/

edited Nov 22 '13 at 15:22

Anthon

60.9k17104166

answered Nov 22 '13 at 15:03

Julien Palard

29635

edited Nov 22 '13 at 15:22

Anthon

60.9k17104166

edited Nov 22 '13 at 15:22

Anthon

60.9k17104166

edited Nov 22 '13 at 15:22

Anthon

60.9k17104166

answered Nov 22 '13 at 15:03

Julien Palard

29635

answered Nov 22 '13 at 15:03

Julien Palard

29635

answered Nov 22 '13 at 15:03

Julien Palard

29635

The only program mentioned here available for Gentoo without unmasking and with hardlink support, thanks!

– Jorrit Schippers
Mar 9 '15 at 13:48

add a comment |

The only program mentioned here available for Gentoo without unmasking and with hardlink support, thanks!

– Jorrit Schippers
Mar 9 '15 at 13:48

The only program mentioned here available for Gentoo without unmasking and with hardlink support, thanks!

– Jorrit Schippers
Mar 9 '15 at 13:48

add a comment |

   --reflink[=WHEN]

          control clone/CoW copies. See below



       When  --reflink[=always]  is specified, perform a lightweight copy, where the 

data blocks are copied only when modified.  If this is not possible the

       copy fails, or if --reflink=auto is specified, fall back to a standard copy.

answered Mar 14 '12 at 9:59

Marcos

1,14211228

I think I will update my cp alias to always include the --reflink=auto parameter now

– Marcos
Mar 14 '12 at 14:08

1

Does ext4 really support --reflink?

– Jack Douglas
Jun 21 '12 at 8:42

7

This is supported on btrfs and OCFS2. It is only possible on copy-on-write filesystems, which ext4 is not. btrfs is really shaping up. I love using it because of reflink and snapshots, makes you less scared to do mass operations on big trees of files.

– clacke
Jul 3 '12 at 18:57

add a comment |

   --reflink[=WHEN]

          control clone/CoW copies. See below



       When  --reflink[=always]  is specified, perform a lightweight copy, where the 

data blocks are copied only when modified.  If this is not possible the

       copy fails, or if --reflink=auto is specified, fall back to a standard copy.

answered Mar 14 '12 at 9:59

Marcos

1,14211228

I think I will update my cp alias to always include the --reflink=auto parameter now

– Marcos
Mar 14 '12 at 14:08

1

Does ext4 really support --reflink?

– Jack Douglas
Jun 21 '12 at 8:42

7

This is supported on btrfs and OCFS2. It is only possible on copy-on-write filesystems, which ext4 is not. btrfs is really shaping up. I love using it because of reflink and snapshots, makes you less scared to do mass operations on big trees of files.

– clacke
Jul 3 '12 at 18:57

add a comment |

   --reflink[=WHEN]

          control clone/CoW copies. See below



       When  --reflink[=always]  is specified, perform a lightweight copy, where the 

data blocks are copied only when modified.  If this is not possible the

       copy fails, or if --reflink=auto is specified, fall back to a standard copy.

answered Mar 14 '12 at 9:59

Marcos

1,14211228

   --reflink[=WHEN]

          control clone/CoW copies. See below



       When  --reflink[=always]  is specified, perform a lightweight copy, where the 

data blocks are copied only when modified.  If this is not possible the

       copy fails, or if --reflink=auto is specified, fall back to a standard copy.

answered Mar 14 '12 at 9:59

Marcos

1,14211228

answered Mar 14 '12 at 9:59

Marcos

1,14211228

answered Mar 14 '12 at 9:59

Marcos

1,14211228

answered Mar 14 '12 at 9:59

Marcos

1,14211228

I think I will update my cp alias to always include the --reflink=auto parameter now

– Marcos
Mar 14 '12 at 14:08

1

Does ext4 really support --reflink?

– Jack Douglas
Jun 21 '12 at 8:42

7

This is supported on btrfs and OCFS2. It is only possible on copy-on-write filesystems, which ext4 is not. btrfs is really shaping up. I love using it because of reflink and snapshots, makes you less scared to do mass operations on big trees of files.

– clacke
Jul 3 '12 at 18:57

add a comment |

I think I will update my cp alias to always include the --reflink=auto parameter now

– Marcos
Mar 14 '12 at 14:08

1

Does ext4 really support --reflink?

– Jack Douglas
Jun 21 '12 at 8:42

7

This is supported on btrfs and OCFS2. It is only possible on copy-on-write filesystems, which ext4 is not. btrfs is really shaping up. I love using it because of reflink and snapshots, makes you less scared to do mass operations on big trees of files.

– clacke
Jul 3 '12 at 18:57

I think I will update my cp alias to always include the --reflink=auto parameter now

– Marcos
Mar 14 '12 at 14:08

Does ext4 really support --reflink?

– Jack Douglas
Jun 21 '12 at 8:42

This is supported on btrfs and OCFS2. It is only possible on copy-on-write filesystems, which ext4 is not. btrfs is really shaping up. I love using it because of reflink and snapshots, makes you less scared to do mass operations on big trees of files.

– clacke
Jul 3 '12 at 18:57

add a comment |

filename

size

md5 checksum

byte contents

Do any methods do this? Look at duff, fdupes, rmlint, fslint, etc.

The following method was top-voted on commandlinefu.com: Find Duplicate Files (based on size first, then MD5 hash)

Can filename comparison be added as a first step, size as a second step?

find -not -empty -type f -printf "%sn" | sort -rn | uniq -d | 

  xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | 

  sort | uniq -w32 --all-repeated=separate

edited Jul 15 '12 at 13:10

Mat

39.6k8121127

answered Jul 9 '12 at 15:02

johny why

1334

2

I've used duff, fdupes and rmlint, and strongly recommend readers to look at the third of these. It has an excellent option set (and documentation). With it, I was able to avoid a lot of the post-processing I needed to use with the other tools.

– dubiousjim
Sep 2 '15 at 6:32

2

In my practice filename is the least reliable factor to look at, and I've completely removed it from any efforts I make a de-duping. How many install.sh files can be found on an active system? I can't count the number of times I've saved a file and had name clash, with some on-the-fly renaming to save it. Flip side: no idea how many times I've downloaded something from different sources, on different days, only to find they are the same file with different names. (Which also kills the timestamp reliability.) 1: Size, 2: Digest, 3: Byte contents.

– Gypsy Spellweaver
Jan 28 '17 at 6:40

@GypsySpellweaver: (1) depends on personal use-case, wouldn't you agree? In my case, i have multiple restores from multiple backups, where files with same name and content exist in different restore-folders. (2) Your comment seems to assume comparing filename only. I was not suggesting to eliminate other checks.

– johny why
Mar 8 '17 at 21:50

add a comment |

filename

size

md5 checksum

byte contents

Do any methods do this? Look at duff, fdupes, rmlint, fslint, etc.

The following method was top-voted on commandlinefu.com: Find Duplicate Files (based on size first, then MD5 hash)

Can filename comparison be added as a first step, size as a second step?

find -not -empty -type f -printf "%sn" | sort -rn | uniq -d | 

  xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | 

  sort | uniq -w32 --all-repeated=separate

edited Jul 15 '12 at 13:10

Mat

39.6k8121127

answered Jul 9 '12 at 15:02

johny why

1334

2

I've used duff, fdupes and rmlint, and strongly recommend readers to look at the third of these. It has an excellent option set (and documentation). With it, I was able to avoid a lot of the post-processing I needed to use with the other tools.

– dubiousjim
Sep 2 '15 at 6:32

2

In my practice filename is the least reliable factor to look at, and I've completely removed it from any efforts I make a de-duping. How many install.sh files can be found on an active system? I can't count the number of times I've saved a file and had name clash, with some on-the-fly renaming to save it. Flip side: no idea how many times I've downloaded something from different sources, on different days, only to find they are the same file with different names. (Which also kills the timestamp reliability.) 1: Size, 2: Digest, 3: Byte contents.

– Gypsy Spellweaver
Jan 28 '17 at 6:40

@GypsySpellweaver: (1) depends on personal use-case, wouldn't you agree? In my case, i have multiple restores from multiple backups, where files with same name and content exist in different restore-folders. (2) Your comment seems to assume comparing filename only. I was not suggesting to eliminate other checks.

– johny why
Mar 8 '17 at 21:50

add a comment |

filename

size

md5 checksum

byte contents

Do any methods do this? Look at duff, fdupes, rmlint, fslint, etc.

The following method was top-voted on commandlinefu.com: Find Duplicate Files (based on size first, then MD5 hash)

Can filename comparison be added as a first step, size as a second step?

find -not -empty -type f -printf "%sn" | sort -rn | uniq -d | 

  xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | 

  sort | uniq -w32 --all-repeated=separate

edited Jul 15 '12 at 13:10

Mat

39.6k8121127

answered Jul 9 '12 at 15:02

johny why

1334

filename

size

md5 checksum

byte contents

Do any methods do this? Look at duff, fdupes, rmlint, fslint, etc.

The following method was top-voted on commandlinefu.com: Find Duplicate Files (based on size first, then MD5 hash)

Can filename comparison be added as a first step, size as a second step?

find -not -empty -type f -printf "%sn" | sort -rn | uniq -d | 

  xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum | 

  sort | uniq -w32 --all-repeated=separate

edited Jul 15 '12 at 13:10

Mat

39.6k8121127

answered Jul 9 '12 at 15:02

johny why

1334

edited Jul 15 '12 at 13:10

Mat

39.6k8121127

edited Jul 15 '12 at 13:10

Mat

39.6k8121127

edited Jul 15 '12 at 13:10

Mat

39.6k8121127

answered Jul 9 '12 at 15:02

johny why

1334

answered Jul 9 '12 at 15:02

johny why

1334

answered Jul 9 '12 at 15:02

johny why

1334

2

I've used duff, fdupes and rmlint, and strongly recommend readers to look at the third of these. It has an excellent option set (and documentation). With it, I was able to avoid a lot of the post-processing I needed to use with the other tools.

– dubiousjim
Sep 2 '15 at 6:32

2

In my practice filename is the least reliable factor to look at, and I've completely removed it from any efforts I make a de-duping. How many install.sh files can be found on an active system? I can't count the number of times I've saved a file and had name clash, with some on-the-fly renaming to save it. Flip side: no idea how many times I've downloaded something from different sources, on different days, only to find they are the same file with different names. (Which also kills the timestamp reliability.) 1: Size, 2: Digest, 3: Byte contents.

– Gypsy Spellweaver
Jan 28 '17 at 6:40

@GypsySpellweaver: (1) depends on personal use-case, wouldn't you agree? In my case, i have multiple restores from multiple backups, where files with same name and content exist in different restore-folders. (2) Your comment seems to assume comparing filename only. I was not suggesting to eliminate other checks.

– johny why
Mar 8 '17 at 21:50

add a comment |

2

I've used duff, fdupes and rmlint, and strongly recommend readers to look at the third of these. It has an excellent option set (and documentation). With it, I was able to avoid a lot of the post-processing I needed to use with the other tools.

– dubiousjim
Sep 2 '15 at 6:32

2

In my practice filename is the least reliable factor to look at, and I've completely removed it from any efforts I make a de-duping. How many install.sh files can be found on an active system? I can't count the number of times I've saved a file and had name clash, with some on-the-fly renaming to save it. Flip side: no idea how many times I've downloaded something from different sources, on different days, only to find they are the same file with different names. (Which also kills the timestamp reliability.) 1: Size, 2: Digest, 3: Byte contents.

– Gypsy Spellweaver
Jan 28 '17 at 6:40

@GypsySpellweaver: (1) depends on personal use-case, wouldn't you agree? In my case, i have multiple restores from multiple backups, where files with same name and content exist in different restore-folders. (2) Your comment seems to assume comparing filename only. I was not suggesting to eliminate other checks.

– johny why
Mar 8 '17 at 21:50

I've used duff, fdupes and rmlint, and strongly recommend readers to look at the third of these. It has an excellent option set (and documentation). With it, I was able to avoid a lot of the post-processing I needed to use with the other tools.

– dubiousjim
Sep 2 '15 at 6:32

In my practice filename is the least reliable factor to look at, and I've completely removed it from any efforts I make a de-duping. How many install.sh files can be found on an active system? I can't count the number of times I've saved a file and had name clash, with some on-the-fly renaming to save it. Flip side: no idea how many times I've downloaded something from different sources, on different days, only to find they are the same file with different names. (Which also kills the timestamp reliability.) 1: Size, 2: Digest, 3: Byte contents.

– Gypsy Spellweaver
Jan 28 '17 at 6:40

@GypsySpellweaver: (1) depends on personal use-case, wouldn't you agree? In my case, i have multiple restores from multiple backups, where files with same name and content exist in different restore-folders. (2) Your comment seems to assume comparing filename only. I was not suggesting to eliminate other checks.

– johny why
Mar 8 '17 at 21:50

add a comment |

I made a Perl script that does something similar to what you're talking about:

http://pastebin.com/U7mFHZU7

Basically, it just traverses a directory, calculating the SHA1sum of the files in it, hashing it and linking matches together. It's come in handy on many, many occasions.

answered Jan 31 '11 at 2:06

amphetamachine

3,82522338

2

I hope to get around to trying this soon... why not upload it on CPAN... App::relink or something

– xenoterracide
Feb 7 '11 at 11:12

1

@xenoterracide: because of all the similar and more mature solutions that already exist. see the other answers, especially rdfind.

– oligofren
Jan 3 '15 at 13:36

1

@oligofren I don't doubt better solutions exist. TMTOWTDI I guess.

– amphetamachine
Jan 5 '15 at 15:49

add a comment |

I made a Perl script that does something similar to what you're talking about:

http://pastebin.com/U7mFHZU7

Basically, it just traverses a directory, calculating the SHA1sum of the files in it, hashing it and linking matches together. It's come in handy on many, many occasions.

answered Jan 31 '11 at 2:06

amphetamachine

3,82522338

2

I hope to get around to trying this soon... why not upload it on CPAN... App::relink or something

– xenoterracide
Feb 7 '11 at 11:12

1

@xenoterracide: because of all the similar and more mature solutions that already exist. see the other answers, especially rdfind.

– oligofren
Jan 3 '15 at 13:36

1

@oligofren I don't doubt better solutions exist. TMTOWTDI I guess.

– amphetamachine
Jan 5 '15 at 15:49

add a comment |

I made a Perl script that does something similar to what you're talking about:

http://pastebin.com/U7mFHZU7

Basically, it just traverses a directory, calculating the SHA1sum of the files in it, hashing it and linking matches together. It's come in handy on many, many occasions.

answered Jan 31 '11 at 2:06

amphetamachine

3,82522338

I made a Perl script that does something similar to what you're talking about:

http://pastebin.com/U7mFHZU7

Basically, it just traverses a directory, calculating the SHA1sum of the files in it, hashing it and linking matches together. It's come in handy on many, many occasions.

answered Jan 31 '11 at 2:06

amphetamachine

3,82522338

answered Jan 31 '11 at 2:06

amphetamachine

3,82522338

answered Jan 31 '11 at 2:06

amphetamachine

3,82522338

answered Jan 31 '11 at 2:06

amphetamachine

3,82522338

2

I hope to get around to trying this soon... why not upload it on CPAN... App::relink or something

– xenoterracide
Feb 7 '11 at 11:12

1

@xenoterracide: because of all the similar and more mature solutions that already exist. see the other answers, especially rdfind.

– oligofren
Jan 3 '15 at 13:36

1

@oligofren I don't doubt better solutions exist. TMTOWTDI I guess.

– amphetamachine
Jan 5 '15 at 15:49

add a comment |

2

I hope to get around to trying this soon... why not upload it on CPAN... App::relink or something

– xenoterracide
Feb 7 '11 at 11:12

1

@xenoterracide: because of all the similar and more mature solutions that already exist. see the other answers, especially rdfind.

– oligofren
Jan 3 '15 at 13:36

1

@oligofren I don't doubt better solutions exist. TMTOWTDI I guess.

– amphetamachine
Jan 5 '15 at 15:49

I hope to get around to trying this soon... why not upload it on CPAN... App::relink or something

– xenoterracide
Feb 7 '11 at 11:12

@xenoterracide: because of all the similar and more mature solutions that already exist. see the other answers, especially rdfind.

– oligofren
Jan 3 '15 at 13:36

@oligofren I don't doubt better solutions exist. TMTOWTDI I guess.

– amphetamachine
Jan 5 '15 at 15:49

add a comment |

Since I'm not a fan of Perl, here's a bash version:

#!/bin/bash



DIR="/path/to/big/files"



find $DIR -type f -exec md5sum {} ; | sort > /tmp/sums-sorted.txt



OLDSUM=""

IFS=$'n'

for i in `cat /tmp/sums-sorted.txt`; do

 NEWSUM=`echo "$i" | sed 's/ .*//'`

 NEWFILE=`echo "$i" | sed 's/^[^ ]* *//'`

 if [ "$OLDSUM" == "$NEWSUM" ]; then

  echo ln -f "$OLDFILE" "$NEWFILE"

 else

  OLDSUM="$NEWSUM"

  OLDFILE="$NEWFILE"

 fi

done

This finds all files with the same checksum (whether they're big, small, or already hardlinks), and hardlinks them together.

NOTE: As has been mentioned before, hardlinks work as long as the files never need modification, or to be moved across filesystems.

edited Jul 3 '12 at 11:04

Mat

39.6k8121127

answered Jul 3 '12 at 5:15

seren

1212

How can I change your script, so that instead of hardlinking it, it will just delete the duplicate files and will add an entry to a CSV file the deleted file -> Lined File. . ???

– MR.GEWA
Jan 12 '13 at 12:17

Sure. The hard link line: echo ln -f "$OLDFILE" "$NEWFILE" Just replaces the duplicate file with a hard link, so you could change it rm the $NEWFILE instead.

– seren
Jan 13 '13 at 4:15

and how on next line, write in some text file somehow $OLDFILE-> NEWFILE ???

– MR.GEWA
Jan 13 '13 at 13:12

Ahh, right. Yes, add a line after the rm such as: echo "$NEWFILE" >> /var/log/deleted_duplicate_files.log

– seren
Jan 14 '13 at 19:28

1

Don't friggin reinvent the wheel. There are more mature solutions available, like rdfind, that works at native speeds and just requires brew install rdfind or apt-get install rdfind to get installed.

– oligofren
Jan 3 '15 at 13:46

add a comment |

Since I'm not a fan of Perl, here's a bash version:

#!/bin/bash



DIR="/path/to/big/files"



find $DIR -type f -exec md5sum {} ; | sort > /tmp/sums-sorted.txt



OLDSUM=""

IFS=$'n'

for i in `cat /tmp/sums-sorted.txt`; do

 NEWSUM=`echo "$i" | sed 's/ .*//'`

 NEWFILE=`echo "$i" | sed 's/^[^ ]* *//'`

 if [ "$OLDSUM" == "$NEWSUM" ]; then

  echo ln -f "$OLDFILE" "$NEWFILE"

 else

  OLDSUM="$NEWSUM"

  OLDFILE="$NEWFILE"

 fi

done

This finds all files with the same checksum (whether they're big, small, or already hardlinks), and hardlinks them together.

NOTE: As has been mentioned before, hardlinks work as long as the files never need modification, or to be moved across filesystems.

edited Jul 3 '12 at 11:04

Mat

39.6k8121127

answered Jul 3 '12 at 5:15

seren

1212

How can I change your script, so that instead of hardlinking it, it will just delete the duplicate files and will add an entry to a CSV file the deleted file -> Lined File. . ???

– MR.GEWA
Jan 12 '13 at 12:17

Sure. The hard link line: echo ln -f "$OLDFILE" "$NEWFILE" Just replaces the duplicate file with a hard link, so you could change it rm the $NEWFILE instead.

– seren
Jan 13 '13 at 4:15

and how on next line, write in some text file somehow $OLDFILE-> NEWFILE ???

– MR.GEWA
Jan 13 '13 at 13:12

Ahh, right. Yes, add a line after the rm such as: echo "$NEWFILE" >> /var/log/deleted_duplicate_files.log

– seren
Jan 14 '13 at 19:28

1

Don't friggin reinvent the wheel. There are more mature solutions available, like rdfind, that works at native speeds and just requires brew install rdfind or apt-get install rdfind to get installed.

– oligofren
Jan 3 '15 at 13:46

add a comment |

Since I'm not a fan of Perl, here's a bash version:

#!/bin/bash



DIR="/path/to/big/files"



find $DIR -type f -exec md5sum {} ; | sort > /tmp/sums-sorted.txt



OLDSUM=""

IFS=$'n'

for i in `cat /tmp/sums-sorted.txt`; do

 NEWSUM=`echo "$i" | sed 's/ .*//'`

 NEWFILE=`echo "$i" | sed 's/^[^ ]* *//'`

 if [ "$OLDSUM" == "$NEWSUM" ]; then

  echo ln -f "$OLDFILE" "$NEWFILE"

 else

  OLDSUM="$NEWSUM"

  OLDFILE="$NEWFILE"

 fi

done

This finds all files with the same checksum (whether they're big, small, or already hardlinks), and hardlinks them together.

NOTE: As has been mentioned before, hardlinks work as long as the files never need modification, or to be moved across filesystems.

edited Jul 3 '12 at 11:04

Mat

39.6k8121127

answered Jul 3 '12 at 5:15

seren

1212

Since I'm not a fan of Perl, here's a bash version:

#!/bin/bash



DIR="/path/to/big/files"



find $DIR -type f -exec md5sum {} ; | sort > /tmp/sums-sorted.txt



OLDSUM=""

IFS=$'n'

for i in `cat /tmp/sums-sorted.txt`; do

 NEWSUM=`echo "$i" | sed 's/ .*//'`

 NEWFILE=`echo "$i" | sed 's/^[^ ]* *//'`

 if [ "$OLDSUM" == "$NEWSUM" ]; then

  echo ln -f "$OLDFILE" "$NEWFILE"

 else

  OLDSUM="$NEWSUM"

  OLDFILE="$NEWFILE"

 fi

done

This finds all files with the same checksum (whether they're big, small, or already hardlinks), and hardlinks them together.

NOTE: As has been mentioned before, hardlinks work as long as the files never need modification, or to be moved across filesystems.

edited Jul 3 '12 at 11:04

Mat

39.6k8121127

answered Jul 3 '12 at 5:15

seren

1212

edited Jul 3 '12 at 11:04

Mat

39.6k8121127

edited Jul 3 '12 at 11:04

Mat

39.6k8121127

edited Jul 3 '12 at 11:04

Mat

39.6k8121127

answered Jul 3 '12 at 5:15

seren

1212

answered Jul 3 '12 at 5:15

seren

1212

answered Jul 3 '12 at 5:15

seren

1212

How can I change your script, so that instead of hardlinking it, it will just delete the duplicate files and will add an entry to a CSV file the deleted file -> Lined File. . ???

– MR.GEWA
Jan 12 '13 at 12:17

Sure. The hard link line: echo ln -f "$OLDFILE" "$NEWFILE" Just replaces the duplicate file with a hard link, so you could change it rm the $NEWFILE instead.

– seren
Jan 13 '13 at 4:15

and how on next line, write in some text file somehow $OLDFILE-> NEWFILE ???

– MR.GEWA
Jan 13 '13 at 13:12

Ahh, right. Yes, add a line after the rm such as: echo "$NEWFILE" >> /var/log/deleted_duplicate_files.log

– seren
Jan 14 '13 at 19:28

1

Don't friggin reinvent the wheel. There are more mature solutions available, like rdfind, that works at native speeds and just requires brew install rdfind or apt-get install rdfind to get installed.

– oligofren
Jan 3 '15 at 13:46

add a comment |

How can I change your script, so that instead of hardlinking it, it will just delete the duplicate files and will add an entry to a CSV file the deleted file -> Lined File. . ???

– MR.GEWA
Jan 12 '13 at 12:17

Sure. The hard link line: echo ln -f "$OLDFILE" "$NEWFILE" Just replaces the duplicate file with a hard link, so you could change it rm the $NEWFILE instead.

– seren
Jan 13 '13 at 4:15

and how on next line, write in some text file somehow $OLDFILE-> NEWFILE ???

– MR.GEWA
Jan 13 '13 at 13:12

Ahh, right. Yes, add a line after the rm such as: echo "$NEWFILE" >> /var/log/deleted_duplicate_files.log

– seren
Jan 14 '13 at 19:28

1

Don't friggin reinvent the wheel. There are more mature solutions available, like rdfind, that works at native speeds and just requires brew install rdfind or apt-get install rdfind to get installed.

– oligofren
Jan 3 '15 at 13:46

How can I change your script, so that instead of hardlinking it, it will just delete the duplicate files and will add an entry to a CSV file the deleted file -> Lined File. . ???

– MR.GEWA
Jan 12 '13 at 12:17

Sure. The hard link line: echo ln -f "$OLDFILE" "$NEWFILE" Just replaces the duplicate file with a hard link, so you could change it rm the $NEWFILE instead.

– seren
Jan 13 '13 at 4:15

and how on next line, write in some text file somehow $OLDFILE-> NEWFILE ???

– MR.GEWA
Jan 13 '13 at 13:12

Ahh, right. Yes, add a line after the rm such as: echo "$NEWFILE" >> /var/log/deleted_duplicate_files.log

– seren
Jan 14 '13 at 19:28

Don't friggin reinvent the wheel. There are more mature solutions available, like rdfind, that works at native speeds and just requires brew install rdfind or apt-get install rdfind to get installed.

– oligofren
Jan 3 '15 at 13:46

add a comment |

If you want to replace duplicates by Hard Links on mac or any UNIX based system, you can try SmartDupe http://sourceforge.net/projects/smartdupe/
am developing it

answered Nov 4 '12 at 0:57

islam

211

3

Can you expand on how “smart” it is?

– Stéphane Gimenez
Nov 4 '12 at 13:25

1

How can I compare files of two different directories?

– Burcardo
May 31 '16 at 8:26

add a comment |

If you want to replace duplicates by Hard Links on mac or any UNIX based system, you can try SmartDupe http://sourceforge.net/projects/smartdupe/
am developing it

answered Nov 4 '12 at 0:57

islam

211

3

Can you expand on how “smart” it is?

– Stéphane Gimenez
Nov 4 '12 at 13:25

1

How can I compare files of two different directories?

– Burcardo
May 31 '16 at 8:26

add a comment |

If you want to replace duplicates by Hard Links on mac or any UNIX based system, you can try SmartDupe http://sourceforge.net/projects/smartdupe/
am developing it

answered Nov 4 '12 at 0:57

islam

211

If you want to replace duplicates by Hard Links on mac or any UNIX based system, you can try SmartDupe http://sourceforge.net/projects/smartdupe/
am developing it

answered Nov 4 '12 at 0:57

islam

211

answered Nov 4 '12 at 0:57

islam

211

answered Nov 4 '12 at 0:57

islam

211

answered Nov 4 '12 at 0:57

islam

211

3

Can you expand on how “smart” it is?

– Stéphane Gimenez
Nov 4 '12 at 13:25

1

How can I compare files of two different directories?

– Burcardo
May 31 '16 at 8:26

add a comment |

3

Can you expand on how “smart” it is?

– Stéphane Gimenez
Nov 4 '12 at 13:25

1

How can I compare files of two different directories?

– Burcardo
May 31 '16 at 8:26

Can you expand on how “smart” it is?

– Stéphane Gimenez
Nov 4 '12 at 13:25

How can I compare files of two different directories?

– Burcardo
May 31 '16 at 8:26

add a comment |

The applicatios FSLint (http://www.pixelbeat.org/fslint/) can find all equal files in any folder (by content) and create hardlinks. Give it a try!

Jorge Sampaio

answered Jan 15 '15 at 16:29

Jorge H B Sampaio Jr

111

It hangs scanning 1TB almost full ext3 harddisk, brings the entire system to a crawl. Aborted after 14 hours of "searching"

– Angsuman Chakraborty
Sep 12 '16 at 11:09

add a comment |

The applicatios FSLint (http://www.pixelbeat.org/fslint/) can find all equal files in any folder (by content) and create hardlinks. Give it a try!

Jorge Sampaio

answered Jan 15 '15 at 16:29

Jorge H B Sampaio Jr

111

It hangs scanning 1TB almost full ext3 harddisk, brings the entire system to a crawl. Aborted after 14 hours of "searching"

– Angsuman Chakraborty
Sep 12 '16 at 11:09

add a comment |

The applicatios FSLint (http://www.pixelbeat.org/fslint/) can find all equal files in any folder (by content) and create hardlinks. Give it a try!

Jorge Sampaio

answered Jan 15 '15 at 16:29

Jorge H B Sampaio Jr

111

The applicatios FSLint (http://www.pixelbeat.org/fslint/) can find all equal files in any folder (by content) and create hardlinks. Give it a try!

Jorge Sampaio

answered Jan 15 '15 at 16:29

Jorge H B Sampaio Jr

111

answered Jan 15 '15 at 16:29

Jorge H B Sampaio Jr

111

answered Jan 15 '15 at 16:29

Jorge H B Sampaio Jr

111

answered Jan 15 '15 at 16:29

Jorge H B Sampaio Jr

111

It hangs scanning 1TB almost full ext3 harddisk, brings the entire system to a crawl. Aborted after 14 hours of "searching"

– Angsuman Chakraborty
Sep 12 '16 at 11:09

add a comment |

It hangs scanning 1TB almost full ext3 harddisk, brings the entire system to a crawl. Aborted after 14 hours of "searching"

– Angsuman Chakraborty
Sep 12 '16 at 11:09

It hangs scanning 1TB almost full ext3 harddisk, brings the entire system to a crawl. Aborted after 14 hours of "searching"

– Angsuman Chakraborty
Sep 12 '16 at 11:09

add a comment |

answered Jun 24 '14 at 8:51

Znik

265129

1

I do like the offline (or batch) deduplication approach btrfs has. Excellent discussion of the options (including the cp --reflink option) here: btrfs.wiki.kernel.org/index.php/Deduplication

– Marcel Waldvogel
Feb 5 '17 at 19:42

ZFS is not Solaris or OpenSolaris only. It's natively supported in FreeBSD. Also, ZFS on Linux is device driver based; ZFS on FUSE is a different thing.

– KJ Seefried
Mar 29 '18 at 19:07

add a comment |

answered Jun 24 '14 at 8:51

Znik

265129

1

I do like the offline (or batch) deduplication approach btrfs has. Excellent discussion of the options (including the cp --reflink option) here: btrfs.wiki.kernel.org/index.php/Deduplication

– Marcel Waldvogel
Feb 5 '17 at 19:42

ZFS is not Solaris or OpenSolaris only. It's natively supported in FreeBSD. Also, ZFS on Linux is device driver based; ZFS on FUSE is a different thing.

– KJ Seefried
Mar 29 '18 at 19:07

add a comment |

answered Jun 24 '14 at 8:51

Znik

265129

answered Jun 24 '14 at 8:51

Znik

265129

answered Jun 24 '14 at 8:51

Znik

265129

answered Jun 24 '14 at 8:51

Znik

265129

answered Jun 24 '14 at 8:51

Znik

265129

1

I do like the offline (or batch) deduplication approach btrfs has. Excellent discussion of the options (including the cp --reflink option) here: btrfs.wiki.kernel.org/index.php/Deduplication

– Marcel Waldvogel
Feb 5 '17 at 19:42

ZFS is not Solaris or OpenSolaris only. It's natively supported in FreeBSD. Also, ZFS on Linux is device driver based; ZFS on FUSE is a different thing.

– KJ Seefried
Mar 29 '18 at 19:07

add a comment |

1

I do like the offline (or batch) deduplication approach btrfs has. Excellent discussion of the options (including the cp --reflink option) here: btrfs.wiki.kernel.org/index.php/Deduplication

– Marcel Waldvogel
Feb 5 '17 at 19:42

ZFS is not Solaris or OpenSolaris only. It's natively supported in FreeBSD. Also, ZFS on Linux is device driver based; ZFS on FUSE is a different thing.

– KJ Seefried
Mar 29 '18 at 19:07

I do like the offline (or batch) deduplication approach btrfs has. Excellent discussion of the options (including the cp --reflink option) here: btrfs.wiki.kernel.org/index.php/Deduplication

– Marcel Waldvogel
Feb 5 '17 at 19:42

ZFS is not Solaris or OpenSolaris only. It's natively supported in FreeBSD. Also, ZFS on Linux is device driver based; ZFS on FUSE is a different thing.

– KJ Seefried
Mar 29 '18 at 19:07

add a comment |

answered May 3 '16 at 18:43

Amaroq Starwind

add a comment |

answered May 3 '16 at 18:43

Amaroq Starwind

add a comment |

answered May 3 '16 at 18:43

Amaroq Starwind

answered May 3 '16 at 18:43

Amaroq Starwind

answered May 3 '16 at 18:43

Amaroq Starwind

answered May 3 '16 at 18:43

Amaroq Starwind

answered May 3 '16 at 18:43

Amaroq Starwind

add a comment |

Easiest way is to use special program
dupeGuru

dupeGuru Preferences Screenshot

as documentation says

Deletion Options

These options affect how duplicate deletion takes place.
Most of the time, you don’t need to enable any of them.

Link deleted files:

The deleted files are replaced by a link to the reference file.
You have a choice of replacing it either with a symlink or a hardlink.
...
a symlink is a shortcut to the file’s path.
If the original file is deleted or moved, the link is broken.
A hardlink is a link to the file itself.
That link is as good as a “real” file.
Only when all hardlinks to a file are deleted is the file itself deleted.

On OSX and Linux, this feature is supported fully,
but under Windows, it’s a bit complicated.
Windows XP doesn’t support it, but Vista and up support it.
However, for the feature to work,
dupeGuru has to run with administrative privileges.

answered Jun 13 '17 at 14:20

Russian Junior Ruby Developer

add a comment |

Easiest way is to use special program
dupeGuru

dupeGuru Preferences Screenshot

as documentation says

Deletion Options

These options affect how duplicate deletion takes place.
Most of the time, you don’t need to enable any of them.

Link deleted files:

The deleted files are replaced by a link to the reference file.
You have a choice of replacing it either with a symlink or a hardlink.
...
a symlink is a shortcut to the file’s path.
If the original file is deleted or moved, the link is broken.
A hardlink is a link to the file itself.
That link is as good as a “real” file.
Only when all hardlinks to a file are deleted is the file itself deleted.

On OSX and Linux, this feature is supported fully,
but under Windows, it’s a bit complicated.
Windows XP doesn’t support it, but Vista and up support it.
However, for the feature to work,
dupeGuru has to run with administrative privileges.

answered Jun 13 '17 at 14:20

Russian Junior Ruby Developer

add a comment |

Easiest way is to use special program
dupeGuru

dupeGuru Preferences Screenshot

as documentation says

Deletion Options

These options affect how duplicate deletion takes place.
Most of the time, you don’t need to enable any of them.

Link deleted files:

The deleted files are replaced by a link to the reference file.
You have a choice of replacing it either with a symlink or a hardlink.
...
a symlink is a shortcut to the file’s path.
If the original file is deleted or moved, the link is broken.
A hardlink is a link to the file itself.
That link is as good as a “real” file.
Only when all hardlinks to a file are deleted is the file itself deleted.

On OSX and Linux, this feature is supported fully,
but under Windows, it’s a bit complicated.
Windows XP doesn’t support it, but Vista and up support it.
However, for the feature to work,
dupeGuru has to run with administrative privileges.

answered Jun 13 '17 at 14:20

Russian Junior Ruby Developer

Easiest way is to use special program
dupeGuru

dupeGuru Preferences Screenshot

as documentation says

Deletion Options

These options affect how duplicate deletion takes place.
Most of the time, you don’t need to enable any of them.

Link deleted files:

The deleted files are replaced by a link to the reference file.
You have a choice of replacing it either with a symlink or a hardlink.
...
a symlink is a shortcut to the file’s path.
If the original file is deleted or moved, the link is broken.
A hardlink is a link to the file itself.
That link is as good as a “real” file.
Only when all hardlinks to a file are deleted is the file itself deleted.

On OSX and Linux, this feature is supported fully,
but under Windows, it’s a bit complicated.
Windows XP doesn’t support it, but Vista and up support it.
However, for the feature to work,
dupeGuru has to run with administrative privileges.

answered Jun 13 '17 at 14:20

Russian Junior Ruby Developer

answered Jun 13 '17 at 14:20

Russian Junior Ruby Developer

answered Jun 13 '17 at 14:20

Russian Junior Ruby Developer

answered Jun 13 '17 at 14:20

Russian Junior Ruby Developer

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Unix & Linux Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

RVXNu3iDuqjPYdW0Sm3L7S zG N8 5Nv91oQQIUuns2iuARHiB

搜尋此網誌

Cdtjkyj