Is there an easy way to replace duplicate files with hardlinks?
I'm looking for an easy way (a command or series of commands, probably involving find
) to find duplicate files in two directories, and replace the files in one directory with hardlinks of the files in the other directory.
Here's the situation: This is a file server which multiple people store audio files on, each user having their own folder. Sometimes multiple people have copies of the exact same audio files. Right now, these are duplicates. I'd like to make it so they're hardlinks, to save hard drive space.
files hard-link deduplication duplicate-files
|
show 4 more comments
I'm looking for an easy way (a command or series of commands, probably involving find
) to find duplicate files in two directories, and replace the files in one directory with hardlinks of the files in the other directory.
Here's the situation: This is a file server which multiple people store audio files on, each user having their own folder. Sometimes multiple people have copies of the exact same audio files. Right now, these are duplicates. I'd like to make it so they're hardlinks, to save hard drive space.
files hard-link deduplication duplicate-files
20
One problem you may run into with hardlinks is if somebody decides to do something to one of their music files that you've hard-linked they could inadvertently be affecting other people's access to their music.
– Steven D
Oct 13 '10 at 2:48
4
another problem is that two different files containing "Some Really Great Tune", even if taken from the same source with the same encoder will very likely not be bit-for-bit identical.
– msw
Oct 13 '10 at 2:57
3
better sollution might be to have a public music folder...
– Stefan
Oct 13 '10 at 7:08
3
related: superuser.com/questions/140819/ways-to-deduplicate-files
– David Cary
Mar 16 '11 at 23:59
1
@tante: Using symlinks solves no problem. When a user "deletes" a file, the number of links to it gets decremented, when the count reaches zero, the files gets really deleted, that's all. So deletion is no problem with hardlinked files, the only problem is a user trying to edit the file (unprobable indeed) or to overwrite it (quite possible if logged in).
– maaartinus
Mar 14 '12 at 3:56
|
show 4 more comments
I'm looking for an easy way (a command or series of commands, probably involving find
) to find duplicate files in two directories, and replace the files in one directory with hardlinks of the files in the other directory.
Here's the situation: This is a file server which multiple people store audio files on, each user having their own folder. Sometimes multiple people have copies of the exact same audio files. Right now, these are duplicates. I'd like to make it so they're hardlinks, to save hard drive space.
files hard-link deduplication duplicate-files
I'm looking for an easy way (a command or series of commands, probably involving find
) to find duplicate files in two directories, and replace the files in one directory with hardlinks of the files in the other directory.
Here's the situation: This is a file server which multiple people store audio files on, each user having their own folder. Sometimes multiple people have copies of the exact same audio files. Right now, these are duplicates. I'd like to make it so they're hardlinks, to save hard drive space.
files hard-link deduplication duplicate-files
files hard-link deduplication duplicate-files
edited 38 mins ago
Jeff Schaller
42.5k1158135
42.5k1158135
asked Oct 12 '10 at 19:23
JoshJosh
3,76664365
3,76664365
20
One problem you may run into with hardlinks is if somebody decides to do something to one of their music files that you've hard-linked they could inadvertently be affecting other people's access to their music.
– Steven D
Oct 13 '10 at 2:48
4
another problem is that two different files containing "Some Really Great Tune", even if taken from the same source with the same encoder will very likely not be bit-for-bit identical.
– msw
Oct 13 '10 at 2:57
3
better sollution might be to have a public music folder...
– Stefan
Oct 13 '10 at 7:08
3
related: superuser.com/questions/140819/ways-to-deduplicate-files
– David Cary
Mar 16 '11 at 23:59
1
@tante: Using symlinks solves no problem. When a user "deletes" a file, the number of links to it gets decremented, when the count reaches zero, the files gets really deleted, that's all. So deletion is no problem with hardlinked files, the only problem is a user trying to edit the file (unprobable indeed) or to overwrite it (quite possible if logged in).
– maaartinus
Mar 14 '12 at 3:56
|
show 4 more comments
20
One problem you may run into with hardlinks is if somebody decides to do something to one of their music files that you've hard-linked they could inadvertently be affecting other people's access to their music.
– Steven D
Oct 13 '10 at 2:48
4
another problem is that two different files containing "Some Really Great Tune", even if taken from the same source with the same encoder will very likely not be bit-for-bit identical.
– msw
Oct 13 '10 at 2:57
3
better sollution might be to have a public music folder...
– Stefan
Oct 13 '10 at 7:08
3
related: superuser.com/questions/140819/ways-to-deduplicate-files
– David Cary
Mar 16 '11 at 23:59
1
@tante: Using symlinks solves no problem. When a user "deletes" a file, the number of links to it gets decremented, when the count reaches zero, the files gets really deleted, that's all. So deletion is no problem with hardlinked files, the only problem is a user trying to edit the file (unprobable indeed) or to overwrite it (quite possible if logged in).
– maaartinus
Mar 14 '12 at 3:56
20
20
One problem you may run into with hardlinks is if somebody decides to do something to one of their music files that you've hard-linked they could inadvertently be affecting other people's access to their music.
– Steven D
Oct 13 '10 at 2:48
One problem you may run into with hardlinks is if somebody decides to do something to one of their music files that you've hard-linked they could inadvertently be affecting other people's access to their music.
– Steven D
Oct 13 '10 at 2:48
4
4
another problem is that two different files containing "Some Really Great Tune", even if taken from the same source with the same encoder will very likely not be bit-for-bit identical.
– msw
Oct 13 '10 at 2:57
another problem is that two different files containing "Some Really Great Tune", even if taken from the same source with the same encoder will very likely not be bit-for-bit identical.
– msw
Oct 13 '10 at 2:57
3
3
better sollution might be to have a public music folder...
– Stefan
Oct 13 '10 at 7:08
better sollution might be to have a public music folder...
– Stefan
Oct 13 '10 at 7:08
3
3
related: superuser.com/questions/140819/ways-to-deduplicate-files
– David Cary
Mar 16 '11 at 23:59
related: superuser.com/questions/140819/ways-to-deduplicate-files
– David Cary
Mar 16 '11 at 23:59
1
1
@tante: Using symlinks solves no problem. When a user "deletes" a file, the number of links to it gets decremented, when the count reaches zero, the files gets really deleted, that's all. So deletion is no problem with hardlinked files, the only problem is a user trying to edit the file (unprobable indeed) or to overwrite it (quite possible if logged in).
– maaartinus
Mar 14 '12 at 3:56
@tante: Using symlinks solves no problem. When a user "deletes" a file, the number of links to it gets decremented, when the count reaches zero, the files gets really deleted, that's all. So deletion is no problem with hardlinked files, the only problem is a user trying to edit the file (unprobable indeed) or to overwrite it (quite possible if logged in).
– maaartinus
Mar 14 '12 at 3:56
|
show 4 more comments
18 Answers
18
active
oldest
votes
There is a perl script at http://cpansearch.perl.org/src/ANDK/Perl-Repository-APC-2.002/eg/trimtrees.pl which does exactly what you want:
Traverse all directories named on the
command line, compute MD5 checksums
and find files with identical MD5. IF
they are equal, do a real comparison
if they are really equal, replace the
second of two files with a hard link
to the first one.
Sounds perfect, thanks!! I'll try it and accept if it works as described!
– Josh
Oct 12 '10 at 20:09
3
This did exactly what I asked for. However I believe that ZFS with dedup will eventually be the way to do, since I did find that the files had slight differences so only a few could be hardlinked.
– Josh
Dec 8 '10 at 20:13
10
Upvoted this, but after researching some more, I kind of which I didn't.rdfind
is available via the package managers for ALL major platforms (os x, linux, (cyg)win, solaris), and works at a blazing native speed. So do check out the answer below.
– oligofren
Jan 3 '15 at 13:42
@oligofren I was thinking the same, but then I hit[Errno 31] Too many links
. This scrips seems to be the only thing that handles that.
– phunehehe
Jun 26 '15 at 6:59
3
Checksumming every single file, rather than only files where there exists at least one other with identical size, is unnecessarily inefficient (and unnecessarily prone to hash collisions).
– Charles Duffy
Feb 1 '16 at 16:56
add a comment |
rdfind
does exactly what you ask for (and in the order johny why lists). Makes it possible to delete duplicates, replace them with either soft or hard links. Combined with symlinks
you can also make the symlink either absolute or relative. You can even pick checksum algorithm (md5 or sha1).
Since it is compiled it is faster than most scripted solutions: time
on a 15 GiB folder with 2600 files on my Mac Mini from 2009 returns this
9.99s user 3.61s system 66% cpu 20.543 total
(using md5).
Available in most package handlers (e.g. MacPorts for Mac OS X).
10
+1 I usedrdfind
and loved it. It has a-dryrun true
option that will let you know what it would have done. Replacing duplicates with hard links is as simple as-makehardlinks true
. It produced a nice log and it let me know how much space was freed up. Plus, according to the author's benchmark, rdfind is faster than duff and fslint.
– Daniel Trebbien
Dec 29 '13 at 20:49
oooh, nice. I used to use fdupes, but its -L option for hardlinking dupes is missing in the latest Ubuntu 14.10. Was quite slow, and did not exist for Homebrew on OSX, so this answer is way better. Thanks!
– oligofren
Jan 3 '15 at 13:38
Very smart and fast algorithm.
– ndemou
Oct 30 '15 at 12:53
1
I suspect the performance of this tool has more to do with the algorithm itself and less to do with whether it's a compiled tool or a script. For this kind of operation, disk is going to be the bottleneck nearly all of the time. As long as scripted tools make sure that they've an async I/O operation in progress while burning the CPU on checksums, they should perform about as well as a native binary.
– cdhowie
May 31 '18 at 21:19
add a comment |
Use the fdupes
tool:
fdupes -r /path/to/folder
gives you a list of duplicates in the directory (-r makes it recursive). The output looks like this:
filename1
filename2
filename3
filename4
filename5
with filename1 and filename2 being identical and filename3, filename4 and filename5 also being identical.
1
Ubuntu Note: As of September 2013, it hasn't had a stable release (it is on 1.50-PR2-3), so the update doesn't appear in ubuntu yet.
– Stuart Axon
Aug 28 '13 at 14:19
11
I just tried installing fdupes_1.50-PR2-4 on both Ubuntu and Debian, neither has the -L flag. Luckily building from github.com/tobiasschulz/fdupes was super easy.
– neu242
Aug 30 '13 at 15:07
3
Tryrdfind
- likefdupes
, but faster and available on OS X and Cygwin as well.
– oligofren
Jan 3 '15 at 13:43
Or if you just requre Linux compatibility, installrmlint
which is blazingly fast, and has lots of nice options. Truly a modern alternative.
– oligofren
Jan 3 '15 at 14:28
3
fdupes
seems to only find duplicates, not replace them with hardlinks, so not an answer to the question IMO.
– Calimo
Nov 8 '17 at 15:58
|
show 1 more comment
I use hardlink
from http://jak-linux.org/projects/hardlink/
1
Nice hint, I am using on a regular base code.google.com/p/hardlinkpy but this was not updated for a while...
– meduz
Apr 11 '12 at 19:09
2
This appears to be similar to the originalhardlink
on Fedora/RHEL/etc.
– Jack Douglas
Jun 21 '12 at 8:43
1
hardlink
is now a native binary in many Linux package systems (since ~2014) and extremely fast. For 1,2M files (320GB), it just took 200 seconds (linking roughly 10% of the files).
– Marcel Waldvogel
Feb 5 '17 at 19:13
FWIW, the abovehardlink
was created by Julian Andres Klode while the Fedorahardlink
was created by Jakub Jelinek (source: pagure.io/hardlink - Fedora package name: hardlink)
– maxschlepzig
Jan 4 at 17:52
add a comment |
This is one of the functions provided by "fslint" --
http://en.flossmanuals.net/FSlint/Introduction
Click the "Merge" button:
4
The -m will hardlink duplicates together, -d will delete all but one, and -t will dry run, printing what it would do
– Azendale
Oct 29 '12 at 5:57
1
On Ubuntu here is what to do:sudo apt-get install fslint
/usr/share/fslint/fslint/findup -m /your/directory/tree
(directory /usr/share/fslint/fslint/ is not in $PATH by default)
– Jocelyn
Sep 8 '13 at 15:38
add a comment |
Since your main target is to save disk space, there is another solution: de-duplication (and probably compression) on file system level. Compared with the hard-link solution, it does not have the problem of inadvertently affecting other linked files.
ZFS has dedup (block-level, not file-level) since pool version 23 and compression since long time ago.
If you are using linux, you may try zfs-fuse, or if you use BSD, it is natively supported.
This is probably the way I'll go eventually, however, does BSD's ZFS implementation do dedup? I thought it did not.
– Josh
Dec 8 '10 at 20:14
In addition, the HAMMER filesystem on DragonFlyBSD has deduplication support.
– hhaamu
Jul 15 '12 at 17:48
11
ZFS dedup is the friend of nobody. Where ZFS recommends 1Gb ram per 1Tb usable disk space, you're friggin' nuts if you try to use dedup with less than 32Gb ram per 1Tb usable disk space. That means that for a 1Tb mirror, if you don't have 32 Gb ram, you are likely to encounter memory bomb conditions sooner or later that will halt the machine due to lack of ram. Been there, done that, still recovering from the PTSD.
– killermist
Sep 22 '14 at 18:51
3
To avoid the excessive RAM requirements with online deduplication (i.e., check on every write),btrfs
uses batch or offline deduplication (run it whenever you consider it useful/necessary) btrfs.wiki.kernel.org/index.php/Deduplication
– Marcel Waldvogel
Feb 5 '17 at 19:18
2
Update seven years later: I eventually did move to ZFS and tried deduplication -- I found that it's RAM requirements were indeed just far to high. Crafty use of ZFS snapshots provided the solution I ended up using. (Copy one user's music, snapshot and clone, copy the second user's music into the clone usingrsync --inplace
so only changed blocks are stored)
– Josh
Sep 13 '17 at 13:54
|
show 2 more comments
On modern Linux these days there's https://github.com/g2p/bedup which de-duplicates on a btrfs filesystem, but 1) without as much of the scan overhead, 2) files can diverge easily again afterwards.
Background and more information is listed on btrfs.wiki.kernel.org/index.php/Deduplication (including reference tocp --reflink
, see also below)
– Marcel Waldvogel
Feb 5 '17 at 19:22
add a comment |
To find duplicate files you can use duff.
Duff is a Unix command-line utility
for quickly finding duplicates in a
given set of files.
Simply run:
duff -r target-folder
To create hardlinks to those files automaticly, you will need to parse the output of duff with bash or some other scripting language.
Really slow though -- see rdfind.pauldreik.se/#g0.6
– ndemou
Oct 30 '15 at 12:52
add a comment |
aptitude show hardlink
Description: Hardlinks multiple copies of the same file
Hardlink is a tool which detects multiple copies of the same file and replaces them with hardlinks.
The idea has been taken from http://code.google.com/p/hardlinkpy/, but the code has been written from scratch and licensed under the MIT license.
Homepage: http://jak-linux.org/projects/hardlink/
The only program mentioned here available for Gentoo without unmasking and with hardlink support, thanks!
– Jorrit Schippers
Mar 9 '15 at 13:48
add a comment |
I've used many of the hardlinking tools for Linux mentioned here.
I too am stuck with ext4 fs, on Ubuntu, and have been using its cp -l and -s for hard/softlinking. But lately noticed the lightweight copy in the cp man page, which would imply to spare the redundant disk space until one side gets modified:
--reflink[=WHEN]
control clone/CoW copies. See below
When --reflink[=always] is specified, perform a lightweight copy, where the
data blocks are copied only when modified. If this is not possible the
copy fails, or if --reflink=auto is specified, fall back to a standard copy.
I think I will update mycp
alias to always include the--reflink=auto
parameter now
– Marcos
Mar 14 '12 at 14:08
1
Does ext4 really support--reflink
?
– Jack Douglas
Jun 21 '12 at 8:42
7
This is supported on btrfs and OCFS2. It is only possible on copy-on-write filesystems, which ext4 is not. btrfs is really shaping up. I love using it because of reflink and snapshots, makes you less scared to do mass operations on big trees of files.
– clacke
Jul 3 '12 at 18:57
add a comment |
Seems to me that checking the filename first could speed things up. If two files lack the same filename then in many cases I would not consider them to be duplicates. Seems that the quickest method would be to compare, in order:
- filename
- size
- md5 checksum
- byte contents
Do any methods do this? Look at duff
, fdupes
, rmlint
, fslint
, etc.
The following method was top-voted on commandlinefu.com: Find Duplicate Files (based on size first, then MD5 hash)
Can filename comparison be added as a first step, size as a second step?
find -not -empty -type f -printf "%sn" | sort -rn | uniq -d |
xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum |
sort | uniq -w32 --all-repeated=separate
2
I've usedduff
,fdupes
andrmlint
, and strongly recommend readers to look at the third of these. It has an excellent option set (and documentation). With it, I was able to avoid a lot of the post-processing I needed to use with the other tools.
– dubiousjim
Sep 2 '15 at 6:32
2
In my practice filename is the least reliable factor to look at, and I've completely removed it from any efforts I make a de-duping. How manyinstall.sh
files can be found on an active system? I can't count the number of times I've saved a file and had name clash, with some on-the-fly renaming to save it. Flip side: no idea how many times I've downloaded something from different sources, on different days, only to find they are the same file with different names. (Which also kills the timestamp reliability.) 1: Size, 2: Digest, 3: Byte contents.
– Gypsy Spellweaver
Jan 28 '17 at 6:40
@GypsySpellweaver: (1) depends on personal use-case, wouldn't you agree? In my case, i have multiple restores from multiple backups, where files with same name and content exist in different restore-folders. (2) Your comment seems to assume comparing filename only. I was not suggesting to eliminate other checks.
– johny why
Mar 8 '17 at 21:50
add a comment |
I made a Perl script that does something similar to what you're talking about:
http://pastebin.com/U7mFHZU7
Basically, it just traverses a directory, calculating the SHA1sum of the files in it, hashing it and linking matches together. It's come in handy on many, many occasions.
2
I hope to get around to trying this soon... why not upload it on CPAN... App::relink or something
– xenoterracide
Feb 7 '11 at 11:12
1
@xenoterracide: because of all the similar and more mature solutions that already exist. see the other answers, especially rdfind.
– oligofren
Jan 3 '15 at 13:36
1
@oligofren I don't doubt better solutions exist. TMTOWTDI I guess.
– amphetamachine
Jan 5 '15 at 15:49
add a comment |
Since I'm not a fan of Perl, here's a bash version:
#!/bin/bash
DIR="/path/to/big/files"
find $DIR -type f -exec md5sum {} ; | sort > /tmp/sums-sorted.txt
OLDSUM=""
IFS=$'n'
for i in `cat /tmp/sums-sorted.txt`; do
NEWSUM=`echo "$i" | sed 's/ .*//'`
NEWFILE=`echo "$i" | sed 's/^[^ ]* *//'`
if [ "$OLDSUM" == "$NEWSUM" ]; then
echo ln -f "$OLDFILE" "$NEWFILE"
else
OLDSUM="$NEWSUM"
OLDFILE="$NEWFILE"
fi
done
This finds all files with the same checksum (whether they're big, small, or already hardlinks), and hardlinks them together.
This can be greatly optimized for repeated runs with additional find flags (eg. size) and a file cache (so you don't have to redo the checksums each time). If anyone's interested in the smarter, longer version, I can post it.
NOTE: As has been mentioned before, hardlinks work as long as the files never need modification, or to be moved across filesystems.
How can I change your script, so that instead of hardlinking it, it will just delete the duplicate files and will add an entry to a CSV file the deleted file -> Lined File. . ???
– MR.GEWA
Jan 12 '13 at 12:17
Sure. The hard link line: echo ln -f "$OLDFILE" "$NEWFILE" Just replaces the duplicate file with a hard link, so you could change it rm the $NEWFILE instead.
– seren
Jan 13 '13 at 4:15
and how on next line, write in some text file somehow $OLDFILE-> NEWFILE ???
– MR.GEWA
Jan 13 '13 at 13:12
Ahh, right. Yes, add a line after the rm such as: echo "$NEWFILE" >> /var/log/deleted_duplicate_files.log
– seren
Jan 14 '13 at 19:28
1
Don't friggin reinvent the wheel. There are more mature solutions available, likerdfind
, that works at native speeds and just requiresbrew install rdfind
orapt-get install rdfind
to get installed.
– oligofren
Jan 3 '15 at 13:46
add a comment |
If you want to replace duplicates by Hard Links on mac or any UNIX based system, you can try SmartDupe http://sourceforge.net/projects/smartdupe/
am developing it
3
Can you expand on how “smart” it is?
– Stéphane Gimenez
Nov 4 '12 at 13:25
1
How can I compare files of two different directories?
– Burcardo
May 31 '16 at 8:26
add a comment |
The applicatios FSLint (http://www.pixelbeat.org/fslint/) can find all equal files in any folder (by content) and create hardlinks. Give it a try!
Jorge Sampaio
It hangs scanning 1TB almost full ext3 harddisk, brings the entire system to a crawl. Aborted after 14 hours of "searching"
– Angsuman Chakraborty
Sep 12 '16 at 11:09
add a comment |
If you'll do hardlinks, pay attention on rights on that file. Notice, owner, group, mode, extended attributes, time and ACL (if you use this) is stored in INODE. Only file names are different because this is stored in directory structure, and other points to INODE properties. This cause, all file names linked to the same inode, have the same access rights. You should prevent modification that file, because any user can damage file to other. It is simple. It is enough, any user put other file in the same name. Inode number is then saved, and original file content is destroyed (replaced) for all hardlinked names.
Better way is deduplication on filesystem layer. You can use BTRFS (very popular last time), OCFS or like this. Look at the page: https://en.wikipedia.org/wiki/Comparison_of_file_systems , specialy at the table Features and column data deduplication. You can click it and sort :)
Specially look at ZFS filesystem. This is available as FUSE, but in this way it's very slow. If you want native support, look at the page http://zfsonlinux.org/ . Then you must patch kernel, and next install zfs tools for managament. I don't understand, why linux doesn't support as drivers, it is way for many other operating systems / kernels.
File systems supports deduplication by 2 ways, deduplicate files, or blocks. ZFS supports block. This means, the same contents that repeats in the same file can be deduplicated. Other way is time when data are deduplicated, this can be online (zfs) or offline (btrfs).
Notice, deduplication consumes RAM. This is, why writing files to ZFS volume mounted with FUSE, cause dramatically slow performance. This is described in documentation.
But you can online set on/off deduplication on volume. If you see any data should be deduplicated, you simply set deduplication on, rewrite some file to any temporary and finally replace. after this you can off deduplication and restore full performance. Of course, you can add to storage any cache disks. This can be very fast rotate disks or SSD disks. Of course this can be very small disks. In real work this is replacement for RAM :)
Under linux you should take care for ZFS because not all work as it should, specialy when you manage filesystem, make snapshot etc. but if you do configuration and don't change it, all works properly. Other way, you should change linux to opensolaris, it natively supports ZFS :) What is very nice with ZFS is, this works both as filesystem, and volumen manager similar to LVM. You do not need it when you use ZFS. See documentation if you want know more.
Notice difference between ZFS and BTRFS. ZFS is older and more mature, unfortunately only under Solaris and OpenSolaris (unfortunately strangled by oracle). BTRFS is younger, but last time very good supported. I recommend fresh kernel. ZFS has online deduplication, that cause slow down writes, because all is calculated online. BTRFS support off-line dedupliaction. Then this saves performance, but when host has nothing to do, you run periodically tool for make deduplication. And BTRFS is natively created under linux. Maybe this is better FS for You :)
1
I do like the offline (or batch) deduplication approachbtrfs
has. Excellent discussion of the options (including thecp --reflink
option) here: btrfs.wiki.kernel.org/index.php/Deduplication
– Marcel Waldvogel
Feb 5 '17 at 19:42
ZFS is not Solaris or OpenSolaris only. It's natively supported in FreeBSD. Also, ZFS on Linux is device driver based; ZFS on FUSE is a different thing.
– KJ Seefried
Mar 29 '18 at 19:07
add a comment |
Hard links might not be the best idea; if one user changes the file, it affects both. However, deleting a hard link doesn't delete both files. Plus, I am not entirely sure if Hard Links take up the same amount of space (on the hard disk, not the OS) as multiple copies of the same file; according to Windows (with the Link Shell Extension), they do. Granted, that's Windows, not Unix...
My solution would be to create a "common" file in a hidden folder, and replace the actual duplicates with symbolic links... then, the symbolic links would be embedded with metadata or alternate file streams that only records however the two "files" are different from each other, like if one person wants to change the filename or add custom album art or something else like that; it might even be useful outside of database applications, like having multiple versions of the same game or software installed and testing them independently with even the smallest differences.
add a comment |
Easiest way is to use special program
dupeGuru
as documentation says
Deletion Options
These options affect how duplicate deletion takes place.
Most of the time, you don’t need to enable any of them.
Link deleted files:
The deleted files are replaced by a link to the reference file.
You have a choice of replacing it either with a symlink or a hardlink.
...
a symlink is a shortcut to the file’s path.
If the original file is deleted or moved, the link is broken.
A hardlink is a link to the file itself.
That link is as good as a “real” file.
Only when all hardlinks to a file are deleted is the file itself deleted.
On OSX and Linux, this feature is supported fully,
but under Windows, it’s a bit complicated.
Windows XP doesn’t support it, but Vista and up support it.
However, for the feature to work,
dupeGuru has to run with administrative privileges.
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f3037%2fis-there-an-easy-way-to-replace-duplicate-files-with-hardlinks%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
18 Answers
18
active
oldest
votes
18 Answers
18
active
oldest
votes
active
oldest
votes
active
oldest
votes
There is a perl script at http://cpansearch.perl.org/src/ANDK/Perl-Repository-APC-2.002/eg/trimtrees.pl which does exactly what you want:
Traverse all directories named on the
command line, compute MD5 checksums
and find files with identical MD5. IF
they are equal, do a real comparison
if they are really equal, replace the
second of two files with a hard link
to the first one.
Sounds perfect, thanks!! I'll try it and accept if it works as described!
– Josh
Oct 12 '10 at 20:09
3
This did exactly what I asked for. However I believe that ZFS with dedup will eventually be the way to do, since I did find that the files had slight differences so only a few could be hardlinked.
– Josh
Dec 8 '10 at 20:13
10
Upvoted this, but after researching some more, I kind of which I didn't.rdfind
is available via the package managers for ALL major platforms (os x, linux, (cyg)win, solaris), and works at a blazing native speed. So do check out the answer below.
– oligofren
Jan 3 '15 at 13:42
@oligofren I was thinking the same, but then I hit[Errno 31] Too many links
. This scrips seems to be the only thing that handles that.
– phunehehe
Jun 26 '15 at 6:59
3
Checksumming every single file, rather than only files where there exists at least one other with identical size, is unnecessarily inefficient (and unnecessarily prone to hash collisions).
– Charles Duffy
Feb 1 '16 at 16:56
add a comment |
There is a perl script at http://cpansearch.perl.org/src/ANDK/Perl-Repository-APC-2.002/eg/trimtrees.pl which does exactly what you want:
Traverse all directories named on the
command line, compute MD5 checksums
and find files with identical MD5. IF
they are equal, do a real comparison
if they are really equal, replace the
second of two files with a hard link
to the first one.
Sounds perfect, thanks!! I'll try it and accept if it works as described!
– Josh
Oct 12 '10 at 20:09
3
This did exactly what I asked for. However I believe that ZFS with dedup will eventually be the way to do, since I did find that the files had slight differences so only a few could be hardlinked.
– Josh
Dec 8 '10 at 20:13
10
Upvoted this, but after researching some more, I kind of which I didn't.rdfind
is available via the package managers for ALL major platforms (os x, linux, (cyg)win, solaris), and works at a blazing native speed. So do check out the answer below.
– oligofren
Jan 3 '15 at 13:42
@oligofren I was thinking the same, but then I hit[Errno 31] Too many links
. This scrips seems to be the only thing that handles that.
– phunehehe
Jun 26 '15 at 6:59
3
Checksumming every single file, rather than only files where there exists at least one other with identical size, is unnecessarily inefficient (and unnecessarily prone to hash collisions).
– Charles Duffy
Feb 1 '16 at 16:56
add a comment |
There is a perl script at http://cpansearch.perl.org/src/ANDK/Perl-Repository-APC-2.002/eg/trimtrees.pl which does exactly what you want:
Traverse all directories named on the
command line, compute MD5 checksums
and find files with identical MD5. IF
they are equal, do a real comparison
if they are really equal, replace the
second of two files with a hard link
to the first one.
There is a perl script at http://cpansearch.perl.org/src/ANDK/Perl-Repository-APC-2.002/eg/trimtrees.pl which does exactly what you want:
Traverse all directories named on the
command line, compute MD5 checksums
and find files with identical MD5. IF
they are equal, do a real comparison
if they are really equal, replace the
second of two files with a hard link
to the first one.
answered Oct 12 '10 at 20:04
fschmittfschmitt
7,6313043
7,6313043
Sounds perfect, thanks!! I'll try it and accept if it works as described!
– Josh
Oct 12 '10 at 20:09
3
This did exactly what I asked for. However I believe that ZFS with dedup will eventually be the way to do, since I did find that the files had slight differences so only a few could be hardlinked.
– Josh
Dec 8 '10 at 20:13
10
Upvoted this, but after researching some more, I kind of which I didn't.rdfind
is available via the package managers for ALL major platforms (os x, linux, (cyg)win, solaris), and works at a blazing native speed. So do check out the answer below.
– oligofren
Jan 3 '15 at 13:42
@oligofren I was thinking the same, but then I hit[Errno 31] Too many links
. This scrips seems to be the only thing that handles that.
– phunehehe
Jun 26 '15 at 6:59
3
Checksumming every single file, rather than only files where there exists at least one other with identical size, is unnecessarily inefficient (and unnecessarily prone to hash collisions).
– Charles Duffy
Feb 1 '16 at 16:56
add a comment |
Sounds perfect, thanks!! I'll try it and accept if it works as described!
– Josh
Oct 12 '10 at 20:09
3
This did exactly what I asked for. However I believe that ZFS with dedup will eventually be the way to do, since I did find that the files had slight differences so only a few could be hardlinked.
– Josh
Dec 8 '10 at 20:13
10
Upvoted this, but after researching some more, I kind of which I didn't.rdfind
is available via the package managers for ALL major platforms (os x, linux, (cyg)win, solaris), and works at a blazing native speed. So do check out the answer below.
– oligofren
Jan 3 '15 at 13:42
@oligofren I was thinking the same, but then I hit[Errno 31] Too many links
. This scrips seems to be the only thing that handles that.
– phunehehe
Jun 26 '15 at 6:59
3
Checksumming every single file, rather than only files where there exists at least one other with identical size, is unnecessarily inefficient (and unnecessarily prone to hash collisions).
– Charles Duffy
Feb 1 '16 at 16:56
Sounds perfect, thanks!! I'll try it and accept if it works as described!
– Josh
Oct 12 '10 at 20:09
Sounds perfect, thanks!! I'll try it and accept if it works as described!
– Josh
Oct 12 '10 at 20:09
3
3
This did exactly what I asked for. However I believe that ZFS with dedup will eventually be the way to do, since I did find that the files had slight differences so only a few could be hardlinked.
– Josh
Dec 8 '10 at 20:13
This did exactly what I asked for. However I believe that ZFS with dedup will eventually be the way to do, since I did find that the files had slight differences so only a few could be hardlinked.
– Josh
Dec 8 '10 at 20:13
10
10
Upvoted this, but after researching some more, I kind of which I didn't.
rdfind
is available via the package managers for ALL major platforms (os x, linux, (cyg)win, solaris), and works at a blazing native speed. So do check out the answer below.– oligofren
Jan 3 '15 at 13:42
Upvoted this, but after researching some more, I kind of which I didn't.
rdfind
is available via the package managers for ALL major platforms (os x, linux, (cyg)win, solaris), and works at a blazing native speed. So do check out the answer below.– oligofren
Jan 3 '15 at 13:42
@oligofren I was thinking the same, but then I hit
[Errno 31] Too many links
. This scrips seems to be the only thing that handles that.– phunehehe
Jun 26 '15 at 6:59
@oligofren I was thinking the same, but then I hit
[Errno 31] Too many links
. This scrips seems to be the only thing that handles that.– phunehehe
Jun 26 '15 at 6:59
3
3
Checksumming every single file, rather than only files where there exists at least one other with identical size, is unnecessarily inefficient (and unnecessarily prone to hash collisions).
– Charles Duffy
Feb 1 '16 at 16:56
Checksumming every single file, rather than only files where there exists at least one other with identical size, is unnecessarily inefficient (and unnecessarily prone to hash collisions).
– Charles Duffy
Feb 1 '16 at 16:56
add a comment |
rdfind
does exactly what you ask for (and in the order johny why lists). Makes it possible to delete duplicates, replace them with either soft or hard links. Combined with symlinks
you can also make the symlink either absolute or relative. You can even pick checksum algorithm (md5 or sha1).
Since it is compiled it is faster than most scripted solutions: time
on a 15 GiB folder with 2600 files on my Mac Mini from 2009 returns this
9.99s user 3.61s system 66% cpu 20.543 total
(using md5).
Available in most package handlers (e.g. MacPorts for Mac OS X).
10
+1 I usedrdfind
and loved it. It has a-dryrun true
option that will let you know what it would have done. Replacing duplicates with hard links is as simple as-makehardlinks true
. It produced a nice log and it let me know how much space was freed up. Plus, according to the author's benchmark, rdfind is faster than duff and fslint.
– Daniel Trebbien
Dec 29 '13 at 20:49
oooh, nice. I used to use fdupes, but its -L option for hardlinking dupes is missing in the latest Ubuntu 14.10. Was quite slow, and did not exist for Homebrew on OSX, so this answer is way better. Thanks!
– oligofren
Jan 3 '15 at 13:38
Very smart and fast algorithm.
– ndemou
Oct 30 '15 at 12:53
1
I suspect the performance of this tool has more to do with the algorithm itself and less to do with whether it's a compiled tool or a script. For this kind of operation, disk is going to be the bottleneck nearly all of the time. As long as scripted tools make sure that they've an async I/O operation in progress while burning the CPU on checksums, they should perform about as well as a native binary.
– cdhowie
May 31 '18 at 21:19
add a comment |
rdfind
does exactly what you ask for (and in the order johny why lists). Makes it possible to delete duplicates, replace them with either soft or hard links. Combined with symlinks
you can also make the symlink either absolute or relative. You can even pick checksum algorithm (md5 or sha1).
Since it is compiled it is faster than most scripted solutions: time
on a 15 GiB folder with 2600 files on my Mac Mini from 2009 returns this
9.99s user 3.61s system 66% cpu 20.543 total
(using md5).
Available in most package handlers (e.g. MacPorts for Mac OS X).
10
+1 I usedrdfind
and loved it. It has a-dryrun true
option that will let you know what it would have done. Replacing duplicates with hard links is as simple as-makehardlinks true
. It produced a nice log and it let me know how much space was freed up. Plus, according to the author's benchmark, rdfind is faster than duff and fslint.
– Daniel Trebbien
Dec 29 '13 at 20:49
oooh, nice. I used to use fdupes, but its -L option for hardlinking dupes is missing in the latest Ubuntu 14.10. Was quite slow, and did not exist for Homebrew on OSX, so this answer is way better. Thanks!
– oligofren
Jan 3 '15 at 13:38
Very smart and fast algorithm.
– ndemou
Oct 30 '15 at 12:53
1
I suspect the performance of this tool has more to do with the algorithm itself and less to do with whether it's a compiled tool or a script. For this kind of operation, disk is going to be the bottleneck nearly all of the time. As long as scripted tools make sure that they've an async I/O operation in progress while burning the CPU on checksums, they should perform about as well as a native binary.
– cdhowie
May 31 '18 at 21:19
add a comment |
rdfind
does exactly what you ask for (and in the order johny why lists). Makes it possible to delete duplicates, replace them with either soft or hard links. Combined with symlinks
you can also make the symlink either absolute or relative. You can even pick checksum algorithm (md5 or sha1).
Since it is compiled it is faster than most scripted solutions: time
on a 15 GiB folder with 2600 files on my Mac Mini from 2009 returns this
9.99s user 3.61s system 66% cpu 20.543 total
(using md5).
Available in most package handlers (e.g. MacPorts for Mac OS X).
rdfind
does exactly what you ask for (and in the order johny why lists). Makes it possible to delete duplicates, replace them with either soft or hard links. Combined with symlinks
you can also make the symlink either absolute or relative. You can even pick checksum algorithm (md5 or sha1).
Since it is compiled it is faster than most scripted solutions: time
on a 15 GiB folder with 2600 files on my Mac Mini from 2009 returns this
9.99s user 3.61s system 66% cpu 20.543 total
(using md5).
Available in most package handlers (e.g. MacPorts for Mac OS X).
edited Jul 5 '13 at 9:22
Tobias Kienzler
4,349104589
4,349104589
answered Jul 5 '13 at 8:15
d-bd-b
94878
94878
10
+1 I usedrdfind
and loved it. It has a-dryrun true
option that will let you know what it would have done. Replacing duplicates with hard links is as simple as-makehardlinks true
. It produced a nice log and it let me know how much space was freed up. Plus, according to the author's benchmark, rdfind is faster than duff and fslint.
– Daniel Trebbien
Dec 29 '13 at 20:49
oooh, nice. I used to use fdupes, but its -L option for hardlinking dupes is missing in the latest Ubuntu 14.10. Was quite slow, and did not exist for Homebrew on OSX, so this answer is way better. Thanks!
– oligofren
Jan 3 '15 at 13:38
Very smart and fast algorithm.
– ndemou
Oct 30 '15 at 12:53
1
I suspect the performance of this tool has more to do with the algorithm itself and less to do with whether it's a compiled tool or a script. For this kind of operation, disk is going to be the bottleneck nearly all of the time. As long as scripted tools make sure that they've an async I/O operation in progress while burning the CPU on checksums, they should perform about as well as a native binary.
– cdhowie
May 31 '18 at 21:19
add a comment |
10
+1 I usedrdfind
and loved it. It has a-dryrun true
option that will let you know what it would have done. Replacing duplicates with hard links is as simple as-makehardlinks true
. It produced a nice log and it let me know how much space was freed up. Plus, according to the author's benchmark, rdfind is faster than duff and fslint.
– Daniel Trebbien
Dec 29 '13 at 20:49
oooh, nice. I used to use fdupes, but its -L option for hardlinking dupes is missing in the latest Ubuntu 14.10. Was quite slow, and did not exist for Homebrew on OSX, so this answer is way better. Thanks!
– oligofren
Jan 3 '15 at 13:38
Very smart and fast algorithm.
– ndemou
Oct 30 '15 at 12:53
1
I suspect the performance of this tool has more to do with the algorithm itself and less to do with whether it's a compiled tool or a script. For this kind of operation, disk is going to be the bottleneck nearly all of the time. As long as scripted tools make sure that they've an async I/O operation in progress while burning the CPU on checksums, they should perform about as well as a native binary.
– cdhowie
May 31 '18 at 21:19
10
10
+1 I used
rdfind
and loved it. It has a -dryrun true
option that will let you know what it would have done. Replacing duplicates with hard links is as simple as -makehardlinks true
. It produced a nice log and it let me know how much space was freed up. Plus, according to the author's benchmark, rdfind is faster than duff and fslint.– Daniel Trebbien
Dec 29 '13 at 20:49
+1 I used
rdfind
and loved it. It has a -dryrun true
option that will let you know what it would have done. Replacing duplicates with hard links is as simple as -makehardlinks true
. It produced a nice log and it let me know how much space was freed up. Plus, according to the author's benchmark, rdfind is faster than duff and fslint.– Daniel Trebbien
Dec 29 '13 at 20:49
oooh, nice. I used to use fdupes, but its -L option for hardlinking dupes is missing in the latest Ubuntu 14.10. Was quite slow, and did not exist for Homebrew on OSX, so this answer is way better. Thanks!
– oligofren
Jan 3 '15 at 13:38
oooh, nice. I used to use fdupes, but its -L option for hardlinking dupes is missing in the latest Ubuntu 14.10. Was quite slow, and did not exist for Homebrew on OSX, so this answer is way better. Thanks!
– oligofren
Jan 3 '15 at 13:38
Very smart and fast algorithm.
– ndemou
Oct 30 '15 at 12:53
Very smart and fast algorithm.
– ndemou
Oct 30 '15 at 12:53
1
1
I suspect the performance of this tool has more to do with the algorithm itself and less to do with whether it's a compiled tool or a script. For this kind of operation, disk is going to be the bottleneck nearly all of the time. As long as scripted tools make sure that they've an async I/O operation in progress while burning the CPU on checksums, they should perform about as well as a native binary.
– cdhowie
May 31 '18 at 21:19
I suspect the performance of this tool has more to do with the algorithm itself and less to do with whether it's a compiled tool or a script. For this kind of operation, disk is going to be the bottleneck nearly all of the time. As long as scripted tools make sure that they've an async I/O operation in progress while burning the CPU on checksums, they should perform about as well as a native binary.
– cdhowie
May 31 '18 at 21:19
add a comment |
Use the fdupes
tool:
fdupes -r /path/to/folder
gives you a list of duplicates in the directory (-r makes it recursive). The output looks like this:
filename1
filename2
filename3
filename4
filename5
with filename1 and filename2 being identical and filename3, filename4 and filename5 also being identical.
1
Ubuntu Note: As of September 2013, it hasn't had a stable release (it is on 1.50-PR2-3), so the update doesn't appear in ubuntu yet.
– Stuart Axon
Aug 28 '13 at 14:19
11
I just tried installing fdupes_1.50-PR2-4 on both Ubuntu and Debian, neither has the -L flag. Luckily building from github.com/tobiasschulz/fdupes was super easy.
– neu242
Aug 30 '13 at 15:07
3
Tryrdfind
- likefdupes
, but faster and available on OS X and Cygwin as well.
– oligofren
Jan 3 '15 at 13:43
Or if you just requre Linux compatibility, installrmlint
which is blazingly fast, and has lots of nice options. Truly a modern alternative.
– oligofren
Jan 3 '15 at 14:28
3
fdupes
seems to only find duplicates, not replace them with hardlinks, so not an answer to the question IMO.
– Calimo
Nov 8 '17 at 15:58
|
show 1 more comment
Use the fdupes
tool:
fdupes -r /path/to/folder
gives you a list of duplicates in the directory (-r makes it recursive). The output looks like this:
filename1
filename2
filename3
filename4
filename5
with filename1 and filename2 being identical and filename3, filename4 and filename5 also being identical.
1
Ubuntu Note: As of September 2013, it hasn't had a stable release (it is on 1.50-PR2-3), so the update doesn't appear in ubuntu yet.
– Stuart Axon
Aug 28 '13 at 14:19
11
I just tried installing fdupes_1.50-PR2-4 on both Ubuntu and Debian, neither has the -L flag. Luckily building from github.com/tobiasschulz/fdupes was super easy.
– neu242
Aug 30 '13 at 15:07
3
Tryrdfind
- likefdupes
, but faster and available on OS X and Cygwin as well.
– oligofren
Jan 3 '15 at 13:43
Or if you just requre Linux compatibility, installrmlint
which is blazingly fast, and has lots of nice options. Truly a modern alternative.
– oligofren
Jan 3 '15 at 14:28
3
fdupes
seems to only find duplicates, not replace them with hardlinks, so not an answer to the question IMO.
– Calimo
Nov 8 '17 at 15:58
|
show 1 more comment
Use the fdupes
tool:
fdupes -r /path/to/folder
gives you a list of duplicates in the directory (-r makes it recursive). The output looks like this:
filename1
filename2
filename3
filename4
filename5
with filename1 and filename2 being identical and filename3, filename4 and filename5 also being identical.
Use the fdupes
tool:
fdupes -r /path/to/folder
gives you a list of duplicates in the directory (-r makes it recursive). The output looks like this:
filename1
filename2
filename3
filename4
filename5
with filename1 and filename2 being identical and filename3, filename4 and filename5 also being identical.
answered Oct 12 '10 at 20:03
tantetante
4,9942023
4,9942023
1
Ubuntu Note: As of September 2013, it hasn't had a stable release (it is on 1.50-PR2-3), so the update doesn't appear in ubuntu yet.
– Stuart Axon
Aug 28 '13 at 14:19
11
I just tried installing fdupes_1.50-PR2-4 on both Ubuntu and Debian, neither has the -L flag. Luckily building from github.com/tobiasschulz/fdupes was super easy.
– neu242
Aug 30 '13 at 15:07
3
Tryrdfind
- likefdupes
, but faster and available on OS X and Cygwin as well.
– oligofren
Jan 3 '15 at 13:43
Or if you just requre Linux compatibility, installrmlint
which is blazingly fast, and has lots of nice options. Truly a modern alternative.
– oligofren
Jan 3 '15 at 14:28
3
fdupes
seems to only find duplicates, not replace them with hardlinks, so not an answer to the question IMO.
– Calimo
Nov 8 '17 at 15:58
|
show 1 more comment
1
Ubuntu Note: As of September 2013, it hasn't had a stable release (it is on 1.50-PR2-3), so the update doesn't appear in ubuntu yet.
– Stuart Axon
Aug 28 '13 at 14:19
11
I just tried installing fdupes_1.50-PR2-4 on both Ubuntu and Debian, neither has the -L flag. Luckily building from github.com/tobiasschulz/fdupes was super easy.
– neu242
Aug 30 '13 at 15:07
3
Tryrdfind
- likefdupes
, but faster and available on OS X and Cygwin as well.
– oligofren
Jan 3 '15 at 13:43
Or if you just requre Linux compatibility, installrmlint
which is blazingly fast, and has lots of nice options. Truly a modern alternative.
– oligofren
Jan 3 '15 at 14:28
3
fdupes
seems to only find duplicates, not replace them with hardlinks, so not an answer to the question IMO.
– Calimo
Nov 8 '17 at 15:58
1
1
Ubuntu Note: As of September 2013, it hasn't had a stable release (it is on 1.50-PR2-3), so the update doesn't appear in ubuntu yet.
– Stuart Axon
Aug 28 '13 at 14:19
Ubuntu Note: As of September 2013, it hasn't had a stable release (it is on 1.50-PR2-3), so the update doesn't appear in ubuntu yet.
– Stuart Axon
Aug 28 '13 at 14:19
11
11
I just tried installing fdupes_1.50-PR2-4 on both Ubuntu and Debian, neither has the -L flag. Luckily building from github.com/tobiasschulz/fdupes was super easy.
– neu242
Aug 30 '13 at 15:07
I just tried installing fdupes_1.50-PR2-4 on both Ubuntu and Debian, neither has the -L flag. Luckily building from github.com/tobiasschulz/fdupes was super easy.
– neu242
Aug 30 '13 at 15:07
3
3
Try
rdfind
- like fdupes
, but faster and available on OS X and Cygwin as well.– oligofren
Jan 3 '15 at 13:43
Try
rdfind
- like fdupes
, but faster and available on OS X and Cygwin as well.– oligofren
Jan 3 '15 at 13:43
Or if you just requre Linux compatibility, install
rmlint
which is blazingly fast, and has lots of nice options. Truly a modern alternative.– oligofren
Jan 3 '15 at 14:28
Or if you just requre Linux compatibility, install
rmlint
which is blazingly fast, and has lots of nice options. Truly a modern alternative.– oligofren
Jan 3 '15 at 14:28
3
3
fdupes
seems to only find duplicates, not replace them with hardlinks, so not an answer to the question IMO.– Calimo
Nov 8 '17 at 15:58
fdupes
seems to only find duplicates, not replace them with hardlinks, so not an answer to the question IMO.– Calimo
Nov 8 '17 at 15:58
|
show 1 more comment
I use hardlink
from http://jak-linux.org/projects/hardlink/
1
Nice hint, I am using on a regular base code.google.com/p/hardlinkpy but this was not updated for a while...
– meduz
Apr 11 '12 at 19:09
2
This appears to be similar to the originalhardlink
on Fedora/RHEL/etc.
– Jack Douglas
Jun 21 '12 at 8:43
1
hardlink
is now a native binary in many Linux package systems (since ~2014) and extremely fast. For 1,2M files (320GB), it just took 200 seconds (linking roughly 10% of the files).
– Marcel Waldvogel
Feb 5 '17 at 19:13
FWIW, the abovehardlink
was created by Julian Andres Klode while the Fedorahardlink
was created by Jakub Jelinek (source: pagure.io/hardlink - Fedora package name: hardlink)
– maxschlepzig
Jan 4 at 17:52
add a comment |
I use hardlink
from http://jak-linux.org/projects/hardlink/
1
Nice hint, I am using on a regular base code.google.com/p/hardlinkpy but this was not updated for a while...
– meduz
Apr 11 '12 at 19:09
2
This appears to be similar to the originalhardlink
on Fedora/RHEL/etc.
– Jack Douglas
Jun 21 '12 at 8:43
1
hardlink
is now a native binary in many Linux package systems (since ~2014) and extremely fast. For 1,2M files (320GB), it just took 200 seconds (linking roughly 10% of the files).
– Marcel Waldvogel
Feb 5 '17 at 19:13
FWIW, the abovehardlink
was created by Julian Andres Klode while the Fedorahardlink
was created by Jakub Jelinek (source: pagure.io/hardlink - Fedora package name: hardlink)
– maxschlepzig
Jan 4 at 17:52
add a comment |
I use hardlink
from http://jak-linux.org/projects/hardlink/
I use hardlink
from http://jak-linux.org/projects/hardlink/
answered Oct 18 '11 at 4:24
waltinatorwaltinator
75048
75048
1
Nice hint, I am using on a regular base code.google.com/p/hardlinkpy but this was not updated for a while...
– meduz
Apr 11 '12 at 19:09
2
This appears to be similar to the originalhardlink
on Fedora/RHEL/etc.
– Jack Douglas
Jun 21 '12 at 8:43
1
hardlink
is now a native binary in many Linux package systems (since ~2014) and extremely fast. For 1,2M files (320GB), it just took 200 seconds (linking roughly 10% of the files).
– Marcel Waldvogel
Feb 5 '17 at 19:13
FWIW, the abovehardlink
was created by Julian Andres Klode while the Fedorahardlink
was created by Jakub Jelinek (source: pagure.io/hardlink - Fedora package name: hardlink)
– maxschlepzig
Jan 4 at 17:52
add a comment |
1
Nice hint, I am using on a regular base code.google.com/p/hardlinkpy but this was not updated for a while...
– meduz
Apr 11 '12 at 19:09
2
This appears to be similar to the originalhardlink
on Fedora/RHEL/etc.
– Jack Douglas
Jun 21 '12 at 8:43
1
hardlink
is now a native binary in many Linux package systems (since ~2014) and extremely fast. For 1,2M files (320GB), it just took 200 seconds (linking roughly 10% of the files).
– Marcel Waldvogel
Feb 5 '17 at 19:13
FWIW, the abovehardlink
was created by Julian Andres Klode while the Fedorahardlink
was created by Jakub Jelinek (source: pagure.io/hardlink - Fedora package name: hardlink)
– maxschlepzig
Jan 4 at 17:52
1
1
Nice hint, I am using on a regular base code.google.com/p/hardlinkpy but this was not updated for a while...
– meduz
Apr 11 '12 at 19:09
Nice hint, I am using on a regular base code.google.com/p/hardlinkpy but this was not updated for a while...
– meduz
Apr 11 '12 at 19:09
2
2
This appears to be similar to the original
hardlink
on Fedora/RHEL/etc.– Jack Douglas
Jun 21 '12 at 8:43
This appears to be similar to the original
hardlink
on Fedora/RHEL/etc.– Jack Douglas
Jun 21 '12 at 8:43
1
1
hardlink
is now a native binary in many Linux package systems (since ~2014) and extremely fast. For 1,2M files (320GB), it just took 200 seconds (linking roughly 10% of the files).– Marcel Waldvogel
Feb 5 '17 at 19:13
hardlink
is now a native binary in many Linux package systems (since ~2014) and extremely fast. For 1,2M files (320GB), it just took 200 seconds (linking roughly 10% of the files).– Marcel Waldvogel
Feb 5 '17 at 19:13
FWIW, the above
hardlink
was created by Julian Andres Klode while the Fedora hardlink
was created by Jakub Jelinek (source: pagure.io/hardlink - Fedora package name: hardlink)– maxschlepzig
Jan 4 at 17:52
FWIW, the above
hardlink
was created by Julian Andres Klode while the Fedora hardlink
was created by Jakub Jelinek (source: pagure.io/hardlink - Fedora package name: hardlink)– maxschlepzig
Jan 4 at 17:52
add a comment |
This is one of the functions provided by "fslint" --
http://en.flossmanuals.net/FSlint/Introduction
Click the "Merge" button:
4
The -m will hardlink duplicates together, -d will delete all but one, and -t will dry run, printing what it would do
– Azendale
Oct 29 '12 at 5:57
1
On Ubuntu here is what to do:sudo apt-get install fslint
/usr/share/fslint/fslint/findup -m /your/directory/tree
(directory /usr/share/fslint/fslint/ is not in $PATH by default)
– Jocelyn
Sep 8 '13 at 15:38
add a comment |
This is one of the functions provided by "fslint" --
http://en.flossmanuals.net/FSlint/Introduction
Click the "Merge" button:
4
The -m will hardlink duplicates together, -d will delete all but one, and -t will dry run, printing what it would do
– Azendale
Oct 29 '12 at 5:57
1
On Ubuntu here is what to do:sudo apt-get install fslint
/usr/share/fslint/fslint/findup -m /your/directory/tree
(directory /usr/share/fslint/fslint/ is not in $PATH by default)
– Jocelyn
Sep 8 '13 at 15:38
add a comment |
This is one of the functions provided by "fslint" --
http://en.flossmanuals.net/FSlint/Introduction
Click the "Merge" button:
This is one of the functions provided by "fslint" --
http://en.flossmanuals.net/FSlint/Introduction
Click the "Merge" button:
edited May 10 '16 at 22:19
Flimm
1,43541928
1,43541928
answered Dec 18 '10 at 22:38
LJ Wobker
4
The -m will hardlink duplicates together, -d will delete all but one, and -t will dry run, printing what it would do
– Azendale
Oct 29 '12 at 5:57
1
On Ubuntu here is what to do:sudo apt-get install fslint
/usr/share/fslint/fslint/findup -m /your/directory/tree
(directory /usr/share/fslint/fslint/ is not in $PATH by default)
– Jocelyn
Sep 8 '13 at 15:38
add a comment |
4
The -m will hardlink duplicates together, -d will delete all but one, and -t will dry run, printing what it would do
– Azendale
Oct 29 '12 at 5:57
1
On Ubuntu here is what to do:sudo apt-get install fslint
/usr/share/fslint/fslint/findup -m /your/directory/tree
(directory /usr/share/fslint/fslint/ is not in $PATH by default)
– Jocelyn
Sep 8 '13 at 15:38
4
4
The -m will hardlink duplicates together, -d will delete all but one, and -t will dry run, printing what it would do
– Azendale
Oct 29 '12 at 5:57
The -m will hardlink duplicates together, -d will delete all but one, and -t will dry run, printing what it would do
– Azendale
Oct 29 '12 at 5:57
1
1
On Ubuntu here is what to do:
sudo apt-get install fslint
/usr/share/fslint/fslint/findup -m /your/directory/tree
(directory /usr/share/fslint/fslint/ is not in $PATH by default)– Jocelyn
Sep 8 '13 at 15:38
On Ubuntu here is what to do:
sudo apt-get install fslint
/usr/share/fslint/fslint/findup -m /your/directory/tree
(directory /usr/share/fslint/fslint/ is not in $PATH by default)– Jocelyn
Sep 8 '13 at 15:38
add a comment |
Since your main target is to save disk space, there is another solution: de-duplication (and probably compression) on file system level. Compared with the hard-link solution, it does not have the problem of inadvertently affecting other linked files.
ZFS has dedup (block-level, not file-level) since pool version 23 and compression since long time ago.
If you are using linux, you may try zfs-fuse, or if you use BSD, it is natively supported.
This is probably the way I'll go eventually, however, does BSD's ZFS implementation do dedup? I thought it did not.
– Josh
Dec 8 '10 at 20:14
In addition, the HAMMER filesystem on DragonFlyBSD has deduplication support.
– hhaamu
Jul 15 '12 at 17:48
11
ZFS dedup is the friend of nobody. Where ZFS recommends 1Gb ram per 1Tb usable disk space, you're friggin' nuts if you try to use dedup with less than 32Gb ram per 1Tb usable disk space. That means that for a 1Tb mirror, if you don't have 32 Gb ram, you are likely to encounter memory bomb conditions sooner or later that will halt the machine due to lack of ram. Been there, done that, still recovering from the PTSD.
– killermist
Sep 22 '14 at 18:51
3
To avoid the excessive RAM requirements with online deduplication (i.e., check on every write),btrfs
uses batch or offline deduplication (run it whenever you consider it useful/necessary) btrfs.wiki.kernel.org/index.php/Deduplication
– Marcel Waldvogel
Feb 5 '17 at 19:18
2
Update seven years later: I eventually did move to ZFS and tried deduplication -- I found that it's RAM requirements were indeed just far to high. Crafty use of ZFS snapshots provided the solution I ended up using. (Copy one user's music, snapshot and clone, copy the second user's music into the clone usingrsync --inplace
so only changed blocks are stored)
– Josh
Sep 13 '17 at 13:54
|
show 2 more comments
Since your main target is to save disk space, there is another solution: de-duplication (and probably compression) on file system level. Compared with the hard-link solution, it does not have the problem of inadvertently affecting other linked files.
ZFS has dedup (block-level, not file-level) since pool version 23 and compression since long time ago.
If you are using linux, you may try zfs-fuse, or if you use BSD, it is natively supported.
This is probably the way I'll go eventually, however, does BSD's ZFS implementation do dedup? I thought it did not.
– Josh
Dec 8 '10 at 20:14
In addition, the HAMMER filesystem on DragonFlyBSD has deduplication support.
– hhaamu
Jul 15 '12 at 17:48
11
ZFS dedup is the friend of nobody. Where ZFS recommends 1Gb ram per 1Tb usable disk space, you're friggin' nuts if you try to use dedup with less than 32Gb ram per 1Tb usable disk space. That means that for a 1Tb mirror, if you don't have 32 Gb ram, you are likely to encounter memory bomb conditions sooner or later that will halt the machine due to lack of ram. Been there, done that, still recovering from the PTSD.
– killermist
Sep 22 '14 at 18:51
3
To avoid the excessive RAM requirements with online deduplication (i.e., check on every write),btrfs
uses batch or offline deduplication (run it whenever you consider it useful/necessary) btrfs.wiki.kernel.org/index.php/Deduplication
– Marcel Waldvogel
Feb 5 '17 at 19:18
2
Update seven years later: I eventually did move to ZFS and tried deduplication -- I found that it's RAM requirements were indeed just far to high. Crafty use of ZFS snapshots provided the solution I ended up using. (Copy one user's music, snapshot and clone, copy the second user's music into the clone usingrsync --inplace
so only changed blocks are stored)
– Josh
Sep 13 '17 at 13:54
|
show 2 more comments
Since your main target is to save disk space, there is another solution: de-duplication (and probably compression) on file system level. Compared with the hard-link solution, it does not have the problem of inadvertently affecting other linked files.
ZFS has dedup (block-level, not file-level) since pool version 23 and compression since long time ago.
If you are using linux, you may try zfs-fuse, or if you use BSD, it is natively supported.
Since your main target is to save disk space, there is another solution: de-duplication (and probably compression) on file system level. Compared with the hard-link solution, it does not have the problem of inadvertently affecting other linked files.
ZFS has dedup (block-level, not file-level) since pool version 23 and compression since long time ago.
If you are using linux, you may try zfs-fuse, or if you use BSD, it is natively supported.
answered Oct 13 '10 at 5:13
Wei-YinWei-Yin
3921210
3921210
This is probably the way I'll go eventually, however, does BSD's ZFS implementation do dedup? I thought it did not.
– Josh
Dec 8 '10 at 20:14
In addition, the HAMMER filesystem on DragonFlyBSD has deduplication support.
– hhaamu
Jul 15 '12 at 17:48
11
ZFS dedup is the friend of nobody. Where ZFS recommends 1Gb ram per 1Tb usable disk space, you're friggin' nuts if you try to use dedup with less than 32Gb ram per 1Tb usable disk space. That means that for a 1Tb mirror, if you don't have 32 Gb ram, you are likely to encounter memory bomb conditions sooner or later that will halt the machine due to lack of ram. Been there, done that, still recovering from the PTSD.
– killermist
Sep 22 '14 at 18:51
3
To avoid the excessive RAM requirements with online deduplication (i.e., check on every write),btrfs
uses batch or offline deduplication (run it whenever you consider it useful/necessary) btrfs.wiki.kernel.org/index.php/Deduplication
– Marcel Waldvogel
Feb 5 '17 at 19:18
2
Update seven years later: I eventually did move to ZFS and tried deduplication -- I found that it's RAM requirements were indeed just far to high. Crafty use of ZFS snapshots provided the solution I ended up using. (Copy one user's music, snapshot and clone, copy the second user's music into the clone usingrsync --inplace
so only changed blocks are stored)
– Josh
Sep 13 '17 at 13:54
|
show 2 more comments
This is probably the way I'll go eventually, however, does BSD's ZFS implementation do dedup? I thought it did not.
– Josh
Dec 8 '10 at 20:14
In addition, the HAMMER filesystem on DragonFlyBSD has deduplication support.
– hhaamu
Jul 15 '12 at 17:48
11
ZFS dedup is the friend of nobody. Where ZFS recommends 1Gb ram per 1Tb usable disk space, you're friggin' nuts if you try to use dedup with less than 32Gb ram per 1Tb usable disk space. That means that for a 1Tb mirror, if you don't have 32 Gb ram, you are likely to encounter memory bomb conditions sooner or later that will halt the machine due to lack of ram. Been there, done that, still recovering from the PTSD.
– killermist
Sep 22 '14 at 18:51
3
To avoid the excessive RAM requirements with online deduplication (i.e., check on every write),btrfs
uses batch or offline deduplication (run it whenever you consider it useful/necessary) btrfs.wiki.kernel.org/index.php/Deduplication
– Marcel Waldvogel
Feb 5 '17 at 19:18
2
Update seven years later: I eventually did move to ZFS and tried deduplication -- I found that it's RAM requirements were indeed just far to high. Crafty use of ZFS snapshots provided the solution I ended up using. (Copy one user's music, snapshot and clone, copy the second user's music into the clone usingrsync --inplace
so only changed blocks are stored)
– Josh
Sep 13 '17 at 13:54
This is probably the way I'll go eventually, however, does BSD's ZFS implementation do dedup? I thought it did not.
– Josh
Dec 8 '10 at 20:14
This is probably the way I'll go eventually, however, does BSD's ZFS implementation do dedup? I thought it did not.
– Josh
Dec 8 '10 at 20:14
In addition, the HAMMER filesystem on DragonFlyBSD has deduplication support.
– hhaamu
Jul 15 '12 at 17:48
In addition, the HAMMER filesystem on DragonFlyBSD has deduplication support.
– hhaamu
Jul 15 '12 at 17:48
11
11
ZFS dedup is the friend of nobody. Where ZFS recommends 1Gb ram per 1Tb usable disk space, you're friggin' nuts if you try to use dedup with less than 32Gb ram per 1Tb usable disk space. That means that for a 1Tb mirror, if you don't have 32 Gb ram, you are likely to encounter memory bomb conditions sooner or later that will halt the machine due to lack of ram. Been there, done that, still recovering from the PTSD.
– killermist
Sep 22 '14 at 18:51
ZFS dedup is the friend of nobody. Where ZFS recommends 1Gb ram per 1Tb usable disk space, you're friggin' nuts if you try to use dedup with less than 32Gb ram per 1Tb usable disk space. That means that for a 1Tb mirror, if you don't have 32 Gb ram, you are likely to encounter memory bomb conditions sooner or later that will halt the machine due to lack of ram. Been there, done that, still recovering from the PTSD.
– killermist
Sep 22 '14 at 18:51
3
3
To avoid the excessive RAM requirements with online deduplication (i.e., check on every write),
btrfs
uses batch or offline deduplication (run it whenever you consider it useful/necessary) btrfs.wiki.kernel.org/index.php/Deduplication– Marcel Waldvogel
Feb 5 '17 at 19:18
To avoid the excessive RAM requirements with online deduplication (i.e., check on every write),
btrfs
uses batch or offline deduplication (run it whenever you consider it useful/necessary) btrfs.wiki.kernel.org/index.php/Deduplication– Marcel Waldvogel
Feb 5 '17 at 19:18
2
2
Update seven years later: I eventually did move to ZFS and tried deduplication -- I found that it's RAM requirements were indeed just far to high. Crafty use of ZFS snapshots provided the solution I ended up using. (Copy one user's music, snapshot and clone, copy the second user's music into the clone using
rsync --inplace
so only changed blocks are stored)– Josh
Sep 13 '17 at 13:54
Update seven years later: I eventually did move to ZFS and tried deduplication -- I found that it's RAM requirements were indeed just far to high. Crafty use of ZFS snapshots provided the solution I ended up using. (Copy one user's music, snapshot and clone, copy the second user's music into the clone using
rsync --inplace
so only changed blocks are stored)– Josh
Sep 13 '17 at 13:54
|
show 2 more comments
On modern Linux these days there's https://github.com/g2p/bedup which de-duplicates on a btrfs filesystem, but 1) without as much of the scan overhead, 2) files can diverge easily again afterwards.
Background and more information is listed on btrfs.wiki.kernel.org/index.php/Deduplication (including reference tocp --reflink
, see also below)
– Marcel Waldvogel
Feb 5 '17 at 19:22
add a comment |
On modern Linux these days there's https://github.com/g2p/bedup which de-duplicates on a btrfs filesystem, but 1) without as much of the scan overhead, 2) files can diverge easily again afterwards.
Background and more information is listed on btrfs.wiki.kernel.org/index.php/Deduplication (including reference tocp --reflink
, see also below)
– Marcel Waldvogel
Feb 5 '17 at 19:22
add a comment |
On modern Linux these days there's https://github.com/g2p/bedup which de-duplicates on a btrfs filesystem, but 1) without as much of the scan overhead, 2) files can diverge easily again afterwards.
On modern Linux these days there's https://github.com/g2p/bedup which de-duplicates on a btrfs filesystem, but 1) without as much of the scan overhead, 2) files can diverge easily again afterwards.
answered Jan 8 '14 at 17:37
Matthew BlochMatthew Bloch
17014
17014
Background and more information is listed on btrfs.wiki.kernel.org/index.php/Deduplication (including reference tocp --reflink
, see also below)
– Marcel Waldvogel
Feb 5 '17 at 19:22
add a comment |
Background and more information is listed on btrfs.wiki.kernel.org/index.php/Deduplication (including reference tocp --reflink
, see also below)
– Marcel Waldvogel
Feb 5 '17 at 19:22
Background and more information is listed on btrfs.wiki.kernel.org/index.php/Deduplication (including reference to
cp --reflink
, see also below)– Marcel Waldvogel
Feb 5 '17 at 19:22
Background and more information is listed on btrfs.wiki.kernel.org/index.php/Deduplication (including reference to
cp --reflink
, see also below)– Marcel Waldvogel
Feb 5 '17 at 19:22
add a comment |
To find duplicate files you can use duff.
Duff is a Unix command-line utility
for quickly finding duplicates in a
given set of files.
Simply run:
duff -r target-folder
To create hardlinks to those files automaticly, you will need to parse the output of duff with bash or some other scripting language.
Really slow though -- see rdfind.pauldreik.se/#g0.6
– ndemou
Oct 30 '15 at 12:52
add a comment |
To find duplicate files you can use duff.
Duff is a Unix command-line utility
for quickly finding duplicates in a
given set of files.
Simply run:
duff -r target-folder
To create hardlinks to those files automaticly, you will need to parse the output of duff with bash or some other scripting language.
Really slow though -- see rdfind.pauldreik.se/#g0.6
– ndemou
Oct 30 '15 at 12:52
add a comment |
To find duplicate files you can use duff.
Duff is a Unix command-line utility
for quickly finding duplicates in a
given set of files.
Simply run:
duff -r target-folder
To create hardlinks to those files automaticly, you will need to parse the output of duff with bash or some other scripting language.
To find duplicate files you can use duff.
Duff is a Unix command-line utility
for quickly finding duplicates in a
given set of files.
Simply run:
duff -r target-folder
To create hardlinks to those files automaticly, you will need to parse the output of duff with bash or some other scripting language.
answered Oct 12 '10 at 20:00
StefanStefan
11.6k3283123
11.6k3283123
Really slow though -- see rdfind.pauldreik.se/#g0.6
– ndemou
Oct 30 '15 at 12:52
add a comment |
Really slow though -- see rdfind.pauldreik.se/#g0.6
– ndemou
Oct 30 '15 at 12:52
Really slow though -- see rdfind.pauldreik.se/#g0.6
– ndemou
Oct 30 '15 at 12:52
Really slow though -- see rdfind.pauldreik.se/#g0.6
– ndemou
Oct 30 '15 at 12:52
add a comment |
aptitude show hardlink
Description: Hardlinks multiple copies of the same file
Hardlink is a tool which detects multiple copies of the same file and replaces them with hardlinks.
The idea has been taken from http://code.google.com/p/hardlinkpy/, but the code has been written from scratch and licensed under the MIT license.
Homepage: http://jak-linux.org/projects/hardlink/
The only program mentioned here available for Gentoo without unmasking and with hardlink support, thanks!
– Jorrit Schippers
Mar 9 '15 at 13:48
add a comment |
aptitude show hardlink
Description: Hardlinks multiple copies of the same file
Hardlink is a tool which detects multiple copies of the same file and replaces them with hardlinks.
The idea has been taken from http://code.google.com/p/hardlinkpy/, but the code has been written from scratch and licensed under the MIT license.
Homepage: http://jak-linux.org/projects/hardlink/
The only program mentioned here available for Gentoo without unmasking and with hardlink support, thanks!
– Jorrit Schippers
Mar 9 '15 at 13:48
add a comment |
aptitude show hardlink
Description: Hardlinks multiple copies of the same file
Hardlink is a tool which detects multiple copies of the same file and replaces them with hardlinks.
The idea has been taken from http://code.google.com/p/hardlinkpy/, but the code has been written from scratch and licensed under the MIT license.
Homepage: http://jak-linux.org/projects/hardlink/
aptitude show hardlink
Description: Hardlinks multiple copies of the same file
Hardlink is a tool which detects multiple copies of the same file and replaces them with hardlinks.
The idea has been taken from http://code.google.com/p/hardlinkpy/, but the code has been written from scratch and licensed under the MIT license.
Homepage: http://jak-linux.org/projects/hardlink/
edited Nov 22 '13 at 15:22
Anthon
60.9k17104166
60.9k17104166
answered Nov 22 '13 at 15:03
Julien PalardJulien Palard
29635
29635
The only program mentioned here available for Gentoo without unmasking and with hardlink support, thanks!
– Jorrit Schippers
Mar 9 '15 at 13:48
add a comment |
The only program mentioned here available for Gentoo without unmasking and with hardlink support, thanks!
– Jorrit Schippers
Mar 9 '15 at 13:48
The only program mentioned here available for Gentoo without unmasking and with hardlink support, thanks!
– Jorrit Schippers
Mar 9 '15 at 13:48
The only program mentioned here available for Gentoo without unmasking and with hardlink support, thanks!
– Jorrit Schippers
Mar 9 '15 at 13:48
add a comment |
I've used many of the hardlinking tools for Linux mentioned here.
I too am stuck with ext4 fs, on Ubuntu, and have been using its cp -l and -s for hard/softlinking. But lately noticed the lightweight copy in the cp man page, which would imply to spare the redundant disk space until one side gets modified:
--reflink[=WHEN]
control clone/CoW copies. See below
When --reflink[=always] is specified, perform a lightweight copy, where the
data blocks are copied only when modified. If this is not possible the
copy fails, or if --reflink=auto is specified, fall back to a standard copy.
I think I will update mycp
alias to always include the--reflink=auto
parameter now
– Marcos
Mar 14 '12 at 14:08
1
Does ext4 really support--reflink
?
– Jack Douglas
Jun 21 '12 at 8:42
7
This is supported on btrfs and OCFS2. It is only possible on copy-on-write filesystems, which ext4 is not. btrfs is really shaping up. I love using it because of reflink and snapshots, makes you less scared to do mass operations on big trees of files.
– clacke
Jul 3 '12 at 18:57
add a comment |
I've used many of the hardlinking tools for Linux mentioned here.
I too am stuck with ext4 fs, on Ubuntu, and have been using its cp -l and -s for hard/softlinking. But lately noticed the lightweight copy in the cp man page, which would imply to spare the redundant disk space until one side gets modified:
--reflink[=WHEN]
control clone/CoW copies. See below
When --reflink[=always] is specified, perform a lightweight copy, where the
data blocks are copied only when modified. If this is not possible the
copy fails, or if --reflink=auto is specified, fall back to a standard copy.
I think I will update mycp
alias to always include the--reflink=auto
parameter now
– Marcos
Mar 14 '12 at 14:08
1
Does ext4 really support--reflink
?
– Jack Douglas
Jun 21 '12 at 8:42
7
This is supported on btrfs and OCFS2. It is only possible on copy-on-write filesystems, which ext4 is not. btrfs is really shaping up. I love using it because of reflink and snapshots, makes you less scared to do mass operations on big trees of files.
– clacke
Jul 3 '12 at 18:57
add a comment |
I've used many of the hardlinking tools for Linux mentioned here.
I too am stuck with ext4 fs, on Ubuntu, and have been using its cp -l and -s for hard/softlinking. But lately noticed the lightweight copy in the cp man page, which would imply to spare the redundant disk space until one side gets modified:
--reflink[=WHEN]
control clone/CoW copies. See below
When --reflink[=always] is specified, perform a lightweight copy, where the
data blocks are copied only when modified. If this is not possible the
copy fails, or if --reflink=auto is specified, fall back to a standard copy.
I've used many of the hardlinking tools for Linux mentioned here.
I too am stuck with ext4 fs, on Ubuntu, and have been using its cp -l and -s for hard/softlinking. But lately noticed the lightweight copy in the cp man page, which would imply to spare the redundant disk space until one side gets modified:
--reflink[=WHEN]
control clone/CoW copies. See below
When --reflink[=always] is specified, perform a lightweight copy, where the
data blocks are copied only when modified. If this is not possible the
copy fails, or if --reflink=auto is specified, fall back to a standard copy.
answered Mar 14 '12 at 9:59
MarcosMarcos
1,14211228
1,14211228
I think I will update mycp
alias to always include the--reflink=auto
parameter now
– Marcos
Mar 14 '12 at 14:08
1
Does ext4 really support--reflink
?
– Jack Douglas
Jun 21 '12 at 8:42
7
This is supported on btrfs and OCFS2. It is only possible on copy-on-write filesystems, which ext4 is not. btrfs is really shaping up. I love using it because of reflink and snapshots, makes you less scared to do mass operations on big trees of files.
– clacke
Jul 3 '12 at 18:57
add a comment |
I think I will update mycp
alias to always include the--reflink=auto
parameter now
– Marcos
Mar 14 '12 at 14:08
1
Does ext4 really support--reflink
?
– Jack Douglas
Jun 21 '12 at 8:42
7
This is supported on btrfs and OCFS2. It is only possible on copy-on-write filesystems, which ext4 is not. btrfs is really shaping up. I love using it because of reflink and snapshots, makes you less scared to do mass operations on big trees of files.
– clacke
Jul 3 '12 at 18:57
I think I will update my
cp
alias to always include the --reflink=auto
parameter now– Marcos
Mar 14 '12 at 14:08
I think I will update my
cp
alias to always include the --reflink=auto
parameter now– Marcos
Mar 14 '12 at 14:08
1
1
Does ext4 really support
--reflink
?– Jack Douglas
Jun 21 '12 at 8:42
Does ext4 really support
--reflink
?– Jack Douglas
Jun 21 '12 at 8:42
7
7
This is supported on btrfs and OCFS2. It is only possible on copy-on-write filesystems, which ext4 is not. btrfs is really shaping up. I love using it because of reflink and snapshots, makes you less scared to do mass operations on big trees of files.
– clacke
Jul 3 '12 at 18:57
This is supported on btrfs and OCFS2. It is only possible on copy-on-write filesystems, which ext4 is not. btrfs is really shaping up. I love using it because of reflink and snapshots, makes you less scared to do mass operations on big trees of files.
– clacke
Jul 3 '12 at 18:57
add a comment |
Seems to me that checking the filename first could speed things up. If two files lack the same filename then in many cases I would not consider them to be duplicates. Seems that the quickest method would be to compare, in order:
- filename
- size
- md5 checksum
- byte contents
Do any methods do this? Look at duff
, fdupes
, rmlint
, fslint
, etc.
The following method was top-voted on commandlinefu.com: Find Duplicate Files (based on size first, then MD5 hash)
Can filename comparison be added as a first step, size as a second step?
find -not -empty -type f -printf "%sn" | sort -rn | uniq -d |
xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum |
sort | uniq -w32 --all-repeated=separate
2
I've usedduff
,fdupes
andrmlint
, and strongly recommend readers to look at the third of these. It has an excellent option set (and documentation). With it, I was able to avoid a lot of the post-processing I needed to use with the other tools.
– dubiousjim
Sep 2 '15 at 6:32
2
In my practice filename is the least reliable factor to look at, and I've completely removed it from any efforts I make a de-duping. How manyinstall.sh
files can be found on an active system? I can't count the number of times I've saved a file and had name clash, with some on-the-fly renaming to save it. Flip side: no idea how many times I've downloaded something from different sources, on different days, only to find they are the same file with different names. (Which also kills the timestamp reliability.) 1: Size, 2: Digest, 3: Byte contents.
– Gypsy Spellweaver
Jan 28 '17 at 6:40
@GypsySpellweaver: (1) depends on personal use-case, wouldn't you agree? In my case, i have multiple restores from multiple backups, where files with same name and content exist in different restore-folders. (2) Your comment seems to assume comparing filename only. I was not suggesting to eliminate other checks.
– johny why
Mar 8 '17 at 21:50
add a comment |
Seems to me that checking the filename first could speed things up. If two files lack the same filename then in many cases I would not consider them to be duplicates. Seems that the quickest method would be to compare, in order:
- filename
- size
- md5 checksum
- byte contents
Do any methods do this? Look at duff
, fdupes
, rmlint
, fslint
, etc.
The following method was top-voted on commandlinefu.com: Find Duplicate Files (based on size first, then MD5 hash)
Can filename comparison be added as a first step, size as a second step?
find -not -empty -type f -printf "%sn" | sort -rn | uniq -d |
xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum |
sort | uniq -w32 --all-repeated=separate
2
I've usedduff
,fdupes
andrmlint
, and strongly recommend readers to look at the third of these. It has an excellent option set (and documentation). With it, I was able to avoid a lot of the post-processing I needed to use with the other tools.
– dubiousjim
Sep 2 '15 at 6:32
2
In my practice filename is the least reliable factor to look at, and I've completely removed it from any efforts I make a de-duping. How manyinstall.sh
files can be found on an active system? I can't count the number of times I've saved a file and had name clash, with some on-the-fly renaming to save it. Flip side: no idea how many times I've downloaded something from different sources, on different days, only to find they are the same file with different names. (Which also kills the timestamp reliability.) 1: Size, 2: Digest, 3: Byte contents.
– Gypsy Spellweaver
Jan 28 '17 at 6:40
@GypsySpellweaver: (1) depends on personal use-case, wouldn't you agree? In my case, i have multiple restores from multiple backups, where files with same name and content exist in different restore-folders. (2) Your comment seems to assume comparing filename only. I was not suggesting to eliminate other checks.
– johny why
Mar 8 '17 at 21:50
add a comment |
Seems to me that checking the filename first could speed things up. If two files lack the same filename then in many cases I would not consider them to be duplicates. Seems that the quickest method would be to compare, in order:
- filename
- size
- md5 checksum
- byte contents
Do any methods do this? Look at duff
, fdupes
, rmlint
, fslint
, etc.
The following method was top-voted on commandlinefu.com: Find Duplicate Files (based on size first, then MD5 hash)
Can filename comparison be added as a first step, size as a second step?
find -not -empty -type f -printf "%sn" | sort -rn | uniq -d |
xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum |
sort | uniq -w32 --all-repeated=separate
Seems to me that checking the filename first could speed things up. If two files lack the same filename then in many cases I would not consider them to be duplicates. Seems that the quickest method would be to compare, in order:
- filename
- size
- md5 checksum
- byte contents
Do any methods do this? Look at duff
, fdupes
, rmlint
, fslint
, etc.
The following method was top-voted on commandlinefu.com: Find Duplicate Files (based on size first, then MD5 hash)
Can filename comparison be added as a first step, size as a second step?
find -not -empty -type f -printf "%sn" | sort -rn | uniq -d |
xargs -I{} -n1 find -type f -size {}c -print0 | xargs -0 md5sum |
sort | uniq -w32 --all-repeated=separate
edited Jul 15 '12 at 13:10
Mat
39.6k8121127
39.6k8121127
answered Jul 9 '12 at 15:02
johny whyjohny why
1334
1334
2
I've usedduff
,fdupes
andrmlint
, and strongly recommend readers to look at the third of these. It has an excellent option set (and documentation). With it, I was able to avoid a lot of the post-processing I needed to use with the other tools.
– dubiousjim
Sep 2 '15 at 6:32
2
In my practice filename is the least reliable factor to look at, and I've completely removed it from any efforts I make a de-duping. How manyinstall.sh
files can be found on an active system? I can't count the number of times I've saved a file and had name clash, with some on-the-fly renaming to save it. Flip side: no idea how many times I've downloaded something from different sources, on different days, only to find they are the same file with different names. (Which also kills the timestamp reliability.) 1: Size, 2: Digest, 3: Byte contents.
– Gypsy Spellweaver
Jan 28 '17 at 6:40
@GypsySpellweaver: (1) depends on personal use-case, wouldn't you agree? In my case, i have multiple restores from multiple backups, where files with same name and content exist in different restore-folders. (2) Your comment seems to assume comparing filename only. I was not suggesting to eliminate other checks.
– johny why
Mar 8 '17 at 21:50
add a comment |
2
I've usedduff
,fdupes
andrmlint
, and strongly recommend readers to look at the third of these. It has an excellent option set (and documentation). With it, I was able to avoid a lot of the post-processing I needed to use with the other tools.
– dubiousjim
Sep 2 '15 at 6:32
2
In my practice filename is the least reliable factor to look at, and I've completely removed it from any efforts I make a de-duping. How manyinstall.sh
files can be found on an active system? I can't count the number of times I've saved a file and had name clash, with some on-the-fly renaming to save it. Flip side: no idea how many times I've downloaded something from different sources, on different days, only to find they are the same file with different names. (Which also kills the timestamp reliability.) 1: Size, 2: Digest, 3: Byte contents.
– Gypsy Spellweaver
Jan 28 '17 at 6:40
@GypsySpellweaver: (1) depends on personal use-case, wouldn't you agree? In my case, i have multiple restores from multiple backups, where files with same name and content exist in different restore-folders. (2) Your comment seems to assume comparing filename only. I was not suggesting to eliminate other checks.
– johny why
Mar 8 '17 at 21:50
2
2
I've used
duff
, fdupes
and rmlint
, and strongly recommend readers to look at the third of these. It has an excellent option set (and documentation). With it, I was able to avoid a lot of the post-processing I needed to use with the other tools.– dubiousjim
Sep 2 '15 at 6:32
I've used
duff
, fdupes
and rmlint
, and strongly recommend readers to look at the third of these. It has an excellent option set (and documentation). With it, I was able to avoid a lot of the post-processing I needed to use with the other tools.– dubiousjim
Sep 2 '15 at 6:32
2
2
In my practice filename is the least reliable factor to look at, and I've completely removed it from any efforts I make a de-duping. How many
install.sh
files can be found on an active system? I can't count the number of times I've saved a file and had name clash, with some on-the-fly renaming to save it. Flip side: no idea how many times I've downloaded something from different sources, on different days, only to find they are the same file with different names. (Which also kills the timestamp reliability.) 1: Size, 2: Digest, 3: Byte contents.– Gypsy Spellweaver
Jan 28 '17 at 6:40
In my practice filename is the least reliable factor to look at, and I've completely removed it from any efforts I make a de-duping. How many
install.sh
files can be found on an active system? I can't count the number of times I've saved a file and had name clash, with some on-the-fly renaming to save it. Flip side: no idea how many times I've downloaded something from different sources, on different days, only to find they are the same file with different names. (Which also kills the timestamp reliability.) 1: Size, 2: Digest, 3: Byte contents.– Gypsy Spellweaver
Jan 28 '17 at 6:40
@GypsySpellweaver: (1) depends on personal use-case, wouldn't you agree? In my case, i have multiple restores from multiple backups, where files with same name and content exist in different restore-folders. (2) Your comment seems to assume comparing filename only. I was not suggesting to eliminate other checks.
– johny why
Mar 8 '17 at 21:50
@GypsySpellweaver: (1) depends on personal use-case, wouldn't you agree? In my case, i have multiple restores from multiple backups, where files with same name and content exist in different restore-folders. (2) Your comment seems to assume comparing filename only. I was not suggesting to eliminate other checks.
– johny why
Mar 8 '17 at 21:50
add a comment |
I made a Perl script that does something similar to what you're talking about:
http://pastebin.com/U7mFHZU7
Basically, it just traverses a directory, calculating the SHA1sum of the files in it, hashing it and linking matches together. It's come in handy on many, many occasions.
2
I hope to get around to trying this soon... why not upload it on CPAN... App::relink or something
– xenoterracide
Feb 7 '11 at 11:12
1
@xenoterracide: because of all the similar and more mature solutions that already exist. see the other answers, especially rdfind.
– oligofren
Jan 3 '15 at 13:36
1
@oligofren I don't doubt better solutions exist. TMTOWTDI I guess.
– amphetamachine
Jan 5 '15 at 15:49
add a comment |
I made a Perl script that does something similar to what you're talking about:
http://pastebin.com/U7mFHZU7
Basically, it just traverses a directory, calculating the SHA1sum of the files in it, hashing it and linking matches together. It's come in handy on many, many occasions.
2
I hope to get around to trying this soon... why not upload it on CPAN... App::relink or something
– xenoterracide
Feb 7 '11 at 11:12
1
@xenoterracide: because of all the similar and more mature solutions that already exist. see the other answers, especially rdfind.
– oligofren
Jan 3 '15 at 13:36
1
@oligofren I don't doubt better solutions exist. TMTOWTDI I guess.
– amphetamachine
Jan 5 '15 at 15:49
add a comment |
I made a Perl script that does something similar to what you're talking about:
http://pastebin.com/U7mFHZU7
Basically, it just traverses a directory, calculating the SHA1sum of the files in it, hashing it and linking matches together. It's come in handy on many, many occasions.
I made a Perl script that does something similar to what you're talking about:
http://pastebin.com/U7mFHZU7
Basically, it just traverses a directory, calculating the SHA1sum of the files in it, hashing it and linking matches together. It's come in handy on many, many occasions.
answered Jan 31 '11 at 2:06
amphetamachineamphetamachine
3,82522338
3,82522338
2
I hope to get around to trying this soon... why not upload it on CPAN... App::relink or something
– xenoterracide
Feb 7 '11 at 11:12
1
@xenoterracide: because of all the similar and more mature solutions that already exist. see the other answers, especially rdfind.
– oligofren
Jan 3 '15 at 13:36
1
@oligofren I don't doubt better solutions exist. TMTOWTDI I guess.
– amphetamachine
Jan 5 '15 at 15:49
add a comment |
2
I hope to get around to trying this soon... why not upload it on CPAN... App::relink or something
– xenoterracide
Feb 7 '11 at 11:12
1
@xenoterracide: because of all the similar and more mature solutions that already exist. see the other answers, especially rdfind.
– oligofren
Jan 3 '15 at 13:36
1
@oligofren I don't doubt better solutions exist. TMTOWTDI I guess.
– amphetamachine
Jan 5 '15 at 15:49
2
2
I hope to get around to trying this soon... why not upload it on CPAN... App::relink or something
– xenoterracide
Feb 7 '11 at 11:12
I hope to get around to trying this soon... why not upload it on CPAN... App::relink or something
– xenoterracide
Feb 7 '11 at 11:12
1
1
@xenoterracide: because of all the similar and more mature solutions that already exist. see the other answers, especially rdfind.
– oligofren
Jan 3 '15 at 13:36
@xenoterracide: because of all the similar and more mature solutions that already exist. see the other answers, especially rdfind.
– oligofren
Jan 3 '15 at 13:36
1
1
@oligofren I don't doubt better solutions exist. TMTOWTDI I guess.
– amphetamachine
Jan 5 '15 at 15:49
@oligofren I don't doubt better solutions exist. TMTOWTDI I guess.
– amphetamachine
Jan 5 '15 at 15:49
add a comment |
Since I'm not a fan of Perl, here's a bash version:
#!/bin/bash
DIR="/path/to/big/files"
find $DIR -type f -exec md5sum {} ; | sort > /tmp/sums-sorted.txt
OLDSUM=""
IFS=$'n'
for i in `cat /tmp/sums-sorted.txt`; do
NEWSUM=`echo "$i" | sed 's/ .*//'`
NEWFILE=`echo "$i" | sed 's/^[^ ]* *//'`
if [ "$OLDSUM" == "$NEWSUM" ]; then
echo ln -f "$OLDFILE" "$NEWFILE"
else
OLDSUM="$NEWSUM"
OLDFILE="$NEWFILE"
fi
done
This finds all files with the same checksum (whether they're big, small, or already hardlinks), and hardlinks them together.
This can be greatly optimized for repeated runs with additional find flags (eg. size) and a file cache (so you don't have to redo the checksums each time). If anyone's interested in the smarter, longer version, I can post it.
NOTE: As has been mentioned before, hardlinks work as long as the files never need modification, or to be moved across filesystems.
How can I change your script, so that instead of hardlinking it, it will just delete the duplicate files and will add an entry to a CSV file the deleted file -> Lined File. . ???
– MR.GEWA
Jan 12 '13 at 12:17
Sure. The hard link line: echo ln -f "$OLDFILE" "$NEWFILE" Just replaces the duplicate file with a hard link, so you could change it rm the $NEWFILE instead.
– seren
Jan 13 '13 at 4:15
and how on next line, write in some text file somehow $OLDFILE-> NEWFILE ???
– MR.GEWA
Jan 13 '13 at 13:12
Ahh, right. Yes, add a line after the rm such as: echo "$NEWFILE" >> /var/log/deleted_duplicate_files.log
– seren
Jan 14 '13 at 19:28
1
Don't friggin reinvent the wheel. There are more mature solutions available, likerdfind
, that works at native speeds and just requiresbrew install rdfind
orapt-get install rdfind
to get installed.
– oligofren
Jan 3 '15 at 13:46
add a comment |
Since I'm not a fan of Perl, here's a bash version:
#!/bin/bash
DIR="/path/to/big/files"
find $DIR -type f -exec md5sum {} ; | sort > /tmp/sums-sorted.txt
OLDSUM=""
IFS=$'n'
for i in `cat /tmp/sums-sorted.txt`; do
NEWSUM=`echo "$i" | sed 's/ .*//'`
NEWFILE=`echo "$i" | sed 's/^[^ ]* *//'`
if [ "$OLDSUM" == "$NEWSUM" ]; then
echo ln -f "$OLDFILE" "$NEWFILE"
else
OLDSUM="$NEWSUM"
OLDFILE="$NEWFILE"
fi
done
This finds all files with the same checksum (whether they're big, small, or already hardlinks), and hardlinks them together.
This can be greatly optimized for repeated runs with additional find flags (eg. size) and a file cache (so you don't have to redo the checksums each time). If anyone's interested in the smarter, longer version, I can post it.
NOTE: As has been mentioned before, hardlinks work as long as the files never need modification, or to be moved across filesystems.
How can I change your script, so that instead of hardlinking it, it will just delete the duplicate files and will add an entry to a CSV file the deleted file -> Lined File. . ???
– MR.GEWA
Jan 12 '13 at 12:17
Sure. The hard link line: echo ln -f "$OLDFILE" "$NEWFILE" Just replaces the duplicate file with a hard link, so you could change it rm the $NEWFILE instead.
– seren
Jan 13 '13 at 4:15
and how on next line, write in some text file somehow $OLDFILE-> NEWFILE ???
– MR.GEWA
Jan 13 '13 at 13:12
Ahh, right. Yes, add a line after the rm such as: echo "$NEWFILE" >> /var/log/deleted_duplicate_files.log
– seren
Jan 14 '13 at 19:28
1
Don't friggin reinvent the wheel. There are more mature solutions available, likerdfind
, that works at native speeds and just requiresbrew install rdfind
orapt-get install rdfind
to get installed.
– oligofren
Jan 3 '15 at 13:46
add a comment |
Since I'm not a fan of Perl, here's a bash version:
#!/bin/bash
DIR="/path/to/big/files"
find $DIR -type f -exec md5sum {} ; | sort > /tmp/sums-sorted.txt
OLDSUM=""
IFS=$'n'
for i in `cat /tmp/sums-sorted.txt`; do
NEWSUM=`echo "$i" | sed 's/ .*//'`
NEWFILE=`echo "$i" | sed 's/^[^ ]* *//'`
if [ "$OLDSUM" == "$NEWSUM" ]; then
echo ln -f "$OLDFILE" "$NEWFILE"
else
OLDSUM="$NEWSUM"
OLDFILE="$NEWFILE"
fi
done
This finds all files with the same checksum (whether they're big, small, or already hardlinks), and hardlinks them together.
This can be greatly optimized for repeated runs with additional find flags (eg. size) and a file cache (so you don't have to redo the checksums each time). If anyone's interested in the smarter, longer version, I can post it.
NOTE: As has been mentioned before, hardlinks work as long as the files never need modification, or to be moved across filesystems.
Since I'm not a fan of Perl, here's a bash version:
#!/bin/bash
DIR="/path/to/big/files"
find $DIR -type f -exec md5sum {} ; | sort > /tmp/sums-sorted.txt
OLDSUM=""
IFS=$'n'
for i in `cat /tmp/sums-sorted.txt`; do
NEWSUM=`echo "$i" | sed 's/ .*//'`
NEWFILE=`echo "$i" | sed 's/^[^ ]* *//'`
if [ "$OLDSUM" == "$NEWSUM" ]; then
echo ln -f "$OLDFILE" "$NEWFILE"
else
OLDSUM="$NEWSUM"
OLDFILE="$NEWFILE"
fi
done
This finds all files with the same checksum (whether they're big, small, or already hardlinks), and hardlinks them together.
This can be greatly optimized for repeated runs with additional find flags (eg. size) and a file cache (so you don't have to redo the checksums each time). If anyone's interested in the smarter, longer version, I can post it.
NOTE: As has been mentioned before, hardlinks work as long as the files never need modification, or to be moved across filesystems.
edited Jul 3 '12 at 11:04
Mat
39.6k8121127
39.6k8121127
answered Jul 3 '12 at 5:15
serenseren
1212
1212
How can I change your script, so that instead of hardlinking it, it will just delete the duplicate files and will add an entry to a CSV file the deleted file -> Lined File. . ???
– MR.GEWA
Jan 12 '13 at 12:17
Sure. The hard link line: echo ln -f "$OLDFILE" "$NEWFILE" Just replaces the duplicate file with a hard link, so you could change it rm the $NEWFILE instead.
– seren
Jan 13 '13 at 4:15
and how on next line, write in some text file somehow $OLDFILE-> NEWFILE ???
– MR.GEWA
Jan 13 '13 at 13:12
Ahh, right. Yes, add a line after the rm such as: echo "$NEWFILE" >> /var/log/deleted_duplicate_files.log
– seren
Jan 14 '13 at 19:28
1
Don't friggin reinvent the wheel. There are more mature solutions available, likerdfind
, that works at native speeds and just requiresbrew install rdfind
orapt-get install rdfind
to get installed.
– oligofren
Jan 3 '15 at 13:46
add a comment |
How can I change your script, so that instead of hardlinking it, it will just delete the duplicate files and will add an entry to a CSV file the deleted file -> Lined File. . ???
– MR.GEWA
Jan 12 '13 at 12:17
Sure. The hard link line: echo ln -f "$OLDFILE" "$NEWFILE" Just replaces the duplicate file with a hard link, so you could change it rm the $NEWFILE instead.
– seren
Jan 13 '13 at 4:15
and how on next line, write in some text file somehow $OLDFILE-> NEWFILE ???
– MR.GEWA
Jan 13 '13 at 13:12
Ahh, right. Yes, add a line after the rm such as: echo "$NEWFILE" >> /var/log/deleted_duplicate_files.log
– seren
Jan 14 '13 at 19:28
1
Don't friggin reinvent the wheel. There are more mature solutions available, likerdfind
, that works at native speeds and just requiresbrew install rdfind
orapt-get install rdfind
to get installed.
– oligofren
Jan 3 '15 at 13:46
How can I change your script, so that instead of hardlinking it, it will just delete the duplicate files and will add an entry to a CSV file the deleted file -> Lined File. . ???
– MR.GEWA
Jan 12 '13 at 12:17
How can I change your script, so that instead of hardlinking it, it will just delete the duplicate files and will add an entry to a CSV file the deleted file -> Lined File. . ???
– MR.GEWA
Jan 12 '13 at 12:17
Sure. The hard link line: echo ln -f "$OLDFILE" "$NEWFILE" Just replaces the duplicate file with a hard link, so you could change it rm the $NEWFILE instead.
– seren
Jan 13 '13 at 4:15
Sure. The hard link line: echo ln -f "$OLDFILE" "$NEWFILE" Just replaces the duplicate file with a hard link, so you could change it rm the $NEWFILE instead.
– seren
Jan 13 '13 at 4:15
and how on next line, write in some text file somehow $OLDFILE-> NEWFILE ???
– MR.GEWA
Jan 13 '13 at 13:12
and how on next line, write in some text file somehow $OLDFILE-> NEWFILE ???
– MR.GEWA
Jan 13 '13 at 13:12
Ahh, right. Yes, add a line after the rm such as: echo "$NEWFILE" >> /var/log/deleted_duplicate_files.log
– seren
Jan 14 '13 at 19:28
Ahh, right. Yes, add a line after the rm such as: echo "$NEWFILE" >> /var/log/deleted_duplicate_files.log
– seren
Jan 14 '13 at 19:28
1
1
Don't friggin reinvent the wheel. There are more mature solutions available, like
rdfind
, that works at native speeds and just requires brew install rdfind
or apt-get install rdfind
to get installed.– oligofren
Jan 3 '15 at 13:46
Don't friggin reinvent the wheel. There are more mature solutions available, like
rdfind
, that works at native speeds and just requires brew install rdfind
or apt-get install rdfind
to get installed.– oligofren
Jan 3 '15 at 13:46
add a comment |
If you want to replace duplicates by Hard Links on mac or any UNIX based system, you can try SmartDupe http://sourceforge.net/projects/smartdupe/
am developing it
3
Can you expand on how “smart” it is?
– Stéphane Gimenez
Nov 4 '12 at 13:25
1
How can I compare files of two different directories?
– Burcardo
May 31 '16 at 8:26
add a comment |
If you want to replace duplicates by Hard Links on mac or any UNIX based system, you can try SmartDupe http://sourceforge.net/projects/smartdupe/
am developing it
3
Can you expand on how “smart” it is?
– Stéphane Gimenez
Nov 4 '12 at 13:25
1
How can I compare files of two different directories?
– Burcardo
May 31 '16 at 8:26
add a comment |
If you want to replace duplicates by Hard Links on mac or any UNIX based system, you can try SmartDupe http://sourceforge.net/projects/smartdupe/
am developing it
If you want to replace duplicates by Hard Links on mac or any UNIX based system, you can try SmartDupe http://sourceforge.net/projects/smartdupe/
am developing it
answered Nov 4 '12 at 0:57
islamislam
211
211
3
Can you expand on how “smart” it is?
– Stéphane Gimenez
Nov 4 '12 at 13:25
1
How can I compare files of two different directories?
– Burcardo
May 31 '16 at 8:26
add a comment |
3
Can you expand on how “smart” it is?
– Stéphane Gimenez
Nov 4 '12 at 13:25
1
How can I compare files of two different directories?
– Burcardo
May 31 '16 at 8:26
3
3
Can you expand on how “smart” it is?
– Stéphane Gimenez
Nov 4 '12 at 13:25
Can you expand on how “smart” it is?
– Stéphane Gimenez
Nov 4 '12 at 13:25
1
1
How can I compare files of two different directories?
– Burcardo
May 31 '16 at 8:26
How can I compare files of two different directories?
– Burcardo
May 31 '16 at 8:26
add a comment |
The applicatios FSLint (http://www.pixelbeat.org/fslint/) can find all equal files in any folder (by content) and create hardlinks. Give it a try!
Jorge Sampaio
It hangs scanning 1TB almost full ext3 harddisk, brings the entire system to a crawl. Aborted after 14 hours of "searching"
– Angsuman Chakraborty
Sep 12 '16 at 11:09
add a comment |
The applicatios FSLint (http://www.pixelbeat.org/fslint/) can find all equal files in any folder (by content) and create hardlinks. Give it a try!
Jorge Sampaio
It hangs scanning 1TB almost full ext3 harddisk, brings the entire system to a crawl. Aborted after 14 hours of "searching"
– Angsuman Chakraborty
Sep 12 '16 at 11:09
add a comment |
The applicatios FSLint (http://www.pixelbeat.org/fslint/) can find all equal files in any folder (by content) and create hardlinks. Give it a try!
Jorge Sampaio
The applicatios FSLint (http://www.pixelbeat.org/fslint/) can find all equal files in any folder (by content) and create hardlinks. Give it a try!
Jorge Sampaio
answered Jan 15 '15 at 16:29
Jorge H B Sampaio JrJorge H B Sampaio Jr
111
111
It hangs scanning 1TB almost full ext3 harddisk, brings the entire system to a crawl. Aborted after 14 hours of "searching"
– Angsuman Chakraborty
Sep 12 '16 at 11:09
add a comment |
It hangs scanning 1TB almost full ext3 harddisk, brings the entire system to a crawl. Aborted after 14 hours of "searching"
– Angsuman Chakraborty
Sep 12 '16 at 11:09
It hangs scanning 1TB almost full ext3 harddisk, brings the entire system to a crawl. Aborted after 14 hours of "searching"
– Angsuman Chakraborty
Sep 12 '16 at 11:09
It hangs scanning 1TB almost full ext3 harddisk, brings the entire system to a crawl. Aborted after 14 hours of "searching"
– Angsuman Chakraborty
Sep 12 '16 at 11:09
add a comment |
If you'll do hardlinks, pay attention on rights on that file. Notice, owner, group, mode, extended attributes, time and ACL (if you use this) is stored in INODE. Only file names are different because this is stored in directory structure, and other points to INODE properties. This cause, all file names linked to the same inode, have the same access rights. You should prevent modification that file, because any user can damage file to other. It is simple. It is enough, any user put other file in the same name. Inode number is then saved, and original file content is destroyed (replaced) for all hardlinked names.
Better way is deduplication on filesystem layer. You can use BTRFS (very popular last time), OCFS or like this. Look at the page: https://en.wikipedia.org/wiki/Comparison_of_file_systems , specialy at the table Features and column data deduplication. You can click it and sort :)
Specially look at ZFS filesystem. This is available as FUSE, but in this way it's very slow. If you want native support, look at the page http://zfsonlinux.org/ . Then you must patch kernel, and next install zfs tools for managament. I don't understand, why linux doesn't support as drivers, it is way for many other operating systems / kernels.
File systems supports deduplication by 2 ways, deduplicate files, or blocks. ZFS supports block. This means, the same contents that repeats in the same file can be deduplicated. Other way is time when data are deduplicated, this can be online (zfs) or offline (btrfs).
Notice, deduplication consumes RAM. This is, why writing files to ZFS volume mounted with FUSE, cause dramatically slow performance. This is described in documentation.
But you can online set on/off deduplication on volume. If you see any data should be deduplicated, you simply set deduplication on, rewrite some file to any temporary and finally replace. after this you can off deduplication and restore full performance. Of course, you can add to storage any cache disks. This can be very fast rotate disks or SSD disks. Of course this can be very small disks. In real work this is replacement for RAM :)
Under linux you should take care for ZFS because not all work as it should, specialy when you manage filesystem, make snapshot etc. but if you do configuration and don't change it, all works properly. Other way, you should change linux to opensolaris, it natively supports ZFS :) What is very nice with ZFS is, this works both as filesystem, and volumen manager similar to LVM. You do not need it when you use ZFS. See documentation if you want know more.
Notice difference between ZFS and BTRFS. ZFS is older and more mature, unfortunately only under Solaris and OpenSolaris (unfortunately strangled by oracle). BTRFS is younger, but last time very good supported. I recommend fresh kernel. ZFS has online deduplication, that cause slow down writes, because all is calculated online. BTRFS support off-line dedupliaction. Then this saves performance, but when host has nothing to do, you run periodically tool for make deduplication. And BTRFS is natively created under linux. Maybe this is better FS for You :)
1
I do like the offline (or batch) deduplication approachbtrfs
has. Excellent discussion of the options (including thecp --reflink
option) here: btrfs.wiki.kernel.org/index.php/Deduplication
– Marcel Waldvogel
Feb 5 '17 at 19:42
ZFS is not Solaris or OpenSolaris only. It's natively supported in FreeBSD. Also, ZFS on Linux is device driver based; ZFS on FUSE is a different thing.
– KJ Seefried
Mar 29 '18 at 19:07
add a comment |
If you'll do hardlinks, pay attention on rights on that file. Notice, owner, group, mode, extended attributes, time and ACL (if you use this) is stored in INODE. Only file names are different because this is stored in directory structure, and other points to INODE properties. This cause, all file names linked to the same inode, have the same access rights. You should prevent modification that file, because any user can damage file to other. It is simple. It is enough, any user put other file in the same name. Inode number is then saved, and original file content is destroyed (replaced) for all hardlinked names.
Better way is deduplication on filesystem layer. You can use BTRFS (very popular last time), OCFS or like this. Look at the page: https://en.wikipedia.org/wiki/Comparison_of_file_systems , specialy at the table Features and column data deduplication. You can click it and sort :)
Specially look at ZFS filesystem. This is available as FUSE, but in this way it's very slow. If you want native support, look at the page http://zfsonlinux.org/ . Then you must patch kernel, and next install zfs tools for managament. I don't understand, why linux doesn't support as drivers, it is way for many other operating systems / kernels.
File systems supports deduplication by 2 ways, deduplicate files, or blocks. ZFS supports block. This means, the same contents that repeats in the same file can be deduplicated. Other way is time when data are deduplicated, this can be online (zfs) or offline (btrfs).
Notice, deduplication consumes RAM. This is, why writing files to ZFS volume mounted with FUSE, cause dramatically slow performance. This is described in documentation.
But you can online set on/off deduplication on volume. If you see any data should be deduplicated, you simply set deduplication on, rewrite some file to any temporary and finally replace. after this you can off deduplication and restore full performance. Of course, you can add to storage any cache disks. This can be very fast rotate disks or SSD disks. Of course this can be very small disks. In real work this is replacement for RAM :)
Under linux you should take care for ZFS because not all work as it should, specialy when you manage filesystem, make snapshot etc. but if you do configuration and don't change it, all works properly. Other way, you should change linux to opensolaris, it natively supports ZFS :) What is very nice with ZFS is, this works both as filesystem, and volumen manager similar to LVM. You do not need it when you use ZFS. See documentation if you want know more.
Notice difference between ZFS and BTRFS. ZFS is older and more mature, unfortunately only under Solaris and OpenSolaris (unfortunately strangled by oracle). BTRFS is younger, but last time very good supported. I recommend fresh kernel. ZFS has online deduplication, that cause slow down writes, because all is calculated online. BTRFS support off-line dedupliaction. Then this saves performance, but when host has nothing to do, you run periodically tool for make deduplication. And BTRFS is natively created under linux. Maybe this is better FS for You :)
1
I do like the offline (or batch) deduplication approachbtrfs
has. Excellent discussion of the options (including thecp --reflink
option) here: btrfs.wiki.kernel.org/index.php/Deduplication
– Marcel Waldvogel
Feb 5 '17 at 19:42
ZFS is not Solaris or OpenSolaris only. It's natively supported in FreeBSD. Also, ZFS on Linux is device driver based; ZFS on FUSE is a different thing.
– KJ Seefried
Mar 29 '18 at 19:07
add a comment |
If you'll do hardlinks, pay attention on rights on that file. Notice, owner, group, mode, extended attributes, time and ACL (if you use this) is stored in INODE. Only file names are different because this is stored in directory structure, and other points to INODE properties. This cause, all file names linked to the same inode, have the same access rights. You should prevent modification that file, because any user can damage file to other. It is simple. It is enough, any user put other file in the same name. Inode number is then saved, and original file content is destroyed (replaced) for all hardlinked names.
Better way is deduplication on filesystem layer. You can use BTRFS (very popular last time), OCFS or like this. Look at the page: https://en.wikipedia.org/wiki/Comparison_of_file_systems , specialy at the table Features and column data deduplication. You can click it and sort :)
Specially look at ZFS filesystem. This is available as FUSE, but in this way it's very slow. If you want native support, look at the page http://zfsonlinux.org/ . Then you must patch kernel, and next install zfs tools for managament. I don't understand, why linux doesn't support as drivers, it is way for many other operating systems / kernels.
File systems supports deduplication by 2 ways, deduplicate files, or blocks. ZFS supports block. This means, the same contents that repeats in the same file can be deduplicated. Other way is time when data are deduplicated, this can be online (zfs) or offline (btrfs).
Notice, deduplication consumes RAM. This is, why writing files to ZFS volume mounted with FUSE, cause dramatically slow performance. This is described in documentation.
But you can online set on/off deduplication on volume. If you see any data should be deduplicated, you simply set deduplication on, rewrite some file to any temporary and finally replace. after this you can off deduplication and restore full performance. Of course, you can add to storage any cache disks. This can be very fast rotate disks or SSD disks. Of course this can be very small disks. In real work this is replacement for RAM :)
Under linux you should take care for ZFS because not all work as it should, specialy when you manage filesystem, make snapshot etc. but if you do configuration and don't change it, all works properly. Other way, you should change linux to opensolaris, it natively supports ZFS :) What is very nice with ZFS is, this works both as filesystem, and volumen manager similar to LVM. You do not need it when you use ZFS. See documentation if you want know more.
Notice difference between ZFS and BTRFS. ZFS is older and more mature, unfortunately only under Solaris and OpenSolaris (unfortunately strangled by oracle). BTRFS is younger, but last time very good supported. I recommend fresh kernel. ZFS has online deduplication, that cause slow down writes, because all is calculated online. BTRFS support off-line dedupliaction. Then this saves performance, but when host has nothing to do, you run periodically tool for make deduplication. And BTRFS is natively created under linux. Maybe this is better FS for You :)
If you'll do hardlinks, pay attention on rights on that file. Notice, owner, group, mode, extended attributes, time and ACL (if you use this) is stored in INODE. Only file names are different because this is stored in directory structure, and other points to INODE properties. This cause, all file names linked to the same inode, have the same access rights. You should prevent modification that file, because any user can damage file to other. It is simple. It is enough, any user put other file in the same name. Inode number is then saved, and original file content is destroyed (replaced) for all hardlinked names.
Better way is deduplication on filesystem layer. You can use BTRFS (very popular last time), OCFS or like this. Look at the page: https://en.wikipedia.org/wiki/Comparison_of_file_systems , specialy at the table Features and column data deduplication. You can click it and sort :)
Specially look at ZFS filesystem. This is available as FUSE, but in this way it's very slow. If you want native support, look at the page http://zfsonlinux.org/ . Then you must patch kernel, and next install zfs tools for managament. I don't understand, why linux doesn't support as drivers, it is way for many other operating systems / kernels.
File systems supports deduplication by 2 ways, deduplicate files, or blocks. ZFS supports block. This means, the same contents that repeats in the same file can be deduplicated. Other way is time when data are deduplicated, this can be online (zfs) or offline (btrfs).
Notice, deduplication consumes RAM. This is, why writing files to ZFS volume mounted with FUSE, cause dramatically slow performance. This is described in documentation.
But you can online set on/off deduplication on volume. If you see any data should be deduplicated, you simply set deduplication on, rewrite some file to any temporary and finally replace. after this you can off deduplication and restore full performance. Of course, you can add to storage any cache disks. This can be very fast rotate disks or SSD disks. Of course this can be very small disks. In real work this is replacement for RAM :)
Under linux you should take care for ZFS because not all work as it should, specialy when you manage filesystem, make snapshot etc. but if you do configuration and don't change it, all works properly. Other way, you should change linux to opensolaris, it natively supports ZFS :) What is very nice with ZFS is, this works both as filesystem, and volumen manager similar to LVM. You do not need it when you use ZFS. See documentation if you want know more.
Notice difference between ZFS and BTRFS. ZFS is older and more mature, unfortunately only under Solaris and OpenSolaris (unfortunately strangled by oracle). BTRFS is younger, but last time very good supported. I recommend fresh kernel. ZFS has online deduplication, that cause slow down writes, because all is calculated online. BTRFS support off-line dedupliaction. Then this saves performance, but when host has nothing to do, you run periodically tool for make deduplication. And BTRFS is natively created under linux. Maybe this is better FS for You :)
answered Jun 24 '14 at 8:51
ZnikZnik
265129
265129
1
I do like the offline (or batch) deduplication approachbtrfs
has. Excellent discussion of the options (including thecp --reflink
option) here: btrfs.wiki.kernel.org/index.php/Deduplication
– Marcel Waldvogel
Feb 5 '17 at 19:42
ZFS is not Solaris or OpenSolaris only. It's natively supported in FreeBSD. Also, ZFS on Linux is device driver based; ZFS on FUSE is a different thing.
– KJ Seefried
Mar 29 '18 at 19:07
add a comment |
1
I do like the offline (or batch) deduplication approachbtrfs
has. Excellent discussion of the options (including thecp --reflink
option) here: btrfs.wiki.kernel.org/index.php/Deduplication
– Marcel Waldvogel
Feb 5 '17 at 19:42
ZFS is not Solaris or OpenSolaris only. It's natively supported in FreeBSD. Also, ZFS on Linux is device driver based; ZFS on FUSE is a different thing.
– KJ Seefried
Mar 29 '18 at 19:07
1
1
I do like the offline (or batch) deduplication approach
btrfs
has. Excellent discussion of the options (including the cp --reflink
option) here: btrfs.wiki.kernel.org/index.php/Deduplication– Marcel Waldvogel
Feb 5 '17 at 19:42
I do like the offline (or batch) deduplication approach
btrfs
has. Excellent discussion of the options (including the cp --reflink
option) here: btrfs.wiki.kernel.org/index.php/Deduplication– Marcel Waldvogel
Feb 5 '17 at 19:42
ZFS is not Solaris or OpenSolaris only. It's natively supported in FreeBSD. Also, ZFS on Linux is device driver based; ZFS on FUSE is a different thing.
– KJ Seefried
Mar 29 '18 at 19:07
ZFS is not Solaris or OpenSolaris only. It's natively supported in FreeBSD. Also, ZFS on Linux is device driver based; ZFS on FUSE is a different thing.
– KJ Seefried
Mar 29 '18 at 19:07
add a comment |
Hard links might not be the best idea; if one user changes the file, it affects both. However, deleting a hard link doesn't delete both files. Plus, I am not entirely sure if Hard Links take up the same amount of space (on the hard disk, not the OS) as multiple copies of the same file; according to Windows (with the Link Shell Extension), they do. Granted, that's Windows, not Unix...
My solution would be to create a "common" file in a hidden folder, and replace the actual duplicates with symbolic links... then, the symbolic links would be embedded with metadata or alternate file streams that only records however the two "files" are different from each other, like if one person wants to change the filename or add custom album art or something else like that; it might even be useful outside of database applications, like having multiple versions of the same game or software installed and testing them independently with even the smallest differences.
add a comment |
Hard links might not be the best idea; if one user changes the file, it affects both. However, deleting a hard link doesn't delete both files. Plus, I am not entirely sure if Hard Links take up the same amount of space (on the hard disk, not the OS) as multiple copies of the same file; according to Windows (with the Link Shell Extension), they do. Granted, that's Windows, not Unix...
My solution would be to create a "common" file in a hidden folder, and replace the actual duplicates with symbolic links... then, the symbolic links would be embedded with metadata or alternate file streams that only records however the two "files" are different from each other, like if one person wants to change the filename or add custom album art or something else like that; it might even be useful outside of database applications, like having multiple versions of the same game or software installed and testing them independently with even the smallest differences.
add a comment |
Hard links might not be the best idea; if one user changes the file, it affects both. However, deleting a hard link doesn't delete both files. Plus, I am not entirely sure if Hard Links take up the same amount of space (on the hard disk, not the OS) as multiple copies of the same file; according to Windows (with the Link Shell Extension), they do. Granted, that's Windows, not Unix...
My solution would be to create a "common" file in a hidden folder, and replace the actual duplicates with symbolic links... then, the symbolic links would be embedded with metadata or alternate file streams that only records however the two "files" are different from each other, like if one person wants to change the filename or add custom album art or something else like that; it might even be useful outside of database applications, like having multiple versions of the same game or software installed and testing them independently with even the smallest differences.
Hard links might not be the best idea; if one user changes the file, it affects both. However, deleting a hard link doesn't delete both files. Plus, I am not entirely sure if Hard Links take up the same amount of space (on the hard disk, not the OS) as multiple copies of the same file; according to Windows (with the Link Shell Extension), they do. Granted, that's Windows, not Unix...
My solution would be to create a "common" file in a hidden folder, and replace the actual duplicates with symbolic links... then, the symbolic links would be embedded with metadata or alternate file streams that only records however the two "files" are different from each other, like if one person wants to change the filename or add custom album art or something else like that; it might even be useful outside of database applications, like having multiple versions of the same game or software installed and testing them independently with even the smallest differences.
answered May 3 '16 at 18:43
Amaroq StarwindAmaroq Starwind
1
1
add a comment |
add a comment |
Easiest way is to use special program
dupeGuru
as documentation says
Deletion Options
These options affect how duplicate deletion takes place.
Most of the time, you don’t need to enable any of them.
Link deleted files:
The deleted files are replaced by a link to the reference file.
You have a choice of replacing it either with a symlink or a hardlink.
...
a symlink is a shortcut to the file’s path.
If the original file is deleted or moved, the link is broken.
A hardlink is a link to the file itself.
That link is as good as a “real” file.
Only when all hardlinks to a file are deleted is the file itself deleted.
On OSX and Linux, this feature is supported fully,
but under Windows, it’s a bit complicated.
Windows XP doesn’t support it, but Vista and up support it.
However, for the feature to work,
dupeGuru has to run with administrative privileges.
add a comment |
Easiest way is to use special program
dupeGuru
as documentation says
Deletion Options
These options affect how duplicate deletion takes place.
Most of the time, you don’t need to enable any of them.
Link deleted files:
The deleted files are replaced by a link to the reference file.
You have a choice of replacing it either with a symlink or a hardlink.
...
a symlink is a shortcut to the file’s path.
If the original file is deleted or moved, the link is broken.
A hardlink is a link to the file itself.
That link is as good as a “real” file.
Only when all hardlinks to a file are deleted is the file itself deleted.
On OSX and Linux, this feature is supported fully,
but under Windows, it’s a bit complicated.
Windows XP doesn’t support it, but Vista and up support it.
However, for the feature to work,
dupeGuru has to run with administrative privileges.
add a comment |
Easiest way is to use special program
dupeGuru
as documentation says
Deletion Options
These options affect how duplicate deletion takes place.
Most of the time, you don’t need to enable any of them.
Link deleted files:
The deleted files are replaced by a link to the reference file.
You have a choice of replacing it either with a symlink or a hardlink.
...
a symlink is a shortcut to the file’s path.
If the original file is deleted or moved, the link is broken.
A hardlink is a link to the file itself.
That link is as good as a “real” file.
Only when all hardlinks to a file are deleted is the file itself deleted.
On OSX and Linux, this feature is supported fully,
but under Windows, it’s a bit complicated.
Windows XP doesn’t support it, but Vista and up support it.
However, for the feature to work,
dupeGuru has to run with administrative privileges.
Easiest way is to use special program
dupeGuru
as documentation says
Deletion Options
These options affect how duplicate deletion takes place.
Most of the time, you don’t need to enable any of them.
Link deleted files:
The deleted files are replaced by a link to the reference file.
You have a choice of replacing it either with a symlink or a hardlink.
...
a symlink is a shortcut to the file’s path.
If the original file is deleted or moved, the link is broken.
A hardlink is a link to the file itself.
That link is as good as a “real” file.
Only when all hardlinks to a file are deleted is the file itself deleted.
On OSX and Linux, this feature is supported fully,
but under Windows, it’s a bit complicated.
Windows XP doesn’t support it, but Vista and up support it.
However, for the feature to work,
dupeGuru has to run with administrative privileges.
answered Jun 13 '17 at 14:20
Russian Junior Ruby DeveloperRussian Junior Ruby Developer
11
11
add a comment |
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f3037%2fis-there-an-easy-way-to-replace-duplicate-files-with-hardlinks%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
20
One problem you may run into with hardlinks is if somebody decides to do something to one of their music files that you've hard-linked they could inadvertently be affecting other people's access to their music.
– Steven D
Oct 13 '10 at 2:48
4
another problem is that two different files containing "Some Really Great Tune", even if taken from the same source with the same encoder will very likely not be bit-for-bit identical.
– msw
Oct 13 '10 at 2:57
3
better sollution might be to have a public music folder...
– Stefan
Oct 13 '10 at 7:08
3
related: superuser.com/questions/140819/ways-to-deduplicate-files
– David Cary
Mar 16 '11 at 23:59
1
@tante: Using symlinks solves no problem. When a user "deletes" a file, the number of links to it gets decremented, when the count reaches zero, the files gets really deleted, that's all. So deletion is no problem with hardlinked files, the only problem is a user trying to edit the file (unprobable indeed) or to overwrite it (quite possible if logged in).
– maaartinus
Mar 14 '12 at 3:56