Extremely fast tool to remove duplicates and other lint from your filesystem

Overview

https://raw.githubusercontent.com/sahib/rmlint/develop/docs/_static/logo.png

rmlint finds space waste and other broken things on your filesystem and offers to remove it.

https://readthedocs.org/projects/rmlint/badge/?version=latest https://img.shields.io/travis/sahib/rmlint/develop.svg?style=flat https://img.shields.io/github/issues/sahib/rmlint.svg?style=flat https://img.shields.io/github/release/sahib/rmlint.svg?style=flat http://img.shields.io/badge/license-GPLv3-4AC51C.svg?style=flat

Features:

Finds…

  • …Duplicate Files and duplicate directories.
  • …Nonstripped binaries (i.e. binaries with debug symbols)
  • …Broken symbolic links.
  • …Empty files and directories.
  • …Files with broken user or/and group ID.

Differences to other duplicate finders:

  • Extremely fast (no exaggeration, we promise!)
  • Paranoia mode for those who do not trust hashsums.
  • Many output formats.
  • No interactivity.
  • Search for files only newer than a certain mtime.
  • Many ways to handle duplicates.
  • Caching and replaying.
  • btrfs support.
  • ...

It runs and compiles under most Unices, including Linux, FreeBSD and Darwin. The main target is Linux though, some optimisations might not be available elsewhere.

https://raw.githubusercontent.com/sahib/rmlint/develop/docs/_static/screenshot.png

INSTALLATION

Chances are that you might have rmlint already as readily made package in your favourite distribution. If not, you might consider compiling it from source.

DOCUMENTATION

Detailed documentation is available on:

http://rmlint.rtfd.org

Most features you'll ever need are covered in the tutorial:

http://rmlint.rtfd.org/en/latest/tutorial.html

An online version of the manpage is available at:

http://rmlint.rtfd.org/en/latest/rmlint.1.html

Sometimes we can be reached via IRC: #rmlint on irc.freenode.net.

BUGS

If you found bugs, having trouble running rmlint or want to suggest new features please read this.

Also read the BUGS section of the manpage to find out how to provide good debug information.

AUTHORS

Here's a list of developers to blame:

Christopher Pahl https://github.com/sahib 2010-2017
Daniel Thomas https://github.com/SeeSpotRun 2014-2017

There are some other people that helped us of course. Please see the AUTHORS distributed along rmlint.

LICENSE

rmlint is licensed under the conditions of the GPLv3. See the COPYING file distributed along the source for details.

Issues
  • Bounded resource usage (feature)

    Bounded resource usage (feature)

    Consider this use case. I'm really stuck here trying to deduplicate over 5 million files. The memory usage goes far beyond available RAM and using large swap won't help because processing takes eons. I think that using some database, like e.g. sqlite, to hold files metadata would be a lifesaver. This is similar to #25 but with different scope: to limit memory usage and make rmlint scalable on huge number of files.

    Feature Request 
    opened by vvs- 41
  • New option: Consider mtime (modification time)

    New option: Consider mtime (modification time)

    I would like to have an option to not consider files with the same content but different mtimes ("modification time") as duplicates. Reason: Sometimes the mtime carries semantic/pragmatic information which makes it desirable to keep it!

    Bug Discussion 
    opened by Awerick 38
  • High CPU usage and cannot complete

    High CPU usage and cannot complete

    Hello,

    I am running the develop version (version 2.6.2 compiled: Aug 14 2018 at [06:54:35] "Penetrating Pineapple" (rev 888b8e2)) on a Btrfs filesystem with over 5 millions files on 2TB. I used the following command-line: rmlint --types="duplicates" --hidden --config=sh:handler=clone --no-hardlinked --algorithm=xxhash --progress /home After 6-8 hours of progress and with only 40GB remaining to scan, there is no progress anymore and "top" is reporting that rmlint consumes 100% of CPU. After 8 hours at 100% CPU usage, I killed rmlint.

    Unfortunately I cannot use the timestamp filtering (I cannot rely on mtime because some duplicates are created with mtime in the past) and I would prefer not to store the checksum in the xattr.

    I was wondering if it could be possible to avoid the checksuming entirely with Btrfs, because Btrfs is checking if the files are identical anyway before cloning/reflinking. So instead of calculating a checksum, in theory we can consider that files with the same size are duplicates and try to clone them. Btrfs will automatically ignore files which are not real duplicates.

    Is it possible to do that with rmlint ? Are there any other way to make rmlint work with these 5 millions files ?

    Thanks

    Bug 
    opened by saintger 36
  • 'O_LARGEFILE' error on compilation (OSX)

    'O_LARGEFILE' error on compilation (OSX)

    I am trying to compile rmlint on OSX 10.10.3 and I have the following error message:

    lib/utilities.h:69:13: error: use of undeclared identifier 'O_LARGEFILE'

    Do you know how can I resolve it?

    Bug sahib-broke-it 
    opened by luclaurent 31
  • option --no-hardlinked does not delete all duplicates (& manpage suggestions)

    option --no-hardlinked does not delete all duplicates (& manpage suggestions)

    I am not sure whether I interpret the --no-hardlinked option correctly but it seems, there is a bug:

    Setup:

    $ echo "text" > file_A
    $ ln file_A file_B
    $ echo "text" > file_Z
    $ ls -l --inode
    33106  […]  2  […]  file_A
    33106  […]  2  […]  file_B
    33107  […]  1  […]  file_Z
    

    Test --no-hardlinked:

    $ rmlint  -S A  --no-hardlinked
    
    # Duplicate(s):
        ls '/tmp/testdir/file_Z'
        rm '/tmp/testdir/file_B'
    
    ==> In total 3 files, whereof 1 are duplicates in 1 groups.
    ==> This equals 0 B of duplicates which could be removed.
    

    I expected the file file_A to be removed as well – but it isn't. Is this the intended behavior?

    Overview & Suggestion for the man page

    Actually it took me some time to understand the intention of the --hardlinked/--no-hardlinked option. If I (now) interpret it correctly, the following tables show the intended meaning of the (default) --hardlinked option and the --no-hardlinked option (with the bug): (For both options I included alphabetical ascending vs descending ranking so that the hardlinked file is one time the original and one time the duplicate.)

    --hardlinked:

    File vs Option | --hardlinked
    -S a: | --hardlinked
    -S A: ----------------------------:|:-------------------------------------:|:----------------------------------: file_A (inode 6): | :white_check_mark: | :x:
    file_B (inode 6): | :x: | :x:
    file_Z (inode 7): | :x: | :white_check_mark:

    --no-hardlinked (incl. bug):

    File vs Option | --no-hardlinked
    -S a: | --no-hardlinked
    -S A: ----------------------------:|:-------------------------------------:|:----------------------------------: file_A (inode 6): | :white_check_mark: | ??? :x: :zap: file_B (inode 6): | also: :white_check_mark: :thought_balloon: | :x:
    file_Z (inode 7): | :x: | :white_check_mark:

    Key:

    :white_check_mark: : Considered as an original (will not be removed). :x: : Considered as a duplicate (will be removed). :zap: : Bug!? Currently not removed (like an original), but should be removed (like a duplicate), right? (:thought_balloon: : Just another thought: This "additional original" is currently silently ignored, i.e. it does not give any output in the summary or the generated script. Maybe this could also be show with ls and the generated script could also call this with original_cmd?)


    Currently the manpage explains both options as:

    "Whether to report hardlinked files as duplicates. Hardlinked files will not appear as space waste in the statistics, since they do not allocate any extra space."

    As I had difficulties to understand the meaning of this, I would like to share some ideas:

    • Instead of "to report" use the term "to treat/take/consider hardlinked files as duplicates"? (Initially I was confused, which "report" this would refer to…)

    • Especially I had difficulties to (1) tell both options apart and (2) interpret the effect of --no-hardlinked. Maybe one could improve the formulation, like:

      "Treat each hardlink of a file as a duplicate (--hardlinked) or take all hardlinks as one subgroup that collectively either counts as original or as duplicates (--no-hardlinked)." (??)

    • Also: Does the final sentence hold true for both options? If so, maybe one could start it like:

      "In any case/Independent of this option, [hardlinked files will not…]"

    Feature Request 
    opened by Awerick 31
  • rmlint with FIEMAP is slow on large dataset

    rmlint with FIEMAP is slow on large dataset

    In my testing rmlint -b --without-fiemap is way faster than rmlint -b. Profiling shows that most of the time is spent in g_sequence_search. I suspect that current implementation shows O(n^2) behavior on fragmented files. I created a test case which should demonstrate it. Create two directories and run seq 259 | while read n; do fallocate -l 4194304 $n; seq 0 8192 4194304 | xargs -I xxx fallocate -p -o xxx -l 4k $n; done inside it. While this test case is artificial, it exhibits the same symptoms as the real system.

    opened by vvs- 28
  • cfg.c: fix bug introduced in 2.10.0

    cfg.c: fix bug introduced in 2.10.0

    This reversion back to the code from version 2.9.0 addresses #438. Obviously there is some functionality here that's being implemented that I'm not familiar enough with the project to understand, but hopefully this gives someone a head start in trying to track down the bug, while still keeping the intended new behavior.

    opened by ChrisBaker97 27
  • -u does not limit memory consumption

    -u does not limit memory consumption

    Firstly, rmlint is AWESOME, thanks, I'd been looking for a great dedup tool like this for years !!!

    But, I either don't understand the -u switch or it doesn't work properly. No matter what I specify for -u, all system memory is consumed and the oom_reaper kills rmlint after a while.

    Running on an odroid hc1, rmlint spawns several processes (seems one per cpu core) and each of them uses much more memory than the -u limit. Changing the -u value does not seem to have any effect. Memory consumption grows over time.

    Here is the htop output to demonstrate

    [email protected]:/etc/monit# htop

    ....

    Mem[|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||897M/1.95G] Tasks: 97, 121 thr; 11 running

    PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command 18702 andrev 20 0 595M 437M 4036 S 772. 21.9 54:02.25 rmlint -e -u 128M HomeVideo/ Camcorder/ 19870 andrev 20 0 595M 437M 4036 R 98.7 21.9 0:06.68 rmlint -e -u 128M HomeVideo/ Camcorder/ 19886 andrev 20 0 595M 437M 4036 R 98.7 21.9 0:04.11 rmlint -e -u 128M HomeVideo/ Camcorder/ 19744 andrev 20 0 595M 437M 4036 R 98.1 21.9 0:23.77 rmlint -e -u 128M HomeVideo/ Camcorder/ 19825 andrev 20 0 595M 437M 4036 R 92.4 21.9 0:11.42 rmlint -e -u 128M HomeVideo/ Camcorder/ 19847 andrev 20 0 595M 437M 4036 R 65.8 21.9 0:05.05 rmlint -e -u 128M HomeVideo/ Camcorder/ 19698 andrev 20 0 595M 437M 4036 R 65.2 21.9 0:19.76 rmlint -e -u 128M HomeVideo/ Camcorder/ 19859 andrev 20 0 595M 437M 4036 R 62.7 21.9 0:05.96 rmlint -e -u 128M HomeVideo/ Camcorder/ 19841 andrev 20 0 595M 437M 4036 S 45.6 21.9 0:07.03 rmlint -e -u 128M HomeVideo/ Camcorder/ 19891 andrev 20 0 595M 437M 4036 R 38.6 21.9 0:01.48 rmlint -e -u 128M HomeVideo/ Camcorder/ 19754 andrev 20 0 595M 437M 4036 R 24.7 21.9 0:16.99 rmlint -e -u 128M HomeVideo/ Camcorder/ 18727 andrev 20 0 595M 437M 4036 S 14.6 21.9 0:40.87 rmlint -e -u 128M HomeVideo/ Camcorder/ 18728 andrev 20 0 595M 437M 4036 S 9.5 21.9 0:43.00 rmlint -e -u 128M HomeVideo/ Camcorder/ 19849 root 20 0 9856 4872 4040 S 6.3 0.2 0:00.23 /usr/sbin/sshd -D -R 29002 root 20 0 5080 2092 1320 R 1.9 0.1 18:10.34 htop 2501 plex 20 0 248M 1524 0 S 0.6 0.1 1h57:28 /usr/lib/plexmediaserver/Plex DLNA Server

    ...

    Bug 
    opened by Blindfreddy 25
  • Support cp --reflink for filesystem that support it.

    Support cp --reflink for filesystem that support it.

    Support reflinks for filesystems that support it. We had this previosuly in #93. But it was in a rough unfinished state as far as I know.

    Feature Request 
    opened by sahib 24
  • --replay doesn't work with -D

    --replay doesn't work with -D

    rmlint -Dkm /media/user/DiskS/ // /media/user/DiskL/ produces long lists of: # Empty dir(s) # Empty file(s) # Duplicate Directorie(s) # Duplicate(s)

    If immediately followed by rmlint --replay rmlint.json -Dkm /media/user/DiskS/ // /media/user/DiskL/ lists only # Duplicate(s) and it includes exactly as many elements as the previous #Duplicates, resulting in much fewer duplicates found.

    Similarly, rmlint -km /media/user/DiskS/ // /media/user/DiskL/ followed by rmlint --replay rmlint.json -Dkm /media/user/DiskS/ // /media/user/DiskL/ doesnt produce the expected # Duplicate Directorie(s) either.

    Feature Request 
    opened by misieck 19
Releases(v2.10.1)
  • v2.10.1(Jun 13, 2020)

  • v2.10.0(May 31, 2020)

  • v2.9.0(Aug 20, 2019)

  • v2.8.0(Oct 30, 2018)

  • v2.7.0(Apr 25, 2018)

  • v2.6.1(Jun 15, 2017)

    This is a bugfix release for v2.6.0 which had some serious problems with the generated bash scripts on some platforms (#231, #232, #233, #234, #235, #236) . All users are advised to update quickly.

    As always, please refer to the CHANGELOG.md for more details.

    If you still wonder why this release is called Penetrating Pineapple you might watch this (or don't, my sense of humor is embarrassing anyways).

    Source code(tar.gz)
    Source code(zip)
  • v2.6.0(Jun 3, 2017)

    New release with a bunch of smaller features. The highlights are:

    • Fix inconsistent handling of duplicate directories in the shellscript.
    • Speed improvement: Do not even hash one file of a hardlink group if no -D is passed.
    • Switch default hashing algorithm to the newly implemented blake2b.
    • -o stats will print some internal statistics about a run.
    • The progressbar will display an time estimation on how long the run will take.
    • --equal is able to compare files or directories to each other.
    • -D learned to also honour the directory layout of two dirs with a special option (-j / --honour-dir-layout)
    • Many smaller bugfixes, upgrading to the latest version is advised.

    As always, refer to the CHANGELOG.md for a lot more details.

    Source code(tar.gz)
    Source code(zip)
  • v2.4.6(Jan 16, 2017)

  • v2.4.5(Dec 12, 2016)

    Fixed

    • Make --replay truly merge different sets of duplicates.
    • Call exit(1) when getting a fatal signal (somehow was missing)
    • scons test now executes only the sane part of the testsuite.
    • Be more friendly when no manpage was found (and show --help)
    • Handle readonly btrfs subvolumes well. See also: https://github.com/sahib/rmlint/issues/195
    • Various build errors fixed for old/rare systems.
    • Various fixes in the gui, mostly related to old GTK versions.

    Added

    • New option --mtime-window: Only consider files as duplicates that share a mtime in a certain time-window. See also: https://github.com/sahib/rmlint/issues/197
    • New sortcriteria O (maximize outside hardlinks) and H (maximize total hardlinks) See also: https://github.com/sahib/rmlint/issues/196
    • Proper installation instructions for macOS.

    Changed

    • Re-Design --replay to accept // like the normal commandline does.
    • New default sortcriteria is pOMa to maximize the chance of deleting the most bytes from the storage.

    See the full CHANGELOG.md for more details about other releases.

    Source code(tar.gz)
    Source code(zip)
  • v2.4.4(Apr 7, 2016)

  • v2.4.3(Mar 20, 2016)

    Includes the following bugfixes:

    • Fix symbolic link emitting in sh script (sometimes files were omitted from rmlint.sh)
    • Fix compile stop on BSD systems in utilities.c (thanks f99aq8ove)
    • Fix some compiler warnings and typos.

    Also:

    • Add basic spanish translation.
    • Add basic compile support on cygwin.
    Source code(tar.gz)
    Source code(zip)
  • v2.4.2(Dec 14, 2015)

    This release contains a collection of small corrections and minor changes.

    Please refer to the CHANGELOG.md for a detailed list of changes:

    https://github.com/sahib/rmlint/blob/master/CHANGELOG.md

    Source code(tar.gz)
    Source code(zip)
  • v2.4.1(Dec 14, 2015)

    This release contains a collection of small corrections and minor changes.

    Please refer to the CHANGELOG.md for a detailed list of changes:

    https://github.com/sahib/rmlint/blob/master/CHANGELOG.md

    Source code(tar.gz)
    Source code(zip)
  • v2.4.0(Oct 25, 2015)

  • v2.2.0(May 9, 2015)

    We're proud to release the new rmlint version 2.2.0 "Dreary Dropbear"!

    Rmlint is a fast, featureful but still easy to use lint finder. This new releases includes over 400 commits and some noticeable improvements:

    • rmlint scales now very well to several million files speed and memory wise.
    • It should be generally faster and have a lower memory footprint.
    • Heavily extended testsuite to make you feel safer (lcov output)
    • Fix some annoying bugs and crashes (especially on 32bit)
    Source code(tar.gz)
    Source code(zip)
  • v2.1.0(May 8, 2015)

  • v2.0.0(Jan 24, 2015)

    This is a temporary release before 2.1.0. It has not the experimental features currently in the develop branch and can be considered stable with minor annoyances.

    Source code(tar.gz)
    Source code(zip)
  • v2.0.0-alpha(Dec 18, 2014)

    This is the first release of the 2.x series of rmlint. It is mostly feature-complete and should not have fatal bugs.

    Users are welcome to test and report back.

    If no fatal bugs are encountered, the final 2.0.0 version should be released in a few weeks.

    Source code(tar.gz)
    Source code(zip)
  • v1.0.6(Dec 1, 2014)

    Latest version before rewrite in 2013. This version contains many bugs and is built on really bad design. We recommend to never use it, except for maniac giggles.

    Source code(tar.gz)
    Source code(zip)
Owner
Chris Pahl
Writes lots of (often free) software.
Chris Pahl
TMSU lets you tags your files and then access them through a nifty virtual filesystem from any other application.

Overview TMSU is a tool for tagging your files. It provides a simple command-line utility for applying tags and a virtual filesystem to give you a tag

Paul Ruane 1.5k Jun 1, 2021
Personal CRM. Remember everything about your friends, family and business relationships.

Personal Relationship Manager Monica is a great open source personal relationship management system. Introduction Purpose Features Who is it for? What

Monica 12.9k Jun 6, 2021
Ubuntu Cleaner is a tool that makes it easy to clean your ubuntu system.

Ubuntu Cleaner Introduction Ubuntu Cleaner is a tool that makes it easy to clean your Ubuntu system. Ubuntu Cleaner can free up disk space and remove

Gerard Puig 142 Jun 3, 2021
Scalable PaaS (automated Docker+nginx) - aka Heroku on Steroids

CapRover Easiest app/database deployment platform and webserver package for your NodeJS, Python, PHP, Ruby, Go applications. No Docker, nginx knowledg

CapRover 7.2k Jun 5, 2021
Online genealogy

webtrees - online collaborative genealogy Contents License Coding styles and standards Introduction System requirements Internet browser compatibility

Greg Roach 550 Jun 7, 2021
📻 Webserver for downloading youtube videos. Ready for docker.

?? ytdl-webserver Webserver for downloading youtube videos. Ready for docker. Demo If you have questions, read the blog post. Installation As a server

null 1.3k Jun 6, 2021
retire your mouse.

keynav Control the mouse with the keyboard. Please see http://www.semicomplete.com/projects/keynav Compiling You may need some extra libraries to comp

Jordan Sissel 364 May 22, 2021
A black hole for Internet advertisements

Network-wide ad blocking via your own Linux hardware The Pi-hole® is a DNS sinkhole that protects your devices from unwanted content, without installi

Pi-hole 31.4k Jun 5, 2021
AlertHub is a simple tool written with NodeJS to get alerted from new GitHub and GitLab repository events.

AlertHub _ _ _ _ /_\ | | ___ _ __| |_ /\ /\_ _| |__ //_\\| |/ _ \ '__| __|/ /_/ / | | | '_ \ / _ \ | __/ | |

Arda Kılıçdağı 96 Jun 3, 2021
Send browser notifications from your terminal. No installation. No registration.

Notica Send browser notifications from your terminal. No installation. No registration. https://notica.us/ Usage Notica is a Bash function / alias tha

Tanner Collin 232 Jun 6, 2021
🔥 Open source static (serverless) status page. Uses hyperfast Go & Hugo, minimal HTML/CSS/JS, customizable, outstanding browser support (IE8+), preloaded CMS, read-only API, badges & more.

Über fast, backwards compatible (IE8+), tiny, and simple status page built with Hugo. Completely free with Netlify. Comes with Netlify CMS, read-only

cState 1.3k Jun 5, 2021
Open source back-end server for web, mobile and IoT. The backend for busy developers. (self-hosted or hosted)

A scalable, multitenant backend for the cloud. Para is a scalable, multitenant backend server/framework for object persistence and retrieval. It helps

Erudika 395 May 26, 2021
Multi-platform app that allows your devices to communicate

KDE Connect - desktop app KDE Connect is a multi-platform app that allows your devices to communicate (eg: your phone and your computer). (Some) Featu

KDE GitHub Mirror 1.1k Jun 7, 2021
Display and control your Android device

scrcpy (v1.17) Read in another language This application provides display and control of Android devices connected on USB (or over TCP/IP). It does no

Genymobile 49.4k Jun 4, 2021