Dedup script v0.2

Last Updated or created 2022-06-29

Update 20220510

Sorting out my fileserver, i had the need for a deduplication script.
Many files i’ve been copying from backup, clouds mobile devices and workstation. Inevitable to get many copies.

Below script walks a directory, using locate it tries to find files with same name. Using a md5sum it wil check if it is the same file, when found a simular file it stops searching, removes the one from the check-directory and checks the next one.

#!/bin/bash
# Copy this script to your to clean directory, 
# when you got a copy on your fileserver from this script
# then the copy in your clean dir will be removed also.
# Dont want that? change
# find -type f | 
# into
# find -type f | grep -v <nameofthisscript> |

# dont is current directory, skip these from locations
dont=$(pwd)
# Never start in /mnt ? uncomment below
# echo "$dont" | grep "^/mnt" && ( echo "start in tank" ; exit )

find -type f | while read file ; do
        filemd5=$(md5sum "$file" | cut -f1 -d" ")
        basenamefile=$(basename "$file")
        echo "searching $basenamefile"
        locate -i "/$basenamefile" | grep -v "$dont" | while read location ; do
        if [ -f "$location" ] ; then
                locatedfilemd5sum=$(md5sum "$location" | cut -f1 -d" ")
                if [ "$filemd5" == "$locatedfilemd5sum" ] ; then
                        echo "found same md5sum at $location"
                        rm "$file"
                        break
                fi
        fi
        done
done
# Remove empty dirs?
# find . -type d -empty -delete

Locate can be slow, sometimes it is better to put the locate DB in memory of on another fast storage system.

mkdir /ramdisk
mount ramfs -t ramfs /ramdisk/
cp /var/lib/mlocate/mlocate.db /ramdisk/ 

# change above script locate command
locate -d /var/lib/mlocate/mlocate.db -i IMG20191123.jpg

And remove empty directories?
Add below at the end of the script

find . -type d -empty -delete