Last Updated or created 2022-06-29
Sorting out my fileserver, i had the need for a deduplication script.
Many files i’ve been copying from backup, clouds mobile devices and workstation. Inevitable to get many copies.
Below script walks a directory, using locate it tries to find files with same name. Using a md5sum it wil check if it is the same file, when found a simular file it stops searching, removes the one from the check-directory and checks the next one.
#!/bin/bash # Copy this script to your to clean directory, # when you got a copy on your fileserver from this script # then the copy in your clean dir will be removed also. # Dont want that? change # find -type f | # into # find -type f | grep -v <nameofthisscript> | # dont is current directory, skip these from locations dont=$(pwd) # Never start in /mnt ? uncomment below # echo "$dont" | grep "^/mnt" && ( echo "start in tank" ; exit ) find -type f | while read file ; do filemd5=$(md5sum "$file" | cut -f1 -d" ") basenamefile=$(basename "$file") echo "searching $basenamefile" locate -i "/$basenamefile" | grep -v "$dont" | while read location ; do if [ -f "$location" ] ; then locatedfilemd5sum=$(md5sum "$location" | cut -f1 -d" ") if [ "$filemd5" == "$locatedfilemd5sum" ] ; then echo "found same md5sum at $location" rm "$file" break fi fi done done # Remove empty dirs? # find . -type d -empty -delete
Locate can be slow, sometimes it is better to put the locate DB in memory of on another fast storage system.
mkdir /ramdisk mount ramfs -t ramfs /ramdisk/ cp /var/lib/mlocate/mlocate.db /ramdisk/ # change above script locate command locate -d /var/lib/mlocate/mlocate.db -i IMG20191123.jpg
And remove empty directories?
Add below at the end of the script
find . -type d -empty -delete