My data backup and archive system
Every time I need to archive data, I get a sense of dread at the prospect of having to tease apart my complex data archival system I use so that I can fully understand it again and update the data archive. I aim to fix that by documenting my system here.
Data backup procedure
Data that gets backed up is data that I am working on or need frequent access to. Most of it is in my home directory.
Daily on-site data backup
I use Restic for daily incremental backups via the following cron job:
0 10 * * * sudo restic backup --repo /mnt/<drv>/RESTIC --password-file=/path/to/password --exclude-caches=true --exclude-file=/path/to/ignore /home/<user> /var/lib/forgejo /etc
Restic backs up encrypted data to a separate SSD. Each day a new snapshot is created. Because Restic features data de-duplication, only the new or changed files are copied over to the new snapshot each day, making it extremely storage-efficient.
Of course, this backup would be useless in the event of a fire or other disasters, so I keep an off-site backup to protect against that scenario.
I use the following cron job to delete all but the last 30 snapshots (1 month):
30 10 * * * sudo restic forget --keep-last 30 --prune --repo /mnt/<drv>/RESTIC --password-file=/path/to/password --group-by ''
Monthly off-site data backup
Once a month (or more), I retrieve my off-site HDD and update my off-site backup. First, I connect the HDD to my computer, enter the encryption passphrase for the luks partition, and then run the following commands to create a new Restic snapshot:
$ restic backup --repo /mnt/offsite/RESTIC/ --password-file=/path/to/password --exclude-caches=true --exclude-file=/path/to/ignore /home/<user> /var/lib/forgejo/ /etc
TODO Second off-site backup HDD
A second off-site HDD would make it easier to update the off-site backup more often and without any gaps in protection from catastrophe. I need this ASAP.
Monthly off-line data backup
Once a week (or more), I connect my bigger on-site data archive mirror HDD and perform a Restic snapshot using the same commands as above. This helps to limit the damage from malicious or accidental data damage, but it wouldn't mitigate damage from bit rot until Restic implements error correcting code.
Data archiving procedure
About once per year, or whenever my SSD starts filling with photos or videos, I move files from my home directory to my data archive.
Since most of our family photos are stored exclusively in digital format, I take the data integrity of my archive very seriously. In the past, I used special DVDs that were good for archival data and created parity files for correcting bit rot. However, I moved to using HDDs as photo/video file sizes got larger, but I didn't continue using parity files. Instead, the strategy I've adopted is to compare file checksums and when data corruption is found, copy a good version of the affected files from one of the data archive mirrors.
Archive media
The media that the data archive is stored on consists of 2 on-site HDDs and 1 off-site HDD, all with ext4 filesystems. They're stored in a faraday bag to protect against solar flares or EMP.
I usually buy a new HDD every couple of years and rotate out the oldest (smallest) one. I think of the the newest HDD as the primary copy of the archive and the other 2 as mirrors. I like to label the drives with a name and a list of dates for when they were updated.
TODO Comply with 3-2-1 rule
My data archive doesn't meet the 3-2-1 rule (not using 2 different storage media). I don't think it's feasible to store all my data on two different types of media anymore, but I should make a copy of my most critical data on immutable optical media, such as DVDs or Blu-ray discs.
Step 1: Scrubbing data
I have noticed some bit rot in some of my files created in the 90s. To
avoid further data degradation, I have begun scrubbing data using
pre-computed md5sums stored in ~/ARCHIVE
. I compare the md5sum of
each file with the md5sum that was computed the last time I updated
the archive (stored in ~/ARCHIVE
).
I do the following each time before adding any data to the archive:
$ cd /mnt/<drv>/ARCHIVE $ md5sum -c ~/ARCHIVE/<drv>-<YYYY>-<MM>-<DD>-<drv>-ARCHIVE.md5 > ~/ARCHIVE/<drv>-ARCHIVE-<YYYY>-<MM>-<DD>-vs-<YYYY>-<MM>-<DD>-check.md5 $ grep FAILED ~/ARCHIVE/<drv>-ARCHIVE-<YYYY>-<MM>-<DD>-vs-<YYYY>-<MM>-<DD>-check.md5
If any files failed the checksum comparison, I would see this as
output when grepping the md5sum -c
output for FAILED
in the last
command.
I actually haven't found any bit rot since I've been doing this. When I do find corrupted files, I'll find a copy of the file in one of the other archive mirrors that passes the checksum test and copy it over to the primary archive drive to overwrite the corrupted file.
TODO Start using btrfs filesystem
I should automate this step by using the btrfs filesystem, which automatically scrubs data.
Step 2: Adding files to archive
Whenever data is ready to the archived, I move it to a special
directory under ~/ARCHIVE
. When that directory becomes large, I move
the data into my actual data archive with the following commands:
$ rsync -avP ~/ARCHIVE/<dir1> /mnt/<drv>/ARCHIVE/<dir1> $ rsync -avP ~/ARCHIVE/path/to/<dir2> /mnt/<drv>/ARCHIVE/path/to/<dir2> $ rsync -avP [...]
Step 3: Recomputing file hashes on updated archive
After new files have been added to the data archive, I recompute a new hash for each of the files:
$ cd /mnt/<drv>/ARCHIVE $ find . -type f -regextype posix-extended -iregex '.*\.(avi|jp[e]?g|m4v|mkv|mlt|mov|mp[34]|mp[e]?g|ogv|wav|webm|bmp|cr2|doc|ppm|psd|thm|tif|xcf|xmp|zip|txt|org|md|sql)$' -exec md5sum '{}' \; > ~/ARCHIVE/<drv>-<YYYY>-<MM>-<DD>-<drv>-ARCHIVE.md5
(The reason I limit the files hashed with the find command is because certain types of files [empty files?] were giving me problems.)
Step 4: Replicating data archive to other archive mirrors
Next, I copy the new or updated files over to the archive mirrors with:
$ sudo rsync -avP /mnt/<drv1>/ARCHIVE/ /mnt/<drv2>/ARCHIVE/ $ sudo rsync -avP /mnt/<drv1>/ARCHIVE/ /mnt/<drv3>/ARCHIVE/
If any files were moved or deleted from the archive, that needs to be propagated as well. First, see which files would be deleted:
$ sudo rsync -avP --delete --dry-run /mnt/<drv1>/ARCHIVE/ /mnt/<drv2>/ARCHIVE/
If the output looks good, run it for real:
$ sudo rsync -avP --delete /mnt/<drv1>/ARCHIVE/ /mnt/<drv2>/ARCHIVE/
Repeat for any other mirrors:
$ sudo rsync -avP --delete --dry-run /mnt/<drv1>/ARCHIVE/ /mnt/<drv3>/ARCHIVE/ $ sudo rsync -avP --delete /mnt/<drv1>/ARCHIVE/ /mnt/<drv3>/ARCHIVE/
Step 5: Removing archived files from home directory
Finally, remove the files that have been safely archived from
~/ARCHIVE
.