Sebastian's personal website

Getting Data Into The Datalake

Written by Sebastian Dümcke on
Tags:

Here I assume you have some kind of storage set up, let’s say a NAS and have personal data scattered across many hard drives, USB thumb drives and other data sources. I wrote the following script to ingest this data into the personal datalake. It will first create an image of the whole disk and then copy all files into the datalake, preserving attributes such as timestamps. In a last pass the data is checksummed to make sure the transfer did not introduce any errors.

This way we already fulfill the 3-2-1 backup rule: 3 copies of each file (1 on the drives, 1 in the datalake and 1 in the disk image), 2 media (NAS and drive) and 1 off-site (if we store the disk images encrypted in the cloud or the drives in a bank safe). The disk images also serve as fail-safe if we forgot to copy particular attributes special to a certain file system. Each disk image file can be mounted into the directory tree with: mount -o loop dd.img mount-point.

Here is the full script:

#!/bin/bash
if [[ $EUID -ne 0 ]]; then
   echo "This script must be run as root" 
   exit 1
fi
if [[ $# -lt 1 ]]; then
    echo "Usage: ingest.sh device"
    exit 2
fi
DEVICE=$1
IMAGE_FOLDER="/srv/pool/DiskImages/"
DATALAKE="/srv/pool/DataLake/"
DEVICE_NAME=$(blkid -sUUID | grep ${DEVICE} | grep -o '".*"$' | tr -d '"')
#TODO: add checks if DEVICE_NAME comes up empty
#make a raw copy of device
echo "making dd image of device ${DEVICE} named ${DEVICE_NAME}"
dd if=${DEVICE} of=${IMAGE_FOLDER}/${DEVICE_NAME}.ddimg bs=1M
#mount device
echo "mounting device..."
mkdir /media/${DEVICE_NAME}
mount ${DEVICE}  /media/${DEVICE_NAME}
#rsync data to datalake
su - ${SUDO_USER} -c "rsync -av /media/${DEVICE_NAME}/ ${DATALAKE}/${DEVICE_NAME}/" #TODO: check if destination folder exists. If so append _i counter to to it
echo "checksumming rsync run..."
su - ${SUDO_USER} -c "rsync -av --checksum /media/${DEVICE_NAME}/ ${DATALAKE}/${DEVICE_NAME}/"
#unmount device and remove mount point
echo "unmounting device..."
umount /media/${DEVICE_NAME}
echo "removing mountpoint..."
rmdir /media/${DEVICE_NAME}

The script should be run via sudo with the name of the device/partition to import into the datalake.