Getting Data Into The Datalake
Here I assume you have some kind of storage set up, let’s say a NAS and have personal data scattered across many hard drives, USB thumb drives and other data sources. I wrote the following script to ingest this data into the personal datalake. It will first create an image of the whole disk and then copy all files into the datalake, preserving attributes such as timestamps. In a last pass the data is checksummed to make sure the transfer did not introduce any errors.
This way we already fulfill the 3-2-1 backup rule: 3 copies of each file
(1 on the drives, 1 in the datalake and 1 in the disk image), 2 media
(NAS and drive) and 1 off-site (if we store the disk images encrypted in
the cloud or the drives in a bank safe). The disk images also serve as
fail-safe if we forgot to copy particular attributes special to a
certain file system. Each disk image file can be mounted into the
directory tree with: mount -o loop dd.img mount-point
.
Here is the full script:
#!/bin/bash if [[ $EUID -ne 0 ]]; then echo "This script must be run as root" exit 1 fi if [[ $# -lt 1 ]]; then echo "Usage: ingest.sh device" exit 2 fi DEVICE=$1 IMAGE_FOLDER="/srv/pool/DiskImages/" DATALAKE="/srv/pool/DataLake/" DEVICE_NAME=$(blkid -sUUID | grep ${DEVICE} | grep -o '".*"$' | tr -d '"') #TODO: add checks if DEVICE_NAME comes up empty #make a raw copy of device echo "making dd image of device ${DEVICE} named ${DEVICE_NAME}" dd if=${DEVICE} of=${IMAGE_FOLDER}/${DEVICE_NAME}.ddimg bs=1M #mount device echo "mounting device..." mkdir /media/${DEVICE_NAME} mount ${DEVICE} /media/${DEVICE_NAME} #rsync data to datalake su - ${SUDO_USER} -c "rsync -av /media/${DEVICE_NAME}/ ${DATALAKE}/${DEVICE_NAME}/" #TODO: check if destination folder exists. If so append _i counter to to it echo "checksumming rsync run..." su - ${SUDO_USER} -c "rsync -av --checksum /media/${DEVICE_NAME}/ ${DATALAKE}/${DEVICE_NAME}/" #unmount device and remove mount point echo "unmounting device..." umount /media/${DEVICE_NAME} echo "removing mountpoint..." rmdir /media/${DEVICE_NAME}
The script should be run via sudo
with the name of the
device/partition to import into the datalake.