How to Deal With Our Digital Lives

Written by Sebastian Dümcke on November 13, 2018
Tags: DAMright

I recently formated an old hard drive to make space for another project. I had a quick look around the drive, noticed that most files would probably be on other drives as well (e.g. my pictures) and saved a few other files. Then I erased everything. It was only a couple of days later that I realized, that the data I deleted was almost 8 years old. And had I missed to save something it was now lost forever. And that these data actually represent the person I was at that time.

Because of this experience I have decided that I want to set up a system to archive, preserve, use, search, and retrieve my data. Such a project would fall into categories such as archiving, asset management, digital preservation, data lakes, and many more. Since I want to get it right, I will call the project #DAMright (digital asset management done right).

As a start, I want to put down my thoughts on the subject, without much structure. I will then start to elaborate on each topics in future posts an hopefully it will all fall into place at some point.

Where to store data

Data will be stored in hot storage on a NAS with backups to cold storage (e.g. Amazon glacier or maybe one of the new kids on the block: Storj and Sia). NAS will run OpenMediaVault, with several drives pooled with mergerfs and parity calculations using snapRAID to safeguard from drive failure. Should the disks be encrypted with LUKS?

NAS hardware will be based on ASRock J3455 mainboard, as described elsewhere.

How to store data

Not much structure, the whole thing will be a data lake: everything in one place and then many options to facet each search: according to filetype, size, age, media type, content…

Data will be ingested and then put into git annex, which has the added side effect of file level deduplication Metadata should also be managed by git annex because this simplifies the toolchain. Other options would have been tagflow, tag spaces or tagsistant

During ingestion, metadata is automatically extracted from the files using Apache Tika.

Full text search across all data with recoll because I have some previous experience with it. Other options are Terrier IR or DocFetcher.

Where to get data from

Put all data from old drives, USB thumbdrives and all Dropbox, OneDrive, GDrive accounts into it. Then start liberating my own data from ther digital silos: all social media, or any other online accounts with an export function. With the advent of the GDPR that should be feasible.

Synchronize all data from my mobile phone to this storage. Same with all my computers.

Once we have a place to store all digital assets, we might create more of them, since they don’t take up so much space. Scan all documents, certificates, insurance policies and what not.

Then maybe go paperless: start scanning shopping receipts and such.

Where to store data

How to store data

Where to get data from

How to share this data