Personal Datalake: Concepts and Goals

In many industries there is a need to aggregate all business data and analyse it to drive business decisions. In the 90’s and early 2000’s the concept of data warehouses was popular. Large databases were aggregating all data in a unified schema. More recently the concept of a datalake was defined. In a datalake many different sources feed into the lake, dumping raw data in their own schema without worrying about unification. It is the job of data scientists to extract relevant trends and insights from these data. Since only data scientists should have access to the lake, they will create ‘lakeshore marts’, smaller annotated data structures, for other users to interact with. Martin Fowler has an interesting post about the datalake.

During our lifetime, we collect a lot of digital data on our hard drives and USB sticks. I want to apply the concept of a datalake to all my personal data, to store and analyse it. These data are currently spread across many different media (hard drives, USB drives, email servers, CDs, DVDs and paper) and it is a daunting task to organize it all. So we will just dump it into one big folder, make sure it is safe, redundant and has backups and then create specialized data marts and tools to find files and extract relevant information.

If we apply the concept of datalake to our own digital files, we create the personal datalake. The majority of our personal datalake will probably be composed of files and folders as this is the main way of storing and organizing digital files. These will be stored in a central space (most likely a NAS) and will be read-only. The raw data shall never be changed, but instead will be supplemented by further information. This augmented information will allow us to answer specific queries and to find the right files with ease. For this we will use specialized tools, most adapted to the kind of information (music files, pictures, text) we are dealing with.

Here are some example queries to our personal datalake that should become easy to answer:

Some of these queries would be difficult for us to answer with our files spread out across many drives and folders and without us having a very good structure to organize them.

In summary the personal datalake serves these goals:

  1. be a safe, central storage for all kind of (digital) data related to our personal lives
  2. have specialised access tools to allow us to find important documents or answer complex queries

all while requiring as little manual interaction as possible. This is achieved with the use of computer programs (e.g. generate automated indices of files instead of manual organization).

Here is the post that made me start this project. I will continue to described my journey of the personal datalake over several posts. You can find them all under the tag: #damright