Sebastian's personal website

Locate Files in a Datalake

Written by Sebastian Dümcke on
Tags:

A personal Datalake is basically a large folder with many files without a clear structure. So how does one find anything?

There are several answers to this question which we will explore in different articles (see here for a list). The answer changes depending on the type of files and if they have metadata associated or not. One thing any files has is a file name. And more often than not that name has a meaning and we remember what file we are looking for by that name (and sometimes by the path).

Both of name and path can be efficiently queried using a standard UNIX tool called locate. The package for your distribution might be called mlocate or findutils. I will quickly show you how to set up file searching with locate and how to query for different files:

#create database
updatedb --require-visibility 0 --output /srv/pool/MetaData/locatedb --database-root /srv/pool/DataLake
#show statistics
locate --database /srv/pool/MetaData/locatedb --statistics
Database /srv/pool/MetaData/locatedb:
    50330 directories    
    505092 files
    59996573 bytes in file names
    20234698 bytes used to store database

The updatedb command creates the locate database by traversing the file tree from the root /srv/pool/DataLake. The --require-visibility 0 allows us to run the program as a user, but will limit the files to the ones we can view. This can be a security concern if other people have access to the locatedb file as they will be able to see all the files in your system.

I my case I currently have over half a million files and the programm took about 5 seconds to generate the database. The size of the database is only around 20 MB, because all paths share a commen prefix than can be efficiently stored.

Queries run just as fast as well, using the command locate with a path to the database and a query. It will match the query anywhere in the path. To only match file names use the swith -b, or use -r followed by a regular expression ending with the $ sigil and -c to output the number of path matching instead of the actual paths. See the manpage for more details.

#find any file with taxes in its name
locate --database /srv/pool/MetaData/locatedb -b taxes
#number of files located in folders called ebooks 
locate --database /srv/pool/MetaData/locatedb -r '/ebooks' -c

I used to play around with raytracing years back, I wonder if I still have the files around?

locate --database /srv/pool/MetaData/locatedb -r "[.]pov$"

Turns out I do, please enjoy a raytraced picture designed by my younger self.

avar.png