Locate Files in a Datalake
Written by Sebastian Dümcke on
Tags: DAMright
A personal Datalake is basically a large folder with many files without a clear structure. So how does one find anything?
There are several answers to this question which we will explore in different articles (see here for a list). The answer changes depending on the type of files and if they have metadata associated or not. One thing any files has is a file name. And more often than not that name has a meaning and we remember what file we are looking for by that name (and sometimes by the path).
Both of name and path can be efficiently queried using a standard UNIX tool called locate. The package for your distribution might be called mlocate or findutils. I will quickly show you how to set up file searching with locate and how to query for different files:
#create database updatedb --require-visibility 0 --output /srv/pool/MetaData/locatedb --database-root /srv/pool/DataLake #show statistics locate --database /srv/pool/MetaData/locatedb --statistics
Database /srv/pool/MetaData/locatedb: 50330 directories 505092 files 59996573 bytes in file names 20234698 bytes used to store database
The updatedb
command creates the locate database by traversing the
file tree from the root /srv/pool/DataLake. The --require-visibility 0
allows us to run the program as a user, but will limit the files to the
ones we can view. This can be a security concern if other people have
access to the locatedb file as they will be able to see all the files in
your system.
I my case I currently have over half a million files and the programm took about 5 seconds to generate the database. The size of the database is only around 20 MB, because all paths share a commen prefix than can be efficiently stored.
Queries run just as fast as well, using the command locate
with a path
to the database and a query. It will match the query anywhere in the
path. To only match file names use the swith -b
, or use -r
followed
by a regular expression ending with the $ sigil and -c
to output the
number of path matching instead of the actual paths. See the manpage for
more details.
#find any file with taxes in its name locate --database /srv/pool/MetaData/locatedb -b taxes #number of files located in folders called ebooks locate --database /srv/pool/MetaData/locatedb -r '/ebooks' -c
I used to play around with raytracing years back, I wonder if I still have the files around?
locate --database /srv/pool/MetaData/locatedb -r "[.]pov$"
Turns out I do, please enjoy a raytraced picture designed by my younger self.