Locate Files in a Datalake
A personal Datalake is basically a large folder with many files without a clear structure. So how does one find anything?
There are several answers to this question which we will explore in different articles (see here for a list). The answer changes depending on the type of files and if they have metadata associated or not. One thing any files has is a file name. And more often than not that name has a meaning and we remember what file we are looking for by that name (and sometimes by the path).
Both of name and path can be efficiently queried using a standard UNIX tool called locate. The package for your distribution might be called mlocate or findutils. I will quickly show you how to set up file searching with locate and how to query for different files:
#create database updatedb --require-visibility 0 --output /srv/pool/MetaData/locatedb --database-root /srv/pool/DataLake #show statistics locate --database /srv/pool/MetaData/locatedb --statistics
Database /srv/pool/MetaData/locatedb: 50330 directories 505092 files 59996573 bytes in file names 20234698 bytes used to store database
updatedb command creates the locate database by traversing the file tree from the
root /srv/pool/DataLake. The
--require-visibility 0 allows us to run the
program as a user, but will limit the files to the ones we can view. This can be
a security concern if other people have access to the locatedb file as they will
be able to see all the files in your system.
I my case I currently have over half a million files and the programm took about 5 seconds to generate the database. The size of the database is only around 20 MB, because all paths share a commen prefix than can be efficiently stored.
Queries run just as fast as well, using the command
locate with a path to the database
and a query. It will match the query anywhere in the path. To only match file
names use the swith
-b, or use
-r followed by a regular expression ending with the $ sigil
-c to output the number of path matching instead of the actual paths.
See the manpage for more details.
#find any file with taxes in its name locate --database /srv/pool/MetaData/locatedb -b taxes #number of files located in folders called ebooks locate --database /srv/pool/MetaData/locatedb -r '/ebooks' -c
I used to play around with raytracing years back, I wonder if I still have the files around?
locate --database /srv/pool/MetaData/locatedb -r "[.]pov$"
Turns out I do, please enjoy a raytraced picture designed by my younger self.