seqfile: Generate sequential filenames safely
Version control is great. However, it is something I have to actively think about. If I am writing code, I can find points when I have finished a task and am ready to commit it in. But I have a different frame of mind while doing exploratory model fitting.
Saving the models and results
When I am playing with an idea, I’d rather spend those brain cycles on improving the model or tweaking parameters. And perhaps it is just me, but training models to fit data seems like a separate realm of thought, where version control just doesn’t seem to fit. Checking in the models feels akin to adding binaries to the source control: a bit icky.
Also, these each of these models can take a long time to train. So I do want to at least persist them to disk; I have accidentally killed my share of Python processes and screen
s with the model results still stores in variables.
Pickling takes care of the serialization. The only missing bit is coming up with filenames.
Hardest problem in CS
Naming things is one of the two (or three?) hardest problems in Computer Science.
These are the approaches that sprang to my mind:
Use a static file name
This means that only the most recent model is saved while all history is lost. This can also get tricky if you train different models in parallel (one model will win in writing the file).
Use model parameters in the filename
This feels intuitive until you add/remove parameters and noise in the filenames can come back to bite you.
Use a GUID!
They are incomprehensible and noisy. Also, it is impossible to tell from the filename which files are the most recent ones.
Use the pid
pid
can be reused and it stays the same during a single interactive session. Hence, the files get overwritten anyway. Moreover, there is no sense of sequence in the names here either.
Endgame: seqfile
Finally, I wrote seqfile to solve the problem for me. It is a python library which can generate sequential filenames from a template. It takes care to ensure that it succeeded in creating a file before returning it to the process to ensure no two processes accidentally end up writing to the same file. If used with prefix
and suffix
options, it is smart about jumping over gaps in the sequence. However, if the filenames are more nuanced, then a function can be supplied to generate the filenames as well.
The API was inspired by the tempfile library.
At the end of my script, I use the following snippet:
Thereafter, I keep changing the parameters code and re-running the analysis on ipython:
In [42]: run -i my_models.py
This keeps saving my models as I tweak the parameters in the script, even if I run them in parallel across different screens.
The package is available on PyPi and the source code is available on Github: musically-ut/seqfile. It has 100% test coverage and is verified to run on all major OSes.
Caveat
The O_CREAT | O_EXCL
trick used in this library to create files may fail on old Linux kernels while writing to NFS.