edges_analysis._workflow¶
Functions for dealing with CLI workflow and progressfiles.
The progress file is simply a YAML file consisting purely of a list of steps in the
analysis, as defined by a workflow.yaml
. Each step in the progress file has the
format:
name: <name of step>
function: <function that was run at this step>
params:
param1: value
param2: value
...
write: <<FORMAT_STRING>>.gsh5
filemap:
- inputs:
- input_file1.gsh5
outputs:
- output_file1.gsh5
- inputs:
- input_file2.gsh5
outputs:
- output_file2.gsh5
The filemap
is thus a list of dicts, where each dict has two keys: inputs
and
outputs
. Each of these is a list of files. Each element in the top-level list of
filemap
is an “object” that gets processed together through a processing function.
For example, if the filemap
has two elements, each with one input and one output,
then the processing function will be called twice, once for each element in the list,
and each time a single file will be passed in, and a single file will be passed out.
The most common other case is that the filemap
has one element, with multiple
inputs and a single output – this happens for example with LST-binning, where we
take multiple files and combine them into one.
The format of the progressfile is thus exactly the same as the user-defined workflow
YAML, except with the addition of this filemap
, which is updated as the processing
occurs, and keeps track of the status of the processing.
Some things to note about the progressfile and its filemap. First, if a step has no
"write"
parameter, then it won’t write any files to disk (the objects it creates
will simply be passed on to the next step in-memory). In this case, the filemap
will
be empty – no inputs OR outputs. Secondly, the way the CLI uses this progressfile is
as follows:
The first time the CLI is run for a given workflow, a new progressfile is created. The
inputs specified on command-line (via -i) are added as inputs to the convert step
in the progressfile. The CLI then runs through the steps in the workflow. On each step,
it checks out the files it might need to read for that step using the
ProgressFile.get_files_to_read_for_step()
method. This method checks the current
step as it exists in the progressfile for any inputs, and ALSO the outputs of the
previous step, which are assumed to be required for the current step. It will return
any of these files that don’t appear as inputs to this or subsequent steps in the
progressfile, where they are already associated with existing outputs (which would mean
they’ve already been processed). It further trims the list by removing any files that
have already been read in previous steps (since sometimes a step writes back to its
own file). As it processes each step, if the step has a "write"
parameter, it will
write the output files to disk, and add them to the filemap
for the step (and
write to the progressfile).
The next time the CLI is run for the workflow (when a progressfile exists already), it will first check the progressfile against the workflow to ensure that the workflow hasn’t changed since the last run. If it has, it will delete all the files that were generated by that modified step and all subsequent steps (both on disk and in the progressfile). Any steps that have been _appended_ to the workflow will simply be added on to the progressfile. It will then run the workflow again, starting from the latest step possible (i.e. the step after the last step that was run successfully AND output files).
The user can also add more raw input files on the second run-through, which will by
default be appended to the raw files from the original step. In this case, the CLI will
first make sure it deletes any files that were generated by combining files down
the pipeline, since these will change with the added files. However, it will not
delete files that are simply generated from a single input, as these should still be
valid. To figure out which files it needs to read at each step, it will check the
files it needs using the ProgressFile.get_files_to_read_for_step()
method, as
described above. Since files that ahve already been processed will have valid outputs
at subsequent steps, these will be ignored until they are required down the pipe.
Classes
|
An object representing the full file-map for a processing step. |
|
A single entry in a filemap (input -> output). |
|
|
|
|
|
A single step in a workflow. |
Exceptions
|
Exception raised when the workflow and progress files are discrepant. |