edges_analysis._workflow

Functions for dealing with CLI workflow and progressfiles.

The progress file is simply a YAML file consisting purely of a list of steps in the analysis, as defined by a workflow.yaml. Each step in the progress file has the format:

name: <name of step>
function: <function that was run at this step>
params:
    param1: value
    param2: value
    ...
write: <<FORMAT_STRING>>.gsh5
filemap:
- inputs:
  - input_file1.gsh5
  outputs:
  - output_file1.gsh5
- inputs:
  - input_file2.gsh5
  outputs:
  - output_file2.gsh5

The filemap is thus a list of dicts, where each dict has two keys: inputs and outputs. Each of these is a list of files. Each element in the top-level list of filemap is an “object” that gets processed together through a processing function. For example, if the filemap has two elements, each with one input and one output, then the processing function will be called twice, once for each element in the list, and each time a single file will be passed in, and a single file will be passed out. The most common other case is that the filemap has one element, with multiple inputs and a single output – this happens for example with LST-binning, where we take multiple files and combine them into one.

The format of the progressfile is thus exactly the same as the user-defined workflow YAML, except with the addition of this filemap, which is updated as the processing occurs, and keeps track of the status of the processing.

Some things to note about the progressfile and its filemap. First, if a step has no "write" parameter, then it won’t write any files to disk (the objects it creates will simply be passed on to the next step in-memory). In this case, the filemap will be empty – no inputs OR outputs. Secondly, the way the CLI uses this progressfile is as follows:

The first time the CLI is run for a given workflow, a new progressfile is created. The inputs specified on command-line (via -i) are added as inputs to the convert step in the progressfile. The CLI then runs through the steps in the workflow. On each step, it checks out the files it might need to read for that step using the ProgressFile.get_files_to_read_for_step() method. This method checks the current step as it exists in the progressfile for any inputs, and ALSO the outputs of the previous step, which are assumed to be required for the current step. It will return any of these files that don’t appear as inputs to this or subsequent steps in the progressfile, where they are already associated with existing outputs (which would mean they’ve already been processed). It further trims the list by removing any files that have already been read in previous steps (since sometimes a step writes back to its own file). As it processes each step, if the step has a "write" parameter, it will write the output files to disk, and add them to the filemap for the step (and write to the progressfile).

The next time the CLI is run for the workflow (when a progressfile exists already), it will first check the progressfile against the workflow to ensure that the workflow hasn’t changed since the last run. If it has, it will delete all the files that were generated by that modified step and all subsequent steps (both on disk and in the progressfile). Any steps that have been _appended_ to the workflow will simply be added on to the progressfile. It will then run the workflow again, starting from the latest step possible (i.e. the step after the last step that was run successfully AND output files).

The user can also add more raw input files on the second run-through, which will by default be appended to the raw files from the original step. In this case, the CLI will first make sure it deletes any files that were generated by combining files down the pipeline, since these will change with the added files. However, it will not delete files that are simply generated from a single input, as these should still be valid. To figure out which files it needs to read at each step, it will check the files it needs using the ProgressFile.get_files_to_read_for_step() method, as described above. Since files that ahve already been processed will have valid outputs at subsequent steps, these will be ignored until they are required down the pipe.

Classes

FileMap([maps])

An object representing the full file-map for a processing step.

FileMapEntry(inputs, outputs)

A single entry in a filemap (input -> output).

ProgressFile(path[, workflow])

Workflow([steps])

WorkflowStep(function[, name, params, ...])

A single step in a workflow.

Exceptions

WorkflowProgressError

Exception raised when the workflow and progress files are discrepant.