CLI Usage

The intention is that full processing pipelines should be run via the CLI interface. This interface assures that the pipeline is reproducible: it comes from a single YAML workflow file, and each processing step that is run is necessarily a @gsregister function, so it preserves history as much as possible.

The Workflow File

The primary command in the CLI is process. This command takes a single required argument – the workflow file – and several optional arguments related to I/O. The full option list can be found below, but here we will cover the main thing: the workflow.

The workflow file is a YAML file that contains a list of processing steps. It should be formatted as follows:

globals:
   param1: value1
   param2: value2

steps:
    - function: <<name of gsregistered function>>
      params:
        a_parameter: <<value>>
        another_parameter: <<value>>

    - function: <<name of gsregistered function>>
      name: the-first-invocation-of-foobar
      params:
        a_parameter: {{ globals.param1 }}
        another_parameter: {{ globals.param2 }}
      write: {prev_stem}.foobar.gsh5

    - ...

Note a few things. First, we have a globals section, which contains parameter values that can be used in later steps (use these if you need to use the same value more than once). These are interpolated into the later steps via Jinja templates, i.e. by using double-curly braces with spaces (as seen for the second function’s a_parameter). Secondly, we have a steps section, which defines a list of processing steps that will happen in the order they are defined. Each step can have four keys:

  • function is the only required key, and specifies a gsregistered function to run. The CLI can only find the function if it has been registered. You can get a current list of available functions (and their types) by using edges-analysis avail.

  • name gives a unique name to the step. By default, it is the function name, but if you use the same function more than once, you will need to specify a unique name.

  • params is a dictionary of parameters to pass to the function, other than the GSData object itself.

  • write is optional, and if included, it tells the workflow to write out a new file at this point. The value given is the filename to write. By default, if the file already exists, the workflow will error. Notice that the value here also uses curly braces. In this case, it is not a Jinja template, but rather a standard string-format. Each step has access to the variables prev_stem (i.e. the filename, without extension, of the last written step), prev_dir (i.e. the directory of the last written step), and fncname (i.e. the name of the function for this step).

Note

There is one “special” function that can be used that is not in the gsregister: the convert function. This function can be used to initially read files in a different format (eg. ACQ).

Using the process command

It is generally best to include a complete workflow in a single file – all the way from initial convert to the final averaging down to a single spectrum (if desired). Keeping it in a single files means it is easier to reason about what was done to produce the results later on (especially in conjunction with the history in the written files). Doing this is made possible by an option to to the process command that lets you start the workflow from any given step, which means that if the workflow fails for some reason mid-way, you don’t have to restart the whole thing (as long as you have written checkpoints).

The typical envisaged usage of the process command, given a workflow file called workflow.yaml, is:

$ edges-analysis process workflow.yaml \
     -i "/path/to/raw/data/2015_*.acq" \
     --outdir ./my_workflow \
     --nthreads 16

Note that the input files (specified by -i) can be specified with a glob, and multiple -i options can be given. If you start with .acq files, be sure to use convert as your first step. The --outdir option tells the workflow where to write the output data files. All filenames given in the workflow are relative to this. It is the current directory by default.

If you only want to run a portion of the workflow, you can specify --stop <NAME>, where the name is the name of the step (or its function name) which is the last one you want to run.

You can resume the workflow by simply pointing to the same output directory without giving any inputs:

$ edges-analysis process workflow.yaml --outdir ./my_workflow

Every time the workflow is run, a “progressfile.yaml” is written to the output directory, containing the full specification of the run, plus some extra metadata required to know what has already been run. You can add new input files to the workflow by adding new -i entries:

$ edges-analysis process workflow.yaml \
     -i "/path/to/raw/data/2016_*.acq" \
     --outdir ./my_workflow \
     --nthreads 16

This will run all the 2016 files, and then combine them with the 2015 files as necessary. The 2015 files will not be reprocesed unless required (eg. when LST-averaging).

If you’d prefer to completely restart the process with the new files, just use the --restart option.

The fork command

If you want to change your workflow but keep the existing processing, you can “fork” the current working directory and start the new workflow from wherever it diverges from the original. To do this, use:

$ edges-analysis fork new-workflow.yaml ./my_workflow --output ./new_workflow

Then, run the process command as normal with --output ./new_workflow.

Commands

Here, we give a basic overview of the commands available, and their respective options. Note that --help can be used at any time on the command line for any command.

process

Process a dataset to the STEP level of averaging/filtering using SETTINGS.

WORKFLOW

is a YAML file. Containing a “steps” parameter which should be a list of steps to execute.

process [OPTIONS] WORKFLOW

Options

-i, --path <path>

The path(s) to input files. Multiple specifications of -i can be included. Each input path may have glob-style wildcards, eg. /path/to/file.*. If the path is a directory, all HDF5/ACQ files in the directory will be used. You may prefix the path with a colon to indicate the “standard” location (given by config['paths']), e.g. -i :big-calibration/.

-o, --outdir <outdir>

The directory into which to save the outputs. Relative paths in the workflow are deemed relative to this directory.

-v, --verbosity <verbosity>

level of verbosity of the logging

-j, --nthreads <nthreads>

How many threads to use.

--mem-check, --no-mem-check

Whether to perform a memory check

-e, --exit-on-inconsistent, -E, --ignore-inconsistent

Whether to immediately exit if any complete step is inconsistent with the progressfile.

-r, --restart, -a, --append

whether any new input paths should be appended, or if everything should be restarted with just those files.

--stop <stop>

The name of the step at which to stop the workflow.

--start <start>

The name of the step at which to start the workflow.

Arguments

WORKFLOW

Required argument

avail

List all available GSData processing commands.

avail [OPTIONS]

Options

-k, --kinds <kinds>

Kinds of data to process

Example Workflow File

Here is a sample “real-world” workflow file:

An example workflow YAML.
globals:
  calobs: alanlike_calfile.h5
  band: low
  s11_path: /data5/edges/data/S11_antenna/low_band/20160830_a/s11


steps:
  - function: convert
    params:
      telescope_name: "EDGES-low2"
  - function: select_freqs
    params:
      range: [50, 100]
  - function: add_weather_data
  - function: add_thermlog_data
    params:
      band: low
    write: "{prev_stem}.gsh5"
  - function: aux_filter
    params:
      maxima:
        adcmax: 0.4
        ambient_hum: 100

  - function: rfi_model_filter
    name: first-rfi-model
    params:
      decrement_threshold: 1
      increase_order: true
      init_flags:
        - 90.0
        - 100.0
      max_iter: 40
      max_terms: 40
      min_terms: 8
      min_threshold: 3.5
      model: !Model
        model: polynomial
        n_terms: 5
        offset: -2.5
        parameters: null
      n_resid: -1
      threshold: 6.5
      watershed: 4
  - function: rfi_watershed_filter
    name: first-rfi-watershed
    params:
      tol: 0.7
  - function: negative_power_filter
#  - function: total_power_filter
#    params:
#      bands:
#        - [50, 100]
#	- [50, 75]
#	- [75, 100]
#      metric_model: !Model
#        model: fourier
#	n_terms: 40
#	period: 48.0
#      std_model: !Model
#        model: fourier
#	n_terms: 10
#	period: 48.0
#      threshold: 3
  - function: sun_filter
    params:
      elevation_range: [-90.0, -10.0]
  - function: moon_filter
    params:
      elevation_range: [-90.0, 90.0]

  # CALIBRATION BLOCK
  - function: dicke_calibration
  - function: freq_bin_with_models
    params:
      resolution: 8
      model: !Model
        model: polynomial
        n_terms: 10
  - function: apply_noise_wave_calibration
    params:
      calobs: "{{ globals.calobs }}"
      band: "{{ globals.band }}"
      s11_path: "{{ globals.s11_path }}"
  - function: apply_loss_correction
    params:
      band: low
      antenna_correction: false
      balun_correction: true
      ground_correction: ':'
      calobs: "{{ globals.calobs }}"
      band: "{{ globals.band }}"
      s11_path: "{{ globals.s11_path }}"
  - function: apply_beam_correction
    params:
      band: low
      beam_file: /data4/nmahesh/edges/edges-field-levels/beam_factors/low_band_beam_factor_fromnive_azrotm6.h5
    write: cal/{prev_stem}.gsh5


  - function: add_model
    name: linlog
    params:
      model: !Model
        model: linlog
        beta: -2.5
        f_center: 75.0
        n_terms: 5
        with_cmb: false
  - function: lst_bin
    name: lst-bin-15min
    params:
      binsize: 0.25
    write: cal/{prev_stem}.L15min.gsh5
  - function: lst_average
    write: cal/lst-avg/lst_average.gsh5
  - function: lst_bin
    name: lst-bin-24hr
    params:
      binsize: 24.0
    write: cal/lst-avg/lstbin24hr.gsh5
  - function: freq_bin
    params:
      resolution: 8
    write: cal/lst-avg/{prev_stem}.400kHz.gsh5