Using edges-analysis

You can use edges-analysis via the Python library, or the command-line. While the Python library has more features (eg. plotting), the command-line is recommended for performing the actual processing of the data for publications. This is because it is better able to keep track of the history and intermediate steps, making the results more reproducible.

Library Usage

The library is split into subpackages for performing different kinds of data processing: eg. calibration, filters, and averaging. There are also submodules for calculating beam corrections, sky models, and other useful functions such as plotting. The primary data object that all functionality is built around is the edges_analysis.gsdata.GSData object. This object stores all relevant data from a global-signal measurement, and has useful methods for accessing parts of the data, as well as reading/writing the data to standard formats (including HDF5 and ACQ).

Note

The GSData object is a severe departure from our previous framework, in which each processing step had its own data format and required parameters. The GSData object is light-weight, and can represent any stage of the data formatting.

All processing functions in edges-analysis operate on the GSData object. The GSData object is considered to be immutable. This means that functions that process the object return a new object with updated data. This makes reasoning about the code much simpler, and makes it easier to write code that is reusable. All processing functions that take in a GSData object and return a new one are decorated with the @gsregister decorator, which registers the function to be discoverable by the CLI (more on that later), but also enables automatic updating of the history of the object, so no manual updating of the history is required.

Reading/Writing Data

To read in data as a GSData object, simply use the from_file method:

from edges_analysis import GSData
data = GSData.from_file('data.acq', telescope_name="EDGES-low")

Notice that we passed the telescope name, which is a piece of (optional) metadata that the ACQ file doesn’t store. Any parameter to GSData that ACQ doesn’t natively contain can be passed in this way when constructing the object. While ACQ is readable, the GSData object supports a native HDF5-based format which is both faster to read, and is able to contain more metadata. We can write such a file:

data.write_gsh5("data.gsh5")

This file can be read using the same method as above:

data = GSData.from_file('data.gsh5')

Notice that here we didn’t have to specify the telescope_name parameter, because the file format contains this information.

Updating the Object

As already stated, the GSData object is to be considered immutable. This means that you can be confident that any function that “changes” your data object will in fact return a new object with the updated data. Thus, you can keep a reference to the original unchanged object if necessary. Despite this, if any arrays are the same between the objects, then the memory will not be copied. Thus, if you were to inadvertantly in-place modify one of the arrays, both objects would be affected. Don’t do this.

The “official” way to update the object is to use the update method:

data = data.update(data=data.data * 3, data_unit="uncalibrated")

This will return the new object. However, this doesn’t update the history automatically. The history can be updated by supplying a dictionary with at least a message:

data = data.update(
    data=data.data * 3,
    data_unit="uncalibrated",
    history={"message": "Multiplied by 3"}
)

In actual fact, the history object that is added is a edges_analysis.gsdata.Stamp object, which is a lightweight object that can be easily serialized to YAML, and adds a default timestamp and set of code versions to the history. You can use one of these directly if you wish:

from edges_analysis import Stamp
data = data.update(
    data=data.data * 3,
    data_unit="uncalibrated",
    history=Stamp(message="Multiplied by 3", timestamp=datetime.now())
)

If you write a function that updates a GSData object, it is better to include the function name and the parameters it uses in the history:

def multiply_by_3(data, data_unit):
    return data.update(
        data=data.data * 3,
        data_unit=data_unit,
        history={"function": "multiply_by_3", "parameters": {"data_unit": data_unit}}
    )

data = multiply_by_3(data, "uncalibrated")

However, if you are going to write functions that update the data, there is a better way to do it, as we shall see now.

Using the Register

There is a decorator defined that makes writing new functions that update GSData objects simpler, called edges_analysis.gsdata.gsregister(). This decorator does a few things: it registers the function into a global dictionary, GSDATA_PROCESSORS, and it adds the function to the history of the object. Using registered functions is simple: just call the function with the object as the first argument, and any other parameters as keyword arguments. Since most internally-defined functions have already been registered, you can use them out of the box. For example:

from edges_analysis.averaging import freq_bin
data = freq_bin(data, resolution=8)

The returned data object has a different data-shape (it has 1/8th of the frequency channels), and the history contains a new entry. You can print that history:

print(str(data.history))

Or just print the most recent addition to the history:

print(str(data.history[-1]))

The history attribute also has a pretty() method, which can be used with the rich library to pretty-print the history:

from rich.console import Console
console = Console()
console.print(data.history.pretty())

Adding your own registered processor is simple – just use the decorator over a function with the correct signature:

from edges_analysis.gsdata import gsregister, GSData

@gsregister("filter")
def filter_every_nth_channel(data: GSData, *, n: int=1) -> GSData:
    return data.update(
        data=data.data[..., ::n],
        nsamples=data.nsamples[..., ::n],
        flags={k: v[..., ::n] for k, v in data.flags.items()},
    )

Note here that the first argument to the function is always a GSData instance, and the return value is always another GSData instance. All other parameters should be keyword arguments, and can in principle be anything, but it is best to make them types that can easily be understood by YAML (this helps with writing out the history, and also for defining workflows for the CLI). Note also that the gsregister decorator takes a single argument: the kind of processor. This is important, because it enables the workflow to make judgments on how to call the function in certain cases, and also makes it possible to find subsets of the available processors.

Finally, the register is most useful in conjunction with the CLI interface, which is described below.

Making Plots

The edges_analysis.plots module contains functions that can be used to make plots from a GSData object. For example, let’s say we have a GSData file:

from edges_analysis import GSData, plots
data = GSData.from_file('2015_202_00.gsh5')

# Plot a flagged waterfall of the data (whether it's residuals or spectra)
plots.plot_waterfall(data)

# Plot the same but show the nsamples intsead of data
plots.plot_waterfall(data, attribute='nsamples')

# Plot the data residuals (if they exist) and don't apply any flags.
plots.plot_waterfall(data, attribute='resids', which_flags=())

CLI Usage

The intention is that full processing pipelines should be run via the CLI interface. This interface assures that the pipeline is reproducible: it comes from a single YAML workflow file, and each processing step that is run is necessarily a @gsregister function, so it preserves history as much as possible.

The Workflow File

The primary command in the CLI is process. This command takes a single required argument – the workflow file – and several optional arguments related to I/O. The full option list can be found below, but here we will cover the main thing: the workflow.

The workflow file is a YAML file that contains a list of processing steps. It should be formatted as follows:

globals:
   param1: value1
   param2: value2

steps:
    - function: <<name of gsregistered function>>
      params:
        a_parameter: <<value>>
        another_parameter: <<value>>

    - function: <<name of gsregistered function>>
      name: the-first-invocation-of-foobar
      params:
        a_parameter: {{ globals.param1 }}
        another_parameter: {{ globals.param2 }}
      write: {prev_stem}.foobar.gsh5

    - ...

Note a few things. First, we have a globals section, which contains parameter values that can be used in later steps (use these if you need to use the same value more than once). These are interpolated into the later steps via Jinja templates, i.e. by using double-curly braces with spaces (as seen for the second function’s a_parameter). Secondly, we have a steps section, which defines a list of processing steps that will happen in the order they are defined. Each step can have four keys:

  • function is the only required key, and specifies a gsregistered function to run. The CLI can only find the function if it has been registered. You can get a current list of available functions (and their types) by using edges-analysis avail.

  • name gives a unique name to the step. By default, it is the function name, but if you use the same function more than once, you will need to specify a unique name.

  • params is a dictionary of parameters to pass to the function, other than the GSData object itself.

  • write is optional, and if included, it tells the workflow to write out a new file at this point. The value given is the filename to write. By default, if the file already exists, the workflow will error. Notice that the value here also uses curly braces. In this case, it is not a Jinja template, but rather a standard string-format. Each step has access to the variables prev_stem (i.e. the filename, without extension, of the last written step), prev_dir (i.e. the directory of the last written step), and fncname (i.e. the name of the function for this step).

Note

There is one “special” function that can be used that is not in the gsregister: the convert function. This function can be used to initially read files in a different format (eg. ACQ).

Using the process command

It is generally best to include a complete workflow in a single file – all the way from initial convert to the final averaging down to a single spectrum (if desired). Keeping it in a single files means it is easier to reason about what was done to produce the results later on (especially in conjunction with the history in the written files). Doing this is made possible by an option to to the process command that lets you start the workflow from any given step, which means that if the workflow fails for some reason mid-way, you don’t have to restart the whole thing (as long as you have written checkpoints).

The typical envisaged usage of the process command, given a workflow file called workflow.yaml, is:

$ edges-analysis process workflow.yaml \
     -i "/path/to/raw/data/2015_*.acq" \
     --outdir ./my_workflow \
     --nthreads 16

Note that the input files (specified by -i) can be specified with a glob, and multiple -i options can be given. If you start with .acq files, be sure to use convert as your first step. The --outdir option tells the workflow where to write the output data files. All filenames given in the workflow are relative to this. It is the current directory by default.

If you only want to run a portion of the workflow, you can specify --stop <NAME>, where the name is the name of the step (or its function name) which is the last one you want to run.

You can resume the workflow by simply pointing to the same output directory without giving any inputs:

$ edges-analysis process workflow.yaml --outdir ./my_workflow

Every time the workflow is run, a “progressfile.yaml” is written to the output directory, containing the full specification of the run, plus some extra metadata required to know what has already been run. You can add new input files to the workflow by adding new -i entries:

$ edges-analysis process workflow.yaml \
     -i "/path/to/raw/data/2016_*.acq" \
     --outdir ./my_workflow \
     --nthreads 16

This will run all the 2016 files, and then combine them with the 2015 files as necessary. The 2015 files will not be reprocesed unless required (eg. when LST-averaging).

If you’d prefer to completely restart the process with the new files, just use the --restart option.

The fork command

If you want to change your workflow but keep the existing processing, you can “fork” the current working directory and start the new workflow from wherever it diverges from the original. To do this, use:

$ edges-analysis fork new-workflow.yaml ./my_workflow --output ./new_workflow

Then, run the process command as normal with --output ./new_workflow.

Commands

Here, we give a basic overview of the commands available, and their respective options. Note that --help can be used at any time on the command line for any command.

process

Process a dataset to the STEP level of averaging/filtering using SETTINGS.

WORKFLOW

is a YAML file. Containing a “steps” parameter which should be a list of steps to execute.

process [OPTIONS] WORKFLOW

Options

-i, --path <path>

The path(s) to input files. Multiple specifications of -i can be included. Each input path may have glob-style wildcards, eg. /path/to/file.*. If the path is a directory, all HDF5/ACQ files in the directory will be used. You may prefix the path with a colon to indicate the “standard” location (given by config['paths']), e.g. -i :big-calibration/.

-o, --outdir <outdir>

The directory into which to save the outputs. Relative paths in the workflow are deemed relative to this directory.

-v, --verbosity <verbosity>

level of verbosity of the logging

-j, --nthreads <nthreads>

How many threads to use.

--mem-check, --no-mem-check

Whether to perform a memory check

-e, --exit-on-inconsistent, -E, --ignore-inconsistent

Whether to immediately exit if any complete step is inconsistent with the progressfile.

-r, --restart, -a, --append

whether any new input paths should be appended, or if everything should be restarted with just those files.

--stop <stop>

The name of the step at which to stop the workflow.

--start <start>

The name of the step at which to start the workflow.

Arguments

WORKFLOW

Required argument

avail

List all available GSData processing commands.

avail [OPTIONS]

Options

-k, --kinds <kinds>

Kinds of data to process

Example Workflow File

Here is a sample “real-world” workflow file:

An example workflow YAML.
globals:
  calobs: alanlike_calfile.h5
  band: low
  s11_path: /data5/edges/data/S11_antenna/low_band/20160830_a/s11


steps:
  - function: convert
    params:
      telescope_name: "EDGES-low2"
  - function: select_freqs
    params:
      range: [50, 100]
  - function: add_weather_data
  - function: add_thermlog_data
    params:
      band: low
    write: "{prev_stem}.gsh5"
  - function: aux_filter
    params:
      maxima:
        adcmax: 0.4
        ambient_hum: 100

  - function: rfi_model_filter
    name: first-rfi-model
    params:
      decrement_threshold: 1
      increase_order: true
      init_flags:
        - 90.0
        - 100.0
      max_iter: 40
      max_terms: 40
      min_terms: 8
      min_threshold: 3.5
      model: !Model
        model: polynomial
        n_terms: 5
        offset: -2.5
        parameters: null
      n_resid: -1
      threshold: 6.5
      watershed: 4
  - function: rfi_watershed_filter
    name: first-rfi-watershed
    params:
      tol: 0.7
  - function: negative_power_filter
#  - function: total_power_filter
#    params:
#      bands:
#        - [50, 100]
#	- [50, 75]
#	- [75, 100]
#      metric_model: !Model
#        model: fourier
#	n_terms: 40
#	period: 48.0
#      std_model: !Model
#        model: fourier
#	n_terms: 10
#	period: 48.0
#      threshold: 3
  - function: sun_filter
    params:
      elevation_range: [-90.0, -10.0]
  - function: moon_filter
    params:
      elevation_range: [-90.0, 90.0]

  # CALIBRATION BLOCK
  - function: dicke_calibration
  - function: freq_bin_with_models
    params:
      resolution: 8
      model: !Model
        model: polynomial
        n_terms: 10
  - function: apply_noise_wave_calibration
    params:
      calobs: "{{ globals.calobs }}"
      band: "{{ globals.band }}"
      s11_path: "{{ globals.s11_path }}"
  - function: apply_loss_correction
    params:
      band: low
      antenna_correction: false
      balun_correction: true
      ground_correction: ':'
      calobs: "{{ globals.calobs }}"
      band: "{{ globals.band }}"
      s11_path: "{{ globals.s11_path }}"
  - function: apply_beam_correction
    params:
      band: low
      beam_file: /data4/nmahesh/edges/edges-field-levels/beam_factors/low_band_beam_factor_fromnive_azrotm6.h5
    write: cal/{prev_stem}.gsh5


  - function: add_model
    name: linlog
    params:
      model: !Model
        model: linlog
        beta: -2.5
        f_center: 75.0
        n_terms: 5
        with_cmb: false
  - function: lst_bin
    name: lst-bin-15min
    params:
      binsize: 0.25
    write: cal/{prev_stem}.L15min.gsh5
  - function: lst_average
    write: cal/lst-avg/lst_average.gsh5
  - function: lst_bin
    name: lst-bin-24hr
    params:
      binsize: 24.0
    write: cal/lst-avg/lstbin24hr.gsh5
  - function: freq_bin
    params:
      resolution: 8
    write: cal/lst-avg/{prev_stem}.400kHz.gsh5