Using edges-analysis
¶
You can use edges-analysis
via the Python library, or the command-line.
While the Python library has more features (eg. plotting), the command-line is
recommended for performing the actual processing of the data for publications.
This is because it is better able to keep track of the history and intermediate steps,
making the results more reproducible.
Library Usage¶
The library is split into subpackages for performing different kinds of data processing:
eg. calibration
, filters
, and averaging
. There are also submodules for
calculating beam corrections, sky models, and other useful functions such as plotting.
The primary data object that all functionality is built around is the
edges_analysis.gsdata.GSData
object. This object stores all relevant data from
a global-signal measurement, and has useful methods for accessing parts of the data,
as well as reading/writing the data to standard formats (including HDF5 and ACQ).
Note
The GSData object is a severe departure from our previous framework, in which each processing step had its own data format and required parameters. The GSData object is light-weight, and can represent any stage of the data formatting.
All processing functions in edges-analysis
operate on the GSData
object.
The GSData
object is considered to be immutable. This means that functions that
process the object return a new object with updated data. This makes reasoning about
the code much simpler, and makes it easier to write code that is reusable.
All processing functions that take in a GSData object and return a new one are
decorated with the @gsregister
decorator, which registers the function to be
discoverable by the CLI (more on that later), but also enables automatic updating
of the history
of the object, so no manual updating of the history is required.
Reading/Writing Data¶
To read in data as a GSData object, simply use the from_file
method:
from edges_analysis import GSData
data = GSData.from_file('data.acq', telescope_name="EDGES-low")
Notice that we passed the telescope name, which is a piece of (optional) metadata that the ACQ file doesn’t store. Any parameter to GSData that ACQ doesn’t natively contain can be passed in this way when constructing the object. While ACQ is readable, the GSData object supports a native HDF5-based format which is both faster to read, and is able to contain more metadata. We can write such a file:
data.write_gsh5("data.gsh5")
This file can be read using the same method as above:
data = GSData.from_file('data.gsh5')
Notice that here we didn’t have to specify the telescope_name
parameter, because
the file format contains this information.
Updating the Object¶
As already stated, the GSData object is to be considered immutable. This means that you can be confident that any function that “changes” your data object will in fact return a new object with the updated data. Thus, you can keep a reference to the original unchanged object if necessary. Despite this, if any arrays are the same between the objects, then the memory will not be copied. Thus, if you were to inadvertantly in-place modify one of the arrays, both objects would be affected. Don’t do this.
The “official” way to update the object is to use the update
method:
data = data.update(data=data.data * 3, data_unit="uncalibrated")
This will return the new object. However, this doesn’t update the history automatically. The history can be updated by supplying a dictionary with at least a message:
data = data.update(
data=data.data * 3,
data_unit="uncalibrated",
history={"message": "Multiplied by 3"}
)
In actual fact, the history object that is added is a edges_analysis.gsdata.Stamp
object, which is a lightweight object that can be easily serialized to YAML, and adds
a default timestamp and set of code versions to the history. You can use one of these
directly if you wish:
from edges_analysis import Stamp
data = data.update(
data=data.data * 3,
data_unit="uncalibrated",
history=Stamp(message="Multiplied by 3", timestamp=datetime.now())
)
If you write a function that updates a GSData object, it is better to include the function name and the parameters it uses in the history:
def multiply_by_3(data, data_unit):
return data.update(
data=data.data * 3,
data_unit=data_unit,
history={"function": "multiply_by_3", "parameters": {"data_unit": data_unit}}
)
data = multiply_by_3(data, "uncalibrated")
However, if you are going to write functions that update the data, there is a better way to do it, as we shall see now.
Using the Register¶
There is a decorator defined that makes writing new functions that update GSData objects
simpler, called edges_analysis.gsdata.gsregister()
.
This decorator does a few things: it registers the function into a global dictionary,
GSDATA_PROCESSORS
, and it adds the function to the history
of the object.
Using registered functions is simple: just call the function with the object as the
first argument, and any other parameters as keyword arguments. Since most
internally-defined functions have already been registered, you can use them out of the
box. For example:
from edges_analysis.averaging import freq_bin
data = freq_bin(data, resolution=8)
The returned data
object has a different data-shape (it has 1/8th of the frequency
channels), and the history contains a new entry. You can print that history:
print(str(data.history))
Or just print the most recent addition to the history:
print(str(data.history[-1]))
The history
attribute also has a pretty()
method, which can be used with the
rich library to pretty-print the history:
from rich.console import Console
console = Console()
console.print(data.history.pretty())
Adding your own registered processor is simple – just use the decorator over a function with the correct signature:
from edges_analysis.gsdata import gsregister, GSData
@gsregister("filter")
def filter_every_nth_channel(data: GSData, *, n: int=1) -> GSData:
return data.update(
data=data.data[..., ::n],
nsamples=data.nsamples[..., ::n],
flags={k: v[..., ::n] for k, v in data.flags.items()},
)
Note here that the first argument to the function is always a GSData instance, and the
return value is always another GSData instance. All other parameters should be keyword
arguments, and can in principle be anything, but it is best to make them types that can
easily be understood by YAML (this helps with writing out the history, and also for
defining workflows for the CLI).
Note also that the gsregister
decorator takes a single argument: the kind of
processor. This is important, because it enables the workflow to make judgments on how
to call the function in certain cases, and also makes it possible to find subsets of the
available processors.
Finally, the register is most useful in conjunction with the CLI interface, which is described below.
Making Plots¶
The edges_analysis.plots
module contains functions that can be used to make plots
from a GSData object. For example, let’s say we have a GSData file:
from edges_analysis import GSData, plots
data = GSData.from_file('2015_202_00.gsh5')
# Plot a flagged waterfall of the data (whether it's residuals or spectra)
plots.plot_waterfall(data)
# Plot the same but show the nsamples intsead of data
plots.plot_waterfall(data, attribute='nsamples')
# Plot the data residuals (if they exist) and don't apply any flags.
plots.plot_waterfall(data, attribute='resids', which_flags=())
CLI Usage¶
The intention is that full processing pipelines should be run via the CLI interface.
This interface assures that the pipeline is reproducible: it comes from a single YAML
workflow file, and each processing step that is run is necessarily a @gsregister
function, so it preserves history as much as possible.
The Workflow File¶
The primary command in the CLI is process
. This command takes a single required
argument – the workflow file – and several optional arguments related to I/O.
The full option list can be found below, but here we will cover the main thing:
the workflow.
The workflow file is a YAML file that contains a list of processing steps. It should be formatted as follows:
globals:
param1: value1
param2: value2
steps:
- function: <<name of gsregistered function>>
params:
a_parameter: <<value>>
another_parameter: <<value>>
- function: <<name of gsregistered function>>
name: the-first-invocation-of-foobar
params:
a_parameter: {{ globals.param1 }}
another_parameter: {{ globals.param2 }}
write: {prev_stem}.foobar.gsh5
- ...
Note a few things. First, we have a globals
section, which contains parameter values
that can be used in later steps (use these if you need to use the same value more than
once). These are interpolated into the later steps via Jinja templates, i.e. by
using double-curly braces with spaces (as seen for the second function’s a_parameter
).
Secondly, we have a steps
section, which defines a list of processing steps that will
happen in the order they are defined. Each step can have four keys:
function
is the only required key, and specifies a gsregistered function to run. The CLI can only find the function if it has been registered. You can get a current list of available functions (and their types) by usingedges-analysis avail
.name
gives a unique name to the step. By default, it is the function name, but if you use the same function more than once, you will need to specify a unique name.params
is a dictionary of parameters to pass to the function, other than the GSData object itself.write
is optional, and if included, it tells the workflow to write out a new file at this point. The value given is the filename to write. By default, if the file already exists, the workflow will error. Notice that the value here also uses curly braces. In this case, it is not a Jinja template, but rather a standard string-format. Each step has access to the variablesprev_stem
(i.e. the filename, without extension, of the last written step),prev_dir
(i.e. the directory of the last written step), andfncname
(i.e. the name of the function for this step).
Note
There is one “special” function that can be used that is not in the gsregister:
the convert
function. This function can be used to initially read files in a
different format (eg. ACQ).
Using the process
command¶
It is generally best to include a complete workflow in a single file – all the way
from initial convert
to the final averaging down to a single spectrum (if desired).
Keeping it in a single files means it is easier to reason about what was done to produce
the results later on (especially in conjunction with the history
in the written files).
Doing this is made possible by an option to to the process
command that lets you
start the workflow from any given step, which means that if the workflow fails for some
reason mid-way, you don’t have to restart the whole thing (as long as you have written
checkpoints).
The typical envisaged usage of the process
command, given a workflow file called
workflow.yaml
, is:
$ edges-analysis process workflow.yaml \
-i "/path/to/raw/data/2015_*.acq" \
--outdir ./my_workflow \
--nthreads 16
Note that the input files (specified by -i
) can be specified with a glob, and
multiple -i
options can be given. If you start with .acq
files, be sure to use
convert
as your first step. The --outdir
option tells the workflow where to
write the output data files. All filenames given in the workflow are relative to this.
It is the current directory by default.
If you only want to run a portion of the workflow, you can specify --stop <NAME>
,
where the name is the name of the step (or its function name) which is the last one you
want to run.
You can resume the workflow by simply pointing to the same output directory without giving any inputs:
$ edges-analysis process workflow.yaml --outdir ./my_workflow
Every time the workflow is run, a “progressfile.yaml” is written to the output
directory, containing the full specification of the run, plus some extra metadata
required to know what has already been run. You can add new input files to the workflow
by adding new -i
entries:
$ edges-analysis process workflow.yaml \
-i "/path/to/raw/data/2016_*.acq" \
--outdir ./my_workflow \
--nthreads 16
This will run all the 2016 files, and then combine them with the 2015 files as necessary. The 2015 files will not be reprocesed unless required (eg. when LST-averaging).
If you’d prefer to completely restart the process with the new files, just use the
--restart
option.
The fork
command¶
If you want to change your workflow but keep the existing processing, you can “fork” the current working directory and start the new workflow from wherever it diverges from the original. To do this, use:
$ edges-analysis fork new-workflow.yaml ./my_workflow --output ./new_workflow
Then, run the process
command as normal with --output ./new_workflow
.
Commands¶
Here, we give a basic overview of the commands available, and their respective options.
Note that --help
can be used at any time on the command line for any command.
process¶
Process a dataset to the STEP level of averaging/filtering using SETTINGS.
- WORKFLOW
is a YAML file. Containing a “steps” parameter which should be a list of steps to execute.
process [OPTIONS] WORKFLOW
Options
- -i, --path <path>¶
The path(s) to input files. Multiple specifications of
-i
can be included. Each input path may have glob-style wildcards, eg./path/to/file.*
. If the path is a directory, all HDF5/ACQ files in the directory will be used. You may prefix the path with a colon to indicate the “standard” location (given byconfig['paths']
), e.g.-i :big-calibration/
.
- -o, --outdir <outdir>¶
The directory into which to save the outputs. Relative paths in the workflow are deemed relative to this directory.
- -v, --verbosity <verbosity>¶
level of verbosity of the logging
- -j, --nthreads <nthreads>¶
How many threads to use.
- --mem-check, --no-mem-check¶
Whether to perform a memory check
- -e, --exit-on-inconsistent, -E, --ignore-inconsistent¶
Whether to immediately exit if any complete step is inconsistent with the progressfile.
- -r, --restart, -a, --append¶
whether any new input paths should be appended, or if everything should be restarted with just those files.
- --stop <stop>¶
The name of the step at which to stop the workflow.
- --start <start>¶
The name of the step at which to start the workflow.
Arguments
- WORKFLOW¶
Required argument
avail¶
List all available GSData processing commands.
avail [OPTIONS]
Options
- -k, --kinds <kinds>¶
Kinds of data to process
Example Workflow File¶
Here is a sample “real-world” workflow file:
globals:
calobs: alanlike_calfile.h5
band: low
s11_path: /data5/edges/data/S11_antenna/low_band/20160830_a/s11
steps:
- function: convert
params:
telescope_name: "EDGES-low2"
- function: select_freqs
params:
range: [50, 100]
- function: add_weather_data
- function: add_thermlog_data
params:
band: low
write: "{prev_stem}.gsh5"
- function: aux_filter
params:
maxima:
adcmax: 0.4
ambient_hum: 100
- function: rfi_model_filter
name: first-rfi-model
params:
decrement_threshold: 1
increase_order: true
init_flags:
- 90.0
- 100.0
max_iter: 40
max_terms: 40
min_terms: 8
min_threshold: 3.5
model: !Model
model: polynomial
n_terms: 5
offset: -2.5
parameters: null
n_resid: -1
threshold: 6.5
watershed: 4
- function: rfi_watershed_filter
name: first-rfi-watershed
params:
tol: 0.7
- function: negative_power_filter
# - function: total_power_filter
# params:
# bands:
# - [50, 100]
# - [50, 75]
# - [75, 100]
# metric_model: !Model
# model: fourier
# n_terms: 40
# period: 48.0
# std_model: !Model
# model: fourier
# n_terms: 10
# period: 48.0
# threshold: 3
- function: sun_filter
params:
elevation_range: [-90.0, -10.0]
- function: moon_filter
params:
elevation_range: [-90.0, 90.0]
# CALIBRATION BLOCK
- function: dicke_calibration
- function: freq_bin_with_models
params:
resolution: 8
model: !Model
model: polynomial
n_terms: 10
- function: apply_noise_wave_calibration
params:
calobs: "{{ globals.calobs }}"
band: "{{ globals.band }}"
s11_path: "{{ globals.s11_path }}"
- function: apply_loss_correction
params:
band: low
antenna_correction: false
balun_correction: true
ground_correction: ':'
calobs: "{{ globals.calobs }}"
band: "{{ globals.band }}"
s11_path: "{{ globals.s11_path }}"
- function: apply_beam_correction
params:
band: low
beam_file: /data4/nmahesh/edges/edges-field-levels/beam_factors/low_band_beam_factor_fromnive_azrotm6.h5
write: cal/{prev_stem}.gsh5
- function: add_model
name: linlog
params:
model: !Model
model: linlog
beta: -2.5
f_center: 75.0
n_terms: 5
with_cmb: false
- function: lst_bin
name: lst-bin-15min
params:
binsize: 0.25
write: cal/{prev_stem}.L15min.gsh5
- function: lst_average
write: cal/lst-avg/lst_average.gsh5
- function: lst_bin
name: lst-bin-24hr
params:
binsize: 24.0
write: cal/lst-avg/lstbin24hr.gsh5
- function: freq_bin
params:
resolution: 8
write: cal/lst-avg/{prev_stem}.400kHz.gsh5