Preprocessor¶

The Preprocessors work to convert the datasets downloaded by the Exporters into a unified data format. This makes testing and developing models much more straightforward.

As with the Exporters, a unique Preprocessor must be written for each dataset.

All preprocessors extend the following base Preprocessor, which has some useful helper methods:

class src.preprocess.base.BasePreProcessor(data_folder: pathlib.Path = PosixPath('data'), output_name: Optional[str] = None)¶

Base for all pre-processor classes. The preprocessing classes are responsible for taking the raw data exports and normalizing them so that they can be ingested by the feature engineering class.

This involves:

subsetting the ROI (default is Kenya)
regridding to a consistent spatial grid (pixel size / resolution)
resampling to a consistent time step (hourly, daily, monthly)
assigning coordinates to .nc files (latitude, longitude, time)

All pre-processors should do this in a process function.

Parameters: data_folder – The location of the data folder. Default = Path("data").

static chop_roi(ds: xarray.core.dataset.Dataset, subset_str: Optional[str] = 'kenya', inverse_lat: bool = False) → xarray.core.dataset.Dataset¶

Select a geographical subset of the data, based on a subset string.

Parameters

ds – The dataset to be subsetted.
subset_str – Defines a geographical subset of the downloaded data to be used. Should be one of the regions defined in src.utils.region_lookup. Default = "kenya".

Inverse_lat

Whether to inverse the minimum and maximum longitudes. Default = False.

static load_reference_grid(path_to_grid: pathlib.Path) → xarray.core.dataset.Dataset¶

Since the regridder only needs to the lat and lon values, there is no need to pass around an enormous grid for the regridding.

In fact, only the latitude and longitude values are necessary!

Parameters: path_to_grid – A path to the reference dataset.
Returns: The loaded reference dataset, but with only the latitudes and longitudes.

merge_files(subset_str: Optional[str] = 'kenya', resample_time: Optional[str] = 'M', upsampling: bool = False, filename: Optional[str] = None) → None¶

Merge multiple interim files into a single preprocessed file. The time resampling happens here, since all the data is necessary to do that.

Parameters

subset_str – The optional subset string used to get a geographical subset of the data. Only used to make a more descriptive filename.
resample_time – Defines the time length to which the data will be resampled. If None, no time-resampling happens. Default = "M" (monthly).
upsampling – If true, tells the class the time-sampling will be upsampling. In this case, nearest instead of mean is used for the resampling. Default = False.
filename – Override the default created filename by passing a string filename here.

regrid(ds: xarray.core.dataset.Dataset, reference_ds: xarray.core.dataset.Dataset, method: str = 'nearest_s2d', reuse_weights: bool = False, clean: bool = True) → xarray.core.dataset.Dataset¶

Use xEMSF package to regrid ds to the same grid as reference_ds

Parameters

ds – The dataset to be regridded
reference_ds – The reference dataset, onto which ds will be regridded
method – One of {"bilinear", "conservative", "nearest_s2d", "nearest_d2s", "patch"}. The method applied for the regridding
reuse_weights – Whether to reuse the weights (weights must already be saved). May speed up the regridder. Default = False.
clean – Whether to delete the weight file. Default = True.

Returns

The regridded dataset.

Preprocessor¶

ml_clim

This Page