Preprocessor

The Preprocessors work to convert the datasets downloaded by the Exporters into a unified data format. This makes testing and developing models much more straightforward.

As with the Exporters, a unique Preprocessor must be written for each dataset.

All preprocessors extend the following base Preprocessor, which has some useful helper methods:


class src.preprocess.base.BasePreProcessor(data_folder: pathlib.Path = PosixPath('data'), output_name: Optional[str] = None)

Base for all pre-processor classes. The preprocessing classes are responsible for taking the raw data exports and normalizing them so that they can be ingested by the feature engineering class.

This involves:

  • subsetting the ROI (default is Kenya)

  • regridding to a consistent spatial grid (pixel size / resolution)

  • resampling to a consistent time step (hourly, daily, monthly)

  • assigning coordinates to .nc files (latitude, longitude, time)

All pre-processors should do this in a process function.

Parameters

data_folder – The location of the data folder. Default = Path("data").

static chop_roi(ds: xarray.core.dataset.Dataset, subset_str: Optional[str] = 'kenya', inverse_lat: bool = False) → xarray.core.dataset.Dataset

Select a geographical subset of the data, based on a subset string.

Parameters
  • ds – The dataset to be subsetted.

  • subset_str – Defines a geographical subset of the downloaded data to be used. Should be one of the regions defined in src.utils.region_lookup. Default = "kenya".

Inverse_lat

Whether to inverse the minimum and maximum longitudes. Default = False.

static load_reference_grid(path_to_grid: pathlib.Path) → xarray.core.dataset.Dataset

Since the regridder only needs to the lat and lon values, there is no need to pass around an enormous grid for the regridding.

In fact, only the latitude and longitude values are necessary!

Parameters

path_to_grid – A path to the reference dataset.

Returns

The loaded reference dataset, but with only the latitudes and longitudes.

merge_files(subset_str: Optional[str] = 'kenya', resample_time: Optional[str] = 'M', upsampling: bool = False, filename: Optional[str] = None) → None

Merge multiple interim files into a single preprocessed file. The time resampling happens here, since all the data is necessary to do that.

Parameters
  • subset_str – The optional subset string used to get a geographical subset of the data. Only used to make a more descriptive filename.

  • resample_time – Defines the time length to which the data will be resampled. If None, no time-resampling happens. Default = "M" (monthly).

  • upsampling – If true, tells the class the time-sampling will be upsampling. In this case, nearest instead of mean is used for the resampling. Default = False.

  • filename – Override the default created filename by passing a string filename here.

regrid(ds: xarray.core.dataset.Dataset, reference_ds: xarray.core.dataset.Dataset, method: str = 'nearest_s2d', reuse_weights: bool = False, clean: bool = True) → xarray.core.dataset.Dataset

Use xEMSF package to regrid ds to the same grid as reference_ds

Parameters
  • ds – The dataset to be regridded

  • reference_ds – The reference dataset, onto which ds will be regridded

  • method – One of {"bilinear", "conservative", "nearest_s2d", "nearest_d2s", "patch"}. The method applied for the regridding

  • reuse_weights – Whether to reuse the weights (weights must already be saved). May speed up the regridder. Default = False.

  • clean – Whether to delete the weight file. Default = True.

Returns

The regridded dataset.