Datasets

The following datasets are supported (i.e. an exporter and preprocessor has been implented for them):

Climate Data Store

From the Climate Data Store website:

The C3S Climate Data Store (CDS) is a one-stop shop for information about the climate: past, present and future. It provides easy access to a wide range of climate datasets via a searchable catalogue. An online toolbox is available that allows users to build workflows and applications suited to their needs.

The climate data store consists of multiple datasets. The following are supported in this pipeline:

ERA5 / ERA5 Land monthly averaged data

From the ERA5 documentation:

ERA5 is the fifth generation ECMWF reanalysis for the global climate and weather for the past 4 to 7 decades. Currently data is available from 1979. When complete, ERA5 will contain a detailed record from 1950 onwards. ERA5 replaces the ERA-Interim reanalysis.

This exporter can export a number of other datasets, notably ERA5 Land.

class src.exporters.cds.ERA5Exporter(data_folder: pathlib.Path = PosixPath('data'))

Exports ERA5 data from the Climate Data Store

Parameters

data_folder – The location of the data folder. Default: pathlib.Path("data")

export(variable: str, dataset: Optional[str] = None, granularity: str = 'hourly', show_api_request: bool = True, selection_request: Optional[Dict] = None, break_up: bool = False, n_parallel_requests: int = 1) → List[pathlib.Path]

Prepare the API request and to send it to the cdsapi.client() object. Save the downloaded data.

Parameters
  • variable – The variable to be exported

  • dataset – The dataset from which to pull the variable from. If None, this is inferred from the dataset and its granularity. Default = None.

  • granularity – One of {"hourly", "monthly"}. The granularity of the data being pulled. Default: "hourly"

  • show_api_request – Whether to print the selection dictionary before making the API request Default = True.

  • selection_request – Selection request arguments to be merged with the defaults. If both a key is defined in both the selection_request and the defaults, the value in the selection_request takes precedence. Default = None.

  • break_up – The best way to download the data is by making many small calls to the CDS API. If true, the calls will be broken up into months. We have not found this necessary even when downloading 30 years of data. Default = False.

  • n_parallel_requests – How many parallel requests to the CDSAPI to make. Default = 1.

Returns

A list of paths to the downloaded data

class src.preprocess.era5.ERA5MonthlyMeanPreprocessor(data_folder: pathlib.Path = PosixPath('data'), output_name: Optional[str] = None)

A processor for data downloaded by src.exporters.cds.ERA5Exporter.

Parameters
  • data_folder – The location of the data folder. Default: pathlib.Path("data")

  • output_name – This processor can be used for multiple datasets. This allows the dataset to be selected.

preprocess(subset_str: Optional[str] = 'kenya', regrid: Optional[pathlib.Path] = None, resample_time: Optional[str] = 'M', upsampling: bool = False, parallel: bool = False, cleanup: bool = True) → None

Preprocess all of the exported era5 .nc files to produce one subset file.

Parameters
  • subset_str – Defines a geographical subset of the downloaded data to be used. Should be one of the regions defined in src.utils.region_lookup. Default = "kenya".

  • regrid – If a Path is passed, the output files will be regridded to have the same spatial grid as the dataset at that Path. If None, no regridding happens. Default = None.

  • resample_time – If not None, defines the time length to which the data will be resampled.

  • upsampling – If true, tells the class the time-sampling will be upsampling. In this case, nearest instead of mean is used for the resampling. Default = False.

  • parallel – If true, run the preprocessing in parallel. Default = True.

  • cleanup – If true, delete interim files created by the class. Default = True.

CHIRPS Rainfall Estimates

From the CHIRPS website:

Climate Hazards Group InfraRed Precipitation with Station data (CHIRPS) is a 35+ year quasi-global rainfall data set. Spanning 50°S-50°N (and all longitudes) and ranging from 1981 to near-present, CHIRPS incorporates our in-house climatology, CHPclim, 0.05° resolution satellite imagery, and in-situ station data to create gridded rainfall time series for trend analysis and seasonal drought monitoring.

class src.exporters.chirps.CHIRPSExporter(data_folder: pathlib.Path = PosixPath('data'))

Exports precip from the Climate Hazards group site.

Parameters

data_folder – The location of the data folder. Default: pathlib.Path("data")

export(years: Optional[List[int]] = None, region: str = 'global', period: str = 'monthly', n_parallel_processes: int = 1) → None

Export functionality for the CHIRPS precipitation product

Parameters
  • years – The years of data to download. If None, all data will be downloaded. Default = None.

  • region – one of {"africa", "global"}, The dataset region to download. If global, a netcdf file is downloaded. If africa, a tif file is downloaded. Default = "africa".

  • period – One of {"monthly", "weekly", "pentad"...}. The period of the data being downloaded. Default = "monthly".

  • n_parallel_processes – The number of parallel processes to use when downloading the data. Default = 1 (none).

class src.preprocess.chirps.CHIRPSPreprocessor(data_folder: pathlib.Path = PosixPath('data'))

Preprocesses the CHIRPS data

Parameters

data_folder – The location of the data folder. Default: pathlib.Path("data")

preprocess(subset_str: Optional[str] = 'kenya', regrid: Optional[pathlib.Path] = None, resample_time: Optional[str] = 'M', upsampling: bool = False, parallel: bool = False, cleanup: bool = True) → None

Preprocess all of the CHIRPS .nc files to produce one subset file.

Parameters
  • subset_str – Defines a geographical subset of the downloaded data to be used. Should be one of the regions defined in src.utils.region_lookup. Default = "kenya".

  • regrid – A path to the reference dataset, onto which the CHIRPS data will be regridded. If None, no regridding happens. Default = None.

  • resample_time – Defines the time length to which the data will be resampled. If None, no time-resampling happens. Default = "M" (monthly).

  • upsampling – If true, tells the class the time-sampling will be upsampling. In this case, nearest instead of mean is used for the resampling. Default = False.

  • parallel – Whether to run the preprocessing in parallel. Default = False.

  • cleanup – Whether to delete interim files created during preprocessing. Default = True.

SRTM Digital Elevation Data

From the SRTM website:

The SRTM digital elevation data, produced by NASA originally, is a major breakthrough in digital mapping of the world, and provides a major advance in the accessibility of high quality elevation data for large portions of the tropics and other areas of the developing world.

class src.exporters.srtm.SRTMExporter(data_folder: pathlib.Path = PosixPath('data'))

Export SRTM elevation data. This exporter leverages the elevation package, http://elevation.bopen.eu/en/stable/, to download SRTM topography data. This exporter requires GDAL and the elevation package to work.

An additional quirk of this exporter is that the region is defined here, instead of in the preprocessor.

Parameters

data_folder – The location of the data folder. Default: pathlib.Path("data")

export(region_name: str = 'kenya', product: str = 'SRTM3', max_download_tiles: int = 15) → None

Export SRTm topography data

Parameters
  • region_name – Defines a geographical subset of the downloaded data to be used. Should be one of the regions defined in src.utils.region_lookup. Default = "kenya".

  • product – One of {"SRTM1", "SRTM3"}, the product to download the data from. Default = "SRTM3".

  • max_download_tiles – By default, the elevation package doesn’t allow more than 9 tiles to be downloaded. Kenya is 12 tiles - this increases the limit to allow Kenya to be downloaded. Default = 15.

class src.preprocess.srtm.SRTMPreprocessor(data_folder: pathlib.Path = PosixPath('data'))

Preprocess SRTM data downloaded by the SRTMExporter. Note - the regridder functionality requires CDO to be installed

Parameters

data_folder – The location of the data folder. Default: pathlib.Path("data")

preprocess(subset_str: str = 'kenya', regrid: Optional[pathlib.Path] = None, cleanup: bool = True) → None

Preprocess a downloaded topography .nc file to produce one subset file with no timestep

Parameters
  • subset_str – Because the SRTM data can only be downloaded in tiles, the subsetting happens during the export step. This tells the preprocessor which file to preprocess

  • regrid – If a Path is passed, the output files will be regridded to have the same spatial grid as the dataset at that Path. If None, no regridding happens. Default = None.

  • cleanup – If true, delete interim files created by the class. Default = True.

Global Land Evaporation Amsterdam Model

From the GLEAM website:

GLEAM (Global Land Evaporation Amsterdam Model) is a set of algorithms that separately estimate the different components of land evaporation (or ‘evapotranspiration’): transpiration, bare-soil evaporation, interception loss, open-water evaporation and sublimation.

class src.exporters.gleam.GLEAMExporter(data_folder: pathlib.Path = PosixPath('data'))

Download data from the Global Land Evaporation Amsterdam Model.

Parameters

data_folder – The location of the data folder. Default: pathlib.Path("data")

export(variables: Union[str, List[str]], granularity: str) → None

Run the exporter.

Parameters
  • variables – A variable or list of variables to download.

  • granularity – The granularity of data to be downloaded. Use get_granularities to get a list of acceptable granularities.

get_granularities() → List[str]

Get acceptable data granularities.

Returns

A list of granularities.

class src.preprocess.gleam.GLEAMPreprocessor(data_folder: pathlib.Path = PosixPath('data'))

Preprocess the GLEAM data.

Parameters

data_folder – The location of the data folder. Default: pathlib.Path("data")

preprocess(subset_str: Optional[str] = 'kenya', regrid: Optional[pathlib.Path] = None, resample_time: Optional[str] = 'M', upsampling: bool = False, cleanup: bool = True) → None

Preprocess all of the GLEAM .nc files to produce one subset file.

Parameters
  • subset_str – The optional subset string used to get a geographical subset of the data. Only used to make a more descriptive filename.

  • regrid – If a Path is passed, the output files will be regridded to have the same spatial grid as the dataset at that Path. If None, no regridding happens. Default = None.

  • resample_time – Defines the time length to which the data will be resampled. If None, no time-resampling happens. Default = "M" (monthly).

  • upsampling – If true, tells the class the time-sampling will be upsampling. In this case, nearest instead of mean is used for the resampling. Default = False.

  • cleanup – If true, delete interim files created by the class. Default = True.