Datasets¶
The following datasets are supported (i.e. an exporter and preprocessor has been implented for them):
Climate Data Store¶
From the Climate Data Store website:
The C3S Climate Data Store (CDS) is a one-stop shop for information about the climate: past, present and future. It provides easy access to a wide range of climate datasets via a searchable catalogue. An online toolbox is available that allows users to build workflows and applications suited to their needs.
The climate data store consists of multiple datasets. The following are supported in this pipeline:
ERA5 / ERA5 Land monthly averaged data¶
From the ERA5 documentation:
ERA5 is the fifth generation ECMWF reanalysis for the global climate and weather for the past 4 to 7 decades. Currently data is available from 1979. When complete, ERA5 will contain a detailed record from 1950 onwards. ERA5 replaces the ERA-Interim reanalysis.
This exporter can export a number of other datasets, notably ERA5 Land.
-
class
src.exporters.cds.
ERA5Exporter
(data_folder: pathlib.Path = PosixPath('data'))¶ Exports ERA5 data from the Climate Data Store
- Parameters
data_folder – The location of the data folder. Default:
pathlib.Path("data")
-
export
(variable: str, dataset: Optional[str] = None, granularity: str = 'hourly', show_api_request: bool = True, selection_request: Optional[Dict] = None, break_up: bool = False, n_parallel_requests: int = 1) → List[pathlib.Path]¶ Prepare the API request and to send it to the cdsapi.client() object. Save the downloaded data.
- Parameters
variable – The variable to be exported
dataset – The dataset from which to pull the variable from. If None, this is inferred from the dataset and its granularity. Default =
None
.granularity – One of
{"hourly", "monthly"}
. The granularity of the data being pulled. Default:"hourly"
show_api_request – Whether to print the selection dictionary before making the API request Default =
True
.selection_request – Selection request arguments to be merged with the defaults. If both a key is defined in both the selection_request and the defaults, the value in the selection_request takes precedence. Default =
None
.break_up – The best way to download the data is by making many small calls to the CDS API. If true, the calls will be broken up into months. We have not found this necessary even when downloading 30 years of data. Default =
False
.n_parallel_requests – How many parallel requests to the CDSAPI to make. Default =
1
.
- Returns
A list of paths to the downloaded data
-
class
src.preprocess.era5.
ERA5MonthlyMeanPreprocessor
(data_folder: pathlib.Path = PosixPath('data'), output_name: Optional[str] = None)¶ A processor for data downloaded by src.exporters.cds.ERA5Exporter.
- Parameters
data_folder – The location of the data folder. Default:
pathlib.Path("data")
output_name – This processor can be used for multiple datasets. This allows the dataset to be selected.
-
preprocess
(subset_str: Optional[str] = 'kenya', regrid: Optional[pathlib.Path] = None, resample_time: Optional[str] = 'M', upsampling: bool = False, parallel: bool = False, cleanup: bool = True) → None¶ Preprocess all of the exported era5 .nc files to produce one subset file.
- Parameters
subset_str – Defines a geographical subset of the downloaded data to be used. Should be one of the regions defined in
src.utils.region_lookup
. Default ="kenya"
.regrid – If a Path is passed, the output files will be regridded to have the same spatial grid as the dataset at that Path. If None, no regridding happens. Default =
None
.resample_time – If not None, defines the time length to which the data will be resampled.
upsampling – If true, tells the class the time-sampling will be upsampling. In this case, nearest instead of mean is used for the resampling. Default =
False
.parallel – If true, run the preprocessing in parallel. Default =
True
.cleanup – If true, delete interim files created by the class. Default =
True
.
CHIRPS Rainfall Estimates¶
From the CHIRPS website:
Climate Hazards Group InfraRed Precipitation with Station data (CHIRPS) is a 35+ year quasi-global rainfall data set. Spanning 50°S-50°N (and all longitudes) and ranging from 1981 to near-present, CHIRPS incorporates our in-house climatology, CHPclim, 0.05° resolution satellite imagery, and in-situ station data to create gridded rainfall time series for trend analysis and seasonal drought monitoring.
-
class
src.exporters.chirps.
CHIRPSExporter
(data_folder: pathlib.Path = PosixPath('data'))¶ Exports precip from the Climate Hazards group site.
- Parameters
data_folder – The location of the data folder. Default:
pathlib.Path("data")
-
export
(years: Optional[List[int]] = None, region: str = 'global', period: str = 'monthly', n_parallel_processes: int = 1) → None¶ Export functionality for the CHIRPS precipitation product
- Parameters
years – The years of data to download. If None, all data will be downloaded. Default =
None
.region – one of
{"africa", "global"}
, The dataset region to download. If global, a netcdf file is downloaded. If africa, a tif file is downloaded. Default ="africa"
.period – One of
{"monthly", "weekly", "pentad"...}
. The period of the data being downloaded. Default ="monthly"
.n_parallel_processes – The number of parallel processes to use when downloading the data. Default =
1
(none).
-
class
src.preprocess.chirps.
CHIRPSPreprocessor
(data_folder: pathlib.Path = PosixPath('data'))¶ Preprocesses the CHIRPS data
- Parameters
data_folder – The location of the data folder. Default:
pathlib.Path("data")
-
preprocess
(subset_str: Optional[str] = 'kenya', regrid: Optional[pathlib.Path] = None, resample_time: Optional[str] = 'M', upsampling: bool = False, parallel: bool = False, cleanup: bool = True) → None¶ Preprocess all of the CHIRPS .nc files to produce one subset file.
- Parameters
subset_str – Defines a geographical subset of the downloaded data to be used. Should be one of the regions defined in
src.utils.region_lookup
. Default ="kenya"
.regrid – A path to the reference dataset, onto which the CHIRPS data will be regridded. If
None
, no regridding happens. Default =None
.resample_time – Defines the time length to which the data will be resampled. If
None
, no time-resampling happens. Default ="M"
(monthly).upsampling – If true, tells the class the time-sampling will be upsampling. In this case, nearest instead of mean is used for the resampling. Default =
False
.parallel – Whether to run the preprocessing in parallel. Default =
False
.cleanup – Whether to delete interim files created during preprocessing. Default =
True
.
SRTM Digital Elevation Data¶
From the SRTM website:
The SRTM digital elevation data, produced by NASA originally, is a major breakthrough in digital mapping of the world, and provides a major advance in the accessibility of high quality elevation data for large portions of the tropics and other areas of the developing world.
-
class
src.exporters.srtm.
SRTMExporter
(data_folder: pathlib.Path = PosixPath('data'))¶ Export SRTM elevation data. This exporter leverages the elevation package, http://elevation.bopen.eu/en/stable/, to download SRTM topography data. This exporter requires GDAL and the elevation package to work.
An additional quirk of this exporter is that the region is defined here, instead of in the preprocessor.
- Parameters
data_folder – The location of the data folder. Default:
pathlib.Path("data")
-
export
(region_name: str = 'kenya', product: str = 'SRTM3', max_download_tiles: int = 15) → None¶ Export SRTm topography data
- Parameters
region_name – Defines a geographical subset of the downloaded data to be used. Should be one of the regions defined in src.utils.region_lookup. Default =
"kenya"
.product – One of
{"SRTM1", "SRTM3"}
, the product to download the data from. Default ="SRTM3"
.max_download_tiles – By default, the elevation package doesn’t allow more than 9 tiles to be downloaded. Kenya is 12 tiles - this increases the limit to allow Kenya to be downloaded. Default =
15
.
-
class
src.preprocess.srtm.
SRTMPreprocessor
(data_folder: pathlib.Path = PosixPath('data'))¶ Preprocess SRTM data downloaded by the SRTMExporter. Note - the regridder functionality requires CDO to be installed
- Parameters
data_folder – The location of the data folder. Default:
pathlib.Path("data")
-
preprocess
(subset_str: str = 'kenya', regrid: Optional[pathlib.Path] = None, cleanup: bool = True) → None¶ Preprocess a downloaded topography .nc file to produce one subset file with no timestep
- Parameters
subset_str – Because the SRTM data can only be downloaded in tiles, the subsetting happens during the export step. This tells the preprocessor which file to preprocess
regrid – If a Path is passed, the output files will be regridded to have the same spatial grid as the dataset at that Path. If None, no regridding happens. Default =
None
.cleanup – If true, delete interim files created by the class. Default =
True
.
Global Land Evaporation Amsterdam Model¶
From the GLEAM website:
GLEAM (Global Land Evaporation Amsterdam Model) is a set of algorithms that separately estimate the different components of land evaporation (or ‘evapotranspiration’): transpiration, bare-soil evaporation, interception loss, open-water evaporation and sublimation.
-
class
src.exporters.gleam.
GLEAMExporter
(data_folder: pathlib.Path = PosixPath('data'))¶ Download data from the Global Land Evaporation Amsterdam Model.
- Parameters
data_folder – The location of the data folder. Default:
pathlib.Path("data")
-
export
(variables: Union[str, List[str]], granularity: str) → None¶ Run the exporter.
- Parameters
variables – A variable or list of variables to download.
granularity – The granularity of data to be downloaded. Use
get_granularities
to get a list of acceptable granularities.
-
get_granularities
() → List[str]¶ Get acceptable data granularities.
- Returns
A list of granularities.
-
class
src.preprocess.gleam.
GLEAMPreprocessor
(data_folder: pathlib.Path = PosixPath('data'))¶ Preprocess the GLEAM data.
- Parameters
data_folder – The location of the data folder. Default:
pathlib.Path("data")
-
preprocess
(subset_str: Optional[str] = 'kenya', regrid: Optional[pathlib.Path] = None, resample_time: Optional[str] = 'M', upsampling: bool = False, cleanup: bool = True) → None¶ Preprocess all of the GLEAM .nc files to produce one subset file.
- Parameters
subset_str – The optional subset string used to get a geographical subset of the data. Only used to make a more descriptive filename.
regrid – If a Path is passed, the output files will be regridded to have the same spatial grid as the dataset at that Path. If None, no regridding happens. Default =
None
.resample_time – Defines the time length to which the data will be resampled. If
None
, no time-resampling happens. Default ="M"
(monthly).upsampling – If true, tells the class the time-sampling will be upsampling. In this case, nearest instead of mean is used for the resampling. Default =
False
.cleanup – If true, delete interim files created by the class. Default =
True
.