Dataloader

The dataloader class contains most of the experimental flexibility of the models. It reads in NetCDF files produced by the Engineer. It is very similar to a PyTorch DataLoader.

Most of the options in the dataloader are exposed in the models - they’re documented here so that it is explicit where the functionality lives.

class src.models.data.DataLoader(data_path: pathlib.Path = PosixPath('data'), batch_file_size: int = 1, mode: str = 'train', shuffle_data: bool = True, clear_nans: bool = True, normalize: bool = True, predict_delta: bool = False, experiment: str = 'one_month_forecast', mask: Optional[List[bool]] = None, pred_months: Optional[List[int]] = None, to_tensor: bool = False, surrounding_pixels: Optional[int] = None, ignore_vars: Optional[List[str]] = None, monthly_aggs: bool = True, static: Optional[str] = 'features', device: str = 'cpu', spatial_mask: Optional[xarray.core.dataarray.DataArray] = None, normalize_y: bool = False)

Dataloader; lazily load the training and test data

Parameters
  • data_path – Location of the data folder. Default = pathlib.Path("data").

  • batch_file_size – The number of files to load at a time. Default = 1.

  • mode – One of {"test", "train"}. Whether to load testing or training data. This also affects the way the data is returned; for train, it is a concatenated array, but for test it is a dict with dates so that the netcdf file can easily be reconstructed. Default = "train".

  • shuffle_data – Whether or not to shuffle data. Default = True.

  • clear_nans – Whether to remove nan values from the data

  • experiment – The name of the experiment to run. Specifically, the name of the engineer used to generate the data. Default = "one_month_forecast" (train on only historical data and predict one month ahead)

  • normalize – Whether to normalize the data. This assumes a normalizing_dict.pkl was saved by the engineer. Default = True.

  • mask – If not None, this list will be used to mask the input files. Useful for creating a train and validation set. Default = None.

  • pred_months – The months the model should predict. If None, all months are predicted. Default = None.

  • to_tensor – Whether to turn the np.ndarrays into torch.Tensors. Default = False.

  • surrounding_pixels – How many surrounding pixels to add to the input data. e.g. if the input is 1, then in addition to the pixels on the prediction point, the neighbouring (spatial) pixels will be included too, up to a distance of one pixel away. Default = None.

  • ignore_vars – A list of variables to ignore. If None, all variables in the data_path will be included. Default = None.

  • monthly_aggs – Whether to include the monthly aggregates (mean and std across all spatial values) for the input variables. These will be additional dimensions to the historical (and optionally current) arrays. Default = True.

  • static – Whether to include static data. Default = True.

  • predict_delta – Whether to predict the change in the target variable relative to the previous timestep instead of the raw target variable. Default = True.

  • normalize_y – Whether to normalize y. Default = True.