Models

All models extend the following base model:

class src.models.base.ModelBase(data_folder: pathlib.Path = PosixPath('data'), batch_size: int = 1, experiment: str = 'one_month_forecast', pred_months: Optional[List[int]] = None, include_pred_month: bool = True, include_latlons: bool = False, include_monthly_aggs: bool = True, include_yearly_aggs: bool = True, surrounding_pixels: Optional[int] = None, ignore_vars: Optional[List[str]] = None, static: Optional[str] = 'embedding', predict_delta: bool = False, spatial_mask: Union[xarray.core.dataarray.DataArray, pathlib.Path] = None, include_prev_y: bool = False, normalize_y: bool = False)

Base for all machine learning models.

Parameters
  • data – The location of the data folder. Default = pathlib.Path("data")

  • batch_size – The number of files to load at once. These will be chunked and shuffled, so a higher value will lead to better shuffling (but will require more memory). Default = 1.

  • experiment – The name of the experiment to run. Specifically, the name of the engineer used to generate the data. Default = "one_month_forecast" (train on only historical data and predict one month ahead)

  • pred_months – The months the model should predict. If None, all months are predicted. Default = None.

  • include_pred_month – Whether to include the prediction month to the model’s training data. Default = True.

  • include_latlons – Whether to include prediction pixel latitudes and longitudes in the model’s training data. Default = True.

  • include_monthly_aggs – Whether to include monthly aggregations. Default = True.

  • include_yearly_aggs – Whether to include yearly aggregations. Default = True.

  • surrounding_pixels – How many surrounding pixels to add to the input data. e.g. if the input is 1, then in addition to the pixels on the prediction point, the neighbouring (spatial) pixels will be included too, up to a distance of one pixel away. Default = None.

  • ignore_vars – A list of variables to ignore. If None, all variables in the data_path will be included. Default = None.

  • static – Whether to include static data. Default = True.

  • predict_delta – Whether to model the change in target variable rather than the raw values. Default = False.

  • spatial_mask – If an xr.DataArray` is passed, it will be used to mask the training / test data. Default = ``None.

  • include_pred_y – Whether to include the y value from one year ago, the same month. This is useful if you are predicting a seasonal value. Default = False.

  • normalize_y – Whether to normalize the y value being predicted. Default = False. The predictions saved in evaluate will be denormalized.

evaluate(save_results: bool = True, save_preds: bool = False) → None

Evaluate the trained model on the test data

Parameters
  • save_results – Whether to save the results of the evaluation. If true, they are saved in self.model_dir / results.json. Default = True.

  • save_preds – Whether to save the model predictions. If true, they are saved in self.model_dir / {year}_{month}.nc. Default = False.

explain(x: Any) → numpy.ndarray

Explain the predictions of the trained model on the input data x

Parameters

x – Any input array / tensor

Returns

A shap value for each of the input values. The sum of the shap values is equal to the prediction of the model for x

get_dataloader(mode: str, to_tensor: bool = False, shuffle_data: bool = False, **kwargs) → src.models.data.DataLoader
Returns

The correct dataloader for this model

Loading models

Models have a save_model function, which saves the model to a pickle object. These can then be loaded using the load_model function:

src.models.load_model(model_path: pathlib.Path, data_path: Optional[pathlib.Path] = None, model_type: Optional[str] = None, device: Optional[str] = 'cpu') → Union[src.models.neural_networks.rnn.RecurrentNetwork, src.models.neural_networks.linear_network.LinearNetwork, src.models.regression.LinearRegression, src.models.neural_networks.ealstm.EARecurrentNetwork, src.models.gbdt.GBDT]

This function loads models from the output .pkl files generated when calling model.save_model()

Parameters
  • model_path – The path to the model

  • data_path – The path to the data folder. If None, the function infers this from the model_path (assuming it was saved as part of the pipeline). Default = None.

  • model_type – The type of model to load. If None, the function infers this from the model_path (assuming it was saved as part of the pipeline). Default = None.

  • device – The device to load the model onto

Returns

A model object loaded from the model_path

The following models have been implemented:

Persistence

class src.models.parsimonious.Persistence(data_folder: pathlib.Path = PosixPath('data'))

A parsimonious persistence model. This “model” predicts the previous time-value of data. For example, its prediction for VHI in March 2018 will be VHI for February 2018 (assuming monthly time-granularity).

Parameters

data_folder – Location of the data folder. Default = pathlib.Path("data").

train() → None

This “model” does not need to be trained!

Linear Regression

class src.models.regression.LinearRegression(data_folder: pathlib.Path = PosixPath('data'), experiment: str = 'one_month_forecast', batch_size: int = 1, pred_months: Optional[List[int]] = None, include_pred_month: bool = True, include_latlons: bool = False, include_monthly_aggs: bool = True, include_yearly_aggs: bool = True, surrounding_pixels: Optional[int] = None, ignore_vars: Optional[List[str]] = None, static: Optional[str] = 'features', predict_delta: bool = False, spatial_mask: Union[xarray.core.dataarray.DataArray, pathlib.Path] = None, include_prev_y: bool = True, normalize_y: bool = True)

A linear regression model, implemented by scikit-learn.

Parameters
  • data_folder – Location of the data folder. Default = pathlib.Path("data").

  • experiment – The name of the experiment to run. Specifically, the name of the engineer used to generate the data. Default = "one_month_forecast" (train on only historical data and predict one month ahead)

  • batch_size – The number of files to load at once. These will be chunked and shuffled, so a higher value will lead to better shuffling (but will require more memory). Default = 1.

  • pred_months – The months the model should predict. If None, all months are predicted. Default = None.

  • include_pred_month – Whether to include the prediction month to the model’s training data. Default = True.

  • include_latlons – Whether to include prediction pixel latitudes and longitudes in the model’s training data. Default = True.

  • include_monthly_aggs – Whether to include monthly aggregations. Default = True.

  • include_yearly_aggs – Whether to include yearly aggregations. Default = True.

  • surrounding_pixels – How many surrounding pixels to add to the input data. e.g. if the input is 1, then in addition to the pixels on the prediction point, the neighbouring (spatial) pixels will be included too, up to a distance of one pixel away. Default = None.

  • ignore_vars – A list of variables to ignore. If None, all variables in the data_path will be included. Default = None.

  • static – Whether to include static data. Default = True.

  • predict_delta – Whether to model the change in target variable rather than the raw values. Default = False.

  • spatial_mask – If an xr.DataArray is passed, it will be used to mask the training / test data. Default = None.

  • include_pred_y – Whether to include the y value from one year ago, the same month. This is useful if you are predicting a seasonal value. Default = False.

  • normalize_y – Whether to normalize the y value being predicted. Default = False. The predictions saved in evaluate will be denormalized.

explain(x: Optional[src.models.data.TrainData] = None, save_shap_values: bool = True) → numpy.ndarray

Explain the predictions of the trained model on the input data x

Parameters

x – Any input array / tensor

Returns

A shap value for each of the input values. The sum of the shap values is equal to the prediction of the model for x

save_model() → None

Saves a pickle object of the model, which can be loaded using src.models.load_model.

train(num_epochs: int = 1, early_stopping: Optional[int] = None, batch_size: int = 256, val_split: float = 0.1, initial_learning_rate: float = 1e-15) → None

Train the linear regression model.

Parameters
  • num_epochs – The number of epochs to train the model for. If early_stopping is not None, then this is the maximum number of epochs for which the model will be trained. Default = 1.

  • early_stopping – If not None, the number of epochs to wait without improvement before stopping model training and reverting to the best model. Default = None.

  • batch_size – The batch size to use when training the model. Default = 256.

  • val_split – The ratio of data to use in the validation set. Default = 0.1.

  • initial_learning_rate – The initial learning rate to use. Default = 1e-15.

Neural Networks

A number of neural networks are implemented. All are trained using Smooth L1 Loss, with optional early stopping.

All neural network classes extend the following base class:

class src.models.neural_networks.base.NNBase(data_folder: pathlib.Path = PosixPath('data'), batch_size: int = 1, experiment: str = 'one_month_forecast', pred_months: Optional[List[int]] = None, include_pred_month: bool = True, include_latlons: bool = False, include_monthly_aggs: bool = True, include_yearly_aggs: bool = True, surrounding_pixels: Optional[int] = None, ignore_vars: Optional[List[str]] = None, static: Optional[str] = 'features', device: str = 'cuda:0', predict_delta: bool = False, spatial_mask: Union[xarray.core.dataarray.DataArray, pathlib.Path] = None, include_prev_y: bool = True, normalize_y: bool = True)

The base for all neural network models, written in Pytorch. It extends the ModelBase class, and implements the train, predict and explain functions, which are shared across all neural networks.

The arguments to the constructor are the same as for the ModelBase class.

explain(x: Optional[src.models.data.TrainData] = None, var_names: Optional[List[str]] = None, save_explanations: bool = True, background_size: int = 100, start_idx: int = 0, num_inputs: int = 10, method: str = 'shap') → src.models.data.TrainData

Expain the outputs of a trained model.

Parameters
  • x – The values to explain. If None, samples are randomly drawn from the test data

  • var_names – The variable names of the historical inputs. If x is None, this will be calculated. Only necessary if the arrays are going to be saved

  • save_explanations – Whether or not to save the shap values

  • background_size – the size of the background to use

  • start_idx – The index to use to calculate the shap values. Shap values will be calculated for x[start_idx: start_idx + num_inputs]

  • num_inputs – The number of datapoints to calculate shap values for

  • method – One of {"shap", "morris"}. The method to use to calculate the explanations.

Returns

A dictionary of shap values for each of the model’s input arrays

train(num_epochs: int = 1, early_stopping: Optional[int] = None, batch_size: int = 256, learning_rate: float = 0.001, val_split: float = 0.1) → None

Trains a neural network.

Parameters
  • num_epochs – The maximum number of epochs to train the model for.

  • early_stopping – If an int is passed, early stopping will be used with this value as the patience. If None, the model will train for num_epochs.

  • batch_size – The number of instances to put in each batch.

  • learning_rate – The learning rate to use.

  • val_split – The ratio of training data to use as validation, for early stopping.

LSTM

class src.models.neural_networks.rnn.RecurrentNetwork(hidden_size: int, dense_features: Optional[List[int]] = None, rnn_dropout: float = 0.25, data_folder: pathlib.Path = PosixPath('data'), batch_size: int = 1, experiment: str = 'one_month_forecast', pred_months: Optional[List[int]] = None, include_pred_month: bool = True, include_latlons: bool = False, include_monthly_aggs: bool = True, include_yearly_aggs: bool = True, surrounding_pixels: Optional[int] = None, ignore_vars: Optional[List[str]] = None, static: Optional[str] = 'features', device: str = 'cuda:0', predict_delta: bool = False, spatial_mask: Union[xarray.core.dataarray.DataArray, pathlib.Path] = None, include_prev_y: bool = True, normalize_y: bool = True)

A single layer long short-term memory (LSTM) layer, followed by linear layers.

See this blog post for more information about recurrent networks and LSTMS.

The LSTM receives the static data appended to every time step of the dynamic data. In addition to the arguments to ModelBase, the LSTM has the following arguments passed to the constructor:

Parameters
  • hidden_size – The number of features in the hidden state

  • dense_features – A list describing the linear layers after the LSTM layer. There will be a layer per element in the list, with output size equal to the value of the element. If None, a single linear layer is used, with an output size of 1 (the prediction).

  • rnn_dropout – Dropout to use between timesteps. Note that this is different from PyTorch’s default LSTM layer, which adds dropout between layers, not timesteps.

EA-LSTM

class src.models.neural_networks.ealstm.EARecurrentNetwork(hidden_size: int, dense_features: Optional[List[int]] = None, rnn_dropout: float = 0.25, data_folder: pathlib.Path = PosixPath('data'), batch_size: int = 1, experiment: str = 'one_month_forecast', pred_months: Optional[List[int]] = None, include_latlons: bool = False, include_pred_month: bool = True, include_monthly_aggs: bool = True, include_yearly_aggs: bool = True, surrounding_pixels: Optional[int] = None, ignore_vars: Optional[List[str]] = None, static: Optional[str] = 'features', static_embedding_size: Optional[int] = None, device: str = 'cuda:0', predict_delta: bool = False, spatial_mask: Union[xarray.core.dataarray.DataArray, pathlib.Path] = None, include_prev_y: bool = True, normalize_y: bool = True)

An Entity Aware - LSTM, described in Towards learning universal, regional, and local hydrological behaviors via machine learning applied to large-sample datasets

In an EA-LSTM, the dynamic features are conditioned on the static features.

In addition to the arguments to ModelBase, the EA-LSTM has the following arguments passed to the constructor:

Parameters
  • hidden_size – The number of features in the hidden state

  • dense_features – A list describing the linear layers after the LSTM layer. There will be a layer per element in the list, with output size equal to the value of the element. If None, a single linear layer is used, with an output size of 1 (the prediction).

  • rnn_dropout – Dropout to use between timesteps. Note that this is different from PyTorch’s default LSTM layer, which adds dropout between layers, not timesteps.

Linear Network

class src.models.neural_networks.linear_network.LinearNetwork(layer_sizes: Union[int, List[int]], dropout: float = 0.25, data_folder: pathlib.Path = PosixPath('data'), batch_size: int = 1, experiment: str = 'one_month_forecast', pred_months: Optional[List[int]] = None, include_pred_month: bool = True, include_latlons: bool = False, include_monthly_aggs: bool = True, include_yearly_aggs: bool = True, surrounding_pixels: Optional[int] = None, ignore_vars: Optional[List[str]] = None, static: Optional[str] = 'features', device: str = 'cuda:0', predict_delta: bool = False, spatial_mask: Union[xarray.core.dataarray.DataArray, pathlib.Path] = None, include_prev_y: bool = True, normalize_y: bool = True)

A linear neural network.

In addition to ModelBase, the linear network has the following arguments passed to the constructor:

Parameters
  • layer_sizes – A list describing the linear layers after the LSTM layer. There will be a layer per element in the list, with output size equal to the value of the element. If an int is passed, the model will only have one hidden layer.

  • dropout – The dropout value to use between layers.