Models¶

All models extend the following base model:

class src.models.base.ModelBase(data_folder: pathlib.Path = PosixPath('data'), batch_size: int = 1, experiment: str = 'one_month_forecast', pred_months: Optional[List[int]] = None, include_pred_month: bool = True, include_latlons: bool = False, include_monthly_aggs: bool = True, include_yearly_aggs: bool = True, surrounding_pixels: Optional[int] = None, ignore_vars: Optional[List[str]] = None, static: Optional[str] = 'embedding', predict_delta: bool = False, spatial_mask: Union[xarray.core.dataarray.DataArray, pathlib.Path] = None, include_prev_y: bool = False, normalize_y: bool = False)¶

Base for all machine learning models.

Parameters

data – The location of the data folder. Default = pathlib.Path("data")
batch_size – The number of files to load at once. These will be chunked and shuffled, so a higher value will lead to better shuffling (but will require more memory). Default = 1.
experiment – The name of the experiment to run. Specifically, the name of the engineer used to generate the data. Default = "one_month_forecast" (train on only historical data and predict one month ahead)
pred_months – The months the model should predict. If None, all months are predicted. Default = None.
include_pred_month – Whether to include the prediction month to the model’s training data. Default = True.
include_latlons – Whether to include prediction pixel latitudes and longitudes in the model’s training data. Default = True.
include_monthly_aggs – Whether to include monthly aggregations. Default = True.
include_yearly_aggs – Whether to include yearly aggregations. Default = True.
surrounding_pixels – How many surrounding pixels to add to the input data. e.g. if the input is 1, then in addition to the pixels on the prediction point, the neighbouring (spatial) pixels will be included too, up to a distance of one pixel away. Default = None.
ignore_vars – A list of variables to ignore. If None, all variables in the data_path will be included. Default = None.
static – Whether to include static data. Default = True.
predict_delta – Whether to model the change in target variable rather than the raw values. Default = False.
spatial_mask – If an xr.DataArray` is passed, it will be used to mask the training / test data. Default = ``None.
include_pred_y – Whether to include the y value from one year ago, the same month. This is useful if you are predicting a seasonal value. Default = False.
normalize_y – Whether to normalize the y value being predicted. Default = False. The predictions saved in evaluate will be denormalized.

evaluate(save_results: bool = True, save_preds: bool = False) → None¶

Evaluate the trained model on the test data

Parameters

save_results – Whether to save the results of the evaluation. If true, they are saved in self.model_dir / results.json. Default = True.
save_preds – Whether to save the model predictions. If true, they are saved in self.model_dir / {year}_{month}.nc. Default = False.

explain(x: Any) → numpy.ndarray¶

Explain the predictions of the trained model on the input data x

Parameters: x – Any input array / tensor
Returns: A shap value for each of the input values. The sum of the shap values is equal to the prediction of the model for x

get_dataloader(mode: str, to_tensor: bool = False, shuffle_data: bool = False, **kwargs) → src.models.data.DataLoader¶

Returns: The correct dataloader for this model

Loading models¶

Models have a save_model function, which saves the model to a pickle object. These can then be loaded using the load_model function:

src.models.load_model(model_path: pathlib.Path, data_path: Optional[pathlib.Path] = None, model_type: Optional[str] = None, device: Optional[str] = 'cpu') → Union[src.models.neural_networks.rnn.RecurrentNetwork, src.models.neural_networks.linear_network.LinearNetwork, src.models.regression.LinearRegression, src.models.neural_networks.ealstm.EARecurrentNetwork, src.models.gbdt.GBDT]¶

This function loads models from the output .pkl files generated when calling model.save_model()

Parameters

model_path – The path to the model
data_path – The path to the data folder. If None, the function infers this from the model_path (assuming it was saved as part of the pipeline). Default = None.
model_type – The type of model to load. If None, the function infers this from the model_path (assuming it was saved as part of the pipeline). Default = None.
device – The device to load the model onto

Returns

A model object loaded from the model_path

The following models have been implemented:

Persistence¶

class src.models.parsimonious.Persistence(data_folder: pathlib.Path = PosixPath('data'))¶

A parsimonious persistence model. This “model” predicts the previous time-value of data. For example, its prediction for VHI in March 2018 will be VHI for February 2018 (assuming monthly time-granularity).

Parameters: data_folder – Location of the data folder. Default = pathlib.Path("data").

train() → None¶: This “model” does not need to be trained!

Linear Regression¶

class src.models.regression.LinearRegression(data_folder: pathlib.Path = PosixPath('data'), experiment: str = 'one_month_forecast', batch_size: int = 1, pred_months: Optional[List[int]] = None, include_pred_month: bool = True, include_latlons: bool = False, include_monthly_aggs: bool = True, include_yearly_aggs: bool = True, surrounding_pixels: Optional[int] = None, ignore_vars: Optional[List[str]] = None, static: Optional[str] = 'features', predict_delta: bool = False, spatial_mask: Union[xarray.core.dataarray.DataArray, pathlib.Path] = None, include_prev_y: bool = True, normalize_y: bool = True)¶

A linear regression model, implemented by scikit-learn.

Parameters

data_folder – Location of the data folder. Default = pathlib.Path("data").
experiment – The name of the experiment to run. Specifically, the name of the engineer used to generate the data. Default = "one_month_forecast" (train on only historical data and predict one month ahead)
batch_size – The number of files to load at once. These will be chunked and shuffled, so a higher value will lead to better shuffling (but will require more memory). Default = 1.
pred_months – The months the model should predict. If None, all months are predicted. Default = None.
include_pred_month – Whether to include the prediction month to the model’s training data. Default = True.
include_latlons – Whether to include prediction pixel latitudes and longitudes in the model’s training data. Default = True.
include_monthly_aggs – Whether to include monthly aggregations. Default = True.
include_yearly_aggs – Whether to include yearly aggregations. Default = True.
surrounding_pixels – How many surrounding pixels to add to the input data. e.g. if the input is 1, then in addition to the pixels on the prediction point, the neighbouring (spatial) pixels will be included too, up to a distance of one pixel away. Default = None.
ignore_vars – A list of variables to ignore. If None, all variables in the data_path will be included. Default = None.
static – Whether to include static data. Default = True.
predict_delta – Whether to model the change in target variable rather than the raw values. Default = False.
spatial_mask – If an xr.DataArray is passed, it will be used to mask the training / test data. Default = None.
include_pred_y – Whether to include the y value from one year ago, the same month. This is useful if you are predicting a seasonal value. Default = False.
normalize_y – Whether to normalize the y value being predicted. Default = False. The predictions saved in evaluate will be denormalized.

explain(x: Optional[src.models.data.TrainData] = None, save_shap_values: bool = True) → numpy.ndarray¶

Explain the predictions of the trained model on the input data x

Parameters: x – Any input array / tensor
Returns: A shap value for each of the input values. The sum of the shap values is equal to the prediction of the model for x

save_model() → None¶: Saves a pickle object of the model, which can be loaded using src.models.load_model.

train(num_epochs: int = 1, early_stopping: Optional[int] = None, batch_size: int = 256, val_split: float = 0.1, initial_learning_rate: float = 1e-15) → None¶

Train the linear regression model.

Parameters

num_epochs – The number of epochs to train the model for. If early_stopping is not None, then this is the maximum number of epochs for which the model will be trained. Default = 1.
early_stopping – If not None, the number of epochs to wait without improvement before stopping model training and reverting to the best model. Default = None.
batch_size – The batch size to use when training the model. Default = 256.
val_split – The ratio of data to use in the validation set. Default = 0.1.
initial_learning_rate – The initial learning rate to use. Default = 1e-15.

Neural Networks¶

A number of neural networks are implemented. All are trained using Smooth L1 Loss, with optional early stopping.

All neural network classes extend the following base class:

class src.models.neural_networks.base.NNBase(data_folder: pathlib.Path = PosixPath('data'), batch_size: int = 1, experiment: str = 'one_month_forecast', pred_months: Optional[List[int]] = None, include_pred_month: bool = True, include_latlons: bool = False, include_monthly_aggs: bool = True, include_yearly_aggs: bool = True, surrounding_pixels: Optional[int] = None, ignore_vars: Optional[List[str]] = None, static: Optional[str] = 'features', device: str = 'cuda:0', predict_delta: bool = False, spatial_mask: Union[xarray.core.dataarray.DataArray, pathlib.Path] = None, include_prev_y: bool = True, normalize_y: bool = True)¶

The base for all neural network models, written in Pytorch. It extends the ModelBase class, and implements the train, predict and explain functions, which are shared across all neural networks.

The arguments to the constructor are the same as for the ModelBase class.

explain(x: Optional[src.models.data.TrainData] = None, var_names: Optional[List[str]] = None, save_explanations: bool = True, background_size: int = 100, start_idx: int = 0, num_inputs: int = 10, method: str = 'shap') → src.models.data.TrainData¶

Expain the outputs of a trained model.

Parameters

x – The values to explain. If None, samples are randomly drawn from the test data
var_names – The variable names of the historical inputs. If x is None, this will be calculated. Only necessary if the arrays are going to be saved
save_explanations – Whether or not to save the shap values
background_size – the size of the background to use
start_idx – The index to use to calculate the shap values. Shap values will be calculated for x[start_idx: start_idx + num_inputs]
num_inputs – The number of datapoints to calculate shap values for
method – One of {"shap", "morris"}. The method to use to calculate the explanations.

Returns

A dictionary of shap values for each of the model’s input arrays

train(num_epochs: int = 1, early_stopping: Optional[int] = None, batch_size: int = 256, learning_rate: float = 0.001, val_split: float = 0.1) → None¶

Trains a neural network.

Parameters

num_epochs – The maximum number of epochs to train the model for.
early_stopping – If an int is passed, early stopping will be used with this value as the patience. If None, the model will train for num_epochs.
batch_size – The number of instances to put in each batch.
learning_rate – The learning rate to use.
val_split – The ratio of training data to use as validation, for early stopping.

LSTM¶

class src.models.neural_networks.rnn.RecurrentNetwork(hidden_size: int, dense_features: Optional[List[int]] = None, rnn_dropout: float = 0.25, data_folder: pathlib.Path = PosixPath('data'), batch_size: int = 1, experiment: str = 'one_month_forecast', pred_months: Optional[List[int]] = None, include_pred_month: bool = True, include_latlons: bool = False, include_monthly_aggs: bool = True, include_yearly_aggs: bool = True, surrounding_pixels: Optional[int] = None, ignore_vars: Optional[List[str]] = None, static: Optional[str] = 'features', device: str = 'cuda:0', predict_delta: bool = False, spatial_mask: Union[xarray.core.dataarray.DataArray, pathlib.Path] = None, include_prev_y: bool = True, normalize_y: bool = True)¶

A single layer long short-term memory (LSTM) layer, followed by linear layers.

See this blog post for more information about recurrent networks and LSTMS.

The LSTM receives the static data appended to every time step of the dynamic data. In addition to the arguments to ModelBase, the LSTM has the following arguments passed to the constructor:

Parameters

hidden_size – The number of features in the hidden state
dense_features – A list describing the linear layers after the LSTM layer. There will be a layer per element in the list, with output size equal to the value of the element. If None, a single linear layer is used, with an output size of 1 (the prediction).
rnn_dropout – Dropout to use between timesteps. Note that this is different from PyTorch’s default LSTM layer, which adds dropout between layers, not timesteps.

EA-LSTM¶

class src.models.neural_networks.ealstm.EARecurrentNetwork(hidden_size: int, dense_features: Optional[List[int]] = None, rnn_dropout: float = 0.25, data_folder: pathlib.Path = PosixPath('data'), batch_size: int = 1, experiment: str = 'one_month_forecast', pred_months: Optional[List[int]] = None, include_latlons: bool = False, include_pred_month: bool = True, include_monthly_aggs: bool = True, include_yearly_aggs: bool = True, surrounding_pixels: Optional[int] = None, ignore_vars: Optional[List[str]] = None, static: Optional[str] = 'features', static_embedding_size: Optional[int] = None, device: str = 'cuda:0', predict_delta: bool = False, spatial_mask: Union[xarray.core.dataarray.DataArray, pathlib.Path] = None, include_prev_y: bool = True, normalize_y: bool = True)¶

An Entity Aware - LSTM, described in Towards learning universal, regional, and local hydrological behaviors via machine learning applied to large-sample datasets

In an EA-LSTM, the dynamic features are conditioned on the static features.

In addition to the arguments to ModelBase, the EA-LSTM has the following arguments passed to the constructor:

Parameters

hidden_size – The number of features in the hidden state
dense_features – A list describing the linear layers after the LSTM layer. There will be a layer per element in the list, with output size equal to the value of the element. If None, a single linear layer is used, with an output size of 1 (the prediction).
rnn_dropout – Dropout to use between timesteps. Note that this is different from PyTorch’s default LSTM layer, which adds dropout between layers, not timesteps.

Linear Network¶

class src.models.neural_networks.linear_network.LinearNetwork(layer_sizes: Union[int, List[int]], dropout: float = 0.25, data_folder: pathlib.Path = PosixPath('data'), batch_size: int = 1, experiment: str = 'one_month_forecast', pred_months: Optional[List[int]] = None, include_pred_month: bool = True, include_latlons: bool = False, include_monthly_aggs: bool = True, include_yearly_aggs: bool = True, surrounding_pixels: Optional[int] = None, ignore_vars: Optional[List[str]] = None, static: Optional[str] = 'features', device: str = 'cuda:0', predict_delta: bool = False, spatial_mask: Union[xarray.core.dataarray.DataArray, pathlib.Path] = None, include_prev_y: bool = True, normalize_y: bool = True)¶

A linear neural network.

In addition to ModelBase, the linear network has the following arguments passed to the constructor:

Parameters

layer_sizes – A list describing the linear layers after the LSTM layer. There will be a layer per element in the list, with output size equal to the value of the element. If an int is passed, the model will only have one hidden layer.
dropout – The dropout value to use between layers.

Models¶

Loading models¶

Persistence¶

Linear Regression¶

Neural Networks¶

LSTM¶

EA-LSTM¶

Linear Network¶

ml_clim

This Page