Models¶
All models extend the following base model:
-
class
src.models.base.
ModelBase
(data_folder: pathlib.Path = PosixPath('data'), batch_size: int = 1, experiment: str = 'one_month_forecast', pred_months: Optional[List[int]] = None, include_pred_month: bool = True, include_latlons: bool = False, include_monthly_aggs: bool = True, include_yearly_aggs: bool = True, surrounding_pixels: Optional[int] = None, ignore_vars: Optional[List[str]] = None, static: Optional[str] = 'embedding', predict_delta: bool = False, spatial_mask: Union[xarray.core.dataarray.DataArray, pathlib.Path] = None, include_prev_y: bool = False, normalize_y: bool = False)¶ Base for all machine learning models.
- Parameters
data – The location of the data folder. Default =
pathlib.Path("data")
batch_size – The number of files to load at once. These will be chunked and shuffled, so a higher value will lead to better shuffling (but will require more memory). Default =
1
.experiment – The name of the experiment to run. Specifically, the name of the engineer used to generate the data. Default =
"one_month_forecast"
(train on only historical data and predict one month ahead)pred_months – The months the model should predict. If None, all months are predicted. Default =
None
.include_pred_month – Whether to include the prediction month to the model’s training data. Default =
True
.include_latlons – Whether to include prediction pixel latitudes and longitudes in the model’s training data. Default =
True
.include_monthly_aggs – Whether to include monthly aggregations. Default =
True
.include_yearly_aggs – Whether to include yearly aggregations. Default =
True
.surrounding_pixels – How many surrounding pixels to add to the input data. e.g. if the input is 1, then in addition to the pixels on the prediction point, the neighbouring (spatial) pixels will be included too, up to a distance of one pixel away. Default =
None
.ignore_vars – A list of variables to ignore. If None, all variables in the data_path will be included. Default =
None
.static – Whether to include static data. Default =
True
.predict_delta – Whether to model the change in target variable rather than the raw values. Default =
False
.spatial_mask – If an
xr.DataArray` is passed, it will be used to mask the training / test data. Default = ``None
.include_pred_y – Whether to include the y value from one year ago, the same month. This is useful if you are predicting a seasonal value. Default =
False
.normalize_y – Whether to normalize the y value being predicted. Default =
False
. The predictions saved inevaluate
will be denormalized.
-
evaluate
(save_results: bool = True, save_preds: bool = False) → None¶ Evaluate the trained model on the test data
- Parameters
save_results – Whether to save the results of the evaluation. If true, they are saved in
self.model_dir / results.json
. Default =True
.save_preds – Whether to save the model predictions. If true, they are saved in
self.model_dir / {year}_{month}.nc
. Default =False
.
-
explain
(x: Any) → numpy.ndarray¶ Explain the predictions of the trained model on the input data x
- Parameters
x – Any input array / tensor
- Returns
A shap value for each of the input values. The sum of the shap values is equal to the prediction of the model for x
-
get_dataloader
(mode: str, to_tensor: bool = False, shuffle_data: bool = False, **kwargs) → src.models.data.DataLoader¶ - Returns
The correct dataloader for this model
Loading models¶
Models have a save_model
function, which saves the model to a pickle object. These
can then be loaded using the load_model
function:
-
src.models.
load_model
(model_path: pathlib.Path, data_path: Optional[pathlib.Path] = None, model_type: Optional[str] = None, device: Optional[str] = 'cpu') → Union[src.models.neural_networks.rnn.RecurrentNetwork, src.models.neural_networks.linear_network.LinearNetwork, src.models.regression.LinearRegression, src.models.neural_networks.ealstm.EARecurrentNetwork, src.models.gbdt.GBDT]¶ This function loads models from the output .pkl files generated when calling model.save_model()
- Parameters
model_path – The path to the model
data_path – The path to the data folder. If None, the function infers this from the model_path (assuming it was saved as part of the pipeline). Default =
None
.model_type – The type of model to load. If None, the function infers this from the model_path (assuming it was saved as part of the pipeline). Default =
None
.device – The device to load the model onto
- Returns
A model object loaded from the model_path
The following models have been implemented:
Persistence¶
-
class
src.models.parsimonious.
Persistence
(data_folder: pathlib.Path = PosixPath('data'))¶ A parsimonious persistence model. This “model” predicts the previous time-value of data. For example, its prediction for VHI in March 2018 will be VHI for February 2018 (assuming monthly time-granularity).
- Parameters
data_folder – Location of the data folder. Default =
pathlib.Path("data")
.
-
train
() → None¶ This “model” does not need to be trained!
Linear Regression¶
-
class
src.models.regression.
LinearRegression
(data_folder: pathlib.Path = PosixPath('data'), experiment: str = 'one_month_forecast', batch_size: int = 1, pred_months: Optional[List[int]] = None, include_pred_month: bool = True, include_latlons: bool = False, include_monthly_aggs: bool = True, include_yearly_aggs: bool = True, surrounding_pixels: Optional[int] = None, ignore_vars: Optional[List[str]] = None, static: Optional[str] = 'features', predict_delta: bool = False, spatial_mask: Union[xarray.core.dataarray.DataArray, pathlib.Path] = None, include_prev_y: bool = True, normalize_y: bool = True)¶ A linear regression model, implemented by scikit-learn.
- Parameters
data_folder – Location of the data folder. Default =
pathlib.Path("data")
.experiment – The name of the experiment to run. Specifically, the name of the engineer used to generate the data. Default =
"one_month_forecast"
(train on only historical data and predict one month ahead)batch_size – The number of files to load at once. These will be chunked and shuffled, so a higher value will lead to better shuffling (but will require more memory). Default =
1
.pred_months – The months the model should predict. If None, all months are predicted. Default =
None
.include_pred_month – Whether to include the prediction month to the model’s training data. Default =
True
.include_latlons – Whether to include prediction pixel latitudes and longitudes in the model’s training data. Default =
True
.include_monthly_aggs – Whether to include monthly aggregations. Default =
True
.include_yearly_aggs – Whether to include yearly aggregations. Default =
True
.surrounding_pixels – How many surrounding pixels to add to the input data. e.g. if the input is 1, then in addition to the pixels on the prediction point, the neighbouring (spatial) pixels will be included too, up to a distance of one pixel away. Default =
None
.ignore_vars – A list of variables to ignore. If None, all variables in the data_path will be included. Default =
None
.static – Whether to include static data. Default =
True
.predict_delta – Whether to model the change in target variable rather than the raw values. Default =
False
.spatial_mask – If an
xr.DataArray
is passed, it will be used to mask the training / test data. Default =None
.include_pred_y – Whether to include the y value from one year ago, the same month. This is useful if you are predicting a seasonal value. Default =
False
.normalize_y – Whether to normalize the y value being predicted. Default =
False
. The predictions saved inevaluate
will be denormalized.
-
explain
(x: Optional[src.models.data.TrainData] = None, save_shap_values: bool = True) → numpy.ndarray¶ Explain the predictions of the trained model on the input data x
- Parameters
x – Any input array / tensor
- Returns
A shap value for each of the input values. The sum of the shap values is equal to the prediction of the model for x
-
save_model
() → None¶ Saves a pickle object of the model, which can be loaded using
src.models.load_model
.
-
train
(num_epochs: int = 1, early_stopping: Optional[int] = None, batch_size: int = 256, val_split: float = 0.1, initial_learning_rate: float = 1e-15) → None¶ Train the linear regression model.
- Parameters
num_epochs – The number of epochs to train the model for. If
early_stopping
is not None, then this is the maximum number of epochs for which the model will be trained. Default =1
.early_stopping – If not
None
, the number of epochs to wait without improvement before stopping model training and reverting to the best model. Default =None
.batch_size – The batch size to use when training the model. Default =
256
.val_split – The ratio of data to use in the validation set. Default =
0.1
.initial_learning_rate – The initial learning rate to use. Default =
1e-15
.
Neural Networks¶
A number of neural networks are implemented. All are trained using Smooth L1 Loss, with optional early stopping.
All neural network classes extend the following base class:
-
class
src.models.neural_networks.base.
NNBase
(data_folder: pathlib.Path = PosixPath('data'), batch_size: int = 1, experiment: str = 'one_month_forecast', pred_months: Optional[List[int]] = None, include_pred_month: bool = True, include_latlons: bool = False, include_monthly_aggs: bool = True, include_yearly_aggs: bool = True, surrounding_pixels: Optional[int] = None, ignore_vars: Optional[List[str]] = None, static: Optional[str] = 'features', device: str = 'cuda:0', predict_delta: bool = False, spatial_mask: Union[xarray.core.dataarray.DataArray, pathlib.Path] = None, include_prev_y: bool = True, normalize_y: bool = True)¶ The base for all neural network models, written in Pytorch. It extends the
ModelBase
class, and implements thetrain
,predict
andexplain
functions, which are shared across all neural networks.The arguments to the constructor are the same as for the
ModelBase
class.-
explain
(x: Optional[src.models.data.TrainData] = None, var_names: Optional[List[str]] = None, save_explanations: bool = True, background_size: int = 100, start_idx: int = 0, num_inputs: int = 10, method: str = 'shap') → src.models.data.TrainData¶ Expain the outputs of a trained model.
- Parameters
x – The values to explain. If None, samples are randomly drawn from the test data
var_names – The variable names of the historical inputs. If x is None, this will be calculated. Only necessary if the arrays are going to be saved
save_explanations – Whether or not to save the shap values
background_size – the size of the background to use
start_idx – The index to use to calculate the shap values. Shap values will be calculated for
x[start_idx: start_idx + num_inputs]
num_inputs – The number of datapoints to calculate shap values for
method – One of
{"shap", "morris"}
. The method to use to calculate the explanations.
- Returns
A dictionary of shap values for each of the model’s input arrays
-
train
(num_epochs: int = 1, early_stopping: Optional[int] = None, batch_size: int = 256, learning_rate: float = 0.001, val_split: float = 0.1) → None¶ Trains a neural network.
- Parameters
num_epochs – The maximum number of epochs to train the model for.
early_stopping – If an int is passed, early stopping will be used with this value as the patience. If None, the model will train for
num_epochs
.batch_size – The number of instances to put in each batch.
learning_rate – The learning rate to use.
val_split – The ratio of training data to use as validation, for early stopping.
-
LSTM¶
-
class
src.models.neural_networks.rnn.
RecurrentNetwork
(hidden_size: int, dense_features: Optional[List[int]] = None, rnn_dropout: float = 0.25, data_folder: pathlib.Path = PosixPath('data'), batch_size: int = 1, experiment: str = 'one_month_forecast', pred_months: Optional[List[int]] = None, include_pred_month: bool = True, include_latlons: bool = False, include_monthly_aggs: bool = True, include_yearly_aggs: bool = True, surrounding_pixels: Optional[int] = None, ignore_vars: Optional[List[str]] = None, static: Optional[str] = 'features', device: str = 'cuda:0', predict_delta: bool = False, spatial_mask: Union[xarray.core.dataarray.DataArray, pathlib.Path] = None, include_prev_y: bool = True, normalize_y: bool = True)¶ A single layer long short-term memory (LSTM) layer, followed by linear layers.
See this blog post for more information about recurrent networks and LSTMS.
The LSTM receives the static data appended to every time step of the dynamic data. In addition to the arguments to
ModelBase
, the LSTM has the following arguments passed to the constructor:- Parameters
hidden_size – The number of features in the hidden state
dense_features – A list describing the linear layers after the LSTM layer. There will be a layer per element in the list, with output size equal to the value of the element. If
None
, a single linear layer is used, with an output size of 1 (the prediction).rnn_dropout – Dropout to use between timesteps. Note that this is different from PyTorch’s default LSTM layer, which adds dropout between layers, not timesteps.
EA-LSTM¶
-
class
src.models.neural_networks.ealstm.
EARecurrentNetwork
(hidden_size: int, dense_features: Optional[List[int]] = None, rnn_dropout: float = 0.25, data_folder: pathlib.Path = PosixPath('data'), batch_size: int = 1, experiment: str = 'one_month_forecast', pred_months: Optional[List[int]] = None, include_latlons: bool = False, include_pred_month: bool = True, include_monthly_aggs: bool = True, include_yearly_aggs: bool = True, surrounding_pixels: Optional[int] = None, ignore_vars: Optional[List[str]] = None, static: Optional[str] = 'features', static_embedding_size: Optional[int] = None, device: str = 'cuda:0', predict_delta: bool = False, spatial_mask: Union[xarray.core.dataarray.DataArray, pathlib.Path] = None, include_prev_y: bool = True, normalize_y: bool = True)¶ An Entity Aware - LSTM, described in Towards learning universal, regional, and local hydrological behaviors via machine learning applied to large-sample datasets
In an EA-LSTM, the dynamic features are conditioned on the static features.
In addition to the arguments to
ModelBase
, the EA-LSTM has the following arguments passed to the constructor:- Parameters
hidden_size – The number of features in the hidden state
dense_features – A list describing the linear layers after the LSTM layer. There will be a layer per element in the list, with output size equal to the value of the element. If
None
, a single linear layer is used, with an output size of 1 (the prediction).rnn_dropout – Dropout to use between timesteps. Note that this is different from PyTorch’s default LSTM layer, which adds dropout between layers, not timesteps.
Linear Network¶
-
class
src.models.neural_networks.linear_network.
LinearNetwork
(layer_sizes: Union[int, List[int]], dropout: float = 0.25, data_folder: pathlib.Path = PosixPath('data'), batch_size: int = 1, experiment: str = 'one_month_forecast', pred_months: Optional[List[int]] = None, include_pred_month: bool = True, include_latlons: bool = False, include_monthly_aggs: bool = True, include_yearly_aggs: bool = True, surrounding_pixels: Optional[int] = None, ignore_vars: Optional[List[str]] = None, static: Optional[str] = 'features', device: str = 'cuda:0', predict_delta: bool = False, spatial_mask: Union[xarray.core.dataarray.DataArray, pathlib.Path] = None, include_prev_y: bool = True, normalize_y: bool = True)¶ A linear neural network.
In addition to
ModelBase
, the linear network has the following arguments passed to the constructor:- Parameters
layer_sizes – A list describing the linear layers after the LSTM layer. There will be a layer per element in the list, with output size equal to the value of the element. If an
int
is passed, the model will only have one hidden layer.dropout – The dropout value to use between layers.