Model

For ease of use, all of Galpro’s functionalities can be accessed using the Model class.

API documentation: Model

Training model

To train a random forest model, the training and testing datasets are required. The model must be given a unique name using model_name. Besides this, there are other optional parameters such as target_features and save_model for passing in a list of all the target features and saving the model respectively:

import galpro as gp

target_features = ['$z$', '$\log(M_{\star} / M_{\odot})$']

model = gp.Model(model_name='model', x_train=x_train, y_train=y_train, x_test=x_test, y_test=y_test,
target_features=target_features, save_model=True)

If the model is saved, it will be located in the directory /galpro/model_name/ as a .sav file. The Model class can also be used to load a previously trained model by specifying its name via model_name. Once a new model has been trained or a previously trained model has been loaded, it can be utilised.

Testing model

The trained model can be used to generate point predictions and posterior PDFs using:

point_estimates = model.point_estimate(save_estimate=True, make_plots=True)
posteriors = model.posterior(save_posteriors=True, make_plots=True, on_the_fly=False)

The point_estimate function will return an array of point estimates. The posterior function will return a h5 file object containing posteriors, which can be accessed using test object numbers as keys. If the point predictions and PDFs are saved, they will be stored as .h5 (Hierarchical Data Format) files in the subdirectories /galpro/model_name/point_estimates/ and /galpro/model_name/posteriors/ respectively. The plots will be saved in the /plots/ folder.

On-the-fly PDFs

Galpro has the ability to generate PDFs on the fly, thus eliminating the problem of storage. Galpro can be easily incorporated into research codes with the following:

posterior = model.posterior(save_posteriors=False, make_plots=False, on_the_fly=True)

for sample in range(no_samples):
    sample_posterior = next(posterior)

In this instance, the on_the_fly parameter is set to True. By calling next(posterior) the function will return posterior PDFs of test objects one at a time. Naturally, the other parameters are set to False, and the following functionalities are not available if generating PDFs in this mode.

Validating model

The posterior PDFs generated by the trained model can be validated using:

validation = model.validate(save_validation=True, make_plots=True)

Marginal PDFs are validated using the framework developed by Gneiting et al. (2007), and multivariate PDFs are validated using the multivariate extension of the framework, developed by Ziegel and Gneiting. (2014). A brief introduction to the methods can be found in our paper (Mucesh et al. 2020). The function will return a .h5 file object, and the different modes of validation can be accessed using the keys, pits, coppits, marginal_calibration and kendall_calibration. The validation is stored in the subdirectory /galpro/model_name/validation/.

Plotting

Galpro can generate various plots:

model.plot.scatter() # Creates scatter plots of point predictions.
model.plot.marginal() # Creates marginal PDF plots.
model.plot.joint_pdf() # Creates joint PDF plots.
model.plot.corner() # Creates a corner style plot for multivariate PDFs.
model.plot.pit() # Plots the probability integral transform (PIT) distribution.
model.plot.coppit() # Plots the copula probability integral transform (copPIT) distribution.
model.marginal_calibration() # Plots the marginal calibration.
model.kendall_calibration() # Plots the kendall calibration.

These plotting functions can take in two optional parameters which are show and save. By default, these are set to False and True respectively. All plots are saved in the /plots/ folder in the respective subdirectory. The same plots can also be created by setting make_plots=True when running model.point_estimate, model.posterior or model.validate. Additionally, these functions can also be used to recreate the different plots, given that the model and the necessary .h5 files have been saved in the previous run.

Configuration

The hyperparameters associated with the random forest algorithm are defined in the conf.py file. We expect the default hyperparameters to work well in most situations. However, if the user wishes to tune the hyperparameter to their liking, they can do so by modifying their values in the configuration file before loading the package.

The plotting aesthetics are also defined in the same configuration file. The user can tweak them to their preference by stating the matplotlib or seaborn settings accordingly.