To begin: Click anywhere in this cell and press Run on the menu bar. This executes the current cell and then highlights the next cell. There are two types of cells. A text cell and a code cell. When you Run a text cell (we are in a text cell now), you advance to the next cell without executing any code. When you Run a code cell (identified by In[ ]: to the left of the cell) you advance to the next cell after executing all the Python code within that cell. Any visual results produced by the code (text/figures) are reported directly below that cell. Press Run again. Repeat this process until the end of the notebook. NOTE: All the cells in this notebook can be automatically executed sequentially by clicking KernelRestart and Run All. Should anything crash then restart the Jupyter Kernal by clicking KernelRestart, and start again from the top.

Metabolomics Data Visualisation Workflow for ANN-SS




This Jupyter Notebook described a metabolomics data analysis and visualisation workflow for a 2 layer artificial neural network with layer 1 consisting of multiple neurons (n = 2 to 6) with a sigmoidal activation, and layer 2 (output layer) consisting of a single neuron with a sigmoidal activation function (ANN-SS) for a binary classification outcome.

This computational workflow is described using a previously published NMR dataset by Chan et al. (2016).The study compared the urine metabolomic profile comparison across patients characterised as Gastric Cancer (GC; n=43), Benign Gastric Disease (BN; n=40), and Healthy Control (HE; n=40) using 149 named metabolites. For the purpose of this computational workflow, we compare only the GC vs HE samples in a binary discriminant analysis. The deconvolved and annotated data from this study is deposited on Metabolomics Workbench (Study ID: ST001047), and can be accessed directly via its Project DOI: 10.21228/M8B10B. The Excel file used in this workflow can be accessed via the following link: ST001047.xlsx.

This computational workflow requires a dataset to be in, or converted to, a previously described standardised Excel file format (Mendez et al. 2019). This format uses the Tidy Data Framework (Wickham, 2014), where each row represents an observation (e.g. sample) and each column represents a variable (e.g. age or metabolite). Each excel file (per study) contains two sheets; a data sheet and a peak sheet. The data sheet contains the metabolite concentration together with the metadata associated for each observation (requiring the inclusion of the columns: Idx, SampleID, and Class). The peak sheet contains the additional metadata that pertains to the metabolites in the data sheet (requiring the inclusion of the columns: Idx, Name, and Label). The standardisation of this format allows for the efficient re-use of this computational workflow.


The steps included in this data analysis and visualisation workflow are:
1. Import Packages
2. Load Data & Peak Sheet
3. Extract X & Y
4. Hyperparameter Optimisation
4.1. Plot R² & Q²
4.2. Plot Projections: Full & CV
5. Build Model
6. Permutation Test
7. Bootstrap Resampling of the Model
8. Model Evaluation
9. Model Visualisation
9.1. Plot Projections: in-bag & out-of-bag
9.2. Plot Loadings
9.3. Plot Feature Importance
10. Export Results

1. Import Packages

Packages provide additional tools that extend beyond the basic functionality of the Python programming. Prior to usage, packages need to be imported into the Jupyter environment. The following packages need to be imported for this computional workflow:

  • numpy: a standard package primarily used for the manipulation of arrays
  • pandas: a standard package primarily used for the manipulation of data tables
  • cimcb: a library of helpful functions and tools provided by the authors

In [1]:
import numpy as np
import pandas as pd
import cimcb as cb

print('All packages successfully loaded')
Using Theano backend.
All packages successfully loaded

Optional: Set Random Seed

To reproduce the figures in the research article, set the random seed to _. When a neural network is first compilied, the weights are initialised. By default in Keras, the glorot normal initializer (a.k.a. Xavier normal initializer) is used where the weights are randomly drawn from a truncated normal distribution. The seed is used to set reproducible initial weights from this distribution.

  • seed: seed the generator using an integer value e.g. 42 (default = None ; no seed set)


In [2]:
seed = 42
# seed = None

2. Load Data & Peak Sheet

This CIMCB helper function load_dataXL() loads the Data and Peak sheet from an Excel file. In addition, this helper function checks that the data is in the standardised Excel file format described above. After the initial checks, load_dataXL() outputs two individual Pandas DataFrames (i.e. tables) called DataTable and PeakTable from the Excel file ST001047.xlsx. This helper function requires values for the following parameters:

  • filename: the name of the excel file (.xlsx file)
  • DataSheet: the name of the data sheet in the file
  • PeakSheet: the name of the peak sheet in the file

In [3]:
home = 'data/'
file = 'ST001047.xlsx'

DataTable,PeakTable = cb.utils.load_dataXL(filename=home + file, DataSheet='Data', PeakSheet='Peak')
Loadings PeakFile: Peak
Loadings DataFile: Data
Data Table & Peak Table is suitable.
TOTAL SAMPLES: 140 TOTAL PEAKS: 149
Done!

3. Extract X & Y

Prior to performing any statistical or machine learning modelling, it is best practice to assess the quality of the data and remove metabolites that lack reproducible measurements (Broadhurst et al. 2018). In this dataset ST001047.xlsx, we can find that the QC-RSD and percentage of missing value has been previously calculated (refer to the peak sheet). In this Jupyter Notebook, we remove all metabolites that do not meet the following criteria:

  • QC-RSD less than 20%
  • Fewer than 10% of values are missing

The following steps are needed to: (a) extract the binary outcome (i.e. GC vs. HE) and (b) extract, transform, and scale the metabolite data matrix, with missing values imputed.

  • Create a subset of DataTable called DataTable2, only with samples in the Class “GC” or “HE”
  • Set Y to a list (or 1D array) of binary outcomes based on the Class column from DataTable2 (“GC”=1 and “HE”=0)
  • Create the variable peaklist to hold the names (M1...Mn) of the metabolites to be used
  • Using this peaklist, extract all corresponding columns (i.e. metabolite data) from DataTable2, and place it in matrix X
  • Log-transform the values in X
  • Using the helper function cb.utils.scale(), scale the log-transformed data to the unit variance (a.k.a. auto scaling).
  • Impute the missing values by using a k-nearest neighbour approach (with three neighbours) using the helper function cb.utils.knnimpute() to give the final matrix, XTknn

In [4]:
# Clean PeakTable
RSD = PeakTable['QC_RSD']
PercMiss = PeakTable['Perc_missing']
PeakTableClean = PeakTable[(RSD < 20) & (PercMiss < 10)]

# Select Subset of Data
DataTable2 = DataTable[(DataTable.Class == "GC") | (DataTable.Class == "HE")]

# Create a Binary Y Vector 
Outcomes = DataTable2['Class']
Y = [1 if outcome == 'GC' else 0 for outcome in Outcomes]
Y = np.array(Y)

# Extract and Scale Metabolite Data 
peaklist = PeakTableClean['Name']
XT = DataTable2[peaklist]
XTlog = np.log(XT)
XTscale = cb.utils.scale(XTlog, method='auto')
XTknn = cb.utils.knnimpute(XTscale, k=3)

4. Hyperparameter Optimisation

The CIMCB helper function cb.cross_val.kfold() is used to carry out k-fold cross-validation (k=5) on a set of ANN-SS models with varying number of neurons (1 to 6) and learning rate (0.001 to 1) to determine the optimal hyperparamater values. In k-fold cross-validation, the original dataset is randomly split into k sized folds and subsequently trained for k iterations, where the model is trained on 1 – k folds and tested on the k fold (Kohavi 1995). This helper function requires values for the following parameters:

  • model: the class of model used by the function, cb.model.NN_SigmoidSigmoid
  • X: the metabolite data matrix, XTknn
  • Y: the binary outcome vector, Y
  • param_dict: a dictionary, param_dict, that describes all key:value pairs to search, with the key name corresponding to the hyperparameter in the model class and the value as the list of possible values
  • folds: the number of folds in the k-fold cross validation
  • n_mc: the number of Monte Carlo repetitions of the k-fold CV

In [5]:
# Parameter Dictionary
lr = [0.01,0.02,0.03,0.04,0.05]
neurons = [2, 3, 4, 5, 6]

param_dict = dict(learning_rate=lr,
                  n_neurons=neurons,
                  epochs=400,
                  momentum=0.5,
                  decay=0,
                  loss='binary_crossentropy',
                  seed=seed)

# Initialise
cv = cb.cross_val.kfold(model=cb.model.NN_SigmoidSigmoid,                      
                                X=XTknn,                                 
                                Y=Y,                               
                                param_dict=param_dict,                   
                                folds=5,
                                n_mc=10)                              

# Run 
cv.run()  
Number of cores set to: 8
Running ...
1/2: 100%|██████████| 25/25 [00:19<00:00,  1.27it/s]
2/2: 100%|██████████| 250/250 [01:34<00:00,  2.65it/s]
Time taken: 2.79 minutes with 8 cores
Done!

4.1. Plot R² & Q²

When cv.plot(metric='r2q2', method='ratio') is run, 6 plots of $R^2$ and $Q^2$ statistics are displayed: (a) heatmap of $R^2$, (b) heatmap of $Q^2$, (c) heatmap of | ($R^2 - Q^2$) / $R^2$ |, (d) | ($R^2 - Q^2$) / $R^2$ | vs. $Q^2$, (e) $R^2$ and $Q^2$ against the learning rate, and (f) $R^2$ and $Q^2$ against the number of neurons. Alternatively, if method='standard', | $R^2 - Q^2$ | is used instead of | ($R^2 - Q^2$) / $R^2$ | . The optimal number of hyperparameters is selected based on the point of inflection in figure b, or if a clear inflection point is not present, where | ($R^2 - Q^2$) / $R^2$ | = 0.2. Note, the $R^2$ is the mean coefficient of determination for the full dataset, and the $Q^2$ is the mean coefficient of determination for cross-validated prediction dataset over the 100 Monte Carlo repetitions. When cv.plot(metric='auc') is run, the predictability of the model is presented as area under the ROC curve (AUC), $AUC(full)$ & $AUC(cv)$, a non-parametric alternative to $R^2$ & $Q^2$. The following parameters of cv.plot() can be altered:

  • metric: the metric used for the plots (default = 'r2q2'). Alternative metrics include 'auc', 'acc', 'f1score', 'prec', 'sens', and 'spec'
  • method: the types of plots displayed (default = 'ratio'). Alternative value is 'standard'
  • ci: the confidence interval in figure b (default = 95)
  • legend: to show legend (default = True). Alternative value is False


In [6]:
cv.plot(metric='auc', method='ratio', ci=95, legend=True)
cv.plot(metric='r2q2', method='ratio', ci=95, legend=True)
Loading BokehJS ...
Loading BokehJS ...

4.2. Plot Projections: Full & CV

When cv.plot_projections() is run, a n x n grid of plots are displayed, where n is the number of neurons in the hidden layer to interrogate. These plots include score plots, distribution plots, and receiver operating characteristic (ROC) curves.

There are C(n,2) score plots (i.e. a score plot for every combination of 2 neurons e.g. neuron 1 scores vs. neuron 2 scores). Each score plot include the full scores (as circles) and CV scores (as crosses) coloured by group, as well as the 95% confidence interval ellipses for the full scores (as solid lines) and CV scores (as dashed lines). Additionally, the optimal line of separation (dashed grey line) and orthogonal line (solid grey line) are shown.

There are n distribution plots (a distribution plot for each neuron scores). The distribution of the full and CV scores for each corresponding group (i.e. 4 discrete distributions overlayed for 2 groups). Each distribution is calculated using kernel density estimation, a standard non-parametric method used for estimation of a probability density function based on a set of data points (Silverman 1986).

There are are C(n,2) ROC curves (a ROC curve for every combination of 2 neurons e.g. neuron 1 scores vs. neuron 2 scores). As the ROC curves are for every combination of 2 neurons, the discrimination is calculated based on optimal separation (i.e. the grey line from the corresponding score plot). For each ROC curve plot there is a ROC curve for the full model (green), and ROC curve for the cv model with 95% confidence intervals (yellow). Additionally, the equal distribution line (dashed black line) is shown.

  • **optional_arguments: optional arguments to specify model hyperparameters if they are changed in this search e.g. learning_rate=0.02 (except number of components). By default, the max value of each hyperparameter is used (unless specificied).
  • components: Neurons to plot (default = "all" ; plot all components). Alternatively, list the components to plot e.g. [1,3,4]
  • plot: Data to show (default = 'ci' ; plot only 95% confidence interval ellipses). Alternative values include 'meanci', 'full', 'cv', and 'all'
  • label: Add labels to groups in scores plot (default = None ; refers to groups as 0/1)
  • legend: Show legends for plots (default = 'all'). Alternative values are 'scatter', 'dist', 'roc', and 'none'

In [7]:
cv.plot_projections(learning_rate=0.03,
                    components=[1,2,3],  
                    plot="ci",
                    label=DataTable2.Class,
                    legend="none")
Loading BokehJS ...

5. Build Model

A ANN-SS model using cb.model.NN_SigmoidSigmoid is created and initialised using the optimal hyperparameter values determined in step 4. Following this initialisation, the ANN-SS model is trained using the .train(X, Y) method where the X matrix is XTknn and the Y vector is Y. The implementation of ANN-SS in the cb.model.NN_SigmoidSigmoid class uses using Keras with a Theano backend.


In [8]:
# Build Model
model = cb.model.NN_SigmoidSigmoid(learning_rate=0.03, 
                                  n_neurons=2,
                                  epochs=400,
                                  momentum=0.5, 
                                  decay=0, 
                                  loss='binary_crossentropy',
                                  seed=seed)

# Train Model
Ypred = model.train(XTknn, Y)
#Ypred_test = model.test(XTknn) # To test model

6. Permutation Test

After a model has been trained, the .permutation_test() method can be used to assess the reliability of the trained model (after selecting the number of latent variables). For the permutation test, the metabolite data matrix is randomised (permuted or 'shuffled'), while the Y (i.e. outcome) is fixed, and subsequently trained and tested on this randomised data (Szymańska et al. 2012). This process is repeated (in this case, n=100) to construct a distribution to fairly access the model. For a dataset with features that have with no meaningful contribution, we would expect a similar $R^2$ and $Q^2$ to a randomised dataset, while for a dataset with features with meaningful contribution, we would expect a $R^2$ and $Q^2$ significantly higher than that of the randomised dataset. When .permutation_test() is run, 2 plots are displayed: (a) $R^2$ and $Q^2$ against "correlation of permuted data against original data", and (b) probability density functions for $R^2$ and $Q^2$, with the $R^2$ and $Q^2$ values found for the model trained on original data presented as ball-and-stick. The following parameter value of .permutation_test() can be altered:

  • metric: the metric used for the plots (default = 'r2q2'). Alternative metrics include 'auc', 'acc', 'f1score', 'prec', 'sens', and 'spec'. Multiple metrics can be plotted using a list e.g. ['r2q2', 'auc]
  • nperm: the number of permutations. (default = 100)

In [9]:
model.permutation_test(metric=['r2q2', 'auc'],
                       nperm=100)
100%|██████████| 100/100 [01:22<00:00,  1.20it/s]
Loading BokehJS ...

7. Bootstrap Resampling of the Model

Bootstrap resampling is a resampling method based on random resampling with replacement, commonly used to provide an estimate of sampling distribution of a test statistic (Efron, 1982). In the context of this workflow, the PLS model from step 5 with its fixed hyperparameter values (i.e. number of LVs = 2) is retrained on the resampled with replacement data (in-bag) and evaluated on the unused data (out-of-bag) for 100 resamples. After the model is evaluated for each boostrap, metrics including the predicted values (ypred), LV scores, LV loadings, and feature importance (VIP and coefficients) are stored and used to calculate 95% confidence intervals. To calculate the 95% confidence intervals, the common bias-correct (BC) bootstrap method is used (as opposed to the basic percentile method), where the percentiles are adjusted to account for the bias in the boostrap distribution from the original distribution. For details on the methodology behind the BC method, refer to (Efron, 1982). To create and run the bootmodel, the following parameter values need to be set:

  • model: A model with fixed hyperparameter values for boostrap resampling
  • bootnum: The number of bootstrap resamples (default = 100)

In [10]:
bootmodel = cb.bootstrap.BC(model, bootnum=100)
bootmodel.run()
Number of cores set to: 8
100%|██████████| 100/100 [00:17<00:00,  5.84it/s]
Time taken: 0.34 minutes with 8 cores

8. Model Evaluation

After the bootmodel has been run, the .evaluate() method can be used to provide an estimate of the robustness and a measure of the generalised predictability of the model. There are three plots produced when this method is run including a violin plot, probability density function, and a ROC curve. The violin plots shows the distribution of the median predicted score for the in-bag and out-of-bag (i.e. train and test) by group. The distribution plot shows the probability density function of the median predicted score for the in-bag and out-of-bag (i.e. train and test) by group. The ROC curve shows the initial model ROC curve with the 95% CI for the in-bag (green) and the median and 95% CI for the out-of-bag (yellow). There are three options for visualising the 95% CI for the in-bag: "null", where a bias-corrected 95% CI is used "parametric", where the upper limit mirrors the lower limit, "nonparametric", where the upper limit mirrors the lower limit, with the exception the upper limit remains or increases (i.e. does not decrease) as 1 - Specificity increases. These options can be selected by altering the following parameter:

  • bc: method used to calculate 95% CI for the ROC Curve (default = 'nonparametric'; Alternative values are 'parametric' and 'null'
  • label: Add labels to groups (default = None ; refer to groups as 0/1)
  • legend: Show legends for plots (default = 'all'). Alternative values are 'roc', 'dist', 'violin', and 'none'

In [11]:
bootmodel.evaluate(bc='nonparametric',
                   label=DataTable2.Class,
                   legend='all') 
Loading BokehJS ...

9. Model Visualisation


9.1 Plot Projections: in-bag & out-of-bag

After the bootmodel has been run, the .plot_projections() method can be used to visualise the latent variable (LV) scores. When this method is run, a n x n grid of plots are displayed, where n is the number of neurons in the hidden layer. These plots include score plots, distribution plots, and receiver operating characteristic (ROC) curves.

There are C(n,2) score plots (i.e. a score plot for every combination of 2 neurons e.g. neuron 1 scores vs. neuron 2 scores). Each score plot include the in-bag scores (as circles) and out-of-bag scores (as crosses) coloured by group, as well as the 95% confidence interval ellipses for the in-bag scores (as solid lines) and out-of-bag scores (as dashed lines). Additionally, the optimal line of separation (dashed grey line) and orthogonal line (solid grey line) are shown.

There are n distribution plots (a distribution plot for each neuron scores). The distribution of the in-bag and out-of-bag scores for each corresponding group (i.e. 4 discrete distributions overlayed for 2 groups). Each distribution is calculated using kernel density estimation, a standard non-parametric method used for estimation of a probability density function based on a set of data points (Silverman 1986).

There are are C(n,2) ROC curves (a ROC curve for every combination of 2 neurons e.g. neuron 1 scores vs. neuron 2 scores). As the ROC curves are for every combination of 2 neurons, the discrimination is calculated based on optimal separation (i.e. the grey line from the corresponding score plot). For each ROC curve plot there is a ROC curve with the LV score for the initial model with the 95% confidence intervals using the in-bag LV scores (green), and a ROC curve for the out-of-bag LV scores with 95% confidence intervals. Additionally, the equal distribution line (dashed black line) is shown.

  • plot: Data to show in plot (default = "ci" ; plot only 95% confidence interval ellipses). Alternative values include 'meanci', 'ib', 'oob', and 'all'
  • label: Add labels to groups in scores plot (default = None ; refer to groups as 0/1).
  • bc: method used to calculate 95% CI for the ROC Curve (default = 'nonparametric; Alternative values are 'parametric', and 'null'
  • legend: Show legends for plots (default = 'all'). Alternative values are 'scatter', 'dist', 'roc', and 'none'

In [12]:
bootmodel.plot_projections(plot='ib',
                           label=DataTable2.Class,
                           bc='nonparametric',                       
                           legend='all')
Loading BokehJS ...

9.2 Plot Loadings

After the bootmodel has been run, the .plot_loadings() method can be used to visualise the neuron loadings (i.e. weights). When this method is run, n plots are displayed, where n is the number of neurons in the hidden layer. The circles in each loadings plot represent the LV loadings for the initial model. The 95% confidence intervals are calculated using bias-correct (BC) bootstrap method in step 6. Any metabolite loadings with a confidence interval crossing the zero line is considered non-significant to the neuron. This method requires values for the following parameters:

  • PeakTable: Cleaned PeakTable from step 3
  • peaklist: Peaks to include in plot (default = None; include all samples).
  • ylabel: Name of column in PeakTable to use as the ylabel (default = "Label")
  • sort: Whether to sort plots in absolute descending order (default = True)

In [13]:
bootmodel.plot_loadings(PeakTable,
                        peaklist,
                        ylabel='Label',  # change ylabel to 'Name' 
                        sort=False)      # change sort to False
Loading BokehJS ...

9.3 Plot Feature Importance

After the bootmodel has been run, the .plot_featureimportance() method can be used to visualise the feature importance metrics. When this method is run, 2 plots are displayed; Connection Weight plot and Garson's Algorithm plot. These feature importance metrics are alternatives to the coefficient and variable importance in projection (VIP) in PLS.

The values in the Connection Weight plot contain information about the overall contribution of each metabolite (Olden et al. 2004). The values can either a positive or negative number, and therefore, negatively or positively contribute to the model. Any metabolite coefficient value with a confidence interval crossing the zero line is considered non-significant to the model.

The values in the Garson's Algorithm plot contain information about the overall contribution of each metabolite (Garson 1991). These values are absolute, with the higher values representing a higher significance to the model. Unlike in a VIP plot, there is no standard cut-off used to determine whether metabolites are considered "important" in the model.

This method, bootmodel exports the feature importance metrics as a pandas DataFrame (table). This method also requires values for the following parameters:

  • PeakTable: Cleaned PeakTable from step 3
  • peaklist: Peaks to include in plot (default = None; include all samples).
  • ylabel: Name of column in PeakTable to use as the ylabel (default = "Label")
  • sort: Whether to sort plots in absolute descending order (default = True)

In [14]:
feature_importance = bootmodel.plot_featureimportance(PeakTable,
                                         peaklist,
                                         ylabel='Label',  # change ylabel to 'Name' 
                                         sort=False)      # change sort to True
Loading BokehJS ...

10. Export Results

The feature importance table created in step 8.3 can be exported using the inbuilt .to_excel() function within a pandas DataFrame. This function requires an input with the name of the file to create, and it can include directories by using the ‘ / ’ symbol. In the cell below, the table feature_importance is exported as an excel file called 'ANNSigSig_ST001047.xlsx' in the 'results' folder.

In [15]:
export_folder = 'results/'
export_file = 'ANNSigSig_ST001047.xlsx'

feature_importance.to_excel(export_folder + export_file)
print("Done!")
Done!
In [ ]:
 
In [ ]: