Skip to contents

Introduction

This page presents an application of the sPLS performance assessment. The sPLS method is a quite particular method : there are several predictions according to the components number selected in the model. The goal is almost to choose the best number of component in sPLS regression in order to compute the best possible predictions but also to select the best number of variables. For that, we will use three datasets:

  • one is a dataset with only one response variable YY.

  • the other is a dataset with four response variables Y=(Y1,Y2,Y3,Y4)Y = (Y1,Y2,Y3,Y4).

  • the last dataset contains real data about NIR spectra.

To access to predefined functions from sgPLSdevelop package and manipulate these datasets, run these lines :

library(sgPLSdevelop)
library(pls)

data1 <- data.create(p = 10, list = TRUE)
data2 <- data.create(p = 10, q = 4, list = TRUE)
data(yarn)
data3 <- yarn
## [1] "First dataset dimensions : 40 x 11"
## [1] "Second dataset dimensions : 40 x 14"
## [1] "Yarn dataset dimensions : 28 x 3"

For the two first datasets, the population is set to n=40n = 40 by default, which is close to actual conditions. Let’s also notice that, on average, the response YY is a linear combination from the predictors XX. Indeed, the function includes a matrix product Y=XB+EY = XB + E with BB the weight matrix and EE matrix the gaussian noise. This linearity condition is important in order to have a good performance of the model, the PLS method using linearity combinaison.

Now, it’s time to train a PLS model for each dataset built or imported.

ncomp.max <- 8

# First model
X <- data1$X
Y <- data1$Y
model1 <- sPLS(X,Y,mode = "regression", ncomp = ncomp.max)

# Second model
X <- data2$X
Y <- data2$Y
model2 <- sPLS(X,Y,mode = "regression", ncomp = ncomp.max)

# Third model
X <- data3$NIR
Y <- data3$density
model3 <- sPLS(X,Y,mode = "regression", ncomp = ncomp.max)

sPLS performance assessment using MSEP

An good way to assess such a model performance consists by using MSEPMSEP criterion. MSEPMSEP is computed as follow :

MSEP=1nqi=1nj=1q(Yi,jŶi,j)2MSEP = \frac{1}{nq} \sum_{i=1}^{n} \sum_{j=1}^{q} (Y_{i,j} - \hat{Y}_{i,j})^2

The function named perf.sPLS will allow to assess the performance and to choose the best parameters. This function outputs the best tuning parameters but also two MSEP plots : the one shows the MSEP according to the number of components and the other shows the MSEP according to the number of selected variables in XX and YY. Particularly, in multivariate sPLS (q>1q>1), the “plot” is a table of MSEP values.

First model MSEP

perf.res1 <- tuning.sPLS.XY(model1)

h.best <- perf.res1$h.best
keepX.best <- perf.res1$keepX.best
keepY.best <- perf.res1$keepY.best

The perf.sPLS gives us a optimal components number equal to H=H = 7, therefore we suggest to select 7 components in our first model. It indicates us also to select 8 variables for each component.

Second model MSEP

perf.res2 <- tuning.sPLS.XY(model2)

h.best <- perf.res2$h.best
keepX.best <- perf.res2$keepX.best
keepY.best <- perf.res2$keepY.best

The perf.sPLS gives us a optimal components number equal to H=H = 8, therefore we suggest to select 8 components in our first model. It indicates us also to select 9 variables in XX and 3 variables in YY.