Sparse Group Partial Least Squares (sgPLS)
sgPLS.Rd
Function to perform sparse group Partial Least Squares (sgPLS) in the conext of datasets are divided into groups of variables. The sgPLS approach enables selection at both groups and single feature levels.
Usage
sgPLS(X, Y, ncomp, mode = "regression",
max.iter = 500, tol = 1e-06, keepX,
keepY = NULL,ind.block.x, ind.block.y = NULL, alpha.x, alpha.y = NULL,
upper.lambda = 10 ^ 5,scale=TRUE)
Arguments
- X
Numeric matrix of predictors.
- Y
Numeric vector or matrix of responses (for multi-response models).
- ncomp
The number of components to include in the model (see Details).
- mode
character string. What type of algorithm to use, (partially) matching one of
"regression"
or"canonical"
. See Details.- max.iter
Integer, the maximum number of iterations.
- tol
A positive real, the tolerance used in the iterative algorithm.
- keepX
Numeric vector of length
ncomp
, the number of variables to keep in \(X\)-loadings. By default all variables are kept in the model.- keepY
Numeric vector of length
ncomp
, the number of variables to keep in \(Y\)-loadings. By default all variables are kept in the model.- ind.block.x
A vector of integers describing the grouping of the \(X\) variables. (see an example in Details section).
- ind.block.y
A vector of integers describing the grouping of the \(Y\) variables (see example in Details section).
- alpha.x
The mixing parameter (value between 0 and 1) related to the sparsity within group for the \(X\) dataset.
- alpha.y
The mixing parameter (value between 0 and 1) related to the sparsity within group for the \(Y\) dataset.
- upper.lambda
By default
upper.lambda=10 ^ 5
. A large value specifying the upper bound of the intervall of lambda values for searching the value of the tuning parameter (lambda) corresponding to a non-zero group of variables.- scale
a logical indicating if the orignal data set need to be scaled. By default
scale
=TRUE
Details
sgPLS
function fit gPLS models with \(1, \ldots ,\)ncomp
components.
Multi-response models are fully supported.
The type of algorithm to use is specified with the mode
argument. Two gPLS
algorithms are available: gPLS regression ("regression")
and gPLS canonical analysis
("canonical")
(see References).
ind.block.x <- c(3, 10, 15)
means that \(X\) is structured into 4 groups: X1 to X3; X4 to X10, X11 to X15 and X16 to X\(p\) where \(p\) is the number of variables in the \(X\) matrix.
Value
sgPLS
returns an object of class "sgPLS"
, a list
that contains the following components:
- X
The centered and standardized original predictor matrix.
- Y
The centered and standardized original response vector or matrix.
- ncomp
The number of components included in the model.
- mode
The algorithm used to fit the model.
- keepX
Number of \(X\) variables kept in the model on each component.
- keepY
Number of \(Y\) variables kept in the model on each component.
- mat.c
Matrix of coefficients to be used internally by
predict
.- variates
List containing the variates.
- loadings
List containing the estimated loadings for the \(X\) and \(Y\) variates.
- names
List containing the names to be used for individuals and variables.
- tol
The tolerance used in the iterative algorithm, used for subsequent S3 methods.
- max.iter
The maximum number of iterations, used for subsequent S3 methods.
- iter
Vector containing the number of iterations for convergence in each component.
- ind.block.x
A vector of integers describing the grouping of the \(X\) variables.
- ind.block.y
A vector of consecutive integers describing the grouping of the \(Y\) variables.
- alpha.x
The mixing parameter related to the sparsity within group for the \(X\) dataset.
- alpha.y
The mixing parameter related to the sparsity within group for the \(Y\) dataset.
- upper.lambda
The upper bound of the intervall of lambda values for searching the value of the tuning parameter (lambda) corresponding to a non-zero group of variables.
References
Liquet Benoit, Lafaye de Micheaux, Boris Hejblum, Rodolphe Thiebaut (2016). A group and Sparse Group Partial Least Square approach applied in Genomics context. Bioinformatics.
Le Cao, K.-A., Martin, P.G.P., Robert-Grani\'e, C. and Besse, P. (2009). Sparse canonical methods for biological data integration: application to a cross-platform study. BMC Bioinformatics 10:34.
Le Cao, K.-A., Rossouw, D., Robert-Grani\'e, C. and Besse, P. (2008). A sparse PLS for variable selection when integrating Omics data. Statistical Applications in Genetics and Molecular Biology 7, article 35.
Shen, H. and Huang, J. Z. (2008). Sparse principal component analysis via regularized low rank matrix approximation. Journal of Multivariate Analysis 99, 1015-1034.
Tenenhaus, M. (1998). La r\'egression PLS: th\'eorie et pratique. Paris: Editions Technic.
Wold H. (1966). Estimation of principal components and related models by iterative least squares. In: Krishnaiah, P. R. (editors), Multivariate Analysis. Academic Press, N.Y., 391-420.
Examples
## Simulation of datasets X and Y with group variables
n <- 100
sigma.gamma <- 1
sigma.e <- 1.5
p <- 400
q <- 500
theta.x1 <- c(rep(1,15),rep(0,5),rep(-1,15),rep(0,5),rep(1.5,15)
,rep(0,5),rep(-1.5,15),rep(0,325))
theta.x2 <- c(rep(0,320),rep(1,15),rep(0,5),rep(-1,15),rep(0,5)
,rep(1.5,15),rep(0,5),rep(-1.5,15),rep(0,5))
theta.y1 <- c(rep(1,15),rep(0,5),rep(-1,15),rep(0,5),rep(1.5,15)
,rep(0,5),rep(-1.5,15),rep(0,425))
theta.y2 <- c(rep(0,420),rep(1,15),rep(0,5),rep(-1,15),rep(0,5),
rep(1.5,15),rep(0,5),rep(-1.5,15),rep(0,5))
Sigmax <- matrix(0, nrow = p, ncol = p)
diag(Sigmax) <- sigma.e ^ 2
Sigmay <- matrix(0, nrow = q, ncol = q)
diag(Sigmay) <- sigma.e ^ 2
set.seed(125)
gam1 <- rnorm(n)
gam2 <- rnorm(n)
X <- matrix(c(gam1, gam2), ncol = 2, byrow = FALSE) %*% matrix(c(theta.x1, theta.x2),
nrow = 2, byrow = TRUE) + rmvnorm(n, mean = rep(0, p), sigma =
Sigmax, method = "svd")
Y <- matrix(c(gam1, gam2), ncol = 2, byrow = FALSE) %*% matrix(c(theta.y1, theta.y2),
nrow = 2, byrow = TRUE) + rmvnorm(n, mean = rep(0, q), sigma =
Sigmay, method = "svd")
ind.block.x <- seq(20, 380, 20)
ind.block.y <- seq(20, 480, 20)
##
model.sgPLS <- sgPLS(X, Y, ncomp = 2, mode = "regression", keepX = c(4, 4),
keepY = c(4, 4), ind.block.x = ind.block.x
,ind.block.y = ind.block.y,
alpha.x = c(0.95, 0.95), alpha.y = c(0.95, 0.95))
result.sgPLS <- select.sgpls(model.sgPLS)
result.sgPLS$group.size.X
#> size comp1 comp2
#> 1 20 15 0
#> 2 20 15 0
#> 3 20 16 0
#> 4 20 15 0
#> 5 20 0 0
#> 6 20 0 0
#> 7 20 0 0
#> 8 20 0 0
#> 9 20 0 0
#> 10 20 0 0
#> 11 20 0 0
#> 12 20 0 0
#> 13 20 0 0
#> 14 20 0 0
#> 15 20 0 0
#> 16 20 0 0
#> 17 20 0 15
#> 18 20 0 15
#> 19 20 0 16
#> 20 20 0 15
result.sgPLS$group.size.Y
#> size comp1 comp2
#> 1 20 15 0
#> 2 20 15 0
#> 3 20 16 0
#> 4 20 16 0
#> 5 20 0 0
#> 6 20 0 0
#> 7 20 0 0
#> 8 20 0 0
#> 9 20 0 0
#> 10 20 0 0
#> 11 20 0 0
#> 12 20 0 0
#> 13 20 0 0
#> 14 20 0 0
#> 15 20 0 0
#> 16 20 0 0
#> 17 20 0 0
#> 18 20 0 0
#> 19 20 0 0
#> 20 20 0 0
#> 21 20 0 0
#> 22 20 0 15
#> 23 20 0 15
#> 24 20 0 15
#> 25 20 0 15