Objectives:
The objective of this document is to give a brief introduction to dimensionality reduction and discretization. At the end of this tutorial you will have learned
Dimensionality reduction is performed in two different ways. The first one is feature subset selection and the second one is feature reduction or extraction. This tutorial will briefly cover both of these topics.
Let’s load our main data to use:
data <- read.csv(url("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv"),
header = T, sep=";")
Feature subset selection aims to select an optimal subset of existing features for modelling purposes. There are two ways to perform feature subset selection: by using wrappers or by using filters. This tutorial will briefly cover both of these topics.
Wrappers are methods that evaluate predictor (feature) performance by adding/removing them into models and measuring the model performance.
For wrapper methods, we will use the caret
package.
# if ("caret" %in% rownames(installed.packages()) == FALSE){
# install.packages("caret")
# }
require(caret)
When using recursive feature elimination, initially all variables are included in the model. Later, by removing variables, model performance is recomputed and optimal set of features is determined.
In caret
package, recursive feature elimination can be utilized with several models such as linear regression (lmFuncs
), random forests (rfFuncs
), naive Bayes (nbFuncs
) and bagged trees (treebagFuncs
). You can also use other functions that can be used with caret
’s train function. For further information, check caret
’s package documentation.
#subset the data
data.train.index <- createDataPartition(data[,12], p=.8, list = F, times = 1)
data.train <- data[data.train.index,]
#Set the control variables for feature selection
#We are using linear regression model (lmFuncs) and cross-validation (cv) method to verify with 10 cross-validations
control.rfe <- rfeControl(functions=lmFuncs, method="cv", number=10)
#x defines predictors, while y defines the output
results.rfe <- rfe(x = data.train[,-12], y = data.train[,12], sizes = c(1:11),
rfeControl = control.rfe)
print(results.rfe) #Print the results
##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (10 fold)
##
## Resampling performance over subset size:
##
## Variables RMSE Rsquared MAE RMSESD RsquaredSD MAESD Selected
## 1 0.8548 0.09364 0.6581 0.03536 0.03710 0.02310
## 2 0.8379 0.12799 0.6451 0.03490 0.04469 0.02172
## 3 0.8355 0.13264 0.6442 0.03471 0.04668 0.02297
## 4 0.8327 0.13761 0.6421 0.03386 0.04212 0.02218
## 5 0.8237 0.15382 0.6361 0.03721 0.05350 0.02451
## 6 0.7812 0.23802 0.6049 0.03808 0.06265 0.02412
## 7 0.7672 0.26684 0.5932 0.03092 0.04395 0.01861
## 8 0.7622 0.27411 0.5910 0.02742 0.03546 0.01886
## 9 0.7620 0.27458 0.5908 0.02775 0.03557 0.01900
## 10 0.7598 0.27887 0.5882 0.02850 0.03456 0.01970 *
## 11 0.7600 0.27858 0.5881 0.02870 0.03492 0.01980
##
## The top 5 variables (out of 10):
## density, volatile.acidity, pH, sulphates, chlorides
predictors(results.rfe) #Print the names of selected variables
## [1] "density" "volatile.acidity" "pH"
## [4] "sulphates" "chlorides" "alcohol"
## [7] "residual.sugar" "fixed.acidity" "citric.acid"
## [10] "free.sulfur.dioxide"
trellis.par.set(caretTheme())#Set the theme for the RMSE plor
plot(results.rfe, type = c("g","o"))#Plot RMSE
In the table, RMSE refers to “root mean squared error”. In general, we want this to be as low as possible (while also paying attention to possible over fitting problems).
Filters are methods that evaluate predictor (feature) performance outside models. They usually employ some form of scoring of the variables based on different criteria.
For filter methods, we will use FSelector
package.
#install.packages("FSelector")
require(FSelector)
These methods filter features based on entropy-based scoring.
Information Gain
weights.ig <- information.gain(quality~.,
data = data.train) #Compute the weights of variables
print(weights.ig) #Print the weights
## attr_importance
## fixed.acidity 0.040864113
## volatile.acidity 0.028661446
## citric.acid 0.046100975
## residual.sugar 0.015631862
## chlorides 0.055416377
## free.sulfur.dioxide 0.025967672
## total.sulfur.dioxide 0.054063089
## density 0.094802184
## pH 0.009277158
## sulphates 0.012191765
## alcohol 0.152440118
subset.ig <- cutoff.k(weights.ig, 5) #Get the most influential 5 variables
f.ig <- as.simple.formula(subset.ig, "quality") #Express the relationship as a formula
print(f.ig) #Print formula
## quality ~ alcohol + density + chlorides + total.sulfur.dioxide +
## citric.acid
## <environment: 0x0000000041f65a50>
Gain Ratio
weights.gr <- gain.ratio(quality~.,
data = data.train) #Compute the weights of variables
print(weights.gr) #Print the weights
## attr_importance
## fixed.acidity 0.03838954
## volatile.acidity 0.03297964
## citric.acid 0.04851085
## residual.sugar 0.01452914
## chlorides 0.05090511
## free.sulfur.dioxide 0.03249979
## total.sulfur.dioxide 0.03943687
## density 0.08035167
## pH 0.01569906
## sulphates 0.01173380
## alcohol 0.09679671
subset.gr <- cutoff.k(weights.gr, 5) #Get the most influential 5 variables
f.gr <- as.simple.formula(subset.gr, "quality") #Express the relationship as a formula
print(f.gr) #Print formula
## quality ~ alcohol + density + chlorides + citric.acid + total.sulfur.dioxide
## <environment: 0x0000000041dba0d0>
weights.chi <- chi.squared(quality~.,
data = data.train) #Compute the weights of variables
print(weights.chi) #Print the weights
## attr_importance
## fixed.acidity 0.1639365
## volatile.acidity 0.1734001
## citric.acid 0.1745144
## residual.sugar 0.1256604
## chlorides 0.2348272
## free.sulfur.dioxide 0.1680390
## total.sulfur.dioxide 0.1615530
## density 0.2453800
## pH 0.1368993
## sulphates 0.1106995
## alcohol 0.2685477
subset.chi <- cutoff.k(weights.chi, 5) #Get the most influential 5 variables
f.chi <- as.simple.formula(subset.chi, "quality") #Express the relationship as a formula
print(f.chi) #Print formula
## quality ~ alcohol + density + chlorides + citric.acid + volatile.acidity
## <environment: 0x0000000041499370>
Linear Correlation
Computes correlations using Pearson coefficient to see if there is a linear correlation.
weights.lc <- linear.correlation(quality~.,
data = data.train) #Compute the weights of variables
print(weights.lc) #Print the weights
## attr_importance
## fixed.acidity 0.119125708
## volatile.acidity 0.201295979
## citric.acid 0.005699735
## residual.sugar 0.081020161
## chlorides 0.214043309
## free.sulfur.dioxide 0.025059470
## total.sulfur.dioxide 0.162478872
## density 0.293223538
## pH 0.110598017
## sulphates 0.061679814
## alcohol 0.428421927
subset.lc <- cutoff.k(weights.lc, 5) #Get the most influential 5 variables
f.lc <- as.simple.formula(subset.lc, "quality") #Express the relationship as a formula
print(f.lc) #Print formula
## quality ~ alcohol + density + chlorides + volatile.acidity +
## total.sulfur.dioxide
## <environment: 0x0000000040375ab8>
Rank Correlation
Computes correlations using Spearman coefficient to see if there is a monotonic correlation. This could be more useful in case of non-linear monotonic relationships. Let’s illustrate the relationship:
x <- (1:100) #Generate numbers from 1 to 100
y <- exp(x) #Exponentiate them
cor(x, y, method = "pearson") #Equals to roughly 0.25
## [1] 0.252032
cor(x, y, method = "spearman") #Equals to 1
## [1] 1
Now let’s filter using rank correlation:
weights.rc <- rank.correlation(quality~.,
data = data.train) #Compute the weights of variables
print(weights.rc) #Print the weights
## attr_importance
## fixed.acidity 0.08897530
## volatile.acidity 0.19474660
## citric.acid 0.02048422
## residual.sugar 0.06591007
## chlorides 0.31356636
## free.sulfur.dioxide 0.03195105
## total.sulfur.dioxide 0.18688722
## density 0.33710912
## pH 0.11790922
## sulphates 0.03781927
## alcohol 0.43276843
subset.rc <- cutoff.k(weights.rc, 5) #Get the most influential 5 variables
f.rc <- as.simple.formula(subset.rc, "quality") #Express the relationship as a formula
print(f.rc) #Print formula
## quality ~ alcohol + density + chlorides + volatile.acidity +
## total.sulfur.dioxide
## <environment: 0x000000004202bd58>
All of the filtering methods return the same subset of features.
Feature reduction or feature extraction techniques aim to generate new, more informative features using the existing set of features. They aim to incorporate the information provided by existing features into a lower number of newly generated features.
Linear feature extraction techniques aim to generate new features by using a linear combination of existing features.
Principle component analysis (PCA) projects the entire n-feature set into a k linearly independent features (k<=n). Instead of using the original variables, you can use the computed principle components for modelling.
fit.pca <- prcomp(data.train[,-12])
summary(fit.pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 43.662 12.77092 4.62419 1.03755 0.84008 0.13619
## Proportion of Variance 0.911 0.07794 0.01022 0.00051 0.00034 0.00001
## Cumulative Proportion 0.911 0.98890 0.99912 0.99964 0.99997 0.99998
## PC7 PC8 PC9 PC10 PC11
## Standard deviation 0.11989 0.10721 0.0934 0.02016 0.0005903
## Proportion of Variance 0.00001 0.00001 0.0000 0.00000 0.0000000
## Cumulative Proportion 0.99999 1.00000 1.0000 1.00000 1.0000000
Based on the PCA summary, we can see that the first two principle components account for 98% of the variance in the data, so we can use these components, instead of the whole dataset.
new.data.pca <- fit.pca$x[,1:2]
summary(new.data.pca)
## PC1 PC2
## Min. :-221.044 Min. :-71.1260
## 1st Qu.: -30.015 1st Qu.: -7.5495
## Median : 4.425 Median : 0.1139
## Mean : 0.000 Mean : 0.0000
## 3rd Qu.: 31.343 3rd Qu.: 8.0233
## Max. : 132.712 Max. : 57.8401
Singular value decomposition (SVD) is similar to PCA. It also uses projection to lower dimensions. The SVD is a numerical method while PCA is an analysis approach. In R, prcomp
functions uses svd (numerical method) to calculate principle components which is more stable than using the euclidean distance. The output of prcomp
function is better in terms of interpretability than the output of svd
function. Also, summary
function does not work properly for the output of svd
.
fit.svd <- svd(data.train[,-12])
To see the amount of variance each component accounts for, use the following code:
cumsum(fit.svd$d)/sum(fit.svd$d)
## [1] 0.8656029 0.9394550 0.9667310 0.9909514 0.9962393 0.9978389 0.9985331
## [8] 0.9991597 0.9997037 0.9998859 1.0000000
According to the results, first two components account for 94% of variance and we can use those two.
new.data.svd <- fit.svd$u[,1:2]
summary(new.data.svd)
## V1 V2
## Min. :-0.038853 Min. :-0.0899225
## 1st Qu.:-0.018458 1st Qu.:-0.0093666
## Median :-0.014816 Median : 0.0004081
## Mean :-0.015286 Mean : 0.0001455
## 3rd Qu.:-0.011957 3rd Qu.: 0.0102804
## Max. :-0.001188 Max. : 0.0698730
As mentioned previously, using prcomp
is more intuitive and easier than using svd
.
Factor analysis is a general term of methods that use linear projection (such as PCA). Instead of using a specific method, we can use general factor analysis to reduce the dimensionality.
First, we need to determine the number of factors we want to obtain.
# if ("nFactors" %in% rownames(installed.packages()) == FALSE){
# install.packages("nFactors")
# }
library(nFactors)
ev <- eigen(cor(data.train[,-12])) # get eigenvalues
ap <- parallel(subject=nrow(data.train[,-12]),var=ncol(data.train[,-12]),
rep=100,cent=.05)
nS <- nScree(x=ev$values, aparallel=ap$eigen$qevpea)
plotnScree(nS)
At the scree plot that we have obtained, optimal coordinates is determined as 3. This means, the optimal number of factors that explain the variability in the data is three.
So we can now obtain our factors:
fit.fa <- factanal(data.train[,-12], 3, rotation="varimax")
print(fit.fa, digits=2, cutoff=.1, sort=TRUE)
##
## Call:
## factanal(x = data.train[, -12], factors = 3, rotation = "varimax")
##
## Uniquenesses:
## fixed.acidity volatile.acidity citric.acid
## 0.00 0.99 0.91
## residual.sugar chlorides free.sulfur.dioxide
## 0.00 0.87 0.89
## total.sulfur.dioxide density pH
## 0.71 0.00 0.71
## sulphates alcohol
## 0.97 0.22
##
## Loadings:
## Factor1 Factor2 Factor3
## residual.sugar 0.96 0.18 0.21
## density 0.69 0.65 0.31
## alcohol -0.29 -0.83
## fixed.acidity -0.13 0.99
## volatile.acidity
## citric.acid 0.29
## chlorides 0.36
## free.sulfur.dioxide 0.29 0.16
## total.sulfur.dioxide 0.32 0.42 0.11
## pH -0.13 0.21 -0.48
## sulphates 0.16
##
## Factor1 Factor2 Factor3
## SS loadings 1.71 1.56 1.45
## Proportion Var 0.16 0.14 0.13
## Cumulative Var 0.16 0.30 0.43
##
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 4254.76 on 25 degrees of freedom.
## The p-value is 0
As we have mentioned previously that factors are linear combinations of variables. fit.pa$loadings
hold the information of which variable is included in which factor and its coefficient in linear combination.
Non-linear feature extraction techniques aim to generate new features by using a non-linear combination of existing features.
Multidimensional scaling uses similarity measures to reduce the dimension of the data. First we compute a distance matrix and based on that distance matrix, we reduce the dimension.
d <- dist(data.train[,-12])
fit.mds <- cmdscale(d,eig=TRUE, k=2) #Reduce data to two variables
new.data.mds <- fit.mds$points
You can use new.data.mds
instead of all the variables in the dataset.
Isomap is a similar function to multidimensional scaling and it extends metric multidimensional scaling (MDS) by incorporating the geodesic distances imposed by a weighted graph.
# if ("vegan" %in% rownames(installed.packages()) == FALSE){
# install.packages("vegan")
# }
require(vegan)
d <- vegdist(data.train[,-12])
fit.iso <- isomap(d, ndim=2, k = 3, fragmentedOK = T)
#Reduce data to two variables (ndim) and retain 3 (k) distances per data point.
#Data might be fragmented, we tell the function that it's ok.
new.data.iso <- fit.iso$points
Feature selection is basically selecting the most appropriate subset of attributes based on the target attribute. Here, we will cover Minimum Redundancy Maximum Relevance (mRMR) feature selection. Note that there are other methods available for feature selection such as mutual information based criterion.
mRMR tries to maximize relevance according to the target variable based on mutual information and chooses a variable where the mutual information between the variable and the others is the least minimum (minimize redundancy).
In order to perform mRMR feature selection, we will use mRMRe
package.
# if ("mRMRe" %in% rownames(installed.packages()) == FALSE){
# install.packages("mRMRe")
# }
require(mRMRe)
data(cgps) #load the data
We will use mRMR.classic
function to select features. This function requires data frame with the columns of the following types: “numeric”, “ordered_factor” and “Surv”. Also, we should convert the data set as mRMRe.Data
type. Then, we will use the mRMR.classic
function with the parameters: data
, target_indices
(index of the target feature (column) from the data matrix), and feature_count
(number of features to be selected by mRMR feature selection). In this example, we will select 3 features for 3 different target variables.
# Convert the data type
feature_data <- mRMR.data(data = data.frame(cgps.ge))
# Create an mRMR filter and obtain the indices of selected features
filter3 <- mRMR.classic(data = feature_data, target_indices = 3:5,
feature_count = 3)
solutions(filter3)
## $`3`
## [,1]
## [1,] 592
## [2,] 805
## [3,] 673
##
## $`4`
## [,1]
## [1,] 578
## [2,] 339
## [3,] 251
##
## $`5`
## [,1]
## [1,] 149
## [2,] 586
## [3,] 160
As the output, we can see the three features selected for each target index. For example; mRMR function selected 592nd, 805th and 673th variables for the 3rd target variable; 578th, 339th and 251st variables for the 4th target variable, and 149th, 586th and 160th variables for the 5th target variable.
Discretization methods aim to discretize continuous variables. For discretization we will use two packages. Namely, discretization
and arules
.
Assume that we want to perform binning (either equal width or equal frequency), following code allows us to do that. Initially, we have a data frame of 11 predictor variables and one outcome variable. We want to discretize the predictor variables which are numeric. discretize
function only works on vectors. Instead of discretizing all of the predictor variables one by one inside a for loop, we can use the lapply
function which takes a function and applies it to all of the variables in a data frame. discretize
function takes three main inputs: the data vector to be discretized, the method and the number of categories. Inside the lapply
function, we also need to determine the method and the number of categories. lapply
splits the data frame into vectors and gives them to the function as input. lapply
returns a list as an output.
discretize
function returns a vector of factors which represent the interval that that particular data point falls into. So if you want the numeric discretization, you need to apply as.numeric
function to all variables, again we use lapply
for this. Finally, we convert the list back to a data frame. For association mining, you may want to keep the factor form of discretization.
#Install and load packages
# if ("arules" %in% rownames(installed.packages()) == FALSE){
# install.packages("arules")
# }
require(arules)
#Loop through all variables in dataset to discretize.
data.eqw <- NULL
for (i in 1:11){
d <- discretize(data.train[,i], method = "interval", categories =3)
data.eqw <- cbind(data.eqw, d)
}
names(data.eqw) <- names(data.train[,-12])
#Or equivalently we can use lapply syntax
#Apply equal width binning discretization to all variables in dataset.
data.eqw <- data.frame(lapply(data.train[,-12],
FUN = discretize, method = "interval", categories = 3))
#Take a look at first few data points in the data.eqw
head(data.eqw)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## 1 [ 3.90, 7.33) [0.08,0.42) [0.000,0.553) [ 0.6,22.3)
## 2 [ 3.90, 7.33) [0.08,0.42) [0.000,0.553) [ 0.6,22.3)
## 3 [ 3.90, 7.33) [0.08,0.42) [0.000,0.553) [ 0.6,22.3)
## 4 [ 7.33,10.77) [0.08,0.42) [0.000,0.553) [ 0.6,22.3)
## 5 [ 3.90, 7.33) [0.08,0.42) [0.000,0.553) [ 0.6,22.3)
## 6 [ 3.90, 7.33) [0.08,0.42) [0.000,0.553) [ 0.6,22.3)
## chlorides free.sulfur.dioxide total.sulfur.dioxide density
## 1 [0.009,0.121) [ 2.0, 50.2) [128,247) [0.987,1.004)
## 2 [0.009,0.121) [ 2.0, 50.2) [128,247) [0.987,1.004)
## 3 [0.009,0.121) [ 2.0, 50.2) [128,247) [0.987,1.004)
## 4 [0.009,0.121) [ 2.0, 50.2) [ 9,128) [0.987,1.004)
## 5 [0.009,0.121) [ 2.0, 50.2) [128,247) [0.987,1.004)
## 6 [0.009,0.121) [ 2.0, 50.2) [128,247) [0.987,1.004)
## pH sulphates alcohol
## 1 [2.72,3.09) [0.22,0.50) [ 8.0,10.1)
## 2 [3.09,3.45) [0.22,0.50) [ 8.0,10.1)
## 3 [3.09,3.45) [0.22,0.50) [ 8.0,10.1)
## 4 [3.09,3.45) [0.22,0.50) [10.1,12.1)
## 5 [3.09,3.45) [0.22,0.50) [ 8.0,10.1)
## 6 [2.72,3.09) [0.22,0.50) [ 8.0,10.1)
#Turn it into a numeric data frame
data.eqw <- data.frame(lapply(data.eqw, as.numeric))
#Take a second look at first few data points in the data.eqw
head(data.eqw)
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1 1 1 1 1
## 2 1 1 1 1 1
## 3 1 1 1 1 1
## 4 2 1 1 1 1
## 5 1 1 1 1 1
## 6 1 1 1 1 1
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 1 2 1 1 1 1
## 2 1 2 1 2 1 1
## 3 1 2 1 2 1 1
## 4 1 1 1 2 1 2
## 5 1 2 1 2 1 1
## 6 1 2 1 1 1 1
#For equal frequency
data.eqf <- lapply(data.train[,-12],
FUN = discretize, method = "frequency", categories = 3)
data.eqf <- data.frame(lapply(data.eqf, as.numeric))
#For k-means clustering discretization
data.eqc <- lapply(data.train[,-12],
FUN = discretize, method = "cluster", categories = 3)
data.eqc <- data.frame(lapply(data.eqc, as.numeric))
#You can also use user-specified intervals
##Lets assume that we want to discretize by quantiles of each variable
cats <- data.frame(lapply(data.train[,-12], FUN = quantile)) #We need to add -Inf and Inf as lower and upper boundaries. #rep function replicates the given value or variable by given number.
cats <- rbind(rep(-Inf, 11), cats, rep(Inf,11))
#In this case we need to use mapply instead of lapply because
#we have multiple different inputs to the function for each variable
data.us <- data.frame(mapply(data.train[,-12],
FUN = discretize, method = "fixed", categories = cats))
data.us <- data.frame(lapply(data.us, as.numeric))
If you want to use different boundaries for each variable, just bind them into a data.frame and pass it into mapply as we did above. Don’t forget to add -Inf
and Inf
.
We can also use Minimum Description Length Principle (mdlp
) to discretize the data points, which uses entropy criterion to determine the optimal discretization. It returns two lists. One holds the cutpoints (boundaries) of the discretization, and the second one holds the discretized data. Keep in mind that mdlp
is a supervised clustering method, so you need to provide the outcome variable along with the dataset. By default, mdlp
assumes the last column in your data frame to be the outcome variable, which is true in our case as our outcome variable is “quality”.
#Install and load packages
# if ("discretization" %in% rownames(installed.packages()) == FALSE){
# install.packages("discretization")
# }
require(discretization)
#Loop through all variables in dataset to discretize.
data.mdlp <- mdlp(data.train)
summary(data.mdlp$Disc.data)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.00
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:4.000 1st Qu.:1.00
## Median :1.000 Median :2.000 Median :4.000 Median :1.00
## Mean :1.057 Mean :1.626 Mean :3.757 Mean :1.49
## 3rd Qu.:1.000 3rd Qu.:2.000 3rd Qu.:4.000 3rd Qu.:2.00
## Max. :2.000 Max. :3.000 Max. :5.000 Max. :3.00
## chlorides free.sulfur.dioxide total.sulfur.dioxide density
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000
## Median :2.000 Median :2.000 Median :2.000 Median :3.000
## Mean :2.016 Mean :2.097 Mean :2.273 Mean :2.263
## 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :3.000 Max. :3.000 Max. :3.000 Max. :3.000
## pH sulphates alcohol quality
## Min. :1.00 Min. :1 Min. :1.000 Min. :3.000
## 1st Qu.:1.00 1st Qu.:1 1st Qu.:2.000 1st Qu.:5.000
## Median :1.00 Median :1 Median :3.000 Median :6.000
## Mean :1.32 Mean :1 Mean :2.962 Mean :5.874
## 3rd Qu.:2.00 3rd Qu.:1 3rd Qu.:4.000 3rd Qu.:6.000
## Max. :2.00 Max. :1 Max. :5.000 Max. :9.000
Useful Links: