33 Comparison of six Linear Regression algorithms
- Datasets:
- Algorithms:
- Objective: Comparison of various algorithms
33.1 Introduction
These are the algorithms used:
- LM
# load packages
# attach the BostonHousing dataset
#> Rows: 506
#> Columns: 14
#> $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829…
#> $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, …
#> $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7…
#> $ chas <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524…
#> $ rm <dbl> 6.58, 6.42, 7.18, 7.00, 7.15, 6.43, 6.01, 6.17, 5.63, 6.00, 6…
#> $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, …
#> $ dis <dbl> 4.09, 4.97, 4.97, 6.06, 6.06, 6.06, 5.56, 5.95, 6.08, 6.59, 6…
#> $ rad <dbl> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4…
#> $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 3…
#> $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 1…
#> $ b <dbl> 397, 397, 393, 395, 397, 394, 396, 397, 387, 387, 393, 397, 3…
#> $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.1…
#> $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 1…
#> # A tibble: 506 x 14
#> crim zn indus chas nox rm age dis rad tax ptratio b
#> <dbl> <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 0.00632 18 2.31 0 0.538 6.58 65.2 4.09 1 296 15.3 397.
#> 2 0.0273 0 7.07 0 0.469 6.42 78.9 4.97 2 242 17.8 397.
#> 3 0.0273 0 7.07 0 0.469 7.18 61.1 4.97 2 242 17.8 393.
#> 4 0.0324 0 2.18 0 0.458 7.00 45.8 6.06 3 222 18.7 395.
#> 5 0.0690 0 2.18 0 0.458 7.15 54.2 6.06 3 222 18.7 397.
#> 6 0.0298 0 2.18 0 0.458 6.43 58.7 6.06 3 222 18.7 394.
#> # … with 500 more rows, and 2 more variables: lstat <dbl>, medv <dbl>
33.2 Workflow
- Load dataset
- Create train and test datasets, 80/20
- Inspect dataset:
- Dimension
- classes
- Analyze features
- correlation
- Visualize features
- histograms
- density plots
- pairwise
- correlogram
- Train as-is
- Set the train control to
- 10 cross-validations
- 3 repetitions
- Metric: RMSE
- Train the models
- Compare accuracy of models
- Visual comparison
- dot plot
- Train with Feature selection
- Feature selection
- generate new dataset
- Train models again
- Compare RMSE again
- Visual comparison
- dot plot
- dot plot
- Train with dataset transformation
- data transformatiom
- Center
- Scale
- BoxCox
- Train models
- Compare RMSE
- Visual comparison
- dot plot
- Tune the best model
- Set the train control to
- 10 cross-validations
- 3 repetitions
- Metric: RMSE
- Train the models
- Radial SVM
- Sigma vector
- .C
- BoxCox 9, Ensembling
- Select the algorithms
- Random Forest
- Stochastic Gradient Boosting
- Cubist
- Numeric comparison
- resample
- summary
- Visual comparison
- dot plot
- Tune the best model: Cubist
- Set the train control to
- 10 cross-validations
- 3 repetitions
- Metric: RMSE
- Train the models
- Cubist
- BoxCox
- Evaluate the tuning parameters
- Numeric comparison
- print tuned model
- Visual comparison
- scatter plot
- Numeric comparison
- Finalize the model
- Back transformation
- Summary
- Apply model to validation set
- Transform the dataset
- Make prediction
- Calculate the RMSE
33.3 Preparing the data
# Split out validation dataset
# create a list of 80% of the rows in the original dataset we can use for training
validationIndex <- createDataPartition(BostonHousing$medv,
p=0.80, list=FALSE)
# select 20% of the data for validation
validation <- BostonHousing[-validationIndex,]
# use the remaining 80% of data to training and testing the models
dataset <- BostonHousing[validationIndex,]
# list types for each attribute
sapply(dataset, class)
#> crim zn indus chas nox rm age dis
#> "numeric" "numeric" "numeric" "factor" "numeric" "numeric" "numeric" "numeric"
#> rad tax ptratio b lstat medv
#> "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
# take a peek at the first 20 rows of the data
head(dataset, n=20)
#> crim zn indus chas nox rm age dis rad tax ptratio b lstat medv
#> 1 0.00632 18.0 2.31 0 0.538 6.58 65.2 4.09 1 296 15.3 397 4.98 24.0
#> 2 0.02731 0.0 7.07 0 0.469 6.42 78.9 4.97 2 242 17.8 397 9.14 21.6
#> 3 0.02729 0.0 7.07 0 0.469 7.18 61.1 4.97 2 242 17.8 393 4.03 34.7
#> 4 0.03237 0.0 2.18 0 0.458 7.00 45.8 6.06 3 222 18.7 395 2.94 33.4
#> 5 0.06905 0.0 2.18 0 0.458 7.15 54.2 6.06 3 222 18.7 397 5.33 36.2
#> 6 0.02985 0.0 2.18 0 0.458 6.43 58.7 6.06 3 222 18.7 394 5.21 28.7
#> 7 0.08829 12.5 7.87 0 0.524 6.01 66.6 5.56 5 311 15.2 396 12.43 22.9
#> 10 0.17004 12.5 7.87 0 0.524 6.00 85.9 6.59 5 311 15.2 387 17.10 18.9
#> 11 0.22489 12.5 7.87 0 0.524 6.38 94.3 6.35 5 311 15.2 393 20.45 15.0
#> 12 0.11747 12.5 7.87 0 0.524 6.01 82.9 6.23 5 311 15.2 397 13.27 18.9
#> 13 0.09378 12.5 7.87 0 0.524 5.89 39.0 5.45 5 311 15.2 390 15.71 21.7
#> 14 0.62976 0.0 8.14 0 0.538 5.95 61.8 4.71 4 307 21.0 397 8.26 20.4
#> 15 0.63796 0.0 8.14 0 0.538 6.10 84.5 4.46 4 307 21.0 380 10.26 18.2
#> 17 1.05393 0.0 8.14 0 0.538 5.93 29.3 4.50 4 307 21.0 387 6.58 23.1
#> 20 0.72580 0.0 8.14 0 0.538 5.73 69.5 3.80 4 307 21.0 391 11.28 18.2
#> 21 1.25179 0.0 8.14 0 0.538 5.57 98.1 3.80 4 307 21.0 377 21.02 13.6
#> 22 0.85204 0.0 8.14 0 0.538 5.96 89.2 4.01 4 307 21.0 393 13.83 19.6
#> 23 1.23247 0.0 8.14 0 0.538 6.14 91.7 3.98 4 307 21.0 397 18.72 15.2
#> 24 0.98843 0.0 8.14 0 0.538 5.81 100.0 4.10 4 307 21.0 395 19.88 14.5
#> 25 0.75026 0.0 8.14 0 0.538 5.92 94.1 4.40 4 307 21.0 394 16.30 15.6
dataset[,4] <- as.numeric(as.character(dataset[,4]))
#> ── Data Summary ────────────────────────
#> Values
#> Name dataset
#> Number of rows 407
#> Number of columns 14
#> _______________________
#> Column type frequency:
#> numeric 14
#> ________________________
#> Group variables None
#> ── Variable type: numeric ──────────────────────────────────────────────────────
#> skim_variable n_missing complete_rate mean sd p0 p25
#> 1 crim 0 1 3.64 8.80 0.00632 0.0796
#> 2 zn 0 1 11.9 24.2 0 0
#> 3 indus 0 1 11.0 6.87 0.74 4.93
#> 4 chas 0 1 0.0713 0.258 0 0
#> 5 nox 0 1 0.555 0.117 0.385 0.448
#> 6 rm 0 1 6.29 0.704 3.56 5.89
#> 7 age 0 1 68.4 28.2 6.2 42.7
#> 8 dis 0 1 3.82 2.12 1.13 2.11
#> 9 rad 0 1 9.59 8.77 1 4
#> 10 tax 0 1 409. 169. 187 280.
#> 11 ptratio 0 1 18.4 2.18 12.6 17
#> 12 b 0 1 357. 89.7 0.32 374.
#> 13 lstat 0 1 12.6 7.03 1.92 7.06
#> 14 medv 0 1 22.5 8.96 5 17.0
#> p50 p75 p100 hist
#> 1 0.268 3.69 89.0 ▇▁▁▁▁
#> 2 0 15 100 ▇▁▁▁▁
#> 3 8.56 18.1 27.7 ▇▆▁▇▁
#> 4 0 0 1 ▇▁▁▁▁
#> 5 0.538 0.631 0.871 ▇▇▆▃▁
#> 6 6.21 6.62 8.78 ▁▁▇▂▁
#> 7 77.3 94.2 100 ▂▃▂▂▇
#> 8 3.15 5.21 12.1 ▇▅▂▁▁
#> 9 5 24 24 ▇▂▁▁▃
#> 10 334 666 711 ▇▇▃▁▇
#> 11 19 20.2 22 ▁▃▅▅▇
#> 12 391. 396. 397. ▁▁▁▁▇
#> 13 11.3 16.5 38.0 ▇▇▃▂▁
#> 14 21.2 25 50 ▂▇▃▁▁
no more factors or character variables
33.3.1 Variables correlation
# find correlation between variables
m <- cor(dataset[,1:13])
diag(m) <- 0
# select variables with correlation 0.7 and above
threshold <- 0.7
ok <- apply(abs(m) >= threshold, 1, any)
m[ok, ok]
#> indus nox age dis rad tax
#> indus 0.000 0.773 0.651 -0.711 0.620 0.719
#> nox 0.773 0.000 0.734 -0.769 0.628 0.676
#> age 0.651 0.734 0.000 -0.749 0.469 0.506
#> dis -0.711 -0.769 -0.749 0.000 -0.504 -0.526
#> rad 0.620 0.628 0.469 -0.504 0.000 0.920
#> tax 0.719 0.676 0.506 -0.526 0.920 0.000
# values of correlation >= 0.7
ind <- sapply(1:13, function(x) abs(m[, x]) > 0.7)
#> [1] 0.773 -0.711 0.719 0.773 0.734 -0.769 0.734 -0.749 -0.711 -0.769
#> [11] -0.749 0.920 0.719 0.920
# defining a index for selecting if the condition is met
cind <- apply(m, 2, function(x) any(abs(x) > 0.7))
cm <- m[, cind] # since col6 only has values less than 0.5 it is not taken
#> indus nox age dis rad tax
#> crim 0.4076 0.4099 0.3524 -0.376 0.60834 0.5711
#> zn -0.5314 -0.5202 -0.5845 0.680 -0.32273 -0.3184
#> indus 0.0000 0.7733 0.6512 -0.711 0.61998 0.7185
#> chas 0.0658 0.0934 0.0735 -0.099 -0.00245 -0.0306
#> nox 0.7733 0.0000 0.7338 -0.769 0.62760 0.6758
#> rm -0.3826 -0.2961 -0.2262 0.207 -0.22126 -0.2953
#> age 0.6512 0.7338 0.0000 -0.749 0.46896 0.5058
#> dis -0.7113 -0.7693 -0.7492 0.000 -0.50372 -0.5264
#> rad 0.6200 0.6276 0.4690 -0.504 0.00000 0.9201
#> tax 0.7185 0.6758 0.5058 -0.526 0.92005 0.0000
#> ptratio 0.3782 0.1888 0.2709 -0.228 0.47971 0.4691
#> b -0.3644 -0.3684 -0.2742 0.284 -0.42314 -0.4303
#> lstat 0.6136 0.5839 0.6066 -0.501 0.50251 0.5538
rind <- apply(cm, 1, function(x) any(abs(x) > 0.7))
rm <- cm[rind, ]
#> indus nox age dis rad tax
#> indus 0.000 0.773 0.651 -0.711 0.620 0.719
#> nox 0.773 0.000 0.734 -0.769 0.628 0.676
#> age 0.651 0.734 0.000 -0.749 0.469 0.506
#> dis -0.711 -0.769 -0.749 0.000 -0.504 -0.526
#> rad 0.620 0.628 0.469 -0.504 0.000 0.920
#> tax 0.719 0.676 0.506 -0.526 0.920 0.000
33.3.2 A look at the variables
# histograms for each attribute
for(i in 1:13) {
hist(dataset[,i], main=names(dataset)[i])

# density plot for each attribute
for(i in 1:13) {
plot(density(dataset[,i]), main=names(dataset)[i])

# boxplots for each attribute
for(i in 1:13) {
boxplot(dataset[,i], main=names(dataset)[i])

# scatter plot matrix


33.4 Evaluation of algorithms
# Run algorithms using 10-fold cross-validation
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "RMSE"
# LM
fit.lm <- train(medv~., data=dataset, method="lm",
metric=metric, preProc=c("center", "scale"),
fit.glm <- train(medv~., data=dataset, method="glm",
metric=metric, preProc=c("center", "scale"),
fit.glmnet <- train(medv~., data=dataset, method="glmnet",
preProc=c("center", "scale"),
fit.svm <- train(medv~., data=dataset, method="svmRadial",
preProc=c("center", "scale"),
grid <- expand.grid(.cp=c(0, 0.05, 0.1))
fit.cart <- train(medv~., data=dataset, method="rpart",
metric=metric, tuneGrid=grid,
preProc=c("center", "scale"),
fit.knn <- train(medv~., data=dataset, method="knn",
metric=metric, preProc=c("center", "scale"),
# Compare algorithms
results <- resamples(list(LM = fit.lm,
GLM = fit.glm,
GLMNET = fit.glmnet,
SVM = fit.svm,
CART = fit.cart,
KNN = fit.knn))
33.5 Feature selection
# remove correlated attributes
# find attributes that are highly correlated
cutoff <- 0.70
correlations <- cor(dataset[,1:13])
highlyCorrelated <- findCorrelation(correlations, cutoff=cutoff)
for (value in highlyCorrelated) {
#> [1] "indus"
#> [1] "nox"
#> [1] "tax"
#> [1] "dis"
# create a new dataset without highly correlated features
datasetFeatures <- dataset[,-highlyCorrelated]
#> [1] 407 10
# Run algorithms using 10-fold cross-validation
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "RMSE"
# LM
fit.lm <- train(medv~., data=dataset, method="lm",
metric=metric, preProc=c("center", "scale"),
fit.glm <- train(medv~., data=dataset, method="glm",
metric=metric, preProc=c("center", "scale"),
fit.glmnet <- train(medv~., data=dataset, method="glmnet",
preProc=c("center", "scale"),
fit.svm <- train(medv~., data=dataset, method="svmRadial",
preProc=c("center", "scale"),
grid <- expand.grid(.cp=c(0, 0.05, 0.1))
fit.cart <- train(medv~., data=dataset, method="rpart",
metric=metric, tuneGrid=grid,
preProc=c("center", "scale"),
fit.knn <- train(medv~., data=dataset, method="knn",
metric=metric, preProc=c("center", "scale"),
# Compare algorithms
feature_results <- resamples(list(LM = fit.lm,
GLM = fit.glm,
GLMNET = fit.glmnet,
SVM = fit.svm,
CART = fit.cart,
KNN = fit.knn))
Comparing the results, we can see that this has made the RMSE worse for the linear and the nonlinear algorithms. The correlated attributes we removed are contributing to the accuracy of the models.
33.6 Evaluate Algorithms: Box-Cox Transform
We know that some of the attributes have a skew and others perhaps have an
exponential distribution. One option would be to explore squaring and log
transforms respectively (you could try this!). Another approach would be to use a power transform and let it figure out the amount to correct each attribute. One example is the Box-Cox
power transform. Let’s try using this transform to rescale the original data and evaluate the effect on the same 6 algorithms. We will also leave in the centering and scaling for the benefit of the instance-based methods.
# Run algorithms using 10-fold cross-validation
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "RMSE"
# lm
fit.lm <- train(medv~., data=dataset, method="lm", metric=metric,
preProc=c("center", "scale", "BoxCox"),
fit.glm <- train(medv~., data=dataset, method="glm", metric=metric,
preProc=c("center", "scale", "BoxCox"),
fit.glmnet <- train(medv~., data=dataset, method="glmnet", metric=metric,
preProc=c("center", "scale", "BoxCox"),
fit.svm <- train(medv~., data=dataset, method="svmRadial", metric=metric,
preProc=c("center", "scale", "BoxCox"),
grid <- expand.grid(.cp=c(0, 0.05, 0.1))
fit.cart <- train(medv~., data=dataset, method="rpart", metric=metric,
preProc=c("center", "scale", "BoxCox"),
fit.knn <- train(medv~., data=dataset, method="knn", metric=metric,
preProc=c("center", "scale", "BoxCox"),
# Compare algorithms
transformResults <- resamples(list(LM = fit.lm,
GLM = fit.glm,
GLMNET = fit.glmnet,
SVM = fit.svm,
CART = fit.cart,
KNN = fit.knn))
33.7 Tune SVM
Let’s design a grid search around a C value of 1. We might see a small trend of decreasing RMSE with increasing C, so let’s try all integer C values between 1 and 10. Another parameter that caret let us tune is the sigma parameter. This is a smoothing parameter. Good sigma values often start around 0.1, so we will try numbers before and after.
# tune SVM sigma and C parametres
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "RMSE"
grid <- expand.grid(.sigma = c(0.025, 0.05, 0.1, 0.15),
.C = seq(1, 10, by=1))
fit.svm <- train(medv~., data=dataset, method="svmRadial", metric=metric,
preProc=c("BoxCox"), trControl=trainControl)
#> Support Vector Machines with Radial Basis Function Kernel
#> 407 samples
#> 13 predictor
#> Pre-processing: Box-Cox transformation (11)
#> Resampling: Cross-Validated (10 fold, repeated 3 times)
#> Summary of sample sizes: 365, 366, 366, 367, 366, 366, ...
#> Resampling results across tuning parameters:
#> sigma C RMSE Rsquared MAE
#> 0.025 1 3.67 0.830 2.34
#> 0.025 2 3.49 0.840 2.21
#> 0.025 3 3.45 0.842 2.17
#> 0.025 4 3.42 0.844 2.14
#> 0.025 5 3.41 0.845 2.13
#> 0.025 6 3.40 0.846 2.12
#> 0.025 7 3.39 0.846 2.11
#> 0.025 8 3.39 0.846 2.11
#> 0.025 9 3.38 0.846 2.11
#> 0.025 10 3.37 0.847 2.10
#> 0.050 1 3.61 0.833 2.25
#> 0.050 2 3.44 0.843 2.17
#> 0.050 3 3.36 0.848 2.11
#> 0.050 4 3.30 0.852 2.08
#> 0.050 5 3.25 0.856 2.05
#> 0.050 6 3.20 0.860 2.03
#> 0.050 7 3.16 0.862 2.02
#> 0.050 8 3.13 0.865 2.02
#> 0.050 9 3.11 0.866 2.01
#> 0.050 10 3.10 0.867 2.01
#> 0.100 1 3.68 0.829 2.26
#> 0.100 2 3.37 0.848 2.12
#> 0.100 3 3.23 0.858 2.06
#> 0.100 4 3.17 0.862 2.04
#> 0.100 5 3.14 0.865 2.04
#> 0.100 6 3.11 0.866 2.04
#> 0.100 7 3.09 0.868 2.04
#> 0.100 8 3.09 0.868 2.04
#> 0.100 9 3.08 0.868 2.04
#> 0.100 10 3.08 0.868 2.05
#> 0.150 1 3.79 0.822 2.30
#> 0.150 2 3.42 0.846 2.14
#> 0.150 3 3.30 0.854 2.09
#> 0.150 4 3.26 0.857 2.09
#> 0.150 5 3.24 0.858 2.09
#> 0.150 6 3.23 0.858 2.10
#> 0.150 7 3.23 0.857 2.12
#> 0.150 8 3.24 0.856 2.13
#> 0.150 9 3.26 0.855 2.15
#> 0.150 10 3.27 0.854 2.17
#> RMSE was used to select the optimal model using the smallest value.
#> The final values used for the model were sigma = 0.1 and C = 9.

33.8 Ensembling
We can try some ensemble methods on the problem and see if we can get a further decrease in our RMSE.
- Random Forest, bagging (RF).
- Gradient Boosting Machines (GBM).
- Cubist, boosting (CUBIST).
# try ensembles
seed <- 7
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "RMSE"
# Random Forest
fit.rf <- train(medv~., data=dataset, method="rf", metric=metric,
# Stochastic Gradient Boosting
fit.gbm <- train(medv~., data=dataset, method="gbm", metric=metric,
trControl=trainControl, verbose=FALSE)
# Cubist
fit.cubist <- train(medv~., data=dataset, method="cubist", metric=metric,
preProc=c("BoxCox"), trControl=trainControl)
# Compare algorithms
ensembleResults <- resamples(list(RF = fit.rf,
GBM = fit.gbm,
CUBIST = fit.cubist))
# look at parameters used for Cubist
print(fit.cubist)
Let’s use a grid search to tune around those values. We’ll try all committees between 15 and 25 and spot-check a neighbors value above and below 5.
# Tune the Cubist algorithm
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "RMSE"
grid <- expand.grid(.committees = seq(15, 25, by=1),
.neighbors = c(3, 5, 7))
tune.cubist <- train(medv~., data=dataset, method = "cubist", metric=metric,
tuneGrid=grid, trControl=trainControl)
#> Cubist
#> 407 samples
#> 13 predictor
#> Pre-processing: Box-Cox transformation (11)
#> Resampling: Cross-Validated (10 fold, repeated 3 times)
#> Summary of sample sizes: 365, 366, 366, 367, 366, 366, ...
#> Resampling results across tuning parameters:
#> committees neighbors RMSE Rsquared MAE
#> 15 3 3.07 0.877 2.00
#> 15 5 3.13 0.873 2.02
#> 15 7 3.14 0.871 2.03
#> 16 3 3.05 0.878 1.99
#> 16 5 3.11 0.874 2.01
#> 16 7 3.12 0.872 2.02
#> 17 3 3.04 0.879 1.98
#> 17 5 3.09 0.875 2.00
#> 17 7 3.11 0.873 2.01
#> 18 3 3.03 0.880 1.97
#> 18 5 3.08 0.876 2.00
#> 18 7 3.10 0.874 2.01
#> 19 3 3.03 0.880 1.97
#> 19 5 3.08 0.876 1.99
#> 19 7 3.10 0.874 2.01
#> 20 3 3.03 0.879 1.98
#> 20 5 3.09 0.875 2.00
#> 20 7 3.10 0.874 2.01
#> 21 3 3.03 0.879 1.98
#> 21 5 3.09 0.876 2.00
#> 21 7 3.10 0.874 2.02
#> 22 3 3.03 0.879 1.98
#> 22 5 3.09 0.875 2.00
#> 22 7 3.10 0.874 2.02
#> 23 3 3.03 0.880 1.98
#> 23 5 3.09 0.876 2.01
#> 23 7 3.10 0.874 2.02
#> 24 3 3.03 0.879 1.98
#> 24 5 3.09 0.875 2.01
#> 24 7 3.11 0.873 2.02
#> 25 3 3.03 0.880 1.98
#> 25 5 3.09 0.876 2.01
#> 25 7 3.10 0.874 2.02
#> RMSE was used to select the optimal model using the smallest value.
#> The final values used for the model were committees = 25 and neighbors = 3.

We can see that we have achieved a more accurate model again with an RMSE of 2.822 using committees = 18 and neighbors = 3.
It looks like the results for the Cubist algorithm are the most accurate. Let’s finalize it by creating a new standalone Cubist model with the parameters above trained using the whole dataset. We must also use the Box-Cox power transform.
33.9 Finalize the model
# prepare the data transform using training data
x <- dataset[,1:13]
y <- dataset[,14]
# transform
preprocessParams <- preProcess(x, method=c("BoxCox"))
transX <- predict(preprocessParams, x)
# train the final model
finalModel <- cubist(x = transX, y=y, committees=18)
#> Evaluation on training data (407 cases):
We can now use this model to evaluate our held-out validation dataset. Again, we must prepare the input data using the same Box-Cox transform.
# transform the validation dataset
valX <- validation[,1:13]
trans_valX <- predict(preprocessParams, valX)
valY <- validation[,14]
# use final model to make predictions on the validation dataset
predictions <- predict(finalModel, newdata = trans_valX, neighbors=3)
# calculate RMSE
rmse <- RMSE(predictions, valY)
r2 <- R2(predictions, valY)
#> [1] 3.24
We can see that the estimated RMSE on this unseen data is about 2.666, lower but not too dissimilar from our expected RMSE of 2.822.