28 Predicting Flu outcome comparing eight classification algorithms

Datasets: fluH7N9_china_2013
Algorithms: RF, GLMNET, KNN, PDA, LDA, NSC, C5, PLS

28.1 Introduction

Among the many nice R packages containing data collections is the outbreaks package. It contains datsets on epidemics and among them is data from the 2013 outbreak of influenza A H7N9 in China as analysed by Kucharski et al. (2014):

A. Kucharski, H. Mills, A. Pinsent, C. Fraser, M. Van Kerkhove, C. A. Donnelly, and S. Riley. 2014. Distinguishing between reservoir exposure and human-to-human transmission for emerging pathogens using case onset data. PLOS Currents Outbreaks. Mar 7, edition 1. doi: 10.1371/currents.outbreaks.e1473d9bfc99d080ca242139a06c455f.

A. Kucharski, H. Mills, A. Pinsent, C. Fraser, M. Van Kerkhove, C. A. Donnelly, and S. Riley. 2014. Data from: Distinguishing between reservoir exposure and human-to-human transmission for emerging pathogens using case onset data. Dryad Digital Repository. http://dx.doi.org/10.5061/dryad.2g43n.

28.1.1 Algorithms

We compare these classification algorithms:

Random Forest
GLM net
k-Nearest Neighbors
Penalized Discriminant Analysis
Stabilized Linear Discriminant Analysis
Nearest Shrunken Centroids
Single C5.0 Tree
Partial Least Squares

28.1.2 Workflow

Load dataset
Data wrangling

Many dates as one variable: gather
Convert group to factor: factor
Change dates descriptions: dplyr::mapvalues
Set several provinces as Other: mapvalues
Convert gender to factor
Convert province to factor

Features visualization

Area density plot of Month vs Age by Province, by date group, by outcome, by gender
Number of flu cases by gender, by province
Age density plot of flu cases by outcome
Days passed from outset by province, by age, by gender, by outcome

Feature engineering

Generate new features
Gender to numeric
Provinces to binary classifiers
Outcome to binary classifier
Impute missing data: mice

Split dataset in training, validation and test sets
Feature importance

with Decision Trees
with Random Forest
Wrangling dataset
Plot Importance vs Variable

Impact on Datasets

Density plot of training, validation and test datasets
Features vs outcome by age, days onset to hospital, day onset to outcome

Train models on training validation dataset

Numeric and visual Comparison of models
- Accuracy
- Kappa
Compare predictions on training validation set
- Create dataset for results
- New variable for outcome

Predicting on unknown outcomes on training data

Numeric and visual Comparison of models
- Accuracy
- Kappa
Compare predictions on training validation set
- Create dataset for results
- New variable for outcome
Calculate predicted outcome
Save to CSV file
Calculate recovery cases
Summarize outcome
Plot month vs log2-ratio of recovery vs death, by gender, by age, by date group

28.1.3 Can we predict flu outcome with Machine Learning in R?

I will be using their data as an example to show how to use Machine Learning algorithms for predicting disease outcome.

To do so, I selected and extracted features from the raw data, including age, days between onset and outcome, gender, whether the patients were hospitalized, etc. Missing values were imputed and different model algorithms were used to predict outcome (death or recovery). The prediction accuracy, sensitivity and specificity. The thus prepared dataset was divided into training and testing subsets. The test subset contained all cases with an unknown outcome. Before I applied the models to the test data, I further split the training data into validation subsets.

The tested modeling algorithms were similarly successful at predicting the outcomes of the validation data. To decide on final classifications, I compared predictions from all models and defined the outcome “Death” or “Recovery” as a function of all models, whereas classifications with a low prediction probability were flagged as “uncertain”. Accounting for this uncertainty led to a 100% correct classification of the validation test set.

The training cases with unknown outcome were then classified based on the same algorithms. From 57 unknown cases, 14 were classified as “Recovery”, 10 as “Death” and 33 as uncertain.

In a Part 2, I am looking at how extreme gradient boosting performs on this dataset.

Disclaimer: I am not an expert in Machine Learning. Everything I know, I taught myself during the last months. So, if you see any mistakes or have tips and tricks for improvement, please don’t hesitate to let me know! Thanks. :-)

28.2 The data

The dataset contains case ID, date of onset, date of hospitalization, date of outcome, gender, age, province and of course the outcome: Death or Recovery. I can already see that there are a couple of missing values in the data, which I will deal with later.

options(width = 1000)

if (!require("outbreaks")) install.packages("outbreaks")
#> Loading required package: outbreaks
library(outbreaks)

fluH7N9_china_2013_backup <- fluH7N9_china_2013

fluH7N9_china_2013$age[which(fluH7N9_china_2013$age == "?")] <- NA
fluH7N9_china_2013$case_ID <- paste("case", 
                                    fluH7N9_china_2013$case_id, 
                                    sep = "_")
head(fluH7N9_china_2013)
#>   case_id date_of_onset date_of_hospitalisation date_of_outcome outcome gender age province case_ID
#> 1       1    2013-02-19                    <NA>      2013-03-04   Death      m  87 Shanghai  case_1
#> 2       2    2013-02-27              2013-03-03      2013-03-10   Death      m  27 Shanghai  case_2
#> 3       3    2013-03-09              2013-03-19      2013-04-09   Death      f  35    Anhui  case_3
#> 4       4    2013-03-19              2013-03-27            <NA>    <NA>      f  45  Jiangsu  case_4
#> 5       5    2013-03-19              2013-03-30      2013-05-15 Recover      f  48  Jiangsu  case_5
#> 6       6    2013-03-21              2013-03-28      2013-04-26   Death      f  32  Jiangsu  case_6

Before I start preparing the data for Machine Learning, I want to get an idea of the distribution of the data points and their different variables by plotting.

Most provinces have only a handful of cases, so I am combining them into the category “other” and keep only Jiangsu, Shanghai and Zhejian and separate provinces.

# make tidy data, clean up and simplify
library(tidyr)

# put different type of dates in one variable (Date)
fluH7N9_china_2013_gather <- fluH7N9_china_2013 %>%
  gather(Group, Date, date_of_onset:date_of_outcome)

# convert Group to factor
fluH7N9_china_2013_gather$Group <- factor(fluH7N9_china_2013_gather$Group, 
                                          levels = c("date_of_onset", 
                                                     "date_of_hospitalisation", 
                                                     "date_of_outcome"))

# change the dates description with plyr::mapvalues
library(plyr)
from <- c("date_of_onset", "date_of_hospitalisation", "date_of_outcome")
to <- c("Date of onset", "Date of hospitalisation", "Date of outcome")
fluH7N9_china_2013_gather$Group <- mapvalues(fluH7N9_china_2013_gather$Group, 
                                             from = from, to = to)

# change additional provinces to Other
from <- c("Anhui", "Beijing", "Fujian", "Guangdong", "Hebei", "Henan", 
          "Hunan", "Jiangxi", "Shandong", "Taiwan")
to <- rep("Other", 10)
fluH7N9_china_2013_gather$province <- 
  mapvalues(fluH7N9_china_2013_gather$province, 
            from = from, to = to)

# convert gender to factor
levels(fluH7N9_china_2013_gather$gender) <- 
  c(levels(fluH7N9_china_2013_gather$gender), "unknown")

# replace NA in gender by "unknown
is_na <- is.na(fluH7N9_china_2013_gather$gender)
fluH7N9_china_2013_gather$gender[is_na] <- "unknown"

# convert province to factor
fluH7N9_china_2013_gather$province <- factor(fluH7N9_china_2013_gather$province, 
                                             levels = c("Jiangsu",  "Shanghai", 
                                                        "Zhejiang", "Other"))

library(ggplot2)
my_theme <- function(base_size = 12, base_family = "sans"){
  theme_minimal(base_size = base_size, base_family = base_family) +
  theme(
    axis.text = element_text(size = 12),
    axis.text.x = element_text(angle = 45, vjust = 0.5, hjust = 0.5),
    axis.title = element_text(size = 14),
    panel.grid.major = element_line(color = "grey"),
    panel.grid.minor = element_blank(),
    panel.background = element_rect(fill = "aliceblue"),
    strip.background = element_rect(fill = "lightgrey", color = "grey", 
                                    size = 1),
    strip.text = element_text(face = "bold", size = 12, color = "black"),
    legend.position = "bottom",
    legend.justification = "top", 
    legend.box = "horizontal",
    legend.box.background = element_rect(colour = "grey50"),
    legend.background = element_blank(),
    panel.border = element_rect(color = "grey", fill = NA, size = 0.5)
  )
}

ggplot(data = fluH7N9_china_2013_gather, 
       aes(x = Date, y = as.numeric(age), fill = outcome)) +
  stat_density2d(aes(alpha = ..level..), geom = "polygon") +
  geom_jitter(aes(color = outcome, shape = gender), size = 1.5) +
  geom_rug(aes(color = outcome)) +
  labs(
    fill = "Outcome",
    color = "Outcome",
    alpha = "Level",
    shape = "Gender",
    x = "Date in 2013",
    y = "Age",
    title = "2013 Influenza A H7N9 cases in China",
    subtitle = "Dataset from 'outbreaks' package (Kucharski et al. 2014)",
    caption = ""
  ) +
  facet_grid(Group ~ province) +
  my_theme() +
  scale_shape_manual(values = c(15, 16, 17)) +
  scale_color_brewer(palette="Set1", na.value = "grey50") +
  scale_fill_brewer(palette="Set1")
#> Warning: Removed 149 rows containing non-finite values (stat_density2d).
#> Warning: Removed 149 rows containing missing values (geom_point).

This plot shows the dates of onset, hospitalization and outcome (if known) of each data point. Outcome is marked by color and age shown on the y-axis. Gender is marked by point shape.

The density distribution of date by age for the cases seems to indicate that older people died more frequently in the Jiangsu and Zhejiang province than in Shanghai and in other provinces.

When we look at the distribution of points along the time axis, it suggests that their might be a positive correlation between the likelihood of death and an early onset or early outcome.

I also want to know how many cases there are for each gender and province and compare the genders’ age distribution.

# more tidy data. first remove age
fluH7N9_china_2013_gather_2 <- fluH7N9_china_2013_gather[, -4] %>%
  gather(group_2, value, gender:province)
#> Warning: attributes are not identical across measure variables;
#> they will be dropped

# change descriptions
from <- c("m", "f", "unknown", "Other")
to <- c("Male", "Female", "Unknown gender", "Other province")
fluH7N9_china_2013_gather_2$value <- mapvalues(fluH7N9_china_2013_gather_2$value, 
                                               from = from, to = to)
# convert to factor
fluH7N9_china_2013_gather_2$value <- factor(fluH7N9_china_2013_gather_2$value, 
                                            levels = c("Female", "Male", 
                                                       "Unknown gender", 
                                                       "Jiangsu", "Shanghai", 
                                                       "Zhejiang", "Other 
                                                       province"))

p1 <- ggplot(data = fluH7N9_china_2013_gather_2, aes(x = value, 
                                                     fill = outcome, 
                                                     color = outcome)) +
  geom_bar(position = "dodge", alpha = 0.7, size = 1) +
  my_theme() +
  scale_fill_brewer(palette="Set1", na.value = "grey50") +
  scale_color_brewer(palette="Set1", na.value = "grey50") +
  labs(
    color = "",
    fill = "",
    x = "",
    y = "Count",
    title = "2013 Influenza A H7N9 cases in China",
    subtitle = "Gender and Province numbers of flu cases",
    caption = ""
  )

p2 <- ggplot(data = fluH7N9_china_2013_gather, aes(x = as.numeric(age), 
                                                   fill = outcome, 
                                                   color = outcome)) +
  geom_density(alpha = 0.3, size = 1) +
  geom_rug() +
  scale_color_brewer(palette="Set1", na.value = "grey50") +
  scale_fill_brewer(palette="Set1", na.value = "grey50") +
  my_theme() +
  labs(
    color = "",
    fill = "",
    x = "Age",
    y = "Density",
    title = "",
    subtitle = "Age distribution of flu cases",
    caption = ""
  )

library(gridExtra)
library(grid)

grid.arrange(p1, p2, ncol = 2)
#> Warning: Removed 6 rows containing non-finite values (stat_density).

In the dataset, there are more male than female cases and correspondingly, we see more deaths, recoveries and unknown outcomes in men than in women. This is potentially a problem later on for modeling because the inherent likelihoods for outcome are not directly comparable between the sexes.

Most unknown outcomes were recorded in Zhejiang. Similarly to gender, we don’t have an equal distribution of data points across provinces either.

When we look at the age distribution it is obvious that people who died tended to be slightly older than those who recovered. The density curve of unknown outcomes is more similar to that of death than of recovery, suggesting that among these people there might have been more deaths than recoveries.

And lastly, I want to plot how many days passed between onset, hospitalization and outcome for each case.

ggplot(data = fluH7N9_china_2013_gather, aes(x = Date, y = as.numeric(age), 
                                             color = outcome)) +
  geom_point(aes(color = outcome, shape = gender), size = 1.5, alpha = 0.6) +
  geom_path(aes(group = case_ID)) +
  facet_wrap( ~ province, ncol = 2) +
  my_theme() +
  scale_shape_manual(values = c(15, 16, 17)) +
  scale_color_brewer(palette="Set1", na.value = "grey50") +
  scale_fill_brewer(palette="Set1") +
  labs(
    color = "Outcome",
    shape = "Gender",
    x = "Date in 2013",
    y = "Age",
    title = "2013 Influenza A H7N9 cases in China",
    subtitle = "Dataset from 'outbreaks' package (Kucharski et al. 2014)",
    caption = "\nTime from onset of flu to outcome."
  )
#> Warning: Removed 149 rows containing missing values (geom_point).
#> Warning: Removed 122 row(s) containing missing values (geom_path).

This plot shows that there are many missing values in the dates, so it is hard to draw a general conclusion.

28.3 Features

In Machine Learning-speak features are the variables used for model training. Using the right features dramatically influences the accuracy of the model. Because we don’t have many features, I am keeping age as it is, but I am also generating new features:

from the date information I am calculating the days between onset and outcome and between onset and hospitalization
I am converting gender into numeric values with 1 for female and 0 for male
similarly, I am converting provinces to binary classifiers (yes == 1, no == 0) for Shanghai, Zhejiang, Jiangsu and other provinces
the same binary classification is given for whether a case was hospitalised, and whether they had an early onset or early outcome (earlier than the median date)

# convert gender, provinces to discrete and numeric values
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following object is masked from 'package:gridExtra':
#> 
#>     combine
#> The following objects are masked from 'package:plyr':
#> 
#>     arrange, count, desc, failwith, id, mutate, rename, summarise, summarize
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

dataset <- fluH7N9_china_2013 %>%
  mutate(hospital = as.factor(ifelse(is.na(date_of_hospitalisation), 0, 1)),
         gender_f = as.factor(ifelse(gender == "f", 1, 0)),
         province_Jiangsu = as.factor(ifelse(province == "Jiangsu", 1, 0)),
         province_Shanghai = as.factor(ifelse(province == "Shanghai", 1, 0)),
         province_Zhejiang = as.factor(ifelse(province == "Zhejiang", 1, 0)),
         province_other = as.factor(ifelse(province == "Zhejiang" | 
                                             province == "Jiangsu" | 
                                             province == "Shanghai", 0, 1)),
         days_onset_to_outcome = as.numeric(as.character(gsub(" days", "",
                as.Date(as.character(date_of_outcome), format = "%Y-%m-%d") - 
                as.Date(as.character(date_of_onset), format = "%Y-%m-%d")))),
         
         days_onset_to_hospital = as.numeric(as.character(gsub(" days", "",
            as.Date(as.character(date_of_hospitalisation), format = "%Y-%m-%d") - 
            as.Date(as.character(date_of_onset), format = "%Y-%m-%d")))),
         age = as.numeric(as.character(age)),
         early_onset = as.factor(ifelse(date_of_onset < 
                        summary(date_of_onset)[[3]], 1, 0)),
         early_outcome = as.factor(ifelse(date_of_outcome < 
                        summary(date_of_outcome)[[3]], 1, 0))) %>%
  subset(select = -c(2:4, 6, 8))
  # print

rownames(dataset) <- dataset$case_ID
dataset <- dataset[, -c(1,4)]
head(dataset)
#>        outcome age hospital gender_f province_Jiangsu province_Shanghai province_Zhejiang province_other days_onset_to_outcome days_onset_to_hospital early_onset early_outcome
#> case_1   Death  87        0        0                0                 1                 0              0                    13                     NA           1             1
#> case_2   Death  27        1        0                0                 1                 0              0                    11                      4           1             1
#> case_3   Death  35        1        1                0                 0                 0              1                    31                     10           1             1
#> case_4    <NA>  45        1        1                1                 0                 0              0                    NA                      8           1          <NA>
#> case_5 Recover  48        1        1                1                 0                 0              0                    57                     11           1             0
#> case_6   Death  32        1        1                1                 0                 0              0                    36                      7           1             1

summary(dataset$outcome)
#>   Death Recover    NA's 
#>      32      47      57

28.3.1 Imputing missing values

https://www.r-bloggers.com/imputing-missing-data-with-r-mice-package/

When looking at the dataset I created for modeling, it is obvious that we have quite a few missing values.

The missing values from the outcome column are what I want to predict but for the rest I would either have to remove the entire row from the data or impute the missing information. I decided to try the latter with the mice package.

library(mice)
#> 
#> Attaching package: 'mice'
#> The following objects are masked from 'package:base':
#> 
#>     cbind, rbind

dataset_impute  <- mice(dataset[, -1],  print = FALSE)  # remove outcome
#> Warning: Number of logged events: 150
impute_complete <- mice::complete(dataset_impute, 1) # return 1st imputed dataset
rownames(impute_complete) <- row.names(dataset)
impute_complete
#>          age hospital gender_f province_Jiangsu province_Shanghai province_Zhejiang province_other days_onset_to_outcome days_onset_to_hospital early_onset early_outcome
#> case_1    87        0        0                0                 1                 0              0                    13                      6           1             1
#> case_2    27        1        0                0                 1                 0              0                    11                      4           1             1
#> case_3    35        1        1                0                 0                 0              1                    31                     10           1             1
#> case_4    45        1        1                1                 0                 0              0                    14                      8           1             1
#> case_5    48        1        1                1                 0                 0              0                    57                     11           1             0
#> case_6    32        1        1                1                 0                 0              0                    36                      7           1             1
#> case_7    83        1        0                1                 0                 0              0                    20                      9           1             1
#> case_8    38        1        0                0                 0                 1              0                    20                     11           1             1
#> case_9    67        1        0                0                 0                 1              0                     7                      0           1             1
#> case_10   48        1        0                0                 1                 0              0                     6                      4           1             1
#> case_11   64        1        0                0                 0                 1              0                     6                      2           1             1
#> case_12   52        0        1                0                 1                 0              0                     7                      7           1             1
#> case_13   67        1        1                0                 1                 0              0                    12                      3           1             1
#> case_14    4        0        0                0                 1                 0              0                    10                      5           1             1
#> case_15   61        0        1                1                 0                 0              0                    32                      4           1             0
#> case_16   79        0        0                1                 0                 0              0                    38                      4           1             0
#> case_17   74        1        0                0                 1                 0              0                    14                      6           1             1
#> case_18   66        1        0                0                 1                 0              0                    20                      4           1             1
#> case_19   59        1        0                0                 1                 0              0                    67                      5           1             0
#> case_20   55        1        0                0                 0                 0              1                    22                      4           1             1
#> case_21   67        1        0                0                 1                 0              0                    23                      1           1             1
#> case_22   85        1        0                1                 0                 0              0                    31                      4           1             0
#> case_23   25        1        1                1                 0                 0              0                    46                      0           1             0
#> case_24   64        0        0                0                 1                 0              0                     6                      5           1             1
#> case_25   62        1        0                0                 1                 0              0                    35                      0           1             0
#> case_26   77        1        0                0                 1                 0              0                    11                      4           1             1
#> case_27   51        1        1                0                 0                 1              0                    37                     27           1             1
#> case_28   79        0        0                0                 0                 1              0                    32                      0           1             0
#> case_29   76        1        1                0                 1                 0              0                    32                      4           1             0
#> case_30   81        0        1                0                 1                 0              0                    23                      0           1             0
#> case_31   70        0        0                1                 0                 0              0                    32                      4           1             0
#> case_32   74        0        0                1                 0                 0              0                    38                      4           1             0
#> case_33   65        0        0                0                 0                 1              0                    24                      8           1             0
#> case_34   74        1        0                0                 1                 0              0                    11                      5           1             1
#> case_35   83        1        1                0                 1                 0              0                    38                      5           1             0
#> case_36   68        0        0                0                 1                 0              0                    17                      4           1             1
#> case_37   31        0        0                1                 0                 0              0                    45                      7           1             0
#> case_38   56        0        0                1                 0                 0              0                    23                      7           1             0
#> case_39   66        1        0                0                 0                 1              0                    13                      0           0             1
#> case_40   74        1        0                0                 0                 1              0                     6                      5           1             1
#> case_41   54        1        1                0                 0                 1              0                    23                      6           1             1
#> case_42   53        1        0                0                 1                 0              0                    11                      7           1             1
#> case_43   86        1        0                0                 1                 0              0                    18                      3           1             1
#> case_44    7        1        1                0                 0                 0              1                     6                      0           0             1
#> case_45   56        1        0                0                 1                 0              0                    86                      3           1             0
#> case_46   77        0        1                1                 0                 0              0                     9                      0           1             1
#> case_47   72        0        0                1                 0                 0              0                    32                      5           1             0
#> case_48   65        1        0                0                 0                 1              0                    20                      6           1             1
#> case_49   38        1        0                0                 0                 1              0                    28                      5           1             0
#> case_50   34        1        0                0                 0                 0              1                    17                      3           1             0
#> case_51   65        1        0                0                 0                 0              1                    15                      2           0             1
#> case_52   64        0        1                0                 0                 1              0                    17                      7           1             1
#> case_53   62        0        1                0                 0                 1              0                    22                      7           1             1
#> case_54   75        0        0                0                 0                 1              0                    37                      7           1             0
#> case_55   79        0        0                0                 0                 1              0                    18                      7           0             0
#> case_56   73        1        0                0                 1                 0              0                    11                      6           1             1
#> case_57   54        1        0                0                 1                 0              0                    13                      4           0             1
#> case_58   78        1        0                0                 1                 0              0                    31                      4           1             0
#> case_59   50        0        0                1                 0                 0              0                    37                      7           1             0
#> case_60   26        0        0                1                 0                 0              0                    18                      4           0             1
#> case_61   60        0        0                1                 0                 0              0                    10                      4           1             1
#> case_62   68        0        1                0                 0                 1              0                    11                      5           1             1
#> case_63   60        0        0                0                 0                 0              1                    26                      3           0             0
#> case_64   56        0        0                1                 0                 0              0                    16                      4           0             0
#> case_65   21        0        1                1                 0                 0              0                    37                      4           1             0
#> case_66   72        0        0                1                 0                 0              0                    32                      3           0             0
#> case_67   56        0        0                0                 0                 1              0                    31                      3           0             0
#> case_68   57        0        0                0                 0                 1              0                     6                     10           0             1
#> case_69   62        0        0                0                 0                 1              0                     8                      7           0             1
#> case_70   58        0        1                0                 0                 1              0                    10                     11           0             1
#> case_71   72        0        1                0                 0                 1              0                    13                      6           0             0
#> case_72   47        1        0                0                 1                 0              0                    17                      5           0             0
#> case_73   69        0        0                0                 1                 0              0                     7                      4           0             1
#> case_74   54        0        0                0                 1                 0              0                    17                      6           1             0
#> case_75   83        0        0                0                 1                 0              0                     6                      4           1             1
#> case_76   55        0        0                0                 1                 0              0                     6                      7           1             1
#> case_77    2        0        0                0                 1                 0              0                    13                      7           1             0
#> case_78   89        1        0                0                 1                 0              0                    17                      4           0             0
#> case_79   37        0        1                0                 0                 1              0                    16                      0           0             0
#> case_80   74        0        0                0                 0                 1              0                    22                      3           0             0
#> case_81   86        0        0                0                 0                 1              0                    10                      6           0             1
#> case_82   41        0        0                0                 0                 1              0                     8                     10           0             1
#> case_83   38        0        0                0                 0                 0              1                    23                      4           0             0
#> case_84   26        0        1                1                 0                 0              0                    13                      1           1             1
#> case_85   80        1        1                0                 1                 0              0                    13                      7           0             1
#> case_86   54        0        1                0                 0                 1              0                    10                      3           0             1
#> case_87   69        0        0                0                 0                 1              0                     8                      3           0             1
#> case_88    4        0        0                0                 0                 0              1                    11                      0           0             0
#> case_89   54        1        0                1                 0                 0              0                    36                      6           0             0
#> case_90   43        1        0                0                 0                 1              0                     9                      3           0             1
#> case_91   48        1        0                0                 0                 1              0                    16                      6           0             0
#> case_92   66        1        1                0                 0                 1              0                    13                      7           0             1
#> case_93   56        0        0                0                 0                 1              0                    17                      0           0             0
#> case_94   35        0        1                0                 0                 1              0                    13                      6           0             0
#> case_95   37        0        0                0                 0                 1              0                     7                      4           1             1
#> case_96   43        0        0                1                 0                 0              0                    37                      4           1             0
#> case_97   75        1        1                0                 1                 0              0                    22                      5           0             0
#> case_98   76        0        0                0                 0                 1              0                    13                      3           0             1
#> case_99   68        0        1                0                 0                 1              0                     6                      7           0             1
#> case_100  58        0        0                0                 0                 1              0                    20                      7           0             0
#> case_101  79        0        1                0                 0                 1              0                    13                      3           0             1
#> case_102  81        0        0                0                 0                 1              0                    32                      6           0             0
#> case_103  68        1        0                1                 0                 0              0                    37                      6           0             0
#> case_104  54        0        1                0                 0                 1              0                    20                      7           0             0
#> case_105  32        0        0                0                 0                 1              0                    22                      6           0             0
#> case_106  36        1        0                0                 0                 0              1                    30                      6           0             0
#> case_107  91        0        0                0                 0                 0              1                    21                     10           0             0
#> case_108  84        0        0                0                 0                 1              0                    13                      3           0             0
#> case_109  62        0        0                0                 0                 1              0                     6                      0           0             1
#> case_110  53        1        0                0                 0                 0              1                     9                      4           0             0
#> case_111  56        0        0                0                 0                 0              1                    22                      5           0             0
#> case_112  69        0        0                0                 0                 0              1                    10                      1           0             0
#> case_113  60        0        1                0                 0                 1              0                    13                      3           0             0
#> case_114  50        0        1                0                 0                 1              0                    10                      6           0             1
#> case_115  38        0        0                0                 0                 1              0                    16                      5           0             0
#> case_116  65        1        0                0                 0                 0              1                    21                      5           0             0
#> case_117  76        0        1                0                 0                 0              1                    28                      5           0             0
#> case_118  49        0        0                1                 0                 0              0                    18                      5           0             1
#> case_119  36        0        0                1                 0                 0              0                    26                      4           0             0
#> case_120  60        0        0                1                 0                 0              0                    31                      7           1             0
#> case_121  64        1        1                0                 0                 0              1                    30                      5           0             0
#> case_122  38        0        0                0                 0                 1              0                     2                      3           0             1
#> case_123  54        1        0                0                 0                 0              1                    16                      7           0             0
#> case_124  80        0        0                0                 0                 0              1                    17                      7           0             0
#> case_125  31        0        1                0                 0                 0              1                    20                      0           0             0
#> case_126  80        1        0                0                 0                 0              1                    24                     10           0             0
#> case_127   4        0        0                0                 0                 0              1                     9                      4           0             0
#> case_128  58        1        0                0                 0                 0              1                    22                      7           0             0
#> case_129  69        1        0                0                 0                 0              1                    28                      4           0             0
#> case_130  69        1        0                0                 0                 0              1                    30                      1           0             0
#> case_131   9        1        0                0                 0                 0              1                    11                      1           0             0
#> case_132  79        1        1                0                 0                 0              1                    18                      0           0             1
#> case_133   6        1        0                0                 0                 0              1                     2                      0           0             0
#> case_134  15        1        0                1                 0                 0              0                     7                      1           0             0
#> case_135  61        1        1                0                 0                 0              1                    32                      3           0             0
#> case_136  51        1        1                0                 0                 0              1                     8                      1           0             1

dataset_complete <- merge(dataset[, 1, drop = FALSE], 
                          # mice::complete(dataset_impute, 1), 
                          impute_complete,
                          by = "row.names", all = TRUE)
rownames(dataset_complete) <- dataset_complete$Row.names
dataset_complete <- dataset_complete[, -1]
dataset_complete
#>          outcome age hospital gender_f province_Jiangsu province_Shanghai province_Zhejiang province_other days_onset_to_outcome days_onset_to_hospital early_onset early_outcome
#> case_1     Death  87        0        0                0                 1                 0              0                    13                      6           1             1
#> case_10    Death  48        1        0                0                 1                 0              0                     6                      4           1             1
#> case_100    <NA>  58        0        0                0                 0                 1              0                    20                      7           0             0
#> case_101    <NA>  79        0        1                0                 0                 1              0                    13                      3           0             1
#> case_102    <NA>  81        0        0                0                 0                 1              0                    32                      6           0             0
#> case_103    <NA>  68        1        0                1                 0                 0              0                    37                      6           0             0
#> case_104    <NA>  54        0        1                0                 0                 1              0                    20                      7           0             0
#> case_105    <NA>  32        0        0                0                 0                 1              0                    22                      6           0             0
#> case_106 Recover  36        1        0                0                 0                 0              1                    30                      6           0             0
#> case_107   Death  91        0        0                0                 0                 0              1                    21                     10           0             0
#> case_108    <NA>  84        0        0                0                 0                 1              0                    13                      3           0             0
#> case_109    <NA>  62        0        0                0                 0                 1              0                     6                      0           0             1
#> case_11    Death  64        1        0                0                 0                 1              0                     6                      2           1             1
#> case_110    <NA>  53        1        0                0                 0                 0              1                     9                      4           0             0
#> case_111   Death  56        0        0                0                 0                 0              1                    22                      5           0             0
#> case_112    <NA>  69        0        0                0                 0                 0              1                    10                      1           0             0
#> case_113    <NA>  60        0        1                0                 0                 1              0                    13                      3           0             0
#> case_114    <NA>  50        0        1                0                 0                 1              0                    10                      6           0             1
#> case_115    <NA>  38        0        0                0                 0                 1              0                    16                      5           0             0
#> case_116 Recover  65        1        0                0                 0                 0              1                    21                      5           0             0
#> case_117 Recover  76        0        1                0                 0                 0              1                    28                      5           0             0
#> case_118    <NA>  49        0        0                1                 0                 0              0                    18                      5           0             1
#> case_119 Recover  36        0        0                1                 0                 0              0                    26                      4           0             0
#> case_12    Death  52        0        1                0                 1                 0              0                     7                      7           1             1
#> case_120    <NA>  60        0        0                1                 0                 0              0                    31                      7           1             0
#> case_121   Death  64        1        1                0                 0                 0              1                    30                      5           0             0
#> case_122    <NA>  38        0        0                0                 0                 1              0                     2                      3           0             1
#> case_123   Death  54        1        0                0                 0                 0              1                    16                      7           0             0
#> case_124 Recover  80        0        0                0                 0                 0              1                    17                      7           0             0
#> case_125 Recover  31        0        1                0                 0                 0              1                    20                      0           0             0
#> case_126    <NA>  80        1        0                0                 0                 0              1                    24                     10           0             0
#> case_127 Recover   4        0        0                0                 0                 0              1                     9                      4           0             0
#> case_128 Recover  58        1        0                0                 0                 0              1                    22                      7           0             0
#> case_129 Recover  69        1        0                0                 0                 0              1                    28                      4           0             0
#> case_13    Death  67        1        1                0                 1                 0              0                    12                      3           1             1
#> case_130    <NA>  69        1        0                0                 0                 0              1                    30                      1           0             0
#> case_131 Recover   9        1        0                0                 0                 0              1                    11                      1           0             0
#> case_132    <NA>  79        1        1                0                 0                 0              1                    18                      0           0             1
#> case_133 Recover   6        1        0                0                 0                 0              1                     2                      0           0             0
#> case_134 Recover  15        1        0                1                 0                 0              0                     7                      1           0             0
#> case_135   Death  61        1        1                0                 0                 0              1                    32                      3           0             0
#> case_136    <NA>  51        1        1                0                 0                 0              1                     8                      1           0             1
#> case_14  Recover   4        0        0                0                 1                 0              0                    10                      5           1             1
#> case_15     <NA>  61        0        1                1                 0                 0              0                    32                      4           1             0
#> case_16     <NA>  79        0        0                1                 0                 0              0                    38                      4           1             0
#> case_17    Death  74        1        0                0                 1                 0              0                    14                      6           1             1
#> case_18  Recover  66        1        0                0                 1                 0              0                    20                      4           1             1
#> case_19    Death  59        1        0                0                 1                 0              0                    67                      5           1             0
#> case_2     Death  27        1        0                0                 1                 0              0                    11                      4           1             1
#> case_20  Recover  55        1        0                0                 0                 0              1                    22                      4           1             1
#> case_21  Recover  67        1        0                0                 1                 0              0                    23                      1           1             1
#> case_22     <NA>  85        1        0                1                 0                 0              0                    31                      4           1             0
#> case_23  Recover  25        1        1                1                 0                 0              0                    46                      0           1             0
#> case_24    Death  64        0        0                0                 1                 0              0                     6                      5           1             1
#> case_25  Recover  62        1        0                0                 1                 0              0                    35                      0           1             0
#> case_26    Death  77        1        0                0                 1                 0              0                    11                      4           1             1
#> case_27  Recover  51        1        1                0                 0                 1              0                    37                     27           1             1
#> case_28     <NA>  79        0        0                0                 0                 1              0                    32                      0           1             0
#> case_29  Recover  76        1        1                0                 1                 0              0                    32                      4           1             0
#> case_3     Death  35        1        1                0                 0                 0              1                    31                     10           1             1
#> case_30  Recover  81        0        1                0                 1                 0              0                    23                      0           1             0
#> case_31     <NA>  70        0        0                1                 0                 0              0                    32                      4           1             0
#> case_32     <NA>  74        0        0                1                 0                 0              0                    38                      4           1             0
#> case_33  Recover  65        0        0                0                 0                 1              0                    24                      8           1             0
#> case_34    Death  74        1        0                0                 1                 0              0                    11                      5           1             1
#> case_35    Death  83        1        1                0                 1                 0              0                    38                      5           1             0
#> case_36  Recover  68        0        0                0                 1                 0              0                    17                      4           1             1
#> case_37  Recover  31        0        0                1                 0                 0              0                    45                      7           1             0
#> case_38     <NA>  56        0        0                1                 0                 0              0                    23                      7           1             0
#> case_39     <NA>  66        1        0                0                 0                 1              0                    13                      0           0             1
#> case_4      <NA>  45        1        1                1                 0                 0              0                    14                      8           1             1
#> case_40     <NA>  74        1        0                0                 0                 1              0                     6                      5           1             1
#> case_41     <NA>  54        1        1                0                 0                 1              0                    23                      6           1             1
#> case_42     <NA>  53        1        0                0                 1                 0              0                    11                      7           1             1
#> case_43    Death  86        1        0                0                 1                 0              0                    18                      3           1             1
#> case_44  Recover   7        1        1                0                 0                 0              1                     6                      0           0             1
#> case_45    Death  56        1        0                0                 1                 0              0                    86                      3           1             0
#> case_46    Death  77        0        1                1                 0                 0              0                     9                      0           1             1
#> case_47     <NA>  72        0        0                1                 0                 0              0                    32                      5           1             0
#> case_48     <NA>  65        1        0                0                 0                 1              0                    20                      6           1             1
#> case_49  Recover  38        1        0                0                 0                 1              0                    28                      5           1             0
#> case_5   Recover  48        1        1                1                 0                 0              0                    57                     11           1             0
#> case_50  Recover  34        1        0                0                 0                 0              1                    17                      3           1             0
#> case_51  Recover  65        1        0                0                 0                 0              1                    15                      2           0             1
#> case_52     <NA>  64        0        1                0                 0                 1              0                    17                      7           1             1
#> case_53    Death  62        0        1                0                 0                 1              0                    22                      7           1             1
#> case_54     <NA>  75        0        0                0                 0                 1              0                    37                      7           1             0
#> case_55  Recover  79        0        0                0                 0                 1              0                    18                      7           0             0
#> case_56     <NA>  73        1        0                0                 1                 0              0                    11                      6           1             1
#> case_57  Recover  54        1        0                0                 1                 0              0                    13                      4           0             1
#> case_58  Recover  78        1        0                0                 1                 0              0                    31                      4           1             0
#> case_59  Recover  50        0        0                1                 0                 0              0                    37                      7           1             0
#> case_6     Death  32        1        1                1                 0                 0              0                    36                      7           1             1
#> case_60  Recover  26        0        0                1                 0                 0              0                    18                      4           0             1
#> case_61    Death  60        0        0                1                 0                 0              0                    10                      4           1             1
#> case_62     <NA>  68        0        1                0                 0                 1              0                    11                      5           1             1
#> case_63     <NA>  60        0        0                0                 0                 0              1                    26                      3           0             0
#> case_64  Recover  56        0        0                1                 0                 0              0                    16                      4           0             0
#> case_65  Recover  21        0        1                1                 0                 0              0                    37                      4           1             0
#> case_66     <NA>  72        0        0                1                 0                 0              0                    32                      3           0             0
#> case_67     <NA>  56        0        0                0                 0                 1              0                    31                      3           0             0
#> case_68     <NA>  57        0        0                0                 0                 1              0                     6                     10           0             1
#> case_69     <NA>  62        0        0                0                 0                 1              0                     8                      7           0             1
#> case_7     Death  83        1        0                1                 0                 0              0                    20                      9           1             1
#> case_70     <NA>  58        0        1                0                 0                 1              0                    10                     11           0             1
#> case_71     <NA>  72        0        1                0                 0                 1              0                    13                      6           0             0
#> case_72  Recover  47        1        0                0                 1                 0              0                    17                      5           0             0
#> case_73  Recover  69        0        0                0                 1                 0              0                     7                      4           0             1
#> case_74  Recover  54        0        0                0                 1                 0              0                    17                      6           1             0
#> case_75    Death  83        0        0                0                 1                 0              0                     6                      4           1             1
#> case_76    Death  55        0        0                0                 1                 0              0                     6                      7           1             1
#> case_77  Recover   2        0        0                0                 1                 0              0                    13                      7           1             0
#> case_78    Death  89        1        0                0                 1                 0              0                    17                      4           0             0
#> case_79  Recover  37        0        1                0                 0                 1              0                    16                      0           0             0
#> case_8     Death  38        1        0                0                 0                 1              0                    20                     11           1             1
#> case_80     <NA>  74        0        0                0                 0                 1              0                    22                      3           0             0
#> case_81    Death  86        0        0                0                 0                 1              0                    10                      6           0             1
#> case_82  Recover  41        0        0                0                 0                 1              0                     8                     10           0             1
#> case_83  Recover  38        0        0                0                 0                 0              1                    23                      4           0             0
#> case_84     <NA>  26        0        1                1                 0                 0              0                    13                      1           1             1
#> case_85     <NA>  80        1        1                0                 1                 0              0                    13                      7           0             1
#> case_86     <NA>  54        0        1                0                 0                 1              0                    10                      3           0             1
#> case_87    Death  69        0        0                0                 0                 1              0                     8                      3           0             1
#> case_88     <NA>   4        0        0                0                 0                 0              1                    11                      0           0             0
#> case_89  Recover  54        1        0                1                 0                 0              0                    36                      6           0             0
#> case_9      <NA>  67        1        0                0                 0                 1              0                     7                      0           1             1
#> case_90     <NA>  43        1        0                0                 0                 1              0                     9                      3           0             1
#> case_91  Recover  48        1        0                0                 0                 1              0                    16                      6           0             0
#> case_92     <NA>  66        1        1                0                 0                 1              0                    13                      7           0             1
#> case_93     <NA>  56        0        0                0                 0                 1              0                    17                      0           0             0
#> case_94  Recover  35        0        1                0                 0                 1              0                    13                      6           0             0
#> case_95     <NA>  37        0        0                0                 0                 1              0                     7                      4           1             1
#> case_96     <NA>  43        0        0                1                 0                 0              0                    37                      4           1             0
#> case_97  Recover  75        1        1                0                 1                 0              0                    22                      5           0             0
#> case_98    Death  76        0        0                0                 0                 1              0                    13                      3           0             1
#> case_99     <NA>  68        0        1                0                 0                 1              0                     6                      7           0             1

cat("NAs before imput: ", sum(is.na(dataset)), "\n")
#> NAs before imput:  277
summary(dataset$outcome)
#>   Death Recover    NA's 
#>      32      47      57

cat("NAs after imput: ", sum(is.na(dataset_complete)), "\n")
#> NAs after imput:  57
summary(dataset_complete$outcome)
#>   Death Recover    NA's 
#>      32      47      57

28.4 Test, train and validation data sets

For building the model, I am separating the imputed data frame into training and test data. Test data are the 57 cases with unknown outcome.

The training data will be further divided for validation of the models: 70% of the training data will be kept for model building and the remaining 30% will be used for model testing.

I am using the caret package for modeling:

train_index <- which(is.na(dataset_complete$outcome))
train_data <- dataset_complete[-train_index, ]     # 79x12
test_data  <- dataset_complete[train_index, -1]    # 57x11

library(caret)
#> Loading required package: lattice
set.seed(27)
val_index <- createDataPartition(train_data$outcome, p = 0.7, list=FALSE)
val_train_data <- train_data[val_index, ]
val_test_data  <- train_data[-val_index, ]
val_train_X <- val_train_data[,-1]
val_test_X <- val_test_data[,-1]

28.4.1 Decision trees

To get an idea about how each feature contributes to the prediction of the outcome, I first built a decision tree based on the training data using rpart and rattle.

library(rpart)
library(rattle)
#> Rattle: A free graphical interface for data science with R.
#> Version 5.3.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
#> Type 'rattle()' to shake, rattle, and roll your data.
library(rpart.plot)
library(RColorBrewer)

set.seed(27)
fit <- rpart(outcome ~ ., data = train_data, method = "class", 
             control = rpart.control(xval = 10, minbucket = 2, cp = 0), 
             parms = list(split = "information"))

fancyRpartPlot(fit)

This randomly generated decision tree shows that cases with an early outcome were most likely to die when they were 68 or older, when they also had an early onset and when they were sick for fewer than 13 days. If a person was not among the first cases and was younger than 52, they had a good chance of recovering, but if they were 82 or older, they were more likely to die from the flu.

28.4.2 Feature Importance

Not all of the features I created will be equally important to the model. The decision tree already gave me an idea of which features might be most important but I also want to estimate feature importance using a Random Forest approach with repeated cross validation.

# prepare training scheme
control <- trainControl(method = "repeatedcv", number = 10, repeats = 10)

# train the model
set.seed(27)
model <- train(outcome ~ ., data = train_data, method = "rf", 
               preProcess = NULL, trControl = control)

# estimate variable importance
importance <- varImp(model, scale=TRUE)

from <- c("age", "hospital1", "gender_f1", "province_Jiangsu1", 
          "province_Shanghai1", "province_Zhejiang1", "province_other1", 
          "days_onset_to_outcome", "days_onset_to_hospital", "early_onset1", 
          "early_outcome1" )

to <- c("Age", "Hospital", "Female", "Jiangsu", "Shanghai", "Zhejiang",
        "Other province", "Days onset to outcome", "Days onset to hospital", 
        "Early onset", "Early outcome" )

importance_df_1 <- importance$importance
importance_df_1$group <- rownames(importance_df_1)

importance_df_1$group <- mapvalues(importance_df_1$group, 
                                           from = from, 
                                           to = to)
f = importance_df_1[order(importance_df_1$Overall, decreasing = FALSE), "group"]

importance_df_2 <- importance_df_1
importance_df_2$Overall <- 0

importance_df <- rbind(importance_df_1, importance_df_2)

# setting factor levels
importance_df <- within(importance_df, group <- factor(group, levels = f))
importance_df_1 <- within(importance_df_1, group <- factor(group, levels = f))

ggplot() +
  geom_point(data = importance_df_1, aes(x = Overall, y = group, 
                                         color = group), size = 2) +
  geom_path(data = importance_df, aes(x = Overall, y = group, color = group, 
                                      group = group), size = 1) +
  scale_color_manual(values = rep(brewer.pal(1, "Set1")[1], 11)) +
  my_theme() +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 0.5)) +
  labs(
    x = "Importance",
    y = "",
    title = "2013 Influenza A H7N9 cases in China",
    subtitle = "Scaled feature importance",
    caption = "\nDetermined with Random Forest and
    repeated cross validation (10 repeats, 10 times)"
  )
#> Warning in brewer.pal(1, "Set1"): minimal value for n is 3, returning requested palette with 3 different levels

This tells me that age is the most important determining factor for predicting disease outcome, followed by days between onset an outcome, early outcome and days between onset and hospitalization.

28.4.3 Feature Plot

Before I start actually building models, I want to check whether the distribution of feature values is comparable between training, validation and test datasets.

# tidy dataframe of 11 variables for plotting features for all datasets
# dataset_complete: 136x12. test + val_train + val_test
dataset_complete_gather <- dataset_complete %>%
  mutate(set = ifelse(rownames(dataset_complete) %in% rownames(test_data), 
                      "Test Data", 
               ifelse(rownames(dataset_complete) %in% rownames(val_train_data), 
                      "Validation Train Data",
               ifelse(rownames(dataset_complete) %in% rownames(val_test_data), 
                      "Validation Test Data", "NA"))),
         case_ID = rownames(.)) %>%
  gather(group, value, age:early_outcome)
#> Warning: attributes are not identical across measure variables;
#> they will be dropped

# map values in group to more readable
from <- c("age", "hospital", "gender_f", "province_Jiangsu", "province_Shanghai", 
          "province_Zhejiang", "province_other", "days_onset_to_outcome", 
          "days_onset_to_hospital", "early_onset",  "early_outcome" )
to <- c("Age", "Hospital", "Female", "Jiangsu", "Shanghai", "Zhejiang", 
        "Other province", "Days onset to outcome", "Days onset to hospital", 
        "Early onset", "Early outcome" )
dataset_complete_gather$group <- mapvalues(dataset_complete_gather$group, 
                      from = from, 
                      to = to)

28.4.4 Plot distribution of features in each dataset

# plot features all datasets
ggplot(data = dataset_complete_gather, 
       aes(x = as.numeric(value), fill = outcome, color = outcome)) +
  geom_density(alpha = 0.2) +
  geom_rug() +
  scale_color_brewer(palette="Set1", na.value = "grey50") +
  scale_fill_brewer(palette="Set1", na.value = "grey50") +
  my_theme() +
  facet_wrap(set ~ group, ncol = 11, scales = "free") +
  labs(
    x = "Value",
    y = "Density",
    title = "2013 Influenza A H7N9 cases in China",
    subtitle = "Features for classifying outcome",
    caption = "\nDensity distribution of all features used for 
    classification of flu outcome."
  )

28.4.5 Plot 3 features vs outcome, all datasets

# plot three groups vs outcome
ggplot(subset(dataset_complete_gather, 
              group == "Age" | 
              group == "Days onset to hospital" | 
              group == "Days onset to outcome"), 
       aes(x=outcome, y=as.numeric(value), fill=set)) + 
  geom_boxplot() +
  my_theme() +
  scale_fill_brewer(palette="Set1", type = "div ") +
  facet_wrap( ~ group, ncol = 3, scales = "free") +
  labs(
    fill = "",
    x = "Outcome",
    y = "Value",
    title = "2013 Influenza A H7N9 cases in China",
    subtitle = "Features for classifying outcome",
    caption = "\nBoxplot of the features age, days from onset to 
    hospitalisation and days from onset to outcome."
  )

Luckily, the distributions looks reasonably similar between the validation and test data, except for a few outliers.

28.5 Comparing Machine Learning algorithms

Before I try to predict the outcome of the unknown cases, I am testing the models’ accuracy with the validation datasets on a couple of algorithms. I have chosen only a few more well known algorithms, but caret implements many more.

I have chosen to not do any preprocessing because I was worried that the different data distributions with continuous variables (e.g. age) and binary variables (i.e. 0, 1 classification of e.g. hospitalisation) would lead to problems.

Random Forest
GLM net
k-Nearest Neighbors
Penalized Discriminant Analysis
Stabilized Linear Discriminant Analysis
Nearest Shrunken Centroids
Single C5.0 Tree
Partial Least Squares

train_control <- trainControl(method = "repeatedcv", 
                              number = 10, 
                              repeats = 10, 
                              verboseIter = FALSE)

28.5.1 Random Forest

Random Forests predictions are based on the generation of multiple classification trees. This model classified 14 out of 23 cases correctly.

set.seed(27)
model_rf <- caret::train(outcome ~ .,
                             data = val_train_data,
                             method = "rf",
                             preProcess = NULL,
                             trControl = train_control)
model_rf
#> Random Forest 
#> 
#> 56 samples
#> 11 predictors
#>  2 classes: 'Death', 'Recover' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (10 fold, repeated 10 times) 
#> Summary of sample sizes: 51, 49, 50, 51, 49, 51, ... 
#> Resampling results across tuning parameters:
#> 
#>   mtry  Accuracy  Kappa
#>    2    0.687     0.340
#>    6    0.732     0.432
#>   11    0.726     0.423
#> 
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 6.

confusionMatrix(predict(model_rf, val_test_data[, -1]), val_test_data$outcome)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction Death Recover
#>    Death       3       0
#>    Recover     6      14
#>                                         
#>                Accuracy : 0.739         
#>                  95% CI : (0.516, 0.898)
#>     No Information Rate : 0.609         
#>     P-Value [Acc > NIR] : 0.1421        
#>                                         
#>                   Kappa : 0.378         
#>                                         
#>  Mcnemar's Test P-Value : 0.0412        
#>                                         
#>             Sensitivity : 0.333         
#>             Specificity : 1.000         
#>          Pos Pred Value : 1.000         
#>          Neg Pred Value : 0.700         
#>              Prevalence : 0.391         
#>          Detection Rate : 0.130         
#>    Detection Prevalence : 0.130         
#>       Balanced Accuracy : 0.667         
#>                                         
#>        'Positive' Class : Death         
#>

28.5.2 GLM net

Lasso or elastic net regularization for generalized linear model regression are based on linear regression models and is useful when we have feature correlation in our model.

This model classified 13 out of 23 cases correctly.

set.seed(27)
model_glmnet <- caret::train(outcome ~ .,
                             data = val_train_data,
                             method = "glmnet",
                             preProcess = NULL,
                             trControl = train_control)
model_glmnet
#> glmnet 
#> 
#> 56 samples
#> 11 predictors
#>  2 classes: 'Death', 'Recover' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (10 fold, repeated 10 times) 
#> Summary of sample sizes: 51, 49, 50, 51, 49, 51, ... 
#> Resampling results across tuning parameters:
#> 
#>   alpha  lambda    Accuracy  Kappa
#>   0.10   0.000491  0.671     0.324
#>   0.10   0.004909  0.669     0.318
#>   0.10   0.049093  0.680     0.339
#>   0.55   0.000491  0.671     0.324
#>   0.55   0.004909  0.671     0.322
#>   0.55   0.049093  0.695     0.365
#>   1.00   0.000491  0.671     0.324
#>   1.00   0.004909  0.672     0.326
#>   1.00   0.049093  0.714     0.414
#> 
#> Accuracy was used to select the optimal model using the largest value.
#> The final values used for the model were alpha = 1 and lambda = 0.0491.

confusionMatrix(predict(model_glmnet, val_test_data[, -1]), val_test_data$outcome)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction Death Recover
#>    Death       3       2
#>    Recover     6      12
#>                                         
#>                Accuracy : 0.652         
#>                  95% CI : (0.427, 0.836)
#>     No Information Rate : 0.609         
#>     P-Value [Acc > NIR] : 0.422         
#>                                         
#>                   Kappa : 0.207         
#>                                         
#>  Mcnemar's Test P-Value : 0.289         
#>                                         
#>             Sensitivity : 0.333         
#>             Specificity : 0.857         
#>          Pos Pred Value : 0.600         
#>          Neg Pred Value : 0.667         
#>              Prevalence : 0.391         
#>          Detection Rate : 0.130         
#>    Detection Prevalence : 0.217         
#>       Balanced Accuracy : 0.595         
#>                                         
#>        'Positive' Class : Death         
#>

28.5.3 k-Nearest Neighbors

k-nearest neighbors predicts based on point distances with predefined constants. This model classified 14 out of 23 cases correctly.

set.seed(27)
model_kknn <- caret::train(outcome ~ .,
                             data = val_train_data,
                             method = "kknn",
                             preProcess = NULL,
                             trControl = train_control)
model_kknn
#> k-Nearest Neighbors 
#> 
#> 56 samples
#> 11 predictors
#>  2 classes: 'Death', 'Recover' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (10 fold, repeated 10 times) 
#> Summary of sample sizes: 51, 49, 50, 51, 49, 51, ... 
#> Resampling results across tuning parameters:
#> 
#>   kmax  Accuracy  Kappa
#>   5     0.666     0.313
#>   7     0.653     0.274
#>   9     0.648     0.263
#> 
#> Tuning parameter 'distance' was held constant at a value of 2
#> Tuning parameter 'kernel' was held constant at a value of optimal
#> Accuracy was used to select the optimal model using the largest value.
#> The final values used for the model were kmax = 5, distance = 2 and kernel = optimal.

confusionMatrix(predict(model_kknn, val_test_data[, -1]), val_test_data$outcome)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction Death Recover
#>    Death       5       3
#>    Recover     4      11
#>                                         
#>                Accuracy : 0.696         
#>                  95% CI : (0.471, 0.868)
#>     No Information Rate : 0.609         
#>     P-Value [Acc > NIR] : 0.264         
#>                                         
#>                   Kappa : 0.348         
#>                                         
#>  Mcnemar's Test P-Value : 1.000         
#>                                         
#>             Sensitivity : 0.556         
#>             Specificity : 0.786         
#>          Pos Pred Value : 0.625         
#>          Neg Pred Value : 0.733         
#>              Prevalence : 0.391         
#>          Detection Rate : 0.217         
#>    Detection Prevalence : 0.348         
#>       Balanced Accuracy : 0.671         
#>                                         
#>        'Positive' Class : Death         
#>

28.5.4 Penalized Discriminant Analysis

Penalized Discriminant Analysis is the penalized linear discriminant analysis and is also useful for when we have highly correlated features. This model classified 14 out of 23 cases correctly.

set.seed(27)
model_pda <- caret::train(outcome ~ .,
                             data = val_train_data,
                             method = "pda",
                             preProcess = NULL,
                             trControl = train_control)
model_pda
#> Penalized Discriminant Analysis 
#> 
#> 56 samples
#> 11 predictors
#>  2 classes: 'Death', 'Recover' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (10 fold, repeated 10 times) 
#> Summary of sample sizes: 51, 49, 50, 51, 49, 51, ... 
#> Resampling results across tuning parameters:
#> 
#>   lambda  Accuracy  Kappa
#>   0e+00     NaN       NaN
#>   1e-04   0.681     0.343
#>   1e-01   0.681     0.343
#> 
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was lambda = 1e-04.

confusionMatrix(predict(model_pda, val_test_data[, -1]), val_test_data$outcome)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction Death Recover
#>    Death       3       2
#>    Recover     6      12
#>                                         
#>                Accuracy : 0.652         
#>                  95% CI : (0.427, 0.836)
#>     No Information Rate : 0.609         
#>     P-Value [Acc > NIR] : 0.422         
#>                                         
#>                   Kappa : 0.207         
#>                                         
#>  Mcnemar's Test P-Value : 0.289         
#>                                         
#>             Sensitivity : 0.333         
#>             Specificity : 0.857         
#>          Pos Pred Value : 0.600         
#>          Neg Pred Value : 0.667         
#>              Prevalence : 0.391         
#>          Detection Rate : 0.130         
#>    Detection Prevalence : 0.217         
#>       Balanced Accuracy : 0.595         
#>                                         
#>        'Positive' Class : Death         
#>

28.5.5 Stabilized Linear Discriminant Analysis

Stabilized Linear Discriminant Analysis is designed for high-dimensional data and correlated co-variables. This model classified 15 out of 23 cases correctly.

set.seed(27)
model_slda <- caret::train(outcome ~ .,
                             data = val_train_data,
                             method = "slda",
                             preProcess = NULL,
                             trControl = train_control)
model_slda
#> Stabilized Linear Discriminant Analysis 
#> 
#> 56 samples
#> 11 predictors
#>  2 classes: 'Death', 'Recover' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (10 fold, repeated 10 times) 
#> Summary of sample sizes: 51, 49, 50, 51, 49, 51, ... 
#> Resampling results:
#> 
#>   Accuracy  Kappa
#>   0.682     0.358

confusionMatrix(predict(model_slda, val_test_data[, -1]), val_test_data$outcome)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction Death Recover
#>    Death       3       3
#>    Recover     6      11
#>                                         
#>                Accuracy : 0.609         
#>                  95% CI : (0.385, 0.803)
#>     No Information Rate : 0.609         
#>     P-Value [Acc > NIR] : 0.590         
#>                                         
#>                   Kappa : 0.127         
#>                                         
#>  Mcnemar's Test P-Value : 0.505         
#>                                         
#>             Sensitivity : 0.333         
#>             Specificity : 0.786         
#>          Pos Pred Value : 0.500         
#>          Neg Pred Value : 0.647         
#>              Prevalence : 0.391         
#>          Detection Rate : 0.130         
#>    Detection Prevalence : 0.261         
#>       Balanced Accuracy : 0.560         
#>                                         
#>        'Positive' Class : Death         
#>

28.5.6 Nearest Shrunken Centroids

Nearest Shrunken Centroids computes a standardized centroid for each class and shrinks each centroid toward the overall centroid for all classes. This model classified 15 out of 23 cases correctly.

set.seed(27)
model_pam <- caret::train(outcome ~ .,
                             data = val_train_data,
                             method = "pam",
                             preProcess = NULL,
                             trControl = train_control)
#> 12345678910111213141516171819202122232425262728293011111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
model_pam
#> Nearest Shrunken Centroids 
#> 
#> 56 samples
#> 11 predictors
#>  2 classes: 'Death', 'Recover' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (10 fold, repeated 10 times) 
#> Summary of sample sizes: 51, 49, 50, 51, 49, 51, ... 
#> Resampling results across tuning parameters:
#> 
#>   threshold  Accuracy  Kappa
#>   0.142      0.709     0.416
#>   2.065      0.714     0.382
#>   3.987      0.590     0.000
#> 
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was threshold = 2.06.

confusionMatrix(predict(model_pam, val_test_data[, -1]), val_test_data$outcome)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction Death Recover
#>    Death       1       3
#>    Recover     8      11
#>                                         
#>                Accuracy : 0.522         
#>                  95% CI : (0.306, 0.732)
#>     No Information Rate : 0.609         
#>     P-Value [Acc > NIR] : 0.857         
#>                                         
#>                   Kappa : -0.115        
#>                                         
#>  Mcnemar's Test P-Value : 0.228         
#>                                         
#>             Sensitivity : 0.1111        
#>             Specificity : 0.7857        
#>          Pos Pred Value : 0.2500        
#>          Neg Pred Value : 0.5789        
#>              Prevalence : 0.3913        
#>          Detection Rate : 0.0435        
#>    Detection Prevalence : 0.1739        
#>       Balanced Accuracy : 0.4484        
#>                                         
#>        'Positive' Class : Death         
#>

28.5.7 Single C5.0 Tree

C5.0 is another tree-based modeling algorithm. This model classified 15 out of 23 cases correctly.

set.seed(27)
model_C5.0Tree <- caret::train(outcome ~ .,
                             data = val_train_data,
                             method = "C5.0Tree",
                             preProcess = NULL,
                             trControl = train_control)
model_C5.0Tree
#> Single C5.0 Tree 
#> 
#> 56 samples
#> 11 predictors
#>  2 classes: 'Death', 'Recover' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (10 fold, repeated 10 times) 
#> Summary of sample sizes: 51, 49, 50, 51, 49, 51, ... 
#> Resampling results:
#> 
#>   Accuracy  Kappa
#>   0.696     0.359

confusionMatrix(predict(model_C5.0Tree, val_test_data[, -1]), val_test_data$outcome)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction Death Recover
#>    Death       4       1
#>    Recover     5      13
#>                                         
#>                Accuracy : 0.739         
#>                  95% CI : (0.516, 0.898)
#>     No Information Rate : 0.609         
#>     P-Value [Acc > NIR] : 0.142         
#>                                         
#>                   Kappa : 0.405         
#>                                         
#>  Mcnemar's Test P-Value : 0.221         
#>                                         
#>             Sensitivity : 0.444         
#>             Specificity : 0.929         
#>          Pos Pred Value : 0.800         
#>          Neg Pred Value : 0.722         
#>              Prevalence : 0.391         
#>          Detection Rate : 0.174         
#>    Detection Prevalence : 0.217         
#>       Balanced Accuracy : 0.687         
#>                                         
#>        'Positive' Class : Death         
#>

28.5.8 Partial Least Squares

modeling with correlated features. This model classified 15 out of 23 cases correctly.

set.seed(27)
model_pls <- caret::train(outcome ~ .,
                             data = val_train_data,
                             method = "pls",
                             preProcess = NULL,
                             trControl = train_control)
model_pls
#> Partial Least Squares 
#> 
#> 56 samples
#> 11 predictors
#>  2 classes: 'Death', 'Recover' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (10 fold, repeated 10 times) 
#> Summary of sample sizes: 51, 49, 50, 51, 49, 51, ... 
#> Resampling results across tuning parameters:
#> 
#>   ncomp  Accuracy  Kappa
#>   1      0.663     0.315
#>   2      0.676     0.341
#>   3      0.691     0.376
#> 
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was ncomp = 3.

confusionMatrix(predict(model_pls, val_test_data[, -1]), val_test_data$outcome)
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction Death Recover
#>    Death       2       3
#>    Recover     7      11
#>                                         
#>                Accuracy : 0.565         
#>                  95% CI : (0.345, 0.768)
#>     No Information Rate : 0.609         
#>     P-Value [Acc > NIR] : 0.742         
#>                                         
#>                   Kappa : 0.009         
#>                                         
#>  Mcnemar's Test P-Value : 0.343         
#>                                         
#>             Sensitivity : 0.222         
#>             Specificity : 0.786         
#>          Pos Pred Value : 0.400         
#>          Neg Pred Value : 0.611         
#>              Prevalence : 0.391         
#>          Detection Rate : 0.087         
#>    Detection Prevalence : 0.217         
#>       Balanced Accuracy : 0.504         
#>                                         
#>        'Positive' Class : Death         
#>

28.6 Comparing accuracy of models

All models were similarly accurate.

28.6.1 Summary Accuracy and Kappa

# Create a list of models
models <- list(rf       = model_rf, 
               glmnet   = model_glmnet, 
               kknn     = model_kknn, 
               pda      = model_pda, 
               slda     = model_slda,
               pam      = model_pam, 
               C5.0Tree = model_C5.0Tree, 
               pls      = model_pls)

# Resample the models
resample_results <- resamples(models)

# Generate a summary
summary(resample_results, metric = c("Kappa", "Accuracy"))
#> 
#> Call:
#> summary.resamples(object = resample_results, metric = c("Kappa", "Accuracy"))
#> 
#> Models: rf, glmnet, kknn, pda, slda, pam, C5.0Tree, pls 
#> Number of resamples: 100 
#> 
#> Kappa 
#>            Min. 1st Qu. Median  Mean 3rd Qu. Max. NA's
#> rf       -0.500   0.167  0.545 0.432   0.667    1    0
#> glmnet   -0.667   0.167  0.545 0.414   0.667    1    0
#> kknn     -0.667   0.125  0.333 0.313   0.615    1    0
#> pda      -0.667   0.142  0.333 0.343   0.615    1    0
#> slda     -0.667   0.167  0.367 0.358   0.615    1    0
#> pam      -0.667   0.167  0.545 0.382   0.571    1    0
#> C5.0Tree -0.667   0.167  0.333 0.359   0.615    1    0
#> pls      -0.667   0.167  0.333 0.376   0.667    1    0
#> 
#> Accuracy 
#>           Min. 1st Qu. Median  Mean 3rd Qu. Max. NA's
#> rf       0.333   0.600  0.800 0.732   0.833    1    0
#> glmnet   0.167   0.600  0.800 0.714   0.833    1    0
#> kknn     0.167   0.600  0.667 0.666   0.800    1    0
#> pda      0.167   0.593  0.667 0.681   0.800    1    0
#> slda     0.167   0.600  0.667 0.682   0.800    1    0
#> pam      0.200   0.600  0.800 0.714   0.800    1    0
#> C5.0Tree 0.167   0.600  0.667 0.696   0.800    1    0
#> pls      0.167   0.600  0.667 0.691   0.833    1    0

bwplot(resample_results , metric = c("Kappa","Accuracy"))

28.6.2 Combined results of predicting validation test samples

To compare the predictions from all models, I summed up the prediction probabilities for Death and Recovery from all models and calculated the log2 of the ratio between the summed probabilities for Recovery by the summed probabilities for Death. All cases with a log2 ratio bigger than 1.5 were defined as Recover, cases with a $log2$ ratio below -1.5 were defined as Death, and the remaining cases were defined as uncertain.

results <- data.frame(
  randomForest = predict(model_rf, newdata = val_test_data[, -1], type="prob"),
  glmnet = predict(model_glmnet, newdata = val_test_data[, -1], type="prob"),
  kknn = predict(model_kknn, newdata = val_test_data[, -1], type="prob"),
  pda = predict(model_pda, newdata = val_test_data[, -1], type="prob"),
  slda = predict(model_slda, newdata = val_test_data[, -1], type="prob"),
  pam = predict(model_pam, newdata = val_test_data[, -1], type="prob"),
  C5.0Tree = predict(model_C5.0Tree, newdata = val_test_data[, -1], type="prob"),
  pls = predict(model_pls, newdata = val_test_data[, -1], type="prob"))

results$sum_Death <- rowSums(results[, grep("Death", colnames(results))])
results$sum_Recover <- rowSums(results[, grep("Recover", colnames(results))])
results$log2_ratio <- log2(results$sum_Recover/results$sum_Death)
results$true_outcome <- val_test_data$outcome
results$pred_outcome <- ifelse(results$log2_ratio > 1.5, "Recover", 
              ifelse(results$log2_ratio < -1.5, "Death", "uncertain"))
results$prediction <- ifelse(results$pred_outcome == results$true_outcome, "CORRECT", 
              ifelse(results$pred_outcome == "uncertain", "uncertain", "wrong"))
results[, -c(1:16)]
#>          sum_Death sum_Recover log2_ratio true_outcome pred_outcome prediction
#> case_10      4.237        3.76    -0.1709        Death    uncertain  uncertain
#> case_11      5.181        2.82    -0.8778        Death    uncertain  uncertain
#> case_116     2.412        5.59     1.2123      Recover    uncertain  uncertain
#> case_12      5.219        2.78    -0.9085        Death    uncertain  uncertain
#> case_121     2.356        5.64     1.2606        Death    uncertain  uncertain
#> case_127     0.694        7.31     3.3972      Recover      Recover    CORRECT
#> case_131     0.685        7.31     3.4164      Recover      Recover    CORRECT
#> case_133     0.649        7.35     3.5024      Recover      Recover    CORRECT
#> case_135     2.027        5.97     1.5589        Death      Recover      wrong
#> case_2       2.161        5.84     1.4337        Death    uncertain  uncertain
#> case_20      3.144        4.86     0.6272      Recover    uncertain  uncertain
#> case_30      4.493        3.51    -0.3576      Recover    uncertain  uncertain
#> case_45      2.594        5.41     1.0590        Death    uncertain  uncertain
#> case_5       3.019        4.98     0.7227      Recover    uncertain  uncertain
#> case_55      3.925        4.08     0.0543      Recover    uncertain  uncertain
#> case_59      1.894        6.11     1.6886      Recover      Recover    CORRECT
#> case_72      2.545        5.46     1.1002      Recover    uncertain  uncertain
#> case_74      2.339        5.66     1.2748      Recover    uncertain  uncertain
#> case_77      0.845        7.15     3.0819      Recover      Recover    CORRECT
#> case_8       2.237        5.76     1.3650        Death    uncertain  uncertain
#> case_89      1.712        6.29     1.8772      Recover      Recover    CORRECT
#> case_97      2.959        5.04     0.7687      Recover    uncertain  uncertain
#> case_98      4.798        3.20    -0.5833        Death    uncertain  uncertain

All predictions based on all models were either correct or uncertain.

28.7 Predicting unknown outcomes

The above models will now be used to predict the outcome of cases with unknown fate.


train_control <- trainControl(method = "repeatedcv", 
                              number = 10, 
                              repeats = 10, 
                              verboseIter = FALSE)

set.seed(27)
model_rf <- caret::train(outcome ~ .,
                             data = train_data,
                             method = "rf",
                             preProcess = NULL,
                             trControl = train_control)
model_glmnet <- caret::train(outcome ~ .,
                             data = train_data,
                             method = "glmnet",
                             preProcess = NULL,
                             trControl = train_control)
model_kknn <- caret::train(outcome ~ .,
                             data = train_data,
                             method = "kknn",
                             preProcess = NULL,
                             trControl = train_control)
model_pda <- caret::train(outcome ~ .,
                             data = train_data,
                             method = "pda",
                             preProcess = NULL,
                             trControl = train_control)
#> Warning: predictions failed for Fold01.Rep01: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold02.Rep01: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold03.Rep01: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold04.Rep01: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold05.Rep01: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold06.Rep01: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold07.Rep01: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold08.Rep01: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold09.Rep01: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold10.Rep01: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold01.Rep02: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold02.Rep02: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold03.Rep02: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold04.Rep02: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold05.Rep02: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold06.Rep02: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold07.Rep02: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold08.Rep02: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold09.Rep02: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold10.Rep02: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold01.Rep03: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold02.Rep03: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold03.Rep03: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold04.Rep03: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold05.Rep03: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold06.Rep03: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold07.Rep03: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold08.Rep03: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold09.Rep03: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold10.Rep03: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold01.Rep04: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold02.Rep04: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold03.Rep04: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold04.Rep04: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold05.Rep04: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold06.Rep04: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold07.Rep04: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold08.Rep04: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold09.Rep04: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold10.Rep04: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold01.Rep05: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold02.Rep05: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold03.Rep05: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold04.Rep05: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold05.Rep05: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold06.Rep05: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold07.Rep05: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold08.Rep05: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold09.Rep05: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold10.Rep05: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold01.Rep06: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold02.Rep06: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold03.Rep06: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold04.Rep06: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold05.Rep06: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold06.Rep06: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold07.Rep06: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold08.Rep06: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold09.Rep06: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold10.Rep06: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold01.Rep07: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold02.Rep07: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold03.Rep07: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold04.Rep07: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold05.Rep07: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold06.Rep07: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold07.Rep07: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold08.Rep07: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold09.Rep07: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold10.Rep07: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold01.Rep08: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold02.Rep08: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold03.Rep08: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold04.Rep08: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold05.Rep08: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold06.Rep08: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold07.Rep08: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold08.Rep08: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold09.Rep08: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold10.Rep08: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold01.Rep09: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold02.Rep09: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold03.Rep09: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold04.Rep09: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold05.Rep09: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold06.Rep09: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold07.Rep09: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold08.Rep09: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold09.Rep09: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold10.Rep09: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold01.Rep10: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold02.Rep10: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold03.Rep10: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold04.Rep10: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold05.Rep10: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold06.Rep10: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold07.Rep10: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold08.Rep10: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold09.Rep10: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning: predictions failed for Fold10.Rep10: lambda=0e+00 Error in mindist[l] <- ndist[l] : 
#>   NAs are not allowed in subscripted assignments
#> Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, : There were missing values in resampled performance measures.
#> Warning in train.default(x, y, weights = w, ...): missing values found in aggregated results
model_slda <- caret::train(outcome ~ .,
                             data = train_data,
                             method = "slda",
                             preProcess = NULL,
                             trControl = train_control)
model_pam <- caret::train(outcome ~ .,
                             data = train_data,
                             method = "pam",
                             preProcess = NULL,
                             trControl = train_control)
#> 12345678910111213141516171819202122232425262728293011111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111
model_C5.0Tree <- caret::train(outcome ~ .,
                             data = train_data,
                             method = "C5.0Tree",
                             preProcess = NULL,
                             trControl = train_control)
model_pls <- caret::train(outcome ~ .,
                             data = train_data,
                             method = "pls",
                             preProcess = NULL,
                             trControl = train_control)

models <- list(rf = model_rf, 
               glmnet   = model_glmnet, 
               kknn     = model_kknn, 
               pda      = model_pda, 
               slda     = model_slda,
               pam      = model_pam, 
               C5.0Tree = model_C5.0Tree, 
               pls      = model_pls)

# Resample the models
resample_results <- resamples(models)

bwplot(resample_results , metric = c("Kappa","Accuracy"))

Here again, the accuracy is similar in all models.

28.7.1 Final results

The final results are calculated as described above.

results <- data.frame(
  randomForest = predict(model_rf, newdata = test_data, type="prob"),
  glmnet = predict(model_glmnet, newdata = test_data, type="prob"),
  kknn = predict(model_kknn, newdata = test_data, type="prob"),
  pda = predict(model_pda, newdata = test_data, type="prob"),
  slda = predict(model_slda, newdata = test_data, type="prob"),
  pam = predict(model_pam, newdata = test_data, type="prob"),
  C5.0Tree = predict(model_C5.0Tree, newdata = test_data, type="prob"),
  pls = predict(model_pls, newdata = test_data, type="prob"))

results$sum_Death <- rowSums(results[, grep("Death", colnames(results))])
results$sum_Recover <- rowSums(results[, grep("Recover", colnames(results))])
results$log2_ratio <- log2(results$sum_Recover/results$sum_Death)
results$predicted_outcome <- ifelse(results$log2_ratio > 1.5, "Recover", 
                                    ifelse(results$log2_ratio < -1.5, "Death", 
                                           "uncertain"))
results[, -c(1:16)]
#>          sum_Death sum_Recover log2_ratio predicted_outcome
#> case_100     1.854        6.15     1.7287           Recover
#> case_101     5.432        2.57    -1.0807         uncertain
#> case_102     2.806        5.19     0.8887         uncertain
#> case_103     2.342        5.66     1.2729         uncertain
#> case_104     1.744        6.26     1.8432           Recover
#> case_105     0.955        7.04     2.8828           Recover
#> case_108     4.489        3.51    -0.3543         uncertain
#> case_109     4.515        3.48    -0.3736         uncertain
#> case_110     2.411        5.59     1.2132         uncertain
#> case_112     2.632        5.37     1.0281         uncertain
#> case_113     2.198        5.80     1.4004         uncertain
#> case_114     3.339        4.66     0.4810         uncertain
#> case_115     1.112        6.89     2.6307           Recover
#> case_118     2.778        5.22     0.9109         uncertain
#> case_120     2.213        5.79     1.3868         uncertain
#> case_122     3.235        4.77     0.5590         uncertain
#> case_126     3.186        4.81     0.5952         uncertain
#> case_130     2.300        5.70     1.3091         uncertain
#> case_132     4.473        3.53    -0.3427         uncertain
#> case_136     3.281        4.72     0.5243         uncertain
#> case_15      2.270        5.73     1.3355         uncertain
#> case_16      2.820        5.18     0.8772         uncertain
#> case_22      4.779        3.22    -0.5689         uncertain
#> case_28      2.862        5.14     0.8445         uncertain
#> case_31      2.412        5.59     1.2117         uncertain
#> case_32      2.591        5.41     1.0616         uncertain
#> case_38      2.060        5.94     1.5280           Recover
#> case_39      4.749        3.25    -0.5466         uncertain
#> case_4       5.342        2.66    -1.0074         uncertain
#> case_40      6.550        1.45    -2.1750             Death
#> case_41      4.611        3.39    -0.4441         uncertain
#> case_42      5.570        2.43    -1.1966         uncertain
#> case_47      2.563        5.44     1.0850         uncertain
#> case_48      4.850        3.15    -0.6224         uncertain
#> case_52      4.709        3.29    -0.5173         uncertain
#> case_54      2.718        5.28     0.9586         uncertain
#> case_56      6.394        1.61    -1.9933             Death
#> case_62      6.048        1.95    -1.6319             Death
#> case_63      2.337        5.66     1.2766         uncertain
#> case_66      2.176        5.82     1.4205         uncertain
#> case_67      1.893        6.11     1.6895           Recover
#> case_68      3.907        4.09     0.0672         uncertain
#> case_69      4.465        3.53    -0.3370         uncertain
#> case_70      3.885        4.12     0.0833         uncertain
#> case_71      2.524        5.48     1.1172         uncertain
#> case_80      2.759        5.24     0.9261         uncertain
#> case_84      3.661        4.34     0.2448         uncertain
#> case_85      4.921        3.08    -0.6762         uncertain
#> case_86      3.563        4.44     0.3164         uncertain
#> case_88      0.566        7.43     3.7160           Recover
#> case_9       5.981        2.02    -1.5665             Death
#> case_90      3.570        4.43     0.3116         uncertain
#> case_92      4.664        3.34    -0.4835         uncertain
#> case_93      2.056        5.94     1.5319           Recover
#> case_95      4.495        3.50    -0.3589         uncertain
#> case_96      1.313        6.69     2.3482           Recover
#> case_99      4.799        3.20    -0.5839         uncertain
write.table(results, "results_prediction_unknown_outcome_ML_part1.txt", 
            col.names = T, sep = "\t")

results %>% 
  filter(predicted_outcome == "Recover") %>% 
  select(-c(1:16))
#>   sum_Death sum_Recover log2_ratio predicted_outcome
#> 1     1.854        6.15       1.73           Recover
#> 2     1.744        6.26       1.84           Recover
#> 3     0.955        7.04       2.88           Recover
#> 4     1.112        6.89       2.63           Recover
#> 5     2.060        5.94       1.53           Recover
#> 6     1.893        6.11       1.69           Recover
#> 7     0.566        7.43       3.72           Recover
#> 8     2.056        5.94       1.53           Recover
#> 9     1.313        6.69       2.35           Recover

28.7.2 Predicted outcome

results %>% 
  group_by(predicted_outcome) %>% 
  summarize(n = n())
#> # A tibble: 3 x 2
#>   predicted_outcome     n
#>   <chr>             <int>
#> 1 Death                 4
#> 2 Recover               9
#> 3 uncertain            44

From 57 cases, 14 were defined as Recover, 3 as Death and 40 as uncertain.

results_combined <- merge(results[, -c(1:16)],
                          fluH7N9_china_2013[which(fluH7N9_china_2013$case_ID 
                                                   %in% rownames(results)), ], 
                          by.x = "row.names", by.y = "case_ID")
# results_combined <- results_combined[, -c(2, 3, 8, 9)]
results_combined <- results_combined[, -c(1, 2, 3, 9, 10)]
results_combined
#>    log2_ratio predicted_outcome case_id date_of_onset date_of_hospitalisation gender age  province
#> 1      1.7287           Recover     100    2013-04-16                    <NA>      m  58  Zhejiang
#> 2     -1.0807         uncertain     101    2013-04-13                    <NA>      f  79  Zhejiang
#> 3      0.8887         uncertain     102    2013-04-12                    <NA>      m  81  Zhejiang
#> 4      1.2729         uncertain     103    2013-04-13              2013-04-19      m  68   Jiangsu
#> 5      1.8432           Recover     104    2013-04-16                    <NA>      f  54  Zhejiang
#> 6      2.8828           Recover     105    2013-04-14                    <NA>      m  32  Zhejiang
#> 7     -0.3543         uncertain     108    2013-04-15                    <NA>      m  84  Zhejiang
#> 8     -0.3736         uncertain     109    2013-04-15                    <NA>      m  62  Zhejiang
#> 9      1.2132         uncertain     110    2013-04-12              2013-04-16      m  53    Taiwan
#> 10     1.0281         uncertain     112    2013-04-17                    <NA>      m  69   Jiangxi
#> 11     1.4004         uncertain     113    2013-04-15                    <NA>      f  60  Zhejiang
#> 12     0.4810         uncertain     114    2013-04-18                    <NA>      f  50  Zhejiang
#> 13     2.6307           Recover     115    2013-04-17                    <NA>      m  38  Zhejiang
#> 14     0.9109         uncertain     118    2013-04-17                    <NA>      m  49   Jiangsu
#> 15     1.3868         uncertain     120    2013-03-08                    <NA>      m  60   Jiangsu
#> 16     0.5590         uncertain     122    2013-04-18                    <NA>      m  38  Zhejiang
#> 17     0.5952         uncertain     126    2013-04-17              2013-04-27      m  80    Fujian
#> 18     1.3091         uncertain     130    2013-04-29              2013-04-30      m  69    Fujian
#> 19    -0.3427         uncertain     132    2013-05-03              2013-05-03      f  79   Jiangxi
#> 20     0.5243         uncertain     136    2013-07-27              2013-07-28      f  51 Guangdong
#> 21     1.3355         uncertain      15    2013-03-20                    <NA>      f  61   Jiangsu
#> 22     0.8772         uncertain      16    2013-03-21                    <NA>      m  79   Jiangsu
#> 23    -0.5689         uncertain      22    2013-03-28              2013-04-01      m  85   Jiangsu
#> 24     0.8445         uncertain      28    2013-03-29                    <NA>      m  79  Zhejiang
#> 25     1.2117         uncertain      31    2013-03-29                    <NA>      m  70   Jiangsu
#> 26     1.0616         uncertain      32    2013-04-02                    <NA>      m  74   Jiangsu
#> 27     1.5280           Recover      38    2013-04-03                    <NA>      m  56   Jiangsu
#> 28    -0.5466         uncertain      39    2013-04-08              2013-04-08      m  66  Zhejiang
#> 29    -1.0074         uncertain       4    2013-03-19              2013-03-27      f  45   Jiangsu
#> 30    -2.1750             Death      40    2013-04-06              2013-04-11      m  74  Zhejiang
#> 31    -0.4441         uncertain      41    2013-04-06              2013-04-12      f  54  Zhejiang
#> 32    -1.1966         uncertain      42    2013-04-03              2013-04-10      m  53  Shanghai
#> 33     1.0850         uncertain      47    2013-04-01                    <NA>      m  72   Jiangsu
#> 34    -0.6224         uncertain      48    2013-04-03              2013-04-09      m  65  Zhejiang
#> 35    -0.5173         uncertain      52    2013-04-06                    <NA>      f  64  Zhejiang
#> 36     0.9586         uncertain      54    2013-04-06                    <NA>      m  75  Zhejiang
#> 37    -1.9933             Death      56    2013-04-05              2013-04-11      m  73  Shanghai
#> 38    -1.6319             Death      62    2013-04-03                    <NA>      f  68  Zhejiang
#> 39     1.2766         uncertain      63    2013-04-10                    <NA>      m  60     Anhui
#> 40     1.4205         uncertain      66          <NA>                    <NA>      m  72   Jiangsu
#> 41     1.6895           Recover      67    2013-04-12                    <NA>      m  56  Zhejiang
#> 42     0.0672         uncertain      68    2013-04-10                    <NA>      m  57  Zhejiang
#> 43    -0.3370         uncertain      69    2013-04-10                    <NA>      m  62  Zhejiang
#> 44     0.0833         uncertain      70    2013-04-11                    <NA>      f  58  Zhejiang
#> 45     1.1172         uncertain      71    2013-04-10                    <NA>      f  72  Zhejiang
#> 46     0.9261         uncertain      80    2013-04-08                    <NA>      m  74  Zhejiang
#> 47     0.2448         uncertain      84          <NA>                    <NA>      f  26   Jiangsu
#> 48    -0.6762         uncertain      85    2013-04-09              2013-04-16      f  80  Shanghai
#> 49     0.3164         uncertain      86    2013-04-13                    <NA>      f  54  Zhejiang
#> 50     3.7160           Recover      88          <NA>                    <NA>      m   4   Beijing
#> 51    -1.5665             Death       9    2013-03-25              2013-03-25      m  67  Zhejiang
#> 52     0.3116         uncertain      90    2013-04-12              2013-04-15      m  43  Zhejiang
#> 53    -0.4835         uncertain      92    2013-04-10              2013-04-17      f  66  Zhejiang
#> 54     1.5319           Recover      93    2013-04-11                    <NA>      m  56  Zhejiang
#> 55    -0.3589         uncertain      95    2013-03-30                    <NA>      m  37  Zhejiang
#> 56     2.3482           Recover      96    2013-04-07                    <NA>      m  43   Jiangsu
#> 57    -0.5839         uncertain      99    2013-04-12                    <NA>      f  68  Zhejiang

# tidy dataframe for plotting
results_combined_gather <- results_combined %>%
  gather(group_dates, date, date_of_onset:date_of_hospitalisation)

results_combined_gather$group_dates <- factor(results_combined_gather$group_dates, 
                          levels = c("date_of_onset", "date_of_hospitalisation"))

results_combined_gather$group_dates <- mapvalues(results_combined_gather$group_dates, 
                            from = c("date_of_onset", "date_of_hospitalisation"), 
                              to = c("Date of onset", "Date of hospitalisation"))

results_combined_gather$gender <- mapvalues(results_combined_gather$gender, 
                                            from = c("f", "m"), 
                                             to = c("Female", "Male"))
levels(results_combined_gather$gender) <- c(levels(results_combined_gather$gender), 
                                            "unknown")
results_combined_gather$gender[is.na(results_combined_gather$gender)] <- "unknown"

ggplot(data = results_combined_gather, aes(x = date, y = log2_ratio, 
                                           color = predicted_outcome)) +
  geom_jitter(aes(size = as.numeric(age)), alpha = 0.3) +
  geom_rug() +
  facet_grid(gender ~ group_dates) +
  labs(
    color = "Predicted outcome",
    size = "Age",
    x = "Date in 2013",
    y = "log2 ratio of prediction Recover vs Death",
    title = "2013 Influenza A H7N9 cases in China",
    subtitle = "Predicted outcome",
    caption = ""
  ) +
  my_theme() +
  scale_shape_manual(values = c(15, 16, 17)) +
  scale_color_brewer(palette="Set1") +
  scale_fill_brewer(palette="Set1")
#> Warning: Removed 42 rows containing missing values (geom_point).

The comparison of date of onset, data of hospitalization, gender and age with predicted outcome shows that predicted deaths were associated with older age than predicted recoveries. Date of onset does not show an obvious bias in either direction.

28.8 Conclusions

This dataset posed a couple of difficulties to begin with, like unequal distribution of data points across variables and missing data. This makes the modeling inherently prone to flaws. However, real life data isn’t perfect either, so I went ahead and tested the modeling success anyway.

By accounting for uncertain classification with low predictions probability, the validation data could be classified accurately. However, for a more accurate model, these few cases don’t give enough information to reliably predict the outcome. More cases, more information (i.e. more features) and fewer missing data would improve the modeling outcome.

Also, this example is only applicable for this specific case of flu. In order to be able to draw more general conclusions about flu outcome, other cases and additional information, for example on medical parameters like preexisting medical conditions, disase parameters, demographic information, etc. would be necessary.

All in all, this dataset served as a nice example of the possibilities (and pitfalls) of machine learning applications and showcases a basic workflow for building prediction models with R.

For a comparison of feature selection methods see here.

If you see any mistakes or have tips and tricks for improvement, please don’t hesitate to let me know! Thanks. :-)

sessionInfo()
#> R version 3.6.3 (2020-02-29)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Debian GNU/Linux 10 (buster)
#> 
#> Matrix products: default
#> BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.3.5.so
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C              LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> attached base packages:
#> [1] grid      stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#>  [1] RColorBrewer_1.1-2 rpart.plot_3.0.8   rattle_5.3.0       rpart_4.1-15       caret_6.0-86       lattice_0.20-38    mice_3.8.0         dplyr_0.8.5        gridExtra_2.3      ggplot2_3.3.0      plyr_1.8.6         tidyr_1.0.2        outbreaks_1.5.0   
#> 
#> loaded via a namespace (and not attached):
#>  [1] nlme_3.1-144         fs_1.4.1             lubridate_1.7.8      C50_0.1.3            tools_3.6.3          backports_1.1.6      bslib_0.2.2.9000     utf8_1.1.4           R6_2.4.1             colorspace_1.4-1     nnet_7.3-12          withr_2.2.0          tidyselect_1.0.0     downlit_0.2.1.9000   compiler_3.6.3       mda_0.4-10           cli_2.0.2            glmnet_3.0-2         xml2_1.3.2           isoband_0.2.1        Cubist_0.2.3         labeling_0.3         bookdown_0.21.4      sass_0.2.0.9005      scales_1.1.0         mvtnorm_1.1-0        randomForest_4.6-14  rappdirs_0.3.1       stringr_1.4.0        digest_0.6.25        rmarkdown_2.5.3      pkgconfig_2.0.3      htmltools_0.5.0.9003 rlang_0.4.5          shape_1.4.4          jquerylib_0.1.2      farver_2.0.3         generics_0.0.2       jsonlite_1.6.1       ModelMetrics_1.2.2.2 magrittr_1.5         Formula_1.2-3        Matrix_1.2-18        fansi_0.4.1          Rcpp_1.0.4.6         munsell_0.5.0        partykit_1.2-7      
#> [48] lifecycle_0.2.0      stringi_1.4.6        pROC_1.16.2          yaml_2.2.1           inum_1.0-1           MASS_7.3-51.5        pamr_1.56.1          recipes_0.1.10       pls_2.7-2            crayon_1.3.4         splines_3.6.3        knitr_1.28           pillar_1.4.3         igraph_1.2.5         reshape2_1.4.4       codetools_0.2-16     stats4_3.6.3         glue_1.4.0           evaluate_0.14        data.table_1.12.8    vctrs_0.2.4          foreach_1.5.0        gtable_0.3.0         purrr_0.3.4          assertthat_0.2.1     xfun_0.19.4          gower_0.2.1          prodlim_2019.11.13   libcoin_1.0-5        broom_0.5.6          e1071_1.7-3          class_7.3-15         survival_3.1-8       timeDate_3043.102    tibble_3.0.1         iterators_1.0.12     kknn_1.3.1           cluster_2.1.0        lava_1.6.7           ellipsis_0.3.0       ipred_0.9-9

27 Classification on bad loans

29 A detailed study of bike sharing demand