We try to recognize cancer in human breast using a multi-hidden layer artificial neural network via H2O package. We use the Wisconsin Breast-Cancer Dataset which is a collectioin of Dr.Wolberg real clinical cases. There are no images, but we can recognize malignal tumor based on 10 biomedical attributes. We have a total number of 699 patients divided in two classes: malignal and benign cancer. From the H2O output below, we can see that it recognised 4 cores.
library(mlbench)
library(h2o)
h2o.init(nthreads = -1) # initializing h2o
Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 2 hours 17 minutes
H2O cluster timezone: Europe/Berlin
H2O data parsing timezone: UTC
H2O cluster version: 3.20.0.8
H2O cluster version age: 4 months and 14 days !!!
H2O cluster name: H2O_started_from_R_perlatoa_vvr054
H2O cluster total nodes: 1
H2O cluster total memory: 2.49 GB
H2O cluster total cores: 4
H2O cluster allowed cores: 4
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
H2O API Extensions: AutoML, Algos, Core V3, Core V4
R Version: R version 3.5.1 (2018-07-02)
The table below shows the crucial biomedical features involved in Cancer, like for exampe the cell size and shape. In the last colum we have the outcome (malign vs. benign cancer).
library(knitr)
library(kableExtra)
library(formattable)
data("BreastCancer")
dt <- as.data.frame(BreastCancer)
dt <- dt[1:10,]
kable(dt) %>%
kable_styling(bootstrap_options = "responsive", full_width = T, position = "center", font_size = 16)
Id | Cl.thickness | Cell.size | Cell.shape | Marg.adhesion | Epith.c.size | Bare.nuclei | Bl.cromatin | Normal.nucleoli | Mitoses | Class |
---|---|---|---|---|---|---|---|---|---|---|
1000025 | 5 | 1 | 1 | 1 | 2 | 1 | 3 | 1 | 1 | benign |
1002945 | 5 | 4 | 4 | 5 | 7 | 10 | 3 | 2 | 1 | benign |
1015425 | 3 | 1 | 1 | 1 | 2 | 2 | 3 | 1 | 1 | benign |
1016277 | 6 | 8 | 8 | 1 | 3 | 4 | 3 | 7 | 1 | benign |
1017023 | 4 | 1 | 1 | 3 | 2 | 1 | 3 | 1 | 1 | benign |
1017122 | 8 | 10 | 10 | 8 | 7 | 10 | 9 | 7 | 1 | malignant |
1018099 | 1 | 1 | 1 | 1 | 2 | 10 | 3 | 1 | 1 | benign |
1018561 | 2 | 1 | 2 | 1 | 2 | 1 | 3 | 1 | 1 | benign |
1033078 | 2 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 5 | benign |
1033078 | 4 | 2 | 1 | 1 | 2 | 1 | 2 | 1 | 1 | benign |
data <- BreastCancer[, -1] # remove ID
data[, c(1:ncol(data))] <- sapply(data[, c(1:ncol(data))], as.numeric) # interpret each features as numeric
data[, 'Class'] <- as.factor(data[, 'Class']) # interpret dependent variable as factor
# convert the dataset in three part in the h2o format
splitSample <- sample(1:3, size=nrow(data), prob=c(0.6,0.2,0.2), replace=TRUE)
train_h2o <- as.h2o(data[splitSample==1,])
|
| | 0%
|
|=================================================================| 100%
val_h2o <- as.h2o(data[splitSample==2,])
|
| | 0%
|
|=================================================================| 100%
test_h2o <- as.h2o(data[splitSample==3,])
|
| | 0%
|
|=================================================================| 100%
# print dimensions
dim(train_h2o)
[1] 425 10
dim(val_h2o)
[1] 150 10
dim(test_h2o)
[1] 124 10
As we can see from the result above, we have 401 (60%) observations for training, and round 20% of observations for both validation (161) and test (136). Now, we can train our model using the deep learning function offers by H2O package.
model <-
h2o.deeplearning(x = 1:9, # column numbers of predictors
y = 10, # column number of the dipendent variable
# data in H2O format
training_frame = train_h2o,
activation = "TanhWithDropout", # use Tanh with pruning method
input_dropout_ratio = 0.2, # precentage of pruning
balance_classes = TRUE, # try to balance malign or begnin in case of one of them is unbalanced
hidden = c(10,10), # two hidden layers of 10 units
hidden_dropout_ratios = c(0.3, 0.3), # pruning probablity per each hidden layer
epochs = 10, # maximum number f epochs
seed= 0)
|
| | 0%
|
|=================================================================| 100%
Now, lets see the confusion matrix for the training and validation set.
# training confusion matrix
h2o.confusionMatrix(model)
Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.247389388195815:
1 2 Error Rate
1 266 8 0.029197 =8/274
2 5 267 0.018382 =5/272
Totals 271 275 0.023810 =13/546
# training confusion matrix
h2o.confusionMatrix(model, val_h2o)
Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.113811786899162:
1 2 Error Rate
1 97 5 0.049020 =5/102
2 1 47 0.020833 =1/48
Totals 98 52 0.040000 =6/150
For the training set we have an incredible around 99% (Error = 0.019), we have just 10 samples that the model is getting wrong. For the validation set, again the error il low (Error = 0.03), with just 5 error samples that the model is getting wrong. If we want to see the accuracy in a out-of-sample data, we can use the test set.
# training confusion matrix
h2o.confusionMatrix(model, test_h2o)
Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.211758576604076:
1 2 Error Rate
1 79 3 0.036585 =3/82
2 1 41 0.023810 =1/42
Totals 80 44 0.032258 =4/124
We also have an icredible accuracy in the test set (Error = 0.025). Our model has an incredible good generalization capability.