In this article we try to predict if a patient’s diagnosis of breast tissue is malignant or bening. From the code below, using the feature_names function we can explore the name of all the predictors involved in the breast cancer such as for example: mean radius, mean perimeter and so forth. Moreover, using the funtion target_names we have the two level of the response variable (malignant, benign). To have a better understanding of the dataset here the link.
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
list(data.feature_names) # predictors
## ['mean radius', 'mean texture', 'mean perimeter', 'mean area', 'mean smoothness', 'mean compactness', 'mean concavity', 'mean concave points', 'mean symmetry', 'mean fractal dimension', 'radius error', 'texture error', 'perimeter error', 'area error', 'smoothness error', 'compactness error', 'concavity error', 'concave points error', 'symmetry error', 'fractal dimension error', 'worst radius', 'worst texture', 'worst perimeter', 'worst area', 'worst smoothness', 'worst compactness', 'worst concavity', 'worst concave points', 'worst symmetry', 'worst fractal dimension']
list(data.target_names) # response variable
## ['malignant', 'benign']
Now, we can use the function train_test_split from sklearn and calling this function we can split the model with a test set of 30% of the original data set.
# Select how the model wwill perform in the future
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, test_size=0.3)
N, D = X_train.shape # number of observation and variables
# Scale the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
In the code above, we also scaled the data, the basic idea is that because the output is a linear combination of the input we don’t want to have one or more input with a very large range and other inputs with a very small range. If this happens, then the weights will be too sensitive when the input has large range and not sensitive when input has small range. We can do that using the z points. In python we can transform the data in z points using the StandarScaler function used above.
Now, we can use TensorFlow, and first we have to build a model object which is an object of type sequential. This takes it in a list of two objects called input and dense. The input jusy specify the size of the input and is called D (see the code above X_train_shape). The dense layer is instead where the real work happens: it takes the input and does a linear transformation to get an output of size 1. The linear transformation we want to apply is the sigmoid activation function so that in output we are in a range of 0 and 1.
# Build the model in TensorFlow
model = tf.keras.models.Sequential([
tf.keras.layers.Input(shape=(D,)),
tf.keras.layers.Dense(1, activation='sigmoid') # use sigmoid function for every epochs
])
model.compile(optimizer='adam', # use adaptive momentum
loss='binary_crossentropy',
metrics=['accuracy'])
# Train the Model
r = model.fit(X_train, y_train, validation_data=(X_test, y_test))
# Evaluate the model
## Train on 398 samples, validate on 171 samples
##
## 32/398 [=>............................] - ETA: 5s - loss: 0.7772 - accuracy: 0.6562
## 398/398 [==============================] - 1s 1ms/sample - loss: 0.6014 - accuracy: 0.7261 - val_loss: 0.5976 - val_accuracy: 0.6667
print("Train score:", model.evaluate(X_train, y_train)) # evaluate returns loss and accuracy
##
s 30us/sample - loss: 0.5403 - accuracy: 0.7563
## Train score: [0.5699630466537859, 0.75628144]
print("Test score:", model.evaluate(X_test, y_test)) # evaluate returns loss and accuracy
##
s 36us/sample - loss: 0.6113 - accuracy: 0.6667
## Test score: [0.5975685590191891, 0.6666667]
From the code above we used the AdaM adaptive momentum as optimizer for the gradient descent using mini-batch. We apply the sigmoid function for every epochs. The results say that we have 56% of accuracy on both trin and test set. We can also plot the loss per iteration using the code below.
import matplotlib.pyplot as plt
# plt.plot(r.history['loss'], label='loss')
# plt.plot(r.history['val_loss'], label='val_loss')
# plt.legend()
# plt.show()
The graph above shows the loss per iteration. From the code above we can see that the training loss is stored in a key called loss, while the validatin loss is stored in a key called val_loss. From the graph, there is a nice decrease in the last iterationas expected.