3
PREDICTING BREAST CANCER USING MACHINE LEARNING TECHNIQUES [PART 1]: EXPLORATORY ANALYSIS

PREDICTING BREAST CANCER USING MACHINE LEARNING TECHNIQUES [PART 1]

EXPLORATORY ANALYSIS

PYTHON, JUPYTER NOTEBOOK, SKLEARN, PANDAS, MATPLOTLIB

MEDIUM

last hacked on Aug 25, 2017

For this project, I implemented a few **Machine Learning** techniques on a data set containing descriptive attributes of digitized images of a process known as, fine needle aspirate (**FNA**) of breast mass. We have a total of 29 features that were computed for each cell nucleus with an ID Number and the Diagnosis (Later converted to binary representations: **Malignant** = 1, **Benign** = 0). <img src="https://www.researchgate.net/profile/Syed_Ali39/publication/41810238/figure/fig5/AS:281736006127621@1444182506838/Figure-2-Fine-needle-aspiration-of-a-malignant-solitary-fibrous-tumor-is-shown-A-A.png"> Ex. Image of a malignant solitary fibrous tumor using **FNA** This is popular data set used for machine learning purposes, and I plan on using the following **Machine learning methods**: + Kth Nearest Neighbor + (Bagging) Random Forest + Neural Networks I employ critical data analysis modules in this project, emphasizing on: + pandas + scikit learn + matplotlib (for visuals) + seaborn (easier to make statistical plots)
# Table of Contents + [Setting Up Python Environment](#pyenv) + [Loading Data](#loaddata) + [Exploratory Analysis](#exploranal) + [Visual Exploratory Analysis](#visexploranal) # Load Modules <a name='pyenv'></a> We load our modules into our python environment. In my case I am employing a **Jupyter Notebook** while running inside a **virtualenv** environment. ```python %matplotlib inline import numpy as np import pandas as pd # Data frames import matplotlib.pyplot as plt # Visuals import seaborn as sns # Danker visuals import helper_functions as hf from sklearn.model_selection import train_test_split from sklearn.model_selection import KFold, cross_val_score from sklearn.neighbors import KNeighborsClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.neural_network import MLPClassifier from sklearn.metrics import roc_curve # ROC Curves from sklearn.metrics import auc # Calculating AUC for ROC's! from urllib.request import urlopen pd.set_option('display.max_columns', 500) # Included to show all the columns # since it is a fairly large data set plt.style.use('ggplot') # Using ggplot2 style visuals # because that's how I learned my visuals # and I'm sticking to it! ``` # Loading Data <a name='loaddata'></a> For this section, I'll load the data into a **Pandas** dataframe using `urlopen` from the `urllib.request` module. Instead of downloading a **csv**, I started implementing this method(Inspired by [Jason's Python Tutorials](https://github.com/JasonFreeberg/PythonTutorials)) where I grab the data straight from the [UCI Machine Learning Database](https://archive.ics.uci.edu/ml/datasets.html) using an http request. This makes it easier to go about analysis from online sources and cuts out the need to download/upload a **csv** file when uploading on *GitHub*, since most files in the UCI database are easily accessible in the desired format. Finally, I created a list with the appropriate names and set them as the column names. **NOTE**: The names were not documented to well so I used [this analysis](https://www.kaggle.com/buddhiniw/d/uciml/breast-cancer-wisconsin-data/breast-cancer-prediction) (I will refer to it as *Buddhini W.* from now on) to grab the variable names and some other tricks that I didn't know that were available in *Python* (I will mention the use in the script!) Finally I set the column `id_number` as the index for the dataframe. ```python UCI_data_URL = 'https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data' names = ['id_number', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave_points_mean', 'symmetry_mean', 'fractal_dimension_mean', 'radius_se', 'texture_se', 'perimeter_se', 'area_se', 'smoothness_se', 'compactness_se', 'concavity_se', 'concave_points_se', 'symmetry_se', 'fractal_dimension_se', 'radius_worst', 'texture_worst', 'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave_points_worst', 'symmetry_worst', 'fractal_dimension_worst'] breast_cancer = pd.read_csv(urlopen(UCI_data_URL), names=names) # Setting 'id_number' as our index breast_cancer.set_index(['id_number'], inplace = True) namesInd = names[2:] # FOR CART MODELS LATER ``` # Exploratory Analysis <a name='exploranal'></a> An important process in **Machine Learning** is doing **Exploratory Analysis** to get a *feel* for your data. Creating visuals can help people understand the data set and allow for digestable pieces of information that sometimes code and predicitive analytics wouldn't allow. Often times we will try to jump into the predictive modeling, but it helps us create narratives which will allow us to give context to people who are not as driven by data. Its good to always output sections of your data so you can give context to the reader as to what each column looks like, as well as seeing examples of how the data is suppose to be formatted when loaded correctly. Many people run into the issue (especially if you run into a data set with poor documentation w.r.t. the column names), so its good habit to show your data during your analysis. We use the function `head()` which is essentially the same as the `head` function in *R* if you come from an *R* background. ```python breast_cancer.head() ``` Unfortunately this data set is 31 columns long so I will leave it up to the user to output this in their terminal or *jupyter notebook* to see the outcome. <!-- Missing output --> ### More Preliminary Analysis Much of these sections are given to give someone context to the dataset you are utilizing. Often looking at raw data will not give people the desired context, so it is important for us as data enthusiast to fill in the gaps for people who are interested in the analysis. But don't plan on running it anytime soon. #### Data Frame Dimensions Here we use the `.shape` function to give us the lengths of our data frame, where the first output is the row-length and the second output is the column-length. #### Data Types Another piece of information that is **important** is the data types of your variables in the data set. It is often good practice to check the variable types with either the source or with your own knowledge of the data set. For this data set, we know that all variables are measurements, so they are all continous (Except **Dx**), so no further processing is needed for this step. A common error that often happens is say a variable is *discrete* (or *categorical*), but has a numerical representation someone can easily forget the pre-processing step and do analysis on the data type as is. Since they are numeric they will be interpretted as either `int` or `float`, this isn't as big a problem in *python* as it is for *R* since most `sklearn` classifiers require numeric inputs, but still important to note. ```python print("Here's the dimensions of our data frame:\n", breast_cancer.shape) print("Here's the data types of our columns:\n", breast_cancer.dtypes) ``` ### Terminal Output Here's the dimensions of our data frame: (569, 31) Here's the data types of our columns: diagnosis object radius_mean float64 texture_mean float64 perimeter_mean float64 area_mean float64 smoothness_mean float64 compactness_mean float64 concavity_mean float64 concave_points_mean float64 symmetry_mean float64 fractal_dimension_mean float64 radius_se float64 texture_se float64 perimeter_se float64 area_se float64 smoothness_se float64 compactness_se float64 concavity_se float64 concave_points_se float64 symmetry_se float64 fractal_dimension_se float64 radius_worst float64 texture_worst float64 perimeter_worst float64 area_worst float64 smoothness_worst float64 compactness_worst float64 concavity_worst float64 concave_points_worst float64 symmetry_worst float64 fractal_dimension_worst float64 dtype: object As you can see we'll be dealing with 30 independent variables that make up our feature space, all `float` types! Our next step is converting the Diagnoses into the appropriate binary representation. ## Converting Diagnoses Important when doing analysis, converting variable types to the appropriate representation. A tool is as useful as the person utilizing it, so if we enter our data incorrectly the algorithm will suffer not as a result from its capabilities, but from the human component (More on this later). Here I converted the Dx to **binary** represenations using the `map` functionality in `pandas`. I borrowed this from *Buddhini W*. We are using a dictionary to map out this conversion: {'M':1, 'B':0} which then converts the previous string representations of the Dx to the **binary** representation, where 1 == **Malignant** and 0 == **Benign**. ```python # Converted to binary to help later on with models and plots breast_cancer['diagnosis'] = breast_cancer['diagnosis']\ .map({'M':1, 'B':0}) # Let's look at the count of the new representations of our Dx's breast_cancer['diagnosis'].value_counts() ``` ### Terminal Output 0 357 1 212 Name: diagnosis, dtype: int64 ## Class Imbalance The count for our Dx is important because it brings up the discussion of *Class Imbalance* within *Machine learning* and *data mining* applications. *Class Imbalance* refers to when a class within a data set is outnumbered by the other class (or classes). Reading documentation online, *Class Imbalance* is present when a class populates 10-20% of the data set. However for this data set, its pretty obvious that we don't suffer from this, but since I'm practicing my **Python**, I decided to experiment with *functions* to get better at **Python**! <br> **NOTE**: If your data set suffers from *class imbalance* I suggest reading documentation on *upsampling* and *downsampling*. ```python def calc_diag_percent(data_frame, col): ''' Purpose ---------- Creates counters for each respective diagnoses and prints the percentage of each unique instance Parameters ---------- * data_frame : Name of pandas.dataframe * col : Name of column within previous mentioned dataframe ''' i = 0 n = 0 perc_mal = 0 perc_beg = 0 for col in data_frame[col]: if (col == 1): i += 1 elif (col == 0): n += 1 perc_mal = (i/len(data_frame)) * 100 perc_beg = (n/len(data_frame)) * 100 print("The percentage of Malignant Diagnoses is: {0:.3f}%"\ .format(perc_mal)) print("The percentage of Begnin Diagnoses is: {0:.3f}%"\ .format(perc_beg)) ``` Let's check if this worked. I'm sure there's more effective ways of doing this process, but this is me doing brute-force attempts to defining/creating working functions. ```python calc_diag_percent(breast_cancer, 'diagnosis') ``` ### Terminal Output ``` The percentage of Malignant Diagnoses is: 37.258% The percentage of Begnin Diagnoses is: 62.742% ``` As we can see here our data set is not suffering from *class imbalance* so we can proceed with our analysis. So I started by using the `.describe()` function to give some basic statistics relating to each variable. We can see there are 569 instances of each variable (which should make sense), but important to note that the distributions of the different variables have very high variance by looking at the **means** (Some can go as low as .0n while some as large as 800!) ```python breast_cancer.describe() ``` <!-- Missing Output --> We will discuss the high variance in the distribution of the variables later within context of the appropriate analysis. For now we move on to visual representations of our data. Still a continuation of our **Exploratory Analysis**. # Visual Exploratory Analysis <a name='visexploranal'></a> For this section we utilize the module `Seaborn` which contains many powerful statistical graphs that would have been hard to produce using `matplotlib` (My note: `matplotlib` is not the most intuitive visualizing tool in comparison to `ggplot2` in *R*, but *Python* seems like its well on its way to creating visually pleasing and intuitive plots!) ## Scatterplot Matrix For this visual I cheated by referencing some variables that were indicators of being influencial to the analysis (See **Random Forest** Section, more importantly the *variable importance* section). ```python # Scatterplot Matrix # Variables chosen from Random Forest modeling. cols = ['concave_points_worst', 'concavity_mean', 'perimeter_worst', 'radius_worst', 'area_worst', 'diagnosis'] sns.pairplot(breast_cancer, x_vars = cols, y_vars = cols, hue = 'diagnosis', palette = ('Red', '#875FDB'), markers=["o", "D"]) ``` <img src='https://raw.githubusercontent.com/raviolli77/machineLearning_breastCancer_Python/master/images/breastCancerWisconsinDataSet_MachineLearning_19_0.png'> You see a matrix of the visual representation of the relationship between 6 variables: + `concave_points_worst` + `concavity_mean` + `perimeter_worst` + `radius_worst` + `area_worst` + `diagnosis` Within each scatterplot we were able to color the two classes of **Dx**, which we can clearly see that we can easily distinguish the difference between **Malignant** and **Begnin**. As well as some variable interactions have an almost linear relationship. Of course these are just 2-dimensional representations, but its still interesting to see how variables interact with each other in our data set. ## Pearson Correlation Matrix The next visual gives similar context that the last visual provided, and it is called the *Pearson Correlation Matrix*. Variable correlation within a *Machine Learning* context doesn't play as an important role as say *linear regression*, but there can still be some dangers when a data set has too many correlated variables. When two features (or more) are almost perfectly correlated in a *Machine Learning* setting then the inclusion of these features does not add addition information to your process. This then has the potential to hurt your algorithm's accuracy, since we are potentially utilizing a large feature space that can cause what is know as the [Curse of Dimensionality](https://en.wikipedia.org/wiki/Curse_of_dimensionality). Thus feature extraction would help reduce the amount of *noise* in your feature space, see *principal component analysis*, or *t-distributed stochastic neighbor embedding*. Many of our algorithms are also very computationally expensive, so utilizing a dimension reduction algorithm would also help performance and computational time. ```python corr = breast_cancer.corr(method = 'pearson') # Correlation Matrix f, ax = plt.subplots(figsize=(11, 9)) # Generate a custom diverging colormap cmap = sns.diverging_palette(10, 275, as_cmap=True) # Draw the heatmap with the mask and correct aspect ratio sns.heatmap(corr, cmap=cmap,square=True, xticklabels=True, yticklabels=True, linewidths=.5, cbar_kws={"shrink": .5}, ax=ax) ``` <img src='https://raw.githubusercontent.com/raviolli77/machineLearning_breastCancer_Python/master/images/breastCancerWisconsinDataSet_MachineLearning_22_1.png'> We can see that our data set contains mostly positive correlation, as well as re-iterating to us that the 5 dependent variables we featured in the *Scatterplot Matrix* have strong *correlation*. Our variables don't have too much correlation so I won't go about doing feature extraction processes like *Principal Component Analysis * (**PCA**), but you are more welcomed to do so (you will probably get better prediction estimates). ## Boxplots Next I decided to include boxplots of the data to show the high variance in the distribution of our variables. This will help drive home the point of the need to do some appropriate transformation for some models I will be employing. This is especially true for *Neural Networks*. ```python f, ax = plt.subplots(figsize=(11, 15)) ax.set_axis_bgcolor('#fafafa') ax.set(xlim=(-.05, 50)) plt.ylabel('Dependent Variables') plt.title("Box Plot of Pre-Processed Data Set") ax = sns.boxplot(data = breast_cancer, orient = 'h', palette = 'Set2') ``` <img src="https://raw.githubusercontent.com/raviolli77/machineLearning_breastCancer_Python/master/images/breastCancerWisconsinDataSet_MachineLearning_25_0.png"> Not the best picture but this is a good segue into the next step in our *Machine learning* process. Here I used a function I created in my python script. Refer to `helperFunction.py` to understand the process but I'm setting the minimum of 0 and maximum of 1 to help with some machine learning applications later on in this report. Notice that I will use this function only for the visualization of my data set. Important to note because if I were to use this transformation, during my machine learning process I would be guilty of the process called, *data leakage*, more on this later in the *neural networks* section. ``` # From helperFunction script def normalize_df(frame): ''' Helper function to Normalize data set Intializes an empty data frame which will normalize all floats types and just append the non-float types so basically the class in our data frame ''' breast_cancerNorm = pd.DataFrame() for item in frame: if item in frame.select_dtypes(include=[np.float]): breast_cancerNorm[item] = ((frame[item] - frame[item].min()) / (frame[item].max() - frame[item].min())) else: breast_cancerNorm[item] = frame[item] return breast_cancerNorm ``` Next we utilize the function on our dataframe. ```python breast_cancerNorm = normalize_df(breast_cancer) ``` Note that we won't use this dataframe until we start fitting *Neural Networks*. Let's try the `.describe()` function again and you'll see that all variables have a maximum of 1 which means we did our process correctly. ```python breast_cancerNorm.describe() ``` <!-- Missing Output --> ### Box Plot of Transformed Data Now to further illustrate the transformation let's create a *boxplot* of the scaled data set, and see the difference from our first *boxplot*. ```python f, ax = plt.subplots(figsize=(11, 15)) ax.set_axis_bgcolor('#fafafa') plt.title("Box Plot of Transformed Data Set (Breast Cancer Wisconsin Data Set)") ax.set(xlim=(-.05, 1.05)) ax = sns.boxplot(data = breast_cancerNorm[1:29], orient = 'h', palette = 'Set2') ``` <img src="https://raw.githubusercontent.com/raviolli77/machineLearning_breastCancer_Python/master/images/breastCancerWisconsinDataSet_MachineLearning_34_0.png"> There are different forms of transformations that are available for *machine learning* and I suggest you research them to gain a better understanding as to when to use a transformation. But for this project I will only employ the transformed dataframe on *Neural Networks*. ## Part 2 To read about the machine learning techniques read part 2 available [click here](https://www.inertia7.com/projects/95) # Sources Cited + W. Street, Nick. UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science. + Waidyawansa, Buddhini. Kaggle [https://www.kaggle.com/]. *Using the Wisconsin breast cancer diagnostic data set for predictive analysis*. [Kernel Source](https://www.kaggle.com/buddhiniw/breast-cancer-prediction) + Zakka, Kevin. Kevin Zakka's Blog [https://kevinzakka.github.io/]. *A Complete Guide to K-Nearest-Neighbors with Applications in Python and R*. [Blog Source](https://kevinzakka.github.io/2016/07/13/k-nearest-neighbor/)

COMMENTS


congrats Ravi, awesome project

back to all projects