Recep Arda Kaya - 435060

Principal Component Analysis to Cardiovascular Disease Dataset

Objective

* As number of x variables grows, several problems arise like problems in visualization, problems in training
and prediction. My main objective is to show how PCA minimize these problems and how we can observe this improvement in 
accuracy score.

Data

Due to more practical explanation, I am going to use Cardiovascular Disease dataset

Table of Contents

1) Review of the Data

1) a - Missing Value Check

1) b - Descriptive Statistics of Data

1) c - Univariate Variable Analysis

2) PCA

2) a - Defining "X" and "Y" variables

2) b - Scaling the data to unit variance

2) c - Principal Component Analysis

2) d - Total and Explained Variance

3) Inference

4) References

Review of the Data

Missing Value Check

Descriptive Statistics of Data

Univariate Variable Analysis

According to figure above, we can observe there is a balance among cardio variable. It is very important for an accurate study. In imbalance situation, we can assume that before prediction, our accuracy score will be quite higher. For a better study we have to be sure about balance between our "y variable" before prediction.

PCA

Defining "X" and "Y" variables

Scaling the data to unit variance.

Principal Component Analysis

Dimensionality reduction is way to reduce the complexity of a model and avoid overfitting. There are two main categories of dimensionality reduction: feature selection and feature extraction. Via feature selection, we select a subset of the original features, whereas in feature extraction, we derive information from the feature set to construct a new feature subspace

Principal Component Analysis (PCA) algorithm used to compress a dataset onto a lower-dimensional feature subspace with the goal of maintaining most of the relevant information.

PCA helps us to identify patterns in data based on the correlation between features. In a nutshell, PCA aims to find the directions of maximum variance in high-dimensional data and projects it onto a new subspace with equal or fewer dimensions than the original one.

Total and Explained Variance

The variance explained ratio of an eigenvalue λ_j is simply the fraction of an eigenvalue λ_j and the total sum of the eigenvalues
explained_varianceratio returns the percentage of variance explained by each of the selected components. The resulting plot indicates that the first principal component alone accounts for approximately 85% of the variance and we can see that 2 principal component will be best for our model
According to results above, we can clearly see first 2 components are highly explanatory for our data and other components may be negligible

Inference

1st PC: 85.0406 % 2nd PC: 14.1914 %
Total : 99.232 %
With only 2 components our power of explanation reached 99.232%. It means, other components can be dropped and our data will be less noisy after this process. We can also, use this components with classification algorithms like SVM later.
When reducing the dimensions of data, it’s important not to lose more information than is necessary. The variation in a data set can be seen as representing the information that we would like to keep. Principal Component Analysis (PCA) is a well-established mathematical technique for reducing the dimensionality of data, while keeping as much variation as possible.

References

https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c