Table of Contents

Introduction

Problem Statement: Cluster Multi Sequence Alignment (MSA) sequences for 400 Covid-19 samples.
Additional useful information:

Report

Project Description

In this project I did the clustering of the given Multi Sequence Alignment (MSA) of 400 covid samples. For the clustering, I used KMeans clustering with kmeans++ initialization to minimize the risk of choosing poor initial points for the modelling. Kmeans is sensitive to the initially chosen centroids and kmeans++ gives better initial centroids.

Kmeans is distance based algorithm where we can specify the number of clusters and it will tell us which data point belongs to which cluster. We can choose number of cluster based on elbow method or silhouette scores. There are also multitude of other clustering algorithms to choose from such as another distance based k-mediods or other density based algorithms such as DBSCAN and OPTICS. Here, I used distance based algorithm k-means with 5 clusters and used euclidean distances as the distance metric.

About the distance metric used, we could also use absolute or cosine distances here, but I used euclidean distance for simplicity. We should also note that, for normalized data, euclidean and cosine distances give the same value and it does not matter much in those situations.

The k-means algorithm comes with it's own problem. We need to choose best number of clusters. To choose the best number of clusters, I used the elbow method. First I found different within cluster sum of squared values (WCSS) for given number of clusters used and plotted number of clusters versus WCSS. Wherever we see sharp turn (elbow like shape) we choose that number as the best number of cluster.

Apart from WCSS, we can also use silhouette score to choose the best number of clusters. The Silhouette score for a data point is given by s=(b-a)/max(a,b) where a is mean distance from that point to all the points in given cluster, and b is to that of nearest other cluster. For a single cluster, the Silhouette score is the mean of silhouette scores of all the points belonging to it.

Methodology

Here, we have 400 covid samples MSA and each sample has 34k characters. A single DNA has a double helix structure with four nitrogen bases Adenine(A), Thymine (T), Cytosine(C), and Guanine(G).

Also, N means there is one of four proteins on the place but no any protein is detected with full accuray. . means character is same across all samples and - means protein was not detected at that place.

First of all, since the data is text data, we need to convert it to numeric data somehow. We can use k-mers with Bag of Words (CountVectorizer) but that will give very large dimension array and clustering techniques may suffer from curse of dimensionality. So, I used simple ASCII Encoding of the characters. For example, the character . has the ASCII integer encoding of 46 and later I will normalize these numbers and do the PCA.

To reduce the feature dimension, I used Principal Component Analysis (PCA). The PCA tries to project data in reduced hyper-dimension where all the axes are orthogonal and tries to reduce the variances in the data. The PCA will perform better if all the features are of similar scale otherwise the feature having larger numbers might be taken as greater principal component. To avoid that, we must scale our dataset before the PCA. Here, I used min-max scaling of the dataset and kept 90% of the variance in the dataset using top 253 eigenvectors based on sorted values of eigenvalues.

From this normalized PCA transformed dataset, I used k-means algorithm with different numbers of cluster (2 to 20), then looking elbow method I chose the number of clusters to be 5.

Again, to see how well our clustering algorithm did, I used 2-dimensional plot with cluster labels predicted from the k-means.

Results

The number of data points assigned to each cluster index is given below:

Cluster Count
3 171
0 161
2 51
4 9
1 8

Here, I have also included the visualization of clusters for 2 dimensional data obtained by normalized PCA data is given below:

Load the modules

Look at the data file

Modelling

Convert Sequences to Numerical Data

k-mer and bag words embedding

Integer ASCII Encoding (I used this)

Normalize the data

Clustering of the Data

Clustering on full dimension 400*34k

Visualization using PCA on un-scaled original data

Reduce dimension of data, Normalize and Clustering

Find the best number of clusters

The Silhouette Coefficient is defined for each sample and is composed of two scores:

The Silhouette Coefficient s for a single sample is then given as:

s = (b-a) / max(a,b)

The Silhouette Coefficient for a set of samples is given as the mean of the Silhouette Coefficient for each sample.

Time Taken