Thursday, March 9, 2017

Data analysis - Clustering using euclidean distance

Recently our ability to gather large amounts of complex data has far outstripped our ability to analyze them.  Although our human brains can process data in complex ways but it does not scale when it comes to large volumes of data. Clustering is one way to distill data to some groups and understand relationships within the dataset.  Clustering is used in many scientific research fields such as natural science, genetics, politics and of course in sales and marketing.  I recently published a blog on analysis performed using euclid distance and clustering at my work SEI for cybersecurity- link here.  Here I am going to simply explore the mechanics of using Euclid distance for clustering using some simple Python code and examples.  Don't worry, it is simple Math hopefully once you walk through this sample.
Euclid

Euclid distance is a very simple distance metrics that extends Pythogras theorem to build distance between objects in n-dimensional space. In the diagram below, pythogras theorem is the well-known left-most representation, the cartesian coordinate representation is in the middle and to the right is measure of distance in n-dimensional space.
How can we use this?  If we have data about many distinct elements and their properties or attributes in a numeric way, we could find distance between these elements and find if they cluster together.  As an example you could get attributes of people immigrating to the US, attributes such as height, weight, sex, age and income-level.  You could use Euclid distance between these people and cluster them to see how the immigrants to US seem to be of a similar nature - a sort of anthropology of immigration study.

Sample data for analysis

The data I choose is the simple "Breakfast Cereal Dataset" from http://idvbook.com.  This data includes number of information such as "Calories","Protein" etc.  of breakfast cereal. Below is a quick snapshot or first two lines of this dataset:
Cereal,Manufacturer,Type,Calories,Protein (g),Fat,Sodium,Dietary Fiber,Carbs,Sugars,Display Shelf,Potassium,Vitamins and Minerals,Serving SizeWeight,Cups per Serving
100%_Bran,Nabisco,C,70,4,1,130,10,5,6,3,280,25,1,0.33

100%_Natural_Bran,Quaker Oats,C,120,3,5,15,2,8,8,3,135,0,1,-1

In this data, I've chosen to ignore the columns that are not numbers and for simplicity I treat all these information as if they have a meaningful distance of measurement.

The analysis steps 

Once you have obtained a reliable data with reasonably low errors, it is important to ensure that all the data you have can be represented as numbers and distance between them makes reasonable sense.  This can be complicated in some cases, I will deal with this later or in another blog. In general I recommend avoiding distance measurement where the attribute does not make sense to be represented as a number.

Once you have "n" number of attributes of say "m" number of rows of elements, your first step is to find a "centroid" that can be used as a central point in this abstract space.  This will be a single coordinate of n-dimensions (or a row of data) for the "centroid" with average value of attributes of all the elements.  The code in python is below, where it takes a m x n matrix data called "cart" and returns a centroid value.


def centroid(cart):
    """ Given a nth dimension of data with m number of points
    calculate the centroid of the m x n matrix """
    ov=[l for l in cart.values()] # An array of just the values
  return [float(sum(l))/len(l) for l in zip(*ov)] # average of n-cartesian dimensions

For example:
The centroid for the entire CSV data is

Centroid is  [105.54, 2.5, 0.94, 160.67, 2.09, 14.59, 6.77, 2.175, 92.67]

Now that we have the centroid, we use this centroid to find distance between the centroid and all the elements.  Below is the code to do exactly that
def euclid(cart):
    dist={}
    center=centroid(cart)
    for k,v in cart.items():
        sum = float(0)
        for i,val in enumerate(v):
            sum = sum + (center[i] - val)**2
        dist[k]=sum**0.5
    dist['center']=center

    return dist

This part of the program is going to give us a single number distance between the "centroid" and each of the element in the m x n matrix.  The output data has a single number to represent each breakfast cereal

Lucky_Charms 1.1045589126
Cocoa_Puffs 1.26067497377
Total_Whole_Grain 1.60908218343
Wheaties 1.66089067547
Oatmeal_Raisin_Crisp 3.90033070652

Just_Right_Fruit_&_Nut 5.66263357814
....
Puffed_Wheat 133.55359331
Puffed_Rice 145.135569511
100%_Bran 151.602320899
All-Bran_with_Extra_Fiber 203.167326345
All-Bran 208.906582331

Plotting and interpretation

On plotting this data we actually find that these cereal by their attributes do not really cluster together that much together.  They actually are well spread with only All-Bran and All-Bran_with_Extra_Fiber being "outliers" by themselves farthest away from the centroid.


What does this mean?  It can be simply interpreted as market has NOT gravitated to only one set of attributes that people like as breakfast cereal. However there are some interesting clusters such as unhealthy and overly healthy cereals (just a couple pointed out).  There are other analysis you can pick up such as the fact that"100% Bran" is not as close to "All Bran" as one would expect.  Note, this is a very simple cluster and can be improved with additional techniques.

Conclusion and expanding the analysis

Clustering is a way for us humans to digest information from large amounts of data that we are now collecting and storing.  This step of clustering data is many times seen as the first step in machine learning algorithms to understand the data for "learning" about the data.  There are several other techniques in clustering that can be explored such as "Weighted euclid distance" (used in genetics) - where you actually add a weight to each dimension before applying clustering logic - for e.g., you could consider Protein value to be more important than the Potassium content of the breakfast cereal and give these appropriate numeric weights. A newer "K*-means clustering" was recently published in IEEE to analyze data that could have multiple clusters or cluster sets of interest.  A weighted euclid distance clustering has also many applications such as in Neural networks which I will explore in a simplified way in a later blog! For now, try this on a small dataset to improve your analysis skills with multi-dimensional data.

Reference Websites

Euclid Distance : http://mathworld.wolfram.com/Distance.html 
K-means Clustering: https://en.wikipedia.org/wiki/K-means_clustering
Unsupervised learning and clustering: http://www.mit.edu/~9.54/fall14/slides/Class13.pdf 

Reference Papers

Ahmad, Amir, and Lipika Dey. "A k-mean clustering algorithm for mixed numeric and categorical data." Data & Knowledge Engineering 63.2 (2007): 503-527.

Qi, Jianpeng, et al. "K*-means: An effective and efficient k-means clustering algorithm." Big Data and Cloud Computing (BDCloud), Social Computing and Networking (SocialCom), Sustainable Computing and Communications (SustainCom)(BDCloud-SocialCom-SustainCom), 2016 IEEE International Conferences on. IEEE, 2016.

Yu, K., et al. "Genetic association mapping under founder heterogeneity via weighted haplotype similarity analysis in candidate genes." Genetic Epidemiology 27.3 (2004): 182-191.

Kimoto, Takashi, et al. "Stock market prediction system with modular neural networks." Neural Networks, 1990., 1990 IJCNN International Joint Conference on. IEEE, 1990.



No comments:

Post a Comment