Poltergeist Statistics: Correlation Coefficient with pandas and numpy

As a data scientist in training, I get to do a lot of exploratory analysis these days, examining different variables in data and see how they may be related. There is a nice little trick you can do with data to understand this relationship which comes from a magical field of statistics.

Imagine that as a part of your education at Hogwarts, you need to measure how ghosts and the messines at Hogwarts are related. You buy some sort of ghost metronome at the Diagon Alley to measure the present of the bodiless undead. As your messiness measure, you break into the Filch’s office and steal his meticulously collected instances of messiness he had to clean up at Hogwarts.

ghost_nicolas

Great, you have your data now! When you plot it in python, it ends up looking something like this:

plt = data.plot.scatter('GhostPresence','FilchMessiness')

positivecorrelation

Notice how the data somewhat skues to upper right hand corner? Most likely, the two variables have a positive correlated. Let’s find a way to express this quantitatively. For this tutorial, we will use Pearson’s Correlation Coefficient implemented as a part of numpy library.

Pearson correlation shows the linear association between continuous variables. This coefficient measures the extent to which a relationship between two variables can be described by a line. The equation for the Pearson correlation is pretty simple: you take the covariance which describes the relationship between our two variables, GhostPresence and FilchMessiness, and divide it by multiplied standard deviations of GhostPresence and FilchMessiness. Expressed as an equation, it will look something like this:

Pearson(GhostPresence,FilchMessiness) = Covariance(GhostPresence,FilchMessiness)/(StdDev(GhostPresence)*StdDev(FilchMessiness))

Since explaining covariance and standard deviation is beyond the scope of this article, I will leave those for the future and for now let python calculate the pearson coefficient for us, specifically:

correlation1 = corrcoef(GhostPresence,FilchMessiness)
[[ 1.         0.85]
 [0.85  1.        ]]

Notice how the code returned a matrix when we were hoping for a single number to express the relationship between the two variables? Corrcoef returns the correlation coefficient matrix of the variables.

Corr( GhostPresence, GhostPresence )    Corr(GhostPresence,FilchMessiness)
Corr( FilchMessiness, GhostPresence )    Corr( FilchMessiness, FilchMessiness )

The value we want is Corr(GhostPresence,FilchMessiness), so you can access it like so:

correlation1 = corrcoef(GhostPresence,FilchMessiness)(0,1)
0.85

It seems counterintuitive at first, but numpy, the library that implements corrcoef, allows us to express correlations of many different variables. For example, if we had a third variable, PotterWasHere, the three variables could be correlated as well.

Note the difference between the correlation and causation. Just because Harry Potter and ghosts seem to be correlated, it does not mean that Potter creates ghosts anywhere he goes (sorry, Sedric!). You would need to set up your experiment completely differently to infer if a change in one variable causes a change in another, but that is a story for a different blog post.

griffyndor

For now, enjoy playing with numpy powers and be CodeBrave!

Advertisements
Poltergeist Statistics: Correlation Coefficient with pandas and numpy

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s