Esa Vilkama, President and Founder at Process Data Insights, LLC

How to Measure and Calculate Correlation 1

Correlation and dependence

As mentioned in Part 1 of this blog series, correlation or dependence is any statistical relationship, whether causal or not, between two random variables. Correlations are useful because they can indicate a predictive relationship that can be exploited in practice, or they can be used for explanation or better understanding of the process. However, in general, the presence of a correlation is not sufficient to infer the presence of a causal relationship.

Two random variables are uncorrelated if their covariance is zero

Cross-correlation

In signal processing, cross-correlation is a measure of similarity of two series as a function of the temporal or spatial displacement of one relative to the other. In probability and statistics, the term cross-correlation refers to some type of correlation between two random variables. There are several correlation coefficients measuring the degree of correlation.

Pearson correlation coefficient

The Pearson product-moment correlation coefficient [also known as r, R, or Pearson’s r, the Pearson product-moment correlation coefficient (PPMCC) or the bivariate correlation] is a measure of the strength and direction of the linear relationship between two continuous variables x and y. It is defined as the covariance of the variables divided by the product of their standard deviations. It has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation. Outliers can skew the results of the correlation estimation. Pearson’s r is the best known and most commonly used type of correlation coefficient.

Rank correlation methods

Using standard methods such as the Pearson’s correlation, it is easy to calculate and interpret them when both variables have a well understood Gaussian distribution. When we do not know the distribution of the variables, we can use nonparametric distribution-free rank correlation methods. Rank correlation refers to methods that quantify the association between variables using the ordinal relationship between the values, like ordered assigned integer values, rather than the specific values.

Spearman’s rank correlation coefficient

The Spearman correlation is a nonparametric version of the Pearson correlation coefficient that measure the degree of association between two variables based on their rankings instead of data values. It assesses how well the relationship between two variables can be described using a monotonic function. The Spearman correlation is less sensitive to the outliers because it limits the outlier to the value of its rank. In addition to being used with non-normal data, the Spearman correlation can also be used with ordinal data. If there are no tied ranks (the same rank assigned to two or more observations), this formula can be used:

Kendall rank correlation coefficient

Kendall’s tau is a non-parametric measure of relationships between columns of ranked data. It is a measure of rank correlation: the similarity of the orderings of the data when ranked by each of the quantities. It returns a value of 0 (no relationship) to 1 (perfect relationship). In most situations, the interpretation of Kendall’s tau is similar to one of the Spearman’s rank correlation coefficient which is more widely used. The distribution of Kendall’s tau has better statistical properties.

Coefficient of determination

The coefficient of determination, denoted R^2 (R-squared) is the proportion of the variance in the dependent variable that is predictable from the independent variable(s). The value of R^2 varies from 0 (random noise) to 1 (perfect fit). It is a statistic used in the context of statistical models which can be used for the prediction of future outcomes. It provides a goodness of fit measure of how well observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model. R-squared values can be used also for non-linear correlations and for multiple inputs. The picture below shows how to calculate the R-squared value for the linear correlation (line fitting).

The total sum of squares (proportional to the variance of the data)

The regression sum of squares, also called the explained sum of squares

The sum of squares of residuals, also called the residual sum of squares

The definition of the coefficient determination

How r and R-squared are related?

The Pearson correlation coefficient is a statistic that measures linear correlation between two variables. The coefficient of determination (R squared) is the proportion of the variance in the dependent variable that is predictable from the independent variable(s). R-squared is equal to r^2 when there is a linear correlation between two variables (simple linear regression). The output can also be a predicted value of y, that can be a function of many input variables. See examples below.

Cross-correlation function

In time series, cross-correlation is a measure of similarity (correlation) of two series as a function of the temporal lag (k) of one relative to the other. It is commonly used for looking for causes for variation in continuous manufacturing processes, determining the delay between two series, and searching a long signal for a shorter, known feature.

Auto-correlation function

The autocorrelation function measures how the current value of a time series correlates with its past values. It is the similarity between observations as a function of the time lag k between them. It can be used to find repeating patterns, such as the presence of a periodic signal obscured by noise.

Using FFT for fast calculation of cross-correlation function

Using the FFT and the correlation theorem, we can accelerate the correlation computation for long time series. The correlation theorem says that multiplying the Fourier transform of one function by the complex conjugate of the Fourier transform of the other gives the Fourier transform of their correlation. To calculate the correlation function between x and y data series, take both signals into the frequency domain, form the complex conjugate of one of the signals, multiply, then take the inverse Fourier transform.

Software

Python libraries NumPySciPyPandas and Phi_K Correlation Analyzer have many correlation-related functions.

Other correlations

In the next and last part of this correlation series, I will cover how to calculate correlations by using coherence, mutual information (MI), principal component analysis (PCA), and distance correlation.