Esa Vilkama President and Founder at Process Data Insights, LLC


In manufacturing troubleshooting, it is important to find causal correlations between process and product variables. E.g. variation line speed can cause excessive variation in product quality. Temperature fluctuation in a drying oven can cause product defects. A malfunctioning control system may result in abnormal process behavior. A failing equipment may even cause the process shutdown. To be able quickly determine and eliminate the source of variation in critical process and product variables can greatly increase process uptime and yield. Information about causal correlations is also needed when determining optimal process windows (best combinations of setup values) to develop and manufacture products. In order to control processes, we need to know the effect of changes in control handles on important variables.


In statistics, correlation (dependence or association) is any statistical relationship, whether causal or not, between two random variables or bivariate data. If there is no relationship, the variables are said to be statistically independent of each other, i.e. their joint probability density is a product of their marginal densities. It means that knowing the value of the variable x does not give any information on the value of y. This happens when the data for two variables originates from the completely separate unrelated sources in the process, e.g. there are no mechanical, electrical or other connection between them. Statistical independence is a stronger property that uncorrelatedness. Two random variables are uncorrelated, if their covariance is zero. In case of zero-mean variables, their cross-correlation is also zero when they are uncorrelated. Even when two variables are uncorrelated, they might be statistically dependent. If the random variables have Gaussian distributions, independence and uncorrelatedness become the same thing.


There are many different methods in looking for correlations in manufacturing data sets. It is always important to remember that, in general, the presence of a correlation is not sufficient to infer the presence of a causal relationship between two variables, i.e. a high correlation value does not necessarily imply causality, a real cause-effect relationship. Process knowledge coming from the outside of the process data set, should be used to verify any possible causality. Another important issue is time-alignment of the signals in case there is a time delay, either a real physical delay or measurement delay, between the cause and effect. E.g. in continuous manufacturing processes, we may measure variation in some upstream variable causing an effect that is measured a few minutes later downstream. We need, of course, to also collect all the relevant variables at sampling intervals that are suitable for effective data analysis and modeling. In the next blogs I’ll cover many correlation-related topics: design of experiments (DOE), evolutionary operations (EVOP), data visualization, Pearson correlation coefficient (r-value), coefficient of determination (R-squared), regression, machine learning, frequency domain correlation (coherence), mutual information and principal component analysis (PCA). 

Part 2 coming soon!

Please contact us at Process Data Insights, LLC, if you are interested in a comprehensive correlation analysis of your manufacturing process.