*Esa Vilkama, President and Founder at Process Data Insights, LLC*

**Data Visualization**

**Benefits**

Data visualization is a graphic representation of data and information. It is a quick and effective way to detect visible correlations that could be otherwise hard to find in large complex manufacturing datasets. Our eyes are drawn to colors and patterns. We can quickly identify red from blue, square from circle. By using visual elements like charts, graphs, maps, colors, shapes, sizes, etc., data visualization tools provide a way to quickly see trends, outliers, clusters, correlations, patterns, and other non-random information in data. They give insight into the structure of the data. Effective visualization helps us to analyze, reason, and make actionable decisions about data and evidence.

Visual methods are often used in exploratory data analysis (EDA) and in descriptive statistics. They can tell stories by curating data into a form easier to understand, highlighting useful information, and removing the “noise” from data. Visualization is often used to explain the results of data analysis. It can provide a convincing means of communicating the underlying message that is present in the data to others.

There are many data visualization methods that can be used in manufacturing process applications to show correlations.

**Multi-line Plots**

When the values of two variables are drawn in the same line plot, we can see if and how much the variables “vary together” over time as measured by the Pearson correlation coefficient (Pearson’s r). r is a statistic that measures linear correlation between two variables. It has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation. In the plot above, r = 0.769 indicates a high correlation between x1 and x2 which we can also see visually. In continuous manufacturing processes, it is important to accurately time-synchronize the variables, if measured at different times, to see any potential correlation in the multi-line plot.

**2D and 3D Scatter Plots**

A 2D scatter plot displays values for two variables, each having a value (x, y) on the horizontal and vertical axes. Any correlation between x and y are easy to see. The coefficient of determination (“R squared”), is the proportion of the variance in the dependent variable that is predictable from the independent variable(s). It is common to provide more information using colors, sizes, or shapes to show groups, or additional variables. The 3D scatter plot above can be rotated thus making it easier to see correlations among three variables for each species class.

**Pair Plots**

A pair plot allows us to see both the distribution of the variables and the correlation between all the combinations of two variables at the same time. The pair plot shown above is using the well-known iris flower data set. It has two length and two width measurements of three different flower species. Color has been used to indicate the species. This dataset is often used in machine learning because the measurements and classes (flowers) provide an excellent way to distinguish classes.

**Bubble Charts **

Bubble charts are useful as they can display correlations among many variables of a data set simultaneously in the same plot. In the plot above, four variables are shown: x-axis (x1), y-axis (x2), the bubble color (x3) and the bubble size (x4). The bubble shape could be the 5th variable, and we could use an interactive slider as the 6th variable. The bubbles could also be labelled with a categorical variable. Because the bubble plots can show so many variables simultaneously, they can be especially useful in optimizing manufacturing product and process designs.

**3D Surface and Contour Plots**

The 3D surface plots are diagrams of 3D data of a function between a dependent variable (y), and two independent variables (x1 and x2). The contour plot represents a 3D surface by plotting the values of y as contours on a 2D format. These plots are useful in regression analysis for viewing the correlations among a dependent and two independent variables.

**Radar Chart**

A radar chart displays multivariate data in the form of a 2D chart of three or more quantitative variables represented on the axes starting from the same point. The relative position and angle of the axes is typically uninformative, but various heuristics, such as algorithms that plot data as the maximal total area, can be applied to sort the variables (axes) into relative positions that reveal distinct correlations, trade-offs, and a multitude of other comparative measures. As an example, a radar chart could be used to visually compare different product designs in terms of the best combination of the values of the desired product features.

**Heat Map**

A correlation heat map shows the magnitudes of a linear correlation (r-value) between all the combinations of two variables as color in the cells of a 2D half-matrix (correlation matrix). In some cases, clustering of similar colors in the heat map may also convey important information. As there are typically a large number of variables in manufacturing process lines, by using a correlation heat map we can quickly find the variables with the highest linear correlations.

**Cross-Correlation and Auto-Correlation Plots**

A cross-correlation plot (cross-correlogram) shows the linear correlation between two variables as a function of delay. This is useful when looking for causes of variation in process and product variables. In continuous manufacturing process, it is important to accurately measure the delay to maximize the r-value if the variables are truly correlated. An auto-correlation plot (correlogram) is designed to show whether the current value of a time series correlates with its past values. It can be used to find repeating patterns, such as the presence of a periodic signal obscured by noise.

**Interactive Data Visualization**

Other variables, not used in the plot itself, can be used interactively to dynamically change the plot. They can used be as sliders (like above), pull-down menus, buttons, etc.

**Python Libraries for Data Visualization**

Matplotlib, Plotly, SymPy, Seaborn, ggplot, Altair, Bokeh, pygal

This website displays hundreds of charts, providing the reproducible python code: https://python-graph-gallery.com/

**Next Blog in This Series:**

In the next blog I will cover how to measure and calculate correlations.

Please contact us at Process Data Insights, LLC, if you are interested in a comprehensive correlation analysis of your manufacturing process.