Industrial Machine Learning using XGBoost

Filip Mulier Chief Science Officer, Process Data Insights, LLC ♦ Data Science

Using machine learning it is possible to develop accurate predictive models of industrial processes. In this example we will show how these methods can be used to develop a predictive model for energy output of a combined power plant.  We will show how to use the model for what-if analysis and visualization of the process performance and discuss some key considerations for online prediction.   

The goal of this article is to give a basic introduction to building a predictive model for an industrial process and give some basic understanding of the Machine Learning modeling process and related technology. 

We will be using cloud computing to solve this problem. We will be using Amazon’s S3 and SageMaker services to perform the data analysis and ML model building.  In addition, we leverage the secure infrastructure provided by AWS Virtual Private Cloud (VPC). 

In this article we use some basic features of AWS Sagemaker.  AWS Sagemaker also provides some advanced features for building and tuning models that allow extensive automated tuning of models (Hyper-Parameter Optimization (HPO) and Experiments) and an advanced environment for model building (Sagemaker Studio). We will cover those in future articles. 

The Industrial Process and Data 

A combined-cycle power plant uses both a gas and a steam turbine together to produce up to 50 percent more electricity from the same fuel than a traditional simple-cycle plant. The waste heat from the gas turbine is routed to the nearby steam turbine, which generates extra power. In this example we will create a machine learning model, which can predict hourly full load electrical power output of the power plant under varying conditions.  The plant’s energy output is influenced by four ambient measures: temperature, atmospheric pressure, relative humidity, and exhaust steam pressure. The dataset from this example is available at the UC Irvine Machine Learning Repository combined+cycle+power+plant.   

Secure Data Management 

In general data security is critical for handling industrial data for both safety and business reasons, especially as it relates to plant equipment, product quality or energy production.  Although in this example we are using public data, we wanted to go through the steps taken to keep data secure. We will use encryption to ensure the data is secure in transit to S3 (for example, https & MFA). While the data is at rest in S3, it will be encrypted as well, with well-defined MFA access controls.    

The dataset contains 9568 data points collected from a Combined Cycle Power Plant over 6 years (2006-2011) during full load operation. Features consist of hourly average ambient Temperature (AT) deg C, Ambient Pressure (AP) millibar, Relative Humidity % (RH) and Exhaust Vacuum (V) cm Hg to predict the net hourly electrical energy output (EP) MW of the plant. 

Secure Machine Learning 

AWS SageMaker, a computational analytics environment, allows coding of scripts to preprocess, analyze, and model data sets. One of the design principles of AWS is infrastructure as code. This allows spinning up—with ease—secure compute instances for training and inference using scripts right in SageMaker. 

Review the data quality 

Our first step: Pull data into SageMaker and begin analysis.  We can do some simple charting to see if there are any visible correlations in the data. 

The graphs show that AT and V have a strong relationship with PE, while AP and RH have a weaker relationship. This is a sign that all the input variables may be informative in predicting the output energy so we will be using all the variables to build out this model. 

Preparing the data for Machine Learning 

Our second step: We will use our data to do two things: 

  • Build a good ML model, that will use the input variables AT, V, AP, and RH to predict PE. The model will be based on fitting the data using a ML method called XGBoost. 
  • Show how well the model can predict new data, data that was not used to build the model. 

To get realistic results for this second step, we set some data aside that is not used for building the model. We split the data set into two portions, called the training and validation sets.   

What is XGBoost? 

XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. In prediction problems involving unstructured data (images, text, etc.) artificial neural networks tend to outperform all other algorithms or frameworks. However, when it comes to small-to-medium structured/tabular data, decision tree-based algorithms are considered best-in-class right now. 

AWS SageMaker provides an industrial strength implementation of XGBoost that can be used to train the model on large data sets using multiple compute instances. Once built, the model can be deployed on a dedicated virtual server to provide predictions in volume with world-class up times. 

Predictive Model with XGBoost 

Generating a model requires setting some training parameters and training the model. The parameters control the flexibility of the model to fit the data. The XGBoost method has many parameters that can be tuned, but fortunately only a few have the most impact. The chart below shows the training process progression for max_depth. It shows an error of 3.62 MW for this model on the validation set.  

In order to improve the model, we can do a search over the model parameter values to find the settings with the lowest error.  SageMaker provides some sophisticated tools to do this search very effectively across multiple model parameters.  For this example, we will do it manually to show the general approach. We ran some repeat trial runs to adjust the key parameter (max_depth) providing the best (lowest) error rate on the validation set. The chart below shows that if we use a setting of max_depth=15, we see an error rate of 3.19 MW, a modest improvement.  

The next chart gives you a visual indication for how well the model predicts the power output. 

So how much better is this model compared to a standard statistical approach? We did a quick comparison using linear regression (polynomial model) which gives a RMSE of 4.08 MW. 

Additional Insights From XGBoost Machine Learning 

The XGBoost model provides the relative variable importance as computed by the model–see chart below. We can confirm that the result is in line with our original scatterplots. If we were to see some surprise findings, we would want to dig into that because it may indicate either incorrect modeling, or that we found a hidden or complex relationship that can be exploited. In either case we would review and analyze to take appropriate steps. 

Offline Model Usage for Process Understanding 

Initially we recommend using the model offline, to provide process understanding to the operators and engineers in control of the process, before using the model and automation to directly control the process. For example, the model can be used for what-if analysis, making predictions for various conditions for the engineers. The chart below shows an example of what happens when changing the Ambient temperature from 10 to 30 deg C, while holding the other input variables fixed.

The chart below shows the modelled relationship between the energy produced and two input variables, Ambient Temperature (AT) and Exhaust Vacuum (V). 

A more sophisticated offline use for the model may be as a real-time indicator. The model output can be calculated and displayed in the operator console for monitoring.  Its inclusion as part of the standard SPC practices on the plant floor can improve efficiency. 

Online Model Usage for Improved Control 

A process model that has proven its worth in an offline setting can be used in the online control of the process. Making this leap requires consideration of the following process control topics: 

  • Process safety – is there monitoring, redundancy and automatic fallback in case the model produces predictions that are inconsistent? 
  • Process changes/drift – is there a process in place to review model accuracy when process changes are made or if drift in the process impacts the model?  

Further reading on XGBoost:

Further reading on Combined Cycle Power Plants: