4 Steps to Machine Learning with Pentaho

The power of Pentaho Data Integration (PDI) for data access, blending and governance has been demonstrated and documented numerous times. However, perhaps less well known is how PDI as a platform, with all its data munging[1] power, is ideally suited to orchestrate and automate up to three stages of the CRISP-DM[2] life-cycle for the data science practitioner: generic data preparation/feature engineering, predictive modeling, and model deployment.

The CRISP-DM process diagram

By "generic data preparation" we are referring to the process of connecting to (potentially) multiple heterogeneous data sources and then joining, blending, cleaning, filtering, deriving and denormalizing data so that it ready for consumption by machine learning (ML) algorithms. Further ML-specific data transformations, such as supervised discretization, one-hot encoding etc. can then be applied as needed in an ML tool. For the data scientist, PDI can be used to remove the repetitive drudgery involved with manually performing similar data preparation processes repetitively, from one dataset to the next. Furthermore, Pentaho's Streamlined Data Refinery can be used to deliver modeling-ready datasets to the data scientist at the click of a button, removing the need to burden the IT department with requests for such data.

When it comes to deploying a predictive solution, PDI accelerates the process of operationalizing machine learning by working seamlessly with popular libraries and languages, such as R, Python, WEKA and Spark MLlib. This allows output from team members developing in different environments to be integrated within same framework, without dictating the use of a single predictive tool.

In this blog, we present a common predictive use case, and step through the typical workflow involved in developing a predictive application using Pentaho Data Integration and Pentaho Data Mining.

Imagine that a direct retailer wants to reduce losses due to orders involving fraudulent use of credit cards. They accept orders via phone and their web site, and ship goods directly to the customer. Basic customer details, such as customer name, date of birth, billing address and preferred shipping address, are stored in a relational database. Orders, as they come in, are stored in a MongoDB database. There is also a report of historical instances of fraud contained in a CSV spreadsheet.

Generic data preparation/feature engineering

With the goal of preparing a dataset for ML, we can use PDI to combine these disparate data sources and engineer some features for learning from it. The following figure shows a transformation demonstrating an example of just that, and includes some steps for deriving new fields. To begin with customer data is joined from several relational database tables, and then blended with transactional data from MongoDB and historical fraud occurrences contained in a CSV file. Following this, there are steps for deriving additional fields that might be useful for predictive modeling. These include computing the customer's age, extracting the hour of the day the order was placed, and setting a flag to indicate whether the shipping and billing addresses have the same zip code.

Diagram of blending data and engineering features

This process culminates with output of flattened (a Data Scientist’s preferred data shape) data in both CSV and ARFF (Attribute Relational File Format) data, the latter being the native file format used by PDM (Pentaho Data Mining, AKA WEKA). We end up with 100,000 examples (rows) containing the following fields:


From this list, for the purposes of predictive modeling, we can drop the customer name, ID fields, email addresses, phone numbers and physical addresses. These fields are unlikely to be useful for learning purposes and, in fact, can be detrimental due to the large number of distinct values they contain.

Train, Tune, Test Machine Learning Models to Identify the Most Accurate Model

So, what does the data scientist do at this point? Typically, they will want to get a feel for the data by examining simple summary statistics and visualizations, followed by applying quick techniques for assessing the relationship between individual attributes (fields) and the target of interest which, in this example, is the "reported_as_fraud_historic" field. Following that, if there are attributes that look promising, quick tests with common supervised classification algorithms will be next on the list. This comprises the initial stages of experimental data mining - i.e. the process of determining which predictive techniques are going to give the best result for a given problem.

 The following figure shows an ML process, for initial exploration, designed in WEKA's Knowledge Flow environment. It demonstrates three main exploratory activities:

  1. Assessment of variable importance. In this example, the top five variables most correlated with "reported_as_fraud_historic" are found, and can be visualized as stacked bar charts/histograms.
  2. Knowledge discovery via decision tree learning to find key variable interactions.
  3. Initial predictive evaluation. Four ML classifiers—two from WEKA, and one each from Python Scikit-learn and R respectively—are evaluated via 10-fold cross validation.

diagram of exploratory data mining

Visualization of the top five variables (ordered from left-to-right, top-to-bottom) correlated with fraud show some clear patterns. In the figure below, blue indicates fraud, and red the non-fraudulent orders. There are more instances of fraud when the billing and shipping zip codes are different. Fraudulent orders also tend to have a higher total dollar value attached to them, involve more individual items and be perpetrated by younger customers.

Top drivers of fraud

The next figure shows visualizing attribute interactions in a WEKA decision tree viewer. The tree has been limited to a depth of five in order to focus on the strongest (most likely to be statistically stable) interactions – i.e., those closest to the root of the tree. As expected, given the correlation analysis, the attribute "billing_shipping_zip_equal" forms the decision at the root of the tree. Inner (decision) nodes are shown in green, and predictions (leaves) are white. The first number in parenthesis at a leaf shows how many training examples reach that leaf; the second how many were misclassified. The numbers in brackets are similar, but apply to the examples that were held out by the algorithm to use when pruning the tree. Variable interactions can be seen by tracing a path from the root of the tree to a leaf. For example, in the top half of the tree, where billing and shipping zip codes are different, we can see that young, first-time customers, who spend a lot on a given order (of which there are 5,530 in the combined training and pruning sets), have a high likelihood of committing credit card fraud.

Variable interactions

The last part of the exploratory process involves an initial evaluation of four different supervised classification algorithms. Given that our visualization shows that decision trees appear to be capturing some strong relationships between the input variables, it is worthwhile including them in the analysis. Furthermore, Because WEKA has no-coding integration with ML algorithms in the R [4] statistical software and the Python Scikit-learn[5] package, we can get a quick comparison of decision tree implementations from all three tools. Also included is the ever-popular logistic regression learner. This will give us a feel for how well a linear method does in comparison to the non-linear decision trees. There are many other learning schemes that could be considered, however, trees and linear functions are popular starting points.

four different supervised classification algorithms

The WEKA Knowledge Flow process captures metrics relating to the predictive performance of the classifiers in a Text Viewer step, and ROC curves - a type of graphical performance evaluation - are captured in the Image Viewer step. The figure below shows WEKA's standard evaluation output for the J48 decision tree learner.

Evaluation output for the J48 decision tree

It is beyond the scope of this article to discuss all the evaluation metrics shown in the figure but, suffice to say, decision trees appear to perform quite well on this problem. J48 only misclassifies 2.7% of the instances. The Scikit-learn decision tree's performance is similar to that of WEKA's J48 (2.63% incorrect), but the R "rpart" decision tree fares worse, with 14.9% incorrectly classified. The logistic regression method performs the worst with 17.3% incorrectly classified. It is worth noting that default settings were used with all four algorithms.

For a problem like this — where a fielded solution would produce a top-n report, listing those orders received recently that have the greatest likelihood of being fraudulent according to the model — we are particularly interested in the ranking performance of the different classifiers. That is, how well each does at ranking actual historic fraud cases above non-fraud ones when the examples are sorted in descending order of predicted likelihood of fraud. This is important because we'll want to manually investigate the cases that the algorithm is most confident about, and not waste time on potential red herrings. Receiver Operating Curves (ROC) graphically depict ranking performance, and the area under such a curve is a statistic that conveniently summarizes the curve[6]. The figure below shows the ROC curves for the four classifiers, with the number of true positives shown on the y axis and false positives shown on the x axis. Each point on the curve, increasing from left to right, shows the number of true and false positives in the n rows taken from the top of our top-n report. In a nutshell, the more a curve bulges towards the upper left-hand corner, the better the ranking performance of the associated classifier is.

Comparing performance with ROC curves

At this stage, the practitioner might be satisfied with the analysis and be ready to build a final production-ready model. Clearly decision trees are performing best, but is there a (statistically) significant difference between the different implementations? Is it possible to improve performance further? There might be more than one dataset (from different stores/sites) that needs to be considered. In such situations, it is a good idea to perform a more principled experiment to answer these questions. WEKA has a dedicated graphical environment, aptly called the Experimenter, for just this purpose. The Experimenter allows multiple algorithm and parameter setting combinations to be applied to multiple datasets, using repeated cross-validation or hold-out set testing. All of WEKA's evaluation metrics are computed and results are presented in tabular fashion, along with tests for statistically significant differences in performance. The figure below shows the WEKA Experimenter configured to run a 10 x 10-fold cross-validation[3] experiment involving seven learning algorithms on the fraud dataset. We've used decision tree and random forest implementations from WEKA and Scikit-learn, and gradient tree boosting from WEKA, Scikit-learn and R. Random forests and boosting are two ensemble learning methods that can improve the performance of decision trees. Parameter settings for implementations of these in WEKA, R and Python have been kept as similar as possible to make a fair comparison.

The next figure shows analyzing the results once the experiment has completed. Average area under the ROC is compared, with the J48 decision tree classifier set as the base comparison on the left. Asterisks and "v" symbols indicate where a scheme performs significantly worse or better than J48 according to a paired correctedt-test. Although Scikit-learn's decision trees are less accurate than J48, when boosted they slightly (but significantly) outperform boosted versions in R and WEKA. However, when analyzing elapsed time, they are significantly slower to train and test than the R and WEKA versions.

Configuring an experiment

Analyzing results

Deploy Predictive Models in Pentaho

Now that the best predictive scheme for the problem has been identified, we can return to PDI to see how the model can be deployed and then periodically re-built on up-to-date historic data. Rebuilding the model from time-to-time will ensure that it remains accurate with respect to underlying patterns in the data. If a trained model is exported from WEKA, then it can be imported directly into a PDI step called Weka Scoring. This step handles passing each incoming row of data to the model for prediction, and then outputting the row with predictions appended. The step can import any WEKA classification or clustering model, including those that invoke a different environment (such as R or Python). The following figure shows a PDI transformation for scoring orders using the Scikit-learn gradient boosting model trained in WEKA. Note that we don't need the historic fraud spreadsheet in this case as that is what we want the model to predict for the new orders!

Deploying a predictive model in PDI

PDI also supports the data scientist who prefers to work directly in R or Python when developing predictive models and engineering features. Scripting steps for R and Python allow existing code to be executed on PDI data that has been converted into data frames. With respect to machine learning, care needs be taken when dealing with separate training and test sets in R and Python, especially with respect to categorical variables. Factor levels in R need to be consistent between datasets (same values and order); the same is true for Scikit-learn and, furthermore, because only numeric inputs are allowed, all categorical variables need to be converted to binary indicators via the one-hot-encoding (or similar). WEKA's wrappers around MLR and Scikit-learn take care of these details automatically, and ensure consistency between training and test sets.

Dynamically Updating Predictive Models

The following figure shows automating the creation of a predictive model using the PDI WEKA Knowledge Flow step. This step takes incoming rows and injects them into a WEKA Knowledg Flow process. The user can select either an existing flow to be executed, or design one on-the-fly in the step's embedded graphical Knowledge Flow editor. Using this step to rebuild a predictive model is simply an exercise in adding this it to the end of our original data prep transformation.

Building a WEKA model in PDI

To build a model directly in Python (rather than via WEKA's wrapper classifiers), we can simply add a CPython Script Executor step to the transformation. PDI materializes incoming batches of rows as a pandas data frame in the Python environment. The following figure shows using this step to execute code that builds and saves a Scikit-learn gradient boosted trees classifier. 

Scripting to build a Python Scikit-learn model in PDI

A similar script, as shown in the figure below,  can be used to leverage the saved model for predicting the likelihood of new orders being fraudulent.

Scripting to make predictions with a Python Scikit-learn model

This predictive use-case walkthrough demonstrates the power and flexibility of Pentaho afforded to the data engineer and data scientist. From data preparation through to model deployment, Pentaho provides machine learning orchestration capabilities that streamline the entire workflow.

[1] Also known as data wrangling, is the process of manually converting or mapping data from one "raw" form into another format that allows for more convenient consumption of the data by semi-automated tools.

[2] The Cross Industry Standard Process for Data Mining.

[3] 10 separate runs of 10-fold cross validation, where the data is randomly shuffled between each run. This results in 100 models being learned and 100 sets of performance statistics for each learning scheme on each dataset. 

[4] https://cran.r-project.org/web/packages/mlr/index.html. 132 classification and regression algorithms in the MLR package.

[5] http://scikit-learn.org/stable/index.html. 55 classification and regression algorithms.

[6] The under the ROC has a nice interpretation as the probability (on average) that a classifier will rank a randomly selected positive case higher than a randomly selected negative case.

Back To Top


Mark Hall

Senior Data Mining Consultant / Engineer