I recently attended the conference Effective Applications of the R language in London. One of the many excellent speakers described how one can use Spark to apply some simple Machine Learning to larger data sets and then extend the range of potential models by simply adding water.

We explore some of the main features and how to get started in this blog. Spark is a general purpose cluster computing system.

## Installation

Follow the guidance on Github.

## Connecting to Spark

Now we form a local Spark connection.

## Hadoop

As I’m running on Windows I get an error, I need to get an embedded copy of Hadoop winutils.exe from here.

## Java

I get another erorr, I need Java also! Success.

## Reading data

Typically one reads data within the Spark cluster using the `spark_read`

family of functions. For convenience and reproducibility we use a small local data set also avaliable online at the UCI Machine Learning Repository. Typically we might want to read from a remote SQL data table on a server.

We are interested in predicting the strength of concrete, a critical component of civil infrastructure, based on the non-linear relationship between it’s ingredients and age. We read in the data and normalise all the quantitative variables.

## Machine Learning

You can orchestrate machine learning algorithms in a Spark cluster via the machine learning functions within ‘sparklyr’. These functions connect to a set of high-level APIs built on top of DataFrames that help you create and tune machine learning workflows. We demonstrate a few of these here.

We start by:

- Partition the data into separate training and test data sets,
- Fit a model to our training data set,
- Evaluate our predictive performance on our test dataset.

For linear regression models produced by Spark, we can use `summary()`

to learn a bit more about the quality of our fit, and the statistical significance of each of our predictors.

The summary suggest our model is a poor-fit. We need to account for the non-linear relationships in the data, something which the linear model fails at! Let’s test our model against data we havn’t seen to have an indictation of its error.

Not bad, but then again not so good. More importantly our diagnostic plots reveal heteroschedasticity and other problems which suggest a linear model is inappropriate for this data.

This is a building critical ingredient, we have a duty of care to do better. We opt for a ML method that can handle non-linear relationships, a neural network approach.

## Neural Network

We follow the same workflow using a Multilayer Perceptron. We fit the model.

Let’s compare our predictions with the actual. Predict doesn’t recognise the `fit_nn`

object, and gives us predictions of zero. As this is relatively new I failed to find any supporting documentation to fix this. Instead I used the `nnet`

package to fit then `compute`

the predicted strength using a neural network, sadly not using Spark.

Let’s quantify the error of the model and compare to the linear model earlier.

The error has been reduced! Seems like a non-linear approach was superior for this type of problem. Let me know in the comments how I can predict using the `ml_multilayer_perceptron()`

function in Spark.

## Principal Component Analysis

There’s lots of standard ML stuff you can apply to your data.

Use Spark’s Principal Components Analysis (PCA) to perform dimensionality reduction. PCA is a statistical method to find a rotation such that the first coordinate has the largest variance possible, and each succeeding coordinate in turn has the largest variance possible. Not particularly useful here but might be useful for those Kaggle competitions.

## Conclusion

This blog described how to get Spark on your machine and use it to conduct some basic ML. It should be useful when dealing with large data sets or interacting with remote data tables on SQL servers. The sustained improvements in all things R continues to inspire and amaze.