### Overview

I’ve been doing a lot of programming in Python recently, and have taken my eye off the #RStats ball of late. With a bit of time to play over the Easter weekend, I’ve been reading Hadley’s new R for Data Science book.

One thing I particularly like so far is the purrr package which he describes in the lists chapter.
I’ve always thought that the `sapply`

,`lapply`

, `vapply`

(etc) commands are rather complicated.
The **purrr** package threatens to simplify this using the same left-to-right chaining framework that we have become used to in **ggplot2**, and more recently **dplyr**.

Something I find myself doing more and more is subsetting a dataframe by a factor, and applying the same or a similar model to each subset of the data.
There are some new ways to do this in **purrr**.

## do()

In this post I’ll briefly explore some of the functions of **purrr**, and use them together with **dplyr** and **broom** (as much for my own memory as anything else).

In the past I have used `dplyr::do()`

to apply a model like so.

This results in three models, one each for 4, 6, and 8 cylinders,

We can now use a second call to `do()`

, `dplyr::summarise()`

or `dplyr::mutate`

to extract elements from these models: for example extract the coefficients…

We can also use `mutate()`

to extract one or more elements

## The broom package

If we want to get a tidier output, we can use the `broom`

package, which provides three levels of aggregation.

`glance`

gives a single line for each model, similar to the `do()`

and `summarise()`

calls above:

`tidy()`

gives details of the model coefficicents:

`augment()`

returns a row for each data point in the original data with relevant model outputs

One nice use case of `augment()`

is for plotting fitted models against the data.

In this simple example, we could achieve the same just with `geom_smooth(aes(group=cyl), method="lm")`

; however this would not be so easy with a more complicated model.

## purrr

So what is new about **purrr**?
Well first off we can do similar things to `do()`

using `map()`

:

And we can keep adding `map()`

functions to get the output we want:

Note the three types of input to map(): a function, a formula (converted to an anonymous function), or a string (used to extract named components).

^{1}

So to use a string this time, returning a double vector…

### Creating training and test splits

A more complicated example that is a purrrfect use case is: creating splits in a dataset on which a model can be trained and then validated.

Here I shamelessly copy Hadley’s example^{1}. Note that you will need the latest dev version of **dplyr** to run this correctly due to this issue (fixed in the next **dplyr** release > 0.4.3).

First define a cost function on which to evaluate the models (in this case the mean squared difference (but this could be anything).

And a function to generate $n$ random groups with a given probability

And wrap this up in a function to replicate it…

Note that this makes use of the new `purrr::transpose()`

function which applies something like a matrix transpose to a list, and when coerced, returns a `data_frame`

containing $n$ random splits of the data.

Finally use `map()`

to:

- Fit simple linear models to the data as before.
- Make predictions based on those models on the test dataset.
- Evaluate model performance using the cost function (
`msd`

).

This still results in a data frame, but with three new list columns. We need to subset out the columns of interest:

## Rounding up

I’ve been playing with some things in this post that I am just getting to grips with, but look to be some really powerful additions to the hadleyverse, and the R landscape in general.
Keeping an eye on the development of **purrr** would be a good move I think.

## References

- https://github.com/hadley/purrr
- http://r4ds.had.co.nz/lists.html
- http://r4ds.had.co.nz/model-assessment.html
- https://cran.r-project.org/web/packages/broom/vignettes/kmeans.html
- https://cran.r-project.org/web/packages/broom/vignettes/broom.html