Data Scientist

# Phat Maps

There be dragons here! I’ve never drawn a map in R, let’s change that by using R bloggers as a guide. This post introduces some basic web-scraping to get some relevant data that can then be visualised on a map of the United Kingdom. Obesity in the UK is a significant health concern and we map an aspect of it here.

First we start by loading the relevant packages. See the session info at the end for full details. Perhaps you can guess the packages, given that we are Hadley fans on this blog?

## The Data (Get and Clean)

We read the data in and discover the exact selector we need by using the excellent Selector Gadget (see the tutorial here). We easily extract the precise piece of the HTML document we are interested in, the table of data on the wikipedia page. We then tidy up the variable names as we recognise that the square parantheses can cause issues.

Remember read the chaining %>% as “then”, to help the code to be more readable.

These names are still pretty horrendous, let’s make them easier to work with.

We need to sort out the BMI_obese and correct to numeric class after removing the percentage sign.

Much bettter! Now we need some map polygons to plot this data on.

## Maps

We need a few tools:

• Some way to handle the polygons which are used to draw lines of a map (the sp package)
• Data describing the position of these polygons
• A hierarchy of these polygon shapes which describe interesting geographical boundaries, in our case we are interested in the County level.

According the GADM website we are interested in Level 2 UK map data. Looking at the names(gadm) it is obvious that the NAME_2 likely contains our level 2 data, or county level identifiers.

So we can draw a map but we want to identify these counties in the UK that have the Local Authorities with the highest percentage of its population with a BMI greater than or equal to 25.

Now we need to find the OBJECTID or row of the Counties that harbour these LAs with high obesity rates. I refer back to an earlier post and remind myself how to use lookup table like strategies. The gadm object looks to be of S4 class thus we can extract slots or fields using the special @ operator.

Now that we know the appropriate rows that were positive for Counties with high obesity levels in at least one LA, we can incorporate this information into our colour vector.

## The Plot

Now when we plot the UK map the appropriate Counties will be coloured red. Thus we’ve made use of the County data how do we make use of the more obscure Local Authority data? That’s something I intend to look into.

## Conclusion

I demonstrated basic webscraping, getting and cleaning data and then handling a large .rds file type for final map production with highlights at the County level. The plotting itself takes almost no time, getting the data and preparing it is the time consuming part. That sounds like Data Science to me!