Matt Upson bio photo

Matt Upson

Yo no soy marinero

Twitter LinkedIn Github Stackoverflow

In the past I have worked with a number of timeseries of sensor data that I collected using raspberry pis, arduinos, and esp8266 modules. It’s not something I do regularly enough to remember the best way to do it, so I’m writing this post as a reminder to myself, and perhaps someone will benefit from my aide-memoire.

In previous posts I have combined data from two sensors I built, both based on raspberry pis (e.g. Measuring obsession). The first sensor sampled internal and external temperature, internal humidity, and internal light levels at a frequency of once every three minutes. Another sensor I built recorded my electricity usage every minute by essentially counting the pulses on my electricity meter.

The data are all in the machinegurning github repo, so I’ll access it here. In the cleaned state that they are available in, the data consist of some 750,000 observations.

library(tidyverse)
library(RCurl)
library(govstyle)
library(scales)
library(lubridate)

# To make things run a little faster, and to exacerbate the problem of
# non-matching timescales, I'll take a smaller sample of these data comprising
# just 100,000 data points

sensor_data <- read_rds('data/2015-12-24-sensorpi_join.Rds') %>%
sample_n(100000) %>%
mutate(
key1 = ifelse(grepl('temp', key), 'temp', as.character(key)),
key1 = factor(key1)
)

# Produce some very simple summary stats on these data

sensor_data %>%
str
## Classes 'tbl_df', 'tbl' and 'data.frame':	100000 obs. of  6 variables:
## $ yday : num 32 115 204 57 295 98 258 252 261 77 ...
## $ timestamp: POSIXct, format: "2015-02-01 18:06:00" "2015-04-25 23:15:00" ...
## $ week : num 5 17 NA NA 43 15 NA NA 38 NA ...
## $ key : Factor w/ 5 levels "elec","ext_temp1",..: 2 2 1 1 5 3 1 1 2 1 ...
## $ value : num 2.75 8.75 0.004 0.001 17.8 ...
## $ key1 : Factor w/ 4 levels "elec","int_humidity",..: 4 4 1 1 4 2 1 1 4 1 ...

So what are the simple ways that we can visualise the data, first off?

p <- sensor_data %>%
ggplot +
aes(
x = timestamp,
y = value,
colour = key
) +
geom_line() +
facet_wrap(
~key1,
scale = 'free_y',
ncol = 1
) +
theme_gov(
base_colour = 'black'
) +
scale_colour_manual(
values = unname(gov_cols[c('turquoise','light_blue','purple','pink','green')])
) +
scale_y_continuous(labels = scales::comma) +
theme(
legend.position = 'bottom',
legend.key = element_blank()
)


p

plot of chunk 2017-01-21_whole_timeseries

Great, so ggplot is smart enough to detect that we need time on the x-axis, and it gives us an appropriate scale - good job Hadley!

We can also set the breaks we want…

p %+%
(sensor_data %>% dplyr::filter(
timestamp < '2015-07-01',
timestamp > '2015-06-14'
)) +
scale_x_datetime(date_breaks = '3 days')

plot of chunk 2017-01-21_subset

And these can be times, not just dates - smart.

p %+%
(sensor_data %>% dplyr::filter(
timestamp > '2015-07-01',
timestamp < '2015-07-02'
)) +
scale_x_datetime(
date_breaks = '2 hours',
date_labels = '%H:%M'
)

plot of chunk 2017-01-21_single_day

Date aggregation

OK, so far so good, all very simple.

The fun begins when we start to aggregate this data. In this case I use tidyr::spread to move this data from long format to wide format.

sensor_data_wide <- sensor_data %>%
select(-key1) %>%
spread(
key, value
)

Because we started by randomly sampling 100,000 values from a dataset of 750,000, and this dataset was in long format, we are likely to have a lot of NA values across the various values of the timeseries:

sensor_data_wide %>%
slice(1:10) %>%
knitr::kable()
yday timestamp week elec ext_temp1 int_humidity int_light int_temp
2 2015-01-02 15:18:00 1 NA NA NA 2069.0 NA
2 2015-01-02 15:30:00 1 NA 7.187 NA NA 16.8
2 2015-01-02 15:33:00 1 NA NA NA 3719.0 NA
2 2015-01-02 15:39:00 1 NA NA NA 5384.0 NA
2 2015-01-02 15:45:00 1 NA 7.000 NA NA NA
2 2015-01-02 16:00:00 1 NA NA NA 1230.5 NA
2 2015-01-02 16:12:00 1 NA 6.687 NA NA NA
2 2015-01-02 16:15:00 1 NA NA 48.75 NA NA
2 2015-01-02 16:18:00 1 NA NA 49.05 NA NA
2 2015-01-02 16:27:00 1 NA NA NA NA 17.3

Just looking at these rows, we can see that there are often multiple observations per minute.

Two problems I often have are:

  • how to aggregate to the nearest unit of time, and
  • how to aggregate across a unit of time

This is the difference between aggregating to every five minutes of every day, and aggregating to every five minutes across all days. The former is easy, and can be achieved with lubridate::ceiling_date and lubridate::floor_date. Ceiling rounds up, whilst floor rounds down, and we can choose any time period of interest:

test_dates <- sensor_data_wide$timestamp[1:10] 

test_dates
##  [1] "2015-01-02 15:18:00 UTC" "2015-01-02 15:30:00 UTC"
## [3] "2015-01-02 15:33:00 UTC" "2015-01-02 15:39:00 UTC"
## [5] "2015-01-02 15:45:00 UTC" "2015-01-02 16:00:00 UTC"
## [7] "2015-01-02 16:12:00 UTC" "2015-01-02 16:15:00 UTC"
## [9] "2015-01-02 16:18:00 UTC" "2015-01-02 16:27:00 UTC"
ceiling_date(test_dates, unit = '5 minutes') %>% unique
## [1] "2015-01-02 15:20:00 UTC" "2015-01-02 15:30:00 UTC"
## [3] "2015-01-02 15:35:00 UTC" "2015-01-02 15:40:00 UTC"
## [5] "2015-01-02 15:45:00 UTC" "2015-01-02 16:00:00 UTC"
## [7] "2015-01-02 16:15:00 UTC" "2015-01-02 16:20:00 UTC"
## [9] "2015-01-02 16:30:00 UTC"
floor_date(test_dates, unit = '5 minutes') %>% unique
## [1] "2015-01-02 15:15:00 UTC" "2015-01-02 15:30:00 UTC"
## [3] "2015-01-02 15:35:00 UTC" "2015-01-02 15:45:00 UTC"
## [5] "2015-01-02 16:00:00 UTC" "2015-01-02 16:10:00 UTC"
## [7] "2015-01-02 16:15:00 UTC" "2015-01-02 16:25:00 UTC"
ceiling_date(test_dates, unit = '30 minutes') %>% unique
## [1] "2015-01-02 15:30:00 UTC" "2015-01-02 16:00:00 UTC"
## [3] "2015-01-02 16:30:00 UTC"
floor_date(test_dates, unit = '30 minutes') %>% unique
## [1] "2015-01-02 15:00:00 UTC" "2015-01-02 15:30:00 UTC"
## [3] "2015-01-02 16:00:00 UTC"
ceiling_date(test_dates, unit = '1 hour') %>% unique
## [1] "2015-01-02 16:00:00 UTC" "2015-01-02 17:00:00 UTC"
floor_date(test_dates, unit = '1 hour') %>% unique
## [1] "2015-01-02 15:00:00 UTC" "2015-01-02 16:00:00 UTC"
ceiling_date(test_dates, unit = '3 hours') %>% unique
## [1] "2015-01-02 18:00:00 UTC"
floor_date(test_dates, unit = '3 hours') %>% unique
## [1] "2015-01-02 15:00:00 UTC"

…you get the idea.

But if I wanted to plot the average temperature at five minute intervals for each month, I will not be able to do this:

p <- sensor_data %>% 
dplyr::filter(
key == 'int_temp',
timestamp < '2015-12-01'
) %>%
mutate(
month = month(timestamp),
timestamp = ceiling_date(timestamp, '5 minutes')
) %>%
group_by(month, timestamp) %>%
summarise(
value = median(value)
) %>%
ggplot +
aes(
x = timestamp,
y = value
) +
geom_line(
colour = gov_cols[['purple']]
) +
facet_wrap(
~month,
ncol = 2
) +
scale_x_datetime(
date_labels = '%H:%M'
) +
geom_smooth(
col = 'red',
size = 0.5) +
theme_gov()

p

plot of chunk 2017-01-21_daily_average_1

This doesn’t give us what we want because there is still date information wrapped up within the timestamp, so we only get a timeseries of each value from each month. To get what we want is a little more tricky, and there may well be a better way that I have not yet discovered, but this is what I have been doing so far.

First we need to extract the time from the timestamp without date information.

format(test_dates,"%H:%M:%S")
##  [1] "15:18:00" "15:30:00" "15:33:00" "15:39:00" "15:45:00" "16:00:00"
## [7] "16:12:00" "16:15:00" "16:18:00" "16:27:00"

The downside here is that while format will return time as a character vector, so we will not be able to rely on ggplot2 to cleverly adjust axes.

To fix this, we can turn these times back into timestamps, but this time with all the same date.

get_time <- function(x) {

time_ <- strftime(x, format = "%H:%M:%S")
datetime_ <- as.POSIXct(time_, format = "%H:%M:%S")
return(datetime_)

}

get_time(test_dates)
##  [1] "2017-01-28 15:18:00 GMT" "2017-01-28 15:30:00 GMT"
## [3] "2017-01-28 15:33:00 GMT" "2017-01-28 15:39:00 GMT"
## [5] "2017-01-28 15:45:00 GMT" "2017-01-28 16:00:00 GMT"
## [7] "2017-01-28 16:12:00 GMT" "2017-01-28 16:15:00 GMT"
## [9] "2017-01-28 16:18:00 GMT" "2017-01-28 16:27:00 GMT"

Now we can get the plot we are after:

p <- sensor_data %>% 
dplyr::filter(
key == 'int_temp',
timestamp < '2015-12-01'
) %>%
mutate(
month = month(timestamp),
timestamp = get_time(timestamp),
timestamp = ceiling_date(timestamp, '5 minutes')
) %>%
group_by(month, timestamp) %>%
summarise(
value = median(value)
) %>%
ggplot +
aes(
x = timestamp,
y = value
) +
geom_line(
colour = gov_cols[['purple']]
) +
facet_wrap(
~month,
ncol = 2
) +
scale_x_datetime(
date_labels = '%H:%M'
) +
geom_smooth(
col = 'red',
size = 0.5) +
theme_gov()

p

plot of chunk 2017-01-21_daily_average_2

If anyone knows a better way of doing this, I would love to know, but this works for now.

devtools::session_info()
##  setting  value                       
## version R version 3.3.2 (2016-10-31)
## system x86_64, linux-gnu
## ui RStudio (1.0.136)
## language en_GB:en
## collate en_GB.UTF-8
## tz GB
## date 2017-01-28
##
## package * version date
## assertthat 0.1 2013-12-06
## backports 1.0.4 2016-10-24
## bitops * 1.0-6 2013-08-17
## colorspace 1.3-1 2016-11-18
## DBI 0.5-1 2016-09-10
## devtools 1.12.0 2016-06-24
## digest 0.6.11 2017-01-03
## dplyr * 0.5.0 2016-06-24
## evaluate 0.10 2016-10-11
## ggplot2 * 2.2.1 2016-12-30
## govstyle * 0.1.2 2017-01-22
## gtable 0.2.0 2016-02-26
## highr 0.6 2016-05-09
## htmltools 0.3.5 2016-03-21
## knitr 1.15.1 2016-11-22
## labeling 0.3 2014-08-23
## lattice 0.20-34 2016-09-06
## lazyeval 0.2.0 2016-06-12
## lubridate * 1.6.0 2016-09-13
## magrittr 1.5 2014-11-22
## Matrix 1.2-7.1 2016-09-01
## memoise 1.0.0 2016-01-29
## mgcv 1.8-16 2016-11-07
## munsell 0.4.3 2016-02-13
## nlme 3.1-129 2017-01-19
## plyr 1.8.4 2016-06-08
## purrr * 0.2.2 2016-06-18
## R6 2.2.0 2016-10-05
## Rcpp 0.12.8 2016-11-17
## RCurl * 1.95-4.8 2016-03-01
## readr * 1.0.0 2016-08-03
## rmarkdown 1.2 2016-11-21
## rmd2md 0.1.1 2017-01-28
## rprojroot 1.1 2016-10-29
## rstudioapi 0.6 2016-06-27
## scales * 0.4.1 2016-11-09
## stringi 1.1.2 2016-10-01
## stringr 1.1.0 2016-08-19
## tibble * 1.2 2016-08-26
## tidyr * 0.6.0 2016-08-12
## tidyverse * 1.0.0 2016-09-09
## withr 1.0.2 2016-06-20
## source
## CRAN (R 3.2.3)
## CRAN (R 3.3.2)
## CRAN (R 3.2.3)
## CRAN (R 3.3.2)
## CRAN (R 3.2.3)
## CRAN (R 3.3.2)
## cran (@0.6.11)
## CRAN (R 3.2.3)
## CRAN (R 3.3.2)
## cran (@2.2.1)
## Github (ukgovdatascience/govstyle@8cd6098)
## CRAN (R 3.2.3)
## CRAN (R 3.2.3)
## CRAN (R 3.2.3)
## CRAN (R 3.3.2)
## CRAN (R 3.2.3)
## CRAN (R 3.3.1)
## CRAN (R 3.3.2)
## CRAN (R 3.2.3)
## CRAN (R 3.2.3)
## CRAN (R 3.3.1)
## CRAN (R 3.2.3)
## CRAN (R 3.3.2)
## CRAN (R 3.2.3)
## CRAN (R 3.3.2)
## CRAN (R 3.2.3)
## CRAN (R 3.2.3)
## CRAN (R 3.2.3)
## CRAN (R 3.3.2)
## CRAN (R 3.3.2)
## CRAN (R 3.2.3)
## CRAN (R 3.3.2)
## Github (ivyleavedtoadflax/rmd2md@3434815)
## CRAN (R 3.3.2)
## CRAN (R 3.2.3)
## CRAN (R 3.3.2)
## CRAN (R 3.2.3)
## CRAN (R 3.2.3)
## CRAN (R 3.2.3)
## CRAN (R 3.2.3)
## CRAN (R 3.2.3)
## CRAN (R 3.2.3)