Matt Gregory bio photo

Matt Gregory

Data Scientist

Twitter Github

Markov chains

Markov chains, named after Andrey Markov, are mathematical systems that hop from one “state” (a situation or set of values) to another. See here for an excellent introduction.

The second time I used a Markov chain method resulted in a publication (the first was when I simulated Brownian motion with a coin for GCSE coursework). At the time I was completing it as part of the excellent and highly recommended Sysmic course so that it could contribute to a chapter in my thesis and as a journal article.

As I was new to programming, I naively wrote the function myself, to simulate the transition between states of injected insect embryos to model the development of transgenic insects, wherby each injected egg had a change of either living or dying and then, if alive, a chance of incorporating the injected DNA into its genome (simplified explanation).

When using R it is always wiser to find a package that has already been developed and documented, as it is likely to have experienced more testing than anything you would produce as an individual (it’s also quicker). We investigate the markovchain package.

Discrete Time Markov Chain

We define a discrete time markov chain that matches our statistics derived from my review published this year, we consider the diamond back moth summary statistics, Plutella xylostella (survival and transformation efficiency).

library(markovchain)
##  Get the diamond back moth summary statistics of interest
library(dplyr)
mydata <- read.csv("https://raw.githubusercontent.com/mammykins/piggyBac-data/master/master2015jan.csv",
header = TRUE) %>% select(g0.lambda, surv)
 
s <-  median(mydata$surv, na.rm = TRUE) %>% round(digits = 5)
trans_efficiency <- median(mydata$g0.lambda, na.rm = TRUE) %>% round(digits = 5)

First we create our three different states of our Discrete Time Markov Chain (DTMC). Embryo, injection survivor and transgenic adult (egg, g0, dead). We keep things simple to learn how the process works, if injection survivors are non-transgenic then we discard them (their state changes to dead).

Transition matrix

We use a “transition matrix” to tally the transition probabilities. Every state in the state space is included once as a row and again as a column, and each cell in the matrix tells you the probability of transitioning from its row’s state to its column’s state. The rows of the transition matrix must total to 1. There also has to be the same number of rows as columns.

Based on our summary statistics in my paper we describe our transition probability between states for Plutella xylostella as:

Compare this matrix to the matrix dtmcA assigned below to elucidate the meaning of the proportions in the matrix.

tmA <- matrix(c(0, s, 1 - s, 0,     #  egg to...
                0, 0, 1 - trans_efficiency, trans_efficiency,   #  g0 to...
                0, 0, 1, 0,         #  dead to...
                0, 0, 0, 1),
              nrow = 4, byrow = TRUE) #define the transition matrix
dtmcA <- new("markovchain", transitionMatrix = tmA,
states = c("egg", "g0", "dead", "transgenic"),
name = "diamondback") #create the DTMC
dtmcA
## diamondback 
##  A  4 - dimensional discrete Markov Chain with following states: 
##  egg, g0, dead, transgenic 
##  The transition matrix   (by rows)  is defined as follows: 
##            egg      g0    dead transgenic
## egg          0 0.12942 0.87058    0.00000
## g0           0 0.00000 0.98134    0.01866
## dead         0 0.00000 1.00000    0.00000
## transgenic   0 0.00000 0.00000    1.00000

Eggs operate as the input to the model, we put egg in, we then do injections, 0.87058 of the egg die, 0.12942 survive to produce injection survivors (g0), of these g0, only 1% give rise to transgenics, the non-transgenics are considered dead, or waste insects.

We can use the igraph package to plot the Markov Chain object.

plot(dtmcA, main = "Transition probability matrix for DBM")

plot of chunk 2016-06-02-trans1 Or there are diagram ways to plot.

library(diagram)
plot(dtmcA, main = "Transition probability matrix for DBM", package = "diagram",
        lwd = 1, box.lwd = 2, cex.txt = 0.8,
        box.size = 0.1, box.type = "square",
        box.prop = 0.5)

plot of chunk 2016-06-02-trans2

Probabilistic analysis

  • It is possible to access transition probabilities and to perform basic operations.
  • Similarly, it is possible to access the conditional distribution of states.
dtmcA[2, 4] #using [ method
## [1] 0.01866
transitionProbability(dtmcA,
"g0","transgenic") #using specific S4 method
## [1] 0.01866
conditionalDistribution(dtmcA,"g0")
##        egg         g0       dead transgenic 
##    0.00000    0.00000    0.98134    0.01866

We can use a variety of methods to subset S4 objects. Further examples can be found in Hadley Wickham’s Advanced R.

Excitingly it is possible to simulate states’ distributions after n steps, where n is the number of eggs to be injected, our input to the Markov Chain. If we run one step or one discrete time period then our egg are injected and it is determined using matrix multiplication (tmA by initialState)

set.seed(1337)
n <- 1000  #  number of embryos to be injected
steps <- 1
 
########
initialState <- c(n, 0, 0, 0)  #  we start off with only eggs injected, zero for all others
finalState <- initialState*dtmcA ^ steps #using power operator
finalState
##      egg     g0   dead transgenic
## [1,]   0 129.42 870.58          0

For the full process of insect transgenesis (injection and setting up wild-type crosses to assess if DNA has been integrated) we need 2 or more steps. What would happen if we specified 10 steps?

set.seed(1337)
n <- 1000  #  number of embryos to be injected
steps <- 2
 
########
initialState <- c(n, 0, 0, 0)  #  we start off with only eggs injected, zero for all others
finalState <- initialState*dtmcA ^ steps #using power operator
finalState
##      egg g0    dead transgenic
## [1,]   0  0 997.585   2.414977

If we specified 2 or greater steps the finalState would be identical, as transgenic feeds into itself with 100% chance. This is emphasised using the steadyStates() function and revealed by the summary().

steadyStates(dtmcA) #S4 method
##      egg g0 dead transgenic
## [1,]   0  0    0          1
## [2,]   0  0    1          0
summary(dtmcA)
## diamondback  Markov chain that is composed by: 
## Closed classes: 
## dead 
## transgenic 
## Recurrent classes: 
## {dead},{transgenic}
## Transient classes: 
## {egg},{g0}
## The Markov chain is not irreducible 
## The absorbing states are: dead transgenic

Estimating a transition matrix from data

The package permits to fit a DTMC estimating the transition matrix from a sequence of data. createSequenceMatrix() returns a function showing previous vs actual states from the pairs in a given sequence.
The markovchainFit() function allows us to obtain the estimated transition matrix and the confidence levels (using elliptic Maximum Likelihood Estimate hyphothesis).

#using Alofi rainfall dataset
data(rain)
str(rain)
## 'data.frame':	1096 obs. of  2 variables:
##  $ V1  : int  3 2 2 2 2 2 2 3 3 3 ...
##  $ rain: chr  "6+" "1-5" "1-5" "1-5" ...
mysequence <- rain$rain
createSequenceMatrix(mysequence)
##       0 1-5  6+
## 0   362 126  60
## 1-5 136  90  68
## 6+   50  79 124
myFit <- markovchainFit(data = mysequence, confidencelevel = 0.9)
myFit
## $estimate
## MLE Fit 
##  A  3 - dimensional discrete Markov Chain with following states: 
##  0, 1-5, 6+ 
##  The transition matrix   (by rows)  is defined as follows: 
##             0       1-5        6+
## 0   0.6605839 0.2299270 0.1094891
## 1-5 0.4625850 0.3061224 0.2312925
## 6+  0.1976285 0.3122530 0.4901186
## 
## 
## $standardError
##              0        1-5         6+
## 0   0.03471952 0.02048353 0.01413498
## 1-5 0.03966634 0.03226814 0.02804834
## 6+  0.02794888 0.03513120 0.04401395
## 
## $confidenceInterval
## $confidenceInterval$confidenceLevel
## [1] 0.9
## 
## $confidenceInterval$lowerEndpointMatrix
##             0       1-5         6+
## 0   0.6160891 0.2036763 0.09137435
## 1-5 0.4117506 0.2647692 0.19534713
## 6+  0.1618105 0.2672305 0.43371243
## 
## $confidenceInterval$upperEndpointMatrix
##             0       1-5        6+
## 0   0.7050788 0.2561777 0.1276038
## 1-5 0.5134195 0.3474757 0.2672379
## 6+  0.2334464 0.3572754 0.5465247
## 
## 
## $logLikelihood
## [1] -1040.419

So we can call our estimated transition probability matrix using myFit$estimate. We could then use it to make predicitons for hypothetical set-ups.

#  Assign our fitted estimate for the transition matrix to tmB
dtmcB <- myFit$estimate
dtmcB
## MLE Fit 
##  A  3 - dimensional discrete Markov Chain with following states: 
##  0, 1-5, 6+ 
##  The transition matrix   (by rows)  is defined as follows: 
##             0       1-5        6+
## 0   0.6605839 0.2299270 0.1094891
## 1-5 0.4625850 0.3061224 0.2312925
## 6+  0.1976285 0.3122530 0.4901186
set.seed(1337)
n <- 999  
steps <- 1
 
########
initialState <- c(n/3, n/3, n/3)
finalState <- initialState*dtmcB ^ steps #using power operator
finalState
##             0      1-5       6+
## [1,] 439.8255 282.4847 276.6897

If we took the time to understand what the rain data was about then this might provide additional insight. The key point is we could develop a discrete time Markov Method based on estimates of a transition matrix from the data.

Waste water pipe deterioration

Here we propose a hypothetical situation where we are responsible for the condition of our sewer waste water pipes in a region. We use a survey to describe the current condition with a four-point scale, inspired by Baik et al., (2006). The ranking goes from Brand New to Bad, A to D.

tmC <- matrix(c(0.9, 0.1, 0, 0,    #  a
                0, 0.8, 0.2, 0,   #  b
                0, 0, 0.6, 0.4,         #  c
                0, 0, 0, 1),        #  d
              nrow = 4, byrow = TRUE) #define the transition matrix
dtmcC <- new("markovchain", transitionMatrix = tmC,
states = c("a", "b", "c", "d"),
name = "element") #create the DTMC
dtmcC
## element 
##  A  4 - dimensional discrete Markov Chain with following states: 
##  a, b, c, d 
##  The transition matrix   (by rows)  is defined as follows: 
##     a   b   c   d
## a 0.9 0.1 0.0 0.0
## b 0.0 0.8 0.2 0.0
## c 0.0 0.0 0.6 0.4
## d 0.0 0.0 0.0 1.0

Correct estimation of transition probabilities in a Markov chain based deterioration model is a key ingredient for successful and cost-effective proactive management of wastewater systems. Imagine our company’s estate (all of its pipes) were allowed to deteriorate for one year with an element transition matrix of:

provided by expert domain knowledge from a consultancy or preferably from the data.

set.seed(1337)
show(dtmcC)
## element 
##  A  4 - dimensional discrete Markov Chain with following states: 
##  a, b, c, d 
##  The transition matrix   (by rows)  is defined as follows: 
##     a   b   c   d
## a 0.9 0.1 0.0 0.0
## b 0.0 0.8 0.2 0.0
## c 0.0 0.0 0.6 0.4
## d 0.0 0.0 0.0 1.0
#n <-   #  
steps <- 1
 
########
initialState <- c(12, 35, 18, 5)   
finalState <- initialState*dtmcC ^ steps #using power operator
finalState
##         a    b    c    d
## [1,] 10.8 29.2 17.8 12.2

We could then use the transition matrix to model what would happen to our pipes through the next 50 years without any investment. We can set up a data frame to contain labels for each timestep, and a count of how many pipes are in each state at each timestep. Then, we fill that data frame with the results after each timestep i, calculated by initialState * dtmcC ^ i:

#INPUT
set.seed(1337)
initialState <- c(12, 35, 18, 5)
timesteps <- 50
 
#SIMULATION
pipe_df <- data.frame( "timestep" = numeric(),
 "a" = numeric(), "b" = numeric(),
 "c" = numeric(), "d" = numeric(),
 stringsAsFactors = FALSE)
 for (i in 0:timesteps) {
newrow <- as.list(c(i, round(as.numeric(initialState * dtmcC ^ i), 0)))
 pipe_df[nrow(pipe_df) + 1, ] <- newrow
 }
 
#OUTPUT
head(pipe_df, 5)
##   timestep  a  b  c  d
## 1        0 12 35 18  5
## 2        1 11 29 18 12
## 3        2 10 24 17 19
## 4        3  9 21 15 26
## 5        4  8 17 13 32
tail(pipe_df, 5)
##    timestep a b c  d
## 47       46 0 0 0 70
## 48       47 0 0 0 70
## 49       48 0 0 0 70
## 50       49 0 0 0 70
## 51       50 0 0 0 70

A plot may be preferred to visualise the condition of our pipes without investment.

library(RColorBrewer)
colours <- brewer.pal(4, "Set1") 
plot(pipe_df$timestep, pipe_df$b, ylim = c(0, 70), col = colours[1], type = "l",
     xlab = "Horizon in years", ylab = "Frequency of pipes in state")
lines(pipe_df$timestep,pipe_df$a, col = colours[2])
lines(pipe_df$timestep,pipe_df$c, col = colours[3])
lines(pipe_df$timestep,pipe_df$d, col = colours[4])
legend("right", legend = c("a", "b", "c", "d"), fill = colours)

plot of chunk 2016-06-02-pipes-norepair

We can calculate the timestep when all pipes break, or a steady state is reached at the absorbing state d by

absorbingStates(dtmcC)
## [1] "d"
transientStates(dtmcC)
## [1] "a" "b" "c"
steadyStates(dtmcC)
##      a b c d
## [1,] 0 0 0 1

As all but state d are transitional, we are interested in when all are equal to zero and “d” is equal to the sum of the frequency of initial states, 70.

head(filter(pipe_df, a == 0, b == 0, c  == 0), n = 1)
##   timestep a b c  d
## 1       31 0 0 0 69
#  note the rounding error
head(filter(pipe_df, a == 0, b == 0, c  == 0, d == sum(initialState)), n = 1)
##   timestep a b c  d
## 1       40 0 0 0 70

Extension

We can use this code to change the various transition probabilities to see what the effects are on the outputs (sensitivity analysis) through visual inspection and consider what transition probabilities would strike the right balance (repair and maintenance allows state reversion, from states c and d back to a). Also, there are methods we could use to perform uncertainty analysis, e.g. putting confidence intervals around the transition probabilities. We won’t do either of these here. This type of decision modelling scenario is perfect for Shiny app implementation!

References

  • Gregory, M., Alphey, L., Morrison, N. I., & Shimeld, S. M. (2016). Insect transformation with piggyBac : getting the number of injections just right. Insect Molecular Biology, http://doi.org/10.1111/imb.12220
  • Baik, H., Seok, H., Jeong, D., & Abraham, D. M. (2006). Deterioration Models for Management of Wastewater Systems. Journal of Water Resources Planning and Management, 132(February), 15-24. http://doi.org/10.1061/(ASCE)0733-9496(2006)132:1(15)
  • http://www.r-bloggers.com/a-discrete-time-markov-chain-dtmc-sir-model-in-r/
sessionInfo()
## R version 3.3.0 (2016-05-03)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.4 LTS
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] methods   stats     graphics  grDevices utils     datasets  base     
## 
## other attached packages:
## [1] RColorBrewer_1.1-2 diagram_1.6.3      shape_1.4.2       
## [4] dplyr_0.4.3        markovchain_0.4.4  testthat_0.8.1    
## [7] knitr_1.13        
## 
## loaded via a namespace (and not attached):
##  [1] igraph_1.0.1        Rcpp_0.12.5         magrittr_1.5       
##  [4] lattice_0.20-33     R6_2.1.2            matlab_1.0.2       
##  [7] stringr_1.0.0       tools_3.3.0         parallel_3.3.0     
## [10] grid_3.3.0          DBI_0.4-1           lazyeval_0.1.10    
## [13] RcppParallel_4.3.19 digest_0.6.4        assertthat_0.1     
## [16] Matrix_1.2-6        formatR_1.4         evaluate_0.9       
## [19] stringi_1.1.1       stats4_3.3.0        expm_0.999-0