Imagine you are responsible for the training budget of an organisation, let’s say a multi academy trust (MAT). Schools or academies run a range of interventions to benefit pupils; from peer tutoring programmes to curriculum boosters to farm-visits. These often-external-charity-led interventions can have positive, negative or negligible effects at a non-trivial cost to the school. But schools often find it extremely challenging to meaningfully assess their impact. This evaluation deficit is common across organisations, the public-and-private sector, and charities, resulting in ineffective spending. This blog post explores the problem of quantifying impact, updating knowledge with new evidence and deciding when to try something new. We take inspiration from various sources, including the medical literature.
In some ways this could be considered analogous to the classic multi-armed bandit problem which we introduced in a previous Bayesian themed post. It’s like organising a trip with friends to a casino and divvying out the money to your group. They disperse and spend the money how they see fit (without the benefit of a global view which together you could construct). They find a one-armed-bandit slot machine and put in a coin (or many coins), they either win or lose. In our scenario, the school’s students either get a good return on the intervention (a win or success - the students show some improvement in a measurable thing) or they do not (a loss or failure - the money is wasted as the students show some decline or no change in a measurable thing). (Often this is easier said than done!) Based on their experience they might change machine (explore) or if they feel like it’s a “good bandit”, they might stick around and exploit it further. The critical requirement here is that the school knows when the intervention has been successful. Typically this would require some before and after testing. A control group will also be required as often students will improve at a thing through time regardless of any intervention ( also not easy!); it’s a non-stationary problem.
There are plenty of great analytics being conducted by schools and charities trying to tackle these problems (see the Education Endowment Fund). The academic research community are also actively involved providing excellent resources on how to measure difficult to measure characteristics of human behaviour (e.g. SPECTRUM measures). Education is an interesting and important domain, we are using this blog post to explore and enthuse data scientists (sign up as a governor for a local school). We opted for this problem domain to make it more relevant and everyday than the classic “Click Through Rate” problem for web pages, but the same principles apply ( David Robinson used baseball in his comprehensive and more detailed series of blog posts and book on this topic).
Randomised Control Trials (RCT) as a Gold Standard
Often organisations that find themselves in a situation as described above, will be-cry the fact that they cannot possibly implement an RCT to ascertain causality. That might be the case, due to resource and ethical constraints, but that should not restrict our creativity and thoroughness of thought when considering non-randomised experimental design of the intervention, as there are other sources of bias than non-random allocation of subjects (however, whenever you can use double-blind RCT, do it!). Looking for inspiration in the medical education field is a natural place to start, given their familiarity with using RCTs, what can we learn from them when this methodology is difficult to implement?
Barriers to RCT
Let us start by considering education implementation within any large organisation (continuing with our example for school interventions; our comments are generaliseable). Imagine we want to implement various training programmes or interventions achieve a measurable outcome in our students, what problems might we experience in a large multi-site organisation like a MAT.
Paraphrased from Sullivan (2011) and adapted to the multi-school context:
With students moving through educational processes in real time, the difficulties of randomization become immediately clear. Students usually experience charity interventions at different times: those experiencing the intervention later may learn from intervening experiences and no longer be comparable to those experiencing the intervention at an earlier time. Also, students (and / or parents) have the option to refuse randomization and may have to contribute financially a small fee for their child to participate on an intervention (biasing those students “treated” by the intervention). Use of a “placebo” is often contraindicated due to parents’ sensitivities to schools acting as a place for “experimentation”. Charitable interventions that operate on multiple-sites, while providing more subjects (with greater power to detect differences) and more generalisability, present challenges due to differences in students. Also, the cost, coordination and data sharing of interventions across multiple sites can be high and public support for education trials may be weak to non-existent (Walser, 2014).
In addition to feasibility considerations, experts question the applicability of research models, derived from clinical research, for education studies. The highly complex system of education may be a poor fit for the RCT model, which requires clear inclusion / exclusion criteria and interventions administered identically via multiple teachers / facilitators. In education studies, variables can rarely be controlled tightly and blinding of subjects and study personnel may be unethical or impossible or unsupported by the public. Finally, defining the therapeutic intervention in education research is much more difficult than in clinical trials. One cannot “apply curriculum daily” in the same way that one can prescribe a medication.
: Table 1 Typical sources of error and potential work-arounds
|Source of Error||Solutions|
|Different sites of intervention||Explicitly describe sources of variability and potential effects on outcomes|
|Students at different stages in education and physical development||Concurrent comparison / control group|
|Different implementers of intervention||Intervention standard operating procedure; enumerate and describe variations explicitly|
|High dropout rate||Describe characteristics of dropouts in as much detail as possible|
|No concurrent comparison group||Nonconcurrent control group; describe variables that may be different between the groups and potential effects upon outcomes|
|Use of “historical” controls||Compare characteristics of groups as well as detailed description of previous education experiences. This design may preclude firm conclusions about the intervention|
|No control group||Results can suggest hypotheses for further testing with a stronger design. Multiple sources of error preclude firm conclusions|
|Control group in same school; contamination||Nonconcurrent with intervention period|
|Difficult to “blind”||Students are aware they are being studied or know their treatment group; use objective assessment|
Note how some solutions create new problems that puts you in a Catch-22. In this situation you want to design the experiment to be the least bad as possible; compromise is necessary throughout any scientific endeavour. The important point is to record what you did and why, so that those using your findings will be aware of these limitations.
Alternatives to Randomisation
Randomisation is awesome but not always practical. How can we move from opinion-based to qualitative to a more quantitative design of our education research in ascertaining whether the intervention is any good and worth the expense?
After whetting your appetite on this topic we point you towards Cronbach (2011) for further development of these ideas; they suggest moving towards “pragmatic trials”.
Our key message is, if you can’t do a RCT don’t turn off your brain! Thinking about your design carefully will reduce your scientific debt and build your confidence in what is really happening.
This also highlights a short-coming in training and education more generally across many industries, particularly assessing impact of educational / training interventions in organisations and schools. These areas are probably ripe for start-ups to tackle some complex esoteric problems well. We explore this theme further with some code, simulated data and statistics.
Made-up charity “Kale-and-Yoga” (these are both hip, so appropriate to lampoon; this post does not use any real data) offers interventions to schools at a fee. Students are taken away from doing other things to attend this intervention. Their intervention sounds plausible as it targets teachers’ putative predisposition to Kale and Yoga, how can we protect our schools from this kind of bias? If tested before and after using an appropriate test that captures what it is Kale-and-Yoga’s intervention is trying to improve, we can build up an evidence base to estimate the intervention efficiency (the proportion of students who attend the intervention that improve). Let’s pretend that the “Kale-and-Yoga” charity says that their intervention makes students more resilient, thus we might be inclined to assess students before and after using Duckworth et al.’s grit-test.
Assuming history can be used to predict the future (ignoring the problem of induction) the expected value of the intervention efficiency of a random charity can be described. For the moment we ignore any effects of differences between the experimental conditions from which the intervention efficiencies were derived and assume that intervention efficiency of Kale-and-Yoga is described by a mean efficiency and a random error term.
The first intervention at a school might look like this:
We can see the impact of chance on our random samples. This is why using statistics is important, otherwise we might consider the intervention to have been effective as it had shown an improvement in students’ performance. The scores after the intervention are higher. This does not necessarily mean the intervention was effective.
Intervention efficiency is a derived variable (
success). It is assumed to be emergent from a series of independent and identically distributed random variables (X1, X2, X3… Xn; range: success (Student improved) =1, failure (No improvement) = 0), with constant probability of success p. Thus for the experiment we can consider who improved and who didn’t or use our
after scores. For this post we explore the success or failure, binary nature of data as this tends to be more common for these types of interventions (are our students really i.i.d.?). Note how this contrasts to Poisson count data, where we know how many times an event occurred, but not how many times it did not occur.
We can test the null hypothesis that the underlying probability of success is p (here we use the default of p set to 0.5). However, as with estimating any standard error, the sample must be random to ensure independence of observations. If one is dealing with a convenience sample, then estimation of the standard error is meaningless. (Unfortunately this would rule out a great deal of research! We will instead continue on the basis that we must restrict our inference to the group of students ‘at hand’). This is likely the case here, but for convenience we pretend that we sampled our school students at random (or limit our inference), the reality is it is often a convenience sample where a class of students is chosen and then assigned the intervention, or students are selected to participate. (Note: we also lost information by transforming our response variable, which could have been the difference in before and after score, rather than a less informative binary variable; we did that for the purposes of this tutorial, assuming binary data is the more typical response variable for intervention outcomes).
Here we reject the null hypothesis (as the p-value < 0.05; note, distinct from p mentioned earlier, which was the probability of success in a series of Bernoulli trials), giving us the belief that the intervention has improved students’ performance in the grit test. However, we know this to be wrong and a false positive, as we have generated the data ourselves from a known distribution. This demonstrates the importance of continued testing of an intervention because our conclusions from the first experiment could be a false positive or negative. Furthermore, an intervention might benefit from the novelty effect and students awareness that they are being tested. Gradually through time we should build up an evidence base. This would be greatly sped up if schools across a MAT or a region could share their data and experiences with specific interventions. However, the frequentist framework is not as conducive to this passing on of knowledge (although meta-analyses of the Cochrane reviews and Education Endowment Fund show the gold standard approaches to this problem).
For a comprehensive list of methods, and the functions to carry them out in R, see the proportion package.
p-values versus confidence interval
An approach, closely related to p-values, is to focus on the precision of an observed result or treatment effect - rather than worrying about whether or not to reject a (frequently arbitrary) null hypothesis. This is done by estimating its confidence interval. Although it may not be immediately obvious, confidence intervals and null hypothesis significance tests are actually facets of exactly the same underlying model - which is why confidence intervals are often used as surrogate tests.
The confidence interval is most commonly defined as that range which, when attached to a sample statistic, would enclose the true parametric value on a given proportion (1 − α) of occasions. Since α is usually taken as 0.05 (as we do here), we therefore end up with 95% confidence intervals. To think about it another way, a claim to 95% confidence simply means that the researcher has seen one possible interval from a large number of possible ones, from which nineteen out of twenty intervals (on average) contain the true value of the parameter.
Some statisticians have recommended doing away with p-values altogether, and only quoting the confidence interval for a treatment effect. Others use the confidence interval as a surrogate statistical test, by assessing whether the interval overlaps 0 (if the treatment effect is expressed as a difference) or 1 (if the treatment effect is expressed as a ratio). Typically you will need to adhere to traditional standards of citing the p-value if you seek publication or approval from your peers.
We care about what information can be extracted from the data. We want to know how wide an interval must be, given your specific data, in order to contain a certain parameter with some pre-selected probability.
Our workflow involves building a prior distribution that matches our knowledge, we then update this knowledge with data to create a posterior distribution of our new understanding of the world.
Constructing a beta prior
Suppose we are interested in the proportion, p, of students that improve their grit score after an intervention. The function
bayes.select is a convenient tool for specifying a beta prior based on knowledge of two prior quantiles. Suppose my prior median for the proportion of improving students is
0.5 and my 75th percentile is
.6 based on my historical data of students’ grit scores the previous year.
We need to convert this knowledge into a beta distribution. The beta disitribution has two shape parameters (called alpha and beta). We use the convenient
beta.select to work this out for us. There are alternative methods to estimating the hyper-parameters of your distribution of interest described in David Robinson’s excellent book on Empirical Bayes.
A beta(5.82, 5.82) prior matches this prior information.
Updating with data
Given our earlier 30 observations (there were 21 successes and 9 failures). The posterior distribution is therefore beta with shape parameters 5.82 + 21 and 5.82 + 9. See how this is simply us updating the prior with the data (likelihood) to create the posterior.
triplot function shows the prior, likelihood, and posterior on the same display; the inputs are the vector of prior parameters and the data vector.
This visual representation is very useful and intuitive. The posterior refelects our updated belief for the interventions efficacy given our informative prior (typically you would use an uninformative or flat prior; shown in the code chunk below).
Simulating from Posterior to Perform Inference
One can perform inference about the proportion p by simulating a large number of draws from the posterior and summarizing the simulated sample. Here the
rbeta function is used to simulate from the beta posterior and the quantile function is used to construct a 90 percent probability interval for p.
Similar to the frequentist approach our credible interval does not contain p = 0.5. What you’ll notice is that confidence intervals and credible intervals will tend to converge as n increases and the prior’s importance is relatively reduced (by making up a lower proportion of the total data).
Collecting more data
Impressed by this data (partly due to experimental design constraints), the school might commission another intervention with this charity. However, this shouldn’t be the end of the story, the school and charity should continue to collect data for impact assessment. The charity could adjust its programme (guided by A/B testing) if it consistently does not impact the thing they want it to (or change their claim i.e. “Kale-and-Yoga” reduces the body mass index of students) and the school can ensure it is spending its money on effective evidence-based-interventions only.
With the Bayesian approach it’s easy to update our knowledge. Our new prior will be our previous posterior. The likelihood will be the data collected during our next experiment. From these two we will generate our new posterior. Imagine we repeated the experiment with more students by randomly sampling students across our school network, exposing them to the “Kale-and-Yoga” then assessing whether they got grittier or not.
Thus we can accumulate knowledge and build on it in a more straightforward way compared to the Frequentist approach. In the Bayesian world, the prior “pulls back” on any point estimates, and the updated posterior distribution applies to inference at any time and requires no complex sample space considerations. This allows us to spot that the intervention did not seem to do so well this time. We also benefit from being able to apply some expert knowledge in the form of a prior.
Building a business case by predicting a future sample
To assess impact we need to be able to compare the cost of doing an intervention with the predicted gain in benefits. Often these might be somewhat intangible and it can be hard to assign economic value to a cost or benefit. Economists have lots of strategies to tackle this but even a back-of-the-beer-mat can be informative and should be part of the business case for how an organisation like a school decides on what interventions to pay for.
Suppose we want to predict the number of students who improve their grit-score following the intervention. To do the programme costs £200 per student. We estimate that improving the grit-score of a student (if this can be shown to be a permanent change) will improve exam success and lifetime performance to the value of £300 (working not shown; made-up thought experiment). Thus if we run the intervention on 50 students from our school what’s our expected cost-benefit ratio?
We know the cost, we know the number of students and we know the benefit. We can work out the minimum number of students that need to improve to work out our break even point.
We need more than 34 students to “succeed” to break even.
We have the probabilty of getting each number of successes between 0 and 50. We can add up some of these probabilities to answer our question, what is the probability of getting more than 34 successes?
Thus there is a NA% chance the school will break even in paying for this Charity’s intervention. Seems a bit risky!
Importance of a control group
The above example was somewhat contrived and missed a non-trivial feature of science; comparison to a control group. Any intervention costs money, if the intervention costs more than the benefit then it is more cost effective to do nothing. In even worse scenarios organisations might pay for interventions that are detrimental to a students performance. There is the opportunity cost of doing the thing.
Imagine your students would of gotten better anyway, as the before and after is seperated by some time where anything could happen (that might have increased their score anyway, without the Charity’s intervention). This time our students are randomly sampled from school and assigned their treatment using an intra-school lottery, making our statistical assumptions more in-line with reality. To make the control commensurate and useful, students are given a green liquid to taste and do some basic stretches by a non-yogi (there should be an additional control of do nothing (business as usual) but we leave that out here for simplicity).
Without the control the
treatment = kale-and-yoga looks promising. However, when considering that students on the control programme showed similar improvement at much reduced cost, we question its value.
We all have bias and pre-held beliefs, we might be willing the novel new approach to beat a “boring” and traditional approach. The age of a practice does not predict its efficacy, it’s fallacious to reason so (although their may be some correlation in efficacy if the scientific method is used for selection; good ideas might stick around, although bad ideas can persist also).
We need statistics to help us here. Which is better A or B? Or the intervention, versus the control or the current best practice.
Given the extra information in the control treatment, we can control for the students having improved anyway. We accept the null hypothesis for now, the evidence suggests the intervention does not improve the performance of our students in the grit-test relative to the control.
In the above example, we lumped together our successes and failures and then analysed the total counts of successful students (those that improved grit scores). The question you may need to ask yourself when considering to treat the data as proportion data or binary response is: do I have unique values of one or more explanatory variable for each and every individual case? In our example above, we didn’t as students only varied in their intervention type, all else was assumed to be constant. As the answer is no, then there is nothing to be gained and we should reduce our data by aggregating the counts to the resolution at which each count does have a unique set of explanatory variables.
Comparing a few interventions across schools
Now imagine you need to make this type of comparison of intervention efficacy across a range of charitable interventions (we pick an intervention that has some support on EEF website) as well as our control group (do nothing; go about your lives) and across schools. You risk further false positives and false negatives due to multiple testing. This time we will hide the code for creating the data by setting the Rmarkdown chunk
echo option to
FALSE. Now you face the uncertainty of the real world!
The old fashioned way of modelling these sorts of success and failure data was to use percentage as the response variable. There are four problems with this approach:
- the errors are not homoschedastic (homogeneity of variance)
- the variance is not constant (it’s npq)
- the response is bounded between 0 and 1 (as probability)
- or intuitively by calculating the percentage we lose information of n, the sample size from which the proportion was estimated
The trick is to do a weighted regression using the individual sample sizes as weights and the logit link funciton to ensure linearity.
For comparison of two samples of proportions, like in A/B testing, use
prop.test. Alternatively use a Bayesian approach, like
bayes.prop.test which you might prefer for explaining the results to regular humans.
For frequentist approach we use a
glm (pronounced glim) with logit link function, with the family argument set to binomial. This is described in detail in the R book by Mick Crawley so we summarise it in brief here. A generalised linear model has three important properties:
- the error structure
- the linear predictor
- the link function
These are lkely to be unfamiliar concepts, that’s OK, but the ideas behind them are straighforward, explained clearly in Crawley’s R book (this book formed part of my first R course at post-graduate level).
First we must condense the data down, at the moment we represent it at the level of each student, we can collapse this down into number of successes and failures for each treatment and school combination. These could be neighbouring schools in a MAT that are keen to share knowledge with one-another as to the efficacy of these interventions. Then we create a two-column object for the response, using
cbind to join together the two vectors of successes and failures.
We want variables of success and failure, so that we can bind them together to create our response variable.
We do that within the model specification. We start with an additive model ignoring the interaction term between the treatment used at each school and the proportion of students “succeeding” (improving their grit-score) (this is done to speed up proceedings; typically you should start with the maximal model and work backwards through model simplification).
Hold your horses prior to interpretation! First we check for overdispersion, residual deviance > residual df. If overdispersed set family to quasibinomial.
The linear predictor is in logits and you can back transform. Proceed with model simplificaiton from maximal to minimal. As our intervention type is categorical variable we use F-test to compare the original and simplified models.
school is not informative in our model and is not signficant. There is no compelling evidence that the two different schools respond differently to the
Again we update by removing
treatment as there are no significant terms, this will leave the null model of just the grand mean.
treatment has non significant effect. However, this doesn’t mean there is not a difference between treatments. Thus, it’s important for schools to revisit different interventions to be sure that they got a fair run, perhaps they were just unlucky the first few times. While its important to exploit good interventions, we must also consider how we continue to explore new and old interventions while carefully tracking performance against some agreed metric to ascertain effiacy.
Visit the machine gurning repo to find out what the data generation process looked like for the above (in
./rmd_/2018-05-07-bayes_thinking.Rmd of our Github repo).
In some situations we might not always have a control and might simply be comparing two competing charity interventions. In this instance we can treat the situation as an A/B test if the assumptions have been met. The Frequentist and Bayesian approaches to A/B testing is described in a Hacker-News-worthy earlier blog post. This is also the “pragmatic trial” that was described in our mini-literature review at the start of the blog post.
In this way we could think of each intervention as being a one-armed-bandit in the Casino, with our paying for a student to experience the intervention, our putting a coin in the slot machine and awaiting the payout (or not). The payout could be a positive real number (success or win) or zero (failure or lose). Every time we play the machine we build up information of p, the probability of success, and can compare it to other machines (or Charity interventions; all learning is carried out via the means of trial-and-error and value estimation). Thus we are solving a multi-armed bandit problem with our schools deciding which slot-machines to gamble on.
Thus, how can we design a systematic strategy that adapts to these stochastic rewards?
What strategy can we use to select the “best” bandit (student intervention) out of a group of bandits (student interventions). What is a unique bandit in this example? Is it every Charity Intervention? Or is it every charity-intervention-school combination? For simplicity we assume that each bandit has the same p regardless of the school it’s in. We need to exploit good bandits but at the same time explore bandits that are poorly sampled, especially if bandits payout probability are non-stationary.
We start off by using the
bayesAB package which we used in a previous post focused on A/B testing in R. This helps us solve a two-armed bandit problem. Imagine we have two interventions to compare.
bayesTest is the main driver function of the
bayesAB package. The input takes two vectors of data, corresponding to recipe A and recipe B of an A/B test (the data we have accumulated through this blog post for the two treatments; A - “kale-and-yoga”, B - “desert-island”).
We use all available data combined with our beta distribution priors (assuming data is gernerated by Bernoulli i.i.d) to fit a Bayesian model. For our priors we assume an uninformative, Jeffrey’s (alpha and beta set to 1/2).
When asking the probability kale-and-yoga is better, we’re asking “if I picked a random draw from kale-and-yoga’s distribution and a random draw from desert-island’s, what’s the probability kale-and-yoga is higher”?
We could use the
plot method on this object
m1 which is quite intuitive and provides pretty defaults. Instead we read off the answer to our above question from the
summary, where the probability that A is better than B (A > B) is 0.6%. We are also reasonably confident that by choosing B over A next time for our students, they will benefit and we only have a 26% chance of them being worse off.
It’s important to continue to collect more data: we fit a multi-armed bandit object based on a
bayesTest object which can serve recommendations and adapt to new data. Imagine that given our previous results the
kale-and-yoga had fallen out of fashion, it’s still important for us to continue to collect the data if the intervention is still being used, we also update with the latest students’ results to go through the intervention. We update the object with the
The documentation tells us
banditize is an ‘object-oriented’ implementation of multi-armed bandits in
bayesAB. It is useful in conjunction with a Shiny app or Plumber deployment. The object itself is mutable and can adapt / learn from new data without having to re-assign the variable.
We can see how many times it’s been updated by using one of the five methods it comes with. This would allow us to keep track.
serveRecipe() to show the best intervention based on samples from both updated posteriors.
Returns most up-to-date object that we can again plot or use summary on.
This demonstrates the ease at which you can continuously update your understanding of the world without falling foul of problems associated with the Frequentist approach, such as repeated testing. It also is much easier to explain the findings to non-statisticians. Would you rather say “the probability that A is greater than B is 10%”, or “Assuming the null hypothesis that A and B are equal is true, the probability that we would see a result this extreme in A vs B is equal to 3%”? Furthermore, since we get a probability distribution over the parameters of the distributions of A and B, we can say something such as “There is a 74.2% chance that A’s λ is between 0.4 and 0.55.” directly from the methods themselves (paraphrased from the
Building a model for the priors
Can we use the Bayesian approach which has some advantages above to model all interventions an organisation might encounter. For example, a school MAT might have up to fifty charities offering interventions who they have paid for before. If data were collected on the effect of those interventions on the students, could we build a model to capture this and help the organisation make evidence-based decisions (combined with cost-benefit analysis) for the school budget and its students’ future. A MAT may also have the scale and purchasing power to insist upon those offering intervention-provision; data-collection, assessment before and after, use of a comparison group and use of standard intervention procedure guidance for implementers. This is non-trivial and I’ve probably only scratched the surface.
We probably want a unique model for each organisation or MAT to deploy, that has a unique prior for each intervention type (so that we control for the relationship between the number of students exposed to an intervention and the intervention’s true impact average). A beta-binomial regression could be used to control for this and could also incorporate information (specifically the number of students a charity intervention had “treated”) into a per-intervention prior distribution. This method could be extended: in particular, the model could include other factors that might change our prior expectations of an intervention. These are each particular applications of Bayesian hierarchical modeling, where the priors for each intervention are not fixed, but rather depend on other latent variables. In an empirical Bayesian approach to hierarchical modeling, one could estimate the prior using beta binomial regression, and then apply it to each intervention (Chapter 8, Robinson).
For example based on meta-analysis and literature review provided by other organisations, like the Education Endowment Fund (EEF), we could adjust an intervention’s priors based on such evidence bases. Part of the philosophy of the Bayesian approach is to bring our prior expectations in mathematically. Let’s try doing that with some factors that influence intervention success on student character development through intervention.
For example, an intervention that is based on Metacognition and self-regulation might have a more optimistic prior than an Aspiration based intervention. We could capture this in a beta-binomial regression model (in the Generalised Additive Model family for location, site and scale (GAMLSS)), the structure of which looks very similar to the
glm earlier in this blog post (GAMLSS are univariate distributional regression models, where all the parameters of the assumed distribution for the response can be modelled as additive functions of the explanatory variables). Thus if we had these characteristics of the interventions being offered by charities we might be able to build up a model that better captures our prior distributions.
Broadly, this approach can be summarised by this quandary:
Suppose you were a head-teacher inspecting potential training-interventions, and were choosing between two that have “treated” 100 students each:
- A metacognition-based intervention that has had 30 successes / 100 trials
- An aspiration-based intervention that has had 30 successes / 100 trials
Which would you guess was the better intervention? This seems like a silly question: they both have the same exact success record. But what if I told you that historically, metacognition-based interventions are slightly better aspiration-based interventions? How could you incorporate that evidence?
We simulate some data of interventions that look promising. They are new on the scene, how do we decide what their prior should be?
We don’t run the code here as it was causing our rmarkdown to markdown function to fail for a reason we couldn’t debug quickly. Thus instead we show the working for you to implement yourself and print the model output.
This fits a model that allows the mean successfully improving grit-score average µ to depend on the number of students that had been “treated” by an intervention as well as controlling for the “flavour” (is it metacognition and self-regulation-based or aspiration-based) of the intervention. This would help an organisation or group of data-sharing organisations to determine if there is a statistically significant advantage to prioritising the use of different flavours of interventions.
We can then look at the coefficients using the
tidy() function from broom.
We print the output here for your convenience.
According to our beta-binomial regression, there is indeed a statistically significant advantage to being a “metacognition” flavoured intervention, with these intervention types getting a succesful improvement in grit-score about 16% more often. It has standard interpretation rules for R summary tables of a model. To get the mean probability of success for “metacognition” flavoured interventions we add the mu intercept ecoefficient of 0.23 to 0.16, to get 0.39, the mean.
If we were to use empirical Bayes estimation, it would mean that every combination of flavour and number of trials would have its own prior. The prior can still be overcome with enough evidence. Other things to consider might be time; when the experiments were conducted. Controlling for time by including it in the model could identify novelty effects, for example, after controlling for time since an intervention was deployed, it might turn out its not very effective. This could be due to the novelty wearing off and a more general observation that students change through time and may be habituated to some appraoches.
When to try something new
Thus far we’ve thought about quantifying impact and updating knowledge with new evidence, and how one might model it. Now we consider how one might determine what interventions to try out, as after all, typically you might just recommend picking the one with the highest mean success rate. However, this leaves you closed off to any new innovation. This is the classic explore / exploit dilemma: the tradeoff between pursuing a known reward and sampling unknown, potentially better opportunities (Wilson et al., 2013). This dilemma is famously embodied by the N-armed bandit problem, where a gambler must decide whether to keep playing at a slot machine, or to try playing at a new one.
In our scenario, not all bandits have the same entry fee and same payout chance. There is also some autonomy demanded by schools, where teachers and students may have a preference for certain interventions for various extraneous reasons. With this caveat, there are several standard approaches to this dilemma:
- The Naive Round-Robin algorithm basically chooses every action once to complete a round and repeats the rounds. Obviously it’s far from optimal since it explores too much and exploits little.
- Pick the bandit with the highest upper 95% credible interval. This probably explores too much, and may suffer when you have many bandits and few coins to play.
- The greedy algorithm tries to choose the arm that has maximum average reward, with the drawback that it may lock-on to a sub-optimal action forever.
- The epsilon greedy and optimistic greedy algorithms are variants of the greedy algorithm that try to recover from the drawback of the greedy algorithm. Epsilon-greedy chooses an action uniformly at random with probability epsilon, whereas the optimistic greedy algorithm initialized the estimated reward for each action to a high value, in order to prevent locking to a sub-optimal action.
Deploying the bandit as a JSON API
Remember our bandit from earlier? We can turn the
bayesBandit object called
m2 into an API and serve/update requests through HTTP.
deployBandit turns a Bayesian bandit into a JSON API that accepts curl requests. Two of the five methods of
bayesBandit classes are exposed:
setResults. Assuming the API is deployed on localhost this is an example of how it would be hit:
Thus, we could extend this to create a Shiny app.
Charities and training-providers are increasingly required to demonstrate that they are achieving impact, based on robust evaluation of their work. While it is important to ensure resources are being used effectively, undertaking evaluation often poses challenges for charities, which may well have limited resources or experience with which to do this.
Charities need to know about the impact of their activities so that they can ensure their efforts are having their intended effect. That’s not to say that impact evaluation will necessarily be able to capture all of the valuable outcomes produced as a result of a charity’s work - some outcomes may be extremely difficult or impossible to measure, and it doesn’t mean those outcomes are not important. However, just because it may be difficult to measure the impact of something, it doesn’t mean we shouldn’t try!
Given the complexities described above and the esoteric language used, it strikes me that it would be helpful for larger organisations or programmes to help design and demonstrate best practice or for some tech start-ups to build some apps to facilitate this process. For example, in the school domain, MATs presumably have some purchasing power and could insist upon some standards for Charity intervention impact evaluation frameworks, an app could help ensure the consistency of any intervention delivery and assessment, the outcome of which could be served to the school leaders across the network. This would ensure quality experimental design which would output better data which could then be analysed and interpreted in a better way.
In this blog post we introduced some Frequentist and Bayesian approaches to answering some of the typical questions that an organisation’s leader might face; how good is an intervention, which is better between these two and how can we learn from previous experience in the literature using informative priors and considerations for the explore/exploit dilemma.
- Hardert et al. (1999). Best Evidence Medical Education
- Crawley, M. (2007). The R Book. John Wiley & Sons.
- LearnBayes package pdf in R, http://fportman.com/blog/bayesab-a-new-r-package-for-bayesian-ab-testing/
- proportion package pdf in R
- Robinson, David (200?) Introduction to Empirical Bayes and blog
- Sullivan, G. M. (2011). Getting Off the “Gold Standard”: Randomized Controlled Trials and Education Research. Journal of Graduate Medical Education, 3(3), 285–289. http://doi.org/10.4300/JGME-D-11-00147.1
- Walser, Tamara M. (2014). Quasi-Experiments in Schools: The Case for Historical Cohort Control Groups. Practical Assessment, Research & Evaluation, 19(6). Available online: http://pareonline.net/getvn.asp?v=19&n=6
- Wilson, R. C., Geana, A., White, J. M., Ludvig, E. A. & Cohen, J. D. (2013). Exploration strategies in human decision making. Reinforcement Learning & Decision Making