class: middle, title background-size: contain <br><br> # Should we retire null hypothesis significance testing in (some) social policy research? <br><br> **Dr. Calum Webb**<br> Sheffield Methods Institute, the University of Sheffield<br> [c.j.webb@sheffield.ac.uk](mailto:c.j.webb@sheffield.ac.uk) Working Paper Available on SPA2023 Conference Platform | Code published on [Github](https://github.com/cjrwebb/retire-nhst)
--- .middle-left[ ## What is the effect of family support services spending on rates of children in the care system? ] -- .middle-right[ <table class="table" style="font-size: 24px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;font-weight: bold;"> </th> <th style="text-align:right;font-weight: bold;"> B </th> <th style="text-align:right;font-weight: bold;"> S.E. </th> <th style="text-align:right;font-weight: bold;"> t </th> <th style="text-align:right;font-weight: bold;"> p </th> <th style="text-align:left;font-weight: bold;"> </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> (Intercept) </td> <td style="text-align:right;"> 0.00 </td> <td style="text-align:right;"> 0.13 </td> <td style="text-align:right;"> 0.00 </td> <td style="text-align:right;"> 1.00 </td> <td style="text-align:left;"> </td> </tr> <tr> <td style="text-align:left;"> Child poverty rate z </td> <td style="text-align:right;"> 0.40 </td> <td style="text-align:right;"> 0.13 </td> <td style="text-align:right;"> 3.15 </td> <td style="text-align:right;"> 0.00 </td> <td style="text-align:left;"> * </td> </tr> <tr> <td style="text-align:left;"> Inequality (Gini) z </td> <td style="text-align:right;"> 0.30 </td> <td style="text-align:right;"> 0.13 </td> <td style="text-align:right;"> 2.36 </td> <td style="text-align:right;"> 0.02 </td> <td style="text-align:left;"> * </td> </tr> <tr> <td style="text-align:left;"> Spending z </td> <td style="text-align:right;"> -0.15 </td> <td style="text-align:right;"> 0.13 </td> <td style="text-align:right;"> -1.18 </td> <td style="text-align:right;"> 0.24 </td> <td style="text-align:left;"> </td> </tr> <tr> <td style="text-align:left;"> Staffing z </td> <td style="text-align:right;"> 0.05 </td> <td style="text-align:right;"> 0.13 </td> <td style="text-align:right;"> 0.39 </td> <td style="text-align:right;"> 0.70 </td> <td style="text-align:left;"> </td> </tr> </tbody> </table> .center[N = 50] ] --- .pull-left[ <img src="retire-nhst_files/figure-html/unnamed-chunk-4-1.png" width="550" height="550" /> ] .middle-right[ * Only effects with a correlation around **±0.4 or greater** will be **detectable** ("p<0.05") **at least 80% of the time** with a sample size of N = 50. * Only effects from a sample with a correlation **.tuos_mint[above around ±0.275]** will be **.tuos_mint[detectable]** ("p<0.05") **.tuos_mint[at all]** with a sample size of N = 50 (Lakens, 2017). * If we could randomly re-sample the data 1,000 times, the spending effect would **only be statistically significant in 23% of the samples.** ] --- class: middle # Our study is **underpowered**. What can we do? .pull-left[ ### The textbook approach: * Do a **power analysis** beforehand and repeat the study with a larger sample and adequate power to detect the small/modest effect. * Only one problem... ] -- .pull-right[ ### We can't collect a bigger 'random' sample: we have data from all 50 administrative units (states, local authorities, countries, etc.). This is an **apparent population** (Berk, Western & Weiss, 1995a) ] --- class: middle, center # Some suggestions... -- .big[**Bin statistical inference** (e.g. our data are the population, no inference is needed [Berk, Western & Weiss, 1995a])] -- .big[**Change the ecological unit** (e.g. individual level survey, neighbourhoods rather than local authorities or states, etc.)] -- .big[**Pool data over several years** (e.g. data for states/LAs between 2015-2021) [Bollen, 1995]] -- .big[**Use meta-analysis** (Several non-significant results can illuminate a significant one [Bollen, 1995])] -- .big[**Bin null hypothesis significance testing**] --- class: middle # **Bin statistical inference**<br>A theoretical/philosophical argument .pull-left[ A brief recap of frequentist NHST: * There is a true population parameter. * Our estimates of this parameter vary because our samples of data vary. * The aim of statistical inference is to generalise from our sample to a population (e.g. a random sample of 500 people to the entire population of the UK) * We test the probability that the estimate might be observed from a random sample of size N if the true population parameter is equal to 0. <br> <br> <br> <br> ] -- .pull-right[ **Is this approach to statistical inference meaningful when our data are the complete population?** > "If the data (the apparent population) are a census of a population and interest lies in a population value, then descriptive statistics are all that is needed. No inference is needed since nothing is unknown." (Berk, Western & Weiss, 1995b: 483) But... * The 'population' could refer to a 'superpopulation' — the underlying data generating process. * Probably undesirable to have no measure of statistical uncertainty, especially if estimates are 'noisy' ] --- class: middle .middle-left[ # **Change the ecological unit**<br>At risk of fallacy * Sometimes appropriate, e.g. unemployment rates might be substituted with individual probability of being unemployed. * Often not: e.g. how to meaningfully disaggregate state or LA spending to the individual or neighbourhood level. * Uncomfortable implications for social policy/public administration research: .tuos_purple[macro- national or meso- policy and socioeconomic conditions don't matter because they're not convenient for our statistical practice]? ] .pull-right[ <img src="retire-nhst_files/figure-html/unnamed-chunk-5-1.png" width="500" height="490" /> ] --- class: middle .middle-left[ # **Change the ecological unit**<br>At risk of fallacy * Sometimes appropriate, e.g. unemployment rates might be substituted with individual probability of being unemployed. * Often not: e.g. how to meaningfully disaggregate state or LA spending to the individual or neighbourhood level. * Uncomfortable implications for social policy/public administration research: .tuos_purple[macro- national or meso- policy and socioeconomic conditions don't matter because they're not convenient for our statistical practice]? ] .pull-right[ <img src="retire-nhst_files/figure-html/unnamed-chunk-6-1.png" width="500" height="490" /> ] --- class: middle .middle-left[ # **Pool data over several years**<br>The problem with pooling * Pool panel data for the same countries/LAs/states, etc. over multiple years to increase the sample size. * Ignores the multilevel structure of the data... ] --- class: middle background-color: white <img src="images/pool-same.png" width="100%" /> ??? The top panel shows the results from 200 simulated datasets that represent 50 units (states, LAs, countries) measured over 20 time points (e.g. 2002-2022). The coefficients are far more likely to be significant, and are generally unbiased, however, this ignores the fact that differences between units are distinct from differences within- units. Once an appropriate multilevel CWC(M) model is used, the standard errors for the between units — the units answering the question of whether, for example, states with higher levels of spending have higher rates of children in care — end up the same as the single year standard errors. --- class: middle background-color: white <img src="images/pool-zero.png" width="100%" /> ??? This can result in considerable bias when the effects are anything but exactly the same at the between and within unit level. This is highly unlikely. For example, in children's services spending the between-unit level is typically positive (places with more problems get more money), while the relationship at the within-unit level are typically negative (if a place gets more money than usual in a given year it tends to have fewer problems). --- class: middle background-color: white <img src="images/pool-opposite.png" width="100%" /> ??? This can result in considerable bias when the effects are anything but exactly the same at the between and within unit level. This is highly unlikely. For example, in children's services spending the between-unit level is typically positive (places with more problems get more money), while the relationship at the within-unit level are typically negative (if a place gets more money than usual in a given year it tends to have fewer problems). --- class: middle background-color: white .middle-left[ # **Meta-analysis**<br>In theory and practice... * It is true that we could repeat a study every year for 20 years and even if all of the coefficients from each individual study were non-significant a meta-analysis of all of these studies would be likely to be significant. * It is also true that doing this would be incredibly boring, unlikely to attract funding, and unlikely to get written up and published... * This would lead to considerable publication bias which will affect the meta-analysis results ] -- .pull-right[ <br> <img src="images/publication-bias.png" width="100%" /> ] --- class: middle, inverse # At this point, you may be wondering... --- class: middle, inverse # Is this one of those depressing methods papers where someone points out everything that is wrong with what we are doing and offers no practical alternatives? --- class: middle, inverse # Fair point. --- class: middle, inverse # But no, it is not! --- class: middle .pull-left-small[ # **An alternative**: Describing, rather than deciding, uncertainty From a pragmatic standpoint, the root of this problem is the use of a .tuos_purple[decision threshold] (i.e. the Null Hypothesis Significance Test). But there are alternatives! ] .pull-right-big[ <img src="retire-nhst_files/figure-html/unnamed-chunk-11-1.png" width="800" height="400" /> ] --- class: middle .pull-left-big[ <img src="retire-nhst_files/figure-html/unnamed-chunk-12-1.png" width="800" height="400" /> ] .middle-right-small[ **Frequentist NHST Interpretation** * There was no statistically significant association between spending on family support services and rates of children in care (p = 0.244) ] --- class: middle .pull-left-big[ <img src="retire-nhst_files/figure-html/unnamed-chunk-13-1.png" width="800" height="400" /> ] .middle-right-small[ **Bayesian descriptive interpretation** * .tuos_purple[87% of the posterior distribution] was consistent with spending having a .tuos_purple[negative effect on rates of children in care]; greater spending on preventative services is around 6.7 times as likely to reduce rates of care entry than to increase them (*Probability of direction*). ] --- class: middle .pull-left-big[ <img src="retire-nhst_files/figure-html/unnamed-chunk-14-1.png" width="800" height="400" /> ] .middle-right-small[ **Bayesian descriptive interpretation** * .tuos_purple[65% of the posterior distribution] was associated with a .tuos_purple[negative effect of spending beyond what would be considered practically equivalent to zero]. There is only a 3% probability that increasing spending would increase rates of care (*Region of Practical Equivalence*). ] --- class: middle .pull-left-big[ <img src="retire-nhst_files/figure-html/unnamed-chunk-15-1.png" width="800" height="400" /> ] .middle-right-small[ **Bayesian descriptive interpretation** * There is a .tuos_purple[76.5% probability that increasing spending would decrease rates of children in care to such an extent to be cost-neutral] (< -0.05sd) (*Policy Relevant Value*). ] --- class: middle .pull-left-big[ <img src="retire-nhst_files/figure-html/unnamed-chunk-16-1.png" width="800" height="400" /> ] .pull-right-small[ > The .tuos_purple[straightforwardness] of these statements is possible because of the .tuos_purple[Bayesian] interpretation of probability: the parameter varies but the data are fixed, our knowledge of the parameter's value is uncertain, rather than the variability of the sample.<br><br>However, that doesn't mean that .tuos_purple[frequentist] descriptions of uncertainty are impossible to imagine. The challenge is the dominance of NHST and difficulty of interpretation. ] --- class: center, middle # Conclusions & Reflections .big[These kinds of .tuos_purple[apparent populations are common in social policy research]. Their units are also meaningful for policy development.] -- .big[A reliance on NHST, inappropriate at worst and underpowered at best, indirectly .tuos_purple[generates more evidence for individualised/atomised policies and framing of social problems] (where there is more statistical power). How many effects stood no chance of being detected?] -- .big[There are .tuos_purple[no shortcuts for making up for the shortfall of statistical power] in apparent populations, but alternative descriptive accounts of uncertainty are possible and desirable.] --- class: middle .middle-left-small[ .center[ <img src="header/staff-photo.png" width="90%" /> ] ] .pull-right-big[ ## Dr. Calum Webb .tuos_purple[Sheffield Methods Institute]<br>The University of Sheffield<br>The Wave, 2 Whitham Road<br>Sheffield<br>S10 2AH [c.j.webb@sheffield.ac.uk](mailto:c.j.webb@sheffield.ac.uk) If this presentation has piqued your interest even slightly, I would recommend learning Bayesian data analysis via [Richard McElreath's Statistical Rethinking](https://xcelab.net/rm/statistical-rethinking/) textbook and lecture series followed by checking out the `brms` `R` package. Remember, the full working paper is available from the SPA2023 conference website and the code for all simulations is available on [github](https://github.com/cjrwebb/retire-nhst) ] --- class: middle # References Berk, R. A., Western, B., & Weiss, R. E. (1995a). Statistical inference for apparent populations. *Sociological Methodology*, **25**, 421-458. Berk, R. A., Western, B., & Weiss, R. E. (1995b). Reply to Bollen, Firebaugh, and Rubin. *Sociological Methodology*, **25**, 481-485. Bollen, K. A. (1995). Apparent and nonapparent significance tests. *Sociological Methodology*, **25**, 459-468.