Still a place for P-values?

Been reading some rather polarized views on the business of carrying out statistical analyses. A healthy dose of skepticism usually arises when I am confronted with “you need to stop using X and only ever use Y” statements. So, something that has been working for over a hundred years is suddenly totally bunk? Well, maybe, maybe not.

Guess I should be a bit more specific — some folks are now taking the stance that traditional frequentist statistics, where you report the “significance” of an experimental treatment or the strength of correlation between two variables, should be dropped and never used again, all in favour of a multi-model approach.

Now, before you go screaming “you’re totally old school, man!”, two thoughts: 1) yeah, I am a little, and 2) I’m totally down with the whole model selection and multi-model inference framework. When I teach Quantitative Biology II (ECOL 425) at @UCalgary, I make sure I introduce the concept to my students, and I have used the approach in some recent works (Vamosi & Vamosi 2010, Ecology Letters; Vamosi & Vamosi, 2011, Am J Bot; Kremer, Vamosi, Rogers, in prep.).

It’s also the case that I’ve long been bothered by the, for a lack of a better word, abruptness of the alpha = 0.05 threshold. For example, let’s say that for part of your PhD dissertation, you conduct two related experiments on plant growth along an elevational gradient. You analyze the first and your analysis returns P = 0.043 for the main effect so you conclude that your fertilizer treatment was significant (yay, time to raise a glass!). However, the analysis of the second returns P = 0.067 so you conclude that herbivore exclusion did not have a significant effect (hmm, might be time for a drink anyway). Sure, you can reach out for the “marginally significant” lifeline, but there are issues with that too, the biggest one being that folks tend to be biased when they do so (i.e., they will [even subconsciously] be happy to do so to “rescue” a main effect but never go to it to do the same for a nuisance interaction term).

Keeping all of that in mind, I’m still not convinced that a strict “do it this way, do not ever do it that way” is productive. At the end of the day, I envision a long future for analytically simple comparisons, such as: “are the wing lengths of these two populations of hummingbirds significantly different from one another?” Yes, we should report statements of magnitude aka effect sizes (“hummingbirds from site A had, on average, 8% longer wings than those from site B”), but 19 of 20 folks will want some statement of confidence in that conclusion. The multi-model inference framework was simply not set up for these types of problems. Maybe you’re thinking “yeah, but… that’s a very simple scenario, surely model selection then *must* be applied to everything else”. Actually, I think the vast majority of lab and field experiments still lend themselves best to the old-fashioned analysis framework. Chamberlain’s method of multiple working hypotheses is nice and all, but the last 125 years have shown that it’s not something that is easily applicable, to one system, all at once. The vast majority of experiments I encounter in ecology and evolutionary biology vary three factors or less (and the vast majority of that majority vary two factors or less), which is perfectly suited for, say, a three-way ANOVA. Just be sure to run a single analysis, ignoring the temptation to remove “non-sigificant” terms in subsequent analyses. Others more qualified than me have written about the perils of stepwise model simplification; in a nutshell: there’s no theory to justify it, it leads to biased parameter estimates, may give different answers if you use forward vs. reverse methods, increased error rates can be a real issue and there’s no theory to justify it (yes, I know I repeated myself there, but it has to be said: a shout-out to Occam’s Razor may sound compelling but isn’t theory).

If there is a single hardline statement to make: do not repeat the mistake that made @cbahlai sad!

Screen shot 2014-09-30 at 6.58.52 PM

That is: pick one framework and stick with it / respect another researcher’s choice of framework and let them stick with it. Mixing and matching AIC scores and P-values is probably* hazardous to kittens. *lack of statement of effect size or P-value intentional

Thanks to those who got me thinking about the topic, and thanks to you for reading.