Statistical Inferencing
==============
By: Dr Grahame Blackwell
BTech(Hons) PGCE PhD MBCS CITP CEng EurIng FCybS
First degree: Maths, Stats & Computing; PhD in Artificial Intelligence.

The purpose of this web page is to give some insights into how statistics (i.e. statistical analysis) may and may not be used, and into what statistics can and cannot prove.
First and foremost, statistics cannot absolutely prove anything; that is not the purpose of statistics.
Secondly, nothing can be proved absolutely, by anyone or by any scientific technique.
Any proposal can only be proved to a certain level of confidence: in a legal setting this may be 'beyond reasonable doubt'; in most practical situations 3-sigma (three standard deviations from the norm*) confidence, equating to around 99.7% certainty, is regarded as pretty definite; if someone's aiming for a Nobel Prize in Physics or claiming discovery of a new sub-atomic particle at the Large Hadron Collider then a 5-sigma level of confidence (equating to less than a 1-in-1.74 million chance of being wrong) is expected.
[* Standard deviations (sigmas), σ, indicate increasingly small likelihood (of a conclusion being wrong): 1σ = 32%, 2σ = 4.5%, 3σ = 0.3%, 4σ = 0.0063%, 5σ = 0.000053%. In situations where a result could be either only larger or only smaller than the norm, the probability of any one of these deviations is halved.]

One of the primary rules of statistical analysis is that one starts by formulating an hypothesis about the data to be tested - what's normally termed the 'Null Hypothesis' (H0 for short). It's totally out of order to pick and choose from sets of data, selecting only those sets that give a pleasing result. By doing this, one is pre-loading the result of the analysis, biasing it in a certain direction; an absolute requirement of a statistical assessment is that it's totally unbiased - otherwise it cannot be objective, and objectivity is what statistical analysis is all about.

As a simple analogy: A school head is facing an assembly hall in which there are 1,500 students, of whom just two are redheads; having spotted the two, she walks down the hall, takes the hand of one of them, then walks across with that one and also takes the hand of the other one. Walking up to the front with the two students, she says to the assembly: "I've just picked out two students from 1,500, and both of them happen to be redheads. What an amazing coincidence!" If the head had actually chosen two of the students at random, the probability that she'd have chosen the two redheads is less than one in a million; as it is, the way she did it, the likelihood of her walking to the front with two redheads is precisely one - a certainty!

[This is quite different, of course, from choosing the data one wishes to analyse and then deciding which form of analysis should be applied to that data; different statistical tools are better for highlighting any significance in different types of data.   It's also standard statistical practice to remove any outliers from a dataset before analysis; these are individual elements shown by mathematical procedures to deviate significantly from the rest of the dataset, and so most likely to be subject to exceptional circumstances that could distort the outcome of the overall analysis.]

The Null Hypothesis H0 is normally stated in the form that "observed results are not significant, that they can be explained by random variations in everyday circumstances" - or, in the case of possible correlation of two sets of measurements, that "there is in fact no actual link between those sets of data and any seeming correlation is simply coincidental". An Alternative Hypothesis, H1, is used to represent the proposition that observed results cannot be explained satisfactorily by random variations, and that H0 must therefore be rejected (i.e. the observed results are significant). The term 'satisfactorily' is defined in terms of a probability level (a 'confidence level'): for example, a confidence level of 0.001 means that if the probability of that situation occurring by chance turns out to be less than 1 in a thousand then H0 is rejected - the situation is too improbable (at that level of confidence) to have occurred by chance, so the data is in fact showing a significant effect.

Different statistical distributions model different types of data: Normal, Binomial, Chi-Square, Poisson etc; the appropriate distribution should be used to analyse any specific type of data and derive a probability rating for that set of data. Another form of analysis, Regression Analysis, can be used to identify whether there is a relationship between two sets of data. If a relationship is established, to a reasonable level of confidence, then a regression line (graph) can be defined which can be used for prediction of likely outcomes in cases where data isn't available in relation to the situation being analysed/modelled. The simplest form of regression is linear regression, in which there's a straight-line relationship between the pairs of data, i.e. those pairs of data, plotted as X-Y points on a graph, run roughly in a straight line.

One important point to note is that pretty well all real-life data is 'fuzzy' - that is, there are generally also other major or minor effects involved, so the points won't lie neatly along that regression line: they'll be scattered around it, closer or further from the line depending on the degree of correlation (linked variation) of those two sets of data. If there is no correlation, then there won't be any tendency of those points to lie in a line. If there is a degree of correlation, though, even if it's only partial (i.e. there are other factors involved, so the points are pretty spread out), linear regression analysis will extract that relationship from the fuzzy data and show it up at whatever level of significance applies - depending also on the number of data pairs used to conduct the analysis (more data pairs give more confidence, so a higher significance/probability rating).

On plotting those points on a scattergram, a standard initial step, it may be apparent that they follow a rough curve - upward or downward - rather than a straight line. In such a case, in addition to checking for a linear relationship, a check for an exponential relationship forms a further part of the analysis. This can be readily achieved by applying the same linear regression analysis to the X-values paired with the logarithms of the Y-values: if the logs of the Y-values are linearly related to the X-values, then the actual Y-values themselves are related exponentially to the X-values.

[A few words on exponential relationships: if Y-values increase by the same amount for each increase of 1 in the X-values - i.e. we add the same amount to Y for each 1 increase in X - then the relationship is linear; if the Y-values are multiplied by the same amount for each increase of 1 in the X-values, the relationship is exponential.

This form of relationship is most often, but not only, found in time-sequences: numbers of rabbits in a colony, number of bacteria in a culture in a Petri dish, both multiply by a certain amount (roughly) each month or each minute, respectively. So the number of rabbits or bacteria correlates exponentially with the time in months or minutes since the start of observations. But consider, for example, the number of colours that you could have in a computer picture, depending on the number of bits per pixel (dot): 2 bits only gives us 4 possibilities (so 4 colours): 00, 01, 10,11; 3 bits gives us 8: 000, 001, 010, 011, 100, 101, 110, 111; 4 bits gives us 16 possible colours ... 8 bits gives us 256 possible colours; 24-bit colour gives us over 16 million colours. Every time we add one more bit, we multiply the number of colours by 2 - there's an exponential relationship between number of bits and number of colours. This particular relationship is also deterministic - it doesn't vary statistically - but it's a useful example of a non-time-based exponential link.]

The following comments in blue refer to the initial analysis as proposed in BMJ.
Corresponding comments, with identical conclusions, apply to the follow-up analysis.


Ok, so in the analysis undertaken here, it's been proposed that we investigate the possibility of a correlation between percentage of over-65s given an influenza jab in 2018 (or as recently as stats are available for) and deaths per million from Covid-19, in 20 different European countries. This is a very reasonable starting point, as: (a) the figures are all available; (b) the level of development - sanitation, hygiene, availability of pure water etc - is not too dissimilar; (c) at least one previous study raises the possibility that influenza vaccination may increase susceptibility to some other forms of illness including coronavirus (study prior to Covid-19). This can't therefore be in any way regarded as a 'contrived' study - it has all the necessary ingredients for an objective, unbiased, relevant study.

Conducting this study in accordance with standard Linear Regression techniques, as detailed on the previous web page, gives us that there's a 0.7299 level of correlation between death rates per million and percentage of over-65s receiving influenza vaccines; for 20 pairs of points this gives a probability rating of 0.000259: there's less than 1 in 4000 likelihood of this degree of correlation occurring by chance. If we consider logs of death rates against over-65 percentage vaccinations, that level of correlation rises to 0.8249, giving a probability rating of less than 0.00001: the likelihood of this result - indicating an exponential relationship of deaths to vaccinations - occurring by chance is less than 1 in 100,000.

Again it must be stressed that this analysis - any analysis - can only show correlation, not causation: we have not proved that increasing vaccination figures cause increasing death rates, only that the figures are linked to a high degree, to a very high level of confidence. Also the scattering of the points, and the less than 100% correlation, indicates that there will also be other effects involved - maybe standard of living, quality of nutrition, or whatever in each country. But 0.8249, and 1 in 99.999% likelihood of a link, are very high figures: this must be regarded as a strong link that's not happening by chance.


So what might an appropriate response be? What should a researcher do with such results?

Well, the whole point of a process like this is to decide whether any sort of action is in order - and the conclusion from such significant results can only be a resounding yes! It's up to the analyst - or the one they're doing the analysis for - to decide what that action should be. But in a case like this, where the dependent variable (Y) represents a safety-critical (life-critical) measure - death rate - to not take such a result very seriously would be grievously irresponsible. At the very least, an urgent detailed investigation should be undertaken to see whether a third factor might be responsible for linking these two (subjected to the same rigorous analysis as this in relation to each of these two); if no such link can be unmistakably identified, then it must be assumed that the link is causal and appropriate action taken to minimise the death toll. It's up to the analyst to draw inferences from the figures provided by statistical methods, and to act on those inferences as seems appropriate.

========

It's been observed that analysing death rates against percentage of Over-65s un-vaccinated (i.e. Y against 1-X) gives a negative correlation numerically equal to the original positive correlation, with the same probability. This is bound to be the case, since if Y increases with X at a certain rate, then it will decrease with 1-X at precisely the same rate. This leads to the corresponding result that death rate decreases exponentially with an increasing percentage of un-vaccinated Over-65s. That's effectively repeating the original result, but just presenting it in a different form.

Again, this is correlation: it doesn't necessarily imply causation. However it does indicate that something is going on; what that 'something' might be is for the researcher - or those receiving these results from the researcher - to identify and act upon.

===================================================