Life On The Far Side: White Lies, Damn Lies And Statistics

Mark Twain's famous quote about the three kinds of lies is one of those undeniable folksy truisms that stands the test of time.

If you read this entire article, I promise by the end you will be able to find at least five major errors in almost every article and newscast that quotes statistics, and be able to invalidate all the misinformation you are being fed, to the amazement and delight of your friends and family.

You are being manipulated, and what follows is precisely how it is being done.

It is safe to say most people in the world have not been trained to reason statistically. It is an arcane branch of mathematics, and includes such things as demography, public health, political science, and media ratings. It is for the latter reason that I was forced into a Statistical Reasoning class at university, as part of my Communications degree.

It came in handy for a significant part of my career, where I spent several years writing, producing and editing medical videos for major hospitals and pharmaceutical companies. In the course of that work, I had to digest medical statistics and express them in terms that lay people could grasp, and as they say, the best way to learn something is to teach it.

When analyzing statistical results, one must have a baseline or control group, a representative sample in both size and demographic representation, and a great many controls on the methods of data collection and analysis.

It should be noted that in some cases, it is impossible to establish a baseline or control group, such as climatology, where the sample is the entire lifespan of Earth. In these cases, the rules for gathering and analyzing data are even more stringent and the process much more complex.

In short, how you ask the question, and how you interpret the answers are vital to receive a usable result. Statistics can be an invaluable tool to discover truth, but they can also be a horrific weapon to hide it. How one randomizes a sample across dozens of variables to ensure a truly representative group is a very complex and tricky process. It is wide open to accidental and intentional bias.

A good study will tell the reader sample size and methodology by which the sample was determined, as well as the exact question(s) and methods used to collect the data.

A valid study will produce a value called a, or alpha, and one called p, or p-value. These are probability tests that determine whether the original question, or hypothesis, was valid. Both these numbers will be between 0 and 1, with 1 being absolute certainty, and 0 being the "null" hypothesis. You should never see a value of 1 in any statistics, since absolute certainty is impossible in REAL science.

Alpha is the level of confidence in the validity of a test. In other words, this number measures the objectivity of a test design. A value of >0.05 for alpha is the standard that good studies shoot for, which is better than 95% probability of the hypothesis being true.

P-value measures the randomness of a phenomenon. A low p-value tells us the probability that a test is free of bias, and that the phenomenon was not caused by the test or the person collecting the data.

In the end, if p is less than or equal to a, then we can be fairly confident that a phenomenon is real and that our guess about what causes it is accurate. Remember that we can never achieve absolte certainty. At best, we can only be "pretty damn sure".

In any case, the margin of error is vital to analyzing statistics. Through a series of calculations involving sample size, test design and other variables, we arrive at the error rate. This will always be a plus/minus value, telling us for any data point, the actual number could be within a range, such as +/-3%. On a graph, this will create the famous "candlestick" result, with lines extending up and down from a point showing the full range of error.

The next three tools we need to analyze statistics are the mean, median and mode. This is where statistics get fun, and often where the bullshit factor can be found. This is precisely why these three values are also almost never reported in any news report citing statistics.

A statistical mean is simply the dead center of the data, with an equal number of points being above and below that value. The primary problem with this number is that a sampling error can mistakenly shift the mean up or down. Thus, the median can be more useful.

The median is a range of values that tells us where the average middle is, versus the average extremes. In other words, mean income would be the center point of all incomes everywhere, while the median range tells us what constitutes the "middle class". As a rule of thumb, the median is typically the middle 50% of all results, with the 25% extreme groups on either side.

The mode is most easily understood as the most likely answer to any question in data collection. The mode is normally an a priori assumption that tests the randomness of a sample. If the results do not match the mode, then there is a problem with the test design or the way the question is asked.

When you are told about polling results, climate models or health information, and these values/tools are not revealed, then you can rest assured that everything you are being told is complete and utter bullshit. Without knowing the sample size, margin of error, mean, median and mode for any result or graph, then there is no way to analyze the objectivity and significance of the results. It is meaningless noise without the tools.

The GeezerMedia feed on the ignoance of the masses by showing lots of pretty graphs and spiffy animations, and sometimes they might even include the sample size and margin of error, just to hook those with a bit more sophistication. However, if we are not told the alpha, p-value and mode, there is no way for us to gauge the objectivity of the questions asked, nor the probability of the result being correct and/or in any way real.

The final Big Lie with statistics is the way in which the results are graphed. How the data is visually laid out has a lot to do with how they relate to reality and our perception of the results.

The first thing we should always look at when confronted with a graph is whether the axes start at zero and contain the full range of possibilities. In other words, a graph of a political poll must show all possible choices/candidates on the X-axis, and the total number of potential voters on the Y-axis. If not, the graph is meaningless and likely hiding something from you, or manipulating your perception of the results.

Any graph that does not have a zero or null point is complete garbage and should be ignored with prejudice.

If one zooms in on part of the data for greater clarity and detail, there must be break marks in the x, y and z axes, and in any case all the axes must start at zero. It must also follow a complete graph so that the perspective is not lost. In every case where zero is not shown, someone is trying to pass off a giant load of bullshit by exaggerating the relevance of the information.

This is where nearly all graphs of climate data fail - usually on purpose. For instance, if temperature data is not graphed with the X-axis showing all of Earth's existence, and the Y-axis showing ALL possible temperatures on Earth, then the perceived relevance of any rise or fall is completely warped.

So, how does all of this help us right now? Glad you asked.

Statistics have been pawned off on the general public as being definitive measures of political moods, climate change and health information. In nearly every case I have ever seen, much less this year, nearly every aspect of statistical analysis is violated. All of the above-listed tools must me present with every single study or it is just bullshit - eye candy, fear mongoring or deliberate misinformation.

One particularly good example of bullshit statistics is the Real Clear Politics graph of the spread between Joe Biden and Donald Trump. If you look at the bottom left corner of the graph, the first value is 40, not zero. Then we notice that all of the possible candidates for president and their results are not listed on the X-axis, and the total possible votes are not given on the Y-axis.

What we are given is a percentage, but no table of real numbers from which the percentages are calculated, and in any case, the Y-axis does not go to 100%.

Therefore, the visual impression is that the spread is enormous, whereas the real numbers are essentially the same on a full scale. Also, without the error range (candlesticks), we cannot see the full range of probabilities and the overlap (if any) of the results.

The graph makes it look like Biden has a significant lead, but if we assume a margin of error of 3%, then Biden is at 46.6, and Trump at 45.6, which is about as close to a dead heat as one can get.

We also have a total result of 92.2% out of a possible 100%. Where's the other 7.8%? Is it rounding error, other candidates, 'none of the above' answers? That's a significant amount of missing data.

Furthermore, we must assume that the Y axis has a total possible value of 100, without voting fraud. If the graph began at zero and ended at 100, the two lines would be visually the same, even without the margin of error.

We also note that the graph is based on the Poll of Polls below it. Here we see that sample sizes are between 800 and 4,000, out of a potential base of 140 million (one doesn't even have this information), and the margins of error range from 2% to 4%. Furthermore, we know nothing about how the samples were taken, or whether the polls all used the same criteria.

For all we know, this graph is a puree of apples and oranges. In fact, we know this is true because some of the polls used Likely Voters (LV), some used Registered Voters (RV), some used both, and some used general opinion.

Nowhere on the page are we told the alpha, p-value, median, mean, or mode values. We don't know what the exact questions were, how they match up across the polls, how well the a priori assumptions matched the actual results, or whether the probabilities have any relation to reality.

Even more basic than the analysis tools is the method by which the data were obtained. What is the baseline? How were the questions worded (bias - are you voting for Joe Biden or one of the other guys)? How was the sample randomized - bias in geography (all in New York, urban/rural), demography (all Democrats, all atheists), etc,? How were the results obtained - in-person, telephone, internet, etc.? Do these criteria match across all polls used to obtain the final result? Did they all use the same criteria for randomizing the samples? Each variable introduces a bias that must be addressed.

In other words, this entire page is bullshit and completely worthless for analysis.

Let's look at another headline stealer that truly needs some objectivity - SARS-COV-2.

To begin with, the US Centers for Disease Control (CDC) has admitted that the process for collecting data concerning deaths from the virus have been flawed. Their criteria have not been consistent, and thus the results are at best questionable and the probabilities are highly skewed.

More than that, let's examine some of the underlying assumptions of the whole issue:

First, what are the criteria and methods for determining that the virus exists at all?
Second, what are the criteria and methods for determining the symptomology and associating it with this particular virus?
Third, what are the design and analyses that underlie the validity of any of the tests for this virus?
Fourth, what are the criteria and methods for determining the effectiveness of social distancing, masks, vaccines, treatments, etc.?
Fifth, what is the baseline or control group value used to determine the validity of any tests (p-value)?

These questions are very important. Since testing has never been done on the scale now being attempted, how do we know the results are not perfectly average in any given year? What is the average result for mass testing for corona viruses? If we don't know that data, then there is no way to know whether the current results are in any way unusual, and thus we have no way to tell if there is a real pandemic, whether the SARS-COV-2 virus is causing it, or whether the virus exists in the first place.

In addition, do the tests discriminate between the seven types of corona viruses that we consider "normal" or known (common cold, etc.). I have found no data that shows any of the tests discriminate between types of corona virus. In fact, the PCR test commonly used only amplifies DNA traces, and has no ability to separate ANY virus from another.

There are literally thousands of variables and assumptions nested one inside the other to claim that the world is under a pandemic situation. Everything from electron microscopy and DNA/RNA analysis, to confidence in test results, protective measures and treatments is wide open for debate. Recall that statistics can never produce absolute certainty, and every assumption or variable has its own level of uncertainly, compounded on all the others.

Telling us something is true is not equivalent to showing us, using all the above criteria to analyze the results. Saying the issue is vulnerable to bias and corruption is a vast understatement.

Real science, be it political polling, virology and medicine, or climatology are all open for debate at all times. At the very best, statistics provide us with a level of confidence that we see a real cause-and-effect phenomenon, but we can never obtain 100%. It's impossible. Even an alpha of 0.01% still has doubt, no matter how much we want to believe it is absolute truth.

In research as complicated as political polling and virology, we only need to prove one case that does not follow the results to disprove the whole argument. If the results do not match every single instance of reality and predict future outcomes in every case, then the underlying assumptions are wrong.

In the case of political polling, the results of the Trump v. Clinton election tell us there is an underlying inherent flaw in the methodology, whether is is bias or structural. With SARS-COV-2, we are told that its RNA is 75% similar to SARS-COV-1, but how do we know that (sample, methods, etc.)? There are enough holes in both arguments to make the results look like Swiss cheese.

Always keep this in mind: to be in any way valid and true, every use of statistics must tell you the sample size and origin, the alpha and p-value, the median, mean and mode, and any accompanying graphs must start at zero and show the total possible causes (X-axis) and total possible effects (Y-axis) in the set being studied (total voters, total cases, etc.). If any one of these elements is missing, you are being intentionally lied to or accidentally misled, pure and simple. There are no exceptions.

To paraphrase the American philosopher William James, if you want to disprove the rule that all crows are black, you need only to produce one crow that is not.

Now return to the graph at top, which was actually published in an article about climate change, and see how many errors you can find.

5 comments:

vdobill@gmail.com7.9.20
Had the blessing of learning this info from Herbert Arkin, Phd. 50 years ago and never saw the world the same since?
Unknown7.9.20
I read E=F The New Mythology from California..by Joseph Christopher Messineo...
You will never look at life the same again....really...
Mr Boompi8.9.20
This comment has been removed by a blog administrator.
Saefu kawzuki29.5.22
must use sophisticated tools to detect a disease

Feel free to leave your own view of The Far Side.

Pages

Here Thar Be Monsters!

6.9.20

White Lies, Damn Lies And Statistics

5 comments:

Search The Far Side