Statistical Society of Australia - A very odd article "Statistical Literacy" example

Member Login Join Now

Back to topics

A very odd article "Statistical Literacy" example

Show oldest replies on top

Subscribe to topic

29 Oct 2020 4:41 PM

Quote

Reply # 9332591 on 9331886

Berwin Turlach

John Maindonald wrote:
Berwin, thanks for your response.

It is not clear what are the grains of rice and what are the tiles. I'd assumed that the grains of rice are 100 staff, and that the probability of falling on one particular tile is the probability of 1 in 400 that any individual staff member will be sick. If as you suggest the idea was that 100 staff with the disease are spread across 400 labs, I'd think that the disease is not all that rare. [...]

I agree that it not clear how the tiles and rice grains relate to the lab and researchers, but I like my correspondence :-). Initially, I also thought that with 100 cases, one would not talk about a rare disease, but then I realised that we would have to look at the total number of people working in these labs. If, say, in each lab 1,000 people are working then we talk about a total of 400,000 people over 400 labs, and if 100 of these have the disease, then the probability of an individual having the disease is 0.000025, or 0.025%. So this may qualify as a rare disease.

Finally, note that the statement was actually:

<<<

[...] We tried numerous times, actually throwing the rice, and there was always a tile with two, three, four, even five or more grains on it. [...]

>>>

As your calculation show, with 100 rice grains and 400 tiles, so an expected value of 0.25 rice grains per tile, if we concentrate on a particular tile the probabilities of observing two or three rice grains on the tile (i.e. particular more than the average) are 19.5% and 2.4%, respectively (incidentally, these probabilities are very close to those given by using a Poisson approximation).

On the other hand, if you ask about the probability that there is at least one tile within the 400 that has 2 or 3 rice grains on it, then the first probability is dead certain (99.99%) and the second is nearly 50:50 (55.67%). O.k., the chance of finding a tile with 5 rice grains on it among the 400 is still very small (well the article actually says "even five or more grains on it"), but then it is not clear whether the 400 tiles and 100 rice grains number was really chosen to mimic the lab/researchers situation. So one should probably not concentrate too much on the 5 rice grains on one tile and 5 researches in one lab comparison.

Cheers,

Berwin

29 Oct 2020 7:20 AM

Quote

Reply # 9331886 on 9331350

John Maindonald

Berwin, thanks for your response.

It is not clear what are the grains of rice and what are the tiles. I'd assumed that the grains of rice are 100 staff, and that the probability of falling on one particular tile is the probability of 1 in 400 that any individual staff member will be sick. If as you suggest the idea was that 100 staff with the disease are spread across 400 labs, I'd think that the disease is not all that rare. And on your calculation, the probability of 0.0025 for five in one of the 400 labs still strongly suggests that the occurrences are not independent, and makes nonsense of his claim that:

"We, know-all professors, had fallen into a gross statistical error. We had become convinced that the “above average” number of sick people required an explanation."

Such an occurrence, in one of 400 labs, does require explanation. There is an issue -- does the individual lab look at the probability that it will have five cases, or at the probability that one of 400 somewhere in the country will have 5 cases (and in the latter case, one should be checking across all 400 labs)?

An additional point is that the 100 occurrences across 400 labs (if that is what Rovelli had in mind) should be compared with the disease incidence in the general population. Is that number the expected number, given its incidence in the population.

As a side comment, if experiments treated their p < 0.05 as just what one might expect when 20 labs are doing a broadly comparable experiment, there'd be many fewer papers published! As you suggest, it is the old chestnut, "What is the relevant reference population?"

Cheers,
John.

Berwin Turlach wrote:

Hi John,

Thank you for the link to an interesting article.

Well, yes, if one is in a dining room with 400 tiles and throws 100 rice grains into the air, the rice grains will probably not fall uniformly onto the tiles. I would probably choose a bivariate normal distribution (or some other elliptic distribution) as a model (but how to choose all the parameters for that distribution?).

But even if we assume that every rice grain is equally likely to fall on any of the tiles and that the rice grains fall independently of each other, your calculation addresses the probability that a particular tile will have x rice grains on it. Whereas the point seems to be to look at the probability that there is at least one tile with x rice grains on it. Pretty much as in the Birthday problem. The difference of whether we ask for a match for one particular day, or a match for any day.

So I think a relevant simulation would be:

> throw <- function(x){ tiles <- sample(1:400, 100, replace=TRUE) ; any(table(tiles)==x)}
> mean(replicate(10000, throw(2)))
[1] 0.9999
> mean(replicate(10000, throw(3)))
[1] 0.5567
> mean(replicate(10000, throw(4)))
[1] 0.0472
> mean(replicate(10000, throw(5)))
[1] 0.0025

So the chance of finding at least one tile with 3 rice grain on it is ~ 0.55. The chance to have a tile with 4 or 5 rice grains on it are small, but not as small as in your calculation.

I guess the mathematician thought of the rice grains as "persons who have a disease" and the tiles as "possible labs". And he wanted to illustrate that the probability of having a lab with a concentration of cases is not as small as people might think it is (if there are many other similar labs around the world). So is there something going on in this particular lab where one has observed the concentration of cases? Or is it just the unlucky one that we would expect to see among all the labs that exists? How to set up the probability calculation post-hoc after having observed the cluster of cases?

Other examples that come to mind are discussions of the effect on one's health when living close to nuclear power plants, close to overland power lines, wind turbines, WiFi modems, and so forth. Some of these discussions are, of course, more scientific and serious than others. :-)

Cheers,

Berwin

Last modified: 29 Oct 2020 7:33 AM | John Maindonald

29 Oct 2020 3:49 AM

Quote

Reply # 9331350 on 9330556

Berwin Turlach

Hi John,

Thank you for the link to an interesting article.

Well, yes, if one is in a dining room with 400 tiles and throws 100 rice grains into the air, the rice grains will probably not fall uniformly onto the tiles. I would probably choose a bivariate normal distribution (or some other elliptic distribution) as a model (but how to choose all the parameters for that distribution?).

But even if we assume that every rice grain is equally likely to fall on any of the tiles and that the rice grains fall independently of each other, your calculation addresses the probability that a particular tile will have x rice grains on it. Whereas the point seems to be to look at the probability that there is at least one tile with x rice grains on it. Pretty much as in the Birthday problem. The difference of whether we ask for a match for one particular day, or a match for any day.

So I think a relevant simulation would be:

> throw <- function(x){ tiles <- sample(1:400, 100, replace=TRUE) ; any(table(tiles)==x)}
> mean(replicate(10000, throw(2)))
[1] 0.9999
> mean(replicate(10000, throw(3)))
[1] 0.5567
> mean(replicate(10000, throw(4)))
[1] 0.0472
> mean(replicate(10000, throw(5)))
[1] 0.0025

So the chance of finding at least one tile with 3 rice grain on it is ~ 0.55. The chance to have a tile with 4 or 5 rice grains on it are small, but not as small as in your calculation.

I guess the mathematician thought of the rice grains as "persons who have a disease" and the tiles as "possible labs". And he wanted to illustrate that the probability of having a lab with a concentration of cases is not as small as people might think it is (if there are many other similar labs around the world). So is there something going on in this particular lab where one has observed the concentration of cases? Or is it just the unlucky one that we would expect to see among all the labs that exists? How to set up the probability calculation post-hoc after having observed the cluster of cases?

Other examples that come to mind are discussions of the effect on one's health when living close to nuclear power plants, close to overland power lines, wind turbines, WiFi modems, and so forth. Some of these discussions are, of course, more scientific and serious than others. :-)

Cheers,

Berwin

28 Oct 2020 6:55 PM

Quote

Message # 9330556

John Maindonald

A recent Guardian article has the title "Statistical illiteracy isn't a niche problem. During a pandemic, it can be fatal"
https://www.theguardian.com/commentisfree/2020/oct/26/statistical-illiteracy-pandemic-numbers-interpret
It starts thus

<<<
In the institute where I used to work a few years ago, a rare non-infectious illness hit five colleagues in quick succession. There was a sense of alarm, and a hunt for the cause of the problem. In the past the building had been used as a biology lab, so we thought that there might be some sort of chemical contamination, but nothing was found. The level of apprehension grew. Some looked for work elsewhere.

One evening, at a dinner party, I mentioned these events to a friend who is a mathematician, and he burst out laughing. “There are 400 tiles on the floor of this room; if I throw 100 grains of rice into the air, will I find,” he asked us, “five grains on any one tile?” We replied in the negative: there was only one grain for every four tiles: not enough to have five on a single tile.

We were wrong. We tried numerous times, actually throwing the rice, and there was always a tile with two, three, four, even five or more grains on it.

. . . We, know-all professors, had fallen into a gross statistical error. We had become convinced that the “above average” number of sick people required an explanation. Some had even gone elsewhere, changing jobs for no good reason.
>>>

Assuming rice grains fall independently on tiles, the probabilities of 0, 1, . , , 5, 6+ grains on one tile are, with the R code used:

setNames(round(dbinom(0:5,100,.0025), 5), 0:5)
0 1 2 3 4 5
0.77856 0.19513 0.02421 0.00198 0.00012 0.00001

Thus, the probability of 5 grains on the one tile is vanishingly small. (The author does not give this calculation.)

Then, what is the author's point with this example? That rice grains that are thrown randomly by a bunch of mathematicians are unlikely to fall independently. If so, what is the relevance to the story of the five colleagues who had a rare infectious illness (probability for one such illness, maybe 1 in 400, or 0.0025)? If so, then, surely the point is that the five illnesses are very unlikely to be independent events, and that is makes sense to look for a common cause. It appears to me that the author is himself guilty of a "gross statistical(?) error". Have I missed something.

For anyone who wants to simulate the probabilities, the following code may be used:

n <- numeric(7)
for(i in 1:100000){
sam <- sample(1:400,100, replace=T)
tab <- table(sam)
for(j in 2:6)n[j]<-n[j]+sum(tab==j-1)
n[1]<-n[1]+400-length(tab)
n[7] <- n[7]+sum(tab>5)
}
setNames(round(n/sum(n),5), c(0:5,"6+"))
0 1 2 3 4 5 6+
0.77854 0.19516 0.02421 0.00197 0.00012 0.00001 0.00000

Statistical Society of Australia (SSA)

PO Box 213

Belconnen ACT 2616 Australia

02 6251 3647

www.statsoc.org.au

ABN 82 853 491 081

Please direct enquiries to:

the SSA Team via email at

contact@statsoc.org.au

@StatSocAus

Privacy Security Sitemap

Website by Converge Design