I'm a bit suspicious that any kind of "crisis" exists. While we can always improve on methods and statistics, the basic premise here that "lots of data => improbable events happening by chance" is not exactly obscure or difficult to guess. It's obvious as soon as you learn about Gaussian statistics or even earlier.
Suppose there are 100 ladies who cannot tell the difference between the tea, but take a guess after tasting all eight cups. There’s actually a 75.6 percent chance that at least one lady would luckily guess all of the orders correctly.
Now, if a scientist saw some lady with a surprising outcome of all correct cups and ran a statistical analysis for her with the same hypergeometric distribution above, then he might conclude that this lady had the ability to tell the difference between each cup. But this result isn’t reproducible. If the same lady did the experiment again she would very likely sort the cups wrongly – not getting as lucky as her first time – since she couldn’t really tell the difference between them.
This small example illustrates how scientists can “luckily” see interesting but spurious signals from a dataset. They may formulate hypotheses after these signals, then use the same dataset to draw the conclusions, claiming these signals are real. It may be a while before they discover that their conclusions are not reproducible. This problem is particularly common in big data analysis due to the large size of data, just by chance some spurious signals may “luckily” occur.
We do this in radio astronomy all the time. With >100 million data points per cube, the chance of getting at least one interesting-but-spurious detection is close to 1.0, especially when considering that the noise isn't perfectly Gaussian. We get around this by the simple process of doing repeat observations; I find it hard to believe that anyone is seriously unaware that correlation <> causation at this point. Charitably the article may be over-simplifying. While there are certainly plenty of weird, non-intuitive statistical effects at work, I don't believe "sheer size of data set" is causing anyone to panic.
https://theconversation.com/how-big-data-has-created-a-big-crisis-in-science-102835
Sister blog of Physicists of the Caribbean. Shorter, more focused posts specialising in astronomy and data visualisation.
Subscribe to:
Post Comments (Atom)
Giants in the deep
Here's a fun little paper about hunting the gassiest galaxies in the Universe. I have to admit that FAST is delivering some very impres...
-
Of course you can prove a negative. In one sense this can be the easiest thing in the world : your theory predicts something which doesn...
-
Why Philosophy Matters for Science : A Worked Example "Fox News host Chris Wallace pushed Republican presidential candidate to expand...
-
In the last batch of simulations, we dropped a long gas stream into the gravitational potential of a cluster to see if it would get torn...
"I don't believe "sheer size of data set" is causing anyone to panic."
ReplyDeleteMaybe not about spurious signals, but certainly about anyone actually having time (or processing power) to find all the interesting objects in the massive data sets. I'm not personally involved, but I know people working on the ASKAP and SKA pipelines who are rightly worried about missing novel events because they have to throw so much data away in the early processing stages. We're lucky FRBs were discovered and proven to be real signals when they were, because I'm told that earlier versions of the ASKAP pipeline would have discarded that data as spurious. It's the same reason that actual scientists looking at the data flagged FRBs as weird artefacts until two telescopes with different gear observed one at the same time and we had actual replication.
Which is to say that big data has its pitfalls, but as you said, no one working on/with it is unaware of the problems. We all just have to be clever about it.