Little Physicists: March 2019

Friday 29 March 2019

Tidal tales of galaxies without dark matter

A year ago almost to the day, a team of astronomers led by Pieter van Dokkum reported the discovery of a galaxy apparently lacking dark matter, known as NGC 1052-DF2. Maybe not quite entirely lacking, but having so little compared to other galaxies that it's clearly very strange.

The implications of this object have yet to be fully realised. There was, predictably, criticism - some of it very silly, but others much more valid. The dark matter mass of a galaxy can't be measured directly but must be inferred by how fast its stars and gas are moving. In this case there are only stars, which appear to be swirling about randomly rather than rotating in a disc, but this is quite normal and doesn't make much difference to the measurement procedure. The point is that there are two main ways the mass could have been overestimated : either the measured velocity dispersion was too low, or the estimated distance to the galaxy was too high.

Since the galaxy is extremely faint, van Dokkum et al. originally used the galaxy's globular clusters to measure the velocity dispersion as these are brighter and easier to measure. This was only possible for a handful of objects, and one of them hinted at a much higher velocity dispersion. This led to a lot of criticism about their statistical inference of the true dispersion, though van Dokkum refuted this (personally I thought their low dispersion measurement seemed much more plausible from the data than the claims of a higher one). Subsequently they improved their observations, which eliminated their errant data point : as with most weird outliers, it was a spurious measurement. And later they improved their measurements still further through deeper observations that allowed them to get the dispersion directly from the stars, showing pretty definitively that they are indeed moving very, very slowly.

If the velocity dispersion is correct, what about the distance ? That came in for stronger criticism, which showed that the galaxy became a pretty typical object if it was rather closer than reported. The original team hit back, saying that their original measurement was correct. Then a completely independent team did their own measurements, which agreed with van Dokkum's original claim.

So it looks as if this is a genuinely weird object. Does it represent a distinct class of objects or just an example of exotica ? It's too early to tell, but van Dokkum found a second, very similar galaxy close to the first one, while another team found a much smaller, much closer galaxy that also appears to be deficient in dark matter. The latter group also investigated how such an object could form, suggesting it might be in part due to tidal encounters with more massive objects that could rip out much of the dark matter. That echoes an earlier study claiming a similar origin was possible for NGC 1052-DF2.

The paper below reports deeper observations of the NGC 1052 group and finds several tidal structures : streams and arcs of stars pulled out of galaxies by their gravitational interactions. Here's their main figure :

The second galaxy claimed without dark matter is NGC 1052-DF4, upper right.

It's abundantly clear from this that tidal interactions have played a role in the evolution of this group. NGC 1052-DF2 is, intriguingly, very close indeed to a narrow stream originating from NGC 1052 itself. So could this enigmatic object just be some form of transient debris ? Almost certainly not. The structure and density of DF2 are very different from the stream, and it's not obvious why a thin, faint stream should be broken at the end but related to a smooth, much higher density object just a bit further away (if it was embedded within the stream, though, that would be a different matter). The authors conclude the alignment is most likely just pure chance.

Could it be something more stable but produced in a similar way - a tidal dwarf galaxy ? If enough material is pulled from a galaxy, it can sometimes become self-gravitating and collapse to form a stable object without any significant dark matter needed to bind it together. These are controversial objects : while there are some clear examples known which are still embedded in their parental streams, it's much harder to decide if an isolated object was formed by this mechanism. While the authors here favour an old tidal origin for these objects, tidal dwarfs should be much more vulnerable to disruption than normal galaxies because they're much less massive (and it should be noted that the authors are known fans of tidal dwarf galaxies).

Personally I'd say that the presence of tidal streams makes the case for a tidal origin more likely, but it's nowhere near definitive. One of the main points of controversy is whether all the galaxies here are at the same distance, so the presence of streams is pretty strong confirmation of that. That we know the galaxies are interacting with each other does make it more plausible that there's tidal debris floating around, but we suspected that anyway. So we're no closer to knowing if the interactions within this group are capable of producing objects similar to the weird galaxies that have been detected there, or if they could survive in such an environment for several billion years. Crucially, a point which remains addressed is that the velocity dispersions are so low that it's difficult to see how the objects could have become this smooth and spherical given the available timescale - the age of the Universe means there's only been time for their stars to complete a couple of orbits !

Finally, what do these galaxies mean for physics ? That too remains highly controversial. If they're formed by tidal encounters, how many other galaxies might have been produced in the same way ? If they're not, then does that mean we have to revise our theory of galaxy formation and/or gravity ? Alas it seems that this object cannot be used to rule out theories of modified gravity - as van Dokkum admitted, earlier claims about this were simply wrong; the objects appear to be equally consistent with both the standard model and its alternatives. A year on, these objects remain as controversial and enigmatic as ever.

A tidal tale: detection of multiple stellar streams in the environment of NGC1052
The possible existence of two dark matter free galaxies (NGC1052-DF2 and NGC1052-DF4) in the field of the early-type galaxy NGC1052 presents a challenge to theories of dwarf galaxy formation according to the current cosmological paradigm.

Friday 22 March 2019

Lessons on abusing statistical significance

I'm not a professional statistician, so I don't claim to understand the nuances of every kind of statistical test. Nevertheless using certain tests is an unavoidable part of my work, and I do try and think about what they really mean. I tend to agree with the article below.

Let’s be clear about what must stop: we should never conclude there is ‘no difference’ or ‘no association’ just because a P value is larger than a threshold such as 0.05 or, equivalently, because a confidence interval includes zero. Neither should we conclude that two studies conflict because one had a statistically significant result and the other did not. These errors waste research efforts and misinform policy decisions.

We are not calling for a ban on P values. Nor are we saying they cannot be used as a decision criterion in certain specialized applications (such as determining whether a manufacturing process meets some quality-control standard). And we are also not advocating for an anything-goes situation, in which weak evidence suddenly becomes credible. Rather, and in line with many others over the decades, we are calling for a stop to the use of P values in the conventional, dichotomous way — to decide whether a result refutes or supports a scientific hypothesis.

Basically they are saying that the term "statistical significance" should not be used in the binary way it currently is. Significance is a matter of degree, and there simply isn't a hard-and-fast threshold to it. There is no single number you can use to decide if variables are or are not connected to each other. I emphasised the words, "just because" above because it's important to remember that there are ways to irrefutably demonstrate a causal connection - just not by P-values alone.

We must learn to embrace uncertainty. One practical way to do so is to rename confidence intervals as ‘compatibility intervals’ and interpret them in a way that avoids overconfidence... singling out one particular value (such as the null value) in the interval as ‘shown’ makes no sense.

When talking about compatibility intervals, bear in mind four things. First, just because the interval gives the values most compatible with the data, given the assumptions, it doesn’t mean values outside it are incompatible; they are just less compatible. Second, not all values inside are equally compatible with the data, given the assumptions. Third, like the 0.05 threshold from which it came, the default 95% used to compute intervals is itself an arbitrary convention. Last, and most important of all, be humble: compatibility assessments hinge on the correctness of the statistical assumptions used to compute the interval. In practice, these assumptions are at best subject to considerable uncertainty.

What will retiring statistical significance look like? We hope that methods sections and data tabulation will be more detailed and nuanced. Authors will emphasize their estimates and the uncertainty in them — for example, by explicitly discussing the lower and upper limits of their intervals. They will not rely on significance tests. Decisions to interpret or to publish results will not be based on statistical thresholds. People will spend less time with statistical software, and more time thinking.

This is a sentiment I fully endorse. Whether it will really work in practice or if people do ultimately have to make binary decisions, however, is something I reserve judgement on.

I've previously shared some thoughts on understanding the very basics of statistical methodologies here, albeit from a moral rather than mathematical perspective. In light of the article, let me very briefly offer a few more quantitative experiences of my own here. Some of them can seem obvious, but when you're working on data in anger, they can be anything but. Which is exactly how science works, after all.

-1) Correlation doesn't equal causation
I'm assuming you know this already so I don't have to go over it.

0) Beware tail-end effects
The more events you have, the greater the number of weird events you'll have from chance alone. If you have a million events, it shouldn't surprise you to see a million-to-one event actually happening.

1) Being statistically significant is not necessarily evidence for your hypothesis
This is, as far as I can tell, the principle sin of those advocating that satellite galaxies are commonly arranged in planar structures. What their tests are actually measuring is the probability that such an arrangement could occur by random chance, not whether it's compatible with being in a plane. That's a rather specific case of using wholly the wrong sort of test though. More generally, even if you do use the "correct" test, that doesn't necessarily rule out other interpretations. Showing that the data is compatible with your hypothesis does not mean that other explanations won't do equally well or better even at the level of fine details - my favourite example can be found here.

In essence, if a number is quoted demonstrating significance, the two main questions to ask are :
1) Does this number relate to how unlikely the event is to have occurred randomly, or does it directly measure compatibility with a proposed hypothesis ?
2) Have other hypotheses been rendered less likely as a result of this ?
You can call your result evidence in favour of your hypothesis if and only if it shows that your hypothesis is in better agreement with the data than alternatives. Otherwise you've probably only demonstrated consistency or compatibility, which is necessary but not sufficient.

2) You're all unique, just like everyone else
There an essentially infinite number of ways of arranging the hairs on your head, so the chance that any two people will have identical configurations is basically nil. But often such precision is insane, because most of those configurations will look extremely similar to each other. So if you want to work out the chance of people having a similar overall distribution of hairs, you don't need to look at individual follicle placement. What you need to think about is what salient parameter is relevant to what you're studying, e.g. median follicle density, which will be very similar from person to person. You may or may not be interested in how the density varies from region to region on individual heads, and if so you won't be able to just measure global properties like full head size and total follicle count (see point 5).

3) Everyone is weird
Everyone is not only unique, but they are also statistically unusual. This is known as the flaw of averages : the counter-intuitive observation that few people indeed are close to the average even on a small number of parameters. The more parameters you insist on for your average person, the fewer such people you'll find. There is, quite literally, no such thing as an ordinary man (as Doctor Who once put it).

Another similar factor cited against planes of galaxies studies is the look elsewhere effect, which is in some sense a variation of the flaw of averages. Suppose you flick paint at a wall. You could compare different configurations by measuring how much paint splatter there is, but this would miss subtleties like the pattern the paint makes. You might find, for instance, that because your flicking motion has systematic characteristics, you could get linear features more often than you expect by chance.

What the "look elsewhere" effect says is that you should compare how frequently different types of unlikely features crop up. To take a silly example, it's fantastically unlikely that by random chance you'd get a paint splatter that looks like a smiley face. And it's equally unlikely that you'd get some other coherent structure instead, like a star or a triangle or whatever. But collectively, the chance that you get any kind of recognisable pattern is much, much greater than the chance you'd get any one specific pattern. Of course, how specific the parameters have to be depends on what you're interested in : if you keep getting smiley faces then something genuinely strange is going on.

4) Probabilities are not necessarily independent
This is a lesson from high school mathematics but even some professional astronomers don't seem to get it so it's worth including here. If you roll a fair dice, the chance you get a six is always, always, always one in six. So the chance of rolling two sixes in a row is simply 1/6 * 1/6, i.e. 1 in 36. There are 36 different outcomes and only one way of rolling two sixes. You can multiply probabilities like this because the events are completely independent of each other - the dice has no memory of what it did previously.

If you have more complex systems, then it's true that if you expect the events to be independent, you can multiply the probabilities in the same way. But remember point 1 : your hypothesis is not necessarily the only hypothesis. The events may be connected in a completely different way to how you expect (e.g. a common cause you hadn't considered) so you can't, for example, infer that there is a global problem from a very small sample of weird systems if there's any connection between them.

5) You can't quantify everything (at least not easily)
Measuring the global properties of systems like hair and paint is easy. But if you want to look for patterns, that can be much harder. Global properties like standard deviation, for example, tell you nothing whatsoever about arrangements.

Imagine you had a set of one hundred hairs and arranged them all in height order. To a casual visual inspection, a pattern would be obvious. But the standard deviation of heights doesn't depend on the ordering or position in the slightest, so it can often be far easier to spot such trends by a simple visual inspection. At the same time, visual inspection doesn't provide you with the quantitative estimates that are often very valuable (for a visual example see this recent post). The take-home message is to look at and measure your data in as many different ways as you can, but take heed of the next point...

6) Beware of P-hacking
A classic approach to data analysis is to throw everything against a wall and see what sticks. That is, you can plot everything against everything and look for "significant" trends. This is an entirely legitimate approach, but point 0 applies to trends as well as individual events. Hence, for example, the infamous spurious correlations website we all know and love.

There's nothing wrong in principle with trying to find trends, as long as you remember they aren't necessarily physically meaningful. Indeed, a correlation can even appear from pure randomness if the data range is limited (see comments). So you should probably start by searching for trends where a physical connection is obvious, and if you do find one which is more unexpected, compare different data sets to see if it's still present. If it's not present in all, or only emerges from the whole, it may well be spurious.

While visual inspection has its limitations, it's damn good at spotting patterns : so good it's much more prone to false positives than missing real trends. So as a rule of thumb, if you can see a pattern by eye, go do some statistical measurements to try and see how significant it is. If you can't see a pattern by eye at all, don't bother trying to extract any through statistical jiggery-pokery, or at the very least not by reducing complex data to a single number.

7) Randomness is not always multi-dimensional
If you have a random distribution of points at one moment, and then move them all a small amount in random directions, you'll end up with another random distribution of points. But... their positions will not be random with respect to their previous positions ! I learned this the hard way why trying to approximate the effects of a random-walk simulation with a parametric approach.

8) Causation doesn't always equal correlation
Wait, shouldn't that be the other way around ? Nope. We all know (hopefully) that two variables can be correlated for reasons other than a causal connection, e.g. a common root cause. Somewhat less commonly appreciated is that a causal connection does not always lead to a correlation, thus making any statistical parameter useless. This can happen if the dependency is weak and/or there are other, possibly stronger influences as well. Disentangling the trend from the highly scattered data can be difficult, but that doesn't mean it isn't there.

EDIT : I recently learned that this is known as Simpson's Paradox, and there's a fantastic illustration of it here.

9) Quantity has a quality all of its own
Suppose you have fifty archers shooting at fifty targets. One of them is a professional and the rest have never held a bow before in their lives. Who's more likely to hit the bullseye first ? This is closely related to the last point (see the same link for details) but this point is distinctly different enough that it's worth including separately.

The answer is that the professional is much more likely to hit the bullseye first when compared to any other individual archer. But if you compare them to the whole of the rest of the group, the tables could turn (depending one exactly how good the professional is and how incompetent everyone else is). For every shot the professional makes, the rest have forty nine chances to succeed. So the professional may appear to be outclassed by an amateur, even though this isn't really the case.

Investigating the role of a particular variable can therefore be more complicated than you might expect - using realistic distributions of abilities or properties can sometimes mask what's really going on. That the most extreme outcome is found in an apparently mediocre group can simply result from the size of the group as long as there's an element of chance at work. Having more archers may be better than having no archers at all, but if two sides are equal in number but differ only in skill, the skilled ones will win every time. But exactly how many crappy archers you need to beat one good one... well, that depends on the precise skill and numbers involved.

Some of these seem obvious but only after the fact. Beforehand they can be bloody perplexing. Statistics are easy to mess up because there are so many competing factors at work and this way of thinking does not come naturally. Everyone makes mistakes in this. This does not mean all statistics are useless or they are always abused to show whatever the author's want. Rather, a better message is that we should, as always, strive to find something that would overturn our findings - not simply pronounce inflexible proof or disproof based on a single number.

Scientists rise up against statistical significance
When was the last time you heard a seminar speaker claim there was 'no difference' between two groups because the difference was 'statistically non-significant'? If your experience matches ours, there's a good chance that this happened at the last talk you attended.

Thursday 21 March 2019

Women in astronomy have equal career prospects to men

Well, maybe. From various anecdotal discussions with colleagues over the years this appears to be variable. In my experience, the gender balance at the start of a degree was extremely male-biased. By the end of the Masters it was close to equal. It remained that way throughout the PhD process. But beyond that, because everyone disperses to far-flung pastures, it became basically impossible to judge. Certainly there do seem to be fewer female than male astronomers, in general.

This study basically concludes that there is a bias in getting women to do a PhD, but once they get one, their career prospects are exactly equal to those of men. Their chances of being hired either into a permanent academic career or leaving the field are the same as men, year on year after completing their PhD. They are no more or less likely to leave astronomy than men. However, PhD programs produce far fewer female than male astronomers (30:70 split), which explains why there are fewer women in astronomy overall (it could also be that there were biases in the past which have now been overcome). A woman who completes her PhD has exactly equal career prospects to her more numerous male counterparts.

What this doesn't address, however, is the PhD program itself. At what point does this 30:70 split emerge ? Do women simply not enrol for a PhD at all or do they drop out more frequently ? This is not addressed here, but would be a logical follow-up.

The sample is drawn from public PhD alumni and dissertation lists posted on the webpages of major PhD-granting graduate programs across the United States. We attempted to find all such listings by searching the webpages of 34 medium-to-large US PhD programs in astrophysics as listed in the American Institute of Physics (AIP) roster of astronomy programs.

My major query about the study would be whether these lists are complete. As far as I know this is not the case in the U.K.

Out of the initial sample of 1154, we removed 91 individuals for the various reasons described above, leading to a final sample of 1063 for the outcome analysis (a further 37 individuals were excluded from hiring-time based analyses only). Of these, 748 are male (70.4%) and 315 (29.6%) are female, consistent with statistics on the gender ratio of astronomy PhDs compiled elsewhere. Within this sample, 672 progressed to long-term careers in astronomy; 273 left and went into careers outside astronomy; 118 were still postdocs or in short-term contract-based positions at the time the analysis was conducted (late 2018).

Our study focuses on the transition in and out of the postdoctoral phase, and so we record only the first long-term position (and not later career moves or promotions.) However, we did also note any cases in which an individual left the field after securing a long-term astrophysics position. These were quite rare (12 men and 2 women, out of 672 total hires), suggesting that “long term” employment (as we have defined it) does indeed represent a the start of a lifetime career in the discipline.

Well I mean that's certainly very good news, but the hiring rate of postdocs to permanent seems awfully high. Perhaps there's a European bias, or perhaps my anecdotal experience is simply wrong, but the general consensus seems to be that there are far, far fewer permanent jobs available than postdoc positions. This is often reported as one of the major reasons people don't pursue a career in astronomy. Still, here's their money plot, showing the hiring/leaving fractions after PhD (men in black, women in red) :

In summary, there is no evidence for any significant difference in career outcomes between male and female astronomy PhDs in the United States. They do not directly explain the reasons for the lower fraction of women (∼15%) in more advanced career roles. However, we do note that the PhD numbers by gender show a large increase in the fraction of women over the period of the study (from 15% in 2000–2001 to 34% in 2011–2012), suggesting that a primary cause is a lower fraction of women in earlier PhD generations relative to more recent years. In any case, we firmly rule out the claim that women postdocs leave the field at three times the rate of men.

I certainly hope this is true, though the results seem almost too good. Of course, being employed is not the only mark of success - for example proposals from women have been traditionally treated unfairly. Citation and publication rates would be another factor to consider, as well as time to reach senior positions. The results of this study are encouraging, but the other known biases make me just a little cautious.

Gender and the Career Outcomes of PhD Astronomers in the United States
I analyze the postdoctoral career tracks of a nearly-complete sample of astronomers from 28 United States graduate astronomy and astrophysics programs spanning 13 graduating years (N=1063). A majority of both men and women (65% and 66%, respectively) find long-term employment in astronomy or closely-related academic disciplines.

Wednesday 20 March 2019

The Datasaurus strikes again

Here's an interesting example of something cool I learned on the internet having direct applications in my day job.

The current paper I'm working on involves a rather obscure procedure for cleaning radio data. As usual I'm dealing with an HI data cube that consists of multiple velocity channels. Each channel maps the same part of the sky at 21 cm wavelength at very slightly different frequencies. The exact frequency gives us the precise velocity of the galaxies, letting us estimate their distance and measure their rotation speed.

How you go about displaying all these channels in a paper depends on what you want to show. For what I'm working on, the best approach is to sum them all up, "stacking" the data to create a single 2D image.

Each big white blob is the HI content of a galaxy. You can see most of these quite clearly, but there's a lot of ugly structures in the noise caused by systematic effects in the data collection (the details don't matter for the point I'm going to make). If the noise was perfectly Gaussian and distributed around a mean of zero, this shouldn't happen : summing the noise should add equal numbers of positive and negative values and tend to cancel out. So we should get a nice flat, uniform background against which even the faintest galaxies stand out clearly.

That's obviously not what we see here - there are variations from point to point, so some regions become much brighter through this stacking than others. Hence it looks ugly as sin.

Some time ago I found a simple way to clean this up, by fitting polynomials to each spectrum - essentially subtracting the average values at each point before they're summed up. Then when you stack this cleaned data, you get a very much nicer result (same colour scale in both plots) :

Huzzah ! Looks ten times better, doesn't it ? But when I measured the standard deviation of the image, I was surprised to see that although there was a change, it wasn't a big one. The distribution of the flux in the raw cube is shown in red and the cleaned cube in blue :

So yeah, the flux distribution has changed, but not by much - the cleaning decreases the standard deviation only by about 30%. So how come the images look so strikingly different, again bearing in mind that that the same colour scale was used ? This was a bit of a puzzle until I remembered one my of my favourite statistical lessons from the internet : the ferocious datasaurus.

The authors of that wonderful little study realised that the mean and standard deviation of the position of a set of points don't tell you anything about the individual positions. If you move one point, you can move another so as to compensate, leaving the mean and standard deviation unchanged. They found a clever way to move the points so as to create specific images. Each frame of the animation shows data of identical statistical parameters but with individual points at very different locations. Or in other words, the mean position of a set of points gives you absolutely no indication as to whether it looks like a line, a star, or a dinosaur. You can't use simple global statistics to determine that : you have to visualise the data.

Something very similar is going on in the data cleaning procedure. To convince myself of this, I quickly wrote a code that takes pairs of random pixels and swaps their positions. It doesn't change the flux values, so the noise distribution is identical. I realised that it's not about the data range at all : coherent structures appear in the noise because there's ordered, correlated structure present; lines appear because pixels of similar flux appear next to each other. The overall distribution of flux is absolutely irrelevant for the visibility of the structures.

To show this, I first removed the galaxies by crudely eliminating all the pixels above some flux threshold. If I didn't do that then those much brighter pixels would be randomised as well, making it much more difficult to see the effects on the noise. So here's the uncleaned data with the galaxies removed :

And here's the same, uncleaned data after just randomly swapping pairs of pixels :

Flat as a pancake ! The structures in the original data set have nothing to do with sensitivity at all, but systematic effects. The noise appears worse only because pixel intensity is spatially correlated, not because there's any greater overall variation in the flux values.

What I like about this is that it shows the important lesson of data visualisation from the datasaurus used in anger, but also that visualising the data can itself be deceptive. Visualisation lets you see at a glance structures that would be ferociously difficult to quantify. At the same time, when it comes to estimating statistical properties, your eye can be misleading : you can't see the standard deviation of an image just by looking at it. Some information you just can't get except through quantitative measurement. Trust the numbers, but remember that they don't tell you everything.

Tuesday 19 March 2019

FAST begins to CRAFT its HI surveys

You may remember that I wrote a short piece on the sensitivity of China's giant FAST telescope back in 2016 as its construction ended. The 500 m size of FAST is a bit misleading because they only use 300 m to collect radio waves, while at Arecibo they use about 225-250 m. So FAST won't be massively more sensitive than Arecibo, nor have much of an improvement in resolution. Its main improvement in terms of telescope design is that it will be able to cover about twice as much sky as Arecibo, so it can detect more galaxies overall but not many more per unit area.

Since then FAST has been undergoing commissioning to prepare for actual science, and today on astro-ph we get some substantial reports as to how well this is going. The first paper is a detailed report on the telescope characteristics and performance, with a basic overview of the instruments, while this second one is specifically about the prospects of extragalactic HI (atomic hydrogen) surveys : i.e. what I'm interested in.

Size matters

There's more to a telescope than just the size of its dish. The dish acts as the ultimate constraint on sensitivity, a theoretical limit you can't exceed. But receivers matter too. The more pixels you have (we call them "beams" in radio astronomy), the wider the field of view and the faster you can reach any given sensitivity level.

While at optical instruments you can have millions or even billions of pixels, at HI wavelengths it's difficult to get more than a handful simply because the wavelength of the radiation (21 cm) is so large. So for example the Australian Parkes telescope has a 13 beam receiver, Britain's Jodrell bank has 4, while America's Arecibo has 7 (with plans to upgrade to the equivalent of 40). FAST will have an impressive 19.

FAST has already begun basic HI observations (single pointings, i.e. measuring the HI at individual points on the sky). Here they reveal their plans for the Commensal Radio Astronomy FasT Survey, a project to survey most or all of the sky visible to FAST. So just how much better will this mighty instrument be ?

The authors of this paper are bullish. When the ALFA receiver was first installed on Arecibo, three surveys were designed to take advantage of its capabilities. The largest and most important of these was ALFALFA, which covered about 7,000 square degrees - half the sky visible to the telescope. It took more than a decade to go from first observations to its final catalogue of over 31,000 galaxies. The other surveys were both smaller and deeper : AGES, which is smaller but more sensitive, will probably have 2-3,000 detections when all is said and done; AUDS is very much smaller again with only around 100 detections.

CRAFTS will follow a similar strategy to ALFALFA and they're expecting a somewhat improved sensitivity level (as much as a factor two). In this paper the authors set out the main characteristics of the survey (resolution, sensitivity, etc.) and predict how many detections they expect. Their expectation is around 600,000 detections : almost twenty times more than ALFALFA !

Will this make everything else obsolete ?

Or in other words : is this figure credible ? I'm skeptical. Their coverage area will be a bit over 20,000 square degrees : almost three times that of ALFALFA, so we should expect around 90,000 galaxies at the same sensitivity level. But of course their sensitivity level will be about twice as good as ALFALFA (or probably a bit less). If detection numbers scaled linearly with sensitivity, we might therefore expect around 180,000. That's a huge number, but far short of the 600,000 claimed.

The question is whether the number of detections does scale linearly with sensitivity. In my experience, it's not as good as that. AGES is about four times as sensitive as ALFALFA but our detection rate per square degree is about three times the ALFALFA level. That's based on real-world examination of actual data, though it's limited because AGES is small - and our detection rates do vary just because galaxy numbers vary across the sky.

The CRAFTS estimate is based on ALFALFA's determination of the HI mass function : basically a measure of how many galaxies there are of different HI masses. In essence there are relatively few really massive galaxies but quite a lot more small ones. Knowing how many galaxies of which mass there are per unit volume, you can make a reasonable extrapolation as to how many you'll find in a new volume with a different sensitivity level.

This isn't bad, but it's a bit crude. Sensitivity is a highly complex parameter and it can't be defined in terms of HI mass alone. Galaxies of the same mass have different detectabilities depending on their distance, rotation speed, and orientation to the observer; what's detectable in principle may not actually be detected in practise. CRAFTS assume that all galaxies have the same rotation speed, which is just not the case at all. So I'm definitely suspicious about this calculation : a more realistic estimate would account for the rotation velocity distribution of galaxies as well as masses.

On the other hand, CRAFTS also significantly increases the distance range of the survey. Whereas ALFALFA is limited to a redshift of 0.06 (a distance of roughly 250 Mpc, or 800 million light years), CRAFTS will go up to 0.35 (1.4 Gpc or 4.4 billion light years). So a much, much larger volume than ALFALFA. In fact, they say the median distance of their detections will be slightly above the maximum distance ALFALFA can even observe at all !

But this too is problematic. Since sensitivity decreases with distance, and those frequencies are unfortunately more contaminated with interference, it's hard to know exactly how much this will really increase the number of detections. The Malmquist bias means that more sensitive surveys tend to detect most of their sources at greater distances. Yet whereas the median distance of ALFALFA sources is about 110 Mpc, the median distance of AGES (which is four times more sensitive) is only about 141 Mpc - not a huge increase.

Mass of the detections from ALFALFA (red) and AGES (blue) as a function of their velocity (i.e. distance).

Now it's true that AGES has mostly been limited to the same distance range as ALFALFA, so we can't be sure if our median distance might not increase further with greater distance. We do have one data set which includes data out to much greater distances though - not as far as for CRAFTS, but far enough that the comparison should be valid (since there is a cutoff due to interference and sensitivity anyway). In that case our median distance increased from 145 Mpc (470 million light years) only to 160 Mpc (520 million light years). This is nowhere near ALFALFA's maximum distance. In short, it looks unlikely that CRAFTS is going to find a huge number of sources at higher distances : there just aren't enough massive HI sources out there that it can detect.

I say "HI sources" rather than "galaxies" very deliberately. The authors comment on the prospect of confusion, i.e. not being able to determine which galaxy the HI source comes from. This is particularly tricky at high distances, where the resolution becomes so low that there can be multiple contributing galaxies within the telescope beam. They state very briefly that they do not expect confusion to be a problem... yet at the highest distances, it most certainly will. The detections at the greatest distances in AGES are likely confused sources, where the combination of multiple galaxies within the beam gives an HI mass much greater than for a single galaxy, making it possible to detect whole groups of galaxies that would otherwise not be out of reach.

All in all, while FAST will certainly detect tens of thousands of galaxies at the very least, I'm not convinced that it's likely to discover as many as they're hoping for, and I doubt very much that confusion won't be problem at the highest distances.

Sooo... it's useless, then ?

Whut ? No ! It'll be a fantastic data set, but in my opinion it just won't be as awesome as these predictions make out. Perhaps I'm wrong though. Fortunately we shouldn't have to wait too long to test it. One huge advantage to having 19 beams is that the survey speed is going to be a lot faster than ALFALFA : they predict it should be done in less than two years (once commissioning ends). That's assuming it's the only survey running and the telescope doesn't do anything else, but I get the distinct impression that's the plan. Which makes sense, because this survey has a killer advantage of simultaneously searching for gas in our own Galaxy, pulsars, and the still-mysterious Fast Radio Bursts. Make no mistake : this will be a very important project and well worth doing.

Of course, what they do with the data is just as important as how good it is. So far we've only seen spectra, not images, so the data quality remains to been seen. But assuming it works, they should learn from ALFALFA and not keep it to themselves. While ALFALFA was initially slated to make its data public, this never happened - we got catalogues and spectra but not the full data products. True, the data volume will be extremely large, but in this day and age that ought not to be a serious issue. FAST's data policy is as yet unclear, so we'll have to wait and see what they choose to do with it. If they're sensible, this will be a huge legacy to the community - even if things don't work out as well as they're predicting.

Status and Perspectives of the CRAFTS Extra-Galactic HI Survey
The Five-hundred-meter Aperture Spherical radio Telescope(FAST) is expected to complete its commissioning in 2019. FAST will soon begin the Commensal Radio Astronomy FasT Survey(CRAFTS), a novel and unprecedented commensal drift scan survey of the entire sky visible from FAST. The goal of CRAFTS is to cover more than 20000 $deg^{2}$ and reach redshift up to about 0.35.

Wednesday 13 March 2019

No dark galaxies here, alas

This paper describes an HI survey of the M81 group. It's a spectacular region, probably best known for M82 (upper right) and its symmetrical, ionised winds:

M81 is the big spiral in the lower left.

But there's also a potential "dark galaxy" here : a cloud of hydrogen that can't be seen at all at optical wavelengths but has its own dark mater halo, just like an optically bright galaxy. This was found back in 2001 with the HIJASS survey, which mapped (at least part of) the northern sky using the Jodrell Bank 76 m telescope. HIJASS and its more famous southern counterpart HIPASS had relatively low sensitivity and resolution by today's standards. This made it difficult for them to both find galaxies with low gas masses, where star formation is likely to be low, and identify the optical counterpart of any gas clouds they did find. With a resolution as low as HIJASS, you can usually find some sort of optical smudge that could be the optical counterpart, because you don't know exactly where the gas really is.

Despite this, they did find a handful of interesting dark galaxy candidates, one of which was near M81. Here the authors describe new observations with resolution that's more than 10x better and of comparable mass sensitivity to HIJASS. They also reprocess other observations from the VLA, GBT and Effelsberg telescopes. The main problem is that this group is at a very low redshift, so the HI emission from the galaxies occurs at similar frequencies to the Milky Way despite being ~3-4 Mpc away. They've developed a way to try and remove contamination from the Milky Way, though as far as I can tell this is basically just a fancy way of identifying which channels have contamination and removing them. I would have liked more figures to illustrate this though, as I get the impression it's more sophisticated than that and I don't fully understand what they've done.

Anyway, here's their main figure. The optical emission is in greyscale with HI contours in blue.

I'll take this opportunity to advise everyone to set the scale of their axes labels to something sensible - this ends up taking up quite a chunk of space that would have been better used for the figure. Also, it would have really helped if they'd labelled the galaxies.

M81 is the big guy on the right, which is interacting with a couple of dwarf galaxies so it has this extremely disturbed gas. Over on the left there's a smaller galaxy IC 2574, and in between them is the purported dark galaxy, HIJASS J1021+68. With these new observations it doesn't much resemble a galaxy - it looks more like a few random patches of gas that just happen to be lumped together. Here's a close-up :

These new observations use an interferometer, which often miss emission on larger scales. So could these few clouds actually be embedded in a larger HI emission region ? Probably not : in this case, the total mass of the clouds is almost bang-on identical to the earlier data, so they've recovered all the flux.

Their interpretation is that this little patch of fluff is much more likely to be tidal debris than a dark galaxy, which I have to agree with. I suppose if you cut off the cloud on the left you could just about imagine that this is an edge-on disc, but it would be highly suspect, and in any case there's no evidence of rotation - the velocities of all parts of the cloud are similar. So although the line width is about 50 km/s (from the earlier observations), which would give a dynamical (i.e. dark matter mass) about 100x the HI mass given the size of the object, this is probably very misleading. It's much more likely to be tidal debris, which you'd expect to able to produce a few somewhat scattered clouds at similar velocities without much difficulty. The overall line width is low enough that this doesn't pose any great difficulties. In addition, the cloud - or clouds - are situated directly between IC 2457 and M81, which is exactly where you'd expect any tidal debris to be if they'd interacted. So we have, in effect, a means, motive and opportunity.

Unusually, the paper also includes a cool interactive figure. Here the connection between IC 2457 is a lot more obvious for some reason. To get the interactive version, open the paper with Acrobat Reader.

There are some caveats, but for once they're almost negligible. There's a remote chance that their Milky Way removal procedure has produced some sort of artifacts, but this seems unlikely. Slightly more problematically, it's not clear where the rest of the HI has gone (new readers should be aware that I've spent a great deal of time simulating this), but determining this would require detailed simulations - and the basic scenario of an isolated, denser patch of gas is consistent with previous modelling attempts of similar systems. What would be even more convincing would be more observations : either more sensitive HI data that could detect the expected stream of gas linking IC 2457 and M81, or other wavelengths such as Hα (which has been shown to be another pretty good tracer of interactions).

All in all, this is pretty persuasive that this dark galaxy candidate must be pronounced dead. Still, the authors are going to present a more detailed analysis in a future paper, which will probably be worth reading as the origin of the cloud could still be very interesting. In particular, at some point we're going to be reducing higher resolution data of our own dark galaxy candidates, so knowing what to expect if they're actually tidal debris is extremely useful.

A 5deg x 5deg deep HI survey of the M81 group
A 25 $\rm deg^2$ region, including the M81 complex (M81, M82, NGC 3077), NGC 2976 and IC2574, was mapped during ~3000 hours with the DRAO synthesis telescope.

Wednesday 6 March 2019

If I had my own journal

When I submit a paper to a journal, usually the reviews are basically helpful. Sometimes the reviewer says something stupid, but generally this is only out of ignorance and easily fixed. But at other times it's clear the referee is a solid, chronic moron. There was one memorable incident I've documented elsewhere in which the reviewer decided to simply ignore our responses which was particularly frustrating : what's the point in me wasting time writing a response if the next review has all the critical analysis one finds from a Russian spambot ? At least the spambots are offering me good old fashioned pornography, whereas poor reviewers offer only insults.

I've just had a second review which is pretty much comparable, if not even worse. It's extremely annoying when the referee asks for something as simple and obvious as a number, you provide that number and highlight it in bold, and the referee refuses to acknowledge this. Being a referee shouldn't entitle you to behave like an obnoxious arsehole.

So how do we prevent this ? I've been thinking for a while that journals should provide a fairly standard set of instructions for reviewers. They wouldn't have to be identical in detail, but they ought to be broadly similar. So if I had my own journal, here's how I would run things. I won't try and give the whole set of conditions, just the ones which are different from how it is in astronomy currently. And I'm only focusing on the review process, not every aspect of submitted papers (e.g. writing style) because that'd take too long.

All parties shall behave with respect, recognising that :
- The authors have invested substantial time and effort into generating a manuscript which they sincerely hold to be correct
- The reviewer has dedicated significant amounts of their own time without compensation in order to provide a valuable scientific service.
The key phrase for a reviewer shall be instruct and justify. They must at all times state what it is they want the authors to do, so that they can proceed clearly and their changes judged as to whether they are appropriate, and explain the need for the changes, so that the authors can respond accordingly.

Instructions to reviewers

The underlying reason for peer review is to ensure that the results presented are scientifically accurate and provide assistance to the authors where necessary. Reviews may be skeptical but always helpful and respectful. Constructive criticism is the order of the day. Criticism for the sake of it shall be disregarded : courtesy costs nothing.

You main purpose is to critique the scientific content of the paper. The methods of investigation used should be sound and the conclusions well-supported by the evidence. If and only if you believe that either of these are seriously flawed, you should request major revisions to the paper. If the paper relies on some fundamental error and likely to require a complete rewrite, you may recommend the paper is rejected.

Both the editor and the authors shall have the right to respond to any request for a rejection before it is finalised. It is good practise to define which of your suggestions are crucial and which are optional, especially on the first or subsequent revisions; authors should be receive fair warning if the referee views their change requests as mandatory.

Every effort should be made to scrutinise the conclusions and methods as much as possible on the first draft. If the authors address the criticisms sufficiently at the first revision, then requesting additional scientific changes is strongly discouraged - except when new issues have arisen due to the changes in the manuscript that were not present previously.

You should distinguish between what the authors are claiming as fact and what they are claiming as interpretation. Statements claimed as fact deserve the most rigorous examination. If the authors clearly state something as a matter of opinion, then this should be viewed with more liberal tolerance unless they are in blatant conflict with the facts. Authors are allowed to present interpretations that others disagree with - the verdict here should primarily rest with the community, not the referee. You are encouraged to suggest that subjective statements be clearly labelled as such, but not to remove them entirely unless there is a strong reason to do so (e.g. excessive unsubstantiated speculation). If this is necessary, reviewers must justify their reasons carefully. Your role is to help the authors write a better paper, not to write it for them.

In terms of style, you should mainly examine the structure of the paper. The paper should follow a basic narrative layout where each section follows logically from the previous. The paper should be concise but clarity is preferred above all : it should be written with the aim that it can and should be read from start to finish, providing sufficient detail that another researcher could (in principle) replicate the results. To this end, all appendices need to be reviewed, but do not count towards the length of the paper - appendices are a useful way to provide technical details that might detract from the narrative flow for the main audience. The decision to remove content entirely or instead move it to an appendix rests with the author and editor.

If the paper is unclear due to a poor layout, but could be improved with re-arrangements (e.g. moving or omitting certain sections or numerous paragraphs), then this will count as a request for moderate revisions. This applies also if the paper appears scientifically correct but lacking in detail.

Reviewers may request the document be shortened but only if they provide explicit directions and an overall purpose, e.g. to focus on the new results, or to avoid repetition. There is no point in complaining that the text is too long unless you state at the very least which sections are too long ! Text should be omitted if it does not provide any useful scientific or explanatory content, if it is excessively speculative, or if it can be replaced with a citation to existing works. Reviewers must justify for each section why they think it should be shortened.

Spotting typographical or language errors is not strictly necessary (this is the role of the editor and typesetter) but helpful if done respectfully. Do not, for instance, question the author's English skills, especially if they may not be a native speaker, but instead simply provide a full list of suggested changes. Vague complaints about the number of typographical errors are insulting to the authors and completely unhelpful. If the paper requires only changes of wording or other similar small modifications, such as labels in figures, then this counts as a request for minor revisions unless they are very extensive, in which case they shall be deemed to be moderate revisions.

In all cases, any requested changes must be as specific as possible and justified. Explaining the reasons for the changes is strongly encouraged, however, providing a commentary with no clear instructions is useless and potentially confusing to the authors. If you dispute any of the author's claims, you should state if they should simply be removed or replaced with something else - and if so, state also the replacement and whether and how this also requires altering the conclusions.

In general, it is not necessary that the referee is convinced the conclusions are correct, so long as the authors (a) clearly differentiate between opinion and fact; (b) provide sufficient instruction that the community can reproduce their results and judge for themselves; and (c) the referee can provide no clear reason the authors are incorrect. It is not sufficient for a referee to simply declare that they are unconvinced, and if they reject the arguments of the authors, they must explain why and do so as directly as possible. If you cannot explain what's wrong with the arguments, the presumption of the editor will be that the referee is biased; if they provide otherwise beneficial critiques which the authors address, then the editor may choose to accept the paper regardless of the referee's recommendation.

The editor has the absolute right to determine whether the reviewer has infringed these guidelines. If so, they may require changes to the report before disseminating it to the authors. If the reviewer fails to modify the report appropriately, they may be rejected and the editor will seek a new referee. Reviewers who repeatedly fail to meet the journal's standards may be removed from further consultation and in extreme cases may lose their right to anonymity.

The role of editors

The scientific editor's main role is to act as the reviewer's referee, not as a second referee for the paper itself. Editors will decide if the initial referee report meets the journal's standards and provide instructions to the reviewer when necessary. Similarly, they will decide if responses to the author's revised manuscripts and accompanying comments are suitable and fair.

They will also adjudicate the exchange between authors and referee. While authors should make every effort to accommodate the requested changes, there are often good reasons why this cannot be done. Authors have the right to respond to all change requests and are not automatically bound to comply with all such requests. However, they should justify their reasons for this, either directly in the manuscript or in their response letter. The editor's role in this is to examine the reviewer's response and ensure that, in the event the reviewer disputes the author's responses, that the reviewer remains fair and abides by the standards outlined earlier : they should address the author's points clearly and directly; they must explain any continued disputes with the authors where appropriate.

In the event of a dispute, both the authors and referee shall have grounds for appeal to the editor. In this case the editor will either attempt to adjudicate the dispute themselves or consult an additional referee, who will be given limited access to the paper sufficient for the adjudication. Each party must then agree to the editor's decision, or, if still unsatisfied : the referee may refuse to continue; the authors may request a new referee for the entire manuscript. If a new reviewer is chosen, the full content of the review process will be made available to them. A change of reviewer is permitted a maximum of twice per manuscript (e.g. three reviewers in total), beyond which the paper shall be rejected.

In all cases of dispute, the final decision rests with the editor and the editor's decision is final.

That wasn't so hard, was it ?

No, it wasn't. Yet currently there are no stated criteria for major/moderate/minor revisions, which makes the status more or less random. Reviewers appear to be able to act on a whim, criticising whatever they like for any reason - sometimes debasing reports to the status of typesetting. We've already got typesetters for that, so there's no need to waste researcher's time on this. On other occasions referees try and twist a paper to say what they want it to say rather than what the author's want. I think this gives a misleading impression of what the author's actually believe, and that's something we should stop. Sometimes it's hard to know if a piece of tortured logic is the fault of the authors or the referee, and I for one would like to know who to blame. And still other times the referee will simply reject an author's argument without any explanation, making it impossible for the authors to properly respond.

In short, some referee's presume they're another helpful co-author trying to improve the paper, while others think of themselves more as God, able to arbitrarily decide which papers are worthy and which aren't. This isn't how it should be.

You might think that stating that everyone should be respectful of each other shouldn't be necessary at this level of professional research. Sadly it is, though insults are rarely as direct as calling each other Mr Stinky Poopy Pants or whatever. What's much more common is selective ignorance. The best referee's report I've ever seen was about 10 pages long and immaculately clear, constructive and detailed. In contrast one of the worst was a mere paragraph which essentially ignored everything we wrote at the first revision - and that's an insulting, unprofessional waste of everyone's time. Hence also the emphasis on providing justifications and explanations.

There's also not really any sort of adjudication process. Sometimes editors do step in and say, "you don't need to do this", but I would like them to be more pro-active about it. Having formal procedures defined as to the purpose of review - what referees are allowed to criticise and where they should step back - as well as making the editor play more of an active role (and less of a mere postman) would help a lot. That way, I hope, we could make the review process what it often is but ought to always be : useful and helpful, critical and skeptical - part of a collective effort of inquiry, not an attempt to disparage others for the sake of it or to make unfounded arguments from authority.