Sister blog of Physicists of the Caribbean. Shorter, more focused posts specialising in astronomy and data visualisation.

Friday 21 June 2024

The ultimate in flattening the curve

It just refuses to go down...

Well, I'd play the innuendo card with this paper, at any rate. 

Galaxy rotation curves are typically described as flat, meaning that as you go further away from their centres, the orbital speeds of the gas and stars don't change. This is the traditional evidence for dark matter. You need something more than the visible matter to accelerate material up to these speeds, especially the further away you go : if you use the observed matter, the prediction is that orbital speeds should steadily decrease a la Kepler. Unseen dark matter is a neat way to resolve this dilemma, along with a host of other observational oddities.

This paper claims to have extended rotation curves considerably further than traditional measurements and find that the damn curves remain flat no matter how far they go. They reach about 1 Mpc, about the size of our whole Local Group, and still don't show any sign of a drop. This is not at all expected, because eventually the curve should drop as you go beyond the bulk of the dark mass. It is, they say, much more in-keeping with the prediction of modified gravity theories that do away with dark matter, such as MOND.

I won't pretend I'm in any way expert in their methodology, however. A standard rotation curve directly measures the line of sight speed of gas and/or stars, which is relatively simple to convert into an orbital speed – and for qualitatively determining the shape of the curve, the corrections used hardly matter at all. But here the authors don't use such direct kinematic measurements, but instead use weak gravitational lensing. By looking at small distortions of background galaxies, the amount of gravity associated with a target foreground source can be determined. Unlike strong lensing, where distortions are easily and directly visible in individual sources, this is inferred through statistics of many small sources rather than from singular measurements.

Here they go even more statistical. Much more statistical, in fact. Rather than looking at individual lens galaxies they consider many thousands, dividing their sample into four mass bins and also by morphology (late-type spiral discs and early-type ellipticals). The lensing measurements don't give you orbital speed directly, but acceleration, which they then convert into velocity. 

1 Mpc is really quite a long way in galactic terms, and it wouldn't be at all uncommon to find another similar-sized galaxy within such a distance : in our Local Group, which is not atypical, there are three large spiral galaxies. Measuring the rotation curve out to such distances then becomes intrinsically complicated (even if you had a direct observational tracer like the gas) because it's hard to know which source is contributing to it. 

They say their sample is of isolated galaxies with any neighbours being of stellar mass less than 10% of their targets out to 4 Mpc away, but their isolation criterion uses photometric redshifts*. Here I feel on very much firmer footing in claiming that these are notoriously unreliable. Especially as the "typical" redshift of their lens galaxies is just 0.2, far too low for photometric measurements to be able to tell you very much. Their large sample means they understandably don't show any images, but it would have been nice if they'd said something about a cursory visual inspection, something to give at least some confidence in the isolation.

* These are measurements of the redshift based on the colour of the galaxy, which is extremely inexact. The gold standard are spectroscopic measurements, which can give precisions of a few km/s or even less.

If we take their results as given, they find that the rotation curves of all galaxies in all mass bins remain flat out to 1 Mpc, the limit of their measurement (although in one particular subset this doesn't look so convincing). They also show that in individual cases where they apparently can get good results from weak lensing, the results compare favourably with the direct kinematics they get from gas data.

As often with results questioning the dark matter paradigm, I'd have to describe the results as "intriguing but overstated". I don't know anywhere near enough about the core method of weak lensing to comment on the main point of the paper. But that this is normally in itself as result of statistical inference, and that here they use a very large sample of galaxies and convert the result from the native acceleration measurement to velocity, and that their isolation criteria seems suspect... I remain unconvinced. I'd need a lot more persuading that the weak lensing data is really giving meaningful results to such large distances from the lens galaxies.

What would have been nice to see is the results from simulations. If they could show that their photometric redshifts were accurate enough in simulated cases to give reliable results, and that the weak lensing should given something similar to this (or not, if the dark haloes in the simulations have a finite extent), then I'd find it all a lot more convincing. As it stands, I don't believe it. Especially given that so many galaxies are now known with significantly lower dark matter contents than expected : these "indefinitely flat" rotation curves seem at odds with galaxies with such low rotation speeds even in their innermost regions. Something very fishy's going on.

Wednesday 19 June 2024

The shoe's on the other foot

My, how the tables have turned. The hunter has become the hunted. And various other cliché's indicating that the normal state of affairs have become reversed.

That is, as well as having to write an observing proposal, I find myself for the first time having to review them. Oh, I've reviewed papers before, but never observing proposals. This came about because ALMA has a distributed proposal review system : everyone who submits their own proposal has to review ten others. And since this year I finally submitted one, I get to experience this process first hand.


The ALMA DPR procedure

When you submit a proposal, you indicate your areas of expertise and any conflicts of interests  – collaborators and direct competitors who shouldn't be reviewing your proposal, either because they'd stand to benefit from it being accepted or would love to take you down a peg. It's a double-blind procedure : your proposal can't contain any identifying information and you don't know who the reviewers are. Some automatic checks are also carried out to prevent Co-Is on recent ALMA proposals being assigned as reviewers, and suchlike.

Then your proposal is sent off for initial checks and distributed to ten other would-be observers who also submitted observing proposals in the current cycle. You, in turn, get ten proposals to review yourself. Each document is four pages of science justification (of which normally one or even two pages are taken up with figures, references, and tables) plus an unlimited-length technical section containing the observing parameters for each source plus some brief justification on the specifics (in practise, in most proposals each of these so-called "science goals" are very similar, using the same observing setup on multiple targets). You then write a short review of each one, of a maximum of 4,000 characters but typically more like ~1,000 (or even less) describing both the strengths and weaknesses of each. You also rank them all relative to each other, from 1 (the strongest) to 10 (the weakest).

That's stage one. A few weeks later, in stage two you get to see everyone else's reviews for the same proposals, and can then change your own reviews and/or rankings accordingly, if you want to. So far as I know, each reviewer gets a unique group of ten proposals to review, so no two reviewers review the same set of proposals, meaning you can't see the others rankings. Exactly how their rankings are then all compared and combined, and ultimately, translated into awarded telescope time, remains a mystery to me. Those details I leave for some other time, and I won't go into the details of anonymity* here either : I seem to recall hearing that this gives a better balance of both experience and gender, but I don't have anything to hand to back this up.

* I will of course continue to respect the anonymity requirements here, and not give any information that could possibly identify me as anyone's reviewer.

Instead I want to give some more general reflections on the process. To be honest I went into this feeling rather biased, having received too many referee comments which were just objectively bollocks. I was quite prepared to believe the whole thing would be essentially little better than random, which is not a position without merit.


First thoughts

And my initial impressions justified this. It seemed clear to me that everyone had chosen interesting targets and would definitely be able to get something interesting our of their data, making this review process a complete waste of time.

But after I let things sink in a bit more, after I read the proposals a bit more carefully and made some notes, I realised this wasn't really the case. I still stand by (with one exception) that all proposals would result in good science, but the more I thought about it, the more I came to the conclusion that I could make a meaningful judgement on each one. I tried not to judge too much whether one would do better science than another, because who am I to say what's better science ? Why should I determine if studies of extrasolar planets are more important than active galactic nuclei ?

These aren't real examples, but you get the idea. Actually the proposals were all aligned very much more closely with my area of expertise. The length of four pages I would say is "about right", it gave enough background for me to set each proposal in its proper context as well as going into the specific objectives.

Instead, what I tried to assess was whether each project would actually be able to accomplish the science it was trying to do. I looked at how impactful this would be only as a secondary consideration. There isn't really any right or wrong answer as to whether it's better to look at a single unique target versus a statistical study of more typical objects, but I tried to judge how much impact the observations would likely have on the field, how much legacy value they would have for the community. But first and foremost, I considered whether I was persuaded the stated science objectives could actually be carried out if the observations themselves reached their design spec.


Judgement Day

And this I found was something I could definitely judge. Two proposals to me stood out as exemplary, perfectly stating exactly what they wanted to do and why, exactly what they'd be able to achieve with this. It was very clear that they understood the scientific background as well as anyone did. I initially ranked these essentially as a coin-toss as to who got first and who got second place; I couldn't meaningfully choose between them.

At the opposite extreme were two or three which didn't convince me at all. One of the principle objectives of one of them was just not feasible with the data they were trying to obtain, and they themselves presented better data in their proposal that they already had which would have been much more suitable for this. Lacking self-consistency is a black mark in just about any school of thought. Another looked like it would observe a perfectly good set of objects, but contained so many rudimentary scientific errors that there was no way I could believe they'd do what they said they would do. 

Again, deciding which one to rank lowest was essentially random, though I confess that one of them just wound me up the wrong way more than the other.

In the middle were a very mixed bunch indeed. Some had outstanding ideas for scientific discovery but were very badly-expressed, saying the same thing over and over again to the nth degree (I would say to these people, there's no obligation to use the full four pages, and we should stipulate this in the guidance to observers and reviewers alike. I tried to ignore the poor writing style of these and rank them highly because of the science). Some oversold the importance of what they'd do, making unwarranted extrapolations from their observations to much more general conclusions. Some had a basically good sample but claimed it was something which it clearly wasn't; others clearly stated what their sample was but the objects themselves were not properly representative of what they were trying to achieve.

This middle group... honestly here, a random lottery would work well. On the other hand, there doesn't seem any obvious reason not to use human judgement here either, because for me at least this felt like a random decision anyway. And if other people's judgements are similar then clearly there are non-random effects which probably should actually be accounted for, whereas if they are truly random then the effects will average out. So there's potentially a benefit in one case and no harm in the other, and in any case there almost certainly is a large degree of randomness at work anyway.


Reviewing the reviews

I went through a similar process of revising my expectations in stage 2, though to a lesser degree. At first glance I didn't think I'd need to change my reviews or rankings, but on carefully checking one of the other reviews, I realised this was not the case. One reviewer out of the ten had managed to spot a deeply problematic technical issue in one of the proposals that I otherwise would have ranked very highly. And on checking I was forced to conclude that they were correct and had to downgrade my ranking significantly. This alone makes the process worth doing : 1 out of 10 is not high, but with ~1,600 proposals in total, this is potentially a significant number overall.

Reading the other reviews turned out to be more interesting than I expected. While some did raise exactly the same issues with some of the proposals that I had mentioned, many didn't. Some said "no weaknesses" to proposals I thought were full of holes. One even said words to the effect that "no-one should doubt the observers will do good science with this", a statement I felt presumptuous, biased, and bordering on an argument-from-authority : it's for us the reviewers to decide this independently; being told what we should think is surely missing the point. 

The reverse of this is that some proposals I though were strong others thought were weak – very weak, in some cases. Everyone picks up on different things they think are important. There was one strange tendency for reviewers to point out that the ALMA data wouldn't be of the same quality as comparison data. This is fine, except that the ALMA data would usually have been of better quality, and downgrading it to the same standard is trivial ! I sort of wished I'd edited my reviews to point this out. Some also made comments on statistics and uncertainties that I thought were so generic as to be unfair, yes of course things might be different from expectations, but that's why we need to do observations !

What the DPR doesn't really do is give any chance for discussion. You can read the other reviews but you can't interact with the other reviewers. It might have been nice to have somewhere where we could enter a "comment to other reviewers", directed to the group, or at least have some form of alert system when reviews were altered. Being able to ask the observers questions might have been nice, but I do understand the need to keep things timely as well. On that front, reviews varied considerably in length; mine were on the longer and to be honest perhaps overly-long side (I think my longest was nearly 2,000 characters), while one was consistently and ludicrously short.

All this has given me very mixed feelings about my own proposal. On the one hand, I don't think it's anywhere near the worst, and I stand behind the scientific objectives. On the other, I think I concentrated overmuch on the science and not enough on the observational details. Ranking it myself with hindsight I'd probably have to put it in the lower third. It was always a long shot though, so I'll be neither surprised nor disappointed by the presumed rejection. One can but try with these things.

One thing I will applaud very strongly is the instruction to write both strengths and weaknesses of each proposal. All of them, bar none, had some really good points, but it was helpful to remind myself of this and not get carried away when reviewing the ones I didn't much like. Weaknesses were more of a mixed bag; one can always find something to criticise, although in some cases they aren't significant. Still, I found it very helpful to remember that this wasn't an exercise in pure fault-finding.




How in the world one judges which projects to actually undertake, though... that seems to me like the ultimate test of philosophy of science. Groups of experts of various levels have pronounced disagreements about factual statements; some notice entirely different things from others. There's the issue of not only will the science be significant, but also whether the data can be used in different ways from what's suggested. That to me remains the fundamental problem with the whole system, that one can nearly always expect some interesting results, but predicting what they could be is a fool's game.

Overall, I've found this a positive experience. Reading the full gamut of excellent to poor proposals really gives a clearer idea of what reviewers are looking for, something it's just not possible to get without direct experience. Not for the first time, I wonder a lot about Aumann's Agreement Theorem. If we the reviewers are rational, we ought to be persuaded by each other's arguments. But are we ? This at least could be assessed objectively, with detailed statistics possible on how many reviewers change their ranking when reading other reviews. 

And at the back of my mind is a constant paradoxical tension : a strong feeling that I'm right and others are wrong, coupled with the knowledge that other people are thinking the same thing about me. How do we reconcile this ? For my part, I simply can't. I formulate my judgement and let everyone else to the same, and hope to goodness the whole thing averages out to something that's approximately correct. The paradox is that this in no way makes me feel any the less convinced of my own judgments, even knowing that some fraction of them simply must be wrong.

Other aspects are much more tricky. This is a convergence of different efforts, both trying to asses what-is-true (what science claimed is factually correct, why do experienced experts still disagree on some points), what will likely benefit the community the most, and how we try and account for the inevitably uncertain and unpredictable findings. As I've said before many times, real, coal-face research is extremely messy. If it isn't already, then I would hope that telescope proposals ought to be an incredibly active field of research for philosophers of science.

Thursday 6 June 2024

The data won't learn from itself

Today I want to briefly mention a couple of papers about AI in astronomy research. These tackle very different questions from the usual sort, which might examine how good LLMs can be at summarising documents or reading figures and the like. These, especially the second, are much more philosophical than that.

The first uses an LLM to construct a knowledge graph for astronomy, attempting to link different concepts together. The idea is to show how, at a very high level, astronomical thinking has shifted over time : what concepts were typically connected and how this has changed. Using distributional semantics, where the meanings of words in relation to other words are encoded as numerical vectors, they construct a very pretty diagram showing how different astronomical topics relate to each other. And it certainly does look very nice – you can even play with it online

It's quite fun to see how different concepts like galaxy and stellar physics relate to each other, how connected they are and how closely (or at least it would be if the damn thing would load faster). It's also interesting to see how different techniques have become more widely-used over time, with machine learning having soared in popularity in the last ten years. But exactly what the point of this is I'm not sure. It's nice to be able to visualise these things for the sake of aesthetics, but does this offer anything truly new ? I get the feeling it's like Hubble's Tuning Fork : nice to show, but nobody actually does anything with it because the graphical version doesn't offer anything that couldn't be conveyed with text.

Perhaps I'm wrong. I'd be more interested to see if such an approach could indicate which fields have benefited from methods that other fields aren't currently using, or more generally, to highlight possible multi-disciplinary approaches that have been thus far overlooked.


The second paper is far more provocative and interesting. It asks, quite bluntly, whether machine learning is a good thing for the natural sciences : this is very general, though astronomy seems to be the main focus. 

They begin by noting that machine learning is good for performance, not understanding. I agree, but once we do understand, then surely performance improvements are what we're after. Machine learning is good for quantification, not qualitative understanding and certainly not for proposing new concepts (LLMs might, and I stress might, be able to help with this). But it's a rather strange thing to examine, and possibly a bit of a straw man, since I've never heard of anyone thinking that ML could do this. And they admit that ML can be obviously beneficial in certain kinds of numerical problems, but this is still a bit strange : what, if any, qualitative problems is ML supposed to ever help with ?

Not that quantitative and qualitative are entirely separable. Sometimes once you obtain a number you can robustly exclude or confirm a particular model, so in that sense the qualitative requires the quantitative. But, as they rightly point out, as I have myself many times, interpretation is a human thing : machines know numbers but nothing else. More interestingly they note :  

The things we care about are almost never directly observable... In physics, for example, not only do the data exist, but so do forces, energies, momenta, charges, spacetime, wave functions, virtual particles, and much more. These entities are judged to exist in part because they are involved in the latent structure of the successful theories; almost none of them are direct observables. 

Well, this is something I've explored a lot on Decoherency (just go there and search for "triangles"). But I have to ask, what is the difference between an observation and a measurement ? For example we can see the effects of electrical charge by measuring, say, the deflection of a hair in the static field of a balloon, but we don't observe charge directly. But we also don't observe radio waves directly, yet we don't think they're less real than optical photons, which we do. Likewise some animals do appear to be able to sense charge and magnetic fields directly. In what sense, then, are these "real" and what sense are they just convenient labels we apply ?

I don't know. The extreme answer is that all we have are perceptions, i.e. labels, and no access to anything "real" at all, but this remains (in some ways) deeply unsatisfactory; again, see innumerable Decoherency posts on this, search for "neutral monism". Perhaps here it doesn't matter so much though. The point is that ML cannot extract any sort of qualitative parameters at all, whereas to humans these matter very much – regardless of their "realness" or otherwise. If you only quantify and never qualify, you aren't doing science, you're just constructing a mathematical model of the world : ultimately you might be able to interpolate perfectly but you'd have no extrapalatory power at all.

Tying in with this and perhaps less controversially are their statements regarding why some models are preferred over others :

When the expansion of the Universe was discovered, the discovery was important, but not because it permitted us to predict the values of the redshifts of new galaxies (though it did indeed permit that). The discovery was important because it told us previously unknown things about the age and evolution of the Universe, and it confirmed a prediction of general relativity, which is a theory of the latent structure of space and time. The discovery would not have been seen as important if Hubble and  Humason had instead announced that they had trained a deep multilayer perceptron that could predict the Doppler shifts of held-out extragalactic nebulae.

Yes ! Hubble needed the numbers to formulate an interpretation, but the numbers themselves don't interpret anything. A device or mathematical model capable of predicting the redshifts from other data, without saying why the redshifts take the values that they do, without relating it to any other physical quantities at all, would be mathematical magic, and not really science.

For another example, consider the discovery that the paths of the planets are ellipses, with the Sun at one focus. This discovery led to extremely precise predictions for data. It was critical to this discovery that the data be well explained by the theory. But that was not the primary consideration that made the nascent scientific community prefer the Keplerian model. After all, the Ptolemaic model preceding Kepler made equally accurate predictions of held-out data. Kepler’s model was preferred because it fit in with other ideas being developed at the same time, most notably heliocentrism.

A theory or explanation has to do much more than just explain the data in order to be widely accepted as true. In physics for example, a model — which, as we note, is almost always a model of latent structure — is judged to be good or strongly confirmed not only if it explains observed data. It ought to explain data in multiple domains, and it must connect in natural ways to other theories or principles (such as conservation laws and invariances) that are strongly confirmed themselves.  

General relativity was widely accepted by the community not primarily because it explained anomalous data (although it did explain some); it was adopted because, in addition to explaining (a tiny bit of new) data, it also had good structure, it resolved conceptual paradoxes in the pre-existing theory of gravity, and it was consistent with emerging ideas of field theory and geometry.

Which is a nice summary. Some time ago I'd almost finished a draft of a much longer post based on this this far more detailed paper which considers the same issues, but then blogger lost it all and I haven't gotten around to re-writing the bloody thing. I may yet try. Anyway the need for self-consistency is important, and doesn't throttle new theories in their infancy as you might expect : there are ways to overturn established findings independent of the models. 

The rest of the paper is more-or-less in line with my initial expectations. ML is great, they say, when only quantification is needed : when a correlation is interesting regardless of causation, or when you want to find outliers. So long as the causative factors are well-understood (and sometimes they are !) it can be a powerful tool for rapidly finding trends in the data and points which don't match the rest. 

If the trends are not well-understood ahead of time, it can reinforce biases, in particular confirmation bias by matching what was expected in advance. Similarly, if there are rival explanations possible, ML doesn't help you choose between them if they don't predict anything significantly different. But often, no understanding is necessary. To remove the background variations in a telescope's image it isn't necessary even to know where all the variations come from : it's usually obvious that they are artifacts, and all you need to is the mathematical description of them. Or more colourfully, "You do not have to understand your customers to make plenty of revenue off of them." 

Wise words. Less wise, perhaps only intended as a joke, are the comments about "the unreasonable effectiveness of ML", that it's remarkable that these industrial-grade mathematical processes are any good for situations to which they were never designed. But I never even got around to blogging Wigner's famous "unreasonable effectiveness" essay because it seemed worryingly silly. 

Finally, they note that it might be better if natural sciences were to shift their focus away from theories and more towards the data, and that the degeneracies in the sciences undermine the "realism" of the models. Well, you do you : it's provocative, but on this occasion, I shall allow myself not to be provoked. Shut up and calculate ? Nah. Shut up and contemplate.

The ultimate in flattening the curve

It just refuses to go down... Well, I'd play the innuendo card with this paper , at any rate.  Galaxy rotation curves are typically desc...