Little Physicists: June 2024

Thursday, 27 June 2024

The galaxies that quenched backwards

Today's paper is about contrarian galaxies that don't play by the rules. Most galaxies, left to their own devices, build up a big bulge of stars in the middle as the gas density is initially highest there. But this high star formation rate burns through its fuel very quickly and soon pitters out, "quenching" the star formation in the centre. Meanwhile the disc happily ambles along forming stars at a stately, steady pace, unhurried but longer-lasting.

There are some, though, which appear to do just the opposite. In clusters this is easy. Ram pressure preferentially removes the outer gas first, leaving the most stubborn gas remaining in the centre to carry on forming stars while the disc slowly reddens and dies. What's weirder is that there appear to be some galaxies like this in environments where ram pressure can't be playing any role. Ram pressure requires a hot intergalactic medium, which is pretty much only found at any significant levels in massive clusters. Everywhere else it should be far too weak to cause this kind of damage*.

* It probably doesn't have zero role to play outside clusters though. Very small satellite dwarfs can experience significant ram pressure from the hot gas of their parent galaxies, and there's some evidence that larger galaxies can experience at least a little gas-loss due to ram pressure in large-scale filaments. But probably not anywhere enough to account for galaxies like this. What they more likely experience is only starvation, where their outermost, thinnest gas is removed but nothing from within their denser discs. This means they can't replenish their gas as it gets eaten up by star formation.

This paper is about a particular sort of these kind of galaxies, which they give the ugly name of "BreakBRDs" : Break Bulge Red Discs. The "break" refers to a particular spectral line indicating that there was star formation very recently in the bulges. "Blue Bulges Red Discs" might have been easier, but technically the bulges aren't actually blue, so this wouldn't be right. Even so, BBRDS would be a better acronym. Or heck, I'll just call 'em backwards galaxies.

I have to say I both like and dislike this paper. On the one hand it's very careful, thorough, and doesn't draw any overblown conclusions from the limited data. On the other, some of the discussion is long-winded, non-committal, and rather tedious considering that in the end the conclusion is so indecisive. I think there's some really great discussion on each individual scenario proposed to explain the backwards galaxies, but the collective whole becomes at times very confusing. It feels a bit like a paper written by committee. The main problem is nothing to do with the science at all, but the structure : there isn't a good unifying framework to tie all the different scenarios together.

Here I shall try and simplify and disentangle things a bit.

They begin with a nice overview of the observational difficulties of establishing what's going on. Some results support this classical picture of outside-in quenching where ram pressure (or other environmental effects) strip the outer gas first. But others find that this happens more in low density environments, exactly the opposite of what's expected ! Still others show something entirely different, where quenching happens irrespective of position in the galaxy, i.e. the whole disc quenches everywhere, all at once. Even simulations aren't much help, with galaxies similar to their BBRDs being found but with no clear mechanism responsible for what happened.

The particularly interesting thing about BBRDs, i.e. backwards quenchers, is that they exist over a wide range of stellar masses and apparently in all environments. Unfortunately they don't say anything else about the environments(s) of their particular sample, which is my only scientific quibble with the paper. Anyway what they do here is look at a sample of about a hundred or so which have HI gas measurements, using a combination of existing and their own new observations.

Their main result is actually quite simple, but you need to be aware of the colour-magnitude diagram first. Very simply, there are :

Galaxies which are blue, have lots of gas, and are forming stars as per usual (the so-called star formation main sequence, or more simply the blue cloud).
Galaxies which are red, don't have any gas, and aren't forming any stars : the red sequence.
Galaxies which are intermediate between red and blue, have a bit of gas, and are forming stars more slowly : the green valley (or transition region).

Of course, regular readers here will know that it isn't as simple as that, but this'll do for what's needed here.

They show that BBRDs and GV (green valley) galaxies have about the same gas fractions regardless of their stellar mass, both considerably lower than those in the blue cloud. But given their star formation rates, BBRDs would burn through their current gas content very much faster than GV galaxies. That is, for the same amount of gas, BBRDs are consuming their gas far more efficiently than GV galaxies. They're forming stars at rates typical of blue cloud galaxies, even though they've got way less gas. GV galaxies, by contrast, are forming stars even slowly than those on the main sequence.

How do they do this ? Or, conversely, what suppresses the star formation in GV galaxies ?

This is not at all easy to answer. For the GV galaxies, their gas depletion times seem to be low because of their extremely low star formation rates, which more than compensates for their lower gas contents. For the BBRDs, their rapid gas depletion times are easier to explain, as their star formation rate remains normal despite their low gas contents.

But what's behind it all ? There are three main options.

1) The final stages of quenching

These BBRD galaxies could be experiencing the end stages of gas removal due to some external mechanism, as is usually the case in clusters. Perturbations to the gas could drive inflows into the centres of the galaxies, resulting in an increase in gas density and thus an increase in star formation rate. Unfortunately they don't have the detailed maps of the gas needed to say if this is the case or not, but here it would have been useful for them to say something more about environment.

2) Accretion

It could be that these galaxies were in fact already fully quenched and are experiencing something of a renaissance. Accretion is a somewhat controversial topic because it's hard to ever determine that it's really happening with any certainty, but clearly galaxies have got to get their gas from somewhere. One possibility is so-called cold accretion from the cosmic web, where the hot, diffuse extragalactic background cools and condenses, falling into the potential wells of galaxies along streams.

Reading between the lines I think this is their favoured scenario. There are several reasons to think the gas isn't doing what it's supposed to be doing. For a few galaxies they have well-resolved kinematics of the gas and stars, and here it tends to be either misaligned with and/or highly distorted compared to the stars. They've only got such data for the innermost regions, but the spectral profiles of the gas (measured over much larger scales) also suggest it's significantly asymmetric. And they're also systematically offset form the Tully-Fisher relation : that is, the gas kinematics has broader line widths that typical galaxies of the same mass.

All this is what you'd expect if the gas had only recently arrived rather than evolving along with the galaxy throughout its history. What they can't say is whether this indicates a temporary revival of the galaxy or a full resuscitation. They might, potentially, be rejuvenating themselves back onto the star formation main sequence, or they may have just accumulated a little bit of gas and will soon burn out once again. My guess is that latter is more likely but this would mean galaxies would have extraordinarily complicated histories.

3) Mergers

The most dramatic way a galaxy can accumulate more gas is by gobbling up other galaxies whole. This is their least favoured scenario as there are no signs of the remnants of the encounter you'd expect, e.g. long stellar tails. They can't rule it out though, as it's possible the star formation is persisting long after the tails have dispersed, but resolved gas measurements could help verify this idea.

I've simplified the discussion here considerably but this is roughly what it boils down to; they themselves, in my opinion, are too focused on the details of the different processes rather than painting a clear picture of the main differences. Fortunately, they've got VLA time to resolve the gas in a pilot sample, so this looks like a solvable problem. As they say, "it is important to remember the results of McKay et al. (in preparation)"... well, asking to remember results that don't exist yet is a new one on me, but I'll try.

Friday, 21 June 2024

The ultimate in flattening the curve

It just refuses to go down...

Well, I'd play the innuendo card with this paper, at any rate.

Galaxy rotation curves are typically described as flat, meaning that as you go further away from their centres, the orbital speeds of the gas and stars don't change. This is the traditional evidence for dark matter. You need something more than the visible matter to accelerate material up to these speeds, especially the further away you go : if you use the observed matter, the prediction is that orbital speeds should steadily decrease a la Kepler. Unseen dark matter is a neat way to resolve this dilemma, along with a host of other observational oddities.

This paper claims to have extended rotation curves considerably further than traditional measurements and find that the damn curves remain flat no matter how far they go. They reach about 1 Mpc, about the size of our whole Local Group, and still don't show any sign of a drop. This is not at all expected, because eventually the curve should drop as you go beyond the bulk of the dark mass. It is, they say, much more in-keeping with the prediction of modified gravity theories that do away with dark matter, such as MOND.

I won't pretend I'm in any way expert in their methodology, however. A standard rotation curve directly measures the line of sight speed of gas and/or stars, which is relatively simple to convert into an orbital speed – and for qualitatively determining the shape of the curve, the corrections used hardly matter at all. But here the authors don't use such direct kinematic measurements, but instead use weak gravitational lensing. By looking at small distortions of background galaxies, the amount of gravity associated with a target foreground source can be determined. Unlike strong lensing, where distortions are easily and directly visible in individual sources, this is inferred through statistics of many small sources rather than from singular measurements.

Here they go even more statistical. Much more statistical, in fact. Rather than looking at individual lens galaxies they consider many thousands, dividing their sample into four mass bins and also by morphology (late-type spiral discs and early-type ellipticals). The lensing measurements don't give you orbital speed directly, but acceleration, which they then convert into velocity.

1 Mpc is really quite a long way in galactic terms, and it wouldn't be at all uncommon to find another similar-sized galaxy within such a distance : in our Local Group, which is not atypical, there are three large spiral galaxies. Measuring the rotation curve out to such distances then becomes intrinsically complicated (even if you had a direct observational tracer like the gas) because it's hard to know which source is contributing to it.

They say their sample is of isolated galaxies with any neighbours being of stellar mass less than 10% of their targets out to 4 Mpc away, but their isolation criterion uses photometric redshifts*. Here I feel on very much firmer footing in claiming that these are notoriously unreliable. Especially as the "typical" redshift of their lens galaxies is just 0.2, far too low for photometric measurements to be able to tell you very much. Their large sample means they understandably don't show any images, but it would have been nice if they'd said something about a cursory visual inspection, something to give at least some confidence in the isolation.

* These are measurements of the redshift based on the colour of the galaxy, which is extremely inexact. The gold standard are spectroscopic measurements, which can give precisions of a few km/s or even less.

If we take their results as given, they find that the rotation curves of all galaxies in all mass bins remain flat out to 1 Mpc, the limit of their measurement (although in one particular subset this doesn't look so convincing). They also show that in individual cases where they apparently can get good results from weak lensing, the results compare favourably with the direct kinematics they get from gas data.

As often with results questioning the dark matter paradigm, I'd have to describe the results as "intriguing but overstated". I don't know anywhere near enough about the core method of weak lensing to comment on the main point of the paper. But that this is normally in itself as result of statistical inference, and that here they use a very large sample of galaxies and convert the result from the native acceleration measurement to velocity, and that their isolation criteria seems suspect... I remain unconvinced. I'd need a lot more persuading that the weak lensing data is really giving meaningful results to such large distances from the lens galaxies.

What would have been nice to see is the results from simulations. If they could show that their photometric redshifts were accurate enough in simulated cases to give reliable results, and that the weak lensing should given something similar to this (or not, if the dark haloes in the simulations have a finite extent), then I'd find it all a lot more convincing. As it stands, I don't believe it. Especially given that so many galaxies are now known with significantly lower dark matter contents than expected : these "indefinitely flat" rotation curves seem at odds with galaxies with such low rotation speeds even in their innermost regions. Something very fishy's going on.

Wednesday, 19 June 2024

The shoe's on the other foot

My, how the tables have turned. The hunter has become the hunted. And various other cliché's indicating that the normal state of affairs have become reversed.

That is, as well as having to write an observing proposal, I find myself for the first time having to review them. Oh, I've reviewed papers before, but never observing proposals. This came about because ALMA has a distributed proposal review system : everyone who submits their own proposal has to review ten others. And since this year I finally submitted one, I get to experience this process first hand.

The ALMA DPR procedure

When you submit a proposal, you indicate your areas of expertise and any conflicts of interests – collaborators and direct competitors who shouldn't be reviewing your proposal, either because they'd stand to benefit from it being accepted or would love to take you down a peg. It's a double-blind procedure : your proposal can't contain any identifying information and you don't know who the reviewers are. Some automatic checks are also carried out to prevent Co-Is on recent ALMA proposals being assigned as reviewers, and suchlike.

Then your proposal is sent off for initial checks and distributed to ten other would-be observers who also submitted observing proposals in the current cycle. You, in turn, get ten proposals to review yourself. Each document is four pages of science justification (of which normally one or even two pages are taken up with figures, references, and tables) plus an unlimited-length technical section containing the observing parameters for each source plus some brief justification on the specifics (in practise, in most proposals each of these so-called "science goals" are very similar, using the same observing setup on multiple targets). You then write a short review of each one, of a maximum of 4,000 characters but typically more like ~1,000 (or even less) describing both the strengths and weaknesses of each. You also rank them all relative to each other, from 1 (the strongest) to 10 (the weakest).

That's stage one. A few weeks later, in stage two you get to see everyone else's reviews for the same proposals, and can then change your own reviews and/or rankings accordingly, if you want to. So far as I know, each reviewer gets a unique group of ten proposals to review, so no two reviewers review the same set of proposals, meaning you can't see the others rankings. Exactly how their rankings are then all compared and combined, and ultimately, translated into awarded telescope time, remains a mystery to me. Those details I leave for some other time, and I won't go into the details of anonymity* here either : I seem to recall hearing that this gives a better balance of both experience and gender, but I don't have anything to hand to back this up.

* I will of course continue to respect the anonymity requirements here, and not give any information that could possibly identify me as anyone's reviewer.

Instead I want to give some more general reflections on the process. To be honest I went into this feeling rather biased, having received too many referee comments which were just objectively bollocks. I was quite prepared to believe the whole thing would be essentially little better than random, which is not a position without merit.

First thoughts

And my initial impressions justified this. It seemed clear to me that everyone had chosen interesting targets and would definitely be able to get something interesting our of their data, making this review process a complete waste of time.

But after I let things sink in a bit more, after I read the proposals a bit more carefully and made some notes, I realised this wasn't really the case. I still stand by (with one exception) that all proposals would result in good science, but the more I thought about it, the more I came to the conclusion that I could make a meaningful judgement on each one. I tried not to judge too much whether one would do better science than another, because who am I to say what's better science ? Why should I determine if studies of extrasolar planets are more important than active galactic nuclei ?

These aren't real examples, but you get the idea. Actually the proposals were all aligned very much more closely with my area of expertise. The length of four pages I would say is "about right", it gave enough background for me to set each proposal in its proper context as well as going into the specific objectives.

Instead, what I tried to assess was whether each project would actually be able to accomplish the science it was trying to do. I looked at how impactful this would be only as a secondary consideration. There isn't really any right or wrong answer as to whether it's better to look at a single unique target versus a statistical study of more typical objects, but I tried to judge how much impact the observations would likely have on the field, how much legacy value they would have for the community. But first and foremost, I considered whether I was persuaded the stated science objectives could actually be carried out if the observations themselves reached their design spec.

Judgement Day

And this I found was something I could definitely judge. Two proposals to me stood out as exemplary, perfectly stating exactly what they wanted to do and why, exactly what they'd be able to achieve with this. It was very clear that they understood the scientific background as well as anyone did. I initially ranked these essentially as a coin-toss as to who got first and who got second place; I couldn't meaningfully choose between them.

At the opposite extreme were two or three which didn't convince me at all. One of the principle objectives of one of them was just not feasible with the data they were trying to obtain, and they themselves presented better data in their proposal that they already had which would have been much more suitable for this. Lacking self-consistency is a black mark in just about any school of thought. Another looked like it would observe a perfectly good set of objects, but contained so many rudimentary scientific errors that there was no way I could believe they'd do what they said they would do.

Again, deciding which one to rank lowest was essentially random, though I confess that one of them just wound me up the wrong way more than the other.

In the middle were a very mixed bunch indeed. Some had outstanding ideas for scientific discovery but were very badly-expressed, saying the same thing over and over again to the nth degree (I would say to these people, there's no obligation to use the full four pages, and we should stipulate this in the guidance to observers and reviewers alike. I tried to ignore the poor writing style of these and rank them highly because of the science). Some oversold the importance of what they'd do, making unwarranted extrapolations from their observations to much more general conclusions. Some had a basically good sample but claimed it was something which it clearly wasn't; others clearly stated what their sample was but the objects themselves were not properly representative of what they were trying to achieve.

This middle group... honestly here, a random lottery would work well. On the other hand, there doesn't seem any obvious reason not to use human judgement here either, because for me at least this felt like a random decision anyway. And if other people's judgements are similar then clearly there are non-random effects which probably should actually be accounted for, whereas if they are truly random then the effects will average out. So there's potentially a benefit in one case and no harm in the other, and in any case there almost certainly is a large degree of randomness at work anyway.

Reviewing the reviews

I went through a similar process of revising my expectations in stage 2, though to a lesser degree. At first glance I didn't think I'd need to change my reviews or rankings, but on carefully checking one of the other reviews, I realised this was not the case. One reviewer out of the ten had managed to spot a deeply problematic technical issue in one of the proposals that I otherwise would have ranked very highly. And on checking I was forced to conclude that they were correct and had to downgrade my ranking significantly. This alone makes the process worth doing : 1 out of 10 is not high, but with ~1,600 proposals in total, this is potentially a significant number overall.

Reading the other reviews turned out to be more interesting than I expected. While some did raise exactly the same issues with some of the proposals that I had mentioned, many didn't. Some said "no weaknesses" to proposals I thought were full of holes. One even said words to the effect that "no-one should doubt the observers will do good science with this", a statement I felt presumptuous, biased, and bordering on an argument-from-authority : it's for us the reviewers to decide this independently; being told what we should think is surely missing the point.

The reverse of this is that some proposals I though were strong others thought were weak – very weak, in some cases. Everyone picks up on different things they think are important. There was one strange tendency for reviewers to point out that the ALMA data wouldn't be of the same quality as comparison data. This is fine, except that the ALMA data would usually have been of better quality, and downgrading it to the same standard is trivial ! I sort of wished I'd edited my reviews to point this out. Some also made comments on statistics and uncertainties that I thought were so generic as to be unfair, yes of course things might be different from expectations, but that's why we need to do observations !

What the DPR doesn't really do is give any chance for discussion. You can read the other reviews but you can't interact with the other reviewers. It might have been nice to have somewhere where we could enter a "comment to other reviewers", directed to the group, or at least have some form of alert system when reviews were altered. Being able to ask the observers questions might have been nice, but I do understand the need to keep things timely as well. On that front, reviews varied considerably in length; mine were on the longer and to be honest perhaps overly-long side (I think my longest was nearly 2,000 characters), while one was consistently and ludicrously short.

All this has given me very mixed feelings about my own proposal. On the one hand, I don't think it's anywhere near the worst, and I stand behind the scientific objectives. On the other, I think I concentrated overmuch on the science and not enough on the observational details. Ranking it myself with hindsight I'd probably have to put it in the lower third. It was always a long shot though, so I'll be neither surprised nor disappointed by the presumed rejection. One can but try with these things.

One thing I will applaud very strongly is the instruction to write both strengths and weaknesses of each proposal. All of them, bar none, had some really good points, but it was helpful to remind myself of this and not get carried away when reviewing the ones I didn't much like. Weaknesses were more of a mixed bag; one can always find something to criticise, although in some cases they aren't significant. Still, I found it very helpful to remember that this wasn't an exercise in pure fault-finding.

How in the world one judges which projects to actually undertake, though... that seems to me like the ultimate test of philosophy of science. Groups of experts of various levels have pronounced disagreements about factual statements; some notice entirely different things from others. There's the issue of not only will the science be significant, but also whether the data can be used in different ways from what's suggested. That to me remains the fundamental problem with the whole system, that one can nearly always expect some interesting results, but predicting what they could be is a fool's game.

Overall, I've found this a positive experience. Reading the full gamut of excellent to poor proposals really gives a clearer idea of what reviewers are looking for, something it's just not possible to get without direct experience. Not for the first time, I wonder a lot about Aumann's Agreement Theorem. If we the reviewers are rational, we ought to be persuaded by each other's arguments. But are we ? This at least could be assessed objectively, with detailed statistics possible on how many reviewers change their ranking when reading other reviews.

And at the back of my mind is a constant paradoxical tension : a strong feeling that I'm right and others are wrong, coupled with the knowledge that other people are thinking the same thing about me. How do we reconcile this ? For my part, I simply can't. I formulate my judgement and let everyone else to the same, and hope to goodness the whole thing averages out to something that's approximately correct. The paradox is that this in no way makes me feel any the less convinced of my own judgments, even knowing that some fraction of them simply must be wrong.

Other aspects are much more tricky. This is a convergence of different efforts, both trying to asses what-is-true (what science claimed is factually correct, why do experienced experts still disagree on some points), what will likely benefit the community the most, and how we try and account for the inevitably uncertain and unpredictable findings. As I've said before many times, real, coal-face research is extremely messy. If it isn't already, then I would hope that telescope proposals ought to be an incredibly active field of research for philosophers of science.

Thursday, 6 June 2024

The data won't learn from itself

Today I want to briefly mention a couple of papers about AI in astronomy research. These tackle very different questions from the usual sort, which might examine how good LLMs can be at summarising documents or reading figures and the like. These, especially the second, are much more philosophical than that.

The first uses an LLM to construct a knowledge graph for astronomy, attempting to link different concepts together. The idea is to show how, at a very high level, astronomical thinking has shifted over time : what concepts were typically connected and how this has changed. Using distributional semantics, where the meanings of words in relation to other words are encoded as numerical vectors, they construct a very pretty diagram showing how different astronomical topics relate to each other. And it certainly does look very nice – you can even play with it online.

It's quite fun to see how different concepts like galaxy and stellar physics relate to each other, how connected they are and how closely (or at least it would be if the damn thing would load faster). It's also interesting to see how different techniques have become more widely-used over time, with machine learning having soared in popularity in the last ten years. But exactly what the point of this is I'm not sure. It's nice to be able to visualise these things for the sake of aesthetics, but does this offer anything truly new ? I get the feeling it's like Hubble's Tuning Fork : nice to show, but nobody actually does anything with it because the graphical version doesn't offer anything that couldn't be conveyed with text.

Perhaps I'm wrong. I'd be more interested to see if such an approach could indicate which fields have benefited from methods that other fields aren't currently using, or more generally, to highlight possible multi-disciplinary approaches that have been thus far overlooked.

The second paper is far more provocative and interesting. It asks, quite bluntly, whether machine learning is a good thing for the natural sciences : this is very general, though astronomy seems to be the main focus.

They begin by noting that machine learning is good for performance, not understanding. I agree, but once we do understand, then surely performance improvements are what we're after. Machine learning is good for quantification, not qualitative understanding and certainly not for proposing new concepts (LLMs might, and I stress might, be able to help with this). But it's a rather strange thing to examine, and possibly a bit of a straw man, since I've never heard of anyone thinking that ML could do this. And they admit that ML can be obviously beneficial in certain kinds of numerical problems, but this is still a bit strange : what, if any, qualitative problems is ML supposed to ever help with ?

Not that quantitative and qualitative are entirely separable. Sometimes once you obtain a number you can robustly exclude or confirm a particular model, so in that sense the qualitative requires the quantitative. But, as they rightly point out, as I have myself many times, interpretation is a human thing : machines know numbers but nothing else. More interestingly they note :

The things we care about are almost never directly observable... In physics, for example, not only do the data exist, but so do forces, energies, momenta, charges, spacetime, wave functions, virtual particles, and much more. These entities are judged to exist in part because they are involved in the latent structure of the successful theories; almost none of them are direct observables.

Well, this is something I've explored a lot on Decoherency (just go there and search for "triangles"). But I have to ask, what is the difference between an observation and a measurement ? For example we can see the effects of electrical charge by measuring, say, the deflection of a hair in the static field of a balloon, but we don't observe charge directly. But we also don't observe radio waves directly, yet we don't think they're less real than optical photons, which we do. Likewise some animals do appear to be able to sense charge and magnetic fields directly. In what sense, then, are these "real" and what sense are they just convenient labels we apply ?

I don't know. The extreme answer is that all we have are perceptions, i.e. labels, and no access to anything "real" at all, but this remains (in some ways) deeply unsatisfactory; again, see innumerable Decoherency posts on this, search for "neutral monism". Perhaps here it doesn't matter so much though. The point is that ML cannot extract any sort of qualitative parameters at all, whereas to humans these matter very much – regardless of their "realness" or otherwise. If you only quantify and never qualify, you aren't doing science, you're just constructing a mathematical model of the world : ultimately you might be able to interpolate perfectly but you'd have no extrapalatory power at all.

Tying in with this and perhaps less controversially are their statements regarding why some models are preferred over others :

When the expansion of the Universe was discovered, the discovery was important, but not because it permitted us to predict the values of the redshifts of new galaxies (though it did indeed permit that). The discovery was important because it told us previously unknown things about the age and evolution of the Universe, and it confirmed a prediction of general relativity, which is a theory of the latent structure of space and time. The discovery would not have been seen as important if Hubble and Humason had instead announced that they had trained a deep multilayer perceptron that could predict the Doppler shifts of held-out extragalactic nebulae.

Yes ! Hubble needed the numbers to formulate an interpretation, but the numbers themselves don't interpret anything. A device or mathematical model capable of predicting the redshifts from other data, without saying why the redshifts take the values that they do, without relating it to any other physical quantities at all, would be mathematical magic, and not really science.

For another example, consider the discovery that the paths of the planets are ellipses, with the Sun at one focus. This discovery led to extremely precise predictions for data. It was critical to this discovery that the data be well explained by the theory. But that was not the primary consideration that made the nascent scientific community prefer the Keplerian model. After all, the Ptolemaic model preceding Kepler made equally accurate predictions of held-out data. Kepler’s model was preferred because it fit in with other ideas being developed at the same time, most notably heliocentrism.
A theory or explanation has to do much more than just explain the data in order to be widely accepted as true. In physics for example, a model — which, as we note, is almost always a model of latent structure — is judged to be good or strongly confirmed not only if it explains observed data. It ought to explain data in multiple domains, and it must connect in natural ways to other theories or principles (such as conservation laws and invariances) that are strongly confirmed themselves.

General relativity was widely accepted by the community not primarily because it explained anomalous data (although it did explain some); it was adopted because, in addition to explaining (a tiny bit of new) data, it also had good structure, it resolved conceptual paradoxes in the pre-existing theory of gravity, and it was consistent with emerging ideas of field theory and geometry.

Which is a nice summary. Some time ago I'd almost finished a draft of a much longer post based on this this far more detailed paper which considers the same issues, but then blogger lost it all and I haven't gotten around to re-writing the bloody thing. I may yet try. Anyway the need for self-consistency is important, and doesn't throttle new theories in their infancy as you might expect : there are ways to overturn established findings independent of the models.

The rest of the paper is more-or-less in line with my initial expectations. ML is great, they say, when only quantification is needed : when a correlation is interesting regardless of causation, or when you want to find outliers. So long as the causative factors are well-understood (and sometimes they are !) it can be a powerful tool for rapidly finding trends in the data and points which don't match the rest.

If the trends are not well-understood ahead of time, it can reinforce biases, in particular confirmation bias by matching what was expected in advance. Similarly, if there are rival explanations possible, ML doesn't help you choose between them if they don't predict anything significantly different. But often, no understanding is necessary. To remove the background variations in a telescope's image it isn't necessary even to know where all the variations come from : it's usually obvious that they are artifacts, and all you need to is the mathematical description of them. Or more colourfully, "You do not have to understand your customers to make plenty of revenue off of them."

Wise words. Less wise, perhaps only intended as a joke, are the comments about "the unreasonable effectiveness of ML", that it's remarkable that these industrial-grade mathematical processes are any good for situations to which they were never designed. But I never even got around to blogging Wigner's famous "unreasonable effectiveness" essay because it seemed worryingly silly.

Finally, they note that it might be better if natural sciences were to shift their focus away from theories and more towards the data, and that the degeneracies in the sciences undermine the "realism" of the models. Well, you do you : it's provocative, but on this occasion, I shall allow myself not to be provoked. Shut up and calculate ? Nah. Shut up and contemplate.