Little Physicists

Thursday, 18 September 2025

Weaponising dark matter

Stephen Baxter's Xeelee sequence revolves around a war between baryonic and non-baryonic life forms. One memorable sequence features a pulsar being hurled at the Great Attractor (because reasons). Today's paper feels like it could easily fit within such a realm of the gloriously far-fetched, albeit it's not without some reasonable evidence too.

This is just a four-page letter so I'll keep this one very short indeed. Like the last paper, they claim to have found the signature of a very small dark matter halo, but this one's even smaller. The last one was about a billion solar masses, extremely small by galaxian standards but not outrageously so... this one, by contrast, is probably no more than a few tens of millions of solar masses, with a lower limit of just a few thousand.

Such features are certainly predicted in cosmological simulations. Basically, the higher the resolution, the more small dark halos result. But below a certain limit, nobody ever expected to have much chance of ever detecting them, since they'd have so little gravity they'd never attract enough gas to form a single star. And once you add in all the baryonic matter to the simulations (the boring normal matter of stars and gas), presumably most of the smallest ones would be disrupted.

The claim here is they've found a bullet wound in the Milky Way resulting from a collision with one of these minihalos. Actually, again like the last paper, this is not a discovery announcement so much as an independent confirmation by a different method. The original discovery came back in 2017 in the form of a molecular gas cloud with an unexpectedly high line width. 30 km/s is small by the standards of galaxies (the Milky Way would be more like 400 km/s), but with no stars to drive the motion, dynamics have an obvious appeal. Without any other visible material (at least nowhere near enough), dark matter is at least heavily implied.

Here they use Gaia data to look at the velocity of the stars in the vicinity and discover they have a vertical velocity anomaly : in a small region of the disc, the average velocity of the stars perpendicular to the disc drops, even while their dispersion increases. The original CO blob is slap-bang in the middle of this VVA, with the VVA being very much larger than the CO blob. Which would be an awfully suspicious coincidence.

A lump of dark matter colliding with the disc could certainly cause this. Its small size is certainly consistent with a very small halo. But I have no familiarity with stellar dynamics on these scales at all, so I can't tell you how unusual such features are and their figures don't really give much of an indication. They also don't even consider other explanations, which I suppose is fair given the limited available space, but it would have been nice to have mentioned something (besides the obvious impossibility of stellar winds and the like, there apparently being no stars here). And how much such features would we expect to find, if these minihalos exist in the numbers predicted by simulations ? Finally, their tie-in to Ultra Compact Galaxies more generally – which are largely thought to be the stripped cores of more massive galaxies – is just too speculative even for a letter.

In short, it's definitely a very interesting feature to report, but it's going to take a lot more work to say anything definitive about what it actually is.

Monday, 15 September 2025

A dark RELHIC of an earlier age

The last time I tried to count the number of times objects had been claimed to be the first dark galaxy candidates, I stopped at ten because I got bored. Today's paper adds another one to the list.

To be fair, not all dark galaxy claims are equal. Some would say a galaxy only counts as dark if it really has no stars at all, others that it just needs to be sufficiently dominated by its gas and/or dark matter. Others would insist that it has to have certain dynamics or only be found in the nearby universe*. Most would probably demand it had a primordial origin rather than just being stripped out of a galaxy, but not everyone would agree. So that list of ten could very plausibly be extended or contracted considerably.

* Pretty much everyone agrees that galaxies started off as dark, so we accept that dark galaxies did exist at one point. The controversy is over whether any still remain dark today.

This paper concerns a very particular type of dark galaxy they annoyingly dub a RELHIC. Why annoying ? Because we also have radio relics, which are completely different beasts : they're incomparably larger and more diffuse, and have little or no direct relation to individual galaxies. These objects, on the other hand, are Reionisation Limited HI Clouds, a term coined by one of the authors that I'd urge them to stop using.

But no matter. What they report on is a very interesting object that in some ways is the sort of dark galaxy candidate everyone wants to find. One of the major reasons to suppose such objects exist at all is that they would solve the long-standing missing satellite problem, whereby simulations produce far more galaxies than are actually observed. The idea is that while the physics of gravity is pretty simple, the physics of star formation is anything but. So maybe these objects do exist, it's just that they've never formed stars. This is now the widely-accepted explanation for missing satellites – indeed, arguably there isn't such a problem at all any more, as (in some models) the only such "dark galaxies" are now so small and so lacking in stars that we wouldn't detect them.

One of the complexities of the physics behind these is that of the Epoch of Reionisation. The first stars as thought to have been super-powerful monsters powerful enough to ionise most of the gas in the early Universe, heating it to the point where it would be driven out of the smallest galaxies completely. Models show that below a critical mass threshold, galaxies would lose all of their gas and never form any stars at all. It's not quite a sharp cut-off, with some near or slightly above the threshold able to form some stars before reionisation brought the process to a permanent halt, but it's close.

Such objects are thought to be extremely difficult to find. Their HI masses should be only a few million times the mass of the sun, about a hundred times less than a typical dwarf galaxy. And their line widths might be only 20 km/s or less, barely wider than the HI line itself . In principle these objects could be almost numberless, just bloody hard to spot. In contrast, most dark galaxy candidates that hit the headlines are much bigger, and usually by the author's own admissions fairly exceptional – massive dark hulks that are relatively easy to find despite being so rare as to indicate little or nothing about how most galaxies form.

Here the authors present a candidate discovered with China's mighty FAST telescope. In contrast to this awful RELHIC term, I can't fault them for the name of this particular object : Cloud 9. Yes, really. Apparently this was first reported in 2023 but I seem to have missed that paper when it came out.

Here they report on deep Hubble images and confirm that it's really, really dark, with no more than a few thousand solar masses of stars against it's million or so of gas. That, together with its line width of just 12 km/s (!) and small size (1.4 kpc radius), with an estimated halo mass that's extraordinarily close to the mass threshold, make it a compelling RELHIC candidate. It's certainly one of the darkest objects ever found, which is always a pretty cool thing to find.

But just how good, exactly ? My verdict would be... yeah, this one's pretty interesting. Is it definite ? By no means. But it's a good candidate, and absolutely needed to be published.

Glancing at the discovery paper, it seems that Cloud 9 is a little over 100 kpc from M94 itself, with HI clouds closer to the galaxy that are clearly some form of debris. 100 kpc is quite far, but only a few times the size of a large galaxy, and certainly there are many extended streams known which are much larger than this. So this cloud could be a leftover far-flung bit of debris as well, but like Cloud 6 in Leo, it doesn't really fit the general pattern of the other clouds.

A perhaps more serious difficulty would be that estimating the total mass of the feature must be extraordinarily difficult. HI tends to be found at ~10,000 K, corresponding to a line width of 10 km/s. A width of 12 km/s tells you pretty much nothing beyond the thermal state of the gas, so inferring its dynamics from this is... well, my worry is that you simply can't get a meaningful estimate when things get this narrow. Even if there was no dark matter here at all, the line width wouldn't get much narrower because this is about as narrow as the line can get. It's not a matter of better data in this case : nothing will help, at least not very much.

Another issue is the question of how long the cloud could survive, and conversely, how long it's been in existence. Currently its gas density is well below the threshold for star formation. Assuming it began life so small that the density would reach the threshold (so as to have always been dark), and given its expansion velocity and small size, it would have taken perhaps a hundred million years to reach its present size (without dark matter, as expected if it's just debris). In a another hundred megayears or so it'll double its and likely be undetectable. So if it's debris, we're detecting it at an unusual point in its existence, but sadly this doesn't constrain things too much.

I think this is a case where what's needed is a good set of simulations, especially given the timing constraints from the size of the cloud. What kind of interactions could affect M94 that would produce debris like this ? How often are such things formed, and are those simulations compatible with all the other data of the system ? What happens to existing minihalo RELHICS like this one in a system like this, where there's clear evidence that M94 experienced a merger – can they survive in such a place ?

This one's going to take a lot of work to answer. It would be easy to dismiss this as just another bit of HI fluff... and it might be. But it's so close to what we expect minihalos to be like that the workload might be worth it. And perhaps, just maybe, some other clouds already lurking in the data aren't the boring bits of debris we all thought they were.

Friday, 12 September 2025

The galaxies that seemed magical are actually just very lazy

Today, two papers for the price of one ! The one is a sequel to the other and they're both quite technical, so let's knock off two birds with one stone in the bush and other mixed metaphors. Paper I is here and paper II is here.

The papers in essence address two related questions on Ultra Diffuse Galaxies – the big faint fuzzy things that often seem to lack dark matter, which I've covered here ad nauseum. Neither paper much addresses the dynamics (i.e. total mass) of the objects, but rather the other fun aspect of these galaxies : why are they so wretchedly bad at forming stars ? Many of them have tonnes of gas, so why aren't they forming stars like normal galaxies are ? Why are they so large and yet so faint ?

The first paper deals with the basics. It tries to determine if the star formation efficiency of UDGs really is weirdly low, or if this is only a selection or measurement effect. Spoiler alert : it really is low. So the second paper then address why this might be – is there something missing in our basic model of star formation, or is it just due to the peculiars and particulars of of these particularly peculiar systems ?

Paper I begins by collecting a sample of 22 UDGs and 35 more typical dwarfs of comparable mass. The UDGs all have atomic HI gas detections, though low resolution so essentially all we know is the mass of gas, nothing about its structure. But it seems that at first glance, UDGs are indeed of systematically lower star formation efficiencies : their star formation rate is less than that of other galaxies of similar gas masses, and given their stellar masses they have more gas than expected. Both of these effects are modest though. It's quite apparent that the population as a whole is systematically offset from the rest, but they're all still within the general scatter. Interestingly, they also note that the properties of the HUDGs (HI-detected UDGs) aren't much affected by environment.

The first question they tackle is whether these objects are really different in terms of their gas content, or if this is just a selection effect. That is, the HI observations might be limiting what can be detected at all. It could be that those of lower gas fractions do exist, it's just that the data isn't sensitive enough to show them. But they find that's not the case : they should be able to find considerably less gassy-objects, so these don't seem to exist at all. These HUDGs are not the tip of a less-gassy iceberg.

Next, their main topic. While the average star formation efficiency of HUDGs appears low, this could just be due to the statistics from using the total gas mass and stellar mass, which smooths over the whole structure of the galaxy. It could be that the star formation efficiency is actually quite normal, just restricted in area. For example the gas could all be concentrated in the centre and forming stars pretty normally, but when averaged over the whole galaxy, this would be "washed out" and it would look like the galaxy was rubbish at forming stars. Overall, they would be, but locally, they wouldn't be that bad. This would imply a significant population of older stars outside the gas-dominated regions.

To test this they use spectral energy distribution fitting. Basically what this does is use many different data points across the optical, UV and IR spectrum to estimate the stellar ages as accurately as possible – this is the best we can do in terms of estimating a galaxy's star formation history. Ideally we'd also like to have resolved measurements of the HI, which sadly they don't have here. But they do SED-fitting for many different points per galaxy, so they can see if the low SFE is something that varies throughout each object or if it's low everywhere.

I'll skip over the details of the SED fitting because I don't understand any of it; I only note that they stress they aren't constructing detailed star formation histories here, just enough to answer their main questions. Their main result is that the low SFE is true on all scales from big to small. It's not just that the gas could be more extended, it's that UDGs are bad at forming stars full stop, even if the gas density is higher. While UDGs are large given their stellar masses, they're not especially large considering their HI masses.

There are a lot of different scaling relations to juggle here, but the end result is very simple : UDGs aren't good at converting gas to stars even when they've got plenty of it. The obvious next question is, of course, why are they so bad at this ? For this we need the sequel.

Paper II takes quite a different approach and is all about modelling. There's a really popular and widely-used relation between the surface density of gas and its star formation activity, but there have been indications for many years that we ought to be using the true, volumetric (3D) density instead. This is harder to measure directly so some assumptions have to be made, but it can be done.

Here they consider a particular version of the volumetric star formation law that depends on the different components of a galaxy together : the gravity from the atomic and molecular gas, the stars, and the dark matter. All have different gravitational contributions. For instance in the centre of a galaxy everything might be extremely dense, whereas further out their might still be lots of atomic gas and stars but less molecular gas, and on the very outskirts only atomic gas and dark matter. So even if the atomic gas has the same surface density, it may need the extra gravity of the stars to help pull it together and collapse.

Again I shall spare you the technical details of the model. This time my essential note is that they consider UDGs to be rather more dark matter dominated than ordinary galaxies, which flies against the prevailing winds in the last few years. More on that in a moment.

They find that this more complex model... works ! It can explain the star formation rates of both normal galaxies and UDGs under very reasonable assumptions : there's no need for any additional physics, no weird mechanism or alien interference that suppresses the star formation. There are many uncertainties, but their sensible default assumptions are enough to give a good result without any sort of fine-tuning needed. So UDGs are, in a sense, pretty normal.

They also find that the model isn't very sensitive to how much molecular gas the UDGs are assumed to contain. That's bad news for anyone trying to detect their molecular content : essentially it implies they could well have very little of it. They can even estimate just how much molecular gas they expect them to have, given their estimated star formation rates and the (rather surprisingly but repeatedly established) independent finding that molecular gas generally has a constant depletion timescale of about 2 Gyr. In short, their conclusion is that detecting the molecular component will be Bloody Difficult. Dwarf galaxies are already hard to detect, but UDGs will be even worse.

What about that choice of a relatively massive dark matter halo ? They explore this, and rather surprisingly it doesn't matter much. It seems that the density of the dark matter is anyway assumed to be low so that reducing the total mass doesn't make a great deal of difference. In fact this can give slightly better agreement with the measured star formation activity, but unfortunately, they say there are too many other uncertainties to say if their model prefers no dark matter at all to a normal mass halo.

So, there we have it. There's no need for any weird physics here : star formation in UDGs apparently follows the same laws as in every other galaxy. The difference isn't so much the gas content itself as the state it happens to find itself in, just as a puppy can be an energetic ball with the madness of a thousand caffeinated suns or the sleepiest thing since Slothy the Sleepy Sloth swallowed an entire bottle of Nytol after a mug of warm milk in a comfy armchair.

Does this mean UDGs aren't weird at all though ? Nope ! You might remember that there were previous claims that UDGs are actually just normal galaxies if you redefine how to measure their radius. That was true, but doesn't mean that the other radius estimates were wrong : it still points to UDGs being anomalous, just not in the way we'd understood.

Here we still don't know what sets the initial conditions of UDGs. Why do they start out so differently to normal galaxies ? Is it their dark matter halos (or lack thereof) and if so, is this compatible with our simulations of the sort of halos we expect to actually exist ? And how come their globular cluster populations appear to be very different to other galaxies ?

We still don't know. We do know that UDGs are weird, but as to whether they're pointing to a deep flaw in our models or just some incompleteness or other... the jury's out. Or more accurately, the trial hasn't even started yet : we need more evidence before we can even really begin. But at least what this paper adds is what evidence we should go after. If UDGs are found to have chonks of molecular gas, that would falsify their model straight away. If they're not, we keep investigating.

Thursday, 28 August 2025

Signal boost

One man's trash is another man's treasure, or so the old saying goes. And in astronomy, one man's signal is another man's noise.

The classic example of this is dust in the Milky Way. If you're interested in dust, then you're a deeply weird person... or just interested in star formation. Dust "grains", which are actually about the size of smoke particles, are thought to be critical sites for star formation because they allow atomic gas to lose energy, cool, and collide with each other to form molecular gas. This is much denser than atomic gas, so eventually this can lead to the cloud collapsing to form a star.

But if you're less of a weirdo, dust just gets in the way. It ruins our majestic sky by blocking our view of all the stars, especially along the plane of the disc of the Galaxy. Some regions are much worse than others, but it's present at some level pretty much everywhere across the sky.

In radio astronomy we have a much more subtle and interesting problem. Of course we always want our observations to be as deep and sensitive as possible. But sometimes, it turns out, the noise in our data can actually be to our advantage – though it comes at a price.

Consider a typical spectrum of a galaxy as detected in the HI line. If you aren't familiar with this, take a look at my webpage if you want details. Basically, it shows us how bright the gas from a galaxy is emitting (that is, how dense it is) at any particular velocity. Even without knowing this rudimentary bit of information though, you can probably immediately identify the feature of interest in the signal :

All the spectra shown in this post are artificial, generated with a simple online code you can use yourself here.

We don't need to worry here about why the signal from the galaxy has the particular structure that it does. No, what I want to talk about today is the noise. That's those random variations outside the big bright bit in the middle.

This example shows a pretty nice detection. It's easy to see exactly where the profile of the galaxy ends and the noise begins. But even within the galaxy, you can see those variations are still present : they're just lifted up to higher values by the flux from the galaxy. Basically the galaxy's signal is simply added to the noise.

Now if you still have an analogue radio, you'll know that if you don't get the tuning just right, you'll hear the sounds from your station but only against a loud and annoying background hiss. The worse the tuning, the worse the noise. So you might well think that the following claim is more than a little dubious :

Fainter signals can be easier to detect in noisier data.

So counterintuitive is this that one referee said it "makes no sense whatsoever", doubling down to label it "bizarre" and "not just counterintuitive, but nonsensical".

This is wrong. I'll point out that the claim comes not from me but from Virginia Kilborn's PhD thesis (now a senior professor). So how does it work ?

The answer is actually very simple : signal is added to noise. That is, regardless of how noisy the data is, the signal from the galaxy is still there. Let me try and do this one illustratively. Suppose we have a pure signal, completely devoid of noise, and for argument's stake we'll give it a top-hat profile (about a quarter of galaxies have this shape, so this isn't anything unusual) :

The "S/N" axis measures the signal to noise, a measure of how bright things look given the sensitivity of the data. The numbers in this case are garbage because I set the noise to zero.

Now let's add it to two different sets of noise, purely random (Gaussian), of exactly the same statistical strength but just different in their exact channel-to-channel values :

Note that even with this purely random noise, you can still see apparent ripples and variations in the baseline outside the source : even random noise, to the human eye, looks structured.

Oh ! What happened there ? Why is the second signal so much clearer than the first ? You can still see the first one, to be sure, but it's marginal, and could easily be mistaken for some weird structure in the baseline. The second isn't great either, but it looks a lot better than the first.

Noise is typically random. That means that some parts of it will have bits of higher flux while other parts will be a bit lower. If we add our signal to the higher flux bits, the total apparent flux in our source gets higher. That is, the real flux in our source obviously doesn't change, but what we would measure would be greater than if the noise wasn't there. And of course the opposite could happen too : we could have noise dimming that makes the source harder to detect as well as easier.

Noise boosting (shown in the carefully-chosen example above), on the other hand, is no less important but far less expected. Every once in a while, a faint signal will happen to align with some bright parts of the noise and turn a marginal signal into a clearer one. This doesn't really work in the sorts of audio radio signals you get on a household radio set as these are much too complex, but the signals of a galaxy are a good deal simpler. And all we need to detect them is (for this basic example at least) pure flux, which the noise can readily provide.

(As an aside, you might notice that the the actual peak levels in these cases aren't much different, though the average level inside the source profile is higher in the second case. While peak levels most certainly can be affected by noise boosting and dimming, what's absolutely crucial here is what detection method we're using to find the signals. I'll return to this below.)

Of course, there are limits to this : it will only work for signals which are comparable in strength to the noise. As the noise level gets higher, the random variations will increasingly tend to "wash out" our signal. Now the level of the signals we can receive varies hugely depending on the nature of our data set, but the signal-to-noise ratios (S/N or sometimes SNR) tend to be fairly constant. That is, a signal which is ten times the typical noise value (the rms) has the same statistical significance in any data set, but the actual flux value it corresponds to can be totally different. But expressing signal strength in terms of the noise level makes things very convenient : a five sigma (5σ) source just means something that's five times brighter than the typical noise level.

So suppose we have a 2σ source which happens to align with a 3σ peak in the noise : bam, we've got ourselves a quite respectable 5σ detection*. But if we keep the flux level of the signal we're adding the same and increase the noise level, then the ratio of S/N will go down. Instead of adding 2σ to 3σ, we'll be adding ever lower and lower values : the "sigma level" of the noise won't change, but that of the signal certainly will. Pretty quickly we won't be shifting that 3σ peak from the noise by any appreciable degree. We'll be adding the same flux value but to an ever-greater starting level.

* Sometimes five sigma is quoted as a sort of scientifically universal gold-standard discovery threshold. This is simply not true at all, because if you have enough data, you'll get that level of signal just by chance alone. Far more importantly, the noise in real data is often far from being purely random, so choosing a robust discovery threshold requires a good knowledge of the characteristics of the data set.

As a possibly pointless analogy, consider lions. If you go from having no lions to one lion, you've just put yourself in infinitely more danger. If you add a second lion you're in even more trouble. But if you've got ten lions and add one more, you won't really notice the difference.

"But hang on," you might say, "surely that means that your earlier claim that fainter signals are more detectable in nosier data can't possibly be correct ?". A perfectly valid question ! The answer is that it depends on how we go about detecting the signals. The details are endless, but two basic techniques are to search for either a S/N ratio or a simple flux threshold. These can give very different results to each other.

Now if you use S/N, which is generally a good idea, then indeed signals of lower flux levels generally don't do well in increasingly nosier data because of the ever-smaller relative increase in the signal. And of course, the effect of the noise to not merely obscure but actually suppress the signal will get ever greater, since there's just as much chance of aligning with a low-value region of the noise as a high-value region.

But S/N is not the only way of detecting signals. You might opt instead to use a simple flux threshold instead : it's computationally cheaper, easier to program, and most importantly of all it gives you more physically meaningful results. If you do it that way, then it's a different story. When you add a signal to noise the flux level always increases, making it much easier to push the flux above your detection threshold by this method. Which makes noise boosting very much easier to explain.

Note here the change of axes values compared to the previous examples. All I did was increase the noise level by ~25% and here both peak flux and S/N levels have increased. It might not look easier to detect visually, but statistically, by some measures this one is more significant than the previous cases !

Even more interesting is the so-called Eddington bias. What this means is that any survey will tend to overestimate the flux of its weakest signals : those signals which are so faint that they can only be detected at all thanks to chance alignments with the noise. In this case, that when someone comes along and does a deeper survey, they'll often find that those sources have less flux than reported in the earlier, less-sensitive data.

There are plenty of other subtleties. The signal might not need to be perfectly superimposed on a noise peak : if it's merely adjacent to it, that can create the appearance of a wider, brighter signal which can be easier to detect to some algorithms (and people !). And of course while we'd like noise to be perfectly uniform and random, this isn't always the case. Importantly, the rms value doesn't tell us anything at all about the coherency of structures in the data, as so powerfully shown by the ferocious Datasaurus.

For the eye the effects of this can be extremely complex, and are poorly understood in astronomy. If you have very few coherent noise structures, for example, you might think this nice clean background would make fainter structures easier to spot. But actually, I have some tentative evidence that this isn't always the case : the eye can be lulled into a false sense of emptiness, whereas if there are a few obvious structures to attract attention, you start to believe that things are present so you're more likely to identify structures. No doubt if your data was dominated by structures then the eye would, in effect, perceive them as background noise again and the effect would diminish, but this is something that needs more investigation. My guess is there's a zone in which you have enough to encourage a search but few enough that they don't obscure the view.

Ultimately, getting deeper data is always the better option. For every source noise-boosted to detectability, there'll be another which is suppressed and hidden. But this explains very neatly that supposedly "nonsensical" result*, especially if your source-finding routine is based on peak flux : of course if you add signal to noise and have the same flux threshold in your search, you're more likely to find the fainter signals in the nosier data... up to a point. Set your threshold too low and you'll just find spurious detections galore, but hit the sweet spot and you'll find signals you otherwise couldn't.

* Kudos to the referee for accepting the explanation; I never heard of any of this until a couple of years ago either. Extragalactic astronomy is full of stuff which isn't that difficult but seems feckin' confusing when you first encounter it because it isn't formally taught in any lectures !

There's nothing weird about noise boosting then. Mathematically it makes complete sense. But when you first hear about it it sounds perplexing, which just goes to show how deceptively simple radio astronomy can be. Noise boosting, at least at a basic level, is quite simple, but simple isn't the same as intuitive.

Friday, 22 August 2025

ChatGPT-5 versus... DeepSeek

My excitement for ChatGPT-5 continues to defy the Will Of The Internet. Sod y'all, this is feckin' awesome ! This is the upgrade I've been waiting for.

It seems only fair to continue testing it, however. It performed extremely well against my own knowledge, but how does it compare to other chatbots ? After all, there's not much point in me getting all excited for this one in particular if others are actually doing comparably well.

I was intrigued by this article which pits it against DeepSeek, mainly in more real-life situations than what I've been testing. To save you a click, the "clear winner" was DeepSeek, with the unexpected proviso that GPT-5 is better at creativity and explanations – which again is in defiance of the judgement of reddit, who think that GPT-5 is absolutely dreadful at this sort of thing.

Before I report my own science-based testing, some of the author's preferences in the AI responses are... questionable. In brief :

Logic puzzle : I find GPT-5's response much clearer here. DS's longer, "different interpretations" of a really very simple "puzzle" only add unnecessary confusion.
Mathematics : I tend to agree with the author that DS gives more of a tutorial in its response, though GPT-5's walkthrough is easier to follow. Neither was asked to teach the general case, however, only a specific example. I'd call this one a dead heat.
Project planning : Seems like a hopelessly vague prompt to me. The output doesn't seem meaningful.
Budgeting : I tend to agree that DS is slightly better here, but by a whisker : it's more instructive. Purely a personality difference though, not a matter of actual content.
Parenting : They seem exactly equal to me. No idea why the author prefers DS or thinks GPT-5 is "less organised".
Lunches : Not sure what the author plans to do with half a banana, nor do they show the whole display so it's hard to check this one. As it is, I'd agree DS is better here for giving better instructions, but I can't check if it really stayed in the budget or not.
Stories : A toss-up; very subjective.
Culture : I agree with the author that GPT-5 is clearly better; DS's list isn't a good summary in this case.
Social media : Both are crap. DS might be better but it's a case of shit sandwich or shit hotdog.

So while the author prefers DS in seven of the nine tests, I'd say that only seven responses have meaningful prompts with enough output shown for a fair evaluation. I prefer DS only once (maybe twice in the last case), GPT-5 twice, and find they're not differentiable in the other cases. Certainly not a clear winner for anyone.

I should add that I've rather gone off DeepSeek some time ago. It can give good, insightful results (for a while I strongly preferred it to ChatGPT), but it's unreliable – both in terms of its server and its output. It can also over-think the problem, and though being able to see its reasoning can be helpful (sometimes this can be better than the final output), it also means there's a lot of content to wade through. This becomes tiresome and counterproductive.

So for this test what I thought I would do is try to ignore the reasoning and just compare the final outputs. I'm not going to go to the same lengths as I did when testing GPT-5 against my own reading of papers. Instead I'll just ask DS some of the same questions I already recently asked GPT-5 (in-anger, real use cases) and compare the initial responses.

1) Effects of ram pressure stripping on the line width of a galaxy

Suppose a galaxy losing gas through ram-pressure stripping, with the angle to the wind being close to edge-on. What, if anything, is the expected effect on its HI line width ? Is it likely to reduce, increase, or not change the measured line width ? Consider the cases of moderate to strong stripping, corresponding to deficiencies of 0.5 and 0.7 respectively. Please also provide relevant citations, if possible.

Both agree that the width will decrease, but they disagree somewhat by how much. DeepSeek raised the important issue of turbulence, but that's it's only point in its favour : most of its numerous citations weren't especially relevant and certainly weren't pertinent to the specific claims it was making. Far worse was that it got "edge-on" completely wrong, taking this to mean exactly the opposite of how the term is used. GPT-5 gave a tighter response which was much more focused on the problem and with much better citations.

Winner : GPT-5, easily.

2) How do SMUDGES get their measurements ?

Hi ! I have a question about the SMUDGES catalogue of Ultra Diffuse Galaxies. I'd like to know how they compute their reported magnitude values. Does this involve fitting a surface brightness profile or do they just do aperture photometry ? If the later, how to they do they set the aperture size ? If the former, are they reporting total integrated magnitude within some radius or are they extrapolating ?

GPT-5 couldn't figure this out from web-based results so I had to feed it the paper. Once I did that it gave a concise, clear explanation. DS got the result correct using a web search with no uploads needed, and its explanation was longer but easier to follow. On the other hand, GPT-5 also came with the significant caveat that SMUDGES actually report both values, which is not correct.

Winner : DS for accuracy. GPT-5 was clearer, and it got the main point right, but it hallucinated a caveat (though it corrected this on asking it where in the paper that the second value was given) and made the answer more complicated than it needed to be.

3) Integrating the light in a galaxy

About the Sersic profile... suppose I have a galaxy with a Sersic index of 0.51. How much larger than the effective radius would I have to go to enclose at least 98% of the light ?

Both models gave the same answer, but GPT-5 took 38 seconds whereas DS took more than ten minutes (!). The explanations each model provided were equally unclear to me as a non-mathematician.

Winner : GPT-5 for sheer speed.

4) The safety of HIP2FITS

I'm curious about the HIPS2FITS service of astroquery. I have code that seems to work very well at extracting data from different surveys using a specified pixel scale, field of view, coordinates etc. What I'd like to know about is how the pixel values are rescaled when the pixel resolution is different from the standard survey data. In particular, I want to know if aperture photometry in DS9 will still be accurate, i.e. if the sum of the flux values within the same aperture will still be approximately the same. Presumably it can't be identical since now region will, by definition, enclose a different area if the pixel scale has changed, but will it be close if the scale is not too different ? When I requested a substantially different pixel scale for SDSS data (10" per pixel rather than the nominal 0.4") I got a very different result using the same aperture. Is this something that can be accounted for e.g. by changing the units in the FITS header, or by specifying some other parameter in astroquery (or by post-processing the result in some other way) ? Or should I always ensure I only use the standard pixel scale if I want accurate photometry ?

Here the answers are very different. GPT-5 says a straightforward no, you can't do photometry on the data and it's not a simply matter to correct it due to the regridding (though if the scale change is small, this is possible). DS also says no, but that it is a relatively simple matter of correction for the adjusted area – but recommends keeping the original scale whenever possible. Its suggested correction was, however, likely nonsense. It also said weird things like "even for the same source" and gave various other caveats that felt incoherent.

Winner : GPT-5 for clarity.

5) How are stellar masses calculated ?

I'm curious about how galaxy stellar masses are estimated from photometric measurements. For example, I've seen several different recipes using, say, g-i or g-r colours, and I notice that they can give quite different result for the same object. How are these recipes derived, and how accurate are they ?

Both bots gave extremely similar answers here with no obvious major discrepancies. I couldn't honestly say I preferred either answer.

Winner : dead heat.

6) Star formation from GALEX

How can I quantitatively estimate the star formation rate of a galaxy using SDSS and GALEX data ?

This was an extremely simple prompt and both models gave similar answers. GPT-5 gave slightly more useable answers with more explanations; DS was a little less clear as to what was going on.

Winner : GPT-5, but it was close (and with the caveat that I haven't checked in detail as the responses were both quite long).

7) Analysing the large scale environment

I've got a student investigating an HI data cube of the Leo Group. We've already studied the Leo Ring volume and he's extending our study into the background volume, ~2.000 - 20,000 km/s. We'd like to be able to characterise the environment of our detections, i.e. to say if they're in any major known groups, clusters, or large-scale structures such as filaments. What's the best approach to do this ? I guess we can simply search for groups and clusters in NED, but maybe there's a better way (also this wouldn't tell us about the larger stuff). And we'd like to know any vital information about such structures, e.g. if something is particularly well-known for any reason.

For the only time in these tests, DS complained its server was busy. On a second attempt it gave a response which was pretty basic and not very helpful; "you can look up this data set" or "use this software", or worse, suggesting a comparison to simulations. GPT-5 gave much more helpful citations and guided instructions in what to do with our own data and how we could compare it to other catalogues.

Winner : GPT-5, easily.

The verdict : ChatGPT-5 won five of the seven tests. DS only won once, with the other being a dead heat. Even leaving aside that GPT-5 had a narrow victory in one case, GPT-5 is the clickbaity "clear winner" here.

To be fair, I'd switched to DeepSeek because it was giving me better discussions than the contemporary model of ChatGPT, even when they first added reasoning. It's by no means a bad model, but it's unreliable compared to GPT-5. Its citations are frequently of dubious relevance, it seems to hallucinate more (the example of GPT-5 hallucinating here is the only example I've found of this so far*, and that was minor), and if you want its best output you'll have to wade through an awful lot of its reasoning processes. It's also slower than GPT-5, sometimes by an order of magnitude. And I've also found that DeepSeek rejects my uploads as violating policies even when there's nothing offensive in them; it won't even discuss them at all but just shuts down the discussion immediately.

* Excluding when you ask for it to analyse a paper by giving it a link rather than uploading. In that case it becomes worse than useless, so never do this.

Still, I have to admit that this does tend to curb my enthusiasm for GPT-5, but only a little. DeepSeek's answers were generally better in comparison to GTP-5 than I was expecting; DeepSeek gets most of the way there when it works. If GPT-5 is a revolution compared to previous OpenAI offerings, then it's only an incremental upgrade compared to DeepSeek. An important one, to be sure, but nevertheless incremental.

On the other hand, it's all about thresholds : how you cross them doesn't matter nearly as much as the fact you've crossed them at all. An increment which gets you across the line is just as useful as if you got there from a standing start. And if DeepSeek gets most of the way there most of the them, GPT-5 gets even further almost all of the time. With only one minor failure here compared to DeekSeek's two majors, the effect is non-linear. Even if GPT-5 doesn't drastically improve upon DeepSeek, it does so by more than enough I now see no point in using DeepSeek at all.

That about wraps up the LLM-testing for the foreseeable future. Now back to the usual posts in which I explain science to the masses... whether I do this better than the chatbots I leave to readers to decide.

Wednesday, 13 August 2025

ChatGPT-5 Versus Me

It's time for another round of evaluating whether ChatGPT is actually helpful for astronomical research.

My previous experiments can be found here, here, and here. The first two links looked at how well ChatGPT and Bing performed when analysing papers I myself know very well, with the upshot being an extreme case of hit-and-miss : Occasional flashes of genuine brilliance wrapped in large doses of mediocrity and sprinkled with total rubbish, to quote myself. All conversations had at least one serious flaw (though arguably in one case, which was factually and scientifically perfect but had crippling format errors).

The third link tested ChatGPT's vision analysis by trying to get it to do source extraction, which was a flat-out failure. Fortunately there have been other tests on this which show it does pretty badly in more typical situations as well, so I'm not going to bother redoing this.

With the release of ChatGPT-5, however, I do want to redo the analysis of papers. If I can have ChatGPT give me reliable scientific assessments of papers, that's potentially a big help in a number of ways, at the very least in determining if something is going to be worth my time to read in full. For this one I picked a new selection of papers as my last tests were a couple of years ago, and I can't claim I remember all their details as well as I did.

Because all the papers cover different topics, there isn't really a good way to standardise the queries. So these tests are designed to mimic how I'd use it in anger, beginning with a standardised query but then allowing more free-ranging, exploratory queries. There's no need for any great numerical precision here, but if I can establish even roughly how often GPT-5 produces a result which is catastrophically wrong or useless, that's useful information.

I began each discussion with a fairly broad request :

I'd like a short summary of the paper's major findings, an evaluation of its scientific importance and implications, what you think the major weaknesses (if any) might be and how they could be addressed. I might then ask you more detailed, specific questions. Accuracy is paramount here, so please draw your information directly from the paper whenever possible – specify your sources if you need to use another reference.

Later I modified this to stress I was interested in the strengths and weaknesses of the scientific interpretation as well as methodology, as GPT-5 seemed to get a little hung up on generic issues – number of sources, sensitivity, that sort of thing. I followed up the general summaries with specific questions tailored to each individual paper as to what they contained and where, this being a severe problem for earlier versions. At no point did I try to deliberately break it – I only tried to use it.

Below, you can find my summaries of the results of discussions about five papers together with links to all of the conversations.

0) To Mine Own Research Be True ?

But first, a couple of examples where I can't share the conversations because they involve current, potentially publishable research (I gave some initial comments already here). I decided to really start at the deep end with a query I've tried many times with ChatGPT previously and got very little out of it : to have it help with a current paper I'm writing, asking it to assess the merits and problems alike – essentially acting as a mock reviewer.

* Of which the management, like the rest of us, is generally sensible about such things. We all recognise the dangers of hallucinations, the usefulness and limitations of AI-generated code, etc. Nobody here is a fanboy nor of the anti-AI evangelical sort.

Previously I'd found it to be very disappointing at this kind of task. It tends to get hung up on minutia, not really addressing wider scientific points at all. For example, if you asked it for which bits should be cut, it might pick out the odd word or sentence or two, but it wouldn't say if a whole section is a digression from the main topic. It didn't think at scale, so to speak. It's hard to describe precisely but it felt like it has no understanding of the wider context at all.; it discussed details, not science. It wasn't that using it for evaluations was of no value whatsoever, but it was certainly questionable whether it was a productive use of one's time.

With the current paper I have in draft, ChatGPT-5's response was world's apart from its previous meagre offerings. It described itself as playing the role of a "constructively horrible" reviewer (its own choice of phrase) and it did that, I have to say, genuinely very well. Its tone was supportive but not sycophantic. It suggested highly pertinent scientific critiques, such as the discussion on the distance of a galaxy – which is crucial for the interpretation in this case – being too limited and alternatives being fully compatible with the data. It told me when I was being over-confident in phrasing, gave accurate indications of where I was overly-repetitive, and came up with perfectly sensible, plausible interpretations of the same data.

Even its numbers were, remarkably, actually accurate* (unlike others I haven't seen it make some classic errors in basic facts and numbers, including the number of specified letters even in fictional words; I tried reproducing some of these multiple times but couldn't). At least, that is, those I checked, but all those I checked were on the money – a far cry indeed from older versions ! Similarly, citations were all correct and relevant to its claims : none were total hallucinations. That is a big upgrade.

* ChatGPT itself claims that it does actual proper calculations whenever the result isn't obvious (like 2+2, for which training data is enough) or accuracy is especially important.

When I continued the discussion... it kept giving excellent, insightful analysis; previous versions tended to degenerate into incoherency and stupidity in long conversations. It wasn't always right – it made one major misunderstanding to one inquiry that I thought it should have avoided* – but it was right more than, say, 95% of the time, and its single significant misunderstanding was very easily corrected**. If it was good for bouncing ideas off before, now it's downright excellent.

* This wasn't a hallucination as it didn't fabricate anything, it just misunderstood the question.
** And how many conversations with real people feature at least one such difficulty ? Practically all of them in my experience.

The second unshareable test was to feed it my rejected ALMA proposal and (subsequently) the reviewer responses. Here too the tone of GPT-5 shines. It phrased things very carefully but without walking on eggshells, explaining what the reviewer's thought processes might have been and how to address them in the future without making me feel like I'd made some buggeringly stupid mistake. I asked it initially to guess how well the proposal would have been ranked and it said second quartile, borderline possibility for acceptance... praiseworthy and supportive, but not toadying, and not raising false hopes.

When I told it the actual results (lowest quartile, i.e. useless), it agreed that some of the comments were objectionable, but gave me clear, precise instructions as to how they could be countered. Those are things I would find extremely difficult to do on my own : I read some of the stupider claims ("the proposal flow feels a bit narrative"... FFS, it damn well should be narrative and I will die on this hill) and just want to punch the screen*, but GPT-5 gave me ways to address those concerns. It said things like, "you and I know that, but...".

* No, not really ! I just need to bitch about it to people. Misery loves company, and in a perverse bit of luck, nobody in our institute got any ALMA proposals accepted this year either.

It made me feel like these were solvable problems after all. For example, it suggested the rather subtle reframing of the proposal from detection experiment (which ALMA disfavours) to hypothesis testing (which is standard scientific practise that nobody can object to). This is really, really good stuff, and the insight into what the reviewers might have been thinking, or not understanding, made me look at the comments in a much more upbeat light. Again, it had one misunderstanding about a question, but again this was easily clarified and it responded perfectly on the second attempt.

On to the papers !

1) The Blob(s)

This paper is one of the most interesting I've read in recent years, concerning the discovery of strange stellar structures in Virgo they attribute to being ram pressure dwarfs. Initially I tried to feed it the paper by providing a URL link, but this didn't work. As I found out with the second paper, trying to do it this way is a simply mistake : in this and this alone does GPT-5 consistently hallucinate. That is, it claims it's done things which it hasn't done, reporting wrong information and randomly giving failure messages.

Not a great start, but it gets better. When a document is uploaded, hallucinations aren't quite eliminated, but good lord they're massively reduced compared to previous versions. It's weird that its more general web search capabilities appear rather impressive, but give it a direct link and it falls over like a crippled donkey. You can't have everything, I guess.

Anyway, you can read my full discussion with ChatGPT here. In brief :

Summary : Factually flawless. All quoted figures and statements are correct. It chose these in a sensible way to give a concise summary of the most important points. Both scientific strengths and weaknesses are entirely sensible, though the latter are a little bland and generic (improve sensitivity and sample size, rather than suggesting alternative interpretations).
Discussion : When pressed more directly for alternative interpretations, it gave sensible suggestions, pointing out pertinent problems with the methodology and data that allow for this.
Specific inquiries : I asked it about the AGES clouds that I know are mentioned in this paper (I discovered them) and here I encountered the only real hallucination in all the tests. It named three different AGES clouds that are indeed noteworthy because they're optically dim and dark ! These are not mentioned in this paper at all. When I asked it to check again more carefully, it reported the correct clouds which the authors refer to. When I asked it about things I knew the paper didn't discuss, it correctly reported that the paper didn't discuss this.
Overall : Excellent, once you accept the need to upload the document. Possibly the hallucination might have been a holdover from that previous attempt to provide the URL, and in my subsequent discussions I emphasised more strongly the need for accuracy and to distinguish what the paper contained from GPT-5's own inferences. This seems to have done the trick. Even with these initial hiccups, however, the quality of the scientific discussions was very high. It felt like talking with someone who genuinely knew what the hell they were talking about.

2) The Smudge

This one is about finding a galaxy so faint the authors detected it by looking for its globular clusters. They also find some very diffuse emission in between them, which is pretty strong confirmation that it's indeed a galaxy of sorts.

At this point I hadn't learned my lesson. Giving ChatGPT a link caused it to hallucinate in a sporadic, unpredictable way. It managed to get some things spot on but randomly claimed it couldn't access the paper at all, and invented content that wasn't present in the paper. Worse, it basically lied about its own failures.

You can read my initial discussion here, but frustrated by these problems, I began in a second thread with an uploaded document here. That one, I'm pleased to say, had no such issues.

Summary : Again, flawless. A little bland, perhaps, but that's what I wanted (I haven't tried asking it for something more sarcastic). The content was researcher level rather than general public but again I didn't ask for outreach content. It correctly highlighted possible flaws like the inferred high dark matter content being highly uncertain due to an extremely large extrapolation from a relatively novel method.
Discussion : In the hallucinatory case, it actually came up with some very sensible ideas even though these weren't in the paper. For example, I asked it about the environment of the galaxy and it gave some plausible suggestions on how this could have contributed to the object's formation – the problem was that none of this was in the paper as it claimed. Still, the discussion on this – even when I pushed it to ideas that are very new in the literature – was absolutely up to scratch. When I suggested one of its ideas might be incorrect, it clarified what it meant without changing the fundamental basis of its scenario in a way that convinced me it was at least plausible : this was indeed a true clarification, not a goalpost-shifting modification. It gave a detailed, sensible discussion of how tidal stripping can preferentially affect different components of a galaxy, something which is hardly a trivial topic.
Specific inquiries : When using the uploaded document, this was perfect. Numbers were correct. It reported correctly both when things were and weren't present in the article, with no hallucinations of any kind. It expanded on my inquiries into more general territory very clearly and concisely.
Overall : Great stuff. Once again, it felt like a discussion with a knowledgeable colleague who could both explain specific details but also the general techniques used. Qualitatively and quantitatively accurate, with an excellent discussion about the wider implications.

3) ALFALFA Dark Galaxies

My rather brief summary is here. This is the discovery of 140-odd dark galaxy candidates in archival ALFALFA HI data. The ChatGPT discussion is here. This time I went straight to file upload and had no issues with hallucinations whatsoever.

Summary : Once again, flawless. Maybe a little bland and generic with regard to other interpretations, but it picked out the major alternative hypothesis correctly. And in this case, nobody else has come up with any other better ideas, so I wouldn't expect it to suggest anything radical without explicitly prompting it to.
Discussion : It correctly understood my concern about whether the dynamical mass estimates are correct and gave a perfect description of the issue. This wasn't a simple case of "did they use the equation correctly" but a contextual "was this the correct equation to be using and were the assumptions correct" case, relating not just to individual objects but also their environment. Productive and insightful.
Specific inquiries : Again flawless, not claiming the authors said anything they didn't or claiming they didn't say anything they did. Numbers and equations used were reported correctly.
Overall/other: Superb. I decided to finish by asking a more social question – how come ALFALFA have been so cagey about the "dark galaxy" term in the past (they use the god-awful "almost darks", which I loathe) but here at least one team member is on board with it ? It came back with answers which were both sociologically (a conservative culture in the past, a change of team here) and scientifically (deeper optical data with more robust constraints) sensible ideas. It also ended with the memorable phrase, "[the authors are] happy to take the “dark galaxy” plunge — but with the word “candidate” as a fig leaf of scientific prudence."

4) The VCC 2034 System

This is a case of a small fuzzy patch of stars near some larger galaxies, possibly with a giant HI stream, which has proven remarkably hard to explain. The latest paper, which I summarise here, discounts the possibility that it formed from the long stream as it apparently doesn't exist, but (unusually) doesn't figure out an alternative scenario either. The ChatGPT discussion is here.

Summary : Factually perfect, though it didn't directly include that the origin of the object was unknown. Arguably "challenges simple ram-pressure stripping scenarios and suggests either an intergalactic or pre-cluster origin" implies this, but I'd have preferred it to state it more directly. Nevertheless, the most crucial point that previous suggestions don't really hold up came through very clearly.
Discussion : Very good, but not perfect. While it didn't get anything wrong, it missed out the claims in the paper against the idea of ram pressure dwarfs more generally (about the main target object of the study it was perfect). With some more direct prompting it did eventually find this, and the ensuing discussion was productive, pointing out some aspects of this I hadn't considered. I'm not entirely convinced this was correct, but no more than I doubt some of the claims made in the paper itself – PhD level hardly means above suspicion, after all. And the discussion on the dynamics of the object was extremely useful, with ChatGPT again raising some points from the paper I'd completely missed when I first read it; the discussion on the survival of such objects in relation to the intracluster medium was similarly helpful.
Specific inquiries : Aside from the above miss, this was perfect. When I asked it to locate particular numbers and discuss their implications it did so, and likewise it correctly reported when the paper didn't comment on a topic I asked about.
Overall : Not flawless, but damn good, and certainly useful. One other discussion point caused a minor trip-up. When I brought in a second paper (via upload) for comparison and mentioned my own work for context, it initially misinterpreted and appeared to ignore the paper. This was easily caught and fixed with a second prompt, and the results were again helpful. By no means was this hallucination – it felt more like it was getting carried away with itself.

5) An Ultra Diffuse Galaxy That Spins Too Slowly

This was a paper that I'd honestly forgot all about until I re-read my own summary. It concerns a UDG that initial observations indicated lacked dark matter entirely, but then another team came along and found that would be unsustainable and it was probably just an inclination angle measurement error. Then the original team came back with new observations and simulations, and they found it does have some dark matter after all – at a freakishly low concentration, but enough to stabilise it. The ChatGPT discussion is here.

Summary : As usual this was on the money, bringing in all the key points of the paper and giving a solid scientific assessment and critique. Rather than dealing with trivialities like sample size or simulation resolution, it noted that maybe they'd need to account more for the effects of environment or using different physics for the effects of feedback on star formation.
Discussion : As with the fourth paper, this was again excellent but not quite complete. It missed out one of my favourite* bits of speculation in the paper that this object could tell us something directly about the physical nature of dark matter. It did get this with direct prompting, but I had to be really explicit about it. To be fair, this is just one paragraph in the whole article, but reading between the lines I felt it was a point the author's really wanted to make. On the other hand, that's just my opinion and it certainly isn't the main point of the work.
Specific inquiries : Yep, once again it delivered the goods. No inaccuracies. It reported the crucial points correctly and described the comparisons with previous works perfectly. Again, it didn't report any claims the authors didn't make,
Overall : Excellent. I allowed myself to branch out to a wider discussion of the cold dark matter paradigm and it came back with some great papers I should check out regarding stability problems in MOND. It sort of back-peddled a little bit on discussions about the radial acceleration relation, but this was more a nuanced clarification than revising its claims : CDM gets RAR as a result of baryonic physics tuning, but it gets this for free as a result of tuning for other parameters rather than directly for RAR itself; MOND gets RAR as a main feature. If that's not a PhD level discussion then I don't know what is.

* More generally, it seems pretty good at picking up on the same stuff that I do, but it would be silly to expect 100% alignment.

Summary and Conclusions

On my other blogs I've gone on about the importance of thresholds. Well, we've crossed one. Even the more positive assessments of GPT-5 tend to label it as an incremental upgrade, but I violently disagree. I went back and checked my earlier discussion with GPT-4o about my ALMA proposal and confirmed that it was mainly spouting generic, useless crap... GPT-5 is a massive improvement. It discusses nuanced and niche scientific issues with a robust understanding of their broader context. In other threads I've found it fully capable of giving practical suggestions and calculations which I've found just work. Its citations are pertinent and exist.

This really does feel like a breakthrough moment. At first it was a cool tech demo, then it was a cool toy. Now it's an actually useful tool for everyday use – potentially an incredibly important one. Where people are coming from when they say it gets basic facts wrong I've honestly no idea. The review linked above says it gave a garbage response when fed a 160+ page document and was anything but PhD-level, but in my tests with typical length papers (generally 12-30 pages) I would absolutely and unequivocally call it PhD level. No question of it.

This is not to say it's perfect. For one thing, even though there's a GUI setting for this, it's very hard to get it to stop offering annoying follow-up suggestions it could do. This is why you'll see my chats with it sometimes ends with "and they all live happily ever after", because I had to put that in my custom instructions to give it an alternative ending (in one memorable case it came up with "one contour to rule them all and in the darkness bind them"*). Even then it doesn't always work. And it always delivers everything in bullet-point form : no doubt this can be altered, but I haven't tried... generally I don't hate this though.

* I really like the personality of GPT-5. It's generally clear and to the point, straightforward and easy to read, but with the occasional unexpected witticism that keeps things just a little more engaging.

Of course, it does still make mistakes. Misinterpretations of the questions appears to be the most common, but these are very easily spotted and fixed. Incompleteness seems to be less common but more serious, but I'd stress that expecting perfection from anything is extremely foolish. And actual hallucinations of the kind that still plagued GPT-4 are now nearly non-existent, provided you give it rigorous instructions.

So that's my first week with GPT-5, a glowing success and vastly better than I was expecting. Okay, people on reddit, I get that you missed the sycophantic ego-stroking personality of GPT-4, so whine about how your virtual friend has died all you want. But all these claims that it's got dumber, and has an IQ barely above that of a Republican voter... what the holy hell are you talking about ? That makes NO sense to me whatsoever.

Anyway I've put my money where my mouth is and subscribed to Plus Watch this space : in a month I'll report back on whether it's worth it.

Tuesday, 29 July 2025

Stop stripping the dwarves, they don't like it !

Today's paper revisits a very minor but interesting storm in a teacup.

Back in 2021, Junais et al. reported on a possible Ultra Diffuse Galaxy losing gas in the Virgo Cluster. At face value it all looked very convincing. The HI gas detection was very clear, nicely offset from the UDG-candidate (basically an especially faint, fluffy sort of galaxy if you aren't keeping up with things – shame on you !) but still overlapping it. At the very centre of the gas detection was a sort of ragged line of blue starlight, plus there were some patches of stars scattered about as well. It's all very much as you'd expect if this was star formation occurring in the stripped material.

Okay, ram pressure is old hat. But to find evidence of this occurring in a UDG would be especially interesting : it would allow us to start investigating whether UDGs in clusters (which seem to be pretty common) are the same as those in the general field (which are known to exist, but we don't know how many there are). In particular, there's this whole controversy over whether they lack dark matter or not, in which case the effects of stripping might be quite different since the gravitational forces involved would be much less. And also it would show whether both cluster and field UDGs form by the same process, or whether there are multiple ways to form the same sort of objects.

All this was strongly challenged by Jones et al. 2021 They said, no, hang on, the distances are all wrong. Using high-resolution Hubble data, they were able to show that the distance to the UDG-candidate is actually much closer than the Virgo Cluster, and it also seemed to be linked to another, much brighter galaxy (VCC 2034) by a giant bridge of HI, so presumably that would also be at the same distance. The patchy starlight, however, could well be in the Cluster, in which case it would require a different explanation because it doesn't look anything like a galaxy.

You may or may not remember that I'm moderately skeptical about all this. It's not that I don't believe the distance estimate... it's that I'm wary about them after that whole "ping-pong" series of papers concerning some other UDGs – a debate which apparently still isn't fully settled. That suggests we shouldn't take any single value as definitive but should wait for multiple analyses.

And the HI stream... although I did send Jones some of our deeper (WAVES) data, and he was able to find the stream by taking a slice through at the right angle... it feels very off to me. Given that our HI data is about 3-4x deeper than the original ALFALFA observations, I'd expect it to be immediately obvious in our data. It isn't. My suspicion is that the analysis and source-finding package SoFiA (which is hella powerful) is oversmoothing here, creating the appearance of a bridge because of the degraded resolution, as is used for increasing sensitivity.

I don't know for sure though. I'm moderately skeptical, but no more than that.

Enter today's paper by Yu-Zhu Sun and friends. They use a combination of new deep data from FAST and high resolution data from the VLA. And this paints a pretty convincing picture that the patchy starlight is not the result of gas stripping from VCC 2034, even if they agree that it's nothing to do with the UDG candidate. This is my favourite sort of paper in that it doesn't actually solve the mystery but just demonstrates that things were even weirder than initially thought.

This is all quite complicated : there are several different galaxies in this region, plus the fuzzy stars, plus the possible gas bridge, and conflicting distance claims for all of them. Let's try a few simple diagrams to illustrate. I'll start with the main hypotheses proposed by Junais and Jones. But, since both groups quite rightly caveat their conclusions and aren't definitive, and don't all deal with the same objects, I'm going to try and standardise and simplify things a little bit. This should be enough to get a general sense of what's going on, but this is very much a limited guide. Needless to say, these are not to scale !

Hypothesis 1

The most straightforward interpretation is the original : that the fuzzy starlight ("The Fuzz", a.k.a. AGC 226178) close to the UDG candidate (here UDG-X, official designation NGVS 3543) and is the result of star formation in its stripped gas. The nearby pair of galaxies VCC 2034/2037 are seemingly unrelated.

All galaxies, in this scenario, are in Virgo at about 17 Mpc distance from us. The gas cloud associated with UDG-X and the Fuzz align well and VCC 2034/2037 is rather far away, so an association isn't at all natural. VCC 2034 has its own gas, showing clear signs of removal. In fact this extends in the direction of UDG-X but doesn't reach nearly far enough, so the orientation doesn't appear to indicate anything interesting. It's also aligned with VCC 2037, but that too is imperfect (not covering the whole of VCC 2037 and the local maxima of the gas is not aligned with the galaxy's centre) and the velocities of the two galaxies don't match well. So this too may just be a coincidence – the two objects might both be in the cluster, but at sufficiently different distances that they aren't actually related. Regardless, they really don't seem to have anything to do with UDG-X at all.

Hypothesis 2

The second scenario relies on a number of additional observations. Direct distance estimates suggest that both UDG-X and VCC 2037 are at 10 Mpc, much closer than the Virgo Cluster (17 Mpc, estimated elsewhere to be be 1-2 Mpc deep). However the Fuzz seems still to be at the cluster distance, and there's a much larger bridge of HI apparently connecting it to VCC 2034. So essentially, the Fuzz results from gas stripping of the cluster member VCC 2034, whereas UDG-X is so close to us that's actually not a cluster member at all : it may or may not relate to VCC 2037 instead. This would make UDG-X an uninteresting normal dwarf galaxy, but the Fuzz becomes very interesting as a rare example of star formation in a gas tail.

Note again that that the existence of the large HI envelope is uncertain, and that it's probably not a great idea to trust the distance estimates overmuch. Furthermore, as we're about to see, even the high resolution HI data can't be treated as gospel.

Hypothesis 3

Stressing that the latest paper is even more cautious, here's their essential idea : there's no big HI envelope and both VCC 2034/2037 show independent HI tails (in the new VLA data) that don't align with the Fuzz or UDG-X at all. UDG-X may well be foreground (again making it a normal dwarf galaxy), but neither it nor the Fuzz are directly related to any of the major galaxies in the general vicinity. What, then, is the origin of the Fuzz in this scenario ?

A tricky question indeed, one which they understandably don't commit to answering. Their main conclusion is that the Fuzz is likely not stable and in the process of disintegration, but as to what formed it in the first place, they don't (can't) say.

Disclaimer : I know a few of the co-authors very well, have published with them, and certainly hope to do so again ! They raise many excellent points, but there are a few with which I disagree. For example, they say that the HI cloud around the Fuzz has a "well-defined" velocity gradient of 10 km/s, but that's the width of the HI line itself so I'm very skeptical that this can be in any sense meaningful.

They do, however, have both new, extremely sensitive FAST data (even slightly deeper than WAVES), and new VLA data which should be of even higher resolution than the earlier observations. The FAST data fails to show the large HI envelope, as does WAVES – and taken together this seems to quite reasonably disprove its existence. I had in mind a simple project to see if this could really result from how the data was processed... maybe one day I'll have the time to try it, as it would be nice to know exactly how this happened if indeed it doesn't exist.

What about UDG-X ? The FAST data is highly sensitive but low resolution, and can't distinguish gas associated with the Fuzz (which definitely does exist) from UDG-X. The Sun et al. VLA data, however, shows much less of a head-tail morphology than the earlier data, now appearing to only be associated with the Fuzz. That makes it unlikely that Fuzz is the result of gas stripping from UDG-X, though it can't be said with too much confidence. There could still be diffuse gas in UDG-X which the VLA wouldn't detect, or the entire gas of the Fuzz might have been displaced wholesale from UDG-X.

And when they say they detect a velocity gradient in this case, it looks a lot more like a very sudden change to me. Their dynamical mass estimates – how much mass is needed to keep the system stable – are, I think, stretching things beyond the quality that the data can sustain, given how narrow the velocity width of the object is. That said, they say the total amount of dark matter that would be present is so low that this is unlikely be a dark/dim galaxy candidate : more likely it's some form of debris. That seems entirely reasonable from the low line width, even if I'd be skeptical about the exact dark matter mass estimate.

But is the debris stable ? That's much harder to answer. A lot of recent work has found candidates for so-called "blue blobs", which are interpreted as gravitationally-bound clumps of gas and stars that formed by the removal of gas from ordinary galaxies by ram pressure. In essence this would be a new class of stellar system, not really galaxies in the classical sense (since they'd have no dark matter) but not star clusters either (being very much larger and formed by a totally different mechanism).

Personally I rather like this idea, but here they place a few well-aimed holes in the scenario. The high metallicity of the clouds seemed in Jones like strong evidence that the clouds originated from within galaxies, as otherwise their chemistry should be basically hydrogen and bugger all else – you need prolonged star formation to cause significant enrichment, which isn't going to happen at their current pathetic levels of star formation activity. But here they say it could happen through mixing with the gas in the cluster itself. On the other hand, the paper they cite in support of this says that metallicity should drop with distance from the parent galaxy, whereas all the blue blobs have essentially the same high metallicity value. So this is an interesting critique, but not a fully convincing one.

Similarly, they're rather skeptical of the whole pressure confinement scenario for blue blobs – the idea here being that the gas within the cluster helps prevent them from disintegration. Now when we simulated this for dark clouds with very high velocity dispersions, we found it flat-out didn't work. But we were investigating rather exceptional systems, and simulations of low velocity dispersion systems have found very much more favourable results (as you'd expect anyway : with a low dispersion, things can only expand more slowly by definition). So I think their toy model is overthinking things. In any case, given the extremely low dispersion of the Fuzz's gas cloud, it would only expand by 10 kpc in a billion years... even if it is technically disintegrating, it's doing so so slowly that it might as well not be.

Finally, I don't agree at all with their interpretation regarding the location of the blue blobs within the cluster. The previous paper by Dey suggested that they're found in regions of modest cluster gas density because this is where they can both form and survive for a while; they avoid the denser core because this would rapidly destroy them. But Sun et al. claim that a "more natural" suggestion is that actually these objects are all outside the cluster in 3D space and only appear projected against it. Surely, though, if that were the case, we'd be equally likely to see such objects projected against the core ! To me, that the distribution of the objects relates to the geometry of the cluster feels like extremely compelling evidence that they are indeed within the cluster.

The long and short of it is that this is a very complex system, and it all serves to underscore that even observations don't always get the last word. It's particularly interesting that the new VLA data looks markedly different to the earlier findings, showing distinctly different structures. Likewise, I have to wonder why everyone is treating the distance estimates with such high confidence, given recent prominent debacles about how damn difficult it is to get these right.

As it stands, it now looks a lot less likely that the origin of the Fuzz can be explained by a giant gas stream from VCC 2034. But I, for one, am by no means convinced that we can rule out the original suggestion of stripping from a UDG, and I downright disagree that we can be so confident that it's a disintegrating gas cloud rather than a ram pressure dwarf. It's likely not a dark galaxy, however.

Which leaves the usual question of : what would it take to resolve all this ? This is very tricky. Well, the question of the long gas stream could be easily answered by running SoFiA over data sets with artificial signals injected of similar configurations to the current system; if the long stream results from oversmoothing, this ought to be reproducible. Distance measurements are much harder to resolve unambiguously, but at a minimum, another team need to try this independently, preferably using different data. As to why the various VLA data of the same objects looks so different, however, I'm at a loss. It's definitely a weird system, but certainly an interesting weird.

Thursday, 3 July 2025

The Bunny Rabbit of Death

Today's paper is a bit more technical than usual, but sometimes you've gotta tackle the hard stuff.

Ram pressure stripping is something we seem to understand pretty well on a large scale. When galaxies enter a massive cluster containing its own gas, pressure builds up that can push out the gas in the galaxy. If it's going fast enough, and/or the cluster gas is dense enough, then the galaxy can loose all of its gas pretty quickly. No ifs or buts, it just looses all its gas, stops forming stars, realises it's made incredibly poor life choices, and dies.

Yeah, literally, it dies. It's run out of fuel for star formation, which means all its remaining massive blue stars aren't replaced when they explode as supernovae in a few million years. Slowly it turns into a "red and dead" smooth, structureless, boring disc, and maybe eventually an elliptical. There's a wealth of evidence that ram pressure is the dominant mechanism of gas loss within clusters, and everything seems to just basically... work. Which is nice.

But, as ever, the details are where it gets interesting. In the extreme case, what you'll see is a galaxy with a big long tail of gas, one single plume stretching off until it's torn apart and dissolved in the chaos of the cluster.

Even here things can be complicated though. Some tails seem to have multiple components : extremely hot X-ray emitting gas, cooler neutral atomic hydrogen detectable with radio telescopes, intermediate temperature ionising gas that emits over very narrow "Hα" optical wavelengths, and very cold gas indeed that emits in the sub-mm regime. They may or may not have stars forming within the plume, and all of these different components can have radically different structures. Or they might all line up quite neatly. Sometimes all of these phases are present, sometimes just one or two.

And then, if a galaxy isn't in the extreme case, it can be even more complicated. If the ram pressure isn't enough to accelerate the gas to escape velocity, it can still be pushed out only to fall back in somewhere else in the disc. In short, it gets messy.

This paper attempts to understand one of those messy cases. It's part of the ALMA JELLY program, a large ALMA observing program run by my officemate Pavel Jachym (conflict of interest : declared ! BOX TICKED). Here they introduce the first analysis of one of their 28 target galaxies and tackle the important question (though they would never dare state it thus) :

Why does it look like the Playboy bunny rabbit ?

Wait, wait... why is it called ALMA JELLY ? It's not an acronym as far as I know. Instead, "jellyfish" galaxies have become a popular name for galaxies experiencing ram pressure stripping as some of them have distinct, narrow tails that look very much like the tentacles of a jellyfish. The term has become somewhat abused lately, often used for any ram-pressure stripping galaxy regardless of what its tail looks like. Here they attempt to take back control of the term and define it as galaxies which have stars forming it their stripped material. This often occurs in narrow tendrils so it's a pretty good proxy for jellyfish-like structures, and highlights the unusual physics at work in these cases.

And, why ALMA ? ALMA observes the cold molecular gas, which is generally agreed to be the main component of star formation. The target here already has many observations at other wavelengths, but the molecular gas has been traditionally tough to observe. Now they can fill in the gap, and with extreme resolution too.

So, the bunny rabbit. The first target for ALMA JELLY is NGC 4858. It's certainly a prime example of a jellyfish galaxy, with clear, bright tendrils of stars extending in one direction directly away from the centre of the Coma cluster in which it resides. It's also close to the cluster centre, where ram pressure ought to be very strong. Its got observations at a bunch of different wavelengths and it is, in short, a right proper mess. Really, it's the kind of thing I might be minded to throw up my hands and say, "hahahah no, I'm not touching that with a barge pole". Or, failing that, I might wave my hands furiously and say, "something something HYDRODYNAMICS !".

Hydrodynamic effects, the complicated interactions between two or more different fluids, are an easy get-out. Mixing of fluids causing extremely complex structures, so if something's a mess, it's a safe bet that hydrodynamics can explain it. Though, in that case you ought to run simulations to test if that really works or not.

Here they don't. Instead they try the much braver task of explaining it without any dedicated simulations, and even those simulations they do use don't have full hydrodynamic effects – just some very basic approximations of the major forces at work from the external gas. And yet they seem to have come up with a pretty convincing explanation.

It works like this. First, NGC 4858 is a grand design spiral, with two prominent spiral arms. As it rotates, each arm moves through a region where its subjected to varying ram pressure forces, which are greatest on the side rotating away from the cluster centre (where the gas is moving fastest away from the cluster, making it easiest to remove). A single, dense arm thus gives rise to a single, dense plume of gas – a tail. But this tail gas preserves some of the rotation it had around the galaxy's centre, so it doesn't just get blasted out into space – it keeps moving around the galaxy. This brings it into the shadow of the galaxy, protecting it from the wind of the cluster. Some of the gas is lucky enough that the greatly reduced ram pressure is now essentially impotent, and it falls back onto the galaxy.

Not all of it though. Some keeps going. If any makes it right around to the other side of the galaxy, it moves back into the zone of death and gets finally stripped away by the cluster gas once and for all. The key is that before it reaches this point, the gas gets compressed as it starts to hit the wind again. In the simulations they use as a reference, the galaxy doesn't have prominent spiral arms and shows a single prominent tail; they surmise that because NGC 4858 has two arms, this could naturally give rise to two tails (or ears).

Their observations also show direct evidence of gas returning to the galaxy. The ALMA observations allow them to make a velocity map of the gas, and there's one big feature which is discontinuous with the rest of the velocity structure. And again, that fits with the basic model of how they expect rotating gas to behave.

I've simplified and shortened this one quite a lot, missing out on any number of interesting details. And there's an awful lot more they could still do with this data. But to me, the first thing I wondered when I first saw the ALMA image was "why is it a bunny rabbit ?". I was expecting this to have a much more complex non-answer, featuring hand-waving and invocations to hydrodynamics galore, possibly involving a chicken sacrifice. As it is, they managed to come up with a decent explanation without any of that, which is no mean feat. Both the bunnies and the chickens can rest easy.

Now all they have to do is convince Playboy to give them a sponsorship deal...