Little Physicists: August 2025

Thursday, 28 August 2025

Signal boost

One man's trash is another man's treasure, or so the old saying goes. And in astronomy, one man's signal is another man's noise.

The classic example of this is dust in the Milky Way. If you're interested in dust, then you're a deeply weird person... or just interested in star formation. Dust "grains", which are actually about the size of smoke particles, are thought to be critical sites for star formation because they allow atomic gas to lose energy, cool, and collide with each other to form molecular gas. This is much denser than atomic gas, so eventually this can lead to the cloud collapsing to form a star.

But if you're less of a weirdo, dust just gets in the way. It ruins our majestic sky by blocking our view of all the stars, especially along the plane of the disc of the Galaxy. Some regions are much worse than others, but it's present at some level pretty much everywhere across the sky.

In radio astronomy we have a much more subtle and interesting problem. Of course we always want our observations to be as deep and sensitive as possible. But sometimes, it turns out, the noise in our data can actually be to our advantage – though it comes at a price.

Consider a typical spectrum of a galaxy as detected in the HI line. If you aren't familiar with this, take a look at my webpage if you want details. Basically, it shows us how bright the gas from a galaxy is emitting (that is, how dense it is) at any particular velocity. Even without knowing this rudimentary bit of information though, you can probably immediately identify the feature of interest in the signal :

All the spectra shown in this post are artificial, generated with a simple online code you can use yourself here.

We don't need to worry here about why the signal from the galaxy has the particular structure that it does. No, what I want to talk about today is the noise. That's those random variations outside the big bright bit in the middle.

This example shows a pretty nice detection. It's easy to see exactly where the profile of the galaxy ends and the noise begins. But even within the galaxy, you can see those variations are still present : they're just lifted up to higher values by the flux from the galaxy. Basically the galaxy's signal is simply added to the noise.

Now if you still have an analogue radio, you'll know that if you don't get the tuning just right, you'll hear the sounds from your station but only against a loud and annoying background hiss. The worse the tuning, the worse the noise. So you might well think that the following claim is more than a little dubious :

Fainter signals can be easier to detect in noisier data.

So counterintuitive is this that one referee said it "makes no sense whatsoever", doubling down to label it "bizarre" and "not just counterintuitive, but nonsensical".

This is wrong. I'll point out that the claim comes not from me but from Virginia Kilborn's PhD thesis (now a senior professor). So how does it work ?

The answer is actually very simple : signal is added to noise. That is, regardless of how noisy the data is, the signal from the galaxy is still there. Let me try and do this one illustratively. Suppose we have a pure signal, completely devoid of noise, and for argument's stake we'll give it a top-hat profile (about a quarter of galaxies have this shape, so this isn't anything unusual) :

The "S/N" axis measures the signal to noise, a measure of how bright things look given the sensitivity of the data. The numbers in this case are garbage because I set the noise to zero.

Now let's add it to two different sets of noise, purely random (Gaussian), of exactly the same statistical strength but just different in their exact channel-to-channel values :

Note that even with this purely random noise, you can still see apparent ripples and variations in the baseline outside the source : even random noise, to the human eye, looks structured.

Oh ! What happened there ? Why is the second signal so much clearer than the first ? You can still see the first one, to be sure, but it's marginal, and could easily be mistaken for some weird structure in the baseline. The second isn't great either, but it looks a lot better than the first.

Noise is typically random. That means that some parts of it will have bits of higher flux while other parts will be a bit lower. If we add our signal to the higher flux bits, the total apparent flux in our source gets higher. That is, the real flux in our source obviously doesn't change, but what we would measure would be greater than if the noise wasn't there. And of course the opposite could happen too : we could have noise dimming that makes the source harder to detect as well as easier.

Noise boosting (shown in the carefully-chosen example above), on the other hand, is no less important but far less expected. Every once in a while, a faint signal will happen to align with some bright parts of the noise and turn a marginal signal into a clearer one. This doesn't really work in the sorts of audio radio signals you get on a household radio set as these are much too complex, but the signals of a galaxy are a good deal simpler. And all we need to detect them is (for this basic example at least) pure flux, which the noise can readily provide.

(As an aside, you might notice that the the actual peak levels in these cases aren't much different, though the average level inside the source profile is higher in the second case. While peak levels most certainly can be affected by noise boosting and dimming, what's absolutely crucial here is what detection method we're using to find the signals. I'll return to this below.)

Of course, there are limits to this : it will only work for signals which are comparable in strength to the noise. As the noise level gets higher, the random variations will increasingly tend to "wash out" our signal. Now the level of the signals we can receive varies hugely depending on the nature of our data set, but the signal-to-noise ratios (S/N or sometimes SNR) tend to be fairly constant. That is, a signal which is ten times the typical noise value (the rms) has the same statistical significance in any data set, but the actual flux value it corresponds to can be totally different. But expressing signal strength in terms of the noise level makes things very convenient : a five sigma (5σ) source just means something that's five times brighter than the typical noise level.

So suppose we have a 2σ source which happens to align with a 3σ peak in the noise : bam, we've got ourselves a quite respectable 5σ detection*. But if we keep the flux level of the signal we're adding the same and increase the noise level, then the ratio of S/N will go down. Instead of adding 2σ to 3σ, we'll be adding ever lower and lower values : the "sigma level" of the noise won't change, but that of the signal certainly will. Pretty quickly we won't be shifting that 3σ peak from the noise by any appreciable degree. We'll be adding the same flux value but to an ever-greater starting level.

* Sometimes five sigma is quoted as a sort of scientifically universal gold-standard discovery threshold. This is simply not true at all, because if you have enough data, you'll get that level of signal just by chance alone. Far more importantly, the noise in real data is often far from being purely random, so choosing a robust discovery threshold requires a good knowledge of the characteristics of the data set.

As a possibly pointless analogy, consider lions. If you go from having no lions to one lion, you've just put yourself in infinitely more danger. If you add a second lion you're in even more trouble. But if you've got ten lions and add one more, you won't really notice the difference.

"But hang on," you might say, "surely that means that your earlier claim that fainter signals are more detectable in nosier data can't possibly be correct ?". A perfectly valid question ! The answer is that it depends on how we go about detecting the signals. The details are endless, but two basic techniques are to search for either a S/N ratio or a simple flux threshold. These can give very different results to each other.

Now if you use S/N, which is generally a good idea, then indeed signals of lower flux levels generally don't do well in increasingly nosier data because of the ever-smaller relative increase in the signal. And of course, the effect of the noise to not merely obscure but actually suppress the signal will get ever greater, since there's just as much chance of aligning with a low-value region of the noise as a high-value region.

But S/N is not the only way of detecting signals. You might opt instead to use a simple flux threshold instead : it's computationally cheaper, easier to program, and most importantly of all it gives you more physically meaningful results. If you do it that way, then it's a different story. When you add a signal to noise the flux level always increases, making it much easier to push the flux above your detection threshold by this method. Which makes noise boosting very much easier to explain.

Note here the change of axes values compared to the previous examples. All I did was increase the noise level by ~25% and here both peak flux and S/N levels have increased. It might not look easier to detect visually, but statistically, by some measures this one is more significant than the previous cases !

Even more interesting is the so-called Eddington bias. What this means is that any survey will tend to overestimate the flux of its weakest signals : those signals which are so faint that they can only be detected at all thanks to chance alignments with the noise. In this case, that when someone comes along and does a deeper survey, they'll often find that those sources have less flux than reported in the earlier, less-sensitive data.

There are plenty of other subtleties. The signal might not need to be perfectly superimposed on a noise peak : if it's merely adjacent to it, that can create the appearance of a wider, brighter signal which can be easier to detect to some algorithms (and people !). And of course while we'd like noise to be perfectly uniform and random, this isn't always the case. Importantly, the rms value doesn't tell us anything at all about the coherency of structures in the data, as so powerfully shown by the ferocious Datasaurus.

For the eye the effects of this can be extremely complex, and are poorly understood in astronomy. If you have very few coherent noise structures, for example, you might think this nice clean background would make fainter structures easier to spot. But actually, I have some tentative evidence that this isn't always the case : the eye can be lulled into a false sense of emptiness, whereas if there are a few obvious structures to attract attention, you start to believe that things are present so you're more likely to identify structures. No doubt if your data was dominated by structures then the eye would, in effect, perceive them as background noise again and the effect would diminish, but this is something that needs more investigation. My guess is there's a zone in which you have enough to encourage a search but few enough that they don't obscure the view.

Ultimately, getting deeper data is always the better option. For every source noise-boosted to detectability, there'll be another which is suppressed and hidden. But this explains very neatly that supposedly "nonsensical" result*, especially if your source-finding routine is based on peak flux : of course if you add signal to noise and have the same flux threshold in your search, you're more likely to find the fainter signals in the nosier data... up to a point. Set your threshold too low and you'll just find spurious detections galore, but hit the sweet spot and you'll find signals you otherwise couldn't.

* Kudos to the referee for accepting the explanation; I never heard of any of this until a couple of years ago either. Extragalactic astronomy is full of stuff which isn't that difficult but seems feckin' confusing when you first encounter it because it isn't formally taught in any lectures !

There's nothing weird about noise boosting then. Mathematically it makes complete sense. But when you first hear about it it sounds perplexing, which just goes to show how deceptively simple radio astronomy can be. Noise boosting, at least at a basic level, is quite simple, but simple isn't the same as intuitive.

Friday, 22 August 2025

ChatGPT-5 versus... DeepSeek

My excitement for ChatGPT-5 continues to defy the Will Of The Internet. Sod y'all, this is feckin' awesome ! This is the upgrade I've been waiting for.

It seems only fair to continue testing it, however. It performed extremely well against my own knowledge, but how does it compare to other chatbots ? After all, there's not much point in me getting all excited for this one in particular if others are actually doing comparably well.

I was intrigued by this article which pits it against DeepSeek, mainly in more real-life situations than what I've been testing. To save you a click, the "clear winner" was DeepSeek, with the unexpected proviso that GPT-5 is better at creativity and explanations – which again is in defiance of the judgement of reddit, who think that GPT-5 is absolutely dreadful at this sort of thing.

Before I report my own science-based testing, some of the author's preferences in the AI responses are... questionable. In brief :

Logic puzzle : I find GPT-5's response much clearer here. DS's longer, "different interpretations" of a really very simple "puzzle" only add unnecessary confusion.
Mathematics : I tend to agree with the author that DS gives more of a tutorial in its response, though GPT-5's walkthrough is easier to follow. Neither was asked to teach the general case, however, only a specific example. I'd call this one a dead heat.
Project planning : Seems like a hopelessly vague prompt to me. The output doesn't seem meaningful.
Budgeting : I tend to agree that DS is slightly better here, but by a whisker : it's more instructive. Purely a personality difference though, not a matter of actual content.
Parenting : They seem exactly equal to me. No idea why the author prefers DS or thinks GPT-5 is "less organised".
Lunches : Not sure what the author plans to do with half a banana, nor do they show the whole display so it's hard to check this one. As it is, I'd agree DS is better here for giving better instructions, but I can't check if it really stayed in the budget or not.
Stories : A toss-up; very subjective.
Culture : I agree with the author that GPT-5 is clearly better; DS's list isn't a good summary in this case.
Social media : Both are crap. DS might be better but it's a case of shit sandwich or shit hotdog.

So while the author prefers DS in seven of the nine tests, I'd say that only seven responses have meaningful prompts with enough output shown for a fair evaluation. I prefer DS only once (maybe twice in the last case), GPT-5 twice, and find they're not differentiable in the other cases. Certainly not a clear winner for anyone.

I should add that I've rather gone off DeepSeek some time ago. It can give good, insightful results (for a while I strongly preferred it to ChatGPT), but it's unreliable – both in terms of its server and its output. It can also over-think the problem, and though being able to see its reasoning can be helpful (sometimes this can be better than the final output), it also means there's a lot of content to wade through. This becomes tiresome and counterproductive.

So for this test what I thought I would do is try to ignore the reasoning and just compare the final outputs. I'm not going to go to the same lengths as I did when testing GPT-5 against my own reading of papers. Instead I'll just ask DS some of the same questions I already recently asked GPT-5 (in-anger, real use cases) and compare the initial responses.

1) Effects of ram pressure stripping on the line width of a galaxy

Suppose a galaxy losing gas through ram-pressure stripping, with the angle to the wind being close to edge-on. What, if anything, is the expected effect on its HI line width ? Is it likely to reduce, increase, or not change the measured line width ? Consider the cases of moderate to strong stripping, corresponding to deficiencies of 0.5 and 0.7 respectively. Please also provide relevant citations, if possible.

Both agree that the width will decrease, but they disagree somewhat by how much. DeepSeek raised the important issue of turbulence, but that's it's only point in its favour : most of its numerous citations weren't especially relevant and certainly weren't pertinent to the specific claims it was making. Far worse was that it got "edge-on" completely wrong, taking this to mean exactly the opposite of how the term is used. GPT-5 gave a tighter response which was much more focused on the problem and with much better citations.

Winner : GPT-5, easily.

2) How do SMUDGES get their measurements ?

Hi ! I have a question about the SMUDGES catalogue of Ultra Diffuse Galaxies. I'd like to know how they compute their reported magnitude values. Does this involve fitting a surface brightness profile or do they just do aperture photometry ? If the later, how to they do they set the aperture size ? If the former, are they reporting total integrated magnitude within some radius or are they extrapolating ?

GPT-5 couldn't figure this out from web-based results so I had to feed it the paper. Once I did that it gave a concise, clear explanation. DS got the result correct using a web search with no uploads needed, and its explanation was longer but easier to follow. On the other hand, GPT-5 also came with the significant caveat that SMUDGES actually report both values, which is not correct.

Winner : DS for accuracy. GPT-5 was clearer, and it got the main point right, but it hallucinated a caveat (though it corrected this on asking it where in the paper that the second value was given) and made the answer more complicated than it needed to be.

3) Integrating the light in a galaxy

About the Sersic profile... suppose I have a galaxy with a Sersic index of 0.51. How much larger than the effective radius would I have to go to enclose at least 98% of the light ?

Both models gave the same answer, but GPT-5 took 38 seconds whereas DS took more than ten minutes (!). The explanations each model provided were equally unclear to me as a non-mathematician.

Winner : GPT-5 for sheer speed.

4) The safety of HIP2FITS

I'm curious about the HIPS2FITS service of astroquery. I have code that seems to work very well at extracting data from different surveys using a specified pixel scale, field of view, coordinates etc. What I'd like to know about is how the pixel values are rescaled when the pixel resolution is different from the standard survey data. In particular, I want to know if aperture photometry in DS9 will still be accurate, i.e. if the sum of the flux values within the same aperture will still be approximately the same. Presumably it can't be identical since now region will, by definition, enclose a different area if the pixel scale has changed, but will it be close if the scale is not too different ? When I requested a substantially different pixel scale for SDSS data (10" per pixel rather than the nominal 0.4") I got a very different result using the same aperture. Is this something that can be accounted for e.g. by changing the units in the FITS header, or by specifying some other parameter in astroquery (or by post-processing the result in some other way) ? Or should I always ensure I only use the standard pixel scale if I want accurate photometry ?

Here the answers are very different. GPT-5 says a straightforward no, you can't do photometry on the data and it's not a simply matter to correct it due to the regridding (though if the scale change is small, this is possible). DS also says no, but that it is a relatively simple matter of correction for the adjusted area – but recommends keeping the original scale whenever possible. Its suggested correction was, however, likely nonsense. It also said weird things like "even for the same source" and gave various other caveats that felt incoherent.

Winner : GPT-5 for clarity.

5) How are stellar masses calculated ?

I'm curious about how galaxy stellar masses are estimated from photometric measurements. For example, I've seen several different recipes using, say, g-i or g-r colours, and I notice that they can give quite different result for the same object. How are these recipes derived, and how accurate are they ?

Both bots gave extremely similar answers here with no obvious major discrepancies. I couldn't honestly say I preferred either answer.

Winner : dead heat.

6) Star formation from GALEX

How can I quantitatively estimate the star formation rate of a galaxy using SDSS and GALEX data ?

This was an extremely simple prompt and both models gave similar answers. GPT-5 gave slightly more useable answers with more explanations; DS was a little less clear as to what was going on.

Winner : GPT-5, but it was close (and with the caveat that I haven't checked in detail as the responses were both quite long).

7) Analysing the large scale environment

I've got a student investigating an HI data cube of the Leo Group. We've already studied the Leo Ring volume and he's extending our study into the background volume, ~2.000 - 20,000 km/s. We'd like to be able to characterise the environment of our detections, i.e. to say if they're in any major known groups, clusters, or large-scale structures such as filaments. What's the best approach to do this ? I guess we can simply search for groups and clusters in NED, but maybe there's a better way (also this wouldn't tell us about the larger stuff). And we'd like to know any vital information about such structures, e.g. if something is particularly well-known for any reason.

For the only time in these tests, DS complained its server was busy. On a second attempt it gave a response which was pretty basic and not very helpful; "you can look up this data set" or "use this software", or worse, suggesting a comparison to simulations. GPT-5 gave much more helpful citations and guided instructions in what to do with our own data and how we could compare it to other catalogues.

Winner : GPT-5, easily.

The verdict : ChatGPT-5 won five of the seven tests. DS only won once, with the other being a dead heat. Even leaving aside that GPT-5 had a narrow victory in one case, GPT-5 is the clickbaity "clear winner" here.

To be fair, I'd switched to DeepSeek because it was giving me better discussions than the contemporary model of ChatGPT, even when they first added reasoning. It's by no means a bad model, but it's unreliable compared to GPT-5. Its citations are frequently of dubious relevance, it seems to hallucinate more (the example of GPT-5 hallucinating here is the only example I've found of this so far*, and that was minor), and if you want its best output you'll have to wade through an awful lot of its reasoning processes. It's also slower than GPT-5, sometimes by an order of magnitude. And I've also found that DeepSeek rejects my uploads as violating policies even when there's nothing offensive in them; it won't even discuss them at all but just shuts down the discussion immediately.

* Excluding when you ask for it to analyse a paper by giving it a link rather than uploading. In that case it becomes worse than useless, so never do this.

Still, I have to admit that this does tend to curb my enthusiasm for GPT-5, but only a little. DeepSeek's answers were generally better in comparison to GTP-5 than I was expecting; DeepSeek gets most of the way there when it works. If GPT-5 is a revolution compared to previous OpenAI offerings, then it's only an incremental upgrade compared to DeepSeek. An important one, to be sure, but nevertheless incremental.

On the other hand, it's all about thresholds : how you cross them doesn't matter nearly as much as the fact you've crossed them at all. An increment which gets you across the line is just as useful as if you got there from a standing start. And if DeepSeek gets most of the way there most of the them, GPT-5 gets even further almost all of the time. With only one minor failure here compared to DeekSeek's two majors, the effect is non-linear. Even if GPT-5 doesn't drastically improve upon DeepSeek, it does so by more than enough I now see no point in using DeepSeek at all.

That about wraps up the LLM-testing for the foreseeable future. Now back to the usual posts in which I explain science to the masses... whether I do this better than the chatbots I leave to readers to decide.

Wednesday, 13 August 2025

ChatGPT-5 Versus Me

It's time for another round of evaluating whether ChatGPT is actually helpful for astronomical research.

My previous experiments can be found here, here, and here. The first two links looked at how well ChatGPT and Bing performed when analysing papers I myself know very well, with the upshot being an extreme case of hit-and-miss : Occasional flashes of genuine brilliance wrapped in large doses of mediocrity and sprinkled with total rubbish, to quote myself. All conversations had at least one serious flaw (though arguably in one case, which was factually and scientifically perfect but had crippling format errors).

The third link tested ChatGPT's vision analysis by trying to get it to do source extraction, which was a flat-out failure. Fortunately there have been other tests on this which show it does pretty badly in more typical situations as well, so I'm not going to bother redoing this.

With the release of ChatGPT-5, however, I do want to redo the analysis of papers. If I can have ChatGPT give me reliable scientific assessments of papers, that's potentially a big help in a number of ways, at the very least in determining if something is going to be worth my time to read in full. For this one I picked a new selection of papers as my last tests were a couple of years ago, and I can't claim I remember all their details as well as I did.

Because all the papers cover different topics, there isn't really a good way to standardise the queries. So these tests are designed to mimic how I'd use it in anger, beginning with a standardised query but then allowing more free-ranging, exploratory queries. There's no need for any great numerical precision here, but if I can establish even roughly how often GPT-5 produces a result which is catastrophically wrong or useless, that's useful information.

I began each discussion with a fairly broad request :

I'd like a short summary of the paper's major findings, an evaluation of its scientific importance and implications, what you think the major weaknesses (if any) might be and how they could be addressed. I might then ask you more detailed, specific questions. Accuracy is paramount here, so please draw your information directly from the paper whenever possible – specify your sources if you need to use another reference.

Later I modified this to stress I was interested in the strengths and weaknesses of the scientific interpretation as well as methodology, as GPT-5 seemed to get a little hung up on generic issues – number of sources, sensitivity, that sort of thing. I followed up the general summaries with specific questions tailored to each individual paper as to what they contained and where, this being a severe problem for earlier versions. At no point did I try to deliberately break it – I only tried to use it.

Below, you can find my summaries of the results of discussions about five papers together with links to all of the conversations.

0) To Mine Own Research Be True ?

But first, a couple of examples where I can't share the conversations because they involve current, potentially publishable research (I gave some initial comments already here). I decided to really start at the deep end with a query I've tried many times with ChatGPT previously and got very little out of it : to have it help with a current paper I'm writing, asking it to assess the merits and problems alike – essentially acting as a mock reviewer.

* Of which the management, like the rest of us, is generally sensible about such things. We all recognise the dangers of hallucinations, the usefulness and limitations of AI-generated code, etc. Nobody here is a fanboy nor of the anti-AI evangelical sort.

Previously I'd found it to be very disappointing at this kind of task. It tends to get hung up on minutia, not really addressing wider scientific points at all. For example, if you asked it for which bits should be cut, it might pick out the odd word or sentence or two, but it wouldn't say if a whole section is a digression from the main topic. It didn't think at scale, so to speak. It's hard to describe precisely but it felt like it has no understanding of the wider context at all.; it discussed details, not science. It wasn't that using it for evaluations was of no value whatsoever, but it was certainly questionable whether it was a productive use of one's time.

With the current paper I have in draft, ChatGPT-5's response was world's apart from its previous meagre offerings. It described itself as playing the role of a "constructively horrible" reviewer (its own choice of phrase) and it did that, I have to say, genuinely very well. Its tone was supportive but not sycophantic. It suggested highly pertinent scientific critiques, such as the discussion on the distance of a galaxy – which is crucial for the interpretation in this case – being too limited and alternatives being fully compatible with the data. It told me when I was being over-confident in phrasing, gave accurate indications of where I was overly-repetitive, and came up with perfectly sensible, plausible interpretations of the same data.

Even its numbers were, remarkably, actually accurate* (unlike others I haven't seen it make some classic errors in basic facts and numbers, including the number of specified letters even in fictional words; I tried reproducing some of these multiple times but couldn't). At least, that is, those I checked, but all those I checked were on the money – a far cry indeed from older versions ! Similarly, citations were all correct and relevant to its claims : none were total hallucinations. That is a big upgrade.

* ChatGPT itself claims that it does actual proper calculations whenever the result isn't obvious (like 2+2, for which training data is enough) or accuracy is especially important.

When I continued the discussion... it kept giving excellent, insightful analysis; previous versions tended to degenerate into incoherency and stupidity in long conversations. It wasn't always right – it made one major misunderstanding to one inquiry that I thought it should have avoided* – but it was right more than, say, 95% of the time, and its single significant misunderstanding was very easily corrected**. If it was good for bouncing ideas off before, now it's downright excellent.

* This wasn't a hallucination as it didn't fabricate anything, it just misunderstood the question.
** And how many conversations with real people feature at least one such difficulty ? Practically all of them in my experience.

The second unshareable test was to feed it my rejected ALMA proposal and (subsequently) the reviewer responses. Here too the tone of GPT-5 shines. It phrased things very carefully but without walking on eggshells, explaining what the reviewer's thought processes might have been and how to address them in the future without making me feel like I'd made some buggeringly stupid mistake. I asked it initially to guess how well the proposal would have been ranked and it said second quartile, borderline possibility for acceptance... praiseworthy and supportive, but not toadying, and not raising false hopes.

When I told it the actual results (lowest quartile, i.e. useless), it agreed that some of the comments were objectionable, but gave me clear, precise instructions as to how they could be countered. Those are things I would find extremely difficult to do on my own : I read some of the stupider claims ("the proposal flow feels a bit narrative"... FFS, it damn well should be narrative and I will die on this hill) and just want to punch the screen*, but GPT-5 gave me ways to address those concerns. It said things like, "you and I know that, but...".

* No, not really ! I just need to bitch about it to people. Misery loves company, and in a perverse bit of luck, nobody in our institute got any ALMA proposals accepted this year either.

It made me feel like these were solvable problems after all. For example, it suggested the rather subtle reframing of the proposal from detection experiment (which ALMA disfavours) to hypothesis testing (which is standard scientific practise that nobody can object to). This is really, really good stuff, and the insight into what the reviewers might have been thinking, or not understanding, made me look at the comments in a much more upbeat light. Again, it had one misunderstanding about a question, but again this was easily clarified and it responded perfectly on the second attempt.

On to the papers !

1) The Blob(s)

This paper is one of the most interesting I've read in recent years, concerning the discovery of strange stellar structures in Virgo they attribute to being ram pressure dwarfs. Initially I tried to feed it the paper by providing a URL link, but this didn't work. As I found out with the second paper, trying to do it this way is a simply mistake : in this and this alone does GPT-5 consistently hallucinate. That is, it claims it's done things which it hasn't done, reporting wrong information and randomly giving failure messages.

Not a great start, but it gets better. When a document is uploaded, hallucinations aren't quite eliminated, but good lord they're massively reduced compared to previous versions. It's weird that its more general web search capabilities appear rather impressive, but give it a direct link and it falls over like a crippled donkey. You can't have everything, I guess.

Anyway, you can read my full discussion with ChatGPT here. In brief :

Summary : Factually flawless. All quoted figures and statements are correct. It chose these in a sensible way to give a concise summary of the most important points. Both scientific strengths and weaknesses are entirely sensible, though the latter are a little bland and generic (improve sensitivity and sample size, rather than suggesting alternative interpretations).
Discussion : When pressed more directly for alternative interpretations, it gave sensible suggestions, pointing out pertinent problems with the methodology and data that allow for this.
Specific inquiries : I asked it about the AGES clouds that I know are mentioned in this paper (I discovered them) and here I encountered the only real hallucination in all the tests. It named three different AGES clouds that are indeed noteworthy because they're optically dim and dark ! These are not mentioned in this paper at all. When I asked it to check again more carefully, it reported the correct clouds which the authors refer to. When I asked it about things I knew the paper didn't discuss, it correctly reported that the paper didn't discuss this.
Overall : Excellent, once you accept the need to upload the document. Possibly the hallucination might have been a holdover from that previous attempt to provide the URL, and in my subsequent discussions I emphasised more strongly the need for accuracy and to distinguish what the paper contained from GPT-5's own inferences. This seems to have done the trick. Even with these initial hiccups, however, the quality of the scientific discussions was very high. It felt like talking with someone who genuinely knew what the hell they were talking about.

2) The Smudge

This one is about finding a galaxy so faint the authors detected it by looking for its globular clusters. They also find some very diffuse emission in between them, which is pretty strong confirmation that it's indeed a galaxy of sorts.

At this point I hadn't learned my lesson. Giving ChatGPT a link caused it to hallucinate in a sporadic, unpredictable way. It managed to get some things spot on but randomly claimed it couldn't access the paper at all, and invented content that wasn't present in the paper. Worse, it basically lied about its own failures.

You can read my initial discussion here, but frustrated by these problems, I began in a second thread with an uploaded document here. That one, I'm pleased to say, had no such issues.

Summary : Again, flawless. A little bland, perhaps, but that's what I wanted (I haven't tried asking it for something more sarcastic). The content was researcher level rather than general public but again I didn't ask for outreach content. It correctly highlighted possible flaws like the inferred high dark matter content being highly uncertain due to an extremely large extrapolation from a relatively novel method.
Discussion : In the hallucinatory case, it actually came up with some very sensible ideas even though these weren't in the paper. For example, I asked it about the environment of the galaxy and it gave some plausible suggestions on how this could have contributed to the object's formation – the problem was that none of this was in the paper as it claimed. Still, the discussion on this – even when I pushed it to ideas that are very new in the literature – was absolutely up to scratch. When I suggested one of its ideas might be incorrect, it clarified what it meant without changing the fundamental basis of its scenario in a way that convinced me it was at least plausible : this was indeed a true clarification, not a goalpost-shifting modification. It gave a detailed, sensible discussion of how tidal stripping can preferentially affect different components of a galaxy, something which is hardly a trivial topic.
Specific inquiries : When using the uploaded document, this was perfect. Numbers were correct. It reported correctly both when things were and weren't present in the article, with no hallucinations of any kind. It expanded on my inquiries into more general territory very clearly and concisely.
Overall : Great stuff. Once again, it felt like a discussion with a knowledgeable colleague who could both explain specific details but also the general techniques used. Qualitatively and quantitatively accurate, with an excellent discussion about the wider implications.

3) ALFALFA Dark Galaxies

My rather brief summary is here. This is the discovery of 140-odd dark galaxy candidates in archival ALFALFA HI data. The ChatGPT discussion is here. This time I went straight to file upload and had no issues with hallucinations whatsoever.

Summary : Once again, flawless. Maybe a little bland and generic with regard to other interpretations, but it picked out the major alternative hypothesis correctly. And in this case, nobody else has come up with any other better ideas, so I wouldn't expect it to suggest anything radical without explicitly prompting it to.
Discussion : It correctly understood my concern about whether the dynamical mass estimates are correct and gave a perfect description of the issue. This wasn't a simple case of "did they use the equation correctly" but a contextual "was this the correct equation to be using and were the assumptions correct" case, relating not just to individual objects but also their environment. Productive and insightful.
Specific inquiries : Again flawless, not claiming the authors said anything they didn't or claiming they didn't say anything they did. Numbers and equations used were reported correctly.
Overall/other: Superb. I decided to finish by asking a more social question – how come ALFALFA have been so cagey about the "dark galaxy" term in the past (they use the god-awful "almost darks", which I loathe) but here at least one team member is on board with it ? It came back with answers which were both sociologically (a conservative culture in the past, a change of team here) and scientifically (deeper optical data with more robust constraints) sensible ideas. It also ended with the memorable phrase, "[the authors are] happy to take the “dark galaxy” plunge — but with the word “candidate” as a fig leaf of scientific prudence."

4) The VCC 2034 System

This is a case of a small fuzzy patch of stars near some larger galaxies, possibly with a giant HI stream, which has proven remarkably hard to explain. The latest paper, which I summarise here, discounts the possibility that it formed from the long stream as it apparently doesn't exist, but (unusually) doesn't figure out an alternative scenario either. The ChatGPT discussion is here.

Summary : Factually perfect, though it didn't directly include that the origin of the object was unknown. Arguably "challenges simple ram-pressure stripping scenarios and suggests either an intergalactic or pre-cluster origin" implies this, but I'd have preferred it to state it more directly. Nevertheless, the most crucial point that previous suggestions don't really hold up came through very clearly.
Discussion : Very good, but not perfect. While it didn't get anything wrong, it missed out the claims in the paper against the idea of ram pressure dwarfs more generally (about the main target object of the study it was perfect). With some more direct prompting it did eventually find this, and the ensuing discussion was productive, pointing out some aspects of this I hadn't considered. I'm not entirely convinced this was correct, but no more than I doubt some of the claims made in the paper itself – PhD level hardly means above suspicion, after all. And the discussion on the dynamics of the object was extremely useful, with ChatGPT again raising some points from the paper I'd completely missed when I first read it; the discussion on the survival of such objects in relation to the intracluster medium was similarly helpful.
Specific inquiries : Aside from the above miss, this was perfect. When I asked it to locate particular numbers and discuss their implications it did so, and likewise it correctly reported when the paper didn't comment on a topic I asked about.
Overall : Not flawless, but damn good, and certainly useful. One other discussion point caused a minor trip-up. When I brought in a second paper (via upload) for comparison and mentioned my own work for context, it initially misinterpreted and appeared to ignore the paper. This was easily caught and fixed with a second prompt, and the results were again helpful. By no means was this hallucination – it felt more like it was getting carried away with itself.

5) An Ultra Diffuse Galaxy That Spins Too Slowly

This was a paper that I'd honestly forgot all about until I re-read my own summary. It concerns a UDG that initial observations indicated lacked dark matter entirely, but then another team came along and found that would be unsustainable and it was probably just an inclination angle measurement error. Then the original team came back with new observations and simulations, and they found it does have some dark matter after all – at a freakishly low concentration, but enough to stabilise it. The ChatGPT discussion is here.

Summary : As usual this was on the money, bringing in all the key points of the paper and giving a solid scientific assessment and critique. Rather than dealing with trivialities like sample size or simulation resolution, it noted that maybe they'd need to account more for the effects of environment or using different physics for the effects of feedback on star formation.
Discussion : As with the fourth paper, this was again excellent but not quite complete. It missed out one of my favourite* bits of speculation in the paper that this object could tell us something directly about the physical nature of dark matter. It did get this with direct prompting, but I had to be really explicit about it. To be fair, this is just one paragraph in the whole article, but reading between the lines I felt it was a point the author's really wanted to make. On the other hand, that's just my opinion and it certainly isn't the main point of the work.
Specific inquiries : Yep, once again it delivered the goods. No inaccuracies. It reported the crucial points correctly and described the comparisons with previous works perfectly. Again, it didn't report any claims the authors didn't make,
Overall : Excellent. I allowed myself to branch out to a wider discussion of the cold dark matter paradigm and it came back with some great papers I should check out regarding stability problems in MOND. It sort of back-peddled a little bit on discussions about the radial acceleration relation, but this was more a nuanced clarification than revising its claims : CDM gets RAR as a result of baryonic physics tuning, but it gets this for free as a result of tuning for other parameters rather than directly for RAR itself; MOND gets RAR as a main feature. If that's not a PhD level discussion then I don't know what is.

* More generally, it seems pretty good at picking up on the same stuff that I do, but it would be silly to expect 100% alignment.

Summary and Conclusions

On my other blogs I've gone on about the importance of thresholds. Well, we've crossed one. Even the more positive assessments of GPT-5 tend to label it as an incremental upgrade, but I violently disagree. I went back and checked my earlier discussion with GPT-4o about my ALMA proposal and confirmed that it was mainly spouting generic, useless crap... GPT-5 is a massive improvement. It discusses nuanced and niche scientific issues with a robust understanding of their broader context. In other threads I've found it fully capable of giving practical suggestions and calculations which I've found just work. Its citations are pertinent and exist.

This really does feel like a breakthrough moment. At first it was a cool tech demo, then it was a cool toy. Now it's an actually useful tool for everyday use – potentially an incredibly important one. Where people are coming from when they say it gets basic facts wrong I've honestly no idea. The review linked above says it gave a garbage response when fed a 160+ page document and was anything but PhD-level, but in my tests with typical length papers (generally 12-30 pages) I would absolutely and unequivocally call it PhD level. No question of it.

This is not to say it's perfect. For one thing, even though there's a GUI setting for this, it's very hard to get it to stop offering annoying follow-up suggestions it could do. This is why you'll see my chats with it sometimes ends with "and they all live happily ever after", because I had to put that in my custom instructions to give it an alternative ending (in one memorable case it came up with "one contour to rule them all and in the darkness bind them"*). Even then it doesn't always work. And it always delivers everything in bullet-point form : no doubt this can be altered, but I haven't tried... generally I don't hate this though.

* I really like the personality of GPT-5. It's generally clear and to the point, straightforward and easy to read, but with the occasional unexpected witticism that keeps things just a little more engaging.

Of course, it does still make mistakes. Misinterpretations of the questions appears to be the most common, but these are very easily spotted and fixed. Incompleteness seems to be less common but more serious, but I'd stress that expecting perfection from anything is extremely foolish. And actual hallucinations of the kind that still plagued GPT-4 are now nearly non-existent, provided you give it rigorous instructions.

So that's my first week with GPT-5, a glowing success and vastly better than I was expecting. Okay, people on reddit, I get that you missed the sycophantic ego-stroking personality of GPT-4, so whine about how your virtual friend has died all you want. But all these claims that it's got dumber, and has an IQ barely above that of a Republican voter... what the holy hell are you talking about ? That makes NO sense to me whatsoever.

Anyway I've put my money where my mouth is and subscribed to Plus Watch this space : in a month I'll report back on whether it's worth it.