Little Physicists: ChatGPT-5 versus... DeepSeek

My excitement for ChatGPT-5 continues to defy the Will Of The Internet. Sod y'all, this is feckin' awesome ! This is the upgrade I've been waiting for.

It seems only fair to continue testing it, however. It performed extremely well against my own knowledge, but how does it compare to other chatbots ? After all, there's not much point in me getting all excited for this one in particular if others are actually doing comparably well.

I was intrigued by this article which pits it against DeepSeek, mainly in more real-life situations than what I've been testing. To save you a click, the "clear winner" was DeepSeek, with the unexpected proviso that GPT-5 is better at creativity and explanations – which again is in defiance of the judgement of reddit, who think that GPT-5 is absolutely dreadful at this sort of thing.

Before I report my own science-based testing, some of the author's preferences in the AI responses are... questionable. In brief :

Logic puzzle : I find GPT-5's response much clearer here. DS's longer, "different interpretations" of a really very simple "puzzle" only add unnecessary confusion.
Mathematics : I tend to agree with the author that DS gives more of a tutorial in its response, though GPT-5's walkthrough is easier to follow. Neither was asked to teach the general case, however, only a specific example. I'd call this one a dead heat.
Project planning : Seems like a hopelessly vague prompt to me. The output doesn't seem meaningful.
Budgeting : I tend to agree that DS is slightly better here, but by a whisker : it's more instructive. Purely a personality difference though, not a matter of actual content.
Parenting : They seem exactly equal to me. No idea why the author prefers DS or thinks GPT-5 is "less organised".
Lunches : Not sure what the author plans to do with half a banana, nor do they show the whole display so it's hard to check this one. As it is, I'd agree DS is better here for giving better instructions, but I can't check if it really stayed in the budget or not.
Stories : A toss-up; very subjective.
Culture : I agree with the author that GPT-5 is clearly better; DS's list isn't a good summary in this case.
Social media : Both are crap. DS might be better but it's a case of shit sandwich or shit hotdog.

So while the author prefers DS in seven of the nine tests, I'd say that only seven responses have meaningful prompts with enough output shown for a fair evaluation. I prefer DS only once (maybe twice in the last case), GPT-5 twice, and find they're not differentiable in the other cases. Certainly not a clear winner for anyone.

I should add that I've rather gone off DeepSeek some time ago. It can give good, insightful results (for a while I strongly preferred it to ChatGPT), but it's unreliable – both in terms of its server and its output. It can also over-think the problem, and though being able to see its reasoning can be helpful (sometimes this can be better than the final output), it also means there's a lot of content to wade through. This becomes tiresome and counterproductive.

So for this test what I thought I would do is try to ignore the reasoning and just compare the final outputs. I'm not going to go to the same lengths as I did when testing GPT-5 against my own reading of papers. Instead I'll just ask DS some of the same questions I already recently asked GPT-5 (in-anger, real use cases) and compare the initial responses.

1) Effects of ram pressure stripping on the line width of a galaxy

Suppose a galaxy losing gas through ram-pressure stripping, with the angle to the wind being close to edge-on. What, if anything, is the expected effect on its HI line width ? Is it likely to reduce, increase, or not change the measured line width ? Consider the cases of moderate to strong stripping, corresponding to deficiencies of 0.5 and 0.7 respectively. Please also provide relevant citations, if possible.

Both agree that the width will decrease, but they disagree somewhat by how much. DeepSeek raised the important issue of turbulence, but that's it's only point in its favour : most of its numerous citations weren't especially relevant and certainly weren't pertinent to the specific claims it was making. Far worse was that it got "edge-on" completely wrong, taking this to mean exactly the opposite of how the term is used. GPT-5 gave a tighter response which was much more focused on the problem and with much better citations.

Winner : GPT-5, easily.

2) How do SMUDGES get their measurements ?

Hi ! I have a question about the SMUDGES catalogue of Ultra Diffuse Galaxies. I'd like to know how they compute their reported magnitude values. Does this involve fitting a surface brightness profile or do they just do aperture photometry ? If the later, how to they do they set the aperture size ? If the former, are they reporting total integrated magnitude within some radius or are they extrapolating ?

GPT-5 couldn't figure this out from web-based results so I had to feed it the paper. Once I did that it gave a concise, clear explanation. DS got the result correct using a web search with no uploads needed, and its explanation was longer but easier to follow. On the other hand, GPT-5 also came with the significant caveat that SMUDGES actually report both values, which is not correct.

Winner : DS for accuracy. GPT-5 was clearer, and it got the main point right, but it hallucinated a caveat (though it corrected this on asking it where in the paper that the second value was given) and made the answer more complicated than it needed to be.

3) Integrating the light in a galaxy

About the Sersic profile... suppose I have a galaxy with a Sersic index of 0.51. How much larger than the effective radius would I have to go to enclose at least 98% of the light ?

Both models gave the same answer, but GPT-5 took 38 seconds whereas DS took more than ten minutes (!). The explanations each model provided were equally unclear to me as a non-mathematician.

Winner : GPT-5 for sheer speed.

4) The safety of HIP2FITS

I'm curious about the HIPS2FITS service of astroquery. I have code that seems to work very well at extracting data from different surveys using a specified pixel scale, field of view, coordinates etc. What I'd like to know about is how the pixel values are rescaled when the pixel resolution is different from the standard survey data. In particular, I want to know if aperture photometry in DS9 will still be accurate, i.e. if the sum of the flux values within the same aperture will still be approximately the same. Presumably it can't be identical since now region will, by definition, enclose a different area if the pixel scale has changed, but will it be close if the scale is not too different ? When I requested a substantially different pixel scale for SDSS data (10" per pixel rather than the nominal 0.4") I got a very different result using the same aperture. Is this something that can be accounted for e.g. by changing the units in the FITS header, or by specifying some other parameter in astroquery (or by post-processing the result in some other way) ? Or should I always ensure I only use the standard pixel scale if I want accurate photometry ?

Here the answers are very different. GPT-5 says a straightforward no, you can't do photometry on the data and it's not a simply matter to correct it due to the regridding (though if the scale change is small, this is possible). DS also says no, but that it is a relatively simple matter of correction for the adjusted area – but recommends keeping the original scale whenever possible. Its suggested correction was, however, likely nonsense. It also said weird things like "even for the same source" and gave various other caveats that felt incoherent.

Winner : GPT-5 for clarity.

5) How are stellar masses calculated ?

I'm curious about how galaxy stellar masses are estimated from photometric measurements. For example, I've seen several different recipes using, say, g-i or g-r colours, and I notice that they can give quite different result for the same object. How are these recipes derived, and how accurate are they ?

Both bots gave extremely similar answers here with no obvious major discrepancies. I couldn't honestly say I preferred either answer.

Winner : dead heat.

6) Star formation from GALEX

How can I quantitatively estimate the star formation rate of a galaxy using SDSS and GALEX data ?

This was an extremely simple prompt and both models gave similar answers. GPT-5 gave slightly more useable answers with more explanations; DS was a little less clear as to what was going on.

Winner : GPT-5, but it was close (and with the caveat that I haven't checked in detail as the responses were both quite long).

7) Analysing the large scale environment

I've got a student investigating an HI data cube of the Leo Group. We've already studied the Leo Ring volume and he's extending our study into the background volume, ~2.000 - 20,000 km/s. We'd like to be able to characterise the environment of our detections, i.e. to say if they're in any major known groups, clusters, or large-scale structures such as filaments. What's the best approach to do this ? I guess we can simply search for groups and clusters in NED, but maybe there's a better way (also this wouldn't tell us about the larger stuff). And we'd like to know any vital information about such structures, e.g. if something is particularly well-known for any reason.

For the only time in these tests, DS complained its server was busy. On a second attempt it gave a response which was pretty basic and not very helpful; "you can look up this data set" or "use this software", or worse, suggesting a comparison to simulations. GPT-5 gave much more helpful citations and guided instructions in what to do with our own data and how we could compare it to other catalogues.

Winner : GPT-5, easily.

The verdict : ChatGPT-5 won five of the seven tests. DS only won once, with the other being a dead heat. Even leaving aside that GPT-5 had a narrow victory in one case, GPT-5 is the clickbaity "clear winner" here.

To be fair, I'd switched to DeepSeek because it was giving me better discussions than the contemporary model of ChatGPT, even when they first added reasoning. It's by no means a bad model, but it's unreliable compared to GPT-5. Its citations are frequently of dubious relevance, it seems to hallucinate more (the example of GPT-5 hallucinating here is the only example I've found of this so far*, and that was minor), and if you want its best output you'll have to wade through an awful lot of its reasoning processes. It's also slower than GPT-5, sometimes by an order of magnitude. And I've also found that DeepSeek rejects my uploads as violating policies even when there's nothing offensive in them; it won't even discuss them at all but just shuts down the discussion immediately.

* Excluding when you ask for it to analyse a paper by giving it a link rather than uploading. In that case it becomes worse than useless, so never do this.

Still, I have to admit that this does tend to curb my enthusiasm for GPT-5, but only a little. DeepSeek's answers were generally better in comparison to GTP-5 than I was expecting; DeepSeek gets most of the way there when it works. If GPT-5 is a revolution compared to previous OpenAI offerings, then it's only an incremental upgrade compared to DeepSeek. An important one, to be sure, but nevertheless incremental.

On the other hand, it's all about thresholds : how you cross them doesn't matter nearly as much as the fact you've crossed them at all. An increment which gets you across the line is just as useful as if you got there from a standing start. And if DeepSeek gets most of the way there most of the them, GPT-5 gets even further almost all of the time. With only one minor failure here compared to DeekSeek's two majors, the effect is non-linear. Even if GPT-5 doesn't drastically improve upon DeepSeek, it does so by more than enough I now see no point in using DeepSeek at all.

That about wraps up the LLM-testing for the foreseeable future. Now back to the usual posts in which I explain science to the masses... whether I do this better than the chatbots I leave to readers to decide.

Little Physicists

Friday, 22 August 2025

ChatGPT-5 versus... DeepSeek

No comments:

Post a Comment

Weaponising dark matter