Sister blog of Physicists of the Caribbean. Shorter, more focused posts specialising in astronomy and data visualisation.

Thursday, 11 July 2024

ChatGPT Is Not A Source Extractor

When ChatGPT-4o came along I was pretty keen to try out its shiny new features, especially since some of the shine has rubbed off the chatbots of late. Oh, ignore the hype trains completely : those who are saying it's going to cause the apocalypse or usher in the Utopian end of history are equally deluded. I'm talking about actual use cases for LLMs. This situation remains pretty much as it has been since they were first unleashed. That is...
  • Decent enough if you want free-form discussions (especially if you need new ideas and don't care too much about factual accuracy)
  • Genuinely actually very useful indeed for coding (brilliant at doing boiler-plate work, a serious time-saver !)
  • Largely crap if you need facts, and even worse if you need those facts to be reliably accurate
Pretending that those first two are unimportant is in my view quite silly, legitimate concerns about energy expenditure notwithstanding. But that third one... nothing much seems to have shifted on that at all. They're still plagued with frequent hallucinations, since they're not grounded in anything so that they have no internal distinction between a verifiable, observable truth and the CPU-equivalent of a random brain-fart firing of the neurons.

Unfortunately GPT-4o just seems to extend this into a multi-modal world, giving results basically consistent with my earlier tests of chatbots. But I was intrigued by its apparent accuracy when supplying image files. It seemed to be, albeit from limited testing, noticeably more accurate when asked questions about image files than, say, PDFs. So I had a passing thought : could I use ChatGPT-4o to find sources in my data ?

Spoiler : no. It doesn't work.

It's not possible to share the chat itself because it contains images, but basically what I did was this. I uploaded an image of a typical data set I would customarily trawl look looking for galaxies. The very short version is that the HI detections of galaxies typically look like elongated blobs, sometimes appearing saturated and sometimes as mere enhancements in the noise. You can find a much more thorough explanation on my website, but that's the absolute basics. For example, in the image below, there are seven very obvious detections and one which is a bit fainter. 

I began by giving ChatGPT a detailed description of the image and the task at hand. This is the kind of thing that takes a few minutes to explain to a new observer; the actual training of data inspection can take a few days, but the explanations need be only very short indeed. And finding the bright sources is trivial : almost anyone can do that almost immediately. The bright galaxies are inherently obvious in the data when presented like this. Even if you have no idea what the axes labels refer to, it's clear that some parts of the image are very different to the others.

I asked ChatGPT to mark the location of the sources or otherwise describe their position. It didn't mark them but instead gave descriptions. Its world coordinates weren't precise enough to verify what it had identified, however, being limited to only values directly readable in the image and not doing any interpolation. I also gave it a broad alpha-numeric grid (A-J along the x-axis and 1-6 along the y-axis), but this was too coarse to properly confirm what it thought it had found. 

Its results were ambiguous at best. Even with this coarse grid it was clear some of its results were simply wrong. So I did what I'd do with new observers. I marked the sources with red outlines and numbers, uploaded the new image and described what I'd done, so it would have some kind of reference image. I also described the sources in more detail, e.g. which ones were bright and which were faint, and whether they extended into adjacent cells.

Next I gave it a new image with a finer grid (A-O and 1-14). This time, two sources (out of the ten or so visible) were reported correctly while the rest were wrong.  By mistake, I missed out the "D" cell in the coordinate labels, but ChatGPT reported a source at D4 ! Its revised claims were still wrong though, with once again getting only two correct.

This wasn't going well. I decided to dial it back and try something simpler. Maybe ChatGPT was able to "see" the features but not was accurately reading the coordinates, or perhaps hallucinating its answers and so mangling its results. So now I uploaded an image devoid of any coordinates and asked it for a simple count of the number of bright blobs. It got the answer right ! Okay, better... I asked it if it could mark the locations directly on the image, but it said it couldn't edit images. Instead it suggested giving the coordinates of the sources as a percentage of the axis length from the top left. Fair enough, but when comparing its reported coordinates it had again two near-misses and got all the rest simply wrong.

Finally I decided to check if at least the reported number count wasn't just a fluke. I uploaded three images in one file (thus circumventing OpenAI's painfully-limited restrictions on the free plan), each labelled with a number, and asked for the number of sources in each. It got one right and the rest wrong. It also gave descriptions of where it thought the sources were (i.e. upper left, middle, that sort of thing) and these were all wrong. Then, rather surprisingly and quite unprompted, it decided that it actually could edit images to mark the positions after all. The result came back :


Well... it's less than stellar. 

The upshot is that nothing much has changed about chatbot use cases at all. Good for discussions,  useless for facts. Whether it is "seeing" the images in some sense I don't know : possibly at some level it does recognise the sources but hallucinates both when trying to mark them and describe their positions, or possibly it's just making stuff up and nothing else. The latter seems rather unlikely though. Too often in other tests it was capable of giving results from figures in PDFs and image files which could not have been obtained from reading any of the text, that required actually "looking" at the images. 

Regardless of what it's actually doing, in terms of using ChatGPT as a source extractor, it's a non-starter. It doesn't matter why it gets things wrong, for practical application it only matters that it does. Maybe there's something capable under there, maybe there isn't. For now it's just an energy-intensive way of getting the wrong answers. Well, I could have done that anyway !

No comments:

Post a Comment

Turns out it really was a death ray after all

Well, maybe. Today, not a paper but an engineering report. Eh ? This is obviously not my speciality at all , in any way shape or form. In fa...