Little Physicists: The shoe's on the other foot

My, how the tables have turned. The hunter has become the hunted. And various other cliché's indicating that the normal state of affairs have become reversed.

That is, as well as having to write an observing proposal, I find myself for the first time having to review them. Oh, I've reviewed papers before, but never observing proposals. This came about because ALMA has a distributed proposal review system : everyone who submits their own proposal has to review ten others. And since this year I finally submitted one, I get to experience this process first hand.

The ALMA DPR procedure

When you submit a proposal, you indicate your areas of expertise and any conflicts of interests – collaborators and direct competitors who shouldn't be reviewing your proposal, either because they'd stand to benefit from it being accepted or would love to take you down a peg. It's a double-blind procedure : your proposal can't contain any identifying information and you don't know who the reviewers are. Some automatic checks are also carried out to prevent Co-Is on recent ALMA proposals being assigned as reviewers, and suchlike.

Then your proposal is sent off for initial checks and distributed to ten other would-be observers who also submitted observing proposals in the current cycle. You, in turn, get ten proposals to review yourself. Each document is four pages of science justification (of which normally one or even two pages are taken up with figures, references, and tables) plus an unlimited-length technical section containing the observing parameters for each source plus some brief justification on the specifics (in practise, in most proposals each of these so-called "science goals" are very similar, using the same observing setup on multiple targets). You then write a short review of each one, of a maximum of 4,000 characters but typically more like ~1,000 (or even less) describing both the strengths and weaknesses of each. You also rank them all relative to each other, from 1 (the strongest) to 10 (the weakest).

That's stage one. A few weeks later, in stage two you get to see everyone else's reviews for the same proposals, and can then change your own reviews and/or rankings accordingly, if you want to. So far as I know, each reviewer gets a unique group of ten proposals to review, so no two reviewers review the same set of proposals, meaning you can't see the others rankings. Exactly how their rankings are then all compared and combined, and ultimately, translated into awarded telescope time, remains a mystery to me. Those details I leave for some other time, and I won't go into the details of anonymity* here either : I seem to recall hearing that this gives a better balance of both experience and gender, but I don't have anything to hand to back this up.

* I will of course continue to respect the anonymity requirements here, and not give any information that could possibly identify me as anyone's reviewer.

Instead I want to give some more general reflections on the process. To be honest I went into this feeling rather biased, having received too many referee comments which were just objectively bollocks. I was quite prepared to believe the whole thing would be essentially little better than random, which is not a position without merit.

First thoughts

And my initial impressions justified this. It seemed clear to me that everyone had chosen interesting targets and would definitely be able to get something interesting our of their data, making this review process a complete waste of time.

But after I let things sink in a bit more, after I read the proposals a bit more carefully and made some notes, I realised this wasn't really the case. I still stand by (with one exception) that all proposals would result in good science, but the more I thought about it, the more I came to the conclusion that I could make a meaningful judgement on each one. I tried not to judge too much whether one would do better science than another, because who am I to say what's better science ? Why should I determine if studies of extrasolar planets are more important than active galactic nuclei ?

These aren't real examples, but you get the idea. Actually the proposals were all aligned very much more closely with my area of expertise. The length of four pages I would say is "about right", it gave enough background for me to set each proposal in its proper context as well as going into the specific objectives.

Instead, what I tried to assess was whether each project would actually be able to accomplish the science it was trying to do. I looked at how impactful this would be only as a secondary consideration. There isn't really any right or wrong answer as to whether it's better to look at a single unique target versus a statistical study of more typical objects, but I tried to judge how much impact the observations would likely have on the field, how much legacy value they would have for the community. But first and foremost, I considered whether I was persuaded the stated science objectives could actually be carried out if the observations themselves reached their design spec.

Judgement Day

And this I found was something I could definitely judge. Two proposals to me stood out as exemplary, perfectly stating exactly what they wanted to do and why, exactly what they'd be able to achieve with this. It was very clear that they understood the scientific background as well as anyone did. I initially ranked these essentially as a coin-toss as to who got first and who got second place; I couldn't meaningfully choose between them.

At the opposite extreme were two or three which didn't convince me at all. One of the principle objectives of one of them was just not feasible with the data they were trying to obtain, and they themselves presented better data in their proposal that they already had which would have been much more suitable for this. Lacking self-consistency is a black mark in just about any school of thought. Another looked like it would observe a perfectly good set of objects, but contained so many rudimentary scientific errors that there was no way I could believe they'd do what they said they would do.

Again, deciding which one to rank lowest was essentially random, though I confess that one of them just wound me up the wrong way more than the other.

In the middle were a very mixed bunch indeed. Some had outstanding ideas for scientific discovery but were very badly-expressed, saying the same thing over and over again to the nth degree (I would say to these people, there's no obligation to use the full four pages, and we should stipulate this in the guidance to observers and reviewers alike. I tried to ignore the poor writing style of these and rank them highly because of the science). Some oversold the importance of what they'd do, making unwarranted extrapolations from their observations to much more general conclusions. Some had a basically good sample but claimed it was something which it clearly wasn't; others clearly stated what their sample was but the objects themselves were not properly representative of what they were trying to achieve.

This middle group... honestly here, a random lottery would work well. On the other hand, there doesn't seem any obvious reason not to use human judgement here either, because for me at least this felt like a random decision anyway. And if other people's judgements are similar then clearly there are non-random effects which probably should actually be accounted for, whereas if they are truly random then the effects will average out. So there's potentially a benefit in one case and no harm in the other, and in any case there almost certainly is a large degree of randomness at work anyway.

Reviewing the reviews

I went through a similar process of revising my expectations in stage 2, though to a lesser degree. At first glance I didn't think I'd need to change my reviews or rankings, but on carefully checking one of the other reviews, I realised this was not the case. One reviewer out of the ten had managed to spot a deeply problematic technical issue in one of the proposals that I otherwise would have ranked very highly. And on checking I was forced to conclude that they were correct and had to downgrade my ranking significantly. This alone makes the process worth doing : 1 out of 10 is not high, but with ~1,600 proposals in total, this is potentially a significant number overall.

Reading the other reviews turned out to be more interesting than I expected. While some did raise exactly the same issues with some of the proposals that I had mentioned, many didn't. Some said "no weaknesses" to proposals I thought were full of holes. One even said words to the effect that "no-one should doubt the observers will do good science with this", a statement I felt presumptuous, biased, and bordering on an argument-from-authority : it's for us the reviewers to decide this independently; being told what we should think is surely missing the point.

The reverse of this is that some proposals I though were strong others thought were weak – very weak, in some cases. Everyone picks up on different things they think are important. There was one strange tendency for reviewers to point out that the ALMA data wouldn't be of the same quality as comparison data. This is fine, except that the ALMA data would usually have been of better quality, and downgrading it to the same standard is trivial ! I sort of wished I'd edited my reviews to point this out. Some also made comments on statistics and uncertainties that I thought were so generic as to be unfair, yes of course things might be different from expectations, but that's why we need to do observations !

What the DPR doesn't really do is give any chance for discussion. You can read the other reviews but you can't interact with the other reviewers. It might have been nice to have somewhere where we could enter a "comment to other reviewers", directed to the group, or at least have some form of alert system when reviews were altered. Being able to ask the observers questions might have been nice, but I do understand the need to keep things timely as well. On that front, reviews varied considerably in length; mine were on the longer and to be honest perhaps overly-long side (I think my longest was nearly 2,000 characters), while one was consistently and ludicrously short.

All this has given me very mixed feelings about my own proposal. On the one hand, I don't think it's anywhere near the worst, and I stand behind the scientific objectives. On the other, I think I concentrated overmuch on the science and not enough on the observational details. Ranking it myself with hindsight I'd probably have to put it in the lower third. It was always a long shot though, so I'll be neither surprised nor disappointed by the presumed rejection. One can but try with these things.

One thing I will applaud very strongly is the instruction to write both strengths and weaknesses of each proposal. All of them, bar none, had some really good points, but it was helpful to remind myself of this and not get carried away when reviewing the ones I didn't much like. Weaknesses were more of a mixed bag; one can always find something to criticise, although in some cases they aren't significant. Still, I found it very helpful to remember that this wasn't an exercise in pure fault-finding.

How in the world one judges which projects to actually undertake, though... that seems to me like the ultimate test of philosophy of science. Groups of experts of various levels have pronounced disagreements about factual statements; some notice entirely different things from others. There's the issue of not only will the science be significant, but also whether the data can be used in different ways from what's suggested. That to me remains the fundamental problem with the whole system, that one can nearly always expect some interesting results, but predicting what they could be is a fool's game.

Overall, I've found this a positive experience. Reading the full gamut of excellent to poor proposals really gives a clearer idea of what reviewers are looking for, something it's just not possible to get without direct experience. Not for the first time, I wonder a lot about Aumann's Agreement Theorem. If we the reviewers are rational, we ought to be persuaded by each other's arguments. But are we ? This at least could be assessed objectively, with detailed statistics possible on how many reviewers change their ranking when reading other reviews.

And at the back of my mind is a constant paradoxical tension : a strong feeling that I'm right and others are wrong, coupled with the knowledge that other people are thinking the same thing about me. How do we reconcile this ? For my part, I simply can't. I formulate my judgement and let everyone else to the same, and hope to goodness the whole thing averages out to something that's approximately correct. The paradox is that this in no way makes me feel any the less convinced of my own judgments, even knowing that some fraction of them simply must be wrong.

Other aspects are much more tricky. This is a convergence of different efforts, both trying to asses what-is-true (what science claimed is factually correct, why do experienced experts still disagree on some points), what will likely benefit the community the most, and how we try and account for the inevitably uncertain and unpredictable findings. As I've said before many times, real, coal-face research is extremely messy. If it isn't already, then I would hope that telescope proposals ought to be an incredibly active field of research for philosophers of science.

Little Physicists

Wednesday, 19 June 2024

The shoe's on the other foot

No comments:

Post a Comment

ChatGPT-5 Versus Me