Some parts of the discussion of Oh dear, oh dear, oh dear: chaos, weather and climate confuses denialists have turned into discussions of (bit) reproducibility of GCM code. mt has a post on this at P3 which he linked to, and I commented there, but most of the comments continued here. So its worth splitting out into its own thread I think. The comments on this issue on that thread are mostly mt against the world; I’m part of the world, but nonetheless I think its worth discussing.
What is the issue?
The issue (for those not familiar with it, which I think is many. I briefly googled this and the top hit for “bit reproducibility gcm” is my old post so I suspect there isn’t much out there. Do put any useful links into comments. Because the internet is well known to be write-only, and no-one follows links, I’ll repeat and amplify what I said there) is “can large-scale computer runs be (exactly) reproduced”. Without any great loss of generality we can restrict ourselves to climate model runs. Since we know these are based effectively on NWP-type code, and since we know from Lorenz’s work or before that weather is chaotic, we know that means that on every time step, for every important variable, everything needs to be identical down to the very last bit of precision. Which is to say its all-or-nothing: if its not reproducible at every timestep down to the least significant bit, then it completely diverges weatherwise.
I think this can be divided down into a heirarchy of cases:
The same code, on the same (single-processor) machine
Nowadays this is trivial: if you run the same code, you’ll get the same answer (with trivial caveats: if you’ve deliberately included “true” random numbers then it won’t reproduce; if you’ve added pseudo-random numbers from a known seed, then it will). Once upon a time this wasn’t true: it was possible for OSs to dump your code to disk at reduced precision and restore it without telling you; I don’t think that’s true any more.
The (scientifically) same code, on different configurations of multiple processors
This is the “bit reproducibility” I’m familiar with (or was, 5+ years ago). And ter be ‘onest, I’m only familiar with HadXM3 under MPP decomposition. Do let me know if I’m out of date. In this version your run is decomposed, essentially geographically, into N x M blocks and each processor gets a block (how big you can efficiently make N or M depends on the speed of your processor versus the speed of your interconnect; in the cases I recall on our little Beowulf cluster, N=1 and M=2 was best; at the Hadley Center I think N = M = 4 was considered a fair trade-off between speed of completion of the run and efficiency).
Note that the decomposition is (always) on the same physical machine. Its possible to conceive of a physically distributed system; indeed Mechoso et al. 1993 does just that. But AFAIK its a stupid idea and no-one does it; the network latency means your processors would block and the whole thing would be inefficient.
In this version, you need to start worrying about how your code behaves. Suppose you need a global variable, like surface temperature (this isn’t a great example, since in practice nothing depends on global surface temperature, but never mind). Then some processor, say P0, needs to call out to P0..Pn for their average surface temperatures on their own blocks, and (area-)average the result. Of course you see immeadiately that, due to rounding error, this process isn’t bit-reproducible across different decompositions. Indeed, it isn’t necessarily even bit-reproducible across the same decomposition, but with random delays meaning that different processors put in their answers at different times. That would depend on exactly how you wrote your code. But note that all possible answers are scientifically equivalent. They differ only by rounding errors. It makes a difference to the future path of your computation which answer you take, but (as long as you don’t have actual bugs in your code or compiler) it makes no scientific difference.
Having this kind of bit-reproducibility is useful for a number of purposes. If you make a non-scientific change to the code, one which you are sure (in theory) doesn’t affect the computation – say, to the IO efficiency or something – then you can re-run and check this is really true. Or, if you have a bug that causes the model to crash, or behave unphysically, then you can run the code with extra debugging and isolate the problem; this is tricky if the code is non-reproducible and refuses to run down the same path a second time.
Obviously, if you make scientific changes to the code, it can’t be reproducible with code before the change. Indeed, this is practically the defn of a scientific change: something designed to change the output.
The same code, with a different compiler, on the same machine. Or, what amounts to much the same, the same code with “the same” compiler, on a different machine
Not all machines follow the IEEE model (VAXes didn’t, and I’m pretty sure DEC Alpha’s didn’t either). Fairly obviously (without massive effort and slowdown from the compiler) you can’t expect the bitwise same results if you change the hardware fundamentally. Nor would you expect identical results if you run the same code at 32 bit and 64 bit. But two different machines with the same processor, or with different processors nominally implementing IEEE specs, ought to be able to produce the same answers. However, compiler optimisations inevitably sacrifice strict accuracy for speed, and two different compiler vendors will make different choices, so there’s no way you’ll get bit repro between different compilers at anything close to their full optimisation level. Which level you want to run at is a different matter; my recollection is that the Hadley folk did sacrifice a little speed for reproducibility, but on the same hardware.
Does it matter, scientifically?
In my view, no. Indeed, its perhaps best turned round: anything that does depend on exact bit-repro isn’t a scientific question.
Why bit-repro doesn’t really matter scientifically
When we’re running a GCM for climate purposes, we’re interested in the climate. Which is the statistics of weather. And a stable climate – which is a scientifically reliable result – means that you’ve averaged out the bit-repro problems. If you did the same run again, in a non-bit-repro manner, you’d get the same (e.g.) average surface temperature, plus or minus a small amount to be determined by the statistics of how long you’ve done the run for. Which may require a small amount of trickery if you’re doing a time-dependent run and are interested in the results in 2100, but never mind.
Similarly, if you’re doing an NWP run where you do really care about the actual trajectory and are trying to model the real weather, you still don’t care about bit-repro, because if errors down at the least-significant-bit level have expanded far enough to be showing measureable differences, then the inevitable errors in your initial conditions, which in any imaginable world are far far larger, have expanded too.
Related to this is the issue people sometimes bring up about being able to (bit?) reproduce the code by independent people starting from just the scientific description in the papers. But this is a joke. You couldn’t get close. Certainly not to bit-repro. In the case of a very very well documented GCM you might manage to get close to climate-reproducibility, but I rather doubt any current model comes up to this kind of documentation spec.
[Update: Jules, correctly, chides me for failing to mention GMD (the famous journal, Geoscientific Model Development) the goal is what we call “scientific reproducibility”.]
Let’s look at some issues mt has raised
mt wrote There are good scientific reasons for bit-for-bit reproducibility but didn’t, in my view, provide convincing arguments. He provided a number of practical arguments, but that’s a different matter.
1. A computation made only a decade ago on the top performing machines is in practice impossible to repeat bit-for-bit on any machines being maintained today. I don’t think this is a scientific issue, its a practical one. But if we wanted to re-run, say, the Hansen ’88 runs that people talk about a lot then we could run them today, on different hardware and with, say, HadXM3 instead. And we’d get different answers, in detail, and probably on the large-scale too. But that difference would be a matter for studying differences between the models – an interesting subject in itself, but more a matter of computational science than atmospheric science. Though in the process you might discover what key differences in the coding choices lead to divergences, which might well teach you something about important processes in atmospheric physics.
2. What’s more, since climate models in particular have a very interesting sensitivity to initial conditions, it is very difficult to determine if a recomputation is actually a realization of the same system, or whether a bug has been introduced. Since this is talking about bugs its computational, not scientific. Note that most computer code can be expected to have bugs somewhere; it would be astonishing of the GCM codes are entirely bug-free. Correcting those bugs would introduce non-bit-repro, but (unless the bugs are important) that wouldn’t much matter. So, to directly address one issue raised by The Recomputation Manifesto that mt points to: The result is inevitable: experimental results enter the literature which are just wrong. I don’t mean that the results don’t generalise. I mean that an algorithm which was claimed to do something just does not do that thing: for example, if the original implementation was bugged and was in fact a different algorithm. I don’t think that’s true; or rather, that it fails to distinguish between trivial and important bugs. Important bugs are bugs, regardless of the bit-repro issue. Trivial bugs (ones that lead, like non-bit-repro, to models with the same climate) don’t really matter. TRM is very much a computational scientist’s viewpoint, not an atmospheric scientist’s.
3. refactoring. Perhaps you want to rework some ugly code into elegant and maintainable form. Its a lot easier to test that you’ve done this right if the new and old are bit-repro. But again, its coding not science.
4. If you seek to extend an ensemble but the platform changes out from under you, you want to ensure that you are running the same dynamics. It is quite conceivable that you aren’t. There’s a notorious example of a version of the Intel Fortran compiler that makes a version of CCM produce an ice age, perhaps apocryphal, but the issue is serious enough to worry about. This comes closest to being a real issue, but my answer is the section “Why bit-repro doesn’t really matter scientifically”. If you port your model to a new platform, then you need to perform long control runs and check that its (climatologically) identical. It would certainly be naive to swap platform (platform here can be hardware, or compiler, or both) and just assume all was going to be well. If there is an Intel fcc that makes CCM produce an ice age, then that is a bug: either in the model, or the compiler, or some associated libraries. Its not a bit-repro issue (obviously; because it produces a real and obvious climatological difference).
Some issues that aren’t issues
A few things have come up, either here or in the original lamentable WUWT post, that are irrelevant. So we may as well mark them as such:
1. Moving to 32 / 64 / 128 bit precision. This makes no fundamental difference, it just shifts the size of the initial bit differences, but since this is weather / climate, any bit differences inevitably grow to macroid dimensions.
2. Involving numerical analysis folk. I’ve seen it suggested that the fundamental problem is one with the algorithms; or with the way those are turned into code. Just as in point 1, this is fundamentally irrelevant to this point. But, FWIW, the Hadley Centre (and, I assume, any other GCM builder worth their salt) have plenty of people who understand NA in depth.
3. These issues are new and exciting. No, these issues are old and well known. If not to you :-).
4. Climate is chaotic. No, weather is chaotic. Climate isn’t (probably).
Some very very stupid or ignorant comments from WUWT
Presented (almost) without further analysis. If you think any of these are useful, you’re lost. But if you think any of these are sane and you’re actually interested in having it explained why they are hopelessly wrong, do please ask in the comments.
1. Ingvar Engelbrecht says: July 27, 2013 at 11:59 am I have been a programmer since 1968 and I am still working. I have been programming in many different areas including forecasting. If I have undestood this correctly this type of forecasting is architected so that forecastin day N is built on results obtained for day N – 1. If that is the case I would say that its meaningless.
2. Frank K. says: July 27, 2013 at 12:16 pm … “They follow patterns of synthetic weather”?? REALLY? Could you expand on that?? I have NEVER heard that one before…
3. DirkH says: July 27, 2013 at 12:21 pm … mathematical definition of chaos as used by chaos theory is that a system is chaotic IFF its simulation on a finite resolution iterative model…
4. ikh says: July 27, 2013 at 1:57 pm I am absolutely flabbergasted !!! This is a novice programming error. Not only that, but they did not even test their software for this very well known problem. Software Engineers avoid floating point numbers like the plague…
5. Pointman says: July 27, 2013 at 2:19 pm Non-linear complex systems such as climate are by their very nature chaotic… (to be fair, this is merely wrong, not stupid)
6. Jimmy Haigh says: July 27, 2013 at 3:25 pm… Are the rounding errors always made to the high side?
7. RoyFOMR says: July 27, 2013 at 3:25 pm… Thank you Anthony and all those who contribute (for better or for worse) to demonstrate the future of learning and enquiry.
8. ROM says: July 27, 2013 at 8:38 pm… And I may be wrong but through this whole post and particularly the very illuminating comments section nary a climate scientist or climate modeler was to be seen or heard from. (He’s missed Nick Stokes’ valuable comments; and of course AW has banned most people who know what they’re talking about)
9. PaulM says: July 28, 2013 at 2:57 am This error wouldn’t be possible outside of academia. In the real world it is important that the results are correct so we write lots of unit tests. (Speaking as a professional software engineer, I can assure you that this is drivel).
10. Mark says: July 28, 2013 at 4:48 am Dennis Ray Wingo says: Why in the bloody hell are they just figuring this out? (They aren’t. Its been known for ages. The only people new to this are the Watties).
11. Mark Negovan says: July 28, 2013 at 6:03 am… THIS IS THE ACHILLES HEAL OF GCMs. (Sorry, was going to stop at 10, but couldn’t resist).
Refs
* Consistency of Floating-Point Results using the Intel® Compiler or Why doesn’t my application always give the same answer? Dr. Martyn J. Corden and David Kreitzer, Software Services Group, Intel Corporation