Why is climate modelling stuck?

Why is climate modelling stuck? asked mt, and Bryan weighs in too. So I don’t see why I shouldn’t too. This is no kind of comprehensive list or manifesto, the way mt’s is. Just some random thoughts.

First of all, there are too many GCMs, and some of them are cr*p, so much so that they should simply be thrown away. I suspect that certain countries simply built them because they wanted “their” model to appear in the IPCC reports. There are 20+ coupled GCMs in IPCC, and we don’t need that many. I don’t know how many we need – certainly more than one. In fact, I rather suspect that of the total of model development person-hours, more than 75% goes into the top 5 models (plucking random numbers out of the air here) so this may not be a huge problem.

Like mt, I think its time to throw away the Fortran. Apart from anything else, its bad for your job prospects :-). Ideally, there should be a model description language, which would then compile into whatever language you wanted. If you’re discretising a PDE, its a bit mad to have to write the code for that by hand. Machines can do that for you. My pet idea here would be a SourceForge opensource model, probably in C++.

Probably, we should throw away the idea of scientists writing the code. Scientists should do science. Software engineers should write code, based on the documentation that the scientists provide. The reason we don’t have SE’s writing code is because we can’t afford them, not because its a bad idea (what do you mean “we”, White Man?).

Down at the trivia level, we don’t need to worry about memory. We have memory to excess.

38 thoughts on “Why is climate modelling stuck?”

  1. The entire history of ‘scientists (and engineers) should not write code’ demonstrates that this is not a viable plan. It is far easier (and safer) to teach the scientists to program (poorly but acurately) than to teach the software engineers enough physics.chemistry.biology.geology.oceanography.etc.
    so that they can comprehend what needs to be coded.

    The only exception appears to be in bioinformatics. Even there the researcher needs a good appreciation of information retrieval concepts and a substantial dose of molecular biology.

    [Disclosure: I am a retired professor of computer science.]

    Like

  2. Throw away Fortran? Ha. Who wants the job of translating from Fortran to something else? At least based on my experience with a very large legacy code, it isn’t going to happen for a long time, because who’s going to pay for it? Besides, I remain unconvinced of the terribleness of Fortran, despite what some see as its inelegance.

    I tend to agree with David about scientists and engineers coding vs software “engineers”. I have somewhat mixed feelings about it, but I tend toward not.

    Like

  3. Nobody (I opine) is suggesting throwing away the existing Fortran code. The question is whether there is a better language for writing the newer parts. Obtining access to legacy code via some form of foriegn function interface is a solved problem — a good example where a software engineer can contribute to the inplementaion of the ‘better’ language.

    A problem with F77 (at least) is the lack of modern data structures. This inexpressiveness is to inefficienct use of the scientist’s time and is often a source of subtle errors.

    Like

  4. I’ve worked on a few projects in which users write requirements, analysts convert the requirements into technical specifications, and software developers write code based on it.

    The next step is invariably that the users fall off their chairs when they are shown the prototype product, aghast at how impossibly wrong it all is. The unified model sucks, really badly, and is just too slow for the rapid model-develop-test-fix-develop-check cycle of scientific software development.

    I think there’s a niche here, for computational scientists – who are first and foremost scientists, but have enough exposure to software development to be able to apply good practices to their colleagues code. Full disclosure, I’m kind of talking about myself here. Not everyone needs to know what inheritance is (for example), but research project with an element of modelling should have someone on staff who can knows when and why it might be a good idea to use. Or (believe it or not this happened to me last week) explain to their colleagues why it is considered very bad practice to use goto statements for flow control. And probably most importantly someone who knows what modules and libraries are already out there that can be incorporated.

    A lot of people have tried to nail the problem of wheel reinvention in scientific software, with generic DE solvers, extensible, OO based frameworks (e.g. St Germain).. What I think works is well designed, self contained, supported and documented libraries. The reason matlab, IDL and their ilk are so good for rapid development is not the languages themselves, which tend to be annoyingly idiosyncratic, it’s the fact that you’ve got ODE solvers, plotting routines, array reshapers, statistical analysis and more right at your fingertips.

    Like

  5. I’d be happy to have someone write code for me. I just need someone else to offer to pay that person’s salary

    [Of course. This is the problem. It appears more “efficient” to those in charge to have scientists do their own. And of course, its historical -W]

    Like

  6. http://www.realclimate.org/index.php/archives/2008/01/new-rule-for-high-profile-papers/#comment-79100

    Just to quote the comment for later readers to find; I’m no programmer; he is, and his pointers are often productive so here it is in full:

    # Mike Boucher Says:
    12 January 2008 at 4:30 PM

    Attempting to come up with a better language for future highly parallel scientific computing by choosing among existing languages designed in the past for serial general computing may work, but one can imagine at least one other probable outcome.

    Another approach underway by my former employer is to create a new language specifically designed for the purpose:
    http://www.gcn.com/print/26_03/43058-1.html

    Click to access 1.02_steele.pdf

    http://research.sun.com/minds/2005-0302/

    There are undoubtedly other language development efforts underway, but this is the one of which I know. Ordinarily, I would be skeptical in the extreme about an effort to develop a language that needs to be supported by such an extensive and sophisticated infrastructure (IMSL, MPI, graphics libs, compilers, users’ groups, docs, etc.). However, whatever you think of Java, I think it’s indisputable that it came with an enormous infrastructure of at least reasonable quality and the organization that did that could do the same here.
    ——-

    Other references:

    InfoQ: Sun Open Sources Fortress programming language for the JVM…. designed specifically for high-performance computing (HPC), has been released on SunSource under the BSD language.
    http://www.infoq.com/news/2007/01/fortress-open-source

    Sun Labs Programming Language Research Group
    Fortress Community Website to download Fortress, learn all about it, and participate in its development …
    http://research.sun.com/projects/plrg/

    Although a preliminary interpreted implementation of the language was produced, the DARPA contract was not renewed in November 2006… “In January 2007 it became an open-source project …. People outside Sun are now writing Fortress code and testing it using the open-source Fortress interpreter.” http://en.wikipedia.org/wiki/Fortress_programming_language

    That’s the last you’ll hear about it from me, I know nothing more. Just interesting to see it happening; I did see some climate modelers’ email addresses in their online exchanges.

    So, dunno if anyone will write code for you or not.

    Like

  7. To have an outside entity (programmer, organization) do your development, you need to know exactly what it is you want. Which you don’t, if what you are doing could be labeled “research” rather than, say “product development”. What you will get delivered will be exactly what you asked for and completely useless. As a minimum you do need a working prototype of the system you want – but for a lot of modeling, the “working prototype” _is_ the model. You have little use for a cleaned up, polished version of the same thing (certainly not enough to burn a large portion of your budget creating it). This is true not just for modeling software, but for a lot of other science-related development as well (robotics come to mind for me).

    Like

  8. Where I work (as an industrial scientist) we do some outsourcing of software engineering. Typically we write core ‘scientific’ routines in language of choice and the company we employ either translates or embeds into a more beautiful structure, properly engineered package. They’re not set up to do implementation from a paper or an idea.

    This route only really applies to code that is going to be used by other, non-export users.

    Like

  9. I agree with other comments above that outsourcing wouldn’t work but I think having good programmers working with the scientists on improving their code and efficiently implementing their algorithms could work very well and would as a side affect be a good way to educate the scientists in writing better code. I also think there could be more support for formally training the scientists in better programing and software design – as a scientist who has written code (just for local use) I’m very aware of the rather ad-hoc way I’ve learnt to program and the shortcomings of my code as a result.

    I’d be interested to know what kind of things a gun-programmer would think the most useful things for scientists-turned-programmers to learn.

    Like

  10. “Ideally, there should be a model description language, which would then compile into whatever language you wanted.”

    Something akin to Z?
    http://en.wikipedia.org/wiki/Z_notation
    http://spivey.oriel.ox.ac.uk/mike/zrm/index.html

    [No, Z is far too formal. More like something that would take the latex equations from your paper (to define the PDE) and something that knows what discretisation you want. Thats only some of the code, of course. But it would build your core -W]

    The idea of Z is to be able to specify an algorithm that can then be unambiguously converted to code. As you say, this may be best done at the module level initially, and then may be used to specify how these pre-built modules can be re-used?

    The idea being that climate scientists learn Z (for sake of argument) and the actual code implementation can be separate and independent (some might say irrelevant to the high level requirements). The same model could be knocked out in different languages for different platforms. Using different versions of the modules you could have different scales and implementations of various aspects (nb I have no experience of modelling so this last bit may be badly worded or a bit pointless).

    “probably in C++.”

    I would keep the languages open. You might want to say code the low-level modules in something like C or C++ but use other languages (like Python) to tie them together, as this may mean it’s easier to change how a model works – or just leave the higher level stuff completely open.

    I think some of these aspects were mentioned on MT’s discussion, and any that were I’ve probably ingested from there.

    [Could be. I’m not too bothered. The real problem is keeping the thing maintainable -W]

    Like

  11. anyway, surely in a few years you’ll only need to talk to your computer and it will convert it to code right away 🙂

    about GCM’s: i used to think having several GCM was interesting as long as they are indepedantly built – which they’re not, of course. So maybe you’d also need a genealogy tree graph of GCMs, so you can chose which ones to keep.

    Like

  12. First of all, there are too many GCMs, and some of them are cr*p, so much so that they should simply be thrown away. I suspect that certain countries simply built them because they wanted “their” model to appear in the IPCC reports

    Just as a heres something to hope for category, maybe as time goes by there will be more of a standardization of the models and all the countries will also find some way to get their names included too!
    Dave Briggs :~)

    Like

  13. If you have a degree and doctorate in mathematics, are you really a scientist? I do FE/CFD programming using maths and physics too but I never considered it as science. Are you proud enough of your own model to make it open-source? Maybe that way we can perhaps force the more obviously bad models out of the IPCC process. Getting rid of Fortran gets rid of the GISS model no? What language does Hadley use?

    [Am I now? Definitely not. Was I then? Probably.

    Its not my model, its the Hadley Centres, and its in Fortran. They are paranoid about copyright so it won’t be OS’d; when I last looked it (hadcm3, say) wasn’t even publically viewable -W]

    Like

  14. I suggest that computional scientists, scientists who write code to support their science that is, need to learn modern data structures and a bit of the analysis of algorithms. This might be managed, in a university setting, by inviting computer science faculty to, one hour per week, give a less-than-one-hour talk on some aspect of data structures and algorithms.

    This way, computational scientist can continue to write their own code (as they will have to anyway), but be able to produce better solutions to the more difficult problems.

    [Disclosure: I am a retired professor of computer science.]

    Like

  15. SCM – the three most useful things for scientists-turned programmers to learn to do (to make the life of aspiring gun-programmers who debug their code easier):

    1. Write re-usuable functions and modules. Actually reuse them! Tell the rest of the group where they are. Do this for enough functions and you have a library.

    2. Use a version control system, even just for your personal stuff. It makes sharing easier, encourages you to group things into coherent ‘projects’.

    3. Maintainability is more of a social problem than a technical problem. So write some documentation. A few lines at the top of a script, a sentence on the group wiki will quite possibly save someone hours. Probably yourself, in six months time, when you wonder how exactly the package you wrote back then works.

    Like

  16. If I may add one point to ac’s list of useful things:
    – please use meaningful names for functions and symbols. The era of six-letter function or variable names is long over.
    Others will thank you:)

    Like

  17. This may be obvious, but I still run across it in commercial IT projects. If you use comments, make sure you update them when you update the code. Also things like:

    x = x + 1 // increment x

    are a bit pointless, whereas:

    x= x + 1 // allow for border

    are useful (example nicked from a style guide I recently read, I forget where).

    Like

  18. I am a physicist who has also been employed as professional computer programmer (doing signal processing R&D and modeling our custom mixed signal integrated circuit).

    Agree with others, it is much easier to teach a physicist good programming. For complex programs it takes good management and a commitment to good programming. That commitment is usually missing in scientists programs. I’m sure this partially because most scientists have never seen good programs and good project development being done.

    Things like code reviews, good commenting and documentation, using version control, using developed libraries (and developing your own as necessary)are all basic requirements for writing usable, maintainable and extensible code. Programs need to be built from the ground up using sound software engineering principles. It looks to me that this was not done for the GCM (if they were, they wouldn’t be using FORTRAN!).

    Having programmers and scientist work together would be one way get better programs. When I was employed as a programmer, we often programed in pairs. It may seem like having two people write the same code at the same time is a waste of resources but it meant much less time debugging and much more clearly written and better organized code. Overall it saved time and money. Having the scientist work with a programmer would be one way to teach the scientists.

    Like

  19. I favor pair programming. But it is not clear how much this might help when both are scientists who know relatively little computer science and less software engineering.

    Like

  20. I suggested pairing a computer scientist and climate modeler This would be very important in the beginning of the model development. Once the structure of the program is developed and good structures are in place, you can do without the computer scientist and by that time the modeler will have learned enough to work with other scientists.

    Like

  21. In my own experience, it’s generally better to have scientists/engineers write their own code. The code may be not be elegant, and may be full of hacks. But in general, the representation of the science can be more easily encapsulated by a scientist with knowledge of the field, rather than attempting to explain to a Software Engineer your thought process.

    [I rather disagree with this. GCMs are so big and complex, and with a long lifetime, so it can’t survive in a maintainable and upgradeable state if its full of hacks. Attempting to explain your though processes is a positive good, not a negative -W]

    I’ve also noted, in my experience, that IT/computer support for scientists/engineers, particularly in the testing phase, can be a mit of a mixed bag. For some applications, help is availiable easily. For others, it’s difficult to obtain.

    My 2 cents on FORTRAN. It’s an archaic language, developed in the 50’s, where it should have remained. It is difficult to interface to existing software (particularly as its I/O is very strange) written in other languages. I feel it should be phased out. My guess is that it has only remained in use for so long, due to a mixture of laziness and first language syndrome on the behalf of the scientific community.

    [I agree there -W]

    Like

  22. David B. Benson. My pleasure. I hope you’ve managed to follow the first link all the way up and down. It wasn’t the best of links in.

    Frankly, having looked at it, I’m glad I’ve not got to maintain it, modify it or use it!

    “Have you ever heard anything about God, Topsy?” The child looked bewildered, but grinned as usual. “Do you know who made you?” “Nobody, as I knows on,” said the child, with a short laugh. The idea appeared to amuse her considerably; for her eyes twinkled, and she added, “I spect I grow’d. Don’t think nobody never made me.”

    Uncle Tom’s Cabin

    Like

  23. Ok, this discussion has been going on for a longish here and there. The assumption is that better coding is somehow related to better making models more useful.

    Allow a heartfelt bulls__t. The history of these models is that they advanced by brute force, faster processors, more memory, etc. This allowed modelers to go from 1-D to 3-D and to shrink the size of the cells. It allowed them to insert more kitchen sink. Allow Eli to point out that the modelers pretty much always knew where the plumbing parts were, but needed machine capacity to insert them. Does anyone seriously think that putting the full HiTRAN database into a GCM would make a significant difference for example?

    Shrinking the size of cells means that the complexity grows by n^3 where n is the number of cells in the calculation. Better coding might allow you to make n a bit larger, put a few more bells and whistles in, but not much.

    So that calls for fundamental changes, and the need for a fundamental change is why models are currently stuck. They have gone about as far as they can along the current track.

    Like

  24. Eli Rabett — Oviously bigger, faster machines help. I am not sure what you mean by fundamental change, but I’ll opine that what seems to be the current way of doing an ocean-air interface could use some serious study of modern (computer science) data structures. Now these are really difficult to even think about, much less code, if your programming ideas are fundmentally just F77, C, Korn shell and (ugh!) Perl.

    Python is fairly good for expanding one’s horizons. However, a statically typed (mostly) functional programming language is even better, IMO. Various companies have chosen O’Caml (which meets my criteria) as their primary programming language. I know of hedge funds and bioinformatics companies. Both use lots of maths.

    So maybe this unsticks some aspects of climate modeling. However, the biggest payoff will certainly come from better ways of organizing (indeed thinking about) the computational form of the equations utilized. Go for it!

    Like

  25. Ok speaking as a scientist-turned-software engineer.. (practising software engineer

    First – yes, it’s nice to select a boutique language with some nice features, but you have instantly reduced the pool of people who even have a chance of understanding the code by 99%. C, C++, Java, and other mainstream languages give you a much bigger base of people.

    Second – Eli, the issue here is that without things like modularity and reusability, every model has to be built from the ground up, and the idea of being able to mix and match different mays of modelling the same thing is a pipe dream. In a perfect world, you could assemble a GCM from parts written by completely separate research groups, so instead of having 20-odd GCMs, different research groups could specialise in different aspects and plug them all together. That’s how you get a big jump in productivity.

    Third – There is a large base of people out there with both strong scientific degrees and significant software engineering experience; but there is a question of cost. 2 kids and a mortgage tend to focus the mind when it comes to pay packets.

    Like

  26. Andrew, I think you are mistaking what I was saying. Yes, especially with the generational change, more science types understand C, C++ and Java than FORTRAN (this was not the case ten years ago) and yes, there are features in these languages that will make code more portable and better executing.

    BUT THAT IS NOT THE PROBLEM WITH CLIMATE MODELING!!

    The problem is how to get to regional size using something besides brute force, because we are rapidly reaching the point where brute force will no longer be possible.

    In a real sense, outside of the denialist sites, everyone knows that the models have solved the problem on the scale that they operate. Straightforward improvements will only be marginally useful (GS sense) although we can look for better field data. The situation is like polio in 1954, do we put our effort into building better iron lungs or do high risk stuff looking for a vaccine? Do we even know in which direction to look for a vaccine. Climate modeling needs a fundamental breakthrough to allow useful predictions on the regional scale. Better coding would be nice, and economically useful, but it is not necessary.

    BTW, if everyone is using the same modules, you better be damned sure that the modules are perfect.

    Like

  27. I think Eli and I are taking opposite sides here.

    I think the physics of the situation is already known and no breakthrough is possible there. If any breakthrough is possible it must lie in methods and not in theory.

    I do not think the physics admits of fundamental improvements. I do not believe that some fundamental Einsteinian insight is lurking out there comparable to the dynamical insights of the first half of the previous century that will suddenly bring the models into focus. The system is just too damned messy to allow any such novel insight.

    Rather, I assert that there are engineering solutions which become feasible under two conditions: an increase in high end computing power of several additional orders of magnitude (almost assures) and an increase in the formal expressive power of the notation in which the models are expressed to the machine (culturally unlikely, I suspect) such that formal deductions about the models can be made with the assistance of metamodeling tools. These methods can explore the model space and identify the models which actually have predictive power.

    There is a lot of resistance to this class of idea among the climate people who have thought about it. Such resistance is understandable but hopefully it is wrong.

    If the resistance is well founded we have about all the information we can expect to get from GCMs, and the orders of magnitude of additional computing power we can envision will be best spent on other matters besides climate.

    mt

    Like

  28. If MT is right, brute force gets to the same end, just a bit slower. OTOH plot the change in resolution of the models as a function of time and tell me why you think brute force will work?

    I don’t think new physics is needed. I agree that the physics we have is more than enough for the job (ok clouds. . . ), but rather we need a fundamental change in the way in which the models are organized independent of programming language. I realize this is vague, but I don’t think the problem is the programming language or style having experienced multiple changes of language in my lifetime.

    If I had to take a guess, I would think something like the way the old analog computers worked might be interesting.

    Like

  29. “orders of magnitude more computing power” — Yes more, but the prospects for additional speed to not appear to be good just now.

    Hence new ideas are indeed needed, assuming all the physics is known.

    Here is a quote from Rober Hindley, regarding static type inference and checking algorithms:

    Curry used it informally in the 1950’s, perhaps even 1930’s,
    before he wrote it up formally as an equation-solving procedure
    in 1967 (published 1969). Curry’s algorithm includes a unific-
    ation algorithm.
    The algorithm of Hindley, dating from 1967, depends on
    Robinson’s unification algorithm.
    The Milner algorithm depends on Robinson too.
    J.H. Morris gave an equation-solving algorithm in his thesis
    at MIT (1968, but presumably devised some time before then);
    it includes a unification algorithm in the same way Curry’s does.
    Carew Meredith, working in propositional logics, used a
    Hindley-like algorithm in the 1950’s; by the formulas-as-types
    correspondence, this is a principal-type-scheme algorithm,
    in today’s language.
    Tarski had used, it is rumoured, a p.t.s. or unification
    algorithm in early work in the 1920’s.

    I’ll return to this in a subsequent post, but haven’t the time to write it just now.

    Like

  30. Following Bryan’s link in the main post, one finds a link to a maverous essay by Phil Graham. The whole is to be savoured, beginning with a quote from Guy Lewis Steele, Jr., stating that Java gets about half-way to Lisp. Lisp? Let’s review the history of important ideas in programming languages:

    1958 LISP

    1960 Algol 60 introduces static scoping.

    1976 Steele/Sussman invent function closures, making functions first-class citizens in a statically scoped language, Scheme.

    1982 Robin Milner (in Damas/Milner) re-discover the Hindley algorithm and bring it to the attention of compute scientists.

    1991 Standard ML defined in Milner/Tofte/Harper/MacQueen (revised in 1997.

    But the Hindley-Milner type inference algorithm can be used for more than just assuring that well-typed programs can have no run-time type errors. Here is the abstract of Cordelia Hull’s 1994 paper:

    Lists are a pervasive data structure in functional programs. The generality and simplicity of their structure makes them expensive. Hindley-Milner type inference and partial evaluation are all that is needed to optimise this structure, yielding considerable improvements in space and time consumption for some interesting programs. This framework is applicable to many data types and their optimised representations, such as lists and parallel implementations of bags, or arrays and quadtrees.

    Reread the last sentence. Sounds to me that this might be useful for climate modelers.

    It is true that much poorer languages, such as the ones Andrew Dodds mentions, are more popular. (But re-read the Phil Graham essay.) Indeed, many more people program in an ML variant, O’Caml, than in Standard ML. There is even a book Ocaml for scientists, albeit expensive. The point is that O’Caml offers perhaps better expressiveness than Standard ML (but with some disadvantages I am not willing to accept for my own programming).

    So possibly, just possibly, learning about these concepts will help climate modelers to find means of expressing more efficient means of carrying out the algorithms needed to model climate.

    Like

  31. I think David meand Paul Graham; he is crossing some wires with Senator Phil Gramm, R-TX, of whom the only thing I remember is what I call Gramm’s Principle: “The only way anyone gets anything done in this country is at a profit”. Unfortunately there is no apparent business model in climate modeling. It probably does add value but there is no way for it to extract that value in the for-profit sector. I suspect if there were I wouldn’t have nearly so much to gripe about.

    Regarding the various subtleties that David is going on about, I for one have to join my collleagues in a heartfelt “huh?” I have no idea what any of this means, and I think it is illustrative of CS people pushing their obsessions without sufficient regard to whether and how they apply.

    I’m a big fan of Paul Graham, but here is an essay somewhat critical of him which explains my attachment to Python:

    http://www.prescod.net/python/IsPythonLisp.html

    This is the best explanation of the role of Python in science taht I have seen, even though the article isn’t science specific. Here is the core of the argument:

    “I get paid to share my code with “dufuses” known as domain experts. Using Python, I typically do not have to wrap my code up as a COM object for their use in VB. I do not have to code in a crappy language designed only for non-experts. They do not have to learn a hard language designed only for experts. We can talk the same language. They can read and sometimes maintain my code if they need to.”

    Like

  32. Michael Tobis — Thanks for the correction and the link! “Paul Graham” it is, and of course I don’t think he is perfect. I’ve already stated that Python is likely to be mostly ok (for you, not for me). Being such a kitchen-sink language, it may have essentially everything from O’Caml that you might want anyway. Nonetheless, the aformentioned book might well be worth your while reading (especially if you can convince the library to buy it instead of you).

    My main point is that there are data structuring principles which might well enable climate modelers to ‘unstick’. Python might suggest some, but almost surely does not have partial evaluation ideas, especially those based on the constraint solution technique called Hindley-Milner type inference. This, as well as even simpler applications of quadtrees (maybe octtrees for three dimensions), may well be of importance. I merely suggest it, being far from a domain expert.

    For a perspective on programming language principles, I recommend

    John C. Mitchell
    Concepts in Programming Languages
    Cambridge University Press, 2003

    which a local library might already have, but in any case comes with a modest price.

    Like

  33. Some time before, I needed to buy a car for my organization but I did not have enough cash and couldn’t order something. Thank goodness my mother suggested to take the credit loans at creditors. Thus, I acted that and used to be happy with my financial loan.

    Like

Leave a comment