Sunday, April 17, 2011

Acquisition Death Spirals

This isn't about the normal death spiral of increasing unit costs driving production cuts, which increases unit costs, which drives production cuts, which.... It's about another sort of price spiral caused by the US government's infatuation with sole-sourcing critical capabilities. I think this is largely due to technocrats trusting simple, static industrial-age cost models which support decisions dominated by returns from economies of scale. The basic logic of the decisions these models support (an equilibrium solution) is: "things will be cheaper with one supplier because the overhead will be amortized over bigger quantities."

The basic mistake these models make is neglecting the dynamics. Price is a dynamic thing. If the capability is very critical, and there is only one supplier, then there is almost no ceiling on how high the price can rise. The price level reached under those dynamics is just below the point where you'd stop paying for the capability in favor of a more important one (I'll call this the buyer's "level of pain"). The alternative dynamics occurs when there are multiple competing suppliers. The price reached under these dynamics asymptotes towards the economic costs (this takes into account barriers to entry / opportunity costs). Here's a simple graph illustrating price behavior under these two situations.

Price Dynamics
This price increase doesn't happen because the contractors are evil. It happens because the contractors have a duty to their shareholders to maximize profit. This is, in fact, the ethical thing for them to do. If their customer is foolish and decides to create a little monopoly for them with every procurement, well, too bad for the customer's shareholders (taxpayers in this case)...

We see these dynamics play out in a variety of defense acquisitions. The F-35 engine program and the Evolved Expendable Launch Vehicle program are two exemplars that are currently making the news.

The F-35 engine procurement was initially structured to support two suppliers during development, much like the engine programs for F-15 and F-16. There are operational advantages to having two engines. If a problem is found in one model, only half of the Air Force's tactical aircraft would have to be grounded while the solution is found. The other advantage comes from the suppliers competing with each other on price for various lots of engines. A disadvantage is overhead and development cost for the two suppliers and possibly increased logistics footprint for supporting two different engine models.

The recent news is that the budget does not include funds for the second engine. Not a week after this budget is passed which makes the engine buy a sole-source deal, we have an Undersecretary of Defense for Acquisition complaining about the price from the remaining supplier.

"I'm not happy, as I am with so many parts of all our programs, with (the P&W engine's) cost performance so far," Carter told the House of Representatives Appropriations subcommittee on defense on Wednesday. "We need to drive the costs down." [...] "Our analysis does not show the payback," Carter told the subcommittee. He added that "people of good will come to different conclusions on this issue." U.S. "not happy" with F-35 engine cost overruns
Did the analysis include the second supplier offering to assume the risks and go fixed price on the development? Is complaining about the price, being really unhappy about paying it, but paying it anyway because there is no alternative anything but empty political theater? Makes for great content in the trade rags: Acquisition Official gives contractor a stern talking to! Contractor hangs head in a suitably chastened way, "yes, our prices are very high for these unique capabilities, we are working hard to contain the costs for our customer." Much harrumphing is heard from various congresscritters, meanwhile the price continues to spiral higher...

In the case of the EELV, despite the fact that the Air Force paid for two parallel rocket development programs we now have just a single supplier. The two launch service providers were so expensive, they could not compete in the commercial market. The business case for the two rockets hinged on them being able to make money in the commercial market and get their launch rates up. When no other customers but the US government could afford their high prices they had to combine into the single consortium: ULA. So it's sole-source with two vehicles. Recall the various advantages and disadvantages of developing two products discussed above in the case of the F-35 engine. Now EELV has the worst of both worlds: high overhead and logistics costs to support two vehicles, and no competition or customer diversification to get flight rates up and bring prices down.

Launch-service providers agree that their viability, as well as their ability to keep costs down, is based on launch rhythm. The more often a vehicle launches, the more reliable it becomes. Scale economies are introduced as well in a virtuous cycle. One U.S. government official agreed that if SpaceX is now allowed to break ULA’s monopoly on U.S. government satellite launches as indicated by the memorandum of agreement, it could force ULA’s already high prices even higher as it eats into ULA’s current market. “In the longer term we may be faced with questions about whether one of them [ULA or SpaceX] can remain viable without direct subsidies — the same questions we faced with ULA,” this official said. “Then what do we do? We have a policy of assured access to space, which means at least two vehicles. The demand for launches has not increased since ULA was formed, so we could be heading toward a nearly identical situation in a few years. But we are spending taxpayers’ money and if we can find reliable launches that are less expensive, we are not going to ignore that.” SpaceX Receives Boost in Bid To Loft National Security Satellites
It is interesting to note the thinking of the unnamed government official. He doesn't recognize the dynamics of the situation. He's living in a sole-source mindset, it just happens that he's going to change to this new, lower cost source.

NASA has a pricing model that shows savings from "outsourcing development", but not because of any interesting dynamics that they've included. The justification is the same as those underlying the broken decisions about aircraft engines: "returns from economies of scale". The dynamics of competition remains ignored.

NASA Deputy Administrator Lori Garver, in a separate presentation here April 12, said the agency’s policy of pushing rocket-development work onto the private sector will only reach maximum benefit if other customers also purchase the vehicles developed initially with NASA funding. Referring specifically to SpaceX, Garver said a conventional NASA procurement of a Falcon 9-class rocket would cost nearly $4.5 billion according to a NASA-U.S. Air Force cost model that includes the vehicle’s first flight. Outsourcing development to SpaceX, she said, would cut that figure by 60 percent, but only if other customers purchase the vehicle, thus permitting scale economies to reach maximum effect. After Servicing Space Station SpaceXs Priority is Taking on EELV

The way out of the death spiral is program dependent. In the EELV case it took a new entrant who has signed commercial contracts in addition to chasing the government launches (from a couple different agencies). SpaceX has built in significant customer diversification that ULA never developed (though this was hoped for in the early justifications of the program structure). Why did Lockheed-Martin and Boeing, and subsequently ULA never develop this customer diversification? Because they didn't have to. The government guaranteed their continued existence. SpaceX, on the other hand, has no such guarantee. The only option for their continued existence is to make a profit from more than one customer. In the F-35 engine case, DoD has decided that even the assumption of development risk by the second supplier under a fixed price contract is not enough to close the case. This puzzles me. Maybe the second engine has become a symbol of "duplication and waste" rather than "competition and efficiency". If so, then DoD has given the primary engine supplier a way to "frame" their competition out of existence with political argument (removing the need for them to earn market share honestly).

The outlook for lower cost space access looks good. However, as long as the DoD is stuck in its current frame, there is great opportunity for political theater that will serve mainly as a content generator for defense trade publications and a distraction from the root cause of steadily rising tactical aircraft engine costs into the future.

Sunday, February 20, 2011

Red Hawks Host Black Knights


I went to watch some collegiate and amateur bouts down in Oxford on Saturday. Miami University was hosting the cadets from West Point (last year's collegiate champs, nice write-up in NY Times). If you're used to the glammed-up barroom brawling of UFC or even the raw knock-out power of professional boxing, then the style and speed of amateur boxing might come as quite a surprise. I really like the amateur fights because they tend to pivot on conditioning, thinking and skillful execution rather than landing lucky or brutal head-shots.

The 14-bout evening started out with a couple of tough young ladies, one from Cincinati, one from Oxford, going three rounds. It's hard to do match-ups for women because there are just fewer boxers in an already small pool of athletes (since boxing is no longer an NCAA sport). The winner of this bout threw very disciplined, quick, straight punches which her clearly less experienced opponent was ill-equipped to catch or counter. This fight was followed by a few match-ups with local fighters out of Cincinnati, OSU and Miami University. The early fights consisted of lots of off-balanced brawling. The result of "first fight" jitters and inexperience for many of these young athletes.

The cadets from West Point fought out of the blue corner for the remainder of the evening against a line-up consisting of mainly Miami University fighters, with the occasional fighter from OSU or Xavier thrown in to the mix. From the first cadet to fight, on up to the "main event" it was clear why these gentlemen have won three championships in a row. A string of cadets won judges decisions handily over their opponents. In the process demonstrating solid fundamentals, and coolness under the frequent early, but generally dissipative, aggressiveness of their foes.

Then, in the first bout at 132lbs (the second being the "main event" of the night), Lang Clarke of Army landed a solid combination to the head followed-up with a deliberate two-two that sent the man from Xavier to the mat (the referee was in the midst of "stop" as the second right landed). The first and only knock-out of the evening. The Xavier athlete was back on his feet (to the relieved cheers of the crowd) after a quick nap and a check from the ring-side doc.

The only heavy-weight bout of the night was stopped by the referee near the end of the first round. The fighter from West Point landed repeated hooks-to-the-head which the young man from Oxford was not defending.

The fight at 195lbs was relatively surprising, not in outcome (the cadet won), but in tactics. Previously the cadets had employed a shorter and simpler version of the rope-a-dope tactic when their less disciplined and less well-conditioned opponents came out swinging. Rather than stand and brawl toe-to-toe, they defended and let the other fighter tire, then exploited that self-inflicted weakness with steady "work" for the rest of the round. This cadet stood up inside his opponent's early windmill, and landed straight punches and upper cuts to the head. One or two furious windmillings punctuated by deliberately thrown and well-landed opposing hits was all that it took for the windmill's blades to drop and the hub to wobble on it's axis. Referee stops contest.

The crowd was ready for their main event. A nice cheer went up for the wiry 132-pounder from Oxford as he stepped in the ring. Yours truly was the only one to cheer when the young man from West Point entered the ring (much to my wife's embarrassment). After she heard the raucous cheer when they actually introduced the Oxford fighter, she said, "OK, you can go ahead and yell for that West Point guy," so I did.

Both fighters were about equally conditioned, which made for a much more exciting fight. They were both able to work (with varying levels of effectiveness) for the majority of each round. The fighter from Oxford threw a great volume of widely-arcing punches, most of which seemed ineffective to me due to the cadet's competent defense. After the repeated, straight head-shots landed by the cadet in the third round, I thought the decision would go his way (if the fight was not stopped sooner, which had been the outcome of this sort of pounding previously). The judges had a different perspective on the bout, so the decision went to the man from Oxford, who, much to his credit, was able to recover repeatedly from the wobbliness induced by these direct blows and swing away until the bell relieved him.

The coaches, medical and officiating crew at Miami University should be congratulated for putting on such a professional event that took good care of these young athletes, and allowed them to further develop their skills.

Thought this was funny; why you don't want guys from the same gym on the card:

Sparring partners are endowed with habitual consideration and forbearance, and they find it hard to change character. A kind of guild fellowship holds them together, and they pepper each other's elbows with merry abandon, grunting with pleasure like hippopotamuses in a beer vat.
The Sweet Science

Saturday, February 19, 2011

Historical Hydraulics

Venturi's drawings of eddies look really modern too, kind of neat:
[h/t Dan Hughes]

Tuesday, February 15, 2011

Comments on Spatio-Temporal Chaos

Some comments from a guest post on Dr Curry's site. I think she has a couple dueling chat bots who've taken up residence in her comments (see if you can guess who they are). This provides a bit more motivation for getting to the forced system results we started talking about earlier. The paper and discussion that Arthur Smith links is well worth a read (even though it isn't actually responsive ; - ).

Tomas – you claimed to focus on my comment, but *completely ignored* the central element, which you even quoted:
“small random variations in solar input (not to mention butterflies)” [as what makes weather random over the long term]
Chaos as you have discussed it requires fixed control parameters (absolutely constant solar input) and no external sources of variation not accounted for in the equations (no butterflies). You gave zero attention in your supposed response to my comment to this central issue. Others here have been accused of being non-responsive, but I have to say that is pretty non-responsive on your part.
The fact is as soon as there is any external perturbation of a chaotic system not accounted for in the dynamical equations, you have bumped the system from one path in phase space to another. Earth’s climate is continually getting bumped by external perturbations small and large. The effect of these is to move the actual observed trajectory of the system randomly – yes randomly – among the different possible states available for given energy/control parameters etc.
The randomness comes not from the chaos, but from external perturbation. Chaos amplifies the randomness so that at a time sufficiently far in the future after even the smallest perturbation, the actual state of the system is randomly sampled from those available. That random sampling means it has real statistics. The “states available” are constrained by boundaries – solar input, surface topography, etc. which makes the climate problem – the problem of the statistics of weather – a boundary value problem (BVP). There are many techniques for studying BVP’s – one of which is simply to randomly sample the states using as physical a model as possible to get the right statistics. That’s what most climate models do. That doesn’t mean it’s not a BVP.

This isn’t anything new – almost every physical dynamical system, if it’s not trivially simple, displays chaos under most conditions. Statistical mechanics, one of the most successful of all physical theories, relies fundamentally on the reliability of a statistical description of what is actually deterministic (and chaotic – way-more-than-3-body) dynamics of immense numbers of atoms and molecules. This goes back to Gibbs over a century ago, and Poincare’s work was directly related.
Tomas’ comments about the 3-body system being not even “predictable statistically (e.g you can not put a probability on the event “Mars will be ejected from the solar system in N years”” is true in the strict sense of the exact mathematics assuming no external perturbations. That’s simply because for a deterministic system something will either happen or it won’t, there’s no issue of probability about it at all. But as soon as you add any sort of noise, your perfect chaotic system becomes a mere stochastic one over long time periods, and probabilities really do apply.
A nice review of the relationships between chaos, probability and statistics is this article from 1992:
“Statistics, Probability and Chaos” by L. Mark Berliner, Statist. Sci. Volume 7, Number 1 (1992), 69-90.
http://projecteuclid.org/DPubS?service=UI&version=1.0&verb=Display&handle=euclid.ss/1177011444
and see some of the discussion that followed in that journal (comments linked on that Project Euclid page).

jstults
Arthur Smith, while that is a very good paper that you linked (thank you for finding one that everyone can access), it only had a very short section on ergodic theory, and you’re back to the same hand-waving analogy about statistical mechanics and turbulent flows. The [lack of] success for simple models (based on analogy to kinetic theory btw) for turbulent flows of any significant complexity indicates to me that I can’t take your analogy very seriously.
Where’s the meat? Where’s the results for the problems we care about? I can calculate results for logistic maps and Lorenz ’63 on my laptop (and the attractor for that particular toy exists).
A more well-phrased attempt to explain why hand-waving about statistical mechanics is a diversion from the questions of significance for this problem (with apologies to Ruelle): what are the measures describing climate?
If one is optimistic, one may hope that the asymptotic measures will play for dissipative systems the sort of role which the Gibbs ensembles have played for statistical mechanics. Even if that is the case, the difficulties encountered in statistical mechanics in going from Gibbs ensembles to a theory of phase transitions may serve as a warning that we are, for dissipative systems, not yet close to a real theory of turbulence.
What Are the Measures Describing Turbulence?

Friday, February 4, 2011

Validation and Calibration: more flowcharts

In a previous post we developed a flow-chart for model verification and validation (V&V) activities. One thing I noted in the update on that post was that calibration activities were absent. My google alerts just turned up a new paper (they reference the Thacker et al. paper the previous post was based on, I think you’ll notice the resemblance of flow-charts) which adds the calibration activity in much the way we discussed.


PIC

Figure 1: Model Calibration Flow Chart of Youn et al. [1]

The distinction between calibration and validation is clearly highlighted, “In many engineering problems, especially if unknown model variables exist in a computational model, model improvement is a necessary step during the validation process to bring the model into better agreement with experimental data. We can improve the model using two strategies: Strategy 1 updates the model through calibration and Strategy 2 refines the model to change the model form.”


PIC

Figure 2: Flow chart from previous post

The well-founded criticism of calibration-based arguments for simulation credibility is that calibration provides no indication of the predictive capability of a model so-tuned. The statistician might use the term generalization risk to talk about the same idea. There is no magic here. Applying techniques such as cross-validation merely add a (hyper)parameter to the model (this becomes readily apparent in a Bayesian framework). Such techniques, while certainly useful, are no silver bullet against over-confidence. This is a fundamental truth that will not change with improving technique or technology, and that is because all probability statements are conditional on (among other things) the choice of model space (particular choices of which must by necessity be finite, though the space of all possible models is countably infinite).
One of the other interesting things in that paper is their argument for a hierarchical framework for model calibration / validation. A long time ago, in a previous life, I made a similar argument [2]. Looking back on that article is a little embarrassing. I wrote that before I had read Jaynes (or much else of the Bayesian analysis and design of experiments literature), so it seems very technically naive to me now. The basic heuristics for product development discussed in it are sound though. They’re based mostly on GAO reports [3456], a report by NAS [7], lessons learned from Live Fire Test and Evaluation [8] and personal experience in flight test. Now I understand better why some of those heuristics have sound theoretical underpinnings.
There are really two hierarchies though. There is the physical hierarchy of system, sub-system and component that Youn et al. emphasize, but there is also a modeling hierarchy. This modeling hierarchy is delineated by the level of aggregation, or the amount of reductive-ness, in the model. All models are reductive (that’s the whole point of modeling: massage the inordinately complex and ill-posed into tractability), some are just more reductive than others.


PIC

Figure 3: Modeling Hierarchy (from [2])

Figure 3 illustrates why I care about Bayesian inference. It’s really the only way to coherently combine information from the bottom of the pyramid (computational physics simulations), with information higher up the pyramid which rely on component and subsystem testing.
A few things I don’t like about the approach in [1]
  • The partitioning of parameters into “known” and “unknown” based on what level of the hierarchy (component, subsystem, system) you are at in the “bottom-up” calibration process. Our (properly formulated) models should tell us how much information different types of test data give us about the different parameters. Parameters should always be described by a distribution rather than discrete switches like known or unknown.
  • The approach is based entirely on the likelihood (but they do mention something that sounds like expert priors in passing).
  • They claim that the proposed calibration method enhances “predictive capability” (section 3), however this is misleading abuse of terminology. Certainly the in-sample performance is improved by calibration, but the whole point of making a distinction between calibration and validation is based on recognizing that this says little about the out-of-sample performance (in fairness, they do equivocate a bit on this point, “The authors acknowledge that it is difficult to assure the predictive capability of an improved model without the assumption that the randomness in the true response primarily comes from the the randomness in random model variables.”).
Otherwise, I find this a valuable paper that strikes a pragmatic chord, and that’s why I wanted to share my thoughts on it.
[Update: This thesis that I linked at Climate Etc. has a flow-chart too.
]

References

[1]   Youn, B. D., Jung, B. C., Xi, Z., Kim, S. B., and Lee, W., “A hierarchical framework for statistical model calibration in engineering product development,” Computer Methods in Applied Mechanics and Engineering, Vol. 200, No. 13-16, 2011, pp. 1421 – 1431.
[2]   Stults, J. A., “Best Practices for Developmental Testing of Modern, Complex Munitions,” ITEA Journal, Vol. 29, No. 1, March 2008, pp. 67–74.
[3]   Defense Acquisitions: Assesment of Major Weapon Programs,” Tech. Rep. GAO-03-476, U.S. General Accounting Office, May 2003.
[4]   Best Practices: Better Support of Weapon System Program Managers Needed to Improve Outcomes,” Tech. Rep. GAO-06-110, U.S. General Accounting Office, 2006.
[5]   Precision-Guided Munitions: Acquisition Plans for the Joint Air-to-Surface Standoff Missile,” Tech. Rep. GAO/NSIAD-96-144, U.S. General Accounting Office, 1996.
[6]   Best Practices: A More Constructive Test Approach is Key to Better Weapon System Outcomes,” Tech. Rep. GAO/NSIAD-00-199, U.S. General Accounting Office, July 2000.
[7]   Michael L. Cohen, John E. Rolph, D. L. S., editor, Statistics, Testing and Defense Acquisition: New Approaches and Methodological Improvements, National Academy Press, Washington D.C., 1998.
[8]   O’Bryon, J. F., editor, Lessons Learned from Live Fire Testing: Insights Into Designing, Testing, and Operating U.S. Air, Land, and Sea Combat Systems for Improved Survivability and Lethality, Office of the Director, Operational Test and Evaluation, Live Fire Test and Evaluation, Office of the Secretary of Defense, January 2007.

Sunday, January 23, 2011

Better Recurrence Plots

In the previous post in the Lorenz63 series we used recurrence plots to get a qualitative feel for the type of behavior exhibited by a time series (stochastic, periodic, chaotic). Those were using the default colormap in matplotlib, and they seem to highlight the “holes” more than the “near returns” (at least to my eye). Here’s some improved ones that use the bone colormap and a threshold on the distance to better highlight the near returns.


PIC
(a) Single Trajectory
PIC
(b) Ensemble Mean
Figure 1: Response of Lorenz 1963 model



PIC
(a) Periodic Series
PIC
(b) Stochastic Series
Figure 2: Non-chaotic Series



PIC
(a) A little smoothing
PIC
(b) More smoothing
PIC
(c) Even more smoothing
Figure 3: Smoothing of a Lorenz 1963 trajectory


Saturday, January 22, 2011

Recurrence, Averaging and Predictability

Motivation and Background

Yet another installment in the Lorenz63 series. This time motivated by a commenter on Climate Etc. Tomas Milanovic claims that time averages are chaotic too in response to the oft repeated claim that the predictability limitations of nonlinear dynamical systems are not a problem in the case of climate prediction. Lorenz would seem to agree, “most climatic elements, and certainly climatic means, are not predictable in the first sense at infinite range, since a non-periodic series cannot be made periodic through averaging [1].” We’re not going to just take his word on it. We’ll see if we can demonstrate this with our toy model.

That’s the motivation, but before we get to toy model results a little background discussion is in order. In this previous entry I illustrated the different types of functionals that you might be interested in depending on whether you are doing weather prediction or climate prediction. I also made the remark, “A climate prediction is trying to provide a predictive distribution of a time-averaged atmospheric state which is (hopefully) independent of time far enough into the future.” It was pointed out to me that this is a testable hypothesis [2], and that the empirical evidence doesn’t seem to support the existence of time-averages (or other functionals) describing the Earth’s climate system that are independent of time [3]. In fact, the above assumption was critiqued by none other than Lorenz in 1968 [4]. In that paper he states,

Questions concerning the existence and uniqueness of long-term statistics fall into the realm of ergodic theory. [...] In the case of nonlinear equations, the uniqueness of long-term statistics is not assured. From the way in which the problem is formulated, the system of equations, expressed in deterministic form, together with a specified set of initial conditions, determines a time-dependent solution extending indefinitely into the future, and therefore determines a set of long-term statistics. The question remains as to whether such statistics are independent of the choice of initial conditions.

He goes on to define a system as transitive if the long-term statistics are independent of initial condition, and intransitive if there are “two or more sets of long-term statistics, each of which has a greater-than-zero probability of resulting from randomly chosen initial conditions.” Since the concept of climate change has no meaning for statistics over infinitely long intervals, he then defines a system as almost intransitive if the statistics at infinity are unique, but the statistics over finite intervals depend (perhaps even sensitively) on initial conditions. In the context of policy relevance we are generally interested in behavior over finite time-intervals.

In fact, from what I’ve been able to find, different large-scale spatial averages (or coherent structures, which you could track by suitable projections or filtering) of state for the climate system face similar limits to predictability as un-averaged states. The predictability just decays at a slower rate. So instead of predictive limitations for weather-like functionals on the order of a few weeks, the more climate-like functionals become unpredictable on slower time-scales. There’s no magic here, things don’t suddenly become predictable a couple decades or a century hence because you take an average. It’s just that averaging or filtering may change the rate that errors for that functional grow (because in spatio-temporal chaos different structures, or state vectors, will have different error growth rates and reach saturation at different times). Again Lorenz puts it well, “the theory which assures us of ultimate decay of atmospheric predictability says nothing about the rate of decay” [1]. Recent work shows that initialization matters for decadal prediction, and that the predictability of various functionals decay at different rates [5]. For instance, sea surface temperature anomalies are predictable at longer forecast horizons than surface temperatures over land. Hind-casts of large spatial averages on decadal time-scales have shown skill in the last two decades of the past century (though they had trouble beating a persistence forecast for much of the rest of the century) [6].

I’ve noticed in on-line discussions about climate science that some people think that the problem of establishing long term statistics for nonlinear systems is a solved one. That is not the case for the complex, nonlinear systems we are generally most interested in (there are results for our toy though [78]). I think this snippet sums things up well,

Atmospheric and oceanic forcings are strongest at global equilibrium scales of 107 m and seasons to millennia. Fluid mixing and dissipation occur at micorscales of 10-3 m and 10-3s, and cloud particulate transformations happen at 10-6 m or smaller. Observed intrinsic variability is spectrally broad band across all intermediate scales. A full representation for all dynamical degrees of freedom in different quantities and scales is uncomputable even with optimistically foreseeable computer technology. No fundamentally reliable reduction of the size of the AOS [atmospheric oceanic simulation] dynamical system (i.e., a statistical mechanics analogous to the transition between molecular kinetics and fluid dynamics) is yet envisioned. [9]

Here McWilliams is making a point similar to that made by Lorenz in [4] about establishing a statistical mechanics for climate. This would be great if it happened, because that would mean that the problem of turbulence would be solved for us engineers too. Right now the best we have (engineers interested in turbulent flows and climate scientists too) is empirically adequate models that are calibrated to work well in specific corners of reality.

Lorenz was responsible for another useful concept concerning predictability, that is predictability of the first and second kind [1]. If you care about the time-accurate evolution of the order of states then you are interested in predictability of the first kind. If, however, you do not care about the order, but only the statistics, then you are concerned with predictability of the second kind. Unfortunately, Lorenz’s concepts of first and second kind predictability have been morphed in to a claim that first kind predictability is about solving initial value problem (IVP)s and second kind predictability is about solving boundary value problem (BVP)s. For example, “Predictability of the second kind focuses on the boundary value problem: how predictable changes in the boundary conditions that affect climate can provide predictive power [5].” This is unsound. If you read Lorenz closely, you’ll see that the important open question he was exploring about whether the climate is transitive, intransitive or almost intransitive has been assumed away by the spurious association of kinds of predictability with kinds of problems [1]. Lorenz never made this mistake, he was always clear that the difference in kinds of predictability depends on the functionals you are interested in, not whether it is appropriate to solve an IVP or a BVP (what reason could you have for expecting meaningful frequency statistics from a solution to a BVP?). Those considerations depend on the sort of system you have. In an intransitive or almost intransitive system even climate-like functionals depend on the initial conditions.

A good early paper on applying information theory concepts to climate predictability is by Leung and North [10], and there is a more recent review article that covers the basic concepts by DelSole and Tippett [11].

Recurrence Plots

Recurrence plots are useful for getting a quick qualitative feel for the type of response exhibited by a time-series [1213]. First we run a little initial condition (IC) ensemble with our toy model. The computer experiment we’ll run to explore this question will consist of perturbations to the initial conditions (I chose the size of the perturbation so the ensemble would blow-up around t = 12). Rather than sampling from a distribution for the members of the ensemble, I chose them according a stochastic collocation (this helps in getting the same results every time too).


PIC
(a)EnsembleTrajectories
PIC
(b)EnsembleMean
Figure 1: Initial Condition Ensemble


One thing that these two plots makes clear is that it doesn’t make much sense to compare individual trajectories with the ensemble mean. The mean is a parameter of a distribution describing a population of which the trajectories are members. While the trajectories are all orbits on the attractor, the mean is not.


PIC
(a)SingleTrajectory
PIC
(b)EnsembleMean
Figure 2: Chaotic Recurrence Plots


Comparing the chaotic recurrence plots with the plots below of a periodic series and a stochastic series illustrates the qualitative differences in appearance.


PIC
(a)PeriodicSeries
PIC
(b)StochasticSeries
Figure 3: Non-chaotic Recurrence Plots


Clearly, both the ensemble mean and the individual trajectory are chaotic series, sort of “between” periodic and stochastic in their appearance. Ensemble averaging doesn’t make our chaotic series non-chaotic, what about time averaging?

Predictability Decay

How does averaging affect the decay of predictability for the state of the Lorenz63 system, and can we measure this effect? We can track how the predictability of the future state decays given knowledge of the initial state by using the relative entropy. There are other choices for measures such as mutual information [10]. Since we’ve already got our ensemble though, we can just use entropy like we did before. Rather than just a simple moving average, I’ll be calculating an exponentially weighted one using an FFT-based approach, of course (there’s some edge effects we’d need to worry about if this were a serious analysis, but we’ll ignore that for now). The entropy for the ensemble is shown for three different smoothing levels in Figure 4 (the high entropy prior to t = 5 for the smoothed series is spurious because I didn’t pad the series and it’s calculated with the FFT).


PIC

Figure 4: Entropy of Exponentially Weighted Smoothed Series


While smoothing does lower the entropy of the ensemble (lower entropy for more smoothing / smaller λ), it still experiences the same sort of “blow-up” as the unsmoothed trajectory. This indicates problems for predictability even for our time-averaged functionals. Guess what? The recurrence plot indicates that our smoothed trajectory is still chaotic!


PIC

Figure 5: Smoothed Trajectory Recurrence Plot


This result shouldn't be too surprising, moving averages or smoothing (of whatever type you fancy) are linear operations. It would probably take a pretty clever nonlinear transformation to turn a chaotic series into a non-chaotic one (think about how the series in this case is generated in the first place). I wouldn't expect any combination of linear transformations to accomplish that.

Conclusions

I’ll begin the end with another great point from McWilliams (though I’ve not heard of sub-grid fluctuations referred to as “computational noise,” that term makes me think of round-off error) that should serve to temper our demands of predictive capability from climate models[9]:

Among their other roles, parametrizations regularize the solutions on the grid scale by limiting fine-scale variance (also known as computational noise). This practice makes the choices of discrete algorithms quite influential on the results, and it removes the simulation from the mathematically preferable realm of asymptotic convergence with resolution, in which the results are independent of resolution and all well conceived algorithms yield the same answer.

If I had read this earlier, I wouldn’t have spent so much time searching for something that doesn’t exist.

Regardless of my tortured learning process, what do the toy models tell us? Our ability to predict the future is fundamentally limited. Not really an earth-shattering discovery; it seems a whole lot like common sense. Does this have any implication for how we make decisions? I think it does. Our choices should be robust with respect to these inescapable limitations. In engineering we look for broad optimums that are insensitive to design or requirements uncertainties. The same sort of design thinking applies to strategic decision making or policy design. The fundamental truism for us to remember in trying to make good decisions under the uncertainty caused by practical and theoretical constraints is that limits on predictability do not imply impotence.

References

[1]   Lorenz, E. N., The Physical Basis of Climate and Climate Modeling, Vol. 16 of GARP publication series, chap. Climatic Predictability, World Meteorological Organization, 1975, pp. 132–136.

[2]   Pielke Sr, R. A., “your query,” September 2010, electronic mail to the author.

[3]   Rial, J. A., Pielke Sr, R. A., Beniston, M., Claussen, M., Canadell, J., Cox, P., Held, H., Noblet-Ducoudr, N. D., Prinn, R., Reynolds, J. F., and Salas, J. D., “Nonlinearities, Feedbacks And Critical Thresholds Within The EarthS Climate System,” Climatic Change, Vol. 65, No. 1-2, 2004, pp. 11–38.

[4]   Lorenz, E. N., “Climatic Determinism,” Meteorological Monographs, Vol. 8, No. 30, 1968.

[5]   Collins, M. and Allen, M. R., “Assessing The Relative Roles Of Initial And Boundary Conditions In Interannual To Decadal Climate Predictability,” Journal ofClimate, Vol. 15, No. 21, 2002, pp. 3104–3109.

[6]   Lee, T. C., Zwiers, F. W., Zhang, X., and Tsao, M., “Evidence of Decadal Climate Prediction Skill Resulting from Changes in Anthropogenic Forcing,” Journal of Climate, Vol. 19, 2006.

[7]   Tucker, W., The Lorenz Attractor Exists, Ph.D. thesis, Uppsala University, 1998.

[8]   Kehlet, B. and Logg, A., “Long-Time Computability of the Lorenz System,”http://lorenzsystem.net/.

[9]   McWilliams, J. C., “Irreducible Imprecision In Atmospheric And Oceanic Simulations,” Vol. 104 of National Academy of Sciences, National Academy of Sciences, pp. 8709 – 8713.

[10]   Leung, L.-Y. and North, G. R., “Information Theory and Climate Prediction,”Journal of Climate, Vol. 3, 1990, pp. 5–14.

[11]   DelSole, T. and Tippett, M. K., “Predictability: Recent insights from information theory,” Reviews of Geophysics, Vol. 45, 2007.

[12]   Eckmann, J.-P., Kamphorst, S. O., and Ruelle, D., “Recurrence Plots of Dynamical Systems,” EPL (Europhysics Letters), Vol. 4, No. 9, 1987, pp. 973.

[13]   Marwan, N., “A historical review of recurrence plots,” The European PhysicalJournal - Special Topics, Vol. 164, 2008, pp. 3–12, 10.1140/epjst/e2008-00829-1.

Wednesday, January 19, 2011

Empiricism and Simulation

There are two orthogonal ideas that seem to get conflated in discussions about climate modeling. One is the idea that you’re not doing science if you can’t do a controlled experiment, but of course we have observational sciences like astronomy. The other is that all this new-fangled computer-based simulation is untrustworthy, usually because “it ain’t the way my grandaddy did science.” Both are rather silly ideas. We can still weigh the evidence for competing models based on observation, and we can still find protection from fooling ourselves even when those models are complex.

What does it mean to be an experimental as opposed to an observational science? Do sensitivity studies, and observational diagnostics using sophisticated simulations count as experiments? Easterbrook claims that because climate scientists do these two things with their models that climate science is an experimental science [1]. It seems like there is a motivation to claim the mantle of experimental, because it may carry more rhetorical credibility than the merely observational (the critic Easterbrook is addressing certainly thinks so). This is probably because the statements we can make about causality and the strength of the inferences we can draw are usually greater when we can run controlled experiments than when we are stuck with whatever natural experiments fortune provisions for us (and there are sound mathematical reasons for this, having to do with optimality in experimental design rather than any label we may place on the source of the data). This seeming motivation demonstrated by Easterbrook to embrace the label of empirical is in sharp contrast to the denigration of the empirical by Tobis in his three part series [234]. As I noted on his site, the narrative Tobis is trying to create with those posts has already been pre-messed with by Easterbrook, his readers just pointed out the obvious weaknesses too. One good thing about blogging is the critical and timely feedback.

The confusions of these two climate warriors are an interesting point of departure. I think they are both saying more than blah blah blah, so it’s worth trying to clarify this issue. The figure below is based on a technical report from Sandia [5], which is a good overview and description of the concepts and definitions for model verification and validation as it has developed in the computational physics community over the past decade or so. I think this emerging body of work on model V&V places the relative parts, experiment and simulation, in a sound framework for decision making and reasoning about what models mean.


PIC

Figure 1: Verification and Validation Process (based largely on [5])


The process starts at the top of the flowchart with a “Reality of Interest”, from which a conceptual model is developed. At this point the path splits into two main branches. One based on “Physical Modeling” and the other based on “Mathematical Modeling”. Something I don’t think many people realize is that there is a significant tradition of modeling in science that isn’t based on equations. It is no coincidence that an aeronautical engineer might talk of testing ideas with a wind-tunnel model or a CFD model. Both models are simplifications of the reality of interest, which, for that engineer, is usually a full-scale vehicle in free flight.

Figure 2 is just a look at the V&V process through my Design of Experiments (DoE) colored glasses.


PIC

Figure 2: Distorted Verification and Validation Process


My distorted view of the V&V process is shown to emphasize that there’s plenty of room for experimentalists to have fun (maybe even a job [3]) in this, admittedly model-centric, sandbox. However, the transferability of the basic experimental design skills between “Validation Experiments” and “Computational Experiments” says nothing about what category of science one is practicing. The method of developing models may very well be empirical (and I think Professor Easterbrook and I would agree it is, and maybe even should be), but that changes nothing about the source of the data which is used for “Model Validation.”

The computational experiments highlighted in Figure 2 are for correctness checking, but those aren’t the sorts of computational experiments Easterbrook claimed made climate science an experimental science. Where do sensitivity studies and model-based diagnostics fit on the flowchart? I think sensitivity studies fit well in the activity called “Pre-test Calculations”, which, one would hope, inform the design of experimental campaigns. Diagnostics are more complicated.

Heald and Wharton have a good explanation for the use of the term “diagnostic” in their book on microwave-based plasma diagnostics: “The term ‘diagnostics,’ of course, comes from the medical profession. The word was first borrowed by scientists engaged in testing nuclear explosions about 15 years ago [c. 1950] to describe measurements in which they deduced the progress of various physical processes from the observable external symptoms” [6]. With a diagnostic we are using the model to help us generate our “Experimental Data”, so that would happen within the activity of “Experimentation” on this flowchart. This use of models as diagnostic tools is applied to data obtained from either experiment (e.g. laboratory plasma diagnostics) or observations (e.g. astronomy, climate science), so it says nothing about whether a particular science is observational or experimental. Classifying scientific activities as experimental or observational is of passing interest, but I think far too much emphasis is placed on this question for the purpose of winning rhetorical “points.”

The more interesting issue from a V&V perspective is introducing a new connection in the flowchart that shows how a dependency between model and experimental data could exist (Figure 3). Most of the time the diagnostic model, and the model being validated are different. However, this case where they are the same is an interesting and practically relevant one that is not addressed in the current V&V literature that I know of (please share links if you “know of”).


PIC

Figure 3: V&V process including model-based diagnostic


It should be noted that even though the same model may be used to make predictions and perform diagnostics, it will usually be run in a different way for those two uses. The significant changes between Figure 1 and Figure 3 are the addition of a “Experimental Diagnostic” box and the change to the mathematical cartoon in the “Validation Experiment” box. The change to the cartoon is to indicate that we can’t measure what we want directly (u), so we have to use a diagnostic model to estimate it based on the things we can measure (b). An example of when the model-based diagnostic is relatively independent of the model being validated might be using laser-based diagnostic for fluid flow. The equations describing propagation of the laser through the fluid are not the same as those describing the flow. An example of when the two codes might be connected would be if you were trying to use ultrasound to diagnose a flow. The diagnostic model and the predictive model could both be Navier-Stokes with turbulence closures. Establishing the validity of which is the aim of the investigation. I’d be interested in criticisms of how I explained this / charted this out.

Afterward

Attempt at Answering Model Questions

I’m not in the target population that professor Easterbrook is studying, but here’s my attempt at answering his questions about model validation[7].

  1. “If I understand correctly–a model is ’valid’ (is that a formal term?) if the code is written to correctly represent the best theoretical science at the time...”

    I think you are using an STS flavored definition for “valid.” The IEEE/AIAA/ASME/US-DoE/US-DoD definition differs. “Valid” means observables you get out of your simulations are “close enough” to observables in the wild (experimental results). The folks from DoE tend to argue for a broader definition of valid than the DoD folks. They’d like to include as “validation” activities of a scientist comparing simulation results and experimental results without reference to an intended use.

  2. “– so then what do the results tell you? What are you modeling for–or what are the possible results or output of the model?”

    Doing a simulation (running the implementation of a model) makes explicit the knowledge implicit in your modeling choices. The model is just the governing equations, you have to run a simulation to find solutions to those governing equations.

  3. “If the model tells you something you weren’t expecting, does that mean it’s invalid? When would you get a result or output that conflicts with theory and then assess whether the theory needs to be reconsidered?”

    This question doesn’t make sense to me. How could you get a model output that conflicted with theory? The model is based on theory. Maybe this question is about how simplifying assumptions could lead to spurious results? For example, if a simulation result shows failure to conserve mass/momentum/energy in a specific calculation possibly due to a modeling assumption (more likely due to a more mundane error), I don’t think anyone but a perpetual-motion machine nutter would seriously reconsider the conservation laws.

  4. “Then is it the theory and not the model that is the best tool for understanding what will happen in the future? Is the best we can say about what will happen that we have a theory that adheres to what we know about the field and that makes sense based on that knowledge?”

    This one doesn’t make sense to me either. You have a “theory,” but you can’t formulate a “model” of it and run a simulation, or just a pencil and paper calculation? I don’t think I’m understanding how you are using those words.

  5. “What then is the protection or assurance that the theory is accurate? How can one ‘check’ predictions without simply waiting to see if they come true or not come true?”

    There’s no magic; the protection from fooling ourselves is the same as it has always been, only the names of the problems change.

Attempt at Understanding Blah Blah Blah

  • “The trouble comes when empiricism is combined with a hypothesis that the climate is stationary, which is implicit in how many of their analyses work.” [8]

    The irony of this statement is extraordinary in light of all the criticisms by the auditors and others of statistical methods in climate science. It would be a valid criticism, if it were supported.

  • “The empiricist view has never entirely faded from climatology, as, I think, we see from Curry. But it’s essentially useless in examining climate change. Under its precepts, the only thing that is predictable is stasis. Once things start changing, empirical science closes the books and goes home. At that point you need to bring some physics into your reasoning.” [2]

    So we’ve gone from what could be reasonable criticism of unfounded assumptions of stationarity to empiricism being unable to explain or understand dynamics. I guess the guys working on embedding dimension stuff, or analogy based predictions would be interested to know that.

  • “See, empiricism lacks consilience. When the science moves in a particular direction, they have nothing to offer. They can only read their tea leaves. Empiricists live in a world which is all correlation, and no causation.” [3]

    Lets try some definitions.

    empiricism
    knowledge through observation
    consilience
    unity of knowledge, non-contradiction

    How can the observations contradict each other? Maybe a particular explanation for a set of observations is not consilient with another explanation for a different set of observations. This seems to be something that would get straightened out in short order though: it’s on this frontier that scientific work proceeds. I’m not sure how empiricism is “all correlation.” This is just a bald assertion with no support.

  • “While empiricism is an insufficient model for science, while not everything reduces to statistics, empiricism offers cover for a certain kind of pseudo-scientific denialism. [...] This is Watts Up technique asea; the measurements are uncertain; therefore they might as well not exist; therefore there is no cause for concern!” [4]

    Tobis: Empiricism is an insufficient model for science. Feynman: The test of all knowledge is experiment. Tobis: Not everything reduces to statistics. Jaynes: Probability theory is the logic of science. To be fair, Feynman does go on to say that you need imagination to think up things to test in your experiments, but I’m not sure that isn’t included in empiricism. Maybe it isn’t included in the empiricism Tobis is talking about.

    So that’s what all this is about? You’re upset at Watts making a fallacious argument about uncertainty? What does empiricism have to do with this? It would be simple enough to just point out that uncertainty doesn’t mean ignorance.

Not quite blah blah blah, but the argument is still hardly thought out and poorly supported.

References

[1]   Easterbrook, S., “Climate Science is an Experimental Science,”http://www.easterbrook.ca/steve/?p=1322, February 2010.

[2]   Tobis, M., “The Empiricist Fallacy,” http://initforthegold.blogspot.com/2010/11/empiricist-fallacy.html, November 2010.

[3]   Tobis, M., “Empiricism as a Job,”http://initforthegold.blogspot.com/2010/11/empiricism-as-job.html, November 2010.

[4]   Tobis, M., “Pseudo-Empiricism and Denialism,”http://initforthegold.blogspot.com/2010/11/pseudo-empiricism-and-denialism.html, November 2010.

[5]   Thacker, B. H., Doebling, S. W., Hemez, F. M., Anderson, M. C., Pepin, J. E., and Rodriguez, E. A., “Concepts of Model Verification and Validation,” Tech. Rep. LA-14167-MS, Los Alamos National Laboratory, Oct 2004.

[6]   Heald, M. and Wharton, C., Plasma Diagnostics with Microwaves, Wiley series in plasma physics, Wiley, New York, 1965.

[7]   Easterbrook, S., “Validating Climate Models,”http://www.easterbrook.ca/steve/?p=2032, November 2010.

[8]   Tobis, M., “Empiricism,”http://initforthegold.blogspot.com/2010/11/empiricism.html, November 2010.

Thanks to George Crews and Dan Hughes for their critical feedback on portions of this.

[Update: George left a comment with suggestions on changing the flowchart. Here's my take on his suggested changes.

A slightly modified version of George's chart. I think it makes more sense to have the "No" branch of the validation decision point back at "Abstraction", which parallels the "No" branch of the verification decision pointing at "Implementation". Also switched around "Experimental Data" and "Experimental Diagnostic." Notably absent is any loop for "Calibration"; this would properly be a separate loop with output feeding in to "Computer Model."
]