Friday, February 4, 2011

Validation and Calibration: more flowcharts

In a previous post we developed a flow-chart for model verification and validation (V&V) activities. One thing I noted in the update on that post was that calibration activities were absent. My google alerts just turned up a new paper (they reference the Thacker et al. paper the previous post was based on, I think you’ll notice the resemblance of flow-charts) which adds the calibration activity in much the way we discussed.


Figure 1: Model Calibration Flow Chart of Youn et al. [1]

The distinction between calibration and validation is clearly highlighted, “In many engineering problems, especially if unknown model variables exist in a computational model, model improvement is a necessary step during the validation process to bring the model into better agreement with experimental data. We can improve the model using two strategies: Strategy 1 updates the model through calibration and Strategy 2 refines the model to change the model form.”


Figure 2: Flow chart from previous post

The well-founded criticism of calibration-based arguments for simulation credibility is that calibration provides no indication of the predictive capability of a model so-tuned. The statistician might use the term generalization risk to talk about the same idea. There is no magic here. Applying techniques such as cross-validation merely add a (hyper)parameter to the model (this becomes readily apparent in a Bayesian framework). Such techniques, while certainly useful, are no silver bullet against over-confidence. This is a fundamental truth that will not change with improving technique or technology, and that is because all probability statements are conditional on (among other things) the choice of model space (particular choices of which must by necessity be finite, though the space of all possible models is countably infinite).
One of the other interesting things in that paper is their argument for a hierarchical framework for model calibration / validation. A long time ago, in a previous life, I made a similar argument [2]. Looking back on that article is a little embarrassing. I wrote that before I had read Jaynes (or much else of the Bayesian analysis and design of experiments literature), so it seems very technically naive to me now. The basic heuristics for product development discussed in it are sound though. They’re based mostly on GAO reports [3456], a report by NAS [7], lessons learned from Live Fire Test and Evaluation [8] and personal experience in flight test. Now I understand better why some of those heuristics have sound theoretical underpinnings.
There are really two hierarchies though. There is the physical hierarchy of system, sub-system and component that Youn et al. emphasize, but there is also a modeling hierarchy. This modeling hierarchy is delineated by the level of aggregation, or the amount of reductive-ness, in the model. All models are reductive (that’s the whole point of modeling: massage the inordinately complex and ill-posed into tractability), some are just more reductive than others.


Figure 3: Modeling Hierarchy (from [2])

Figure 3 illustrates why I care about Bayesian inference. It’s really the only way to coherently combine information from the bottom of the pyramid (computational physics simulations), with information higher up the pyramid which rely on component and subsystem testing.
A few things I don’t like about the approach in [1]
  • The partitioning of parameters into “known” and “unknown” based on what level of the hierarchy (component, subsystem, system) you are at in the “bottom-up” calibration process. Our (properly formulated) models should tell us how much information different types of test data give us about the different parameters. Parameters should always be described by a distribution rather than discrete switches like known or unknown.
  • The approach is based entirely on the likelihood (but they do mention something that sounds like expert priors in passing).
  • They claim that the proposed calibration method enhances “predictive capability” (section 3), however this is misleading abuse of terminology. Certainly the in-sample performance is improved by calibration, but the whole point of making a distinction between calibration and validation is based on recognizing that this says little about the out-of-sample performance (in fairness, they do equivocate a bit on this point, “The authors acknowledge that it is difficult to assure the predictive capability of an improved model without the assumption that the randomness in the true response primarily comes from the the randomness in random model variables.”).
Otherwise, I find this a valuable paper that strikes a pragmatic chord, and that’s why I wanted to share my thoughts on it.
[Update: This thesis that I linked at Climate Etc. has a flow-chart too.


[1]   Youn, B. D., Jung, B. C., Xi, Z., Kim, S. B., and Lee, W., “A hierarchical framework for statistical model calibration in engineering product development,” Computer Methods in Applied Mechanics and Engineering, Vol. 200, No. 13-16, 2011, pp. 1421 – 1431.
[2]   Stults, J. A., “Best Practices for Developmental Testing of Modern, Complex Munitions,” ITEA Journal, Vol. 29, No. 1, March 2008, pp. 67–74.
[3]   Defense Acquisitions: Assesment of Major Weapon Programs,” Tech. Rep. GAO-03-476, U.S. General Accounting Office, May 2003.
[4]   Best Practices: Better Support of Weapon System Program Managers Needed to Improve Outcomes,” Tech. Rep. GAO-06-110, U.S. General Accounting Office, 2006.
[5]   Precision-Guided Munitions: Acquisition Plans for the Joint Air-to-Surface Standoff Missile,” Tech. Rep. GAO/NSIAD-96-144, U.S. General Accounting Office, 1996.
[6]   Best Practices: A More Constructive Test Approach is Key to Better Weapon System Outcomes,” Tech. Rep. GAO/NSIAD-00-199, U.S. General Accounting Office, July 2000.
[7]   Michael L. Cohen, John E. Rolph, D. L. S., editor, Statistics, Testing and Defense Acquisition: New Approaches and Methodological Improvements, National Academy Press, Washington D.C., 1998.
[8]   O’Bryon, J. F., editor, Lessons Learned from Live Fire Testing: Insights Into Designing, Testing, and Operating U.S. Air, Land, and Sea Combat Systems for Improved Survivability and Lethality, Office of the Director, Operational Test and Evaluation, Live Fire Test and Evaluation, Office of the Secretary of Defense, January 2007.


  1. The paper I linked on the Stick Does Not Pogo post references another interesting report from Sandia. Abstract: It is critically important, for the sake of credible computational predictions, that model- validation experiments be designed, conducted, and analyzed in ways that provide for measuring predictive capability. I first develop a conceptual framework for designing and conducting a suite of physical experiments and calculations (ranging from phenomenological to integral levels), then analyzing the results first to (statistically) measure predictive capability in the experimental situations then to provide a basis for inferring the uncertainty of a computational-model prediction of system or component performance in an application environment or configuration that cannot or will not be tested. Several attendant issues are discussed in general, then illustrated via a simple linear model and a shock physics example.
    Measuring the Predictive Capability of Computational Models: Principles and Methods, Issues and Illustrations