Saturday, August 11, 2012

Validating the Prediction of Unobserved Quantities

I was going to just post a bit of the abstract and a link to this recent tech report as a comment on my V&V workshop notes, but Validating the Prediction of Unobserved Quantities was interesting enough to deserve its own post. The basic approach is well-founded in probability theory, and there are some novel concepts as well.

Here's the abstract:
In predictive science, computational models are used to make predictions regarding the response of complex systems. Generally, there is no observational data for the predicted quantities (the quantities of interest or QoIs) prior to the computation, since otherwise predictions would not be necessary. Further, to maximize the utility of the predictions it is necessary to assess their reliability|i.e., to provide a quantitative characterization of the discrepancies between the prediction and the real world. Two aspects of this reliability assessment are judging the credibility of the prediction process and characterizing the uncertainty in the predicted quantities. These processes are commonly referred to as validation and uncertainty quantification (VUQ), and they are intimately linked. In typical VUQ approaches, model outputs for observed quantities are compared to experimental observations to test for consistency. While this consistency is necessary, it is not sufficient for extrapolative predictions because, by itself, it only ensures that the model can predict the observed quantities in the observed scenarios. Indeed, the fundamental challenge of predictive science is to make credible predictions with quantified uncertainties, despite the fact that the predictions are extrapolative. At the PECOS Center, a broadly applicable approach to VUQ for prediction of unobserved quantities has evolved. The approach incorporates stochastic modeling, calibration, validation, and predictive assessment phases where uncertainty representations are built, informed, and tested. This process is the subject of the current report, as well as several research issues that need to be addressed to make it applicable in practical problems.

So I think this cartoon illustrates what Moser et al. are after.
When you predict responses that are "far" from the data you have the credible intervals you give reflect that fact by being wider. This seems pretty mundane. That sort of behavior is a natural result of fitting models (as the cartoon example illustrates). The trouble comes because some folks with lots of great scientific computing and engineering experience, but with limited training in probability and statistics, think that you shouldn't "fit" (or calibrate) a model for a particular problem. Their approach seems to be that you just set constants for a model once and for all, or for a broad class of problems, and then set off to the races. This vague concept of "a broad class of problems" will resurface.

Here's an interesting part addressing validity,
Of course, the failure of the model to fit available data even after calibration is a symptom of model inadequacy, and it is tempting to conclude that such a model is invalid. However, this conclusion is not justified because failure to perfectly fit available data does not imply that the model is unable to predict the QoI with sufficient accuracy. The challenge is to determine when predictions of the QoIs are justified or not justified, in light of the observed discrepancies with available data.
I would have worded that differently. Rather than predictions of quantities of interst (QoIs) being "justified" I would focus on figuring out if/when predictions are "useful". This highlights the subjective nature of the validation question. I got dismissive chuckles out of the old men at the V&V workshop when I said a model is valid if the decision maker uses it to make a decision, but this is the pragmatic reality. As L.F. Richardson said, "An error of form which wold be negligible in a haystack would be disastrous in a lens." The decision maker's risk is his own, and the judgment call of how close is close enough depends on his purposes.

This concept of "sufficient for purpose" is addressed under the activity that Moser et al. call "predictive assessment" which is step four of their process.
  1. Uncertainty Modeling (they choose probability)
  2. Calibration and Model Selection (Bayesian updates and model comparison)
  3. Validation
  4. Predictive Assessment
However, when describing "validation", they talk about choosing tests that are "sufficient for the predictions at hand" which seems like it overlaps with the concept of "predictive assessment." I think the most important concept presented in the paper is the "composite model" concept.
To enable credible, extrapolative predictions, we will take advantage of the fact that the predictions are to be made for physical systems. Such systems are commonly described by models based on highly reliable theory (e.g., conservation laws), whose validity is not in question in the context of the predictions to be made. This fact is key in the development of the proposed methodology and is often overlooked, leading to the pessimistic view that any extrapolation is suspect. Of course, if the entire model were known a priori to be highly reliable, there would be no validation question. The difficulties arise because these highly reliable theories are generally augmented with one or more embedded models, which are less reliable. The less reliable embedded models may embody various modeling approximations, empirical correlations, or even direct interpolation of data. For example, in continuum mechanics, the embedded models might include constitutive models and boundary conditions, while in molecular dynamics they would include models for inter-atomic potentials. We will refer to such models--i.e., high-fidelity models with lower-fidelity embedded components--as composite models.
The first objection that popped into my mind was, "sure, that will work for something like an equation of state where you can calibrate over a range of temperatures and pressures, but what about turbulence!?". For a turbulence model you calibrate over a "broad class of problems". What is this broad class? How is it defined? What quantities are predicted well for which problem types? I didn't have long to stew on that because Moser et al. soon say,
For example, Arrhenius reaction rate models [25] have a single model scenario parameter; the temperature T. Given scenario parameters for an embedded model, the domain of applicability of the model can be defined as the range of these parameters over which the model has been calibrated and validation tested. Unfortunately, for other embedded models, such as RANS turbulence models [38, 16], the definition of appropriate model-specific scenario parameters is less clear. Further, in high dimensional scenario parameter spaces, discerning the range of applicability of a model by direct sampling will generally be impractical.
The important distinction here (one that Dan Hughes often makes) is that those things like equations of state and reaction rate models are describing properties of the material, not properties of the flow. This raises very similar issues to a discussion I had a long time ago about calibration of climate models. There are certainly lots of opportunities for interesting work highlighted by this little tech report. After reading it, I went and browsed some more of the reports on the ICES site. Lots of interesting stuff; even a A Gentle Tutorial on Statistical Inversion using the Bayesian Paradigm.

No comments:

Post a Comment