Validation (comparing the model predictions to experimental observations) is generally what allows you to find out if you've made good choices of what model structure to use, what physics to include and what physics to neglect. In most applications of computational physics this is a straight-forward (if sometimes expensive) process. The problem is harder with climate models. We can't do designed experiments on the Earth.
Here are a couple of choice quotes from Reichler and Kim 2007 about the difficulties of climate model validation.
Several important issues complicate the model validation process. First, identifying model errors is difficult because of the complex and sometimes poorly understood nature of climate itself, making it difficult to decide which of the many aspects of climate are important for a good simulation. Second, climate models must be compared against present (e.g., 1979-1999) or past climate, since verifying observations for future climate are unavailable. Present climate, however, is not an independent data set since it has already been used for the model development (Williamson 1995). On the other hand, information about past climate carries large inherit uncertainties, complicating the validation process of past climate simulations (e.g., Schmidt et al. 2004). Third, there is a lack of reliable and consistent observations for present climate, and some climate processes occur at temporal or spatial scales that are either unobservable or unresolvable. Finally, good model performance evaluated from the present climate does not necessarily guarantee reliable predictions of future climate (Murphy et al. 2004).
The above quoted paper is a comparison of three generations of IPCC-family models. The study shows improvement in prediction of modern climate as the models improve from 1990 to 2001 to 2007. It also shows that the ensemble mean is more skilled than any individaul model (more on this later). The reasons given to explain the improvement make intuitive sense:
Two developments, more realistic parameterizations and finer resolutions, are likely to be most responsible for the good performance seen in the latest model generation. For example, there has been a constant refinement over the years in how sub-grid scale processes are parameterized in models. Current models also tend to have higher vertical and horizontal resolution than their predecessors. Higher resolution reduces the dependency of models on parameterizations, eliminating problems since parameterizations are not always entirely physical. That increased resolution improves model performance has been shown in various previous studies (e.g., Mullen and Buizza 2002, Mo et al. 2005, Roeckner et al. 2006).
A problem faced by climate modelers is that it is unlikely that we'll be able to run grid-resolved solutions of the climate within the lifetime of anyone now living (us CFD guys have the same problem with grid resolution scaling for high Reynolds number flows). There will always be a need for 'sub-grid' parameterizations, the hope is that eventually they will become "entirely physical" and well calibrated (if you think they already are, then you have been taken by someone's propaganda).
Bayesian model averaging (BMA) is one way to account for our uncertainty in model structure / physics choices. Instead of choosing a 'right' model, we get predictive distributions for things we care about by marginalizing over the uncertain model structures (and the uncertain parameters too).This paper shows that it is a useful procedure for short-term forecasting. The benefit with short-term forecasts is that we can evaluate the accuracy by closing the loop between predictions and observations. Min and Hense apply this idea to the IPCC AR4 coupled-climate models. Here's a short snippet from that paper providing some motivation for the use of BMA:
However, more than 50% of the models with anthropogenic-only forcing cannot reproduce the observed warming reasonably. This indicates the important role of natural forcing although other factors like different climate sensitivity, forcing uncertainty, and a climate drift might be responsible for the discrepancy in anthropogenic-only models. Besides, Bayesian and conventional skill comparisons demonstrate that a skill-weighted average with the Bayes factors (Bayesian model averaging, BMA) overwhelms the arithmetic ensemble mean and three other weighted averages based on conventional statistics, illuminating future applicability of BMA to climate predictions.
The ensemble means or Bayesian averages tend to outperform individual models, but why is this? Here's what R&K2007 has to say:
Our results indicate that multi-model ensembles are a legitimate and effective means to improve the outcome of climate simulations. As yet, it is not exactly clear why the multi-model mean is better than any individual model. One possible explanation is that the model solutions scatter more or less evenly about the truth (unless the errors are systematic), and the errors behave like random noise that can be efficiently removed by averaging. Such noise arises from internal climate variability (Barnett et al. 1994), and probably to a much larger extent from uncertainties in the formulation of models (Murphy et al. 2004; Stainforth et al. 2005).
Another interesting paper that explores this finds that models which have good scores on the calibration data do not tend to outperform other models over a subsequent validation period.
Error in the ensemble mean decreases systematically with ensemble size, N, and for a random selection as approximately 1∕Na, where a lies between 0.6 and 1. This is larger than the exponent of a random sample (a = 0.5) and appears to be an indicator of systematic bias in the model simulations.
This should not be surprising, it is very difficult to get all of the physics right (and remove the systematic bias) when you aren't able to do no-kidding validation experiments. They begin their conclusion with
In our analysis there is no evidence of future prediction skill delivered by past performance-based model selection. There seems to be little persistence in relative model skill, as illustrated by the percentage turnover in Figure 3. We speculate that the cause of this behavior is the non-stationarity of climate feedback strengths. Models that respond accurately in one period are likely to have the correct feedback strength at that time. However, the feedback strength and forcing is not stationary, favoring no particular model or groups of models consistently.
This means it is very difficult to protect ourselves from 'over-fitting' the models to our available historical record, and it certainly indicates that we should be cautious in basing policy decision on climate model forecasts. The 'science is settled' crowd, while busy banging the consensus drum and clamouring for urgent action (NOW!), never seem to offer this sort of nuanced approach to policy though.
If you have read any good, recent climate model validation papers please post them in the comments. Please don't post polemics about polar bears and arctic sea ice, my skepticism is honest, your activism should be too.
For some further Bayes Model Averaging / Model Selection check out:
(isn't Google Books cool?)