Various Consequences: test driven development

Showing posts with label test driven development. Show all posts

Friday, February 4, 2011

Validation and Calibration: more flowcharts

In a previous post we developed a flow-chart for model verification and validation (V&V) activities. One thing I noted in the update on that post was that calibration activities were absent. My google alerts just turned up a new paper (they reference the Thacker et al. paper the previous post was based on, I think you’ll notice the resemblance of flow-charts) which adds the calibration activity in much the way we discussed.

Figure 1:

Model Calibration Flow Chart of Youn et al. [1]

The distinction between calibration and validation is clearly highlighted, “In many engineering problems, especially if unknown model variables exist in a computational model, model improvement is a necessary step during the validation process to bring the model into better agreement with experimental data. We can improve the model using two strategies: Strategy 1 updates the model through calibration and Strategy 2 refines the model to change the model form.”

Figure 2:

Flow chart from previous post

The well-founded criticism of calibration-based arguments for simulation credibility is that calibration provides no indication of the predictive capability of a model so-tuned. The statistician might use the term generalization risk to talk about the same idea. There is no magic here. Applying techniques such as cross-validation merely add a (hyper)parameter to the model (this becomes readily apparent in a Bayesian framework). Such techniques, while certainly useful, are no silver bullet against over-confidence. This is a fundamental truth that will not change with improving technique or technology, and that is because all probability statements are conditional on (among other things) the choice of model space (particular choices of which must by necessity be finite, though the space of all possible models is countably infinite).

One of the other interesting things in that paper is their argument for a hierarchical framework for model calibration / validation. A long time ago, in a previous life, I made a similar argument [2]. Looking back on that article is a little embarrassing. I wrote that before I had read Jaynes (or much else of the Bayesian analysis and design of experiments literature), so it seems very technically naive to me now. The basic heuristics for product development discussed in it are sound though. They’re based mostly on GAO reports [3, 4, 5, 6], a report by NAS [7], lessons learned from Live Fire Test and Evaluation [8] and personal experience in flight test. Now I understand better why some of those heuristics have sound theoretical underpinnings.

There are really two hierarchies though. There is the physical hierarchy of system, sub-system and component that Youn et al. emphasize, but there is also a modeling hierarchy. This modeling hierarchy is delineated by the level of aggregation, or the amount of reductive-ness, in the model. All models are reductive (that’s the whole point of modeling: massage the inordinately complex and ill-posed into tractability), some are just more reductive than others.

Figure 3:

Modeling Hierarchy (from [2])

Figure 3 illustrates why I care about Bayesian inference. It’s really the only way to coherently combine information from the bottom of the pyramid (computational physics simulations), with information higher up the pyramid which rely on component and subsystem testing.

A few things I don’t like about the approach in [1]

The partitioning of parameters into “known” and “unknown” based on what level of the hierarchy (component, subsystem, system) you are at in the “bottom-up” calibration process. Our (properly formulated) models should tell us how much information different types of test data give us about the different parameters. Parameters should always be described by a distribution rather than discrete switches like known or unknown.
The approach is based entirely on the likelihood (but they do mention something that sounds like expert priors in passing).
They claim that the proposed calibration method enhances “predictive capability” (section 3), however this is misleading abuse of terminology. Certainly the in-sample performance is improved by calibration, but the whole point of making a distinction between calibration and validation is based on recognizing that this says little about the out-of-sample performance (in fairness, they do equivocate a bit on this point, “The authors acknowledge that it is difficult to assure the predictive capability of an improved model without the assumption that the randomness in the true response primarily comes from the the randomness in random model variables.”).

Otherwise, I find this a valuable paper that strikes a pragmatic chord, and that’s why I wanted to share my thoughts on it.

[Update: This thesis that I linked at Climate Etc. has a flow-chart too.

]

References

[1] Youn, B. D., Jung, B. C., Xi, Z., Kim, S. B., and Lee, W., “A hierarchical framework for statistical model calibration in engineering product development,” Computer Methods in Applied Mechanics and Engineering, Vol. 200, No. 13-16, 2011, pp. 1421 – 1431.

[2] Stults, J. A., “Best Practices for Developmental Testing of Modern, Complex Munitions,” ITEA Journal, Vol. 29, No. 1, March 2008, pp. 67–74.

[3] “Defense Acquisitions: Assesment of Major Weapon Programs,” Tech. Rep. GAO-03-476, U.S. General Accounting Office, May 2003.

[4] “Best Practices: Better Support of Weapon System Program Managers Needed to Improve Outcomes,” Tech. Rep. GAO-06-110, U.S. General Accounting Office, 2006.

[5] “Precision-Guided Munitions: Acquisition Plans for the Joint Air-to-Surface Standoff Missile,” Tech. Rep. GAO/NSIAD-96-144, U.S. General Accounting Office, 1996.

[6] “Best Practices: A More Constructive Test Approach is Key to Better Weapon System Outcomes,” Tech. Rep. GAO/NSIAD-00-199, U.S. General Accounting Office, July 2000.

[7] Michael L. Cohen, John E. Rolph, D. L. S., editor, Statistics, Testing and Defense Acquisition: New Approaches and Methodological Improvements, National Academy Press, Washington D.C., 1998.

[8] O’Bryon, J. F., editor, Lessons Learned from Live Fire Testing: Insights Into Designing, Testing, and Operating U.S. Air, Land, and Sea Combat Systems for Improved Survivability and Lethality, Office of the Director, Operational Test and Evaluation, Live Fire Test and Evaluation, Office of the Secretary of Defense, January 2007.

Monday, January 11, 2010

A Computational Physics Quality Control Checklist?

gmcrews has a nice write-up about software quality and process control with some commentary touching on climate modeling software in particular. One thing I thought after reading was that a lot of the process control stuff works really well in manufacturing physical artifacts, but software tends to be pretty different each time you approach a new problem. Statistical process control may not translate to the coding world very directly.

I just read a book review of the Checklist Manifesto (being a former flight test engineer, checklists are near and dear to me) and found an interesting passage that touches on this concern:

...three different kinds of problems in the world: the simple, the complicated, and the complex. Simple problems, they [Zimmerman and Glouberman] note, are ones like baking a cake from a mix. There is a recipe. Somtimes there are a few basic techniques to learn. But once these are mastered, following the recipe brings a high likelihood of success.
Complicated problems are ones like sending a rocket to the moon. They can sometimes be broken down into a series of simple problems. But there is no straightfoward recipe. Success frequently requires multiple people, often multiple teams, and specialized expertise. Unanticipated difficulties are frequent. Timing and coordination become serious concerns.
Complex problems are ones like raising a child. Once you learn how to send a rocket to the moon, you can repeat the process with other rockets and perfect it. One rocket is like another rocket. But not so with raising a child, the professors point out. Every child is unique. Although raising one child may provide experience, it does not guarantee success with the next child. Expertise is valuable but most certainly not sufficient. Indeed, the next child may require an entirely different approach from the previous one. And this brings up another feature of complex problems: their outcomes remain highly uncertain. Yet we all know that it is possible to raise a child well. It’s complex, that’s all.

So is software development more like producing / operating a rocket or raising a child? Taken broadly, I think it is more like child rearing, but I think it’s certainly something that could benefit from checklists in narrow domains. If you are developing computational physics codes, there are published best practices and methodologies for verification, validation and uncertainty quantification. Good software carpentry stuff like unit test suites and solid version control should also be part of the ’process’ of ensuring software quality. The end product of a quality assurance effort in computational physics should be (approaching a checklist here?) a report that documents the version control methods used, the coverage (and successful completion) of the unit test suite, the results of formal verification studies and inferences drawn from the validation testing about the range of the input parameters over which useful predictive accuracy can be expected.

Got your own 'checklist' ideas? Put them in the comments!

Tuesday, November 17, 2009

Cut F-35 Flight Testing?

I just read an interesting article about the F-35. It repeats the standard press-release-based storyline that every major defense contractor / program office offers: "we've figured out what's wrong, we've spent gazillions on simulation (and the great-for-marketing computer graphics that result) so now we know our system so well that we can't afford not to cut flight testing."

I would have just dismissed this article as more of the "same-ole-same-ole", but at the bottom they have a quote I just love:

But scrimping on flight testing isn’t a good idea, said Bill Lawrence of Aledo, a former Marine test pilot turned aviation legal consultant.

"They build the aircraft . . . and go fly it," he said. "Then you come back and fix the things you don’t like."

No amount of computer processing power, Lawrence said, can tell Lockheed and the military what they don’t know about the F-35.

Thank God for testers. This goes to the heart of the problems I have with the model validation (or lack there of) in climate change science.

From reading some of these posts, you might think I'm some sort of Luddite who doesn't like CFD or other high-fidelity simulations, but you'd be wrong. I love CFD, I've spent a big chunk of my (admittedly short) professional career developing or running or consuming CFD (and similar) analyses. I just didn't fall in love with the colourful fluid dynamics. I understand that bending metal (or buying expensive stuff) based on the results of modeling unconstrained by reality is the height of folly.

That folly is exacerbated by an 'old problem'. In another article on the F-35, I found this gem:

No one should be shocked by the continuing delays and cost increases, said Hans Weber, a prominent aerospace engineering executive and consultant. "The airplane is so ambitious, it was bound to have problems."

The real problem, Weber said, is an old one. Neither defense contractors nor military or civilian leaders in government will admit how difficult a program will be and, when problems do arise, how challenging and costly they will be to fix. "It's really hard," Weber said, "for anyone to be really honest."

Defense acquisition in a nutshell, Pentagon Wars anyone?

Defense procurement veterans said they fear that the Pentagon will be tempted to cut the flight-testing plan yet again to save money.

"You need to do the testing the engineers originally said needed to be done," Weber said. By cutting tests now, "you kick the can down the road," and someone else has to deal with problems that will inevitably arise later.

Classic.

Friday, October 23, 2009

Numerical Throwaway Code

Blogs are a great place for throwaway code. You write throwaway code to learn about solving a particular problem. For me, it is usually implementing different sorts of numerical methods. Posting the results to the web just provides a bit of a goad to pick something interesting and work it to completion. Who knows, it may turn out to be useful to someone else too.

Doing throwaway code is also a good opportunity to practice the complete computational physics implementation cycle, which is (according to me at least):

Define the governing equations in your favorite computer algebra system (CAS)

Do the symbol manipulation required to apply the chosen discretization or approximation to the governing equations.

Export appropriate computation-ready expressions to your favorite language (I use Maxima’s f90() function).

Find an analytical solution or generate forcing functions for a manufactured solution using your CAS.

Write the auxilliary code to turn your expressions into useful routines and functions (loops and logic statements).

Compile and verify that the implementation meets the design order of accuracy with a grid convergence study.

Becoming fast at going from governing equations to a formally verified implementation is important. According to Philip Greenspun fast means profficient, and formally verified means you aren’t fooling yourself.

Research has shown that the only way we really learn and improve our skills is by deliberate practice. Deliberate practice has a few defining characteristics

activity explicitly intended to improve performance

reaches for objectives just beyond your level of competence

provides critical feedback

involves high levels of repetition

Does the little numerical methods throwaway code process described above satisfy these criteria? Learning and implementing a new method on an old equation set or an old method on a new equation set seems to be intended to improve performance. Since it is a new method it is just beyond the level of competence (you haven’t implemented it successfully yet). The critical feedback comes from passing the verification test to confirm that the implementation meets the design order of accuracy. Which brings us finally to the repetition level. Getting high levels of repetition is a bit harder. For simple governing equations and simple methods (like this example) it will take a very short time to go from governing equations to verified implementation. But part of the progression of the competence level is extensions to more complicated equations (non-linear instead of linear, multi dimensional instead of one dimensional, vector instead of scalar), which can take quite a while to get through the entire cycle, because the complexity tends to explode.

Just getting practice at using the available software tools is important too. They are increasingly powerful, which means they come with a significant learning curve. If you want to solve complicated PDEs quickly and correctly, being proficient with a CAS is the only way to get there.

The (open source) toolset I like to use is Maxima + Fortran + F2Py + Python (scipy and matplotlib especially). The symbol manipulation is quick and correct in Maxima, which produces correct low-level Fortran expressions (through f90()). Once these are generated it is straightforward to wrap them in loops and conditionals and compile them into Python modules with F2Py. This gives you custom compiled routines for your problem which were mostly generated rather than being written by hand. They are callable from a high level language so you can use them alongside all of the great high level stuff that comes in Scipy and the rest of Python. Especially convenient is Python’s unittest module, which provides a nice way to organize and automate a test suite for these routines.

The Radau-type Runge-Kutta example I mentioned earlier is a good example for another reason. It is clearly throwaway code. Why? It has no modularity whatsoever. All of the different parts of the calculation, spatial derivatives, forcing functions, time-step update, are jammed together in one big ugly expression. It is not broadly useful. I can’t take it and use it to integrate a different problem, it only solves the simple harmonic oscillator. I can get away with this monolithic mess because the governing equation is a simple one-dimensional scalar PDE and Maxima doesn’t make mistakes in combining and simplifying all of those operations. In real production code those things need to be separated out into their own modular routines. That way different spatial derivative approximations, different forcing functions, different time-integration methods can be mixed and matched (and independently tested) to efficiently solve a wide variety of problems.

Making sure that your throwaway code really is just that means you won’t make the biggest mistake related to throwaway code: not throwing it away.

Various Consequences

Pages

Friday, February 4, 2011

Validation and Calibration: more flowcharts

References

Monday, January 11, 2010

A Computational Physics Quality Control Checklist?

Tuesday, November 17, 2009

Cut F-35 Flight Testing?

Friday, October 23, 2009

Numerical Throwaway Code

Post Archive

Parts on Shapeways

Diode Gear

Topics