Tuesday, July 26, 2011

Software development in science

There are fundamental differences in the way in which scientists and software engineers create software. Here are two posts on two separate blogs, arguing their respective cases about the difference between the software created by scientists and the software created by software engineers. The first argues that the differences are due to culture: scientists view software as a tool that just needs to work, so don't mind doing it quickly and in a less-than-maintainable manner. Software engineers see software as a product, and so spend the time and effort to make software that is maintainable. The second, on the other hand, argues that it is not a cultural difference, but an issue of reproducibility. Being able to reproduce results is extremely important in science - for example, a lack of reproducibility is in part how the fraudulent results of Jan Hendrik Schon were uncovered. Thus, software need to be reproducible and therefore, produce trustworthy results.

As both a software engineer and a working scientist, I tend to agree more with the second argument, but I think that the major problem is that some scientists who code are going too far outside of their area of expertise.

It takes education and a lot of experience to be able to write good code. I've been writing software for more than sixteen years now, and I think I am finally getting to the point that my coding skills are adequate. But that's after earning an honours degree in the field, after spending a couple of years working closely with a truly gifted programmer, and many more years writing software for a wide variety of applications. When I first started writing scientific software, the code I produced wasn't very good: it ran OK, and produced reasonable results, but it was pretty clunky, being very difficult to adapt to other projects. I learned very quickly after that to design code for modularity and replicability. Reusable code,of course, is superior to code that is purpose-built each time. Apart from making it easier and quicker to produce new software, it is far more reliable: bugs are more likely to have been noticed and fixed in the earlier software.

I often tell my co-workers (who are all very good ecologists) that it is very easy to write bad software and that writing good software is hard. So, even though I spend my days writing software to process the output of some fairly painful software (that was obviously written by non-engineers), even though it takes me more time than people think it should, I still spend the time to build it according to the principles I learned as a software engineer. And every time I do that, the effort pays off later on, because I am always able to adapt my code to a new application with minimal effort, even though that application had not even been thought of when I first wrote the code.

I know that this sounds terribly snobbish, even elitist, but I look at it this way: If you want to design a reliable bridge, you need a civil engineer. If you want to design a reliable car, you need a mechanical engineer. If you want to write reliable software, you need a software engineer.

I think this problem of scientists over-reaching into code writing occurs because writing code is so easy to do, and because software can fail in subtle ways. Building a bridge takes a lot of material and manpower, and if it is not designed properly, it falls down. Building a car takes a lot of time and components, and if it is not designed properly, it crashes (or doesn't run at all). With software, however, anyone can download and install a scripting language like Python or a package like R and knock out a script that seems to do what they want. It also means that anyone can knock out numbers that look reasonable but are in fact completely wrong.

If you want good software, you need a software engineer. It's an investment that pays off in the long run.