Wednesday, February 22, 2012

Using MLP to model the distribution of bacterial crop diseases

A new paper I co-authored with Sue Worner at Lincoln University is now available and describes how we used MLP to model the global distribution of bacterial crop diseases.

We had data on the presence or absence of certain species of bacterial crop disease (that is, bacteria that infect and cause diseases in plants we use as crops) in 459 geo-political regions throughout the world. We also had data on the climate in these regions and the presence of host plant species. We created MLP that predicted the presence or absence of the bacteria species from climate (abiotic factors). We also created MLP that predicted the presence or absence of the bacteria species from the host plant species assemblages (biotic factors). While both of these approaches worked, we got much better accuracies by combining the outputs into ensembles, and by using a cascaded or tandem ANN approach.

Ensembles are a way of combining the outputs of several ANN. An input vector is propagated through each of the ANN, and the output values combined either statistically (the final output value is the max, mean or median of the uncombined outputs) or algorithmically (output is determined as a majority vote of the uncombined outputs - that is, if the majority of the values is above a threshold, the output of the ensemble is a presence, otherwise, absence). We looked at three different kinds of ensemble: firstly, ensembles of the best ten MLP trained on abiotic inputs; secondly, ensembles of the best ten MLP trained on biotic inputs; and finally, ensembles that combined the best ten MLP trained on abiotic input as well as the best ten MLP trained on biotic inputs. These last ensembles were particularly interesting, as it allowed us to make predictions of species distributions using both biotic and abiotic factors simultaneously. The rationale behind ensembles is that different MLP learn different parts of the problem space: by combining the outputs of several MLP, it is possible to cover a larger part of the problem space, and therefore to boost prediction accuracy. Combining abiotic and biotic factors is the same idea. We know that an organism is affected by both of these factors, so combining both of them allows us to make more accurate predictions.

While the ensemble approach boosted the prediction accuracies, we thought we could do better, so we created MLP that took as inputs the outputs of the very best MLP trained on abiotic and biotic factors. In other words, the outputs of the climate and host networks were used as the input values for a second-level of MLP, which were then trained on the presence and absence of the bacteria species. The idea behind tandem ANN is that, if a first-level network makes a mistake - that is, if a climate or host MLP makes an incorrect prediction - then the tandem network can learn to correct it. Again, we were combining abiotic and biotic factors to make predictions.

The results of all these techniques were that while the single-level MLP were able to predict the distributions of the crop diseases fairly well, combining abiotic and biotic factors gave much better accuracy, whether the combination was achieved by a simple ensemble approach, or by using a tandem MLP approach.

This paper is published in Computational Ecology and Software, an open-access journal. Given my previous posts extolling the virtues of open access journals (see here, here, here and here) I'm putting my academic money where my mouth is, and submitting to open-access journals.