Tuesday, October 29, 2013

Prioritizing insect pests with Kohonen SOM

My research interests and activities are split between two fields: computational intelligence (obviously) and ecological modelling. Although I got into ecological modelling via computational intelligence, many of my recent publications in ecological modelling haven't had anything to do with computational intelligence. An exception to this is a recently published paper in the journal Neobiota that I am a coauthor of: "Prioritizing the risk of plant pests by clustering methods: self-organising maps, k-means and hierarchical clustering".

The problem is this: given the species that are known to exist in various geo-political regions of the world, what is the likelihood of one of those species establishing in a region where it is not already present? Species presences and absences are represented by binary vectors, where each region has a vector, a one represents a presence of a particular species in that region, and a zero represents an absence in that region. By clustering the assemblage vectors using a SOM, it is possible to infer which species pose the greatest threat to any particular region.

The rationale behind this approach is that regions that have similar species assemblages are likely to have similar environments. So if several assemblages end up in the same cluster, and a species is present in many of those regions but absent in others, then that species is likely to become established in the regions from which it is absent.

In this work the SOM were used as data clustering algorithms, with the vector quantisation abilities of the SOM being largely underutilized. My own contribution to the work was the realisation that the SOM were being used to cluster data, and hence to test the approach against the much-faster k-means clustering algorithm. I found that k-means is just as effective at producing good clusters as the SOM, and is much faster.

There are some problems with this work as well: it is virtually impossible to determine which approach is better without testing data. Which means that if you are clustering a set of species assemblages, you also need some more up-to-date data to validate the predictions. I do have some thoughts on getting around this, which I am currently investigating.