Showing posts with label applications. Show all posts
Showing posts with label applications. Show all posts

Tuesday, October 29, 2013

Prioritizing insect pests with Kohonen SOM

My research interests and activities are split between two fields: computational intelligence (obviously) and ecological modelling. Although I got into ecological modelling via computational intelligence, many of my recent publications in ecological modelling haven't had anything to do with computational intelligence. An exception to this is a recently published paper in the journal Neobiota that I am a coauthor of: "Prioritizing the risk of plant pests by clustering methods: self-organising maps, k-means and hierarchical clustering".

The problem is this: given the species that are known to exist in various geo-political regions of the world, what is the likelihood of one of those species establishing in a region where it is not already present? Species presences and absences are represented by binary vectors, where each region has a vector, a one represents a presence of a particular species in that region, and a zero represents an absence in that region. By clustering the assemblage vectors using a SOM, it is possible to infer which species pose the greatest threat to any particular region.

The rationale behind this approach is that regions that have similar species assemblages are likely to have similar environments. So if several assemblages end up in the same cluster, and a species is present in many of those regions but absent in others, then that species is likely to become established in the regions from which it is absent.

In this work the SOM were used as data clustering algorithms, with the vector quantisation abilities of the SOM being largely underutilized. My own contribution to the work was the realisation that the SOM were being used to cluster data, and hence to test the approach against the much-faster k-means clustering algorithm. I found that k-means is just as effective at producing good clusters as the SOM, and is much faster.

There are some problems with this work as well: it is virtually impossible to determine which approach is better without testing data. Which means that if you are clustering a set of species assemblages, you also need some more up-to-date data to validate the predictions. I do have some thoughts on getting around this, which I am currently investigating.


Tuesday, August 14, 2012

Identifying bats with ANN

An interesting paper has just come out in the Journal of Applied Ecology, which describes how ANN were used to identify thirty-four European species of bats based on their echolocation calls. This is a challenging problem, because the calls within any bat species can vary quite a lot, depending on what the bat is doing. For example, the calls that a bat uses while hunting are different to the calls that a bat uses while commuting to a hunting ground. The work described in this paper has several good features.

Firstly, they used a hierarchy of MLP ensembles to identify the species. First a level of MLP identified the geographic region (out of six) that the bat came from. Then a second level was used to identify the genus (out of seven) of the bat. Finally, an ensemble of species-specific MLP identified the species itself.

Secondly, they used a large data set to train the MLP, and performed a thorough data analysis to identify the significant features. Rather than just cramming every acoustic feature through the MLP and hoping for the best, they only used the most significant twenty-four.

Finally, they incorporated the classifiers into software called iBatsID that is freely available for anyone to use.

The authors reported a range of classification accuracies across the species, from a high of 100% to a low of 56.5%. They say that "This is almost certainly the results of our eANN [ensemble ANN] dealing with many more species". I think they're wrong when they say that, because the point of using ensembles is that the individual members of the ensemble can be highly specialised for a particular class. I suspect that the problem may be that the features they selected were not as useful for classifying the poorly-recognised species: rather than using the same twenty-four parameters for all thirty-four species, they might have gotten better results by selecting acoustic parameters for each species. Also, from the diagram (Figure 3 in the paper) it looks like they used the outputs of the regional and genus networks only to decide which groups of species MLP to use. An alternative would have been to use the output of the regional and genus MLP as input features for the following levels (similar to my approach in this paper), which would have added some more information into the classification process and probably boosted accuracy.

A final problem with this paper is that they have excluded a lot of the technical details about constructing and training the ANN, and about exactly how the different levels in the hierarchy interacted. This is probably because it is an ecology paper, not an ANN paper.

Overall, it's an interesting application, and I'm looking forward to seeing more work done on this problem in the future.

Thursday, March 29, 2012

Squirrel detection with SVM

Below is an entertaining video explaining how to automatically squirt squirrels with a water gun. What's interesting about this is that the presenter used a Support Vector Machine (SVM) to classify the images from the camera as either a squirrel or not a squirrel. I haven't talked about SVM on this blog much, but they are very powerful, learning algorithms that often outperform neural networks in classification applications.

He starts talking about the details of squirrel detection about the 7:30 mark - before that he describes the image processing toolkit he used to segment the images from the camera into blobs, where each blob needed to be classified as either a squirrel or not a squirrel. I was particularly interested in how he used three different kinds of features as inputs to the SVM: size of the blob segmented from the image; the colour histogram of the blob; and the entropy of the blob, where entropy is used as a measure of the "fuzziness" of the blob - squirrels have fuzzy, furry tails, while birds do not. This shows that careful thought is always required when selecting the inputs to a classifier or a learning algorithm. You can't just throw everything in and hope to get something useful out!


Wednesday, February 22, 2012

Using MLP to model the distribution of bacterial crop diseases

A new paper I co-authored with Sue Worner at Lincoln University is now available and describes how we used MLP to model the global distribution of bacterial crop diseases.

We had data on the presence or absence of certain species of bacterial crop disease (that is, bacteria that infect and cause diseases in plants we use as crops) in 459 geo-political regions throughout the world. We also had data on the climate in these regions and the presence of host plant species. We created MLP that predicted the presence or absence of the bacteria species from climate (abiotic factors). We also created MLP that predicted the presence or absence of the bacteria species from the host plant species assemblages (biotic factors). While both of these approaches worked, we got much better accuracies by combining the outputs into ensembles, and by using a cascaded or tandem ANN approach.

Ensembles are a way of combining the outputs of several ANN. An input vector is propagated through each of the ANN, and the output values combined either statistically (the final output value is the max, mean or median of the uncombined outputs) or algorithmically (output is determined as a majority vote of the uncombined outputs - that is, if the majority of the values is above a threshold, the output of the ensemble is a presence, otherwise, absence). We looked at three different kinds of ensemble: firstly, ensembles of the best ten MLP trained on abiotic inputs; secondly, ensembles of the best ten MLP trained on biotic inputs; and finally, ensembles that combined the best ten MLP trained on abiotic input as well as the best ten MLP trained on biotic inputs. These last ensembles were particularly interesting, as it allowed us to make predictions of species distributions using both biotic and abiotic factors simultaneously. The rationale behind ensembles is that different MLP learn different parts of the problem space: by combining the outputs of several MLP, it is possible to cover a larger part of the problem space, and therefore to boost prediction accuracy. Combining abiotic and biotic factors is the same idea. We know that an organism is affected by both of these factors, so combining both of them allows us to make more accurate predictions.

While the ensemble approach boosted the prediction accuracies, we thought we could do better, so we created MLP that took as inputs the outputs of the very best MLP trained on abiotic and biotic factors. In other words, the outputs of the climate and host networks were used as the input values for a second-level of MLP, which were then trained on the presence and absence of the bacteria species. The idea behind tandem ANN is that, if a first-level network makes a mistake - that is, if a climate or host MLP makes an incorrect prediction - then the tandem network can learn to correct it. Again, we were combining abiotic and biotic factors to make predictions.

The results of all these techniques were that while the single-level MLP were able to predict the distributions of the crop diseases fairly well, combining abiotic and biotic factors gave much better accuracy, whether the combination was achieved by a simple ensemble approach, or by using a tandem MLP approach.

This paper is published in Computational Ecology and Software, an open-access journal. Given my previous posts extolling the virtues of open access journals (see here, here, here and here) I'm putting my academic money where my mouth is, and submitting to open-access journals.

Tuesday, June 14, 2011

Detecting reefs with ANN

I have just published a paper, along with several of my colleagues at the University of Adelaide, on detecting reefs using MLP.

The problem was that while there is coarse-scale bathymetric data from sonar surveys, and surveys of small areas that list the presence and absence of reefs in a relatively small number of points, there have not been large-scale surveys of where, exactly, reefs are. This is because the fine-scale sonar surveys needed to detect them remotely are very expensive and time consuming, and surveying manually (divers going into the water and looking) can be dangerous in places (either dangerous sea conditions, or big bitey beasties in the water). Not knowing where reefs are is a problem, especially if you want to construct ecological models of reef-dwelling creatures like abalone. In short, abalone like to live on reefs, so to build an accurate model, you must know where the reefs are.

We addressed this problem by firstly, processing the bathymetric data into slope and curvature measures of the sea bed, then training MLP over sliding 2D windows of these variables, where a known reef presence or absence was in the centre of the window. A window in this case was an n * n matrix of values, where we used n=5. So, the third element of the third row was the target cell, which the MLP was learning to classify as either a reef or non-reef point.

We found that combinations of the bathymetric value of the target cell, and a 5*5 window of seabed slope, gave us the best results. The overall experimental method we used was as I described in this post. While we weren't able to classify every reef exactly, the overall accuracy of 85% was enough to construct a useful map of reefs for ecological models of abalone.

We're looking at boosting the accuracy of our models by various means - this first paper is just a proof-of-concept, to show that we can find reefs with ANN.

The full citation for this paper is:

Watts, M.J., et al., A novel method for mapping reefs and subtidal rocky habitats using artificial neural networks. Ecological Modelling (2011), doi:10.1016/j.ecolmodel.2011.04.024