When my daughter was a toddler, we bought her a batch of Looney Tunes DVDs. She loved watching Daffy Duck, Porky Pig, and Bugs Bunny. On her third birthday, we gave her a Road Runner DVD. She had her birthday at her grandparents' house, and so that's where she first watched Road Runner, after lunch.
The next time we visited, a couple of weeks later, straight after lunch she got up and walked towards the living room, saying she was going to watch "meep meep". I had to explain to her that that DVD was at home, it didn't stay at Grandma and Grandad's house.
Why did my daughter do that? It's because she had only ever seen Road Runner at Grandma and Grandad's house. In other words, she only had one example of where she could see Road Runner, so that was where she thought she would see it.
A couple of years later I was attending the viva of a PhD student, whose thesis I had examined. The thesis topic was predicting a medical condition from gene expression data. Gene expression was assigned certain values according to how much the gene was active. During the viva, the student made the comment that "expression was measured for those patients who had the condition, and was set to zero for those who did not". I was gobsmacked, as this meant that there were really no gene expression values at all for patients who did not have the condition.
A few years after that I was examining a PhD thesis that used neural networks for earthquake prediction. The central idea was that data from seismographs could be used to predict if an earthquake greater than a certain threshold would occur in the next few days. The data was taken from Canterbury in New Zealand, and started from September 2010, and went for just over a year.
The problem with this of course is that on the 4th of September 2010, the Darfield Earthquake struck Canterbury. The aftershocks continued for more than two years, that is, the length of the data set.
These are all examples of making predictions based on biased data. My daughter had biased data because she had only ever seen Road Runner at her grandparents' house.
The first student had biased data because there was really only data for one class of patient. They got good results, but that was because they trained a model on biased data then tested it on biased data. Their model would have failed utterly if it had been tested on a different data set.
The second student had biased data because the entire data set was constructed over a time when it was known that there were going to be earthquakes. So again, their model worked well when trained and tested on the data set they had, but it would have failed utterly if tested on a different data set. Hilariously, when I challenged the student on this in their viva, they replied "It's the right kind of bias"! A much better data set would have been one that extended over several years before and after the Darfield Earthquake, and had been taken from different regions.
While biased models in academic settings might not cause a lot of harm - other than to examiners' calm - such models are being used in production systems. This seems to be quite a widespread problem as well. It can even have deadly consequences.
Biased data leads to biased models. This is such a simple concept, yet so many people who build AI models don't seem to grasp it. Identifying biased data can be tricky, as it requires a solid understanding of what the data represents, how it was gathered, and what it is to be used for.
More insidiously, models built with biased data can show very good results. They will train well, and they will test well, as long as the test data is from the original, biased, data set. That is, as long as the test data is as biased as the training data, the model will show good test results.
This makes the sourcing and use of an independent testing data set essential. And that is probably the number one thing anyone to do to do avoid bias in AI. It's not foolproof - there might still be systemic biases in the process that generates the data - but it is an essential first step.