Big Data – these are per se data which are too big to be inspected by humans. Their bigness has consequences: They are so large that typical applications to store, compute and analyse them are inappropriate. Often processing them is a challenge for a single computer; thus, a cluster of computers have to be used in parallel. Or the amount of data has to be reduced by mapping an unstructured dataset into a dataset where individual elements are key-value pairs; on a reduced selection of these key-value pairs mathematical analyses can be performed (“MapReduce”). Even though Big Data are not collected in response to a specific research question, their sheer largeness (millions of observations x of y variables) promises to provide answers relevant for a large part of a societies’ population. From a statistical point of view, what happens is that large sample sizes boost significance; the effect size is more important. However, on the other hand, large does not mean all; one has to be aware of the universe covered by the data. Statistical inference – conclusions drawn from data about the population as a whole – cannot easily be applied, because the datasets are not established in a way that ensures that they are representative. Therefore, bias in Big Data ironically may come from missing datasets, e.g. on those parts of the population which are not captured in the data collection process.
But biases may also arise in the process of analysing Big Data. This has also to do with the substantial size of the datasets; standard software may be inappropriate to handle it. Beyond parallel computing and MapReduce, the use of machine learning seems to provide solutions. Machine learning designates algorithms that can learn from and make predictions on data through building a model from sample inputs. It is a type of artificial intelligence in which the system learns from lots of examples; results – such as patterns or clusters – become stronger with more evidence. It is for this reason why Big Data and machine learning seem to go hand in hand. Machine learning can roughly be divided into A) analytic techniques which use stochastic data models, most often classification and regression in supervised learning; and B) predictive approaches, where the data mechanism is unknown, as it is the case with neural nets and deep learning. In both cases biases may be the result of the processing of Big Data.
A) The goal of statistical modelling is to find a model which allows to draw quantitative conclusions from data. It has the advantage of the data model being transparent and comprehensible by the analyst. However, what sounds objective (since it is ‘based on statistics’), neither needs to be correct (since if the model is a poor emulation of reality, the conclusions may be wrong), nor need it be fair: The algorithms may simply not be written in a manner which describes fairness or an even distribution as a goal of the problem-solving procedure. Machine learning then commits disparate mistreatment: the algorithm optimizes the discrimination for the whole population, but it is not looking for a fair distribution of this discrimination. ‘Objective decisions’ in machine learning can therefore be objectively unfair. This is the reason why Cathy O’Neill has called an algorithm “an opinion formalized in code”[1] – it does not simply provide objectivity, but works towards the (unfair) goals for which it written. But there is remedy; it is possible to develop mechanisms for fair algorithmic decision making. See for example the publications of Krishna P. Gummadi from the Max Planck Institute for Software Systems.
Example of an algorithm, taken from: Pang-Ning Tan, Michael Steinbach, Vipin Kumar, Introduction to Data Mining, Boston 2006, p. 164.
B) In recent years, powerful new tools for Big Data analysis have been developed: Neural nets, deep learning algorithms. The goal of these tools is predictive accuracy; they are hardware-hungry and data hungry, but have their strength in complex prediction problems where it is obvious that stochastic data models are not applicable. Therefore, the approach is designed in another way here: What is observed is a set of x’s that go in and a subsequent set of y’s that come out. The challenge is to find an algorithm f(x) such that for future x in a test set, f(x) will be a good predictor of y. The goal is to have the algorithm produce results with a strong predictive accuracy. The focus does not lie with the model by which the input x is transformed into the output y; it does not have to be a stochastic data model. Rather, the model is unknown, complex & mysterious; and irrelevant. This is the reason why accurate prediction methods are addressed as complex “black boxes”; at least with neural nets, ‘algorithms’ are seen as a synecdoche for “black box”. Other than it is the case with stochastic models, the goal is not interpretability, but accurate information. And it is here, on the basis of an opaque data model, where neural nets and deep learning extract features from Big Data and identify patterns or clusters which have been invisible to the human analyst. It is fascinating to see that humans don’t decide what those features are. The predictive analysis of Big Data can identify and magnify patterns hidden in the data. This is the case with many recent studies, like, for example, the facial recognition system recognizing ethnicity which has been developed by the company Kairos or the Stanford study inferring sexual orientation by analysing people’s faces. What comes out here is that the automatic feature extraction amplifies human bias. A lot of the talk about “biased algorithms” is a result out of these findings. But are the algorithms really to blame for the bias, especially in the case of machine learning systems with a non-transparent data model?
This question leads us back again to Big Data. There are at least two possible ways in which the data used predetermine the outcomes: The first is Big Data with built-in bias which is then amplified by the algorithm. Simply go to the Google image search and perform a search either for the words “CEO” or “cleaner”. The second is the difference between the data sets used as training data for the algorithm and the data analysed subsequently. If you don’t have, for example, African American faces in a training set on facial recognition, you simply don’t know how the algorithm will behave when applied to images with African American faces. Therefore, the appropriateness and the coverage of the data set is crucial.
The other point lies with data models and the modelling process. Models are always contextual, be they stochastic models with built-in assumptions about how the world works; or be they charged with context during the modelling process. This is why we should reflect on the social and historical contexts in which Big Data sets have been established; and the way our models and modelling processes are being shaped. And maybe it is also timely to reflect on the term “bias”, and to recollect that it implies an impossible unbiased ideal …
[1] Cathy O’Neil, Weapons of Math Destruction. How Big Data increases inequality and threatens Democracy, New York: Crown 2016, p.53.