“The Trouble with Big Data”. New Book published by the KPLEX Project

One of the major terminological forces driving ICT development today is that of ‘big data.’ While the phrase may sound inclusive and integrative, in fact, big data approaches are highly selective, excluding any input that cannot be effectively structured, represented, or, indeed, digitised. The Trouble with Big Data explores the challenges society faces with big data, through the lens of culture rather than social, political or economic trends as demonstrated in the words we use, the values that underpin our interactions and the biases and assumptions that drive us.

Evolving from research undertaken in the Knowledge Complexity (KPLEX) project, in which Trinity College Dublin, the Data Archiving and Networked Services (DANS) of the Koninklijke Nederlandse Akademie von Wetenschappen, and Freie Universität Berlin were partners, this book focuses on areas such as data and language, data and sensemaking, data and power, data and invisibility, and big data aggregation. How cultural practices are displaced by, and yet simultaneously resist mass datafication, can be instructive for the critical observation of big data research and innovation.

This book is available as open access through the Bloomsbury Open programme and is available on www.bloomsburycollections.com. It is funded by Trinity College Dublin, DARIAH-EU and the European Commission.

Big Data and AI for the Public Good?

The provision of data containing information on personal health is still regarded with a lot of scepticism from the side of European citizens. The growing interest of large companies in medical and patient data is in opposition with the reservations of citizens in view of issues of privacy, anonymization and pseudonymization of those data which are intimately bound to their body. But in the past weeks, the intense debate around a possible Contact Tracing App in times of COVID-19 has again shown that people are willing to cooperate and contribute even sensitive data related to their own personality if this endeavour supports common welfare. Furthermore, the debate has revealed that there need not be a privacy-public welfare trade-off involved, if certain conditions with regard to the development of an app like Pepp-PT (Pan European Privacy Protecting Proximity Tracing) are met. First of all: Encryption methods that enable compliance with the GDPR; the impossibility of de-anonymization of the data collected; and the relinquishment to use central servers (i.e. servers under state control). Second: Transparency around the production process of the app, which enables non-governmental organizations like the German Chaos Computer Club (CCC) or Reporters without Borders to inspect, test and evaluate the source code of the device deposited on GitHub. Both these points create the trust needed to involve a large part of Europe’s population. Third: The involvement of the citizens at several points in the data collection and exchange process. Users of the app need not only to download and activate the app, but have to agree that the Bluetooth device of their smartphone is activated and can be used by the app. If they are presented with the diagnosis of being Corona-positive, they need to confer the right to use this information in the app; only afterwards all the other persons who use this app and have been in contact with the infected persons receive a message about the diagnosis. On this basis, they can decide whether or not they address themselves to a medical authority for further testing. The third point essentially means that users are requested to act as responsible citizens who take up responsibility – and are empowered by their involvement in the decisions within the process.

This latter point – empowerment of the users/data donators – is most often neglected by policy makers as well as by jurisdiction. Data protection laws clearly identify data controllers, but ownership is ill defined. This is a consequence of the fact that data can be copied without loss; customary conceptions of ownership (like in the case of a bicycle) therefore do not apply. Furthermore, in a world where data are quite often a by-product of human activities, the massive data collection by a few monopolists has obscured the sense of data ownership e.g. in users of online media, and thus introduced a feeling of being exploited by such data aggregators as well as an accompanying mistrust in data collection in general. This is also why the case of the Pepp-PT app seems to open new doors for users: It is not only that they donate personal data, they also get something back from the app. What they get back is not more than a tiny little piece of information, which they regard as precious – the information whether they have been in the company of an infected person or not.

In the scenario described above, there is only data involved; with many people using the app, this can even be Big Data. AI comes in if there is more data available: Data about the health of the people using this app, provided by healthcare services, or fitness data collected by apps and devices pertaining to the self-quantification domain. If these data could be aggregated, together with the data collected by the Pepp-PT app, enough information would be available to enable machine learning to answer such questions as: How do personal fitness and the course of the disease relate to each other? Which factors support a quicker cure? Which variables predict complications during the development of the disease? So this is where the full power of Big Data and Artificial Intelligence would unfold for the public good.

The successful use of AI needs computational power, big data, and the work of capable developers of algorithms. All three ingredients can usually be found in big tech companies, but most often not beyond them. This reflection reveals why a broader societal debate on “AI for the Public Good” has not yet been conducted: We are far from having the necessary infrastructure and financial means in place. But how could this look like? And how could algorithmic innovation for the common welfare be furthered?


Smart Citizen. Image © Jörg Lehmann 2019

Beyond the case of a COVID Tracing App, the idea of “Smart Cities” provides for a scenario where some of the prerequisites for the use of Big Data and AI serving the public welfare becomes visible. For Smart Cities, data on quite a lot of very important topics are needed: Energy, Mobility, Climate, Environment, Garbage, and so forth. In the concept of Smart Cities, every urban dweller easily can imagine the need for solutions beyond the individual household and her or his contribution to it. What would be the best time to turn on the washing machine because at that certain time power is available in abundance? Where can I find a parking lot? Which alternative transports are available to bring a good from A to B? If rainfalls become rare but heavier, which is the best way for a house to manage the flood and mitigate disasters? Should polders and cisterns be installed to counterbalance phases of drought? How can systems for the reuse of recyclable material and goods replace or complement the current garbage collection service?

Questions like these are negotiated in the concept of Smart Cities, and it immediately becomes obvious that a lot of data are already available (energy, mobility, climate), while others are missing (data on individual energy consumption, mobility, or patterns of daily use of resources). Furthermore, facilities providing the computational power needed as well as relevant algorithms are nowhere to be seen. Ouch. These seem to be the pain points where we as a society have to move forward in order not to relinquish algorithmic innovation to private companies. Data seem not to be the problem; all of us produce them ceaselessly. They could be collected, aggregated and managed within data cooperatives (the German language has the word “Genossenschaft” for it). An example from health research is the Swiss cooperative MIDATA, jointly created in 2015 by ETH Zurich and the Bern University of Applied Sciences. In such data cooperatives, personal data coming from the members of the cooperative are aggregated according to transparent governance principles and state-of-the-art encryption to ensure privacy. Furthermore, citizens (and the communities forming a city) are empowered to steer data use according to their motivations and preferences. These cooperatives can organize access to aggregated data that did not exist in this linked format before, since they consolidate data which have been stored in disparate silos before.

While the issue of missing individual data can be solved by data cooperatives, the infrastructural questions remain. There is the need for computational power as well as data analysis and interpretation platforms or interfaces that enable individual or collective users to obtain insights derived from available Big Data. They form the basis for decisions, for example by predictions on transport and energy use in the upcoming months and years; on the watering of plants in public streets and private gardens, or the usage of parks; and on the communities or facilities in demand of recyclable material. The industry can contribute to such an implementation of Smart Cities by providing applications and interfaces on a pro bono basis; also unions and associations like Data Science for Social Good Berlin might be helpful in data analysis. But such endeavours do not provide sustainable solutions for the lack in infrastructure. Policy makers should therefore promote the model of Smart Cities by funding distributed data infrastructures piloting new data aggregation models; and private foundations should provide the necessary investments in highly performant computers and expensive algorithm developers until a proof-of-concept has convinced governments or the European Commission to provide long-term funding for such independent institutions. Only if all these conditions are met, civil society can move forward and find ways to use AI for the public good.

The Future of History

If you go to an archive today and look for a personal heritage, what would you expect? Notebooks, letters, photographs, calendars, the documentation of printed publications, drafts of articles or books with hand-written comments, and the like. But how about born-digital contents? In the best case, you will find a backup of the hard disk drive of the person you are researching on. The letters of earlier times have changed to eMails, SMS and WhatsApp messages, Facebook entries, Twitter posts; publications may have changed to online articles, PDF files, and blog contributions distributed all over the net. And that’s the point: What may have been part of somebody’s personal inheritance in paper format, may have nowadays become part of Big Data. Yes, Big Data: They do not consist only of incredibly large tables, with variables and columns filled with numbers; a good part of Big Data consists simply of text files with social media contents (Facebook, Twitter, blogs, and so on).

At first glance, this sounds astonishing. The ‘private’ character of personal heritage seems to have vanished, the proportion of content available in the public sphere seems to have grown. It seems. But this is not surprising; we are reminded of one of the most influential studies on this topic, Jürgen Habermas’ “Structural Transformation of the Public Sphere”. What Habermas analyses here are the constant changes and shifts of the border between private and public. His examination starts in the late 18th and early 19th century, with the formation of an ideal type of bourgeois discourse marked by what Habermas calls “Räsonnement”. This reasoning aims at arguing, but also, in its pejorative form, at grumbling. The study begins with bourgeois members of the public meeting in Salons, coffee houses, and literary round tables, pursuing reasoned exchange by contributing to journals, practicing subjectivity, individualism, and sentimentalism by writing letters and diaries destined either to be published (think of Gellert, Gleim, and Goethe) or to become part of personal heritage. Habermas draws long lines into the 20th century, where his book ends with the opposition between public and private characteristic of that time: Employment is part of the public space, while leisure time is dedicated to private activities; letters and lecture have become much less important, only the high bourgeoisie keeps their own libraries; mass media enhance the passivity of consumerism. This can also be read from personal heritages: The functional differentiation of a modern society created presumed experts for Räsonnement, like journalists, politicians, and publicists, who deliver opinion formation as a service, while editors and scientists professionalise the critique of politics. Habermas is overtly critical in view of the mass media and their potentials for manipulation since they reduce citizens to recipients without agency.

The last edition of Habermas’ book was printed in 1990. Since that time, a lot has changed, especially with the emergence of the internet. The border between public and private has been moved, and the societal-political commitment of citizens has changed. Social media grant an incredibly agency and empower citizens. Hyperdigital hipsters are working in cafés, co-working spaces or start-ups, without having the private leisure time characteristic of the 20th century. Digital media network people across large spaces and form new transnational collectives. Anthropologist Arjun Appadurai has spoken of “diasporic public spheres” in this respect – small groups of people discussing face to face in pubs have been transformed into “communities of sentiment” grumbling at politics. Formerly silent recipients have mutated into counter publics, the sentimental bourgeois has become an enraged citizen. Habermas wouldn’t have liked this development, since his ideal type of Räsonnement doesn’t fit with current realities, and what he overlooked – the existence of large parts of the society consisting of people who don’t participate in mass media discourses because they don’t want to – nowadays informs e.g. right-wing populism.

The Facebook network as a new public sphere

This latest transformation of the public sphere has consequences for archivists as well as historians. Consequently, archivists should regard social media contents as part of personal heritages and have thus to struggle with data management and storage problems. Historians (at least historians of the future) have to become familiar with quantitative analysis in order to e.g. examine Twitter-networks in order to determine the impact of the Alt-Right movement onto the presidential election in the U.S. Born-digital contents can therefore be seen as valuable parts of personal heritage. And coming from this point of view, there is certainly a lot that historians can contribute to discussions on Big Data.


Jürgen Habermas, The Structural Transformation of the Public Sphere: An Inquiry into a Category of Bourgeois Society. Cambridge: Polity 1989.

Arjun Appadurai, Modernity At Large: Cultural Dimensions of Globalization. Minneapolis: University of Minnesota Press 1996.


What the stories around data tell us

The strange thing about data is that they don’t speak themselves. They need to be embedded into an interpretation to become palatable and understandable. This interpretation may be an analytic account like the narrative synthesis told after performing a regression. It also may be a story in a more conventional sense, something like a success story of conquest, mastery, submission, or revelation which results out of the usual storytelling used in marketing.

The funny thing about data and stories is that it is easier to create a story out of data than extracting data out of a story told. It is easy to conceived of narratives built on top of data. Companies like Narrative Science create market reports or sporting reports automatically out of the data they receive on a daily basis. On the other hand, it is difficult to imagine to extract statistical data about a soccer game out of the up-to-the-minute scores of the same game presented on a website.

But data form a peculiar basis of stories. Think of the data which are collected when you visit a website – a typical basis of Big Data. These websites collect data on where you go, where you click, how long you stay there and so on; typically, they are behavioural data. What data scientist can get out of that are correlations; these data don’t allow to grasp the causal mechanisms behind the observed behaviour. They are not able to see the whole person behind the behaviour; thus they are e.g. not able to tell what the costumers feel during their visit on the website, why they reacted – and where the visitors see value in the offers they are presented. It is thus a reduction of the perspective which comes along with stories based on data, a reduction which maybe can’t be avoided since data are in themselves a product in a process of estrangement typical for capitalism. The narrative might attenuate or conceal the limitations of the data, but it will not be able to reach far beyond the restrictions imposed.


But there is more about data presented in narratives than a mere reduction of perspective: Other than data collected in scientific disciplines like psychology and anthropology, which might enable representative statements about population groups, the results of Big Data analyses grant a shift in perspective. By performing classifications, groupings of people according to their preferences, assessing the creditworthiness of customers, etc., Big Data allow to view human beings from the perspective of a market. And in their ability to shape the offers presented on a website in real-time and adapt the pricing mechanisms according to the IP-address from where the websites are being accessed; in their ability to build up systems of gratification, rewarding actions of the users which are seen as opportune by the infrastructure, data grant a point of view onto customers which further strengthen commodification and economic governance. The fact that this point of view is equivalent to the perspective of the market becomes especially visible in the narratives accompanying these data.

Identities and Big Data

One of the amenities of Big Data (and the neural nets used to mine them) is its potential to reveal patterns of which we were not aware before. “Identity” in Big Data might either be a variable contained in the dataset. This implies classification, and a feature which cannot easily be determined (like some cis-, trans-, inter-, and alia-identities) might have been put into the ‘garbage’ category of the “other”. Or, identity might arise from a combination of features that was unknown beforehand. This was the case in a study which claimed that neural networks are able to detect sexual orientation from facial images. The claim did not go unanswered; a recent examination of this study by Google researchers exposed that differences in culture, rather than facial structures were the features responsible for the result. Therefore, features that can easily be changed – like makeup, eyeshadow, facial hair, or glasses – were underestimated by the authors of the study.

The debate between these data analysts exposes insights well-known to humanists and social scientists. Identities differ in context; depending on the situation in which she or he is, a person may say “I am a mother of three children”, “I am a vegan”, or “I am a Muslim”. In fields marked by strong tensions and polarizations, identity statements can come close to confessions, and it might be wise to carefully deliberate about whether it is opportune to either provide the one or the other answer: “I am a British”, “I am an European”.

It is not without irony that it is easy to list several identities that have been important throughout the past few centuries. Clan, tribe, nation, heritage, sect, kinship, blood, race – this is the typical stuff ethnographers and historians work on. Beyond the ones just named, identities like family, religion, culture, and gender are currently intensely debated in our postmodern, globalised world. Think of the discussions about cultural identities in a world characterized by migration; and think of gender identity as one of the examples which only recently has split itself into new forms and created a new vocabulary that tries to grasp the new changeableness: genderfluid, cisgender, transgender, agender, genderqueer, non-binary, two-spirit, etc.

Androgynous portrayal of Akhenaten, the first ruler to introduce monotheism in Ancient Egypt

It is obvious that identity is not a stable category. This is the irony of identity – the promise of an identical self dissolves itself in time and space, and any trial to isolate, fix and homogenise an identity is doomed to failure. In postmodernity, identities are rather constructed out of interactions with the social environment – through constant negotiations, distinction and recognition, exclusion and belonging. Mutations and transformations are the expression of the tensions between vital elements that characterize identities. The path from Descartes’ “Cogito, ergo sum” to Rimbaud’s “I is another” is one of the best-known examples for such a transformative process.

Humanists and social scientists are experts in providing thick descriptions and large contexts in which identities can be located. They are used to relate the different resources to each other which feed into identities, and they are capable to build the contexts in which the relevant features and the relationships between them are embedded. In view of these potentials it is astonishing that citizens with such an academic background do not speak with confidence of the power of their methods and mix themselves into Big Data debates about such elusive concepts like “identities”.

Data. A really short history

“One man’s noise is another man’s signal.” Edward Ng


Data are “givens”. They are artefacts. They don’t pre-exist in the world, but come to the world because we, the human beings, want them to do so. If you would imagine an intimate relationship between humans and data, you would say that data do not exist independently from humans. Hunters and gatherers collect the discrete objects of their desire, put them in order, administer them, store them, use them, throw them away, forget about where they were stored and at times uncover those escaped places again, for example during a blitheful snowball fight. At that time data were clearly divided from non-data – edible things were separated from inedible ones, helpful items from uninteresting ones. This selectivity may have benefited from the dependency on shifting availability of resources in a wider space.

Later onwards, when mankind went out of the forests, things became much more complicated. With the advent of agriculture and especially of trade data became more meaningful. Furthermore, rulers became interested in knowledge about their sheep and therefore asked some of their accountants to keep record of their subordinates – the eldest census dates back to 700 B.C. In the times when mathematics became a bit more complicated from the 17th (probability) and 19th (statistics) century onwards, data about people, agriculture, and trade began to heap up and it became more and more difficult to distinguish between relevant data (“signal”) and irrelevant ones (“noise”). The distinction between these two simply became a matter of the question with which the data at hands were consulted.

With the advent of the industrial age, the concept of mechanical objectivity was introduced, and the task of data creation was delegated to machines which were constructed to collect the items in which humans were interested in. Now data were available in huge amounts, and the need for organizing and ordering them became even more pressing. It is over here, where powerful schemes came into force: Selection processes, categorizations, classifications, standards; variables prioritized as signal over others reduced to noise, thus creating systems of measurement and normativity intended to govern societies. They have been insightful investigated in the book “Standards and their stories. How quantifying, classifying, and formalizing practices shape everyday life.”*

It was only later, in the beginning of the post-industrial age, when an alternative to this scheme- and signal-oriented approach was developed by simply piling up everything that may be of any interest, a task also often delegated to machines because of their patient effortlessness. The agglomeration of masses presupposes that storing is not a problem, neither in spatial nor in temporal terms. The result of such an approach is nowadays called “Big Data” – the accumulation of masses of (mostly observational) data for no specific purpose. Collecting noise in the hope of signal, without defining what noise and what signal is. Fabricating incredibly large haystacks and assuming there are needles in it. Data as the output of a completely economised process with its classic capitalistic division of labour, including the alienation from their sources.

What is termed “culture” often resembles these haystacks. Archives are haystacks. The German Historical Museum in Berlin is located in the “Zeughaus”, literally the “house of stuff”,** stuffed with the remnants of history, a hayloft of history. Libraries are haystacks as well; if you are not bound to the indexes and registers called catalogues, if you are free to wander around the shelves of books and pick them out at random, you might get lost in an ocean of thoughts, in countless imaginary worlds, in intellectual universes. This is the fun of the explorers, conquerors and researchers: Never get bored through routine, always discover something new which feeds your curiosity. And it is here, within this flurry of noise and signal, within the richness of multisensory, kinetic and synthetic modes of access, where it becomes tangible that in culture noise and signal cannot be thought without the environment out of which they were taken, that both are the product of human endeavours, and that data are artefacts that cannot be understood without the context in which they were created.


*Martha Lampland, Susan Leigh Star (Eds.), Standards and their stories. How quantifying, classifying, and formalizing practices shape everyday life. Ithaca: Cornell Univ. Press 2009.
** And yes, it should be translated correctly as “armoury”.

Big Data and Biases

Big Data – these are per se data which are too big to be inspected by humans. Their bigness has consequences: They are so large that typical applications to store, compute and analyse them are inappropriate. Often processing them is a challenge for a single computer; thus, a cluster of computers have to be used in parallel. Or the amount of data has to be reduced by mapping an unstructured dataset into a dataset where individual elements are key-value pairs; on a reduced selection of these key-value pairs mathematical analyses can be performed (“MapReduce”). Even though Big Data are not collected in response to a specific research question, their sheer largeness (millions of observations x of y variables) promises to provide answers relevant for a large part of a societies’ population. From a statistical point of view, what happens is that large sample sizes boost significance; the effect size is more important. However, on the other hand, large does not mean all; one has to be aware of the universe covered by the data. Statistical inference – conclusions drawn from data about the population as a whole – cannot easily be applied, because the datasets are not established in a way that ensures that they are representative. Therefore, bias in Big Data ironically may come from missing datasets, e.g. on those parts of the population which are not captured in the data collection process.

But biases may also arise in the process of analysing Big Data. This has also to do with the substantial size of the datasets; standard software may be inappropriate to handle it. Beyond parallel computing and MapReduce, the use of machine learning seems to provide solutions. Machine learning designates algorithms that can learn from and make predictions on data through building a model from sample inputs. It is a type of artificial intelligence in which the system learns from lots of examples; results – such as patterns or clusters – become stronger with more evidence. It is for this reason why Big Data and machine learning seem to go hand in hand. Machine learning can roughly be divided into A) analytic techniques which use stochastic data models, most often classification and regression in supervised learning; and B) predictive approaches, where the data mechanism is unknown, as it is the case with neural nets and deep learning. In both cases biases may be the result of the processing of Big Data.

A) The goal of statistical modelling is to find a model which allows to draw quantitative conclusions from data. It has the advantage of the data model being transparent and comprehensible by the analyst. However, what sounds objective (since it is ‘based on statistics’), neither needs to be correct (since if the model is a poor emulation of reality, the conclusions may be wrong), nor need it be fair: The algorithms may simply not be written in a manner which describes fairness or an even distribution as a goal of the problem-solving procedure. Machine learning then commits disparate mistreatment: the algorithm optimizes the discrimination for the whole population, but it is not looking for a fair distribution of this discrimination. ‘Objective decisions’ in machine learning can therefore be objectively unfair. This is the reason why Cathy O’Neill has called an algorithm “an opinion formalized in code”[1] – it does not simply provide objectivity, but works towards the (unfair) goals for which it written. But there is remedy; it is possible to develop mechanisms for fair algorithmic decision making. See for example the publications of Krishna P. Gummadi from the Max Planck Institute for Software Systems.


Example of an algorithm, taken from: Pang-Ning Tan, Michael Steinbach, Vipin Kumar, Introduction to Data Mining, Boston 2006, p. 164.

B) In recent years, powerful new tools for Big Data analysis have been developed: Neural nets, deep learning algorithms. The goal of these tools is predictive accuracy; they are hardware-hungry and data hungry, but have their strength in complex prediction problems where it is obvious that stochastic data models are not applicable. Therefore, the approach is designed in another way here: What is observed is a set of x’s that go in and a subsequent set of y’s that come out. The challenge is to find an algorithm f(x) such that for future x in a test set, f(x) will be a good predictor of y. The goal is to have the algorithm produce results with a strong predictive accuracy. The focus does not lie with the model by which the input x is transformed into the output y; it does not have to be a stochastic data model. Rather, the model is unknown, complex & mysterious; and irrelevant. This is the reason why accurate prediction methods are addressed as complex “black boxes”; at least with neural nets, ‘algorithms’ are seen as a synecdoche for “black box”. Other than it is the case with stochastic models, the goal is not interpretability, but accurate information. And it is here, on the basis of an opaque data model, where neural nets and deep learning extract features from Big Data and identify patterns or clusters which have been invisible to the human analyst. It is fascinating to see that humans don’t decide what those features are. The predictive analysis of Big Data can identify and magnify patterns hidden in the data. This is the case with many recent studies, like, for example, the facial recognition system recognizing ethnicity which has been developed by the company Kairos or the Stanford study inferring sexual orientation by analysing people’s faces. What comes out here is that the automatic feature extraction amplifies human bias. A lot of the talk about “biased algorithms” is a result out of these findings. But are the algorithms really to blame for the bias, especially in the case of machine learning systems with a non-transparent data model?

This question leads us back again to Big Data. There are at least two possible ways in which the data used predetermine the outcomes: The first is Big Data with built-in bias which is then amplified by the algorithm. Simply go to the Google image search and perform a search either for the words “CEO” or “cleaner”. The second is the difference between the data sets used as training data for the algorithm and the data analysed subsequently. If you don’t have, for example, African American faces in a training set on facial recognition, you simply don’t know how the algorithm will behave when applied to images with African American faces. Therefore, the appropriateness and the coverage of the data set is crucial.

The other point lies with data models and the modelling process. Models are always contextual, be they stochastic models with built-in assumptions about how the world works; or be they charged with context during the modelling process. This is why we should reflect on the social and historical contexts in which Big Data sets have been established; and the way our models and modelling processes are being shaped. And maybe it is also timely to reflect on the term “bias”, and to recollect that it implies an impossible unbiased ideal …


[1] Cathy O’Neil, Weapons of Math Destruction. How Big Data increases inequality and threatens Democracy, New York: Crown 2016, p.53.

The K-PLEX project on the European Big Data Value Forum 2017

Mike Priddy (DANS, 2nd from right in the image) represented the K-PLEX project at the European Big Data Value Forum 2017 Conference in a panel on privacy-preserving technologies. Read here about the statements he made in answering to three questions posed.


Question 1: There is an apparent trade-off between big data exploitation and privacy – do you agree or not?

  • Privacy is only one part of Identity. There needs to be respect for the individual’s right to build their identity upon a rich and informed basis.
  • The right not to know should also be considered. People have a right to their own levels of knowledge
  • Privacy is broader than the individual. Confidential data exists in and can affect: family, community, & company/organisations. The self is relational, it is not individual, it produces social facts and consequences.
  • Trust in data use & third party use – where should the accountability be?
  • There is the challenge of transparency versus accountability; just making all data available may obfuscate the accountability.
  • Accountability versus responsibility? Where is the ethical responsibility lie with human & non-human actors?
  • Anonymisation is still an evolving ‘science’ – the effectiveness of anonymising processes is is not always well and broadly understood. Aggregation may not give the results that users want or can use, but may protect the individual but not necessarily for a community or family.
  • Anonymity maybe an illusion; we don’t understand how minimal the data may need to be in order to expose identity. DoB, Gender & Region is enough to be disclosive for the majority of a population.
  • individuals, in particular young or vulnerable individuals, may not be in a position to defend themselves.
  • This means that big data may need to exclude communities & people with niche problems
  • Black boxes of ML & NNets don’t allow people to understand or track use or misuse or misinformation – wrong assertions being made: you cannot give informed consent under these conditions.
  • IOT and other technologies (facial recognition) mean that there is possibly no point at which informed consent can be given.

Strategies for meeting these issues:

  • There are well established strategies to deal with disclosure of confidential data in the Social Sciences and Official Statistics: such as output checking, off the grid access, remote execution (with testable data), secure rooms etc. Checks and balances are needed (a pause) before it goes out – this is a part of oversight and governance.
  • Individuals should be able to see when these processes are triggered, and decide if it is disclosive and whether that is appropriate.
  • More information about how data is used, shared, processed must be made available to the data creator (in a way they can use it)
  • meeting ISO 27001 standard in your data handling and procedures within your organisation is a good start.

Question 2: Regarding the level of development, privacy preserving big data technologies still have a long way to go – do you agree or not?

  • Biases are baked in. There isn’t enough differentiation between kinds of data: mine, yours, raw, cleaned, input, output – data is seen as just data and processed without narrative or context. We need not privacy by design, we need humanity at the centre of design and respect human agency.
  • Too often we only are concerned about privacy when it becomes a problem: privacy/confidentiality is NOT an obsolete concept.

Question 3: Individual application areas differ considerably with regard to the difficulty of meeting the privacy requirements – do you agree or not?

  • The problem is the way the question is formulated. By looking at application areas we are basically saying the problem is superficial. It is not. It is fundamental.
  • It has become very hard to opt out of everything. We cannot cut all of our social ties because of network effects.
  • Technology is moving faster than society can cope with and understand how data is being used. Not a new phenomena, we can see similar challenges in the historical record.
  • Privacy needs to be understood as a public good; there must be the right to be forgotten, but also right not to be recorded.
  • Data citizenship is needed: Citizens need to be involved enough & to be able to make better decisions about providing confidential/personal data & what happens to their data. What it means and what happens when you fill in that form


On the aura of Big Data

People who know very little about technology seem to attribute an aura of “objectivity” and “impartiality” to Big Data and analyses based on them. Statistics and predictive analytics give the impression, to the outside observer, of being able to reach objective conclusions based on massive samples. But why exactly is that so? How has it come that a societal discourse has ascribed that certain aura to Big Data analyses?

Since most people conceive of Big Data as tables filled with numbers which have been collected by machines observing human behavior, there are at least two points intermingled in this peculiar aura of Big Data: The belief that numbers are impartial and preinterpretive, and the conviction that there exists something like mechanical objectivity. Both concepts have a long history, and it is therefore wise to consult cultural historians and historians of science.

With respect to the claim that numbers are theory-free and value-free, one can consult the book “A History of the Modern Fact”[1] by Mary Poovey. Poovey traces the history of that modern epistemological assumption that numbers are free of an interpretive dimension, and she points to the story of how description came to seem separate from interpretation. In analyzing historical debates about induction and by studying authors such as Adam Smith, Thomas Malthus, and William Petty, Poovey points out that “Separating numbers from interpretive narrative, that is, reinforced the assumption that numbers were different in kind from the analytic accounts that accompanied them.” (XV) If nowadays many members of our societies imagine that observation can be separated from analysis and numbers guarantee value-free description, this is the result of the long historical process examined by Poovey. But seen from an epistemological point this is not correct, because numbers are interpretive – they embody theoretical assumptions about what should be counted, they depend on categories, entities and units of measurement established before counting has begun, and they contain assumptions on how one should understand material reality.

The second point, mechanical objectivity, has been treated by Lorraine Daston and Peter Galison in their book on “Objectivity”; it contains a chapter of the same name.[2] Daston and Galison focus on photography as a primary metaphor for the objectivity ascribed to a machine. Alongside this example, they describe mechanical objectivity as “the insistent drive to repress the willful intervention of the artist-author, and to put in its stead a set of procedures that would, as it were, move nature to the page through a strict protocol, if not automatically.” (121) Both authors see two intertwined processes at work: On the one hand the separation of the development and activities of machines from the human beings who conceived them, with the result that machines were attributed freedom from the willful interventions that had come to be seen as the most dangerous aspects of subjectivity. And on the other hand the development of an ethics of objectivity, which called for a morality of self-restraint in order to refrain researchers from intervention and interferences like interpretation, aestheticization, and theoretical overreaching. Thus machines – be they cameras, sensors or electronic devices – have become emblematic for the elimination of human agency.

If the aura of Big Data is based on these conceptions of an “impartiality” of numbers and data collected by “objectively” working machines, there remains little space for human agency. But this aura proves of a false consciousness, the consequences of which can easily be seen: If analyses based on Big Data are taken as ground truth, it is no wonder that there is no space being opened up for a public discussion, for decisions made independently by citizens, and for a democratically organized politics, where the processes in which Big Data play an important role are being shaped actively.

[1] Mary Poovey, A History of the Modern Fact. Problems of Knowledge in the Sciences of Wealth and Society, Chicago / London: The University of Chicago Press 1998.

[2] Lorraine Daston, Peter Galison, Objectivity, New York: Zone Books 2007.

Statistical Modeling: The Two Cultures

In this article Leo Breiman describes two approaches in statistics: One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown.

Statisticians in applied research consider data modeling as the template for statistical analysis and focus within their range of multivariate analysis tools on discriminant analysis and logistic regression in classification and multiple linear regression in regression. This approach has the plus that it produces a simple and understandable picture of the relationship between the input variables and response. But the assumption that the data model is an emulation of nature is not necessarily right and can lead to wrong conclusions.

The algorithmic approach uses neural nets and decision trees; predictive accuracy as criterion to judge the quality of the results of analysis. This approach does not apply data models to explain the relationship between input variable x and output variable y, but treats this relationship as a black box. Hence the focus is on finding an algorithm f(x) such that for future x in a test set, f(x) will be a good predictor of y. While this approach has seen major advances in machine learning, it lacks interpretability of the relationship between prediction and response variables.

This article has been published in 2001, when the word “Big Data” was not yet in everybody’s mouth. But by shaping two different cultures to analyzing data and balancing pros and cons of each approach, it makes the differences of big data analysis in contrast to stochastic data models understandable even to laymen.

Leo Breiman, Statistical Modeling: The Two Cultures. In: Statistical Science, Vol. 16 (2001), No. 3, 199-231. Freely vailable online here.