Blog

Big Data and AI for the Public Good?

The provision of data containing information on personal health is still regarded with a lot of scepticism from the side of European citizens. The growing interest of large companies in medical and patient data is in opposition with the reservations of citizens in view of issues of privacy, anonymization and pseudonymization of those data which are intimately bound to their body. But in the past weeks, the intense debate around a possible Contact Tracing App in times of COVID-19 has again shown that people are willing to cooperate and contribute even sensitive data related to their own personality if this endeavour supports common welfare. Furthermore, the debate has revealed that there need not be a privacy-public welfare trade-off involved, if certain conditions with regard to the development of an app like Pepp-PT (Pan European Privacy Protecting Proximity Tracing) are met. First of all: Encryption methods that enable compliance with the GDPR; the impossibility of de-anonymization of the data collected; and the relinquishment to use central servers (i.e. servers under state control). Second: Transparency around the production process of the app, which enables non-governmental organizations like the German Chaos Computer Club (CCC) or Reporters without Borders to inspect, test and evaluate the source code of the device deposited on GitHub. Both these points create the trust needed to involve a large part of Europe’s population. Third: The involvement of the citizens at several points in the data collection and exchange process. Users of the app need not only to download and activate the app, but have to agree that the Bluetooth device of their smartphone is activated and can be used by the app. If they are presented with the diagnosis of being Corona-positive, they need to confer the right to use this information in the app; only afterwards all the other persons who use this app and have been in contact with the infected persons receive a message about the diagnosis. On this basis, they can decide whether or not they address themselves to a medical authority for further testing. The third point essentially means that users are requested to act as responsible citizens who take up responsibility – and are empowered by their involvement in the decisions within the process.

This latter point – empowerment of the users/data donators – is most often neglected by policy makers as well as by jurisdiction. Data protection laws clearly identify data controllers, but ownership is ill defined. This is a consequence of the fact that data can be copied without loss; customary conceptions of ownership (like in the case of a bicycle) therefore do not apply. Furthermore, in a world where data are quite often a by-product of human activities, the massive data collection by a few monopolists has obscured the sense of data ownership e.g. in users of online media, and thus introduced a feeling of being exploited by such data aggregators as well as an accompanying mistrust in data collection in general. This is also why the case of the Pepp-PT app seems to open new doors for users: It is not only that they donate personal data, they also get something back from the app. What they get back is not more than a tiny little piece of information, which they regard as precious – the information whether they have been in the company of an infected person or not.

In the scenario described above, there is only data involved; with many people using the app, this can even be Big Data. AI comes in if there is more data available: Data about the health of the people using this app, provided by healthcare services, or fitness data collected by apps and devices pertaining to the self-quantification domain. If these data could be aggregated, together with the data collected by the Pepp-PT app, enough information would be available to enable machine learning to answer such questions as: How do personal fitness and the course of the disease relate to each other? Which factors support a quicker cure? Which variables predict complications during the development of the disease? So this is where the full power of Big Data and Artificial Intelligence would unfold for the public good.

The successful use of AI needs computational power, big data, and the work of capable developers of algorithms. All three ingredients can usually be found in big tech companies, but most often not beyond them. This reflection reveals why a broader societal debate on “AI for the Public Good” has not yet been conducted: We are far from having the necessary infrastructure and financial means in place. But how could this look like? And how could algorithmic innovation for the common welfare be furthered?

SmartCitizen

Smart Citizen. Image © Jörg Lehmann 2019

Beyond the case of a COVID Tracing App, the idea of “Smart Cities” provides for a scenario where some of the prerequisites for the use of Big Data and AI serving the public welfare becomes visible. For Smart Cities, data on quite a lot of very important topics are needed: Energy, Mobility, Climate, Environment, Garbage, and so forth. In the concept of Smart Cities, every urban dweller easily can imagine the need for solutions beyond the individual household and her or his contribution to it. What would be the best time to turn on the washing machine because at that certain time power is available in abundance? Where can I find a parking lot? Which alternative transports are available to bring a good from A to B? If rainfalls become rare but heavier, which is the best way for a house to manage the flood and mitigate disasters? Should polders and cisterns be installed to counterbalance phases of drought? How can systems for the reuse of recyclable material and goods replace or complement the current garbage collection service?

Questions like these are negotiated in the concept of Smart Cities, and it immediately becomes obvious that a lot of data are already available (energy, mobility, climate), while others are missing (data on individual energy consumption, mobility, or patterns of daily use of resources). Furthermore, facilities providing the computational power needed as well as relevant algorithms are nowhere to be seen. Ouch. These seem to be the pain points where we as a society have to move forward in order not to relinquish algorithmic innovation to private companies. Data seem not to be the problem; all of us produce them ceaselessly. They could be collected, aggregated and managed within data cooperatives (the German language has the word “Genossenschaft” for it). An example from health research is the Swiss cooperative MIDATA, jointly created in 2015 by ETH Zurich and the Bern University of Applied Sciences. In such data cooperatives, personal data coming from the members of the cooperative are aggregated according to transparent governance principles and state-of-the-art encryption to ensure privacy. Furthermore, citizens (and the communities forming a city) are empowered to steer data use according to their motivations and preferences. These cooperatives can organize access to aggregated data that did not exist in this linked format before, since they consolidate data which have been stored in disparate silos before.

While the issue of missing individual data can be solved by data cooperatives, the infrastructural questions remain. There is the need for computational power as well as data analysis and interpretation platforms or interfaces that enable individual or collective users to obtain insights derived from available Big Data. They form the basis for decisions, for example by predictions on transport and energy use in the upcoming months and years; on the watering of plants in public streets and private gardens, or the usage of parks; and on the communities or facilities in demand of recyclable material. The industry can contribute to such an implementation of Smart Cities by providing applications and interfaces on a pro bono basis; also unions and associations like Data Science for Social Good Berlin might be helpful in data analysis. But such endeavours do not provide sustainable solutions for the lack in infrastructure. Policy makers should therefore promote the model of Smart Cities by funding distributed data infrastructures piloting new data aggregation models; and private foundations should provide the necessary investments in highly performant computers and expensive algorithm developers until a proof-of-concept has convinced governments or the European Commission to provide long-term funding for such independent institutions. Only if all these conditions are met, civil society can move forward and find ways to use AI for the public good.

KPLEX Presented at DH 2019 Conference

Jörg Lehmann and Jennifer Edmond were very pleased to have been given a chance to present some learnings from the KPLEX project to an engaged audience at the DH 2019 conference on 12th July 2019.  The paper was entitled “Digital Humanities, Knowledge Complexity and the Six ‘Aporias’ of Digital Research,” and explored a number of the cultural clashes we found between the perspectives in our interviews.  While DH was never a planned audience for our results, the response today convinced us that there is still much to mine from our interviews and insights!

The slides from the presentation can be viewed here.

ACDH LECTURE 4.1- What can Big Data Research Learn from the Humanities?

csm_events_ACDH_Lecture_4.1_e156aa4aa6

Jennifer Edmond
Director of the Trinity College Dublin Centre for Digital Humanities and Principal Investigator on the KPLEX Project

One of the major terminological forces driving ICT development today is that of ‘big data.’ While the phrase may sound inclusive and integrative, in fact, ‘big data’ approaches are highly selective, excluding, as they do, any input that cannot be effectively structured, represented, or, indeed, digitised. Data of this messy, dirty sort is precisely the kind that humanities and cultural researchers deal with best, however.  In particular, knowledge creation and information management approaches from the humanities shed light on gaps such as: the manner in which data that are not digitised or shared become ‘hidden’ from aggregation systems; the fact that data are human created, and lack the objectivity often ascribed to the term; and the subtle ways in which data that are complex almost always become simplified before they can be aggregated. Humanities insight also exposes the problematic discursive strategies that big data research deploys, strategies that can be seen reflected not only in the research outputs of the field, but also in many of the urgent challenges our digitised society faces.

The lecture is available to view here: https://www.youtube.com/watch?v=E2vdFBo9wB4

The Future of History

If you go to an archive today and look for a personal heritage, what would you expect? Notebooks, letters, photographs, calendars, the documentation of printed publications, drafts of articles or books with hand-written comments, and the like. But how about born-digital contents? In the best case, you will find a backup of the hard disk drive of the person you are researching on. The letters of earlier times have changed to eMails, SMS and WhatsApp messages, Facebook entries, Twitter posts; publications may have changed to online articles, PDF files, and blog contributions distributed all over the net. And that’s the point: What may have been part of somebody’s personal inheritance in paper format, may have nowadays become part of Big Data. Yes, Big Data: They do not consist only of incredibly large tables, with variables and columns filled with numbers; a good part of Big Data consists simply of text files with social media contents (Facebook, Twitter, blogs, and so on).

At first glance, this sounds astonishing. The ‘private’ character of personal heritage seems to have vanished, the proportion of content available in the public sphere seems to have grown. It seems. But this is not surprising; we are reminded of one of the most influential studies on this topic, Jürgen Habermas’ “Structural Transformation of the Public Sphere”. What Habermas analyses here are the constant changes and shifts of the border between private and public. His examination starts in the late 18th and early 19th century, with the formation of an ideal type of bourgeois discourse marked by what Habermas calls “Räsonnement”. This reasoning aims at arguing, but also, in its pejorative form, at grumbling. The study begins with bourgeois members of the public meeting in Salons, coffee houses, and literary round tables, pursuing reasoned exchange by contributing to journals, practicing subjectivity, individualism, and sentimentalism by writing letters and diaries destined either to be published (think of Gellert, Gleim, and Goethe) or to become part of personal heritage. Habermas draws long lines into the 20th century, where his book ends with the opposition between public and private characteristic of that time: Employment is part of the public space, while leisure time is dedicated to private activities; letters and lecture have become much less important, only the high bourgeoisie keeps their own libraries; mass media enhance the passivity of consumerism. This can also be read from personal heritages: The functional differentiation of a modern society created presumed experts for Räsonnement, like journalists, politicians, and publicists, who deliver opinion formation as a service, while editors and scientists professionalise the critique of politics. Habermas is overtly critical in view of the mass media and their potentials for manipulation since they reduce citizens to recipients without agency.

The last edition of Habermas’ book was printed in 1990. Since that time, a lot has changed, especially with the emergence of the internet. The border between public and private has been moved, and the societal-political commitment of citizens has changed. Social media grant an incredibly agency and empower citizens. Hyperdigital hipsters are working in cafés, co-working spaces or start-ups, without having the private leisure time characteristic of the 20th century. Digital media network people across large spaces and form new transnational collectives. Anthropologist Arjun Appadurai has spoken of “diasporic public spheres” in this respect – small groups of people discussing face to face in pubs have been transformed into “communities of sentiment” grumbling at politics. Formerly silent recipients have mutated into counter publics, the sentimental bourgeois has become an enraged citizen. Habermas wouldn’t have liked this development, since his ideal type of Räsonnement doesn’t fit with current realities, and what he overlooked – the existence of large parts of the society consisting of people who don’t participate in mass media discourses because they don’t want to – nowadays informs e.g. right-wing populism.

Facebook-Network
The Facebook network as a new public sphere

This latest transformation of the public sphere has consequences for archivists as well as historians. Consequently, archivists should regard social media contents as part of personal heritages and have thus to struggle with data management and storage problems. Historians (at least historians of the future) have to become familiar with quantitative analysis in order to e.g. examine Twitter-networks in order to determine the impact of the Alt-Right movement onto the presidential election in the U.S. Born-digital contents can therefore be seen as valuable parts of personal heritage. And coming from this point of view, there is certainly a lot that historians can contribute to discussions on Big Data.

 

Jürgen Habermas, The Structural Transformation of the Public Sphere: An Inquiry into a Category of Bourgeois Society. Cambridge: Polity 1989.

Arjun Appadurai, Modernity At Large: Cultural Dimensions of Globalization. Minneapolis: University of Minnesota Press 1996.

 

What the stories around data tell us

The strange thing about data is that they don’t speak themselves. They need to be embedded into an interpretation to become palatable and understandable. This interpretation may be an analytic account like the narrative synthesis told after performing a regression. It also may be a story in a more conventional sense, something like a success story of conquest, mastery, submission, or revelation which results out of the usual storytelling used in marketing.

The funny thing about data and stories is that it is easier to create a story out of data than extracting data out of a story told. It is easy to conceived of narratives built on top of data. Companies like Narrative Science create market reports or sporting reports automatically out of the data they receive on a daily basis. On the other hand, it is difficult to imagine to extract statistical data about a soccer game out of the up-to-the-minute scores of the same game presented on a website.

But data form a peculiar basis of stories. Think of the data which are collected when you visit a website – a typical basis of Big Data. These websites collect data on where you go, where you click, how long you stay there and so on; typically, they are behavioural data. What data scientist can get out of that are correlations; these data don’t allow to grasp the causal mechanisms behind the observed behaviour. They are not able to see the whole person behind the behaviour; thus they are e.g. not able to tell what the costumers feel during their visit on the website, why they reacted – and where the visitors see value in the offers they are presented. It is thus a reduction of the perspective which comes along with stories based on data, a reduction which maybe can’t be avoided since data are in themselves a product in a process of estrangement typical for capitalism. The narrative might attenuate or conceal the limitations of the data, but it will not be able to reach far beyond the restrictions imposed.

Market

But there is more about data presented in narratives than a mere reduction of perspective: Other than data collected in scientific disciplines like psychology and anthropology, which might enable representative statements about population groups, the results of Big Data analyses grant a shift in perspective. By performing classifications, groupings of people according to their preferences, assessing the creditworthiness of customers, etc., Big Data allow to view human beings from the perspective of a market. And in their ability to shape the offers presented on a website in real-time and adapt the pricing mechanisms according to the IP-address from where the websites are being accessed; in their ability to build up systems of gratification, rewarding actions of the users which are seen as opportune by the infrastructure, data grant a point of view onto customers which further strengthen commodification and economic governance. The fact that this point of view is equivalent to the perspective of the market becomes especially visible in the narratives accompanying these data.

Identities and Big Data

One of the amenities of Big Data (and the neural nets used to mine them) is its potential to reveal patterns of which we were not aware before. “Identity” in Big Data might either be a variable contained in the dataset. This implies classification, and a feature which cannot easily be determined (like some cis-, trans-, inter-, and alia-identities) might have been put into the ‘garbage’ category of the “other”. Or, identity might arise from a combination of features that was unknown beforehand. This was the case in a study which claimed that neural networks are able to detect sexual orientation from facial images. The claim did not go unanswered; a recent examination of this study by Google researchers exposed that differences in culture, rather than facial structures were the features responsible for the result. Therefore, features that can easily be changed – like makeup, eyeshadow, facial hair, or glasses – were underestimated by the authors of the study.

The debate between these data analysts exposes insights well-known to humanists and social scientists. Identities differ in context; depending on the situation in which she or he is, a person may say “I am a mother of three children”, “I am a vegan”, or “I am a Muslim”. In fields marked by strong tensions and polarizations, identity statements can come close to confessions, and it might be wise to carefully deliberate about whether it is opportune to either provide the one or the other answer: “I am a British”, “I am an European”.

It is not without irony that it is easy to list several identities that have been important throughout the past few centuries. Clan, tribe, nation, heritage, sect, kinship, blood, race – this is the typical stuff ethnographers and historians work on. Beyond the ones just named, identities like family, religion, culture, and gender are currently intensely debated in our postmodern, globalised world. Think of the discussions about cultural identities in a world characterized by migration; and think of gender identity as one of the examples which only recently has split itself into new forms and created a new vocabulary that tries to grasp the new changeableness: genderfluid, cisgender, transgender, agender, genderqueer, non-binary, two-spirit, etc.

Akhenaten
Androgynous portrayal of Akhenaten, the first ruler to introduce monotheism in Ancient Egypt

It is obvious that identity is not a stable category. This is the irony of identity – the promise of an identical self dissolves itself in time and space, and any trial to isolate, fix and homogenise an identity is doomed to failure. In postmodernity, identities are rather constructed out of interactions with the social environment – through constant negotiations, distinction and recognition, exclusion and belonging. Mutations and transformations are the expression of the tensions between vital elements that characterize identities. The path from Descartes’ “Cogito, ergo sum” to Rimbaud’s “I is another” is one of the best-known examples for such a transformative process.

Humanists and social scientists are experts in providing thick descriptions and large contexts in which identities can be located. They are used to relate the different resources to each other which feed into identities, and they are capable to build the contexts in which the relevant features and the relationships between them are embedded. In view of these potentials it is astonishing that citizens with such an academic background do not speak with confidence of the power of their methods and mix themselves into Big Data debates about such elusive concepts like “identities”.

Data. A really short history

“One man’s noise is another man’s signal.” Edward Ng

 

Data are “givens”. They are artefacts. They don’t pre-exist in the world, but come to the world because we, the human beings, want them to do so. If you would imagine an intimate relationship between humans and data, you would say that data do not exist independently from humans. Hunters and gatherers collect the discrete objects of their desire, put them in order, administer them, store them, use them, throw them away, forget about where they were stored and at times uncover those escaped places again, for example during a blitheful snowball fight. At that time data were clearly divided from non-data – edible things were separated from inedible ones, helpful items from uninteresting ones. This selectivity may have benefited from the dependency on shifting availability of resources in a wider space.

Later onwards, when mankind went out of the forests, things became much more complicated. With the advent of agriculture and especially of trade data became more meaningful. Furthermore, rulers became interested in knowledge about their sheep and therefore asked some of their accountants to keep record of their subordinates – the eldest census dates back to 700 B.C. In the times when mathematics became a bit more complicated from the 17th (probability) and 19th (statistics) century onwards, data about people, agriculture, and trade began to heap up and it became more and more difficult to distinguish between relevant data (“signal”) and irrelevant ones (“noise”). The distinction between these two simply became a matter of the question with which the data at hands were consulted.

With the advent of the industrial age, the concept of mechanical objectivity was introduced, and the task of data creation was delegated to machines which were constructed to collect the items in which humans were interested in. Now data were available in huge amounts, and the need for organizing and ordering them became even more pressing. It is over here, where powerful schemes came into force: Selection processes, categorizations, classifications, standards; variables prioritized as signal over others reduced to noise, thus creating systems of measurement and normativity intended to govern societies. They have been insightful investigated in the book “Standards and their stories. How quantifying, classifying, and formalizing practices shape everyday life.”*

It was only later, in the beginning of the post-industrial age, when an alternative to this scheme- and signal-oriented approach was developed by simply piling up everything that may be of any interest, a task also often delegated to machines because of their patient effortlessness. The agglomeration of masses presupposes that storing is not a problem, neither in spatial nor in temporal terms. The result of such an approach is nowadays called “Big Data” – the accumulation of masses of (mostly observational) data for no specific purpose. Collecting noise in the hope of signal, without defining what noise and what signal is. Fabricating incredibly large haystacks and assuming there are needles in it. Data as the output of a completely economised process with its classic capitalistic division of labour, including the alienation from their sources.

What is termed “culture” often resembles these haystacks. Archives are haystacks. The German Historical Museum in Berlin is located in the “Zeughaus”, literally the “house of stuff”,** stuffed with the remnants of history, a hayloft of history. Libraries are haystacks as well; if you are not bound to the indexes and registers called catalogues, if you are free to wander around the shelves of books and pick them out at random, you might get lost in an ocean of thoughts, in countless imaginary worlds, in intellectual universes. This is the fun of the explorers, conquerors and researchers: Never get bored through routine, always discover something new which feeds your curiosity. And it is here, within this flurry of noise and signal, within the richness of multisensory, kinetic and synthetic modes of access, where it becomes tangible that in culture noise and signal cannot be thought without the environment out of which they were taken, that both are the product of human endeavours, and that data are artefacts that cannot be understood without the context in which they were created.

 

*Martha Lampland, Susan Leigh Star (Eds.), Standards and their stories. How quantifying, classifying, and formalizing practices shape everyday life. Ithaca: Cornell Univ. Press 2009.
** And yes, it should be translated correctly as “armoury”.