Blog

“The Trouble with Big Data”. New Book published by the KPLEX Project

One of the major terminological forces driving ICT development today is that of ‘big data.’ While the phrase may sound inclusive and integrative, in fact, big data approaches are highly selective, excluding any input that cannot be effectively structured, represented, or, indeed, digitised. The Trouble with Big Data explores the challenges society faces with big data, through the lens of culture rather than social, political or economic trends as demonstrated in the words we use, the values that underpin our interactions and the biases and assumptions that drive us.

Evolving from research undertaken in the Knowledge Complexity (KPLEX) project, in which Trinity College Dublin, the Data Archiving and Networked Services (DANS) of the Koninklijke Nederlandse Akademie von Wetenschappen, and Freie Universität Berlin were partners, this book focuses on areas such as data and language, data and sensemaking, data and power, data and invisibility, and big data aggregation. How cultural practices are displaced by, and yet simultaneously resist mass datafication, can be instructive for the critical observation of big data research and innovation.

This book is available as open access through the Bloomsbury Open programme and is available on www.bloomsburycollections.com. It is funded by Trinity College Dublin, DARIAH-EU and the European Commission.

TCD is Hiring in Computational Literary Studies

The Trinity College Centre for Digital Humanities is recruiting a 2-year post-doctoral fellow with expertise in computational literary studies to join its growing and diverse team. The purpose of this post is to support the development of the LI4AI project, through the delivery of collaborative research in the field of computational literary studies (70%), and teaching backfill for the project PI (30%).

LI4AI is a proposal-stage research project that will use literary evidence to develop an evidence-based model upon which key competencies related to positive liberty – agency, autonomy, competency, transgression, creativity, trust and empathy – can be specified and made actionable within a software environment. As such, it will apply an urgently needed applied humanities approach to the challenge of AI development and regulation in light of the necessary pivot toward humane technology development and deployment.

LI4AI is led by Trinity College Dublin’s Professor Jennifer Edmond, Co-Director of the Trinity Centre for Digital Humanities. The appointee will be based in the Trinity Long Room Hub Arts and Humanities Research Institute and will work closely with Professor Edmond and the DH@TCD Team, also based in the Trinity Long Room Hub.

For more information, see the document below.

Big Data and AI for the Public Good?

The provision of data containing information on personal health is still regarded with a lot of scepticism from the side of European citizens. The growing interest of large companies in medical and patient data is in opposition with the reservations of citizens in view of issues of privacy, anonymization and pseudonymization of those data which are intimately bound to their body. But in the past weeks, the intense debate around a possible Contact Tracing App in times of COVID-19 has again shown that people are willing to cooperate and contribute even sensitive data related to their own personality if this endeavour supports common welfare. Furthermore, the debate has revealed that there need not be a privacy-public welfare trade-off involved, if certain conditions with regard to the development of an app like Pepp-PT (Pan European Privacy Protecting Proximity Tracing) are met. First of all: Encryption methods that enable compliance with the GDPR; the impossibility of de-anonymization of the data collected; and the relinquishment to use central servers (i.e. servers under state control). Second: Transparency around the production process of the app, which enables non-governmental organizations like the German Chaos Computer Club (CCC) or Reporters without Borders to inspect, test and evaluate the source code of the device deposited on GitHub. Both these points create the trust needed to involve a large part of Europe’s population. Third: The involvement of the citizens at several points in the data collection and exchange process. Users of the app need not only to download and activate the app, but have to agree that the Bluetooth device of their smartphone is activated and can be used by the app. If they are presented with the diagnosis of being Corona-positive, they need to confer the right to use this information in the app; only afterwards all the other persons who use this app and have been in contact with the infected persons receive a message about the diagnosis. On this basis, they can decide whether or not they address themselves to a medical authority for further testing. The third point essentially means that users are requested to act as responsible citizens who take up responsibility – and are empowered by their involvement in the decisions within the process.

This latter point – empowerment of the users/data donators – is most often neglected by policy makers as well as by jurisdiction. Data protection laws clearly identify data controllers, but ownership is ill defined. This is a consequence of the fact that data can be copied without loss; customary conceptions of ownership (like in the case of a bicycle) therefore do not apply. Furthermore, in a world where data are quite often a by-product of human activities, the massive data collection by a few monopolists has obscured the sense of data ownership e.g. in users of online media, and thus introduced a feeling of being exploited by such data aggregators as well as an accompanying mistrust in data collection in general. This is also why the case of the Pepp-PT app seems to open new doors for users: It is not only that they donate personal data, they also get something back from the app. What they get back is not more than a tiny little piece of information, which they regard as precious – the information whether they have been in the company of an infected person or not.

In the scenario described above, there is only data involved; with many people using the app, this can even be Big Data. AI comes in if there is more data available: Data about the health of the people using this app, provided by healthcare services, or fitness data collected by apps and devices pertaining to the self-quantification domain. If these data could be aggregated, together with the data collected by the Pepp-PT app, enough information would be available to enable machine learning to answer such questions as: How do personal fitness and the course of the disease relate to each other? Which factors support a quicker cure? Which variables predict complications during the development of the disease? So this is where the full power of Big Data and Artificial Intelligence would unfold for the public good.

The successful use of AI needs computational power, big data, and the work of capable developers of algorithms. All three ingredients can usually be found in big tech companies, but most often not beyond them. This reflection reveals why a broader societal debate on “AI for the Public Good” has not yet been conducted: We are far from having the necessary infrastructure and financial means in place. But how could this look like? And how could algorithmic innovation for the common welfare be furthered?

SmartCitizen

Smart Citizen. Image © Jörg Lehmann 2019

Beyond the case of a COVID Tracing App, the idea of “Smart Cities” provides for a scenario where some of the prerequisites for the use of Big Data and AI serving the public welfare becomes visible. For Smart Cities, data on quite a lot of very important topics are needed: Energy, Mobility, Climate, Environment, Garbage, and so forth. In the concept of Smart Cities, every urban dweller easily can imagine the need for solutions beyond the individual household and her or his contribution to it. What would be the best time to turn on the washing machine because at that certain time power is available in abundance? Where can I find a parking lot? Which alternative transports are available to bring a good from A to B? If rainfalls become rare but heavier, which is the best way for a house to manage the flood and mitigate disasters? Should polders and cisterns be installed to counterbalance phases of drought? How can systems for the reuse of recyclable material and goods replace or complement the current garbage collection service?

Questions like these are negotiated in the concept of Smart Cities, and it immediately becomes obvious that a lot of data are already available (energy, mobility, climate), while others are missing (data on individual energy consumption, mobility, or patterns of daily use of resources). Furthermore, facilities providing the computational power needed as well as relevant algorithms are nowhere to be seen. Ouch. These seem to be the pain points where we as a society have to move forward in order not to relinquish algorithmic innovation to private companies. Data seem not to be the problem; all of us produce them ceaselessly. They could be collected, aggregated and managed within data cooperatives (the German language has the word “Genossenschaft” for it). An example from health research is the Swiss cooperative MIDATA, jointly created in 2015 by ETH Zurich and the Bern University of Applied Sciences. In such data cooperatives, personal data coming from the members of the cooperative are aggregated according to transparent governance principles and state-of-the-art encryption to ensure privacy. Furthermore, citizens (and the communities forming a city) are empowered to steer data use according to their motivations and preferences. These cooperatives can organize access to aggregated data that did not exist in this linked format before, since they consolidate data which have been stored in disparate silos before.

While the issue of missing individual data can be solved by data cooperatives, the infrastructural questions remain. There is the need for computational power as well as data analysis and interpretation platforms or interfaces that enable individual or collective users to obtain insights derived from available Big Data. They form the basis for decisions, for example by predictions on transport and energy use in the upcoming months and years; on the watering of plants in public streets and private gardens, or the usage of parks; and on the communities or facilities in demand of recyclable material. The industry can contribute to such an implementation of Smart Cities by providing applications and interfaces on a pro bono basis; also unions and associations like Data Science for Social Good Berlin might be helpful in data analysis. But such endeavours do not provide sustainable solutions for the lack in infrastructure. Policy makers should therefore promote the model of Smart Cities by funding distributed data infrastructures piloting new data aggregation models; and private foundations should provide the necessary investments in highly performant computers and expensive algorithm developers until a proof-of-concept has convinced governments or the European Commission to provide long-term funding for such independent institutions. Only if all these conditions are met, civil society can move forward and find ways to use AI for the public good.

KPLEX Presented at DH 2019 Conference

Jörg Lehmann and Jennifer Edmond were very pleased to have been given a chance to present some learnings from the KPLEX project to an engaged audience at the DH 2019 conference on 12th July 2019.  The paper was entitled “Digital Humanities, Knowledge Complexity and the Six ‘Aporias’ of Digital Research,” and explored a number of the cultural clashes we found between the perspectives in our interviews.  While DH was never a planned audience for our results, the response today convinced us that there is still much to mine from our interviews and insights!

The slides from the presentation can be viewed here.

ACDH LECTURE 4.1- What can Big Data Research Learn from the Humanities?

csm_events_ACDH_Lecture_4.1_e156aa4aa6

Jennifer Edmond
Director of the Trinity College Dublin Centre for Digital Humanities and Principal Investigator on the KPLEX Project

One of the major terminological forces driving ICT development today is that of ‘big data.’ While the phrase may sound inclusive and integrative, in fact, ‘big data’ approaches are highly selective, excluding, as they do, any input that cannot be effectively structured, represented, or, indeed, digitised. Data of this messy, dirty sort is precisely the kind that humanities and cultural researchers deal with best, however.  In particular, knowledge creation and information management approaches from the humanities shed light on gaps such as: the manner in which data that are not digitised or shared become ‘hidden’ from aggregation systems; the fact that data are human created, and lack the objectivity often ascribed to the term; and the subtle ways in which data that are complex almost always become simplified before they can be aggregated. Humanities insight also exposes the problematic discursive strategies that big data research deploys, strategies that can be seen reflected not only in the research outputs of the field, but also in many of the urgent challenges our digitised society faces.

The lecture is available to view here: https://www.youtube.com/watch?v=E2vdFBo9wB4

The Future of History

If you go to an archive today and look for a personal heritage, what would you expect? Notebooks, letters, photographs, calendars, the documentation of printed publications, drafts of articles or books with hand-written comments, and the like. But how about born-digital contents? In the best case, you will find a backup of the hard disk drive of the person you are researching on. The letters of earlier times have changed to eMails, SMS and WhatsApp messages, Facebook entries, Twitter posts; publications may have changed to online articles, PDF files, and blog contributions distributed all over the net. And that’s the point: What may have been part of somebody’s personal inheritance in paper format, may have nowadays become part of Big Data. Yes, Big Data: They do not consist only of incredibly large tables, with variables and columns filled with numbers; a good part of Big Data consists simply of text files with social media contents (Facebook, Twitter, blogs, and so on).

At first glance, this sounds astonishing. The ‘private’ character of personal heritage seems to have vanished, the proportion of content available in the public sphere seems to have grown. It seems. But this is not surprising; we are reminded of one of the most influential studies on this topic, Jürgen Habermas’ “Structural Transformation of the Public Sphere”. What Habermas analyses here are the constant changes and shifts of the border between private and public. His examination starts in the late 18th and early 19th century, with the formation of an ideal type of bourgeois discourse marked by what Habermas calls “Räsonnement”. This reasoning aims at arguing, but also, in its pejorative form, at grumbling. The study begins with bourgeois members of the public meeting in Salons, coffee houses, and literary round tables, pursuing reasoned exchange by contributing to journals, practicing subjectivity, individualism, and sentimentalism by writing letters and diaries destined either to be published (think of Gellert, Gleim, and Goethe) or to become part of personal heritage. Habermas draws long lines into the 20th century, where his book ends with the opposition between public and private characteristic of that time: Employment is part of the public space, while leisure time is dedicated to private activities; letters and lecture have become much less important, only the high bourgeoisie keeps their own libraries; mass media enhance the passivity of consumerism. This can also be read from personal heritages: The functional differentiation of a modern society created presumed experts for Räsonnement, like journalists, politicians, and publicists, who deliver opinion formation as a service, while editors and scientists professionalise the critique of politics. Habermas is overtly critical in view of the mass media and their potentials for manipulation since they reduce citizens to recipients without agency.

The last edition of Habermas’ book was printed in 1990. Since that time, a lot has changed, especially with the emergence of the internet. The border between public and private has been moved, and the societal-political commitment of citizens has changed. Social media grant an incredibly agency and empower citizens. Hyperdigital hipsters are working in cafés, co-working spaces or start-ups, without having the private leisure time characteristic of the 20th century. Digital media network people across large spaces and form new transnational collectives. Anthropologist Arjun Appadurai has spoken of “diasporic public spheres” in this respect – small groups of people discussing face to face in pubs have been transformed into “communities of sentiment” grumbling at politics. Formerly silent recipients have mutated into counter publics, the sentimental bourgeois has become an enraged citizen. Habermas wouldn’t have liked this development, since his ideal type of Räsonnement doesn’t fit with current realities, and what he overlooked – the existence of large parts of the society consisting of people who don’t participate in mass media discourses because they don’t want to – nowadays informs e.g. right-wing populism.

Facebook-Network
The Facebook network as a new public sphere

This latest transformation of the public sphere has consequences for archivists as well as historians. Consequently, archivists should regard social media contents as part of personal heritages and have thus to struggle with data management and storage problems. Historians (at least historians of the future) have to become familiar with quantitative analysis in order to e.g. examine Twitter-networks in order to determine the impact of the Alt-Right movement onto the presidential election in the U.S. Born-digital contents can therefore be seen as valuable parts of personal heritage. And coming from this point of view, there is certainly a lot that historians can contribute to discussions on Big Data.

 

Jürgen Habermas, The Structural Transformation of the Public Sphere: An Inquiry into a Category of Bourgeois Society. Cambridge: Polity 1989.

Arjun Appadurai, Modernity At Large: Cultural Dimensions of Globalization. Minneapolis: University of Minnesota Press 1996.

 

What the stories around data tell us

The strange thing about data is that they don’t speak themselves. They need to be embedded into an interpretation to become palatable and understandable. This interpretation may be an analytic account like the narrative synthesis told after performing a regression. It also may be a story in a more conventional sense, something like a success story of conquest, mastery, submission, or revelation which results out of the usual storytelling used in marketing.

The funny thing about data and stories is that it is easier to create a story out of data than extracting data out of a story told. It is easy to conceived of narratives built on top of data. Companies like Narrative Science create market reports or sporting reports automatically out of the data they receive on a daily basis. On the other hand, it is difficult to imagine to extract statistical data about a soccer game out of the up-to-the-minute scores of the same game presented on a website.

But data form a peculiar basis of stories. Think of the data which are collected when you visit a website – a typical basis of Big Data. These websites collect data on where you go, where you click, how long you stay there and so on; typically, they are behavioural data. What data scientist can get out of that are correlations; these data don’t allow to grasp the causal mechanisms behind the observed behaviour. They are not able to see the whole person behind the behaviour; thus they are e.g. not able to tell what the costumers feel during their visit on the website, why they reacted – and where the visitors see value in the offers they are presented. It is thus a reduction of the perspective which comes along with stories based on data, a reduction which maybe can’t be avoided since data are in themselves a product in a process of estrangement typical for capitalism. The narrative might attenuate or conceal the limitations of the data, but it will not be able to reach far beyond the restrictions imposed.

Market

But there is more about data presented in narratives than a mere reduction of perspective: Other than data collected in scientific disciplines like psychology and anthropology, which might enable representative statements about population groups, the results of Big Data analyses grant a shift in perspective. By performing classifications, groupings of people according to their preferences, assessing the creditworthiness of customers, etc., Big Data allow to view human beings from the perspective of a market. And in their ability to shape the offers presented on a website in real-time and adapt the pricing mechanisms according to the IP-address from where the websites are being accessed; in their ability to build up systems of gratification, rewarding actions of the users which are seen as opportune by the infrastructure, data grant a point of view onto customers which further strengthen commodification and economic governance. The fact that this point of view is equivalent to the perspective of the market becomes especially visible in the narratives accompanying these data.