Using humanities knowledge to explore bias in big data approaches to knowledge creation
Author: Jennifer Edmond
Dr Jennifer Edmond, is the Director of Strategic Projects in the Faculty of Arts, Humanities and Social Sciences at Trinity College Dublin. Trained as a scholar of German literature, Jennifer is mostly engaged professionally with the investigation of knowledge exchange and collaboration in Humanities research and in particular the impact of technology on these processes.
Beyond the ‘long tail’ metaphor, the distribution of data within the scientific field has been described in terms of ‘data wealth’ and ‘data poverty’. Steve Sawyer has sketched a political economy of data in a short essay (a slightly modified version of this paper is freely accessible here). According to him, in data-poor fields data are a prized possession; access to data drives methods; and there are many theoretical camps. By contrast, data-rich fields can be identified by three characteristics: pooling and sharing of data is expected; method choices are limited since forms drive methods; and only a small number of theoretical camps have been validated. This opposition leads to an unequal distribution of grants received, since data wealth provides for legitimacy to claims of insight as well as access to additional resources.
While Sawyer describes a polarity within the scientific field with respect to funding and cyberinfrastructures, which he sees as a means to overcome obstacles in data-poor fields, the KPLEX Project will take a look into how contents and characteristics of data relate to methodologies and epistemologies, integration structures and aggregation systems.
As discussed in the earlier post this week “How can data simultaneously mean what is fabricated and what is not fabricated?”, the term ‘raw data’ points toward an illusion, to something that is naturally or even divinely given, untouched by human hand or interpretive mind. The very organicism of the metaphor belies the work it has to do to hide the fact that progression form data to knowledge is not a clean one: knowledge also produces data, or, perhaps better said, knowledge also produces other knowledge. We need to get away from the idea of data as somehow ‘given’ – Johanna Drucker’s preference for the term ‘capta’ over ‘data’ is hugely useful in this respect.
Raw data is, however, an indicator for where we begin. Human beings have limited lifespans and limited capacities for experimentation and research, sensitive as we are to what John Guillory referred to as the ‘clock time’ of scholarship. Therefore to make scientific progress, we start on the basis of the work of others. Sometimes this is recognised overtly, as in citation of secondary sources. Sometimes it is perhaps only indirectly visible, as in the use of certain instruments, datasets, infrastructures or libraries, that themselves (to the most informed and observant) belie the conditions in which particular knowledge was born. In this way, ‘raw data’ merely refers to the place where our organisation started, where we removed what we found from the black box and began to create order from the relative chaos we found there.
Do big data environments make these edges apparent? Usually not. In the attempt to gain an advantage over ‘clock time,’ systems hide the nature, perhaps not of the chaos they began with, but of how that chaos came to be, whose process of creating ‘capta’ gave origin to it in the first place. Like the children’s rhyme about the ‘House that Jack Built’, knowledge viewed backwards toward its starting point can perhaps be seen recursing into data, but going further, that data itself emerges from a process or instrument that someone else created through a process that began elsewhere, with something else, at yet another, far earlier point. At our best, humanists cling to the provenance of their ideas and knowledge. Post-modernism has been decried by many as the founding philosophy of the age of ‘alternative facts’, as an abandonment of any possibility for one position to be right or wrong, true or untrue. In the post-truth world, any position can offend, and attempt at defence result in being labelled a ‘snowflake.’ But Lyotard’s work on The Post-Modern Condition wasn’t about this: it was about recognising the bias that creeps in to our assumptions once we start to generalise at the scale of a grand narrative, a truth. In this sense, we perhaps need to move our thinking from ‘raw data,’ to ‘minimally biased data,’ and from the grand narratives of ‘big data’ to the nuances of ‘big enough data.’ If we can begin speaking with clarity about that place where we begin, we will have a much better chance of not losing track of our knowledge along the way.
Common wisdom holds it that data are recorded facts and figures; as an example see Kroenke et Auer, Database Processing, slide 8. This is not astonishing if one considers the meaning of the term “raw data”. Most often it refers to unprocessed instrument data at full resolution; the output of a machine which has differentiated between signal and noise – as if the machine has not been conceived by human beings and designed according to needs defined by them. But the entanglement of data and facts refers to a mechanical understanding of objectivity, where instruments record signals which can easily be identified by humans as something that has been counted, measured, sensed. “Raw data” is seen as the output of that operation, as facts gained by experiment, experience or collection. Data products derived from these machine outputs seem to inherit their ‘factual’ status, even if processed to a high degree in order to enable modelling and analysis. Turning against this distinction between “raw” and “cooked” data and underlining data as a result of human intervention, Geoffrey Bowker has termed the phrase “’Raw data’ is an oxymoron”, which provided the title of a book edited by Lisa Gitelman (find the introduction here). But common sense is not this precise with differentiations; it keeps the aura of data as something always being pre-processed.
Data and facts are not the same. In a contribution to the anthology edited by Gitelman, Daniel Rosenberg explains the etymology of both terms as well as their differences. Facts are ontological, they depend on the logical operation true/false. When they are proven false, they cease to be termed “facts”. Data can have something in common with facts, namely their “out there-ness”, a reference apparently beyond the arbitrary relation between signifier and signified. “False data is data nonetheless”, Rosenberg puts it, and he points out that “the term ‘data’ servers a different rhetorical and conceptual function than do sister terms as ‘facts’ and ‘evidence’.” But what exactly is the rhetorical function of the term “data”? Rosenberg’s answer is that “data” designates something which is given prior to argument. Again, this brings the term “data” close to the term “fact”: In argumentation, both terms assume the task of a proof, of something that substantiates. In which settings is this rhetoric particularly relevant?
As Steve Woolgar and Bruno Latour have pointed out a generation ago, facts are social constructions, purified from the remainders of the process in which they were created: “the process of construction involves the use of certain devices whereby all traces of production are made extremely difficult to detect”, they wrote in “Laboratory Life”. There is a process at work which can be compared to ‘datafication’: In a laboratory, the term “fact” can simultaneously assume two meanings. At the frontier of science, the scientists themselves know about the constructed nature of statements and are aware of their relation to subjectivity and artificiality; and at the same time these statements are referred to as a thing “out there”, i.e. to objectivity and facts. And it is in the very interest of science, aiming at “truth effects”, to make the artificiality of factual statements forgotten, and, as a consequence, to have facts taken for granted. The analogy of the nature of these statements to “raw” and “cooked” data is obvious; “facts” and “data”. With respect to the latter, the process of division and classification of phenomena into numbers obscures ambiguity, conflict, and contradiction; the left-over of this process, “data”, are completely cleansed of irritating factors; and no trace remains of the purifying process itself. Therefore “data” deny their fabricated-ness; and in struggles for legitimacy of insight or the application for resources, they serve their purpose in the very same way as “facts” do.
In Europe, copyright laws continue to vary from country to country, in spite of many years’ pressure towards harmonisation.
One of the biggest questions this raises is whether having the right to read a work gives you to right to data mine it. In terms of literature, you might think the right to mine could be the less restrictive one, if anything, as the use and market for mining is so different from a linear reading mode. I don’t know of anyone who takes a nice big set of unstructured data (rather than a novel) with them on a beach holiday for fun, though I am sure such people exist (albeit with sand in their laptops). But no one is sure whether or if the rights to these modes of use will be harmonised, largely because the original rules and customs date to the era of print, and the tension between the unknown benefit of relaxing them and an unknowable future profit model that could emerge from data mining is unresolved.
This is a problem for literary scholars, but also for historians and others working with cultural data. Which leads me to the observation that the very beauty and utility of humanities data creates a two tier system of science, in which legal hurdles hinder research in some disciplines (those that co-own their data) or for some methodologies, but others. There are surely some medical or other data sets that are viewed as having equivalent market value to a best-seller, but even there, the mining model would be the only reading paradigm.
So how would data-driven research in other disciplines be different if the raw materials of their research were sold in airports, hung on the walls of galleries, or revered as the founding documents of a nation?
The ‘long tail’ is a metaphor for a statistical distribution, where about 20 percent of the distribution is in the head of the curve, and the remaining 80 percent are distributed along the tail. This well-known distribution – a power law – was popularized by Chris Anderson, former editor-in-chief of the Wired Magazine. In an article published in 2004 [http://archive.wired.com/wired/archive/12.10/tail.html], Anderson described the market for goods in physical stores versus online stores by means of this distribution: In the head of the curve, there are the hits: blockbusters, bestsellers, top-ten chart tracks, all sold by retailers with stores. But in the long tail, there are the much-loved film classics, the back catalog, older music albums. Combined, the non-hits make for a market bigger than the hits, which opens the door for online stores. As Anderson explains, successful businesses in the digital economy are about aggregating the long tail, focusing on a million of niches rather than onto the mass market.
But the metaphor of the ‘long tail’ can also be applied to scholarly research: Only a fraction of research teams work on large volumes of data, say, about 20%; in comparison to them, a much bigger number of researchers work with ‘little data’. These small datasets are harder to handle and to exchange, as Christine Borgman et al have recently described in their research paper “Data Management in the Long Tail: Science, Software and Service”(1) [http://www.ijdc.net/index.php/ijdc/article/viewFile/11.1.128/432].
The attention of the public is towards ‘big data’; in contrast, the smaller datasets used by the humanities as well as cultural data are ‘hidden’, not standardized, and they lack metadata. Even more, they are unstructured, polyvalent, messy; and this is what the KPLEX-Project focuses on – long tail data.
Christine L. Borgman, Milena S. Golshan, Ashley E. Sands, Jillian C. Wallis, Rebekah L. Cummings, Peter T. Darch, Data Management in the Long Tail: Science, Software and Service. In: International Journal of Digital Curation 11(1), (2016) 128–149. DOI: 10.2218/ijdc.v11i1.428
If not, don’t worry – you are in good company. The majority of people working with data have not been training in this respect: It is surprising that amongst those who establish and collect data, only rarely a statistician or data scientist can be found. We know that datasets consist of variables and observations; but it is surprisingly difficult to precisely define variables and observations in general. Tidy datasets provide a standardized way to link the structure of a dataset with its meaning; they form the basis for effective computation, modeling and visualization. In his contribution “Tidy data” Hadley Wickham, Chief Scientist at RStudio and Adjunct Professor of Statistics, explains what accounts for the quality of datasets, and how they can effectively be cleaned and prepared for analysis. This article from the Journal of Statistical Software is freely accessible.
Data are contained in millions of spreadsheets, and everybody working with Excel-files has already experienced some peculiarities. This is not only a problem for the exchange of data, since often the formulas used for calculating the values in fields are exported and not the values themselves. Snafus and bugs also an obstacle for sciences like economics or genomics, because the complex values at times are misinterpreted; “SEPT2”, which corresponds to the gene Septin 2, is read by Excel as “September 2nd”. A recent study in the journal Genome Biology investigated papers published between 2005 and 2015, and identified an astonishing proportion of spreadsheet-related errors in them. This poses a real challenge for subsequent scientists who try to build on previously published research results.
Results of statistical operations are far from being intuitive. One needs a lot of experience to reliably judge computed results, and most people would listen to the advice given by experts in the field. This makes a story presented by the BBC all the more remarkable; it is about a student who took over the task to check the calculations of two Harvard professors. He was not only unable to replicate the results that Carmen Reinhart and Ken Rogoff published; it was much more: he spotted a basic error in the spreadsheet and was thus able to correct their research paper called “Growth in a Time of Debt”. That may sound like a modern myth; but sometimes politicians base their decisions on papers like these.
The KPLEX team was pleased to present their project at the Big Data PPP Info Day in Luxembourg, 17-18 January 2017. Although categorised by our panel chair as ‘different,’ the project concept received a good response and a number of good questions and leads for future work. We look forward to contributing more in the future to the widening of perspectives and development of RRI in big data research.