A KPLEX Primer for the Digital Humanities

Definitions of terms relevant for the study of Big Data

The big data research field is full of terminology, some of which will be familiar to humanists, some of which may not. When it is familiar, it may also mean something completely different from what we might expect. What follows is an interpretive glossary, intended not so much to add one more claim to an authoritative voice so much as to translate some of these terms into a humanist’s range of experience and frame of reference, as a reflection of the experiences of the KPLEX project. It is based on our experience of working across disciplines, not only humanities and computer science, but also equally distinct fields like Science and Technology Studies (STS). We share it at the encouragement of our project reviewers.

Please note also that in spite of our respect for the Latin roots of the word and its linguistically plural status, we use ‘data’ grammatically as a singular collective noun.

Algorithm – this is a set of rules according to which computational systems operate. Given the statistical underpinning of many big data algorithms, they can be sources of objective processing or introduce bias, depending on the model of the world on which they are based.

Artificial Intelligence (AI) – where computation-based systems are able to move beyond simple logical or rule-based results, or programmed using a training methodology, the system is said to demonstrate artificial intelligence.

Big Data – there is no single definition or measure of size that determines what data is big. Indeed, the most common definitions rely on a number of measures (often four, but many lists give more), not all of which actually measure size, but many of which are described by terms that begin with the letter ‘V’: eg. volume, variety, velocity, and veracity.

Capta – this term (generally credited to Johanna Drucker) exists as an alternative to ‘data,’ stemming from the Latin word for ‘to take’ rather than ‘to give.’ Use of it reflects the recognition that data is never ‘given’ but in fact always ‘taken’ by human agency.

Complexity – reflecting phenomena in a digital environment requires it to be expressed somehow in binary code of 1s and 0s. But we do not know how to capture and express all aspects of an object or phenomena in this way. Humans receive information in multimodal ways – through sight, smell, touch, hearing, for example. Much of the information we receive is understood only tacitly. All of these layers, tacit and explicit, mean that objects and phenomena in the real world generally possess a greater complexity than their digital surrogates (though digital systems themselves can possess their own complexity). In digitisation, we generally capture only part of a phenomenon, thereby reducing its complexity for the next user of that surrogate. How this choice of what to capture is made is also referred to as tuning the signal to noise ratio (with signal being what you think is important, and noise the unimportant or distracting elements). Needless to say, your tuning will generally depend on your needs or values regarding a dataset.

Computational – a term that would be associated with quantitative (or mixed) research approaches, but referring not to research processes at the macro level, but to specific mathematical processing of numerical information as a component of such an approach where often an algorithm could be applied.

Context – the process of datification is by necessity a simplification. One aspect of this simplification is often the removal or reduction of context, that is the rich set of interconnections and relationships that any given object of phenomena possesses.

Curation – a process by which a selection or organisation of items is made from a larger possible collection. Can also refer to the presentation, integration or annotation of such a dataset.

Data – you need only read a bit of this report to realise that many communities define data in different ways: even internal consistency in these communities (or indeed in individual research papers) in the definition and use of this key term is often not the norm. It is therefore important in any interdisciplinary collaboration involving talk of data to ensure the parties involved make explicit what they mean by data, and what the implications of this working definition are.

Data Cleaning, also known as data scrubbing – a process by which elements are removed from a dataset or stream, generally because they foul the desired processing. This process is viewed by some communities as a central part of good research practice; others, however, view data cleaning as a form of data manipulation that erodes the credibility of research based upon it.

Datafication – a process generally understood as the rendering of original state objects in digital, quantified or otherwise more structured streams of information.

Data Manipulation – see Data Cleaning.

Data Scrubbing – see Data Cleaning.

Digital – this may seem obvious, but something is digital having been converted into a representation consisting of ‘1’s and ‘0’s (binary code) that can be read by a computer. Digital surrogates of analogue objects do not necessarily carry all of the properties of the original, however: for example, a page of text can be digitised as a photograph, but that doesn’t mean that the computer can read what is printed on the page, only that it has captured a sequence of lighter and darker pixels.

Documentation – this term has a rich history, but its formal use is generally attributed to Suzanne Briet. It is a process similar to datafication, but coined in an age before computers, referring to a number of technological and human processes that create a durable record of an object (for example, a geological formation, which cannot itself be held in a museum), an event (such as a performance), or other human activity or natural phenomenon.

DIKW – a commonly used acronym to capture the relationships between different kinds or states of the building blocks of understanding. It stands for ‘Data, Information, Knowledge, Wisdom.’ KPLEX findings have shown that it is flawed in its lack of recognition of the positionality of knowledge states.

Information architecture – data must be organised according to some kind of framework, and the major and minor elements in this framework can be referred to as its architecture (just as a house will have a certain collection of rooms and hallways into which furniture – data – can be placed).

Machine learning – a process by which computational intelligence is developed by presenting a system with a large body of information, from which it extracts rules and patterns without direct human oversight of exactly what patterns it extracts. Also referred to as training. A common component of building AI.

Metadata – literally, data about data. Like catalogue information is to a collection of books, or a jukebox listing is to a collection of songs, it is a shorthand, often standardised, description of objects in a collection.

Narrative – this terms seems to have an interesting relationship to data, reflecting some of the common (erroneous) claims that data would be closer to objectivity than other forms by which information is communicated. Narrative is seen by many as the other side of that coin, the ‘story’ that someone either wants to or finds a way to tell using data (with a common implication that this ‘story’ does not reflect an objective position).

Native Data – this term can have a couple of meanings. In the context of commercial or other software development, it may refer to data formats that are most compatible with a particular system, or indeed created by it. For example, a Word document is most easily read in that software environment: viewing it through another word processor may cause errors or artefacts to appear. Native data is also, however, used to refer generally to data that is close to its original context of datification, as it exists/existed in its indigenous environment, the source context or environ where it was extracted from; so [original] data plus its environmental or source context.

Neural Networks – a form of computational system commonly associated with artificial intelligence.

Paradata – sometimes called ‘data byproducts,’ paradata are some of the most interesting forms of data (to the humanist, at least) because of their dependence and interrelatedness with other data streams. Notes in the margin of a book are a good example of paradata, as they document the connection between the thoughts of the reader and the text being read.

Processing – taking data and changing it, by filtering, reorganising, interrogating or otherwise applying some set of transformations (permanent or temporary) to it.

Provenance – a particular perspective on context, central in the world of museums, libraries and archives. Provenance refers to the record of where an object, collection or dataset has come from, and the places and ‘experiences’ (additions, transformations, losses, audiences, etc.) it has had since its original documentation.

Raw Data – what this phrase refers to is, as Sandra Gitelman puts it, an oxymoron. And yet the term persists. In theory, it refers to data that has not undergone any processing, or more specifically, any further processing, as the process of datafication in and of itself is a transformative process. It can also refer to data that has yet to be in any way interpreted or marked by human perspectives.

Source data – indicates the source of the data for any given research project or use. For some researchers, their source data may not be native, or raw; it may already be data proper and have undergone extensive processing, whether or not they recognise this is a part of how they situate their use of the data and their results.

Social Costs – since the time of Plato, it has been recognised that knowledge technologies bring both costs and benefits to their users. Writing may have led to the preservation and dissemination of many texts, but it also certainly weakened the habits and capabilities associated with the oral tradition. Digital technologies, such as data-driven research and big data collection have similar ‘pharmakon’-like properties, meaning they can both help and hurt. In the contemporary context, however, the weighing of the ‘costs’ can be a challenging process, in particular because companies can generate profit by exploiting or at least skirting the edge of practices and products that may have a negative effects on individuals’ health or overall social cohesion. It is therefore more important than ever that the impact of issues like identity, culture, privacy and cohesion are considered and (where necessary) defended against.

Standards – many communities agree to use the same sets of descriptors for certain features of a collection of objects or data. These tend to be agreed by communities, for example library standards such as Marc 21 or the standards for the sizes of sheets of paper based upon A5, A4, A3 etc. Community bodies can sometimes manage and validate standards (as the Society of American Archivists does for the EAD, or Encoded Archival Description); many others are described and promoted by the international standards agency ISO. There is overlap between standards in this sense and other tools for restricting and aligning how data is described, such as controlled vocabularies (one well-known example of this is the Getty terminology set, which provides standardised definitions of terms related to art and architecture).

Structured Data – in a database, small information elements will be organised and grouped at a granular level, like with like, such as a list of titles or names. Unstructured data will be more like narrative text, with no external order imposed to enhance processability.

Trust – this is a fundamental component of data driven research, big or otherwise, and means very different things to different people. Trust is a component of how we interact with digital environments and sources: it is the first thing any digital environment needs to inspire in us (which usually occurs via proxies and heuristics, which may or may not actually speak to the basis on which a source should be trusted). A trusted environment or dataset may be seen to have authority, that is a claim to greater knowledge or experience than other possible sources.