If we think of big data, we don’t think of unstructured text data. Mostly we even don’t think of (weakly) structured data like XML – we conceive of big data as being organized in tables, with columns (variables) and rows (observations); lists of numbers with labels attached. But where do the variables come from? Variables are classifiers; and a classification, as Geoffrey Bowker and Susan Leigh Star have put it in their inspiring book “Sorting Things Out”, is “a spatial, temporal, or spatio-temporal segmentation of the world. A ‘classification system’ is a set of boxes (metaphorical or literal) into which things can be put to then do some kind of work – bureaucratic or knowledge production. In an abstract, ideal sense, a classification system exhibits the following properties: 1. There are consistent unique classificatory principles in operation. […] 2. The categories are mutually exclusive. […] 3. The system is complete.”[1] Bowker and Star describe classification systems as invisible, erased by their naturalization into the routines of life; in them, conflict, contradiction, and multiplicity are often buried beneath layers of obscure representations.
Humanists are well accustomed this kind of approach in general and to classifications and their consequences in particular; classification is, so to say, the administrative part of philosophy and, also, of disciplines like history or anthropology. Any humanist who has occupied himself with Saussurean structuralism or Derrida’s deconstruction knows that each term (and potential classifier) is inseparably linked to other terms: “the proper name was never possible except through its functioning within a classification and therefore within a system of differences,” Derrida writes in “Of Grammatology”.[2] Proper names, as indivisible units, thus form residual categories. In the language of data-as-tables, proper names would correspond to variables, and where they don’t form a variable, they would be sorted into a ‘garbage category’ – the infamous and ubiquitous “other” (“if it is not that and that and that, then it is – other”). Garbage categories are these columns where things get put that you do not know what to do with. But in principle, garbage categories are okay; they can signal uncertainty at the level of data collection. As Derrida’s work reminds us, we must be aware of exclusions, even if they may be explicit.
The most famous scientist who has occupied himself with implicit exclusions in historical perspective is certainly Michel Foucault. In examining the formative rules of powerful discourses as exclusion mechanisms, he analyzed how power constitutes the “other”, and how standards and classifications are suffused with traces of political and social work. In the chapter “The Birth of the Asylum” of his book “Madness and Civilization”,[3] for example, he describes how the categories of ‘normal’ and ‘deviant’, and classifications of forms of ‘deviance’, go hand in hand with forms of treatment and control. Something which is termed ‘normal’ is linked to modes of conduct, standards of action and behavior, and with judgments on what is acceptable and unacceptable. In his conception, the ‘other’ or ‘deviant’ is to be found outside of the discourse of power and therefore not taking part in the communication between the powerful and the ‘others’. Categories and classifications created by the powerful justify diverse forms of treatment by individuals in professional capacities, such as physicians, and in political offices, such as gaolers or legislators. The equivalent to the historical setting analyzed by Foucault can nowadays be found in classification systems like the International Classification of Diseases (ICD) or the Diagnostic and Statistical Manual (DSM). Doctors, epidemiologists, statisticians, and medical insurance companies work with these classification systems, and certainly there are equivalent ‘big data’ containing these classifications as variables. Not only are the ill excluded from taking part in the establishment of classifications, but also other medical cultures and their representation systems of classification of diseases. Traditional Chinese medicine is a prominent example here, and the historical (and later psychoanalytic) conception of hysteria had to undergo several major revisions until it nowadays found an entry as “dissociation” (F44) and “Histrionic personality disorder” (F60.4) in the ICD-10. Here Foucault’s work reminds us of the fact that power structures are implicit and thus invisible. We are admonished to render classifications retrievable, and to include the public in policy participation.
A third example that may come to the humanist’s mind when thinking of classifications is the anthropologist Mary Douglas. In her influential work “Purity and Danger”, she outlines the inseparability of seemingly distinct categories. One of her examples is the relation of sacredness and pollution: “It is their nature [of religious entities, J.L.] always to be in danger of losing their distinctive and necessary character. The sacred needs to be continually hedged in with prohibitions. The sacred must always be treated as contagious because relations with it are bound to be expressed by rituals of separation and demarcation and by beliefs in the danger of crossing forbidden boundaries.”[4] Sacredness and pollution are thus in permanent tension with each other and create their distinctiveness out of a permanent process of exchange. This process undermines classification – with Douglas, classifiers have to run over into each other (like it the case with the 5-point Likert scale), or classification systems have to be conceived of as heterogeneous lists or parallel different lists. Douglas’ work thus reminds us of the need for the incorporation of ambiguity and of leaving certain terms open for multiple definitions.
Seen from a humanist’s point of view, big data classifications pretend false objectivity, insofar as they help to forget their political, cultural, moral, or social origins, as well as their constructedness. It lies beyond the tasks of the KPLEX project to analyze big data classifications for their implications; but if anybody is aware of relevant studies, references are most welcome. But what we all can do is to constantly remind ourselves that there is no such thing as an unambiguous, uniform classification system implemented in big data.
[1] Geoffrey C. Bowker, Susan Leigh Star, Sorting Things Out. Classification and Its Consequences, Cambridge, MA / London, England: The MIT Press 1999, p. 10/11.
[2] Jacques Derrida, Of Grammatology. Translated by Gayatri Chakravorty Spivak, Baltimore and London: The Johns Hopkins University Press 1998, chapter “The Battle of Proper Names”, p. 120.
[3] Michel Foucault, Madness and Civilization: History of Insanity in the Age of Reason, translated by Richard Howard, New York: Vintage Books 1988 (first print 1965), pp. 244ff.
[4] Mary Douglas, Purity and Danger. An Analysis of the concepts of pollution and taboo, London and New York: Routledge 2001 (first print 1966), p. 22.