Veracity and Value: Two more “V” of Big Data

So far we have learnt about the most popular three criteria of big data: volume, velocity and variety. Jennifer Edmond suggested adding voluptuousness as fourth criteria of (cultural) big data.

I will now discuss two more “V” of big data that are often mentioned: veracity and value. Veracity refers to source reliability, information credibility and content validity. In the book chapter “Data before the Fact” Daniel Rosenberg (2013: 37) argued: “Data has no truth. Even today, when we speak of data, we make no assumptions at all about veracity”. Many other scholars agree with this, see: Data before the (alternative) facts.

What has been questioned for “ordinary” data seems to hold true for big data. Is this because big data is thought to comprise statistical population data, not just data of a sample? Does the assumed totality of data reveal the previously hidden truth? Instead of relying on a model or on probability distributions, we could now assess and analyse data of the entire population. But apart from the implications for statistical analysis (higher chances of getting false-positives, need for tight statistical significance levels, etc.) there are even more fundamental problems with the veracity of big data.

 

emoji

Take the case of Facebook emoji reactions. They have been introduced in February 2016 to give users the opportunity to react to a post by tapping either Like, Love, Haha, Wow, Sad or Angry. Not only is the choice of affective states very limited and the expression of mixed emotions impossible but the ambiguity in using these expressions themselves is problematic. Although Facebook reminds its users: “It’s important to use Reactions in the way it was originally intended for Facebook — as a quick and easy way to express how you feel. […] Don’t associate a Reaction with something that doesn’t match its emotional intent (ex: ‘choose angry if you like the cute kitten’)”, we do know that human perceptions as well as motives and ways of acting and reacting are manifold. Emojis can be used to add emotional or situational meaning, to adjust tone, to make a message more engaging to the recipient, to manage conversations or to maintain relationships. Social and linguistic function of emojis are complex and varied. Big Data in the case of Facebook emoji reactions then seems to be as pre-factual and rhetorical as “ordinary” data.

Value now refers to social and economic value that big data might create. When reading documents like the European Big Data Value Strategic Research Innovation Agenda one gets the impression of economic value dominating. The focus is directed to “fuelling innovation, driving new business models, and supporting increased productivity and competitiveness”, “increase business opportunities through business intelligence and analytics” as well as to the “creation of value from Big Data for increased productivity, optimised production, more efficient logistics”. Big Data value is not speculative anymore: “Data-driven paradigms will emerge where information is proactively extracted through data discovery techniques and systems are anticipating the user’s information needs. […] Content and information will find organisations and consumers, rather than vice versa, with a seamless content experience”.

Facebook emoji reactions are just an example of this trend. Analysing users’ reactions allows not only for “better filter the News Feed to show more things that Wow us” but probably also to change consumer behavior and sell individualized products and services.


Featured image was taken from Flickr.

From analogue to proto-digital databases

Databases as collections of data are not a new phenomenon. Several centuries ago, collections began to emerge all over the world, as for instance the manuscript collections of Timbuktu (in medieval times a centre for Islamic scholars) demonstrate. The number of these manuscripts is estimated at about 300,000 in all the different domains such as Qur’anic exegesis, Arabic language and rhetoric, law and politics, astronomy and medicine, trade reports, etc.

Usually the memory of many people does not go back so far. They might relate today’s databases with the efforts of establishing universalizing classification systems, which began in the nineteenth century.

The transition to digital databases took place only very recently and this explains why many databases are still underway to digitization.

I will present the database eHRAF World Cultures to illustrate this point. This online database originated as “Collection of Ethnography” by the research programme “Human Relations Area Files” that started back in the 1940s at Yale University. The original aim of anthropologist George Peter Murdock was to allow for global comparisons in terms of human behaviour, social life, customs, material culture, and human-ecological environments. To implement this research endeavour it was thought necessary “to have a complete list of the world’s cultures – the Outline of World Cultures, which has about 2,000 described cultures – and include in the HRAF Collection of Ethnography (then available on paper) about ¼ of the world’s cultures. The available literature was much smaller then, so the million or so pages collected could have been about ¼ of the existing literature at that time”.

From the 1960s onwards, the contents of this collection of monographs, journal articles, dissertations, manuscripts, etc. have been converted into microfiche before in 1994 the digitization of the database was launched. The first online version of the database “eHRAF World Cultures” was available in 1997. This digitization process is far from accomplished. Up to now additional 15,000 pages are converted from the microfiche collection and integrated in the online database every year. Currently the database contains data about more than 300 cultures worldwide.

HRAF-historic

So what does make this database proto-digital then?

First of all it is the research function. When the subject-indexing – at the paragraph level (!) –was done, it was done manually. The standard that provided the guidelines for what and how to index the content of the texts is called the Outline of Cultural Materials and was at that time very elaborate. It assembles more than 700 topic identifiers, clustered into more than 90 subject groups.

The three digit numbers, e.g. 850 for the subject group “Infancy and Childhood” or 855 for the subject “Child Care” ought to facilitate the search for concepts and retrieve data also in other languages than English. And although Boolean searches allow combinations of subject categories and key words, cultures, countries or regions, one has to adapt the logic of this ethnographic classification system in order to carry out purposeful search operations. The organisation of the database was obviously conceptualised in a hierarchical way. If you want to get a particular piece of information, then you look up the superordinate concept and decide which subjects of this group you will need to apply to your research to get the expected results.

Secondly, although the “Outline of Cultural Materials” thesaurus is continually being extended there is no system for providing updates. Only once a year a new list of subject and subject groups is published (online, in PDF and in print).

Thirdly, data that would contribute to better localise cultural groups, such as GIS data (latitude and longitude coordinates) are not available in eHRAF.

At last, users can print or email search results and selected paragraphs or pages from documents, but there is no feature to export data from eHRAF into a (qualitative) data analysis software. The “eHRAF World Cultures” database is also not compatible with OpenURL.

The way from analogue to digital databases is apparently a long and difficult one. The curatorial agency of the database structure and the still discernible influence of the people who assigned the subjects to the database materials should now be a bit clearer.


Featured image was taken from http://hraf.yale.edu/wp-content/uploads/2013/11/HRAF-historic.jpg

Whose context?

When Christine Borgman (2015) mentions the term “native data” she is referring to data in its rawest form, with context information like communication artefacts included. In terms of the NASA’s EOSDIS Data Processing Levels, “native data” even precede level 0, meaning that no cleaning had been performed at all. Scientists who begin their analysis at this stage do not face any uncertainties about what this context information is. It is simply the recordings and the output of instruments, predetermined by the configuration of the instruments. NASA researchers may therefore count them lucky to obtain this kind of reliable context information.

Humanists’ and social scientists’ point of departure is quite different. Anthropologists for example would probably use the term “emic” for their field research data. “Emic” here stands in contrast to “etic” and has been derived from the distinction in linguistics between phonemics and phonetics: “The etic viewpoint studies behavior as from outside of a particular system, and as an essential initial approach to an alien system. The emic viewpoint results from studying behavior as from inside the system” (Pike 1967: 37). An example for the emic viewpoint might be the correspondences between pulses and organs in Chinese medical theory (see picture below) or the relation of masculinity to maleness in a particular cultural setting (MacInnes 1998).

L0038821 Chinese woodcut: Correspondences between pulses and organs

The emic context then for Anthropologists depends on the particular cultural background of their research participants. Disassociated from this cultural background and transferred into an etic context, data may become incomprehensible. Take for example the Kosovo, a sovereign state from an emic point of view, but only recognized by 111 UN member states. In this transition from emic to etic context, the etic context obviously becomes an imposed context.

Applied to libraries, archives, museums and galleries, it might equally be important to know the provenance and original use, so to speak the emic context of the resources. What functions did the materials have for the author or creator? To know about the “experience-near” and not only the “experience-distant” meanings of materials would increase its information content and transparency. One could also say that this additional providing of “emic” metadata enables traceability to the source context and guarantees the credibility of the data. From an operational viewpoint that would nevertheless recreate the problem of standards and making data findable.

If we move up to the next level, metadata from each GLAM-institution could be said to be emic, according to the understanding of the data structure by the curators in that institution. Currently there are over hundred different metadata standards applied. Again, the aggregation of several metadata standards into a unified metadata standard creates the same problem – transfer from an emic (an institution’s inherent metadata standard) into an etic metadata standard.

So what is the solution? Unless GLAM-institutions are willing to accept an imposed standard there remains only the possibility of a mutual convergence and ultimately an inter-institutional consensus.

——
Borgman, Christine L. (2015) Big Data, Little Data, No Data. Scholarship in the Networked World. Cambridge: MIT Press.
MacInnes, John (1998) The end of masculinity. The confusion of sexual genesis and sexual difference in modern society. Buckingham: Open University Press.
Pike, Kenneth L. (1967) Language in Relation to a Unified Theory of the Structure of Human Behavior. The Hague: Mouton.

Featured image was taken from http://www.europeana.eu/portal/de/record/9200105/wellcome_historical_images_L0038821.html

Will Big Data render theory dispensable?

Scientific theories have been crucial for the development of the humanities and social sciences. Metatheories such as classical social evolution, cultural diffusion, functionalism or structuralism for example guided early anthropologists in their research process. Postmodern theorists rightly criticized their predecessors among other things for their deterministic theoretical models. Their criticism however was still based on theoretical reflections, although many tried to reduce their theoretical bias by combining several perspectives and theories (cf. theory triangulation).

Whereas it was common in the humanities to keep track of “disproven or refuted theories” there could be a trend among proponents of a new scientific realism to put the blinkers on and solely focus on progress towards a universal, objective and true account of the physical world. Even worse, theory could be discarded altogether. Big data might revolutionise the scientific landscape. From the point of view of the physicist Chris Anderson the “end of theory” is near: “Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves”.

Implicit-Bias-Registration

This approach towards science might gain ground. Digital humanists are said to be “nice”, due to their concern with method rather than theory. Methodological debates in the digital humanities seem to circumnavigate more fundamental epistemological debates on principles. But big data is not self-explanatory. Explicitly or implicitly theory plays a role when it comes to collect, organize and interpret information: “So it is one thing to establish significant correlations, and still another to make the leap from correlations to causal attributes” (Bollier 2010). Theory matters for making the semantic shift from information to meaning.

In order to understand the process of knowledge production we must keep an eye on the mutually constitutive spheres of theory and practice. In the era of big data Bruno Latour’s conclusion: “Change the instruments, and you will change the entire social theory that goes with them” is hence more important than ever.


Featured image was taken from https://vimeo.com/