To err is human – but also computers can make mistakes

Imagine an automated rating of CVs in order to decide whom to grant a scholarship or which job candidate to hire. This is not science-fiction. Big private companies increasingly rely on such mechanisms to make hiring decisions. They analyse the data of their employees to find the underlying patterns for success. The characteristics of job candidates are then matched with those of successful employees and the system recommends those candidates with most similar characteristics. Much less time and effort is needed to choose the “right” candidates from a pool of promising applicants. Certainly, the human resources department has to reflect on what characteristics to choose and how to combine and weight them, but the recommendations based on the analysis of big data seem to be very efficient.

Making automatic algorithm-produced predictions about one individual person by analyzing the information from many people is still problematic in several ways. First of all, it requires inter alia large datasets to avoid bias. Second, the demand for standardized CVs implies that non-normative ways of presenting oneself are excluded a priori. Third, assume that the characteristics of high achievers change by time. The system will continue (at least for some time) to formulate recommendations based on past experiences. The static model of prediction will be unable to detect potential innovative candidates who have divergent experiences and skills. It thus discriminates against individuals with non-standard backgrounds and motivations. A last problem is that all the data available to the model are based on the people who have been accepted in the first place and who have proven successful or unsuccessful thereafter. Nothing is known about the career paths of the applicants who had been rejected.

“In this context, data – long-term, comparable, and interoperable – become a sort of actor, shaping and reshaping the social worlds around them” (Ribes and Jackson 2013: 148). Taken from an ecological research with stream chemistry data this statement applies equally to the problem of automatic recommendation systems.

Even worse, not only the CV but the footprint one leaves in cyberspace might serve as the basis of decision-making. The company Xerox used data it has mined of their (former) employees to define the criteria for hiring new staff for its 55,000 call-centre positions. The applicants’ data gained from the screening test were compared with the significant, but sometimes unexpected criteria detected so far. In the case of Xerox for example “employees who are members of one or two social networks were found to stay in their job for longer than those who belonged to four or more social networks”.

To-err-is-human1

Whether the social consequences of these new developments can be attributed to humans or also computers is highly controversial. Luciano Floridi (2013) makes the point that we should differentiate between the accountability of (artificial) agents and the responsibility of (human) agents. Does the algorithm discussed above qualify as an agent? Floridi would argue yes, because “artificial agents could have acted differently had they chosen differently, and they could have chosen differently because they are interactive, informed, autonomous and adaptive” (ibid. 149). So even if “it would be ridiculous to praise or blame an artificial agent for its behavior, or charge it with a moral accusation” (ibid. 150), we must acknowledge that artificial agents as transition systems interact with their environment, that they can change their state independently and that they are even able to adopt new transition rules by which they change their state.

The difference between accountability and responsibility should be kept in mind, so that attempts to delegate responsibility to artificial agents can be uncovered. In case of artificial agents malfunctioning, the engineers who designed them are requested to re-engineer them to make sure they no longer cause evil. And in the case of recruitment decisions companies should be very careful about how to proceed. There is no single recipe for success.


Floridi, Luciano (2013) The Ethics of Information. Oxford: Oxford University Press.

Ribes, David/ Jackson, Steven J. (2013) Data Bite Man: The Work of Sustaining a Long-Term Study. In: Lisa Gitelman, “Raw Data” Is an Oxymoron. Cambridge: The MIT Press. 147-166.

Featured image was taken from: https://cdn.static-economist.com/sites/default/files/images/print-edition/20170610_FND000_0.jpg

New Wine into Old Wineskins

The communication of scientific outputs or in other words the way narratives relate to data has received much attention in previous KPLEX posts. Questions such as “Do the narratives elaborate on the data, create narrative from the data, or do the narratives reveal the latent richness inherent in the data?” have been raised. These fundamental questions touch upon the very heart of the scientific enterprise. How do we try to grasp the complexity of a phenomenon and how do we translate our insights and findings into clear language?

Debates in anthropology might be revealing in this regard. The “reflexive turn” since the 1970s led anthropologists to ask themselves if it was possible to create an objective study of a culture when their own biases and epistemologies were inherently involved. So far they had produced some kind of “realist tale” with a focus on the regular and the junction of observations with standard categories. Only those data had been allowed that supported the analysis; the underanalyzed or problematic had been left out. Anthropologists’ worries and efforts had revolved around the possible criticism that their work was “’opinion, not fact’, ‘taste, not logic’, or ‘art, not science’” (Van Maanen 1988: 68). Then a plentitude of new tales emerged that openly discussed the accuracy, breadth, typicality, or generality of the own cultural representations and interpretations. Put under headings like “confessional tale”, “impressionist tale”, “critical tale”, “literary tale” or “jointly told tale” all these kinds of narratives open up new ways of description and explanation. Issues such as serendipity, errors, misgivings and limiting research roles for example are taken up by confessional writers (Karlan and Appel 2016).

How far has this discussion progressed in the sciences? A differentiation analogue to the one between “real events (Geschichte) and the narrative strategies (Historie) used to represent, capture, communicate, and render these events meaningful” has to some extent taken place. Still the process of constructing scientific facts often happens in a black box and is not revealed to the reader, as the famous study by Bruno Latour and Steve Woolgar has shown. Like in the humanities also in the sciences different kinds of statements – ranging from taken-for granted facts through tentative suggestions and claims to conjectures – contribute to the establishment of ”truth”. The combination of researchers, machines, “inscription devices”, skills, routines, research programs, etc. leads to the “stabilisation” of statements: “At the point of stabilisation, there appears to be both objects and statements about these objects. Before long, more and more reality is attributed to the object and less and less to the statement about the object. Consequently, an inversion take place: the object becomes the reason why the statement was formulated in the first place. At the onset of stabilisation, the object was the virtual image of the statement; subsequently, the statement becomes the mirror image of the reality ‘out there’” (Latour and Woolgar 1979: 176 f.).

New wine in old skins

So with regard to the sciences the questions that had been raised before: “Is it possible for a narrative to become data-heavy or data-saturated? Does this impede the narrative from being narrative?” have to be negated. Discursive representations are always implicated when conveying some version of truth. In terms of reflexivity there is still some room for improvement, e.g. putting the focus not only on the communication of startling facts but also on non-significant results. This would certainly help the practitioners of science to get to know better the scope and explanatory power of their disciplinary methods and theories. Hope remains that unlike in the parable of new wine in old wineskins the sciences will stand these changes and not burst.


Karlan, Dean S./ Appel, Jacob (2016) Failing in the field: What we can learn when field research goes wrong. Princeton: Princeton University Press.

Latour, Bruno/ Woolgar, Steve (1979) Laboratory Life: The Construction of Scientific Facts. Princeton: Princeton University Press.

Van Maanen (1988) Tales of the Field: On Writing Ethnography. Chicago: The University of Chicago Press.

Featured image was taken from: https://i.pinimg.com/originals/22/bc/0a/22bc0a8c4573701d9df70f51a971388a.jpg

What’s behind a name?

Could a complete worldwide list of all the names of streets, squares, parks, bridges, etc. be considered as big data? Would the analysis of frequencies and the spatial distribution of these names tell us anything about ourselves?

Such a comparative analysis would miss important information, especially the historical changes of names and the cultural significance embedded therein.

The Ebertstraße in Berlin had changed its name several times: in the 19th century it became the Königgrätzer Straße after the Prussian victory over Austria at the Battle of Königgrätz, during the First World War it was renamed in Budapester Straße, in 1925 it got the name Friedrich-Ebert-Straße in memorial of the first President of the Weimar Republic. Shortly after the Nazi took over in Germany the street was renamed in Hermann-Göring-Straße after the newly elected President of the Reichstag. Only in 1947 the street was finally renamed back to Ebertstraße.

The close-by Mohrenstraße on the other hand bears its name since the beginning of the 18th century. One of the myths on the origin of the street name stems from African musicians who played in the Prussian army. Debates on changing the street name remain and University departments which are located in that street chose to use Møhrenstraße in the meantime.

dav

So, if even if street names are not as rich cultural data as the painting of Mona Lisa, they convey meaning that has been formed, changed and negotiated over a long period of time.

The advantage in dealing with street names and not with maps is that street name data are more reliable than maps which often have been manipulated and distorted for military or other reasons.

But in order to reveal the history of street names one should not restrict oneself to the evidence on, about and of street names but dig into the events, processes, narratives and politics related to the context of origin. The HyperCities project has set up a digital map that allows “thick mapping”.

Certainly such a research will lead to the creation of narratives itself – that might be biased overall – but in the face of historical events is there any objective account possible at all?

The Dream of Knowing the Future

Predicting the future has always been a concern for positivist scientists. The theories and models they constructed claimed not only to represent general laws but also to forecast the prospective outcomes of long-term processes. Take for example the predecessor of sociology, Auguste Comte, who tried to explain the past development of humanity and predict its future course. Even Karl Marx, who criticized the limited conception of cause underlying positivist natural laws, developed a theory of history for Western Europe that saw socialism and communism as the final stages after epochs of slavery, feudalism and capitalism. In addition to description and explanation the predictive power was and is crucial for the scientific worth of theories.

Predictions about the future often rely on a combination of historical data and interpretations informed by current theories. A review of 65 estimates of how many people the earth can support for instance show how widely these differ: “The estimates have varied from <1 billion to >1000 billion”. The review also shows the different methods used for estimating human carrying capacity. The first estimation in the 17th century which stated 13.4 billion people extrapolated the number of inhabitants in Holland to the earth’s inhabited land area. Estimations from the 20th century are based on food and water supply and individual requirements thereof.

Recent estimations rely on computer models that integrate data and theories related to growth. Different scenarios are developed that estimate the maximum global population at about 9 billion people in the 21st century and then either to collapse or to adapt smoothly to the carrying capacity of the earth.

Population-Projections

Estimations of this kind should be viewed with caution, because the information provided is incomplete. We might have some idea about the desirable level of material well-being and the physical environments we want to live in, but we cannot foresee the technologies, economic arrangements or the political institutions in place in fifty or eighty years. These mechanisms do not operate independently but interact and produce feedback loops. The awareness of dangers and risks alone won’t necessarily change predominant policies. Human behavior and the underlying fashions, tastes and values (on family size, equality, stability and sustainability) are too complex to be predicted accurately.

Let’s try then a more modest example! What about predicting the potential outbreak of a disease? Google Flu Trends was a program that aimed for better influenza forecasting than the U.S. Centers for Disease Control and Prevention. From 2008 onwards internet searches for information on symptoms, stages and remedies were analyzed in order to predict where and how severely the flu would strike next. The program failed. Big data inconsistencies and human errors in interpreting the data are held responsible for not predicting the flu outbreak in the United States in 2013, the worst outbreak of influenza in ten years. Another recent example is the Ebola epidemic in West Africa in 2014. The U.S. Centers for Disease Control and Prevention published a worst-case prediction with 1.4 million people infected. The World Health Organization predicted a 90% death rate from the disease, in retrospect the rate is about 70%. The data and the model based on initial outbreak conditions turned out inadequate for projections. Disease conditions and human behavior changed too quickly for humans and algorithms to keep up.

OK, then how about sales forecasting, a comparatively easy task? Mass-scale historical data has served eBay and other companies to measure the benefit of search advertising. In a simple predictive model clicks were counted to predict sales: “Although a click on an eBay ad was a strong predictor of a sale – consumers typically purchased right after clicking – the experiment revealed that a click did not have nearly as large a causal effect, because the consumers who clicked were likely to purchase, anyway”.

This shows us that data alone are not enough for prediction, one needs to know about causal effects and context information. Additionally, purely data-driven approaches tend to produce models and algorithms that are overfit to the idiosyncrasies of particular circumstances. What theories and models can deliver is not knowledge of the future but at best the ability to rule out a range of futures as unrealistic.


Featured image was taken from: http://www.bigpicexplorer.com/idealworld/population.htm

Veracity and Value: Two more “V” of Big Data

So far we have learnt about the most popular three criteria of big data: volume, velocity and variety. Jennifer Edmond suggested adding voluptuousness as fourth criteria of (cultural) big data.

I will now discuss two more “V” of big data that are often mentioned: veracity and value. Veracity refers to source reliability, information credibility and content validity. In the book chapter “Data before the Fact” Daniel Rosenberg (2013: 37) argued: “Data has no truth. Even today, when we speak of data, we make no assumptions at all about veracity”. Many other scholars agree with this, see: Data before the (alternative) facts.

What has been questioned for “ordinary” data seems to hold true for big data. Is this because big data is thought to comprise statistical population data, not just data of a sample? Does the assumed totality of data reveal the previously hidden truth? Instead of relying on a model or on probability distributions, we could now assess and analyse data of the entire population. But apart from the implications for statistical analysis (higher chances of getting false-positives, need for tight statistical significance levels, etc.) there are even more fundamental problems with the veracity of big data.

 

emoji

Take the case of Facebook emoji reactions. They have been introduced in February 2016 to give users the opportunity to react to a post by tapping either Like, Love, Haha, Wow, Sad or Angry. Not only is the choice of affective states very limited and the expression of mixed emotions impossible but the ambiguity in using these expressions themselves is problematic. Although Facebook reminds its users: “It’s important to use Reactions in the way it was originally intended for Facebook — as a quick and easy way to express how you feel. […] Don’t associate a Reaction with something that doesn’t match its emotional intent (ex: ‘choose angry if you like the cute kitten’)”, we do know that human perceptions as well as motives and ways of acting and reacting are manifold. Emojis can be used to add emotional or situational meaning, to adjust tone, to make a message more engaging to the recipient, to manage conversations or to maintain relationships. Social and linguistic function of emojis are complex and varied. Big Data in the case of Facebook emoji reactions then seems to be as pre-factual and rhetorical as “ordinary” data.

Value now refers to social and economic value that big data might create. When reading documents like the European Big Data Value Strategic Research Innovation Agenda one gets the impression of economic value dominating. The focus is directed to “fuelling innovation, driving new business models, and supporting increased productivity and competitiveness”, “increase business opportunities through business intelligence and analytics” as well as to the “creation of value from Big Data for increased productivity, optimised production, more efficient logistics”. Big Data value is not speculative anymore: “Data-driven paradigms will emerge where information is proactively extracted through data discovery techniques and systems are anticipating the user’s information needs. […] Content and information will find organisations and consumers, rather than vice versa, with a seamless content experience”.

Facebook emoji reactions are just an example of this trend. Analysing users’ reactions allows not only for “better filter the News Feed to show more things that Wow us” but probably also to change consumer behavior and sell individualized products and services.


Featured image was taken from Flickr.

From analogue to proto-digital databases

Databases as collections of data are not a new phenomenon. Several centuries ago, collections began to emerge all over the world, as for instance the manuscript collections of Timbuktu (in medieval times a centre for Islamic scholars) demonstrate. The number of these manuscripts is estimated at about 300,000 in all the different domains such as Qur’anic exegesis, Arabic language and rhetoric, law and politics, astronomy and medicine, trade reports, etc.

Usually the memory of many people does not go back so far. They might relate today’s databases with the efforts of establishing universalizing classification systems, which began in the nineteenth century.

The transition to digital databases took place only very recently and this explains why many databases are still underway to digitization.

I will present the database eHRAF World Cultures to illustrate this point. This online database originated as “Collection of Ethnography” by the research programme “Human Relations Area Files” that started back in the 1940s at Yale University. The original aim of anthropologist George Peter Murdock was to allow for global comparisons in terms of human behaviour, social life, customs, material culture, and human-ecological environments. To implement this research endeavour it was thought necessary “to have a complete list of the world’s cultures – the Outline of World Cultures, which has about 2,000 described cultures – and include in the HRAF Collection of Ethnography (then available on paper) about ¼ of the world’s cultures. The available literature was much smaller then, so the million or so pages collected could have been about ¼ of the existing literature at that time”.

From the 1960s onwards, the contents of this collection of monographs, journal articles, dissertations, manuscripts, etc. have been converted into microfiche before in 1994 the digitization of the database was launched. The first online version of the database “eHRAF World Cultures” was available in 1997. This digitization process is far from accomplished. Up to now additional 15,000 pages are converted from the microfiche collection and integrated in the online database every year. Currently the database contains data about more than 300 cultures worldwide.

HRAF-historic

So what does make this database proto-digital then?

First of all it is the research function. When the subject-indexing – at the paragraph level (!) –was done, it was done manually. The standard that provided the guidelines for what and how to index the content of the texts is called the Outline of Cultural Materials and was at that time very elaborate. It assembles more than 700 topic identifiers, clustered into more than 90 subject groups.

The three digit numbers, e.g. 850 for the subject group “Infancy and Childhood” or 855 for the subject “Child Care” ought to facilitate the search for concepts and retrieve data also in other languages than English. And although Boolean searches allow combinations of subject categories and key words, cultures, countries or regions, one has to adapt the logic of this ethnographic classification system in order to carry out purposeful search operations. The organisation of the database was obviously conceptualised in a hierarchical way. If you want to get a particular piece of information, then you look up the superordinate concept and decide which subjects of this group you will need to apply to your research to get the expected results.

Secondly, although the “Outline of Cultural Materials” thesaurus is continually being extended there is no system for providing updates. Only once a year a new list of subject and subject groups is published (online, in PDF and in print).

Thirdly, data that would contribute to better localise cultural groups, such as GIS data (latitude and longitude coordinates) are not available in eHRAF.

At last, users can print or email search results and selected paragraphs or pages from documents, but there is no feature to export data from eHRAF into a (qualitative) data analysis software. The “eHRAF World Cultures” database is also not compatible with OpenURL.

The way from analogue to digital databases is apparently a long and difficult one. The curatorial agency of the database structure and the still discernible influence of the people who assigned the subjects to the database materials should now be a bit clearer.


Featured image was taken from http://hraf.yale.edu/wp-content/uploads/2013/11/HRAF-historic.jpg

Whose context?

When Christine Borgman (2015) mentions the term “native data” she is referring to data in its rawest form, with context information like communication artefacts included. In terms of the NASA’s EOSDIS Data Processing Levels, “native data” even precede level 0, meaning that no cleaning had been performed at all. Scientists who begin their analysis at this stage do not face any uncertainties about what this context information is. It is simply the recordings and the output of instruments, predetermined by the configuration of the instruments. NASA researchers may therefore count them lucky to obtain this kind of reliable context information.

Humanists’ and social scientists’ point of departure is quite different. Anthropologists for example would probably use the term “emic” for their field research data. “Emic” here stands in contrast to “etic” and has been derived from the distinction in linguistics between phonemics and phonetics: “The etic viewpoint studies behavior as from outside of a particular system, and as an essential initial approach to an alien system. The emic viewpoint results from studying behavior as from inside the system” (Pike 1967: 37). An example for the emic viewpoint might be the correspondences between pulses and organs in Chinese medical theory (see picture below) or the relation of masculinity to maleness in a particular cultural setting (MacInnes 1998).

L0038821 Chinese woodcut: Correspondences between pulses and organs

The emic context then for Anthropologists depends on the particular cultural background of their research participants. Disassociated from this cultural background and transferred into an etic context, data may become incomprehensible. Take for example the Kosovo, a sovereign state from an emic point of view, but only recognized by 111 UN member states. In this transition from emic to etic context, the etic context obviously becomes an imposed context.

Applied to libraries, archives, museums and galleries, it might equally be important to know the provenance and original use, so to speak the emic context of the resources. What functions did the materials have for the author or creator? To know about the “experience-near” and not only the “experience-distant” meanings of materials would increase its information content and transparency. One could also say that this additional providing of “emic” metadata enables traceability to the source context and guarantees the credibility of the data. From an operational viewpoint that would nevertheless recreate the problem of standards and making data findable.

If we move up to the next level, metadata from each GLAM-institution could be said to be emic, according to the understanding of the data structure by the curators in that institution. Currently there are over hundred different metadata standards applied. Again, the aggregation of several metadata standards into a unified metadata standard creates the same problem – transfer from an emic (an institution’s inherent metadata standard) into an etic metadata standard.

So what is the solution? Unless GLAM-institutions are willing to accept an imposed standard there remains only the possibility of a mutual convergence and ultimately an inter-institutional consensus.

——
Borgman, Christine L. (2015) Big Data, Little Data, No Data. Scholarship in the Networked World. Cambridge: MIT Press.
MacInnes, John (1998) The end of masculinity. The confusion of sexual genesis and sexual difference in modern society. Buckingham: Open University Press.
Pike, Kenneth L. (1967) Language in Relation to a Unified Theory of the Structure of Human Behavior. The Hague: Mouton.

Featured image was taken from http://www.europeana.eu/portal/de/record/9200105/wellcome_historical_images_L0038821.html

Will Big Data render theory dispensable?

Scientific theories have been crucial for the development of the humanities and social sciences. Metatheories such as classical social evolution, cultural diffusion, functionalism or structuralism for example guided early anthropologists in their research process. Postmodern theorists rightly criticized their predecessors among other things for their deterministic theoretical models. Their criticism however was still based on theoretical reflections, although many tried to reduce their theoretical bias by combining several perspectives and theories (cf. theory triangulation).

Whereas it was common in the humanities to keep track of “disproven or refuted theories” there could be a trend among proponents of a new scientific realism to put the blinkers on and solely focus on progress towards a universal, objective and true account of the physical world. Even worse, theory could be discarded altogether. Big data might revolutionise the scientific landscape. From the point of view of the physicist Chris Anderson the “end of theory” is near: “Out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do? The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves”.

Implicit-Bias-Registration

This approach towards science might gain ground. Digital humanists are said to be “nice”, due to their concern with method rather than theory. Methodological debates in the digital humanities seem to circumnavigate more fundamental epistemological debates on principles. But big data is not self-explanatory. Explicitly or implicitly theory plays a role when it comes to collect, organize and interpret information: “So it is one thing to establish significant correlations, and still another to make the leap from correlations to causal attributes” (Bollier 2010). Theory matters for making the semantic shift from information to meaning.

In order to understand the process of knowledge production we must keep an eye on the mutually constitutive spheres of theory and practice. In the era of big data Bruno Latour’s conclusion: “Change the instruments, and you will change the entire social theory that goes with them” is hence more important than ever.


Featured image was taken from https://vimeo.com/