Ethics and epistemics in biomedical big data research

This recent article explores issues in big data approaches with regard to

a) ethics – such as obtaining consent from large numbers of research participants across a large number of institutions; protect confidentiality; privacy concerns; optimal methods for de-identification; and the limitation of the capacity for the public, and even experts, to interpret and question research findings

and b) epistemics – such as personalized (or precision) treatment that rely on extending concepts that have largely failed or have very high error rates; deficiencies of observational studies that do not get eliminated with big data; challenges of big data approaches due to their overpowered analysis settings; minor noise due to errors or low quality information being easily be translated into false signals; and problems with the view that big data is somehow “objective,” including that this obscures the fact that all research questions, methods, and interpretations
are value-laden.

The article closes with a list of recommendations which consider the tight links between epistemology and ethics in relation to big data in biomedical research.

Wendy Lipworth, Paul H. Mason, Ian Kerridge, John P. A. Ioannidis, Ethics and Epistemology in Big Data Research, in: Journal of Bioethical Inquiry 2017, DOI 10.1007/s11673-017-9771-3 [Epub ahead of print].

Beyond a binary conception of data

An awful lot of data exist in tables, the columns (variables) and rows (observations) filled with numerical values. Numerical values are always binary: Either they have a certain value or not (in the latter case they are NAs or NULLs). Computability is based on information stored in a binary fashion coded as either 0 or 1. Statistics is based on this binary logic, even where nominal data are in use. Nominal variables such as gender, race, color, and profession can be measured only in terms of whether the individual items belong to some distinctively different categories. More precisely: They belong to these categories or not.

A large amount of data, especially in the Internet, consists of unstructured text data (eMails, WhatsApp Messages, WordPress posts, tweets, etc.). Can texts – or other cultural data like images, works of art, or music – be adapted to a binary logic? Principally yes, as the examples of nominal variables show; either a word (or an image …) belongs to a certain category or not. Quite a good of part of Western thinking follows a binary logic; classical structuralism, for example, is fond of structuring oppositions: good – bad, pure – dirty, raw – cooked, orthodox – heretic, avantgarde – arrièregarde, up-to-date – old-fashioned etc. The point here is that one has to be careful to which domain these binary oppositions belong to: Good – bad is not the same as good – evil.

But the habit of thinking in binary terms narrows the perspective; the use of data according to a binary logic means a reduction. This is particularly evident with respect to texts: The meaning of individual words changes with their context. Another example: A smile can be the expression of sympathy and of uncertainty in the Western world, while in other cultures it may be referring to aggression, confusion, sadness or a social distancing from the other. What can be seen as a ‘smile’ in monkeys is most often the expression of fear – a showing of the canines.

Indian logic provides for an example of how to go beyond a binary logic: They have something which is being called “Tetralemma”. While binary systems are based on calculations of 0 and 1 and therefore formulate a dilemma, the tetralemma provides for four possible answers to any logical proposition: Beyond the logics of 0, 1, there is both 0 and 1, and neither 0 nor 1. One can even conceive of a fifth answer: none of these all. Put as a graph and expressed mathematically, the tetralemma would look like this:


The word “dawn” would be an example for what is at stake in the tetralemma: Depending on how you define its meaning, it can be a category of its own, not fitting into a category (because of it ambivalent character), it is both day and night, and it is neither sunshine nor darkness.

One of the few philosophers to point to the narrowing of logics to binary oppositions (TRUE / FALSE) and to underline the many possibilities in language games was Jean-Francois Lyotard, in his main work “Le Différend” (Paris: Minuit 1983). In information science, it was only more recently that complex approaches have been developed beyond binary systems, which allow for an adequate coding of culture, emotions or human communications. The best examples are ontologies; they can t be understood as networks of entities, while the relations between these entities can be defined in multiple ways. A person can be at the same time a colleague in a team, a partner in a company, a father of another person working in the same company (a visualization of the “friend of a friend” ontology can be found here). Datafication of human signs, be they linguistic, artistic, or part of the historical record, therefore exposes the challenges of data production in particularly evident ways.

Is there an identity crisis of statistics?

It is not without irony that statistics currently seems to live through an identity crisis, since this discipline is often named the “science of uncertainty”. If there is an identity crisis at all, in what way can it be conceived of? And why did this crisis come now – is there a nexus between the rise of ‘big data’, algorithm and the development of the discipline? Three developments can be identified that make the hypothesis of an identity crisis of statistics more tangible.

  • A crisis of legitimacy: Statistics’ findings result from value-free and transparent methods; but this is nothing (and maybe never has been) which can easily be communicated to the broad public. Hence, in the age of ‘big data’, a loss of legitimacy: the more complex the collection of data, statistical methods and presented results are, the more disorienting an average person may find them, especially if there is a steep increase in the sheer quantity of statistical findings. Even for citizens who are willing to occupy themselves with data, statistics, and probability algorithms, Nobel prize laureate Danel Kahnemann has underlined the counterintuitive and intellectually challenging character of statistics (see his book “Thinking, fast and slow” or the classical article “Judgment under Uncertainty”). These peculiarities of the discipline lower the general public’s trust in the discipline: “Whom should I believe?”
  • Crisis of expertise: Statistics has become part of a broad range of scientific disciplines, far beyond mathematics. But the acquisition of competences in statistics quite obviously has its limits. As Gerd Gigerenzer has pointed out already 13 years ago, “mindless statistics” has become a custom and ritual in sciences like f.ex. psychology. In recent years, this crisis of expertise has been termed the crisis of reproducibility (for data from a previous publication) or replicability (for data from an experiment); the renowned journal “Nature” has devoted in 2014 an article series onto this problem, with focus on f.ex. the use of p-values in scientific arguments. The report of the 2013 London Workshop on the Future of the Statistical Sciences is outspoken on this problem, and there is even a Wikipedia article on the crisis of replicability. Statisticians themselves defend themselves by pointing to these scientists’ lack of training in statistics and computation [see Jeff Leek’s recent article here], but quite obviously this crisis of expertise undermines the credibility of scientists as experts.
  • Crisis of the societal function of the discipline: Statistics as a scientific discipline established itself alongside with the rise of nation-states; hence its close connection to national economies and the data collected across large populations. As has been explained in a “Guardian” article posted earlier in this blog, statistics served as the basis of “evidence-based policy“, and statisticians were seen by politicians as the caste of scientific pundits and consultants. But this has changed completely: Nowadays big data are assets of globalised companies which act across the borders of nation-states. This points to a shift in the core societal function of statistics, not longer serving politics and hence the nation, but global companies and their interests: Statistics leaves representative democracy, and it has become unclear how the benefits of digital analytics might ever be offered to the public. Even if the case is still obscure, the possible role of “Cambridge Analytica” in the U.S. presidential election campaign shows that the privatisation of expertise can be turned against the interests of a nation’s citizens.

Data scales in applied statistics – are nominal data poor in information?

An earlier blog post by Jennifer Edmond on „Voluptuousness: The Fourth „V“ of Big Data?” focused on cultural data steeped in meaning; proverbs, haikus or poetic language are amongst the best examples for this kind of data.

But computers are not (yet) good in understanding human languages. Nor is statistics – it simply conceives of words as nominal variables. Quite in contrast to an understanding of cultural data as described by Jennifer Edmond, applied statistics regards nominal variables as the ones with the LEAST density in information. This becomes obvious when these data are classified amongst other variables in scales of measurement. The scale (or level) of measurement refers to assigning numbers to a characteristic according to a defined rule. The particular scale of measurement a researcher uses will determine what type of statistical procedures (or algorithms) are appropriate. It is important to understand the nominal, ordinal, interval, and ratio scales; for a short introduction into the terminology, follow this link. Seen in this context, nominal variables belong to qualitative data and are classified into distinctively different categories; in comparison to ordinal variables, these categories cannot be quantified or even ranked. The functions of the different scales are shown in the following graph:



Here it becomes visible that words are classified as nominal variables; they belong to some distinctively different categories, but those categories cannot be quantified or even be ranked; there is no meaningful order in choice.

This has the consequence that in order to be able to compute with words, numeric values are being attributed to them. E.g. in Sentiment Analysis, the word “good” can receive the value +1, while the word “bad” will receive a -1. Now they have become computable; words are thus transformed into values, and it is exactly this process which reduces their “voluptuousness” and robs them of their polysemy; just binary values remain.

To my knowledge, the most significant endeavor to provide for a more complex measurement of linguistic data has been undertaken in the field of psychology: In their book “The Measurement of Meaning”, Charles E. Osgood, George J. Suci, and Percy H. Tannenbaum developed the “semantic differential” as a scale used for measuring the meaning of things and concepts in a multitude of dimensions (see for a contemporary approach f.ex. But imagine each word of a single language measured – with all its connotations and denotations, each in its different functions, in the context of the other words around it … not to speak of figurative language such as metaphors, irony, and sarcasm.

This is why words – and cultural, voluptuous data in a broader sense – are so difficult to compute; and why they are the next big computational challenge.