default blog post image

The Natural Side of Big Data

“Big Data” is getting big coverage. For example, this recent column in the New York Times that captures the emergence of Big Data as a cultural meme. Usually, people take a primarily a technophilic view of Big Data. The Times article, for example, describes Big Data as, “applying the tools of artificial intelligence, like machine learning, to vast new troves of data beyond that captured in standard databases. The new data sources include Web-browsing data trails, social network communications, sensor data and surveillance data.” But Big Data is also being shaped by the natural world and affecting how we understand and interact with the natural world. It is shaped by the natural world because biology offers such an enormous data base on so many levels, from the nearly infinite bytes of data emerging from genetic studies that are already taxing our biggest digital depositories (a dilemma well captured in this article by science philosopher Mark Sagoff on data deluge), to the scores of specimens held in natural history museums and photographic archives that provide a world-wide view of what used to live where, to masses of data now emerging from “citizen science” databases, such as the National Phenology Network (a great overview article was recently published in a special issue of Frontiers in Ecology and the Environment on citizen science), to large scale aggregates of biogeochemical interactions, such as coastal “dead zones” that essentially compile the interactions of industrial nitrogen conversion, human agricultural practices, primary productivity, and biological respiration (an interactive map of these zones has been created by WRI). Big Data is affecting how we understand the world because it erodes what we have been told over the past 50 years or so is the bedrock of scientific understanding: falsifiable hypotheses tested in controlled experiments under a “strong inference” framework. The notion that Science must be falsifiable comes from Karl Popper, and his ideas got a big boost from John Platt’s propaganda like rant for a standardized method for conducting biological science in his well-cited 1964 “Strong Inference” paper (if you loved this paper the first time you read it, as I did, I urge you to read it again with a more critical ear – it’s a little like getting enthralled with Ayn Rand in high school, and then trying to square her ideas with reality as an adult). These philosophies have led to institutionalized rules that, “correlation does not imply causation”, that “patterns cannot reveal mechanisms” and that science that proceeds without falsifying pre-determined hypotheses is just a “fishing expedition”. Big Data makes these stalwart arguments seem a bit quaint. They are all still valuable, sometimes, but the reflexive nature in which they are used by both scientists and non-scientists alike (see my comments on it here and in our book, Observation and Ecology) needs to be reevaluated. Big Data approaches allow life scientists to find very robust patterns with much larger chaotic cycles, and if it doesn’t allow us to ascribe mechanistic causes with 100% certitude (no approach will), it at times gets us about as close as possible.  At the same time, a caution about Big Data. It will never fully substitute for Little People. That is, individuals who take the time to observe nature (“with the brain in gear” as Geerat Vermeij says in an excellent contribution to Observation and Ecology) and understand the Little pieces of it that make the Big whole of it.