This is one of the big ideas I have run into this year…that notion that with enough data and brute force computer power, we can just crunch the numbers to get the answers which have been the results of incredibly elegant thinking and use of the scientific method. This implies that the incisive rapier of theories and ideas will give way to the sledge hammer of crunching number in a database…
There are already some examples which I have discussed in this blog…such as the latest results of paleolinguistics (based upon a comparative anatomy of most of the world’s languages…with regard to how many, and which phonemes they use). The results have tipped over the table of Chomsky’s elegant theories.
There are a variety of perspectives to look at regarding this possibility in the future of the sciences. The first one (a practical, pragmatic one) is that if it works…great. But then there are some other important things to consider, such as:
· This can’t replace all of scientific research, in that we don’t have huge data sets for all possible questions, in every possible area of inquiry.
· Does this change, or have any impact upon what it is to be a theorist?
· How well does this method answer the fundamental notions of what science is really all about? By this, I mean that with regard to the potential pragmatic value in this method (brute force number crunching), if all scientific inquiry if predicated upon the use of statistics, as we losing something else in the tradeoff?
In any case, this is a fascinating idea to consider…
The End of Theory: The Data Deluge Makes the Scientific Method Obsolete
http://www.wired.com/science/discoveries/magazine/16-07/pb_theory
"All models are wrong, but some are useful."
So proclaimed statistician George Box 30 years ago, and he was right. But what choice did we have? Only models, from cosmological equations to theories of human behavior, seemed to be able to consistently, if imperfectly, explain the world around us…until now. Today companies like Google, which have grown up in an era of massively abundant data, don’t have to settle for wrong models. Indeed, they don’t have to settle for models at all.
Sixty years ago, digital computers made information readable. Twenty years ago, the Internet made it reachable. Ten years ago, the first search engine crawlers made it a single database. Now Google and like-minded companies are sifting through the most measured age in history, treating this massive corpus as a laboratory of the human condition. They are the children of the Petabyte Age.
The Petabyte Age is different because more is different. Kilobytes were stored on floppy disks. Megabytes were stored on hard disks. Terabytes were stored in disk arrays. Petabytes are stored in the cloud. As we moved along that progression, we went from the folder analogy to the file cabinet analogy to the library analogy to — well, at petabytes we ran out of organizational analogies.
Speaking at the O’Reilly Emerging Technology Conference this past March, Peter Norvig, Google’s research director, offered an update to George Box’s maxim: "All models are wrong, and increasingly you can succeed without them."
This is a world where massive amounts of data and applied mathematics replace every other tool that might be brought to bear, out with every theory of human behavior, from linguistics to sociology. Forget taxonomy, ontology, and psychology. Who knows why people do what they do?
The point is they do it, and we can track and measure it with unprecedented fidelity. With enough data, the numbers speak for themselves.
The big target here isn’t advertising, though. It’s science. The scientific method is built around testable hypotheses. These models, for the most part, are systems visualized in the minds of scientists. The models are then tested, and experiments confirm or falsify theoretical models of how the world works. This is the way science has worked for hundreds of years.
In short, the more we learn about biology, the further we find ourselves from a model that can explain it.
There is now a better way. Petabytes allow us to say: "Correlation is enough." We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.
The best practical example of this is the shotgun gene sequencing by J. Craig Venter. Enabled by high-speed sequencers and supercomputers that statistically analyze the data they produce, Venter went from sequencing individual organisms to sequencing entire ecosystems. In 2003, he started sequencing much of the ocean, retracing the voyage of Captain Cook. And in 2005 he started sequencing the air. In the process, he discovered thousands of previously unknown species of bacteria and other life-forms.
This kind of thinking is poised to go mainstream. In February, the National Science Foundation announced the Cluster Exploratory, a program that funds research designed to run on a large-scale distributed computing platform developed by Google and IBM in conjunction with six pilot universities. The cluster will consist of 1,600 processors, several terabytes of memory, and hundreds of terabytes of storage, along with the software, including IBM’s Tivoli and open source versions of Google File System and MapReduce.1 Early CluE projects will include simulations of the brain and the nervous system and other biological research that lies somewhere between wetware and software.


