Foreword

Bigger Data, Easier Workflows

Author

Nic Crane, Jonathan Keane, and Neal Richardson

Foreword by Wes McKinney

Data science using open-source libraries in languages like Python and R has evolved from fringe use by academic researchers and enthusiasts in the late 2000s and early 2010s to being the dominant, mainstream approach for business and research alike now in the 2020s. There has been a similar shift in enterprise software development and server-side data processing, where, for a variety of reasons, work that used to take place mainly in compiled systems languages like C++ and Java is now being done in dynamic, interpreted programming languages that were previously reserved to more of a “scripting” role for automation and “gluing” together parts of larger systems.

The modern open-source data science landscape has developed organically out of many federated silos that did not collaborate very actively with each other. Some of this history predates what we now think of as the “big data” or even “AI” ecosystems that are taken for granted today. I couldn’t do justice to all of the nuances of how these different programming communities developed. Still, it is useful to briefly consider how we got where we are today and how you came to be holding this book in your hand (or reading it virtually).

R was created during the early open-source movement in the 1990s as a free and open-source alternative to the S programming language, which had become popular for statistical programming in the S-PLUS product along with other commercial statistical computing products such as Stata and SAS. R became widely used in academia for research work (especially in statistics departments). After a time, it started to become widely adopted for data analysis in industries such as life sciences, insurance, finance, and many others. In the 2010s, this business adoption was accelerated by a combination of a fast-growing collection of high-quality open source add-on libraries and new integrated development environments (IDEs), which improved productivity and collaboration on large projects.

Python as a data science language had a rather different path. The first version of the Python language was released in 1991, only two years before R’s first release in 1993. For many years, it occupied a Perl-like role as a dynamic scripting language for the Linux ecosystem. In 1995, Jim Hugunin created the first numerical computing library, Numeric, and over the next decade, a small scientific computing community worked to create open-source libraries that would enable scientists to do work based on Python and open-source libraries instead of using a commercial environment like MATLAB. Python’s usability for statistics and what we now think of as “data science” was limited until the late 2000s when open-source developers started incorporating ideas from R and other languages into Python to create new projects providing statistical capabilities such as pandas, scikit-learn, and statsmodels.

By the early 2010s, businesses worldwide were making a massive push to incorporate data science and machine learning into their companies, collecting and processing more data than ever to create new data-powered products and features to become more competitive and efficient. This created a simultaneous need for tools and systems to capture, store, and process massive quantities of data and professionals with the skills to develop and maintain data applications. At the same time, cloud computing services became generally available, making it easier than ever for small teams to set up and manage extensive “virtual” application infrastructure, where open-source software rapidly dominated the more restrictive licensing model of commercial software products.

It’s perhaps no great surprise that, given the urgency of this need for data systems and skilled data scientists, these new data teams chose to use primarily open-source software and, in particular, high-level dynamic programming languages that are streamlined for individual productivity and interactive computing. This investment in data science has led to massive growth of the Python and R communities. In 2008, R was ranked by TIOBE as the 25th most popular programming language, but by 2020 it had risen to 8th place. Universities saw the value of sending new graduates out into the world with practical skills that they could immediately apply on the job and responded by incorporating Python and R into their academic programs.

While the open-source data science ecosystem has been coming of age over the last two decades, there are similar stories to be told about growth and progress in the field of database systems and large-scale enterprise data management. Early database systems were the backbone of the computing revolution and the early internet in the 1980s and 1990s, and they similarly played an essential role in the nascent “big data” ecosystem that came about in the 2000s with the growth of the now-massive internet companies that are omnipresent in our daily lives.

Internet companies like Google and Facebook (now known as Meta) started collecting so much data that they needed to create physical data centers and new types of software to cope with the massive computational needs of their products and services. Google began to share some of the details of the systems they had built (such as MapReduce), which led Yahoo! to create an open-source version of some of Google’s systems called Apache Hadoop. Hadoop helped popularize the idea and practice of “big data.” One of the core ideas from Google’s early papers, which was reflected in the Hadoop ecosystem’s success, was the concept of decoupling data storage from computing engines. Historically, databases were vertically integrated systems that bundled data storage, computation, and query language (SQL) all into one, sometimes even in a physical appliance located in a company’s server room. In Hadoop, data files are stored in HDFS, a distributed file system that runs on a cluster of computers, and then different compute engines execute against the data files stored in HDFS, writing their results back to the same file system.

Within a few years, the Hadoop system spawned a fast-growing collection of open-source big data projects that worked with HDFS. A number of companies were founded to help regular businesses adopt Hadoop and related big data technologies so that they, too, could collect, store, and utilize massive datasets in their business. Open-source technology like Hadoop would be used alongside a company’s existing database systems, which might have included commercial databases like Oracle, Teradata, or SAP and open-source databases like MySQL and PostgreSQL.

The open-source data science and big data ecosystems developed in parallel over a significant period and without much collaboration and coordination. Many big data systems were written in Java or languages that ran on the Java runtime environment like Scala, while data science was increasingly happening in Python and R. By the mid-2010s, data scientists found themselves needing to build their statistics and machine learning applications on top of their company’s big data platforms, and, to make a long story short, it wasn’t easy to do that.

At the same time that the big data and data science ecosystems were starting to converge and overlap in the mid-2010s, we also saw a significant acceleration in the performance of the computing hardware powering companies’ data infrastructure. Fast solid-state drives replaced slow hard disk drives, while networking performance connecting computers together increased by ten to one hundred times (or more in some cases). Similarly, computer processors (CPUs) became faster and more efficient, and they could support higher levels of parallel processing because the number of physical CPU cores increased significantly. The initial generation of open-source Hadoop ecosystem projects was designed for scalability so that processing large datasets was feasible, even if slow. In most cases, they weren’t necessarily designed to use cutting-edge computing hardware efficiently.

By 2015, the need for improved interoperability of big data technologies with the data science world and more efficient use of modern computing hardware led to a simultaneous reckoning in the broader open-source community. How could we reconcile these language interface incompatibilities while making our software a great deal more computationally efficient? At the same time, we also recognized the federated nature of how different communities had developed (e.g., the Python and R developers had had little collaboration or code sharing over the years). We wondered if we could create an environment where developers could collaborate and share computing infrastructure across programming language boundaries.

These questions led us to create the Apache Arrow open-source project, a new kind of project with the initial goal of providing a fast, language-independent data interchange protocol for tabular datasets. This protocol is now commonly known as “the Arrow format.” Once we established an interchange layer that would enable systems written in Java, C++, Python, R, Go, Rust, or other languages to send datasets to each other efficiently, we could create fast, reusable computing libraries for Arrow to power our analysis workflows. We have begun to think of Arrow as a “development platform” for making more interoperable and efficient data processing applications.

This book teaches you how to use the R Arrow library, which is a product of all of this work that started almost a decade ago. In addition to providing you with fast, scalable computing capabilities, it is also a foundation for using other Arrow-powered data processing tools like DuckDB or DataFusion and any others that may be developed in the future. Arrow for R builds on top of a common C++-based library foundation also used in Python and Ruby, as well as many other open-source and proprietary systems that incorporate the Arrow C++ library. The authors of this book are some of the leading developers of the Arrow R library: they have spent several years creating a thoughtful and intuitive experience to help R developers level up their data processing capabilities.

As a co-creator of Arrow, I am excited for books like this to be written and to showcase the benefits of the work done by the Arrow open-source community. The tools within will serve you well and expand your use of R to work with large-scale datasets efficiently.