8 Sharing Data and Interoperability

Bigger Data, Easier Workflows

Author

Nic Crane, Jonathan Keane, and Neal Richardson

One of the benefits of Arrow as a standard is that data can be easily shared between different applications or libraries that understand the format. By not having to convert to an intermediate format, and by avoiding serialization and deserialization costs, moving data from one tool to another is fast and efficient. As a library developer, it also reduces maintenance burden and the surface area for bugs because you don’t have to write and manage adapters for many formats: you just implement the Arrow connector.

As a result, many projects have adopted the Arrow format as the way to connect with other projects in the ecosystem. In this chapter, we’ll show some examples where Arrow is used as the means of exchanging data, and how that results in major speedups for users. In some cases, Arrow is being used behind the scenes, and you benefit without needing to write any additional code.

Along the way, we’ll highlight the different ways in which these different tools communicate—within the same process, across processes on the same system, or over the network between systems—and provide some context for how Arrow works in those modalities. Some of those details will be most relevant if you’re trying to write a new library using Arrow, such as a new R package wrapping an Arrow-native project. But, even if you’re not writing packages, developing an understanding of how arrow is communicating with other tools is useful for having an intuition about how it will perform in different contexts.

8.1 Sharing within a process

We’ve shown in Chapter 7.4 how you can pass data from the arrow query engine to duckdb and back. Let’s explore a little more deeply how that works in order to understand what’s happening and how the Arrow standard makes that basically free in terms of data copying costs.

Processes work on data by allocating memory and loading data into it. By allocating the memory, they ask the operating system to grant it ownership over a block of memory. Code running within a process can reference memory that the process has allocated, but other processes are not allowed to access it.

The integration between arrow and duckdb takes advantage of the fact that both are running in the same process. In your R session, arrow and duckdb are R packages that wrap C++ libraries. When you load the R packages, the C++ libraries are also loaded into the running R process. That means that a pointer to a block of memory that a function from the arrow library has created can be accessed by a function in the duckdb library just where it is: no need to copy or move anything, just read it from where it is.

However, we can’t just pass C++ objects around. C++ objects are complex and include methods—methods that require code in the C++ library to execute. In order for DuckDB to work with an Arrow C++ object, it would have to depend on the Arrow C++ library, and vice versa. This is not desirable, and some features of C++ make this particularly challenging. Instead, we just want to pass the data buffer inside the C++ object. We only need a basic way to communicate where the data is and what its shape is. This is where the C data interface comes in.

8.1.1 The C data interface

The Arrow C data interface defines a simple means for referencing Arrow data in memory, allowing different libraries to consume the same data without depending on each other. It defines just 2 structures: one for an array of data and another for a schema. It is a small amount of C code—29 lines in total—so any programming language that is compatible with C can use it by copying this code into their codebase.

This can be a huge advantage to projects which utilize Arrow’s format and data structures. For one, it’s very small, so there’s little cost to adding it. It avoids the need to bring in all of the dependencies of the Arrow C++ library, which may not be needed when you just want to exchange data in the Arrow format. Finally, the C data interface is stable. While the Arrow C++ library is under active development, the Arrow C data interface is guaranteed to remain unchanged.

8.1.2 Between R and Python

Before we explain how arrow and duckdb share data between the query engines, let’s start with a simpler case that also uses the C data interface: sharing Arrow memory between R and Python with the reticulate package.

If you’re working on a polyglot team which has people working both in R and in Python, or a project with both R and Python components, different components of your analysis pipeline might be in different languages. Passing data back and forth between the two—serialization and deserialization—can take time and resources, and so it’s desirable to be able to avoid this if possible.

The reticulate R package already provides a way of passing data between R and Python in the same process, but it is not as efficient as using arrow. Let’s take a look at two examples of passing data from R to Python and then returning it to R. In the first example, we’ll work with use the standard method of passing data between these processes, and in our second example, we’ll see how using Arrow speeds things up.

library(reticulate)

virtualenv_create("pyarrow-env")
install_pyarrow("pyarrow-env")
py_install("pandas", "pyarrow-env")
use_virtualenv("pyarrow-env")

First, let’s make a data frame in memory of the data for Washington state.

washington <- open_dataset("./data/person") |>
  filter(location == "wa") |>
  collect()

Because we called collect() on our data pipeline, the washington object is in memory as an R data frame. We’ll send it to Python using reticulate::r_to_py(), then back to R with py_to_r(). In a real world example, you wouldn’t just go back and forth: there would be some work you’re doing in Python that requires you to switch to it. For this example though, we are only doing the round trip to show the cost of that part alone.

We need to have the pandas Python library loaded, so that the data can be passed to Python as a pandas table, otherwise it’ll be passed as a Python dictionary object.

pd <- import("pandas")

returned_data <- washington |>
  r_to_py() |>
  py_to_r()

This took about 26 seconds when we ran it, due to reticulate needing to take the R data frame, convert it into the equivalent Python structure, a pandas DataFrame, and then converting it back to an R data frame.

Let’s see how that looks using Arrow. By calling compute() at the end of our pipeline instead of collect(), we keep the result of the query in a Table, not an R data frame.

pa <- import("pyarrow")

washington_table <- open_dataset("./data/person") |>
  filter(location == "wa") |>
  compute()

returned_arrow_data <- washington_table |>
  r_to_py() |>
  py_to_r()

This time it only took 0.2 seconds, which was a huge speed-up! How does this work?

Unlike in the first case, where data in R had to be copied and translated into a pandas DataFrame, here we used the Arrow C data interface to pass to Python the pointer to where the Arrow Table is stored in memory, and Python can use this to work with the table directly. The result in Python is a pyarrow Table. Naturally, the same happens when the data is passed back from Python to R. So, there is no conversion of data and copying data around: there is only a handoff of ownership.

8.1.3 DuckDB

The integration between arrow and duckdb builds on this machinery to share data between the query engines. Both DuckDB and Acero, the Arrow C++ query engine, operate on chunks of data at a time. This allows them to parallelize across multiple threads and efficiently stream results from one stage in the query evaluation to the next. Each stage, or node, including the data source node, behaves like an iterator: the next node of the query requests a batch of data from the previous one, and when it finishes processing it, it requests the next batch, until there are no more batches left.

In the Arrow C++ library, and thus also in R, this iterator is represented as a RecordBatchReader. To integrate with DuckDB, we deal in RecordBatchReaders: to_duckdb() hands off a reader to duckdb, and to_arrow() receives one back from duckdb. This works using an extension to the Arrow C data interface, the C data stream. The C object contains a reference to a callback function, which the consumer calls to request the next batch of data. As a result, batches of data can flow from one query engine to the other, almost as if they were a single engine.

Let’s take a look at how this would work. The example below is a variation on the one we started the book with: we are finding the mean commute time by year, though this time we want this in hours instead of minutes and we aren’t breaking it down by mode of transport.

pums_person <- open_dataset("./data/person")

commute_by_mode <- pums_person |>
  select(JWMNP, PWGTP, year) |>
  mutate(JWMNP_hours = JWMNP / 60) |>
  group_by(year) |>
  summarize(
    mean_commute_time = sum(JWMNP_hours * PWGTP, na.rm = TRUE) /
      sum(PWGTP, na.rm = TRUE),
    n_commuters = sum(PWGTP, na.rm = TRUE),
    .groups = "drop"
  ) |>
  collect()

With a query entirely in arrow, it takes 3.6 seconds to complete.

Now, to demonstrate the efficiency of swapping between arrow and duckdb we will send data to duckdb only for the minutes to hour mutation. Swapping back and forth for a mutation like this that arrow can do itself isn’t something we would do in the real world, but it helps demonstrate how efficiently we can do the same calculations but passing the data back and forth.

pums_person <- open_dataset("./data/person")

commute_by_mode <- pums_person |>
  select(JWMNP, PWGTP, year) |>
  to_duckdb() |>
  mutate(JWMNP_hours = JWMNP / 60) |>
  to_arrow() |>
  group_by(year) |>
  summarize(
    mean_commute_time = sum(JWMNP_hours * PWGTP, na.rm = TRUE) /
      sum(PWGTP, na.rm = TRUE),
    n_commuters = sum(PWGTP, na.rm = TRUE),
    .groups = "drop"
  ) |>
  collect()

And going to duckdb for the mutation and then back to arrow for the rest, this takes only 4.5 seconds to complete. There is a small amount of overhead, but nowhere near as much as you would see if we had to serialize to a CSV, or even a Parquet file to pass data back and forth.

This example is contrived specifically to show the low overhead of passing data back and forth. In the real world, there would be no reason to pass data to duckdb when you can do the computation in arrow and vice versa. But this is extremely helpful if you’re working in arrow and duckdb—or another library that can speak arrow—has a function that arrow doesn’t.

8.2 Apache Spark integration: sharing across processes

In the previous section, we talked about sharing data within an individual system process, but what about if we want to share between separate processes? Different processes can’t share memory: they have to allocate their own memory and share data with each other by sending messages. This introduces overhead, both in having to allocate memory for the copy of the data, and in encoding and decoding the message: our friends serialization and deserialization again.

As you may expect by now, Arrow provides a means to minimize that overhead. A great example of the benefits of using Arrow to communicate across systems is with Apache Spark. Speeding up data access with Spark was one of the initial use cases that demonstrated the value of Arrow. The first blog post illustrating the benefits, focused on PySpark, is from 2017; an R version with sparklyr came out in 2019. Both examples show speedups on the order of 20-40x, depending on the workflow.

Without Arrow, Spark had to send data one row at a time, serialized to a less efficient format, and then on the receiving side, pandas or R would have to reconstruct the data frame from the records. Sending data back to Spark did the same thing in reverse. As the benchmarks demonstrate, switching between row and column layouts is costly. With Arrow, data can stay in a columnar format, with less copying and transformation.

However, because we aren’t running Spark in the same process, and likely not even on the same machine, we can’t just point to a block of memory and start working with it, as we did with the C data interface. This is where Arrow’s interprocess communication (IPC) format comes in. We’ve already seen this, in fact: it’s the “Arrow file format”. But it doesn’t have to be written to a file; the important aspect is that it is fully encapsulated and can be sent from one process or system to another.

Similar to the C data interface for intraprocess communication, the IPC format is almost exactly the shape of the data in memory, so the serialization cost is near-zero. But unlike communicating within the same process, there is some cost to sending or receiving: you need to read it from disk or send it over the network, and allocate the memory to hold it.

A nice feature of the sparklyr integration in R is that you as the user don’t need to change your analysis code to take advantage of it. All you have to do is load library(arrow). If the arrow R package is loaded, sparkylr will use arrow for data transfer automatically.

8.3 nanoarrow

Since Arrow has become the standard for columnar data, it is easier to integrate databases and data products with R. Rather than having to implement an adapter for the product’s custom format, you can just use Arrow to connect with it. However, as this book shows, the arrow R package does a lot of things. If you just need to bring Arrow data into an R data frame, you don’t need all of arrow’s cloud file systems, its Parquet reader, the query engine, and other features.

This is where nanoarrow comes in. The nanoarrow R package wraps just the C data interface and the IPC file format. It has packaged versions in R, Python, C, and C++. The R package supports translating Arrow data to and from R data frames, and that’s about it. While it lacks a lot of the features of the full arrow R package, this is exactly the point: to create a minimal interface which allows users to be able to work with Arrow data structures in a small library.

While nanoarrow is relatively new, some R packages already use it. The polars package, for example, uses nanoarrow to bring Arrow format data into R from the Polars data frame library—which is built on an implementation of Arrow in Rust. Similarly, tiledb uses nanoarrow to bring Arrow data into R without require a dependency on the full arrow R package.

The nanoarrow project is a great example of how Arrow improves the experience of working with data. We have focused in this book on the arrow R package and the many ways it can be useful. But the Apache Arrow project and mission of improving the foundations for data analysis is much bigger than one package. Even when you aren’t using the arrow R package itself, Arrow may be there behind the scenes, making your life easier.

8.4 Looking ahead

The list of examples of using Arrow to speed up data interchage is large, and it’s still growing. Particularly as Arrow becomes more central to the internal workings of databases and query engines, we expect to see more projects using Arrow in R in more ways.

One promising direction is in getting data out of databases. For decades, the predominant standard in communicating with databases has been ODBC, which specifies how database drivers should receive SQL queries and return data. Both ODBC and JDBC, a similar standard that is implemented in Java, are row-oriented APIs, which as we have seen in previous chapters, means that there is a conversion step required to get into R’s column-oriented data structures. This is further costly when the database that you are querying is also columnar—data is converted from columns to rows and then back to columns.

The Arrow project has defined a new standard called ADBC. It is an API for communicating with databases that sends and receives Arrow-formatted data. By writing database drivers that conform to the ADBC interface, getting data in and out of databases can be made more efficient and easier to work with on the client side. At the time of writing this book, ADBC has only begun to see adoption, but it has the potential to greatly improve the performance of querying databases.

A related area of interest is Arrow Flight RPC, a framework for high-performance across-network transfer. This is an alternative to sending messages over regular HTTP and is designed to maximize network throughput. ADBC database drivers could be implemented using Flight, or it could be used in custom data services. In principle, the integration with Spark could be further accelerated if it switched from sending Arrow IPC files to using Flight. As with ADBC, Flight is not yet widely adopted, but it holds promise for the future.