2 Getting Started

Bigger Data, Easier Workflows

Author

Nic Crane, Jonathan Keane, and Neal Richardson

In this chapter we will introduce the package and the data that we’ll be using in the majority of examples in this book. We’ll also be introducing key concepts that we’ll be going into more detail about in later chapters.

2.1 Installing Arrow

The arrow R package provides bindings to the Arrow C++ library, and so both must be installed together. Normally, you don’t have to do anything unusual to do this, and, as with other R packages, Arrow can be installed by using install.packages().

install.packages("arrow")

If you want to customize your arrow installation, you can find more information in the installation guide, though for the majority of circumstances this isn’t necessary and the default installation will contain all the necessary features to work productively with arrow.

2.2 PUMS dataset

Many of the examples in this book use data from the Public Use Microdata Sample (PUMS) from the United States Census. Although the full-country census takes place every 10 years, the American Community Survey is conducted every year and that PUMS data is what we use here. The dataset we have here covers the years 2005–2022. The raw data was retrieved from the Census’s FTP site, with many values recoded and cleaned, so we can focus here on demonstrating arrow’s capabilities.

This is a dataset that comes from a detailed survey that is sent out to a subset of US residents every year. The dataset is released for public use by the Census Bureau in a raw CSV form. We have cleaned it up and converted it to a Parquet-based dataset for use with Arrow for demonstration purposes in this book.

One thing we have to pay attention to is that this dataset is weighted, so we can’t simply count the number of rows to get an accurate count of population—instead we sum or multiply by the weighting variables. This is why the example in the Introduction did sum(PWGTP) rather than just n() to count the population. We will discuss this weighting in our analysis below. If you want to know more details about the dataset, including how you can get hold of it, you can read more about it in Section A.2.1.

2.3 Opening the dataset

Let’s take a look at the data in R. The data is stored in a directory called ./data/pums/person. This is further split into multiple directories, one for each year, and then within those directories, one for each location. Finally, within each state directory, there is a single Parquet file containing the data for that year and location.¹

./data/pums/person/
├── year=2005
│   ├── location=ak
│   │   └── part-0.parquet
...
│   └── location=wy
│       └── part-0.parquet
├── year=2006
│   ├── location=ak
│   │   └── part-0.parquet
...
│   └── location=wy
│       └── part-0.parquet

If we want to take a quick look at one of the files in the dataset, we can use read_parquet() to read it into R.

library(arrow)
path <- "./data/person/year=2021/location=ca/part-0.parquet"
read_parquet(path)

Our full dataset is stored across multiple files. We don’t have to worry about that though, as arrow can work with them as a single object, called an Arrow Dataset. To open the dataset in arrow, we use the open_dataset() function, and provide the path to the data. Arrow can work with data in multiple formats, including a range of delimited text formats like CSV, JSON, and the Parquet format—an efficient binary file format with support in many languages—which we are working with here. We’ll discuss the different formats in detail in Chapter 4.

pums_person <- open_dataset("./data/person")

Now Arrow knows where our data is stored, so let’s learn more about the dataset we just created. Let’s begin by looking at its size. How big is our dataset in terms of rows and columns?

dim(pums_person)

[1] 53528050      311

It’s over 53 million rows and 311 columns, which sounds pretty big. It’s certainly more than can be loaded into the memory of most machines people use in their day-to-day work.

So how can we work with this data? When we called open_dataset() earlier, we created an Arrow Dataset object, which doesn’t read all of the data into our R session. Instead, it captures information about where the data is on disk and some additional metadata. We can see some of this metadata if we print the dataset.

pums_person

FileSystemDataset with 884 Parquet files
311 columns
SPORDER: int32
RT: dictionary<values=string, indices=int32>
SERIALNO: string
PUMA: string
ST: string
ADJUST: int32
PWGTP: int32
AGEP: int32
CIT: dictionary<values=string, indices=int32>
COW: dictionary<values=string, indices=int32>
DDRS: bool
DEYE: bool
DOUT: bool
DPHY: bool
DREM: bool
DWRK: bool
ENG: dictionary<values=string, indices=int32>
FER: bool
GCL: bool
GCM: dictionary<values=string, indices=int32>
...
291 more columns
Use `schema()` to see entire schema

The output above shows that the data is divided into 884 Parquet files: 17 years worth of data, for each of the 50 states plus Puerto Rico and the District of Columbia. After the number of columns in the dataset, we can also see the schema, the mapping of column names to data types. Parquet files have metadata about what each column is which is richer than text-based formats. We’ll discuss types, metadata, type inference, and how you can control it, in Chapter 4.

Note that the data types are not R’s data types—numeric, character, factor, etc.—they are from the Arrow format. As we mentioned in Chapter 1, Arrow has many more types than R, and this richer type system enables more precise control of how your data is represented. Importantly, it also allows for interoperability across systems without loss of information. These data types are similar to R data types, and some have a direct mapping to R data types, though arrow’s data types also include some which don’t exist in R. This interoperability is a core to the broader Arrow project. We’ll get more into this in Chapter 8.

All of R’s types can be represented in Arrow without loss of precision, and while there are many types in Arrow that don’t have a direct mapping to types in R, nearly all have a translation that preserves the data with high fidelity. For more details, see Section A.2.2.

2.4 Querying the dataset

We have an Arrow Dataset that is pointing to multiple files on disk. Even though we can’t fit the data all into memory, we can query it using dplyr-like syntax with arrow. Let’s take a look at some of the questions we might want to ask about the data. For example, how did the mode of transit and average commute time change over time?

The query below is pretty complicated—but we’ve done that on purpose. We will talk in Chapter 3 about how we have cleaned and curated the PUMS dataset. Even with the cleaned data, we have to do a bit more processing to get this ready for analysis and plotting. First, the variable JWTR “Transportation to work” and JWTRNS “Means of transportation to work” are effectively the same content and question, but the variable measuring it was re-named starting in 2019. There were also some subtle changes to the values as well. So we start off by coalescing those two variables together and then using a case_when statement to align the values. We add a variable that classifies each mode into “public”, “private”, or “other” for ease of plotting later. Finally we calculate the total number of commuters for each mode of transport and the mean commute durations for each. One thing to note here is that we need to calculate weighted means because of the survey methodology used ².

Again, don’t worry if this looks like a lot! We picked an example that needed some work on purpose, to show just how easily we can take the tools we know already, like dplyr, and use them with arrow on larger-than-memory datasets.

First, we’ll create a couple of lists we can use in our pipeline.

private_transport <- c("Car, motorcycle", "Bicycle", "Walked")
public_transport <- c("Bus", "Subway, light rail", "Commuter rail, train")

And now we can use them in our data pipeline.

commute_by_mode <- pums_person |>
  mutate(
    mode_of_transit = coalesce(JWTR, JWTRNS),
    mode_of_transit = case_when(
      grepl("car|motorcycle", mode_of_transit, TRUE) ~
        "Car, motorcycle",
      grepl("bicycle", mode_of_transit, TRUE) ~ "Bicycle",
      grepl("walked", mode_of_transit, TRUE) ~ "Walked",
      grepl("bus", mode_of_transit, TRUE) ~ "Bus",
      grepl("subway|streetcar", mode_of_transit, TRUE) ~
        "Subway, light rail",
      grepl("railroad|commuter", mode_of_transit, TRUE) ~
        "Commuter rail, train",
      grepl("worked .* home", mode_of_transit, TRUE) ~
        "Worked from home",
      grepl("ferry|other", mode_of_transit, TRUE) ~
        "Other method",
      grepl("taxi", mode_of_transit, TRUE) ~ "Taxicab"
    ),
    public_private = case_when(
      mode_of_transit %in% private_transport ~ "Private",
      mode_of_transit %in% public_transport ~ "Public",
      TRUE ~ "Other"
    )
  ) |>
  filter(!is.na(mode_of_transit)) |>
  group_by(year, mode_of_transit, public_private) |>
  summarize(
    mean_commute_time = sum(JWMNP * PWGTP, na.rm = TRUE) / sum(PWGTP),
    n_commuters = sum(PWGTP)
  )

commute_by_mode |>
  collect()

Now that we have our dataset containing commuters by mode of transport, we can also establish a baseline of 2005 and then calculate the percentage change for each group to make our plots relative. We do this by making a subset, baseline, that is just data from the year 2005 and then we use a standard join and calculate percent changes from 2005 for each year. Note, that until we call collect() we aren’t even pulling any data into R—it’s all an arrow query that we are building up.

baseline <- commute_by_mode |>
  filter(year == 2005) |>
  select(
    mode_of_transit,
    time_baseline = mean_commute_time,
    n_baseline = n_commuters
  )

to_plot <- commute_by_mode |>
  left_join(baseline) |>
  mutate(
    transit_type = mode_of_transit,
    mean_commute_time_relative = 
      (mean_commute_time - time_baseline) / time_baseline,
    n_commuters_relative =
      (n_commuters - n_baseline) / n_baseline
  ) |>
  collect()

Now we can use the data in to_plot that we just pulled in to R to make a plot of the numbers of commuters for each mode as a change over time, as a percentage change from a 2005 baseline:

One striking thing is that we can see that after the 2020 Covid pandemic, the number of people working from home jumped to over 500% compared to 2005. There was already growth for working from home before then, but there was—understandably—a giant bump there. We also saw public transit modes drop in 2021 and then come back up in 2022.

And we can see how the commuting durations also change, as mean duration in minutes. Below we can see that, besides “Other method”, all modes of transit seem to be getting longer over time, with commuter rail and bicycles seeing some of the largest increases.

As we’ll see in Chapter 3, arrow has support for hundreds of functions you can call and compose this way, which allow you to do all sorts of transformations and aggregations on data that is too big to read into memory. These generally work just as they do when you use dplyr on an in-memory R data.frame, with one difference: you have to call collect() at the end to evaluate the query. This follows the model of dbplyr, in which each step in the pipeline builds up a query that you then send to the database to evaluate and return the result. But with arrow, there is no remote database: everything is running in your R process.

2.5 How is it so easy to query such a large dataset?

Running queries like the one above so quickly on a large dataset are enabled by 3 things:

arrow’s ability to quickly process things in parallel
the use of the Parquet format
partitioning our data

One of the things that arrow provides is a way to read and write Parquet files in R. As we will explore in greater detail in Chapter 4, Parquet files are nice because they are fast, compressed, and store data with high fidelity. In the same way that people use “high-fidelity” to refer to audio systems which can play music without distortion or loss of quality, the same principle applies here in the context of data storage with Parquet, due to qualities of the format, such as the metadata stored alongside the data itself.

Parquet is a binary file format optimized for storing analytics data. Parquet files are much smaller than the equivalent CSVs, faster to work with, and also contain metadata about the schema of the data. We created a Parquet version of the PUMS dataset that we’ll be using in most of the examples in this book. The Parquet version of this data is aroudn 5 times smaller than a CSV version of the same data. The CSV version of the same data is around 45 GB, but the Parquet version is smaller on disk around 8.5 GB. It’s even faster to query; at least 11 times for even simple queries, and some times much, much faster.

Earlier we mentioned that the dataset is stored in directories based on the year and location. This is called partitioning and Arrow is designed to take advantage of this to make dataset queries faster. It does this by only reading in the data from the partitions which are relevant to the query. For example, when we filtered the data above to only include years after 2014, Arrow only read in the data from the directories relating to 2015 onwards, and totally ignored the files from the earlier years. If you partition your data based on the variables which you query most often, this can make your queries faster. We talk more about the art and the science of partitioning datasets in Chapter 5.

2.6 How does this connect with the rest of the modern data ecosystem?

Now that we’ve introduced some of the core concepts and features of the arrow R package, we turn our attention to the ways that Arrow is structured that help you participate in a broad, modern data system. We’ve alluded to this a few times already how the Arrow standard was designed to be used across different technologies in the modern data stack. But let’s make that concrete: what does that actually mean and why is it important?

Before the advent of standardization like the Arrow format, much—in some cases most! —of the time in data processing pipelines was spent converting data from one format to another. This process is called serialization because the data is changed into a format that is easily transported from one system to another and well known enough that each of the two systems know how to read or write it. These different formats were needed because each system generally had a specialized format it used internally and other systems didn’t know how to read or interact with that or were blocked from doing so even if they could operate on it. Arrow changes this by being a modern data format that many different projects have either adopted as their internal representation or is close enough to it that they can quickly and cheaply speak arrow with other systems.

This interconnect and standardization means that it’s easy to use the arrow R package alongside other packages and systems that speak Arrow, like duckdb and pyarrow (Chapter 8), or alongside geospatial packages (Chapter 7.5). And, because we are in the modern data world, all of this needs to be readable from cloud storage. The Arrow C++ library includes many features that let you read from cloud storage easily (Chapter 6).

Unfamiliar with the Parquet file format? Don’t worry, we’ll cover that in Chapter 4 ↩︎
The PUMS dataset comes from surveying around 1% of the US population. It also asks a number of sensitive questions, so the Census Bureau is careful to avoid accidentally identifying specific people in the dataset. For these two reasons, the dataset is actually not the raw responses—where each row is one respondent—but rather each row has a specific weight applied to it. This weight could be thought of as something along the lines of “this number of respondents responded with this set of answers” though it is more complicated than that. Because of this, in order to make estimates about populations, we need to use the weighting columns from the dataset which tell us how many people are represented in each row to get an accurate measure in our final calculations.↩︎