2  Getting Started

Bigger Data, Easier Workflows

Author

Nic Crane, Jonathan Keane, and Neal Richardson

In this chapter we will introduce the package and the data that we’ll be using in the majority of examples in this book. We’ll also be introducing key concepts that we’ll be going into more detail about in later chapters.

2.1 Installing Arrow

The arrow R package provides bindings to the Arrow C++ library, and so both must be installed together. Normally, you don’t have to do anything unusual to do this, and, as with other R packages, Arrow can be installed by using install.packages().

install.packages("arrow")

If you want to customize your arrow installation, you can find more information in the installation guide, though for the majority of circumstances this isn’t necessary and the default installation will contain all the necessary features to work productively with arrow.

2.2 PUMS dataset

Many of the examples in this book use data from the Public Use Microdata Sample (PUMS) from the United States Census. Although the full-country census takes place every 10 years, the American Community Survey is conducted every year and that PUMS data is what we use here. The dataset we have here covers the years 2005–2022. The raw data was retrieved from the Census’s FTP site, with many values recoded and cleaned, so we can focus here on demonstrating arrow’s capabilities.

This is a dataset that comes from a detailed survey that is sent out to a subset of US residents every year. The dataset is released for public use by the Census Bureau in a raw CSV form. We have cleaned it up and converted it to a Parquet-based dataset for use with Arrow for demonstration purposes in this book.

One thing we have to pay attention to is that this dataset is weighted, so we can’t simply count the number of rows to get an accurate count of population—instead we sum or multiply by the weighting variables. This is why the example in the Introduction did sum(PWGTP) rather than just n() to count the population. We will discuss this weighting in our analysis below. If you want to know more details about the dataset, including how you can get hold of it, you can read more about it in Section A.2.1.

2.3 Opening the dataset

Let’s take a look at the data in R. The data is stored in a directory called ./data/pums/person. This is further split into multiple directories, one for each year, and then within those directories, one for each location. Finally, within each state directory, there is a single Parquet file containing the data for that year and location.1

./data/pums/person/
├── year=2005
│   ├── location=ak
│   │   └── part-0.parquet
...
│   └── location=wy
│       └── part-0.parquet
├── year=2006
│   ├── location=ak
│   │   └── part-0.parquet
...
│   └── location=wy
│       └── part-0.parquet

If we want to take a quick look at one of the files in the dataset, we can use read_parquet() to read it into R.

library(arrow)
path <- "./data/person/year=2021/location=ca/part-0.parquet"
read_parquet(path)