4  Files and Formats

Bigger Data, Easier Workflows

Author

Nic Crane, Jonathan Keane, and Neal Richardson

In the previous chapter, we started exploring a dataset made up of CSV files, and then skipped ahead to working with a version we’d converted to Parquet files. While CSVs are ubiquitous, Parquet files have a lot of nice features that make them ideal for analytic workloads:

In this chapter, we’ll explore the different file formats that Arrow can read and write, focusing particularly on the tools you need to ingest data from CSVs and store them in Parquet files with high fidelity.

We’ll start off by creating a directory to work with temporary files created in this chapter.

tmp_dir <- "./data/transient_data"
dir.create(tmp_dir)

4.1 Overview of supported formats

Arrow supports reading and writing files in multiple formats:

  • CSV and other text-delimited formats
  • newline-delimited JSON
  • Parquet
  • the Arrow format itself

The functions for reading and writing individual files in arrow are shown in Table 4.1.

Table 4.1: Functions for reading and writing individual files
Format Reading Writing
CSV or other delimited file

read_delim_arrow()

read_csv_arrow()

read_csv2_arrow()

read_tsv_arrow()

write_csv_arrow()
Newline-delimited JSON read_json_arrow() (Not supported)
Parquet read_parquet() write_parquet()
Arrow IPC Format (formerly Feather)

read_ipc_file()

read_feather()

write_ipc_file()

write_feather()

Let’s briefly review the file formats.

4.1.1 CSV

A CSV is one of the simplest and most widely-known formats for data storage and sharing. It is a plain-text file in which each line is a row of data, and values are separated into columns by commas (hence “comma-separated values”). CSV is the most common of a family of text-delimited formats; any character can be used as a delimiter, such as tab (as in TSV) or semicolon.

The CSV’s main strength is that everyone and everything can read it. However, CSVs have some serious limitations. Most importantly, CSVs do not contain all of the information you need to read in the data faithfully. Because the data is encoded as text, and because there is no standard way to indicate what the data types are for each column, you either have to specify types yourself, or allow the library to attempt to guess the types. Essentially, CSVs are a lossy file format: when you write out data to a CSV and read it back in, features of the data are lost because they are not contained in the file.

The encoding of data as text has other costs as well. Because you have to parse strings to read numbers, and have to convert numbers to strings when writing, there’s a serialization cost relative to binary formats that can represent these types natively. Text is also inefficient for representing numbers: for example, the number -2147483647 takes up 88 bits as text (11 characters, 8 bits each), but if stored in its numeric form can be represented a 32-bit integer.

4.1.2 Newline-delimited JSON

Another format supported by Arrow is newline-delimited JSON (NDJSON), also known as JSON lines (JSONL). This format is a subset of JSON, and is different from other JSON formats in that each line is a JSON object with a shared schema. It is commonly used for streaming data.

NDJSON files look something like this:

{"species": "cat", "sounds": ["meow", "purr", "hiss"]}
{"species": "dog", "sounds": ["woof", "bark", "growl"]}
{"species": "bird", "sounds": ["tweet", "squawk"]}

One advantage of NDJSON is that is is both human-readable and supports nested data structures. JSON also has a slightly richer type system than CSV. For example, true is parsed as a boolean value and not a string. However, it only supports primitive types (string, number, boolean, object) and is still a text-based format, so it requires parsing and is large on disk, typically larger than a comparable CSV because the column names are repeated in each row.

writeLines(
  c(
    '{"species": "cat", "sounds": ["meow", "purr", "hiss"]}',
    '{"species": "dog", "sounds": ["woof", "bark", "growl"]}',
    '{"species": "bird", "sounds": ["tweet", "squawk"]}'
  ),
  file.path(tmp_dir, "animals.jsonl")
)

read_json_arrow(file.path(tmp_dir, "animals.jsonl"))

Arrow supports reading NDJSON files, but at the time of writing does not support writing them.

4.1.3 Parquet

Parquet is a standard, open format widely used in the big data ecosystem. It is a binary file format: Parquet files are designed to be read by computers rather than humans. This means that unlike CSVs, you can’t just open them in a text editor and inspect the contents. It is also a columnar format, meaning that data in each column is saved together, as opposed to CSVs and many other text-based formats, which are row based.

Parquet files have a number of features that make them small on disk, fast to read in, and optimal for analytic workloads. They support many compression algorithms and data encodings, and they can include statistics about the data in them that systems can take advantage of to speed up reading and querying data even further. We generally recommend using Parquet files unless there is a compelling reason not to.

4.1.4 Arrow and Feather

Feather was originally developed as a prototype by Hadley Wickham and Wes McKinney to implement the Arrow format, and demonstrate rapid sharing of data between R and Python. It was not exactly “Arrow” as it did not support all data types, but it did demonstrate the value of having a common data format that could be read across languages efficiently and with high fidelity.

The Arrow inter-process-communication (IPC) format—that is, the bits in memory—can also be written to disk. (At one point, this bore the “Feather V2” label, but that has since been dropped.) There are some circumstances where this is useful. Because the uncompressed1 data is exactly the same as what is in memory, there is zero serialization cost to accessing it, and you can memory-map the file, allowing the software to read in chunks of it lazily.

That said, the Arrow developer community generally does not recommend storing data in Arrow IPC format on disk.2 For most applications, Parquet is a better choice for storage. Parquet files tend to be smaller on disk, which means faster to transfer over the network, and despite having to decompress and decode, can still be faster to read into memory, because less memory needs to be moved around. They’re also widely supported across many data systems.

Arrow? Feather? IPC? Parquet?

We frequently get questions and see confusion about these different formats. Given the history we talk about above, it’s not totally surprising that Arrow, Feather, and IPC aren’t totally clear. So here are short summaries of each:

Arrow IPC: the Arrow in-memory format and layout, written to disk.

Feather: Version 1 was a precursor to Arrow and is now deprecated. It is very close to the Arrow format, though doesn’t have all of the types. Version 2 is the same thing as Arrow IPC, though that name has fallen out of favor.

Parquet: A standard, open, modern, column-oriented data storage format.

“Feather” as a name will see less and less use since Version 1 is deprecated and the Arrow community has started using the name “Arrow IPC” instead.

Like we recommend in this chapter: if you’re looking for a format to store data in today, pick Parquet unless you have good and specific technical reasons for needing Arrow IPC.

4.1.5 Summary

Table 4.2 summarizes the features and tradeoffs of the file formats that arrow works with.

Table 4.2: Comparison of formats supported by Arrow
CSV ND-JSON Parquet Arrow
Read using Arrow
Write using Arrow
Human-readable
Fast read/write
Includes schema
Metadata
Nested structures
Compression
High-fidelity type preservation
Widely adopted

Now that we’ve surveyed the file formats, we’re going to discuss how to read in CSVs effectively since they are both widespread and potentially challenging. We’ll also demonstrate why exactly Parquet files are beneficial in terms of file size, speed, and fidelity, and show you how to convert your CSV data into Parquet files.

4.2 Reading CSVs

Let’s explore the tools the arrow R package provides to help us work with CSVs. Because they don’t contain all of the information you need to understand the data, and they can vary greatly in structure, we’ll need to get familiar with the range of options you can provide to the reader functions. We’re going to be looking at a subset of the PUMS dataset, which contains fewer columns than the full dataset.

4.2.1 The basics

The functions in arrow for reading delimited file formats are:

  • read_csv_arrow()
  • read_csv2_arrow() (for semicolon-delimited data)
  • read_tsv_arrow() (tab-delimited)
  • read_delim_arrow(), the general purpose function

If you’ve used the readr package, you should find these familiar to use, as many of the argument names match those in readr::read_delim() For example, if you want to specify that the string “none” represents an NA value, and to only read in particular columns of the data, you can use the na and col_select arguments.

read_csv_arrow(
  "./data/raw_csvs/person/2021/ak/psam_p02.csv",
  na = "none",
  col_select = c("AGEP", "SCHL")
)

See the docs for ?read_csv_arrow() for the full list of options supported.

When you use one of these functions to read a file, it will be read in as an Arrow Table, the Arrow equivalent of a data frame, and then converted to a tibble automatically. If instead you want the data to remain as an Arrow Table, you should set the as_data_frame argument to FALSE.

# Read in the CSV as an Arrow Table
read_csv_arrow(
  "./data/raw_csvs/person/2021/ak/psam_p02.csv",
  as_data_frame = FALSE
)

You can do all of the same dplyr operations on Tables as you can on Datasets. The difference is that with Tables, the data is already in memory.

In this book, we’re focusing on examples using Datasets because if your data can already fit into memory, you have many options for how to analyze your data. Arrow is great, but if you are just trying to analyze data that fits into memory, it’s probably unnecessary.

That said, there are some cases where it is useful to read a file in as a Table. We’ll highlight a few of them below.

4.2.2 Type inference and schemas

Because CSVs do not contain information about the types of the columns of data in them, when reading one in, you either specify the types explicitly, as in the col_types argument to the readr functions, or you let the CSV reader try to infer the types by inspecting the first chunk of data. Often, the type inference is just fine.

Sometimes, however, it can be wildly wrong. When arrow is working out the schema of a column, it guesses the simplest possible type that can describe all the values. The problem with relying on schema inference like this is that if you have a dataset where the first few rows of data are not representative of the rest of the dataset, you may end up with an incorrect schema.

For example, if arrow sees the values 1, 2, and 3, it might guess that these are integer values, but if arrow sees the values 1, 2, and 3.1, then arrow will see that column cannot be integers and must be made of doubles. You can see how this could cause an inference problem: if the reader infers that a column is integers, but then decimal value appear later, they won’t be read correctly.

Another problem is that even if the data is representative, the guess might be wrong. Classic examples of this are things like phone numbers or US zip codes. They are made up entirely of numbers, but schema inference may treat them as integers and strip the leading 0, instead of the correct approach of treating them as strings, and so, when you try to read in the dataset later, you may get an error or incorrect results.

In our dataset, the PUMA variable (short for Public Use Microdata Area) represents micro-level geographic areas. They are similar to US zip codes: they are a 5-digit number that may include a leading zero.

You can see here on the first few rows, the PUMA values are "00101", "00101", "00102", and "00200".

readLines("./data/raw_csvs/person/2021/ak/psam_p02.csv", n = 5)
[1] "RT,SERIALNO,DIVISION,SPORDER,PUMA,REGION,ST,ADJINC,PWGTP,AGEP,CIT,C...
[2] "P,2021GQ0000239,9,01,00101,4,02,1029928,38,21,1,,5,2,2,2,2,2,,2,2,,...
[3] "P,2021GQ0000288,9,01,00101,4,02,1029928,151,21,1,,5,2,2,2,2,1,,2,2,...
[4] "P,2021GQ0000315,9,01,00102,4,02,1029928,69,91,1,,,1,2,1,1,1,,2,1,1,...
[5] "P,2021GQ0000499,9,01,00200,4,02,1029928,16,30,1,,5,2,2,2,2,2,,,2,,,...

However, if we read in our CSV subset without providing a schema, despite the leading zeroes being present in the data, the data type of this column is inferred to be an integer.

pums_csv_one_state <- read_csv_arrow(
  "./data/raw_csvs/person/2021/ak/psam_p02.csv"
)

head(schema(pums_csv_one_state))
Schema
RT: string
SERIALNO: string
DIVISION: int32
SPORDER: int32
PUMA: int32
REGION: int32

This is incorrect, and could cause us problems later, for instance, if we wanted to use this column to join to another dataset which correctly records PUMA as a string. If we were to convert this to a string, we’d have to make sure that any values were correctly padded with leading zeroes—that is, we would have to convert the integer 101 to string "00101". This is also extra unnecessary processing: the original data is already string and we’d be converting from string to integer and then back to string again.

Instead of doing this though, we can specify the data types before we read in the data, using the schema argument.

# Read CSV file and specify a schema
read_csv_arrow(
  "./data/raw_csvs/person/2021/ak/psam_p02.csv",
  schema = schema(
    location = string(),
    year = int32(),
    AGEP = int32(),
    SCHL = string(),
    PUMA = string(),
    JWMNP = int32(),
    JWTR = string(),
    WKWN = int32(),
    HICOV = bool(),
    WRK = bool()
  ),
  skip = 1
)

Note that we also added skip = 1. As with other CSV readers, if the file contains a header row with column names, but you provide them yourself, you need to tell the reader to skip the header row rather than try to read it in as data. By providing the schema, we’re defining both the names and types, so we likewise need to tell the reader to skip the first row.

Another thing to note is that we have specified the int32 type for the integers. If you want control of the precision of the integers or floating-point numbers, you can’t rely on the type inference because it only chooses the largest size: int64 and float64. If you know the range of integer values in the data is small, for example, and you wanted to save on memory, you would want to be able to specify a smaller type, like int8. Or as we have done here, if you expected to pull a large amount of data into R, you might want int32 because integers in R are 32-bit. This is not required—unlike with the PUMA variable, the integers will be parsed correctly, but they will just require more memory than they need and will require a small amount of translation to step down from int64 to int32. Sometimes, that savings will be meaningful.

But what if you’re working with a file with a lot of columns—do you really have to type this all out by hand? Absolutely not!

One option is to supply column types for a subset of columns, by providing a partial schema using the col_types argument.

# Read CSV file and specify a partial schema
read_csv_arrow(
  "./data/raw_csvs/person/2021/ak/psam_p02.csv",
  col_types = schema(PUMA = string())
)

Because we only supplied a partial schema, Arrow still needs to read the headers of the other columns and infer their types, and so in this case we don’t need to pass in skip = 1 like we did when we passed in the full schema.

Another option is to take the schema which is inferred by arrow and manually update it yourself before reading in the file again. You can extract the schema using the schema() function.

# Read in the file as an Arrow Table and get its schema
pums_csv_one_state <- read_csv_arrow(
  "./data/raw_csvs/person/2021/ak/psam_p02.csv"
)

one_state_schema <- schema(pums_csv_one_state)

Schemas can be interacted with like lists in R, and so if we want to look more closely at one of these column name to data type mappings, we can extract it using the [[ or $ operator.

one_state_schema[["PUMA"]]
Field
PUMA: int32

We can also assign in different types, and then provide the modified schema when re-reading the file to use those types.

# Update the data types of the PUMA field
one_state_schema[["PUMA"]] <- string()

# Read in the file, using the updated schema
read_csv_arrow(
  "./data/raw_csvs/person/2021/ak/psam_p02.csv",
  schema = one_state_schema,
  skip = 1
)

What if you want to save the schema in a script, instead of loading it and updating it all the time? You can use the code() method to extract the code from the schema, which you can then copy and paste into your script. We subset the schema with [c("RT", "SERIALNO", "DIVISION", "SPORDER", "PUMA", "REGION", "ST")] only for clarity and space savings in the book. In real use, call one_state_schema$code() to use the whole schema.

# Print the code needed to create this schema
subset_cols <- c("RT", "SERIALNO", "DIVISION", "SPORDER", 
                 "PUMA", "REGION", "ST")

one_state_schema[subset_cols]$code()
schema(RT = utf8(), SERIALNO = utf8(), DIVISION = int32(), SPORDER = int32(), 
    PUMA = utf8(), REGION = int32(), ST = int32())

The last bit of schema manipulation which is useful to know about is how to add or remove items from a schema.

You can add a new item by assigning a data type to a new item in the schema.

one_state_schema[["new_var"]] <- float64()

subset_cols <- c("RT", "SERIALNO", "DIVISION", "SPORDER",
                 "PUMA", "REGION", "ST", "new_var")

one_state_schema[subset_cols]
Schema
RT: string
SERIALNO: string
DIVISION: int32
SPORDER: int32
PUMA: string
REGION: int32
ST: int32
new_var: double

To remove items from a schema or select a subset, you can create a new schema containing a vector of the names of the desired fields.

to_keep <- c("RT", "SERIALNO", "DIVISION", "SPORDER", "PUMA",
             "REGION", "ST", "AGEP", "new_var")
one_state_schema_mini <- one_state_schema[to_keep]
one_state_schema_mini
Schema
RT: string
SERIALNO: string
DIVISION: int32
SPORDER: int32
PUMA: string
REGION: int32
ST: int32
AGEP: int32
new_var: double

Alternatively, to just remove items, you can assign the value NULL to them, again, just as you could with a list.

one_state_schema_mini[["new_var"]] <- NULL
one_state_schema_mini
Schema
RT: string
SERIALNO: string
DIVISION: int32
SPORDER: int32
PUMA: string
REGION: int32
ST: int32
AGEP: int32

4.2.3 Writing CSVs

You can write CSV files to disk using write_csv_arrow().

# Save `pums_subset` to CSV file
write_csv_arrow(pums_subset, file.path(tmp_dir, "new_pums.csv"))

If you’re working with large CSV files, you may want to compress them to reduce the amount of space they take up on disk. Arrow automatically detects compression from the file name you give to either the CSV reader or writer, so you can alter the file path to end in .gz to save a compressed CSV file.

# Write to a gz compressed file
write_csv_arrow(
  pums_subset,
  file.path(tmp_dir, "pums_subset_compressed.csv.gz")
)

# Read from a gz compressed file
read_csv_arrow(
  file.path(tmp_dir, "pums_subset_compressed.csv.gz")
)

Compressed CSVs take up significantly less space on disk, but are slower to read and write. Let’s look at an example from the PUMS dataset, using the data for Washington in 2021. We can load in the original data, and then save it to disk, first as an uncompressed CSV and then as a compressed CSV.

# Read in the data
washington_2021 <- read_parquet(
  "./data/person/year=2021/location=wa/part-0.parquet"
)

write_csv_arrow(washington_2021, file.path(tmp_dir, "washington_2021.csv"))
write_csv_arrow(washington_2021, file.path(tmp_dir, "washington_2021.csv.gz"))

The uncompressed CSV is pretty big: 137MB. However, after the CSV has been compressed, it shrinks down to 19.6MB.

It’s still much larger than a Parquet file with the same data, but not quite as dramatically so.

Although compressing CSVs can lead to much smaller file sizes, they still have issues around data type consistency, and read/write speed. And, on top of that, a compressed CSV loses the one property of CSVs that binary formats don’t have: a compressed CSV isn’t human readable without doing some extra work to make it so. In the next section, we’ll take a look at Parquet format, and discuss how it can bring serious improvements to analytics workflows.

4.3 Parquet

As discussed above, Parquet is an open-source standard columnar format supported by arrow and many other big data products. It has many features that make it superior to CSV and other formats.

4.3.1 Schema metadata

On of the features of this format is that the schema metadata is saved alongside the data. This means that when you are sharing data between people or programs, there is less chance of errors with data being misinterpreted when it’s being read in.

Say we have data like below:

numeric_data <- tibble::tibble(
  integers = 1:10,
  doubles = as.numeric(1:10),
  strings = sprintf("%02d", 1:10)
)
numeric_data

In the example below, we create an Arrow Table containing one column of integers, a column of doubles, and a column of strings. We then write them to disk in CSV and Parquet format.

numeric_data_schema <- schema(
  integers = int32(),
  doubles = float64(),
  strings = string()
)

numeric_table <- arrow_table(
  numeric_data,
  schema = numeric_data_schema
)

write_csv_arrow(
  numeric_table,
  file.path(tmp_dir, "numeric.csv")
)

write_parquet(
  numeric_table,
  file.path(tmp_dir, "numeric.parquet")
)

If we read in the CSV file without providing any information about the schema, arrow will have to guess the schema, and incorrectly infer that the all of the columns are integers.

read_csv_arrow(file.path(tmp_dir, "numeric.csv"))

However, if we read in the Parquet file, the “doubles” column is correctly identified as containing doubles and the “strings” column is correctly identified as containing strings as well as maintains the leading 0.

read_parquet(file.path(tmp_dir, "numeric.parquet"))

The inclusion of the schema metadata in the Parquet files means that these files can easily be shared with less chance of errors from having to guess data types.

4.3.2 Preservation of other metadata

Another advantage of Parquet is that is is capable of saving other format-specific metadata. Say you have an SPSS file—in the example below, we’ll use one which is included with the haven package.

library(haven)

path <- system.file("examples", "iris.sav", package = "haven")
iris_spss <- read_sav(path)
iris_spss

The Species column in this version of the iris dataset is a haven labelled column, a format from SPSS which includes both the value and label. If we were to write this data to a CSV and then read it back into R, we lose the data type and it gets converted to the closest equivalent.

library(readr)
tf <- tempfile()
write_csv(iris_spss, tf)
read_csv(tf)
Rows: 150 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (5): Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

However, writing to Parquet and then reading back into R preserves the original data type.

tf <- tempfile()
write_parquet(iris_spss, tf)
read_parquet(tf)

This also works for custom classes in R.

Let’s take a look at an example of a custom R class for storing fractions. We can define the class and a print method for our fraction class.

library(vctrs)

Attaching package: 'vctrs'
The following object is masked from 'package:dplyr':

    data_frame
The following objects are masked from 'package:arrow':

    field, list_of
# define function to create fraction object
fraction <- function(n = integer(), d = integer()) {

  # ensure inputs are correct types and lengths
  vcc <- vec_cast_common(n, d, .to = integer())
  vrc <- vec_recycle_common(vcc[[1]], vcc[[2]])

  new_rcrd(
    fields = list(numerator = vrc[[1]], denominator = vrc[[2]]),
    class = "fraction"
  )

}

# define a format method for the object
# so we can control how it's printed
format.fraction <- function(x, ...) {
  paste0(field(x, "numerator"), "/", field(x, "denominator"))
}
registerS3method("format", "fraction", format.fraction)

# create and view a fraction object
x <- fraction(1, 5)
x
<fraction[1]>
[1] 1/5

This fraction class is a custom class, so it doesn’t exist in Arrow or Parquet. However, we can still write this data to Parquet and then back into R, and the type information will be preserved. Let’s first set up a tibble containing the values we want to save.

frac_df <- tibble::tibble(
  total = c(2, 2, 7, 9)
)

frac_df$fraction = fraction(
  frac_df$total,
  sum(frac_df$total)
)

frac_df

If we try to save this data to a CSV, we get an error.

tf <- tempfile()
write_csv(frac_df, tf)
Error in `as.character()`:
! Can't convert `x` <fraction> to <character>.

We can, however, save this to Parquet and then read it back into the R session, and the custom class type is retained.

tf <- tempfile()
write_parquet(frac_df, tf)
read_parquet(tf)

This preservation of custom attributes and classes is helpful because it means that we can use custom R packages and save to Parquet files, without having to implement every single specific custom class as a specific Parquet type in the code that reads and writes Parquet. One thing to note is that this preservation of custom attributes works automatically when you write a Parquet file from R and then read it in to R, but it won’t work automatically when writing from R and then reading in another language like Python. The custom metadata for each language implementation is stored in a separate location for each language. So while reading data into Python which has a custom class that was created in R, Arrow won’t automatically create that same class. However, you still do have access to that metadata under the metadata and then r attribute if there are details you need to access from Python.

4.3.3 File structure and parallelisation

Another feature of Parquet is that its files are divided into smaller pieces.

Parquet file components

A Parquet file is made up of groups of rows, known as row groups. Each of these is divided up into subsections of each column: column chunks. These column chunks are then further divided into pages. Earlier, we mentioned that the schema is stored with the data, but there’s a lot more metadata in a Parquet file than this. There’s also metadata on all of the subcomponents of a Parquet file. The metadata contains information on things like how many rows in that particular subcomponent, and summary statistics about the data contained inside. You can read more about this in the Parquet docs.

This micro-level metadata means that software that is designed to work with Parquet, like Arrow, can use this information to work out which data to read in and which data to ignore, and when combined with the structure here which makes it easier to read things in parallel, can dramatically speed up read time.

4.3.4 Columnar format and selective reading

Parquet is a columnar format, which means that data in the same column is stored in adjacent areas of memory. Although row-based format readers typically allow you to specify which columns to read in, this has computational overhead, as the entire row must be read in and the unwanted columns discarded. Parquet’s columnar format allows you to read in only the columns you need, and so is much faster when you only require a subset of the data. We’ll show some examples that illustrate this efficiency gain in Chapter 6.

4.3.5 Dictionary Encoding

Parquet’s columnar format allows it to take advantage of encoding strategies that can help reduce size on disk. One of these encoding strategies is dictionary encoding; instead of representing every single value in a column, generally Parquet will save the values as key-value pairs, similar to factors. This encoding usually makes the space needed to store string columns much smaller, but there are times when dictionary encoding does not help and in that case we fall back to plain encoding. This fallback happens if the size of the dictionary grows too large, which could happen if there are a large number of distinct values or if the length of each string is large. Note that this is different from Arrow’s DictionaryType data type: Parquet’s dictionary encoding can be applied to any column type. Parquet’s dictionary encoding happens entirely under the hood with arrow so you don’t have to decide what is the ideal encoding to store this as—it just happens, unless you actively disable it. We can see its effect by comparing the size of the same Parquet file with and without dictionary encoding. First, let’s look at the file size of the Parquet file for Washington in 2021, which was saved with the default setting of using dictionary encoding.

wa_21_path <- "./data/person/year=2021/location=wa/part-0.parquet"
fs::file_size(wa_21_path)
12.2M

And now let’s check out the size of that file if it’s saved without dictionary encoding.

read_parquet(wa_21_path, as_data_frame = FALSE) |>
  write_parquet(
    file.path(tmp_dir, "pums_subset_no_dict.parquet"),
    use_dictionary = FALSE
  )

fs::file_size(file.path(tmp_dir, "pums_subset_no_dict.parquet"))
22.7M

The difference is significant; the file is 1.86 times larger! This isn’t the only type of encoding that Parquet uses though; let’s take a look next at run-length encoding.

4.3.6 Run-length encoding

Writing data to disk with dictionary encoding can help shrink the size of some columns, but there are other aspects of the data that can be taken advantage of. It’s not uncommon to find repeated values in the columns of our datasets, and another option for reducing storage space is to use run-length encoding to record the value and how many times it occurs. Let’s take a look at the MAR column in the PUMS person dataset subset, representing marital status of respondents.

pums_subset_mar <- read_parquet(wa_21_path, col_select = MAR)

count(pums_subset_mar, MAR)

There are 5 different possible values in this column.

In the pseudocode below, we show two different ways we could represent the value “Married” occurring 5 times in a row.

["Married ", "Married", "Married", "Married", "Married"]
["Married", 5]

In the first version, we store the same value five times, whereas in the second version we just store it once, with the number of time it occurs. The second option takes up fewer characters to represent: this is how run-length encoding works. Run-length encoding is built into Parquet, and we can’t turn it off to compare it, but what we can do is reorder our data by the MAR column, so that Parquet can take the most advantage of the column ordering, and see what impact it has on the file size.

Let’s first take a look at the size of the file containing just the MAR column in its original order.

unordered_path <- file.path(
  tmp_dir,
  "pums_subset_mar_unordered.parquet"
)

# Write unordered data to a file
write_parquet(pums_subset_mar, unordered_path)

# Get the file of this file
fs::file_size(unordered_path)
29.4K

And now, after rearranging the data by the MAR column.

ordered_path <- file.path(
  tmp_dir,
  "pums_subset_mar_ordered.parquet"
)

# Order data by the `MAR` column
pums_subset_mar_ordered <- pums_subset_mar |>
  arrange(MAR)

# Save this ordered data to a file
write_parquet(pums_subset_mar_ordered, ordered_path)

# Get the file of this file
fs::file_size(ordered_path)
589

The file size has now shrunk to be approximately 51.12 times smaller! The effects of run-length encoding are especially noticeable when character columns contain fewer possible values, or data naturally contains repetitions.

4.3.7 Compression

Another feature of Parquet is in-built support for compressing data using one of multiple different compression codecs. Generally speaking, all of these codecs reduce the amount of space a Parquet file takes up on disk, and the main differentiation between them relates to how much smaller the resulting file is, and the speed of access when compressing/decompressing the file. Historically, the performance of these codecs has a tradeoff between size and speed: codecs with a higher compression ratio and thus smaller file size tend to result in files which are slower to access than those with a lower compression ratio. More recently, however, benchmarks have show that the “zstd” codec manages to balance both aspects of performance.

The codecs available depend on your Arrow C++ build. You can use arrow::arrow_info() to see a full list of codecs and other capabilities available in your installed version.

Let’s take a look at the degree of difference between an uncompressed CSV file, uncompressed Parquet file, compared to “snappy”, the default on Linux, and “zstd”, one of the current top-performing codecs.

We wrote the files to disk, and then read in just the “AGEP”, “COW”, and “year” columns. The results can be see in Table 4.3.

Table 4.3: Comparison of file size, read and write times for CSV and Parquet
Format Size (MB) Write time (s) Read time (s)
CSV 1800 24.052 10.02
Parquet - uncompressed 37.1 14.89 9.215
Parquet - snappy 35.8 15.378 10.051
Parquet - zstd 30.7 15.203 9.942

Ultimately, the different Parquet codecs results in similar performance for read and write times with the data here, though this can vary massively based on the data being written/read. The key points to take away here are:

  • you can save a huge amount of space switching from CSV to Parquet
  • the default Parquet compression codec is more than adequate in the vast majority of cases
  • if you need an extra degree of control it’s worth experimenting with the alternatives codecs to see what works best with your data

4.3.8 Nested structures

Parquet also has the ability to work with nested structures. The R version of this is list-columns. Using Parquet, we can write list-columns to Parquet files and then read them back into R without having to do any extra additional manual processing.

nested_data <- tibble::tibble(
  species = c("cat", "dog", "bird"),
  sounds = list(
    c("meow", "purr", "hiss"),
    c("woof", "bark", "growl"),
    c("tweet", "squawk")
  )
)
write_parquet(nested_data, "./data/transient_data/nested_data.parquet")

read_parquet("./data/transient_data/nested_data.parquet")

If we were using CSV format, we would have to flatten the list-columns before saving, and convert them back into a hierarchical structure after loading the file.

4.4 Example workflow: cleaning the PUMS dataset

Now that we’ve reviewed the different file formats and options, let’s explore a real-world scenario where you would want to use arrow to manipulate individual files.

As is widely stated, the majority of data science work is actually data cleaning. In order to get to the point where we have a nice dataset for analysis, we have to do some work on the raw data. And when data is already spread across multiple files, it is commonplace to have to do transformations to align the files.

The general workflow we’ll examine is:

  1. tab <- read_csv_arrow(file, as_data_frame = FALSE)

  2. Do dplyr transformations on the Arrow Table

  3. write_parquet(tab, new_file)

That is, we’ll keep the data in Arrow memory and never pull it into an R data frame. This saves processing time, and it allows us to avoid type conversions between R and Arrow that might lose fidelity.

Let’s look at the PUMS dataset. It is a survey, and it has some features commonly found in surveys that need to be cleaned up. For one, the actual row-level data in the survey is encoded based on the type of data it represents.3 For columns that are numeric, the data is what is in the data file, but characters, factors, and missing values are recorded as integers. These integers must be mapped to values in the codebook, which is distributed with the data and acts as a lookup for what actual value is represented by the integer in the dataset.

For another, the survey questions have changed over time, so data from one year may not line up perfectly with data from other years without some adjustment. Many columns in this dataset require transformation to create a nice consistent dataset to work with.

Let’s start with the COW variable, which represents class of worker, categorizing people based on their employment status. Using the metadata distributed for PUMS recoding—in the survey world, this is frequently referred to as the codebook—we can see that COW can take the values 0–9 with the corresponding meaning for the year 2005:

codebook |>
  filter(name == "COW")

Now let’s take a look at the data for the first few rows for the year 2005 and Washington state. First, we read the data in from a single CSV as an Arrow Table, and then we show the first few values for COW.

wa_2005 <- read_csv_arrow(
  "data/raw_csvs/person/2005/wa/ss05pwa.csv",
  as_data_frame = FALSE
)
wa_2005 |>
  select(COW) |>
  head() |>
  collect()

We see values like 1, 6, 1, 3 etc. We ultimately want these to match up to the labels “Private for profit”, “Self-employed not incorporated”, “Private for profit”, “Local government”, etc. One way that we can do this is by using a join, matching on the value codes in the codebook to pull in the value labels:

wa_2005 |>
  left_join(
    filter(codebook, name == "COW"),
    by = c("COW" = "values_code")
  ) |>
  mutate(COW_label = values_label, .keep = "unused") |>
  select(sporder, SERIALNO, PUMA, ST, PWGTP, COW, COW_label) |>
  head() |>
  collect()

Above, we kept both COW and COW_label to illustrate the mapping, but we could replace it like so:

wa_2005 |>
  left_join(
    filter(codebook, name == "COW"),
    by = c("COW" = "values_code")
  ) |>
  mutate(COW = values_label, .keep = "unused") |>
  select(sporder, SERIALNO, PUMA, ST, PWGTP, COW, AGEP) |>
  head() |>
  collect()

We can do this with multiple variables at once, and when we’re happy with the result, we can skip the collect() step and just pipe the transformed data to write_parquet():

wa_2005 |>
  left_join(
    filter(codebook, name == "COW"),
    by = c("COW" = "values_code")
  ) |>
  mutate(COW = values_label, .keep = "unused") |>
  left_join(
    filter(codebook, name == "JWTR"),
    by = c("JWTR" = "values_code")
  ) |>
  mutate(JWTR = values_label, .keep = "unused") |>
  write_parquet("wa_2005.parquet")

This will take the Table, which was read from the CSVs, and write out a Parquet file without ever holding the data in an R data frame.

From here, we could recode each of the columns using the codebook and write out cleaned Parquet files. We go from reading the data into Arrow tables and then using Arrow’s data processing abilities to transform the data from the raw, survey-specific format into one that is more natural to work with.

For this book, we’ve gone ahead and processed all of the data for PUMS from 2005 to 2022, both the person-level and the household-level data. This is, of course, slightly more involved than we show here since we have hundreds of variables to join and there are subtle changes to the schema of the raw data from one year to the next. But the concept and approach we show here is the same. If you’re curious to learn more, see Section A.2.1.

4.5 Summary

In this chapter, we examined the different file formats which are supported by Arrow and compared their advantages and disadvantages, as well as discussing how they work. If you’re looking for a TL;DR about what format to store your data in, as of the writing of this book we recommend Parquet. There are very specific other circumstances where Arrow IPC or CSV might be required, but the benefits of Parquet make it the right choice in the vast majority of circumstances.

We took a look at functions for reading and writing individual files into memory as Arrow Tables. In the next chapter, we’ll take a look at Arrow Datasets: a powerful concept for working with datasets split across multiple files.


  1. The Arrow IPC file format also supports compression, though if you use it, it’s no longer the case that there’s no serialization, and you can’t memory-map it.↩︎

  2. https://arrow.apache.org/faq/↩︎

  3. Exploring the reasons for this convention is outside of the scope of this book, but it is partly because, historically, folks working with surveys used tools and frameworks that made working with large amounts of data that wasn’t numeric difficult. However, with modern data tools like arrow, we can effectively and efficiently work with data of all types!↩︎