4  Files and Formats

Bigger Data, Easier Workflows

Author

Nic Crane, Jonathan Keane, and Neal Richardson

In the previous chapter, we started exploring a dataset made up of CSV files, and then skipped ahead to working with a version we’d converted to Parquet files. While CSVs are ubiquitous, Parquet files have a lot of nice features that make them ideal for analytic workloads:

In this chapter, we’ll explore the different file formats that Arrow can read and write, focusing particularly on the tools you need to ingest data from CSVs and store them in Parquet files with high fidelity.

We’ll start off by creating a directory to work with temporary files created in this chapter.

tmp_dir <- "./data/transient_data"
dir.create(tmp_dir)

4.1 Overview of supported formats

Arrow supports reading and writing files in multiple formats:

  • CSV and other text-delimited formats
  • newline-delimited JSON
  • Parquet
  • the Arrow format itself

The functions for reading and writing individual files in arrow are shown in Table 4.1.

Table 4.1: Functions for reading and writing individual files
Format Reading Writing
CSV or other delimited file

read_delim_arrow()

read_csv_arrow()

read_csv2_arrow()

read_tsv_arrow()

write_csv_arrow()
Newline-delimited JSON read_json_arrow() (Not supported)
Parquet read_parquet() write_parquet()
Arrow IPC Format (formerly Feather)

read_ipc_file()

read_feather()

write_ipc_file()

write_feather()

Let’s briefly review the file formats.

4.1.1 CSV

A CSV is one of the simplest and most widely-known formats for data storage and sharing. It is a plain-text file in which each line is a row of data, and values are separated into columns by commas (hence “comma-separated values”). CSV is the most common of a family of text-delimited formats; any character can be used as a delimiter, such as tab (as in TSV) or semicolon.

The CSV’s main strength is that everyone and everything can read it. However, CSVs have some serious limitations. Most importantly, CSVs do not contain all of the information you need to read in the data faithfully. Because the data is encoded as text, and because there is no standard way to indicate what the data types are for each column, you either have to specify types yourself, or allow the library to attempt to guess the types. Essentially, CSVs are a lossy file format: when you write out data to a CSV and read it back in, features of the data are lost because they are not contained in the file.

The encoding of data as text has other costs as well. Because you have to parse strings to read numbers, and have to convert numbers to strings when writing, there’s a serialization cost relative to binary formats that can represent these types natively. Text is also inefficient for representing numbers: for example, the number -2147483647 takes up 88 bits as text (11 characters, 8 bits each), but if stored in its numeric form can be represented a 32-bit integer.

4.1.2 Newline-delimited JSON

Another format supported by Arrow is newline-delimited JSON (NDJSON), also known as JSON lines (JSONL). This format is a subset of JSON, and is different from other JSON formats in that each line is a JSON object with a shared schema. It is commonly used for streaming data.

NDJSON files look something like this:

{"species": "cat", "sounds": ["meow", "purr", "hiss"]}
{"species": "dog", "sounds": ["woof", "bark", "growl"]}
{"species": "bird", "sounds": ["tweet", "squawk"]}

One advantage of NDJSON is that is is both human-readable and supports nested data structures. JSON also has a slightly richer type system than CSV. For example, true is parsed as a boolean value and not a string. However, it only supports primitive types (string, number, boolean, object) and is still a text-based format, so it requires parsing and is large on disk, typically larger than a comparable CSV because the column names are repeated in each row.

writeLines(
  c(
    '{"species": "cat", "sounds": ["meow", "purr", "hiss"]}',
    '{"species": "dog", "sounds": ["woof", "bark", "growl"]}',
    '{"species": "bird", "sounds": ["tweet", "squawk"]}'
  ),
  file.path(tmp_dir, "animals.jsonl")
)

read_json_arrow(file.path(tmp_dir, "animals.jsonl"))

Arrow supports reading NDJSON files, but at the time of writing does not support writing them.

4.1.3 Parquet

Parquet is a standard, open format widely used in the big data ecosystem. It is a binary file format: Parquet files are designed to be read by computers rather than humans. This means that unlike CSVs, you can’t just open them in a text editor and inspect the contents. It is also a columnar format, meaning that data in each column is saved together, as opposed to CSVs and many other text-based formats, which are row based.

Parquet files have a number of features that make them small on disk, fast to read in, and optimal for analytic workloads. They support many compression algorithms and data encodings, and they can include statistics about the data in them that systems can take advantage of to speed up reading and querying data even further. We generally recommend using Parquet files unless there is a compelling reason not to.

4.1.4 Arrow and Feather

Feather was originally developed as a prototype by Hadley Wickham and Wes McKinney to implement the Arrow format, and demonstrate rapid sharing of data between R and Python. It was not exactly “Arrow” as it did not support all data types, but it did demonstrate the value of having a common data format that could be read across languages efficiently and with high fidelity.

The Arrow inter-process-communication (IPC) format—that is, the bits in memory—can also be written to disk. (At one point, this bore the “Feather V2” label, but that has since been dropped.) There are some circumstances where this is useful. Because the uncompressed1 data is exactly the same as what is in memory, there is zero serialization cost to accessing it, and you can memory-map the file, allowing the software to read in chunks of it lazily.

That said, the Arrow developer community generally does not recommend storing data in Arrow IPC format on disk.2 For most applications, Parquet is a better choice for storage. Parquet files tend to be smaller on disk, which means faster to transfer over the network, and despite having to decompress and decode, can still be faster to read into memory, because less memory needs to be moved around. They’re also widely supported across many data systems.

Arrow? Feather? IPC? Parquet?

We frequently get questions and see confusion about these different formats. Given the history we talk about above, it’s not totally surprising that Arrow, Feather, and IPC aren’t totally clear. So here are short summaries of each:

Arrow IPC: the Arrow in-memory format and layout, written to disk.

Feather: Version 1 was a precursor to Arrow and is now deprecated. It is very close to the Arrow format, though doesn’t have all of the types. Version 2 is the same thing as Arrow IPC, though that name has fallen out of favor.

Parquet: A standard, open, modern, column-oriented data storage format.

“Feather” as a name will see less and less use since Version 1 is deprecated and the Arrow community has started using the name “Arrow IPC” instead.

Like we recommend in this chapter: if you’re looking for a format to store data in today, pick Parquet unless you have good and specific technical reasons for needing Arrow IPC.

4.1.5 Summary

Table 4.2 summarizes the features and tradeoffs of the file formats that arrow works with.

Table 4.2: Comparison of formats supported by Arrow
CSV ND-JSON Parquet Arrow
Read using Arrow
Write using Arrow
Human-readable
Fast read/write
Includes schema
Metadata
Nested structures
Compression
High-fidelity type preservation
Widely adopted

Now that we’ve surveyed the file formats, we’re going to discuss how to read in CSVs effectively since they are both widespread and potentially challenging. We’ll also demonstrate why exactly Parquet files are beneficial in terms of file size, speed, and fidelity, and show you how to convert your CSV data into Parquet files.

4.2 Reading CSVs

Let’s explore the tools the arrow R package provides to help us work with CSVs. Because they don’t contain all of the information you need to understand the data, and they can vary greatly in structure, we’ll need to get familiar with the range of options you can provide to the reader functions. We’re going to be looking at a subset of the PUMS dataset, which contains fewer columns than the full dataset.

4.2.1 The basics

The functions in arrow for reading delimited file formats are:

  • read_csv_arrow()
  • read_csv2_arrow() (for semicolon-delimited data)
  • read_tsv_arrow() (tab-delimited)
  • read_delim_arrow(), the general purpose function

If you’ve used the readr package, you should find these familiar to use, as many of the argument names match those in readr::read_delim() For example, if you want to specify that the string “none” represents an NA value, and to only read in particular columns of the data, you can use the na and col_select arguments.

read_csv_arrow(
  "./data/raw_csvs/person/2021/ak/psam_p02.csv",
  na = "none",
  col_select = c("AGEP", "SCHL")
)

See the docs for ?read_csv_arrow() for the full list of options supported.

When you use one of these functions to read a file, it will be read in as an Arrow Table, the Arrow equivalent of a data frame, and then converted to a tibble automatically. If instead you want the data to remain as an Arrow Table, you should set the as_data_frame argument to FALSE.

# Read in the CSV as an Arrow Table
read_csv_arrow(
  "./data/raw_csvs/person/2021/ak/psam_p02.csv",
  as_data_frame = FALSE
)

You can do all of the same dplyr operations on Tables as you can on Datasets. The difference is that with Tables, the data is already in memory.

In this book, we’re focusing on examples using Datasets because if your data can already fit into memory, you have many options for how to analyze your data. Arrow is great, but if you are just trying to analyze data that fits into memory, it’s probably unnecessary.

That said, there are some cases where it is useful to read a file in as a Table. We’ll highlight a few of them below.

4.2.2 Type inference and schemas

Because CSVs do not contain information about the types of the columns of data in them, when reading one in, you either specify the types explicitly, as in the col_types argument to the readr functions, or you let the CSV reader try to infer the types by inspecting the first chunk of data. Often, the type inference is just fine.

Sometimes, however, it can be wildly wrong. When arrow is working out the schema of a column, it guesses the simplest possible type that can describe all the values. The problem with relying on schema inference like this is that if you have a dataset where the first few rows of data are not representative of the rest of the dataset, you may end up with an incorrect schema.

For example, if arrow sees the values 1, 2, and 3, it might guess that these are integer values, but if arrow sees the values 1, 2, and 3.1, then arrow will see that column cannot be integers and must be made of doubles. You can see how this could cause an inference problem: if the reader infers that a column is integers, but then decimal value appear later, they won’t be read correctly.

Another problem is that even if the data is representative, the guess might be wrong. Classic examples of this are things like phone numbers or US zip codes. They are made up entirely of numbers, but schema inference may treat them as integers and strip the leading 0, instead of the correct approach of treating them as strings, and so, when you try to read in the dataset later, you may get an error or incorrect results.

In our dataset, the PUMA variable (short for Public Use Microdata Area) represents micro-level geographic areas. They are similar to US zip codes: they are a 5-digit number that may include a leading zero.

You can see here on the first few rows, the PUMA values are "00101", "00101", "00102", and "00200".

readLines("./data/raw_csvs/person/2021/ak/psam_p02.csv", n = 5)
[1] "RT,SERIALNO,DIVISION,SPORDER,PUMA,REGION,ST,ADJINC,PWGTP,AGEP,CIT,C...
[2] "P,2021GQ0000239,9,01,00101,4,02,1029928,38,21,1,,5,2,2,2,2,2,,2,2,,...
[3] "P,2021GQ0000288,9,01,00101,4,02,1029928,151,21,1,,5,2,2,2,2,1,,2,2,...
[4] "P,2021GQ0000315,9,01,00102,4,02,1029928,69,91,1,,,1,2,1,1,1,,2,1,1,...
[5] "P,2021GQ0000499,9,01,00200,4,02,1029928,16,30,1,,5,2,2,2,2,2,,,2,,,...

However, if we read in our CSV subset without providing a schema, despite the leading zeroes being present in the data, the data type of this column is inferred to be an integer.

pums_csv_one_state <- read_csv_arrow(
  "./data/raw_csvs/person/2021/ak/psam_p02.csv"
)

head(schema(pums_csv_one_state))
Schema
RT: string
SERIALNO: string
DIVISION: int32
SPORDER: int32
PUMA: int32
REGION: int32

This is incorrect, and could cause us problems later, for instance, if we wanted to use this column to join to another dataset which correctly records PUMA as a string. If we were to convert this to a string, we’d have to make sure that any values were correctly padded with leading zeroes—that is, we would have to convert the integer 101 to string "00101". This is also extra unnecessary processing: the original data is already string and we’d be converting from string to integer and then back to string again.

Instead of doing this though, we can specify the data types before we read in the data, using the schema argument.

# Read CSV file and specify a schema
read_csv_arrow(
  "./data/raw_csvs/person/2021/ak/psam_p02.csv",
  schema = schema(
    location = string(),
    year = int32(),
    AGEP = int32(),
    SCHL = string(),
    PUMA = string(),
    JWMNP = int32(),
    JWTR = string(),
    WKWN = int32(),
    HICOV = bool(),
    WRK = bool()
  ),
  skip = 1
)

Note that we also added skip = 1. As with other CSV readers, if the file contains a header row with column names, but you provide them yourself, you need to tell the reader to skip the header row rather than try to read it in as data. By providing the schema, we’re defining both the names and types, so we likewise need to tell the reader to skip the first row.

Another thing to note is that we have specified the int32 type for the integers. If you want control of the precision of the integers or floating-point numbers, you can’t rely on the type inference because it only chooses the largest size: int64 and float64. If you know the range of integer values in the data is small, for example, and you wanted to save on memory, you would want to be able to specify a smaller type, like int8. Or as we have done here, if you expected to pull a large amount of data into R, you might want int32 because integers in R are 32-bit. This is not required—unlike with the PUMA variable, the integers will be parsed correctly, but they will just require more memory than they need and will require a small amount of translation to step down from int64 to int32. Sometimes, that savings will be meaningful.

But what if you’re working with a file with a lot of columns—do you really have to type this all out by hand? Absolutely not!

One option is to supply column types for a subset of columns, by providing a partial schema using the col_types argument.

# Read CSV file and specify a partial schema
read_csv_arrow(
  "./data/raw_csvs/person/2021/ak/psam_p02.csv",
  col_types = schema(PUMA = string())
)

Because we only supplied a partial schema, Arrow still needs to read the headers of the other columns and infer their types, and so in this case we don’t need to pass in skip = 1 like we did when we passed in the full schema.

Another option is to take the schema which is inferred by arrow and manually update it yourself before reading in the file again. You can extract the schema using the schema() function.

# Read in the file as an Arrow Table and get its schema
pums_csv_one_state <- read_csv_arrow(
  "./data/raw_csvs/person/2021/ak/psam_p02.csv"
)

one_state_schema <- schema(pums_csv_one_state)

Schemas can be interacted with like lists in R, and so if we want to look more closely at one of these column name to data type mappings, we can extract it using the [[ or $ operator.

one_state_schema[["PUMA"]]
Field
PUMA: int32

We can also assign in different types, and then provide the modified schema when re-reading the file to use those types.

# Update the data types of the PUMA field
one_state_schema[["PUMA"]] <- string()

# Read in the file, using the updated schema
read_csv_arrow(
  "./data/raw_csvs/person/2021/ak/psam_p02.csv",
  schema = one_state_schema,
  skip = 1
)