CSV vs. JSON Lines Files

The dataset builder creates two files:

  • A CSV file containing only metadata
  • A JSON Lines file containing metadata and the textual data

The textual data includes:

  • Unigrams
  • Bigrams
  • Trigrams
  • Full Text (where available)

The metadata includes:

Column Name Description
id a unique item ID (In JSTOR, this is a stable URL)
title the title for the item
isPartOf the larger work that holds this title (for example, a journal title)
publicationYear the year of publication
doi the digital object identifier for an item
docType the type of document (for example, article or book)
provider the source or provider of the dataset
datePublished the publication date in yyyy-mm-dd format
issueNumber the issue number for a journal publication
volumeNumber the volume number for a journal publication
url a URL for the item and/or the item's metadata
creator the author or authors of the item
publisher the publisher for the item
language the language or languages of the item (eng is the ISO 639 code for English)
pageStart the first page number of the print version
pageEnd the last page number of the print version
placeOfPublication the city of the publisher
wordCount the number of words in the item
pageCount the number of print pages in the item
outputFormat what data is available (unigrams, bigrams, trigrams, and/or full-text)

All of the textual data and metadata is available inside of the JSON Lines files, but we have chosen to offer the metadata CSV for two primary reasons:

  1. The JSON Lines data is a little more complex to parse since it is nested. It cannot be easily represented in a table form in something like Pandas. It is nice to be able to easily view all the metadata in a Pandas dataframe.
  2. The JSON Lines data can be very large. Each file contains all of the metadata plus unigram counts, bigram counts, trigram counts, and full-text (when available). Manipulating all that data takes significant computer time and costs. Even a modest dataset (~5000 files) can be over 1 GB in size uncompressed.

We are still refining the structure of the dataset file. We anticipate adding additional “features” (such as named entity recognition) in the future. Please reach out to Ted Lawless Ted.Lawless@ithaka.org if you have comments or suggestions.

Data Structure

The CSV file is a comma-delimited, tabular structure that can easily be viewed in Excel or Pandas.

The JSON Lines file (file extension ".jsonl") is served in a compressed gzip format (.gz). The data for each document in the corpus is a written on a single line. (If there are 1,245 documents in the corpus, the JSON Lines file will 1,245 lines long.) Each line contains a list of key/value pairs that map a key concept to a matching value.

The basic structure looks like:

"Key": Value

Instead of attempting to decode the structure of a single large line, we can plug a single line into a JSON editor. The screenshot below was created using JSON Editor Online. The JSON editor reveals the file structure by breaking it down into a set of nested hierarchies, similar to XML. These can also be collapsed using arrows in a separate viewer pane within JSON Editor Online.

View of the top of a sample file

A single line from a JSON Lines dataset expressed as a nested hierarchy using JSON Editor Online

The editor makes it easier for human readers to discern a portion of the metadata for the text. In the data above, we can see:

  • The title is "Shakespeare and the Middling Sort" ("title": "Shakespeare and the Middling Sort")
  • The author is "Theodore B. Leinwand" ("creators": ["Theodore B. Leinwand"])
  • The text is a journal article ("doctypeType": "article")
  • The journal is Shakespeare Quarterly ("isPartOf": "Shakespeare Quarterly")
  • Identifiers such as ISSN, OCLC, and DOI
  • PageCount and WordCount

If you examine the rest of the file, you'll discover additional metadata such as the publication date, DOI, page numbers, ISSN, and more.

The most significant data for text analysis is found within the "unigramCount" section where the frequency of each word is recorded. In this context, the word "unigram" describes a single word construction like the word "chicken." There are also bigrams (e.g. "chicken stock"), trigrams ("homemade chicken stock"), and n-grams of any length. Depending on the licensing for the content, there may also be full-text available.

The JSONL file section that lists unigrams

The start of the section of the JSONL file that lists the unigrams for the text

Notice that the beginning of the "unigramCount" section mostly contains numbers (represented as strings. The texts have not been pre-processed in any fashion, so the numbers we are seeing suggest that each page number has been accurately captured. We do not pre-filter out any words or numbers from JSON.

Showing the unigram counts

On each line, a key on the left is matched to value representing its frequency on the right

Each word here is treated as a string. Since JavaScript and Python strings are case-sensitive, that means that "Tiger" is considered a different word than "tiger". Counting all the occurences of the word "tiger" then would require combining the counts of both strings. These methods are covered in the notebooks.

How does the dataset format compare with HathiTrust?

They are very similar, and that's not an accident. We worked closely with HathiTrust to develop our data format. Ultimately, we decided to expand on the HathiTrust Extracted Features format to support key features our users sought (for example, the ability to analyze texts at the level of individual journal articles instead of at the issue-level). We are excited by the possibility of including content from the HathiTrust Digital Library in the future.