CSV vs. JSON Lines Files
The dataset builder creates two files:
- A CSV file containing only metadata
- A JSON Lines file containing metadata and the textual data
The textual data includes:
- Full Text (where available)
The metadata includes:
|id||a unique item ID (In JSTOR, this is a stable URL)|
|title||the title for the item|
|isPartOf||the larger work that holds this title (for example, a journal title)|
|publicationYear||the year of publication|
|doi||the digital object identifier for an item|
|docType||the type of document (for example, article or book)|
|provider||the source or provider of the dataset|
|datePublished||the publication date in yyyy-mm-dd format|
|issueNumber||the issue number for a journal publication|
|volumeNumber||the volume number for a journal publication|
|url||a URL for the item and/or the item's metadata|
|creator||the author or authors of the item|
|publisher||the publisher for the item|
|language||the language or languages of the item (eng is the ISO 639 code for English)|
|pageStart||the first page number of the print version|
|pageEnd||the last page number of the print version|
|placeOfPublication||the city of the publisher|
|wordCount||the number of words in the item|
|pageCount||the number of print pages in the item|
|outputFormat||what data is available (unigrams, bigrams, trigrams, and/or full-text)|
All of the textual data and metadata is available inside of the JSON Lines files, but we have chosen to offer the metadata CSV for two primary reasons:
- The JSON Lines data is a little more complex to parse since it is nested. It cannot be easily represented in a table form in something like Pandas. It is nice to be able to easily view all the metadata in a Pandas dataframe.
- The JSON Lines data can be very large. Each file contains all of the metadata plus unigram counts, bigram counts, trigram counts, and full-text (when available). Manipulating all that data takes significant computer time and costs. Even a modest dataset (~5000 files) can be over 1 GB in size uncompressed.
We are still refining the structure of the dataset file. We anticipate adding additional “features” (such as named entity recognition) in the future. Please reach out to Ted Lawless Ted.Lawless@ithaka.org if you have comments or suggestions.
The CSV file is a comma-delimited, tabular structure that can easily be viewed in Excel or Pandas.
The JSON Lines file (file extension ".jsonl") is served in a compressed gzip format (.gz). The data for each document in the corpus is a written on a single line. (If there are 1,245 documents in the corpus, the JSON Lines file will 1,245 lines long.) Each line contains a list of key/value pairs that map a key concept to a matching value.
The basic structure looks like:
Instead of attempting to decode the structure of a single large line, we can plug a single line into a JSON editor. The screenshot below was created using JSON Editor Online. The JSON editor reveals the file structure by breaking it down into a set of nested hierarchies, similar to XML. These can also be collapsed using arrows in a separate viewer pane within JSON Editor Online.
The editor makes it easier for human readers to discern a portion of the metadata for the text. In the data above, we can see:
- The title is "Shakespeare and the Middling Sort" ("title": "Shakespeare and the Middling Sort")
- The author is "Theodore B. Leinwand" ("creators": ["Theodore B. Leinwand"])
- The text is a journal article ("doctypeType": "article")
- The journal is Shakespeare Quarterly ("isPartOf": "Shakespeare Quarterly")
- Identifiers such as ISSN, OCLC, and DOI
- PageCount and WordCount
If you examine the rest of the file, you'll discover additional metadata such as the publication date, DOI, page numbers, ISSN, and more.
The most significant data for text analysis is found within the "unigramCount" section where the frequency of each word is recorded. In this context, the word "unigram" describes a single word construction like the word "chicken." There are also bigrams (e.g. "chicken stock"), trigrams ("homemade chicken stock"), and n-grams of any length. Depending on the licensing for the content, there may also be full-text available.
The start of the section of the JSONL file that lists the unigrams for the text
Notice that the beginning of the "unigramCount" section mostly contains numbers (represented as strings. The texts have not been pre-processed in any fashion, so the numbers we are seeing suggest that each page number has been accurately captured. We do not pre-filter out any words or numbers from JSON.
On each line, a key on the left is matched to value representing its frequency on the right
How does the dataset format compare with HathiTrust?
They are very similar, and that's not an accident. We worked closely with HathiTrust to develop our data format. Ultimately, we decided to expand on the HathiTrust Extracted Features format to support key features our users sought (for example, the ability to analyze texts at the level of individual journal articles instead of at the issue-level). We are excited by the possibility of including content from the HathiTrust Digital Library in the future.