Introduction

Users may download datasets of up to 25,000 items (50,000 items if you are at a participating institution) at a time.  Our datasets are delivered as a JSON Lines (JSON-L)  file that has been GZIPped and contains one JSON record per line.  Each JSON record represents one document in your dataset and contains:

  • Bibliographic metadata (such as author, title, and publication date)
  • Unigrams: Every word in the document and the number of time it occurs
  • Bigrams: Every two word phrase in the document and the number of times it occurs
  • Trigrams: Every three word phrase in the document and the number of times it occurs

In order to work with it, you will need to unzip the file (your Windows or Mac should know how to do this automatically and if you are on a Unix machine, you probably have a good handle on it already - but if you run into trouble, please reach out to us at tdm@ithaka.org).

While you are reading through this help, we recommend that you have a file open. If you have not built a dataset of your own, feel free to use one of our pre-built datasets from your dashboard.

JSON-L

JSON-L is a way for us to deliver multiple JSON records in a single file.  These files tend to be very, very large.  It is quite easy to programmatically access the content in them, but a bit more difficult to read them by eye as we deliver them.

Here is the first little bit of a sample opened in Notepad:

Screen shot of JSON-L file opened in a text editor

JSON

Once you are working with the files in Python or R, you can sequentially read each record and load the bits you need into an array.  If we take just one of the JSON records (a single line from a JSON-L file) and pretty print it, you can easily see the structure of these records:

Screen shot of a 'pretty printed' JSON file

You may download a single JSON record as a file and open it in an editor such as JSON Editor Online to get a sense of the structure.

The first section is devoted to the bibliographic information that describes and identifies the document represented.  It is then followed by a section for each of the unigrams, bigrams, and trigrams of the object (and in cases of non-rights restricted content, the full text).

The outputFormat element tells you what content to expect for this document.  It will always include unigrams, bigrams, and trigrams and sometimes full-text.
Our full JSON schema is available in our Git repository and the data includes:

Element Description
id a unique item ID (In JSTOR, this is a stable URL)
title the title for the item
isPartOf the larger work that holds this title (for example, a journal title)
publicationYear the year of publication
doi the digital object identifier for an item
docType the type of document (for example, article or book)
provider the source or provider of the dataset
datePublished the publication date in yyyy-mm-dd format
issueNumber the issue number for a journal publication
volumeNumber the volume number for a journal publication
url a URL for the item and/or the item's metadata
creator the author or authors of the item
publisher the publisher for the item
language the language or languages of the item (eng is the ISO 639 code for English)
pageStart the first page number of the print version
pageEnd the last page number of the print version
placeOfPublication the city of the publisher
wordCount the number of words in the item
pageCount the number of print pages in the item
outputFormat what data is available (unigrams, bigrams, trigrams, and/or full-text)
identifier an array of identifiers for the published document
sourceCategory the subject categories our automated processing has assigned to this document
provider the organization that provided the content

Unigrams

The unigramCount element contains every “word” in the document and the number of times it occurs.  For example, here is an excerpt:

Screen shot of unigram representation in JSON

Note that we are not doing any cleanup of the data.  The capitalization is as it was in the document and sometimes the “words” are just abbreviations or phrases (any text surrounded by white space).

If we take the first verse of a common nursery …

Jack and Jill went up the hill
To fetch a pail of water;
Jack fell down and broke his crown,
And Jill came tumbling after.

and consider its unigrams and their frequency, we can see this:

Unigram Frequency
Jack 2
and 3
Jill 2
went 1
up 1
the 1
hill 1
to 1
fetch 1
a 1
pail 1
of 1
wacter; 1
fell 1
down 1
broke 1
his 1
crown, 1
came 1
tumbling 1
after. 1

Bigrams

The bigramCount is similar to the unigramCount, only it is every two word phrase. So, if we take our nursery rhyme verse again, these are the bigrams:

Bigram Frequency
Jack and 1
and Jill 1
Jill went 1
went up 1
up the 1
the hill 1
hill To 1
To fetch 1
fetch a 1
a pail 1
pail of 1
of water; 1
water; Jack 1
Jack fell 1
fell down 1
down and 1
and broke 1
broke his 1
his crown, 1
crown, And 1
And Jill 1
Jill came 1
came tumbling 1
tumbling after. 1

Trigrams

As you expect by now, trigrams are every three word phrase and their frequency in the document.

Full-Text

Some of the content in our dataset builder has no rights restrictions and we are including the full-text in the JSON-L files.  It will be another element named fullText.  To see what full-text is available for download, consult our Collections to Analyze.

CSV

If you are more comfortable working with CSV for the bibliographic metadata of the documents, a CSV file is available in our Analytics Lab when you are working with a dataset created in our application.  Note that the unigrams, bigrams, trigrams, and full-text are only available in our JSON files.

Differences from JSTOR’s Data for Research (DfR) Format


While the contents of what is delivered to users is very similar, the format of the dataset differs.

The biggest impact on users already comfortable with DfR datasets is the change in dataset delivery. DfR delivers datasets to end users in a ZIP file with the following structure:

Screen shot of DfR ZIP File

Each of those directories contained one file per article or book chapter in the dataset. For example, the metadata directory contains one XML file for each document in the dataset and the ngram1 directory contains one CSV file for each document in the dataset -- where on each row is one of the words from the document in the first column and the number of times it occurred in the second column.

In addition to the new JSON format described above, this platform is not doing any preemptive cleanup on the data it delivers, whereas DfR removes stopwords and lowercases all the words in the dataset. Our philosophy is that the researcher will know best how to clean-up his or her own data. In addition, our focus is on learning and teaching and most data in the world is wild and woolly and dirty, so it is best if users learn to do their own clean-up.

Differences from HathiTrust Extracted Features Format

They are very similar, and that's not an accident. We worked closely with HathiTrust to develop our data format. Ultimately, we decided to expand on the HathiTrust Extracted Features format to support key features our users sought (for example, the ability to analyze texts at the level of individual journal articles instead of at the issue-level).