Users may download datasets of up to 25,000 items (50,000 items if you are at a participating institution) at a time. Our datasets are delivered as a JSON Lines (JSON-L) file that has been GZIPped and contains one JSON record per line. Each JSON record represents one document in your dataset and contains:
- Bibliographic metadata (such as author, title, and publication date)
- Unigrams: Every word in the document and the number of time it occurs
- Bigrams: Every two word phrase in the document and the number of times it occurs
- Trigrams: Every three word phrase in the document and the number of times it occurs
In order to work with it, you will need to unzip the file (your Windows or Mac should know how to do this automatically and if you are on a Unix machine, you probably have a good handle on it already - but if you run into trouble, please reach out to us at firstname.lastname@example.org).
While you are reading through this help, we recommend that you have a file open. If you have not built a dataset of your own, feel free to use one of our pre-built datasets from your dashboard.
JSON-L is a way for us to deliver multiple JSON records in a single file. These files tend to be very, very large. It is quite easy to programmatically access the content in them, but a bit more difficult to read them by eye as we deliver them.
Here is the first little bit of a sample opened in Notepad:
Once you are working with the files in Python or R, you can sequentially read each record and load the bits you need into an array. If we take just one of the JSON records (a single line from a JSON-L file) and pretty print it, you can easily see the structure of these records:
The first section is devoted to the bibliographic information that describes and identifies the document represented. It is then followed by a section for each of the unigrams, bigrams, and trigrams of the object (and in cases of non-rights restricted content, the full text).
The outputFormat element tells you what content to expect for this document. It will always include unigrams, bigrams, and trigrams and sometimes full-text.
Our full JSON schema is available in our Git repository and the data includes:
|id||a unique item ID (In JSTOR, this is a stable URL)|
|title||the title for the item|
|isPartOf||the larger work that holds this title (for example, a journal title)|
|publicationYear||the year of publication|
|doi||the digital object identifier for an item|
|docType||the type of document (for example, article or book)|
|provider||the source or provider of the dataset|
|datePublished||the publication date in yyyy-mm-dd format|
|issueNumber||the issue number for a journal publication|
|volumeNumber||the volume number for a journal publication|
|url||a URL for the item and/or the item's metadata|
|creator||the author or authors of the item|
|publisher||the publisher for the item|
|language||the language or languages of the item (eng is the ISO 639 code for English)|
|pageStart||the first page number of the print version|
|pageEnd||the last page number of the print version|
|placeOfPublication||the city of the publisher|
|wordCount||the number of words in the item|
|pageCount||the number of print pages in the item|
|outputFormat||what data is available (unigrams, bigrams, trigrams, and/or full-text)|
|identifier||an array of identifiers for the published document|
|sourceCategory||the subject categories our automated processing has assigned to this document|
|provider||the organization that provided the content|
The unigramCount element contains every “word” in the document and the number of times it occurs. For example, here is an excerpt:
Note that we are not doing any cleanup of the data. The capitalization is as it was in the document and sometimes the “words” are just abbreviations or phrases (any text surrounded by white space).
If we take the first verse of a common nursery …
Jack and Jill went up the hill
To fetch a pail of water;
Jack fell down and broke his crown,
And Jill came tumbling after.
and consider its unigrams and their frequency, we can see this:
The bigramCount is similar to the unigramCount, only it is every two word phrase. So, if we take our nursery rhyme verse again, these are the bigrams:
As you expect by now, trigrams are every three word phrase and their frequency in the document.
Some of the content in our dataset builder has no rights restrictions and we are including the full-text in the JSON-L files. It will be another element named fullText. To see what full-text is available for download, consult our Collections to Analyze.
If you are more comfortable working with CSV for the bibliographic metadata of the documents, a CSV file is available in our Analytics Lab when you are working with a dataset created in our application. Note that the unigrams, bigrams, trigrams, and full-text are only available in our JSON files.
Differences from JSTOR’s Data for Research (DfR) Format
While the contents of what is delivered to users is very similar, the format of the dataset differs.
The biggest impact on users already comfortable with DfR datasets is the change in dataset delivery. DfR delivers datasets to end users in a ZIP file with the following structure:
Each of those directories contained one file per article or book chapter in the dataset. For example, the metadata directory contains one XML file for each document in the dataset and the ngram1 directory contains one CSV file for each document in the dataset -- where on each row is one of the words from the document in the first column and the number of times it occurred in the second column.
In addition to the new JSON format described above, this platform is not doing any preemptive cleanup on the data it delivers, whereas DfR removes stopwords and lowercases all the words in the dataset. Our philosophy is that the researcher will know best how to clean-up his or her own data. In addition, our focus is on learning and teaching and most data in the world is wild and woolly and dirty, so it is best if users learn to do their own clean-up.
Differences from HathiTrust Extracted Features Format
They are very similar, and that's not an accident. We worked closely with HathiTrust to develop our data format. Ultimately, we decided to expand on the HathiTrust Extracted Features format to support key features our users sought (for example, the ability to analyze texts at the level of individual journal articles instead of at the issue-level).