The new platform offers a modern revitalization of the existing JSTOR DfR service. The key improvements include:
- Considerably more content, including an additional 15 million articles from over 3,000 journals and 42 publishers.
- A number of built-in visualizations, including an n-gram or term frequency viewer.
- An Analytics Lab that lets users program in the cloud through their web browser
- Text analytics tutorials and code templates in Jupyter Notebooks
- Downloads of datasets in JSON format
The biggest impact on users already comfortable with DfR datasets is the change in dataset delivery. DfR delivered datasets to end users in a ZIP file with the following structure:
Each of those directories contained one file per article or book chapter in the dataset. For example, the metadata directory contains one XML file for each document in the dataset and the ngram2 directory contains one CSV file for each document in the dataset -- where on each row is one of the bigrams (or two word phrases) from the document in the first column and the number of times it occurred in the second:
DfR also did some preemptive cleanup on the data it delivers. Stop words were removed from the text and all the text was lowercased.
In contrast, the new platform is delivering each dataset as a GZIPed JSON-L. It is considerably easier to programmatically interact with the single file format.
JSON-L files are a series of individual JSON documents each inserted onto a single line and concatenated together. Each document in your dataset is represented by a single JSON file that includes all the bibliographic metadata plus the unigrams, bigrams, and trigrams (and for open content, the full text).
In addition, the new platform is not doing preemptive cleanup on the data it delivers. Our philosophy is that researchers will know best how to clean up their own data. In addition, our focus is on learning and teaching and most data in the world is wild and woolly and dirty, so it is best if users learn to do their own cleanup.
For further information on the format delivered by the new platform, please see our detailed documentation.