The demand for analytics skills across all domains is growing exponentially. Text and data analysis is one of those skills, yet it remains difficult to learn. Researchers and students are often teased by black box, point and click tools that produce a few quick visualizations that whet the appetite; however, the next step in learning text analytics is a high one and requires students to learn statistics and programming. Our primary goal is to make it easier for anyone to learn these skills by creating a learning platform that empowers faculty, librarians, and other instructors to educate a generation in text and data analysis. It provides users with the ability to build datasets for analysis from a variety of sources and provides a gathering space for the growing community of practitioners. Our solution is centered on student and researcher success, providing text and data analysis capabilities and access to content from some of the world’s most respected databases in an open environment with a variety of teaching materials that can be used, modified, and shared.

Summary of the Service

The platform provides value to users in three core areas -- they can teach and learn text analytics, build datasets from across multiple content sources, and visualize and analyze their datasets:

Learn & Teach

  • Template and Tutorial Code: Work with template Jupyter Notebooks to analyze your dataset and learn about text analytics (with additional environments forthcoming, such as R Studio).
  • Lessons and Documentation: Lessons and educational materials created by a community of experts including those from the NEH-funded Text Analysis Pedagogy Institutes.
  • Collaborative Teaching Materials Creation: Users may create, edit, reuse and collaborate in the creation of tutorials, code, documentation, and other educational resources for text analysis (our tutorial notebooks are all available in GitHub, in addition to being accessible in our analytics lab).

Build

  • Multiple Collections: Anchor collections from JSTOR and Portico, with additional content sources continually added (such as Library of Congress’ Chronicling America). Further details about the collections are available.
  • Data Download in JSON
    • All content - bibliographic metadata and unigrams
    • JSTOR content - bibliographic metadata, unigrams, bigrams, trigrams
    • Open content - bibliographic metadata, full-text, unigrams, bigrams, trigrams
  • Dataset Dashboard: Easily view datasets you have built or accessed.

Analyze

  • Computational Environment: Integrated computational environment powered by BinderHub that will allow users to seamlessly analyze text content using provided template Jupyter Notebooks and tutorials.
  • Visualize: Built in visualizations for your datasets.
  • Work with Rights Restricted Full-Text: Access to substantial compute cycles to work with the full-text of rights restricted content (forthcoming in late 2021 -- until then, it is possible to request JSTOR content through a personal agreement).

Launch

In January 2021, we will launch the subscription service by offering a six month free trial to institutions that participate in JSTOR or Portico.  It is important to us that the platform be as widely available as possible, while also covering our costs, and to that end there will always be a tier of service available to individuals for free and that tier of service will not offer less than JSTOR’s self-service Data for Research (DfR) functionality (see below for differences between this new platform and DfR).

Institutional participants in the free trial will be able to provide their users with additional computational power in the analytics lab and participate in training sessions:

Non-Trial Users Trial Participants
Build
Build & visualize datasets up to a specified number of items 25K 50K
Download datasets up to a specified number of items 25K 50K
Analyze
View and download built-in visualizations for datasets
Access to computational environment resources sufficient for: Learning Teaching & research
Computational environment with learn to text mine notebooks
Compute environment - CPUs <Core Tier 4 cores
Compute environment - maximum memory 2 GB 8 GB
Unlimited simultaneous users in computational environment
Learn
Adopt, adapt, and contribute tutorials and documentation
Run institutional users’ (instructors, students, etc.) repositories of code in our computational environment
Attend our Train-the-Trainer workshops

This free trial period will help institutions gauge the demand on their campus for this tool and help us gauge the amount of usage the platform may see (and thereby more accurately estimate costs and determine appropriate fees).

If you are interested in signing up for the free trial, please contact us at tdm@ithaka.org.

Subscription Service

In July of 2021, we expect to offer institutional subscriptions to a paid tier of service sized to be used for teaching and learning. We do not yet have set pricing for these subscriptions, we want to balance the need to both cover our costs and keep these subscriptions reasonably priced.  The free trial period associated with our 2021 launch will help us and our institutions evaluate both cost and value.   By the end of 2021, we plan to offer a second, additional tier of service aimed at meeting the more substantial demands of advanced researchers requiring computing power and access to the full-text of rights restricted content.   (If you are an advanced researcher interested in exploring with us what might meet your needs, please let us know at tdm@ithaka.org.)

Differences from JSTOR’s Data for Research (DfR)

The new platform offers a modern revitalization of the existing JSTOR DfR service.  The key differences are that the new platform provides:

  • considerably more content, including an additional 15 million articles from over 3,000 journals and 42 publishers.
  • a number of built in visualizations, including an n-gram or term frequency viewer and wordcloud.
  • an analytics lab (which lets users program in the cloud through their web browser)
  • text analytics tutorials and code templates in Jupyter Notebooks
  • downloads of datasets in JSON format

The biggest impact on users already comfortable with DfR datasets is the change in dataset delivery.  DfR delivers datasets to end users in a ZIP file with the following structure:

DfR ZIP File Structure

Each of those directories contained one file per article or book chapter in the dataset.  For example, the metadata directory contains one XML file for each document in the dataset and the ngram2 directory contains one CSV file for each document in the dataset -- where on each row is one of the bigrams (or two word phrases) from the document in the first column and the number of times it occurred in the second:

DfR Tab Delimited Bigram File

DfR is also doing some preemptive cleanup on the data it delivers.  Stop words are removed from the text and all the text is lowercased.

In contrast, the new platform is delivering each dataset as a GZIPed JSON-L.  It is considerably easier to programmatically interact with the single file format.

New Platform JSON-L GZIP

JSON-L files are a series of individual JSON documents each inserted onto a single line and concatenated together.  Each document in your dataset is represented by a single JSON file that includes all the bibliographic metadata plus the unigrams, bigrams, and trigrams (and for open content, the full text).

Example JSON

In addition, the new platform is not doing preemptive cleanup on the data it delivers.  Our philosophy is that the researcher will know best how to clean-up his or her own data.  In addition, our focus is on learning and teaching and most data in the world is wild and woolly and dirty, so it is best if users learn to do their own clean-up.

For further information on the format delivered by the new platform, please see our detailed documentation.