Text mining, or the process of deriving new information from pattern and trend analysis of the written word, has the potential to revolutionize research across subjects. Sadly, there is a massive hurdle facing those eager to unleash its power. The coding skills and statistical knowledge that text mining requires can take years to develop. All too often, researchers learn about the promise of text mining, only to have it revealed that the promise can be realized solely by the select few with the necessary technical skills. Ted Underwood, Professor of English at the University of Illinois, likens this scenario to researchers being presented with a “deceptively gentle welcome mat, followed by a trapdoor”.

Enter The Digital Scholar Workbench

JSTOR and Portico are addressing this problem by building a text and data mining platform aimed at teaching and enabling a generation of researchers to text mine (the name 'Digital Scholar Workbench' is a holder -- if you have any naming suggestions for us, add them to this Google Sheet). The platform includes a user interface to allow researchers, students, and instructors to curate, visualize, and save custom datasets. Researchers may download the extracted features of their curated datasets. Extracted features are a non-consumptive “bag-of-words” where each article or book chapter in the custom dataset is represented with bibliographic metadata, the unique set of words on each page, and the number of times the word occurs on the page.

The platform includes a teaching and development environment (Binder) which will be populated with easy-to-use code tutorials and templates (starting with Jupyter Notebooks and eventually featuring RStudio) where new text miners can analyze their custom datasets and learn to modify the Python or R code to better suit their own research purposes. Researchers may download and locally hold the extracted features of any content and the full-text of open content, while the full-text of rights restricted content will be available for analysis in a secure computing environment coming in 2020.

The content in the text mining platform will at least include all of JSTOR and the content from those Portico publishers who choose to participate. In addition, we are in discussions with third party content providers about participating with content and the service will include the ability for researchers to upload their own content for analysis.

The JSTOR & Portico text mining service will provide both free tools and tools accessible exclusively for institutional participants. As a not-for-profit, our sole aim is to reach self-sustainability.

We are working with a set of ten reference institutions from late 2019 and through 2020 to identify and build all of the necessary features, with an aim to release the service in 2021.

Reference Institutions

Ten institutions of higher education are working closely with us to develop the text mining platform from late 2019 through at least Spring 2020. These institutions include:

  1. Carnegie Mellon University
  2. James Madison University
  3. New York University
  4. Northeastern University
  5. University of Cambridge
  6. University of Pittsburgh
  7. University of Sydney
  8. University of Virginia
  9. Wake Forest University
  10. Yale University

The reference institutions will be using the tool in workshops, developing Jupyter notebook tutorials, designing curated datasets, and helping us think through a self-sufficiency business model.

Any questions, discussion items, or requests for a demonstration may be sent to us at tdm@ithaka.org.