The dataset builder lets you create personal datasets tailored to your research interests.   It works with the content we have loaded into our application.  You filter your dataset on the left hand side and see the results of that filtering on the right hand side:

Screen shot of dataset builder

Filtering

You may create a dataset by filtering on a number of the values you will also find in your dataset (read more about our dataset structure).  In addition to filtering by these elements in the UI, you may work with them in the Analytics Lab and find them in your dataset JSON-L, should you download it.

As you filter, your other filter options will become limited.  For example, if you choose the Newspaper document type, only content from Chronicling America will be presented to you (because none of the other providers have included newspapers).

The filter options include:

Keyword filter

You may filter by words found in the full-text of the documents.  Two and three word phrases may be searched, if you enclose them in quotes, and this filter uses traditional boolean functionality.

For example, this search: “foster care” AND (“adoption” OR “reunification”)

Will find all documents that contain both “foster care” and “adoption” and also all the documents that contain both “foster care” and “reunification”.

Publication Title(s) filter

We have newspapers, journals, and books.  If you know the title of the item in which you are interested (for example, Shakespeare Quarterly), you may select it here.  This is a multi-select and you may enter multiple titles.

If the title list is overwhelming (as we have tens of thousands of publication titles available), try limiting by document type, language, or provider first to reduce the number of publication title choices.

Publication Date filter

You may limit your dataset to a specific time frame.  

Note that we have content as far back as the 1700s, but often content that early requires considerable clean-up by the researcher before it is ready for analysis.

Also, beware that the JSTOR content tends to stop within 5 years of the current year, so if you see a precipitous drop in publication rates, that is probably why (the Portico content tends to be available for the current year).

Language filter

Much of our content is in English, but it is not all in English.  See what you can find in other languages.

Document type filter

Are you interested in a specific type of content, feel free to filter for it.

Note that for some of the books we have in our application the entire book is the document and others the individual chapters are each a document.

Provider filter

The content in our application comes from a variety of providers and you may filter by them.  Please read about each of these on our collections page.

Category filter

We have used some training data and the JSTOR thesaurus to assign subject categories to many of the JSTOR and Portico documents loaded into our platform.

These documents have been assigned one or more sets of two level subject categories to aid in the creation of datasets.

Review Results of Dataset Filtering

Our service is not meant to enable users to read content in the traditional manner with a highlighter in hand and note cards on the side, rather you can build a dataset and analyze the content en masse.  When building a dataset, it is quite useful to review what is in it to make sure you are getting what you expect.  For example, you might filter on “adopt” in an attempt to find content about adopting children and find your dataset interspersed with articles about “resolutions adopted” at conference meetings!  Reviewing the list of citations at the bottom of the page may help you appropriately narrow down your search.  In addition, you may select “More visualizations” for additional ways to review and “view” your results.

You may select “More visualizations” at any point in your dataset creation and dive deeper into your content.

Screen shot of where to find "More visualizations"

The information available in More visualizations is a subset of what is available on any given dataset page and you may review those details in the dataset page documentation.

Dataset Size Limitations

For anonymous users, not associated with a participating institution, you are limited to datasets of 25,000 items at a time.  For users at an institution participating in our beta test period in early 2021, you may work with datasets of up to 50,000 items at a time.  We have limited the number of datasets you can create on any given day to prevent overloading our systems with constant downloads.  However, if you need considerably more content than you can currently retrieve, please reach out to us at tdm@ithaka.org, as we want to understand your needs and consider the best way to meet them.