This notebook finds the word frequencies for a dataset. Optionally, this notebook can take the following inputs:

  • Filtering based on a pre-processed ID list
  • Filtering based on a stop words list

Use Case: For Researchers (Mostly code without explanation, not ideal for learners)

Difficulty: Intermediate

Completion time: 5-10 minutes

Knowledge Required:

Knowledge Recommended:

Data Format: JSON Lines (.jsonl)

Libraries Used:

  • tdm_client to collect, unzip, and read our dataset
  • NLTK to help clean up our dataset
  • Counter from Collections to help sum up our word frequencies

Research Pipeline:

  1. Build a dataset
  2. Create a "Pre-Processing CSV" with Exploring Metadata (Optional)
  3. Create a "Custom Stopwords List" with Creating a Stopwords List (Optional)
  4. Create the word frequencies analysis with this notebook