Open This Research Notebook ->

Take me to the Learning Version of this notebook ->

Discover the significant words in a corpus using Gensim TF-IDF. The following code is included:

  • Filtering based on a pre-processed ID list
  • Filtering based on a stop words list
  • Token cleaning
  • Computing TF-IDF using Gensim

Use Case: For Researchers (Mostly code without explanation, not ideal for learners)

Difficulty: Intermediate

Completion time: 5-10 minutes

Knowledge Required:

Knowledge Recommended:

Data Format: JSON Lines (.jsonl)

Libraries Used:

  • pandas to load a preprocessing list
  • csv to load a custom stopwords list
  • gensim to help compute the tf-idf calculations
  • NLTK to create a stopwords list (if no list is supplied)

Research Pipeline:

  1. Build a dataset
  2. Create a "Pre-Processing CSV" with Exploring Metadata (Optional)
  3. Create a "Custom Stopwords List" with Creating a Stopwords List (Optional)
  4. Complete the TF-IDF analysis with this notebook