Open This Research Notebook ->
Take me to the Learning Version of this notebook ->
Description:
Discover the significant words in a corpus using Gensim TF-IDF. The following code is included:
- Filtering based on a pre-processed ID list
- Filtering based on a stop words list
- Token cleaning
- Computing TF-IDF using Gensim
Use Case: For Researchers (Mostly code without explanation, not ideal for learners)
Difficulty: Intermediate
Completion time: 5-10 minutes
Knowledge Required:
- Python Basics Series (Start Python Basics I)
Knowledge Recommended:
- Exploring Metadata
- Working with Dataset Files
- Pandas I
- Creating a Stopwords List
- A familiarity with gensim is helpful but not required.
Data Format: JSON Lines (.jsonl)
Libraries Used:
pandas
to load a preprocessing listcsv
to load a custom stopwords list- gensim to help compute the tf-idf calculations
- NLTK to create a stopwords list (if no list is supplied)
Research Pipeline:
- Build a dataset
- Create a "Pre-Processing CSV" with Exploring Metadata (Optional)
- Create a "Custom Stopwords List" with Creating a Stopwords List (Optional)
- Complete the TF-IDF analysis with this notebook