Open this Research Notebook ->
Description: You may have text files and metadata that you want to tokenize into ngrams with Python. This notebook tokenizes
This notebook takes as input:
Plain text files (.txt) in a folder
A metadata CSV file called 'metadata.csv'
and outputs a single JSON-L file containing the unigrams, bigrams, trigrams, full-text, and metadata.
Use Case: For Researchers (Mostly code without explanation, not ideal for learners)
Difficulty: Intermediate
Completion time: 10-15 minutes
Knowledge Required:
- Python Basics (Start Python Basics I)
Knowledge Recommended:
Data Format: .txt, .csv, .jsonl
Libraries Used:
- os
- json
- NLTK
- gzip
- nltk.corpus
- collections
- pandas
Research Pipeline:
- Scan documents
- OCR files
- Clean up texts
- Tokenize text files (this notebook)