Open this Research Notebook ->
Description:
This notebook takes as input:
- Plain text files (.txt) in a zipped folder called 'texts' in the data folder
- Metadata CSV file called 'metadata.csv' in the data folder (optional)
and outputs a single JSON-L file containing the unigrams, bigrams, trigrams, full-text, and metadata.
It allows researchers to create a dataset compatible with other notebooks on this platform.
Use Case: For Researchers (Mostly code without explanation, not ideal for learners)
Difficulty: Advanced
Completion time: 10-15 minutes
Knowledge Required:
- Python Basics (Start Python Basics I)
Knowledge Recommended:
Data Format: .txt, .csv, .jsonl
Libraries Used:
- os
- json
- NLTK
- gzip
- nltk.corpus
- collections
- pandas
Research Pipeline:
- Scan documents
- OCR files
- Clean up texts
- Tokenize text files (this notebook)