Open this Research Notebook ->
Description:
This notebook takes as input:
- Plain text files (.txt) in a zipped folder called 'texts' in the data folder
- Metadata CSV file called 'metadata.csv' in the data folder (optional)
and outputs a single JSON-L file containing the unigrams, bigrams, trigrams, full-text, and metadata.
It allows researchers to create a dataset compatible with other notebooks on this platform. The NLTK tokenization method in this notebook is slightly different from how other documents are tokenized in the Constellate dataset builder. If you want to combine with an existing dataset from the builder, use the Tokenizing Text Files notebook.
Use Case: For Researchers (Mostly code without explanation, not ideal for learners)
Difficulty: Advanced
Completion time: 10-15 minutes
Knowledge Required:
- Python Basics (Start Python Basics I)
Knowledge Recommended:
Data Format: .txt, .csv, .jsonl
Libraries Used:
- os
- json
- NLTK
- gzip
- nltk.corpus
- collections
- pandas
Research Pipeline:
- Scan documents
- OCR files
- Clean up texts
- Tokenize text files (this notebook)