This notebook takes as input:
- Plain text files (.txt) in a zipped folder called 'texts' in the data folder
- Metadata CSV file called 'metadata.csv' in the data folder (optional)
and outputs a single JSON-L file containing the unigrams, bigrams, trigrams, full-text, and metadata.
It allows researchers to create a dataset compatible with other notebooks on this platform. The NLTK tokenization method in this notebook is slightly different from how other documents are tokenized in the Constellate dataset builder. If you want to combine with an existing dataset from the builder, use the Tokenizing Text Files notebook.
Use Case: For Researchers (Mostly code without explanation, not ideal for learners)
Completion time: 10-15 minutes
- Python Basics (Start Python Basics I)
Data Format: .txt, .csv, .jsonl
- Scan documents
- OCR files
- Clean up texts
- Tokenize text files (this notebook)