This notebook takes as input:
- Plain text files (.txt) in a zipped folder called 'texts' in the data folder
- Metadata CSV file called 'metadata.csv' in the data folder (optional)
and outputs a single JSON-L file containing the unigrams, bigrams, trigrams, full-text, and metadata.
It allows researchers to create a dataset compatible with other notebooks on this platform.
Use Case: For Researchers (Mostly code without explanation, not ideal for learners)
Completion time: 10-15 minutes
- Python Basics (Start Python Basics I)
Data Format: .txt, .csv, .jsonl
- Scan documents
- OCR files
- Clean up texts
- Tokenize text files (this notebook)