Description: You may have text files and metadata that you want to tokenize into ngrams with Python. This notebook tokenizes
This notebook takes as input:
Plain text files (.txt) in a folder
A metadata CSV file called 'metadata.csv'
and outputs a single JSON-L file containing the unigrams, bigrams, trigrams, full-text, and metadata.
Use Case: For Researchers (Mostly code without explanation, not ideal for learners)
Completion time: 10-15 minutes
- Python Basics (Start Python Basics I)
Data Format: .txt, .csv, .jsonl
- Scan documents
- OCR files
- Clean up texts
- Tokenize text files (this notebook)