Open this Research Notebook ->

Description: You may have text files and metadata that you want to tokenize into ngrams with Python. This notebook tokenizes

This notebook takes as input:

Plain text files (.txt) in a folder
A metadata CSV file called 'metadata.csv'
and outputs a single JSON-L file containing the unigrams, bigrams, trigrams, full-text, and metadata.

Use Case: For Researchers (Mostly code without explanation, not ideal for learners)

Difficulty: Intermediate

Completion time: 10-15 minutes

Knowledge Required:

Knowledge Recommended:

Data Format: .txt, .csv, .jsonl

Libraries Used:

  • os
  • json
  • NLTK
  • gzip
  • nltk.corpus
  • collections
  • pandas

Research Pipeline:

  1. Scan documents
  2. OCR files
  3. Clean up texts
  4. Tokenize text files (this notebook)