Open this Research Notebook ->

Description:
This notebook takes as input:

  • Plain text files (.txt) in a zipped folder called 'texts' in the data folder
  • Metadata CSV file called 'metadata.csv' in the data folder (optional)

and outputs a single JSON-L file containing the unigrams, bigrams, trigrams, full-text, and metadata.

It allows researchers to create a dataset compatible with other notebooks on this platform.

Use Case: For Researchers (Mostly code without explanation, not ideal for learners)

Difficulty: Advanced

Completion time: 10-15 minutes

Knowledge Required:

Knowledge Recommended:

Data Format: .txt, .csv, .jsonl

Libraries Used:

  • os
  • json
  • NLTK
  • gzip
  • nltk.corpus
  • collections
  • pandas

Research Pipeline:

  1. Scan documents
  2. OCR files
  3. Clean up texts
  4. Tokenize text files (this notebook)