squirrel.lu

start with nltk (Natural Language Toolkit)

$pip install --user U nltk

Start python3 on your terminal:

$python3 
>>> import nltk
>>> nltk.download('popular') 

see https://www.nltk.org/nltk_data/

try to import common corpus

 >>> nltk.download('brown')
[nltk_data] Downloading package brown to /Users/poknyen/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
True

if you want to see corpora status and download some specific:

>>> nltk.download()
NLTK Downloader
---------------------------------------------------------------------------
    d) Download   l) List    u) Update   c) Config   h) Help   q) Quit
---------------------------------------------------------------------------
Downloader> l

Packages:
  [ ] abc................. Australian Broadcasting Commission 2006
  [ ] alpino.............. Alpino Dutch Treebank
  [*] averaged_perceptron_tagger Averaged Perceptron Tagger
  [ ] averaged_perceptron_tagger_ru Averaged Perceptron Tagger (Russian)
  [ ] basque_grammars..... Grammars for Basque
  [ ] biocreative_ppi..... BioCreAtIvE (Critical Assessment of Information
                           Extraction Systems in Biology)
  [ ] bllip_wsj_no_aux.... BLLIP Parser: WSJ Model
  [ ] book_grammars....... Grammars from NLTK Book
  [ ] brown............... Brown Corpus
  [ ] brown_tei........... Brown Corpus (TEI XML Version)
  [ ] cess_cat............ CESS-CAT Treebank
  [ ] cess_esp............ CESS-ESP Treebank
  [ ] chat80.............. Chat-80 Data Files
  [ ] city_database....... City Database
  [*] cmudict............. The Carnegie Mellon Pronouncing Dictionary (0.6)
  [ ] comparative_sentences Comparative Sentence Dataset
  [ ] comtrans............ ComTrans Corpus Sample
  [ ] conll2000........... CONLL 2000 Chunking Corpus
  [ ] conll2002........... CONLL 2002 Named Entity Recognition Corpus
Hit Enter to continue: 
  [ ] conll2007........... Dependency Treebanks from CoNLL 2007 (Catalan
                           and Basque Subset)
  [ ] crubadan............ Crubadan Corpus
  [ ] dependency_treebank. Dependency Parsed Treebank
  [ ] dolch............... Dolch Word List
  [ ] europarl_raw........ Sample European Parliament Proceedings Parallel
                           Corpus
```

Corpora installed are those with the * between [ ] So if you want to install a specific corpus specify the name of the corpus:

nltk.download(city_database)

to open your own corpus :

>>> f=open('your-corpus.txt', 'r')
>>> text=f.read()
>>> text1=text.split()

text.split() will split all the words in your file that NLTK can access and read. text1 is a python list, not a text

see the types of your variables:

>>> type(text)
<class 'str'>
>>> type(text1)
<class 'list'>

clean corpus from non alpha-numeric characters:

sorted([w for w in set (text1) if w.isalpha()])

Searching for something specific? : list words in corpus who starts with a specific letter or chain of characters:

>>> sorted([w for w in set (text1) if w.startswith('home')])
['home', 'home!', 'home!!!', 'home"', 'home)', 'home,', 'home,"', 'home,”', 'home-buyer', 'home-free.', 'home-made', 'home-office', 'home.', 'home.[9]', 'home/end', 'home?', 'homebrew', 'homebrewing', 'homebuyer', "homebuyer's", 'homebuyers.', 'homebuyer’s', 'homebuying', 'homegrrl)', 'homeland', 'homeowner', 'homeowners', 'homeownership', 'homepage', 'homepage,', 'homepage.', 'homes', 'homes,', 'homes.', 'homes."', 'homes.”', 'homes;', 'homes”', 'homework', 'homework.', 'home’', 'home’s']
>>> 

Have fun!