back to all projects





last hacked on Jul 22, 2017

Notes on NLP using book written by: Steven Bird, Ewan Klien, and Edward Loper. Chapter 3

# Processing Raw Text ## Part 2 For this section, I will be going over the process of cleaning up text when doing **NLP**. This is an important pre-processing step For this section we will be utilizing the [Gutenberg](http://www.gutenberg.org/) website that hosts a crap-load of free online books. I decided to use *Great Expectations* by *Charles Dickens*, which is file *1400* in the website repo. So we begin by using the package `urllib` in order to extract the book from the website. >>> from urllib import request >>> from nltk import * >>> url = 'http://www.gutenberg.org/files/1400/1400.txt' >>> response = request.urlopen(url) >>> raw = response.read().decode('utf8') >>> type(raw) <class 'str'> >>> len(raw) 1033801 >>> raw[:73] 'The Project Gutenberg EBook of Great Expectations, by Charles Dickens\r\n\r\n' Notice how this string also contains `\r` and `\n` characters, these are usually produced on *Windows* machines. For our analysis, we do not need these characters, so the next step, **tokenization** (See part 1 for a refresher). We use the function `word_tokenize()` from `nltk` to tokenize the string then look at some basic characteristics that we looked at in *part 1*. >>> tokens = nltk.word_tokenize(raw) >>> type(tokens) <class 'list'> >>> len(tokens) 228880 >>> tokens[:11] ['The', 'Project', 'Gutenberg', 'EBook', 'of', 'Great', 'Expectations', ',', 'by', 'Charles', 'Dickens'] The tutorial doesn't really go into much detail after this but I decided to do some exercises to keep *part 1* fresh in my mind. So I decided to create a new list of only upper text using the syntax from *part 1*: >>> upperText = [word for word in tokens if words.isupper()] I noticed that obviously there was a large amount of *I*'s when doing this so I removed them in my next conditional selection. Notice I use two seperate syntaxes for practice, but they essentially do the same thing: >>> upperText = [word for word in upperText if word != 'I'] >>> for word in upperText: if word != 'I': print(word, end = ' ') I won't give the result since it outputs a very large set of upper case words. But something I notice off the bat is that many of the chapter's numbering(?) are outputted since they are written in *Roman Numeral* syntax. So this would be a pre-processing step important for this body of text. Some further inspection after removing *I*'s is that there was a section of the text where we can see *Pip*'s (the protagonist) letter to Joe (the husband of his sister) showcasing the rudimentary style of writing long before his eventual journey to becoming a man of higher status (won't spoil the book for anyone) >>> upperText[36:73] ['MI', 'DEER', 'JO', 'OPE', 'U', 'R', 'KRWITE', 'WELL', 'OPE', 'SHAL', 'SON', 'B', 'HABELL', 'TEEDGE', 'U', 'JO', 'AN', 'THEN', 'WE', 'SHORL', 'B', 'SO', 'GLODD', 'AN', 'WEN', 'M', 'PRENGTD', 'U', 'JO', 'WOT', 'LARX', 'AN', 'BLEVE', 'ME', 'INF', 'XN', 'PIP'] This can be found by indexing as shown above although it is important to note some of the *I*'s got removed due to our earlier transformation. ## HTML Formatting For this example we will be using a story called *Blondes to die out in 200 years*, which was published as a legit news article on **BBC**. >>> url = "http://news.bbc.co.uk/2/hi/health/2284783.stm" >>> html = request.urlopen(url).read().decode('utf8') >>> html[:60] '<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN' We can `print(html)` to see the entire html source including all tags and what not. But say we only wanted the text we would use the package *Beautiful Soup* to extract the text only. So we parse and tokenize this article as follows: >>> from bs4 import BeautifulSoup >>> raw = BeautifulSoup(html).get_text() **Note**: When you execute this you will be given the warning: The code that caused this warning is on line 1 of the file <stdin>. To get rid of this warning, change code that looks like this: BeautifulSoup([your markup]) to this: BeautifulSoup([your markup], "lxml") markup_type=markup_type)) So then we must implicitly state the markup up syntax >>> raw = BeautifulSoup(html, 'html').get_text() >>> tokens = nltk.word_tokenize(raw) Now let's look at the newly tokenized text >>> tokens[:12] ['BBC', 'NEWS', '|', 'Health', '|', 'Blondes', "'to", 'die', 'out', 'in', '200', "years'"] Thus once we have this we an start doing simple analysis as before. ## Local Files To check what's inside the current local directory you can use the `os` package from within **Python** as such: >>> import os >>> os.listdir('.') ['project2.md', 'project.md', 'abstract.md', 'document.txt'] Now let's read *document.txt* into our **Python** IDLE using the built-in function `open()`. >>> f = open('document.txt') >>> raw = f.read() >>> f.read() 'Yellin\', "One, two, three, four, five\nI am the greatest rapper alive!"\nSo damn great, motherfucker, I\'ve died\nWhat you hearin\' now is a paranormal vibe' + (Lyrics by Kendrick Lamar, track: The Heart pt. 4) Notice how we have `\n` characters in our read file, so we can read each line at a time using the following for loop (I've seen this alot when doing online parsing so super **important**) >>> f = open('document.txt', 'rU') >>> for line in f: ... print(line.strip()) ... Yellin', "One, two, three, four, five I am the greatest rapper alive!" So damn great, motherfucker, I've died What you hearin' now is a paranormal vibe # Pipeline to NLP ### Read-in Raw File Here we outline the overall process we've learned so far! >>> from nltk import * >>> raw = open('document.txt').read() >>> type(raw) <class 'str'> ### Tokenize words from Raw File to List Next we tokenize the string and output a list type. Finally normalizing by making all words lowercase. >>> tokens = word_tokenize(raw) >>> tokens ['Yellin', "'", ',', '``', 'One', ',', 'two', ',', 'three', ',', 'four', ',', 'five', 'I', 'am', 'the', 'greatest', 'rapper', 'alive', '!', "''", 'So', 'damn', 'great', ',', 'motherfucker', ',', 'I', "'ve", 'died', 'What', 'you', 'hearin', "'", 'now', 'is', 'a', 'paranormal', 'vibe'] >>> type(tokens) <class 'list'> ### Lowercaps all words in list >>> words = [w.lower() for w in tokens] >>> words ['yellin', "'", ',', '``', 'one', ',', 'two', ',', 'three', ',', 'four', ',', 'five', 'i', 'am', 'the', 'greatest', 'rapper', 'alive', '!', "''", 'so', 'damn', 'great', ',', 'motherfucker', ',', 'i', "'ve", 'died', 'what', 'you', 'hearin', "'", 'now', 'is', 'a', 'paranormal', 'vibe'] ### Sort and Identify Unique Words in List >>> vocab = sorted(set(words)) >>> vocab ['!', "'", "''", "'ve", ',', '``', 'a', 'alive', 'am', 'damn', 'died', 'five', 'four', 'great', 'greatest', 'hearin', 'i', 'is', 'motherfucker', 'now', 'one', 'paranormal', 'rapper', 'so', 'the', 'three', 'two', 'vibe', 'what', 'yellin', 'you'] >>> type(vocab) <class 'list'> # Regular Expressions For identifying *regular expressions* we will use the module `re` extensively within this tutorial. First we load the appropriate modules and the *Words Corpus* to start playing with *regular expressions*. >>> import re >>> from nltk import * >>> wordlist = [w for w in corpus.words.words('en') if w.islower()] Let's start with finding all words that end with the regular expression `<<ed$>>`. We will be utilizing the `re.search(p, s)` function to see if the pattern `p` is found in string `s`. The `$` character at the end of the pattern indicates that we are searching for `ed` at the end of the word. >>> [w for w in wordlist if re.search('ed$', w)] Returns a long list of words ending with *ed* take my word for it. The `.` is called **wildcard** this matches any character. So a display of how it works is say we're looking for an 8-letter word with *j* as its 3rd letter and *t* as its sixth letter, we would use the following syntax to search for the words fitting this criteria. >>> [w for w in wordlist if re.search('^..j..t..$', w)] ['abjectly', 'adjuster', 'dejected', 'dejectly', 'injector', 'majestic', 'objectee', 'objector', 'rejecter', 'rejector', 'unjilted', 'unjolted', 'unjustly'] ## Ranges and Closures The **T9** system used for entering text in mobile phones, is used in *regular expressions* and two or more words that have the same sequence of keystrokes are called **textonyms**. An example includes, *soon* and *room* and *pint* and *riot*. Let's see this used in context of *regular expressions* in **Python**. ## Textonyms >>> [w for w in wordlist if re.search('^[ghi][mno][jlk][def]$', w)] ['gold', 'golf', 'hold', 'hole'] More examples to get familiar with syntax >>> [w for w in wordlist if re.search('^[ghijklmno]+', w')] Or written more concisely >>> [w for w in wordlist if re.search('^[g-o]+$')] These searches will only match the middle keys >>> [w for w in wordlist if re.search('^[a-fj-o]+$', w)] This search will match the top-right corner keys when looking at the **T9** system. ## Closures in Regular Expressions >>> chat_words = sorted(set(w for w in corpus.nps_chat.words())) >>> [w for w in chat_words if re.search('^m+i+n+e+$', w)] ['miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee', 'miiiiiinnnnnnnnnneeeeeeee', 'mine', 'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee'] Now let's try the same expression using `*` instead of `+`. >>> [w for w in chat_words if re.search('^m*i*n*e+$', w)] ['e', 'me', 'meeeeeeeeeeeee', 'miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee', 'miiiiiinnnnnnnnnneeeeeeee', 'mine', 'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee', 'ne'] Notice how I kept the `+` at the end, this was to ensure that the closure included at least one of the letters if it was `*` it would also include the empty set `''`. Important to note: when using the `^` inside a set you are negating all the values in the set. Best shown through example >>> [w for w in chat_words if re.search('^[^aeiouAEIOU]+$', w)] [..., ':', ':(', ':)', ':):):)', ':-(', ':-)', ':-@', ':.', ':/', ':@', ':D', ':P', ':]', ':p', ':|', ';', '; ..', ';)', ';-(', ';-)', ';0', ';]', ';p', ...] ## More useful applications for Regular Expressions Here we will be using the following new symbols: `\`, `{}`, `()`, and `|` >>> wsj = sorted(set(corpus.treebank.words())) ## \ This allows us to capture the character after `\` so it allows us to look for a specific character. >>> [w for w in wsj if re.search('^[0-9]+\.[0-9]+$', w)] ['0.0085', '0.05', '0.1', '0.16', '0.2', '0.25', '0.28', '0.3', '0.4', '0.5', '0.50', '0.54', '0.56', '0.60', '0.7', '0.82', '0.84', '0.9', '0.95', '0.99', '1.01', '1.1', '1.125', '1.14', ...] >>> [w for w in wsj if re.search('^[A-Z]+\$$', w)] ['C$', 'US$'] ## {} This allows us to set a limit of the amount of characters we want chosen in the example provided we are looking for digits that are 1K. >>> [w for w in wsj if re.search('^[0-9]{4}$', w)] ['1614', '1637', '1787', '1901', '1903', '1917', '1925', '1929', '1933', '1934', '1948', '1953', '1955', '1956',...] >>> [w for w in wsj if re.search('^[0-9]+-[a-z]{3,5}$', w)] ['10-day', '10-lap', '10-year', '100-share', '12-point', '12-year', '14-hour', '15-day', '150-point', '190-point', '20-point', '20-stock', '21-month', '237-seat', '240-page', '27-year', '30-day', '30-point', '30-share', '30-year', '300-day', '36-day', '36-store', '42-year', '50-state', '500-stock', '52-week', '69-point', '84-month', '87-store', '90-day'] **NOTE**: Spacing between `{3,5}` matters so if done incorrectly like say `{3, 5}` you will receive an empty list. >>> [w for w in wsj if re.search('^[a-z]{5,}-[a-z]{2,3}-[a-z]{,6}$', w)] ['black-and-white', 'bread-and-butter', 'father-in-law', 'machine-gun-toting', 'savings-and-loan'] The above syntax states that the first word is 5 or more letters followed by a `-` then the next word can have 2 repeats and no more than 3 repeats followed by a `-` and finally followed by a word that is 6 or less letters. >>> [w for w in wsj if re.search('(ed|ing)$', w)] ['62%-owned', 'Absorbed', 'According', 'Adopting', 'Advanced', 'Advancing', 'Alfred', 'Allied', 'Annualized', 'Anything', 'Arbitrage-related', 'Arbitraging', 'Asked', 'Assuming', ...]


back to all projects