0
NATURAL LANGUAGE PROCESSING INTRO TUTORIAL (PART 2): 'NATURAL LANGUAGE PROCESSING WITH PYTHON' INTRODUCTION

NATURAL LANGUAGE PROCESSING INTRO TUTORIAL (PART 2)

'NATURAL LANGUAGE PROCESSING WITH PYTHON' INTRODUCTION

PYTHON

EASY

last hacked on Jul 22, 2017

Notes on NLP using book written by: Steven Bird, Ewan Klien, and Edward Loper. Chapter 3
# Processing Raw Text ## Part 2 For this section, I will be going over the process of cleaning up text when doing **NLP**. This is an important pre-processing step For this section we will be utilizing the [Gutenberg](http://www.gutenberg.org/) website that hosts a crap-load of free online books. I decided to use *Great Expectations* by *Charles Dickens*, which is file *1400* in the website repo. So we begin by using the package `urllib` in order to extract the book from the website. >>> from urllib import request >>> from nltk import * >>> url = 'http://www.gutenberg.org/files/1400/1400.txt' >>> response = request.urlopen(url) >>> raw = response.read().decode('utf8') >>> type(raw) <class 'str'> >>> len(raw) 1033801 >>> raw[:73] 'The Project Gutenberg EBook of Great Expectations, by Charles Dickens\r\n\r\n' Notice how this string also contains `\r` and `\n` characters, these are usually produced on *Windows* machines. For our analysis, we do not need these characters, so the next step, **tokenization** (See part 1 for a refresher). We use the function `word_tokenize()` from `nltk` to tokenize the string then look at some basic characteristics that we looked at in *part 1*. >>> tokens = nltk.word_tokenize(raw) >>> type(tokens) <class 'list'> >>> len(tokens) 228880 >>> tokens[:11] ['The', 'Project', 'Gutenberg', 'EBook', 'of', 'Great', 'Expectations', ',', 'by', 'Charles', 'Dickens'] The tutorial doesn't really go into much detail after this but I decided to do some exercises to keep *part 1* fresh in my mind. So I decided to create a new list of only upper text using the syntax from *part 1*: >>> upperText = [word for word in tokens if words.isupper()] I noticed that obviously there was a large amount of *I*'s when doing this so I removed them in my next conditional selection. Notice I use two seperate syntaxes for practice, but they essentially do the same thing: >>> upperText = [word for word in upperText if word != 'I'] >>> for word in upperText: if word != 'I': print(word, end = ' ') I won't give the result since it outputs a very large set of upper case words. But something I notice off the bat is that many of the chapter's numbering(?) are outputted since they are written in *Roman Numeral* syntax. So this would be a pre-processing step important for this body of text. Some further inspection after removing *I*'s is that there was a section of the text where we can see *Pip*'s (the protagonist) letter to Joe (the husband of his sister) showcasing the rudimentary style of writing long before his eventual journey to becoming a man of higher status (won't spoil the book for anyone) >>> upperText[36:73] ['MI', 'DEER', 'JO', 'OPE', 'U', 'R', 'KRWITE', 'WELL', 'OPE', 'SHAL', 'SON', 'B', 'HABELL', 'TEEDGE', 'U', 'JO', 'AN', 'THEN', 'WE', 'SHORL', 'B', 'SO', 'GLODD', 'AN', 'WEN', 'M', 'PRENGTD', 'U', 'JO', 'WOT', 'LARX', 'AN', 'BLEVE', 'ME', 'INF', 'XN', 'PIP'] This can be found by indexing as shown above although it is important to note some of the *I*'s got removed due to our earlier transformation. ## HTML Formatting For this example we will be using a story called *Blondes to die out in 200 years*, which was published as a legit news article on **BBC**. >>> url = "http://news.bbc.co.uk/2/hi/health/2284783.stm" >>> html = request.urlopen(url).read().decode('utf8') >>> html[:60] '<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN' We can `print(html)` to see the entire html source including all tags and what not. But say we only wanted the text we would use the package *Beautiful Soup* to extract the text only. So we parse and tokenize this article as follows: >>> from bs4 import BeautifulSoup >>> raw = BeautifulSoup(html).get_text() **Note**: When you execute this you will be given the warning: The code that caused this warning is on line 1 of the file <stdin>. To get rid of this warning, change code that looks like this: BeautifulSoup([your markup]) to this: BeautifulSoup([your markup], "lxml") markup_type=markup_type)) So then we must implicitly state the markup up syntax >>> raw = BeautifulSoup(html, 'html').get_text() >>> tokens = nltk.word_tokenize(raw) Now let's look at the newly tokenized text >>> tokens[:12] ['BBC', 'NEWS', '|', 'Health', '|', 'Blondes', "'to", 'die', 'out', 'in', '200', "years'"] Thus once we have this we an start doing simple analysis as before. ## Local Files To check what's inside the current local directory you can use the `os` package from within **Python** as such: >>> import os >>> os.listdir('.') ['project2.md', 'project.md', 'abstract.md', 'document.txt'] Now let's read *document.txt* into our **Python** IDLE using the built-in function `open()`. >>> f = open('document.txt') >>> raw = f.read() >>> f.read() 'Yellin\', "One, two, three, four, five\nI am the greatest rapper alive!"\nSo damn great, motherfucker, I\'ve died\nWhat you hearin\' now is a paranormal vibe' + (Lyrics by Kendrick Lamar, track: The Heart pt. 4) Notice how we have `\n` characters in our read file, so we can read each line at a time using the following for loop (I've seen this alot when doing online parsing so super **important**) >>> f = open('document.txt', 'rU') >>> for line in f: ... print(line.strip()) ... Yellin', "One, two, three, four, five I am the greatest rapper alive!" So damn great, motherfucker, I've died What you hearin' now is a paranormal vibe # Pipeline to NLP ### Read-in Raw File Here we outline the overall process we've learned so far! >>> from nltk import * >>> raw = open('document.txt').read() >>> type(raw) <class 'str'> ### Tokenize words from Raw File to List Next we tokenize the string and output a list type. Finally normalizing by making all words lowercase. >>> tokens = word_tokenize(raw) >>> tokens ['Yellin', "'", ',', '``', 'One', ',', 'two', ',', 'three', ',', 'four', ',', 'five', 'I', 'am', 'the', 'greatest', 'rapper', 'alive', '!', "''", 'So', 'damn', 'great', ',', 'motherfucker', ',', 'I', "'ve", 'died', 'What', 'you', 'hearin', "'", 'now', 'is', 'a', 'paranormal', 'vibe'] >>> type(tokens) <class 'list'> ### Lowercaps all words in list >>> words = [w.lower() for w in tokens] >>> words ['yellin', "'", ',', '``', 'one', ',', 'two', ',', 'three', ',', 'four', ',', 'five', 'i', 'am', 'the', 'greatest', 'rapper', 'alive', '!', "''", 'so', 'damn', 'great', ',', 'motherfucker', ',', 'i', "'ve", 'died', 'what', 'you', 'hearin', "'", 'now', 'is', 'a', 'paranormal', 'vibe'] ### Sort and Identify Unique Words in List >>> vocab = sorted(set(words)) >>> vocab ['!', "'", "''", "'ve", ',', '``', 'a', 'alive', 'am', 'damn', 'died', 'five', 'four', 'great', 'greatest', 'hearin', 'i', 'is', 'motherfucker', 'now', 'one', 'paranormal', 'rapper', 'so', 'the', 'three', 'two', 'vibe', 'what', 'yellin', 'you'] >>> type(vocab) <class 'list'> # Regular Expressions For identifying *regular expressions* we will use the module `re` extensively within this tutorial. First we load the appropriate modules and the *Words Corpus* to start playing with *regular expressions*. >>> import re >>> from nltk import * >>> wordlist = [w for w in corpus.words.words('en') if w.islower()] Let's start with finding all words that end with the regular expression `<<ed$>>`. We will be utilizing the `re.search(p, s)` function to see if the pattern `p` is found in string `s`. The `$` character at the end of the pattern indicates that we are searching for `ed` at the end of the word. >>> [w for w in wordlist if re.search('ed$', w)] Returns a long list of words ending with *ed* take my word for it. The `.` is called **wildcard** this matches any character. So a display of how it works is say we're looking for an 8-letter word with *j* as its 3rd letter and *t* as its sixth letter, we would use the following syntax to search for the words fitting this criteria. >>> [w for w in wordlist if re.search('^..j..t..$', w)] ['abjectly', 'adjuster', 'dejected', 'dejectly', 'injector', 'majestic', 'objectee', 'objector', 'rejecter', 'rejector', 'unjilted', 'unjolted', 'unjustly'] ## Ranges and Closures The **T9** system used for entering text in mobile phones, is used in *regular expressions* and two or more words that have the same sequence of keystrokes are called **textonyms**. An example includes, *soon* and *room* and *pint* and *riot*. Let's see this used in context of *regular expressions* in **Python**. ## Textonyms >>> [w for w in wordlist if re.search('^[ghi][mno][jlk][def]$', w)] ['gold', 'golf', 'hold', 'hole'] More examples to get familiar with syntax >>> [w for w in wordlist if re.search('^[ghijklmno]+', w')] Or written more concisely >>> [w for w in wordlist if re.search('^[g-o]+$')] These searches will only match the middle keys >>> [w for w in wordlist if re.search('^[a-fj-o]+$', w)] This search will match the top-right corner keys when looking at the **T9** system. ## Closures in Regular Expressions >>> chat_words = sorted(set(w for w in corpus.nps_chat.words())) >>> [w for w in chat_words if re.search('^m+i+n+e+$', w)] ['miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee', 'miiiiiinnnnnnnnnneeeeeeee', 'mine', 'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee'] Now let's try the same expression using `*` instead of `+`. >>> [w for w in chat_words if re.search('^m*i*n*e+$', w)] ['e', 'me', 'meeeeeeeeeeeee', 'miiiiiiiiiiiiinnnnnnnnnnneeeeeeeeee', 'miiiiiinnnnnnnnnneeeeeeee', 'mine', 'mmmmmmmmiiiiiiiiinnnnnnnnneeeeeeee', 'ne'] Notice how I kept the `+` at the end, this was to ensure that the closure included at least one of the letters if it was `*` it would also include the empty set `''`. Important to note: when using the `^` inside a set you are negating all the values in the set. Best shown through example >>> [w for w in chat_words if re.search('^[^aeiouAEIOU]+$', w)] [..., ':', ':(', ':)', ':):):)', ':-(', ':-)', ':-@', ':.', ':/', ':@', ':D', ':P', ':]', ':p', ':|', ';', '; ..', ';)', ';-(', ';-)', ';0', ';]', ';p', ...] ## More useful applications for Regular Expressions Here we will be using the following new symbols: `\`, `{}`, `()`, and `|` >>> wsj = sorted(set(corpus.treebank.words())) ## \ This allows us to capture the character after `\` so it allows us to look for a specific character. >>> [w for w in wsj if re.search('^[0-9]+\.[0-9]+$', w)] ['0.0085', '0.05', '0.1', '0.16', '0.2', '0.25', '0.28', '0.3', '0.4', '0.5', '0.50', '0.54', '0.56', '0.60', '0.7', '0.82', '0.84', '0.9', '0.95', '0.99', '1.01', '1.1', '1.125', '1.14', ...] >>> [w for w in wsj if re.search('^[A-Z]+\$$', w)] ['C$', 'US$'] ## {} This allows us to set a limit of the amount of characters we want chosen in the example provided we are looking for digits that are 1K. >>> [w for w in wsj if re.search('^[0-9]{4}$', w)] ['1614', '1637', '1787', '1901', '1903', '1917', '1925', '1929', '1933', '1934', '1948', '1953', '1955', '1956',...] >>> [w for w in wsj if re.search('^[0-9]+-[a-z]{3,5}$', w)] ['10-day', '10-lap', '10-year', '100-share', '12-point', '12-year', '14-hour', '15-day', '150-point', '190-point', '20-point', '20-stock', '21-month', '237-seat', '240-page', '27-year', '30-day', '30-point', '30-share', '30-year', '300-day', '36-day', '36-store', '42-year', '50-state', '500-stock', '52-week', '69-point', '84-month', '87-store', '90-day'] **NOTE**: Spacing between `{3,5}` matters so if done incorrectly like say `{3, 5}` you will receive an empty list. >>> [w for w in wsj if re.search('^[a-z]{5,}-[a-z]{2,3}-[a-z]{,6}$', w)] ['black-and-white', 'bread-and-butter', 'father-in-law', 'machine-gun-toting', 'savings-and-loan'] The above syntax states that the first word is 5 or more letters followed by a `-` then the next word can have 2 repeats and no more than 3 repeats followed by a `-` and finally followed by a word that is 6 or less letters. >>> [w for w in wsj if re.search('(ed|ing)$', w)] ['62%-owned', 'Absorbed', 'According', 'Adopting', 'Advanced', 'Advancing', 'Alfred', 'Allied', 'Annualized', 'Anything', 'Arbitrage-related', 'Arbitraging', 'Asked', 'Assuming', ...]

COMMENTS


back to all projects