17 NLP
Natural Language Processing
17.1 Regular Expression
- Rgular expressions (called REs or regexes) is mandatory skill for NLP. The
re
is a **built-in* library
- It is essentially a tiny, highly specialized programming language embedded inside Python and made available through the re module
- Regular expression patterns are compiled into a series of bytecodes which are then executed by a matching engine written in C
17.1.1 Syntax
There are two methods to emply re. Below method compile a regex first, then apply it multiple times in subsequent code.
import re
= re.compile(r'put pattern here')
pattern 'put text here') pattern.match(
Second method below employ compile and match in single line. The pattern cannot be reused, therefore good for onetime usage only.
import re
= (r'put pattern here')
pattern r'put text here') # compile and match in single line re.match(pattern,
17.1.2 Finding
17.1.2.1 Find The First Match
There are two ways to find the first match:
- re.search
find first match anywhere in text, including multiline
- re.match
find first match at the BEGINNING of text, similar to re.search
with ^
- Both returns first match, return MatchObject
- Both returns None if no match is found
= re.compile('123')
pattern1 = re.compile('123')
pattern2 = re.compile('^123') # equivalent to above
pattern3 = 'abc123xyz'
text
## Single Line Text Example
print( 're.search found a match somewhere:\n',
'\n', ## found
pattern1.search(text), '\nre.match did not find anything at the beginning:\n',
'\n',
pattern2.match(text), '\nre.search did not find anything at beginning too:\n',
## None pattern3.search(text))
#:> re.search found a match somewhere:
#:> <re.Match object; span=(3, 6), match='123'>
#:>
#:> re.match did not find anything at the beginning:
#:> None
#:>
#:> re.search did not find anything at beginning too:
#:> None
Returned MatchObject provides useful information about the matched string.
= re.compile(r'\d+')
age_pattern = 'Ali is my teacher. He is 109 years old. his kid is 40 years old.'
age_text = age_pattern.search(age_text)
first_found
print('Found Object: ', first_found,
'\nInput Text: ', first_found.string,
'\nInput Pattern: ', first_found.re,
'\nFirst Found string: ', first_found.group(),
'\nFound Start Position: ', first_found.start(),
'\nFound End Position: ', first_found.end(),
'\nFound Span: ', first_found.span(),)
#:> Found Object: <re.Match object; span=(25, 28), match='109'>
#:> Input Text: Ali is my teacher. He is 109 years old. his kid is 40 years old.
#:> Input Pattern: re.compile('\\d+')
#:> First Found string: 109
#:> Found Start Position: 25
#:> Found End Position: 28
#:> Found Span: (25, 28)
17.1.2.2 Find All Matches
findall()
returns all matching string as list. If no matches found, it return an empty list.
print(
'Finding Two Digits:',
r'\d\d','abc123xyz456'), '\n',
re.findall('\nFound Nothing:',
r'\d\d','abcxyz')) re.findall(
#:> Finding Two Digits: ['12', '45']
#:>
#:> Found Nothing: []
17.1.3 Matching Condition
17.1.3.1 Meta Characters
[] match any single character within the bracket
[1234] is the same as [1-4]
[0-39] is the same as [01239]
[a-e] is the same as [abcde]
[^abc] means any character except a,b,c
[^0-9] means any character except 0-9
a|b: a or b
{n,m} at least n repetition, but maximum m repetition
() grouping
= re.compile(r'[a-z]+')
pattern = "tempo"
text1 = "tempo1"
text2 = "123 tempo1"
text3 = " tempo"
text4 print(
'Matching Text1:', pattern.match(text1),
'\nMatching Text2:', pattern.match(text2),
'\nMatching Text3:', pattern.match(text3),
'\nMatching Text4:', pattern.match(text4))
#:> Matching Text1: <re.Match object; span=(0, 5), match='tempo'>
#:> Matching Text2: <re.Match object; span=(0, 5), match='tempo'>
#:> Matching Text3: None
#:> Matching Text4: None
17.1.3.2 Special Sequence
. : [^\n]
\d: [0-9] \D: [^0-9]
\s: [ \t\n\r\f\v] \S: [^ \t\n\r\f\v]
\w: [a-zA-Z0-9_] \W: [^a-zA-Z0-9_]
\t: tab
\n: newline
\b: word boundry (delimited by space, \t, \n)
Word Boundary Using \b
:
-
\bABC
match if specified characters at the beginning of word (delimited by space,\t
,\n
), or beginning of newline
-
ABC\b
match if specified characters at the end of word (delimited by space,\t
,\n
), or end of the line
= "ABCD ABC XYZABC"
text = re.compile(r'\bABC')
pattern1 = re.compile(r'ABC\b')
pattern2 = re.compile(r'\bABC\b')
pattern3
print('Match word that begins ABC:',
'\n',
pattern1.findall(text), 'Match word that ends with ABC:',
'\n',
pattern2.findall(text),'Match isolated word with ABC:',
pattern3.findall(text))
#:> Match word that begins ABC: ['ABC', 'ABC']
#:> Match word that ends with ABC: ['ABC', 'ABC']
#:> Match isolated word with ABC: ['ABC']
17.1.3.3 Repetition
When repetition is used, re will be greedy; it try to repeat as many times as possible. If later portions of the pattern don’t match, the matching engine will then back up and try again with fewer repetitions.
?: zero or 1 occurance
*: zero or more occurance
+: one or more occurance
?
Zero or 1 Occurance
= 'abcbcdd'
text = re.compile(r'a[bcd]?b')
pattern pattern.findall(text)
#:> ['ab']
+
At Least One Occurance
= 'abcbcdd'
text = re.compile(r'a[bcd]+b')
pattern pattern.findall(text)
#:> ['abcb']
*
Zero Or More Occurance Occurance
= 'abcbcdd'
text = re.compile(r'a[bcd]*b')
pattern pattern.findall(text)
#:> ['abcb']
17.1.3.4 Greedy vs Non-Greedy
- The
*
,+
, and?
qualifiers are all greedy; they match as much text as possible
- If the
<.*>
is matched against<a> b <c>
, it will match the entire string, and not just<a>
- Adding
?
after the qualifier makes it perform the match in non-greedy; as few characters as possible will be matched. Using the RE <.*?> will match only ‘’
= '<a> ali baba <c>'
text = re.compile(r'<.*>')
greedy_pattern = re.compile(r'<.*?>')
non_greedy_pattern print( 'Greedy: ' , greedy_pattern.findall(text), '\n',
'Non Greedy: ', non_greedy_pattern.findall(text) )
#:> Greedy: ['<a> ali baba <c>']
#:> Non Greedy: ['<a>', '<c>']
17.1.4 Grouping
When ()
is used in the pattern, retrive the grouping components in MatchObject with .groups()
. Result is in list. Example below extract hours, minutes and am/pm into a list.
17.1.4.1 Capturing Group
= 'Today at Wednesday, 10:50pm, we go for a walk'
text = re.compile(r'(\d\d):(\d\d)(am|pm)')
pattern = pattern.search(text)
m print(
'All Gropus: ', m.groups(), '\n',
'Group 1: ', m.group(1), '\n',
'Group 2: ', m.group(2), '\n',
'Group 3: ', m.group(3) )
#:> All Gropus: ('10', '50', 'pm')
#:> Group 1: 10
#:> Group 2: 50
#:> Group 3: pm
17.1.4.2 Non-Capturing Group
Having (:? )
means don’t capture this group
= 'Today at Wednesday, 10:50pm, we go for a walk'
text = re.compile(r'(:?\d\d):(?:\d\d)(am|pm)')
pattern = pattern.search(text)
m print(
'All Gropus: ', m.groups(), '\n',
'Group 1: ', m.group(1), '\n',
'Group 2: ', m.group(2) )
#:> All Gropus: ('10', 'pm')
#:> Group 1: 10
#:> Group 2: pm
17.1.5 Splittitng
Pattern is used to match delimters.
17.1.6 Substitution re.sub()
17.1.6.1 Found Match
Example below repalce anything within {{.*}}
r'({{.*}})', 'Durian', 'I like to eat {{Food}}.', flags=re.IGNORECASE) re.sub(
#:> 'I like to eat Durian.'
Replace AND
with &
. This does not require ()
grouping
r'\sAND\s', ' & ', 'Baked Beans And Spam', flags=re.IGNORECASE) re.sub(
#:> 'Baked Beans & Spam'
17.1.7 Practical Examples
17.1.7.1 Extracting Float
= re.compile(r'\d+(\.\d+)?')
re_float def extract_float(x):
= x.replace(',','')
money = re_float.search(money)
result return float(result.group()) if result else float(0)
print( extract_float('123,456.78'), '\n',
'rm 123.78 (30%)'), '\n',
extract_float('rm 123,456.78 (30%)') ) extract_float(
#:> 123456.78
#:> 123.78
#:> 123456.78
17.2 Word Tokenizer
17.2.1 Custom Tokenizer
17.2.1.1 Split By Regex Pattern
Use regex to split words based on specific punctuation as delimeter.
The rule is: split input text when any one or more continuous occurances of specified character.
import re
= re.compile(r"[-\s.,;!?]+")
pattern "hi @ali--baba, you are aweeeeeesome! isn't it. Believe it.:)") pattern.split(
#:> ['hi', '@ali', 'baba', 'you', 'are', 'aweeeeeesome', "isn't", 'it', 'Believe', 'it', ':)']
17.2.1.2 Pick By Regex Pattern nltk.tokenize.RegexpTokenizer
Any sequence of chars fall within the bracket are considered tokens. Any chars not within the bracket are removed.
from nltk.tokenize import RegexpTokenizer
= RegexpTokenizer(r'[a-zA-Z0-9\']+')
my_tokenizer "hi @ali--baba, you are aweeeeeesome! isn't it. Believe it.:") my_tokenizer.tokenize(
#:> ['hi', 'ali', 'baba', 'you', 'are', 'aweeeeeesome', "isn't", 'it', 'Believe', 'it']
17.2.2 nltk.tokenize.word_tokenize()
Words and punctuations are considered as tokens!
import nltk
'punkt') nltk.download(
#:> True
#:>
#:> [nltk_data] Downloading package punkt to /home/msfz751/nltk_data...
#:> [nltk_data] Package punkt is already up-to-date!
from nltk.tokenize import word_tokenize
print( word_tokenize("hi @ali-baba, you are aweeeeeesome! isn't it. Believe it.:)") )
#:> ['hi', '@', 'ali-baba', ',', 'you', 'are', 'aweeeeeesome', '!', 'is', "n't", 'it', '.', 'Believe', 'it', '.', ':', ')']
17.2.3 nltk.tokenize.casual.casual_tokenize()
- Support emoji
- Support reduction of repetition chars
- Support removing userid (@someone)
- Good for social media text
- Punctuations are tokens!
from nltk.tokenize.casual import casual_tokenize
print( casual_tokenize("hi @ali-baba, you are aweeeeeesome! isn't it. Believe it. :)") )
#:> ['hi', '@ali', '-', 'baba', ',', 'you', 'are', 'aweeeeeesome', '!', "isn't", 'it', '.', 'Believe', 'it', '.', ':)']
Example below shorten repeating chars, notice aweeeeeesome becomes aweeesome
## shorten repeated chars
print( casual_tokenize("hi @ali-baba, you are aweeeeeesome! isn't it. Believe it.:)",
=True)) reduce_len
#:> ['hi', '@ali', '-', 'baba', ',', 'you', 'are', 'aweeesome', '!', "isn't", 'it', '.', 'Believe', 'it', '.', ':)']
Stripping off User ID
## shorten repeated chars, stirp usernames
print( casual_tokenize("hi @ali-baba, you are aweeeeeesome! isn't it. Believe it.:)",
=True,
reduce_len=True)) strip_handles
#:> ['hi', '-', 'baba', ',', 'you', 'are', 'aweeesome', '!', "isn't", 'it', '.', 'Believe', 'it', '.', ':)']
17.2.4 nltk.tokenize.treebank.TreebankWordTokenizer().tokenize()
Treebank assume input text is A sentence, hence any period combined with word is treated as token.
from nltk.tokenize.treebank import TreebankWordTokenizer
"hi @ali-baba, you are aweeeeeesome! isn't it. Believe it.:)") TreebankWordTokenizer().tokenize(
#:> ['hi', '@', 'ali-baba', ',', 'you', 'are', 'aweeeeeesome', '!', 'is', "n't", 'it.', 'Believe', 'it.', ':', ')']
17.2.5 Corpus Token Extractor
A corpus is a collection of documents (list of documents). A document is a text string containing one or many sentences.
from nltk.tokenize import word_tokenize
from nlpia.data.loaders import harry_docs as corpus
## Tokenize each doc to list, then add to a bigger list
=[]
doc_tokensfor doc in corpus:
+= [word_tokenize(doc.lower())]
doc_tokens
print('Corpus (Contain 3 Documents):\n',corpus,'\n',
'\nTokenized result for each document:','\n',doc_tokens)
#:> Corpus (Contain 3 Documents):
#:> ['The faster Harry got to the store, the faster and faster Harry would get home.', 'Harry is hairy and faster than Jill.', 'Jill is not as hairy as Harry.']
#:>
#:> Tokenized result for each document:
#:> [['the', 'faster', 'harry', 'got', 'to', 'the', 'store', ',', 'the', 'faster', 'and', 'faster', 'harry', 'would', 'get', 'home', '.'], ['harry', 'is', 'hairy', 'and', 'faster', 'than', 'jill', '.'], ['jill', 'is', 'not', 'as', 'hairy', 'as', 'harry', '.']]
Unpack list of token lists from above using sum. To get the vocabulary (unique tokens), convert list to set.
## unpack list of list to list
= sum(doc_tokens,[])
vocab print('\nCorpus Vacabulary (Unique Tokens):\n',
sorted(set(vocab)))
#:>
#:> Corpus Vacabulary (Unique Tokens):
#:> [',', '.', 'and', 'as', 'faster', 'get', 'got', 'hairy', 'harry', 'home', 'is', 'jill', 'not', 'store', 'than', 'the', 'to', 'would']
17.3 Sentence Tokenizer
This is about detecting sentence boundry and split text into list of sentences
17.3.1 Sample Text
= '''
text Hello Mr. Smith, how are you doing today?
The weather is great, and city is awesome.
The sky is pinkish-blue, Dr. Alba would agree.
You shouldn't eat hard things i.e. cardboard, stones and bushes
'''
17.3.2 ’nltk.tokenize.punkt.PunktSentenceTokenizer`
- The
PunktSentenceTokenizer
is an sentence boundary detection algorithm. It is an unsupervised trainable model. This means it can be trained on unlabeled data, aka text that is not split into sentences
- PunkSentneceTokenizer is based on work published on this paepr: Unsupervised Multilingual Sentence Boundary Detection
17.3.2.1 Default Behavior
Vanila tokenizer splits sentences on period .
, which is not desirable
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktTrainer
#nltk.download('punkt')
= PunktSentenceTokenizer()
tokenizer = tokenizer.tokenize(text)
tokenized_text for x in tokenized_text:
print(x)
#:>
#:> Hello Mr.
#:> Smith, how are you doing today?
#:> The weather is great, and city is awesome.
#:> The sky is pinkish-blue, Dr.
#:> Alba would agree.
#:> You shouldn't eat hard things i.e.
#:> cardboard, stones and bushes
17.3.2.2 Pretrained Model - English Pickle
NLTK already includes a pre-trained version of the PunktSentenceTokenizer for English, as you can see, it is quite good
= nltk.data.load('tokenizers/punkt/english.pickle')
tokenizer = tokenizer.tokenize(text)
tokenized_text for x in tokenized_text:
print(x)
#:>
#:> Hello Mr. Smith, how are you doing today?
#:> The weather is great, and city is awesome.
#:> The sky is pinkish-blue, Dr. Alba would agree.
#:> You shouldn't eat hard things i.e.
#:> cardboard, stones and bushes
17.3.2.3 Adding Abbreviations
- The pretrained tokenizer is not perfect, it wrongly detected ‘i.e.’ as sentence boundary
- Let’s teach Punkt by adding the abbreviation to its parameter
Adding Single Abbreviation
= nltk.data.load('tokenizers/punkt/english.pickle')
tokenizer
## Add apprevaitions to Tokenizer
'i.e')
tokenizer._params.abbrev_types.add(
= tokenizer.tokenize(text)
tokenized_text for x in tokenized_text:
print(x)
#:>
#:> Hello Mr. Smith, how are you doing today?
#:> The weather is great, and city is awesome.
#:> The sky is pinkish-blue, Dr. Alba would agree.
#:> You shouldn't eat hard things i.e. cardboard, stones and bushes
Add List of Abbreviations
If you have more than one abbreviations, use update()
with the list of abbreviations
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
## Add Abbreviations to Tokenizer
= nltk.data.load('tokenizers/punkt/english.pickle')
tokenizer 'dr', 'vs', 'mr', 'mrs', 'prof', 'inc', 'i.e'])
tokenizer._params.abbrev_types.update([
= tokenizer.tokenize(text)
sentences for x in sentences:
print(x)
#:>
#:> Hello Mr. Smith, how are you doing today?
#:> The weather is great, and city is awesome.
#:> The sky is pinkish-blue, Dr. Alba would agree.
#:> You shouldn't eat hard things i.e. cardboard, stones and bushes
17.3.3 nltk.tokenize.sent_tokenize()
The sent_tokenize
function uses an instance of PunktSentenceTokenizer, which is already been trained and thus very well knows to mark the end and begining of sentence at what characters and punctuation.
from nltk.tokenize import sent_tokenize
= sent_tokenize(text)
sentences for x in sentences:
print(x)
#:>
#:> Hello Mr. Smith, how are you doing today?
#:> The weather is great, and city is awesome.
#:> The sky is pinkish-blue, Dr. Alba would agree.
#:> You shouldn't eat hard things i.e. cardboard, stones and bushes
17.4 N-Gram
To create n-gram, first create 1-gram token
from nltk.util import ngrams
import re
= "Thomas Jefferson began building the city, at the age of 25"
sentence = re.compile(r"[-\s.,;!?]+")
pattern = pattern.split(sentence)
tokens print(tokens)
#:> ['Thomas', 'Jefferson', 'began', 'building', 'the', 'city', 'at', 'the', 'age', 'of', '25']
ngrams() is a generator, therefore, use list() to convert into full list
2) ngrams(tokens,
#:> <generator object ngrams at 0x7fcdc9eae850>
Convert 1-gram to 2-Gram, wrap into list
= list( ngrams(tokens,2) )
grammy print(grammy)
#:> [('Thomas', 'Jefferson'), ('Jefferson', 'began'), ('began', 'building'), ('building', 'the'), ('the', 'city'), ('city', 'at'), ('at', 'the'), ('the', 'age'), ('age', 'of'), ('of', '25')]
Combine each 2-gram into a string object
" ".join(x) for x in grammy] [
#:> ['Thomas Jefferson', 'Jefferson began', 'began building', 'building the', 'the city', 'city at', 'at the', 'the age', 'age of', 'of 25']
17.5 Stopwords
17.5.1 Custom Stop Words
Build the custom stop words dictionary.
= ['a','an','the','on','of','off','this','is','at'] stop_words
Tokenize text and remove stop words
= "The house is on fire"
sentence = word_tokenize(sentence)
tokens = [ x for x in tokens if x not in stop_words ]
tokens_without_stopwords
print(' Original Tokens : ', tokens, '\n',
'Removed Stopwords: ',tokens_without_stopwords)
#:> Original Tokens : ['The', 'house', 'is', 'on', 'fire']
#:> Removed Stopwords: ['The', 'house', 'fire']
17.5.2 NLTK Stop Words
Contain 179 words, in a list form
import nltk
'stopwords') nltk.download(
#:> True
#:>
#:> [nltk_data] Downloading package stopwords to
#:> [nltk_data] /home/msfz751/nltk_data...
#:> [nltk_data] Package stopwords is already up-to-date!
import nltk
#nltk.download('stopwords')
= nltk.corpus.stopwords.words('english')
nltk_stop_words print('Total NLTK Stopwords: ', len(nltk_stop_words),'\n',
nltk_stop_words)
#:> Total NLTK Stopwords: 179
#:> ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
17.5.3 SKLearn Stop Words
Contain 318 stop words, in frozenset form
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS as sklearn_stop_words
print(' Total Sklearn Stopwords: ', len(sklearn_stop_words),'\n\n',
sklearn_stop_words)
#:> Total Sklearn Stopwords: 318
#:>
#:> frozenset({'before', 'becomes', 'down', 'see', 'already', 'cry', 'behind', 'first', 'such', 'con', 'neither', 'however', 'was', 'something', 'against', 'either', 'thence', 'well', 'to', 'when', 'those', 'whose', 'throughout', 'never', 'third', 'been', 'empty', 'anyway', 'if', 'their', 'amount', 'etc', 'others', 'nothing', 'around', 'every', 'how', 'into', 'then', 'along', 'amongst', 'should', 'upon', 'us', 'take', 'himself', 'namely', 'not', 'de', 'under', 'all', 'out', 'twelve', 'sincere', 'through', 'thru', 'get', 'across', 'mill', 'indeed', 'each', 'whereas', 'now', 'so', 'very', 'hereafter', 'them', 'few', 'most', 'call', 'hereby', 'sometime', 'many', 'ever', 'interest', 'twenty', 'any', 'might', 'thick', 'rather', 'until', 'besides', 're', 'back', 'hers', 'cant', 'enough', 'by', 'hence', 'would', 'via', 'for', 'yours', 'couldnt', 'will', 'thus', 'fire', 'wherever', 'but', 'none', 'name', 'somehow', 'almost', 'full', 'myself', 'fill', 'hasnt', 'sixty', 'who', 'what', 'keep', 'front', 'nevertheless', 'these', 'there', 'system', 'done', 'eleven', 'three', 'seem', 'why', 'anything', 'made', 'ourselves', 'had', 'nowhere', 'with', 'un', 'give', 'another', 'perhaps', 'everywhere', 'between', 'show', 'amoungst', 'further', 'whom', 'within', 'here', 'thereby', 'your', 'during', 'too', 'yourself', 'towards', 'and', 'at', 'move', 'eg', 'eight', 'be', 'above', 'among', 'do', 'a', 'inc', 'six', 'together', 'yet', 'up', 'toward', 'whatever', 'an', 'on', 'you', 'still', 'five', 'therein', 'noone', 'although', 'whereafter', 'alone', 'as', 'we', 'always', 'often', 'has', 'this', 'next', 'else', 'whoever', 'again', 'due', 'even', 'no', 'become', 'several', 'please', 'beforehand', 'least', 'nobody', 'could', 'both', 'mostly', 'may', 'latterly', 'own', 'because', 'that', 'thereupon', 'anyhow', 'hereupon', 'off', 'less', 'only', 'it', 'whole', 'than', 'seemed', 'more', 'whence', 'hundred', 'must', 'my', 'since', 'i', 'can', 'except', 'other', 'side', 'bill', 'after', 'our', 'were', 'meanwhile', 'herself', 'in', 'two', 'ltd', 'whereby', 'put', 'bottom', 'afterwards', 'his', 'its', 'everyone', 'one', 'everything', 'find', 'anyone', 'beside', 'though', 'he', 'from', 'she', 'are', 'former', 'or', 'otherwise', 'someone', 'top', 'me', 'of', 'therefore', 'whereupon', 'am', 'the', 'nine', 'ours', 'found', 'co', 'some', 'yourselves', 'have', 'once', 'describe', 'over', 'themselves', 'mine', 'detail', 'itself', 'per', 'they', 'latter', 'whither', 'forty', 'somewhere', 'fifteen', 'beyond', 'being', 'whether', 'cannot', 'is', 'nor', 'thin', 'became', 'onto', 'go', 'whenever', 'thereafter', 'seeming', 'without', 'formerly', 'sometimes', 'seems', 'last', 'much', 'fifty', 'which', 'serious', 'while', 'herein', 'ie', 'moreover', 'part', 'where', 'him', 'ten', 'four', 'elsewhere', 'below', 'same', 'about', 'anywhere', 'her', 'also', 'becoming', 'wherein'})
17.5.4 Combined NLTK and SKLearn Stop Words
= list( set(nltk_stop_words) | set(sklearn_stop_words) )
combined_stop_words print('Total combined NLTK and SKLearn Stopwords:', len( combined_stop_words ),'\n'
'Stopwords shared among NLTK and SKlearn :', len( list( set(nltk_stop_words) & set(sklearn_stop_words)) ))
#:> Total combined NLTK and SKLearn Stopwords: 378
#:> Stopwords shared among NLTK and SKlearn : 119
17.6 Normalizing
Similar things are combined into single normalized form. This will reduced the vocabulary.
17.6.1 Case Folding
If tokens aren’t cap normalized, you will end up with large word list. However, some information is often communicated by capitalization of word, such as name of places. If names are important, consider using proper noun.
= ['House','Visitor','Center']
tokens for x in tokens] [ x.lower()
#:> ['house', 'visitor', 'center']
17.6.2 Stemming
- Output of a stemmer is not necessary a proper word
- Automatically convert words to lower cap
-
Porter stemmer is a lifetime refinement with 300 lines of python code
- Stemming is faster then Lemmatization
from nltk.stem.porter import PorterStemmer
= PorterStemmer()
stemmer = ('house','Housing','hOuses', 'Malicious','goodness')
tokens for x in tokens ] [stemmer.stem(x)
#:> ['hous', 'hous', 'hous', 'malici', 'good']
17.6.3 Lemmatization
NLTK uses connections within princeton WordNet graph for word meanings.
'wordnet') nltk.download(
#:> True
#:>
#:> [nltk_data] Downloading package wordnet to /home/msfz751/nltk_data...
#:> [nltk_data] Package wordnet is already up-to-date!
from nltk.stem import WordNetLemmatizer
= WordNetLemmatizer()
lemmatizer
print( lemmatizer.lemmatize("better", pos ='a'), '\n',
"better", pos ='n') ) lemmatizer.lemmatize(
#:> good
#:> better
print( lemmatizer.lemmatize("good", pos ='a'), '\n',
"good", pos ='n') ) lemmatizer.lemmatize(
#:> good
#:> good
17.6.4 Comparing Stemming and Lemmatization
- Lemmatization is slower than stemming = Lemmatization is better at retaining meanings
- Lemmatization produce valid english word
- Stemming not necessary produce valid english word
- Both reduce vocabulary size, but increase ambiguity
- For search engine application, stemming and lemmatization will improve recall as it associate more documents with the same query words, however with the cost of reducing precision and accuracy.
For search-based chatbot where accuracy is more important, it should first search with unnormalzied words.
17.7 Wordnet
WordNet® is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations.
WordNet superficially resembles a thesaurus, in that it groups words together based on their meanings. However, there are some important distinctions:
- WordNet interlinks not just word forms—strings of letters—but specific senses of words. As a result, words that are found in close proximity to one another in the network are semantically disambiguated
- WordNet labels the semantic relations among words, whereas the groupings of words in a thesaurus does not follow any explicit pattern other than meaning similarity
17.7.1 NLTK and Wordnet
NLTK (version 3.7.6) includes the English WordNet (147,307 words and 117,659 synonym sets)
from nltk.corpus import wordnet as wn
= set( wn.all_synsets() )
s = set(wn.words())
w print('Total words in wordnet : ' , len(w),
'\nTotal synsets in wordnet: ' , len(s) )
#:> Total words in wordnet : 147306
#:> Total synsets in wordnet: 117659
17.7.2 Synset
17.7.2.1 Notation
A synset is the basic construct of a word in wordnet. It contains the Word itself, with its POS tag and Usage: word.pos.nn
'breakdown.n.03') wn.synset(
#:> Synset('breakdown.n.03')
Breaking down the construct:
'breakdown' = Word
'n' = Part of Speech
'03' = Usage (01 for most common usage and a higher number would indicate lesser common usages)
17.7.3 Synsets
- Synsets is a collection of synsets, which are synonyms that share a common meaning
- A synset (member of Synsets) is identified with a 3-part name of the form:
- A synset can contain one or more lemmas, which represent a specific sense of a specific word
- A synset can contain one or more Hyponyms and Hypernyms. These are specific and generalized concepts respectively. For example, ‘beach house’ and ‘guest house’ are hyponyms of ‘house.’ They are more specific concepts of ‘house.’ And ‘house’ is a hypernym of ‘guest house’ because it is the general concept
- Hyponyms and Hypernyms are also called lexical relations
= wn.synsets('dog') # get all synsets for word 'dog'
dogs
for d in dogs: ## iterate through each Synset
print(d,':\nDefinition:', d.definition(),
'\nExample:', d.examples(),
'\nLemmas:', d.lemma_names(),
'\nHyponyms:', d.hyponyms(),
'\nHypernyms:', d.hypernyms(), '\n\n')
#:> Synset('dog.n.01') :
#:> Definition: a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds
#:> Example: ['the dog barked all night']
#:> Lemmas: ['dog', 'domestic_dog', 'Canis_familiaris']
#:> Hyponyms: [Synset('basenji.n.01'), Synset('corgi.n.01'), Synset('cur.n.01'), Synset('dalmatian.n.02'), Synset('great_pyrenees.n.01'), Synset('griffon.n.02'), Synset('hunting_dog.n.01'), Synset('lapdog.n.01'), Synset('leonberg.n.01'), Synset('mexican_hairless.n.01'), Synset('newfoundland.n.01'), Synset('pooch.n.01'), Synset('poodle.n.01'), Synset('pug.n.01'), Synset('puppy.n.01'), Synset('spitz.n.01'), Synset('toy_dog.n.01'), Synset('working_dog.n.01')]
#:> Hypernyms: [Synset('canine.n.02'), Synset('domestic_animal.n.01')]
#:>
#:>
#:> Synset('frump.n.01') :
#:> Definition: a dull unattractive unpleasant girl or woman
#:> Example: ['she got a reputation as a frump', "she's a real dog"]
#:> Lemmas: ['frump', 'dog']
#:> Hyponyms: []
#:> Hypernyms: [Synset('unpleasant_woman.n.01')]
#:>
#:>
#:> Synset('dog.n.03') :
#:> Definition: informal term for a man
#:> Example: ['you lucky dog']
#:> Lemmas: ['dog']
#:> Hyponyms: []
#:> Hypernyms: [Synset('chap.n.01')]
#:>
#:>
#:> Synset('cad.n.01') :
#:> Definition: someone who is morally reprehensible
#:> Example: ['you dirty dog']
#:> Lemmas: ['cad', 'bounder', 'blackguard', 'dog', 'hound', 'heel']
#:> Hyponyms: [Synset('perisher.n.01')]
#:> Hypernyms: [Synset('villain.n.01')]
#:>
#:>
#:> Synset('frank.n.02') :
#:> Definition: a smooth-textured sausage of minced beef or pork usually smoked; often served on a bread roll
#:> Example: []
#:> Lemmas: ['frank', 'frankfurter', 'hotdog', 'hot_dog', 'dog', 'wiener', 'wienerwurst', 'weenie']
#:> Hyponyms: [Synset('vienna_sausage.n.01')]
#:> Hypernyms: [Synset('sausage.n.01')]
#:>
#:>
#:> Synset('pawl.n.01') :
#:> Definition: a hinged catch that fits into a notch of a ratchet to move a wheel forward or prevent it from moving backward
#:> Example: []
#:> Lemmas: ['pawl', 'detent', 'click', 'dog']
#:> Hyponyms: []
#:> Hypernyms: [Synset('catch.n.06')]
#:>
#:>
#:> Synset('andiron.n.01') :
#:> Definition: metal supports for logs in a fireplace
#:> Example: ['the andirons were too hot to touch']
#:> Lemmas: ['andiron', 'firedog', 'dog', 'dog-iron']
#:> Hyponyms: []
#:> Hypernyms: [Synset('support.n.10')]
#:>
#:>
#:> Synset('chase.v.01') :
#:> Definition: go after with the intent to catch
#:> Example: ['The policeman chased the mugger down the alley', 'the dog chased the rabbit']
#:> Lemmas: ['chase', 'chase_after', 'trail', 'tail', 'tag', 'give_chase', 'dog', 'go_after', 'track']
#:> Hyponyms: [Synset('hound.v.01'), Synset('quest.v.02'), Synset('run_down.v.07'), Synset('tree.v.03')]
#:> Hypernyms: [Synset('pursue.v.02')]
17.8 Part Of Speech (POS)
- In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context—i.e., its relationship with adjacent and related words in a phrase, sentence, or paragraph
- This is useful for Information Retrieval, Text to Speech, Word Sense Disambiguation
- The primary target of Part-of-Speech(POS) tagging is to identify the grammatical group of a given word. Whether it is a NOUN, PRONOUN, ADJECTIVE, VERB, ADVERBS, etc. based on the context
- A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc
17.8.1 Tag Sets
- Schools commonly teach that there are 9 parts of speech in English: noun, verb, article, adjective, preposition, pronoun, adverb, conjunction, and interjection
- However, there are clearly many more categories and sub-categories
'universal_tagset') nltk.download(
17.8.2 Tagging Techniques
There are few types of tagging techniques:
- Lexical-based
- Rule-based (Brill)
- Probalistic/Stochastic-based (Conditional Random Fields-CRFs, Hidden Markov Models-HMM)
- Neural network-based
NLTK supports the below taggers:
from nltk.tag.brill import BrillTagger
from nltk.tag.hunpos import HunposTagger
from nltk.tag.stanford import StanfordTagger, StanfordPOSTagger, StanfordNERTagger
from nltk.tag.hmm import HiddenMarkovModelTagger, HiddenMarkovModelTrainer
from nltk.tag.senna import SennaTagger, SennaChunkTagger, SennaNERTagger
from nltk.tag.crf import CRFTagger
from nltk.tag.perceptron import PerceptronTagger
17.8.2.1 nltk PerceptronTagger
PerceptronTagger produce tags with Penn Treebank tagset
from nltk.tag import PerceptronTagger
'averaged_perceptron_tagger') nltk.download(
#:> True
#:>
#:> [nltk_data] Downloading package averaged_perceptron_tagger to
#:> [nltk_data] /home/msfz751/nltk_data...
#:> [nltk_data] Package averaged_perceptron_tagger is already up-to-
#:> [nltk_data] date!
= PerceptronTagger()
tagger print('Tagger Classes:', tagger.classes,
'\n\n# Classes:', len(tagger.classes))
#:> Tagger Classes: {'PRP', '(', 'PRP$', 'JJR', 'JJ', 'FW', ':', 'RBS', 'VB', 'RBR', '#', 'CD', ',', 'RB', '``', 'NNPS', ')', 'NN', 'WP', '.', 'TO', 'NNP', 'CC', 'PDT', "''", 'WP$', 'SYM', 'WDT', 'LS', 'IN', 'VBD', 'POS', 'JJS', 'VBP', 'UH', '$', 'DT', 'MD', 'WRB', 'VBG', 'RP', 'VBZ', 'NNS', 'VBN', 'EX'}
#:>
#:> # Classes: 45
17.8.3 Performing Tagging nltk.pos_tag()
Tagging works sentence by sentence:
- Document fist must be splitted into sentences
- Each sentence need to be tokenized into words
- Default NTLK uses
PerceptronTagger
#nltk.download('averaged_perceptron_tagger')
#import nltk
#from nltk.tokenize import word_tokenize, sent_tokenize
= '''Sukanya, Rajib and Naba are my good friends. Sukanya is getting married next year. Marriage is a big step in one's life. It is both exciting and frightening. But friendship is a sacred bond between people. It is a special kind of love between us. Many of you must have tried searching for a friend but never found the right one.'''
doc
= nltk.sent_tokenize(doc)
sentences for sentence in sentences:
= nltk.word_tokenize(sentence)
tokens = nltk.pos_tag(tokens)
tagged print(tagged)
#:> [('Sukanya', 'NNP'), (',', ','), ('Rajib', 'NNP'), ('and', 'CC'), ('Naba', 'NNP'), ('are', 'VBP'), ('my', 'PRP$'), ('good', 'JJ'), ('friends', 'NNS'), ('.', '.')]
#:> [('Sukanya', 'NNP'), ('is', 'VBZ'), ('getting', 'VBG'), ('married', 'VBN'), ('next', 'JJ'), ('year', 'NN'), ('.', '.')]
#:> [('Marriage', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('big', 'JJ'), ('step', 'NN'), ('in', 'IN'), ('one', 'CD'), ("'s", 'POS'), ('life', 'NN'), ('.', '.')]
#:> [('It', 'PRP'), ('is', 'VBZ'), ('both', 'DT'), ('exciting', 'VBG'), ('and', 'CC'), ('frightening', 'NN'), ('.', '.')]
#:> [('But', 'CC'), ('friendship', 'NN'), ('is', 'VBZ'), ('a', 'DT'), ('sacred', 'JJ'), ('bond', 'NN'), ('between', 'IN'), ('people', 'NNS'), ('.', '.')]
#:> [('It', 'PRP'), ('is', 'VBZ'), ('a', 'DT'), ('special', 'JJ'), ('kind', 'NN'), ('of', 'IN'), ('love', 'NN'), ('between', 'IN'), ('us', 'PRP'), ('.', '.')]
#:> [('Many', 'JJ'), ('of', 'IN'), ('you', 'PRP'), ('must', 'MD'), ('have', 'VB'), ('tried', 'VBN'), ('searching', 'VBG'), ('for', 'IN'), ('a', 'DT'), ('friend', 'NN'), ('but', 'CC'), ('never', 'RB'), ('found', 'VBD'), ('the', 'DT'), ('right', 'JJ'), ('one', 'NN'), ('.', '.')]
17.9 Sentiment
17.9.1 NLTK and Senti-Wordnet
- SentiWordNet extends Wordnet Synsets with positive and negative sentiment scores
- The extension was achieved via a complex mix of propagation methods and classifiers. It is thus not a gold standard resource like WordNet (which was compiled by humans), but it has proven useful in a wide range of tasks
- It contains similar number of synsets as wordnet
from nltk.corpus import sentiwordnet as swn
'sentiwordnet') nltk.download(
#:> True
#:>
#:> [nltk_data] Downloading package sentiwordnet to
#:> [nltk_data] /home/msfz751/nltk_data...
#:> [nltk_data] Package sentiwordnet is already up-to-date!
= set( swn.all_senti_synsets() )
s print('Total synsets in senti-wordnet : ' , len(s))
#:> Total synsets in senti-wordnet : 117659
17.9.1.1 Senti-Synset
- Senti-Wordnet extends wordnet with three(3) sentiment scores: positive, negative, objective
- All three scores added up to value 1.0
= swn.senti_synset('breakdown.n.03')
breakdown print(
'\n'
breakdown, 'Positive:', breakdown.pos_score(), '\n',
'Negative:', breakdown.neg_score(), '\n',
'Objective:',breakdown.obj_score()
)
#:> <breakdown.n.03: PosScore=0.0 NegScore=0.25>
#:> Positive: 0.0
#:> Negative: 0.25
#:> Objective: 0.75
17.9.1.2 Senti-Synsets
Get all the synonmys, with and without the POS information
print( list(swn.senti_synsets('slow')), '\n\n', ## without POS tag
list(swn.senti_synsets('slow', 'a')) ) ## with POS tag
#:> [SentiSynset('decelerate.v.01'), SentiSynset('slow.v.02'), SentiSynset('slow.v.03'), SentiSynset('slow.a.01'), SentiSynset('slow.a.02'), SentiSynset('dense.s.04'), SentiSynset('slow.a.04'), SentiSynset('boring.s.01'), SentiSynset('dull.s.08'), SentiSynset('slowly.r.01'), SentiSynset('behind.r.03')]
#:>
#:> [SentiSynset('slow.a.01'), SentiSynset('slow.a.02'), SentiSynset('dense.s.04'), SentiSynset('slow.a.04'), SentiSynset('boring.s.01'), SentiSynset('dull.s.08')]
Get the score for first synset
= list(swn.senti_synsets('slow','a'))[0]
first_synset
print(
'\n',
first_synset, 'Positive:', first_synset.pos_score(), '\n',
'Negative:', first_synset.neg_score(), '\n',
'Objective:', first_synset.obj_score()
)
#:> <slow.a.01: PosScore=0.0 NegScore=0.0>
#:> Positive: 0.0
#:> Negative: 0.0
#:> Objective: 1.0
17.9.1.3 Converting POS-tag into Wordnet POS-tag
Using Function
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet as wn
def penn_to_wn(tag):
"""
Convert between the PennTreebank tags to simple Wordnet tags
"""
if tag.startswith('J'):
return wn.ADJ
elif tag.startswith('N'):
return wn.NOUN
elif tag.startswith('R'):
return wn.ADV
elif tag.startswith('V'):
return wn.VERB
return None
= word_tokenize("Star Wars is a wonderful movie")
wt = nltk.pos_tag(wt)
penn_tags = [ (x, penn_to_wn(y)) for (x,y) in penn_tags ]
wordnet_tags
print(
'Penn Tags :', penn_tags,
'\nWordnet Tags :', wordnet_tags)
#:> Penn Tags : [('Star', 'NNP'), ('Wars', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('wonderful', 'JJ'), ('movie', 'NN')]
#:> Wordnet Tags : [('Star', 'n'), ('Wars', 'n'), ('is', 'v'), ('a', None), ('wonderful', 'a'), ('movie', 'n')]
Using defaultdict
import nltk
from nltk.corpus import wordnet as wn
from nltk import word_tokenize, pos_tag
from collections import defaultdict
= defaultdict(lambda : None)
tag_map 'J'] = wn.ADJ
tag_map['R'] = wn.ADV
tag_map['V'] = wn.VERB
tag_map['N'] = wn.NOUN
tag_map[
= word_tokenize("Star Wars is a wonderful movie")
wt = nltk.pos_tag(wt)
penn_tags = [ (x, tag_map[y[0]]) for (x,y) in penn_tags ]
wordnet_tags
print(
'Penn Tags :', penn_tags,
'\nWordnet Tags :', wordnet_tags)
#:> Penn Tags : [('Star', 'NNP'), ('Wars', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('wonderful', 'JJ'), ('movie', 'NN')]
#:> Wordnet Tags : [('Star', 'n'), ('Wars', 'n'), ('is', 'v'), ('a', None), ('wonderful', 'a'), ('movie', 'n')]
17.9.2 Vader
- It is a rule based sentiment analyzer, contain 7503 lexicons
- It is good for social media because lexicon contain emoji and short form text
- Contain only 3 n-gram
- Supported by NTLK or install vader seperately (pip install vaderSentiment)
17.9.2.1 Vader Lexicon
The lexicon is a dictionary. To make it iterable, need to convert into list:
- Step 1: Convert dict
to dict_items
, which is a list containing items, each item is one dict
- Step 2: Unpack dict_items
to list
#from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer ## seperate pip installed library
from nltk.sentiment.vader import SentimentIntensityAnalyzer
'vader_lexicon') nltk.download(
#:> True
#:>
#:> [nltk_data] Downloading package vader_lexicon to
#:> [nltk_data] /home/msfz751/nltk_data...
#:> [nltk_data] Package vader_lexicon is already up-to-date!
= SentimentIntensityAnalyzer().lexicon # get the lexicon dictionary
vader_lex = list(vader_lex.items()) # convert to items then list
vader_list print( 'Total Vader Lexicon:', len(vader_lex),'\n',
1:10], vader_list[220:240] ) vader_list[
#:> Total Vader Lexicon: 7502
#:> [('%)', -0.4), ('%-)', -1.5), ('&-:', -0.4), ('&:', -0.7), ("( '}{' )", 1.6), ('(%', -0.9), ("('-:", 2.2), ("(':", 2.3), ('((-:', 2.1)] [('b^d', 2.6), ('cwot', -2.3), ("d-':", -2.5), ('d8', -3.2), ('d:', 1.2), ('d:<', -3.2), ('d;', -2.9), ('d=', 1.5), ('doa', -2.3), ('dx', -3.0), ('ez', 1.5), ('fav', 2.0), ('fcol', -1.8), ('ff', 1.8), ('ffs', -2.8), ('fkm', -2.4), ('foaf', 1.8), ('ftw', 2.0), ('fu', -3.7), ('fubar', -3.0)]
There is only four N-Gram in the lexicon
print('List of N-grams: ')
#:> List of N-grams:
for tok, score in vader_list if " " in tok] [ (tok,score)
#:> [("( '}{' )", 1.6), ("can't stand", -2.0), ('fed up', -1.8), ('screwed up', -1.5)]
If stemming or lemmatization is used, stem/lemmatize the vader lexicon too
for tok, score in vader_list if "lov" in tok] [ (tok,score)
#:> [('beloved', 2.3), ('lovable', 3.0), ('love', 3.2), ('loved', 2.9), ('lovelies', 2.2), ('lovely', 2.8), ('lover', 2.8), ('loverly', 2.8), ('lovers', 2.4), ('loves', 2.7), ('loving', 2.9), ('lovingly', 3.2), ('lovingness', 2.7), ('unlovable', -2.7), ('unloved', -1.9), ('unlovelier', -1.9), ('unloveliest', -1.9), ('unloveliness', -2.0), ('unlovely', -2.1), ('unloving', -2.3)]
17.9.2.2 Polarity Scoring
Scoring result is a dictionary of:
- neg
- neu
- pos
- compound neg, neu, pos adds up to 1.0
Example below shows polarity for two sentences:
= ["Python is a very useful but hell difficult to learn",
corpus ":) :) :("]
for doc in corpus:
print(doc, '-->', "\n:", SentimentIntensityAnalyzer().polarity_scores(doc) )
#:> Python is a very useful but hell difficult to learn -->
#:> : {'neg': 0.554, 'neu': 0.331, 'pos': 0.116, 'compound': -0.8735}
#:> :) :) :( -->
#:> : {'neg': 0.326, 'neu': 0.0, 'pos': 0.674, 'compound': 0.4767}
17.10 Feature Representation
17.10.1 The Data
A corpus is a collection of multiple documents. In the below example, each document is represented by a sentence.
= [
corpus 'This is the first document, :)',
'This document is the second document.',
'And this is a third one',
'Is this the first document?',
]
17.10.2 Frequency Count
Using purely frequency count as a feature will obviously bias on long document (which contain a lot of words, hence words within the document will have very high frequency).
17.10.2.1 + Tokenizer
Default Tokenizer
By default, vectorizer apply tokenizer to select minimum 2-chars alphanumeric words. Below train the vectorizer using fit_transform()
.
from sklearn.feature_extraction.text import CountVectorizer
= CountVectorizer() # initialize the vectorizer
vec = vec.fit_transform(corpus) # FIT the vectorizer, return fitted data
X print(pd.DataFrame(X.toarray(), columns=vec.get_feature_names()),'\n\n',
'Vocabulary: ', vec.vocabulary_)
#:> and document first is one second the third this
#:> 0 0 1 1 1 0 0 1 0 1
#:> 1 0 2 0 1 0 1 1 0 1
#:> 2 1 0 0 1 1 0 0 1 1
#:> 3 0 1 1 1 0 0 1 0 1
#:>
#:> Vocabulary: {'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}
Custom Tokenizer
You can use a custom tokenizer, which is a function that return list of words. Example below uses nltk RegexpTokenizer function, which retains one or more alphanumeric characters.
= RegexpTokenizer(r'[a-zA-Z0-9\']+') ## Custom Tokenizer
my_tokenizer = CountVectorizer(tokenizer=my_tokenizer.tokenize) ## custom tokenizer's function
vec2 = vec2.fit_transform(corpus) # FIT the vectorizer, return fitted data
X2 print(pd.DataFrame(X2.toarray(), columns=vec2.get_feature_names()),'\n\n',
'Vocabulary: ', vec.vocabulary_)
#:> a and document first is one second the third this
#:> 0 0 0 1 1 1 0 0 1 0 1
#:> 1 0 0 2 0 1 0 1 1 0 1
#:> 2 1 1 0 0 1 1 0 0 1 1
#:> 3 0 0 1 1 1 0 0 1 0 1
#:>
#:> Vocabulary: {'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}
1 and 2-Word-Gram Tokenizer
Use ngram_range()
to specify range of grams needed.
= CountVectorizer(ngram_range=(1,2)) # initialize the vectorizer
vec3 = vec3.fit_transform(corpus) # FIT the vectorizer, return fitted data
X3 print(pd.DataFrame(X3.toarray(), columns=vec3.get_feature_names()),'\n\n',
'Vocabulary: ', vec.vocabulary_)
#:> and and this document document is first ... third one this this document \
#:> 0 0 0 1 0 1 ... 0 1 0
#:> 1 0 0 2 1 0 ... 0 1 1
#:> 2 1 1 0 0 0 ... 1 1 0
#:> 3 0 0 1 0 1 ... 0 1 0
#:>
#:> this is this the
#:> 0 1 0
#:> 1 0 0
#:> 2 1 0
#:> 3 0 1
#:>
#:> [4 rows x 22 columns]
#:>
#:> Vocabulary: {'this': 8, 'is': 3, 'the': 6, 'first': 2, 'document': 1, 'second': 5, 'and': 0, 'third': 7, 'one': 4}
Apply Trained Vectorizer Once the vectorizer had been trained, you can apply them on new corpus. Tokens not in the vectorizer vocubulary are ignored.
= ["My Name is Charlie Angel", "I love to watch Star Wars"]
new_corpus = vec.transform(new_corpus)
XX =vec.get_feature_names()) pd.DataFrame(XX.toarray(), columns
#:> and document first is one second the third this
#:> 0 0 0 0 1 0 0 0 0 0
#:> 1 0 0 0 0 0 0 0 0 0
17.10.2.2 + Stop Words
Vectorizer can optionally be use with stop words list. Use stop_words=english
to apply filtering using sklearn built-in stop word. You can replace english
with other word list object.
= CountVectorizer(stop_words='english') ## sklearn stopwords list
vec4 = vec4.fit_transform(corpus)
X4 =vec4.get_feature_names()) pd.DataFrame(X4.toarray(), columns
#:> document second
#:> 0 1 0
#:> 1 2 1
#:> 2 0 0
#:> 3 1 0
17.10.3 TFIDF
17.10.3.1 Equation
\[tf(t,d) = \text{occurances of term t in document t} \\ n = \text{number of documents} \\ df(t) = \text{number of documents containing term t} \\ idf(t) = log \frac{n}{df(t))} + 1 \\ idf(t) = log \frac{1+n}{1+df(t))} + 1 \text{.... smoothing, prevent zero division} \\ tfidf(t) = tf(t) * idf(t,d) \text{.... raw, no normalization on tf(t)} \\ tfidf(t) = \frac{tf(t,d)}{||V||_2} * idf(t) \text{.... tf normalized with euclidean norm}\]
17.10.3.2 TfidfTransformer
To generate TFIDF vectors, first run CountVectorizer
to get frequency vector matrix. Then take the output into this transformer.
from sklearn.feature_extraction.text import TfidfTransformer
= [
corpus "apple apple apple apple apple banana",
"apple apple",
"apple apple apple banana",
"durian durian durian"]
= CountVectorizer()
count_vec = count_vec.fit_transform(corpus)
X
= TfidfTransformer(smooth_idf=False,norm=None)
transformer1 = TfidfTransformer(smooth_idf=False,norm='l2')
transformer2 = TfidfTransformer(smooth_idf=True,norm='l2')
transformer3
= transformer1.fit_transform(X)
tfidf1 = transformer2.fit_transform(X)
tfidf2 = transformer3.fit_transform(X)
tfidf3
print(
'Frequency Count: \n', pd.DataFrame(X.toarray(), columns=count_vec.get_feature_names()),
'\n\nVocabulary: ', count_vec.vocabulary_,
'\n\nTFIDF Without Norm:\n',tfidf1.toarray(),
'\n\nTFIDF with L2 Norm:\n',tfidf2.toarray(),
'\n\nTFIDF with L2 Norm (smooth):\n',tfidf3.toarray())
#:> Frequency Count:
#:> apple banana durian
#:> 0 5 1 0
#:> 1 2 0 0
#:> 2 3 1 0
#:> 3 0 0 3
#:>
#:> Vocabulary: {'apple': 0, 'banana': 1, 'durian': 2}
#:>
#:> TFIDF Without Norm:
#:> [[6.43841036 1.69314718 0. ]
#:> [2.57536414 0. 0. ]
#:> [3.86304622 1.69314718 0. ]
#:> [0. 0. 7.15888308]]
#:>
#:> TFIDF with L2 Norm:
#:> [[0.96711783 0.25432874 0. ]
#:> [1. 0. 0. ]
#:> [0.91589033 0.40142857 0. ]
#:> [0. 0. 1. ]]
#:>
#:> TFIDF with L2 Norm (smooth):
#:> [[0.97081492 0.23982991 0. ]
#:> [1. 0. 0. ]
#:> [0.92468843 0.38072472 0. ]
#:> [0. 0. 1. ]]
17.10.3.3 TfidfVectorizer
This vectorizer gives end to end processing from corpus into TFIDF vector matrix, including tokenization, stopwords.
from sklearn.feature_extraction.text import TfidfVectorizer
= RegexpTokenizer(r'[a-zA-Z0-9\']+') ## Custom Tokenizer
my_tokenizer
= TfidfVectorizer(tokenizer=my_tokenizer.tokenize, stop_words='english') #default smooth_idf=True, norm='l2'
vec1 = TfidfVectorizer(tokenizer=my_tokenizer.tokenize, stop_words='english',smooth_idf=False)
vec2 = TfidfVectorizer(tokenizer=my_tokenizer.tokenize, stop_words='english', norm=None)
vec3
= vec1.fit_transform(corpus) # FIT the vectorizer, return fitted data
X1 = vec2.fit_transform(corpus) # FIT the vectorizer, return fitted data
X2 = vec3.fit_transform(corpus) # FIT the vectorizer, return fitted data
X3
print(
'TFIDF Features (Default with Smooth and L2 Norm):\n',
round(3), columns=vec1.get_feature_names()),
pd.DataFrame(X1.toarray().'\n\nTFIDF Features (without Smoothing):\n',
round(3), columns=vec2.get_feature_names()),
pd.DataFrame(X2.toarray().'\n\nTFIDF Features (without L2 Norm):\n',
round(3), columns=vec3.get_feature_names())
pd.DataFrame(X3.toarray(). )
#:> TFIDF Features (Default with Smooth and L2 Norm):
#:> apple banana durian
#:> 0 0.971 0.240 0.0
#:> 1 1.000 0.000 0.0
#:> 2 0.925 0.381 0.0
#:> 3 0.000 0.000 1.0
#:>
#:> TFIDF Features (without Smoothing):
#:> apple banana durian
#:> 0 0.967 0.254 0.0
#:> 1 1.000 0.000 0.0
#:> 2 0.916 0.401 0.0
#:> 3 0.000 0.000 1.0
#:>
#:> TFIDF Features (without L2 Norm):
#:> apple banana durian
#:> 0 6.116 1.511 0.000
#:> 1 2.446 0.000 0.000
#:> 2 3.669 1.511 0.000
#:> 3 0.000 0.000 5.749
17.11 Appliction
17.11.1 Document Similarity
Document1 and Document 2 are mutiplicate of Document0, therefore their consine similarity is the same.
= (
documents "apple apple banana",
"apple apple banana apple apple banana",
"apple apple banana apple apple banana apple apple banana")
from sklearn.feature_extraction.text import TfidfVectorizer
= TfidfVectorizer()
tfidf_vec = tfidf_vec.fit_transform(documents)
tfidf_matrix
from sklearn.metrics.pairwise import cosine_similarity
print('Cosine Similarity betwen doc0 and doc1:\n',cosine_similarity(tfidf_matrix[0], tfidf_matrix[1]))
#:> Cosine Similarity betwen doc0 and doc1:
#:> [[1.]]
print('Cosine Similarity betwen doc1 and doc2:\n',cosine_similarity(tfidf_matrix[1], tfidf_matrix[2]))
#:> Cosine Similarity betwen doc1 and doc2:
#:> [[1.]]
print('Cosine Similarity betwen doc1 and doc2:\n',cosine_similarity(tfidf_matrix[0], tfidf_matrix[2]))
#:> Cosine Similarity betwen doc1 and doc2:
#:> [[1.]]
17.12 Naive Bayes
17.12.1 Libraries
from nlpia.data.loaders import get_data
from nltk.tokenize.casual import casual_tokenize
from collections import Counter
17.12.2 The Data
= get_data('hutto_movies') # download data
movies print(movies.head(), '\n\n',
movies.describe())
#:> sentiment text
#:> id
#:> 1 2.266667 The Rock is destined to be the 21st Century's ...
#:> 2 3.533333 The gorgeously elaborate continuation of ''The...
#:> 3 -0.600000 Effective but too tepid biopic
#:> 4 1.466667 If you sometimes like to go to the movies to h...
#:> 5 1.733333 Emerges as something rare, an issue movie that...
#:>
#:> sentiment
#:> count 10605.000000
#:> mean 0.004831
#:> std 1.922050
#:> min -3.875000
#:> 25% -1.769231
#:> 50% -0.080000
#:> 75% 1.833333
#:> max 3.941176
17.12.3 Bag of Words
- Tokenize each record, remove single character token, then convert into list of counters (words-frequency pair).
- Each item in the list is a counter, which represent word frequency within the record
= []
bag_of_words for text in movies.text:
= casual_tokenize(text, reduce_len=True, strip_handles=True) # tokenize
tokens = [x for x in tokens if len(x)>1] ## remove single char token
tokens =True) ## add to our BoW
bag_of_words.append( Counter(tokens, strip_handles
)
= list( set([ y for x in bag_of_words for y in x.keys()]) )
unique_words
print("Total Rows: ", len(bag_of_words),'\n\n',
'Row 1 BoW: ',bag_of_words[:1],'\n\n', # see the first two records
'Row 2 BoW: ', bag_of_words[:2], '\n\n',
'Total Unique Words: ', len(unique_words))
#:> Total Rows: 10605
#:>
#:> Row 1 BoW: [Counter({'to': 2, 'The': 1, 'Rock': 1, 'is': 1, 'destined': 1, 'be': 1, 'the': 1, '21st': 1, "Century's": 1, 'new': 1, 'Conan': 1, 'and': 1, 'that': 1, "he's": 1, 'going': 1, 'make': 1, 'splash': 1, 'even': 1, 'greater': 1, 'than': 1, 'Arnold': 1, 'Schwarzenegger': 1, 'Jean': 1, 'Claud': 1, 'Van': 1, 'Damme': 1, 'or': 1, 'Steven': 1, 'Segal': 1, 'strip_handles': 1})]
#:>
#:> Row 2 BoW: [Counter({'to': 2, 'The': 1, 'Rock': 1, 'is': 1, 'destined': 1, 'be': 1, 'the': 1, '21st': 1, "Century's": 1, 'new': 1, 'Conan': 1, 'and': 1, 'that': 1, "he's": 1, 'going': 1, 'make': 1, 'splash': 1, 'even': 1, 'greater': 1, 'than': 1, 'Arnold': 1, 'Schwarzenegger': 1, 'Jean': 1, 'Claud': 1, 'Van': 1, 'Damme': 1, 'or': 1, 'Steven': 1, 'Segal': 1, 'strip_handles': 1}), Counter({'of': 4, 'The': 2, 'gorgeously': 1, 'elaborate': 1, 'continuation': 1, 'Lord': 1, 'the': 1, 'Rings': 1, 'trilogy': 1, 'is': 1, 'so': 1, 'huge': 1, 'that': 1, 'column': 1, 'words': 1, 'cannot': 1, 'adequately': 1, 'describe': 1, 'co': 1, 'writer': 1, 'director': 1, 'Peter': 1, "Jackson's": 1, 'expanded': 1, 'vision': 1, "Tolkien's": 1, 'Middle': 1, 'earth': 1, 'strip_handles': 1})]
#:>
#:> Total Unique Words: 20686
Convert NaN into 0 then all features into integer
= pd.DataFrame.from_records(bag_of_words)
bows_df = bows_df.fillna(0).astype(int) # replace NaN with 0, change to integer
bows_df bows_df.head()
#:> The Rock is destined to ... Bearable Staggeringly ve muttering dissing
#:> 0 1 1 1 1 2 ... 0 0 0 0 0
#:> 1 2 0 1 0 0 ... 0 0 0 0 0
#:> 2 0 0 0 0 0 ... 0 0 0 0 0
#:> 3 0 0 1 0 4 ... 0 0 0 0 0
#:> 4 0 0 0 0 0 ... 0 0 0 0 0
#:>
#:> [5 rows x 20686 columns]