Natural Language Processing in Python. Implementation code is also provided.

 Natural Language Processing (or NLP) is an applying Machine Learning models to text and language. Teaching machines to understand what is said in the spoken and written word is the focus of Natural Language Processing. Whenever you dictate something into your iPhone / Android device that is then converted to text, that’s an NLP algorithm in action.

You can also use NLP on a text review to predict if the review is a good one or a bad one. You can use NLP on an article to predict some categories of the articles you are trying to segment. Likewise, you can use NLP on a book to predict the genre of the book. And it can go further, you can use NLP to build a machine translator or a speech recognition system. Speaking of classification algorithms, most NLP algorithms are classification models, and they include Logistic Regression, Naive Bayes, CART which is a model based on decision trees, Maximum Entropy again related to Decision Trees, Hidden Markov Models which are models based on Markov processes.

A very well-known model in NLP is the Bag of Words model. It is a model used to preprocess the texts to classify before fitting the classification algorithms on the observations containing the texts.

In this part, you will understand and learn how to:

  • Clean texts to prepare them for the Machine Learning models,
  • Create a Bag of Words model,
  • Apply Machine Learning models onto this Bag of Worlds model.

 

The idea of presented code is to review the given sentence if it is positive or negative. We have collected a dataset with two columns and 1000 rows. 

Here’s the step-by-step implementation and illustration of coding:

  1. Importing libraries
  2. Importing datasets
  3. Cleaning text
  4. Creating the Bag of Words model
  5. Splitting the data into Training and Test sets
  6. Training thee Naive Bayes model on the Training set
  7. Predicting the Test set results
  8. Making the confusion matrix

 

Explaining NLP code written in Python:

Importing the libraries

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

 

Importing the dataset

dataset = pd.read_csv(‘Restaurant_Reviews.tsv’, delimiter = ‘t’, quoting = 3)

Here we are importing .tsv data. TSV stands for Tab Separated Values which are essentially text files. The second parameter delimiter is used for referring .tsv files. The option quoting is a Boolean that specifies whether quotation marks should be removed or kept in strings. It has no impact on parsing itself.

Cleaning the texts (Data Preprocessing)

import re

import nltk

nltk.download(‘stopwords’)

from nltk.corpus import stopwords

from nltk.stem.porter import PorterStemmer

corpus = []

for i in range(0, 1000):

  review = re.sub(‘[^a-zA-Z]’, ‘ ‘, dataset[‘Review’][i])

  review = review.lower()

  review = review.split()

  ps = PorterStemmer()

  all_stopwords = stopwords.words(‘english’)

  all_stopwords.remove(‘not’)

  review = [ps.stem(word) for word in review if not word in set(all_stopwords)]

  review = ‘ ‘.join(review)

  corpus.append(review)

 

Creating the Bag of Words model

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features = 1500)

X = cv.fit_transform(corpus).toarray()

y = dataset.iloc[:, -1].values

 

Splitting the dataset into the Training set and Test set

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)

 

Training the Naive Bayes model on the Training set

from sklearn.naive_bayes import GaussianNB

classifier = GaussianNB()

classifier.fit(X_train, y_train)

 

Predicting the Test set results

y_pred = classifier.predict(X_test)

print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

 

Making the Confusion Matrix

from sklearn.metrics import confusion_matrix, accuracy_score

cm = confusion_matrix(y_test, y_pred)

print(cm)

accuracy_score(y_test, y_pred)

 

 

Get the complete Code:

I compiled this code in Juypter Notebook with Anaconda distribution. To compile this code on your machine, download Anaconda on your machine and compile the code in the notebook.

## Importing the libraries

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd


## Importing the dataset

dataset = pd.read_csv(‘Restaurant_Reviews.tsv’, delimiter = ‘t’, quoting = 3)


## Cleaning the texts

import re

import nltk

nltk.download(‘stopwords’)

from nltk.corpus import stopwords

from nltk.stem.porter import PorterStemmer

corpus = []

for i in range(0, 1000):

  review = re.sub(‘[^a-zA-Z]’, ‘ ‘, dataset[‘Review’][i])

  review = review.lower()

  review = review.split()

  ps = PorterStemmer()

  all_stopwords = stopwords.words(‘english’)

  all_stopwords.remove(‘not’)

  review = [ps.stem(word) for word in review if not word in set(all_stopwords)]

  review = ‘ ‘.join(review)

  corpus.append(review)


## Creating the Bag of Words model

from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(max_features = 1500)

X = cv.fit_transform(corpus).toarray()

y = dataset.iloc[:, -1].values


## Splitting the dataset into the Training set and Test set

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)


## Training the Naive Bayes model on the Training set

from sklearn.naive_bayes import GaussianNB

classifier = GaussianNB()

classifier.fit(X_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)


## Predicting the Test set results

y_pred = classifier.predict(X_test)

print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))


## Making the Confusion Matrix

from sklearn.metrics import confusion_matrix, accuracy_score

cm = confusion_matrix(y_test, y_pred)

print(cm)

accuracy_score(y_test, y_pred)

Leave a Comment

error

Enjoy this blog? Please spread the word :)