Natural Language Processing Using Python

The following two tabs change content below.
Prasad Kharkar is a java enthusiast and always keen to explore and learn java technologies. He is SCJP,OCPWCD, OCEJPAD and aspires to be java architect.

Latest posts by Prasad Kharkar (see all)

Welcome to another interesting concept in machine learning. In this article, we will learn how to perform natural language processing using python.

Natural Language Processing Using Python:

NLP is a concept in machine learning which concerned with interactions between computers and natural language, particularly to teach computers how to understand and process human language. In this article we will learn how to write a simple program for natural language processing using python.

Problem Statement:

We have a dataset of 1000 reviews by customers of a restaurant. Each record consists comments from the customer and whether they liked it or not. Our goal is to train the machine with these reviews so that it can predict whether new reviews are positive or negative.

Preparing Dataset:

Download restaurant reviews tab separated file from SuperDataScience. Keep this file in same folder where you are writing natural language processing using python.

Execute above lines of code and in variable explorer, you should see dataset variable as follows

sample dataset

sample dataset

  • Review column contains actual text review from customer
  • Liked column contains 1 if customer liked the experience otherwise 0

Textual reviews contain full stops, spaces, capital letters etc. We are only concerned with meaningful words because of which good predictions can be done. We need to perform data clean up activity on these reviews.

Data Cleanup:

For each review comment we will,

  • Keep only letters in comment
  • Convert all letters in lower case
  • Split each sentence into words.
  • Remove all stop words like a, an, the, this.
  • Treat all variations of same words as one. For example, loves, loved, love will be treated as love
  • combine these processed words to form textual review again.

Below is some sample code which performs above processing on first review.

  • We downloaded nltk stop words and imported necessary libraries.
  • Took all the capital and small letters from zeroth row in dataset i.e. “Wow… Loved this place.” and stored in reviewWithOnlyLetters

  • Then converted reviewWithOnlyLetters to lowercase.

  • Then split reviewInLowerCase into words.

  • Then removed stop words and stems from splittedReview 

  • Combined stemmedReview to form meaningful review comments again.

That’t it, we just need to perform such cleanup on all 1000 records in dataset. Code given below does just that.

We are done with data cleanup for natural language processing using python. Now we will create bag of words model

Bag of Words Model:

We will convert our textual reviews in a matrix form so that machine can be trained as usual. CountVectorizer from sklearn comes to the rescue.

Convert a collection of text documents to a matrix of token counts

This implementation produces a sparse representation of the counts using scipy.sparse.csr_matrix.

corpus is a huge vector containing all reviews in meaningful text format. cv.fit_transform(corpus) 

converts text into a huge matrix where each column represents a word and each cell in a row represents whether the word is present in review or not.

  • x is a sparse matrix with textual reviews as independent variable
  • y denotes whether review is positive or negative. It is dependent on x

Now, remember our objective to determine whether review is positive or negative based on textual review. We simply need to use a classification model and train it with existing data.

Creating Classification Model:

  • Split our data in training set and test set. (Note that we are taking 0.1 as test set and 0.9 as training set)
  • Create a Gaussian naive bayes classification model
  • Train the model with training set.
  • Predict results for test set.

We are done with classification, now simply compare results with confusion matrix

Confusion Matrix:

Create the confusion matrix for training and test set as below

confusion matrix

confusion matrix

From confusion matrix we can see that there are

  • 27 true negatives
  • 3 false negatives
  • 24 false positives
  • 46 true positives

So, out of 100 test set, 27+46 = 73 predictions are accurate and we can say, accuracy for natural language processing using python for our model is 73%

Share Button

Leave a Reply

Your email address will not be published. Required fields are marked *