NLP (Part 1): Preprocessing text data

6 min readApr 6, 2023

Introduction

Text is everywhere and it can give us many insights about a company, product and service. However, extracting such insights is not an easy task. Text data is an unstructured data and, as the name suggest, we have to process it to a more “structured” form to be able to analyse and extract meaningful information. This is especially fundamental when we want to feed this data to a machine learning model.

I will write about text preprocessing in two parts. In this first part, I will cover some of the techniques necessary to process text data, mainly normalizing text and removing unwanted words and characters.

Techniques covered:

Tokenization;
Removing punctuation;
Lower casing;
Removing stopwords;
Stemming;
Lemmatization.

You can check the notebook with the code in my github.

The Dataset

First, let’s take a look at the dataset.

# Importing
import pandas as pd

# Loading Dataset
df = pd.read_csv(r'https://github.com/AlvaroMatsuda/Sentiment_Analysis/blob/main/data/reviews.csv?raw=true')
df

We can observe that there are five columns:

Time_submitted: indicating the date and time of the submitted review;
Review: column containing the text review;
Rating: The rating given by the customer. Ranges from 1 to 5;
Total_thumbsup: The number of thumbsup reaction of the review;
Reply: The text containg the reply from spotify.

We are interested only on the Review column.

Tokenization

Tokenization is a technique that most often decompose a sentence into words, but we can decompose into single letters if we want. The result of this decomposition is called tokens.

# Importing
from nltk.tokenize import word_tokenize

# Tokenizing a single review
tokens = word_tokenize(sentence)
tokens

# Tokenizing all rows from dataframe
review_tokens = df['Review'].apply(word_tokenize)
review_tokens

We can see that there are some characters, such as (,.``) that depending on the use case is not ideal to have. For example, for sentiment analysis, these special character does not have any meaningful information if the review is positive or negative. In other cases like text generator models, these special characters can be important.

Next we are going to remove these special characters to show how it is done.

Removing Punctuation

Removing punctuation is quite simple. Inside the string module of python we have the following method that has all punctuations mapped.

We are going to use it to map all punctuations to use into the translate function of a string.

# Imports
import string

# Creating translator object
translator = str.maketrans('', '', string.punctuation)

# Removing punctuation from sentence
punct_removed = sentence.translate(translator)

# Printing sentence with and without punctuation
punct_removed, sentence

Output of the code above.

# Removing punctuation of all rows
reviews_punct_removed = df['Review'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))
reviews_punct_removed

Lower Casing

Another important step is to lower case our text. For example, “Great” and “great” would be considered two different words.

# Lower Casing all rows
reviews_lower = df['Review'].str.lower()
reviews_lower

Removing Stop Words

Stop words are common words in a language. They are very common in text that they have no value in the meaning of the text. The most common words are:

preposition: in, at, for, by, etc;
determiners: the, a, this, that, my, etc.

These stop words are already mapped and is called stop list. Every language have a stop list and it can vary from list to list. We can even customize the stop list adding or removing words from the list.

It is worth mentioning that in some cases stop words are important, for instance, in text generator models.

# Importing
from nltk.corpus import stopwords

# Getting list of english stopwords
ENG_STOPWORDS = stopwords.words('english')

# Removing stopwords from all rows
review_wo_stopwords = df['Review'].apply(lambda review: ' '.join([word for word in review.split() if word not in ENG_STOPWORDS]))
review_wo_stopwords

# Comparing sentence with and without stepwords
df['Review'][0], review_wo_stopwords[0]

Comparing sentence with and without stopwords.

Stemming

Stemming is the process of transforming the word into its root. It does not account for the part of speech that the word belongs and the result can generate a word that does not exist.

# Importing
from nltk.stem.porter import PorterStemmer

# Stemmer object
stemmer = PorterStemmer()

# List of words to be stemmed
list_words = ['study', 'studied', 'studying', 'studies']

# Result of words stemmed
[stemmer.stem(word) for word in list_words]

Output of the above snippet.

As we can see, the stemming generated the word “studi” which does not exist, but is the root of the word study.

# Stemming all rows
review_stemmed = df['Review'].apply(lambda x: ' '.join(stemmer.stem(word) for word in x.split()))
review_stemmed

Lemmatization

Lemmatization is the process of transforming a word into its lemma. Lemmatization takes into account the part of speech of each word and it returns an existing word.

# Importing
from nltk.stem import WordNetLemmatizer

# Lemmatizer object
lemmatizer = WordNetLemmatizer()

# List of words to be stemmed
list_words = ['study', 'studied', 'studying', 'studies']

# Result of words lemmatized
[lemmatizer.lemmatize(word) for word in list_words]

Output of the above snippet.

# Lemmatization on all rows
review_lemmed = df['Review'].apply(lambda x: ' '.join(lemmatizer.lemmatize(word) for word in x.split()))
review_lemmed

Lemmatization x Stemming

How to decide which do we use? In general, lemmatization is used when we need to generate an understandable output, like chatbot and text generator models. On the other hand, we use stemming when we do not need and understandable output. The general meaning of the word is enough for the model. Stemming is most used for sentiment analysis and spam classification.

Full code

# Importing 
import pandas as pd
from nltk.tokenize import word_tokenize
import string
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

# Loading Dataset
df = pd.read_csv(r'C:\Users\AlvaroMatsuda\OneDrive - DHAUZ\Área de Trabalho\Kazu_Files\Curso_FIA\Deep_Learning\Projeto_aula_deep_learning\base_dados\reviews.csv')

# Removing punctuation of all rows
df['review_pp'] = df['Review'].apply(lambda x: x.translate(str.maketrans('', '', string.punctuation)))

# Lower Casing all rows
df['review_pp'] = df['review_pp'].str.lower()

# Getting list of english stopwords
ENG_STOPWORDS = stopwords.words('english')

# Removing stopwords from all rows
df['review_pp'] = df['review_pp'].apply(lambda review: ' '.join([word for word in review.split() if word not in ENG_STOPWORDS]))

# Stemmer object
stemmer = PorterStemmer()

# Stemming all rows
df['review_pp_stem'] = df['review_pp'].apply(lambda x: ' '.join(stemmer.stem(word) for word in x.split()))

# Lemmatizer object
lemmatizer = WordNetLemmatizer()

# Lemmatization on all rows
df['review_pp_lem'] = df['review_pp'].apply(lambda x: ' '.join(lemmatizer.lemmatize(word) for word in x.split()))

# Tokenizing reviews
tokens_lem = df['review_pp_lem'].apply(word_tokenize)
tokens_stem = df['review_pp_stem'].apply(word_tokenize)

Conclusion

In this first post, it was showed some techniques to prepare text data, removing special characters, punctuation, stopwords and normalizing the text (lower casing, stemming and lemmatizing).

On the next post, I will continue to preprocess the same dataset. There, it will be shown techniques on how to represent these tokens into numeric data. This transformation is needed because machine learning can’t interpret text data.

About me

I am a geographer currently working as a data scientist. For that reason, I am interested in data science and specially in spatial data science.

Follow me for more content. I will write a post every month about data science concepts and techniques, spatial data science and GIS (Geographic Information System).