NLP (Part 2): Feature Extraction

Alvaro Matsuda
7 min readMay 10, 2023

--

Photo by Piotr Łaskawski on Unsplash

Introduction

Continuing on my previus post, in this second part I will focus on techniques of feature extraction for text data. Feature extraction are techniques that encode text data into numerical data such that machine learning models can interpret it.

Techniques covered:

  • Bag of words (BoW);
  • Term Frequency Inverse Document Frequency (TF-IDF);
  • n-grams.

If you haven’t read the first part yet, I highly recommend that you read it first as I am going to use the code done there.

Link to the notebook with the code for this post:

Bag of Words (BoW)

Bag of words are one of the most popular feature extraction techniques. BoW is a technique that represents words as their occurance within a document. It is similar to one hot encoding in that each word in the vocabulary is a feature and it indicates whether that word occurs in the document or not (1 or 0).

After this brief explanation of BoW, I will show how to build it with code examples. There are three main steps to build a BoW:

  1. select our text data from where we are going to build the vocabulary;
  2. generate vocabulary;
  3. vectorize the document according to the vocabulary.

OBS: I will continue to use the code of the first part as our text data is already preprocessed.

To build our Bag of Words we are going to use the CountVectorizer from Sklearn. We can do all three steps mentioned above inside the function.

# Importing CountVectorizer from Sklearn
from sklearn.feature_extraction.text import CountVectorizer

# Getting only the reviews preprocessed with lemmatization
corpus_lem = df['review_pp_lem']

# Instantiate CountVectorizer object
vectorizer = CountVectorizer()

# Generate Bag of Words
bow = vectorizer.fit_transform(corpus_lem)
Shape comparison of bow object and the corpus_lem object.

We can see that it generated a parse matrix with 61,594 rows (the number of reviews in our dataset), and 28,034 columns, each one representing one word in the vocabulary created and the values are 0 or 1 indicating whether that word occurs in the review or not.

To make it easier to understand, let’s look how we can interpret it.

The first code snippet of the image above, shows us how the bow object represent the reviews. Each column is referred to a word in the vocabulary and if a word is in the review it will have a value of 1 otherwise it will have value of 0. In the second code snippet I selected only words that are in either of the two sampled reviews. In the third code snippet I printed the text of both reviews to exemplify how bag of words encode text to number.

One thing to keep in mind about Bag of Words is that it does not care about the order of words or their context. It simply tells if a word is in the review or not. Another thing is that, as we can see from the shape of the bow object, it creates sparsity. We went from 1 column containing the text of the reviews to 28,034 columns.

TF-IDF

Term Frequency-Inverse Document Frequency is a statistical mearure that evaluate how relevant a word is to a document in a collection of documents. We can think of it as a weight of the importance of a word within a document compared to the entire corpus.

The formula behind it is:

Source: https://padhaitime.com/Natural-Language-Processing/TF-IDF

To have a better understanding of what it does, let’s say that we have three news articles with the following titles:

  1. What critics are saying about Guardians of the Galaxy Vol 3;
  2. Six records that could be broken at the 2023 Oscar;
  3. Marvel boss reveals how Fantastic Four will differ from old versions.

Probably the words “movie”, “actor/actress”, “performance” will be very common among all three articles, thus it has little to say about each individual article, whereas “Torch” and “Invisible” probably will be very common only on the last article about Fantastic Four. In this example, the words “movie”, “actor/actress”, “performance” would have a small TF-IDF value and “Torch” and “Invisible” would have a high value.

Let’s see how to implement TF-IDF.

# Importing TfidfVectorizer from Sklearn
from sklearn.feature_extraction.text import TfidfVectorizer

# Instantiating TfidfVectorizer object
tfidfvec = TfidfVectorizer()

# Generating TF-IDF with the first 100 reviews
tfidf = tfidfvec.fit_transform(corpus_lem[:100])

# Getting the first ten reviews decoded by the TF-IDF
example_tfidf = pd.DataFrame(tfidf[0:10].toarray(), columns=tfidfvec.get_feature_names_out())

example_tfidf
Output of the code snippet.

We can see that TF-IDF gives us a float number instead of only 0 and 1 generated by Bag of Words. Higher numbers generated by TF-IDF indicates that the word is frequent within a single document and is rare on the rest of documents. In our example, the word “app” appears in more documents than the word “audio”.

TF-IDF share the same problems that Bag of Words has, that being the context of words are not considered, it creates sparsity according to the size of the vocabulary. However, TF-IDF gives a bit more information to a machine learning model in that it gives the importance of a word as a weight instead of just indicating whether a word accurs in a document or not.

N-grams

We can think of N-grams as a sequence of “n” words that occurs in a document. For instance, a 1-gram (Unigram) is a sequence of only on word (the bag of words is a Unigram). A 2-gram (Bigram) is a sequence of two words and so on.

Let’s apply n-gram to a simple sentence:

“I really love chocolate ice cream.”

  • Unigram: [“I”, “really”, “love”, “chocolate”, “ice”, “cream”]
  • Bigram: [“I really”, “really love”, “love chocolate”, “chocolate ice”, “ice cream”]
  • Trigram: [“I really love”, “really love chocolate”, “love chocolate ice”, “chocolate ice cream”]

One advantage of N-grams over bag of words and TF-IDF is that the context of words are considered. However, it does not consider the part of speech of words. Since N-grams does not encode the words itself, it only organizes them, we can use it in conjunction with either Bag of Words and TF-IDF.

On the functions that we used before (CountVectorizer and TfidfVectorizer) we can use a parameter called ngram_range and specify how we want to use n-grams and create the Bag of N-grams.

# Importing CountVectorizer from Sklearn
from sklearn.feature_extraction.text import CountVectorizer

# Getting only the reviews preprocessed with lemmatization
corpus_lem = df['review_pp_lem']

# Instantiate CountVectorizer object
vectorizer = CountVectorizer(ngram_range=(2, 2)) # Bigram

# Generate Bag of Words
bongram = vectorizer.fit_transform(corpus_lem)
Shape comparison of bongram object and the corpus_lem object.

Notice how n-grams explode in dimension. It created 328.380 pair of words. There are ways to address this problem through the use of the max_df, min_df and max_features parameters from the CountVectorizer function. We can notice that the representation of words is the same as Bag of Words, the only difference being that in Bag of N-grams the features are pair of words instead of a single word.

Conclusion

We saw three techniques to extract features from text and after we apply one of them, we are set to feed the data into a machine learning model. From our review example, we still need to work on the data because we created a sparce matrix. One way to address it is by using said parameters from CountVectorizer function, as mentioned before.

The purpose of this post is to show how to apply these feature extraction techniques and hopefully have a bit more understanding behind what each technique does.

About me

I am a geographer currently working as a data scientist. For that reason, I am interested in data science and specially in spatial data science.

Follow me for more content. I will write a post every month about data science concepts and techniques, spatial data science and GIS (Geographic Information System).

References

--

--

Alvaro Matsuda

Writing about Geospatial Data Science, AI, ML, Python, GIS