NLP (Natural Language Processing) is a subfield of Artificial Intelligence or in other sense, we can say it comes under a machine learning subset.
Ever since man created computers he always wanted the system to understand him. He advanced and he created Robots, and now we have Smartphones that use a software called Text to Speech to convert the human language to Text. Whether its Text to Speech or the reverse it all comes under Natural Language Processing(NLP).
So, how can we implement NLP in our system? Simple. We first need to choose a programming language like Python or Java. Since I have been working on both these languages I would recommend Python for its simpler syntax and the availability of many libraries on Github.
We already have a tool called Natural Language Took Kit or NLTK developed specifically for those who like to work on Python. So let’s get started.
Before we could do some coding in Python lets just explain the logic behind NLTK which makes the system understand human language and fulfill our necessities.
NLTK works on models otherwise known as trained data. These trained data in a simpler sense can be explained as a dictionary containing words for a specific language. When we code in NLTK we first import the necessary trained data. Let’s say we need our system to understand English sentences. So we import a file such as “eng.osd”. Luckily for us these trained data are already available and contains almost every word and its meaning, so what we need is just to import them. If in case we need to add a new word to our existing trained data we have tools for that like nltk-trainer on GitHub, etc.
So let’s get started, we first need to install python. For Linux users Python is pre-built, but make sure you have the latest version of Python 3 installed. You can type “Python3” to get the version. The latest version of Python is 3.7 (at the time of writing this article), but 3 and above is just fine. For Windows users you can download and install python from here -> Python 3.7: http://www.python.org/downloads/
Next, we need to install NLTK to get started.
For Linux users, you can follow the below commands.
sudo pip3 install nltk;
sudo pip3 install numpy;
For Window’s users, you can follow the links below.
If everything’s ok, you are ready to use NLTK. You can directly paste the below code in your Python terminal or execute it inside an IDE such as Pycharm.
import nltk
from nltk.tokenize import sent_tokenize
mytext = “My Name is Mr. Sheldon. I work at ThinkPalm Technologies.”
print(sent_tokenize(mytext))
The output is as follows;
“[‘My Name is Mr. Sheldon’, ‘I work at ThinkPalm Technologies’]”
If you observe the output you can see that with NLTK the program understood English and split them into two sentences. You may think that we can use regular expression in Python to split using ‘.’ or punctuation marks but if you look carefully, ‘Mr.’ came in the same sentence. This would not be the output if you had used any Regular expressions.
We can even split words by using “nltk.word_tokenize”
So now we have sentences and words, so how does the system know what word is what part of speech like a verb, noun, etc. Well we have a terminology called
Named entity Recognition or NER in NLTK helps us do that. So what is NER?
NER is basically identifying what a real-world entity such as a Person or an Organization from a given Text. We need to map this against a knowledge base so that we can make the system understand what the sentence is about. We need to extract relationships between different named entities like who works here or what is the age of a person or when does an event occur, etc.
The below example is how you use NLTK’s NER.
from nltk import word_tokenize, pos_tag, ne_chunk
Sentence = “My name is Sheldon, I work here at ThinkPalm Technologies.”
Print(ne_chunk(word_tokenize(sentence)))
Output :- [NNP My NNP name VBG is PERSON Sheldon, NNP I VBG work CC here CC at ORGANIZATION ThinkPalm NN Technologies]
Let’s analyze the above code and its output.
The “ne_chunk” is a built-in NLTK took for the purpose of named entity recognition, it needs parts-of-speech annotations to add the labels to each word. It uses the built-in models or trained data to identify the labels.
So from these examples, we can see how the system identifies each entity and thus understands the meaning of a sentence. Using these labels as our tags we can re-program our codes to perform logical operations in real life and this makes machines understand human language.
There are so many other NLP tools which you can use instead of Python’s NLTK. The following are some of the most popular.
The coreNLP from Standford university is the best in the market for NLP implementation as it contains a lot of Named Entities and much precise when compared to NLTK, but I prefer to use NLTK since its performance is much faster when compared to core NLP. CoreNLP’s performance is reduced for large and heavy applications. Memory to load libraries in CoreNLP itself requires a minimum of 5 GB and execution time is much slower.
Another reason of choosing NLTK is because it’s built for Python and Python was developed for Artificial Intelligence alone to perform complex tasks using mathematical and statistical data with a simple syntax that anyone can pick up in a short time.
I hope each and everyone finds it simple enough to understand the concepts of NLP base and its use in developing Artificial Intelligence-based applications. In the future, Natural Language Processing has a huge scope and would play a major role in creating the communication bridge between man and machine.