In this series, we would be exploring, at a high level, some ideas, techniques, and algorithms that are at the foundation of AI. AI covers a wide range of techniques and in this series, I would be covering the following categories of problems:
In this post, we will explore Language.
So far, the AI problem areas we dealt with needed us to formulate the problem in a way AI can understand. Now we will look at some ideas that will enable AI to interpret our language.
Natural Language Processing (NLP) is about coming up with algorithms that will enable AI to process and understand natural language. It involves understanding the syntax and semantics of the human language to generate sentences, to make predictions about texts in a written or spoken language and to do much more.
Some of the tasks under NLP are as follows:
Understanding the structure and semantics of a language are the two main challenges in the world of NLP.
We need a way to tell the AI which English structures are valid and which are not. Context-free grammer is a way to describe the syntactic structure of natural languages. Using this method we can split a sentence into different parts and construct syntax trees.
Using Context-free grammer to build a syntax tree (https://www.pling.org.uk/cs/com6791.html) |
Context-free grammar rule enables AI to learn the structure of language and generate syntactically correct sentences.
Take the example of 2 sentences, “I ate a banana” and “I ate a building”. They are both syntactically correct sentences. We need a way to tell the AI that the second sentence is not likely to be true.
n-gram is a concept where we can feed the AI with a huge corpus of data and let the AI learn the most likely sequence of words - sequence of 2 words (bigram), 3 words (trigrams), etc.
Splitting a sentence into uni-, bi- and trigrams (https://web.archive.org/web/20180427050745/http://recognize-speech.com/language-model/n-gram-model/comparison) |
Word tokenization enables the AI to split a sequence of characters into words. Applying n-gram, word tokenization and Markov’s model, AI can predict the most likely word, for example, the 3rd word given a sequence of 2 words.
For identifying an email as spam or categorizing a product review as positive or negative, AI analyzes the words in the text.
Bag-of-words model represents text as an unordered collection of words. The idea is that the words in the text are what is important, the syntax or structure of the sentence being irrelevant.
Naive Bayes is a popular method used to categorize text. Naive Bayes, along with the Bag-of-words model, can predict the probability of a sentiment being positive or negative.
Represent a movie review as a bag-of-words - the order of words is ignored; only the frequency is considered (https://web.stanford.edu/~jurafsky/slp3/4.pdf) |
Given enough training data, we can train the AI to be able to look at natural language, figure out which words are likely to show up in positive as opposed to negative sentiment messages and categorize them accordingly.
AI extracts the unique words in a document by looking for words that appear most often because they are likely the important words. This makes Topic Modeling, mentioned above, possible.
Term frequency (TF) is the number of times a term appears in a document. AI considers only the content words and ignores the function words. Function words are words that have little meaning on their own but are used to grammatically connect other words. Examples are the, a, is, of, etc. This is in contrast to content words, which carry meaning.
Inverse document frequency (IDF) is the measure of how common or rare a word is across documents. AI arrives at the ranking of important words in a document by multiplying TF by IDF, also known as tf-idf. This helps AI identify imp words or keywords that set the doc apart from others.
AI understands the meaning of words by finding the relationship of all words in the dictionary with each other. WordNet is a dataset that has definitions of words and information on how they relate to each other.
In the world of NLP, words are represented as vectors (Word representation). Words with similar meaning will have similar vector representations. word2vec is a model for generating word vectors. It uses skip-gram architecture, which is a neural network architecture for predicting context words given a target word.
The distance between words is the difference between their vectors. The distance is closer to 0 if the meanings are similar and closer to 1, if otherwise.
Power of vectors (https://searchengineland.com/word-vectors-implication-seo-258599) |
Using vectors, AI can generate words by extracting the relationship between 2 words and applying that relationship to a 3rd word. A famous example is king - man + woman = queen. Here, the AI calculates what got “man” to “king” and applies it to “woman”, to arrive at “queen”.
Reference: CS50’s Introduction to Artificial Intelligence with Python