Coding a search engine using TF-IDF and cosine similarity in Python to
find relevant TED talks
Objective - Pre-procesing - TF-IDF - Match ranking
Cosine Similarity - Conclusion
The aim of this project is to process search queries of multiple words and produce relevant result from a given database. The program will not include any third-party libraries that 'do it all for you' so we will be calling all the calculations.
Pre-processing is very important. The data must be prepared for the algorithm to work well.
import pandas as pd
import math
import re
Note that the file path will be different for you.
The dataset can be downloaded at https://data.world/owentemple/ted-talks-complete-list. I removed a the number of additional features in the dataset with the transcription since I won't be using them with TF-IDF. See the column headings below to see which ones I have kept.
def remove_null_vals(df):
df = df.dropna()
return df
def load_data(path):
# Will infer the headings of each column
dataframe = pd.read_csv(path)
return dataframe
PATH = 'C:\\Users\\maxel\\OneDrive\\Search_Engine\\Version_1\\TED_Talks_dataset.csv'
df = load_data(PATH)
df = remove_null_vals(df)
df.head()
The columns that we will consider is the decription, tags, heading and transcription. However we must account for the heading, description and tags to carry more weight per word when processing the search query since they give a better description of what the talk will be about.
To reduce the complexity of the search, we shall make a new feature which combines the description, tags and the title to make a more highly weighted description of the talk compared to the transcripted (in terms of description the individual words carry). We shall call this feature exposition.
def combine_tags_description_heading(df):
s = ' '
df['exposition'] = [row['headline'] +s+ row['description'] +s+ row['tags'] for index, row in df.iterrows()]
return df
df = combine_tags_description_heading(df)
df['exposition'][0]
We have added a space when combining the headline, description and tags to ensure the words are spilt when tokenizing
The data must be pre-processed in order to ensure we achieve maximum accuracy with the search. We shall conduct stemming and lemmetisation on the the transcripts, decription and heading of the talk.
from nltk.corpus import stopwords
print(stopwords.words('english'))
def remove_stop_words(text):
text = re.split(r'[;,.\s]\s*', text)
stop_words = stopwords.words('english')
index_to_remove = []
for i in range(len(text)):
# Make the word lower case
text[i] = text[i].lower()
# Remove the word if the word is a stop word
if text[i] in stop_words:
index_to_remove.append(i)
continue
for index in index_to_remove[::-1]:
text.pop(index)
return text
def remove_symbols(text):
symbols = "!\"#$%&()*+-./:;<=>?@[\]^_`{|}~\n\r0123456789"
for symbol in symbols:
text = text.replace(symbol, ' ')
return text
def remove_apostrophes(text):
return [word.replace('`', '') for word in text]
We must be careful about certain words such as the conversion of U.S to us or apostrophes. "Won't" will be converted to "wont" so won't be removed with the stop words. We must take out the apstrophes separately afterwards.
We are also going to remove numbers since the transcript files contain timings which are not relavent to the search.
Stemming converts words to its stem. For example running and ran is converted to run based on some set of rules. This is what we want since it does really make a difference which tense the word is in for our search query. We are going to use a library for this because coding a stemmer seems like it would be boring and not much would be gained from it. The library will be Porter-Stemmer which identifies and removes the suffix or affix of a word (the attachments on the words).
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
print(stemmer.stem('running'), stemmer.stem('played'), stemmer.stem('undecided'))
Lemmetisation is reducing a word to its root synonym. Unlike stemming, lemmatisation will produce a word that is in a set dictionary. We shall use stemming for simplicity.
def stemming(text):
stemmer = PorterStemmer()
for i in range(len(text)):
text[i] = stemmer.stem(text[i])
return text
Putting this all together in a function we get
def preprocess(df):
for target_text in ['exposition', 'transcript']:
for index, row in df.iterrows():
text = row[target_text]
text = remove_symbols(text)
text = remove_stop_words(text)
text = remove_apostrophes(text)
text = stemming(text)
df[target_text][index] = text
return df
May take a while to run depending on the dataset size
df = preprocess(df)
df.head()
We don't want to run this every time we want to calculate TF-IDF so we will save if to a csv file after converting the exposition and transcription into lists.
df['exposition'] = [','.join(row['exposition']) for i, row in df.iterrows()]
df['transcript'] = [','.join(row['transcript']) for i, row in df.iterrows()]
df.to_csv('processed_TED_Talks.csv', index = False)
from nltk.corpus import stopwords
from collections import Counter
import numpy as np
import pandas as pd
import math
Using numpy arrays for faster access and pandas to make use of the dataframe class.
def load_data(path):
dataframe = pd.read_csv(path)
return dataframe
PATH = 'C:\\Users\\maxel\\OneDrive\\Search_Engine\\Version_1\\processed_TED_Talks.csv'
df_info = load_data(PATH)
N = len(df_info)
processed_exposition = np.array([np.array(row['exposition'][1:].split(',')) for i, row in df_info.iterrows()])
processed_transcript = np.array([np.array(row['transcript'][1:].split(',')) for i, row in df_info.iterrows()])
Convert the processed expostition and transcript into a numpy array for a faster speed.
DF - document frequency - how many documents the word appears in.
DF = {}
for i in range(N):
tokens = processed_exposition[i]
for w in tokens:
try:
DF[w].add(i)
except:
DF[w] = {i} # Set type
tokens = processed_transcript[i]
for w in tokens:
try:
DF[w].add(i)
except:
DF[w] = {i} # Set type
for i in DF:
DF[i] = len(DF[i]) # Get the number of documents the token appeared in
total_vocab_size = len(DF)
total_vocab_size
total_vocab = list(DF.keys())
We need a function that will return the document frequency of a word
def get_DF(word):
try:
return DF[word]
except:
return 0
Where:
The term frequency is given by the equation above. The choice can be made as to which equation is used. The first equation is simply a fraction of the words that is the token. The second gives a logarithmic way of describing the frequency of the words. We shall be using the logarithmic equation.
Where:
There is also choice in the way that inverse document frequency is calcuated using the equations above. Jusitification for the use of this have included attempts to connect it to Zipf's law https://en.wikipedia.org/wiki/Zipf%27s_law, information theory or a probability. See more here https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Justification_of_idf.
We shall be using the smooth idf equation. The use of the plus one to the denominator is so that an undefined error does not occur. 1 will also be added to N to ensure that we don't have to compute the log of 0 (although this is not written in the equation).
We can now make a function which will calculate the tf-idf weights of each of the documens.
We will calculate the TF-IDF weights separately for the exposition and the transcript then merge then according to some given weight. The weight (called alpha) will try best describe how much more useful the exposition is at descibing the talk compared to the transcription per word.
tf_idf = {}
for i in range(N):
tokens = processed_transcript[i]
counter = Counter(list(tokens) + list(processed_exposition[i]))
words_count = len(tokens) + len(processed_exposition[i])
for token in np.unique(tokens):
tf = counter[token]/words_count
df = get_DF(token)
idf = 1 + np.log((N+1)/(df+1))
tf_idf[i, token] = tf*idf
tf_idf_expo = {}
for i in range(N):
tokens = processed_exposition[i]
counter = Counter(list(tokens) + list(processed_transcript[i]))
words_count = len(tokens) + len(processed_transcript[i])
for token in np.unique(tokens):
tf = counter[token]/words_count
df = get_DF(token)
idf = np.log((N+1)/(df+1))
tf_idf_expo[i, token] = tf*idf
tf_idf[(1948,"quantum")]
tf_idf_expo[(1948,"quantum")]
len(tf_idf)
Merge them into one weight matrix according to the alpha weighting
alpha = 0.7
for i in tf_idf:
tf_idf[i] = (1 - alpha) * tf_idf[i]
try:
tf_idf[i] += alpha * tf_idf_expo[i]
except:
pass
Here, the preprocess import is for a python script which simply contains the preprocessing functions written earlier.
from preprocess import preprocess_query
Matching score is simply how close the search query is in terms of its tfidf weights. Think of it as how close two points are in 2d or 3d space, you may struggle imagining a 40 thousand dimension space. The program below uses Manhatten distance (the blocky distance) where rather than drawing a straight line between the points, lines are drawn parallel to the axis and the sum of these distances are used. You can think of this as going along the edges of a cube rather than cutting throught the middle to get between opposite corners.
def matching_score(k, query):
preprocessed_query = preprocess_query(query)
tokens = preprocessed_query
print("Matching Score")
print("Search Query:", query, '\n')
query_weights = {}
for key in tf_idf:
if key[1] in tokens:
try:
query_weights[key[0]] += tf_idf[key]
except:
query_weights[key[0]] = tf_idf[key]
query_weights = sorted(query_weights.items(), key=lambda x: x[1], reverse=True)
for i in query_weights[:k]:
print(i[0], df_info['headline'][i[0]])
matching_score(10, "Quantum biology")
These results seem to be giving the right answers. But we can improve this even more.
The equation above can be used to find the angle between two vectors. This can be used to find the angle between the search query and the document vectors. From this we can deduce that the smaller the angle the better the query matches the document. Cosine similarity is used when the magnitude of the vector doesn't matter. This is so since we are don't mind how long the talks are.
def cosine_sim(a, b):
cos_sim = np.dot(a, b)/(np.linalg.norm(a)*np.linalg.norm(b))
return cos_sim
D = np.zeros((N, total_vocab_size))
for i in tf_idf:
try:
ind = total_vocab.index(i[1])
D[i[0]][ind] = tf_idf[i]
except:
pass
def gen_vector(tokens):
Q = np.zeros((len(total_vocab)))
counter = Counter(tokens)
words_count = len(tokens)
query_weights = {}
for token in np.unique(tokens):
tf = counter[token]/words_count
df = get_DF(token)
idf = math.log((N+1)/(df+1))
try:
ind = total_vocab.index(token)
Q[ind] = tf*idf
except:
pass
return Q
def cosine_similarity(k, query):
print("Cosine Similarity")
preprocessed_query = preprocess_query(query)
tokens = preprocessed_query
print("Search Query:", query, '\n')
d_cosines = []
query_vector = gen_vector(tokens)
for d in D:
d_cosines.append(cosine_sim(query_vector, d))
out = np.array(d_cosines).argsort()[-k:][::-1]
for i in out[:k]:
print(i, df_info['headline'][i])
Q = cosine_similarity(10, "Groundwater")
We don't want to be computing the TF-IDF matrix every time we want to search something so we will save it to a file (although this process itself does take some time to load up again). We will use the npy file extension which has faster read times than a plain text or csv file.
# Save to a file
np.save('tf_idf_dict.npy', tf_idf)
read_tf_idf = np.load('tf_idf_dict.npy', allow_pickle = True).item()
The algorithm performs well giving relevant search results, very similar to the search results given by the search bar on the TED talks website. Storing and loading of the TF-IDF matrix is not very efficient/fast so an actual implementation may need to alter the method.
If you would like an alternative description of this process check out this article on Medium https://towardsdatascience.com/tf-idf-for-document-ranking-from-scratch-in-python-on-real-world-dataset-796d339a4089
If you have questions, email me at leo@terminalconvergence.com
created with
Nicepage .