Search Engine

Coding a search engine using TF-IDF and cosine similarity in Python to
find relevant TED talks

Contents

Objective    -    Pre-procesing    -    TF-IDF    -    Match ranking
Cosine Similarity    -    Conclusion

Objective

The aim of this project is to process search queries of multiple words and produce relevant result from a given database. The program will not include any third-party libraries that 'do it all for you' so we will be calling all the calculations.

Preprocessing

Pre-processing is very important. The data must be prepared for the algorithm to work well.

Preprocessing

Import the required libraries

In [1]:
import pandas as pd
import math
import re

Note that the file path will be different for you.

The dataset can be downloaded at https://data.world/owentemple/ted-talks-complete-list. I removed a the number of additional features in the dataset with the transcription since I won't be using them with TF-IDF. See the column headings below to see which ones I have kept.

Load the data and remove null values

In [2]:
def remove_null_vals(df):
    df = df.dropna()
    return df

def load_data(path):
    # Will infer the headings of each column
    dataframe = pd.read_csv(path)
    return dataframe
In [3]:
PATH = 'C:\\Users\\maxel\\OneDrive\\Search_Engine\\Version_1\\TED_Talks_dataset.csv'
df = load_data(PATH)
df = remove_null_vals(df)

df.head()
Out[3]:
id speaker headline URL description year_filmed duration views_as_of_06162017 tags transcript
0 1 Al Gore Averting the climate crisis http://www.ted.com/talks/view/id/1 With the same humor and humanity he exuded in ... 2006 00:16:17 3177001.0 cars,alternative energy,culture,politics,scien... 0:14\r\r\rThank you so much, Chris.\rAnd it's ...
1 2 Amy Smith Simple designs to save a life http://www.ted.com/talks/view/id/2 Fumes from indoor cooking fires kill more than... 2006 00:15:06 1379328.0 MacArthur grant,simplicity,industrial design,a... 0:11\r\r\rIn terms of invention,\rI'd like to ...
2 3 Ashraf Ghani How to rebuild a broken state http://www.ted.com/talks/view/id/3 Ashraf Ghani's passionate and powerful 10-minu... 2005 00:18:45 790536.0 corruption,poverty,economics,investment,milita... 0:12\r\r\rA public, Dewey long ago observed,\r...
3 4 Burt Rutan The real future of space exploration http://www.ted.com/talks/view/id/4 In this passionate talk, legendary spacecraft ... 2006 00:19:37 1985119.0 aircraft,flight,industrial design,NASA,rocket ... 0:11\r\r\rI want to start off by saying, Houst...
4 5 Chris Bangle Great cars are great art http://www.ted.com/talks/view/id/5 American designer Chris Bangle explains his ph... 2002 00:20:04 859487.0 cars,industrial design,transportation,inventio... 0:12\r\r\rWhat I want to talk about is, as bac...

The columns that we will consider is the decription, tags, heading and transcription. However we must account for the heading, description and tags to carry more weight per word when processing the search query since they give a better description of what the talk will be about.

To reduce the complexity of the search, we shall make a new feature which combines the description, tags and the title to make a more highly weighted description of the talk compared to the transcripted (in terms of description the individual words carry). We shall call this feature exposition.

In [4]:
def combine_tags_description_heading(df):
    s = ' '
    df['exposition'] = [row['headline'] +s+ row['description'] +s+ row['tags'] for index, row in df.iterrows()]
    
    return df
In [5]:
df = combine_tags_description_heading(df)
df['exposition'][0]
Out[5]:
'Averting the climate crisis With the same humor and humanity he exuded in "An Inconvenient Truth," Al Gore spells out 15 ways that individuals can address climate change immediately, from buying a hybrid to inventing a new, hotter brand name for global warming. cars,alternative energy,culture,politics,science,climate change,environment,sustainability,global issues,technology'

We have added a space when combining the headline, description and tags to ensure the words are spilt when tokenizing

Pre-processing the data

The data must be pre-processed in order to ensure we achieve maximum accuracy with the search. We shall conduct stemming and lemmetisation on the the transcripts, decription and heading of the talk.

In [6]:
from nltk.corpus import stopwords

print(stopwords.words('english'))
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
In [7]:
def remove_stop_words(text):
    text = re.split(r'[;,.\s]\s*', text)
    stop_words = stopwords.words('english')
    index_to_remove = []
    
    for i in range(len(text)):
        
        # Make the word lower case
        text[i] = text[i].lower()
        
        # Remove the word if the word is a stop word
        if text[i] in stop_words:
            index_to_remove.append(i)
            continue

    for index in index_to_remove[::-1]:
        text.pop(index)
    
    return text

def remove_symbols(text):
    symbols = "!\"#$%&()*+-./:;<=>?@[\]^_`{|}~\n\r0123456789"
    for symbol in symbols:
        text = text.replace(symbol, ' ')
        
    return text

def remove_apostrophes(text):
    return [word.replace('`', '') for word in text]

We must be careful about certain words such as the conversion of U.S to us or apostrophes. "Won't" will be converted to "wont" so won't be removed with the stop words. We must take out the apstrophes separately afterwards.

We are also going to remove numbers since the transcript files contain timings which are not relavent to the search.

Stemming and Lemmatization

Stemming converts words to its stem. For example running and ran is converted to run based on some set of rules. This is what we want since it does really make a difference which tense the word is in for our search query. We are going to use a library for this because coding a stemmer seems like it would be boring and not much would be gained from it. The library will be Porter-Stemmer which identifies and removes the suffix or affix of a word (the attachments on the words).

In [8]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
print(stemmer.stem('running'), stemmer.stem('played'), stemmer.stem('undecided')) 
run play undecid

Lemmetisation is reducing a word to its root synonym. Unlike stemming, lemmatisation will produce a word that is in a set dictionary. We shall use stemming for simplicity.

In [9]:
def stemming(text):
    stemmer = PorterStemmer()
    
    for i in range(len(text)):
        text[i] = stemmer.stem(text[i])
    
    return text

Putting this all together in a function we get

In [10]:
def preprocess(df):
    for target_text in ['exposition', 'transcript']:
        for index, row in df.iterrows():
            text = row[target_text]
            
            text = remove_symbols(text)
            text = remove_stop_words(text)
            text = remove_apostrophes(text)
            text = stemming(text)
            
            df[target_text][index] = text
    
    return df        

May take a while to run depending on the dataset size

In [11]:
df = preprocess(df)
df.head()
C:\Users\maxel\Anaconda3\envs\Machine_Learning\lib\site-packages\ipykernel_launcher.py:11: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  # This is added back by InteractiveShellApp.init_path()
Out[11]:
id speaker headline URL description year_filmed duration views_as_of_06162017 tags transcript exposition
0 1 Al Gore Averting the climate crisis http://www.ted.com/talks/view/id/1 With the same humor and humanity he exuded in ... 2006 00:16:17 3177001.0 cars,alternative energy,culture,politics,scien... [, thank, much, chri, truli, great, honor, opp... [avert, climat, crisi, humor, human, exud, inc...
1 2 Amy Smith Simple designs to save a life http://www.ted.com/talks/view/id/2 Fumes from indoor cooking fires kill more than... 2006 00:15:06 1379328.0 MacArthur grant,simplicity,industrial design,a... [, term, invent, i'd, like, tell, tale, one, f... [simpl, design, save, life, fume, indoor, cook...
2 3 Ashraf Ghani How to rebuild a broken state http://www.ted.com/talks/view/id/3 Ashraf Ghani's passionate and powerful 10-minu... 2005 00:18:45 790536.0 corruption,poverty,economics,investment,milita... [, public, dewey, long, ago, observ, constitut... [rebuild, broken, state, ashraf, ghani', passi...
3 4 Burt Rutan The real future of space exploration http://www.ted.com/talks/view/id/4 In this passionate talk, legendary spacecraft ... 2006 00:19:37 1985119.0 aircraft,flight,industrial design,NASA,rocket ... [, want, start, say, houston, problem, we'r, e... [real, futur, space, explor, passion, talk, le...
4 5 Chris Bangle Great cars are great art http://www.ted.com/talks/view/id/5 American designer Chris Bangle explains his ph... 2002 00:20:04 859487.0 cars,industrial design,transportation,inventio... [, want, talk, background, idea, car, art, act... [great, car, great, art, american, design, chr...

We don't want to run this every time we want to calculate TF-IDF so we will save if to a csv file after converting the exposition and transcription into lists.

In [12]:
df['exposition'] = [','.join(row['exposition']) for i, row in df.iterrows()]

df['transcript'] = [','.join(row['transcript']) for i, row in df.iterrows()]
In [14]:
df.to_csv('processed_TED_Talks.csv', index = False)
TF_IDF_cosine_similarity

TF-IDF

Importing libraries

In [1]:
from nltk.corpus import stopwords
from collections import Counter

import numpy as np
import pandas as pd
import math

Using numpy arrays for faster access and pandas to make use of the dataframe class.

Loading the data

In [2]:
def load_data(path):
    dataframe = pd.read_csv(path)
    return dataframe
In [3]:
PATH = 'C:\\Users\\maxel\\OneDrive\\Search_Engine\\Version_1\\processed_TED_Talks.csv'
df_info = load_data(PATH)
N = len(df_info)
In [4]:
processed_exposition = np.array([np.array(row['exposition'][1:].split(',')) for i, row in df_info.iterrows()])
processed_transcript = np.array([np.array(row['transcript'][1:].split(',')) for i, row in df_info.iterrows()])

Convert the processed expostition and transcript into a numpy array for a faster speed.

Calculating DF for both the exposition and and the transcript

DF - document frequency - how many documents the word appears in.

In [5]:
DF = {}

for i in range(N):
    tokens = processed_exposition[i]
    for w in tokens:
        try:
            DF[w].add(i)
        except:
            DF[w] = {i} # Set type

    tokens = processed_transcript[i]
    for w in tokens:
        try:
            DF[w].add(i)
        except:
            DF[w] = {i} # Set type
            
for i in DF:
    DF[i] = len(DF[i]) # Get the number of documents the token appeared in 
In [6]:
total_vocab_size = len(DF)
total_vocab_size
Out[6]:
40737
In [7]:
total_vocab = list(DF.keys())

We need a function that will return the document frequency of a word

In [8]:
def get_DF(word):
    try:
        return DF[word]
    except:
        return 0

Calculating TF-IDF

Where:

  • tf is the term frequency
  • f is the frequency of a given word
  • d is the given document

The term frequency is given by the equation above. The choice can be made as to which equation is used. The first equation is simply a fraction of the words that is the token. The second gives a logarithmic way of describing the frequency of the words. We shall be using the logarithmic equation.

Where:

  • idf is the inverse document frequency
  • N is the number of documents
  • n is the document frequency of the given token

There is also choice in the way that inverse document frequency is calcuated using the equations above. Jusitification for the use of this have included attempts to connect it to Zipf's law https://en.wikipedia.org/wiki/Zipf%27s_law, information theory or a probability. See more here https://en.wikipedia.org/wiki/Tf%E2%80%93idf#Justification_of_idf.

We shall be using the smooth idf equation. The use of the plus one to the denominator is so that an undefined error does not occur. 1 will also be added to N to ensure that we don't have to compute the log of 0 (although this is not written in the equation).

We can now make a function which will calculate the tf-idf weights of each of the documens.

We will calculate the TF-IDF weights separately for the exposition and the transcript then merge then according to some given weight. The weight (called alpha) will try best describe how much more useful the exposition is at descibing the talk compared to the transcription per word.

Calculating TF-IDF for transcript

In [9]:
tf_idf = {}

for i in range(N):
    
    tokens = processed_transcript[i]
    
    counter = Counter(list(tokens) + list(processed_exposition[i]))
    words_count = len(tokens) + len(processed_exposition[i])
    
    for token in np.unique(tokens):
        
        tf = counter[token]/words_count
        df = get_DF(token)
        idf = 1 + np.log((N+1)/(df+1))
        
        tf_idf[i, token] = tf*idf

Calculate the TF-IDF for the exposition

In [10]:
tf_idf_expo = {}

for i in range(N):
    tokens = processed_exposition[i]
    
    counter = Counter(list(tokens) + list(processed_transcript[i]))
    words_count = len(tokens) + len(processed_transcript[i])

    for token in np.unique(tokens):
        
        tf = counter[token]/words_count
        df = get_DF(token)
        idf = np.log((N+1)/(df+1))
        
        tf_idf_expo[i, token] = tf*idf
In [11]:
tf_idf[(1948,"quantum")]
Out[11]:
0.2378347668491331
In [12]:
tf_idf_expo[(1948,"quantum")]
Out[12]:
0.1876029275910187
In [13]:
len(tf_idf)
Out[13]:
1086962

Merging the TF-IDF according to the weights

Merge them into one weight matrix according to the alpha weighting

In [14]:
alpha = 0.7
In [15]:
for i in tf_idf:
    tf_idf[i] = (1 - alpha) * tf_idf[i]
    
    try:
        tf_idf[i] += alpha * tf_idf_expo[i]
    except:
        pass

Matching score simularity

Here, the preprocess import is for a python script which simply contains the preprocessing functions written earlier.

In [16]:
from preprocess import preprocess_query

Matching score is simply how close the search query is in terms of its tfidf weights. Think of it as how close two points are in 2d or 3d space, you may struggle imagining a 40 thousand dimension space. The program below uses Manhatten distance (the blocky distance) where rather than drawing a straight line between the points, lines are drawn parallel to the axis and the sum of these distances are used. You can think of this as going along the edges of a cube rather than cutting throught the middle to get between opposite corners.

In [17]:
def matching_score(k, query):
    preprocessed_query = preprocess_query(query)
    tokens = preprocessed_query

    print("Matching Score")
    print("Search Query:", query, '\n')
    
    query_weights = {}

    for key in tf_idf:
        
        if key[1] in tokens:
            try:
                query_weights[key[0]] += tf_idf[key]
            except:
                query_weights[key[0]] = tf_idf[key]
    
    query_weights = sorted(query_weights.items(), key=lambda x: x[1], reverse=True)
    
    for i in query_weights[:k]:
        print(i[0], df_info['headline'][i[0]])
In [18]:
matching_score(10, "Quantum biology")
Matching Score
Search Query: Quantum biology 

1948 How quantum biology might explain life’s biggest questions
1198 The levitating superconductor
912 Making sense of a visible quantum object
970 Making matter come alive
706 The bio-future of joint replacement
1347 Print your own medicine
1159 What's left to explore?
2155 Clues to prehistoric times, found in blind cavefish
1124 How can technology transform the human body?
166 Using biology to rethink the energy challenge

These results seem to be giving the right answers. But we can improve this even more.

TF-IDF Cosine Similarity Ranking

The equation above can be used to find the angle between two vectors. This can be used to find the angle between the search query and the document vectors. From this we can deduce that the smaller the angle the better the query matches the document. Cosine similarity is used when the magnitude of the vector doesn't matter. This is so since we are don't mind how long the talks are.

In [19]:
def cosine_sim(a, b):
    cos_sim = np.dot(a, b)/(np.linalg.norm(a)*np.linalg.norm(b))
    return cos_sim
In [20]:
D = np.zeros((N, total_vocab_size))
for i in tf_idf:
    try:
        ind = total_vocab.index(i[1])
        D[i[0]][ind] = tf_idf[i]
    except:
        pass
In [21]:
def gen_vector(tokens):
    
    Q = np.zeros((len(total_vocab)))
    
    counter = Counter(tokens)
    words_count = len(tokens)

    query_weights = {}
    
    for token in np.unique(tokens):
        
        tf = counter[token]/words_count
        df = get_DF(token)
        idf = math.log((N+1)/(df+1))

        try:
            ind = total_vocab.index(token)
            Q[ind] = tf*idf
        except:
            pass
    return Q
In [22]:
def cosine_similarity(k, query):
    print("Cosine Similarity")
    preprocessed_query = preprocess_query(query)
    tokens = preprocessed_query
    
    print("Search Query:", query, '\n')
    
    d_cosines = []
    query_vector = gen_vector(tokens)
    
    for d in D:
        d_cosines.append(cosine_sim(query_vector, d))
        
    out = np.array(d_cosines).argsort()[-k:][::-1]
    
    for i in out[:k]:
        print(i, df_info['headline'][i])
In [23]:
Q = cosine_similarity(10, "Groundwater")
Cosine Similarity
Search Query: Groundwater 

2031 4 ways we can avoid a catastrophic drought
2035 The mysterious world of underwater caves
173 Why aren't we more compassionate?
1734 An engineer's vision for tiny forests, everywhere
940 Let's take back the Internet!
2178 The taboo secret to better health
2155 Clues to prehistoric times, found in blind cavefish
554 The ancient ingenuity of water harvesting
2020 What happens when a city runs out of room for its dead
2105 Hunting for dinosaurs showed me our place in the universe

We don't want to be computing the TF-IDF matrix every time we want to search something so we will save it to a file (although this process itself does take some time to load up again). We will use the npy file extension which has faster read times than a plain text or csv file.

In [24]:
# Save to a file
np.save('tf_idf_dict.npy', tf_idf)
In [25]:
read_tf_idf = np.load('tf_idf_dict.npy', allow_pickle = True).item()

Code for the project

Find the code on my Github below.

Github