Movie Recommender

1 Document retrieval

This movie recommender is based on textbook implementations in python of various ranking functions, specifically cosine similarity, tf-idf, and BM25 ranking functions.

Our general setup is as follows. We are given a corpus $D$ consisting of documents $d \in D$. For a given query $Q$ consisting of keywords $q_1,\ldots, q_n$ we use a ranking function to assign a score to each document. The documents will be ranked according to this score, with higher score corresponding to a higher document relevance.

2 Dataset

We used Wikipedia Movies Plots dataset from Kaggle. We keep our implementation as simple as possible to illustrate the concepts of document retrieval, and so we will only use the columns Title and Plot.

3 BoW

Given a text, the bag of words representation is a count of the frequency of each word in the text. For example:

Text	Bag of Words
“John likes to watch movies. Mary likes movies too. Mary also likes to watch football games.”	{“John”:1, “likes”:3, “to”:2, “watch”:2, “movies”:2, “Mary”:2, “too”:1, “also”:1, “football”:1, “games”:1}

Note that this representation does not capture any information involving word order; the bag of words representations of the texts:

Yesterday, investors were rallying, but today, they are retreating.
Yesterday, investors were retreating, but today, they are rallying.

are identical.

4 Cosine Similarity

Given two texts, consider their representations as bag of words, call them $A$ and $B$. We can consider these bag of words as sparse vectors in a vector space with a basis consisting of all words, and coefficients then given by the word frequencies.

The cosine similarity between $A$ and $B$ is the cosine of the angle $\theta$ between these two vectors, and is given by:

\[ \text{cosine} (\theta) = \frac{A \cdot B}{\|A\| \|B\|}. \]

In this case, the cosine similarity takes values in $[0,1]$; this is because the coefficients of a bag of words are a frequency count, and cannot be negative. Values closer to $1$ indicate a closer match, and values closer to $0$ indicate a worse match.

5 tf-idf

Given a term $t$ and a document $d$, tf-idf is the product of two statistics: term frequency (tf) and inverse document frequency (idf).

The term frequency is defined as:

\[ \text{tf}(t,d) = \frac{f_{t,d}}{ \sum_{t' \in d} f_{t', d} } \]

where

$f_(t,d)$ is the number of times term $t$ appears in document $d$,
the denominator is the total number of terms in the document $d$.

Observe $\text{tf}(t,d) \in [0,1]$; it is $0$ if the term does not appear, and $1$ if the document consists of the term and nothing else.

The inverse document frequency is the “logarithmically scaled inverse fraction of the documents which contain the term”. It is defined as:

\[ \text{idf}(t) = \log (\frac{N}{n_t}) \]

where

$N$ is the total number of documents in the corpus
$n_t$ is the count of documents $d in D$ where the term appears, i.e., $|{d in D | t in d}|$.

Hence $\tfrac{n_t}{N}$ is the fraction of documents where the term appears; $\tfrac{N}{n_t}$ is the inverse of this fraction. Observe $\tfrac{n_t}{N} \in [0,1]$; hence $\text{idf}(t) \in [0,\infty]$ To avoid division by zero, it is common to adjust the numerator / denominator as $1+N$ / $1+n_t$, respectively.

Given a query $Q$ consisting of terms $q_1,...,q_n$ and a document $d$ belonging to a corpus $D$, the tf-idf score is calculated as:

\[ \text{tf-idf} (Q, d) = \sum_{q_i \in Q} \text{tf-idf} (q_i, d) =\sum_{q_i \in Q} \text{tf} (q_i, d) \cdot \text{idf}(q_i) \]

We can summarize its behavior as follows:

As a term $t$ increases in frequency in the document $d$ we have: $\text{tf}(t,d) \to 1$.
Likewise as the term $t$ decreases in frequency in the document $d$ we have: $\text{tf}(t,d) \to 0$.
As a term $t$ increases in frequency in the corpus $D$, we have: $\tfrac{n_t}{N} \to 1^-$ and hence $\text{idf}(t) = \log(\tfrac{N}{n_t}) \to 0^+$.
As a term $t$ decreases in frequency in the corpus $D$, we have: $\tfrac{n_t}{N} \to 0^+$ and hence $\text{idf}(t) = \log(\tfrac{N}{n_t}) \to \infty$.

6 BM25

The BM25 score can be viewed as a modified version of tf-idf with different asymptotic behaviors. Given a query $Q$ consisting of terms $q_1,...,q_n$ and a document $d$ belonging to a corpus $D$, the BM25 score is calculated as:

\[ \text{BM}25(Q,d) = \sum_{q_i \in Q} \text{idf}(q_i) \frac{f_{q_i,d} \cdot (k + 1) }{ f_{q_i,d} + k (1 - b + b \tfrac{|d|}{\text{avgdl}}) } \]

where

$k$ and $b$ are free parameters, often set to $k=1.2$ or $k=2.0$ and $b=0.75$,
$|d|$ is the length of the document $d$,
$\text{avgdl}$ is the average length of a document in the corpus,
$f_{q_i,d}$ is the number of times the term $q_i$ appears in the document $d$,
$\text{idf}(q_i) = \log ( \frac{N - n_{q_i} + 0.5}{n_{q_i} + 0.5} + 1)$

We can summarize its behavior as follows:

As a term $t$ increases in frequency in the document $d$ we have: $\tfrac{f_{q_i,d} \cdot (k + 1) }{ f_{q_i,d} + k (1 - b + b \tfrac{|d|}{\text{avgdl}}) } \to k+1$
From the above, we see how the term $k$ controls the asymptotic behavior as a term increases in frequency in a document.
Likewise as the term $t$ decreases in frequency in the document $d$ we have: $\tfrac{f_{q_i,d} \cdot (k + 1) }{ f_{q_i,d} + k (1 - b + b \tfrac{|d|}{\text{avgdl}}) } \to 0$
As the length of $d$ increases we have: $ $
As the length of $d$ decreases we have: $\tfrac{f_{q_i,d} \cdot (k + 1) }{ f_{q_i,d} + k (1 - b + b \tfrac{|d|}{\text{avgdl}}) } \to \tfrac{f_{q_i,d} \cdot (k + 1) }{ f_{q_i,d} + k (1 - b ) }$
As a term $q_i$ increases in frequency in the corpus $D$, we have: $\tfrac{n_{q_i} + 0.5}{N - n_{q_i} + 0.5} \to \infty$ and hence $\text{idf}(q_i) \to 0^+$.
As a term $q_i$ decreases in frequency in the corpus $D$, we have: $\tfrac{n_{q_i} + 0.5}{N - n_{q_i} + 0.5} \to 0$ and hence $\text{idf}(t) \to \infty$.

7 Python implementation

Below find a straightforward implementation of the described ranking functions.

Ranking function implementations

import heapq
import math
import re
from collections import Counter, defaultdict


class Document_Data:
    def __init__(self, title, text):
        self.title = title

        # Represent the text as a Bag of Words
        self.words = self._get_words(text)
        self.unique_words = set(self.words)
        self.bow = Counter(self.words)

        self.doc_length = len(self.words)
        self.magnitude = math.sqrt(sum(v**2 for _, v in self.bow.items()))

    def _get_words(self, text):
        """
        Given a string 'text':
        1. standardize by converts to lowercase and removing punctuation
        2. split on whitespace
        """
        text = text.lower()
        text = re.sub(r"[^a-z0-9\s]", "", text)
        return text.split()


class Ranking_Functions:
    def __init__(self, titles, plots):
        """
        Given movie titles and plot summaries:
        1. construct the vocabulary
        2. represent each document in the corpus as a BoW
        3. for each term calculate the inverse document frequency
        """
        self.corpus = [
            Document_Data(title, plot) for title, plot in zip(titles, plots)
        ]
        self.avg_len = 0
        self.vocabulary = set()
        self.word_to_doc = defaultdict(set)
        self.inverse_document_frequency = {}
        self.inverse_document_frequency_bm25 = {}

        # 1. construct vocabulary
        for doc in self.corpus:
            self.vocabulary.update(doc.unique_words)

        # 2. construct word to doc lookup
        for doc in self.corpus:
            for word in doc.unique_words:
                self.word_to_doc[word].add(doc)

        # 4. calculate average document length
        self.avg_len = sum(doc.doc_length for doc in self.corpus) / len(
            self.corpus
        )

        # 5. calculate inverse document frequency
        for term in self.vocabulary:
            self.inverse_document_frequency[term] = (
                self._get_inverse_document_frequency(term)
            )

        # 6. calculate inverse document frequency for BM25
        for term in self.vocabulary:
            self.inverse_document_frequency_bm25[term] = (
                self._get_inverse_document_frequency_bm25(term)
            )

    def _get_cosine_similarity(self, doc_a, doc_b):
        """
        Calculates the cosine similarity between two BoWs.
        Formula: (A . B) / (||A|| * ||B||)
        """
        dot_product = 0
        for word in doc_a.unique_words:
            if word in doc_b.unique_words:
                dot_product += doc_a.bow[word] * doc_b.bow[word]

        magnitude_a = doc_a.magnitude
        magnitude_b = doc_b.magnitude

        if magnitude_a == 0 or magnitude_b == 0:
            return 0.0

        cosine_similarity = dot_product / (magnitude_a * magnitude_b)

        return cosine_similarity

    def _get_term_frequency(self, term, doc):
        """
        Given a term and a document represented as a BoW, the term frequency is computed as:
            f_{t,d} / sum_{t' in d} f_{t',d}
        where:
        - f_{t,d} is the number of times the term t appears in document d
        - sum_{t' in d} f_{t',d} is the total number of terms in document d

        """
        if term not in doc.bow:
            return 0
        f_td = doc.bow[term]
        return f_td / doc.doc_length

    def _get_inverse_document_frequency(self, term):
        """
        Given a term the inverse document frequency of a corpus is computed as:
            log(1+N / 1+n_term)
        where:
        - N = size of the corpus
        - n_term = number of documents in the corpus where the term appears
        """
        n_term = len(self.word_to_doc[term])
        N = len(self.corpus)
        return math.log((1 + N) / (1 + n_term))

    def _get_inverse_document_frequency_bm25(self, term):
        """
        Given a term the inverse document frequency of a corpus is computed as:
            log ( (N - n_term + 0.5) / (n_term + 0.5) + 1)
        where:
        - N = size of the corpus
        - n_term = number of documents in the corpus where the term appears
        """
        n_term = len(self.word_to_doc[term])
        N = len(self.corpus)
        return math.log((N - n_term + 0.5) / (n_term + 0.5) + 1)

    def _get_tfidf(self, query, doc):
        """
        Given a query and a document both represented as a BoW, returns the tf-idf score, computed as:
            sum_{t in query} tf-idf(t,d)
        where for a single term t:
            tf-idf(t,d) = tf(t,d) * idf(t)
        """
        score = 0
        for term in query.bow:
            if term in self.vocabulary:
                score += (
                    self._get_term_frequency(term, doc)
                    * self.inverse_document_frequency[term]
                    * query.bow[term]
                )
        return score

    def _get_BM25(self, query, doc, b, k):
        """
        Given a query and a document both represented as a BoW, returns the BM25 score, computed as:
            BM25()
        """
        score = 0
        for term in query.bow:
            if term in self.vocabulary and term in doc.bow:
                numerator = doc.bow[term] * (k + 1)
                denominator = doc.bow[term] + k * (
                    1 - b + b * (doc.doc_length / self.avg_len)
                )
                score += (
                    (numerator / denominator)
                    * self.inverse_document_frequency_bm25[term]
                    * query.bow[term]
                )
        return score

    def search_cosine_similarity(self, query, n=10):
        """
        Given a query, return the titles of the top n documents, ranked via cosine similarity.
        """
        query_data = Document_Data("Query", query)

        # Scores is a min heap
        scores = []

        for doc in self.corpus:
            score = self._get_cosine_similarity(query_data, doc)
            if len(scores) < n:
                heapq.heappush(scores, (score, doc.title))
            else:
                heapq.heappushpop(scores, (score, doc.title))

        return heapq.nlargest(n, scores)

    def search_tfidf(self, query, n=10):
        """
        Given a query, return the titles of the top n documents, ranked via tf-idf.
        """
        query_data = Document_Data("Query", query)

        # Scores is a min heap
        scores = []

        for doc in self.corpus:
            score = self._get_tfidf(query_data, doc)
            if len(scores) < n:
                heapq.heappush(scores, (score, doc.title))
            else:
                heapq.heappushpop(scores, (score, doc.title))

        return heapq.nlargest(n, scores)

    def search_bm25(self, query, b=0.75, k=1.2, n=10):
        """
        Given a query, return the titles of the top n documents, ranked via BM25.
        """
        query_data = Document_Data("Query", query)

        # Scores is a min heap
        scores = []

        for doc in self.corpus:
            score = self._get_BM25(query_data, doc, b, k)
            if len(scores) < n:
                heapq.heappush(scores, (score, doc.title))
            else:
                heapq.heappushpop(scores, (score, doc.title))

        return heapq.nlargest(n, scores)

Example usage

titles = ['Over Her Dead Body', 'Walk on the Wild Side', 'Gangster Story', 'Atlantic', 'The Arena']
plots = ['Kate and Henry are a happy couple. Henry proposed to Kate and they are about to be married, but on the day of their wedding, Kate is accidentally killed by an ice sculpture angel, because of the actions of an ice sculptor (Stephen Root). Unaware that she has died and her soul left her body, Kate awakens in Purgatory, and wastes precious time arguing with an angel who finally leaves before she can explain to Kate what she must do to move on.\r\nA year later, Henry\'s sister Chloe (Lindsay Sloane) hopes that he will find closure by consulting Ashley (Lake Bell), a psychic who also runs a catering business with her gay best friend Dan (Jason Biggs). After an unsuccessful first meeting, Chloe gives Kate\'s diary to Ashley so that she can pretend to communicate with Kate and convince Henry to move on with his life. In the process, Henry and Ashley fall for each other... much to the consternation of Kate, who has been watching over Henry. When Kate voices her displeasure, Ashley hears her, unaware of what it means.\r\nAngry over Ashley\'s deception and uncertain of what she is supposed to do, Kate later encounters the ice sculptor, and discovers that he is also a ghost (a result of a drunk driving accident). He explains to her that they must deal with their unfinished business. Believing that her job is to protect Henry, Kate proceeds to harass Ashley (who is the only one who can see or hear Kate). Using her ghostly abilities of intangibility, levitation, and auditory hallucination, Kate hopes to force Ashley to break up with Henry. Ashley persists, but then Henry discovers the fraud with the diary and breaks off the relationship. Despondent over the break-up, Ashley turns to Dan for solace, but is further distraught when Dan reveals that he is not gay and has secretly been in love with her for years. Over time, Ashley and Dan eventually reconcile.\r\nAfter several months of watching Henry fall back into a depressed funk, Kate encounters the sculptor once more, who points out that if she had resolved her unfinished business, she would have moved on to Heaven by now. When the sculptor asks her what she really wants, Kate reluctantly admits that she only wants Henry to be happy... and realizes that he could be happy with Ashley. Then the sculptor reveals that Kate was his unfinished business and he had to get her to do the right thing before moving on, which he does. Kate first attempts to convince Ashley to get back together with Henry but Ashley doesn\'t believe her change of heart, and is preparing to fly to Las Vegas with Dan. In desperation, Kate finds she is able to talk to Henry through his pet parrot and gets him to meet Ashley at the airport. Realizing that Henry has forgiven her and that she has Kate\'s blessing, Ashley joyfully embraces with Henry. At their wedding, Ashley delays her walk down the aisle to sit briefly in the back pew, to promise Kate that she will strive to make Henry happy. Also at the wedding, Dan makes a new connection with Chloe. Now ready to move on, Kate arrives once more in Purgatory, congratulated for her efforts by the angel and requests the "orb of true light" collected from Kate\'s loved ones. The angel leaves once again, leaving Kate in Purgatory.', "During the Great Depression, Dove Linkhorn and Kitty Tristram meet on the road in Texas as each travels separately to New Orleans. They decide to travel together, hitchhiking and hopping freight trains. Dove is hoping to find his lost love Hallie Gerard, and is not interested when Kitty comes on to him sexually.\r\nAfter Kitty steals from the New Orleans-area café where she and Dove stop for a meal, he leaves her and makes things right with the owner, Teresina Vidaverri. She gives Dove a job at the café and a place to stay while he searches for Hallie. He finds her working at the Doll House, an upscale French Quarter bordello, where Jo Courtney is the madam.\r\nLater it is revealed that, after Jo's husband lost his legs in an accident, she lost interest in him. A lesbian relationship is suggested between Jo and Hallie, who is supported by the owner in pursuing her interest in sculpting on the side. But Hallie still works for Jo as a prostitute like the other women. Hallie is unhappy with her life at Jo's, but does not want to give up her comforts to risk married life with Dove.\r\nMeanwhile, Kitty starts working at the bordello after Jo bails her out of jail, where she had been confined for vagrancy. Seeing that Kitty and Dove appear to know each other, Jo questions Kitty about her past, and learns that she traveled with Dove from Texas to Louisiana. Jo threatens Dove with arrest for transporting the underage Kitty across state lines for immoral purposes and for statutory rape, unless he leaves New Orleans without Hallie. As Dove leaves the bordello, the bouncer, another employee, and Jo's husband beat him viciously. Kitty watches from upstairs.\r\nKitty helps Dove return to the café, where Teresina cares for him. The younger woman goes back to the bordello to get Hallie, helping her reach the café. When Hallie can't be found at the bordello, Kitty is suspected and put under pressure; frightened, she brings Jo and her three henchmen to the café. During the ensuing struggle among the men, Hallie is shot and killed by a stray bullet. On the front-page of a newspaper is a story reporting that Kitty's testimony sent Jo and several others from the bordello to prison.", "A mobster is hiding from the law in a small town and he's running out of money, so he robs a bank and rakes in some big bucks. However, now, not only are the cops after him, but so is the local mob boss who is jealous that an outsider pulled such a job in his territory, and especially without giving him a piece of the pie.", 'Atlantic is a drama film based on the RMS\xa0Titanic and set aboard a fictional ship, called the Atlantic. The main plotline revolves around a man who has a shipboard affair with a fellow passenger, which is eventually discovered by his wife. The ship also has aboard an elderly couple, the Rools, who are on their anniversary cruise. Midway across the Atlantic Ocean, the Atlantic strikes an iceberg and is damaged to the point where it is sinking into the Atlantic. A shortage of lifeboats causes the crew to only allow women and children in (though the captain allows a few men to take to the last remaining boats as the disaster reaches its zenith) and many couples are separated. Mrs. Rool refuses to leave her husband and after the boats are gone all the passengers gather on the deck and sing "Nearer, My God, to Thee" as the Atlantic sinks into the ocean. The final scenes depict a group of passengers saying the Lord\'s Prayer in a flooding lounge.', 'The Arena, set in the ancient Roman town of Brundusium, is about a group of slave girls who are sold to a man named Timarchus (Daniele Vargas), the organizer of the events that take place in the town’s colosseum. After a fight breaks out amongst the girls, Timarchus gets the idea of putting the women in the ring to fight to the death. The recently captured Mamawi (Pam Grier) and Bodicia (Margaret Markov) realize they must stick together if they are to survive.']

movie_ranker = Ranking_Functions(titles, plots)

query = "travel adventure ocean"

recommendations = movie_ranker.search_bm25(query, n=3)

for score, title in recommendations:
        print(f"Score: {score}, Title: {title}")

Score: 2.1030016428592933, Title: Atlantic
Score: 1.14813126746257, Title: Walk on the Wild Side
Score: 0, Title: The Arena