Cosine Similarity: What It Is, How to Calculate, And Applications

Written by Sagar Joshi | Dec 24, 2024 2:10:35 AM

Picture this: you’re shopping online for a high-performance laptop.

You click on one, and several similar suggestions pop up. They’re close to what you’re looking for but not the same. How does the website know which ones are relevant to your search?

This is where cosine similarity comes in – a mathematical tool that measures the similarity between two non-zero vectors in a high-dimensional space.

Vector databases in search engines and recommendation systems use cosine similarity to understand how closely products or search queries in a database match based on their vector representations. Understanding these relationships and patterns among items allows vector database solutions to retrieve personalized, spot-on suggestions that keep you browsing – and maybe even buying.

What is cosine similarity?

Cosine similarity is a mathematical metric that measures the similarity between two vectors in an inner product or multi-dimensional space. It uses the cosine of the angle between two vectors to determine whether they point in the same direction, irrespective of their magnitudes.

In other words, cosine similarity is the dot or scalar product of two vectors divided by the product of their magnitudes. It is also known as Orchini similarity and Tucker coefficient of congruence.

Data scientists, machine learning engineers, and software developers use cosine similarity to compare thousands of data points and understand their relationships without getting lost in the details. It is widely used for measuring similarity in text mining, informational retrieval, and text analysis applications.

Other popular similarity measures include Euclidean distance, Manhattan distance, Jaccard similarity, and Minkowski distance.

Why is cosine similarity important?

Cosine similarity provides a robust way to evaluate semantic similarity among high-dimensional, sparse documents, datasets, and images. It’s effective because it focuses on the orientation of two vectors in a space, measuring their similarity regardless of their magnitude.

Text analysis applications using term frequency-inverse document frequency (TF-IDF), Word2Vec, and bidirectional encoder representations from transformers (BERT) methods derive word vectors with large dimensions but low overlap. Traditional similarity metrics like Euclidean distance are sensitive to vector lengths and can’t handle these vectors. Cosine similarity can easily focus on the data correlation in such scenarios.

Cosine similarity also helps information retrieval applications rank documents based on how well they match a query, even when documents vary in length or complexity.

Cosine similarity’s scalability in high-dimensional vector space makes it invaluable for vector databases, where finding nearest neighbors quickly and accurately is necessary for image retrieval, recommendation systems, and anomaly detection.

Natural language processing (NLP) relies on cosine similarity to efficiently compare vector embeddings. This embedding comparison aids NLP algorithms in classifying, clustering, or recommending content based on the semantic similarity of documents.

How does cosine similarity work?

Cosine similarity quantifies the similarity between two vectors by calculating the cosine of the angle between them. Below is the breakdown of how cosine similarity works in high-dimensional, sparse data environments.

Vector representation: The first step involves converting objects like words, documents, images, or texts into vectors in a high-dimensional space. Each vector dimension represents a single object, with its value showing the vector frequency or its importance.
Dot product: The dot product quantifies the relationship between two vectors by multiplying their corresponding components and summing the results. Calculating this dot product is vital for understanding the alignment between vectors in the same direction.
Magnitude calculation: The next step involves calculating the lengths or magnitudes of each vector.
Cosine similarity: Cosine similarity score is the dot product divided by the product of the two vectors' magnitudes.

Dividing the dot product by the product of the vectors' magnitudes normalizes the similarity result to a range between -1 and 1. This normalization ensures that the similarity score reflects only the orientation angle of vectors, not their magnitude. It consistently measures the similarity of vectors, regardless of the scale of the data.

How to calculate the cosine similarity

Calculating the cosine similarity requires finding the dot product of two vectors. Then, multiply the magnitudes of those two vectors. Now, divide the dot product by the product of magnitudes to find the cosine similarity score.

Cosine similarity formula

The cosine similarity score between two vectors, A and B, is calculated using the formula below:

Cosine Similarity (A, B) = (A·B) / (||A|| * ||B||)

Where,
A · B is the dot product of the vectors A and B
||A|| and ||B|| represent the length or magnitude of the two vectors A and B
||A|| * ||B|| denotes the product of magnitudes of vectors A and B

Cosine similarity ranges between -1 and 1.

Source: Medium

A score of 1 means the vectors are perfectly aligned or proportional, indicating maximum similarity.

A score of 0 implies the vectors are orthogonal, meaning they have no similarity.

A score of -1 shows the vectors are perfectly opposite, meaning they point in opposite directions but have the same magnitude.

Cosine similarity example

Let’s calculate the cosine similarity between vectors A and B.

The A vector has values, A = { 1, 9, 3, 6 } The ‘B’ vector has values, B = { 1, 7, 0, 1 }

Dot Product: A·B = 1×1 + 9×7 + 3×0 + 6×1 =70

Magnitude of A: ||A|| = √(1² + 9² + 3² + 6²) = √(1 + 81 + 9 + 36) = √127 ≈ 11.27

Magnitude of B: ||B|| = √(1² + 7² + 0² + 1²) = √(1 + 49 + 0 + 1) = √51 ≈ 7.14

Cosine Similarity = (A · B) / (||A|| * ||B||) = 70 / (11.27 × 7.14) ≈ 0.87

The cosine similarity between vectors A and B is approximately 0.87, which shows a substantial similarity between them.

Libraries for cosine similarity calculation

Calculating cosine similarity is straightforward but can be difficult when working with large datasets. In such situations, you can use programming languages like Python and libraries and tools like Matlab, SciKit-Learn, TensorFlow, and SciPy.

NumPy

NumPy is a powerful Python library for numerical computations. It supports multi-dimensional array operations, matrices, and mathematical functions, making it ideal for cosine similarity calculations.

How to calculate cosine similarity using NumPy

# import required libraries
import numpy as np
from numpy.linalg import norm

# define two lists or array
A = np.array([1, 9, 3, 6])
B = np.array([1, 7, 0, 1])

print("A:", A)
print("B:", B)

# compute cosine similarity
cosine = np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B))
print("Cosine Similarity:", cosine)

SciKit-Learn

SciKit-Learn is a Python-based machine learning library with built-in functions for data analysis tasks, including cosine similarity. The sklearn.metrics.pairwise module offers a cosine_similarity function to handle both dense and sparse matrices.

How to calculate cosine similarity using SciKit-Learn

# import required libraries
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# define the vectors
A = np.array([[1, 9, 3, 6]])
B = np.array([[1, 7, 0, 1]])

# calculate cosine similarity
cosine_sim = cosine_similarity(A, B)

print("Cosine Similarity:", cosine_sim[0][0])

SciPy

SciPy is another popular Python-based library for scientific computing. Built on NumPy, SciPy features optimized functions that calculate cosine similarity for large datasets. It's scipy.spatial.distance module includes a cosine distance calculation function, which you can use to calculate cosine similarity.

How to calculate cosine similarity using SciPy

# import required libraries
import numpy as np
from scipy.spatial.distance import cosine

# define the vectors
A = np.array([1, 9, 3, 6])
B = np.array([1, 7, 0, 1])

# calculate cosine distance
cosine_distance = cosine(A, B)

# calculate cosine similarity
cosine_similarity = 1 - cosine_distance

print("Cosine Similarity:", cosine_similarity)

Gensim

Gensim is a Python library widely used for topic modeling and natural language processing. It features built-in functions for calculating cosine similarity between a large volume of text documents and word vectors.

How to calculate cosine similarity using Gensim

# import required libraries
from gensim import corpora
from gensim.matutils import sparse2full
from gensim.similarities import MatrixSimilarity
from gensim.models import TfidfModel
import numpy as np

# sample documents
documents = [
"I love playing football.",
"Football is a great sport.",
"I enjoy watching movies.",
"Movies are entertaining.",
]

# tokenize and preprocess
texts = [[word.lower() for word in doc.split()] for doc in documents]

# create a dictionary and corpus
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

# create a TF-IDF model
tfidf_model = TfidfModel(corpus)
tfidf_corpus = tfidf_model[corpus]

# create a similarity index
index = MatrixSimilarity(tfidf_corpus)

# calculate the similarity of the first document with all others
similarities = index[tfidf_corpus[0]]

# print the similarities
print("Cosine Similarities for Document 0:", similarities)

TensorFlow and PyTorch

TensorFlow and PyTorch are popular deep learning libraries that help measure similarity between high-dimensional feature vectors.

How to calculate cosine similarity using TensorFlow

# import required libraries
import tensorflow as tf

# define the vectors
A = tf.constant([1.0, 9.0, 3.0, 6.0])
B = tf.constant([1.0, 7.0, 0.0, 1.0])

# calculate cosine similarity
def cosine_similarity(A, B):

# calculate the dot product
dot_product = tf.reduce_sum(A * B)

# calculate the norm (magnitude) of the vectors
norm_A = tf.sqrt(tf.reduce_sum(tf.square(A)))
norm_B = tf.sqrt(tf.reduce_sum(tf.square(B)))

# calculate cosine similarity
cosine_sim = dot_product / (norm_A * norm_B)
return cosine_sim

similarity = cosine_similarity(A, B)
print("Cosine Similarity:", similarity.numpy())

# calculate cosine similarity
cosine_similarity = 1 - cosine_distance

print("Cosine Similarity:", cosine_similarity)

When using PyTorch, developers can use torch.tensor() to create tensor objects and use the torch.norm() function to calculate the Euclidean norm of the vectors.

Consider using the tips below to optimize cosine calculations in Python:

Opt for optimized libraries. Use NumPy or SciKit-Learn’s vectorized operations to compute cosine similarity and reduce the overhead of Python loops. These libraries also perform faster than basic Python.
Use Numba. Numba is an open-source just-in-time (JIT) compiler for Python and NumPy code. It helps you automatically parallelize cosine similarity calculations using multiple central processing units (CPUs).
Employ sparse matrices. Consider using SciPy’s sparse matrix representations when handling datasets with multiple zero values. It will help you save memory and computation time for text and document similarity analysis tasks.
Use Cython for code compilation. Cython allows developers to write C-like codes with a Python-like syntax. Using it for code compilation helps them accomplish speeds similar to C or C++ while maintaining the ease of Python syntax.
Use ANN algorithms. Approximate nearest neighbors or ANN algorithms like KD-Tress or locality-sensitive hashing (LSH) help find similar vectors without computing cosine similarity. These algorithms can be beneficial for large-scale applications.
Access GPUs. Using graphical processing units or GPUs with Python libraries like TensorFlow can also help achieve higher throughput when working with extensive high-dimensional datasets.

Cosine similarity vs. other methods for similarity calculation

Cosine similarity isn’t the only method for measuring similarity between objects in a data set. Other popular similarity calculation methods are:

Euclidean distance

Euclidean distance measures the straight-line distance between two points in Euclidean space. It is always either zero or positive. You may not be able to find meaningful Euclidean distance in high-dimensional spaces as points tend to converge.

Cosine similarity is ideal for high-dimensional data or text analysis where vector magnitude isn’t essential. Euclidean similarity works best for lower-dimensional spaces where vector magnitude is vital.

Manhattan distance

This measures the distance between two points in a grid-like path by summing the absolute differences between their coordinates. Unlike Euclidean distance, Manhattan distance is less sensitive to outliers, which is why it’s suitable for clustering tasks.

Use Manhattan distance to consider absolute differences between coordinates but cosine similarity when the direction of vectors is more important than magnitude.

Hamming distance

Hamming distance compares two binary data strings of equal length by quantifying the number of bit positions between two bits. It measures the number of positions where the corresponding symbols of two strings differ.

Hamming distance is always a non-negative integer since the distance is the total count of these mismatches. Classification tasks in machine learning and error detection algorithms use hamming distance to compare binary vectors.

Jaccard similarity

Jaccard similarity, also known as Jaccard coefficient or Jaccard Index, is another proximity measurement that computes the similarity between two asymmetric binary vectors or objects. You can calculate it by dividing the size of the intersection of sets by the size of the union of sets.

Jaccard similarity is best for comparing the presence or absence of terms, while cosine similarity excels at measuring the angle between vectors in dense data with overlapping terms.

Advantages of cosine similarity

The main advantage is that cosine similarity captures the directional aspect of data without being affected by vector magnitude changes. Text analysis applications, recommendation systems, and NLP solutions use cosine similarity to ease calculation and efficiently reduce vector dimensions.

Magnitude-invariance: Cosine similarity measures the direction of two vectors and doesn’t skew the proximity score even when they differ in length. Its scale-invariant nature makes it ideal for search engines and text-based applications, where it focuses on whether vectors cover similar topics instead of the length, word count, or verbosity.
Scalability for sparse data: Cosine similarity uses directional relevance to compress data with large dimensions efficiently. The result is fast computation time for sparse, high-dimensional datasets, making cosine similarity ideal for vector databases and image retrieval systems.
Dimensionality reduction: Cosine similarity is compatible with techniques like principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE). These methods lower the number of dimensions while preserving variance in a dataset. Cosine similarity’s attention to angular relationships ensures data points remain robust even after these transformations map vectors into a lower-dimensional space.
Semantic similarity: Cosine similarity score uses vector space models to compare words based on their meanings instead of raw word counts or syntactic similarities. Top vector database search engines rely on this ability to measure the distance between vectorized words or sentences in sentiment analysis, topic modeling, and machine translation tasks.

Disadvantages of cosine similarity

Despite its many benefits, cosine similarity suffers from disadvantages, including:

Curse of dimensionality: Cosine similarity can face challenges in analyzing data in high-dimensional spaces, a phenomenon popularly known as the curse of dimensionality. Increased distances between data points cause the angles between vectors to converge, making it difficult to differentiate them using cosine similarity.
Sensitivity to sparse data: Cosine similarity also struggles to provide meaningful insights in sparse datasets with many zero entries in the vectors.
Doesn’t consider absolute difference: As cosine similarity focuses on an angle instead of the magnitude of vectors, it may overlook magnitude differences, which can convey crucial contextual information.
High dependence on vector representation: Cosine similarity may return inaccurate results for poorly constructed vector representations of documents.

Cosine similarity applications

Cosine similarity is used in information retrieval, text mining, recommendation systems, image processing, document classification, and clustering.

“We use cosine similarity to measure the similarity between the original text and the AI-generated text. It helps us improve the originality of AI-generated text and personalize it for user satisfaction and engagement.”
Robert Brown
Co-founder of AI Humanize

Information retrieval

Information retrieval systems like search engines use cosine similarity to find relevant database documents for search queries. This similarity search ensures users get valuable documents. In these cases, text embedding relies on complex neural network models like Word2Vec, GloVe, or LLMs like GPT, BERT, and LLaMa.

Recommendation systems

Movie streaming platforms like Netflix rely on cosine similarity to share recommendations based on users’ watch history. These systems consider each movie and user as vectors. After generating vector embedding using matrix factorization or autoencoders, they use cosine similarity to recommend movies based on user preferences and past viewing patterns.

Image processing

Facial recognition systems, medical imaging applications, and self-driving vehicles rely on cosine similarity scores to gauge similarity between images. They use convolutional neural networks to generate embeddings for images and capture visual patterns among them.

Tips for using cosine similarity:

Preprocess data: Consider removing common stop words with little to no semantic value – also, use stemming or lemmatization to reduce words to their base form and standardize the dataset.
Use term weighting: Employ the TF-IDF technique to assign weights to rare words appearing frequently across documents. This weighting helps in efficient vector differentiation.
Opt for a larger dataset size: Using larger datasets allows you to explore a broader range of topics and styles, making it easier to compare similarities.

Angle up your data game

Cosine similarity’s focus on angles instead of magnitude makes it ideal for content recommendation, text analysis, document clustering, and data mining. It’s undoubtedly the best choice for comparing transformer embeddings because of its scale-invariant nature and ability to handle high-dimensional data. Consider the problem and data requirements to find the most appropriate similarity calculation method.

Looking for software with event-driven architecture for real-time data processing? Check out the top real-time analytic database solutions.

View full post