Picture this: you’re shopping online for a high-performance laptop.
You click on one, and several similar suggestions pop up. They’re close to what you’re looking for but not the same. How does the website know which ones are relevant to your search?
This is where cosine similarity comes in – a mathematical tool that measures the similarity between two non-zero vectors in a high-dimensional space.
Vector databases in search engines and recommendation systems use cosine similarity to understand how closely products or search queries in a database match based on their vector representations. Understanding these relationships and patterns among items allows vector database solutions to retrieve personalized, spot-on suggestions that keep you browsing – and maybe even buying.
Cosine similarity is a mathematical metric that measures the similarity between two vectors in an inner product or multi-dimensional space. It uses the cosine of the angle between two vectors to determine whether they point in the same direction, irrespective of their magnitudes.
In other words, cosine similarity is the dot or scalar product of two vectors divided by the product of their magnitudes. It is also known as Orchini similarity and Tucker coefficient of congruence.
Data scientists, machine learning engineers, and software developers use cosine similarity to compare thousands of data points and understand their relationships without getting lost in the details. It is widely used for measuring similarity in text mining, informational retrieval, and text analysis applications.
Other popular similarity measures include Euclidean distance, Manhattan distance, Jaccard similarity, and Minkowski distance.
Cosine similarity provides a robust way to evaluate semantic similarity among high-dimensional, sparse documents, datasets, and images. It’s effective because it focuses on the orientation of two vectors in a space, measuring their similarity regardless of their magnitude.
Text analysis applications using term frequency-inverse document frequency (TF-IDF), Word2Vec, and bidirectional encoder representations from transformers (BERT) methods derive word vectors with large dimensions but low overlap. Traditional similarity metrics like Euclidean distance are sensitive to vector lengths and can’t handle these vectors. Cosine similarity can easily focus on the data correlation in such scenarios.
Cosine similarity also helps information retrieval applications rank documents based on how well they match a query, even when documents vary in length or complexity.
Cosine similarity’s scalability in high-dimensional vector space makes it invaluable for vector databases, where finding nearest neighbors quickly and accurately is necessary for image retrieval, recommendation systems, and anomaly detection.
Natural language processing (NLP) relies on cosine similarity to efficiently compare vector embeddings. This embedding comparison aids NLP algorithms in classifying, clustering, or recommending content based on the semantic similarity of documents.
Cosine similarity quantifies the similarity between two vectors by calculating the cosine of the angle between them. Below is the breakdown of how cosine similarity works in high-dimensional, sparse data environments.
Dividing the dot product by the product of the vectors' magnitudes normalizes the similarity result to a range between -1 and 1. This normalization ensures that the similarity score reflects only the orientation angle of vectors, not their magnitude. It consistently measures the similarity of vectors, regardless of the scale of the data.
Calculating the cosine similarity requires finding the dot product of two vectors. Then, multiply the magnitudes of those two vectors. Now, divide the dot product by the product of magnitudes to find the cosine similarity score.
The cosine similarity score between two vectors, A and B, is calculated using the formula below:
Cosine Similarity (A, B) = (A·B) / (||A|| * ||B||)
Where,
A · B is the dot product of the vectors A and B
||A|| and ||B|| represent the length or magnitude of the two vectors A and B
||A|| * ||B|| denotes the product of magnitudes of vectors A and B
Cosine similarity ranges between -1 and 1.
A score of 1 means the vectors are perfectly aligned or proportional, indicating maximum similarity.
A score of 0 implies the vectors are orthogonal, meaning they have no similarity.
A score of -1 shows the vectors are perfectly opposite, meaning they point in opposite directions but have the same magnitude.
Let’s calculate the cosine similarity between vectors A and B.
The A vector has values, A = { 1, 9, 3, 6 } The ‘B’ vector has values, B = { 1, 7, 0, 1 }
Dot Product: A·B = 1×1 + 9×7 + 3×0 + 6×1 =70
Magnitude of A: ||A|| = √(1² + 9² + 3² + 6²) = √(1 + 81 + 9 + 36) = √127 ≈ 11.27
Magnitude of B: ||B|| = √(1² + 7² + 0² + 1²) = √(1 + 49 + 0 + 1) = √51 ≈ 7.14
Cosine Similarity = (A · B) / (||A|| * ||B||) = 70 / (11.27 × 7.14) ≈ 0.87
The cosine similarity between vectors A and B is approximately 0.87, which shows a substantial similarity between them.
Calculating cosine similarity is straightforward but can be difficult when working with large datasets. In such situations, you can use programming languages like Python and libraries and tools like Matlab, SciKit-Learn, TensorFlow, and SciPy.
NumPy is a powerful Python library for numerical computations. It supports multi-dimensional array operations, matrices, and mathematical functions, making it ideal for cosine similarity calculations.
# import required libraries
import numpy as np
from numpy.linalg import norm
# define two lists or array
A = np.array([1, 9, 3, 6])
B = np.array([1, 7, 0, 1])
print("A:", A)
print("B:", B)
# compute cosine similarity
cosine = np.dot(A, B) / (np.linalg.norm(A) * np.linalg.norm(B))
print("Cosine Similarity:", cosine)
SciKit-Learn is a Python-based machine learning library with built-in functions for data analysis tasks, including cosine similarity. The sklearn.metrics.pairwise module offers a cosine_similarity function to handle both dense and sparse matrices.
# import required libraries
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
# define the vectors
A = np.array([[1, 9, 3, 6]])
B = np.array([[1, 7, 0, 1]])
# calculate cosine similarity
cosine_sim = cosine_similarity(A, B)
print("Cosine Similarity:", cosine_sim[0][0])
SciPy is another popular Python-based library for scientific computing. Built on NumPy, SciPy features optimized functions that calculate cosine similarity for large datasets. It's scipy.spatial.distance module includes a cosine distance calculation function, which you can use to calculate cosine similarity.
# import required libraries
import numpy as np
from scipy.spatial.distance import cosine
# define the vectors
A = np.array([1, 9, 3, 6])
B = np.array([1, 7, 0, 1])
# calculate cosine distance
cosine_distance = cosine(A, B)
# calculate cosine similarity
cosine_similarity = 1 - cosine_distance
print("Cosine Similarity:", cosine_similarity)
Gensim is a Python library widely used for topic modeling and natural language processing. It features built-in functions for calculating cosine similarity between a large volume of text documents and word vectors.
# import required libraries
from gensim import corpora
from gensim.matutils import sparse2full
from gensim.similarities import MatrixSimilarity
from gensim.models import TfidfModel
import numpy as np
# sample documents
documents = [
"I love playing football.",
"Football is a great sport.",
"I enjoy watching movies.",
"Movies are entertaining.",
]
# tokenize and preprocess
texts = [[word.lower() for word in doc.split()] for doc in documents]
# create a dictionary and corpus
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
# create a TF-IDF model
tfidf_model = TfidfModel(corpus)
tfidf_corpus = tfidf_model[corpus]
# create a similarity index
index = MatrixSimilarity(tfidf_corpus)
# calculate the similarity of the first document with all others
similarities = index[tfidf_corpus[0]]
# print the similarities
print("Cosine Similarities for Document 0:", similarities)
TensorFlow and PyTorch are popular deep learning libraries that help measure similarity between high-dimensional feature vectors.
# import required libraries
import tensorflow as tf
# define the vectors
A = tf.constant([1.0, 9.0, 3.0, 6.0])
B = tf.constant([1.0, 7.0, 0.0, 1.0])
# calculate cosine similarity
def cosine_similarity(A, B):
# calculate the dot product
dot_product = tf.reduce_sum(A * B)
# calculate the norm (magnitude) of the vectors
norm_A = tf.sqrt(tf.reduce_sum(tf.square(A)))
norm_B = tf.sqrt(tf.reduce_sum(tf.square(B)))
# calculate cosine similarity
cosine_sim = dot_product / (norm_A * norm_B)
return cosine_sim
similarity = cosine_similarity(A, B)
print("Cosine Similarity:", similarity.numpy())
# calculate cosine similarity
cosine_similarity = 1 - cosine_distance
print("Cosine Similarity:", cosine_similarity)
When using PyTorch, developers can use torch.tensor() to create tensor objects and use the torch.norm() function to calculate the Euclidean norm of the vectors.
Consider using the tips below to optimize cosine calculations in Python:
Cosine similarity isn’t the only method for measuring similarity between objects in a data set. Other popular similarity calculation methods are:
Euclidean distance measures the straight-line distance between two points in Euclidean space. It is always either zero or positive. You may not be able to find meaningful Euclidean distance in high-dimensional spaces as points tend to converge.
Cosine similarity is ideal for high-dimensional data or text analysis where vector magnitude isn’t essential. Euclidean similarity works best for lower-dimensional spaces where vector magnitude is vital.
This measures the distance between two points in a grid-like path by summing the absolute differences between their coordinates. Unlike Euclidean distance, Manhattan distance is less sensitive to outliers, which is why it’s suitable for clustering tasks.
Use Manhattan distance to consider absolute differences between coordinates but cosine similarity when the direction of vectors is more important than magnitude.
Hamming distance compares two binary data strings of equal length by quantifying the number of bit positions between two bits. It measures the number of positions where the corresponding symbols of two strings differ.
Hamming distance is always a non-negative integer since the distance is the total count of these mismatches. Classification tasks in machine learning and error detection algorithms use hamming distance to compare binary vectors.
Jaccard similarity, also known as Jaccard coefficient or Jaccard Index, is another proximity measurement that computes the similarity between two asymmetric binary vectors or objects. You can calculate it by dividing the size of the intersection of sets by the size of the union of sets.
Jaccard similarity is best for comparing the presence or absence of terms, while cosine similarity excels at measuring the angle between vectors in dense data with overlapping terms.
The main advantage is that cosine similarity captures the directional aspect of data without being affected by vector magnitude changes. Text analysis applications, recommendation systems, and NLP solutions use cosine similarity to ease calculation and efficiently reduce vector dimensions.
Despite its many benefits, cosine similarity suffers from disadvantages, including:
Cosine similarity is used in information retrieval, text mining, recommendation systems, image processing, document classification, and clustering.
“We use cosine similarity to measure the similarity between the original text and the AI-generated text. It helps us improve the originality of AI-generated text and personalize it for user satisfaction and engagement.”
Robert Brown
Co-founder of AI Humanize
Information retrieval systems like search engines use cosine similarity to find relevant database documents for search queries. This similarity search ensures users get valuable documents. In these cases, text embedding relies on complex neural network models like Word2Vec, GloVe, or LLMs like GPT, BERT, and LLaMa.
Movie streaming platforms like Netflix rely on cosine similarity to share recommendations based on users’ watch history. These systems consider each movie and user as vectors. After generating vector embedding using matrix factorization or autoencoders, they use cosine similarity to recommend movies based on user preferences and past viewing patterns.
Facial recognition systems, medical imaging applications, and self-driving vehicles rely on cosine similarity scores to gauge similarity between images. They use convolutional neural networks to generate embeddings for images and capture visual patterns among them.
Cosine similarity’s focus on angles instead of magnitude makes it ideal for content recommendation, text analysis, document clustering, and data mining. It’s undoubtedly the best choice for comparing transformer embeddings because of its scale-invariant nature and ability to handle high-dimensional data. Consider the problem and data requirements to find the most appropriate similarity calculation method.
Looking for software with event-driven architecture for real-time data processing? Check out the top real-time analytic database solutions.