Unveiling the Power of Attention Mechanism Using Numpy: A Dive into Selective Focus in Neural Networks
In the realm of neural networks, the attention mechanism stands as a formidable tool, enabling models to selectively focus on specific elements of input data when making predictions. Drawing inspiration from human attention, which instinctively fixates on certain aspects when presented with multiple stimuli, attention mechanisms have found profound applications in natural language processing tasks such as machine translation, text classification, and question answering.
Understanding Attention Mechanism: A Spotlight on Neural Networks
At its core, attention models leverage a function that maps a query and a set of key-value pairs to generate a nuanced output. These elements, including the query, keys, values, and the final output, are all represented as vectors. The magic happens as the model calculates the output by taking a weighted sum of the values. The weights are determined by a compatibility function, evaluating the similarity between the query and the corresponding key.
Analogous to Human Perception: “High-Resolution” and “Low-Resolution” Understanding
To grasp the essence of attention mechanisms, envision a person viewing a complex scene. Their eyes fixate intensely on a specific point, akin to a “high-resolution” understanding of that focal area, while the surrounding regions are perceived with less detail — a form of “low-resolution” understanding. As the neural network gains a more profound comprehension of the input data, it dynamically adjusts its focal point, mirroring the adaptability of human attention.
Coding the Attention Mechanism with Numpy: A Practical Example
Let’s dive into a hands-on example using Python and the powerful Numpy library to implement an attention mechanism. In this scenario, we’ll use a simple sentence, “I miss the old Kanye,” and represent each word with randomly generated word embeddings.
import numpy as np
# Define the sentence
sentence = "i miss the old kanye"
# Define the vocabulary
vocabulary = set(sentence.split())
# Define the word embeddings
word_embeddings = {}
for word in vocabulary:
word_embeddings[word] = np.random.rand(3) # Generate random embeddings
# Encode the sentence as a sequence of word embeddings
encoded_sentence = [word_embeddings[word] for word in sentence.split()]
# Assign the encoded sentence to variables
i = encoded_sentence[0]
miss = encoded_sentence[1]
the = encoded_sentence[2]
old = encoded_sentence[3]
kanye = encoded_sentence[4]
# Print the variables
print('Creating word embedding')
print(f"i: {i}")
print(f"miss: {miss}")
print(f"the: {the}")
print(f"old: {old}")
print(f"kanye: {kanye}")
# word embeddings.
words = np.array([i , miss , the , old , kanye])
#Next, we generate the weight matrices that will be multiplied with the word embeddings to obtain the queries, keys, and values. For this example, we randomly generate these weight matrices, but in real scenarios, they would be learned during training.
np.random.seed(42)
W_Query = np.random.randint(3, size=(3, 3))
W_Key = np.random.randint(3, size=(3, 3))
W_Value = np.random.randint(3, size=(3, 3))
# Generating the queries, keys, and values.
Q = np.dot(words, W_Query)
K = np.dot(words, W_Key)
V = np.dot(words, W_Value)
# Scoring vector query.
scores = np.dot(Q, K.T)
# Computing the weights by applying a softmax operation.
weights = softmax(scores / np.sqrt(K.shape[1]), axis=1)
# Computing the attention by calculating the weighted sum of the value vectors.
attention = np.dot(weights, V)
print("Attention")
print(attention)
#output
Creating word embedding
i: [0.46676289 0.85994041 0.68030754]
miss: [0.52477466 0.39986097 0.04666566]
the: [0.45049925 0.01326496 0.94220176]
old: [0.97375552 0.23277134 0.09060643]
kanye: [0.61838601 0.38246199 0.98323089]
Attention
[[0.81676354 1.14873475 0.33197121]
[0.71861207 1.07919157 0.3605795 ]
[0.74203343 1.13656778 0.39453435]
[0.75607715 1.11278575 0.3567086 ]
[0.79709991 1.15043165 0.35333174]]
In this code snippet, we define the sentence, create word embeddings, and generate weight matrices for queries, keys, and values. We then compute attention by applying a softmax operation to scores obtained from the compatibility function.
Results and Insights: Unveiling the Attention
The output of our attention mechanism reveals the weighted sum of the value vectors, showcasing how the model focuses on different parts of the input sentence. This hands-on implementation provides a tangible understanding of the inner workings of attention mechanisms in neural networks.