Converting Words to Features in NLP
π’ 1. One-Hot Encodingβ
What it is:
A way to represent words as vectors with the same length as the vocabulary size. Each word is assigned a unique index. The one-hot vector has a 1 at the index of that word and 0s elsewhere.
Example:
Vocabulary = "I","like","cats""I", "like", "cats""I","like","cats"
- "I" β 1,0,01, 0, 01,0,0
- "like" β 0,1,00, 1, 00,1,0
- "cats" β 0,0,10, 0, 10,0,1
Why itβs used:
It's easy to understand and implement. It ensures each word has a unique representation.
Limitations:
- Doesnβt capture meaning or similarity (e.g., βcatβ and βkittenβ are just as unrelated as βcatβ and βbananaβ).
- High memory usage with large vocabularies (sparse vectors).
π’ 2. Bag of Words (BoW)β
What it is:
A way to represent a whole sentence or document as a single vector. You sum or average the one-hot vectors of all the words in the sentence.
Example:
Sentence: "I like cats"
- BoW vector = 1,1,11, 1, 11,1,1 (just sum of the individual one-hot vectors)
Why itβs used:
Gives a simple summary of word presence/frequency in a document.
Limitations:
- Word order is lost.
- Still uses sparse vectors.
- No semantic information.
π’ 3. Embeddingsβ
What it is:
Instead of sparse one-hot vectors, embeddings map words to dense vectors in a lower-dimensional space. Each word gets a vector of, say, 50 or 100 dimensions, learned during training.
Example:
- "cat" β 0.23,β1.5,0.88,...0.23, -1.5, 0.88, ...0.23,β1.5,0.88,...
- "kitten" might have a similar vector, capturing their semantic closeness.
Implemented via:
An embedding matrix, where:
-
Rows = words
-
Columns = embedding dimensions
(If 10,000 words in vocab and 100-dim embeddings, matrix = 10,000 Γ 100)
Advantages:
- Captures similarity between words.
- Efficient: low-dimensional and dense.
- Learnable: vectors are trained during model training.
π’ 4. Embedding Bagβ
What it is:
An efficient way to combine multiple word embeddings (from a sentence or document) into one vector by summing or averaging them, without manually handling one-hot vectors.
Why itβs better than BoW:
- Works directly with token indices.
- Avoids overhead of creating/summing one-hot vectors.
- Built-in support for batching multiple documents with offsets.
Offsets:
In a batched setting (e.g., 3 documents):
- Index tensor = all token indices across all docs.
- Offset tensor = starting index of each doc in the index tensor.
PyTorch class:
nn.EmbeddingBag(num_embeddings, embedding_dim, mode='mean')
Mode can be 'sum'
, 'mean'
, or 'max'
.
π§ 5. Using Embeddings in PyTorchβ
Steps:
- Tokenization: Break text into words.
- Vocabulary creation: Assign an index to each word.
- Embedding layer:
nn.Embedding(vocab_size, embed_dim)
maps token indices to vectors. - EmbeddingBag layer (optional):
nn.EmbeddingBag(...)
aggregates word embeddings for whole sentences or documents efficiently. - Feeding to a model: These dense vectors go into your neural network as features.
Example Code Snippet:
python
CopyEdit
embedding = nn.Embedding(num_embeddings=10_000, embedding_dim=100)
indices = torch.tensor([1, 5, 8]) # e.g., "I like cats"
output = embedding(indices)
With EmbeddingBag:
python
CopyEdit
embedding_bag = nn.EmbeddingBag(num_embeddings=10_000, embedding_dim=100, mode='mean')
indices = torch.tensor([1, 5, 8, 2, 3, 6]) # tokens from multiple docs
offsets = torch.tensor([0, 3]) # starting positions for each doc
output = embedding_bag(indices, offsets)
β Summary Recapβ
Concept | Purpose | Pros | Cons |
---|---|---|---|
One-Hot Encoding | Represent each word as unique ID | Simple, fast for small vocab | Sparse, doesnβt capture meaning |
Bag of Words | Represent a document as word count | Easy doc-level summary | Loses order, sparse, no context |
Embeddings | Dense word representation | Captures similarity, efficient, trainable | Needs training or pre-trained vectors |
Embedding Bag | Efficient aggregated embeddings | Great for sentence/doc-level input, fast |