What is Tokenization?
π What is Tokenization?β
The process of breaking text into smaller parts (tokens) so a model can understand and process it.
Example:
"IBM taught me tokenization"
β ["IBM", "taught", "me", "tokenization"]
π οΈ Types of Tokenizationβ
Type | Description | Pros | Cons |
---|---|---|---|
Word-based | Splits text into individual words | Preserves meaning | Big vocabulary β memory-heavy |
Character-based | Splits into single characters | Tiny vocab | Context loss, inefficient |
Subword-based | Breaks rare words into chunks, leaves common words whole | Balance between size & context | More complex implementation |
π§ͺ Subword Tokenization Methodsβ
Method | Description |
---|---|
WordPiece | Merges or splits symbols based on usefulness; used in BERT |
Unigram | Starts with many possible tokens, narrows down based on frequency |
SentencePiece | Breaks raw text without needing pre-tokenization; assigns unique IDs |
Examples:
"token##ization"
β WordPiece"_token" "ization"
β Unigram/SentencePiece (Underscore = new word after space)
Tokenization conceptsβ
π© Padding β β
What it is:
Padding is the process of making all sequences the same length by adding a special token (e.g., 0
) to the end (or beginning) of shorter sequences.
Why it matters:
Neural networks like LSTMs, Transformers, or CNNs expect inputs to be in fixed-sized tensors. Without padding, variable-length inputs will cause shape mismatch errors.
π You use padding to fix unequal lengths between sequences in a batch.
π Shufflingβ
What it is:
Shuffling randomly rearranges the order of data samples in your dataset during training.
Why it matters:
If the data is fed in the same order every epoch (e.g., all positive reviews, then all negative), the model may learn spurious patterns from the order rather than the content.
π Use shuffling to improve generalization and avoid learning order-based biases.
π Iterationβ
What it is:
Iteration refers to the process of looping through your dataset using an iterator (like a Python for
loop). In PyTorch, DataLoader
objects are iterators.
Why it matters:
While it's necessary for data loading, iteration itself doesnβt solve any issues like length mismatches or overfitting β it's just the mechanism for going through the data.
π Iteration is how you read data, not how you preprocess or adjust it.
π§© Batchingβ
What it is:
Batching groups multiple samples into a single batch before feeding it into the model, instead of one sample at a time.
Why it matters:
Batching speeds up training (parallelism) and helps with convergence, but it doesnβt solve sequence length differences β it assumes the input sequences are already aligned in shape.
π You still need padding before batching works correctly.
Data loader creationsβ
π§± 1. Batchingβ
What it is:
Grouping multiple samples into a single batch so they can be processed in parallel.
Why it's used:
- Improves training speed by leveraging vectorized operations
- Reduces memory usage per update (compared to processing all samples at once)
- Stabilizes gradient updates
Example: If your dataset has 1000 samples and your batch size is 32, the model trains on 32 samples at a time, updating weights after each batch.
π§© 2. Paddingβ
What it is:
Making sequences equal length by adding special padding tokens (e.g., zeros) at the end of shorter sequences.
Why it's used:
Neural networks require fixed-size tensors. If your sentences are different lengths (which they usually are), you need to pad them so they can be batched together.
Example:
python
CopyEdit
[
[1, 2, 3],
[4, 5]
]
β Padded:
python
CopyEdit
[
[1, 2, 3],
[4, 5, 0]
]
In PyTorch:
python
CopyEdit
from torch.nn.utils.rnn import pad_sequence
π 3. Shufflingβ
What it is:
Randomly changing the order of your dataset for each epoch.
Why it's used:
- Prevents the model from memorizing order-based patterns
- Improves generalization
- Helps avoid bias introduced by grouped data (e.g., positive reviews followed by negative ones)
Code example:
python
CopyEdit
DataLoader(dataset, shuffle=True)
π 4. Iterationβ
What it is:
Looping through your dataset one sample or batch at a time, typically with a for
loop or iterator.
Why it's important:
It's how the model consumes the data during training.
Example:
python
CopyEdit
for batch in dataloader:
model(batch)
Summary Tableβ
Concept | Purpose | Solves |
---|---|---|
Batching | Efficient parallel processing | Speed, memory |
Padding | Makes input tensors the same length | Varying sequence lengths |
Shuffling | Prevents learning dataset order | Bias/generalization |
Iteration | Loads data sample-by-sample or batch-by-batch | Accessing data |
β So in summary:β
Concept | Purpose | Solves Varying Length? |
---|---|---|
Padding | Makes sequences equal length with filler tokens | β Yes |
Shuffling | Prevents overfitting to input order | β No |
Iteration | Loops through dataset | β No |
Batching | Feeds multiple samples at once | β No (needs padding) |
π§° Tokenization in PyTorch (torchtext)β
get_tokenizer()
β applies tokenizer (e.g., word or subword)build_vocab_from_iterator()
β builds vocab and maps tokens to indicesvocab[token]
β returns tokenβs index- Special tokens:
BOS
,EOS
,PAD
,UNK
β added for sentence marking and padding
π― Why It Matters to Youβ
- If youβre using Hugging Face, these methods are what sit underneath your tokenizer magic
- In MythosQuest or Spellweaver, if you do custom embeddings or want to fine-tune, understanding tokenization helps you prepare your data properly
- Especially important when designing prompt structure or pre/post-processing layers
β¨ Bitty Bonus Recap Spellβ
βTokenization is the spell that breaks language into its component runes. Choose your rune-style wisely: words for clarity, characters for precision, subwords for balance.β