Skip to main content

What is Tokenization?

πŸ” What is Tokenization?​

The process of breaking text into smaller parts (tokens) so a model can understand and process it.

Example:

"IBM taught me tokenization" β†’ ["IBM", "taught", "me", "tokenization"]


πŸ› οΈ Types of Tokenization​

TypeDescriptionProsCons
Word-basedSplits text into individual wordsPreserves meaningBig vocabulary β†’ memory-heavy
Character-basedSplits into single charactersTiny vocabContext loss, inefficient
Subword-basedBreaks rare words into chunks, leaves common words wholeBalance between size & contextMore complex implementation

πŸ§ͺ Subword Tokenization Methods​

MethodDescription
WordPieceMerges or splits symbols based on usefulness; used in BERT
UnigramStarts with many possible tokens, narrows down based on frequency
SentencePieceBreaks raw text without needing pre-tokenization; assigns unique IDs

Examples:

  • "token##ization" β†’ WordPiece
  • "_token" "ization" β†’ Unigram/SentencePiece (Underscore = new word after space)

Tokenization concepts​

🟩 Padding βœ…β€‹

What it is:

Padding is the process of making all sequences the same length by adding a special token (e.g., 0) to the end (or beginning) of shorter sequences.

Why it matters:

Neural networks like LSTMs, Transformers, or CNNs expect inputs to be in fixed-sized tensors. Without padding, variable-length inputs will cause shape mismatch errors.

πŸ“Œ You use padding to fix unequal lengths between sequences in a batch.


πŸ”„ Shuffling​

What it is:

Shuffling randomly rearranges the order of data samples in your dataset during training.

Why it matters:

If the data is fed in the same order every epoch (e.g., all positive reviews, then all negative), the model may learn spurious patterns from the order rather than the content.

πŸ“Œ Use shuffling to improve generalization and avoid learning order-based biases.


πŸ” Iteration​

What it is:

Iteration refers to the process of looping through your dataset using an iterator (like a Python for loop). In PyTorch, DataLoader objects are iterators.

Why it matters:

While it's necessary for data loading, iteration itself doesn’t solve any issues like length mismatches or overfitting β€” it's just the mechanism for going through the data.

πŸ“Œ Iteration is how you read data, not how you preprocess or adjust it.


🧩 Batching​

What it is:

Batching groups multiple samples into a single batch before feeding it into the model, instead of one sample at a time.

Why it matters:

Batching speeds up training (parallelism) and helps with convergence, but it doesn’t solve sequence length differences β€” it assumes the input sequences are already aligned in shape.

πŸ“Œ You still need padding before batching works correctly.

Data loader creations​


🧱 1. Batching​

What it is:

Grouping multiple samples into a single batch so they can be processed in parallel.

Why it's used:

  • Improves training speed by leveraging vectorized operations
  • Reduces memory usage per update (compared to processing all samples at once)
  • Stabilizes gradient updates

Example: If your dataset has 1000 samples and your batch size is 32, the model trains on 32 samples at a time, updating weights after each batch.


🧩 2. Padding​

What it is:

Making sequences equal length by adding special padding tokens (e.g., zeros) at the end of shorter sequences.

Why it's used:

Neural networks require fixed-size tensors. If your sentences are different lengths (which they usually are), you need to pad them so they can be batched together.

Example:

python
CopyEdit
[
[1, 2, 3],
[4, 5]
]

β†’ Padded:

python
CopyEdit
[
[1, 2, 3],
[4, 5, 0]
]

In PyTorch:

python
CopyEdit
from torch.nn.utils.rnn import pad_sequence


πŸ”€ 3. Shuffling​

What it is:

Randomly changing the order of your dataset for each epoch.

Why it's used:

  • Prevents the model from memorizing order-based patterns
  • Improves generalization
  • Helps avoid bias introduced by grouped data (e.g., positive reviews followed by negative ones)

Code example:

python
CopyEdit
DataLoader(dataset, shuffle=True)


πŸ” 4. Iteration​

What it is:

Looping through your dataset one sample or batch at a time, typically with a for loop or iterator.

Why it's important:

It's how the model consumes the data during training.

Example:

python
CopyEdit
for batch in dataloader:
model(batch)


Summary Table​

ConceptPurposeSolves
BatchingEfficient parallel processingSpeed, memory
PaddingMakes input tensors the same lengthVarying sequence lengths
ShufflingPrevents learning dataset orderBias/generalization
IterationLoads data sample-by-sample or batch-by-batchAccessing data

βœ… So in summary:​

ConceptPurposeSolves Varying Length?
PaddingMakes sequences equal length with filler tokensβœ… Yes
ShufflingPrevents overfitting to input order❌ No
IterationLoops through dataset❌ No
BatchingFeeds multiple samples at once❌ No (needs padding)

🧰 Tokenization in PyTorch (torchtext)​

  • get_tokenizer() – applies tokenizer (e.g., word or subword)
  • build_vocab_from_iterator() – builds vocab and maps tokens to indices
  • vocab[token] – returns token’s index
  • Special tokens: BOS, EOS, PAD, UNK β†’ added for sentence marking and padding

🎯 Why It Matters to You​

  • If you’re using Hugging Face, these methods are what sit underneath your tokenizer magic
  • In MythosQuest or Spellweaver, if you do custom embeddings or want to fine-tune, understanding tokenization helps you prepare your data properly
  • Especially important when designing prompt structure or pre/post-processing layers

✨ Bitty Bonus Recap Spell​

β€œTokenization is the spell that breaks language into its component runes. Choose your rune-style wisely: words for clarity, characters for precision, subwords for balance.”