Tokenization
Discussion about Tokenization.
Introduction
Tokenization 分詞
Tokenizer 分詞器
- Character Tokenization
- Word Tokenization
Character Tokenization
1. Character Tokenization
string = "How are you?"
tokenized_str = list(string)
print(tokenized_str)
-
Explanation:
- The string
"How are you?"
is converted into a list of characters. list(string)
splits the string into individual characters.
- The string
-
Output:
['H', 'o', 'w', ' ', 'a', 'r', 'e', ' ', 'y', 'o', 'u', '?']
2. Numericalization
Step 1: Remove Duplicates and Sort Characters
unique_chars = sorted(set(tokenized_str))
-
Explanation:
set(tokenized_str)
removes duplicate characters, keeping only unique ones.sorted()
sorts these unique characters in ascending order (based on ASCII values).
-
Example Output:
[' ', '?', 'H', 'a', 'e', 'o', 'r', 'u', 'w', 'y']
Step 2: Assign a Unique Index to Each Character
token2idx = {}
for idx, ch in enumerate(unique_chars):
token2idx[ch] = idx
-
Explanation:
enumerate(unique_chars)
assigns an index to each character.- A dictionary
token2idx
is created where each character is a key, and its index is the corresponding value.
-
Example Output:
{' ': 0, '?': 1, 'H': 2, 'a': 3, 'e': 4, 'o': 5, 'r': 6, 'u': 7, 'w': 8, 'y': 9}
3. Mapping Characters to Indices
input_ids = [token2idx[token] for token in tokenized_str]
print(input_ids)
-
Explanation:
- This step maps each character in the original string to its corresponding index from the
token2idx
dictionary. input_ids
will be a list of integers representing the original string.
- This step maps each character in the original string to its corresponding index from the
-
Example Output:
[2, 5, 8, 0, 3, 6, 4, 0, 9, 5, 7, 1]
Summary:
- Character Tokenization: Converts the string into a list of individual characters.
- Numericalization:
- Creates a sorted list of unique characters.
- Assigns a unique index to each character.
- Mapping: Converts the original string into a list of indices based on the tokenization and numericalization process.
Word Tokenization
1. Word Tokenization
# Word tokenization
string = "How are you?"
tokenized_str = string.split()
print(tokenized_str)
2. Numericalization
The same .
3. Mapping Characters to Indices
The same.