Text embeddings are the building blocks of natural language processing (NLP) algorithms. They are used to transform text data into numerical vectors that can be fed into machine learning models. Text embeddings can be created at various levels of granularity, including characters, words, and sentences. In this tutorial, we’ll explore the different types of text embeddings, including character embedding, word embedding, sentence embedding, and byte pair encoding.

What are Text Embeddings?

Text embeddings are numerical representations of text data that capture its semantic meaning. They are created using unsupervised learning techniques that analyze the context and relationships between words in a corpus of text data. The resulting embeddings can be used as input to machine learning models for a variety of NLP tasks, such as text classification, sentiment analysis, and machine translation.

Character Embedding

Character embedding is a type of text embedding that represents each character in a text sequence as a numerical vector. This approach is often used when working with text data that contains rare or out-of-vocabulary words. By breaking down the text into its constituent characters, character embedding can capture the morphology and syntax of the text, regardless of the vocabulary size.

Here’s an example of how to create a character embedding using Python and the keras library:

import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding
from tensorflow.keras.models import Sequential

words = [
    "apple",
    "orange",
    "banana",
    "grape",
]

# Create a character-level tokenizer
tokenizer = Tokenizer(char_level=True, oov_token="<OOV>")
tokenizer.fit_on_texts(words)
char_index = tokenizer.word_index

# Encode the words
sequences = tokenizer.texts_to_sequences(words)

# Pad the sequences
max_length = max([len(seq) for seq in sequences])
padded_sequences = pad_sequences(sequences, maxlen=max_length, padding="post")

# Define the embedding model
embedding_dim = 16
vocab_size = len(char_index) + 1

model = Sequential([
    Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_length),
])

# Generate character embeddings
char_embeddings = model.predict(padded_sequences)

print("Character embeddings:")
for i, word in enumerate(words):
    print(f"Word: {word}")
    for j, char_emb in enumerate(char_embeddings[i]):
        print(f"  Char {j + 1}: {char_emb}")


# Get the index of the character "a"
char_idx = char_index.get("a", 0)

# Get the embedding matrix from the Embedding layer
embedding_matrix = model.layers[0].get_weights()[0]

# # Get the embedding for the character "a"
char_a_embedding = embedding_matrix[char_idx]
print("Embedding for character 'a':", char_a_embedding)

In this example, we create a character-level tokenizer using the Tokenizer class from keras. We then use the tokenizer to convert the text data into numerical sequences of characters, and pad the sequences to ensure they are all the same length. We does not train the embedding for the sepcific task as the numerical example of word2vec in Vectorization blog does, and we just define the model with a single Embedding layer as a example for the simple embeddings. Then call model.predict instead of model.fit on the padded data to generate the character embeddings. Knowing the dimension of padded sequences, embedding matrix, and char_embeddings will assit you with the understanding of this code, please ask chatGPT to confirm this.

Word Embedding

Word embedding is a type of text embedding that represents each word in a text sequence as a numerical vector. This approach is often used when working with text data that contains a relatively large vocabulary size. By breaking down the text into its constituent words, word embedding can capture the semantic meaning and relationships between words in the text.

For numerical example, see the one in Vectorization blog, and compare the difference with the one in character embeddings in terms of implementation ways and given tasks.

Sentence Embedding

Sentence embedding is a type of text embedding that represents each sentence in a text corpus as a numerical vector. This approach is often used when working with text data that contains longer sequences of text, such as paragraphs or documents. By summarizing the semantic meaning of each sentence in the corpus, sentence embedding can capture the overall meaning and context of the text.

Here’s an example of how to create a sentence embedding using Tensorflow and Keras,

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, GlobalAveragePooling1D
from tensorflow.keras.models import Sequential

sentences = [
    "I love machine learning.",
    "Deep learning is fascinating.",
    "Natural language processing is a subfield of artificial intelligence.",
]

# Create a word-level tokenizer
tokenizer = Tokenizer(oov_token="<OOV>")
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index

# Encode the sentences
sequences = tokenizer.texts_to_sequences(sentences)

# Pad the sequences
max_length = max([len(seq) for seq in sequences])
padded_sequences = pad_sequences(sequences, maxlen=max_length, padding="post")

# Define the embedding model
embedding_dim = 16
vocab_size = len(word_index) + 1

model = Sequential([
    Embedding(input_dim=vocab_size, output_dim=embedding_dim, input_length=max_length),
    GlobalAveragePooling1D(),
])

# Generate sentence embeddings
sentence_embeddings = model.predict(padded_sequences)

print("Sentence embeddings:")
for i, emb in enumerate(sentence_embeddings):
    print(f"Sentence {i + 1}: {emb}")

Output:

Sentence embeddings:
Sentence 1: [-0.00010269  0.00750268  0.01172061 -0.02027113 -0.00212045 -0.01125387
 -0.0130327  -0.01406111 -0.01669746  0.03088781  0.0110696   0.00854814
  0.0155106   0.01087279 -0.01596794 -0.01844954]
Sentence 2: [-0.00298612  0.01048899  0.00338774 -0.03776908 -0.01171835 -0.00342189
 -0.03417958 -0.01213398 -0.0055492   0.03442742  0.00192641  0.00010407
  0.01639477  0.00306632 -0.026868   -0.01939041]
Sentence 3: [ 0.01340608 -0.00078986 -0.0013294   0.01496422 -0.01185604  0.00092309
  0.00330739  0.00237165 -0.01248339  0.00533317  0.01867088 -0.01010449
 -0.01195448  0.00528855  0.00105715 -0.00977207]

In this example, we created a simple model with an Embedding layer and a GlobalAveragePooling1D layer to generate sentence embeddings. The Embedding layer generates word embeddings, and the GlobalAveragePooling1D layer averages the word embeddings to obtain a single vector representing the whole sentence. You can experiment with different embedding dimensions and model architectures to obtain better sentence embeddings.

Byte Pair Encoding

Byte pair encoding (BPE) is a type of text encoding that represents each word in a text sequence as a series of byte pairs. This approach is often used when working with text data that contains a large vocabulary size and many rare or out-of-vocabulary words. By breaking down words into smaller subword units, BPE can capture the morphology and syntax of the text, while reducing the overall vocabulary size.

Here’s an example of how to create a BPE encoding using Python and the sentencepiece library:

!pip install tokenizers
from tokenizers import Tokenizer, trainers, pre_tokenizers, decoders
from tokenizers.models import BPE

# Sample text data
data = [
    "This is an example.",
    "Byte Pair Encoding is popular.",
    "BPE is used for tokenization.",
]

# Initialize BPE tokenizer
tokenizer = Tokenizer(BPE())

# Pre-tokenize the text data
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

# Train BPE tokenizer on the text data
trainer = trainers.BpeTrainer(vocab_size=1000, min_frequency=1)
tokenizer.train_from_iterator(data, trainer=trainer)

# Tokenize a new sentence using the trained BPE tokenizer
sentence = "This is an example of using BPE."
output = tokenizer.encode(sentence)
print(f"Tokenized IDs: {output.ids}")
print(f"Tokenized text: {output.tokens}")

# Decode the tokenized IDs back to the original text
decoded_text = tokenizer.decode(output.ids)
print(f"Decoded text: {decoded_text}")

Output:

Tokenized IDs: [62, 26, 33, 66, 17, 9, 22, 20, 12, 48, 58, 0]
Tokenized text: ['This', 'is', 'an', 'example', 'o', 'f', 'u', 's', 'i', 'ng', 'BPE', '.']
Decoded text: This is an example o f u s i ng BPE .

In this example, we first import the necessary modules from the tokenizers library and create a BPE tokenizer. We then train the tokenizer on our sample text data and tokenize a new sentence using the trained BPE tokenizer. Finally, we decode the tokenized IDs back to the original text.

Please note that this example doesn’t involve TensorFlow or Keras directly, but the tokenizers library is compatible with both, and you can easily use the tokenized text with TensorFlow or Keras models for various NLP tasks.

Conclusion

Text embeddings are a powerful tool for transforming text data into numerical vectors that can be fed into machine learning models. Depending on the type of text data you are working with, different types of embeddings may be more appropriate. Character embedding, word embedding, sentence embedding, and byte pair encoding each have their own strengths and weaknesses, and can be used to extract different levels of semantic meaning from text data. By understanding the differences between these types of embeddings, you can choose the right approach for your NLP project and create more accurate and effective models.