PyTorch 词嵌入 (Embedding)

词嵌入是自然语言处理中最基础也是最重要的技术之一。

词嵌入将离散的词符号映射为连续的稠密向量，使机器能够理解和处理文本数据。

PyTorch 提供了 nn.Embedding 模块来实现这一功能，是构建各种 NLP 模型的基础。

1. 词嵌入基础概念

在计算机中，文本本质上是一串整数序列。每个词被分配一个唯一的索引 ID，但这种离散表示存在一个问题：相似的词在语义上可能很接近，但它们的 ID 却毫无关联。

词嵌入通过学习一个嵌入矩阵来解决这个问题：

\[ E \in \mathbb{R}^{V \times D} \]

其中 \(V\) 是词表大小，\(D\) 是嵌入维度。每个词 ID 对应嵌入矩阵中的一行，通过查表操作获取其向量表示：

\[ \text{embedding} = E[\text{word\_id}] \]

词嵌入的优势包括：

将高维稀疏的 one-hot 向量转换为低维稠密向量，大幅降低计算开销
语义相似的词在向量空间中距离更近，可通过余弦相似度计算词相似度
嵌入向量是可学习的参数，可以通过反向传播自动调整

2. nn.Embedding 详解

nn.Embedding 是 PyTorch 提供的词嵌入层，封装了嵌入矩阵的创建和查表操作。

2.1 基本用法

实例

import torch
import torch.nn as nn

# 创建词嵌入层
# num_embeddings: 词表大小（vocab_size）
# embedding_dim: 嵌入维度（embedding_dim）
vocab_size = 10000
embedding_dim = 256

embedding = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_dim)

# 查看嵌入矩阵的形状
print(embedding.weight.shape) # torch.Size([10000, 256])

# 输入词索引（LongTensor），获取嵌入向量
word_ids = torch.tensor([0, 1, 2, 9999]) # 任意词索引
embedded = embedding(word_ids)

print(embedded.shape) # torch.Size([4, 256])
# 每个词ID对应一个256维的向量

2.2 nn.Embedding 参数详解

实例

import torch.nn as nn

embedding = nn.Embedding(
num_embeddings=10000, # 词表大小，必须大于等于输入的最大索引值
embedding_dim=256, # 嵌入向量维度，通常为 50, 100, 200, 300 等
padding_idx=None, # 填充词的索引，填充词的嵌入向量全为零
max_norm=None, # 嵌入向量的最大范数，用于归一化
norm_type=2.0, # 归一化类型，通常为 L2 范数
scale_grad_by_freq=False,# 按词频缩放梯度
sparse=False, # 是否使用稀疏梯度（节省显存，但训练较慢）
_weight=None, # 预定义权重，用于加载预训练嵌入
)

# 查看参数量
total_params = embedding.num_embeddings * embedding.embedding_dim
print(f"嵌入层参数量: {total_params:,}")
# 10000 * 256 = 2,560,000

嵌入层的参数量 = 词表大小 × 嵌入维度，这是一个非常大的矩阵。通常 NLP 模型的嵌入层占模型总参数量的很大比例。

2.3 填充索引 padding_idx

在处理变长序列时，需要对短序列进行填充。使用 padding_idx 可以将填充词的嵌入向量固定为零向量，避免填充内容对模型产生影响：

实例

import torch
import torch.nn as nn

# 设置 padding_idx=0，表示索引0为填充词
embedding = nn.Embedding(num_embeddings=10000, embedding_dim=128, padding_idx=0)

# 初始化权重
nn.init.uniform_(embedding.weight, -0.1, 0.1)

# 索引0的嵌入向量全为0
word_0 = embedding(torch.tensor([0]))
print(f"填充词的嵌入: {word_0}") # 全为0

# 其他索引正常
word_5 = embedding(torch.tensor([5]))
print(f"词5的嵌入: {word_5}") # 非零值

2.4 最大范数归一化

使用 max_norm 可以限制嵌入向量的范数，防止训练过程中嵌入向量过大：

实例

import torch
import torch.nn as nn

# 设置最大范数为1.0
embedding = nn.Embedding(num_embeddings=1000, embedding_dim=64, max_norm=1.0)

# 输入任意词索引
ids = torch.tensor([1, 2, 3])
embedded = embedding(ids)

# 检查每个向量的L2范数
norms = torch.norm(embedded, p=2, dim=1)
print(f"各向量范数: {norms}") # 所有值都接近1.0

3. 加载预训练词嵌入

使用预训练词嵌入可以大幅提升模型性能，尤其是当训练数据较少时。常见的预训练词嵌入包括 Word2Vec、GloVe、FastText 等。

3.1 从头训练与加载预训练的区别

方式	优点	缺点	适用场景
随机初始化训练	完全可定制，适应特定任务	需要大量训练数据	领域特定词汇多、训练数据充足
加载预训练嵌入	利用大规模语料知识，训练快、效果好	词汇覆盖受限，无法处理未登录词	通用任务、训练数据有限
冻结预训练嵌入	训练速度快，显存占用小	无法微调嵌入	训练资源有限、只关注上层模型
微调预训练嵌入	可适应特定任务	训练较慢，显存占用大	数据量中等、领域有一定差异

3.2 加载 GloVe 预训练词向量

GloVe 是 Stanford 大学发布的预训练词向量，下面展示如何加载：

实例

import torch
import torch.nn as nn
import numpy as np

# 模拟加载 GloVe 词向量（实际需要下载 GloVe 文件）
# 假设已有一个词向量文件，格式为：每行一个词 followed by its vectors

def load_glove_embeddings(path, word2idx, embedding_dim=300):
"""
加载 GloVe 预训练词向量
path: 词向量文件路径
word2idx: 词到索引的映射字典
embedding_dim: 词向量维度
"""
embeddings = np.random.randn(len(word2idx), embedding_dim).astype(np.float32)
word_count = 0

with open(path, 'r', encoding='utf-8') as f:
for line in f:
values = line.strip().split()
word = values[0]
if word in word2idx:
word_idx = word2idx[word]
embeddings[word_idx] = np.asarray(values[1:], dtype=np.float32)
word_count += 1

print(f"已加载 {word_count}/{len(word2idx)} 个词向量")
return torch.from_numpy(embeddings)

# 假设已有词表
word2idx = {'hello': 0, 'world': 1, 'runoob': 2, 'python': 3}
EMBED_DIM = 300

# 加载预训练嵌入并创建嵌入层
# pretrained_embeddings = load_glove_embeddings('glove.6B.300d.txt', word2idx, EMBED_DIM)
# embedding = nn.Embedding.from_pretrained(pretrained_embeddings, padding_idx=0)

# 简化示例：使用随机初始化的预训练矩阵
pretrained_embeddings = torch.randn(len(word2idx), EMBED_DIM)
embedding = nn.Embedding.from_pretrained(pretrained_embeddings, padding_idx=0)

print(f"嵌入层形状: {embedding.weight.shape}")

3.3 冻结与微调嵌入层

根据任务需求，可以选择冻结或微调嵌入层：

实例

import torch
import torch.nn as nn

embedding = nn.Embedding(num_embeddings=10000, embedding_dim=256)

# 方式一：冻结嵌入层（不参与训练）
embedding.weight.requires_grad = False
# 训练时只有 embedding.weight.requires_grad = False 的参数不会被更新

# 方式二：微调嵌入层（参与训练）
embedding.weight.requires_grad = True

# 方式三：冻结部分词向量（冻结"the", "is"等高频词）
# 假设高频词的索引在 0-99
embedding.weight.requires_grad = True
with torch.no_grad():
embedding.weight[0:100] *= 0 # 或赋值为固定值

# 在优化器中过滤掉冻结的参数
optimizer = torch.optim.Adam(
filter(lambda p: p.requires_grad, embedding.parameters()),
lr=1e-3
)

4. 嵌入层与 RNN/LSTM 结合

词嵌入是 NLP 模型的第一层，将文本 ID 转换为密集向量后，传入 RNN、LSTM 等序列模型进行处理。

4.1 嵌入层 + LSTM 文本分类

实例

import torch
import torch.nn as nn

class TextClassifier(nn.Module):
"""
嵌入层 + LSTM 文本分类模型
"""
def __init__(self, vocab_size, embed_dim, hidden_size, num_classes, padding_idx=0):
super().__init__()

# 词嵌入层
self.embedding = nn.Embedding(
num_embeddings=vocab_size,
embedding_dim=embed_dim,
padding_idx=padding_idx
)

# LSTM 层
self.lstm = nn.LSTM(
input_size=embed_dim,
hidden_size=hidden_size,
num_layers=2,
batch_first=True,
bidirectional=True,
dropout=0.3
)

# 分类器
self.fc = nn.Linear(hidden_size * 2, num_classes)

def forward(self, x):
# x: (batch_size, seq_len) - 词索引
embedded = self.embedding(x) # (batch_size, seq_len, embed_dim)

# LSTM 输出
output, (h_n, c_n) = self.lstm(embedded)

# 取最后一个时间步的隐藏状态（双向拼接）
# 正向最后隐藏状态: h_n[-2]
# 反向最后隐藏状态: h_n[-1]
h_combined = torch.cat([h_n[-2], h_n[-1]], dim=-1)

# 分类
logits = self.fc(h_combined)
return logits

# 模型实例化
VOCAB_SIZE = 10000
EMBED_DIM = 128
HIDDEN_SIZE = 128
NUM_CLASSES = 2

model = TextClassifier(VOCAB_SIZE, EMBED_DIM, HIDDEN_SIZE, NUM_CLASSES)

# 模拟输入：batch_size=4, 序列长度=10
batch_input = torch.randint(1, VOCAB_SIZE, (4, 10))
output = model(batch_input)

print(f"输入形状: {batch_input.shape}") # torch.Size([4, 10])
print(f"输出形状: {output.shape}") # torch.Size([4, 2])

4.2 使用预训练词向量

实例

import torch
import torch.nn as nn

# 假设已加载预训练词向量
pretrained_vectors = torch.randn(10000, 300) # 模拟预训练向量

# 创建嵌入层并加载预训练权重
embedding = nn.Embedding.from_pretrained(
pretrained_vectors,
padding_idx=0,
freeze=False # True: 冻结不训练, False: 微调
)

# 使用 glove 或其他预训练词向量时，通常先冻结训练几轮，再解冻微调
# 前几轮只训练上层模型
embedding.weight.requires_grad = False

# 训练若干轮后，解冻嵌入层进行微调
# embedding.weight.requires_grad = True

5. 位置编码 (Positional Encoding)

与 RNN/LSTM 不同，Transformer 模型不包含位置信息，需要额外添加位置编码来让模型感知序列中词的顺序。

5.1 位置编码原理

位置编码使用正弦和余弦函数生成位置向量：

\begin{aligned} PE_{(pos, 2i)} &= \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right) \\ PE_{(pos, 2i+1)} &= \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right) \end{aligned}

这种编码方式的特点是：不同位置的编码可以通过线性变换相互转换，便于模型学习位置关系。

5.2 实现位置编码

实例

import torch
import torch.nn as nn
import math

class PositionalEncoding(nn.Module):
"""
位置编码层
"""
def __init__(self, d_model, max_len=5000):
super().__init__()

# 创建位置编码矩阵
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)

# 计算除数项
div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))

# 偶数索引使用 sin，奇数索引使用 cos
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)

# 添加 batch 维度，并注册为不参与梯度计算的缓冲区
pe = pe.unsqueeze(0) # (1, max_len, d_model)
self.register_buffer('pe', pe)

def forward(self, x):
"""
x: (batch_size, seq_len, d_model)
"""
seq_len = x.size(1)
# 截取对应长度的位置编码并相加
x = x + self.pe[:, :seq_len, :]
return x

# 使用示例
d_model = 256
max_len = 100

pos_encoding = PositionalEncoding(d_model, max_len)

# 模拟输入：batch_size=4, seq_len=20, d_model=256
x = torch.randn(4, 20, d_model)
x = pos_encoding(x)

print(f"输入形状: {x.shape}") # torch.Size([4, 20, 256])

位置编码是 Transformer 架构中的关键组件，它使模型能够区分不同位置的词，即使它们的嵌入向量相同。

5.3 可学习的位置编码

除了固定的位置编码，也可以使用可学习的位置编码：

实例

import torch
import torch.nn as nn

class LearnablePositionalEncoding(nn.Module):
"""
可学习的位置编码
"""
def __init__(self, d_model, max_len=5000):
super().__init__()
self.pos_embedding = nn.Embedding(max_len, d_model)

def forward(self, x):
batch_size, seq_len, d_model = x.size(1), x.size(1), x.size(2)
# 创建位置索引 [0, 1, 2, ..., seq_len-1]
positions = torch.arange(seq_len, device=x.device).unsqueeze(0).expand(batch_size, -1)
pos_encoded = self.pos_embedding(positions)
return x + pos_encoded

# 使用示例
pos_encoding = LearnablePositionalEncoding(d_model=256, max_len=100)
x = torch.randn(4, 20, 256)
x = pos_encoding(x)
print(f"输出形状: {x.shape}") # torch.Size([4, 20, 256])

6. 嵌入层的进阶技巧

6.1 降低显存占用

当词表非常大时，嵌入层会占用大量显存。可以使用以下技巧优化：

实例

import torch
import torch.nn as nn

# 技巧一：使用稀疏梯度（sparse=True）
embedding = nn.Embedding(
num_embeddings=100000,
embedding_dim=256,
sparse=True # 梯度以稀疏格式存储，节省显存
)

# 技巧二：使用量化（ quantization）
# 将 float32 转换为 int8 或 float16
embedding_int8 = embedding.to(torch.int8)

# 技巧三：冻结不常用的词向量
# 在大规模词表中，只训练高频词，低频词保持冻结
embedding = nn.Embedding(num_embeddings=100000, embedding_dim=256)
embedding.weight.requires_grad = True

# 冻结索引大于 50000 的词向量
with torch.no_grad():
embedding.weight[50000:] *= 0
embedding.weight.requires_grad = False

# 只训练高频词
embedding.weight[1:50000].requires_grad = True

6.2 处理未登录词 (OOV)

测试集中可能出现词表中没有的词（未登录词 / OOV），需要特殊处理：

实例

import torch
import torch.nn as nn

class EmbeddingWithOOV(nn.Module):
"""
支持处理未登录词的嵌入层
"""
def __init__(self, vocab_size, embed_dim, oov_idx=None):
super().__init__()
self.embedding = nn.Embedding(vocab_size + 1, embed_dim, padding_idx=0)
self.oov_idx = oov_idx if oov_idx is not None else vocab_size # 最后一个索引作为 OOV

def forward(self, x):
# 将 OOV 词的索引替换为 oov_idx
oov_mask = x >= self.vocab_size
x = x.clone()
x[oov_mask] = self.oov_idx
return self.embedding(x)

# 使用哈希技术处理更大规模的词表
class HashEmbedding(nn.Module):
"""
使用哈希技术处理任意规模词表的嵌入层
"""
def __init__(self, num_buckets, embed_dim):
super().__init__()
self.num_buckets = num_buckets
self.embedding = nn.Embedding(num_buckets, embed_dim)

def forward(self, x):
# 将词索引哈希到桶中
# 使用 Python 字典的哈希方式
hashed = torch.remainder(x, self.num_buckets)
return self.embedding(hashed)

6.3 子词嵌入 (Subword Embedding)

对于形态丰富的语言（如德语、俄语），子词嵌入可以有效处理未登录词问题：

实例

import torch
import torch.nn as nn

class SubwordEmbedding(nn.Module):
"""
简化的子词嵌入示例
实际应使用 BPE、WordPiece 等算法进行分词
"""
def __init__(self, vocab_size, embed_dim, char_dim=50):
super().__init__()
# 字符级嵌入
self.char_embedding = nn.Embedding(vocab_size, char_dim)
# 词级嵌入
self.word_embedding = nn.Embedding(vocab_size, embed_dim - char_dim)
# 字符级 LSTM，用于组合字符向量
self.char_lstm = nn.LSTM(
char_dim, char_dim,
batch_first=True, bidirectional=True
)

def forward(self, word_ids, char_ids):
"""
word_ids: 词索引 (batch_size, seq_len)
char_ids: 字符索引 (batch_size, seq_len, max_word_len)
"""
# 词嵌入
word_emb = self.word_embedding(word_ids)

# 字符嵌入 + LSTM
batch_size, seq_len, max_word_len = char_ids.shape
char_ids_flat = char_ids.view(-1, max_word_len) # (batch_size * seq_len, max_word_len)
char_emb = self.char_embedding(char_ids_flat) # (batch_size * seq_len, max_word_len, char_dim)

char_output, (h_n, _) = self.char_lstm(char_emb)
# 取双向最后隐藏状态拼接
char_rep = torch.cat([h_n[-2], h_n[-1]], dim=-1) # (batch_size * seq_len, char_dim * 2)

char_rep = char_rep.view(batch_size, seq_len, -1) # (batch_size, seq_len, char_dim * 2)

# 拼接词嵌入和字符表示
combined = torch.cat([word_emb, char_rep], dim=-1)
return combined

7. 完整实战：文本分类模型

下面是一个完整的文本分类模型示例，包含嵌入层、LSTM 和分类器：

实例

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# ── 配置参数 ──────────────────────────────────────
VOCAB_SIZE = 10000
EMBED_DIM = 128
HIDDEN_SIZE = 128
NUM_LAYERS = 2
NUM_CLASSES = 5
DROPOUT = 0.3
MAX_LEN = 200

# ── 模型定义 ──────────────────────────────────────
class TextClassificationModel(nn.Module):
def __init__(self, vocab_size, embed_dim, hidden_size,
num_layers, num_classes, dropout=0.3, padding_idx=0):
super().__init__()

# 嵌入层
self.embedding = nn.Embedding(
vocab_size,
embed_dim,
padding_idx=padding_idx
)

# BiLSTM
self.lstm = nn.LSTM(
embed_dim,
hidden_size,
num_layers=num_layers,
batch_first=True,
bidirectional=True,
dropout=dropout if num_layers > 1 else 0
)

# 分类器
self.dropout = nn.Dropout(dropout)
self.fc = nn.Linear(hidden_size * 2, num_classes)

def forward(self, x):
# x: (batch_size, seq_len)
embedded = self.embedding(x) # (batch_size, seq_len, embed_dim)

# LSTM
output, (h_n, c_n) = self.lstm(embedded)

# 双向最后隐藏状态拼接
h_forward = h_n[-2] # (batch_size, hidden_size)
h_backward = h_n[-1] # (batch_size, hidden_size)
h_combined = torch.cat([h_forward, h_backward], dim=-1)

# 分类
dropped = self.dropout(h_combined)
logits = self.fc(droped)
return logits

# ── Dataset ──────────────────────────────────────
class TextDataset(Dataset):
def __init__(self, texts, labels, vocab, max_len=200):
self.texts = texts
self.labels = labels
self.vocab = vocab
self.max_len = max_len

def __len__(self):
return len(self.texts)

def __getitem__(self, idx):
text = self.texts[idx][:self.max_len]
# 将词转换为索引，未知词用 1（<UNK>）表示
ids = [self.vocab.get(word, 1) for word in text]
# 补零到固定长度
if len(ids) < self.max_len:
ids += [0] * (self.max_len - len(ids))
return torch.tensor(ids), torch.tensor(self.labels[idx])

# ── 训练函数 ──────────────────────────────────────
def train_epoch(model, loader, optimizer, criterion, device):
model.train()
total_loss = 0
correct = 0
total = 0

for inputs, labels in loader:
inputs, labels = inputs.to(device), labels.to(device)

optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()

total_loss += loss.item() * inputs.size(0)
_, predicted = outputs.max(1)
correct += predicted.eq(labels).sum().item()
total += labels.size(0)

return total_loss / total, correct / total

# ── 初始化与训练 ──────────────────────────────────
# 假设已有词表
word2idx = {'<PAD>': 0, '<UNK>': 1} # 词表需根据实际语料构建

model = TextClassificationModel(
VOCAB_SIZE, EMBED_DIM, HIDDEN_SIZE,
NUM_LAYERS, NUM_CLASSES, DROPOUT
).to(device)

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)

# 模拟训练
print("开始训练...")
for epoch in range(10):
train_loss, train_acc = train_epoch(model, None, optimizer, criterion, device)
print(f"Epoch {epoch+1}: Loss={train_loss:.4f}, Acc={train_acc:.4f}")

8. API 快速参考

8.1 nn.Embedding 常用操作

操作	代码
创建嵌入层	`nn.Embedding(num_embeddings, embedding_dim)`
加载预训练嵌入	`nn.Embedding.from_pretrained(weights)`
查表获取嵌入	`embedding(word_ids)`
冻结嵌入	`embedding.weight.requires_grad = False`
获取嵌入向量	`embedding.weight[idx]`

8.2 预训练词向量资源

资源	维度	特点
GloVe	50, 100, 200, 300	词共现统计，训练快
Word2Vec	50-500	Google 训练，覆盖广
FastText	300	支持子词，处理 OOV 好
BERT	768+	上下文相关，效果最好

8.3 嵌入层选型建议

数据量小（< 10K）
    -> 使用预训练词向量（GloVe/FastText）+ 冻结或微调

数据量中等（10K ~ 100K）
    -> 使用预训练词向量 + 微调

数据量大（> 100K）
    -> 可考虑从头训练，或使用大规模预训练模型

领域差异大
    -> 使用领域相关预训练模型或增量训练

资源受限
    -> 冻结嵌入层，使用较小的嵌入维度

返回顶部

菜鸟教程