PyTorch 迁移学习
迁移学习(Transfer Learning)是指将在大规模数据集上预训练好的模型,迁移到新的、数据量较少的任务上进行训练的技术。
它是当今深度学习实践中使用最广泛的技术之一——在大多数情况下,迁移学习比从零开始训练效果更好、速度更快、所需数据更少。
1. 迁移学习核心思想
深度神经网络在 ImageNet 上学到的特征具有通用性:
- 浅层网络:学习通用低级特征(边缘、纹理、颜色梯度)
- 中层网络:学习通用中级特征(形状、部件、纹理组合)
- 深层网络:学习任务相关的高级特征(人脸、轮子、文字)
这些低中层特征对绝大部分视觉任务都有效,无需重新学习。
什么时候使用迁移学习?
| 数据量 | 与源任务相似度 | 推荐策略 |
|---|---|---|
| 少(< 1000) | 高 | 只替换最后分类头,冻结全部主干 |
| 少(< 1000) | 低 | 微调较浅的层,冻结深层 |
| 中等(1000~10000) | 高 | 微调全部网络,用小学习率 |
| 中等(1000~10000) | 低 | 微调较深的层,冻结浅层 |
| 多(> 10000) | 任意 | 全部微调,或考虑从头训练 |
三种核心策略对比
预训练模型(如 ResNet50)的层次结构如下:
- Conv Layer 1~3(低级特征:边缘/纹理):通常冻结
- Conv Layer 4~6(中级特征:形状/部件):可选冻结
- Conv Layer 7~N(高级特征:语义信息):微调
- Classifier Head(分类头):替换并训练
┌─────────────────────────────────────────────────┐ │ 预训练模型(如 ResNet50) │ │ ┌──────────────────────────────────────────┐ │ │ │ Conv Layer 1~3(低级特征:边缘/纹理) │ ← 通常冻结 │ ├──────────────────────────────────────────┤ │ │ │ Conv Layer 4~6(中级特征:形状/部件) │ ← 可选冻结 │ ├──────────────────────────────────────────┤ │ │ │ Conv Layer 7~N(高级特征:语义信息) │ ← 微调 │ ├──────────────────────────────────────────┤ │ │ │ Classifier Head(分类头) │ ← 替换 & 训练 │ └──────────────────────────────────────────┘ │ └─────────────────────────────────────────────────┘
2. 加载预训练模型
PyTorch 通过 torchvision.models 提供了大量官方预训练模型,加载非常简单。
实例
import torchvision.models as models
# 加载预训练模型(自动下载权重)
# PyTorch >= 0.13 推荐新写法:使用 weights 参数
from torchvision.models import ResNet50_Weights
model = models.resnet50(weights=ResNet50_Weights.IMAGENET1K_V2)
# 旧写法(仍然有效,但会收到 deprecation 警告)
model = models.resnet50(pretrained=True)
# 不加载预训练权重(仅使用网络结构)
model = models.resnet50(weights=None)
查看模型结构
实例
print(model)
# 只查看最后几层(分类头)
print(model.fc)
# Linear(in_features=2048, out_features=1000, bias=True)
# 统计参数量
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"总参数量: {total_params:,}")
print(f"可训练参数: {trainable_params:,}")
各模型分类头的名称
不同模型的分类头属性名不同,迁移时需要替换对应层:
| 模型 | 分类头属性 |
|---|---|
| ResNet / RegNet | model.fc |
| VGG / AlexNet | model.classifier[-1] |
| DenseNet | model.classifier |
| EfficientNet | model.classifier[-1] |
| MobileNetV2/V3 | model.classifier[-1] |
| ViT (Vision Transformer) | model.heads.head |
| ConvNeXt | model.classifier[-1] |
| Inception V3 | model.fc |
| Swin Transformer | model.head |
3. 三种迁移策略
3.1 策略一:特征提取(冻结全部)
冻结预训练模型的全部参数,只训练新替换的分类头。
适合数据量极少(几百张),或任务与源任务高度相似的场景。
实例
import torch.nn as nn
import torchvision.models as models
from torchvision.models import ResNet18_Weights
NUM_CLASSES = 5 # 目标任务的类别数
# Step 1:加载预训练模型
model = models.resnet18(weights=ResNet18_Weights.IMAGENET1K_V1)
# Step 2:冻结所有参数
for param in model.parameters():
param.requires_grad = False
# Step 3:替换分类头(这部分参数默认 requires_grad=True)
in_features = model.fc.in_features # 512
model.fc = nn.Linear(in_features, NUM_CLASSES)
# 验证:只有分类头是可训练的
trainable = [(n, p.shape) for n, p in model.named_parameters() if p.requires_grad]
print(f"可训练层数: {len(trainable)}")
for name, shape in trainable:
print(f" {name}: {shape}")
# 输出:
# fc.weight: torch.Size([5, 512])
# fc.bias: torch.Size([5])
# Step 4:优化器只传可训练参数(更高效)
optimizer = torch.optim.Adam(
filter(lambda p: p.requires_grad, model.parameters()),
lr=1e-3
)
# 或等价的更清晰写法:
optimizer = torch.optim.Adam(model.fc.parameters(), lr=1e-3)
特征提取是最简单也是最常用的迁移学习策略,特别适合数据量较少的场景。
3.2 策略二:微调(Fine-tuning)
解冻全部或部分预训练层,使用较小的学习率整体训练。
适合数据量中等,或任务与源任务有所不同的场景。
实例
import torch.nn as nn
import torchvision.models as models
NUM_CLASSES = 10
model = models.resnet50(weights='IMAGENET1K_V2')
# 方式 A:全量微调(解冻所有层)
# 先冻结
for param in model.parameters():
param.requires_grad = False
# 再解冻(等价于全量微调,此写法常用于逐步解冻场景)
for param in model.parameters():
param.requires_grad = True
# 替换分类头
model.fc = nn.Linear(model.fc.in_features, NUM_CLASSES)
# 全量微调:主干用小学习率,头部用大学习率(见策略三)
optimizer = torch.optim.SGD(model.parameters(), lr=1e-4, momentum=0.9)
# 方式 B:解冻最后 N 层(部分微调)
model = models.resnet50(weights='IMAGENET1K_V2')
# 先全部冻结
for param in model.parameters():
param.requires_grad = False
# 只解冻 layer4 和 fc(ResNet 的最后一个 Block 和分类头)
for param in model.layer4.parameters():
param.requires_grad = True
model.fc = nn.Linear(model.fc.in_features, NUM_CLASSES) # fc 默认可训练
print("可训练参数:")
for name, param in model.named_parameters():
if param.requires_grad:
print(f" {name}")
3.3 策略三:分层差异化学习率
主干(backbone)使用小学习率(保留预训练知识),分类头使用大学习率(快速适应新任务)。
这是业界最常用的微调策略,综合效果最好。
实例
import torch.nn as nn
import torchvision.models as models
NUM_CLASSES = 8
model = models.resnet50(weights='IMAGENET1K_V2')
model.fc = nn.Linear(model.fc.in_features, NUM_CLASSES)
# 方案一:两组学习率(主干 vs 头部)
optimizer = torch.optim.Adam([
{'params': model.fc.parameters(), 'lr': 1e-3}, # 分类头:大学习率
{'params': [p for n, p in model.named_parameters() # 主干:小学习率
if not n.startswith('fc')],
'lr': 1e-5},
])
# 方案二:逐层递减学习率(最精细)
# 越靠近输出的层,学习率越大
layer_groups = [
(model.layer1, 1e-5), # 最浅层,最小学习率
(model.layer2, 3e-5),
(model.layer3, 1e-4),
(model.layer4, 3e-4), # 最深主干层
(model.fc, 1e-3), # 分类头,最大学习率
]
param_groups = [
{'params': layer.parameters(), 'lr': lr}
for layer, lr in layer_groups
]
optimizer = torch.optim.Adam(param_groups)
# 方案三:逐步解冻(Gradual Unfreezing)
# 训练初期只训练头部,逐步解冻更多层(fastai 推荐)
model = models.resnet50(weights='IMAGENET1K_V2')
for param in model.parameters():
param.requires_grad = False
model.fc = nn.Linear(model.fc.in_features, NUM_CLASSES)
def unfreeze_layers(model, num_layers):
"""解冻 ResNet 最后 num_layers 个 layer block"""
layers = [model.layer4, model.layer3, model.layer2, model.layer1]
for i in range(min(num_layers, len(layers))):
for param in layers[i].parameters():
param.requires_grad = True
# Epoch 1-5:只训练分类头
# Epoch 6-10:解冻 layer4
unfreeze_layers(model, num_layers=1)
# Epoch 11+:解冻更多层
unfreeze_layers(model, num_layers=3)
4. 常用预训练模型
4.1 图像分类模型
实例
# ResNet 系列(最经典,适合大多数任务)
resnet18 = models.resnet18(weights='IMAGENET1K_V1') # 轻量,适合边缘设备
resnet50 = models.resnet50(weights='IMAGENET1K_V2') # 均衡之选
resnet101 = models.resnet101(weights='IMAGENET1K_V2') # 更强,更慢
# EfficientNet 系列(精度效率比极高)
effnet_b0 = models.efficientnet_b0(weights='IMAGENET1K_V1') # 最轻量
effnet_b4 = models.efficientnet_b4(weights='IMAGENET1K_V1') # 均衡
effnet_b7 = models.efficientnet_b7(weights='IMAGENET1K_V1') # 最强
# Vision Transformer(大数据量任务首选)
vit_b16 = models.vit_b_16(weights='IMAGENET1K_V1') # ViT-Base/16
vit_l16 = models.vit_l_16(weights='IMAGENET1K_V1') # ViT-Large/16
# MobileNet(移动端/嵌入式部署)
mobilenet_v3 = models.mobilenet_v3_small(weights='IMAGENET1K_V1')
# ConvNeXt(CNN 的现代化版本,性能接近 ViT)
convnext_t = models.convnext_tiny(weights='IMAGENET1K_V1')
convnext_b = models.convnext_base(weights='IMAGENET1K_V1')
主流模型性能对比(ImageNet Top-1 Acc):
| 模型 | Top-1 Acc | 参数量 | 推理速度 | 适用场景 |
|---|---|---|---|---|
| ResNet-18 | 69.8% | 11.7M | 极快 | 资源受限、快速原型 |
| ResNet-50 | 80.9% | 25.6M | 快 | 通用首选 |
| EfficientNet-B4 | 83.4% | 19.3M | 中 | 精度效率均衡 |
| ConvNeXt-Base | 84.1% | 88.6M | 中 | 高精度 CNN |
| ViT-B/16 | 81.1% | 86.6M | 中 | 大数据量场景 |
| ViT-L/16 | 85.1% | 307M | 慢 | 最高精度 |
4.2 目标检测模型
实例
# Faster R-CNN(经典两阶段检测器)
faster_rcnn = detection.fasterrcnn_resnet50_fpn(weights='DEFAULT')
# SSD(单阶段检测器,速度快)
ssd = detection.ssd300_vgg16(weights='DEFAULT')
# RetinaNet
retinanet = detection.retinanet_resnet50_fpn(weights='DEFAULT')
# FCOS
fcos = detection.fcos_resnet50_fpn(weights='DEFAULT')
# 替换 Faster R-CNN 的分类头(适配新类别数)
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
NUM_CLASSES = 5 + 1 # 5 个目标类 + 1 个背景类
faster_rcnn = detection.fasterrcnn_resnet50_fpn(weights='DEFAULT')
in_features = faster_rcnn.roi_heads.box_predictor.cls_score.in_features
faster_rcnn.roi_heads.box_predictor = FastRCNNPredictor(in_features, NUM_CLASSES)
4.3 文本模型(HuggingFace)
NLP 任务的迁移学习通常使用 HuggingFace transformers 库:
实例
from transformers import (
BertForSequenceClassification,
RobertaForSequenceClassification,
AutoModelForSequenceClassification,
AutoTokenizer,
)
NUM_CLASSES = 3
# BERT(文本分类首选)
model = BertForSequenceClassification.from_pretrained(
'bert-base-chinese', # 中文 BERT
num_labels=NUM_CLASSES
)
tokenizer = AutoTokenizer.from_pretrained('bert-base-chinese')
# RoBERTa(BERT 改进版,更强)
model = RobertaForSequenceClassification.from_pretrained(
'roberta-base',
num_labels=NUM_CLASSES
)
# 通用加载方式(自动识别模型类型)
model = AutoModelForSequenceClassification.from_pretrained(
'hfl/chinese-roberta-wwm-ext', # 中文 RoBERTa
num_labels=NUM_CLASSES
)
5. 数据预处理与增强
迁移学习使用 ImageNet 预训练权重时,必须使用与预训练相同的归一化参数,否则特征分布不匹配,效果会大幅下降。
实例
# ImageNet 标准归一化参数(所有 torchvision 预训练模型通用)
IMAGENET_MEAN = [0.485, 0.456, 0.406]
IMAGENET_STD = [0.229, 0.224, 0.225]
# 训练集:含数据增强
train_transforms = transforms.Compose([
transforms.RandomResizedCrop(224), # 随机裁剪并 resize
transforms.RandomHorizontalFlip(p=0.5), # 随机水平翻转
transforms.ColorJitter( # 颜色抖动
brightness=0.2, contrast=0.2,
saturation=0.2, hue=0.1
),
transforms.RandomRotation(degrees=15), # 随机旋转
transforms.ToTensor(),
transforms.Normalize(IMAGENET_MEAN, IMAGENET_STD),
])
# 验证/测试集:不做随机增强
val_transforms = transforms.Compose([
transforms.Resize(256), # 先 resize 到 256
transforms.CenterCrop(224), # 再中心裁剪到 224
transforms.ToTensor(),
transforms.Normalize(IMAGENET_MEAN, IMAGENET_STD),
])
# 数据集加载(目录结构:root/class_name/img.jpg)
from torchvision.datasets import ImageFolder
from torch.utils.data import DataLoader
train_dataset = ImageFolder(root='data/train', transform=train_transforms)
val_dataset = ImageFolder(root='data/val', transform=val_transforms)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True,
num_workers=4, pin_memory=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False,
num_workers=4, pin_memory=True)
print(f"训练样本数: {len(train_dataset)}")
print(f"类别列表: {train_dataset.classes}")
print(f"类别映射: {train_dataset.class_to_idx}")
使用 torchvision 官方推荐预处理
实例
from torchvision.models import ResNet50_Weights
weights = ResNet50_Weights.IMAGENET1K_V2
model = models.resnet50(weights=weights)
preprocess = weights.transforms() # 自动返回对应的预处理 pipeline
# preprocess 已包含 Resize(232)、CenterCrop(224)、Normalize 等步骤
# 训练时只需在此基础上追加数据增强
必须使用 ImageNet 的归一化参数 [0.485, 0.456, 0.406] 和 [0.229, 0.224, 0.225],否则预训练特征将无法正确对齐。
6. 完整实战:图像二分类
以猫狗分类为例,展示从数据准备到训练评估的完整迁移学习流程:
实例
import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim.lr_scheduler import CosineAnnealingLR
from torchvision import models, transforms, datasets
from torch.utils.data import DataLoader, random_split
from torchvision.models import EfficientNet_B0_Weights
# 配置
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
NUM_CLASSES = 2
BATCH_SIZE = 32
EPOCHS = 20
BASE_LR = 1e-3
DATA_DIR = 'data/cats_and_dogs'
# 数据准备
train_tfm = transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ColorJitter(brightness=0.3, contrast=0.3),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406],
[0.229, 0.224, 0.225]),
])
val_tfm = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize([0.485, 0.456, 0.406],
[0.229, 0.224, 0.225]),
])
full_dataset = datasets.ImageFolder(DATA_DIR, transform=train_tfm)
n_val = int(len(full_dataset) * 0.2)
n_train = len(full_dataset) - n_val
train_set, val_set = random_split(full_dataset, [n_train, n_val])
val_set.dataset = datasets.ImageFolder(DATA_DIR, transform=val_tfm) # 替换 val 的 transform
train_loader = DataLoader(train_set, BATCH_SIZE, shuffle=True,
num_workers=4, pin_memory=True)
val_loader = DataLoader(val_set, BATCH_SIZE, shuffle=False,
num_workers=4, pin_memory=True)
# 构建迁移模型
weights = EfficientNet_B0_Weights.IMAGENET1K_V1
model = models.efficientnet_b0(weights=weights)
# 冻结主干
for param in model.parameters():
param.requires_grad = False
# 替换分类头(EfficientNet-B0 的头部结构)
in_features = model.classifier[1].in_features # 1280
model.classifier = nn.Sequential(
nn.Dropout(p=0.2, inplace=True),
nn.Linear(in_features, NUM_CLASSES),
)
model = model.to(DEVICE)
# 优化器:分层学习率
optimizer = optim.Adam([
{'params': model.classifier.parameters(), 'lr': BASE_LR},
{'params': model.features.parameters(), 'lr': BASE_LR * 0.1},
])
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
scheduler = CosineAnnealingLR(optimizer, T_max=EPOCHS, eta_min=1e-6)
# 训练与验证函数
def train_epoch(model, loader, optimizer, criterion):
model.train()
total_loss, correct = 0.0, 0
for imgs, labels in loader:
imgs, labels = imgs.to(DEVICE), labels.to(DEVICE)
optimizer.zero_grad()
outputs = model(imgs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
total_loss += loss.item() * imgs.size(0)
correct += (outputs.argmax(1) == labels).sum().item()
n = len(loader.dataset)
return total_loss / n, correct / n
def eval_epoch(model, loader, criterion):
model.eval()
total_loss, correct = 0.0, 0
with torch.no_grad():
for imgs, labels in loader:
imgs, labels = imgs.to(DEVICE), labels.to(DEVICE)
outputs = model(imgs)
loss = criterion(outputs, labels)
total_loss += loss.item() * imgs.size(0)
correct += (outputs.argmax(1) == labels).sum().item()
n = len(loader.dataset)
return total_loss / n, correct / n
# 分阶段训练
print("=== 阶段一:只训练分类头(Epoch 1-5)===")
best_acc = 0.0
for epoch in range(1, EPOCHS + 1):
# 第 5 个 epoch 解冻主干,进入全量微调阶段
if epoch == 6:
print("\n=== 阶段二:解冻主干全量微调(Epoch 6-20)===")
for param in model.features.parameters():
param.requires_grad = True
train_loss, train_acc = train_epoch(model, train_loader, optimizer, criterion)
val_loss, val_acc = eval_epoch(model, val_loader, criterion)
scheduler.step()
print(f"Epoch {epoch:2d}/{EPOCHS} | "
f"Train Loss: {train_loss:.4f}, Acc: {train_acc:.4f} | "
f"Val Loss: {val_loss:.4f}, Acc: {val_acc:.4f} | "
f"LR: {scheduler.get_last_lr()[0]:.2e}")
if val_acc > best_acc:
best_acc = val_acc
torch.save(model.state_dict(), 'best_model.pth')
print(f" ✓ 保存最优模型 Acc={best_acc:.4f}")
print(f"\n训练完成,最佳验证准确率: {best_acc:.4f}")
推理与预测
实例
# 加载最优模型
model.load_state_dict(torch.load('best_model.pth', map_location=DEVICE))
model.eval()
# 预测单张图片
def predict(image_path, model, class_names):
img = Image.open(image_path).convert('RGB')
tensor = val_tfm(img).unsqueeze(0).to(DEVICE) # (1, 3, 224, 224)
with torch.inference_mode():
logits = model(tensor)
probs = torch.softmax(logits, dim=1)[0]
pred = probs.argmax().item()
for cls, prob in zip(class_names, probs.tolist()):
print(f" {cls}: {prob:.4f}")
print(f"预测结果: {class_names[pred]}")
return class_names[pred]
class_names = train_set.dataset.classes # ['cat', 'dog']
predict('test_cat.jpg', model, class_names)
7. 完整实战:文本分类(BERT)
实例
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from transformers import (
BertTokenizer,
BertForSequenceClassification,
AdamW,
get_linear_schedule_with_warmup,
)
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
MODEL_NAME = 'bert-base-chinese'
NUM_CLASSES = 3 # 情感分类:正/中/负
MAX_LEN = 128
BATCH_SIZE = 16
EPOCHS = 5
LR = 2e-5 # BERT 微调的推荐学习率范围:1e-5 ~ 5e-5
# 自定义 Dataset
class SentimentDataset(Dataset):
def __init__(self, texts, labels, tokenizer, max_len):
self.texts = texts
self.labels = labels
self.tokenizer = tokenizer
self.max_len = max_len
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
encoding = self.tokenizer(
self.texts[idx],
max_length=self.max_len,
padding='max_length',
truncation=True,
return_tensors='pt',
)
return {
'input_ids': encoding['input_ids'].squeeze(0),
'attention_mask': encoding['attention_mask'].squeeze(0),
'label': torch.tensor(self.labels[idx], dtype=torch.long),
}
# 加载 BERT 预训练模型
tokenizer = BertTokenizer.from_pretrained(MODEL_NAME)
model = BertForSequenceClassification.from_pretrained(
MODEL_NAME,
num_labels=NUM_CLASSES,
hidden_dropout_prob=0.1,
)
model = model.to(DEVICE)
# 冻结底部 N 层(可选)
# BERT-base 有 12 层 Transformer,可以冻结前几层节省计算
FREEZE_LAYERS = 6 # 冻结前 6 层
for i, layer in enumerate(model.bert.encoder.layer):
if i < FREEZE_LAYERS:
for param in layer.parameters():
param.requires_grad = False
print(f"已冻结前 {FREEZE_LAYERS} 层,减少约 {FREEZE_LAYERS/12*100:.0f}% 的参数更新")
# 数据加载
# 示例数据(实际替换为真实数据集)
train_texts = ["这部电影太棒了!", "服务很差,失望", "一般般,没什么特别"]
train_labels = [2, 0, 1] # 0:负 1:中 2:正
train_dataset = SentimentDataset(train_texts, train_labels, tokenizer, MAX_LEN)
train_loader = DataLoader(train_dataset, BATCH_SIZE, shuffle=True)
# 优化器与调度器
# BERT 标配:AdamW + 线性 warmup
no_decay = ['bias', 'LayerNorm.weight']
optimizer_grouped = [
{'params': [p for n, p in model.named_parameters()
if not any(nd in n for nd in no_decay)], 'weight_decay': 0.01},
{'params': [p for n, p in model.named_parameters()
if any(nd in n for nd in no_decay)], 'weight_decay': 0.0},
]
optimizer = AdamW(optimizer_grouped, lr=LR)
total_steps = len(train_loader) * EPOCHS
warmup_steps = int(total_steps * 0.1) # 10% 的步数用于预热
scheduler = get_linear_schedule_with_warmup(
optimizer,
num_warmup_steps=warmup_steps,
num_training_steps=total_steps,
)
# 训练循环
for epoch in range(1, EPOCHS + 1):
model.train()
total_loss, correct = 0.0, 0
for batch in train_loader:
input_ids = batch['input_ids'].to(DEVICE)
attention_mask = batch['attention_mask'].to(DEVICE)
labels = batch['label'].to(DEVICE)
optimizer.zero_grad()
outputs = model(input_ids=input_ids,
attention_mask=attention_mask,
labels=labels)
loss = outputs.loss # BERT 内部已计算 CE Loss
logits = outputs.logits
loss.backward()
# BERT 微调标配:梯度裁剪,防止梯度爆炸
nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
scheduler.step()
total_loss += loss.item()
correct += (logits.argmax(1) == labels).sum().item()
avg_loss = total_loss / len(train_loader)
acc = correct / len(train_dataset)
print(f"Epoch {epoch}/{EPOCHS} | Loss: {avg_loss:.4f} | Acc: {acc:.4f}")
# 推理
def predict_sentiment(text, model, tokenizer, id2label):
model.eval()
encoding = tokenizer(text, max_length=MAX_LEN, padding='max_length',
truncation=True, return_tensors='pt')
with torch.inference_mode():
outputs = model(
input_ids = encoding['input_ids'].to(DEVICE),
attention_mask = encoding['attention_mask'].to(DEVICE),
)
pred = outputs.logits.argmax(1).item()
return id2label[pred]
id2label = {0: '负面', 1: '中性', 2: '正面'}
print(predict_sentiment("这款产品质量真的很好!", model, tokenizer, id2label))
8. 模型结构修改技巧
替换分类头的通用方法
实例
import torchvision.models as models
def build_transfer_model(arch, num_classes, pretrained=True, dropout=0.5):
"""
通用迁移模型构建函数,自动识别分类头并替换
支持: resnet, efficientnet, densenet, mobilenet, vit, convnext
"""
weights = 'IMAGENET1K_V1' if pretrained else None
model = getattr(models, arch)(weights=weights)
name = arch.lower()
if 'resnet' in name or 'resnext' in name or 'inception' in name:
in_f = model.fc.in_features
model.fc = nn.Sequential(
nn.Dropout(dropout),
nn.Linear(in_f, num_classes)
)
elif 'efficientnet' in name or 'mobilenet' in name or 'convnext' in name:
in_f = model.classifier[-1].in_features
model.classifier[-1] = nn.Sequential(
nn.Dropout(dropout),
nn.Linear(in_f, num_classes)
)
elif 'densenet' in name:
in_f = model.classifier.in_features
model.classifier = nn.Linear(in_f, num_classes)
elif 'vit' in name or 'swin' in name:
in_f = model.heads.head.in_features
model.heads.head = nn.Linear(in_f, num_classes)
else:
raise ValueError(f"不支持的架构: {arch}")
return model
# 使用示例
model_resnet = build_transfer_model('resnet50', num_classes=10)
model_effnet = build_transfer_model('efficientnet_b3', num_classes=10)
model_vit = build_transfer_model('vit_b_16', num_classes=10)
添加中间特征提取层
实例
import torch.nn as nn
import torchvision.models as models
class TransferWithAttention(nn.Module):
"""在预训练主干后接自定义注意力模块和分类头"""
def __init__(self, num_classes, dropout=0.5):
super().__init__()
backbone = models.resnet50(weights='IMAGENET1K_V2')
# 去掉原始分类头,保留 feature extractor
self.backbone = nn.Sequential(*list(backbone.children())[:-1])
self.feat_dim = 2048 # ResNet50 特征维度
# 自定义注意力门控
self.attention = nn.Sequential(
nn.Linear(self.feat_dim, 512),
nn.Tanh(),
nn.Linear(512, 1),
nn.Sigmoid()
)
# 分类头
self.classifier = nn.Sequential(
nn.Dropout(dropout),
nn.Linear(self.feat_dim, 256),
nn.ReLU(),
nn.Linear(256, num_classes),
)
def forward(self, x):
feat = self.backbone(x).flatten(1) # (N, 2048)
weight = self.attention(feat) # (N, 1)
feat = feat * weight # 加权特征
return self.classifier(feat)
9. 迁移学习最佳实践
学习率选择
实例
# 通常设为头部学习率的 1/10 ~ 1/100
backbone_lr = 1e-5 # 保守策略,数据量少时推荐
head_lr = 1e-3 # 新层从随机初始化开始,需要较大学习率
# BERT / ViT 等 Transformer 的学习率
# 这类大模型对学习率极其敏感,超出范围会破坏预训练知识
bert_lr = 2e-5 # 推荐范围:1e-5 ~ 5e-5
常见问题与解决方案
实例
# 解法 1:减小 batch size
# 解法 2:使用梯度检查点(以时间换空间)
from torch.utils.checkpoint import checkpoint_sequential
model.features = lambda x: checkpoint_sequential(model.features, 4, x)
# 解法 3:冻结更多层(减少反向传播计算量)
# 问题二:训练集准确率高,验证集低(过拟合)
# 解法 1:增强数据增强
# 解法 2:增大 Dropout
model.classifier = nn.Sequential(nn.Dropout(0.5), nn.Linear(in_f, num_classes))
# 解法 3:使用 Label Smoothing
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
# 解法 4:冻结更多主干层,减少可训练参数
# 问题三:训练不稳定,Loss 震荡
# 解法 1:降低学习率(减半再试)
# 解法 2:添加 warmup 预热
# 解法 3:梯度裁剪
nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
# 解法 4:更换优化器(Adam → AdamW)
# 问题四:归一化参数未匹配
# 错误:使用自定义归一化,与预训练不匹配
transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5])
# 正确:必须使用 ImageNet 的均值和标准差
transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225])
整体策略速查
| 问题 | 推荐做法 |
|---|---|
| 数据量 < 500 | 冻结所有层,只训练分类头 |
| 数据量 500~2000 | 冻结前 2/3,微调后 1/3 |
| 数据量 2000~10000 | 全量微调,主干用 1e-5,头部用 1e-3 |
| 数据量 > 10000 | 全量微调,考虑更大学习率或从头训练 |
| 训练速度慢 | 先冻结训练几轮再解冻;使用 AMP 混合精度 |
| 效果瓶颈 | 换更大/更新的 backbone;尝试 ConvNeXt / ViT |
| 模型要上线部署 | 优先选 EfficientNet-B0 / MobileNetV3 等轻量模型 |
| 中文 NLP 任务 | BERT-base-chinese 或 chinese-roberta-wwm-ext |
推荐的完整训练策略
第一阶段(Epoch 1~N/4):
- 冻结主干,只训练分类头
- 使用较大学习率(1e-3),快速收敛头部参数
第二阶段(Epoch N/4~N):
- 解冻主干,全量微调
- 主干使用小学习率(1e-5),头部保持(1e-3)
- 配合余弦退火或 ReduceLROnPlateau
保存策略:
- 只保存验证集指标最优的 checkpoint
- 同时保存 model / optimizer / scheduler 状态
