pytorchlightning/pytorch_lightningPyTorch Lightning是一个轻量级的PyTorch包装器,旨在简化高性能AI研究。该Docker镜像封装了PyTorch Lightning及其依赖环境,提供了一个即开即用的深度学习研究平台。
PyTorch Lightning的核心思想是将PyTorch代码解耦,分离科研逻辑与工程实现,让研究人员能够专注于模型本身而非训练循环等工程细节。
bashdocker pull pytorchlightning/pytorch_lightning
bashdocker run -it --rm pytorchlightning/pytorch_lightning /bin/bash
docker run命令示例:
bashdocker run -it --rm \ --gpus all \ -v $(pwd):/workspace \ -w /workspace \ pytorchlightning/pytorch_lightning \ python train.py
创建docker-compose.yml文件:
yamlversion: '3' services: lightning-training: image: pytorchlightning/pytorch_lightning volumes: - ./:/workspace working_dir: /workspace command: python train.py deploy: resources: reservations: devices: - driver: nvidia count: all capabilities: [gpu] environment: - CUDA_VISIBLE_DEVICES=0,1 # 指定使用的GPU - PYTHONUNBUFFERED=1 # 实时输出日志
启动服务:
bashdocker-compose up
PyTorch Lightning的主要配置通过Trainer类实现,常用参数包括:
max_epochs: 最大训练轮数gpus: 使用的GPU数量,可指定具体GPU索引tpu_cores: 使用的TPU核心数num_nodes: 分布式训练的节点数量precision: 训练精度,支持16位和32位logger: 日志记录器配置callbacks: 回调函数列表CUDA_VISIBLE_DEVICES: 指定容器可见的GPU设备PYTHONPATH: Python模块搜索路径PL_TORCH_DISTRIBUTED_BACKEND: PyTorch分布式后端,可选"nccl"或"gloo"PL_VERBOSITY: 日志详细程度,0-4(0=静默,4=调试)PL_ENABLE_WANDB: 是否启用Weights & Biases日志集成pythonimport os import torch from torch import nn import torch.nn.functional as F from torchvision.datasets import MNIST from torch.utils.data import DataLoader, random_split from torchvision import transforms import pytorch_lightning as pl class LitAutoEncoder(pl.LightningModule): def __init__(self): super().__init__() self.encoder = nn.Sequential(nn.Linear(28 * 28, 128), nn.ReLU(), nn.Linear(128, 3)) self.decoder = nn.Sequential(nn.Linear(3, 128), nn.ReLU(), nn.Linear(128, 28 * 28)) def forward(self, x): # inference/prediction时的前向传播 embedding = self.encoder(x) return embedding def training_step(self, batch, batch_idx): # 定义训练循环 x, y = batch x = x.view(x.size(0), -1) z = self.encoder(x) x_hat = self.decoder(z) loss = F.mse_loss(x_hat, x) return loss def configure_optimizers(self): optimizer = torch.optim.Adam(self.parameters(), lr=1e-3) return optimizer def main(): dataset = MNIST(os.getcwd(), download=True, transform=transforms.ToTensor()) train, val = random_split(dataset, [55000, 5000]) autoencoder = LitAutoEncoder() trainer = pl.Trainer(max_epochs=10) trainer.fit(autoencoder, DataLoader(train), DataLoader(val)) if __name__ == "__main__": main()
bashdocker run -it --rm \ --gpus all \ -v $(pwd):/workspace \ -w /workspace \ pytorchlightning/pytorch_lightning \ python train.py
无需修改模型代码,只需调整Trainer参数即可实现多GPU或TPU训练:
python# 8 GPUs训练 trainer = pl.Trainer(max_epochs=10, gpus=8) # 多节点训练 (256 GPUs) trainer = pl.Trainer(max_epochs=10, gpus=8, num_nodes=32) # TPU训练 (8个TPU核心) trainer = pl.Trainer(max_epochs=10, tpu_cores=8) # 单TPU核心训练 trainer = pl.Trainer(max_epochs=10, tpu_cores=[1])
pytorch-lightning)该镜像基于Apache 2.0许可证发布。Lightning框架正在申请专利。
manifest unknown 错误
TLS 证书验证失败
DNS 解析超时
410 错误:版本过低
402 错误:流量耗尽
身份认证失败错误
429 限流错误
凭证保存错误
来自真实用户的反馈,见证轩辕镜像的优质服务