Python + PyTorch ML AI Rules

Rules for PyTorch model-training projects: dataset/loader patterns, mixed-precision, distributed training defaults, Hugging Face transformers integration, and W&B logging conventions.

PythonPyTorch#python#pytorch#huggingface#trainingLast updated 2026-05-05
tune

Want to customize this rules file? Open the generator with this stack pre-loaded.

Open in generatorarrow_forward

Save at .cursor/rules/main.mdc

PyTorch ML

Project context

This is a PyTorch-based ML project — model training, fine-tuning, and inference. We use Hugging Face for transformers and tokenizers, PyTorch Lightning for training orchestration when scale demands it, and Weights & Biases for experiment tracking.

Stack

  • Python 3.12+
  • PyTorch 2.4+ (CUDA / MPS / CPU)
  • Hugging Face transformers, datasets, accelerate, peft
  • PyTorch Lightning (optional, for multi-GPU)
  • wandb for experiment tracking
  • bitsandbytes for 4-bit / 8-bit quantization
  • uv for env management
  • safetensors for model serialization (never .pt / .bin)

Folder structure

src/
  data/
    dataset.py         — torch.utils.data.Dataset implementations
    collate.py         — collation functions
    tokenize.py
  model/
    model.py           — model classes (or HF AutoModel wrappers)
    config.py
  training/
    train.py           — main training entrypoint
    optimizer.py       — optimizer + scheduler factories
    callbacks.py
  inference/
    predict.py
    serve.py           — FastAPI / Modal / Replicate wrapper
configs/
  base.yaml
  experiments/<name>.yaml
checkpoints/           — saved as safetensors, .gitignored

Reproducibility

import torch, random, numpy as np

def set_seeds(seed: int) -> None:
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    np.random.seed(seed)
    random.seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False  # disable for full determinism
  • Set seeds at the top of every training run
  • Pin all dependencies (uv lock committed)
  • Log the git SHA, dataset hash, and full config to W&B
  • Save checkpoints as safetensors, not pickle-based formats

Datasets

  • Use Hugging Face datasets for tabular and text — its memory-mapping is much faster than rolling your own
  • For custom data, subclass torch.utils.data.Dataset with __len__ and __getitem__
  • Move heavy preprocessing into dataset.map(...) so it caches automatically

DataLoader

from torch.utils.data import DataLoader

loader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=True,
    num_workers=4,             # ~CPU count, but tune
    pin_memory=True,           # if using CUDA
    persistent_workers=True,   # avoid worker startup overhead each epoch
    collate_fn=my_collate,
)
  • pin_memory=True for CUDA, False for MPS / CPU
  • Use persistent_workers=True unless you re-create the dataloader between epochs
  • Set num_workers=0 when debugging (so tracebacks point at the right place)

Training loop

  • Use accelerate or lightning instead of writing your own DDP boilerplate
  • Mixed precision: torch.cuda.amp.autocast (CUDA) or torch.amp.autocast("mps") (Apple Silicon)
  • Gradient accumulation when batch doesn't fit; accelerator.accumulate(model) handles it cleanly
  • Always optimizer.zero_grad(set_to_none=True) — slight speedup over default

Checkpointing

  • Save model + optimizer + scheduler + step + epoch + RNG state
  • Use safetensors.torch.save_file for the model weights — not torch.save(model.state_dict())
  • Keep the last N checkpoints + the best by validation metric; rotate the rest
  • Save config alongside the checkpoint so it's reproducible without the codebase

LoRA / fine-tuning

  • Use peft for parameter-efficient fine-tuning
  • Set rank deliberately (r=8 is a common sweet spot for 7B-class models; higher for smaller models)
  • Save only the adapter weights — much smaller than full checkpoints
  • For QLoRA: load base model in 4-bit via bitsandbytes, then attach LoRA adapters

Inference

  • Use model.eval() and with torch.no_grad(): (or torch.inference_mode() — slightly faster)
  • For batch inference, batch up requests; for streaming, use transformers' TextIteratorStreamer
  • Quantize for inference if memory is tight: bitsandbytes, gguf, or torch.compile's 8-bit path

Patterns to avoid

  • pickle / torch.save(..., 'model.pt') — use safetensors
  • Hand-rolling DDP — use accelerate or lightning
  • Hard-coded paths in scripts — use a config file (Hydra / YAML)
  • Forgetting .eval() at inference — dropout and batchnorm will give wrong results
  • .cuda() everywhere — use model.to(device) and pass device from config
  • Silent device mismatchesRuntimeError: Expected all tensors to be on the same device — set up a to_device helper

Logging & experiment tracking

  • W&B: log loss every step, metrics every epoch, hyperparameters at start
  • Save the config as a W&B artifact
  • Use W&B's media logging for sample inputs/outputs
  • Tag runs (baseline, ablation-x) for filtering

Testing

  • Smoke tests with a 2-batch tiny dataset to catch shape errors fast
  • Unit-test datasets and collation functions
  • For models, test forward pass shapes; for losses, test gradient flow

Tooling

  • uv venv && uv sync
  • python -m src.training.train --config configs/experiments/foo.yaml
  • pytest
  • ruff check && ruff format
  • nvidia-smi / nvtop to watch GPU

AI behavioral rules

  • Always set seeds at the top of any training script
  • Never use pickle / raw torch.save for model weights — use safetensors
  • Always wrap inference in torch.inference_mode() and model.eval()
  • Prefer accelerate over hand-rolled distributed code
  • Log experiments to W&B by default; never silent training runs
  • Verify shapes via tiny-batch smoke test before launching long training
  • Don't add new dependencies without surfacing the GPU/memory implications
  • Run pytest (smoke tests) and ruff check before declaring a task done

Frequently asked

How do I use this PyTorch rules file with Cursor?

Pick "Cursor (.cursor/rules/*.mdc)" from the format dropdown above and click Copy. Save it at .cursor/rules/main.mdc in your project root and restart Cursor. The legacy .cursorrules format still works if you're on an older Cursor version — pick that option instead.

Can I use this with Claude Code (CLAUDE.md)?

Yes — pick "Claude Code (CLAUDE.md)" from the format dropdown above and copy. Save the file as CLAUDE.md at your repo root. Claude Code reads it automatically on every session. For monorepos, you can also drop nested CLAUDE.md files in subdirectories — Claude merges them when working in those paths.

Where exactly do I put this file?

It depends on the AI tool. Cursor reads .cursorrules or .cursor/rules/*.mdc at the project root. Claude reads CLAUDE.md at the project root. Copilot reads .github/copilot-instructions.md. The "Save at" path under each format in the dropdown shows the exact location for the format you picked.

Can I customize these PyTorch rules for my project?

Yes — that's what the generator is for. Click "Open in generator" above and the wizard loads with this stack's defaults pre-selected. Toggle on or off the conventions you want, then re-export in your AI tool's format.

Will using this rules file slow down my AI tool?

No. Rules files count toward the model's context window but not toward latency in any noticeable way. The file is loaded once per session, not per token. The library files target 250–400 lines, well within every tool's recommended budget.

Should I commit this file to git?

Yes. The rules file is project documentation that benefits every developer using the AI tool. Commit it. The exception is personal-global settings (e.g. ~/.claude/CLAUDE.md) which are user-scoped and stay out of the repo.

Related stacks