Python + PyTorch ML AI Rules

Rules for PyTorch model-training projects: dataset/loader patterns, mixed-precision, distributed training defaults, Hugging Face transformers integration, and W&B logging conventions.

PythonPyTorch#python#pytorch#huggingface#trainingLast updated 2026-05-05

tune

Want to customize this rules file? Open the generator with this stack pre-loaded.

Open in generatorarrow_forward

Format

Save at .cursor/rules/main.mdc

PyTorch ML

Project context

This is a PyTorch-based ML project — model training, fine-tuning, and inference. We use Hugging Face for transformers and tokenizers, PyTorch Lightning for training orchestration when scale demands it, and Weights & Biases for experiment tracking.

Stack

Python 3.12+
PyTorch 2.4+ (CUDA / MPS / CPU)
Hugging Face transformers, datasets, accelerate, peft
PyTorch Lightning (optional, for multi-GPU)
wandb for experiment tracking
bitsandbytes for 4-bit / 8-bit quantization
uv for env management
safetensors for model serialization (never .pt / .bin)

Folder structure

src/
  data/
    dataset.py         — torch.utils.data.Dataset implementations
    collate.py         — collation functions
    tokenize.py
  model/
    model.py           — model classes (or HF AutoModel wrappers)
    config.py
  training/
    train.py           — main training entrypoint
    optimizer.py       — optimizer + scheduler factories
    callbacks.py
  inference/
    predict.py
    serve.py           — FastAPI / Modal / Replicate wrapper
configs/
  base.yaml
  experiments/<name>.yaml
checkpoints/           — saved as safetensors, .gitignored

Reproducibility

import torch, random, numpy as np

def set_seeds(seed: int) -> None:
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    np.random.seed(seed)
    random.seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False  # disable for full determinism

Set seeds at the top of every training run
Pin all dependencies (uv lock committed)
Log the git SHA, dataset hash, and full config to W&B
Save checkpoints as safetensors, not pickle-based formats

Datasets

Use Hugging Face datasets for tabular and text — its memory-mapping is much faster than rolling your own
For custom data, subclass torch.utils.data.Dataset with __len__ and __getitem__
Move heavy preprocessing into dataset.map(...) so it caches automatically

DataLoader

from torch.utils.data import DataLoader

loader = DataLoader(
    dataset,
    batch_size=32,
    shuffle=True,
    num_workers=4,             # ~CPU count, but tune
    pin_memory=True,           # if using CUDA
    persistent_workers=True,   # avoid worker startup overhead each epoch
    collate_fn=my_collate,
)

pin_memory=True for CUDA, False for MPS / CPU
Use persistent_workers=True unless you re-create the dataloader between epochs
Set num_workers=0 when debugging (so tracebacks point at the right place)

Training loop

Use accelerate or lightning instead of writing your own DDP boilerplate
Mixed precision: torch.cuda.amp.autocast (CUDA) or torch.amp.autocast("mps") (Apple Silicon)
Gradient accumulation when batch doesn't fit; accelerator.accumulate(model) handles it cleanly
Always optimizer.zero_grad(set_to_none=True) — slight speedup over default

Checkpointing

Save model + optimizer + scheduler + step + epoch + RNG state
Use safetensors.torch.save_file for the model weights — not torch.save(model.state_dict())
Keep the last N checkpoints + the best by validation metric; rotate the rest
Save config alongside the checkpoint so it's reproducible without the codebase

LoRA / fine-tuning

Use peft for parameter-efficient fine-tuning
Set rank deliberately (r=8 is a common sweet spot for 7B-class models; higher for smaller models)
Save only the adapter weights — much smaller than full checkpoints
For QLoRA: load base model in 4-bit via bitsandbytes, then attach LoRA adapters

Inference

Use model.eval() and with torch.no_grad(): (or torch.inference_mode() — slightly faster)
For batch inference, batch up requests; for streaming, use transformers' TextIteratorStreamer
Quantize for inference if memory is tight: bitsandbytes, gguf, or torch.compile's 8-bit path

Patterns to avoid

pickle / torch.save(..., 'model.pt') — use safetensors
Hand-rolling DDP — use accelerate or lightning
Hard-coded paths in scripts — use a config file (Hydra / YAML)
Forgetting .eval() at inference — dropout and batchnorm will give wrong results
.cuda() everywhere — use model.to(device) and pass device from config
Silent device mismatches — RuntimeError: Expected all tensors to be on the same device — set up a to_device helper

Logging & experiment tracking

W&B: log loss every step, metrics every epoch, hyperparameters at start
Save the config as a W&B artifact
Use W&B's media logging for sample inputs/outputs
Tag runs (baseline, ablation-x) for filtering

Testing

Smoke tests with a 2-batch tiny dataset to catch shape errors fast
Unit-test datasets and collation functions
For models, test forward pass shapes; for losses, test gradient flow

Tooling

uv venv && uv sync
python -m src.training.train --config configs/experiments/foo.yaml
pytest
ruff check && ruff format
nvidia-smi / nvtop to watch GPU

AI behavioral rules

Always set seeds at the top of any training script
Never use pickle / raw torch.save for model weights — use safetensors
Always wrap inference in torch.inference_mode() and model.eval()
Prefer accelerate over hand-rolled distributed code
Log experiments to W&B by default; never silent training runs
Verify shapes via tiny-batch smoke test before launching long training
Don't add new dependencies without surfacing the GPU/memory implications
Run pytest (smoke tests) and ruff check before declaring a task done

Frequently asked

How do I use this PyTorch rules file with Cursor?

Pick "Cursor (.cursor/rules/*.mdc)" from the format dropdown above and click Copy. Save it at .cursor/rules/main.mdc in your project root and restart Cursor. The legacy .cursorrules format still works if you're on an older Cursor version — pick that option instead.

Can I use this with Claude Code (CLAUDE.md)?

Yes — pick "Claude Code (CLAUDE.md)" from the format dropdown above and copy. Save the file as CLAUDE.md at your repo root. Claude Code reads it automatically on every session. For monorepos, you can also drop nested CLAUDE.md files in subdirectories — Claude merges them when working in those paths.

Where exactly do I put this file?

It depends on the AI tool. Cursor reads .cursorrules or .cursor/rules/*.mdc at the project root. Claude reads CLAUDE.md at the project root. Copilot reads .github/copilot-instructions.md. The "Save at" path under each format in the dropdown shows the exact location for the format you picked.

Can I customize these PyTorch rules for my project?

Yes — that's what the generator is for. Click "Open in generator" above and the wizard loads with this stack's defaults pre-selected. Toggle on or off the conventions you want, then re-export in your AI tool's format.

Will using this rules file slow down my AI tool?

No. Rules files count toward the model's context window but not toward latency in any noticeable way. The file is loaded once per session, not per token. The library files target 250–400 lines, well within every tool's recommended budget.

Should I commit this file to git?

Yes. The rules file is project documentation that benefits every developer using the AI tool. Commit it. The exception is personal-global settings (e.g. ~/.claude/CLAUDE.md) which are user-scoped and stay out of the repo.

Related stacks

Python + FastAPI + SQLAlchemy + Alembic

Rules for FastAPI projects with Pydantic v2 models, SQLAlchemy 2.0 async ORM, Alembic migrations, dependency-injection patterns, and pytest-asyncio for tests.

Python + Django + DRF

Rules for Django projects: app structure, models-first design, DRF serializer conventions, migrations workflow, and signal usage. Stops the AI from inventing fictional Django APIs.

Python + pandas + numpy + scikit-learn (Jupyter)

Rules for data-science notebooks and scripts: pandas idioms, numpy vectorization, scikit-learn pipelines, environment management with uv, and reproducible-result conventions.