Python Data Science AI Rules

Rules for data-science notebooks and scripts: pandas idioms, numpy vectorization, scikit-learn pipelines, environment management with uv, and reproducible-result conventions.

Pythonpandas / scikit-learn#python#pandas#numpy#jupyterLast updated 2026-05-05

tune

Want to customize this rules file? Open the generator with this stack pre-loaded.

Open in generatorarrow_forward

Format

Save at .cursor/rules/main.mdc

Python Data Science (pandas / numpy / scikit-learn)

Project context

This is a data-science codebase combining notebooks, scripts, and reusable library code. We optimize for reproducibility, clarity, and reasonable performance — not for milliseconds. Notebooks for exploration; modules for production paths.

Stack

Python 3.12+
pandas 2.2+
numpy 1.26+
scikit-learn 1.4+
matplotlib + seaborn for plotting (or plotly for interactive)
Jupyter / JupyterLab for notebooks
uv for environment + package management
pytest for tests
Ruff for lint + format

Folder structure

src/
  data/          — data loaders (raw → cleaned)
  features/      — feature engineering
  models/        — model definitions, training, inference
  evaluation/    — metrics, plotting helpers
  utils/
notebooks/
  01-explore.ipynb
  02-baseline.ipynb
  03-improvements.ipynb
data/
  raw/           — never overwritten; usually .gitignored
  interim/
  processed/
tests/

Notebooks are for exploration and reporting. Reusable code lives in src/ and gets imported into notebooks (%load_ext autoreload).

Reproducibility rules

Pin the Python version (.python-version)
Pin every dependency (uv lock → uv.lock committed)

Set seeds at the top of every notebook and training script:

np.random.seed(42)
random.seed(42)
torch.manual_seed(42)  # if using PyTorch

Save model artifacts with metadata: training data hash, hyperparameters, metrics, git SHA
Never commit data files over a few MB — use git-lfs or external storage
Use environment-specific config (.env) for paths and secrets, never hard-code

pandas idioms

Use .loc[...] for assignment — not chained indexing (df[mask]['col'] = ... will warn or silently fail)
pd.NA over None for missing values in pandas 2+
pd.read_parquet over pd.read_csv for intermediate storage — faster, type-stable
assign for chaining — easier to reorder than nested .copy() + assignment
pipe() for custom functions — keeps method chaining readable
Avoid .iterrows() — vectorize, or use df.apply(axis=1) only as a last resort
.merge(...) over pd.merge(...) for clarity
Set dtype= explicitly on read_csv when you can — pandas guesses are fragile

numpy idioms

Vectorize. If you're writing a Python for loop over an array, stop and find the numpy primitive.
Use np.float32 for ML pipelines unless you need float64 precision — half the memory
np.einsum for clear reductions; broadcasting for clear elementwise math
Don't call np.array(...) repeatedly inside a hot loop — preallocate

scikit-learn idioms

Use Pipeline — not raw fit/transform chains. Pipelines integrate with GridSearchCV and prevent data leakage.
ColumnTransformer for mixed types — different scaling for numeric, encoding for categorical
Set random_state on every estimator and splitter
Use cross_val_score / cross_validate instead of a single train/test split
Save trained pipelines with joblib.dump — not pickle directly

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ("scale", StandardScaler()),
    ("clf", LogisticRegression(random_state=42, max_iter=1000)),
])
pipe.fit(X_train, y_train)

Notebook discipline

One concern per notebook — don't pile up exploratory cells indefinitely
Number notebooks in execution order (01-, 02-, 03-)
Restart-and-run-all before sharing — verify the notebook runs top to bottom
Move repeated code to src/ immediately — notebooks are ephemeral, modules are durable
Use %load_ext autoreload + %autoreload 2 so module edits hot-reload in the notebook
Don't commit large notebook outputs (clear before commit, or use nbstripout)

Plotting

Matplotlib for static plots; seaborn for statistical defaults
Always label axes and title
Use figsize=(width, height) deliberately — defaults are too small
Save figures with dpi=150 minimum for sharing; SVG / PDF for publication

Patterns to avoid

Chained indexing for assignment: df[mask]['col'] = x
.iterrows() in production paths
Silent dtype changes — pd.concat of mixed dtypes will upcast
Hard-coded paths — use pathlib.Path and a config
Modifying df in place across cells — name new variables, easier to debug
Mixing train and test in fit_transform — fit on train, transform on test, always

Testing

pytest for any function in src/
Use small fixture DataFrames (10 rows) for tests — fast, deterministic
Test the interface of pipelines, not the underlying scikit-learn behavior
For training scripts, write at least one smoke test that runs end-to-end on tiny data

Tooling

uv venv && uv sync — reproducible env
jupyter lab — notebooks
pytest — tests
ruff check && ruff format
mypy src/ — type check (notebooks excluded)

AI behavioral rules

Vectorize before suggesting any pandas / numpy for loop
Always set random_state on splitters, estimators, and anything stochastic
Use Pipeline and ColumnTransformer — don't write fit/transform chains by hand
Never modify raw data files — always write derived data to data/processed/
Don't suggest pickle for model serialization — use joblib
Move reusable code from notebooks to src/ modules; notebooks should call modules
Run pytest and ruff check before declaring a task done

Frequently asked

How do I use this pandas / scikit-learn rules file with Cursor?

Pick "Cursor (.cursor/rules/*.mdc)" from the format dropdown above and click Copy. Save it at .cursor/rules/main.mdc in your project root and restart Cursor. The legacy .cursorrules format still works if you're on an older Cursor version — pick that option instead.

Can I use this with Claude Code (CLAUDE.md)?

Yes — pick "Claude Code (CLAUDE.md)" from the format dropdown above and copy. Save the file as CLAUDE.md at your repo root. Claude Code reads it automatically on every session. For monorepos, you can also drop nested CLAUDE.md files in subdirectories — Claude merges them when working in those paths.

Where exactly do I put this file?

It depends on the AI tool. Cursor reads .cursorrules or .cursor/rules/*.mdc at the project root. Claude reads CLAUDE.md at the project root. Copilot reads .github/copilot-instructions.md. The "Save at" path under each format in the dropdown shows the exact location for the format you picked.

Can I customize these pandas / scikit-learn rules for my project?

Yes — that's what the generator is for. Click "Open in generator" above and the wizard loads with this stack's defaults pre-selected. Toggle on or off the conventions you want, then re-export in your AI tool's format.

Will using this rules file slow down my AI tool?

No. Rules files count toward the model's context window but not toward latency in any noticeable way. The file is loaded once per session, not per token. The library files target 250–400 lines, well within every tool's recommended budget.

Should I commit this file to git?

Yes. The rules file is project documentation that benefits every developer using the AI tool. Commit it. The exception is personal-global settings (e.g. ~/.claude/CLAUDE.md) which are user-scoped and stay out of the repo.

Related stacks

Python + FastAPI + SQLAlchemy + Alembic

Rules for FastAPI projects with Pydantic v2 models, SQLAlchemy 2.0 async ORM, Alembic migrations, dependency-injection patterns, and pytest-asyncio for tests.

Python + Django + DRF

Rules for Django projects: app structure, models-first design, DRF serializer conventions, migrations workflow, and signal usage. Stops the AI from inventing fictional Django APIs.

PyTorch + Hugging Face + W&B

Rules for PyTorch model-training projects: dataset/loader patterns, mixed-precision, distributed training defaults, Hugging Face transformers integration, and W&B logging conventions.