Python Data Science AI Rules

Rules for data-science notebooks and scripts: pandas idioms, numpy vectorization, scikit-learn pipelines, environment management with uv, and reproducible-result conventions.

Pythonpandas / scikit-learn#python#pandas#numpy#jupyterLast updated 2026-05-05
tune

Want to customize this rules file? Open the generator with this stack pre-loaded.

Open in generatorarrow_forward

Save at .cursor/rules/main.mdc

Python Data Science (pandas / numpy / scikit-learn)

Project context

This is a data-science codebase combining notebooks, scripts, and reusable library code. We optimize for reproducibility, clarity, and reasonable performance — not for milliseconds. Notebooks for exploration; modules for production paths.

Stack

  • Python 3.12+
  • pandas 2.2+
  • numpy 1.26+
  • scikit-learn 1.4+
  • matplotlib + seaborn for plotting (or plotly for interactive)
  • Jupyter / JupyterLab for notebooks
  • uv for environment + package management
  • pytest for tests
  • Ruff for lint + format

Folder structure

src/
  data/          — data loaders (raw → cleaned)
  features/      — feature engineering
  models/        — model definitions, training, inference
  evaluation/    — metrics, plotting helpers
  utils/
notebooks/
  01-explore.ipynb
  02-baseline.ipynb
  03-improvements.ipynb
data/
  raw/           — never overwritten; usually .gitignored
  interim/
  processed/
tests/

Notebooks are for exploration and reporting. Reusable code lives in src/ and gets imported into notebooks (%load_ext autoreload).

Reproducibility rules

  • Pin the Python version (.python-version)
  • Pin every dependency (uv lockuv.lock committed)
  • Set seeds at the top of every notebook and training script:
    np.random.seed(42)
    random.seed(42)
    torch.manual_seed(42)  # if using PyTorch
    
  • Save model artifacts with metadata: training data hash, hyperparameters, metrics, git SHA
  • Never commit data files over a few MB — use git-lfs or external storage
  • Use environment-specific config (.env) for paths and secrets, never hard-code

pandas idioms

  • Use .loc[...] for assignment — not chained indexing (df[mask]['col'] = ... will warn or silently fail)
  • pd.NA over None for missing values in pandas 2+
  • pd.read_parquet over pd.read_csv for intermediate storage — faster, type-stable
  • assign for chaining — easier to reorder than nested .copy() + assignment
  • pipe() for custom functions — keeps method chaining readable
  • Avoid .iterrows() — vectorize, or use df.apply(axis=1) only as a last resort
  • .merge(...) over pd.merge(...) for clarity
  • Set dtype= explicitly on read_csv when you can — pandas guesses are fragile

numpy idioms

  • Vectorize. If you're writing a Python for loop over an array, stop and find the numpy primitive.
  • Use np.float32 for ML pipelines unless you need float64 precision — half the memory
  • np.einsum for clear reductions; broadcasting for clear elementwise math
  • Don't call np.array(...) repeatedly inside a hot loop — preallocate

scikit-learn idioms

  • Use Pipeline — not raw fit/transform chains. Pipelines integrate with GridSearchCV and prevent data leakage.
  • ColumnTransformer for mixed types — different scaling for numeric, encoding for categorical
  • Set random_state on every estimator and splitter
  • Use cross_val_score / cross_validate instead of a single train/test split
  • Save trained pipelines with joblib.dump — not pickle directly
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ("scale", StandardScaler()),
    ("clf", LogisticRegression(random_state=42, max_iter=1000)),
])
pipe.fit(X_train, y_train)

Notebook discipline

  • One concern per notebook — don't pile up exploratory cells indefinitely
  • Number notebooks in execution order (01-, 02-, 03-)
  • Restart-and-run-all before sharing — verify the notebook runs top to bottom
  • Move repeated code to src/ immediately — notebooks are ephemeral, modules are durable
  • Use %load_ext autoreload + %autoreload 2 so module edits hot-reload in the notebook
  • Don't commit large notebook outputs (clear before commit, or use nbstripout)

Plotting

  • Matplotlib for static plots; seaborn for statistical defaults
  • Always label axes and title
  • Use figsize=(width, height) deliberately — defaults are too small
  • Save figures with dpi=150 minimum for sharing; SVG / PDF for publication

Patterns to avoid

  • Chained indexing for assignment: df[mask]['col'] = x
  • .iterrows() in production paths
  • Silent dtype changespd.concat of mixed dtypes will upcast
  • Hard-coded paths — use pathlib.Path and a config
  • Modifying df in place across cells — name new variables, easier to debug
  • Mixing train and test in fit_transform — fit on train, transform on test, always

Testing

  • pytest for any function in src/
  • Use small fixture DataFrames (10 rows) for tests — fast, deterministic
  • Test the interface of pipelines, not the underlying scikit-learn behavior
  • For training scripts, write at least one smoke test that runs end-to-end on tiny data

Tooling

  • uv venv && uv sync — reproducible env
  • jupyter lab — notebooks
  • pytest — tests
  • ruff check && ruff format
  • mypy src/ — type check (notebooks excluded)

AI behavioral rules

  • Vectorize before suggesting any pandas / numpy for loop
  • Always set random_state on splitters, estimators, and anything stochastic
  • Use Pipeline and ColumnTransformer — don't write fit/transform chains by hand
  • Never modify raw data files — always write derived data to data/processed/
  • Don't suggest pickle for model serialization — use joblib
  • Move reusable code from notebooks to src/ modules; notebooks should call modules
  • Run pytest and ruff check before declaring a task done

Frequently asked

How do I use this pandas / scikit-learn rules file with Cursor?

Pick "Cursor (.cursor/rules/*.mdc)" from the format dropdown above and click Copy. Save it at .cursor/rules/main.mdc in your project root and restart Cursor. The legacy .cursorrules format still works if you're on an older Cursor version — pick that option instead.

Can I use this with Claude Code (CLAUDE.md)?

Yes — pick "Claude Code (CLAUDE.md)" from the format dropdown above and copy. Save the file as CLAUDE.md at your repo root. Claude Code reads it automatically on every session. For monorepos, you can also drop nested CLAUDE.md files in subdirectories — Claude merges them when working in those paths.

Where exactly do I put this file?

It depends on the AI tool. Cursor reads .cursorrules or .cursor/rules/*.mdc at the project root. Claude reads CLAUDE.md at the project root. Copilot reads .github/copilot-instructions.md. The "Save at" path under each format in the dropdown shows the exact location for the format you picked.

Can I customize these pandas / scikit-learn rules for my project?

Yes — that's what the generator is for. Click "Open in generator" above and the wizard loads with this stack's defaults pre-selected. Toggle on or off the conventions you want, then re-export in your AI tool's format.

Will using this rules file slow down my AI tool?

No. Rules files count toward the model's context window but not toward latency in any noticeable way. The file is loaded once per session, not per token. The library files target 250–400 lines, well within every tool's recommended budget.

Should I commit this file to git?

Yes. The rules file is project documentation that benefits every developer using the AI tool. Commit it. The exception is personal-global settings (e.g. ~/.claude/CLAUDE.md) which are user-scoped and stay out of the repo.

Related stacks