Python Data Science (pandas + numpy)

Notebook discipline, vectorization, and reproducible analysis with pandas.

DevZone Tools1,980 copiesUpdated Feb 19, 2026Python

# CLAUDE.md — Python Data Science (pandas + numpy)

## Notebook discipline

- Notebooks are for exploration. Anything reusable lives in `.py` files importable from the notebook.
- Cells run top-to-bottom. Restart and run all before claiming a result.
- Clear outputs before committing notebooks (`jupyter nbconvert --clear-output --inplace`) — or use `nbstripout`.
- One question per notebook. When a notebook gets to "and also...", split it.

## pandas

- Always read with explicit dtypes (`dtype={...}`) when you know the schema. Auto-detection is slow and wrong on edge rows.
- Prefer column operations over `.iterrows()`. If you reach for a row loop, you're losing the vectorization win.
- Use `.loc[]` for label-based indexing and `.iloc[]` for positional. Never chain `[col][row]`.
- `df.assign(new_col=...)` instead of `df['new_col'] = ...` when chaining. It returns a new frame and reads top-to-bottom.
- `df.merge(...)` over `pd.merge(df, other)` — it's the same call but reads better.
- Set `pd.options.mode.copy_on_write = True` on new code to avoid the SettingWithCopyWarning class of bugs.

## Schemas

- Validate input frames with `pandera` (or assertions) at the boundary. A 5-line schema catches an hour of debugging.
- Document expected columns in a docstring or a constant — don't make readers grep.

## numpy

- Vectorize. Loops over numpy arrays are almost always wrong.
- Be explicit about dtype on creation: `np.zeros(n, dtype=np.float32)`. Defaults vary across platforms.
- `np.einsum` for unusual reductions — once you learn it, it replaces dozens of one-off helpers.

## Reproducibility

- Pin random seeds in any non-deterministic step. Use `np.random.default_rng(seed)` — never the legacy `np.random.seed`.
- Save intermediate frames to `parquet`, not CSV — preserves dtypes and is much faster.
- Track versions: a small `pyproject.toml` or `requirements.txt` next to the notebook is enough.

## Plotting

- Matplotlib for everything reproducible. Plotly for interactive dashboards.
- Always set `figsize`, `xlabel`, `ylabel`, `title`. A plot without labels is debug output, not a result.

## Don't

- Don't store secrets, API keys, or DB credentials in notebooks. Use a `.env` and `python-dotenv`.
- Don't pickle dataframes for long-term storage. Use parquet.
- Don't run analysis on `head()` and ship without re-running on the full data. The shape of small samples lies.
- Don't trust auto-inferred datetime parsing. Specify `format=` or use `pd.to_datetime(..., utc=True)` explicitly.

Python Data Science (pandas + numpy)

Other Python templates

Modern Python Rules

Python Clean Architecture

Python asyncio Patterns

Python CLI Tools (Typer)

Django + DRF Rules