Python Data Science (pandas + numpy)
Notebook discipline, vectorization, and reproducible analysis with pandas.
# CLAUDE.md — Python Data Science (pandas + numpy)
## Notebook discipline
- Notebooks are for exploration. Anything reusable lives in `.py` files importable from the notebook.
- Cells run top-to-bottom. Restart and run all before claiming a result.
- Clear outputs before committing notebooks (`jupyter nbconvert --clear-output --inplace`) — or use `nbstripout`.
- One question per notebook. When a notebook gets to "and also...", split it.
## pandas
- Always read with explicit dtypes (`dtype={...}`) when you know the schema. Auto-detection is slow and wrong on edge rows.
- Prefer column operations over `.iterrows()`. If you reach for a row loop, you're losing the vectorization win.
- Use `.loc[]` for label-based indexing and `.iloc[]` for positional. Never chain `[col][row]`.
- `df.assign(new_col=...)` instead of `df['new_col'] = ...` when chaining. It returns a new frame and reads top-to-bottom.
- `df.merge(...)` over `pd.merge(df, other)` — it's the same call but reads better.
- Set `pd.options.mode.copy_on_write = True` on new code to avoid the SettingWithCopyWarning class of bugs.
## Schemas
- Validate input frames with `pandera` (or assertions) at the boundary. A 5-line schema catches an hour of debugging.
- Document expected columns in a docstring or a constant — don't make readers grep.
## numpy
- Vectorize. Loops over numpy arrays are almost always wrong.
- Be explicit about dtype on creation: `np.zeros(n, dtype=np.float32)`. Defaults vary across platforms.
- `np.einsum` for unusual reductions — once you learn it, it replaces dozens of one-off helpers.
## Reproducibility
- Pin random seeds in any non-deterministic step. Use `np.random.default_rng(seed)` — never the legacy `np.random.seed`.
- Save intermediate frames to `parquet`, not CSV — preserves dtypes and is much faster.
- Track versions: a small `pyproject.toml` or `requirements.txt` next to the notebook is enough.
## Plotting
- Matplotlib for everything reproducible. Plotly for interactive dashboards.
- Always set `figsize`, `xlabel`, `ylabel`, `title`. A plot without labels is debug output, not a result.
## Don't
- Don't store secrets, API keys, or DB credentials in notebooks. Use a `.env` and `python-dotenv`.
- Don't pickle dataframes for long-term storage. Use parquet.
- Don't run analysis on `head()` and ship without re-running on the full data. The shape of small samples lies.
- Don't trust auto-inferred datetime parsing. Specify `format=` or use `pd.to_datetime(..., utc=True)` explicitly.
Other Python templates
Modern Python Rules
Type hints, ruff, black, uv, and pytest — opinionated Python defaults.
Python Clean Architecture
Layered architecture with use-cases, repositories, and dependency inversion.
Python asyncio Patterns
asyncio fundamentals: tasks, gather, cancellation, and structured concurrency.
Python CLI Tools (Typer)
Build polished CLIs with Typer, Rich output, and clean argument parsing.
Django + DRF Rules
Django REST Framework conventions: viewsets, serializers, permissions.