Python Data Science (pandas / numpy / scikit-learn)
Project context
This is a data-science codebase combining notebooks, scripts, and reusable library code. We optimize for reproducibility, clarity, and reasonable performance — not for milliseconds. Notebooks for exploration; modules for production paths.
Stack
- Python 3.12+
- pandas 2.2+
- numpy 1.26+
- scikit-learn 1.4+
- matplotlib + seaborn for plotting (or plotly for interactive)
- Jupyter / JupyterLab for notebooks
uvfor environment + package management- pytest for tests
- Ruff for lint + format
Folder structure
src/
data/ — data loaders (raw → cleaned)
features/ — feature engineering
models/ — model definitions, training, inference
evaluation/ — metrics, plotting helpers
utils/
notebooks/
01-explore.ipynb
02-baseline.ipynb
03-improvements.ipynb
data/
raw/ — never overwritten; usually .gitignored
interim/
processed/
tests/
Notebooks are for exploration and reporting. Reusable code lives in src/ and gets imported into notebooks (%load_ext autoreload).
Reproducibility rules
- Pin the Python version (
.python-version) - Pin every dependency (
uv lock→uv.lockcommitted) - Set seeds at the top of every notebook and training script:
np.random.seed(42) random.seed(42) torch.manual_seed(42) # if using PyTorch - Save model artifacts with metadata: training data hash, hyperparameters, metrics, git SHA
- Never commit data files over a few MB — use git-lfs or external storage
- Use environment-specific config (
.env) for paths and secrets, never hard-code
pandas idioms
- Use
.loc[...]for assignment — not chained indexing (df[mask]['col'] = ...will warn or silently fail) pd.NAoverNonefor missing values in pandas 2+pd.read_parquetoverpd.read_csvfor intermediate storage — faster, type-stableassignfor chaining — easier to reorder than nested.copy()+ assignmentpipe()for custom functions — keeps method chaining readable- Avoid
.iterrows()— vectorize, or usedf.apply(axis=1)only as a last resort .merge(...)overpd.merge(...)for clarity- Set
dtype=explicitly onread_csvwhen you can — pandas guesses are fragile
numpy idioms
- Vectorize. If you're writing a Python
forloop over an array, stop and find the numpy primitive. - Use
np.float32for ML pipelines unless you needfloat64precision — half the memory np.einsumfor clear reductions; broadcasting for clear elementwise math- Don't call
np.array(...)repeatedly inside a hot loop — preallocate
scikit-learn idioms
- Use
Pipeline— not raw fit/transform chains. Pipelines integrate withGridSearchCVand prevent data leakage. ColumnTransformerfor mixed types — different scaling for numeric, encoding for categorical- Set
random_stateon every estimator and splitter - Use
cross_val_score/cross_validateinstead of a single train/test split - Save trained pipelines with
joblib.dump— notpickledirectly
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
pipe = Pipeline([
("scale", StandardScaler()),
("clf", LogisticRegression(random_state=42, max_iter=1000)),
])
pipe.fit(X_train, y_train)
Notebook discipline
- One concern per notebook — don't pile up exploratory cells indefinitely
- Number notebooks in execution order (
01-,02-,03-) - Restart-and-run-all before sharing — verify the notebook runs top to bottom
- Move repeated code to
src/immediately — notebooks are ephemeral, modules are durable - Use
%load_ext autoreload+%autoreload 2so module edits hot-reload in the notebook - Don't commit large notebook outputs (clear before commit, or use
nbstripout)
Plotting
- Matplotlib for static plots; seaborn for statistical defaults
- Always label axes and title
- Use
figsize=(width, height)deliberately — defaults are too small - Save figures with
dpi=150minimum for sharing; SVG / PDF for publication
Patterns to avoid
- Chained indexing for assignment:
df[mask]['col'] = x .iterrows()in production paths- Silent dtype changes —
pd.concatof mixed dtypes will upcast - Hard-coded paths — use
pathlib.Pathand a config - Modifying
dfin place across cells — name new variables, easier to debug - Mixing train and test in
fit_transform— fit on train, transform on test, always
Testing
- pytest for any function in
src/ - Use small fixture DataFrames (10 rows) for tests — fast, deterministic
- Test the interface of pipelines, not the underlying scikit-learn behavior
- For training scripts, write at least one smoke test that runs end-to-end on tiny data
Tooling
uv venv && uv sync— reproducible envjupyter lab— notebookspytest— testsruff check && ruff formatmypy src/— type check (notebooks excluded)
AI behavioral rules
- Vectorize before suggesting any pandas / numpy
forloop - Always set
random_stateon splitters, estimators, and anything stochastic - Use
PipelineandColumnTransformer— don't write fit/transform chains by hand - Never modify raw data files — always write derived data to
data/processed/ - Don't suggest
picklefor model serialization — usejoblib - Move reusable code from notebooks to
src/modules; notebooks should call modules - Run
pytestandruff checkbefore declaring a task done