The Python data science stack has grown dramatically over the past decade, and in 2026 it has largely stabilized around a core set of tools. This guide covers what you actually need, what the newer alternatives offer, and how the stack differs between a startup and an enterprise environment.
The Core Stack
NumPy is the numerical foundation of the entire Python data science ecosystem. It provides the n-dimensional array (ndarray) that every other library builds on. You will use NumPy directly for linear algebra operations, array manipulation, and mathematical functions, but mostly you encounter it as the underlying layer of pandas, scikit-learn, and PyTorch.
import numpy as np
# Array creation
arr = np.array([1, 2, 3, 4, 5])
matrix = np.zeros((100, 50))
# Vectorized operations (fast)
result = np.sqrt(arr ** 2 + 1)
# Linear algebra
A = np.random.randn(100, 10)
eigenvalues = np.linalg.eigvals(A.T @ A)
Pandas provides DataFrame-based tabular data manipulation. It is the workhorse for data cleaning, transformation, and analysis. See the dedicated pandas guide for full coverage.
Scikit-learn is the standard library for traditional machine learning algorithms. Classification, regression, clustering, dimensionality reduction, model evaluation, and preprocessing pipelines are all here.
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
# A minimal ML pipeline
pipeline = Pipeline([
("scaler", StandardScaler()),
("classifier", RandomForestClassifier(n_estimators=100, random_state=42))
])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
pipeline.fit(X_train, y_train)
print(classification_report(y_test, pipeline.predict(X_test)))
Visualization: Matplotlib, Seaborn, and Plotly
The visualization stack is covered in detail in the visualization guide. The short version: seaborn for statistical plots during EDA, plotly for any chart you will show to non-technical stakeholders.
Notebooks: When Each Tool Is Appropriate
Jupyter Notebook and JupyterLab are the standard for exploratory data analysis and communicating findings. They support inline visualizations, markdown text, and interactive widgets. The main weakness: non-linear execution order and poor version control.
Google Colab is Jupyter in the cloud with free GPU access. Use it when you need a GPU for model training but do not have local GPU hardware. The free tier has limited runtime and memory; Colab Pro is worth the cost for regular use.
Marimo is a newer reactive notebook that eliminates the cell execution order problem. Cells automatically re-run when their dependencies change. It exports to Python scripts cleanly and has better version control support. Worth trying if you work in teams where notebook state confusion causes bugs.
VS Code with Jupyter extension works well if you prefer an IDE environment. Provides better refactoring tools than classic Jupyter while keeping notebook interactivity.
Environment Management
The environment management story for data science is messy. Here is the pragmatic guidance:
Conda is still common in data science because it handles non-Python dependencies (CUDA, BLAS, HDF5) better than pip. The Miniforge (community) or Miniconda distribution is preferred over full Anaconda.
venv + pip works fine for projects without complex binary dependencies. Use pyproject.toml or requirements.txt to specify dependencies.
Poetry provides deterministic dependency resolution with a lockfile, similar to npm. Good for production data science applications and packages. Heavier than venv for quick experimentation.
The practical approach: conda environments for experimentation (handles GPU dependencies easily), Poetry or pip-tools for production deployments.
Polars: The Faster Pandas Alternative
Polars is a DataFrame library written in Rust that is typically 5-20x faster than pandas for large datasets. It uses a lazy execution model (like Spark) that optimizes the full query before executing it.
import polars as pl
# Polars syntax is similar to pandas but not identical
df = pl.read_csv("large_file.csv")
result = (
df
.filter(pl.col("age") > 30)
.group_by("department")
.agg(
pl.col("salary").mean().alias("avg_salary"),
pl.len().alias("headcount")
)
.sort("avg_salary", descending=True)
)
Polars is worth learning if you regularly work with datasets over 500MB. The API is slightly different from pandas but the concepts transfer. It does not have the same breadth of ecosystem integration yet (some ML libraries expect NumPy arrays, not Polars Series), but this is improving.
DuckDB: In-Process SQL Analytics
DuckDB is an in-process analytical database (like SQLite, but for analytics instead of OLTP). It runs SQL directly on CSV, Parquet, or JSON files with no server required, and it outperforms pandas for most aggregation and join operations.
import duckdb
# Query a CSV file directly with SQL
result = duckdb.query("""
SELECT department, AVG(salary) as avg_salary, COUNT(*) as headcount
FROM 'employees.csv'
GROUP BY department
ORDER BY avg_salary DESC
""").df() # Returns a pandas DataFrame
# Query a pandas DataFrame with SQL
df = pd.read_csv("employees.csv")
result = duckdb.query("SELECT * FROM df WHERE salary > 100000").df()
DuckDB is increasingly used as a replacement for complex pandas pipelines. If you can express your transformation in SQL and your data fits in memory, DuckDB is likely faster and more readable than pandas.
GPU Acceleration: RAPIDS cuDF
For users with NVIDIA GPUs, RAPIDS cuDF provides a GPU-accelerated pandas-compatible DataFrame. The API is largely identical to pandas but operations run on GPU memory.
import cudf # NVIDIA GPU required
# Almost identical to pandas
df = cudf.read_csv("large_file.csv")
result = df.groupby("category")["value"].mean()
This is relevant for very large datasets (10GB+) where CPU-based processing is the bottleneck. At smaller scales, the overhead of GPU memory transfer is not worth it.
The Startup Stack vs Enterprise Stack
At a startup, the typical data scientist uses: pandas, scikit-learn, and plotly for day-to-day work; Jupyter or Marimo for exploration; DuckDB or polars when pandas is too slow; Google Colab for GPU training; GitHub for version control; dbt for analytics transformations.
At an enterprise, the stack adds: Spark for distributed processing of very large datasets; Snowflake, BigQuery, or Redshift as the data warehouse; MLflow or SageMaker for experiment tracking and model registry; Airflow or Prefect for pipeline orchestration; feature stores for shared feature management.
The difference is not capability but scale. Do not introduce Spark or an enterprise feature store until you have a problem that the startup stack cannot solve.
Keep Reading
- Pandas Guide for Developers — deep dive into the most essential tool
- Data Pipeline Guide — orchestrating the tools into production workflows
- Machine Learning Complete Guide for Software Developers — applying this stack to ML
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.