LLMs are genuinely useful for data analysis, but not in the way many people initially expect. They cannot reliably perform arithmetic on large datasets stored in their context. What they can do is write accurate SQL queries, pandas code, and visualization scripts, which real computing infrastructure then executes. The correct mental model: LLMs write the code, computers run it.
What LLMs Can Do With Data
Write SQL Queries From Natural Language
This is one of the highest-value LLM data applications. Given a schema description and a natural language question, GPT-4o and Claude 3.5 Sonnet reliably produce correct SQL for moderately complex queries.
Schema: orders(id, customer_id, amount, created_at, status)
customers(id, name, email, country)
Question: What is the average order value by country for orders in the last 90 days,
only for countries with more than 100 orders?
Generated SQL:
SELECT c.country,
AVG(o.amount) as avg_order_value,
COUNT(*) as order_count
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.created_at >= NOW() - INTERVAL '90 days'
AND o.status != 'cancelled'
GROUP BY c.country
HAVING COUNT(*) > 100
ORDER BY avg_order_value DESC;
The LLM writes this. Your actual database executes it against real data. You get correct results without the LLM ever touching the raw data directly.
Generate pandas Code
For Python-based data work, LLMs can write pandas transformation pipelines, cleaning scripts, and analysis code that runs in your actual environment:
# LLM-generated pandas code for calculating customer lifetime value cohorts
import pandas as pd
df["first_purchase_month"] = pd.to_datetime(df["first_purchase_date"]).dt.to_period("M")
ltv_by_cohort = (
df.groupby("first_purchase_month")
.agg(
customer_count=("customer_id", "nunique"),
total_revenue=("amount", "sum"),
avg_ltv=("amount", lambda x: x.sum() / df.loc[x.index, "customer_id"].nunique()),
)
.reset_index()
)
The model generates this code. You execute it in your environment. The calculation runs on your actual data.
Explain What a Dataset Means
Paste a sample of your data (headers + first 20 rows) and ask the model to explain what you are looking at, spot potential data quality issues, suggest analysis directions, or explain what a particular column likely represents. This is high-value for onboarding to unfamiliar datasets.
Create Visualization Code
LLMs write matplotlib, plotly, and seaborn code from a description of what you want to see. "Show me a heatmap of monthly revenue by product category" becomes working Python code in seconds.
Spot Anomalies in Small Tables
For tables small enough to fit in context (under ~100 rows), LLMs can identify patterns, flag outliers, and spot data quality issues. For example: pasting a week of daily metrics and asking "does anything look unusual?"
What LLMs Cannot Do Reliably
Perform Calculations on Large Datasets
If you paste 10,000 rows of data into a context window and ask "what is the total revenue?", the model will attempt to sum the values but will likely be wrong. LLMs are not calculators. They generate plausible-looking numbers based on training, not by actually iterating through values.
This is not an edge case. Arithmetic over large in-context datasets is a known, fundamental weakness of current LLMs. Always use actual computational tools for this.
Remember Data Accurately Across a Long Conversation
If you share a dataset at the start of a conversation and then ask about specific rows 20 turns later, the model may confuse values, hallucinate rows that were not there, or fail to correctly reference data from early in the conversation. Treat each data reference as a fresh lookup, not a persistent in-memory store.
Handle Truly Large Tables in Context
Current context windows (even at 128k or 1M tokens) fill quickly with tabular data. A CSV with 50,000 rows and 20 columns might take hundreds of thousands of tokens. Loading this directly into context is impractical and still does not give you reliable arithmetic.