Using LLMs for Data Analysis: What Works and What Doesn't

LLMs excel at writing SQL and pandas code for data analysis, but they cannot reliably calculate over large datasets. The correct model: LLMs write code, computers run it.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 17, 2026

7 min read

// tags

#data-analysis#sql-generation#pandas#llm-applications#code-generation

FIG. ART-21

7 min read

“

Using LLMs for Data Analysis: What Works and What Doesn't

// reading plan

sections

1,029

words

min read

// LLM & Language Models

What Is OpenAI Frontier Models and Codex on AWS? A Practical Overview

OpenAI's frontier models and Codex are now available on AWS through Amazon Bedrock and SageMaker. This post covers what's included, how it works, and the practical tradeoffs for teams considering this integration.

4 min read

// LLM & Language Models

Claude 3.5 Sonnet Review: What It Does Better Than GPT-4o (and Where It Falls Short)

LLMs are genuinely useful for data analysis, but not in the way many people initially expect. They cannot reliably perform arithmetic on large datasets stored in their context. What they can do is write accurate SQL queries, pandas code, and visualization scripts, which real computing infrastructure then executes. The correct mental model: LLMs write the code, computers run it.

What LLMs Can Do With Data

Write SQL Queries From Natural Language

This is one of the highest-value LLM data applications. Given a schema description and a natural language question, GPT-4o and Claude 3.5 Sonnet reliably produce correct SQL for moderately complex queries.

Schema: orders(id, customer_id, amount, created_at, status)
       customers(id, name, email, country)

Question: What is the average order value by country for orders in the last 90 days,
          only for countries with more than 100 orders?

Generated SQL:
SELECT c.country,
       AVG(o.amount) as avg_order_value,
       COUNT(*) as order_count
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.created_at >= NOW() - INTERVAL '90 days'
  AND o.status != 'cancelled'
GROUP BY c.country
HAVING COUNT(*) > 100
ORDER BY avg_order_value DESC;

The LLM writes this. Your actual database executes it against real data. You get correct results without the LLM ever touching the raw data directly.

Generate pandas Code

For Python-based data work, LLMs can write pandas transformation pipelines, cleaning scripts, and analysis code that runs in your actual environment:

# LLM-generated pandas code for calculating customer lifetime value cohorts
import pandas as pd

df["first_purchase_month"] = pd.to_datetime(df["first_purchase_date"]).dt.to_period("M")
ltv_by_cohort = (
    df.groupby("first_purchase_month")
    .agg(
        customer_count=("customer_id", "nunique"),
        total_revenue=("amount", "sum"),
        avg_ltv=("amount", lambda x: x.sum() / df.loc[x.index, "customer_id"].nunique()),
    )
    .reset_index()
)

The model generates this code. You execute it in your environment. The calculation runs on your actual data.

Explain What a Dataset Means

Paste a sample of your data (headers + first 20 rows) and ask the model to explain what you are looking at, spot potential data quality issues, suggest analysis directions, or explain what a particular column likely represents. This is high-value for onboarding to unfamiliar datasets.

Create Visualization Code

LLMs write matplotlib, plotly, and seaborn code from a description of what you want to see. "Show me a heatmap of monthly revenue by product category" becomes working Python code in seconds.

Spot Anomalies in Small Tables

For tables small enough to fit in context (under ~100 rows), LLMs can identify patterns, flag outliers, and spot data quality issues. For example: pasting a week of daily metrics and asking "does anything look unusual?"

What LLMs Cannot Do Reliably

Perform Calculations on Large Datasets

If you paste 10,000 rows of data into a context window and ask "what is the total revenue?", the model will attempt to sum the values but will likely be wrong. LLMs are not calculators. They generate plausible-looking numbers based on training, not by actually iterating through values.

This is not an edge case. Arithmetic over large in-context datasets is a known, fundamental weakness of current LLMs. Always use actual computational tools for this.

Remember Data Accurately Across a Long Conversation

If you share a dataset at the start of a conversation and then ask about specific rows 20 turns later, the model may confuse values, hallucinate rows that were not there, or fail to correctly reference data from early in the conversation. Treat each data reference as a fresh lookup, not a persistent in-memory store.

Handle Truly Large Tables in Context

Current context windows (even at 128k or 1M tokens) fill quickly with tabular data. A CSV with 50,000 rows and 20 columns might take hundreds of thousands of tokens. Loading this directly into context is impractical and still does not give you reliable arithmetic.

Using LLMs for Data Analysis: What Works and What Doesn't

Related Articles

What Is OpenAI Frontier Models and Codex on AWS? A Practical Overview

What LLMs Can Do With Data

Write SQL Queries From Natural Language

Generate pandas Code

Explain What a Dataset Means

Create Visualization Code

Spot Anomalies in Small Tables

What LLMs Cannot Do Reliably

Perform Calculations on Large Datasets

Remember Data Accurately Across a Long Conversation

Handle Truly Large Tables in Context

The Correct Workflow

Tools That Do This Well

Practical Data Analysis Applications

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Claude 3.5 Sonnet Review: What It Does Better Than GPT-4o (and Where It Falls Short)

LLM Safety and Alignment Explained for Developers

Using LLMs for Data Analysis: What Works and What Doesn't

Related Articles

What Is OpenAI Frontier Models and Codex on AWS? A Practical Overview

What LLMs Can Do With Data

Write SQL Queries From Natural Language

Generate pandas Code

Explain What a Dataset Means

Create Visualization Code

Spot Anomalies in Small Tables

What LLMs Cannot Do Reliably

Perform Calculations on Large Datasets

Remember Data Accurately Across a Long Conversation

Handle Truly Large Tables in Context

The Correct Workflow

Tools That Do This Well

Practical Data Analysis Applications

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Claude 3.5 Sonnet Review: What It Does Better Than GPT-4o (and Where It Falls Short)

LLM Safety and Alignment Explained for Developers

The workspace your team
actually needs