Using LLMs for Data Analysis: What Works and What Doesn't

LLMs excel at writing SQL and pandas code for data analysis, but they cannot reliably calculate over large datasets. The correct model: LLMs write code, computers run it.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 17, 2026

7 min read

// tags

#data-analysis#sql-generation#pandas#llm-applications#code-generation

FIG. ART-21

7 min read

“

Using LLMs for Data Analysis: What Works and What Doesn't

// reading plan

sections

1,029

words

min read

// LLM & Language Models

LLMs for Code Generation: A Deep Dive Into Benchmarks, Best Practices, and Limits

Which LLMs write the best code in 2026, what the benchmarks actually measure, how to get better output, and where generated code will still burn you.

9 min read

// Data Science

Pandas for Software Developers: The Complete Guide to Data Manipulation in Python

LLMs are genuinely useful for data analysis, but not in the way many people initially expect. They cannot reliably perform arithmetic on large datasets stored in their context. What they can do is write accurate SQL queries, pandas code, and visualization scripts, which real computing infrastructure then executes. The correct mental model: LLMs write the code, computers run it.

What LLMs Can Do With Data

Write SQL Queries From Natural Language

This is one of the highest-value LLM data applications. Given a schema description and a natural language question, GPT-4o and Claude 3.5 Sonnet reliably produce correct SQL for moderately complex queries.

Schema: orders(id, customer_id, amount, created_at, status)
       customers(id, name, email, country)

Question: What is the average order value by country for orders in the last 90 days,
          only for countries with more than 100 orders?

Generated SQL:
SELECT c.country,
       AVG(o.amount) as avg_order_value,
       COUNT(*) as order_count
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.created_at >= NOW() - INTERVAL '90 days'
  AND o.status != 'cancelled'
GROUP BY c.country
HAVING COUNT(*) > 100
ORDER BY avg_order_value DESC;

The LLM writes this. Your actual database executes it against real data. You get correct results without the LLM ever touching the raw data directly.

Generate pandas Code

For Python-based data work, LLMs can write pandas transformation pipelines, cleaning scripts, and analysis code that runs in your actual environment:

# LLM-generated pandas code for calculating customer lifetime value cohorts
import pandas as pd

df["first_purchase_month"] = pd.to_datetime(df["first_purchase_date"]).dt.to_period("M")
ltv_by_cohort = (
    df.groupby("first_purchase_month")
    .agg(
        customer_count=("customer_id", "nunique"),
        total_revenue=("amount", "sum"),
        avg_ltv=("amount", lambda x: x.sum() / df.loc[x.index, "customer_id"].nunique()),
    )
    .reset_index()
)

The model generates this code. You execute it in your environment. The calculation runs on your actual data.

Explain What a Dataset Means

Paste a sample of your data (headers + first 20 rows) and ask the model to explain what you are looking at, spot potential data quality issues, suggest analysis directions, or explain what a particular column likely represents. This is high-value for onboarding to unfamiliar datasets.

Create Visualization Code

LLMs write matplotlib, plotly, and seaborn code from a description of what you want to see. "Show me a heatmap of monthly revenue by product category" becomes working Python code in seconds.

Spot Anomalies in Small Tables

For tables small enough to fit in context (under ~100 rows), LLMs can identify patterns, flag outliers, and spot data quality issues. For example: pasting a week of daily metrics and asking "does anything look unusual?"

What LLMs Cannot Do Reliably

Perform Calculations on Large Datasets

If you paste 10,000 rows of data into a context window and ask "what is the total revenue?", the model will attempt to sum the values but will likely be wrong. LLMs are not calculators. They generate plausible-looking numbers based on training, not by actually iterating through values.

This is not an edge case. Arithmetic over large in-context datasets is a known, fundamental weakness of current LLMs. Always use actual computational tools for this.

Remember Data Accurately Across a Long Conversation

If you share a dataset at the start of a conversation and then ask about specific rows 20 turns later, the model may confuse values, hallucinate rows that were not there, or fail to correctly reference data from early in the conversation. Treat each data reference as a fresh lookup, not a persistent in-memory store.

Handle Truly Large Tables in Context

Current context windows (even at 128k or 1M tokens) fill quickly with tabular data. A CSV with 50,000 rows and 20 columns might take hundreds of thousands of tokens. Loading this directly into context is impractical and still does not give you reliable arithmetic.

The Correct Workflow

The right approach is a loop:

LLM writes the code (SQL, Python, R)
You run the code in your actual environment
You share the results back with the LLM
LLM interprets the results and generates follow-up code if needed

This loop is how professional data analysts already use LLMs effectively. The LLM is a pair programmer for data work, not a data processing engine.

You: "Write pandas code to find customers who have not ordered in 90 days
      but had at least 3 orders before that"

LLM: [generates pandas code]

You: [runs code, gets 847 customers]

You: "Now write code to segment these by their total historical spend"

LLM: [generates segmentation code using the same dataframe]

Tools That Do This Well

ChatGPT Advanced Data Analysis (formerly Code Interpreter): uploads a CSV, runs Python in a sandboxed environment, shows results directly. The most accessible implementation of the correct workflow. Good for ad-hoc analysis where you do not need to export the code.

Claude with uploaded CSVs: Claude can analyze uploaded CSV files, though its in-context analysis has the arithmetic limitations described above. Works better when you ask it to generate code you run externally.

LlamaIndex for structured data: the Query Engine in LlamaIndex can convert natural language queries to SQL or pandas against your actual database or dataframe. This is the infrastructure layer for building production LLM data pipelines.

LangChain SQL Agent: connects an LLM to a SQL database, lets it query, observe results, and iterate. Production-viable for building natural language database interfaces.

Practical Data Analysis Applications

Analyst copilot: data analysts write queries faster with LLM assistance. Research shows 20-40% productivity gains for common query writing tasks.

Natural language database interface: non-technical users ask questions in English, the system generates SQL, runs it, and returns results. Reduces dependency on data team for ad-hoc requests.

Data quality auditing: paste schema and sample data, ask the model to suggest validation checks, identify potential integrity issues, and generate cleaning code.

Report generation: run analysis code, share results with LLM, ask it to write a natural language summary of the findings.

Keep Reading

Function Calling in LLMs — How to give LLMs access to real computation tools
LLM Embeddings Explained — The right way to search over large datasets semantically
How Large Language Models Work: Complete Guide — Why LLMs cannot reliably do arithmetic

Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.

Using LLMs for Data Analysis: What Works and What Doesn't

Related Articles

LLMs for Code Generation: A Deep Dive Into Benchmarks, Best Practices, and Limits

What LLMs Can Do With Data

Write SQL Queries From Natural Language

Generate pandas Code

Explain What a Dataset Means

Create Visualization Code

Spot Anomalies in Small Tables

What LLMs Cannot Do Reliably

Perform Calculations on Large Datasets

Remember Data Accurately Across a Long Conversation

Handle Truly Large Tables in Context

The Correct Workflow

Tools That Do This Well

Practical Data Analysis Applications

Keep Reading

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Pandas for Software Developers: The Complete Guide to Data Manipulation in Python

Claude 3.5 Sonnet Review: What It Does Better Than GPT-4o (and Where It Falls Short)

Using LLMs for Data Analysis: What Works and What Doesn't

Related Articles

LLMs for Code Generation: A Deep Dive Into Benchmarks, Best Practices, and Limits

What LLMs Can Do With Data

Write SQL Queries From Natural Language

Generate pandas Code

Explain What a Dataset Means

Create Visualization Code

Spot Anomalies in Small Tables

What LLMs Cannot Do Reliably

Perform Calculations on Large Datasets

Remember Data Accurately Across a Long Conversation

Handle Truly Large Tables in Context

The Correct Workflow

Tools That Do This Well

Practical Data Analysis Applications

Keep Reading

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

Pandas for Software Developers: The Complete Guide to Data Manipulation in Python

Claude 3.5 Sonnet Review: What It Does Better Than GPT-4o (and Where It Falls Short)

The workspace your team
actually needs