Comparing LLMs for a simple SQL task
For my weekend project, I decided to explore how various language models would perform on a straightforward SQL task that I recently encountered at work. The task involved writing a query to determine if a snapshot table contained any gaps, something that would typically take me just a couple of minutes to complete. Curious about the current state of LLM performance, I initially ran the task through GPT-3.5, but to my surprise, the generated query was incorrect. Even GPT-4 failed to produce an accurate solution. This piqued my interest, and I set out to investigate how different LLMs would handle this seemingly simple SQL challenge.
The Prompt
Create an SQL statement in PostgreSQL to identify any gaps (non-consecutive days) in the dates
table, which has a date
column. The statement should:
- Return the date of any gaps.
- Order the results in ascending order of date.
Setup
I whipped up this python script to pass the prompt through several different models including LLama, Opus, Gemma, Gemini, and GPT.
I used this PostgreSQL script loaded into DB Fiddle to create the dates
table with 10 days:
CREATE TABLE dates (
date DATE PRIMARY KEY
);
INSERT INTO dates (date)
SELECT GENERATE_SERIES(
'2024-01-01'::DATE,
'2024-01-10'::DATE,
'1 day'::INTERVAL
) AS date;
DELETE FROM dates WHERE date in ('2024-01-03', '2024-01-06', '2024-01-07');
The DELETE
statement will create gaps in the data and be our test case for the llm query.
Results
gemma-7b-it
SELECT d.date
FROM generate_series(
(SELECT MIN(date) FROM dates),
(SELECT MAX(date) FROM dates) -- ❗ whoops, missing a parameter
) AS d
LEFT JOIN dates ON d.date = dates.date
WHERE dates.date IS NULL
ORDER BY d.date;
⚠ This query has a bug, generate_series
needs a third parameter '1 day'::INTERVAL
to be valid, however, with the bugfix, this query works!
gemini-1.5-pro-latest
WITH DateSeries AS (
SELECT date, LAG(date) OVER (ORDER BY date) AS previous_date
FROM dates
),
Gaps AS (
SELECT date
FROM DateSeries
WHERE date <> previous_date + INTERVAL '1 day'
AND previous_date IS NOT NULL
)
SELECT date
FROM Gaps
ORDER BY date;
❌ This query runs successfully without any modification, however, it returns the wrong days: '2024-01-04', '2024-01-08'
can you tell why?
claude-3-opus-20240229
SELECT date::date + 1 AS missing_date
FROM (
SELECT date::date, lead(date::date) OVER (ORDER BY date) AS next_date
FROM dates
) t
WHERE next_date - date::date > 1
ORDER BY missing_date;
❌ Again, this query runs, but only returns 2 of 3 of the correct dats: '2024-01-03', '2024-01-06'
llama3-70b-8192
WITH consecutive_days AS (
SELECT date, LAG(date) OVER (ORDER BY date) AS prev_date
FROM dates
)
SELECT date
FROM consecutive_days
WHERE date - prev_date > INTERVAL '1 day'; -- ❗ can't compare integer with interval
❌ This query returns the error: Query Error: error: operator does not exist: integer > interval
and after replacing INTERVAL '1 day'
with 1
, the returns values are off: '2024-01-04', '2024-01-08'
gpt-4
SELECT
current_date + INTERVAL '1 day' as start_gap,
next_date - INTERVAL '1 day' as end_gap
FROM
(
SELECT
date as current_date,
LEAD(date) OVER (ORDER BY date) as next_date
FROM
dates
) subquery
WHERE
next_date - current_date > INTERVAL '1 day' -- ❗ can't compare integer with interval
ORDER BY
current_date ASC;
❌ Similarly, this query returns the following error: Query Error: error: operator does not exist: integer > interval
. Even after fixing this error, this query is not following the prompt correctly to return a single column of dates, but rather two columns with the gap start/end dates. Interesting that Llama-3 and GPT-4 both generate the same mistake.
mixtral-8x7b-32768
WITH date_series AS (
SELECT
generate_series(
(SELECT MIN(date) FROM dates),
(SELECT MAX(date) FROM dates),
'1 day'
)::date as series_date
)
SELECT
ds.series_date
FROM date_series ds
LEFT JOIN dates d ON ds.series_date = d.date
WHERE d.date IS NULL
ORDER BY ds.series_date;
✅ This model works perfectly, no syntax errors and it returns the 3 correct days!
Conclusion
Surprisingly, the smaller, faster models outperformed the large, expensive ones. Mixtral-8x7b was the clear winner, with Gemma-7b a close second. The other four models were simply inadequate. Next time I need SQL help from AI, I’ll consider all model sizes – bigger doesn’t always mean better!
Model | Syntax | Accuracy |
---|---|---|
Gemma-7b | ❌ | ✅ |
Gemini-1.5 | ✅ | ❌ |
Claude3-Opus | ✅ | ⚠️ |
Llama3-70b | ❌ | ❌ |
GPT-4 | ❌ | ❌ |
🏆 Mixtral-8x7b | ✅ | ✅ |