AST validation vs regex for SQL — why does AST win?
Regex blocklists are the natural first instinct ("just reject if the string contains DROP") and they fail in five well-documented ways:
1. Comments. DR/**/OP TABLE x, DROP-- inline\nTABLE x. The regex sees the comment characters; the DB parser doesn't.
2. Case + whitespace. dRoP\tTaBlE\nx. Trivially defeated.
3. Unicode + encoding. Some parsers normalize Unicode lookalikes; some don't. Regex generally doesn't.
4. CTEs and subqueries. WITH x AS (DROP TABLE users) SELECT 1 — regex sees DROP; but actually parsing reveals this isn't valid SQL at all and would already be rejected. The flip side: WITH x AS (SELECT 1) DELETE FROM users WHERE id IN (SELECT * FROM x) — regex sees DELETE, fair, but in more elaborate forms the regex misses.
5. False positives. SELECT 'I will not DROP this' AS warning — regex blocks the legitimate query.
AST validation parses the SQL with the database's own grammar (or a faithful surrogate like libpg_query for Postgres, sqlglot cross-dialect). It then inspects semantic structure:
- Statement kind:
SelectStmtvsDropStmtvsDeleteStmt. - Tables referenced (
RangeVarnodes). - Columns projected (target list).
- Predicates (
WhereClause). - Subqueries and CTEs (recursed).
The cost is single-digit milliseconds per query. The benefit: to bypass the AST validator, an attacker must produce a query that, parsed by the DB itself, doesn't include the forbidden construct — at which point the query is by definition safe by your policy. Regex bypasses are a game of strings; AST bypasses are a game of grammar, and grammar has the high ground.