AST validation vs regex for SQL — why does AST win?

Question

Accepted Answer

Regex blocklists are the natural first instinct ("just reject if the string contains DROP") and they fail in five well-documented ways: 1. **Comments.** `DR/**/OP TABLE x`, `DROP-- inline
TABLE x`. The regex sees the comment characters; the DB parser doesn't. 2. **Case + whitespace.** `dRoP	TaBlE
x`. Trivially defeated. 3. **Unicode + encoding.** Some parsers normalize Unicode lookalikes; some don't. Regex generally doesn't. 4. **CTEs and subqueries.** `WITH x AS (DROP TABLE users) SELECT 1` — regex sees `DROP`; but actually parsing reveals this isn't valid SQL at all and would already be rejected. The flip side: `WITH x AS (SELECT 1) DELETE FROM users WHERE id IN (SELECT * FROM x)` — regex sees `DELETE`, fair, but in more elaborate forms the regex misses. 5. **False positives.** `SELECT 'I will not DROP this' AS warning` — regex blocks the legitimate query. AST validation parses the SQL with the database's own grammar (or a faithful surrogate like libpg_query for Postgres, sqlglot cross-dialect). It then inspects semantic structure: - Statement kind: `SelectStmt` vs `DropStmt` vs `DeleteStmt`. - Tables referenced (`RangeVar` nodes). - Columns projected (target list). - Predicates (`WhereClause`). - Subqueries and CTEs (recursed). The cost is single-digit milliseconds per query. The benefit: to bypass the AST validator, an attacker must produce a query that, parsed by the DB itself, doesn't include the forbidden construct — at which point the query is by definition safe by your policy. Regex bypasses are a game of strings; AST bypasses are a game of grammar, and grammar has the high ground.