How does AST-based SQL validation differ from regex blocklists?

Question

Accepted Answer

A regex blocklist scans the query string for forbidden tokens (`DROP`, `DELETE`, `UNION`, `;`). It is trivially bypassable: - Comments: `DR/**/OP TABLE x` - Whitespace: `DROP	TABLE
x` - Case: `DrOp TaBlE x` - Unicode: `DROP TABLE x` (some parsers) - Encoded payloads in CTEs or string literals that the LLM later rewrites. An AST validator parses the SQL into a tree using the database's own grammar (or a faithful parser like libpg_query for Postgres, sqlglot for cross-dialect). It then inspects the tree: - Statement kind (`SelectStmt` vs `DropStmt`) - Tables referenced (`RangeVar` nodes) - Columns projected - Predicates present (`WhereClause`) - Subqueries and CTEs (recursed) A blocklist sees a string. An AST validator sees the *semantics*. The cost is a few milliseconds per query; the benefit is that bypasses are no longer a regex-golf game — to evade the AST validator you have to produce a query that, parsed by the DB itself, doesn't include the forbidden construct. At which point the query is, by definition, safe by your policy.