How does AST-based SQL validation differ from regex blocklists?
A regex blocklist scans the query string for forbidden tokens (DROP, DELETE, UNION, ;). It is trivially bypassable:
- Comments:
DR/**/OP TABLE x - Whitespace:
DROP\tTABLE\nx - Case:
DrOp TaBlE x - Unicode:
DROP TABLE x(some parsers) - Encoded payloads in CTEs or string literals that the LLM later rewrites.
An AST validator parses the SQL into a tree using the database's own grammar (or a faithful parser like libpg_query for Postgres, sqlglot for cross-dialect). It then inspects the tree:
- Statement kind (
SelectStmtvsDropStmt) - Tables referenced (
RangeVarnodes) - Columns projected
- Predicates present (
WhereClause) - Subqueries and CTEs (recursed)
A blocklist sees a string. An AST validator sees the *semantics*. The cost is a few milliseconds per query; the benefit is that bypasses are no longer a regex-golf game — to evade the AST validator you have to produce a query that, parsed by the DB itself, doesn't include the forbidden construct. At which point the query is, by definition, safe by your policy.