Graph Query
Summary
This RFC proposes how graph instances conforming to the data model defined in [DRAFT] RFC-0025 "Graph Data Model" will be queried in PartiQL. The core of the proposal is to largely follow GPML, the Graph Pattern Matching Language, which is under standardization as part of ISO 9075-16: SQL/PGQ. Consequently, this RFC focuses on how GPML is adapted for and incorporated into PartiQL, while, for shared details, it defers to publicly available GPML information (currently, [1]).
Familiarity with [1] is required for understanding the full impact of this RFC on PartiQL, however this RFC is meant to be self-contained (in conjunction with RFC-0025).
Overview
The construct for querying a graph that is proposed in this RFC is the graph matching expression of the form
(
graphMATCH
pattern)
where
-
graph is an expression that evaluates to a graph, in the data model proposed in RFC-0025;
-
pattern is a graph matching pattern, mostly following the GPML syntax and semantics;
-
MATCH
is a keyword, while the surrounding parentheses(
…)
are required, in most cases.
A graph pattern pattern describes a graph fragment of interest to be found in the input
graph graph. Variables within the pattern designate elements of the fragment
that are needed in the result.
The result of this expression is a bag of structs (a.k.a. a table),
with one struct (a.k.a. a row) per each fragment found in graph that matched pattern,
where attributes
in the struct correspond to the variables in pattern.
With the result of a graph matching expression being a bag of structs, this expression
can occur anywhere in a PartiQL query, as any other expression. In particular,
it can occur as a data source in a FROM
clause of a query, which is the primary
intended usage.
Computation of a pattern match involves navigation around the graph, but graph navigation capabilities are supported only within the scope of this computation. The result table that gets out does not carry any implicit information about the graph’s structure; it only carries the "payload" information from the matched graph elements (see RFC-0025). In a sense, what is inside a MATCH expression’s pattern is a separate query language, distinct from PartiQL, and this language also has its own distinct evaluation mechanisms.
Relation to GPML and SQL/PGQ
GPML is the name given to the graph pattern matching language described in a publicly available paper [1] that previews such a language developed for SQL/PGQ and GQL, the upcoming standards from ISO and IEC. SQL/PGQ is ISO 9075-16 — a new section in the SQL standard ISO 9075 that adds to SQL ability to query graphs and, presumably, those will be graphs defined as "views" over relational tables, according to conventions to be described in ISO 9075-16 as well. GQL will be a language for "native" graph databases, with full CRUD capabilities for graphs. Graph queries by means of graph patterns will be a capability shared between SQL/PGQ and GQL, and that is what is referred to as GPML.
This RFC intends to describe for PartiQL parts of what SQL/PGQ is meant
to describe for SQL: a graph pattern language and how it is integrated into
the host language (PartiQL or SQL).
In general, PartiQL strives for SQL compatibility.
While [1] provides good amount of detail about the pattern language,
the details of its integration into SQL are less clear.
Consequently, the proposal in this RFC may need adjustments (more likely in syntax,
but possibly in semantics as well) after SQL/PGQ becomes available.
Graph pattern language for PartiQL described in this RFC largely follows GPML as descibed in [1], with two differences:
-
Since the graph data model in RFC-0025 is more general than in GPML (it allows any PartiQL value as a payload at a graph element, not just something that mimics GPML’s key/value properties), this has to be addressed for the relevant aspects of the pattern language.
-
In GPML, at least as described in [1], graph queries have form
MATCH
pattern, where the graph being queried is implicit, while here graph match expression has form(
graphMATCH
pattern)
, where the graph is explicit.
Grammar
Adding MATCH expression
Graph pattern matching is added to the PartiQL grammar as a new production
for the non-terminal <expr_query>
(which defines constructs like scalar, struct,
and collection literals; function calls; parenthesized SELECT-FROM-WHERE, etc.):
<expr_query> ::=
// ...
| '(' <expr_query> 'MATCH' <graph_pattern> ')'
<graph_pattern> ::=
<selector>? <path_pattern> [ ',' <path_pattern> ]*
where <graph_pattern>
, as well as the non-terminals on which it depends,
are defined as in GPML (see the following section).
After this, the match expression becomes available, as an <expr_query>
,
to be used as a source in the FROM
clause of SELECT-FROM-WHERE queries,
which is defined in PartiQL as
<from_clause> ::=
'FROM' <from_item> [ ',' <from_item> ]*
<from_item> ::=
<expr_query> [ 'AS' <id> [ AT <id> ]? ]?
// ... joins, etc
Note that there are mandatory parentheses (
… )
around the MATCH
expression.
These are necessary for parsing FROM
clauses with one-token lookahead.
Otherwise, there could be uses of ,
in a FROM
clause for which it would be
impossible to decide whether they separate instances of <path_pattern>
or instances of <from_item>
.
However, the parenthesation restriction can be relaxed in a FROM
clause
when its pattern consists of a single path pattern.
For this, we add an extra production to <from_item>
:
<from_clause> ::=
'FROM' <from_item> [ ',' <from_item> ]*
<from_item> ::=
// ...
| <expr_query> 'MATCH' <selector>? <path_pattern> [ 'AS' <id> [ AT <id> ]? ]?
Graph Patterns
PartiQL GPML is considered experimental — see RFC-0033. |
The grammar in this section is a reverse-engineered reconstruction based on the examples
in [1]. It is followed in the current experimental implementation of
graph query parsing in partiql-lang-kotlin, starting in 0.8.0.
This grammar is subject to change, taking into account any details that might appear
in SQL/PGQ but were not covered in [1].
It also does not cover several features mentioned in the paper in passing only
(label patterns, predicates for node/edge relationships).
To clarify how graph pattern and the "host" PartiQL grammars are combined,
PartiQL non-terminals are indicated with double angle brackets,
specifically [integer]
, [identifier]
, and [where_clause]
.
<graph_pattern> ::=
<selector>? <path_pattern> [ ',' <path_pattern> ]*
<path_pattern> ::=
<restrictor>? [ <path_variable> '=' ]? <path_part>+
<path_variable> ::= <<identifier>>
<path_part> ::=
<node_pattern>
| <edge_pattern>
| <group_pattern>
<selector> ::=
[ 'ANY' | 'ALL' ] 'SHORTEST'
| 'ANY' <<integer>>?
| 'SHORTEST' <<integer>> 'GROUP'?
<restrictor> ::=
'TRAIL' | 'ACYCLIC' | 'SIMPLE'
<node_pattern> ::= '(' <node_labeled> <<where_clause>>? ')'
<node_labeled> ::=
<node_variable> ':' <label_descr>
| <node_variable>
| ':' <label_descr>
<node_variable> ::= <<identifier>>
<label_descr> ::= //Simplified. GPML provides %, !, |, &, grouping.
<label_name>
<label_name> ::= <<identifier>>
<edge_pattern> ::= <edge_spec> <quantifier>?
<group_pattern> ::=
'(' <path_pattern> <<where_clause>>? ')' <quantifier>?
| '[' <path_pattern> <<where_clause>>? ']' <quantifier>?
<quantifier> ::=
'+' | '*'
| '{' <<integer>> ',' <<integer>>? '}'
<edge_spec> ::=
'<-' | '<-' '[' <edge_filtered> ']' '-'
| '~' | '~' '[' <edge_filtered> ']' '~'
| '->' | '-' '[' <edge_filtered> ']' '->'
| '<~' | '<~' '[' <edge_filtered> ']' '~'
| '~>' | '~' '[' <edge_filtered> ']' '~>'
| '<->' | '<-' '[' <edge_filtered> ']' '->'
| '-' | '-' '[' <edge_filtered> ']' '-'
<edge_filtered> ::= <edge_labeled> <<where_clause>>?
<edge_labeled> ::=
<edge_variable> ':' <label_descr>
| <edge_variable>
| ':' <label_descr>
<edge_variable> ::= <<identifier>>
It is worthy of a note that [where_clause]
, referenced in the above grammar
in node, edge, and group patterns, is, syntactically, the WHERE
clause of host PartiQL,
where it is defined as
<where_clause> ::= 'WHERE' <expr_query>
where the expression <expr_query>
is meant to evaluate to a boolean.
This expession, however, can contain, as references, variables introduced as
<node_variable>
, <edge_variable>
, or <path_variable>
elsewhere in the pattern, as well as graph-specific predicates and functions
applied to these references.
Evaluation
Background: Evaluation in GPML
The details of graph pattern matching semantics are available in [1] (Section 6). For the purposes of this RFC, it is sufficient to describe the structure of a pattern-matching result.
Suppose we have a GPML graph pattern of the form
P1 , … , Pn
where Pi are path patterns.
(We abstract away from the possibility that a <graph_pattern>
can start with a selector,
since that does not affect the structure of the result.)
The path patterns can contain variables marking node, edge, or path subpatterns.
The variables that occur in the scope of a quantifier
(such as +
, *
, {2,7}
, or {2,}
) are group variables,
while the remaining ones are singleton variables.
The intent of pattern matching is to find all (possibly overlapping) fragments
of the graph, each of which is described by the pattern
P1 , … , Pn, and associate
the pattern’s variables with elements of the graph from the fragment.
In a given match result (fragment), a singleton variable gets associated to
one graph element, while a group variable can get associated to none or several
elements — one for each repetition of the containing quantified subpattern.
A singleton variable x
for a node or an edge can occur in more than one
path pattern Pi, in which case it is meant to be bound,
in a given result, to the same node or edge across all path patterns.
For the sanity of this semantics, it follows that this is not
allowed for group variables.
Inferring from [1] (Section 6), the result of evaluating a GPML pattern on a graph can be characterized as a bag of matching graph fragments, each being an ordered tuple of the form
p1 , …, pn
with each pi being an annotated path from the graph corresponding to the path pattern Pi. In turn, an annotated path pi is a path from the graph (a sequence of alternating nodes and edges) that is annotated with variables from the corresponding path pattern Pi. Among these annotations,
-
A singleton variable can be used multiple times, possibly in different paths pi and pj, but always annotating the same node or edge, in a given matching fragment.
-
A group variable annotates zero or more elements within one of the pi paths in the match.
-
A path variable annotates a contiguous subpath of one of the pi paths (which could be the entirety of pi).
It is important to note that the variables are annotated to entities in the graph (nodes, edges, paths) that are "aware" of their identity and location within the graph. This capability is essential within the pattern matching process, to determine sameness of nodes and edges when they are probed for being annotated with the same (singleton) pattern variable.
Example. In its Section 6, the GPML paper[1] works though pattern match computation of an example query pattern
MATCH TRAIL (a WHERE a.owner='Jay')
[−[b:Transfer WHERE b.amount>5M]−>]+
(a) [−[:isLocatedIn]−>(c:City) |
−[:isLocatedIn]−>(c:Country)]
that is evaluated against the graph from Section 2 of the paper.
This pattern has singleton node variables a
and c
and
a group edge variable b
.
This is also a valid PartiQL graph pattern as proposed here,
provided its input graph uses structs as payloads at node and edges,
to model property graphs. (Making subexpressions a.owner
and 'b.amount` valid.)
The final result of this pattern matching, given in Subsection 6.5 of the paper, is a bag with two graph fragments, each consisting of one annotated path (because the pattern consists of only one path pattern):
a b b b b a c a4 t4 a6 t5 a3 t2 a2 t3 a4 li4 c2 a b b b b b b b a c a4 t4 a6 t5 a3 t7 a5 t8 a1 t1 a3 t2 a2 t3 a4 li4 c2
where, for each of the two fragments, the second row is a sequence of internal identifiers for the nodes and edges in the graph and the first row are the pattern-variable annotations.
Evaluation of Graph Pattern Matching in PartiQL
In the context of this RFC, the notion of a PartiQL value has been extended,
according to RFC-0025, to include graph values, while the notion of an expression
is being extended in this RFC by adding graph matching expressions.
Furthermore, a graph matching expression can contain PartiQL expressions,
notably within the filtering WHERE
clauses.
A PartiQL expression is evaluated in an environment that maps variables to PartiQL values,
with the result of an evaluation being a PartiQL value.
The above-outlined evaluation semantics of graph pattern matching in GPML is
structured differently (for one, it produces a bag of tuples of annotated paths)
and operates over values of a kind that are not part
of PartiQL data model, even after the RFC-0025 extension (such as nodes and edges
that have awareness of their identity and location within a graph).
In a sense, graph matching is a distinct query language of its own, embedded
within PartiQL as a MATCH
expression form.
Consequently, our goal here is to describe how the two semantics are combined, specifically addressing:
-
Role of PartiQL environments while evaluating pattern matching.
-
Evaluation of PartiQL expressions within graph pattern
WHERE
clauses. -
And, crucially, how the result of pattern match computation is represented as a value within the PartiQL data model.
For evaluating a graph matching expression
(
graphMATCH
pattern)
this RFC proposes the following.
-
As any PartiQL expression, the graph matching expression is evaluated within a regular PartiQL environment (𝝆0, 𝝆) that maps variables (defined outside this expression) to values.
-
The graph subexpression is evaluated within (𝝆0, 𝝆) and must result in a graph value (in the sense of RFC-0025).
-
Given the graph value, graph matching computation for pattern proceeds according to the GPML semantics, with the following modifications:
-
Variables defined in pattern shadow variables bound in (𝝆0, 𝝆).
-
Whenever a
WHERE e
clause within the pattern needs to be applied, the clause’s expressione
is evaluated within (𝝆0, 𝝆) though regular PartiQL evaluation, except that any pattern-defined variablex
is not looked up in (𝝆0, 𝝆). Instead, its value, as a graph element v, is provided by the GPML pattern matching procedure.
To make the evaluation proceed further, according to regular PartiQL evaluation rules, v is replaced with its payload:-
If v is a node or an edge, payload(*v)* is used.
-
If v is a path, it is an error, according to this RFC. <!-- To support graphical predicates, it would be something like this, instead of the above paragraph: The subsequent processing of the graph-element value v depends on the context of
x
withine
: -
If
x
is used within a graph-specific construct (such as a predicate likeIS DIRECTED
or a path-aggregating function), the graph-element value v is passed to this construct directly. -
Otherwise,
x
occurs in a regular PartiQL context, in which case it is replaced with its payload:-
If v is a node or an edge, payload(*v)* is used.
-
If v is a path, it is an error, according to this RFC. -→
-
-
-
-
Upon completion of the GPML pattern matching computation, the resulting bag of matching graph fragments is converted into a bag of structs by converting each fragment into a struct as follows.
Given a fragment of the formp1 , …, pn
where pi are variable-annotated paths, the resulting struct gets key/value attribute pairs where keys are variable names while the overall attribute for variable
x
is determined is follows.-
If the
x
is a singleton variable that annotates a node or an edge v, the value of itsx
's attribute is payload(*v)*. -
If
x
is a group variable, the value of its attribute is the list of payloads at the elements (nodes or edges) annotated withx
, taken in the order of their appearance in the fragment’s path. -
If
x
is a path variable, it does not produce an attribute in the struct, according to this RFC.
-
This semantics naturally supports the important property lookup use case from GPML[1] on property graphs. That is, evaluating an expression of the form v*.a* where v is a node or an edge and a is a property name:
-
In GPML[1], .a is a property lookup operation specifically defined for graph nodes and edges.
-
In PartiQL, .a remains the operation of a key lookup in a struct, which would succeed under the above semantics as long as v has a struct as its payload.
Example. Continuing the example from the previous section, the result of the PartiQL match expression is a bag of two structs containing attributes named as pattern variables and carrying values drawn from payloads of the matched nodes and edges:
<< { 'a' : { 'owner' : 'Jay', 'isBlocked' : 'yes' }, 'b' : [ { 'date' : '4/1/2020', 'amount' : 10M }, { 'date' : '6/1/2020', 'amount' : 10M }, { 'date' : '2/1/2020', 'amount' : 10M }, { 'date' : '3/1/2020', 'amount' : 10M } ], 'c' : { 'name' : 'Ankh-Morpork' } }, { 'a' : { 'owner' : 'Jay', 'isBlocked' : 'yes' }, 'b' : [ { 'date' : '4/1/2020', 'amount' : 10M }, { 'date' : '6/1/2020', 'amount' : 10M }, { 'date' : '8/1/2020', 'amount' : 6M }, { 'date' : '9/1/2020', 'amount' : 9M }, { 'date' : '1/1/2020', 'amount' : 8M }, { 'date' : '2/1/2020', 'amount' : 10M }, { 'date' : '3/1/2020', 'amount' : 10M } ], 'c' : { 'name' : 'Ankh-Morpork' } } >>
Implementation-Dependent Aspects
Certain aspects of graph pattern matching behavior can be implementation-dependent. This section identifies such aspects, as well as possibilities that an implementation may choose to adopt for a given aspect. The chosen possibility must be clearly documented. An implementation may also support several of the designated possibilities. In this case, it must provide a configuration mechanism for a user to choose the desired one.
Global Restrictors for Graph Patterns
A path subpattern in a graph pattern can be marked with a TRAIL
or ACYCLIC
restrictor (also SIMPLE
) that restricts the result bag of otherwise-matching paths
to those that additionally do not repeat an edge (for TRAIL
)
or do not repeat a node (for ACYCLIC
).
It is conceivable to have similar restrictors that apply to the whole graph pattern,
but it appears this is not the design in GPML and SQL/PGQ.
Instead, PartiQL defines this as an implementation-dependent behavior.
That is, dependent on an application’s choice, the bag of annotated graph fragments
computed as matches for any graph pattern is allowed to contain only certain fragments:
-
repeats ok: any matching fragment is allowed;
-
no repeat nodes a fragment is allowed only if no node occurs twice;
-
no repeat edges a fragment is allowed only if no edge occurs twice;
-
no repeat elements a fragment is allowed only neither an edge nor a node occurs twice.
Note that these modes, when enabled, can render relevant path restrictors
(TRAIL
, ACYCLIC
, SIMPLE
) unnecessary.
Unresolved Questions
MATCH-Associated WHERE
The pattern grammar proposed in this RFC supports a variant of WHERE
clause within
node, edge, and path (grouping) patterns.
In a few places, [1] also mentions WHERE
clause that is not a part of any
node, edge, or path fragment of a pattern,
but rather applies to the entirety of the pattern in a graph MATCH
expression.
This WHERE
is sometimes referred to as a "postfilter" of a pattern.
From the limited discussion in the paper and the lack of examples on the matter,
it is not fully clear whether such WHERE
is meant to be the WHERE
of the "host"
select-from-where query (in SQL or PartiQL) or it is a new kind of WHERE
that is coupled with MATCH
.
In the case of the latter, the syntax of graph match expression would have to accommodate this possibility with a grammar production like
<expr_query> ::=
// ...
| '(' <expr_query> 'MATCH' <graph_pattern> [ 'WHERE' <expr_query> ]? ')'
Currently, this is not being proposed for the reasons of economy of design — until and unless it becomes certain that the "host" WHERE
cannot support necessary use cases.
Graphical Predicates and First-Class Nodes/Edges
GPML[1] supports several graphical predicates, such as
e IS DIRECTED
, n IS SOURCE OF
e, n IS DESTINATION OF
e,
SAME(`v1
,` v2`,` …)
, ALL_DIFFERENT(`v1
,` v2`,` …)
,
which can be used in WHERE
filters associated with nodes, edges, and paths (groupings),
as well as, presumably, in the post-filter WHERE
associated with the overall pattern
(see the previous section).
Evaluation of these predicates, naturally, requires that nodes n and edges e
to which they are applied are entities that are "aware" of their location within
a graph G to which they belong.
In particular, the values n and e cannot be merely the payloads
taken from nodes and edges of G.
Since the semantics proposed in this RFC equates graph elements with their payloads for all purposes except the internal mechanics of pattern matching computation, this RFC does not specify support for the graphical predicates.
Properly supporting them would require modeling values of nodes and edges as something like in-graph references of the form (id, G) or similar and, if we are to support property lookups v*.a* as naturally as possible so far, introducing rules on when an in-graph reference is treated as itself vs. when it is reduced to its payload.
Then, an important question would be whether these in-graph references were to become
first-class values in PartiQL or be limited to the scope of MATCH
expressions.
-
If they were to become first-class, then the post-filter
WHERE
could likely coincide with the "host" PartiQLWHERE
. However, this would significantly complicate the overall data model: in addition to graph values, it would be expanded with in-graph reference values, of variants specific to nodes, edges, and, possibly, paths. Anticipating these consequences makes it worthwhile to investigate whether such reference could be an instance of some more general construct, perhaps relevant to update operations in non-graph parts of the data model as well. -
If they were to be limited in scope, this would give further justification to adding the post-filter
WHERE
as an option toMATCH
expressions. The non-graph PartiQL data model would remain simple, while the overall language should still be able to support all interesting use cases. It seems likely this approach is being chosen for SQL/PGQ.
Expressing Path Bindings in MATCH Result
While GPML patterns support path variables, which bind to graph paths within
graph fragments that constitute the result of a pattern match,
this RFC does not specify how to represent these path bindings in the
final result of a PartiQL MATCH
expression.
This is obvious from the explicit exclusions in two places in the evaluation semantics
where paths are encountered.
While representing a singleton variable bound to a node or an edge as the latter’s payload fairly naturally translates to representing a group variable as a sequence of payloads for the nodes or edges that the variable annotates, settling on a natural way to extend this approach to path variables is not straightforward.
In the first approximation, the MATCH-result residual of a graph path bound to a path variable should probably be a list composed of payloads of certain nodes and edges on the path. However, clarifying details of this idea leads to questions such as
-
Should this residual only contain payloads of elements of the path named by node/edge variables or should it include anonymous elements as well?
-
Should we address the double representation of node/edge variables (once as the top-level attributes in the result struct and then again within the path’s residual)?
-
Is there a way to express the connection between the two representations?
-
Or should those variables that occur within a path be only present in the path’s residual and not on the top level? But then placement of node/edge variables in the result struct would become sensitive to the presence or absence of path variables, which is a disadvantage relative to the always-on-top-level placement.
-
-
Do nested named path patterns lead to a hierarchical structure of the result struct, or should these paths be "flattened out"?
While it could be possible to cobble together a "reasonable" solution based on some choices from the above, the choices would be essentially arbitrary. At the same time, this design challenge is likely to be addressed in SQL/PGQ, so it would be fitting for the PartiQL solution to be consistent.
References
Graph Pattern Matching in GQL and SQL/PGQ. SIGMOD'22: Proceedings of the 2022 International Conference on Management of Data, June 2022, pp. 2246—2258. PDF