GROUP BY Clause
The PartiQL GROUP BY
clause expands SQL’s grouping. Unlike SQL, the
PartiQL GROUP BY
can be thought of as a standalone operator that
inputs a collection of binding tuples and outputs a collection of
binding tuples.
As is typical in many clauses, the semantics proceed in two steps:
-
Core GROUP BY explains the core PartiQL
GROUP BY
structure. -
SQL GROUP BY shows that SQL’s can be explained over the core
GROUP BY
.
Core GROUP BY
The GROUP BY
clause
GROUP BY e1 AS x1, ..., em AS xm GROUP AS g
creates a group. Each \$e_i\$ is a grouping expression, each \$x_i\$ is a grouping variable [1] and \$g\$ is the group variable.
As in SQL, the bag of input binding tuples \$B_"GROUP"^"in"\$ is
partitioned into the minimal number of equivalence groups
\$B_1,...,B_n\$, such that any two binding tuples \$b, b' in
B_"GROUP"^"in"\$ are in the same equivalence group if and only if every
grouping expression \$e_i\$ evaluates to equivalent values
\$v_i\$ (when evaluated on \$b\$) and \$ v_i' \$ (when
evaluated on \$b'\$). More precisely, as in SQL, there is an
equivalence function \$"eqg"\$, used by the GROUP BY
to determine
if two values \$v_i\$ and \$ v_i' \$ are equivalent for grouping
purposes. The equivalence function \$"eqg"(v_i, v_i' )\$ returns
only true or false; true meaning that the values are equivalent for
grouping purposes. See Equivalence in Grouping for specifics of eqg
. If a
grouping expression evaluates to MISSING
, it is first coerced into
NULL
, thus bringing MISSING
and NULL
in the same group.
Unlike SQL, for each group \$B_j (1 <= j <= n)\$, the GROUP BY
clause outputs a binding tuple \$b_j = (: x_1 : v_1,...,x_m : v_m,
g : B_j:)\$ that has the full group \$B_j\$. Notice:
-
the binding tuples that appear in the \$g\$ collection have one attribute for each of the variables defined in the
FROM
clause, since these binding tuples come as-is from \$B_"GROUP"^"in"\$. -
even if the bag \$B_"GROUP"^"in"\$ is flat binding tuples, the output bag \$B_"GROUP"^"out"\$ is not just flat binding tuples, since \$g\$ has nested binding tuples. Note, we have been explicitly denoting binding attributes with
MISSING
values in the binding tuples. However, once these binding tuples become the tuples of the PartiQL data model, any binding attribute withMISSING
value will not appear.
Consider the data logs
and assume that we want to group the co
readings by sensor. The following query solves the problem using only core features.
<
sensors: [
{'sensor':1},
{'sensor':2}
],
logs: [
{'sensor':1, 'co':0.4},
{'sensor':1, 'co':0.2},
{'sensor':2, 'co':0.3}
]
>
SELECT VALUE {
'sensor': sensor,
'readings': (SELECT VALUE v.l.co FROM g AS v)
}
FROM logs AS l
GROUP BY l.sensor AS sensor GROUP AS g
The GROUP BY
outputs the collection of binding tuples
\$ B_"GROUP"^"out" = B_"SELECT"^"in" = \$ \$<<<<\$
\$ (: "sensor":1, g: <<<<\ (: l:{"'sensor'":1, "'co'":0.4} :), (: l:{"'sensor'":1, "'co'":0.2} :) >>>> :) \$
\$ (: "sensor":2, g: <<<<\ (: l:{"'sensor'":2, "'co'":0.3} :)\ >>>> :) \$
\$>>>>\$
Notice that the collection \$g\$ has tuples with a single attribute \$l\$, since
this is the single variable of the FROM
clause in this example.
Consequently the SELECT
clause outputs
\$<<<<\$
\$" " {"'sensor'":1, "'readings'": <<<< 0.4, 0.2 >>>> }, \$
\$" " {"'sensor'":2, "'readings'": <<<< 0.3 >>>> } \$
\$>>>>\$
Notice that the query of the nested SELECT example and the query of the present example do not always produce the same result. For
example, if there were no readings for a sensor, the query of the nested SELECT example would still have this sensor in the result (and its readings
would be empty). In contrast, the query of the present example will not have this sensor in the result.
Here is a shorter equivalent query that uses PartiQL collection paths and SQL’s aliases.
SELECT VALUE {
'sensor': sensor,
'readings': g[*].l.co
}
FROM logs AS l
GROUP BY l.sensor AS sensor GROUP AS g
Notice, the output binding tuple provides the partitioned input
binding tuples in the group variable \$g\$, which can be explicitly
utilized in subsequent HAVING
, ORDER BY
, and SELECT
clauses. Thus, an PartiQL query can perform complex computations on
the groups, leading to results of any type (e.g. collections nested
within collections). The explicit presence of groups in PartiQL, while
more general than SQL, also leads to simpler semantics than those of
SQL, since the GROUP BY
clause semantics are independent of the presence of
subsequent functions in HAVING
, ORDER BY
, and SELECT
.
The following PartiQL query counts and averages the readings of each
sensor. It also refers to the logs
value from the previous example. The
COLL_COUNT
function is simply given the group variable and counts
how many elements are in that collection.
SELECT VALUE {
'sensor': sensor,
'avg': COLL_AVG(SELECT VALUE v.l.co FROM g AS v),
'count': COLL_COUNT(g)
}
GROUP BY l.sensor AS sensor GROUP AS g
Notice, the aggregate functions COLL_AVG
and COLL_COUNT
(and for
that matter, by convention, any function starting with COLL
) can be
thought of as general-purpose functions. Generally, they do not have
to be fed by the result of a grouping operation - unlike SQL’s COUNT
and AVG
that are being fed exclusively from the results of grouping
operations. (Furthermore, the SQL COUNT
and AVG
make use of SQL’s
syntactic sugar, where there is no explicit use of group variable, as
explained in SQL’s Implicit Group Variable.)
This is a legitimate PartiQL expression:
COLL_COUNT([5, {a:2, b:3}])
The result is 2
, since the input to COLL_COUNT
is an array with
two elements.
Similarly, it is fine to include in any clause an aggregate function fed by the result of a (sub)query.
In the following expression COLL_COUNT
inputs the result of a query
COLL_COUNT(SELECT VALUE x FROM logs x WHERE x.sensor=1)
The PartiQL approach provides two benefits: First, it leads to
shorter, modular semantics. Second, it enables GROUP BY
to address
use cases that would otherwise need knowledge and non-trivial SQL
programming of window functions. See Windows by Grouping.
Equivalence in Grouping
The equivalence function \$eqg\$ extends SQL’s respective function. In particular, it behaves as follows:
-
\$eqg("NULL", "NULL")\$ is true, despite \$"NULL"="NULL"\$ not being true.
-
for any two non-null values \$x\$ and \$y\$, \$\gl{eqg}(x,y)\$ returns the same with \$x=y\$. As is the case generally for \$=\$, while SQL’s \$=\$ will error when given incompatible types, while the PartiQL \$=\$ will return
false
.
Notice that PartiQL will group together the NULL
and the MISSING
grouping expressions, since any grouping expression resulting to
MISSING
has been coerced into NULL
before eqg
does comparisons
for grouping. Example 5 shows the repercussions
of coercing NULL
into MISSING
and also shows how to discriminate
between NULL
and MISSING
, if so desired.
The query of will group together any log
readings where the sensor
attribute is either NULL
or is
altogether MISSING
. For example, if logs
is
logs: [
{'sensor': 1, 'co':0.4},
{'sensor': 2, 'co':0.3},
{'sensor': null, 'co':0.1},
{'sensor': 1, 'co':0.2},
{'co':0.5}
]
then the GROUP BY
will output the collection of binding tuples
\$ B_"GROUP"^"out" = B_"SELECT"^"in" = \$ \$<<<<\$
\$" " (: "sensor":1, g: <<<< (: l:{"'sensor'":1, "'co'":0.4} :) (: l:{"'sensor'":1, "'co'":0.2} :) >>>> :) \$
\$" " (: "sensor":2, g: <<<< (: l:{"'sensor'":2, "'co'":0.3} :) >>>> :) \$
\$" " (: "sensor":"null", g: <<<< (: l:{"'sensor'":"null", "'co'":0.1} :) (: l:{"'co'":0.5} :) > > :) \$
\$>>>>\$
Notice that both the 3rd and 5th tuples logs
of were grouped under
the \$"sensor":"null"\$ group, despite the sensor
of the 3rd
being NULL
while the sensor
of the 5th being MISSING
. The query
result is
<<
{'sensor':1, 'readings':<0.4, 0.2>},
{'sensor':2, 'readings':<0.3>},
{'sensor':null, 'readings':<0.1, 0.5>}
>>
If we wanted to discriminate the NULL
from the MISSING
we could
write the following query
SELECT VALUE {
'sensor': CASE WHEN missingFlag THEN MISSING ELSE sensor END,
'readings': (SELECT VALUE v.l.co FROM g AS v)
}
FROM logs AS l
GROUP BY l.sensor IS MISSING AS missingFlag, l.sensor AS sensor GROUP AS g
In this case the would output the collection of binding tuples
\$ B_"GROUP"^"out" = B_"SELECT"^"in" = \$ \$<<<<\$
\$ (: "missingFlag":"false", "sensor":1, g: <<<< (: l:{"'sensor'":1, "'co'":0.4} :) (: l:{"'sensor'":1, "'co'":0.2} :) >>>> :) \$
\$ (: "missingFlag":"false", "sensor":2, g: <<<< (: l:{"'sensor'":2, "'co'":0.3} :) >>>> :) \$
\$ (: "missingFlag":"false", "sensor":"null", g: <<<< (: l:{"'sensor'":"null", "'co'":0.1} :) >>>> :) \$
\$ (: "missingFlag":"true", "sensor":"null", g: <<<< (: l:{"'co'":0.5} :) >>>> :) \$
\$>>>>\$
and the query result would be
<<
{'sensor':1, 'readings':<0.4, 0.2>},
{'sensor':2, 'readings':<0.3>},
{'sensor':null, 'readings':<0.1>},
{'readings':<0.5>}
>>
SQL GROUP BY
The group-by and aggregation of PartiQL is backwards compatible to SQL.
Grouping Attributes and Direct Use of Grouping Expressions
For SQL compatibility PartiQL allows GROUP BY …,e,…
i.e., a
grouping expression \$e\$ that is not associated with a grouping
variable \$x\$. (In core PartiQL, one would write GROUP BY …,e
AS x,…
.)
For SQL compatibility, PartiQL supports using the grouping expression
e
in HAVING
, ORDER BY
, and SELECT
clauses.
The SQL form:
FROM ...
GROUP BY e, ...
HAVING f(e, ...)
ORDER BY f2(e, ...)
SELECT f3(e, ...)
is syntactic sugar for the core PartiQL:
FROM ...
GROUP BY e AS x, ...
HAVING f(x, ...)
ORDER BY f2(x, ...)
SELECT f3(x, ...)
The SQL-compatible query
SELECT v.a+1 AS bar
FROM foo AS v
GROUP BY v.a+1
is written in core PartiQL as
SELECT VALUE {'bar': x}
FROM foo AS v
GROUP BY v.a+1 AS x GROUP AS dontcare
SQL’s Implicit Group Variable
SQL does not have explicit group variables. For SQL compatibility, PartiQL allows the SQL aggregation functions to be fed by expressions that do not explicitly say that there is iteration over the group variable. Suppose that a query
-
is a
SELECT
query, -
lacks a
GROUP AS
clause, and -
any of the
SELECT
,HAVING
, and/orORDER BY
clauses contains a function call \$f(e)\$, where \$f\$ is a SQL aggregation function such asSUM
andAVG
. (See Designation of SQL aggregate functions)
Then, the query is rewritten as follows:
-
if the query has a
GROUP BY
clause, add to itGROUP AS g
where
g
is a fresh variable, i.e., a variable that is not a database name nor a variable of the query or a variable of the queries within which it is nested. -
if the query has no
GROUP BY
clause, add to itGROUP ALL GROUP AS g
where
g
is a fresh variable. -
if the aggregation function call is
COUNT(*)
, then rewrite intoCOUNT(g)
-
otherwise, rewrite \$f(e)\$ into
f(SELECT VALUE e1 FROM g AS p)
where \$e1\$ is produced from \$e\$ as follows: Consider the variables \$v_1, ..., v_n\$ that appear in \$B_"GROUP"^"in"\$ (i.e., the variables defined by the query’s
FROM
andLET
clauses) and are not grouping attributes. Substitute each identifier \$v_i\$ (that does not stand for attribute name) in \$e\$ with \$p.v_i\$.
Consider again the query of GROUP BY with Collection Aggregates. It can be written in an SQL compatible way as
SELECT l.sensor AS sensor,
AVG(l.co) AS avg,
COUNT(*) AS count
FROM logs AS l
GROUP BY l.sensor
The query of GROUP ALL Core can be written in standard SQL syntax as
SELECT COUNT(g) AS largeco
FROM logs AS l
WHERE l.co > 1.5
Notice that SQL does not allow nested aggregate
functions. Respectively, PartiQL does not allow one to write queries
that lack a GROUP AS
or GROUP ALL
clause and have nested aggregate
SQL functions.
Designation of SQL aggregate functions
Each implementation will have a list of SQL aggregate functions, which
are not necessarily just the ones prescribed by the standard (COUNT
,
SUM
, AVG
, etc). (Recall from SQL’s Implicit Group Variable that
SQL aggregate functions do not use an explicit group variable.)
Furthermore, it is required that for each SQL aggregate function f
,
if an implementation offers a corresponding core PartiQL aggregate
function, the PartiQL function is named COLL_f
. For example, the
core PartiQL aggregate COLL_AVG
corresponds to the SQL AVG
aggregate . Nevertheless, it is possible that an implementation
offers only COLL_AVG
or offers only AVG
. The semantic relationship
between the SQL aggregate function and the corresponding core PartiQL
aggregate function is the one explained in
SQL’s Implicit Group Variable: The SQL aggregate functions do not
input explicit group variables and, thus, their semantics are
explained by the reduction to the corresponding core PartiQL
aggregate.
Aliases from the SELECT
Clause
In SQL, a grouping expression may be an alias that is defined by the
SELECT
clause. For compatibility purposes, PartiQL adopts the same
behavior.
The query, which uses the SELECT
-defined alias feature:
SELECT ...,e AS a,...
FROM ...
GROUP BY ...,a,...
is syntactic sugar for the query:
SELECT ...e AS a,...
FROM ...
GROUP BY ...,e,...
Notice that the grouping expression a
is simply a shorthand for e
.
In the case that the grouping expression is a constant positive
integer literal \$n\$, then it stands for the \$n\$th attribute
of the SELECT
clause. However, this requires that the tuples
produced by the SELECT
have schema and they are ordered tuples. The
relevant examples will be provided in the schema section.
Consider the database
people: <<
{'name': 'zoe', 'age': 10, 'tag': 'child'},
{'name': 'zoe', 'age': 20, 'tag': 'adult'},
{'name': 'bill', 'age': 30, 'tag': 'adult'}
>>
The query
SELECT p.tag || ':' || p.name AS tagname, AVG(p.age) AS average
FROM people AS p
GROUP BY tagname
is equivalent to the query
SELECT p.tag || ':' || p.name AS tagname, AVG(p.age) AS average
FROM people AS p
GROUP BY p.tag || ':' || p.name
Either query results into
people: <<
{'tagname': 'child:zoe', 'average': 10},
{'tagname': 'adult:zoe', 'average': 20},
{'tagname': 'adult:bill', 'average': 30}
>>
Consider again a collection of sensor readings, this time with a timestamp.
logs: [
{'sensor':1, 'co':0.4, 'timestamp':04:05:06},
{'sensor':1, 'co':0.2, 'timestamp':04:05:07},
{'sensor':1, 'co':0.5, 'timestamp':04:05:10},
{'sensor':2, 'co':0.3}
]
We look for the “jump" readings that are more than 2x the previous
reading at the same sensor. The following query solves the problem
using GROUP BY
.
SELECT sensor AS sensor,
(WITH orderedReadings
AS (SELECT v FROM oneSensorsReadings v ORDER BY v.timestamp)
SELECT r.co, r.timestamp
FROM orderedReadings r AT p
WHERE r.co > 2*orderedReadings[p-1].co
ORDER BY p
) AS jumpReadings
FROM logs l
GROUP BY l.sensor AS sensor GROUP AS oneSensorsReadings
The result is
<<
{'sensor':1, 'jumpReadings':[{'co':0.4, 'timestamp':04:05:06}]},
{'sensor':2, 'jumpReadings':[]}
>>
GROUP ALL
The GROUP ALL
variant of GROUP BY
outputs a single binding tuple,
regardless of whether the FROM
/WHERE
produced any tuples, i.e.,
regardless of whether its input \$B_"GROUP"^"in"\$ is empty
or not.
The GROUP ALL
is not increasing the expressiveness of PartiQL.
GROUP ALL Core shows how to achieve without GROUP ALL
,
what the GROUP ALL
can do. However, we include GROUP ALL
for
facilitating the reduction of SQL’s aggregation into the core PartiQL
(see SQL’s Implicit Group Variable).
Consider again the logs
data and assume
that we want to count the total number of readings that are above
1.5
with a core PartiQL query. (SQL GROUP BY Nothing does
the same with SQL.)
SELECT VALUE {'largeco': COLL_COUNT(g)}
FROM logs AS l
WHERE l.co > 1.5
GROUP ALL AS g
Notice, there are no readings above 1.5
in the example data. Since there is
no tuple that satisfies the WHERE
clause
\$B_"WHERE"^"out" = B_"GROUP"^"in" = <<<<\ >>>>\$
\$B_"GROUP"^"out" = B_"SELECT"^"in" = <<<<\ (: g: <<<<\ >>>>\ :)\ >>>>\$
Since COLL_COUNT(<<>>)
is 0
, the query result is the collection
<< {'largeco': 0} >>
Therefore the PartiQL query is equivalent to the plain SQL query
SELECT COUNT(*) AS largeco
FROM logs AS l
WHERE l.co > 1.5
The following core PartiQL also accomplishes the same computation,
without using GROUP ALL
.
{ 'largeco': COLL_COUNT(SELECT VALUE l FROM logs AS l WHERE l.co > 1.5) }
GROUP BY
to be seen as a standalone function.