GROUP BY Clause
The PartiQL GROUP BY clause expands SQL’s grouping. Unlike SQL, the
PartiQL GROUP BY can be thought of as a standalone operator that
inputs a collection of binding tuples and outputs a collection of
binding tuples.
As is typical in many clauses, the semantics proceed in two steps:
-
Core GROUP BY explains the core PartiQL
GROUP BYstructure. -
SQL GROUP BY shows that SQL’s can be explained over the core
GROUP BY.
Core GROUP BY
The GROUP BY clause
GROUP BY e1 AS x1, ..., em AS xm GROUP AS g
creates a group. Each \$e_i\$ is a grouping expression, each \$x_i\$ is a grouping variable [1] and \$g\$ is the group variable.
As in SQL, the bag of input binding tuples \$B_"GROUP"^"in"\$ is
partitioned into the minimal number of equivalence groups
\$B_1,...,B_n\$, such that any two binding tuples \$b, b' in
B_"GROUP"^"in"\$ are in the same equivalence group if and only if every
grouping expression \$e_i\$ evaluates to equivalent values
\$v_i\$ (when evaluated on \$b\$) and \$ v_i' \$ (when
evaluated on \$b'\$). More precisely, as in SQL, there is an
equivalence function \$"eqg"\$, used by the GROUP BY to determine
if two values \$v_i\$ and \$ v_i' \$ are equivalent for grouping
purposes. The equivalence function \$"eqg"(v_i, v_i' )\$ returns
only true or false; true meaning that the values are equivalent for
grouping purposes. See Equivalence in Grouping for specifics of eqg. If a
grouping expression evaluates to MISSING, it is first coerced into
NULL, thus bringing MISSING and NULL in the same group.
Unlike SQL, for each group \$B_j (1 <= j <= n)\$, the GROUP BY
clause outputs a binding tuple \$b_j = (: x_1 : v_1,...,x_m : v_m,
g : B_j:)\$ that has the full group \$B_j\$. Notice:
-
the binding tuples that appear in the \$g\$ collection have one attribute for each of the variables defined in the
FROMclause, since these binding tuples come as-is from \$B_"GROUP"^"in"\$. -
even if the bag \$B_"GROUP"^"in"\$ is flat binding tuples, the output bag \$B_"GROUP"^"out"\$ is not just flat binding tuples, since \$g\$ has nested binding tuples. Note, we have been explicitly denoting binding attributes with
MISSINGvalues in the binding tuples. However, once these binding tuples become the tuples of the PartiQL data model, any binding attribute withMISSINGvalue will not appear.
Consider the data logs and assume that we want to group the co readings by sensor. The following query solves the problem using only core features.
<
sensors: [
{'sensor':1},
{'sensor':2}
],
logs: [
{'sensor':1, 'co':0.4},
{'sensor':1, 'co':0.2},
{'sensor':2, 'co':0.3}
]
>
SELECT VALUE {
'sensor': sensor,
'readings': (SELECT VALUE v.l.co FROM g AS v)
}
FROM logs AS l
GROUP BY l.sensor AS sensor GROUP AS g
The GROUP BY outputs the collection of binding tuples
\$ B_"GROUP"^"out" = B_"SELECT"^"in" = \$ \$<<<<\$
\$ (: "sensor":1, g: <<<<\ (: l:{"'sensor'":1, "'co'":0.4} :), (: l:{"'sensor'":1, "'co'":0.2} :) >>>> :) \$
\$ (: "sensor":2, g: <<<<\ (: l:{"'sensor'":2, "'co'":0.3} :)\ >>>> :) \$
\$>>>>\$
Notice that the collection \$g\$ has tuples with a single attribute \$l\$, since
this is the single variable of the FROM clause in this example.
Consequently the SELECT clause outputs
\$<<<<\$
\$" " {"'sensor'":1, "'readings'": <<<< 0.4, 0.2 >>>> }, \$
\$" " {"'sensor'":2, "'readings'": <<<< 0.3 >>>> } \$
\$>>>>\$
Notice that the query of the nested SELECT example and the query of the present example do not always produce the same result. For
example, if there were no readings for a sensor, the query of the nested SELECT example would still have this sensor in the result (and its readings would be empty). In contrast, the query of the present example will not have this sensor in the result.
Here is a shorter equivalent query that uses PartiQL collection paths and SQL’s aliases.
SELECT VALUE {
'sensor': sensor,
'readings': g[*].l.co
}
FROM logs AS l
GROUP BY l.sensor AS sensor GROUP AS g
Notice, the output binding tuple provides the partitioned input
binding tuples in the group variable \$g\$, which can be explicitly
utilized in subsequent HAVING, ORDER BY, and SELECT
clauses. Thus, an PartiQL query can perform complex computations on
the groups, leading to results of any type (e.g. collections nested
within collections). The explicit presence of groups in PartiQL, while
more general than SQL, also leads to simpler semantics than those of
SQL, since the GROUP BY clause semantics are independent of the presence of
subsequent functions in HAVING, ORDER BY, and SELECT.
The following PartiQL query counts and averages the readings of each
sensor. It also refers to the logs value from the previous example. The
COLL_COUNT function is simply given the group variable and counts
how many elements are in that collection.
SELECT VALUE {
'sensor': sensor,
'avg': COLL_AVG(SELECT VALUE v.l.co FROM g AS v),
'count': COLL_COUNT(g)
}
GROUP BY l.sensor AS sensor GROUP AS g
Notice, the aggregate functions COLL_AVG and COLL_COUNT (and for
that matter, by convention, any function starting with COLL) can be
thought of as general-purpose functions. Generally, they do not have
to be fed by the result of a grouping operation - unlike SQL’s COUNT
and AVG that are being fed exclusively from the results of grouping
operations. (Furthermore, the SQL COUNT and AVG make use of SQL’s
syntactic sugar, where there is no explicit use of group variable, as
explained in SQL’s Implicit Group Variable.)
This is a legitimate PartiQL expression:
COLL_COUNT([5, {a:2, b:3}])
The result is 2, since the input to COLL_COUNT is an array with
two elements.
Similarly, it is fine to include in any clause an aggregate function fed by the result of a (sub)query.
In the following expression COLL_COUNT inputs the result of a query
COLL_COUNT(SELECT VALUE x FROM logs x WHERE x.sensor=1)
The PartiQL approach provides two benefits: First, it leads to
shorter, modular semantics. Second, it enables GROUP BY to address
use cases that would otherwise need knowledge and non-trivial SQL
programming of window functions. See Windows by Grouping.
Equivalence in Grouping
The equivalence function \$eqg\$ extends SQL’s respective function. In particular, it behaves as follows:
-
\$eqg("NULL", "NULL")\$ is true, despite \$"NULL"="NULL"\$ not being true.
-
for any two non-null values \$x\$ and \$y\$, \$\gl{eqg}(x,y)\$ returns the same with \$x=y\$. As is the case generally for \$=\$, while SQL’s \$=\$ will error when given incompatible types, while the PartiQL \$=\$ will return
false.
Notice that PartiQL will group together the NULL and the MISSING
grouping expressions, since any grouping expression resulting to
MISSING has been coerced into NULL before eqg does comparisons
for grouping. Example 5 shows the repercussions
of coercing NULL into MISSING and also shows how to discriminate
between NULL and MISSING, if so desired.
The query of will group together any log
readings where the sensor attribute is either NULL or is
altogether MISSING. For example, if logs is
logs: [
{'sensor': 1, 'co':0.4},
{'sensor': 2, 'co':0.3},
{'sensor': null, 'co':0.1},
{'sensor': 1, 'co':0.2},
{'co':0.5}
]
then the GROUP BY will output the collection of binding tuples
\$ B_"GROUP"^"out" = B_"SELECT"^"in" = \$ \$<<<<\$
\$" " (: "sensor":1, g: <<<< (: l:{"'sensor'":1, "'co'":0.4} :) (: l:{"'sensor'":1, "'co'":0.2} :) >>>> :) \$
\$" " (: "sensor":2, g: <<<< (: l:{"'sensor'":2, "'co'":0.3} :) >>>> :) \$
\$" " (: "sensor":"null", g: <<<< (: l:{"'sensor'":"null", "'co'":0.1} :) (: l:{"'co'":0.5} :) > > :) \$
\$>>>>\$
Notice that both the 3rd and 5th tuples logs of were grouped under
the \$"sensor":"null"\$ group, despite the sensor of the 3rd
being NULL while the sensor of the 5th being MISSING. The query
result is
<<
{'sensor':1, 'readings':<0.4, 0.2>},
{'sensor':2, 'readings':<0.3>},
{'sensor':null, 'readings':<0.1, 0.5>}
>>
If we wanted to discriminate the NULL from the MISSING we could
write the following query
SELECT VALUE {
'sensor': CASE WHEN missingFlag THEN MISSING ELSE sensor END,
'readings': (SELECT VALUE v.l.co FROM g AS v)
}
FROM logs AS l
GROUP BY l.sensor IS MISSING AS missingFlag, l.sensor AS sensor GROUP AS g
In this case the would output the collection of binding tuples
\$ B_"GROUP"^"out" = B_"SELECT"^"in" = \$ \$<<<<\$
\$ (: "missingFlag":"false", "sensor":1, g: <<<< (: l:{"'sensor'":1, "'co'":0.4} :) (: l:{"'sensor'":1, "'co'":0.2} :) >>>> :) \$
\$ (: "missingFlag":"false", "sensor":2, g: <<<< (: l:{"'sensor'":2, "'co'":0.3} :) >>>> :) \$
\$ (: "missingFlag":"false", "sensor":"null", g: <<<< (: l:{"'sensor'":"null", "'co'":0.1} :) >>>> :) \$
\$ (: "missingFlag":"true", "sensor":"null", g: <<<< (: l:{"'co'":0.5} :) >>>> :) \$
\$>>>>\$
and the query result would be
<<
{'sensor':1, 'readings':<0.4, 0.2>},
{'sensor':2, 'readings':<0.3>},
{'sensor':null, 'readings':<0.1>},
{'readings':<0.5>}
>>
SQL GROUP BY
The group-by and aggregation of PartiQL is backwards compatible to SQL.
Grouping Attributes and Direct Use of Grouping Expressions
For SQL compatibility PartiQL allows GROUP BY …,e,… i.e., a
grouping expression \$e\$ that is not associated with a grouping
variable \$x\$. (In core PartiQL, one would write GROUP BY …,e
AS x,….)
For SQL compatibility, PartiQL supports using the grouping expression
e in HAVING, ORDER BY, and SELECT clauses.
The SQL form:
FROM ...
GROUP BY e, ...
HAVING f(e, ...)
ORDER BY f2(e, ...)
SELECT f3(e, ...)
is syntactic sugar for the core PartiQL:
FROM ...
GROUP BY e AS x, ...
HAVING f(x, ...)
ORDER BY f2(x, ...)
SELECT f3(x, ...)
The SQL-compatible query
SELECT v.a+1 AS bar
FROM foo AS v
GROUP BY v.a+1
is written in core PartiQL as
SELECT VALUE {'bar': x}
FROM foo AS v
GROUP BY v.a+1 AS x GROUP AS dontcare
SQL’s Implicit Group Variable
SQL does not have explicit group variables. For SQL compatibility, PartiQL allows the SQL aggregation functions to be fed by expressions that do not explicitly say that there is iteration over the group variable. Suppose that a query
-
is a
SELECTquery, -
lacks a
GROUP ASclause, and -
any of the
SELECT,HAVING, and/orORDER BYclauses contains a function call \$f(e)\$, where \$f\$ is a SQL aggregation function such asSUMandAVG. (See Designation of SQL aggregate functions)
Then, the query is rewritten as follows:
-
if the query has a
GROUP BYclause, add to itGROUP AS gwhere
gis a fresh variable, i.e., a variable that is not a database name nor a variable of the query or a variable of the queries within which it is nested. -
if the query has no
GROUP BYclause, add to itGROUP ALL GROUP AS gwhere
gis a fresh variable. -
if the aggregation function call is
COUNT(*), then rewrite intoCOUNT(g) -
otherwise, rewrite \$f(e)\$ into
f(SELECT VALUE e1 FROM g AS p)where \$e1\$ is produced from \$e\$ as follows: Consider the variables \$v_1, ..., v_n\$ that appear in \$B_"GROUP"^"in"\$ (i.e., the variables defined by the query’s
FROMandLETclauses) and are not grouping attributes. Substitute each identifier \$v_i\$ (that does not stand for attribute name) in \$e\$ with \$p.v_i\$.
Consider again the query of GROUP BY with Collection Aggregates. It can be written in an SQL compatible way as
SELECT l.sensor AS sensor,
AVG(l.co) AS avg,
COUNT(*) AS count
FROM logs AS l
GROUP BY l.sensor
The query of GROUP ALL Core can be written in standard SQL syntax as
SELECT COUNT(g) AS largeco
FROM logs AS l
WHERE l.co > 1.5
Notice that SQL does not allow nested aggregate
functions. Respectively, PartiQL does not allow one to write queries
that lack a GROUP AS or GROUP ALL clause and have nested aggregate
SQL functions.
Designation of SQL aggregate functions
Each implementation will have a list of SQL aggregate functions, which
are not necessarily just the ones prescribed by the standard (COUNT,
SUM, AVG, etc). (Recall from SQL’s Implicit Group Variable that
SQL aggregate functions do not use an explicit group variable.)
Furthermore, it is required that for each SQL aggregate function f,
if an implementation offers a corresponding core PartiQL aggregate
function, the PartiQL function is named COLL_f. For example, the
core PartiQL aggregate COLL_AVG corresponds to the SQL AVG
aggregate . Nevertheless, it is possible that an implementation
offers only COLL_AVG or offers only AVG. The semantic relationship
between the SQL aggregate function and the corresponding core PartiQL
aggregate function is the one explained in
SQL’s Implicit Group Variable: The SQL aggregate functions do not
input explicit group variables and, thus, their semantics are
explained by the reduction to the corresponding core PartiQL
aggregate.
Aliases from the SELECT Clause
In SQL, a grouping expression may be an alias that is defined by the
SELECT clause. For compatibility purposes, PartiQL adopts the same
behavior.
The query, which uses the SELECT-defined alias feature:
SELECT ...,e AS a,...
FROM ...
GROUP BY ...,a,...
is syntactic sugar for the query:
SELECT ...e AS a,...
FROM ...
GROUP BY ...,e,...
Notice that the grouping expression a is simply a shorthand for e.
In the case that the grouping expression is a constant positive
integer literal \$n\$, then it stands for the \$n\$th attribute
of the SELECT clause. However, this requires that the tuples
produced by the SELECT have schema and they are ordered tuples. The
relevant examples will be provided in the schema section.
Consider the database
people: <<
{'name': 'zoe', 'age': 10, 'tag': 'child'},
{'name': 'zoe', 'age': 20, 'tag': 'adult'},
{'name': 'bill', 'age': 30, 'tag': 'adult'}
>>
The query
SELECT p.tag || ':' || p.name AS tagname, AVG(p.age) AS average
FROM people AS p
GROUP BY tagname
is equivalent to the query
SELECT p.tag || ':' || p.name AS tagname, AVG(p.age) AS average
FROM people AS p
GROUP BY p.tag || ':' || p.name
Either query results into
people: <<
{'tagname': 'child:zoe', 'average': 10},
{'tagname': 'adult:zoe', 'average': 20},
{'tagname': 'adult:bill', 'average': 30}
>>
Consider again a collection of sensor readings, this time with a timestamp.
logs: [
{'sensor':1, 'co':0.4, 'timestamp':04:05:06},
{'sensor':1, 'co':0.2, 'timestamp':04:05:07},
{'sensor':1, 'co':0.5, 'timestamp':04:05:10},
{'sensor':2, 'co':0.3}
]
We look for the “jump" readings that are more than 2x the previous
reading at the same sensor. The following query solves the problem
using GROUP BY.
SELECT sensor AS sensor,
(WITH orderedReadings
AS (SELECT v FROM oneSensorsReadings v ORDER BY v.timestamp)
SELECT r.co, r.timestamp
FROM orderedReadings r AT p
WHERE r.co > 2*orderedReadings[p-1].co
ORDER BY p
) AS jumpReadings
FROM logs l
GROUP BY l.sensor AS sensor GROUP AS oneSensorsReadings
The result is
<<
{'sensor':1, 'jumpReadings':[{'co':0.4, 'timestamp':04:05:06}]},
{'sensor':2, 'jumpReadings':[]}
>>
GROUP ALL
The GROUP ALL variant of GROUP BY outputs a single binding tuple,
regardless of whether the FROM/WHERE produced any tuples, i.e.,
regardless of whether its input \$B_"GROUP"^"in"\$ is empty
or not.
The GROUP ALL is not increasing the expressiveness of PartiQL.
GROUP ALL Core shows how to achieve without GROUP ALL,
what the GROUP ALL can do. However, we include GROUP ALL for
facilitating the reduction of SQL’s aggregation into the core PartiQL
(see SQL’s Implicit Group Variable).
Consider again the logs data and assume
that we want to count the total number of readings that are above
1.5 with a core PartiQL query. (SQL GROUP BY Nothing does
the same with SQL.)
SELECT VALUE {'largeco': COLL_COUNT(g)}
FROM logs AS l
WHERE l.co > 1.5
GROUP ALL AS g
Notice, there are no readings above 1.5 in the example data. Since there is
no tuple that satisfies the WHERE clause
\$B_"WHERE"^"out" = B_"GROUP"^"in" = <<<<\ >>>>\$
\$B_"GROUP"^"out" = B_"SELECT"^"in" = <<<<\ (: g: <<<<\ >>>>\ :)\ >>>>\$
Since COLL_COUNT(<<>>) is 0, the query result is the collection
<< {'largeco': 0} >>
Therefore the PartiQL query is equivalent to the plain SQL query
SELECT COUNT(*) AS largeco
FROM logs AS l
WHERE l.co > 1.5
The following core PartiQL also accomplishes the same computation,
without using GROUP ALL.
{ 'largeco': COLL_COUNT(SELECT VALUE l FROM logs AS l WHERE l.co > 1.5) }
GROUP BY to be seen as a standalone function.