Tutorial
PartiQL provides SQL-compatible unified query access across multiple data stores containing structured, semi-structured and nested data. PartiQL separates the syntax and semantics of a query from the underlying data source and data format. It enables users to interact with data with[1] or without regular schema.
This introduction aims to teach SQL users the PartiQL extensions to SQL. The tutorial is primarily driven by "how to" examples.
For the reader who is interested in the full details and formal specification of PartiQL, we recommend the 2-tiered PartiQL formal specification. The formal specification first describes the PartiQL Core, which is a short and concise functional programming language. Then the specification layers SQL compatibility through syntactic sugar that shows how SQL features can be translated to semantically equivalent core PartiQL expressions. These translations presented as syntactic sugar enable SQL compatibility.
PartiQL is backwards compatible with SQL-92[2]. We will see what compatibility means when it is used to query data found in data formats and data stores.
For starters, given the table hr.employees
Id name title ---- ------------- ---------------- 3 Bob Smith null 4 Susan Smith Dev Mgr 6 Jane Smith Software Eng 2
the following SQL query
SELECT e.id, e.name AS employeeName, e.title AS title
FROM hr.employees e
WHERE e.title = 'Dev Mgr'
is also a valid PartiQL query. As we know from SQL, when this query
operates on the table hr.employees
it will return the result
Id employeeName title ---- -------------- --------- 4 Susan Smith Dev Mgr
REPL
PartiQL provides an interactive shell, or Read Evaluate Print Loop (REPL), that allows users to write and evaluate PartiQL queries.
Installation
The REPL can be obtained from a Github release.
├── partiql-cli ├── bin │ ├── partiql // Mac/Unix │ └── partiql.bat // Windows ├── lib │ └── ... ├── README.md └── Tutorial ├── code │ └── ... ├── tutorial.html └── tutorial.pdf
The root folder partiql-cli
contains a README.md
file and 3
subfolders
-
The folder
bin
contains startup scriptspartiql
for macOS and Unix systems andpartiql.bat
for Windows systems. Execute these files to start the REPL. -
The folder
lib
contains all the necessary Java libraries needed to run PartiQL. -
The folder
Tutorial
contains the tutorial inpdf
andhtml
form. The subfoldercode
contains 3 types of files:-
Data files with the extension
.env
. These files contains PartiQL data that we can query. -
PartiQL query files with the extension
.sql
. These files contain the PartiQL queries used in the tutorial. -
Sample query output files with the extension
.output
. These files contain sample output from running the tutorial queries on the appropriate data.
-
To exit the PartiQL REPL, press:
-
Control+D
in macOS or Unix -
Control+C
on Windows
or close the terminal/command prompt window.
Loading Data from Disk
An easy way to load the necessary data into the REPL is use the -e
switch when starting the REPL and provide the name of a file that
contains your data.
./bin/partiql -e Tutorial/code/q1.env
You can then see what is loaded in the REPL’s global environment using
the REPL command !global_env
, i.e.,
For convenience we have provided the file tutorial-all-data.env
in
the folder Tutorial/code/
. You will also find separate .env
files
in the same folder for each query in the tutorial.
For example, running
./bin/partiql -e Tutorial/code/tutorial-all-data.env
will load all the data used in the tutorial in the REPL. This will allow you to copy-paste queries from the tutorial into the REPL and try them out.
Data Model
PartiQL operates not just on SQL tables but also on data that may have nesting, union types, different attributes across different tuples, and many other features that we often find in today’s nested and/or semi-structured formats, like JSON, Ion, Parquet, etc.
To capture this generality, PartiQL is based on a logical type system: the PartiQL Data Model. Each PartiQL implementation maps data formats, like JSON, Parquet etc., into a PartiQL data set that follows the PartiQL data model. PartiQL queries work on the PartiQL data set abstraction.
For example, the table hr.employees
is denoted in the PartiQL data
model as this dataset
{
'hr': {
'employees': <<
-- a tuple is denoted by { ... } in the PartiQL data model
{ 'id': 3, 'name': 'Bob Smith', 'title': null },
{ 'id': 4, 'name': 'Susan Smith', 'title': 'Dev Mgr' },
{ 'id': 6, 'name': 'Jane Smith', 'title': 'Software Eng 2'}
>>
}
}
Notice that the employees
is nested within hr
.
The delimiters <<
… >>
denote that the data is an unordered
collection (also known as bag), as is the case with SQL tables. That
is, there is no order between the three tuples. Single-line comments
start with --
and end at the end of the line.
A very different kind of data source may lead to the same PartiQL dataset. For example, a set of JSON files that contain the following JSON objects
{
"hr" : {
"employees": [
{ "id": 3, "name": "Bob Smith", "title": null },
{ "id": 4, "name": "Susan Smith", "title": "Dev Mgr" },
{ "id": 6, "name": "Jane Smith", "title": "Software Eng 2"}
]
}
}
will likely[4] be abstracted by a PartiQL-supporting implementation
into the identical PartiQL abstraction with the hr.employees
table.
You will keep noticing the similarity of the PartiQL notation with the JSON notation. Notice also the subtle differences: In the interest of SQL compatibility, a PartiQL literal is single-quoted, while JSON literals are double-quoted. |
You may conceptually think that a deserializer inputs JSON and outputs the PartiQL data set. But do not assume that the query processing of a PartiQL implementation will have to actually parse and abstract into PartiQL each and every bit of the underlying data storage. |
Back to our query
SELECT e.id,
e.name AS employeeName,
e.title AS title
FROM hr.employees e
WHERE e.title = 'Dev Mgr'
Instead of a SQL result set, evaluating the query in PartiQL produces:
<<
{
'id': 4,
'employeeName': 'Susan Smith',
'title': 'Dev Mgr'
}
>>
the result remains the same, no matter whether hr.employees
is a SQL
table or a JSON file. All that is needed is an association between the
name hr.employees
and the PartiQL abstraction of the data.
In the same spirit, the same PartiQL abstraction may come from a CSV
file or a Parquet file, a format that has gained big traction, thanks to
the efficient way in which it stores data. Again, the same query makes
perfect sense, regardless of what exactly was the storage format behind
hr.employees
.
LEARN MORE
PartiQL data sets look very much like JSON.
What are the differences? Indeed, PartiQL adopts the tuple/object and array notation of JSON. However, the PartiQL string literals are denoted by single quotes. Importantly, the scalar types of PartiQL are those of SQL, not just strings, numbers and booleans, as in JSON.
Do implementations need to have a catalog?
If queries refer to names, a catalog logically validates whether the name exists or not. However, we will also see PartiQL queries that refer to no names.
Literals
Literals of the PartiQL query language correspond to the types in the PartiQL data model:
-
scalars, including
null
which follow the SQL syntax when applicable. For example:-
5
-
'foo'
-
-
tuples, denoted by
{...}
with tuple elements separated by,
(also known as structs and/or objects in many formats and other data models)-
{ 'id' : 3, 'arr': [1, 2] }
-
-
arrays, denoted by
[...]
with array elements separated by,
-
[ 1, 'foo' ]
-
-
bags, denoted by
<< ... >>
with bag elements separated by a,
-
<< 1, 'foo'>>
-
Notice that in the spirit of the PartiQL data model, literals compose freely and any kind of literal may appear within any tuple, array and bag literal, eg.,
{ 'id': 3, 'matrix': [ [2, 4, 6], 'NA' ] }
Querying Nested Data
SQL-92 only has tables that have tuples that contain scalar values. A key feature of many modern formats is nested data. That is, attributes whose values may themselves be tables (i.e., collections of tuples), or may be arrays of scalars, or arrays of arrays and many other combinations. Let’s take a closer look at PartiQL’s features (SQL extensions) that allow us to work with nested data.
We also include sections titled "Use Case". Such "Use Case" sections do not introduce additional features. They merely show how to combine the few novel PartiQL features with standard SQL features in order to solve a large number of problems.
Nested Collections
Let’s now add the nested attribute projects
into the data set.
{
'hr': {
'employeesNest': <<
{
'id': 3,
'name': 'Bob Smith',
'title': null,
'projects': [
{ 'name': 'AWS Redshift Spectrum querying' },
{ 'name': 'AWS Redshift security' },
{ 'name': 'AWS Aurora security' }
]
},
{
'id': 4,
'name': 'Susan Smith',
'title': 'Dev Mgr',
'projects': []
},
{
'id': 6,
'name': 'Jane Smith',
'title': 'Software Eng 2',
'projects': [ { 'name': 'AWS Redshift security' } ]
}
>>
}
}
Notice that the value of 'projects'
is an array. Arrays are denoted by
[ ... ]
with array elements separated by commas. In our example the
array happens to be an array of tuples. We will see that arrays may be
arrays of anything, not just arrays of tuples.
Unnesting a Nested Collection
The following query finds the names of employees who work on projects
that contain the string 'security'
and outputs them along with the
name of the 'security'
project. Notice that the query has just one
extension over standard SQL — the e.projects AS p
part.
SELECT e.name AS employeeName, p.name AS projectName
FROM hr.employeesNest AS e, e.projects AS p
WHERE p.name LIKE '%security%'
The output of our query is
<<
{
'employeeName': 'Bob Smith',
'projectName': 'AWS Redshift security'
},
{
'employeeName': 'Bob Smith',
'projectName': 'AWS Aurora security'
},
{
'employeeName': 'Jane Smith',
'projectName': 'AWS Redshift security'
}
>>
The extension over SQL is the FROM
clause item e.projects AS p
.
Standard SQL would attempt to find a schema named e
with a table
projects
and since in our example there isn’t an e.projects
table,
the query would fail. In contrast, PartiQL recognizes e.projects
to
refer to the projects
attribute of e
.
Once we allow this extension, the semantics are alike SQL’s. The alias
(also called variable in PartiQL) e
gets bound to each employee, in
turn. For each employee, the variable p
gets bound to each project of
the employee, in turn. Thus the query’s meaning, like SQL, is
foreach employee tuple e from hr.employeesNest
foreach project tuple p from e.projects
if p.name LIKE '%security%'
output e.name AS employeeName, p.name AS projectName
Notice that our query involved variables that were ranging over nested
collections (p
in the example), along with variables that were ranging
over tables (e
in the example), as standard SQL aliases do. All
variables, no matter what they range over, can be used wherever in the
FROM
, WHERE
, SELECT
clauses as we will see in the examples that
follow.
LEARN MORE
Can I only unnest arrays of tuples?
No, anything can be unnested. For example, arrays of scalars, etc.
Does
e.projects AS p
have to appear in the sameFROM
clause that definese
?No. For example, see below the use cases that involve subqueries. There, the
e
andp
are defined in separateFROM
clauses.How could I force
e.projects
to refer to the nested attributeprojects
even if there were a schema namede
with a tableprojects
?Use the syntax
@e.projects
. Recall, in the absence of the@
, in the interest of SQL compatibility, PartiQL will first attempt to dereference thee.projects
against the catalog.SQL allows me to avoid writing an explicit alias
e
when I write, say,e.name
. Can I avoid writing thee
in PartiQL as well?SQL allows us to avoid writing aliases (variables) when the schema of the tables allows correct dereferencing. PartiQL does the same. However, recall, a schema is not necessary for a PartiQL data set. Indeed, our example has not assumed a schema. In the absence of a schema, you cannot omit the aliases (variables). For example, if you write just
name
and there is no schema, PartiQL cannot tell whether you mean employee name or project name. Thus you need to explicitly write the alias (variable).There is one exception to this rule: If your query has a single item in its
FROM
clause, you can omit the alias (variable). Eg, you can writeSELECT name FROM hr.employeesNest
In this case it is apparent that
name
may only be an employee name and thus PartiQL allows you to not provide an alias (variable).Nevertheless, for clarity we recommend that you always use aliases (variables) and this is what this tutorial does.
If there is a schema, can I avoid writing the alias
p
?No. The
p
has to be written in order to denote the iteration over the projects.
Unnesting Nested Collections Using JOIN
In this section, we simply present an alternate way to express and think about unnesting collections.
One may think that the FROM
clause of the example executes, in a
sense, a JOIN
between employees and projects. If it helps you to think
in terms of JOIN
, you may replace the comma with JOIN
. That is, the
following two queries are equivalent.
SELECT e[.p].[.n]name AS employeeName[.p], p[.p].[.n]name AS projectName
FROM hr[.p].[.n]employeesNest AS e[.p], e[.p].[.n]projects AS p
WHERE p[.p].[.n]name LIKE '%security%'
\$<=>\$
SELECT e[.p].[.n]name AS employeeName[.p], p[.p].[.n]name AS projectName
FROM hr[.p].[.n]employeesNest AS e CROSS JOIN e[.p].[.n]projects AS p
WHERE p[.p].[.n]name LIKE '%security%'
Unnesting with LEFT JOIN
Assume that we want to write a query that returns as a bag of tuples the
entire employee and project information from hr.employeesNest
. The
query result we want is this bag of tuples with attributes id
,
employeeName
, title
and projectName
:
<<
{
'id': 3,
'employeeName': 'Bob Smith',
'title': NULL,
'projectName': 'AWS Redshift Spectrum querying'
},
{
'id': 3,
'employeeName': 'Bob Smith',
'title': NULL,
'projectName': 'AWS Redshift security'
},
{
'id': 3,
'employeeName': 'Bob Smith',
'title': NULL,
'projectName': 'AWS Aurora security'
},
{
'id': 4,
'employeeName': 'Susan Smith',
'title': 'Dev Mgr'
},
{
'id': 6,
'employeeName': 'Jane Smith',
'title': 'Software Eng 2',
'projectName': 'AWS Redshift security'
}
>>
Notice that there is a 'Susan Smith'
tuple in the result, despite the
fact that Susan has no project. Susan’s projectName
is null
. We can
obtain this result by combining employees and projects using the
LEFT JOIN
operator, as follows:
SELECT e.id AS id,
e.name AS employeeName,
e.title AS title,
p.name AS projectName
FROM hr.employeesNest AS e LEFT JOIN e.projects AS p ON true
The semantics of this query can be thought of as
foreach employee tuple e from hr.employeesNest
if the e.projects is an empty collection then // this part is special about LEFT JOINs
output e.id AS id, e.name AS employeeName, e.title AS title
and output a null AS projectName
else // the following part is identical to plain (inner) JOINs
foreach project tuple p from e.projects
output e.id AS id, e.name AS employeeName, e.title AS title
Use Case: Checking whether a nested collection satisfies a condition
The following use cases employ the unnesting features, which we have already discussed, in new use cases. A lesson that emerges is that we can use variables (SQL aliases) that range over nested data as if they were standard SQL aliases. This realization gives us the power to solve a great number of use cases just by combining the unnesting features with features we already know from standard SQL.
In our first use case we want a query that returns the names of the
employees that are involved in a project that contains the word
'security'
. The solution employs SQL’s "EXISTS
(subquery)" feature,
along with unnesting:
SELECT e.name AS employeeName
FROM hr.employeesNest AS e
WHERE EXISTS ( SELECT *
FROM e.projects AS p
WHERE p.name LIKE '%security%')
returns
<<
{
'employeeName': 'Bob Smith'
},
{
'employeeName': 'Jane Smith'
}
>>
In the second use case we want a query that outputs the names of the
employees that have more than one security project and we are aware of a
key for employees (e.g., an attribute that is guaranteed to have a
unique value for each employee). We can find the requested employees by
utilizing a combination of GROUP BY
and HAVING
.[5]
with the implementation currently prevents us from doing so. In our example,
let’s assume that the id
attribute is a primary key for the employees.
Then we could find the employees with more than one security project
with this query:
SELECT e.name AS employeeName
FROM hr.employeesNest AS e, e.projects AS p
WHERE p.name LIKE '%security%'
GROUP BY e.id, e.name
HAVING COUNT(*) > 1
which returns
<<
{
'employeeName': 'Bob Smith'
}
>>
Use Case: Subqueries that aggregate over nested collections
Next, let’s find how many querying projects (that is, projects whose name contains the word 'querying') each employee has.[6] with the implementation currently prevents us from doing so.
Making the same asssumption as before, that id
is a key for employees,
we can solve the problem with the query
SELECT e.name AS employeeName, COUNT(p.name) AS queryProjectsNum
FROM hr.employeesNest e LEFT JOIN e.projects AS p ON p.name LIKE '%querying%'
GROUP BY e.id, e.name
that returns
<< { 'employeeName': 'Bob Smith', 'queryProjectsNum': 1 }, { 'employeeName': 'Susan Smith', 'queryProjectsNum': 0 }, { 'employeeName': 'Jane Smith', 'queryProjectsNum': 0 } >> --- OK!
Notice this query’s result includes Susan Smith and Jane Smith, who have no querying projects.
Nested Structs
A value may also be a tuple — also called object and struct in many models and formats. For example, the project value in the following tuples is always a tuple with project name and project org.
{
'hr': {
'employeesWithTuples': <<
{
'id': 3,
'name': 'Bob Smith',
'title': null,
'project': {
'name': 'AWS Redshift Spectrum querying',
'org': 'AWS'
}
},
{
'id': 6,
'name': 'Jane Smith',
'title': 'Software Eng 2',
'project': {
'name': 'AWS Redshift security',
'org': 'AWS'
}
}
>>
}
}
PartiQL’s multistep paths enable navigating within tuples. For example, the following query finds AWS projects and outputs the project name and employee name.
SELECT e.name AS employeeName, e.project.name AS projectName
FROM hr.employeesWithTuples e
WHERE e.project.org = 'AWS'
The result is
<< { 'employeeName': 'Bob Smith', 'projectName': 'AWS Redshift Spectrum querying' }, { 'employeeName': 'Jane Smith', 'projectName': 'AWS Redshift security' } >> --- OK!
Other Nested Values
The previous examples have shown nested attributes that were arrays of tuples. It need not be the case that the nested attributes are collections of tuples. They may just as well be arrays of scalars, arrays of arrays, or any combination of data that one can create by composing scalars, tuples and arrays. You need not learn a different set of query language features for each case. The unnesting features, which we have already seen, are sufficient.
Use Case: Unnesting Arrays of Scalars
The list of projects associated with each employee in hr.employeesNest
could have been simply a list of project name strings. Replacing the
nested tuples with plain strings gives us
{
'hr': {
'employeesNestScalars': <<
{
'id': 3,
'name': 'Bob Smith',
'title': null,
'projects': [
'AWS Redshift Spectrum querying',
'AWS Redshift security',
'AWS Aurora security'
]
},
{
'id': 4,
'name': 'Susan Smith',
'title': 'Dev Mgr',
'projects': []
},
{
'id': 6,
'name': 'Jane Smith',
'title': 'Software Eng 2',
'projects': [ 'AWS Redshift security' ]
}
>>
}
}
Let us repeat the previous use cases on the revised employee data.
The following query finds the names of employees who work on projects
that contain the string 'security'
and outputs them along with the
name of the 'security' project.
SELECT e.name AS employeeName,
p AS projectName
FROM hr.employeesNestScalars AS e,
e.projects AS p
WHERE p LIKE '%security%'
The preceding query returns
<<
{
'employeeName': 'Bob Smith',
'projectName': 'AWS Redshift security'
},
{
'employeeName': 'Bob Smith',
'projectName': 'AWS Aurora security'
},
{
'employeeName': 'Jane Smith',
'projectName': 'AWS Redshift security'
}
>>
The variable p
ranges (again) over the content of e.projects
. In
this case, since e.projects
has strings (as opposed to tuples), the
variable p
binds each time to a project name string. Thus, this query
can be thought of as executing the following snippet.
foreach employee tuple e from hr.employeesNestScalars foreach project p from e.projects if the string p matches '%security%' output e.name AS employeeName and the string p AS projectName
Use Case: Unnesting Arrays of Arrays
Arrays may also contain arrays, directly, without intervening tuples, as
in the matrices
data set.
{
'matrices': <<
{
'id': 3,
'matrix': [
[2, 4, 6],
[1, 3, 5, 7],
[9, 0]
]
},
{
'id': 4,
'matrix': [
[5, 8],
[ ]
]
}
>>
}
The following query finds every even number and outputs the even number
and the id
of the tuple where it was found.
SELECT t.id AS id, x AS even
FROM matrices AS t, t.matrix AS y, y AS x
WHERE x % 2 = 0
The preceding query returns
<<
{
'id': 3,
'even': 2
},
{
'id': 3,
'even': 4
},
{
'id': 3,
'even': 6
},
{
'id': 3,
'even': 0
},
{
'id': 4,
'even': 8
}
>>
Informally the query’s evaluation can be thought of as
foreach tuple t from matrices
foreach array y from t.matrix
foreach number x from y
if x is even then
output t.id AS id and x AS even
Querying Schemaless Data
Many formats do not require a schema that describes the data — that is schemaless data. In such cases it is possible to have various "heterogeneities" in the data:
-
One tuple may have an attribute
x
while another tuple may not have this attribute -
In one tuple of the collection an attribute
x
may be of one type, e.g., string, while in another tuple of the same collection the same attributex
may be of a different type — e.g, array. -
The elements of a collection (be it a bag or array) can be heterogeneous (not have the same type). For example, the first element may be a string, the second one may be an integer and the third one an array.
-
Generally, any composition is possible as we can bundle heterogeneous elements in arrays and bags.
Heterogeneities are not particular to schemaless. Schemas may allow for
heterogeneity in the types of the data. For example, one of the Hive
data types is the union type,[7] which allows a value to belong to any one of a list of types. Consider the following schema whose projects
attribute may be either a string or an array of strings
CREATE TABLE employeesMixed(
id: INT,
name: STRING,
title: STRING,
projects: UNIONTYPE<STRING, ARRAY<STRING>>
);
A collection of PartiQL tuples that follows this schema could be
{
'hr': {
'employeesMixed1': <<
{
'id': 3,
'name': 'Bob Smith',
'title': null,
'projects': [
'AWS Redshift Spectrum querying',
'AWS Redshift security',
'AWS Aurora security'
]
},
{
'id': 4,
'name': 'Susan Smith',
'title': 'Dev Mgr',
'projects': []
},
{
'id': 6,
'name': 'Jane Smith',
'title': 'Software Eng 2',
'projects': 'AWS Redshift security'
}
>>
}
}
Thus we see that data may have heterogeneities — regardless of whether they are described by a schema or not. PartiQL tackles heterogeneous data in ways that we will see in the next use cases and feature presentations.
Missing Attributes
Let’s go back to the hr.employees
table (that is, bag of tuples). Bob
Smith has no title and, as is typical in SQL, the lack of title is
modeled with the null
value.
{
'hr': {
'employees': <<
{ 'id': 3, 'name': 'Bob Smith', 'title': null }
{ 'id': 4, 'name': 'Susan Smith', 'title': 'Dev Mgr' }
{ 'id': 6, 'name': 'Jane Smith', 'title': 'Software Eng 2'}
>>
}
}
Nowadays, many semi-structured formats allow users to represent "missing" information in two ways.
-
The first way is by use of
null
. -
The second kind is the plain absence of the attribute from the tuple.
That is, we can represent the fact that Bob Smith has no title by simply
having no title
attribute in the 'Bob Smith'
tuple:
{
'hr': {
'employeesWithMissing': <<
{ 'id': 3, 'name': 'Bob Smith' }, -- no title in this tuple
{ 'id': 4, 'name': 'Susan Smith', 'title': 'Dev Mgr' },
{ 'id': 6, 'name': 'Jane Smith', 'title': 'Software Eng 2'}
>>
}
}
PartiQL does not argue about when to use null
s and when to use
"missing". Myriads of datasets already use one of the two or both.
However, PartiQL enables queries to distinguish between null and missing
values, and also enables query results that have nulls and missing
values. Indeed, PartiQL makes it very easy to propagate source data
nulls as query result nulls and source data missing attributes into
result missing attributes.
The MISSING Value
Consider again this PartiQL query, which happens to also be an SQL query.
SELECT e.id,
e.name AS employeeName,
e.title AS title
FROM hr.employeesWithMissing AS e
WHERE e.title = 'Dev Mgr'
What will happen when the query goes over the Bob Smith tuple, which has
no title
?
The first step to answering this question is understanding the result of
the path e.title
when the alias (variable) e
binds to the tuple
{ 'id': 3, 'name': 'Bob Smith' }
. In more basic terms, what is the
result of the expression { 'id': 3, 'name': 'Bob Smith' }.title
?
PartiQL says that it is the special value MISSING
. MISSING
behaves
very similar to null
.
Evaluating Functions and Conditions with MISSING
If a function (including infix functions like =
) inputs a MISSING
the function’s result is NULL
. In the case of the example, this means
that the WHERE
clause e.title='Dev Mgr'
will evaluate to NULL
when
e
binds to { 'id': 3, 'name': 'Bob Smith' }
and, as usual in SQL,
the WHERE
clause fails when it does not evaluate to true
. Thus the
output will be
<<
{
'id': 4,
'employeeName': 'Susan Smith',
'title': 'Dev Mgr'
}
>>
Propagating MISSING in Result Tuples
What would happen if a missing attribute or, more generally, an
expression returning MISSING
appears in the SELECT
?
SELECT e.id,
e.name AS employeeName,
e.title AS outputTitle
FROM hr.employeesWithMissing AS e
The query will output one tuple for each employee. When it outputs the
Bob Smith tuple, the e.title
will evaluate to NULL
and then the
output tuple will not have an outputTitle
attribute.
<< { 'id': 3, 'employeeName': 'Bob Smith' }, { 'id': 4, 'employeeName': 'Susan Smith', 'outputTitle': 'Dev Mgr' }, { 'id': 6, 'employeeName': 'Jane Smith', 'outputTitle': 'Software Eng 2' } >> --- OK!
The same treatment of MISSING
would happen if, say, we had this query
that converts titles to capital letters:
SELECT e.id,
e.name AS employeeName,
UPPER(e.title) AS outputTitle
FROM hr.employeesWithMissing AS e
Again, the e.title
will evaluate to MISSING
for 'Bob Smith'
, the
UPPER(e.title)
is then UPPER(MISSING)
and also evaluates to NULL
.
Thus the result will be:
<<
{
'id': 3,
'employeeName': 'Bob Smith',
'outputTitle': NULL
},
{
'id': 4,
'employeeName': 'Susan Smith',
'outputTitle': 'DEV MGR'
},
{
'id': 6,
'employeeName': 'Jane Smith',
'outputTitle': 'SOFTWARE ENG 2'
}
>>
Ranging over Heterogenous Values
A PartiQL variable (called alias in SQL) can bind to data of different types during a query’s evaluation. This is unlike SQL where the variables always bind to tuples. It is even different from what happened in Use Case: Unnesting Arrays of Scalars and what happened in Use Case: Unnesting Arrays of Arrays.
In the first use case, the PartiQL variable p
happened to always bind
to a string (given the particular sample data of the example). In the
second use case, the PartiQL variable y
was always bound to an array
(again, given the particular sample data of the example).
To make the case for variables that bind to different types, consider
the following twist in the employeesNest
data set. Some of the
elements of the projects
array are plain strings and some are tuples.
Even the employee tuples do not always have the same attributes.
{
'hr': {
'employeesMixed2': <<
{
'id': 3,
'name': 'Bob Smith',
'title': null,
'projects': [
{ 'name': 'AWS Redshift Spectrum querying' },
'AWS Redshift security',
{ 'name': 'AWS Aurora security' }
]
},
{
'id': 4,
'name': 'Susan Smith',
'title': 'Dev Mgr',
'projects': []
},
{
'id': 6,
'name': 'Jane Smith',
'projects': [ 'AWS Redshift security']
}
>>
}
}
This query on hr.employeesMixed2
produces employee name — employee
project pairs.
SELECT e.name AS employeeName,
CASE WHEN (p IS TUPLE) THEN p.name ELSE p END AS projectName
FROM hr.employeesMixed2 AS e,
e.projects AS p
Notice the sub-expression (p IS TUPLE)
. The IS
operator can be used
to check a value against its type at evaluation time. Notice also that
the variable p
binds to different types.
In general, the FROM
clause of a query binds its variables (aliases)
to data. The variables need not bind to data that have the same types.
Each binding is fed to the SELECT
clause, which evaluates its
expressions.
This table shows each variables' binding produced by the FROM
clause
and the corresponding tuple output by the SELECT
clause.
Variable e | Variable p | Result tuple |
---|---|---|
{ 'id': 3, 'name': 'Bob Smith', 'title': null, 'projects': [ { 'name': 'AWS Redshift Spectrum querying' }, 'AWS Redshift security', { 'name': 'AWS Aurora security' } ] } |
{ 'name': 'AWS Redshift Spectrum querying' } |
{ 'employeeName': 'Bob Smith', 'projectName': 'AWS Redshift Spectrum querying' } |
{ 'id': 3, 'name': 'Bob Smith', 'title': null, 'projects': [ { 'name': 'AWS Redshift Spectrum querying' }, 'AWS Redshift security', { 'name': 'AWS Aurora security' } ] } |
'AWS Redshift security' |
{ 'employeeName': 'Bob Smith', 'projectName': 'AWS Redshift security' } |
{ 'id': 3, 'name': 'Bob Smith', 'title': null, 'projects': [ { 'name': 'AWS Redshift Spectrum querying' }, 'AWS Redshift security', { 'name': 'AWS Aurora security' } ] } |
{ 'name': 'AWS Aurora security' } |
{ 'employeeName': 'Bob Smith', 'projectName': 'AWS Aurora security' } |
{ 'id': 6, 'name': 'Jane Smith', 'projects': [ 'AWS Redshift security' ] } |
'AWS Redshift security' |
{ 'employeeName': 'Jane Smith', 'projectName': 'AWS Redshift security' } |
Array Access
SQL allows us to order the output of a query using the ORDER BY
clause. However, the SQL data model does not recognize order in the
input data. In contrast, many of the new data formats feature arrays;
the arrays' elements have an order. We may want to find an array element
by its index or, we may want to find the positions of certain elements
in their arrays.
Let’s consider again the dataset hr.employeesNest
.
{
'hr': {
'employeesNest': <<
{
'id': 3,
'name': 'Bob Smith',
'title': null,
'projects': [
{ 'name': 'AWS Redshift Spectrum querying' },
{ 'name': 'AWS Redshift security' },
{ 'name': 'AWS Aurora security' }
]
},
{
'id': 4,
'name': 'Susan Smith',
'title': 'Dev Mgr',
'projects': []
},
{
'id': 6,
'name': 'Jane Smith',
'title': 'Software Eng 2',
'projects': [ { 'name': 'AWS Redshift security' } ]
}
>>
}
}
The projects
attribute is an array of tuples; that is, each tuple has
an ordinal associated with it. The following query returns each employee
name, along with the first project of the employee.
SELECT e.name AS employeeName,
e.projects[0].name AS firstProjectName
FROM hr.employeesNest AS e
The query returns
<<
{
'employeeName': 'Bob Smith',
'firstProjectName': 'AWS Redshift Spectrum querying'
},
{
'employeeName': 'Susan Smith'
},
{
'employeeName': 'Jane Smith',
'firstProjectName': 'AWS Redshift security'
}
>>
Multistep Paths
Technically, the structure [<number>]
is a kind of path step. For
example, notice the 4-step path e.projects[0].name
. When e
is bound
to the first tuple of hr.employeesNest
, then the path e.projects
results into the array
[
{ 'name': 'AWS Redshift Spectrum querying' },
{ 'name': 'AWS Redshift security' },
{ 'name': 'AWS Aurora security' }
]
Consequently applying the [0]
step on e.projects
(that is,
evaluating e.projects[0]
) leads to
{'name': 'AWS Redshift Spectrum querying'}
. Finally, evaluating the
.name
step on e.projects[0]
(that is, evaluating
e.projects[0].name
) leads to 'AWS Redshift Spectrum querying'
.
Finding the Index of Each Element in an Array
Let’s assume that each employees' projects are sorted in priority order.
The following query finds the names of each employee involved in a
security project, the security project, and, its index in the projects
array.
SELECT e.name AS employeeName,
p.name AS projectName,
o AS projectPriority
FROM hr.employeesNest AS e,
e.projects AS p AT o
WHERE p.name LIKE '%security%'
Notice the new feature: AT o
. While p
ranges over the elements of
the array e.projects
, the variable o
is assigned to the index of the
element in the array. The query returns:
<<
{
'employeeName': 'Bob Smith',
'projectName': 'AWS Redshift security',
'projectPriority': 1
},
{
'employeeName': 'Bob Smith',
'projectName': 'AWS Aurora security',
'projectPriority': 2
},
{
'employeeName': 'Jane Smith',
'projectName': 'AWS Redshift security',
'projectPriority': 0
}
>>
Pivoting & Unpivoting
Many queries need to range over and collect the attribute name/value pairs of tuples or the key/value pairs of maps.
Unpivoting Tuples
Consider this dataset that provides the closing prices of multiple ticker symbols.
{
'closingPrices': <<
{ 'date': '4/1/2019', 'amzn': 1900, 'goog': 1120, 'fb': 180 },
{ 'date': '4/2/2019', 'amzn': 1902, 'goog': 1119, 'fb': 183 }
>>
}
The following query unpivots the stock ticker/price pairs.
SELECT c."date" AS "date",
sym AS "symbol",
price AS price
FROM closingPrices AS c,
UNPIVOT c AS price AT sym
WHERE NOT sym = 'date'
Notice the use of "
in this query. The double quotes allow us to
disambiguate from date
the keyword and "date"
the identifier. Double
quotes can also specify case sensitivity for attribute lookups.
The query returns
<<
{
'date': '4/1/2019',
'symbol': 'amzn',
'price': 1900
},
{
'date': '4/1/2019',
'symbol': 'goog',
'price': 1120
},
{
'date': '4/1/2019',
'symbol': 'fb',
'price': 180
},
{
'date': '4/2/2019',
'symbol': 'amzn',
'price': 1902
},
{
'date': '4/2/2019',
'symbol': 'goog',
'price': 1119
},
{
'date': '4/2/2019',
'symbol': 'fb',
'price': 183
}
>>
Unpivoting tuples enables the use of attribute names as if they were data. For example, it becomes easy to compute the average price for each symbol as
SELECT sym AS "symbol",
AVG(price) AS avgPrice
FROM closingPrices c,
UNPIVOT c AS price AT sym
WHERE NOT sym = 'date'
GROUP BY sym
which returns
<<
{
'symbol': 'amzn',
'avgPrice': 1901
},
{
'symbol': 'fb',
'avgPrice': 181.5
},
{
'symbol': 'goog',
'avgPrice': 1119.5
}
>>
Pivoting into Tuples
Pivoting turns a collection into a tuple. For example, consider the collection
{
'todaysStockPrices': <<
{ 'symbol': 'amzn', 'price': 1900},
{ 'symbol': 'goog', 'price': 1120},
{ 'symbol': 'fb', 'price': 180 }
>>
}
Then the following PIVOT
query
PIVOT sp.price AT sp."symbol"
FROM todaysStockPrices sp
produces the tuple
{
'amzn': 1900,
'goog': 1120,
'fb': 180
}
Notice that the PIVOT
query looks like a SELECT-FROM-WHERE-...
query
except that instead of a SELECT
clause it has a
PIVOT <value expression> AT <attribute expression>
. Note also that the
PIVOT
query does not return a collection of tuples: rather it
literally returns a tuple value.
Pivoting Subqueries
This example also uses the grouping features of PartiQL.
Let us generalize the previous case of pivoting. We have a table of stock prices
{
'stockPrices':<<
{ 'date': '4/1/2019', 'symbol': 'amzn', 'price': 1900},
{ 'date': '4/1/2019', 'symbol': 'goog', 'price': 1120},
{ 'date': '4/1/2019', 'symbol': 'fb', 'price': 180 },
{ 'date': '4/2/2019', 'symbol': 'amzn', 'price': 1902},
{ 'date': '4/2/2019', 'symbol': 'goog', 'price': 1119},
{ 'date': '4/2/2019', 'symbol': 'fb', 'price': 183 }
>>
}
and we want to pivot it into a collection of tuples, where each tuple
has all the symbol:price
pairs for a date, as follows
<<
{
'date': date(4/1/2019),
'prices': {'amzn': 1900, 'goog': 1120, 'fb': 180}
},
{
'date': date(4/2/2019),
'prices': {'amzn': 1902, 'goog': 1119, 'fb': 183}
}
>>
The following query first creates one group datesPrices for each date.
Then the PIVOT
subquery pivots the group into the tuple prices.
SELECT sp."date" AS "date",
(PIVOT dp.sp.price AT dp.sp."symbol"
FROM datesPrices as dp ) AS prices
FROM StockPrices AS sp GROUP BY sp."date" GROUP AS datesPrices
For example, the datesPrices
collection, returned from GROUP AS
for
sp.date = date(4/1/2019)
is
'datesPrices': <<
{
'sp': {
'date': '4/1/2019',
'symbol': 'amzn',
'price': 1900
}
},
{
'sp': {
'date': '4/1/2019',
'symbol': 'goog',
'price': 1120
}
},
{
'sp': {
'date': '4/1/2019',
'symbol': 'fb',
'price': 180
}
}
>>
Nested Results
PartiQL allows queries that create nested results as well as queries that create heterogeneous results.
The SELECT VALUE Clause
Let’s consider again the dataset hr.employeesNestScalars
:
{
'hr': {
'employeesNestScalars': <<
{
'id': 3,
'name': 'Bob Smith',
'title': null,
'projects': [
'AWS Redshift Spectrum querying',
'AWS Redshift security',
'AWS Aurora security'
]
},
{
'id': 4,
'name': 'Susan Smith',
'title': 'Dev Mgr',
'projects': []
},
{
'id': 6,
'name': 'Jane Smith',
'title': 'Software Eng 2',
'projects': [ 'AWS Redshift security' ]
}
>>
}
}
The following query outputs each tuple of hr.employeesNestScalars
,
except that instead of all projects each tuple has only the security
projects of the employee. The important new feature here is the
SELECT VALUE <expression>
.
SELECT e.id AS id,
e.name AS name,
e.title AS title,
( SELECT VALUE p
FROM e.projects AS p
WHERE p LIKE '%security%'
) AS securityProjects
FROM hr.employeesNestScalars AS e
The result is
<<
{
'id': 3,
'name': 'Bob Smith',
'title': NULL,
'securityProjects': <<
'AWS Redshift security',
'AWS Aurora security'
>>
},
{
'id': 4,
'name': 'Susan Smith',
'title': 'Dev Mgr',
'securityProjects': <<>>
},
{
'id': 6,
'name': 'Jane Smith',
'title': 'Software Eng 2',
'securityProjects': <<
'AWS Redshift security'
>>
}
>>
A SELECT VALUE <expression>
query (or subquery, as in this example)
returns a collection of whatever the <expression>
evaluates to.
Notice the difference from SQL’s SELECT
, which always produces tuples.
If a SQL SELECT
appears as a subquery, then the context of the
subquery designates whether the subquery’s result should be coerced into
a scalar (e.g., when 5 = <subquery>
), coerced into a collection of
scalars (e.g., when 5 IN <subquery>
), etc. None of this applies to
SELECT VALUE
, which produces a collection and this collection is not
coerced.
GROUP BY and GROUP AS
Another pattern of creating nested results in PartiQL is via the
GROUP AS
extension to SQL’s GROUP BY
. This pattern is more efficient
and more intuitive than the use of nested SELECT VALUE
queries when
the required nesting is not following the nesting of the input. (The
example in Creating Nested Results with SELECT VALUE
Queries is one
where the nesting in the output follows the nesting of the input and
thus, an intuitive solution does not involve GROUP BY
.)
The following query outputs each security project found in
hr.employeesNestScalars
along with the list of employee names that
work on the project.
SELECT p AS projectName,
( SELECT VALUE v.e.name
FROM perProjectGroup AS v ) AS employees
FROM hr.employeesNestScalars AS e JOIN e.projects AS p ON p LIKE '%security%'
GROUP BY p GROUP AS perProjectGroup
The result is
<<
{
'projectName': 'AWS Aurora security',
'employees': <<
'Bob Smith'
>>
},
{
'projectName': 'AWS Redshift security',
'employees': <<
'Bob Smith',
'Jane Smith'
>>
}
>>
The GROUP AS
generalizes SQL’s GROUP BY
by making the formulated
groups available in their entirety to the query’s SELECT
and HAVING
clauses. Contrast with SQL’s GROUP BY
, where the SELECT
and HAVING
clauses can have aggregate functions over grouped columns but they
cannot get access to the individual values of the grouped columns.
To better understand the workings of GROUP BY ... GROUP AS
it is best
to think of PartiQL queries as a pipeline of clauses, starting with the
FROM
, continuing with the GROUP BY
and finishing with the SELECT
.
Each clause is a function that inputs data and outputs data. In that
sense, the GROUP BY ... GROUP AS
is a function that inputs the result
of the FROM
and outputs its result to the SELECT
.
The following query (conceptually) produces the output of the FROM
clause.
SELECT e AS e, p AS p
FROM hr.employeesNestScalars AS e JOIN e.projects AS p ON p LIKE '%security%'
We see that the FROM
delivers the collection of tuples consisting of
an employee e
and a project p
that were output by the FROM
clause,
i.e., the LEFT JOIN
. This is like SQL’s FROM
semantics.
Variable e | Variable p |
---|---|
{ 'id': 3, 'name': 'Bob Smith', 'title': null, 'projects': [ 'AWS Redshift Spectrum querying', 'AWS Redshift security', 'AWS Aurora security' ] } |
'AWS Redshift security' |
{ 'id': 3, 'name': 'Bob Smith', 'title': null, 'projects': [ 'AWS Redshift Spectrum querying', 'AWS Redshift security', 'AWS Aurora security' ] } |
'AWS Aurora security' |
{ 'id': 6, 'name': 'Jane Smith', 'title': 'Software Eng 2', 'projects': [ 'AWS Redshift security' ] } |
'AWS Redshift security' |
Then the GROUP BY ... GROUP AS ...
can be thought of as outputting a
table that has one column for each group-by expression (i.e., each
security project p
) and a last column perProjectGroup
whose value
(conceptually) is the collection of employee/project e
/p
tuples that
correspond to the group-by expression p
. Thus the
GROUP BY ... GROUP AS ...
output is the table
p | perProjectGroup |
---|---|
'AWS Redshift security' |
<< { e: { 'id': 3, 'name': 'Bob Smith', … }, p: 'AWS Redshift security' }, { e: { 'id': 6, 'name': 'Jane Smith', … }, p: 'AWS Redshift security' } >> |
'AWS Aurora security' |
<< { e: { 'id': 3, 'name': 'Bob Smith', …}, p: 'AWS Aurora security' }, >> |
Finally the SELECT
clause inputs the above and outputs the query
result.
employee
is an ordered list. PartiQL implementations may provide their own mappings from popular data formats, e.g., CSV, TSV, JSON, Ion etc., to the PartiQL data model and/or allow clients to implements their own mappings.