Tutorial
PartiQL provides SQL-compatible unified query access across multiple data stores containing structured, semi-structured and nested data. PartiQL separates the syntax and semantics of a query from the underlying data source and data format. It enables users to interact with data with[1] or without regular schema.
This introduction aims to teach SQL users the PartiQL extensions to SQL. The tutorial is primarily driven by "how to" examples.
For the reader who is interested in the full details and formal specification of PartiQL, we recommend the 2-tiered PartiQL formal specification. The formal specification first describes the PartiQL Core, which is a short and concise functional programming language. Then the specification layers SQL compatibility through syntactic sugar that shows how SQL features can be translated to semantically equivalent core PartiQL expressions. These translations presented as syntactic sugar enable SQL compatibility.
PartiQL is backwards compatible with SQL-92[2]. We will see what compatibility means when it is used to query data found in data formats and data stores.
For starters, given the table hr.employees
Id name title ---- ------------- ---------------- 3 Bob Smith null 4 Susan Smith Dev Mgr 6 Jane Smith Software Eng 2
the following SQL query
SELECT e.id, e.name AS employeeName, e.title AS title
FROM hr.employees e
WHERE e.title = 'Dev Mgr'
is also a valid PartiQL query. As we know from SQL, when this query
operates on the table hr.employees it will return the result
Id employeeName title ---- -------------- --------- 4 Susan Smith Dev Mgr
REPL
PartiQL provides an interactive shell, or Read Evaluate Print Loop (REPL), that allows users to write and evaluate PartiQL queries.
Installation
The REPL can be obtained from a Github release.
├── partiql-cli
├── bin
│ ├── partiql // Mac/Unix
│ └── partiql.bat // Windows
├── lib
│ └── ...
├── README.md
└── Tutorial
├── code
│ └── ...
├── tutorial.html
└── tutorial.pdf
The root folder partiql-cli contains a README.md file and 3
subfolders
-
The folder
bincontains startup scriptspartiqlfor macOS and Unix systems andpartiql.batfor Windows systems. Execute these files to start the REPL. -
The folder
libcontains all the necessary Java libraries needed to run PartiQL. -
The folder
Tutorialcontains the tutorial inpdfandhtmlform. The subfoldercodecontains 3 types of files:-
Data files with the extension
.env. These files contains PartiQL data that we can query. -
PartiQL query files with the extension
.sql. These files contain the PartiQL queries used in the tutorial. -
Sample query output files with the extension
.output. These files contain sample output from running the tutorial queries on the appropriate data.
-
To exit the PartiQL REPL, press:
-
Control+Din macOS or Unix -
Control+Con Windows
or close the terminal/command prompt window.
Loading Data from Disk
An easy way to load the necessary data into the REPL is use the -e
switch when starting the REPL and provide the name of a file that
contains your data.
./bin/partiql -e Tutorial/code/q1.env
You can then see what is loaded in the REPL’s global environment using
the REPL command !global_env, i.e.,
For convenience we have provided the file tutorial-all-data.env in
the folder Tutorial/code/. You will also find separate .env files
in the same folder for each query in the tutorial.
For example, running
./bin/partiql -e Tutorial/code/tutorial-all-data.env
will load all the data used in the tutorial in the REPL. This will allow you to copy-paste queries from the tutorial into the REPL and try them out.
Data Model
PartiQL operates not just on SQL tables but also on data that may have nesting, union types, different attributes across different tuples, and many other features that we often find in today’s nested and/or semi-structured formats, like JSON, Ion, Parquet, etc.
To capture this generality, PartiQL is based on a logical type system: the PartiQL Data Model. Each PartiQL implementation maps data formats, like JSON, Parquet etc., into a PartiQL data set that follows the PartiQL data model. PartiQL queries work on the PartiQL data set abstraction.
For example, the table hr.employees is denoted in the PartiQL data
model as this dataset
{
'hr': {
'employees': <<
-- a tuple is denoted by { ... } in the PartiQL data model
{ 'id': 3, 'name': 'Bob Smith', 'title': null },
{ 'id': 4, 'name': 'Susan Smith', 'title': 'Dev Mgr' },
{ 'id': 6, 'name': 'Jane Smith', 'title': 'Software Eng 2'}
>>
}
}
Notice that the employees is nested within hr.
The delimiters << … >> denote that the data is an unordered
collection (also known as bag), as is the case with SQL tables. That
is, there is no order between the three tuples. Single-line comments
start with -- and end at the end of the line.
A very different kind of data source may lead to the same PartiQL dataset. For example, a set of JSON files that contain the following JSON objects
{
"hr" : {
"employees": [
{ "id": 3, "name": "Bob Smith", "title": null },
{ "id": 4, "name": "Susan Smith", "title": "Dev Mgr" },
{ "id": 6, "name": "Jane Smith", "title": "Software Eng 2"}
]
}
}
will likely[4] be abstracted by a PartiQL-supporting implementation
into the identical PartiQL abstraction with the hr.employees table.
| You will keep noticing the similarity of the PartiQL notation with the JSON notation. Notice also the subtle differences: In the interest of SQL compatibility, a PartiQL literal is single-quoted, while JSON literals are double-quoted. |
| You may conceptually think that a deserializer inputs JSON and outputs the PartiQL data set. But do not assume that the query processing of a PartiQL implementation will have to actually parse and abstract into PartiQL each and every bit of the underlying data storage. |
Back to our query
SELECT e.id,
e.name AS employeeName,
e.title AS title
FROM hr.employees e
WHERE e.title = 'Dev Mgr'
Instead of a SQL result set, evaluating the query in PartiQL produces:
<<
{
'id': 4,
'employeeName': 'Susan Smith',
'title': 'Dev Mgr'
}
>>
the result remains the same, no matter whether hr.employees is a SQL
table or a JSON file. All that is needed is an association between the
name hr.employees and the PartiQL abstraction of the data.
In the same spirit, the same PartiQL abstraction may come from a CSV
file or a Parquet file, a format that has gained big traction, thanks to
the efficient way in which it stores data. Again, the same query makes
perfect sense, regardless of what exactly was the storage format behind
hr.employees.
LEARN MORE
PartiQL data sets look very much like JSON.
What are the differences? Indeed, PartiQL adopts the tuple/object and array notation of JSON. However, the PartiQL string literals are denoted by single quotes. Importantly, the scalar types of PartiQL are those of SQL, not just strings, numbers and booleans, as in JSON.
Do implementations need to have a catalog?
If queries refer to names, a catalog logically validates whether the name exists or not. However, we will also see PartiQL queries that refer to no names.
Literals
Literals of the PartiQL query language correspond to the types in the PartiQL data model:
-
scalars, including
nullwhich follow the SQL syntax when applicable. For example:-
5 -
'foo'
-
-
tuples, denoted by
{...}with tuple elements separated by,(also known as structs and/or objects in many formats and other data models)-
{ 'id' : 3, 'arr': [1, 2] }
-
-
arrays, denoted by
[...]with array elements separated by,-
[ 1, 'foo' ]
-
-
bags, denoted by
<< ... >>with bag elements separated by a,-
<< 1, 'foo'>>
-
Notice that in the spirit of the PartiQL data model, literals compose freely and any kind of literal may appear within any tuple, array and bag literal, eg.,
{
'id': 3,
'matrix': [
[2, 4, 6],
'NA'
]
}
Querying Nested Data
SQL-92 only has tables that have tuples that contain scalar values. A key feature of many modern formats is nested data. That is, attributes whose values may themselves be tables (i.e., collections of tuples), or may be arrays of scalars, or arrays of arrays and many other combinations. Let’s take a closer look at PartiQL’s features (SQL extensions) that allow us to work with nested data.
We also include sections titled "Use Case". Such "Use Case" sections do not introduce additional features. They merely show how to combine the few novel PartiQL features with standard SQL features in order to solve a large number of problems.
Nested Collections
Let’s now add the nested attribute projects into the data set.
{
'hr': {
'employeesNest': <<
{
'id': 3,
'name': 'Bob Smith',
'title': null,
'projects': [
{ 'name': 'AWS Redshift Spectrum querying' },
{ 'name': 'AWS Redshift security' },
{ 'name': 'AWS Aurora security' }
]
},
{
'id': 4,
'name': 'Susan Smith',
'title': 'Dev Mgr',
'projects': []
},
{
'id': 6,
'name': 'Jane Smith',
'title': 'Software Eng 2',
'projects': [ { 'name': 'AWS Redshift security' } ]
}
>>
}
}
Notice that the value of 'projects' is an array. Arrays are denoted by
[ ... ] with array elements separated by commas. In our example the
array happens to be an array of tuples. We will see that arrays may be
arrays of anything, not just arrays of tuples.
Unnesting a Nested Collection
The following query finds the names of employees who work on projects
that contain the string 'security' and outputs them along with the
name of the 'security' project. Notice that the query has just one
extension over standard SQL — the e.projects AS p part.
SELECT e.name AS employeeName, p.name AS projectName
FROM hr.employeesNest AS e, e.projects AS p
WHERE p.name LIKE '%security%'
The output of our query is
<<
{
'employeeName': 'Bob Smith',
'projectName': 'AWS Redshift security'
},
{
'employeeName': 'Bob Smith',
'projectName': 'AWS Aurora security'
},
{
'employeeName': 'Jane Smith',
'projectName': 'AWS Redshift security'
}
>>
The extension over SQL is the FROM clause item e.projects AS p.
Standard SQL would attempt to find a schema named e with a table
projects and since in our example there isn’t an e.projects table,
the query would fail. In contrast, PartiQL recognizes e.projects to
refer to the projects attribute of e.
Once we allow this extension, the semantics are alike SQL’s. The alias
(also called variable in PartiQL) e gets bound to each employee, in
turn. For each employee, the variable p gets bound to each project of
the employee, in turn. Thus the query’s meaning, like SQL, is
foreach employee tuple e from hr.employeesNest
foreach project tuple p from e.projects
if p.name LIKE '%security%'
output e.name AS employeeName, p.name AS projectName
Notice that our query involved variables that were ranging over nested
collections (p in the example), along with variables that were ranging
over tables (e in the example), as standard SQL aliases do. All
variables, no matter what they range over, can be used wherever in the
FROM, WHERE, SELECT clauses as we will see in the examples that
follow.
LEARN MORE
Can I only unnest arrays of tuples?
No, anything can be unnested. For example, arrays of scalars, etc.
Does
e.projects AS phave to appear in the sameFROMclause that definese?No. For example, see below the use cases that involve subqueries. There, the
eandpare defined in separateFROMclauses.How could I force
e.projectsto refer to the nested attributeprojectseven if there were a schema namedewith a tableprojects?Use the syntax
@e.projects. Recall, in the absence of the@, in the interest of SQL compatibility, PartiQL will first attempt to dereference thee.projectsagainst the catalog.SQL allows me to avoid writing an explicit alias
ewhen I write, say,e.name. Can I avoid writing theein PartiQL as well?SQL allows us to avoid writing aliases (variables) when the schema of the tables allows correct dereferencing. PartiQL does the same. However, recall, a schema is not necessary for a PartiQL data set. Indeed, our example has not assumed a schema. In the absence of a schema, you cannot omit the aliases (variables). For example, if you write just
nameand there is no schema, PartiQL cannot tell whether you mean employee name or project name. Thus you need to explicitly write the alias (variable).There is one exception to this rule: If your query has a single item in its
FROMclause, you can omit the alias (variable). Eg, you can writeSELECT name FROM hr.employeesNestIn this case it is apparent that
namemay only be an employee name and thus PartiQL allows you to not provide an alias (variable).Nevertheless, for clarity we recommend that you always use aliases (variables) and this is what this tutorial does.
If there is a schema, can I avoid writing the alias
p?No. The
phas to be written in order to denote the iteration over the projects.
Unnesting Nested Collections Using JOIN
In this section, we simply present an alternate way to express and think about unnesting collections.
One may think that the FROM clause of the example executes, in a
sense, a JOIN between employees and projects. If it helps you to think
in terms of JOIN, you may replace the comma with JOIN. That is, the
following two queries are equivalent.
SELECT e[.p].[.n]name AS employeeName[.p], p[.p].[.n]name AS projectName
FROM hr[.p].[.n]employeesNest AS e[.p], e[.p].[.n]projects AS p
WHERE p[.p].[.n]name LIKE '%security%'
\$<=>\$
SELECT e[.p].[.n]name AS employeeName[.p], p[.p].[.n]name AS projectName
FROM hr[.p].[.n]employeesNest AS e CROSS JOIN e[.p].[.n]projects AS p
WHERE p[.p].[.n]name LIKE '%security%'
Unnesting with LEFT JOIN
Assume that we want to write a query that returns as a bag of tuples the
entire employee and project information from hr.employeesNest. The
query result we want is this bag of tuples with attributes id,
employeeName, title and projectName:
<<
{
'id': 3,
'employeeName': 'Bob Smith',
'title': NULL,
'projectName': 'AWS Redshift Spectrum querying'
},
{
'id': 3,
'employeeName': 'Bob Smith',
'title': NULL,
'projectName': 'AWS Redshift security'
},
{
'id': 3,
'employeeName': 'Bob Smith',
'title': NULL,
'projectName': 'AWS Aurora security'
},
{
'id': 4,
'employeeName': 'Susan Smith',
'title': 'Dev Mgr'
},
{
'id': 6,
'employeeName': 'Jane Smith',
'title': 'Software Eng 2',
'projectName': 'AWS Redshift security'
}
>>
Notice that there is a 'Susan Smith' tuple in the result, despite the
fact that Susan has no project. Susan’s projectName is null. We can
obtain this result by combining employees and projects using the
LEFT JOIN operator, as follows:
SELECT e.id AS id,
e.name AS employeeName,
e.title AS title,
p.name AS projectName
FROM hr.employeesNest AS e LEFT JOIN e.projects AS p ON true
The semantics of this query can be thought of as
foreach employee tuple e from hr.employeesNest
if the e.projects is an empty collection then // this part is special about LEFT JOINs
output e.id AS id, e.name AS employeeName, e.title AS title
and output a null AS projectName
else // the following part is identical to plain (inner) JOINs
foreach project tuple p from e.projects
output e.id AS id, e.name AS employeeName, e.title AS title
Use Case: Checking whether a nested collection satisfies a condition
The following use cases employ the unnesting features, which we have already discussed, in new use cases. A lesson that emerges is that we can use variables (SQL aliases) that range over nested data as if they were standard SQL aliases. This realization gives us the power to solve a great number of use cases just by combining the unnesting features with features we already know from standard SQL.
In our first use case we want a query that returns the names of the
employees that are involved in a project that contains the word
'security'. The solution employs SQL’s "EXISTS (subquery)" feature,
along with unnesting:
SELECT e.name AS employeeName
FROM hr.employeesNest AS e
WHERE EXISTS ( SELECT *
FROM e.projects AS p
WHERE p.name LIKE '%security%')
returns
<<
{
'employeeName': 'Bob Smith'
},
{
'employeeName': 'Jane Smith'
}
>>
In the second use case we want a query that outputs the names of the
employees that have more than one security project and we are aware of a
key for employees (e.g., an attribute that is guaranteed to have a
unique value for each employee). We can find the requested employees by
utilizing a combination of GROUP BY and HAVING.[5]
with the implementation currently prevents us from doing so. In our example,
let’s assume that the id attribute is a primary key for the employees.
Then we could find the employees with more than one security project
with this query:
SELECT e.name AS employeeName
FROM hr.employeesNest AS e, e.projects AS p
WHERE p.name LIKE '%security%'
GROUP BY e.id, e.name
HAVING COUNT(*) > 1
which returns
<<
{
'employeeName': 'Bob Smith'
}
>>
Use Case: Subqueries that aggregate over nested collections
Next, let’s find how many querying projects (that is, projects whose name contains the word 'querying') each employee has.[6] with the implementation currently prevents us from doing so.
Making the same asssumption as before, that id is a key for employees,
we can solve the problem with the query
SELECT e.name AS employeeName, COUNT(p.name) AS queryProjectsNum
FROM hr.employeesNest e LEFT JOIN e.projects AS p ON p.name LIKE '%querying%'
GROUP BY e.id, e.name
that returns
<<
{
'employeeName': 'Bob Smith',
'queryProjectsNum': 1
},
{
'employeeName': 'Susan Smith',
'queryProjectsNum': 0
},
{
'employeeName': 'Jane Smith',
'queryProjectsNum': 0
}
>>
---
OK!
Notice this query’s result includes Susan Smith and Jane Smith, who have no querying projects.
Nested Structs
A value may also be a tuple — also called object and struct in many models and formats. For example, the project value in the following tuples is always a tuple with project name and project org.
{
'hr': {
'employeesWithTuples': <<
{
'id': 3,
'name': 'Bob Smith',
'title': null,
'project': {
'name': 'AWS Redshift Spectrum querying',
'org': 'AWS'
}
},
{
'id': 6,
'name': 'Jane Smith',
'title': 'Software Eng 2',
'project': {
'name': 'AWS Redshift security',
'org': 'AWS'
}
}
>>
}
}
PartiQL’s multistep paths enable navigating within tuples. For example, the following query finds AWS projects and outputs the project name and employee name.
SELECT e.name AS employeeName, e.project.name AS projectName
FROM hr.employeesWithTuples e
WHERE e.project.org = 'AWS'
The result is
<<
{
'employeeName': 'Bob Smith',
'projectName': 'AWS Redshift Spectrum querying'
},
{
'employeeName': 'Jane Smith',
'projectName': 'AWS Redshift security'
}
>>
---
OK!
Other Nested Values
The previous examples have shown nested attributes that were arrays of tuples. It need not be the case that the nested attributes are collections of tuples. They may just as well be arrays of scalars, arrays of arrays, or any combination of data that one can create by composing scalars, tuples and arrays. You need not learn a different set of query language features for each case. The unnesting features, which we have already seen, are sufficient.
Use Case: Unnesting Arrays of Scalars
The list of projects associated with each employee in hr.employeesNest
could have been simply a list of project name strings. Replacing the
nested tuples with plain strings gives us
{
'hr': {
'employeesNestScalars': <<
{
'id': 3,
'name': 'Bob Smith',
'title': null,
'projects': [
'AWS Redshift Spectrum querying',
'AWS Redshift security',
'AWS Aurora security'
]
},
{
'id': 4,
'name': 'Susan Smith',
'title': 'Dev Mgr',
'projects': []
},
{
'id': 6,
'name': 'Jane Smith',
'title': 'Software Eng 2',
'projects': [ 'AWS Redshift security' ]
}
>>
}
}
Let us repeat the previous use cases on the revised employee data.
The following query finds the names of employees who work on projects
that contain the string 'security' and outputs them along with the
name of the 'security' project.
SELECT e.name AS employeeName,
p AS projectName
FROM hr.employeesNestScalars AS e,
e.projects AS p
WHERE p LIKE '%security%'
The preceding query returns
<<
{
'employeeName': 'Bob Smith',
'projectName': 'AWS Redshift security'
},
{
'employeeName': 'Bob Smith',
'projectName': 'AWS Aurora security'
},
{
'employeeName': 'Jane Smith',
'projectName': 'AWS Redshift security'
}
>>
The variable p ranges (again) over the content of e.projects. In
this case, since e.projects has strings (as opposed to tuples), the
variable p binds each time to a project name string. Thus, this query
can be thought of as executing the following snippet.
foreach employee tuple e from hr.employeesNestScalars
foreach project p from e.projects
if the string p matches '%security%'
output e.name AS employeeName and the string p AS projectName
Use Case: Unnesting Arrays of Arrays
Arrays may also contain arrays, directly, without intervening tuples, as
in the matrices data set.
{
'matrices': <<
{
'id': 3,
'matrix': [
[2, 4, 6],
[1, 3, 5, 7],
[9, 0]
]
},
{
'id': 4,
'matrix': [
[5, 8],
[ ]
]
}
>>
}
The following query finds every even number and outputs the even number
and the id of the tuple where it was found.
SELECT t.id AS id, x AS even
FROM matrices AS t, t.matrix AS y, y AS x
WHERE x % 2 = 0
The preceding query returns
<<
{
'id': 3,
'even': 2
},
{
'id': 3,
'even': 4
},
{
'id': 3,
'even': 6
},
{
'id': 3,
'even': 0
},
{
'id': 4,
'even': 8
}
>>
Informally the query’s evaluation can be thought of as
foreach tuple t from matrices
foreach array y from t.matrix
foreach number x from y
if x is even then
output t.id AS id and x AS even
Querying Schemaless Data
Many formats do not require a schema that describes the data — that is schemaless data. In such cases it is possible to have various "heterogeneities" in the data:
-
One tuple may have an attribute
xwhile another tuple may not have this attribute -
In one tuple of the collection an attribute
xmay be of one type, e.g., string, while in another tuple of the same collection the same attributexmay be of a different type — e.g, array. -
The elements of a collection (be it a bag or array) can be heterogeneous (not have the same type). For example, the first element may be a string, the second one may be an integer and the third one an array.
-
Generally, any composition is possible as we can bundle heterogeneous elements in arrays and bags.
Heterogeneities are not particular to schemaless. Schemas may allow for
heterogeneity in the types of the data. For example, one of the Hive
data types is the union type,[7] which allows a value to belong to any one of a list of types. Consider the following schema whose projects
attribute may be either a string or an array of strings
CREATE TABLE employeesMixed(
id: INT,
name: STRING,
title: STRING,
projects: UNIONTYPE<STRING, ARRAY<STRING>>
);
A collection of PartiQL tuples that follows this schema could be
{
'hr': {
'employeesMixed1': <<
{
'id': 3,
'name': 'Bob Smith',
'title': null,
'projects': [
'AWS Redshift Spectrum querying',
'AWS Redshift security',
'AWS Aurora security'
]
},
{
'id': 4,
'name': 'Susan Smith',
'title': 'Dev Mgr',
'projects': []
},
{
'id': 6,
'name': 'Jane Smith',
'title': 'Software Eng 2',
'projects': 'AWS Redshift security'
}
>>
}
}
Thus we see that data may have heterogeneities — regardless of whether they are described by a schema or not. PartiQL tackles heterogeneous data in ways that we will see in the next use cases and feature presentations.
Missing Attributes
Let’s go back to the hr.employees table (that is, bag of tuples). Bob
Smith has no title and, as is typical in SQL, the lack of title is
modeled with the null value.
{
'hr': {
'employees': <<
{ 'id': 3, 'name': 'Bob Smith', 'title': null }
{ 'id': 4, 'name': 'Susan Smith', 'title': 'Dev Mgr' }
{ 'id': 6, 'name': 'Jane Smith', 'title': 'Software Eng 2'}
>>
}
}
Nowadays, many semi-structured formats allow users to represent "missing" information in two ways.
-
The first way is by use of
null. -
The second kind is the plain absence of the attribute from the tuple.
That is, we can represent the fact that Bob Smith has no title by simply
having no title attribute in the 'Bob Smith' tuple:
{
'hr': {
'employeesWithMissing': <<
{ 'id': 3, 'name': 'Bob Smith' }, -- no title in this tuple
{ 'id': 4, 'name': 'Susan Smith', 'title': 'Dev Mgr' },
{ 'id': 6, 'name': 'Jane Smith', 'title': 'Software Eng 2'}
>>
}
}
PartiQL does not argue about when to use nulls and when to use
"missing". Myriads of datasets already use one of the two or both.
However, PartiQL enables queries to distinguish between null and missing
values, and also enables query results that have nulls and missing
values. Indeed, PartiQL makes it very easy to propagate source data
nulls as query result nulls and source data missing attributes into
result missing attributes.
The MISSING Value
Consider again this PartiQL query, which happens to also be an SQL query.
SELECT e.id,
e.name AS employeeName,
e.title AS title
FROM hr.employeesWithMissing AS e
WHERE e.title = 'Dev Mgr'
What will happen when the query goes over the Bob Smith tuple, which has
no title?
The first step to answering this question is understanding the result of
the path e.title when the alias (variable) e binds to the tuple
{ 'id': 3, 'name': 'Bob Smith' }. In more basic terms, what is the
result of the expression { 'id': 3, 'name': 'Bob Smith' }.title ?
PartiQL says that it is the special value MISSING. MISSING behaves
very similar to null.
Evaluating Functions and Conditions with MISSING
If a function (including infix functions like =) inputs a MISSING
the function’s result is NULL. In the case of the example, this means
that the WHERE clause e.title='Dev Mgr' will evaluate to NULL when
e binds to { 'id': 3, 'name': 'Bob Smith' } and, as usual in SQL,
the WHERE clause fails when it does not evaluate to true. Thus the
output will be
<<
{
'id': 4,
'employeeName': 'Susan Smith',
'title': 'Dev Mgr'
}
>>
Propagating MISSING in Result Tuples
What would happen if a missing attribute or, more generally, an
expression returning MISSING appears in the SELECT?
SELECT e.id,
e.name AS employeeName,
e.title AS outputTitle
FROM hr.employeesWithMissing AS e
The query will output one tuple for each employee. When it outputs the
Bob Smith tuple, the e.title will evaluate to NULL and then the
output tuple will not have an outputTitle attribute.
<<
{
'id': 3,
'employeeName': 'Bob Smith'
},
{
'id': 4,
'employeeName': 'Susan Smith',
'outputTitle': 'Dev Mgr'
},
{
'id': 6,
'employeeName': 'Jane Smith',
'outputTitle': 'Software Eng 2'
}
>>
---
OK!
The same treatment of MISSING would happen if, say, we had this query
that converts titles to capital letters:
SELECT e.id,
e.name AS employeeName,
UPPER(e.title) AS outputTitle
FROM hr.employeesWithMissing AS e
Again, the e.title will evaluate to MISSING for 'Bob Smith', the
UPPER(e.title) is then UPPER(MISSING) and also evaluates to NULL.
Thus the result will be:
<<
{
'id': 3,
'employeeName': 'Bob Smith',
'outputTitle': NULL
},
{
'id': 4,
'employeeName': 'Susan Smith',
'outputTitle': 'DEV MGR'
},
{
'id': 6,
'employeeName': 'Jane Smith',
'outputTitle': 'SOFTWARE ENG 2'
}
>>
Ranging over Heterogenous Values
A PartiQL variable (called alias in SQL) can bind to data of different types during a query’s evaluation. This is unlike SQL where the variables always bind to tuples. It is even different from what happened in Use Case: Unnesting Arrays of Scalars and what happened in Use Case: Unnesting Arrays of Arrays.
In the first use case, the PartiQL variable p happened to always bind
to a string (given the particular sample data of the example). In the
second use case, the PartiQL variable y was always bound to an array
(again, given the particular sample data of the example).
To make the case for variables that bind to different types, consider
the following twist in the employeesNest data set. Some of the
elements of the projects array are plain strings and some are tuples.
Even the employee tuples do not always have the same attributes.
{
'hr': {
'employeesMixed2': <<
{
'id': 3,
'name': 'Bob Smith',
'title': null,
'projects': [
{ 'name': 'AWS Redshift Spectrum querying' },
'AWS Redshift security',
{ 'name': 'AWS Aurora security' }
]
},
{
'id': 4,
'name': 'Susan Smith',
'title': 'Dev Mgr',
'projects': []
},
{
'id': 6,
'name': 'Jane Smith',
'projects': [ 'AWS Redshift security']
}
>>
}
}
This query on hr.employeesMixed2 produces employee name — employee
project pairs.
SELECT e.name AS employeeName,
CASE WHEN (p IS TUPLE) THEN p.name ELSE p END AS projectName
FROM hr.employeesMixed2 AS e,
e.projects AS p
Notice the sub-expression (p IS TUPLE). The IS operator can be used
to check a value against its type at evaluation time. Notice also that
the variable p binds to different types.
In general, the FROM clause of a query binds its variables (aliases)
to data. The variables need not bind to data that have the same types.
Each binding is fed to the SELECT clause, which evaluates its
expressions.
This table shows each variables' binding produced by the FROM clause
and the corresponding tuple output by the SELECT clause.
| Variable e | Variable p | Result tuple |
|---|---|---|
{
'id': 3,
'name': 'Bob Smith',
'title': null,
'projects': [
{ 'name': 'AWS Redshift Spectrum querying' },
'AWS Redshift security',
{ 'name': 'AWS Aurora security' }
]
} |
{ 'name': 'AWS Redshift Spectrum querying' } |
{
'employeeName': 'Bob Smith',
'projectName': 'AWS Redshift Spectrum querying'
} |
{ 'id': 3,
'name': 'Bob Smith',
'title': null,
'projects': [
{ 'name': 'AWS Redshift Spectrum querying' },
'AWS Redshift security',
{ 'name': 'AWS Aurora security' }
]
} |
'AWS Redshift security' |
{
'employeeName': 'Bob Smith',
'projectName': 'AWS Redshift security'
} |
{ 'id': 3,
'name': 'Bob Smith',
'title': null,
'projects': [
{ 'name': 'AWS Redshift Spectrum querying' },
'AWS Redshift security',
{ 'name': 'AWS Aurora security' }
]
} |
{ 'name': 'AWS Aurora security' } |
{
'employeeName': 'Bob Smith',
'projectName': 'AWS Aurora security'
} |
{ 'id': 6,
'name': 'Jane Smith',
'projects': [ 'AWS Redshift security' ]
} |
'AWS Redshift security' |
{
'employeeName': 'Jane Smith',
'projectName': 'AWS Redshift security'
} |
Array Access
SQL allows us to order the output of a query using the ORDER BY
clause. However, the SQL data model does not recognize order in the
input data. In contrast, many of the new data formats feature arrays;
the arrays' elements have an order. We may want to find an array element
by its index or, we may want to find the positions of certain elements
in their arrays.
Let’s consider again the dataset hr.employeesNest.
{
'hr': {
'employeesNest': <<
{
'id': 3,
'name': 'Bob Smith',
'title': null,
'projects': [
{ 'name': 'AWS Redshift Spectrum querying' },
{ 'name': 'AWS Redshift security' },
{ 'name': 'AWS Aurora security' }
]
},
{
'id': 4,
'name': 'Susan Smith',
'title': 'Dev Mgr',
'projects': []
},
{
'id': 6,
'name': 'Jane Smith',
'title': 'Software Eng 2',
'projects': [ { 'name': 'AWS Redshift security' } ]
}
>>
}
}
The projects attribute is an array of tuples; that is, each tuple has
an ordinal associated with it. The following query returns each employee
name, along with the first project of the employee.
SELECT e.name AS employeeName,
e.projects[0].name AS firstProjectName
FROM hr.employeesNest AS e
The query returns
<<
{
'employeeName': 'Bob Smith',
'firstProjectName': 'AWS Redshift Spectrum querying'
},
{
'employeeName': 'Susan Smith'
},
{
'employeeName': 'Jane Smith',
'firstProjectName': 'AWS Redshift security'
}
>>
Multistep Paths
Technically, the structure [<number>] is a kind of path step. For
example, notice the 4-step path e.projects[0].name. When e is bound
to the first tuple of hr.employeesNest, then the path e.projects
results into the array
[
{ 'name': 'AWS Redshift Spectrum querying' },
{ 'name': 'AWS Redshift security' },
{ 'name': 'AWS Aurora security' }
]
Consequently applying the [0] step on e.projects (that is,
evaluating e.projects[0]) leads to
{'name': 'AWS Redshift Spectrum querying'}. Finally, evaluating the
.name step on e.projects[0] (that is, evaluating
e.projects[0].name) leads to 'AWS Redshift Spectrum querying'.
Finding the Index of Each Element in an Array
Let’s assume that each employees' projects are sorted in priority order.
The following query finds the names of each employee involved in a
security project, the security project, and, its index in the projects
array.
SELECT e.name AS employeeName,
p.name AS projectName,
o AS projectPriority
FROM hr.employeesNest AS e,
e.projects AS p AT o
WHERE p.name LIKE '%security%'
Notice the new feature: AT o. While p ranges over the elements of
the array e.projects, the variable o is assigned to the index of the
element in the array. The query returns:
<<
{
'employeeName': 'Bob Smith',
'projectName': 'AWS Redshift security',
'projectPriority': 1
},
{
'employeeName': 'Bob Smith',
'projectName': 'AWS Aurora security',
'projectPriority': 2
},
{
'employeeName': 'Jane Smith',
'projectName': 'AWS Redshift security',
'projectPriority': 0
}
>>
Pivoting & Unpivoting
Many queries need to range over and collect the attribute name/value pairs of tuples or the key/value pairs of maps.
Unpivoting Tuples
Consider this dataset that provides the closing prices of multiple ticker symbols.
{
'closingPrices': <<
{ 'date': '4/1/2019', 'amzn': 1900, 'goog': 1120, 'fb': 180 },
{ 'date': '4/2/2019', 'amzn': 1902, 'goog': 1119, 'fb': 183 }
>>
}
The following query unpivots the stock ticker/price pairs.
SELECT c."date" AS "date",
sym AS "symbol",
price AS price
FROM closingPrices AS c,
UNPIVOT c AS price AT sym
WHERE NOT sym = 'date'
Notice the use of " in this query. The double quotes allow us to
disambiguate from date the keyword and "date" the identifier. Double
quotes can also specify case sensitivity for attribute lookups.
The query returns
<<
{
'date': '4/1/2019',
'symbol': 'amzn',
'price': 1900
},
{
'date': '4/1/2019',
'symbol': 'goog',
'price': 1120
},
{
'date': '4/1/2019',
'symbol': 'fb',
'price': 180
},
{
'date': '4/2/2019',
'symbol': 'amzn',
'price': 1902
},
{
'date': '4/2/2019',
'symbol': 'goog',
'price': 1119
},
{
'date': '4/2/2019',
'symbol': 'fb',
'price': 183
}
>>
Unpivoting tuples enables the use of attribute names as if they were data. For example, it becomes easy to compute the average price for each symbol as
SELECT sym AS "symbol",
AVG(price) AS avgPrice
FROM closingPrices c,
UNPIVOT c AS price AT sym
WHERE NOT sym = 'date'
GROUP BY sym
which returns
<<
{
'symbol': 'amzn',
'avgPrice': 1901
},
{
'symbol': 'fb',
'avgPrice': 181.5
},
{
'symbol': 'goog',
'avgPrice': 1119.5
}
>>
Pivoting into Tuples
Pivoting turns a collection into a tuple. For example, consider the collection
{
'todaysStockPrices': <<
{ 'symbol': 'amzn', 'price': 1900},
{ 'symbol': 'goog', 'price': 1120},
{ 'symbol': 'fb', 'price': 180 }
>>
}
Then the following PIVOT query
PIVOT sp.price AT sp."symbol"
FROM todaysStockPrices sp
produces the tuple
{
'amzn': 1900,
'goog': 1120,
'fb': 180
}
Notice that the PIVOT query looks like a SELECT-FROM-WHERE-... query
except that instead of a SELECT clause it has a
PIVOT <value expression> AT <attribute expression>. Note also that the
PIVOT query does not return a collection of tuples: rather it
literally returns a tuple value.
Pivoting Subqueries
This example also uses the grouping features of PartiQL.
Let us generalize the previous case of pivoting. We have a table of stock prices
{
'stockPrices':<<
{ 'date': '4/1/2019', 'symbol': 'amzn', 'price': 1900},
{ 'date': '4/1/2019', 'symbol': 'goog', 'price': 1120},
{ 'date': '4/1/2019', 'symbol': 'fb', 'price': 180 },
{ 'date': '4/2/2019', 'symbol': 'amzn', 'price': 1902},
{ 'date': '4/2/2019', 'symbol': 'goog', 'price': 1119},
{ 'date': '4/2/2019', 'symbol': 'fb', 'price': 183 }
>>
}
and we want to pivot it into a collection of tuples, where each tuple
has all the symbol:price pairs for a date, as follows
<<
{
'date': date(4/1/2019),
'prices': {'amzn': 1900, 'goog': 1120, 'fb': 180}
},
{
'date': date(4/2/2019),
'prices': {'amzn': 1902, 'goog': 1119, 'fb': 183}
}
>>
The following query first creates one group datesPrices for each date.
Then the PIVOT subquery pivots the group into the tuple prices.
SELECT sp."date" AS "date",
(PIVOT dp.sp.price AT dp.sp."symbol"
FROM datesPrices as dp ) AS prices
FROM StockPrices AS sp GROUP BY sp."date" GROUP AS datesPrices
For example, the datesPrices collection, returned from GROUP AS for
sp.date = date(4/1/2019) is
'datesPrices': <<
{
'sp': {
'date': '4/1/2019',
'symbol': 'amzn',
'price': 1900
}
},
{
'sp': {
'date': '4/1/2019',
'symbol': 'goog',
'price': 1120
}
},
{
'sp': {
'date': '4/1/2019',
'symbol': 'fb',
'price': 180
}
}
>>
Nested Results
PartiQL allows queries that create nested results as well as queries that create heterogeneous results.
The SELECT VALUE Clause
Let’s consider again the dataset hr.employeesNestScalars:
{
'hr': {
'employeesNestScalars': <<
{
'id': 3,
'name': 'Bob Smith',
'title': null,
'projects': [
'AWS Redshift Spectrum querying',
'AWS Redshift security',
'AWS Aurora security'
]
},
{
'id': 4,
'name': 'Susan Smith',
'title': 'Dev Mgr',
'projects': []
},
{
'id': 6,
'name': 'Jane Smith',
'title': 'Software Eng 2',
'projects': [ 'AWS Redshift security' ]
}
>>
}
}
The following query outputs each tuple of hr.employeesNestScalars,
except that instead of all projects each tuple has only the security
projects of the employee. The important new feature here is the
SELECT VALUE <expression>.
SELECT e.id AS id,
e.name AS name,
e.title AS title,
( SELECT VALUE p
FROM e.projects AS p
WHERE p LIKE '%security%'
) AS securityProjects
FROM hr.employeesNestScalars AS e
The result is
<<
{
'id': 3,
'name': 'Bob Smith',
'title': NULL,
'securityProjects': <<
'AWS Redshift security',
'AWS Aurora security'
>>
},
{
'id': 4,
'name': 'Susan Smith',
'title': 'Dev Mgr',
'securityProjects': <<>>
},
{
'id': 6,
'name': 'Jane Smith',
'title': 'Software Eng 2',
'securityProjects': <<
'AWS Redshift security'
>>
}
>>
A SELECT VALUE <expression> query (or subquery, as in this example)
returns a collection of whatever the <expression> evaluates to.
Notice the difference from SQL’s SELECT, which always produces tuples.
If a SQL SELECT appears as a subquery, then the context of the
subquery designates whether the subquery’s result should be coerced into
a scalar (e.g., when 5 = <subquery>), coerced into a collection of
scalars (e.g., when 5 IN <subquery>), etc. None of this applies to
SELECT VALUE, which produces a collection and this collection is not
coerced.
GROUP BY and GROUP AS
Another pattern of creating nested results in PartiQL is via the
GROUP AS extension to SQL’s GROUP BY. This pattern is more efficient
and more intuitive than the use of nested SELECT VALUE queries when
the required nesting is not following the nesting of the input. (The
example in Creating Nested Results with SELECT VALUE
Queries is one
where the nesting in the output follows the nesting of the input and
thus, an intuitive solution does not involve GROUP BY.)
The following query outputs each security project found in
hr.employeesNestScalars along with the list of employee names that
work on the project.
SELECT p AS projectName,
( SELECT VALUE v.e.name
FROM perProjectGroup AS v ) AS employees
FROM hr.employeesNestScalars AS e JOIN e.projects AS p ON p LIKE '%security%'
GROUP BY p GROUP AS perProjectGroup
The result is
<<
{
'projectName': 'AWS Aurora security',
'employees': <<
'Bob Smith'
>>
},
{
'projectName': 'AWS Redshift security',
'employees': <<
'Bob Smith',
'Jane Smith'
>>
}
>>
The GROUP AS generalizes SQL’s GROUP BY by making the formulated
groups available in their entirety to the query’s SELECT and HAVING
clauses. Contrast with SQL’s GROUP BY, where the SELECT and HAVING
clauses can have aggregate functions over grouped columns but they
cannot get access to the individual values of the grouped columns.
To better understand the workings of GROUP BY ... GROUP AS it is best
to think of PartiQL queries as a pipeline of clauses, starting with the
FROM, continuing with the GROUP BY and finishing with the SELECT.
Each clause is a function that inputs data and outputs data. In that
sense, the GROUP BY ... GROUP AS is a function that inputs the result
of the FROM and outputs its result to the SELECT.
The following query (conceptually) produces the output of the FROM
clause.
SELECT e AS e, p AS p
FROM hr.employeesNestScalars AS e JOIN e.projects AS p ON p LIKE '%security%'
We see that the FROM delivers the collection of tuples consisting of
an employee e and a project p that were output by the FROM clause,
i.e., the LEFT JOIN. This is like SQL’s FROM semantics.
| Variable e | Variable p |
|---|---|
{ 'id': 3,
'name': 'Bob Smith',
'title': null,
'projects': [ 'AWS Redshift Spectrum querying',
'AWS Redshift security',
'AWS Aurora security'
]
} |
'AWS Redshift security' |
{ 'id': 3,
'name': 'Bob Smith',
'title': null,
'projects': [ 'AWS Redshift Spectrum querying',
'AWS Redshift security',
'AWS Aurora security'
]
} |
'AWS Aurora security' |
{ 'id': 6,
'name': 'Jane Smith',
'title': 'Software Eng 2',
'projects': [ 'AWS Redshift security' ]
} |
'AWS Redshift security' |
Then the GROUP BY ... GROUP AS ... can be thought of as outputting a
table that has one column for each group-by expression (i.e., each
security project p) and a last column perProjectGroup whose value
(conceptually) is the collection of employee/project e/p tuples that
correspond to the group-by expression p. Thus the
GROUP BY ... GROUP AS ... output is the table
| p | perProjectGroup |
|---|---|
'AWS Redshift security' |
<<
{ e: { 'id': 3, 'name': 'Bob Smith', … },
p: 'AWS Redshift security' },
{ e: { 'id': 6, 'name': 'Jane Smith', … },
p: 'AWS Redshift security' }
>> |
'AWS Aurora security' |
<<
{ e: { 'id': 3, 'name': 'Bob Smith', …},
p: 'AWS Aurora security' },
>> |
Finally the SELECT clause inputs the above and outputs the query
result.
employee is an ordered list. PartiQL implementations may provide their own mappings from popular data formats, e.g., CSV, TSV, JSON, Ion etc., to the PartiQL data model and/or allow clients to implements their own mappings.