Introduction
Welcome to the Beamline Guide — your comprehensive resource for mastering synthetic data and query generation for your AI/ML, testing, and simulation use-cases.
What You’ll Learn
This guide will take you on a journey from understanding the basics of Beamline to becoming proficient in generating sophisticated synthetic datasets and queries. Whether you are an AI/ML researcher, a developer looking to test your implementations, or a data scientist needing realistic test data, this guide has you covered.
How This Guide is Organized
The guide is structured to gradually build your understanding and skills:
- Getting Started — Learn what Beamline is and get your first data generation running
- Understanding the Basics — Grasp core concepts like random processes and reproducible generation
- Data Generation — Master the art of creating synthetic data with various types and patterns
- Query Generation — Learn to generate PartiQL queries that match your data shapes
- Schema and Shape Inference — Understand how to work with data schemas and type inference
- Database Generation — Create local BeamlineLite database with both data and schemas
- Command Line Interface — Become proficient with all CLI commands and options
- Examples and Tutorials — Hands-on tutorials with real-world scenarios
What is Beamline?
Beamline is a tool for fast data generation. It generates reproducible pseudo-random data using a stochastic approach and probability distributions, meaning you can create realistic datasets that follow specific mathematical patterns. This makes the data both random enough to be useful for AI/ML model training, simulation, and testing purposing, while remaining deterministic enough to be reproducible for debugging and validation.
The tool’s ability to generate data based on statistical distributions makes it particularly valuable for AI model training scenarios where you need synthetic data that resembles specific population distributions or statistical characteristics.
Key Features
- Reproducible Data Generation: Use seeds to generate the same data every time
- Stochastic Processes: Model real-world data patterns using mathematical distributions
- Query Generation: Automatically generate PartiQL/SQL-like queries that match your data shapes
- Schema Inference: Automatically infer and export data schemas in various formats
- Multiple Output Formats: Support for Amazon Ion, JSON, and SQL DDL
- Database Generation: Create complete local copies of the generated data with both data and schemas
- Flexible Configuration: Highly configurable through scripts
Prerequisites
This guide assumes basic familiarity with:
- Command-line interfaces
- Ion/JSON data formats
- Basic understanding of databases and queries
- Rust programming language (for building from source)
Don’t worry if you’re new to some of these concepts — we’ll explain everything you need to know as we go!
Getting Help
If you encounter issues or have questions while following this guide please open an issue on Beamline GitHub repository. or start a dicussion on our GitHub repositories Discussions section.
Let’s begin your journey with Beamline!
What is Beamline?
Beamline is a tool designed for fast data generation. At its core, it generates reproducible pseudo-random data using a stochastic approach that models real-world data patterns.
The Problem It Solves
Some of the common software engineering problems that most developers, data scientists, and researchers often face are:
- Lack of Test Data: Creating realistic test datasets manually is time-consuming and error-prone
- Inconsistent Testing: Different test runs with different data make it hard to reproduce bugs
- Query Testing: Writing queries that match your data structures requires understanding both the data shape and query patterns
- Performance Benchmarking: Consistent, scalable datasets are needed for meaningful performance comparisons and AI inferences evaluations
- Schema Evolution: As data structures change, maintaining test data becomes increasingly complex
Beamline addresses all these challenges with a unified approach to synthetic data generation.
Core Components
Beamline consists of three main components that work together:
1. Data Generator
The Data Generator creates reproducible pseudo-random data based on mathematical distributions and stochastic processes. It can generate:
- Simple scalar values (numbers, strings, booleans, dates)
- Complex nested structures (structs, arrays, mixed types)
- Time-series data with realistic temporal patterns
- Sharing data across multiple datasets
Key Features:
- Reproducible: Same seed always produces the same data no matter how nested that data is
- Configurable: Highly customizable through scripts
- Realistic: Uses statistical distributions to model real-world patterns
- Scalable: Can generate datasets from small samples to millions of records
2. Query Generator
The Query Generator creates SQL-like queries (starting from PartiQL support) that match the shapes and types of your generated data. It can produce:
SELECT * FROM ... WHERE ...queries with various predicatesSELECT ... FROM ... WHERE ...queries with custom projectionsSELECT ... EXCLUDE ... FROM ... WHERE ...queries with exclusions- Complex nested queries with deep path expressions
Key Features:
- Datatype-Aware: Generates queries that match your data types
- Parameterizable: Control query complexity, depth, and patterns
- Reproducible: Same seed produces the same query patterns
3. CLI Interface
The Command Line Interface provides easy access to all functionality with comprehensive options for:
- Data generation with various output formats
- Query generation with extensive parameterization
- Schema inference and export
- Local data file creation with both data and schemas
How It Works
Stochastic Processes
Beamline models data generation as stochastic processes — mathematical models that describe systems that appear to vary randomly over time. This approach allows it to:
- Generate data that follows realistic patterns
- Model temporal relationships (like arrival times)
- Simulate real-world variability while maintaining reproducibility
Scripts and Configuration
Data generation is controlled through scripts that define:
- Random Processes: How data arrives and is generated over time
- Data Generators: What types of data to create and their distributions
- Relationships: How different data elements relate to each other
- Constraints: Rules and patterns the data should follow
Reproducibility
One of Beamline’s key strengths is reproducibility:
- Seeds: Control the random data generation for consistent results
- Timestamps: Control the starting time for temporal data
- Deterministic: Same inputs always produce the same outputs
- Debuggable: Reproduce exact datasets for debugging and validation
Use Cases
AI Model Training and Inference
- Training Data Generation: Generate datasets that follow specific statistical distributions for machine learning model training
- Distribution-Based Modeling: Create training data that matches target population distributions for more representative models
- Synthetic Data Augmentation: Expand training datasets while preserving underlying statistical distributions
- Edge Case Generation: Generate rare statistical scenarios for robust model validation
Testing and Development
- Unit Testing: Generate consistent test data for implementations
- Integration Testing: Create realistic datasets for end-to-end testing
- Regression Testing: Ensure changes don’t break existing functionality
- Edge Case Testing: Generate data that exercises boundary conditions
Performance and Benchmarking
- Load Testing: Generate large datasets for performance evaluation
- Scalability Testing: Test how systems perform with growing data sizes
- Query Optimization: Generate queries to test optimization strategies
Research and Education
- Algorithm Research: Generate datasets for testing new features
- Query Pattern Analysis: Study how different query patterns perform
- Educational Examples: Create realistic examples for learning PartiQL
- Prototyping: Quickly generate data for proof-of-concept implementations
What Makes It Special
Mathematical Foundation
Unlike simple random data generators, Beamline is built on solid mathematical foundations:
- Probability Distributions: Uses proper statistical distributions for realistic data
- Stochastic Modeling: Models real-world processes mathematically
- Temporal Modeling: Handles time-based data generation correctly
- Correlation Modeling: Can generate related data across multiple dimensions
Next Steps
Now that you understand what Beamline is and why it’s useful, let’s get it installed and running on your system. In the next section, we’ll walk through the installation process and verify that everything is working correctly.
Installation and Setup
This chapter will guide you through installing Beamline and setting up your development environment. Beamline is written in Rust, so we weill cover both building from source and using pre-built binaries when available.
Prerequisites
Before installing Beamline, ensure you have the following prerequisites:
Required
- Rust Toolchain: Beamline requires Rust 1.70 or later
- Git: For cloning the repository
- Command Line Access: Terminal or command prompt
Optional but Recommended
- Text Editor: For editing Ion scripts (VS Code, vim, emacs, etc.)
- JSON/Ion Viewer: Use jq and/or ion-cli tools for examining generated data
Installing Rust
Read the following for more details on installing Rust on your machine: https://rust-lang.org/tools/install/
If you don’t have Rust installed, follow these steps:
On macOS, Linux, or WSL
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source ~/.cargo/env
On Windows
- Download and run rustup-init.exe
- Follow the installation prompts
- Restart your command prompt
Verify Rust Installation
rustc --version
cargo --version
You should see version information for both rustc and cargo.
Installing Beamline
Method 1: Building from Source (Recommended)
This is currently the primary method for installing Beamline:
-
Clone the Repository
git clone https://github.com/partiql/partiql-beamline.git cd partiql-beamline -
Build the Project
cargo build --releaseThis will compile Beamline in release mode, which provides better performance for data generation.
-
Verify the Installation
./target/release/beamline --versionYou should see version information for Beamline.
-
Optional: Add to PATH
For easier access, you can add the binary to your PATH or create a symlink:
On macOS/Linux:
# Option 1: Copy to a directory in your PATH sudo cp target/release/beamline /usr/local/bin/ # Option 2: Create a symlink ln -s $(pwd)/target/release/beamline ~/.local/bin/beamline # Option 3: Add to your shell profile echo 'export PATH="'$(pwd)'/target/release:$PATH"' >> ~/.bashrc source ~/.bashrcOn Windows:
# Add the target/release directory to your PATH environment variable # Or copy the .exe file to a directory already in your PATH
Method 2: Using Cargo Install (Not available yet)
Once Beamline is published to crates.io, you’ll be able to install it directly:
# This will be available in the future
cargo install beamline
Verifying Your Installation
Let’s verify that Beamline is installed correctly by running a few basic commands:
1. Check Version
beamline --version
2. View Help
beamline --help
You should see output similar to:
Beamline CLI
Usage: beamline <COMMAND>
Commands:
gen Run the generator
infer-shape Run the script shape inference
help Print this message or the help of the given subcommand(s)
Options:
-h, --help Print help
-V, --version Print version
3. Test Data Generation
Let’s run a simple test to ensure data generation works:
beamline gen data --help
This should display the help for the data generation command, confirming that the core functionality is available.
Development Environment Setup
Setting Up Your Workspace
Create a directory for your Beamline projects:
mkdir ~/partiql-beamline-workspace
cd ~/partiql-beamline-workspace
Editor Configuration
VS Code
If you’re using VS Code, consider installing these extensions for better Ion support:
- Rust Analyzer: For Rust syntax highlighting if you plan to contribute
- ion-vscode-plugin: For Syntax Highlighting, Error Reporting, and Formatting of Beamline scripts
- JSON: For viewing generated JSON output
- Better TOML: For configuration files
Shell Aliases (Optional)
For convenience, you might want to create shell aliases:
# Add to your ~/.bashrc, ~/.zshrc, or equivalent
alias pql-gen='beamline gen data'
alias pql-query='beamline query'
alias pql-shape='beamline infer-shape'
Troubleshooting Installation
Common Issues
Rust Version Too Old
Error: error: package requires rustc 1.70 or newer
Solution: Update Rust:
rustup update
Build Failures
Error: Compilation errors during cargo build
Solutions:
- Ensure you have the latest Rust version
- Clean and rebuild:
cargo clean cargo build --release - Check for system-specific dependencies
Permission Issues
Error: Permission denied when copying to /usr/local/bin
Solution: Use sudo or choose a different installation location:
# Install to user directory instead
mkdir -p ~/.local/bin
cp target/release/beamline ~/.local/bin/
PATH Issues
Error: command not found: beamline
Solution: Verify the binary is in your PATH:
which beamline
echo $PATH
Getting Help
If you encounter issues not covered here:
- Check the Troubleshooting section
- Review the GitHub Issues
- Create a new issue with:
- Your operating system
- Rust version (
rustc --version) - Complete error messages
- Steps to reproduce
Performance Considerations
Release vs Debug Builds
Always use release builds for actual data generation:
# Debug build (slower, for development)
cargo build
# Release build (faster, for production use)
cargo build --release
Release builds can be 10-100x faster than debug builds for data generation tasks.
System Resources
Beamline is designed to be memory-efficient, but consider your system resources:
- RAM: 4GB minimum, 8GB+ recommended for large datasets
- Storage: Ensure adequate disk space for generated data
- CPU: Multi-core processors will benefit from parallel processing features
Next Steps
Now that you have Beamline installed and verified, you’re ready to generate your first dataset! In the next section, we’ll walk through creating your first data generation script and producing some sample data.
Your First Data Generation
Now that you have Beamline installed, let’s generate your first dataset! This hands-on tutorial will walk you through creating a simple sensor data generator and understanding the basic concepts.
Quick Start: Using an Example Script
Beamline comes with several example scripts. Let’s start with the sensors example to see data generation in action.
Step 1: Generate Your First Dataset
Run the following command to generate 2 sensor readings:
beamline gen data \
--seed-auto \
--start-auto \
--sample-count 2 \
--script-path partiql-beamline-sim/tests/scripts/sensors.ion
You should see output similar to:
Seed: 5372343081885320050
Start: 2022-01-08T18:38:38.000000000Z
[2022-01-08 18:38:57.155 +00:00:00] : DataSetName("sensors") { 'tick': 19155, 'i8': 57, 'f': 30.103028021670184, 'w': 3.2669, 'd': 2, 'a': 'ed6b2d0c-dd09-4d7e-b1d3-fc16e3547eb5', 'ar1': [1.2, 1.4, 0.8], 'ar2': ['8fe9ee2c-a9e0-462a-8a44-a9abc51e759b', '0411eace-53be-4647-b351-3fa2de9b8e5f'], 'ar3': [3.2669, NULL, 3.0777], 'ar4': [10, 4, 8, 2], 'ar5': ['ed6b2d0c-dd09-4d7e-b1d3-fc16e3547eb5'] }
Congratulations! You’ve just generated your first synthetic dataset with Beamline.
Understanding the Output
Let’s break down what happened:
- Seed:
5372343081885320050— This random seed ensures reproducibility - Start:
2024-01-20T20:05:41.000000000Z— The simulation start time - Data Records: Two sensor readings with timestamps, each containing:
f: A floating-point sensor valuei8: An 8-bit integer valuetick: A simulation tick counter
Step 2: Reproduce the Same Data
Let’s generate the exact same data using the seed from the previous run:
beamline gen data \
--seed 5372343081885320050 \
--start-auto \
--sample-count 2 \
--script-path partiql-beamline-sim/tests/scripts/sensors.ion
Notice that the data values are identical, but the timestamps might be different because we used --start-auto. To get exactly the same output, use the same start time:
beamline gen data \
--seed 5372343081885320050 \
--start-iso "2022-01-08T18:38:38.000000000Z" \
--sample-count 2 \
--script-path partiql-beamline-sim/tests/scripts/sensors.ion
Now you’ll get exactly the same output as the first run!
Understanding the Script
Let’s examine the script that generated this data. Look at the contents of partiql-beamline-sim/tests/scripts/sensors.ion:
rand_processes::{
$n:UniformU8::{
low:2,
high:10
},
sensors:$n::[
rand_process::{
$r:Uniform::{
choices:[
5,
10
]
},
$arrival:HomogeneousPoisson::{
interarrival:minutes::$r
},
$weight:UniformDecimal::{
nullable:0.75,
low:1.995,
high:4.9999,
optional:true
},
$anyof:UniformAnyOf::{
types:[
Tick,
UniformF64,
UUID,
UniformDecimal::{
low:1.995,
high:4.9999,
nullable:false
}
]
},
$array:UniformArray::{
min_size:3,
max_size:3,
element_type:UniformDecimal::{
low:0.5,
high:1.5
}
},
$data:{
tick:Tick,
i8:UniformI8,
f:UniformF64,
w:$weight,
d:UniformDecimal::{
low:0.,
high:42.,
nullable:false
},
a:$anyof,
ar1:$array,
ar2:UniformArray::{
min_size:2,
max_size:4,
element_type:UUID
},
ar3:UniformArray::{
min_size:2,
max_size:4,
element_type:$weight
},
ar4:UniformArray::{
min_size:2,
max_size:4,
element_type:UniformI8::{
low:2,
high:10
}
},
ar5:UniformArray::{
min_size:1,
max_size:1,
element_type:$anyof
}
}
}
]
}
Script Breakdown
-
rand_processes::: This annotation tells Beamline that this structure defines random processes -
$n: UniformU8::{ low: 1, high: 3 }: Creates a variablenthat generates a random number between 1 and 3 -
sensors: $n::[...]: Creates a dataset called “sensors” withnrandom processes (1-3 processes) -
rand_process::: Defines a single random process within the sensors dataset -
$r: Uniform::[5,10]: Creates a variablerthat randomly selects between 5 and 10 -
$arrival: HomogeneousPoisson:: { interarrival: minutes::$r }: Defines how often data arrives (everyrminutes using a Poisson process) -
$data:: Defines the structure of each generated data record:tick: Tick- Current simulation tickid: '$@n'- Process identifieri8: UniformI8- Random 8-bit integerf: UniformF64- Random 64-bit float
Exploring Different Output Formats
Beamline supports multiple output formats. Let’s try generating the same data in different formats:
Ion Pretty Format
beamline gen data \
--seed 5372343081885320050 \
--start-auto \
--sample-count 3 \
--script-path partiql-beamline-sim/tests/scripts/sensors.ion \
--output-format ion-pretty
This produces nicely formatted Ion output:
{
seed: 12328924104731257599,
start: "2024-01-20T20:05:41.000000000Z",
data: {
sensors: [
{
i8: -21,
tick: 9421,
f: 2.803799956162891e0,
id: 1
},
{
i8: -70,
tick: 12294,
f: 1.7229362418585936e1,
id: 1
},
{
i8: 84,
tick: 32697,
f: -2.4809825455060093e1,
id: 0
}
]
}
}
Text Format (Default)
The default text format is human-readable and great for quick inspection:
beamline gen data \
--seed 5372343081885320050 \
--start-auto \
--sample-count 3 \
--script-path partiql-beamline-sim/tests/scripts/sensors.ion \
--output-format text
Creating Your Own Simple Script
Now let’s create your own script from scratch. Create a new file called my-first-script.ion:
rand_processes::{
simple_data: rand_process::{
$arrival: HomogeneousPoisson:: { interarrival: seconds::1 },
$data: {
timestamp: Instant,
temperature: UniformF64::{ low: 20.0, high: 35.0 },
humidity: UniformF64::{ low: 30.0, high: 80.0 },
sensor_id: UUID,
active: Bool::{ p: 0.9 }
}
}
}
This script creates a simple weather sensor that generates:
timestamp: Current simulation timetemperature: Random temperature between 20-35°Chumidity: Random humidity between 30-80%sensor_id: A unique UUID for each readingactive: Boolean with 90% chance of being true
Test Your Script
beamline gen data \
--seed 42 \
--start-auto \
--sample-count 5 \
--script-path my-first-script.ion \
--output-format ion-pretty
Understanding Key Concepts
Seeds and Reproducibility
The --seed parameter controls randomness:
--seed-auto: Generate a random seed (different data each time)--seed 42: Use a specific seed (same data each time)
Start Times
The --start parameter controls simulation time:
--start-auto: Use current time--start-iso "2024-01-01T00:00:00Z": Use specific time--start-epoch-ms 1704067200000: Use epoch milliseconds
Sample Count
The --sample-count parameter controls how many data points to generate. This is particularly useful for:
- Testing with small datasets
- Generating large datasets for performance testing
- Controlling output size
Common Patterns
Multiple Datasets
You can generate data for specific datasets using the --dataset flag:
beamline gen data \
--seed 42 \
--start-auto \
--sample-count 10 \
--script-path partiql-beamline-sim/tests/scripts/client-service.ion \
--dataset service --dataset client_1 \
--output-format ion-pretty
Controlling Nullability
You can control how often NULL values appear:
beamline gen data \
--seed 42 \
--start-auto \
--sample-count 5 \
--script-path my-first-script.ion \
--default-nullable true \
--pct-null 0.1 # 10% chance of NULL values
Next Steps
Now that you’ve successfully generated your first datasets, you are ready to dive deeper into Beamline’s capabilities. In the next section, we’ll explore the core concepts that power Beamline’s data generation, including:
- Random processes and stochastic modeling
- Data generators and their configurations
- Temporal modeling and arrival patterns
- Relationships between data elements
Quick Reference
Here are the commands you’ve learned in this chapter:
# Basic data generation
beamline gen data --seed-auto --start-auto --sample-count N --script-path SCRIPT
# Reproducible generation
beamline gen data --seed SEED --start-iso "TIMESTAMP" --sample-count N --script-path SCRIPT
# Different output formats
beamline gen data ... --output-format [text|ion|ion-pretty]
# Specific datasets
beamline gen data ... --dataset DATASET_NAME
# Control nullability
beamline gen data ... --default-nullable true --pct-null 0.1
Congratulations on completing your first data generation with Beamline! You’re now ready to explore more advanced features and create more sophisticated synthetic datasets.
Core Concepts
Before diving deeper into Beamline’s advanced features, it’s essential to understand the fundamental concepts that power its data generation capabilities. This chapter will introduce you to the mathematical and computational foundations that make Beamline both powerful and reliable.
Stochastic Processes
At the heart of Beamline lies the concept of stochastic processes — mathematical models that describe systems appearing to vary randomly over time.
What is a Stochastic Process?
A stochastic process is a collection of random variables indexed by time or space. In simpler terms, it is a way to model how things change randomly over time while still following certain patterns or rules.
Real-world examples:
- Stock prices over time
- Sensor readings from IoT devices
- User activity on a website
- Network traffic patterns
- Temperature measurements
Why Stochastic Processes Matter
Traditional random data generators often produce data that looks random but lacks the realistic patterns found in real-world data. Stochastic processes allow Beamline to:
- Model Temporal Relationships: Data points aren’t just random — they follow realistic time-based patterns
- Create Correlations: Different data elements can be related in meaningful ways
- Simulate Real Patterns: Generate data that behaves like real-world systems
- Maintain Consistency: Ensure generated data follows logical rules and constraints
Example: Sensor Data
Consider a temperature sensor:
- Simple Random: Each reading is completely independent
- Stochastic Process: Readings follow realistic patterns (gradual changes, daily cycles, seasonal trends)
// Simple random (unrealistic)
temperature: UniformF64::{ low: -10.0, high: 40.0 }
// Stochastic process (realistic)
temperature: NormalF64::{ mean: 22.0, std_dev: 5.0 }
Random Processes in Beamline
Beamline implements stochastic processes through random processes defined in scripts in Amazon Ion Format.
Anatomy of a Random Process
rand_process::{
$arrival: HomogeneousPoisson:: { interarrival: minutes::5 },
$data: {
// Data structure definition
}
}
Every random process has two key components:
- Arrival Process (
$arrival): Defines the statistical pattern of new data arrivals, i.e., when the data arrives - Data Structure (
$data): Defines what data is generated
Arrival Processes
Arrival processes control the timing of data generation. Beamline only supports Homogeneous Poisson Process at the moment:
Homogeneous Poisson Process
The most common arrival process, modeling events that occur at a constant average rate:
$arrival: HomogeneousPoisson:: { interarrival: minutes::5 }
Characteristics:
- Events occur independently
- Average rate is constant over time
- Time between events follows an exponential distribution
- Models many real-world phenomena (customer arrivals, system events, etc.)
Use cases:
- Web server requests
- Sensor readings
- User logins
- System alerts
Time Units
Beamline supports various time units for arrival processes:
// Different time units
seconds::30 // 30 seconds
minutes::5 // 5 minutes
hours::2 // 2 hours
days::1 // 1 day
milliseconds::100 // 100 milliseconds
Data Generators
Data generators define the structure and content of generated data. They use probability distributions to create realistic values.
Probability Distributions
Beamline supports many probability distributions, each suited for different types of data:
Uniform Distributions
Generate values where each value in a range is equally likely:
// Discrete uniform (integers)
age: UniformU8::{ low: 18, high: 65 }
// Continuous uniform (floats)
temperature: UniformF64::{ low: 20.0, high: 30.0 }
// Uniform choice from literals
status: Uniform::{ choices: ["active", "inactive", "pending"] }
Use cases:
- IDs, categories, discrete choices
- Baseline random values
- Testing edge cases
Normal (Gaussian) Distributions
Generate values that cluster around a mean with a bell-curve distribution:
height: NormalF64::{ mean: 170.0, std_dev: 10.0 }
Characteristics:
- Most values near the mean
- Symmetric distribution
- Models many natural phenomena
Use cases:
- Physical measurements (height, weight)
- Performance metrics
- Error values
Other Distributions
// Exponential (for modeling wait times)
response_time: ExpF64::{ rate: 0.1 }
// Log-normal (for modeling sizes, prices)
file_size: LogNormalF64::{ location: 10.0, scale: 1.0 }
// Weibull (for modeling lifetimes, reliability)
device_lifetime: WeibullF64::{ shape: 2.0, scale: 1000.0 }
Data Types
Beamline supports the following data types:
Scalar Types
// Numbers
integer_val: UniformI32::{ low: 1, high: 1000 }
float_val: UniformF64::{ low: 0.0, high: 1.0 }
decimal_val: UniformDecimal::{ low: 1.99, high: 999.99 }
// Text
name: LoremIpsumTitle
description: LoremIpsum::{ min_words: 10, max_words: 50 }
pattern_text: Regex::{ pattern: "[A-Z]{2}[0-9]{4}" }
// Boolean
active: Bool::{ p: 0.8 } // 80% chance of true
// Temporal
created_at: Instant
birth_date: Date
// Identifiers
user_id: UUID
Complex Types
// Structures
user: {
id: UUID,
name: LoremIpsumTitle,
age: UniformU8::{ low: 18, high: 65 },
preferences: {
theme: Uniform::{ choices: ["light", "dark"] },
notifications: Bool::{ p: 0.7 }
}
}
// Arrays
tags: UniformArray::{
min_size: 1,
max_size: 5,
element_type: LoremIpsumTitle
}
// Union types
value: UniformAnyOf::{ types: [
UniformI32::{ low: 1, high: 100 },
LoremIpsumTitle,
Bool
]}
Variables and References
Beamline supports variables for creating relationships and reusing values:
Variable Definition
rand_processes::{
$n: UniformU8::{ low: 2, high: 10 },
sensors: $n::[
rand_process::{
$r: Uniform::{ choices: [5,10] },
$arrival: HomogeneousPoisson:: { interarrival: minutes::$r },
$weight: UniformDecimal::{ nullable: 0.75, low: 1.995, high: 4.9999, optional: true },
$anyof: UniformAnyOf::{ types: [Tick, UniformF64, UUID, UniformDecimal::{ low: 1.995, high: 4.9999, nullable: false }] },
$array: UniformArray::{
min_size: 3,
max_size: 3,
element_type: UniformDecimal::{ low: 0.5, high: 1.5 }
},
$data: {
tick: Tick,
i8: UniformI8,
f: UniformF64,
w: $weight,
d: UniformDecimal::{ low: 0d0, high: 4.2d1, nullable: false },
a: $anyof,
ar1: $array,
ar2: UniformArray::{ min_size: 2, max_size: 4, element_type: UUID },
ar3: UniformArray::{ min_size: 2, max_size: 4, element_type: $weight },
ar4: UniformArray::{ min_size: 2, max_size: 4, element_type: UniformI8::{ low: 2, high: 10 } },
ar5: UniformArray::{ min_size: 1, max_size: 1, element_type: $anyof }
}
}
],
}
Variable Types
Generator Variables
Store data generators for reuse:
$temperature_sensor: NormalF64::{ mean: 22.0, std_dev: 3.0 }
$id_gen: UUID
Value Variables
Store computed values:
$success_rate: UniformF64::{ low: 0.95, high: 1.0 },
$is_successful: Bool::{ p: $success_rate }
Evaluation Control
Control when variables are evaluated:
// Evaluate once at script read time
$user_id: $id_gen::()
// Evaluate each time it's used
$request_id: $id_gen
Datasets and Collections
Beamline organizes generated data into datasets, which represent collections of related data.
Single Dataset
rand_processes::{
sensors: rand_process::{
$data: { /* sensor data */ }
}
}
Multiple Datasets
rand_processes::{
users: rand_process::{
$data: { /* user data */ }
},
orders: rand_process::{
$data: { /* order data */ }
}
}
Dynamic Datasets
Create multiple related datasets:
rand_processes::{
$n: UniformU8::{ low: 3, high: 8 },
// Creates client_1, client_2, ..., client_n datasets
clients: $n::[
'client_{ $@n }': rand_process::{
$data: {
client_id: '$@n',
// ... other fields
}
}
]
}
Reproducibility and Determinism
One of Beamline’s key strengths is its ability to generate reproducible data.
Seeds
Seeds control the random number generation:
# Same seed = same data
beamline gen data --seed 42 --start-auto --script-path my-script.ion
beamline gen data --seed 42 --start-auto --script-path my-script.ion # Identical output
Timestamps
Control the simulation start time:
# Same timestamp = same temporal patterns
beamline gen data --seed 42 --start-iso "2024-01-01T00:00:00Z" --script-path my-script.ion
Deterministic Behavior
Beamline ensures that:
- Same inputs always produce same outputs
- Random sequences are predictable and reproducible
- Debugging is possible with consistent data
- Tests can be reliable and repeatable
Static vs. Dynamic Data
Beamline supports both static and dynamic data generation:
Dynamic Data (Default)
Generated during simulation with temporal patterns:
rand_process::{
$arrival: HomogeneousPoisson:: { interarrival: minutes::5 },
$data: {
timestamp: Instant,
value: UniformF64
}
}
Static Data
Generated once at the beginning of simulation:
static_data::{
$data: {
id: UUID,
created_at: Instant, // Will be simulation start time
config: LoremIpsum
}
}
Use cases for static data:
- Reference tables
- Configuration data
- Lookup tables
- Master data
Summary
Understanding these core concepts is crucial for effectively using Beamline:
- Stochastic Processes: Mathematical foundation for realistic data patterns
- Random Processes: Implementation of stochastic processes in Beamline
- Arrival Processes: Control timing of data generation
- Data Generators: Create realistic values using probability distributions
- Variables: Enable relationships and reuse in data generation
- Datasets: Organize generated data into meaningful collections
- Reproducibility: Ensure consistent, debuggable data generation
- Static vs. Dynamic: Choose appropriate data generation patterns
In the next chapter, we’ll dive deeper into scripts and random processes, exploring how to create more sophisticated data generation patterns and relationships.
Reproducible Data Generation
One of Beamline’s core strengths is its ability to generate reproducible data — the same input parameters will always produce exactly the same output data, no matter when or where you run the generation process.
What is Reproducibility?
Reproducible data generation means that given the same:
- Seed value (random number generator seed)
- Configuration parameters (Ion script, generators, etc.)
- Timestamp (starting time for temporal data)
- Environment (same version of Beamline)
You will get exactly the same data every single time, down to the last byte.
Why Reproducibility Matters
Debugging and Testing
# First run - discovers a bug with specific data
beamline gen data --seed 12345 --start-auto --script-path my_script.ion
# Later run - reproduce exact same data to debug
beamline gen data --seed 12345 --start-auto --script-path my_script.ion
When you find a bug or unexpected behavior in your tests, reproducibility lets you generate the exact same problematic data to investigate and fix the issue.
Consistent Benchmarking
# Performance test run 1
beamline gen data --seed 42 --start-auto --sample-count 1000000 --script-path perf_test.ion
# Performance test run 2 (weeks later)
beamline gen data --seed 42 --start-auto --sample-count 1000000 --script-path perf_test.ion
For meaningful performance comparisons, you need identical datasets. Reproducibility ensures your benchmarks are comparing like with like.
AI Model Training
# Training dataset generation
beamline gen data --seed 789 --start-auto --script-path training_data.ion --sample-count 50000
# Later: regenerate exact same training data for model comparison
beamline gen data --seed 789 --start-auto --script-path training_data.ion --sample-count 50000
When training machine learning models, being able to regenerate identical training data is crucial for comparing model performance and reproducing results.
Regression Testing
# Original test data
beamline gen data --seed 2024 --start-auto --script-path integration_test.ion
# After code changes - same test data to verify no regressions
beamline gen data --seed 2024 --start-auto --script-path integration_test.ion
Regression testing requires the same test data to verify that code changes don’t break existing functionality.
How Seeds Work
Pseudorandom Number Generation
Beamline uses cryptographically secure pseudorandom number generators (PRNGs) that are initialized with a seed value:
# Different seeds = different data
beamline gen data --seed 1 --start-auto --script-path test.ion # Generates dataset A
beamline gen data --seed 2 --start-auto --script-path test.ion # Generates dataset B
# Same seed = identical data
beamline gen data --seed 1 --start-auto --script-path test.ion # Generates dataset A (identical)
beamline gen data --seed 1 --start-auto --script-path test.ion # Generates dataset A (identical)
Seed Propagation
Seeds propagate through the entire generation process:
- Data generators use the seed for all random decisions
- Stochastic processes use the seed for temporal modeling
- Nested structures maintain seed consistency across all levels
Default Seeds
If you don’t specify a seed, Beamline uses a default seed derived from the configuration:
# These generate identical data (same default seed)
beamline gen data --seed-auto --start-auto --script-path my_script.ion
beamline gen data --seed-auto --start-auto --script-path my_script.ion
# This generates different data (explicit different seed)
beamline gen data --seed 999 --start-auto --script-path my_script.ion
Reproducibility Scope
What IS Reproduced
✅ Data Values: All generated numbers, strings, booleans, etc. ✅ Data Structure: Object nesting, array lengths, field presence ✅ Temporal Patterns: Event timestamps and intervals ✅ Statistical Distributions: Same distribution samples ✅ Relationships: Cross-field correlations and dependencies
What Might VARY
❌ Beamline Version: Different versions may produce different output
❌ System Architecture: 32-bit vs 64-bit might have subtle differences
❌ Floating Point: Different CPUs might have tiny precision differences
❌ Ion Formatting: Whitespace and formatting might vary slightly
Best Practices
1. Always Specify Seeds for Important Use Cases
# Good - explicit seed for reproducible testing
beamline gen data --seed 12345 --start-auto --script-path test_suite.ion
# Avoid - relying on default seed might change
beamline gen data --seed-auto --start-auto --script-path test_suite.ion
2. Document Your Seeds
# Document seeds in your scripts or README
# Training data: seed 2024
# Test data: seed 2025
# Performance benchmark: seed 3000
3. Use Meaningful Seed Values
# Use dates, version numbers, or meaningful identifiers
beamline gen data --seed 20241212 --start-auto --script-path data.ion # Today's date
beamline gen data --seed 100 --start-auto --script-path v1.0.0.ion # Version-based
4. Pin Beamline Version for Critical Use Cases
# In your Cargo.toml or requirements
partiql-beamline = "=1.2.3" # Exact version for reproducibility
5. Store Configuration Alongside Data
# Save configuration for later reproduction
beamline gen data \
--seed 42 \
--start-auto \
--script-path production_test.ion \
--sample-count 1000 \
--output-format ion-pretty > data.ion
# Document generation parameters separately
echo "Seed: 42, Script: production_test.ion, Count: 1000" > config.txt
Examples
Basic Reproducibility
# Generate same data multiple times
$ beamline gen data --seed 100 --start-auto --sample-count 3 --script-path simple.ion
[1, 2, 5]
$ beamline gen data --seed 100 --start-auto --sample-count 3 --script-path simple.ion
[1, 2, 5] # Identical output
$ beamline gen data --seed 101 --start-auto --sample-count 3 --script-path simple.ion
[7, 1, 9] # Different seed = different data
Complex Structure Reproducibility
# Complex nested structures are also reproducible
$ beamline gen data --seed 200 --start-auto --sample-count 1 --script-path complex.ion
{
id: 42,
name: "Alice Johnson",
scores: [85, 92, 78],
metadata: {
timestamp: 2024-01-15T10:30:00Z,
active: true
}
}
# Run again with same seed
$ beamline gen data --seed 200 --start-auto --sample-count 1 --script-path complex.ion
{
id: 42,
name: "Alice Johnson", # Identical name
scores: [85, 92, 78], # Identical scores
metadata: {
timestamp: 2024-01-15T10:30:00Z, # Identical timestamp
active: true # Identical boolean
}
}
Time-based Reproducibility
# Even temporal data is reproducible
$ beamline gen data --seed 300 --start-iso "2024-01-01T00:00:00Z" --script-path events.ion --sample-count 3
[
{ event: "login", time: "2024-01-01T00:12:34Z" },
{ event: "action", time: "2024-01-01T00:15:47Z" },
{ event: "logout", time: "2024-01-01T00:23:12Z" }
]
# Same seed + same start time = identical temporal patterns
$ beamline gen data --seed 300 --start-iso "2024-01-01T00:00:00Z" --script-path events.ion --sample-count 3
[
{ event: "login", time: "2024-01-01T00:12:34Z" }, # Same intervals
{ event: "action", time: "2024-01-01T00:15:47Z" }, # Same timestamps
{ event: "logout", time: "2024-01-01T00:23:12Z" } # Exact reproduction
]
Troubleshooting Reproducibility
Issue: Getting Different Data with Same Seed
Possible Causes:
- Different Beamline versions
- Different script files
- Different command-line parameters
- Different system architectures
Solution:
# Check version
beamline --version
# Use exact same command-line parameters
beamline gen data --seed 123 --start-auto --sample-count 100 --script-path exact_same_script.ion
# Verify script file hasn't changed (use checksums)
sha256sum my_script.ion
Issue: Need to Break Reproducibility
Sometimes you want different data each run:
# Use current timestamp as seed
beamline gen data --seed $(date +%s) --start-auto --script-path varied_data.ion
# Use random seed
beamline gen data --seed $RANDOM --start-auto --script-path varied_data.ion
# Let Beamline generate a random seed
beamline gen data --seed-auto --start-auto --script-path varied_data.ion
Next Steps
Now that you understand reproducible data generation, you’re ready to learn about Scripts and Processes, which will show you how to configure and control the data generation process through Ion-based scripts.
Scripts and Random Processes
Beamline uses Ion-based scripts to define data generation configurations and stochastic processes to model how data arrives and evolves over time. This combination provides powerful, flexible control over synthetic data generation.
Ion Scripts Overview
What are Ion Scripts?
Ion scripts are configuration files written in Amazon Ion format that define:
- What data to generate (data types, structures, values)
- How data arrives (temporal patterns, frequencies)
- How data relates (cross-field dependencies, correlations)
- How much data (counts, durations, stopping conditions)
Basic Script Structure
Every Beamline script follows this structure:
rand_processes::{
// Variable definitions (optional)
$variable_name: GeneratorType::{ configuration },
// Dataset definitions (required)
dataset_name: rand_process::{
$arrival: ArrivalProcess::{ configuration },
$data: {
field_name: GeneratorType::{ configuration },
// ... more fields
}
}
}
Real Example from Test Suite
From sensors.ion test script:
rand_processes::{
$n: UniformU8::{ low: 2, high: 10 },
sensors: $n::[
rand_process::{
$r: Uniform::{ choices: [5,10] },
$arrival: HomogeneousPoisson:: { interarrival: minutes::$r },
$weight: UniformDecimal::{ nullable: 0.75, low: 1.995, high: 4.9999, optional: true },
$data: {
tick: Tick,
i8: UniformI8,
f: UniformF64,
w: $weight,
d: UniformDecimal::{ low: 0d0, high: 4.2d1, nullable: false }
}
}
]
}
Ion Format Benefits
Ion provides several advantages for configuration:
- Type Safety: Native support for numbers, strings, booleans, timestamps
- Comments: Document your configuration inline with
// - Annotations: Add type annotations like
minutes::$r - Nested Structures: Define complex object hierarchies naturally
- Variable References: Use
$variablefor reusable components
Stochastic Processes
What are Stochastic Processes?
Stochastic processes are mathematical models that describe how events occur over time in a seemingly random but statistically predictable way. In Beamline, they’re defined using the $arrival field in rand_process blocks.
Arrival Process Types
1. Homogeneous Poisson Process
Models events that occur at a constant average rate with random intervals:
rand_processes::{
sensor_readings: rand_process::{
$arrival: HomogeneousPoisson:: { interarrival: minutes::5 },
$data: {
sensor_id: UUID,
reading: UniformF64::{ low: 0.0, high: 100.0 },
timestamp: Instant
}
}
}
Time Units:
milliseconds::N- N milliseconds between eventsseconds::N- N seconds between eventsminutes::N- N minutes between eventshours::N- N hours between eventsdays::N- N days between events
Use Cases:
- User logins to a website
- Network packet arrivals
- Customer service calls
- Sensor readings
2. Variable Arrival Rates
Use variables to create dynamic arrival patterns:
rand_processes::{
user_events: rand_process::{
$r: Uniform::{ choices: [2, 5, 10] }, // Variable rate
$arrival: HomogeneousPoisson:: { interarrival: minutes::$r },
$data: {
event_type: Uniform::{ choices: ["login", "logout", "action"] },
user_id: UUID
}
}
}
Data Generators
Basic Generator Types
From the actual implementation:
Numeric Generators
rand_processes::{
numeric_data: rand_process::{
$arrival: HomogeneousPoisson:: { interarrival: seconds::1 },
$data: {
// Integer generators
small_int: UniformI8, // -127 to 127
medium_int: UniformI16::{ low: 100, high: 1000 }, // Custom range
large_int: UniformU32::{ low: 1, high: 1000000 }, // Unsigned
// Float generators
decimal_value: UniformDecimal::{ low: 1.99, high: 99.99 }, // Exact decimal
float_value: UniformF64::{ low: 0.0, high: 1.0 }, // Float
// Statistical distributions
normal_score: NormalF64::{ mean: 100.0, std_dev: 15.0 },
exponential_wait: ExpF64::{ rate: 0.1 },
weibull_lifetime: WeibullF64::{ shape: 2.0, scale: 1000.0 }
}
}
}
String Generators
rand_processes::{
text_data: rand_process::{
$arrival: HomogeneousPoisson:: { interarrival: seconds::2 },
$data: {
// UUID generator
id: UUID,
// Lorem Ipsum text
description: LoremIpsum::{ min_words: 5, max_words: 20 },
title: LoremIpsumTitle, // 3-8 title-cased words
// Regular expressions
country_code: Regex::{ pattern: "[A-Z]{2}" },
phone: Regex::{ pattern: "[0-9]{3}-[0-9]{3}-[0-9]{4}" },
// Format strings with variables
formatted_name: Format::{ pattern: "User #{UUID}" }
}
}
}
System Generators
rand_processes::{
system_data: rand_process::{
$arrival: HomogeneousPoisson:: { interarrival: seconds::1 },
$data: {
// System state generators
current_time: Instant, // Current simulation time
current_date: Date, // Current simulation date
event_tick: Tick, // Current tick counter
// Boolean generator
active: Bool, // 50% true by default
premium: Bool::{ p: 0.1 } // 10% true
}
}
}
Complex Type Generators
rand_processes::{
complex_data: rand_process::{
$arrival: HomogeneousPoisson:: { interarrival: seconds::5 },
$data: {
// Array generator
measurements: UniformArray::{
min_size: 3,
max_size: 8,
element_type: UniformF64::{ low: 0.0, high: 100.0 }
},
// Union type generator (any of several types)
mixed_value: UniformAnyOf::{
types: [
UUID,
UniformI32::{ low: 1, high: 1000 },
LoremIpsumTitle
]
},
// Choice from literals
status: Uniform::{ choices: ["active", "inactive", "pending"] }
}
}
}
Advanced Script Features
Variable Definitions and References
From the real client-service.ion script:
rand_processes::{
// Define reusable generators
$n: UniformU8::{ low: 5, high: 20 },
$id_gen: UUID,
$rid_gen: UUID,
requests: $n::[
{
// Force evaluation at script read time
$id: $id_gen::(),
$rate: UniformF64::{ low: 0.995e0, high: 1.0e0 },
$success: Bool::{ p: $rate },
service: rand_process::{
$r: UniformU8::{ low: 20, high: 150 },
$arrival: HomogeneousPoisson:: { interarrival: milliseconds::$r },
$data: {
Request: $rid_gen,
Account: $id,
client: Format::{ pattern: "customer #{ $@n }" },
success: $success
}
}
}
]
}
Key concepts:
- Variables:
$variable_namefor reusable generators - Forced evaluation:
$id_gen::()evaluates once at script read time - Loop arrays:
$n::[...]creates N instances - Loop index:
$@naccesses current iteration index
Static Data
From the orders.ion test script:
rand_processes::{
$n: UniformU8::{ low: 5, high: 20 },
$id_gen: UUID,
customers: $n::[
{
$id: $id_gen::(),
// Static data - generated once at simulation start
customer_table: static_data::{
$data: {
id: $id,
address: Format::{ pattern: "{ $@n } Foo Bar Ave" }
}
},
// Dynamic data - generated over time
orders: rand_process::{
$arrival: HomogeneousPoisson:: { interarrival: days::UniformU8::{ low: 1, high: 150 } },
$data: {
Order: UUID,
Customer: $id,
Time: Instant
}
}
}
]
}
Nullability and Optionality
Real syntax from test scripts:
rand_processes::{
nullable_data: rand_process::{
$arrival: HomogeneousPoisson:: { interarrival: seconds::1 },
$data: {
// 75% chance of NULL, can also be MISSING
weight: UniformDecimal::{
nullable: 0.75,
optional: true,
low: 1.995,
high: 4.9999
},
// Never NULL
id: UUID::{ nullable: false },
// 10% chance of MISSING (field won't appear)
optional_field: UniformI32::{ optional: 0.1, low: 1, high: 100 }
}
}
}
Real Script Examples
Simple Sensor Script
Based on the actual sensors.ion test:
rand_processes::{
$n: UniformU8::{ low: 2, high: 10 },
sensors: $n::[
rand_process::{
$r: Uniform::{ choices: [5,10] },
$arrival: HomogeneousPoisson:: { interarrival: minutes::$r },
$weight: UniformDecimal::{ nullable: 0.75, low: 1.995, high: 4.9999, optional: true },
$data: {
tick: Tick,
i8: UniformI8,
f: UniformF64,
w: $weight,
d: UniformDecimal::{ low: 0d0, high: 4.2d1, nullable: false }
}
}
]
}
Test this script:
target/release/beamline gen data \
--seed 42 \
--start-auto \
--script-path partiql-beamline-sim/tests/scripts/sensors.ion \
--sample-count 10 \
--output-format ion-pretty
Client-Service System
Based on client-service.ion test:
rand_processes::{
// Generate between 5 & 20 customers
$n: UniformU8::{ low: 5, high: 20 },
// Shared generators
$id_gen: UUID,
$rid_gen: UUID,
requests: $n::[
{
// Each customer gets unique ID
$id: $id_gen::(),
$rate: UniformF64::{ low: 0.995e0, high: 1.0e0 },
$success: Bool::{ p: $rate },
// Service dataset
service: rand_process::{
$r: UniformU8::{ low: 20, high: 150 },
$arrival: HomogeneousPoisson:: { interarrival: milliseconds::$r },
$data: {
Request: $rid_gen,
StartTime: Instant,
Program: "FancyService",
Operation: "GetMyData",
Account: $id,
client: Format::{ pattern: "customer #{ $@n }" },
success: $success
}
},
// Individual client datasets
'client_{ $@n }': rand_process::{
$arrival: HomogeneousPoisson:: { interarrival: milliseconds::$r },
$data: {
id: $id,
request_time: Instant,
request_id: $rid_gen,
success: $success
}
}
}
]
}
Transaction Data Script
Based on simple_transactions.ion test:
rand_processes::{
test_data: rand_process::{
$r: Uniform::{ choices: [5,10] },
$arrival: HomogeneousPoisson:: { interarrival: milliseconds::$r },
$data: {
transaction_id: UUID::{ nullable: false },
marketplace_id: UniformU8::{ nullable: false },
country_code: Regex::{ pattern: "[A-Z]{2}" },
created_at: Instant,
completed: Bool,
description: LoremIpsum::{ min_words:10, max_words:200 },
price: UniformDecimal::{ low: 2.99, high: 99999.99, optional: true }
}
}
}
Advanced Script Patterns
Complex Statistical Distributions
From numbers.ion test script:
rand_processes::{
test_data: rand_process::{
$r: Uniform::{ choices: [5,10] },
$arrival: HomogeneousPoisson:: { interarrival: milliseconds::$r },
$data: {
uniform: {
// Uniform distributions
uniform_u8: UniformU8::{ low: 13, high: 42 },
uniform_f64: UniformF64::{ low: -13.0, high: 42.0 },
uniform_decimal: UniformDecimal::{ low: 0.995, high: 499.9999 }
},
statistical: {
// Statistical distributions
normal: NormalF64::{ mean: 14.3, std_dev: 3.0 },
lognormal: LogNormalF64::{ location: 14.3, scale: 3.0 },
weibull: WeibullF64::{ shape: 14.3, scale: 3.0 },
exponential: ExpF64::{ rate: 3.0 }
},
// With nullability and optionality
nullable_field: UniformI32::{
nullable: 0.2, // 20% NULL
optional: 0.1, // 10% MISSING
low: 1,
high: 100
}
}
}
}
Multiple Datasets with Relationships
Real pattern from client-service.ion:
rand_processes::{
$n: UniformU8::{ low: 5, high: 20 },
$id_gen: UUID,
requests: $n::[
{
$id: $id_gen::(), // One ID per customer
$rid_gen: UUID, // Separate request ID generator per customer
// Shared service dataset
service: rand_process::{
$arrival: HomogeneousPoisson:: { interarrival: milliseconds::50 },
$data: {
Request: $rid_gen,
StartTime: Instant,
Account: $id,
client: Format::{ pattern: "customer #{ $@n }" }
}
},
// Individual client dataset for this customer
'client_{ $@n }': rand_process::{
$arrival: HomogeneousPoisson:: { interarrival: milliseconds::50 },
$data: {
id: $id,
request_time: Instant,
request_id: $rid_gen
}
}
}
]
}
Static Data with Dynamic References
From orders.ion test script:
rand_processes::{
$n: UniformU8::{ low: 5, high: 20 },
$id_gen: UUID,
$oid_gen: UUID,
customers: $n::[
{
$id: $id_gen::(),
// Static customer data (generated once)
customer_table: static_data::{
$data: {
id: $id,
address: Format::{ pattern: "{ $@n } Foo Bar Ave" }
}
},
// Dynamic orders (generated over time)
orders: rand_process::{
$r: UniformU8::{ low: 1, high: 150 },
$arrival: HomogeneousPoisson:: { interarrival: days::$r },
$data: {
Order: $oid_gen,
Time: Instant,
Customer: $id // Links to customer_table
}
}
}
]
}
Script Testing and Validation
Testing Script Syntax
# Test script with minimal data generation
target/release/beamline gen data \
--seed 1 \
--start-auto \
--script-path new_script.ion \
--sample-count 1
# Check inferred schema
target/release/beamline infer-shape \
--seed 1 \
--start-auto \
--script-path new_script.ion \
--output-format basic-ddl
Testing with Small Samples
# Test each dataset individually
target/release/beamline gen data \
--seed 42 \
--start-auto \
--script-path complex_script.ion \
--sample-count 5 \
--dataset specific_dataset
# Test all datasets with small sample
target/release/beamline gen data \
--seed 42 \
--start-auto \
--script-path complex_script.ion \
--sample-count 5 \
--output-format text
Best Practices
1. Use Real Test Script Patterns
// Good - follows actual Beamline syntax
rand_processes::{
$arrival_rate: Uniform::{ choices: [5, 10] },
events: rand_process::{
$arrival: HomogeneousPoisson:: { interarrival: minutes::$arrival_rate },
$data: {
event_id: UUID,
timestamp: Instant,
value: UniformF64::{ low: 0.0, high: 100.0 }
}
}
}
2. Test Scripts Incrementally
# Start with basic structure
echo 'rand_processes::{ test: rand_process::{ $arrival: HomogeneousPoisson:: { interarrival: seconds::1 }, $data: { id: UUID } } }' > minimal.ion
# Test basic structure
target/release/beamline gen data --seed 1 --start-auto --script-path minimal.ion --sample-count 3
3. Use Meaningful Variable Names
rand_processes::{
// Clear variable names
$customer_count: UniformU8::{ low: 10, high: 50 },
$order_frequency: Uniform::{ choices: [1, 3, 7] }, // Days
$customer_id_generator: UUID,
orders: $customer_count::[
rand_process::{
$arrival: HomogeneousPoisson:: { interarrival: days::$order_frequency },
$data: {
customer_id: $customer_id_generator,
order_time: Instant
}
}
]
}
4. Document Complex Patterns
rand_processes::{
// === Customer Simulation Configuration ===
// Generate 10-50 customers, each placing orders every 1-30 days
$customer_count: UniformU8::{ low: 10, high: 50 },
$shared_customer_id: UUID,
customer_orders: $customer_count::[
{
// Each customer gets unique ID for all their orders
$id: $shared_customer_id::(),
// Customer places orders with variable frequency
orders: rand_process::{
$days_between_orders: UniformU8::{ low: 1, high: 30 },
$arrival: HomogeneousPoisson:: { interarrival: days::$days_between_orders },
$data: {
customer_id: $id,
order_id: UUID,
order_time: Instant,
amount: UniformDecimal::{ low: 10.00, high: 500.00 }
}
}
}
]
}
Common Script Errors and Solutions
Error: Invalid Ion Syntax
// Wrong - missing closing brace
rand_processes::{
test: rand_process::{
$data: { id: UUID }
// Missing closing brace for rand_processes
Error: Missing Required Fields
// Wrong - missing $arrival
rand_processes::{
test: rand_process::{
$data: { id: UUID } // Missing $arrival definition
}
}
// Correct
rand_processes::{
test: rand_process::{
$arrival: HomogeneousPoisson:: { interarrival: seconds::1 },
$data: { id: UUID }
}
}
Error: Invalid Generator Configuration
// Wrong - low > high
rand_processes::{
test: rand_process::{
$arrival: HomogeneousPoisson:: { interarrival: seconds::1 },
$data: {
bad_range: UniformI32::{ low: 100, high: 50 } // Invalid
}
}
}
// Correct
rand_processes::{
test: rand_process::{
$arrival: HomogeneousPoisson:: { interarrival: seconds::1 },
$data: {
good_range: UniformI32::{ low: 50, high: 100 }
}
}
}
Performance Optimization
Efficient Generator Usage
rand_processes::{
// Efficient - reuse expensive generators
$expensive_distribution: NormalF64::{ mean: 100.0, std_dev: 15.0 },
$simple_uuid: UUID,
efficient_data: rand_process::{
$arrival: HomogeneousPoisson:: { interarrival: seconds::1 },
$data: {
// Reuse expensive distribution
score1: $expensive_distribution,
score2: $expensive_distribution,
score3: $expensive_distribution,
// Simple generators are fast
id: $simple_uuid,
active: Bool,
count: UniformI32::{ low: 1, high: 1000 }
}
}
}
Testing Commands
# Test with small samples first
target/release/beamline gen data \
--seed 1 \
--start-auto \
--script-path new_script.ion \
--sample-count 5 \
--output-format text
# Scale up after validation
target/release/beamline gen data \
--seed 1 \
--start-auto \
--script-path new_script.ion \
--sample-count 10000 \
--output-format ion-binary
Next Steps
Now that you understand real Ion scripts and stochastic processes, you’re ready to dive deeper into the Data Generation section, where you’ll learn about specific generator types, output formats, and advanced data modeling techniques using the actual Beamline syntax.
Data Generation Overview
Beamline’s data generation system creates synthetic data using stochastic processes and probability distributions. The system is built around three core concepts: random processes, value generators, and temporal modeling.
Architecture Overview
Data generation in Beamline follows a layered architecture:
- Random Processes — Mathematical models that describe how events occur over time
- Value Generators — Components that create specific data types and values
- Arrival Times — Models for when events occur in the simulation
- Simulation Context — Manages state, timing, and reproducibility
Core Concepts
Random Processes
A Random Process (also called Stochastic Process) is a mathematical model of systems that appear to vary randomly over time. In Beamline, these processes control:
- When data arrives (temporal patterns)
- What data is generated (value types and structures)
- How data relates (cross-field dependencies)
Value Generators
Value Generators are the building blocks that create actual data values. They can generate:
- Scalar values: numbers, strings, booleans, timestamps
- Complex structures: objects, arrays, nested data
- Statistical distributions: normal, exponential, Weibull, etc.
- Specialized types: UUIDs, formatted text, regex patterns
Each generator can be configured for:
- Nullability: Probability of generating
NULLvalues - Optionality: Probability of generating
MISSINGvalues - Value ranges: Minimum and maximum bounds
- Distribution parameters: Mean, standard deviation, shape, scale
Temporal Modeling
Beamline models data generation as events occurring over time using:
- Arrival processes: When events occur (e.g. Poisson Point Process)
- Simulation time: Virtual time that advances as events are generated
- Tick counters: Global state that increments with each event
- Instant generators: Current simulation time when values are created
Ion Script Structure
All data generation is controlled through Amazon Ion scripts with this basic structure:
rand_processes::{
// Variable definitions
$variable_name: GeneratorType::{ configuration },
// Dataset definitions
dataset_name: dataset_configuration
}
Variable Definitions
Variables allow you to define generators once and reuse them:
rand_processes::{
// Define reusable generators
$id_generator: UUID,
$weight_generator: UniformDecimal::{ low: 1.0, high: 10.0 },
$count_range: UniformU8::{ low: 5, high: 20 },
// Use variables in dataset definitions
products: $count_range::[
// ... uses $id_generator and $weight_generator
]
}
Dataset Configurations
Datasets can be configured in several ways:
1. Single Random Process
dataset_name: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: minutes::5 },
$data: {
id: UUID,
value: UniformF64
}
}
2. Static Data (Generated Once)
dataset_name: static_data::{
$data: {
id: UUID,
name: LoremIpsumTitle
}
}
3. Multiple Instances with Loops
$n: UniformU8::{ low: 2, high: 5 },
dataset_name: $n::[
rand_process::{
$data: {
instance_id: '$@n', // Current loop index
value: UniformF64
}
}
]
Data Generator Types
Basic Generators
| Generator | Description | PartiQL Type | Configuration |
|---|---|---|---|
Bool | Boolean values | BOOL | p: f64 (probability of true, default: 0.5) |
UUID | UUID v4 identifiers | STRING | No configuration |
Tick | Current simulation tick | Int64 | No configuration |
Instant | Current simulation time | DATETIME | No configuration |
Date | Current simulation date | DATETIME | No configuration |
Numeric Generators
Uniform Integer Generators
// Unsigned integers
UniformU8::{ low: 0, high: 255 } // 8-bit unsigned
UniformU16::{ low: 0, high: 65535 } // 16-bit unsigned
UniformU32::{ low: 0, high: 4294967295 } // 32-bit unsigned
UniformU64::{ low: 0, high: 18446744073709551615 } // 64-bit unsigned
// Signed integers
UniformI8::{ low: -128, high: 127 } // 8-bit signed
UniformI16::{ low: -32768, high: 32767 } // 16-bit signed
UniformI32::{ low: -2147483648, high: 2147483647 } // 32-bit signed
UniformI64::{ low: -9223372036854775808, high: 9223372036854775807 } // 64-bit signed
Floating Point Generators
// Uniform float
UniformF64::{ low: -127.0, high: 127.0 }
// Uniform decimal (exact arithmetic)
UniformDecimal::{ low: 0.995, high: 499.9999 }
Statistical Distribution Generators
// Normal distribution (bell curve)
NormalF64::{ mean: 100.0, std_dev: 15.0 }
// Log-normal distribution
LogNormalF64::{ location: 0.0, scale: 1.0 }
// Weibull distribution
WeibullF64::{ shape: 2.0, scale: 1.0 }
// Exponential distribution
ExpF64::{ rate: 1.0 }
String Generators
// Lorem Ipsum text
LoremIpsum::{ min_words: 10, max_words: 200 }
// Lorem Ipsum titles (3-8 words, title case)
LoremIpsumTitle
// Regular expression patterns
Regex::{ pattern: "[A-Z]{2}[0-9]{3}" }
// Format strings with variable substitution
Format::{ pattern: "User #{$@n}" }
Complex Type Generators
Arrays
UniformArray::{
min_size: 1,
max_size: 5,
element_type: UniformI32::{ low: 1, high: 100 }
}
Union Types (Any Of)
UniformAnyOf::{
types: [
UUID,
UniformI32::{ low: 1, high: 1000 },
LoremIpsumTitle
]
}
Choice from Literals
Uniform::{ choices: [1, 2, 5, 10, 20] }
Nullability and Optionality
Every generator supports NULL and MISSING value generation:
Nullability (NULL values)
// 20% chance of NULL values
generator::{ nullable: 0.2 }
// Never NULL
generator::{ nullable: false }
// Always NULL (not useful, but possible)
generator::{ nullable: 1.0 }
Optionality (MISSING values)
// 10% chance of MISSING values
generator::{ optional: 0.1 }
// Never MISSING
generator::{ optional: false }
// Always MISSING (field won't appear)
generator::{ optional: 1.0 }
Combined Configuration
// 20% NULL, 10% MISSING, 70% present values
price: UniformDecimal::{
nullable: 0.2,
optional: 0.1,
low: 9.99,
high: 999.99
}
Arrival Processes
Control when events occur in simulation time. Beamline is currently supporintg only Homogeneous Poisson Process:
Homogeneous Poisson Process
Statistically indepe events occur at a constant average rate with random intervals:
$arrival: HomogeneousPoisson::{ interarrival: minutes::5 }
Time units supported:
milliseconds::N- N milliseconds between eventsseconds::N- N seconds between eventsminutes::N- N minutes between eventshours::N- N hours between eventsdays::N- N days between events
Variable References and Scope
Variable Definition and Usage
rand_processes::{
$arrival: HomogeneousPoisson::{ interarrival: minutes::5 },
// Define variables at top level
$customer_id: UUID,
$price_range: UniformDecimal::{ low: 9.99, high: 199.99 },
orders: rand_process::{
$data: {
customer: $customer_id, // Reference variable
price: $price_range, // Reference variable
order_id: UUID // Direct generator
}
}
}
Forced Evaluation with ::()
Force generator evaluation at script read time (not generation time):
rand_processes::{
$id_gen: UUID,
customers: 3::[
{
// Each customer gets the same ID across all their records
$id: $id_gen::(), // Evaluated once per customer
customer_profile: static_data::{
$data: {
id: $id, // Same ID for this customer
name: LoremIpsumTitle
}
},
transactions: rand_process::{
$data: {
customer_id: $id, // Same ID for this customer
transaction_id: UUID, // New UUID per transaction
amount: UniformDecimal::{ low: 10.0, high: 500.0 }
}
}
}
]
}
Loop Index Variable $@n
Access the current loop index in array definitions:
$n: UniformU8::{ low: 3, high: 7 },
clients: $n::[
{
'client_$@n': rand_process::{ // Dynamic dataset name
$data: {
client_number: '$@n', // Current index as value
name: Format::{ pattern: "Client #{$@n}" }
}
}
}
]
Some Real Examples
Simple Sensor Data
rand_processes::{
$n: UniformU8::{ low: 2, high: 10 },
sensors: $n::[
rand_process::{
$r: Uniform::{ choices: [5,10] },
$arrival: HomogeneousPoisson::{ interarrival: minutes::$r },
$data: {
tick: Tick,
i8: UniformI8,
f: UniformF64
}
}
]
}
Complex Statistical Data
rand_processes::{
test_data: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: milliseconds::100 },
$data: {
// Statistical distributions
normal_score: NormalF64::{ mean: 100.0, std_dev: 15.0 },
exponential_wait: ExpF64::{ rate: 0.1 },
weibull_lifetime: WeibullF64::{ shape: 2.0, scale: 1000.0 },
// Arrays with statistical elements
measurements: UniformArray::{
min_size: 5,
max_size: 10,
element_type: NormalF64::{ mean: 50.0, std_dev: 5.0 }
},
// Union types
mixed_value: UniformAnyOf::{
types: [
NormalF64::{ mean: 0.0, std_dev: 1.0 },
UniformI32::{ low: 1, high: 100 },
UUID
]
}
}
}
}
Static and Dynamic Data Combination
rand_processes::{
$n: UniformU8::{ low: 5, high: 20 },
$id_gen: UUID,
customers: $n::[
{
$id: $id_gen::(), // One ID per customer
// Static customer data (generated once)
customer_table: static_data::{
$data: {
id: $id,
address: Format::{ pattern: "{$@n} Main Street" }
}
},
// Dynamic order data (generated over time)
orders: rand_process::{
$r: UniformU8::{ low: 1, high: 30 },
$arrival: HomogeneousPoisson::{ interarrival: days::$r },
$data: {
customer_id: $id,
order_id: UUID,
timestamp: Instant
}
}
}
]
}
Probability Distribution Support
Beamline provides support for data generation based on probability distributions, making it particularly valuable for AI model training and statistical simulation:
Available Distributions
- Normal Distribution:
NormalF64::{ mean: μ, std_dev: σ } - Log-Normal Distribution:
LogNormalF64::{ location: μ, scale: σ } - Exponential Distribution:
ExpF64::{ rate: λ } - Weibull Distribution:
WeibullF64::{ shape: k, scale: λ } - Uniform Distribution: All
Uniform*generators use uniform distribution
AI Model Training Applications
rand_processes::{
training_data: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: milliseconds::1 },
$data: {
// Features following realistic distributions
age: NormalF64::{ mean: 35.0, std_dev: 12.0 },
income: LogNormalF64::{ location: 10.5, scale: 0.5 },
response_time: ExpF64::{ rate: 0.1 },
// Categorical features
category: Uniform::{ choices: ["A", "B", "C", "D"] },
// Correlated features using shared variables
experience_years: NormalF64::{ mean: 8.0, std_dev: 5.0 },
// Target variable (could be based on features)
target: Bool::{ p: 0.3 }
}
}
}
Next Steps
Now that you understand the data generation overview, explore specific aspects:
- Generator Types - Detailed guide to all available generators
- Static Data - Using static_data for reference tables
- Output Formats - Understanding different output formats
- Nullability - Controlling NULL and MISSING values
- Scripts - Advanced Ion script techniques
- Datasets - Working with multiple datasets and relationships
Data Generator Types
Beamline provides a comprehensive set of data generators that can create values following various statistical distributions and patterns. Each generator is designed to produce realistic data for specific use cases and data types.
Generator Categories
Basic System Generators
These generators provide fundamental values based on simulation state:
| Generator | PartiQL Type | Description | Configuration |
|---|---|---|---|
Bool | BOOL | Boolean values using Bernoulli distribution | p: f64 (probability of true, default: 0.5) |
Date | DATETIME | Current simulation date | No configuration |
Instant | DATETIME | Current simulation timestamp with timezone | No configuration |
Tick | Int64 | Current simulation tick counter | No configuration |
UUID | STRING | Version 4 UUID identifiers | No configuration |
Examples
rand_processes::{
basic_data: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: seconds::1 },
$data: {
// System generators
created_at: Instant, // Current simulation time
event_tick: Tick, // Current tick counter
user_id: UUID, // Random UUID
active: Bool, // 50% true by default
premium: Bool::{ p: 0.1 }, // 10% true, 90% false
event_date: Date // Current simulation date
}
}
}
Uniform Integer Generators
Generate integers using discrete uniform distribution:
Unsigned Integers
| Generator | Range | Default Range | Configuration |
|---|---|---|---|
UniformU8 | 0 to 255 | low: 0, high: 255 | low: u8, high: u8 |
UniformU16 | 0 to 65,535 | low: 0, high: 65535 | low: u16, high: u16 |
UniformU32 | 0 to 4,294,967,295 | low: 0, high: 4294967295 | low: u32, high: u32 |
UniformU64 | 0 to 9,223,372,036,854,775,807 | low: 0, high: 9223372036854775807 | low: u64, high: u64 |
Signed Integers
| Generator | Range | Default Range | Configuration |
|---|---|---|---|
UniformI8 | -128 to 127 | low: -127, high: 127 | low: i8, high: i8 |
UniformI16 | -32,768 to 32,767 | low: -32767, high: 32767 | low: i16, high: i16 |
UniformI32 | -2,147,483,648 to 2,147,483,647 | low: -2147483647, high: 2147483647 | low: i32, high: i32 |
UniformI64 | -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 | low: -9223372036854775807, high: 9223372036854775807 | low: i64, high: i64 |
Examples
rand_processes::{
numeric_data: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: seconds::1 },
$data: {
// Default ranges
age_category: UniformU8, // 0-255
small_count: UniformI8, // -127 to 127
// Custom ranges
human_age: UniformU8::{ low: 0, high: 120 },
temperature_c: UniformI8::{ low: -40, high: 50 },
user_score: UniformU16::{ low: 0, high: 1000 },
large_id: UniformU64::{ low: 1000000, high: 9999999 }
}
}
}
Floating Point Generators
Uniform Float
UniformF64::{ low: -127.0, high: 127.0 } // Default range
UniformF64::{ low: 0.0, high: 1.0 } // Unit interval
Uniform Decimal (Exact Arithmetic)
UniformDecimal::{ low: 0.995, high: 499.9999 } // Default range
UniformDecimal::{ low: 9.99, high: 99.99 } // Price range
Examples
rand_processes::{
measurements: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: seconds::30 },
$data: {
// Floating point measurements
temperature: UniformF64::{ low: -10.0, high: 40.0 },
pressure: UniformF64::{ low: 980.0, high: 1050.0 },
// Exact decimal values for money
price: UniformDecimal::{ low: 9.99, high: 999.99 },
tax_rate: UniformDecimal::{ low: 0.05, high: 0.12 }
}
}
}
Statistical Distribution Generators
Beamline supports several important probability distributions:
Normal Distribution
Models natural phenomena that cluster around a mean value:
NormalF64::{ mean: 100.0, std_dev: 15.0 }
Use Cases:
- Human measurements (height, weight, IQ scores)
- Measurement errors
- Natural phenomena
- AI model features
Example:
// Human height in centimeters (approximately normal)
height: NormalF64::{ mean: 170.0, std_dev: 10.0 }
// Test scores
test_score: NormalF64::{ mean: 75.0, std_dev: 12.0 }
Log-Normal Distribution
Models positive values that are log-normally distributed (multiplicative effects):
LogNormalF64::{ location: 0.0, scale: 1.0 }
Use Cases:
- Income distributions
- Stock prices
- File sizes
- Response times
Example:
// Income distribution (log-normal is realistic)
annual_income: LogNormalF64::{ location: 10.5, scale: 0.5 }
// File sizes
file_size_bytes: LogNormalF64::{ location: 10.0, scale: 2.0 }
Exponential Distribution
Models time between events or lifetimes:
ExpF64::{ rate: 1.0 }
Use Cases:
- Time between events
- Equipment lifetimes
- Queue waiting times
- Radioactive decay
Example:
// Time between customer arrivals (exponential inter-arrival times)
wait_time_minutes: ExpF64::{ rate: 0.1 } // Average 10 minutes
// Equipment lifetime
lifetime_hours: ExpF64::{ rate: 0.001 } // Average 1000 hours
Weibull Distribution
Models reliability, survival analysis, and extreme values:
WeibullF64::{ shape: 2.0, scale: 1000.0 }
Use Cases:
- Equipment failure times
- Material strength
- Wind speeds
- Survival analysis
Example:
// Equipment failure time
failure_time_hours: WeibullF64::{ shape: 2.0, scale: 8760.0 } // ~1 year scale
// Material strength
breaking_force: WeibullF64::{ shape: 3.0, scale: 500.0 }
String Generators
Lorem Ipsum Text
Generate placeholder text:
LoremIpsum::{ min_words: 10, max_words: 200 }
LoremIpsumTitle // 3-8 words, title case
Examples:
description: LoremIpsum::{ min_words: 5, max_words: 20 }
title: LoremIpsumTitle
Sample Output:
description: "Lorem ipsum dolor sit amet consectetur adipiscing elit sed do eiusmod tempor"
title: "Importari Putant Quae Autem Tanta"
Regular Expression Generator
Generate strings matching regex patterns:
Regex::{ pattern: "[A-Z]{2}[0-9]{4}" }
Examples:
rand_processes::{
test_data: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: seconds::1 },
$data: {
// Country codes
country: Regex::{ pattern: "[A-Z]{2}" }, // "US", "GB", "FR"
// License plates
license: Regex::{ pattern: "[A-Z]{3}[0-9]{3}" }, // "ABC123"
// Phone numbers
phone: Regex::{ pattern: "[0-9]{3}-[0-9]{3}-[0-9]{4}" }, // "555-123-4567"
// IPv4 addresses
ip: Regex::{ pattern: "([0-9]{1,3}\\.){3}[0-9]{1,3}" }, // "192.168.1.1"
}
}
}
Important Notes:
- Use double backslashes for escape sequences:
\\dnot\d - Character classes are Unicode-aware:
\\dmatches all Unicode digits - Complex patterns supported: quantifiers, alternatives, character classes
Format String Generator
Generate formatted strings with variable substitution:
Format::{ pattern: "User #{$@n}" }
Format::{ pattern: "Order {$order_id} for customer {$customer_id}" }
Complex Type Generators
Array Generator
Generate arrays with variable length and typed elements:
UniformArray::{
min_size: 1,
max_size: 10,
element_type: UniformI32::{ low: 1, high: 100 }
}
Configuration:
min_size: Minimum array lengthmax_size: Maximum array lengthelement_type: Generator for array elements
Examples:
rand_processes::{
array_data: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: seconds::1 },
$data: {
// Array of integers
scores: UniformArray::{
min_size: 3,
max_size: 10,
element_type: UniformU8::{ low: 0, high: 100 }
},
// Array of UUIDs
related_ids: UniformArray::{
min_size: 1,
max_size: 5,
element_type: UUID
},
// Array using variable generator
weights: UniformArray::{
min_size: 2,
max_size: 4,
element_type: $weight_generator
}
}
}
}
Union Type Generator (Any Of)
Generate values that can be one of several types:
UniformAnyOf::{
types: [
UUID,
UniformI32::{ low: 1, high: 1000 },
LoremIpsumTitle,
Bool
]
}
Use Cases:
- Heterogeneous data
- Schema evolution simulation
- Polymorphic fields
- Variant types
Example:
rand_processes::{
flexible_data: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: seconds::1 },
$data: {
// Field that can be different types
metadata_value: UniformAnyOf::{
types: [
UUID, // Could be an ID
UniformI32::{ low: 1, high: 10000 }, // Could be a count
LoremIpsumTitle, // Could be a title
UniformDecimal::{ low: 0.0, high: 100.0 } // Could be a percentage
]
}
}
}
}
Choice from Literals
Select from a predefined list of values:
Uniform::{ choices: [1, 2, 5, 10, 20] }
Uniform::{ choices: ["pending", "processing", "shipped", "delivered"] }
Examples:
rand_processes::{
categorical_data: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: seconds::1 },
$data: {
// Status choices
status: Uniform::{ choices: ["active", "inactive", "pending"] },
// Priority levels
priority: Uniform::{ choices: [1, 2, 3, 4, 5] },
// Mixed type choices
config_value: Uniform::{ choices: [true, false, "auto", 0] }
}
}
}
Timestamp Generators
Timestamp with Configuration
Generate timestamps with precision and timezone control:
Timestamp::{
timezone: true, // Include timezone (default: implementation dependent)
precision: "microsecond" // Precision level
}
Precision Options:
"microsecond"- Microsecond precision"millisecond"- Millisecond precision"second"- Second precision"minute"- Minute precision"hour"- Hour precision"day"- Day precision
Example:
rand_processes::{
temporal_data: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: minutes::1 },
$data: {
// Different timestamp precisions
precise_time: Timestamp::{ timezone: true, precision: "microsecond" },
log_time: Timestamp::{ timezone: false, precision: "second" },
daily_snapshot: Timestamp::{ timezone: true, precision: "day" }
}
}
}
Generator Configuration Options
Nullability and Optionality
All generators support NULL and MISSING value configuration:
// 20% NULL values
generator::{ nullable: 0.2 }
// 10% MISSING values (field won't appear)
generator::{ optional: 0.1 }
// Combined: 15% NULL, 5% MISSING, 80% present
generator::{ nullable: 0.15, optional: 0.05 }
// Disable NULL/MISSING
generator::{ nullable: false, optional: false }
Range-Based Generators
Most numeric generators support range configuration:
// Integer ranges
UniformI32::{ low: 1, high: 1000 }
UniformU8::{ low: 18, high: 65 } // Age range
// Float ranges
UniformF64::{ low: -10.0, high: 50.0 } // Temperature range
// Decimal ranges (exact arithmetic)
UniformDecimal::{ low: 9.99, high: 999.99 } // Price range
Statistical Distribution Parameters
Normal Distribution
NormalF64::{
mean: 100.0, // Mean (μ)
std_dev: 15.0 // Standard deviation (σ)
}
Example Applications:
// Human height (cm) - approximately normal
height: NormalF64::{ mean: 170.0, std_dev: 10.0 }
// IQ scores - designed to be normal
iq_score: NormalF64::{ mean: 100.0, std_dev: 15.0 }
// Measurement errors
measurement_error: NormalF64::{ mean: 0.0, std_dev: 0.1 }
Log-Normal Distribution
LogNormalF64::{
location: 0.0, // Location parameter (μ)
scale: 1.0 // Scale parameter (σ)
}
Example Applications:
// Income - typically log-normal
income: LogNormalF64::{ location: 10.5, scale: 0.5 } // ~$36K median
// File sizes
file_size: LogNormalF64::{ location: 8.0, scale: 2.0 } // Bytes
// Response times
response_ms: LogNormalF64::{ location: 3.0, scale: 0.5 } // Milliseconds
Exponential Distribution
ExpF64::{
rate: 1.0 // Rate parameter (λ)
}
Example Applications:
// Time between events
inter_arrival_time: ExpF64::{ rate: 0.1 } // Average 10 time units
// Equipment lifetime
lifetime_hours: ExpF64::{ rate: 0.0001 } // Average 10,000 hours
// Queue waiting time
wait_time_sec: ExpF64::{ rate: 0.05 } // Average 20 seconds
Weibull Distribution
WeibullF64::{
shape: 2.0, // Shape parameter (k)
scale: 100.0 // Scale parameter (λ)
}
Example Applications:
// Equipment reliability
failure_time: WeibullF64::{ shape: 2.0, scale: 1000.0 }
// Wind speed modeling
wind_speed: WeibullF64::{ shape: 2.0, scale: 15.0 }
// Material strength
breaking_stress: WeibullF64::{ shape: 3.0, scale: 500.0 }
Advanced Generator Usage
Nested Structures
Create complex nested objects:
rand_processes::{
complex_data: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: seconds::5 },
$data: {
user: {
id: UUID,
profile: {
name: LoremIpsumTitle,
age: UniformU8::{ low: 18, high: 80 },
preferences: {
notifications: Bool::{ p: 0.8 },
theme: Uniform::{ choices: ["light", "dark", "auto"] }
}
},
stats: {
login_count: UniformU32::{ low: 0, high: 10000 },
last_login: Instant,
score: NormalF64::{ mean: 85.0, std_dev: 12.0 }
}
}
}
}
}
Arrays of Complex Objects
rand_processes::{
order_data: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: minutes::2 },
$data: {
order_id: UUID,
items: UniformArray::{
min_size: 1,
max_size: 10,
element_type: {
product_id: UUID,
quantity: UniformU8::{ low: 1, high: 5 },
unit_price: UniformDecimal::{ low: 5.00, high: 200.00 }
}
}
}
}
}
Variable References in Complex Generators
rand_processes::{
// Define reusable components
$id_gen: UUID,
$weight_dist: NormalF64::{ mean: 70.0, std_dev: 15.0 },
$status_options: Uniform::{ choices: ["new", "active", "suspended", "closed"] },
users: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: minutes::1 },
$data: {
user_id: $id_gen,
weight_kg: $weight_dist,
account_status: $status_options,
// Arrays using variables
measurement_history: UniformArray::{
min_size: 5,
max_size: 20,
element_type: $weight_dist // Same distribution for all measurements
},
// Union types with variables
contact_method: UniformAnyOf::{
types: [
$id_gen, // UUID for anonymous contact
Regex::{ pattern: "[a-z]+@[a-z]+\\.[a-z]{2,3}" }, // Email
Regex::{ pattern: "[0-9]{3}-[0-9]{3}-[0-9]{4}" } // Phone
]
}
}
}
}
AI Model Training Examples
Classification Dataset
rand_processes::{
classification_training: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: milliseconds::1 },
$data: {
// Features with realistic distributions
feature_1: NormalF64::{ mean: 0.0, std_dev: 1.0 },
feature_2: NormalF64::{ mean: 0.0, std_dev: 1.0 },
feature_3: LogNormalF64::{ location: 0.0, scale: 0.5 },
feature_4: ExpF64::{ rate: 1.0 },
// Categorical features
category: Uniform::{ choices: ["A", "B", "C"] },
region: Uniform::{ choices: ["North", "South", "East", "West"] },
// Binary classification target
label: Bool::{ p: 0.3 } // 30% positive class
}
}
}
Regression Dataset
rand_processes::{
regression_training: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: milliseconds::1 },
$data: {
// Independent variables
x1: NormalF64::{ mean: 10.0, std_dev: 2.0 },
x2: UniformF64::{ low: 0.0, high: 20.0 },
x3: ExpF64::{ rate: 0.1 },
// Dependent variable (could be computed based on x1, x2, x3)
y: NormalF64::{ mean: 50.0, std_dev: 10.0 },
// Noise term
noise: NormalF64::{ mean: 0.0, std_dev: 1.0 }
}
}
}
Time Series Dataset
rand_processes::{
time_series: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: seconds::60 }, // Every minute
$data: {
timestamp: Instant,
// Trending value with noise
base_value: NormalF64::{ mean: 100.0, std_dev: 5.0 },
seasonal_component: NormalF64::{ mean: 0.0, std_dev: 10.0 },
noise: NormalF64::{ mean: 0.0, std_dev: 2.0 },
// External factors
temperature: NormalF64::{ mean: 22.0, std_dev: 5.0 },
humidity: UniformF64::{ low: 30.0, high: 80.0 }
}
}
}
Performance Considerations
Generator Efficiency
- Simple generators (
UUID,Bool,UniformI32) are fastest - Statistical distributions (
NormalF64,ExpF64) require more computation - String generators (
LoremIpsum,Regex) can be slower for complex patterns - Array generators scale with array size and element complexity
Memory Usage
- Streaming generation: Constant memory usage regardless of dataset size
- Variable caching: Variables are computed once and reused
- Complex nesting: Memory usage scales with structure depth
Optimization Tips
// Efficient - simple generators
id: UUID,
count: UniformU32::{ low: 1, high: 1000 }
// Less efficient - complex regex
complex_pattern: Regex::{ pattern: "(very|extremely|quite)\\s+complex\\s+pattern\\s+with\\s+many\\s+alternatives" }
// Efficient - reuse variables
$common_decimal: UniformDecimal::{ low: 1.0, high: 100.0 },
field1: $common_decimal,
field2: $common_decimal,
field3: $common_decimal
Next Steps
- Static Data - Learn about static_data generation
- Output Formats - Understand different output formats
- Nullability - Deep dive into NULL and MISSING values
- Scripts - Advanced Ion scripting techniques
Datasets and Collections
Datasets in Beamline represent collections of related data records that share the same structure. Understanding how to design, organize, and work with multiple datasets is essential for creating realistic data generation scenarios.
What are Datasets?
A dataset is a named collection of records that share a common schema. In Ion scripts, datasets are defined as top-level keys within the rand_processes structure:
rand_processes::{
users: rand_process::{ /* ... */ }, // "users" dataset
orders: rand_process::{ /* ... */ }, // "orders" dataset
products: static_data::{ /* ... */ } // "products" dataset
}
Each dataset becomes a separate data collection in the output, whether in text format, Ion format, or database generation.
Single Dataset Scripts
Basic Single Dataset
rand_processes::{
sensors: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: minutes::5 },
$data: {
sensor_id: UUID,
temperature: NormalF64::{ mean: 22.0, std_dev: 3.0 },
humidity: UniformF64::{ low: 30.0, high: 80.0 },
timestamp: Instant
}
}
}
Output characteristics:
- Single dataset named “sensors”
- All records have the same structure
- Records generated according to arrival process
Multiple Dataset Scripts
Independent Datasets
Create multiple unrelated datasets in the same script:
rand_processes::{
// User activity dataset
user_events: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: seconds::30 },
$data: {
user_id: UUID,
event_type: Uniform::{ choices: ["login", "logout", "click", "purchase"] },
timestamp: Instant
}
},
// System metrics dataset
system_metrics: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: seconds::60 },
$data: {
metric_name: Uniform::{ choices: ["cpu", "memory", "disk", "network"] },
value: UniformF64::{ low: 0.0, high: 100.0 },
timestamp: Instant
}
},
// Configuration dataset (static)
app_config: static_data::{
$data: {
config_key: Uniform::{ choices: ["max_users", "timeout", "retry_count"] },
config_value: UniformAnyOf::{ types: [UniformI32::{ low: 1, high: 1000 }, Bool] }
}
}
}
Related Datasets with Shared Variables
Create datasets that share common identifiers or generators:
rand_processes::{
// Shared generators
$user_id: UUID,
$session_id: UUID,
// User profiles (static)
users: static_data::{
$data: {
user_id: $user_id,
username: Format::{ pattern: "user_{UUID}" },
created_at: Date
}
},
// User sessions (dynamic)
sessions: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: hours::2 },
$data: {
session_id: $session_id,
user_id: $user_id, // Links to users dataset
start_time: Instant,
duration_minutes: UniformU16::{ low: 5, high: 180 }
}
},
// Session events (dynamic)
session_events: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: minutes::3 },
$data: {
event_id: UUID,
session_id: $session_id, // Links to sessions dataset
event_type: Uniform::{ choices: ["page_view", "click", "scroll", "exit"] },
timestamp: Instant
}
}
}
Complex Dataset Relationships
Dynamic Dataset Creation with Loops
From the real client-service.ion test script:
rand_processes::{
// Generate between 5 & 20 customers
$n: UniformU8::{ low: 5, high: 20 },
// Shared ID generators
$id_gen: UUID,
$rid_gen: UUID,
requests: $n::[
// Each iteration creates datasets for customer $@n
{
// Unique ID per customer
$id: $id_gen::(),
$rate: UniformF64::{ low: 0.995, high: 1.0 },
$success: Bool::{ p: $rate },
// Service dataset - shared by all customers
service: rand_process::{
$r: UniformU8::{ low: 20, high: 150 },
$arrival: HomogeneousPoisson::{ interarrival: milliseconds::$r },
$data: {
Request: $rid_gen,
StartTime: Instant,
Program: "FancyService",
Operation: "GetMyData",
Account: $id,
client: Format::{ pattern: "customer #{$@n}" },
success: $success
}
},
// Individual client dataset - one per customer
'client_{$@n}': rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: milliseconds::$r },
$data: {
id: $id,
request_time: Instant,
request_id: $rid_gen,
success: $success
}
}
}
]
}
This creates:
- 1 service dataset: Shared across all customers
- N client datasets:
client_0,client_1,client_2, etc. - Shared variables: Same request IDs, customer IDs, success rates
Output Example
$ beamline gen data \
--seed 100 \
--start-auto \
--script-path client-service.ion \
--sample-count 20 \
--output-format text
Seed: 100
Start: 2024-01-01T00:00:00Z
[2024-01-01 00:00:10.123] : "service" { 'Request': 'req-001', 'Account': 'customer-abc', 'client': 'customer #0' }
[2024-01-01 00:00:10.124] : "client_0" { 'id': 'customer-abc', 'request_id': 'req-001' }
[2024-01-01 00:00:15.456] : "service" { 'Request': 'req-002', 'Account': 'customer-def', 'client': 'customer #1' }
[2024-01-01 00:00:15.457] : "client_1" { 'id': 'customer-def', 'request_id': 'req-002' }
Dataset Filtering
CLI Dataset Selection
Generate data for specific datasets only:
# Generate all datasets
beamline gen data \
--seed 42 \
--start-auto \
--script-path multi_dataset.ion \
--sample-count 100
# Generate only specific datasets
beamline gen data \
--seed 42 \
--start-auto \
--script-path multi_dataset.ion \
--sample-count 100 \
--dataset users \
--dataset orders
# Generate only one dataset
beamline gen data \
--seed 42 \
--start-auto \
--script-path multi_dataset.ion \
--sample-count 100 \
--dataset system_metrics
Use Cases for Dataset Filtering
- Focused testing: Test specific components in isolation
- Performance optimization: Generate only needed data
- Development: Work with subset of complex systems
- Incremental development: Build datasets one at a time
Dataset Design Patterns
Master-Detail Pattern
rand_processes::{
$n_customers: UniformU8::{ low: 10, high: 50 },
$customer_id: UUID,
$order_id: UUID,
customers: $n_customers::[
{
$id: $customer_id::(),
// Master dataset - customer information
customer_master: static_data::{
$data: {
customer_id: $id,
name: LoremIpsumTitle,
email: Format::{ pattern: "customer{$@n}@example.com" },
registration_date: Date
}
},
// Detail dataset - customer orders
'customer_{$@n}_orders': rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: days::UniformU8::{ low: 1, high: 30 } },
$data: {
order_id: $order_id,
customer_id: $id, // Foreign key relationship
order_date: Instant,
total_amount: UniformDecimal::{ low: 10.00, high: 500.00 }
}
}
}
]
}
Event Sourcing Pattern
rand_processes::{
$entity_id: UUID,
// Entity snapshots (static)
entity_snapshots: static_data::{
$data: {
entity_id: $entity_id,
entity_type: Uniform::{ choices: ["user", "order", "product"] },
created_at: Date,
initial_state: LoremIpsumTitle
}
},
// Entity events (dynamic)
entity_events: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: minutes::UniformU8::{ low: 5, high: 60 } },
$data: {
event_id: UUID,
entity_id: $entity_id, // Links to snapshots
event_type: Uniform::{ choices: ["created", "updated", "deleted", "restored"] },
timestamp: Instant,
event_data: LoremIpsum::{ min_words: 5, max_words: 20 }
}
}
}
Multi-Tenant Pattern
rand_processes::{
$n_tenants: UniformU8::{ low: 3, high: 10 },
$tenant_id: UUID,
tenants: $n_tenants::[
{
$id: $tenant_id::(),
// Tenant configuration (static)
'tenant_{$@n}_config': static_data::{
$data: {
tenant_id: $id,
tenant_name: Format::{ pattern: "Tenant {$@n}" },
plan: Uniform::{ choices: ["basic", "premium", "enterprise"] },
max_users: UniformU16::{ low: 10, high: 1000 }
}
},
// Tenant activity (dynamic)
'tenant_{$@n}_activity': rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: minutes::UniformU8::{ low: 1, high: 30 } },
$data: {
activity_id: UUID,
tenant_id: $id,
activity_type: Uniform::{ choices: ["login", "api_call", "data_export", "config_change"] },
timestamp: Instant,
user_count: UniformU16::{ low: 1, high: 100 }
}
}
}
]
}
Dataset Analysis and Inspection
Examining Generated Datasets
# Generate multi-dataset output
beamline gen data \
--seed 123 \
--start-auto \
--script-path complex_system.ion \
--sample-count 1000 \
--output-format ion-pretty > output.ion
# Extract dataset names and record counts
jq -r '.data | keys[]' output.ion # List all dataset names
jq '.data.users | length' output.ion # Count records in users dataset
jq '.data | to_entries[] | "\(.key): \(.value | length) records"' output.ion # All counts
Database Catalog Analysis
# Generate database
beamline gen db beamline-lite \
--seed 456 \
--start-auto \
--script-path multi_dataset.ion \
--sample-count 5000
# Analyze generated datasets
ls -la beamline-catalog/*.ion | grep -v shape # List data files
for f in beamline-catalog/*.ion; do
if [[ "$f" != *".shape.ion" ]]; then
echo "$(basename "$f" .ion): $(wc -l < "$f") records"
fi
done
Schema Comparison Across Datasets
# Compare schemas of related datasets
diff beamline-catalog/client_0.shape.sql beamline-catalog/client_1.shape.sql
# Should be identical for datasets created from same template
# Compare different dataset schemas
diff beamline-catalog/users.shape.sql beamline-catalog/orders.shape.sql
# Should be different - different structures
Advanced Dataset Patterns
Hierarchical Data Modeling
rand_processes::{
$n_orgs: UniformU8::{ low: 2, high: 5 },
$n_depts_per_org: UniformU8::{ low: 3, high: 8 },
$n_users_per_dept: UniformU8::{ low: 5, high: 20 },
organizations: $n_orgs::[
{
$org_id: UUID::(),
// Organization master data
'org_{$@n}': static_data::{
$data: {
org_id: $org_id,
org_name: Format::{ pattern: "Organization {$@n}" },
industry: Uniform::{ choices: ["Tech", "Finance", "Healthcare", "Retail"] }
}
},
// Departments within organization
departments: $n_depts_per_org::[
{
$dept_id: UUID::(),
'org_{$@n}_dept_{$@n}': static_data::{
$data: {
dept_id: $dept_id,
org_id: $org_id,
dept_name: Uniform::{ choices: ["Engineering", "Sales", "Marketing", "HR"] }
}
},
// Users within department
'org_{$@n}_dept_{$@n}_users': $n_users_per_dept::[
rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: hours::UniformU8::{ low: 8, high: 24 } },
$data: {
user_id: UUID,
dept_id: $dept_id,
org_id: $org_id,
activity_type: Uniform::{ choices: ["work", "meeting", "break", "training"] },
timestamp: Instant
}
}
]
}
]
}
]
}
Time-Series Dataset Families
rand_processes::{
$n_sensors: UniformU8::{ low: 5, high: 15 },
$sensor_id: UUID,
sensors: $n_sensors::[
{
$id: $sensor_id::(),
$location: Format::{ pattern: "Location-{$@n}" },
// Sensor metadata (static)
'sensor_{$@n}_metadata': static_data::{
$data: {
sensor_id: $id,
location: $location,
sensor_type: Uniform::{ choices: ["temperature", "humidity", "pressure"] },
calibration_date: Date
}
},
// Regular sensor readings (dynamic)
'sensor_{$@n}_readings': rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: seconds::30 },
$data: {
sensor_id: $id,
reading_time: Instant,
value: NormalF64::{ mean: 22.0, std_dev: 5.0 },
quality: Uniform::{ choices: ["good", "fair", "poor"] }
}
},
// Sensor alerts (dynamic, infrequent)
'sensor_{$@n}_alerts': rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: hours::UniformU8::{ low: 6, high: 48 } },
$data: {
alert_id: UUID,
sensor_id: $id,
alert_type: Uniform::{ choices: ["high_value", "low_value", "malfunction", "maintenance"] },
timestamp: Instant,
severity: Uniform::{ choices: [1, 2, 3, 4, 5] }
}
}
}
]
}
Dataset Output in Different Formats
Text Format Multi-Dataset Output
$ beamline gen data \
--seed 999 \
--start-auto \
--script-path multi_dataset.ion \
--sample-count 20 \
--output-format text
# Datasets are interleaved by timestamp
[2024-01-01 00:00:00.000] : "config" { 'key': 'timeout', 'value': 30 }
[2024-01-01 00:00:00.000] : "config" { 'key': 'max_users', 'value': 1000 }
[2024-01-01 00:02:15.123] : "users" { 'user_id': 'abc-123', 'action': 'login' }
[2024-01-01 00:03:45.456] : "metrics" { 'metric': 'cpu', 'value': 45.6 }
[2024-01-01 00:04:30.789] : "users" { 'user_id': 'def-456', 'action': 'click' }
Ion Pretty Multi-Dataset Output
{
seed: 999,
start: "2024-01-01T00:00:00Z",
data: {
config: [
{ key: "timeout", value: 30 },
{ key: "max_users", value: 1000 }
],
users: [
{ user_id: "abc-123", action: "login", timestamp: 2024-01-01T00:02:15.123Z },
{ user_id: "def-456", action: "click", timestamp: 2024-01-01T00:04:30.789Z }
],
metrics: [
{ metric: "cpu", value: 45.6, timestamp: 2024-01-01T00:03:45.456Z }
]
}
}
Database Generation Multi-Dataset Files
$ beamline gen db beamline-lite \
--seed 42 \
--start-auto \
--script-path client_service.ion \
--sample-count 1000
$ ls beamline-catalog/
.beamline-manifest
.beamline-script
service.ion # Service dataset data
service.shape.ion # Service dataset schema
service.shape.sql # Service dataset SQL
client_0.ion # Client 0 dataset data
client_0.shape.ion # Client 0 dataset schema
client_0.shape.sql # Client 0 dataset SQL
client_1.ion # Client 1 dataset data
client_1.shape.ion # Client 1 dataset schema
client_1.shape.sql # Client 1 dataset SQL
... # More client datasets
Dataset Naming Best Practices
1. Use Descriptive Names
// Good - descriptive dataset names
user_profiles: static_data::{ /* ... */ },
user_activity_events: rand_process::{ /* ... */ },
system_performance_metrics: rand_process::{ /* ... */ }
// Avoid - generic names
data1: static_data::{ /* ... */ },
stuff: rand_process::{ /* ... */ }
2. Follow Consistent Naming Conventions
// Consistent naming pattern
user_profiles: static_data::{ /* ... */ },
user_sessions: rand_process::{ /* ... */ },
user_events: rand_process::{ /* ... */ },
order_master: static_data::{ /* ... */ },
order_items: rand_process::{ /* ... */ },
order_payments: rand_process::{ /* ... */ }
3. Use Meaningful Prefixes for Related Datasets
// Group related datasets with prefixes
$n: UniformU8::{ low: 5, high: 10 },
services: $n::[
{
'service_{$@n}_config': static_data::{ /* ... */ },
'service_{$@n}_requests': rand_process::{ /* ... */ },
'service_{$@n}_responses': rand_process::{ /* ... */ },
'service_{$@n}_errors': rand_process::{ /* ... */ }
}
]
Performance Considerations
Dataset Count Impact
- Few datasets (1-5): Minimal overhead
- Many datasets (10-50): Slight memory overhead for tracking
- Dynamic datasets (100+): Significant memory for metadata
Dataset Size Balance
// Balanced approach - mix of small and large datasets
rand_processes::{
// Small reference dataset
config: static_data::{ $data: { /* small config */ } },
// Medium operational dataset
users: rand_process::{ /* moderate activity */ },
// Large transaction dataset
transactions: rand_process::{ /* high frequency */ }
}
Memory Usage with Multiple Datasets
# Monitor memory usage with many datasets
time beamline gen data \
--seed 1 \
--start-auto \
--script-path many_datasets.ion \
--sample-count 10000
# Use dataset filtering to reduce memory
beamline gen data \
--seed 1 \
--start-auto \
--script-path many_datasets.ion \
--sample-count 10000 \
--dataset important_dataset_only
Integration Workflows
Dataset-Specific Processing
#!/bin/bash
# process-datasets.sh
SCRIPT="multi_system.ion"
SEED=12345
# Generate full dataset
beamline gen data \
--seed $SEED \
--start-auto \
--script-path $SCRIPT \
--sample-count 10000 \
--output-format ion-pretty > full_data.ion
# Extract individual datasets for processing
jq '.data.users' full_data.ion > users_only.json
jq '.data.orders' full_data.ion > orders_only.json
jq '.data.metrics' full_data.ion > metrics_only.json
echo "Datasets extracted for individual processing"
Cross-Dataset Validation
# Generate related datasets
beamline gen data \
--seed 999 \
--start-auto \
--script-path related_data.ion \
--sample-count 5000 \
--output-format ion-pretty > related_data.ion
# Validate relationships
jq '.data.orders[].customer_id' related_data.ion | sort -u > order_customers.txt
jq '.data.users[].user_id' related_data.ion | sort -u > all_customers.txt
# Check referential integrity
comm -23 order_customers.txt all_customers.txt # Orders with invalid customer IDs (should be empty)
Troubleshooting Multi-Dataset Scripts
Issue: Missing Datasets in Output
Cause: Dataset filtering or script errors
Solution:
# Check all available datasets
beamline infer-shape --seed 1 --start-auto --script-path script.ion --output-format text
# Generate without filtering
beamline gen data --seed 1 --start-auto --script-path script.ion --sample-count 5
Issue: Uneven Dataset Sizes
Cause: Different arrival rates or loop counts
Solution:
# Check arrival rates in your script
# Adjust interarrival times to balance dataset sizes
$arrival1: HomogeneousPoisson::{ interarrival: seconds::1 }, # Frequent
$arrival2: HomogeneousPoisson::{ interarrival: minutes::1 }, # Less frequent
Issue: Memory Issues with Many Datasets
Solution:
# Use dataset filtering
beamline gen data --script-path many.ion --dataset important_one --dataset important_two
# Or generate datasets separately
beamline gen data --script-path script.ion --dataset batch_1 --sample-count 10000
beamline gen data --script-path script.ion --dataset batch_2 --sample-count 10000
Next Steps
- Scripts - Advanced Ion scripting techniques for complex datasets
- Output Formats - How datasets appear in different output formats
- Examples - See complete multi-dataset examples in action
- Database Guide - Working with dataset catalogs and databases
Working with Scripts
Ion scripts are the core of Beamline’s data generation system. This section covers advanced scripting techniques, best practices, and patterns for creating sophisticated data generation scenarios.
Ion Script Fundamentals
Basic Script Structure
Every Beamline script follows this structure:
rand_processes::{
// 1. Variable definitions (optional)
$variable_name: GeneratorType::{ configuration },
// 2. Dataset definitions (required)
dataset_name: dataset_type::{
// Configuration specific to dataset type
}
}
Script Validation
Before generating large datasets, validate your script:
# Quick validation with minimal generation
beamline gen data \
--seed 1 \
--start-auto \
--script-path new_script.ion \
--sample-count 1
# Check inferred schema
beamline infer-shape \
--seed 1 \
--start-auto \
--script-path new_script.ion \
--output-format basic-ddl
Variable Management
Variable Definition Best Practices
rand_processes::{
// Group related variables together with comments
// === ID Generators ===
$user_id: UUID,
$session_id: UUID,
$transaction_id: UUID,
// === Shared Distributions ===
$age_distribution: NormalF64::{ mean: 35.0, std_dev: 12.0 },
$price_range: UniformDecimal::{ low: 9.99, high: 999.99 },
// === Configuration Values ===
$max_users: UniformU8::{ low: 10, high: 50 },
$success_rate: UniformF64::{ low: 0.95, high: 0.99 },
// === Categorical Choices ===
$status_options: Uniform::{ choices: ["active", "inactive", "pending", "suspended"] },
$priority_levels: Uniform::{ choices: [1, 2, 3, 4, 5] },
// Dataset definitions follow...
}
Variable Scoping Rules
Variables have different scoping behaviors:
rand_processes::{
// Global variable - accessible everywhere
$global_id: UUID,
dataset: $n::[
{
// Loop-scoped variable - unique per iteration
$local_id: UUID::(), // Forces evaluation per loop iteration
'data_{$@n}': rand_process::{
$data: {
global: $global_id, // Same value across all loops
local: $local_id, // Different per loop iteration
index: '$@n' // Current loop index
}
}
}
]
}
Advanced Variable Techniques
Computed Variables
rand_processes::{
// Base measurements
$base_temp: NormalF64::{ mean: 20.0, std_dev: 3.0 },
$temp_variance: UniformF64::{ low: 0.5, high: 2.0 },
// Computed distributions based on other variables
$adjusted_temp: NormalF64::{
mean: 22.0, // Slightly higher than base
std_dev: 4.0 // More variation
},
sensors: rand_process::{
$data: {
base_temperature: $base_temp,
adjusted_temperature: $adjusted_temp,
temperature_diff: UniformF64::{ low: -5.0, high: 5.0 }
}
}
}
Conditional Variable Usage
rand_processes::{
// Define multiple generators for different scenarios
$high_value_price: UniformDecimal::{ low: 100.00, high: 1000.00 },
$low_value_price: UniformDecimal::{ low: 1.00, high: 50.00 },
$medium_value_price: UniformDecimal::{ low: 25.00, high: 200.00 },
products: rand_process::{
$data: {
product_id: UUID,
category: Uniform::{ choices: ["electronics", "books", "clothing"] },
// Use different price generators for different scenarios
price: UniformAnyOf::{
types: [
$high_value_price, // Electronics
$low_value_price, // Books
$medium_value_price // Clothing
]
}
}
}
}
Advanced Script Patterns
Multi-Level Hierarchies
rand_processes::{
$n_regions: UniformU8::{ low: 2, high: 4 },
$n_stores_per_region: UniformU8::{ low: 3, high: 8 },
$n_employees_per_store: UniformU8::{ low: 5, high: 20 },
retail_hierarchy: $n_regions::[
{
$region_id: UUID::(),
// Region data
'region_{$@n}': static_data::{
$data: {
region_id: $region_id,
region_name: Format::{ pattern: "Region {$@n}" },
timezone: Uniform::{ choices: ["PST", "MST", "CST", "EST"] }
}
},
// Stores in region
stores: $n_stores_per_region::[
{
$store_id: UUID::(),
'region_{$@n}_store_{$@n}': static_data::{
$data: {
store_id: $store_id,
region_id: $region_id,
store_name: Format::{ pattern: "Store {$@n}-{$@n}" },
address: Format::{ pattern: "{$@n} Commerce St" }
}
},
// Employees in store
'region_{$@n}_store_{$@n}_employees': $n_employees_per_store::[
rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: hours::8 },
$data: {
employee_id: UUID,
store_id: $store_id,
region_id: $region_id,
clock_in_time: Instant,
activity: Uniform::{ choices: ["sales", "inventory", "cleaning", "break"] }
}
}
]
}
]
}
]
}
Time-Based Dataset Coordination
rand_processes::{
// Shared timing variables
$peak_hours_rate: HomogeneousPoisson::{ interarrival: minutes::2 },
$off_hours_rate: HomogeneousPoisson::{ interarrival: minutes::15 },
$maintenance_rate: HomogeneousPoisson::{ interarrival: hours::6 },
// High-frequency events during peak hours
peak_user_activity: rand_process::{
$arrival: $peak_hours_rate,
$data: {
event_id: UUID,
event_type: Uniform::{ choices: ["login", "search", "purchase"] },
timestamp: Instant,
load_factor: UniformF64::{ low: 0.7, high: 1.0 } // High load
}
},
// Lower frequency during off hours
off_hours_activity: rand_process::{
$arrival: $off_hours_rate,
$data: {
event_id: UUID,
event_type: Uniform::{ choices: ["backup", "cleanup", "monitoring"] },
timestamp: Instant,
load_factor: UniformF64::{ low: 0.1, high: 0.3 } // Low load
}
},
// Maintenance events
maintenance_events: rand_process::{
$arrival: $maintenance_rate,
$data: {
maintenance_id: UUID,
maintenance_type: Uniform::{ choices: ["scheduled", "emergency", "upgrade"] },
timestamp: Instant,
duration_minutes: UniformU16::{ low: 30, high: 240 }
}
}
}
Cross-Dataset Correlation
rand_processes::{
// Shared correlation factors
$system_load: UniformF64::{ low: 0.1, high: 0.9 },
$error_probability: Bool::{ p: 0.05 }, // 5% base error rate
// System metrics affected by load
system_metrics: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: seconds::30 },
$data: {
metric_id: UUID,
timestamp: Instant,
cpu_usage: $system_load,
memory_usage: UniformF64::{ low: 0.2, high: 0.8 },
response_time_ms: LogNormalF64::{ location: 2.0, scale: 0.5 }
}
},
// Application events affected by same factors
application_events: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: seconds::10 },
$data: {
event_id: UUID,
timestamp: Instant,
event_type: Uniform::{ choices: ["request", "response", "error", "timeout"] },
has_error: $error_probability, // Correlated error rate
load_factor: $system_load // Same load factor
}
}
}
Script Organization Strategies
Modular Script Design
rand_processes::{
// === CONFIGURATION SECTION ===
// System-wide settings
$system_version: "2.1.0",
$max_concurrent_users: UniformU16::{ low: 100, high: 1000 },
// === SHARED GENERATORS ===
// Reusable ID generators
$user_id: UUID,
$session_id: UUID,
$request_id: UUID,
// Reusable distributions
$user_age_dist: NormalF64::{ mean: 34.5, std_dev: 12.8 },
$response_time_dist: LogNormalF64::{ location: 3.0, scale: 0.4 },
// === REFERENCE DATA ===
// Static lookup tables
user_types: static_data::{
$data: {
type_id: UniformU8::{ low: 1, high: 5 },
type_name: Uniform::{ choices: ["free", "premium", "enterprise", "admin", "guest"] },
max_sessions: Uniform::{ choices: [1, 5, 10, 100, 1] }
}
},
// === OPERATIONAL DATA ===
// Dynamic user activity
user_sessions: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: minutes::UniformU8::{ low: 2, high: 30 } },
$data: {
user_id: $user_id,
session_id: $session_id,
start_time: Instant,
user_age: $user_age_dist
}
},
// === PERFORMANCE DATA ===
// System performance metrics
performance_metrics: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: seconds::15 },
$data: {
metric_timestamp: Instant,
response_time: $response_time_dist,
concurrent_users: UniformU16::{ low: 0, high: 1000 }
}
}
}
Environment-Specific Scripts
Create scripts that can be configured for different environments:
rand_processes::{
// === ENVIRONMENT CONFIGURATION ===
// Development environment settings
$dev_user_count: UniformU8::{ low: 5, high: 20 },
$dev_load_factor: UniformF64::{ low: 0.1, high: 0.3 },
$dev_error_rate: 0.1, // 10% errors in dev
// Production-like environment settings
$prod_user_count: UniformU16::{ low: 100, high: 1000 },
$prod_load_factor: UniformF64::{ low: 0.6, high: 0.95 },
$prod_error_rate: 0.01, // 1% errors in prod
// Use dev settings (change as needed)
$current_user_count: $dev_user_count,
$current_load_factor: $dev_load_factor,
$current_error_rate: $dev_error_rate,
// === DATASETS ===
users: $current_user_count::[
rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: minutes::5 },
$data: {
user_id: UUID,
load_impact: $current_load_factor,
has_error: Bool::{ p: $current_error_rate },
timestamp: Instant
}
}
]
}
Complex Data Relationships
Foreign Key Relationships
rand_processes::{
$n_customers: UniformU8::{ low: 10, high: 50 },
$n_products: UniformU8::{ low: 20, high: 100 },
// Generate customer IDs we can reference
$customer_ids: $n_customers::[UUID::()], // Array of customer UUIDs
$product_ids: $n_products::[UUID::()], // Array of product UUIDs
customers: static_data::{
$data: {
customer_id: Uniform::{ choices: $customer_ids }, // Reference predefined IDs
name: LoremIpsumTitle,
email: Format::{ pattern: "customer{UUID}@example.com" }
}
},
products: static_data::{
$data: {
product_id: Uniform::{ choices: $product_ids }, // Reference predefined IDs
name: LoremIpsumTitle,
price: UniformDecimal::{ low: 5.00, high: 200.00 }
}
},
orders: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: minutes::UniformU8::{ low: 5, high: 30 } },
$data: {
order_id: UUID,
customer_id: Uniform::{ choices: $customer_ids }, // Valid customer reference
product_id: Uniform::{ choices: $product_ids }, // Valid product reference
quantity: UniformU8::{ low: 1, high: 5 },
timestamp: Instant
}
}
}
Temporal Coordination
rand_processes::{
// Shared timing patterns
$business_hours: HomogeneousPoisson::{ interarrival: minutes::UniformU8::{ low: 2, high: 10 } },
$after_hours: HomogeneousPoisson::{ interarrival: hours::UniformU8::{ low: 1, high: 4 } },
// Customer activity during business hours
customer_activity: rand_process::{
$arrival: $business_hours,
$data: {
activity_id: UUID,
activity_type: Uniform::{ choices: ["browse", "search", "purchase", "support"] },
timestamp: Instant,
response_time: LogNormalF64::{ location: 2.5, scale: 0.3 } // Faster during business hours
}
},
// System maintenance after hours
system_maintenance: rand_process::{
$arrival: $after_hours,
$data: {
maintenance_id: UUID,
maintenance_type: Uniform::{ choices: ["backup", "update", "cleanup", "monitoring"] },
timestamp: Instant,
duration_minutes: UniformU16::{ low: 15, high: 120 }
}
}
}
Script Testing and Development
Iterative Development Process
# 1. Start with minimal script
echo 'rand_processes::{ test: rand_process::{ $arrival: HomogeneousPoisson::{ interarrival: seconds::1 }, $data: { id: UUID } } }' > minimal.ion
# 2. Validate basic structure
beamline gen data --seed 1 --start-auto --script-path minimal.ion --sample-count 3
# 3. Add complexity incrementally
# ... edit script to add fields, variables, etc.
# 4. Test each addition
beamline gen data --seed 1 --start-auto --script-path enhanced.ion --sample-count 5
# 5. Validate schema
beamline infer-shape --seed 1 --start-auto --script-path enhanced.ion --output-format basic-ddl
Script Debugging Techniques
Add Debug Fields
rand_processes::{
test_data: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: seconds::5 },
$data: {
// Production fields
user_id: UUID,
action: Uniform::{ choices: ["login", "logout"] },
// Debug fields (remove in production)
debug_tick: Tick,
debug_timestamp: Instant,
debug_seed_info: Format::{ pattern: "Generated at tick {Tick}" }
}
}
}
Validate Variable Evaluation
rand_processes::{
// Test variable evaluation
$test_var: UniformI32::{ low: 1, high: 10 },
$forced_eval: UniformI32::{ low: 100, high: 200 }::(),
debug_variables: rand_process::{
$data: {
normal_var: $test_var, // New value each time
forced_var: $forced_eval, // Same value each time
comparison: Format::{ pattern: "normal: {$test_var}, forced: {$forced_eval}" }
}
}
}
Test Script Fragments
# Test individual components
echo 'rand_processes::{ test_generators: rand_process::{ $arrival: HomogeneousPoisson::{ interarrival: seconds::1 }, $data: { test_field: NormalF64::{ mean: 0.0, std_dev: 1.0 } } } }' | \
beamline gen data --seed 1 --start-auto --script - --sample-count 5
Performance Optimization in Scripts
Generator Efficiency
rand_processes::{
// Efficient - simple generators
efficient_data: rand_process::{
$data: {
id: UUID, // Very fast
count: UniformI32::{ low: 1, high: 1000 }, // Fast
flag: Bool // Very fast
}
},
// Less efficient - complex generators
complex_data: rand_process::{
$data: {
// Slower - statistical distributions
normal_value: NormalF64::{ mean: 0.0, std_dev: 1.0 },
// Slower - complex regex patterns
complex_pattern: Regex::{ pattern: "([A-Z][a-z]{2,8}\\s){3}[A-Z][a-z]{2,8}" },
// Slower - large arrays
large_array: UniformArray::{
min_size: 50,
max_size: 100,
element_type: NormalF64::{ mean: 0.0, std_dev: 1.0 }
}
}
}
}
Variable Reuse for Performance
rand_processes::{
// Efficient - reuse expensive generators
$expensive_distribution: WeibullF64::{ shape: 2.0, scale: 100.0 },
$simple_choices: Uniform::{ choices: ["A", "B", "C", "D"] },
optimized_data: rand_process::{
$data: {
// Reuse the same expensive distribution
measurement1: $expensive_distribution,
measurement2: $expensive_distribution,
measurement3: $expensive_distribution,
// Reuse simple categorical generator
category1: $simple_choices,
category2: $simple_choices
}
}
}
Memory-Conscious Patterns
rand_processes::{
// Memory-efficient approach
streaming_data: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: milliseconds::10 },
$data: {
// Simple fields - low memory
id: UUID,
timestamp: Instant,
value: UniformF64::{ low: 0.0, high: 100.0 },
// Avoid large embedded structures in high-frequency data
// metadata: { /* avoid large nested objects */ }
}
},
// Separate detailed data as less frequent dataset
detailed_metadata: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: minutes::5 }, // Much less frequent
$data: {
detail_id: UUID,
large_description: LoremIpsum::{ min_words: 50, max_words: 200 },
complex_structure: {
nested_data: LoremIpsumTitle,
more_nested: {
deep_field: UniformF64::{ low: 0.0, high: 1.0 }
}
}
}
}
}
Error Handling in Scripts
Common Script Errors
Invalid Ion Syntax
// Wrong - missing closing brace
rand_processes::{
test: rand_process::{
$data: {
id: UUID
}
// Missing closing brace here
Error:
Error: Failed to parse Ion script: Expected closing brace '}' at line 8
Invalid Generator Configuration
// Wrong - min > max
rand_processes::{
test: rand_process::{
$data: {
bad_range: UniformI32::{ low: 100, high: 50 } // Invalid range
}
}
}
Missing Required Fields
// Wrong - missing arrival for rand_process
rand_processes::{
test: rand_process::{
$data: { id: UUID } // Missing $arrival
}
}
Script Validation Patterns
rand_processes::{
// Good - comprehensive configuration
validated_data: rand_process::{
// Required: arrival process
$arrival: HomogeneousPoisson::{ interarrival: seconds::1 },
// Required: data definition
$data: {
// Validate ranges
valid_range: UniformI32::{ low: 1, high: 100 }, // min <= max
// Validate probabilities
valid_probability: Bool::{ p: 0.5 }, // 0.0 <= p <= 1.0
// Validate nullable/optional
valid_nullable: UniformF64::{
low: 0.0,
high: 1.0,
nullable: 0.1, // 0.0 <= nullable <= 1.0
optional: 0.05 // 0.0 <= optional <= 1.0
}
}
}
}
Script Documentation
Inline Documentation Best Practices
rand_processes::{
// =============================================================================
// E-Commerce Simulation Script v2.1
//
// Purpose: Generate realistic e-commerce data for performance testing
// Author: Data Team
// Created: 2024-01-01
// Last Modified: 2024-01-15
//
// Datasets Generated:
// - customers: Static customer profiles (10-50 customers)
// - products: Static product catalog (50-200 products)
// - orders: Dynamic order events (variable frequency)
// - reviews: Dynamic product reviews (low frequency)
// =============================================================================
// === CONFIGURATION VARIABLES ===
// Customer population size
$n_customers: UniformU8::{ low: 10, high: 50 }, // 10-50 customers for testing
// Product catalog size
$n_products: UniformU8::{ low: 50, high: 200 }, // 50-200 products
// Business parameters
$avg_order_value: UniformDecimal::{ low: 25.00, high: 500.00 }, // Realistic order sizes
$customer_satisfaction: UniformF64::{ low: 0.7, high: 0.95 }, // High satisfaction rate
// === SHARED GENERATORS ===
$customer_id: UUID, // Unique customer identifiers
$product_id: UUID, // Unique product identifiers
$order_id: UUID, // Unique order identifiers
// === STATIC REFERENCE DATA ===
// Customer master data - generated once at simulation start
customers: static_data::{
$data: {
customer_id: $customer_id,
name: LoremIpsumTitle, // Realistic names
email: Format::{ pattern: "customer{UUID}@example.com" },
registration_date: Date, // All register at simulation start
loyalty_tier: Uniform::{ choices: ["bronze", "silver", "gold", "platinum"] }
}
},
// Product catalog - static reference data
products: static_data::{
$data: {
product_id: $product_id,
name: LoremIpsumTitle,
category: Uniform::{ choices: ["Electronics", "Clothing", "Books", "Home"] },
base_price: $avg_order_value,
in_stock: Bool::{ p: 0.9 } // 90% of products in stock
}
},
// === DYNAMIC TRANSACTIONAL DATA ===
// Order events - customers place orders over time
orders: rand_process::{
// Variable order frequency - some customers more active
$r: UniformU8::{ low: 30, high: 180 }, // 30-180 minutes between orders
$arrival: HomogeneousPoisson::{ interarrival: minutes::$r },
$data: {
order_id: $order_id,
customer_id: $customer_id, // Links to customers dataset
product_id: $product_id, // Links to products dataset
quantity: UniformU8::{ low: 1, high: 5 },
order_total: $avg_order_value,
timestamp: Instant,
// Order status progression
status: Uniform::{
choices: ["pending", "processing", "shipped", "delivered"],
// Weight towards later statuses for realistic distribution
}
}
},
// Product reviews - less frequent than orders
reviews: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: hours::UniformU8::{ low: 2, high: 48 } },
$data: {
review_id: UUID,
product_id: $product_id, // Links to products dataset
customer_id: $customer_id, // Links to customers dataset
rating: UniformU8::{ low: 1, high: 5 },
review_text: LoremIpsum::{
min_words: 10,
max_words: 100,
optional: 0.3 // 30% don't write review text
},
timestamp: Instant,
verified_purchase: Bool::{ p: 0.8 } // 80% are verified purchases
}
}
}
Script Maintenance and Version Control
Script Versioning
rand_processes::{
// === SCRIPT METADATA ===
script_info: static_data::{
$data: {
script_version: "3.2.1",
created_date: "2024-01-01",
last_modified: "2024-01-15",
author: "data-engineering-team",
description: "Multi-tenant SaaS simulation with realistic usage patterns"
}
},
// Script content follows...
}
Migration Between Script Versions
# Test new script version against old version
beamline gen data --seed 1000 --start-auto --script-path data_v3.ion --sample-count 100 > new_output.ion
beamline gen data --seed 1000 --start-auto --script-path data_v2.ion --sample-count 100 > old_output.ion
# Compare schemas
beamline infer-shape --seed 1 --start-auto --script-path data_v3.ion --output-format basic-ddl > new_schema.sql
beamline infer-shape --seed 1 --start-auto --script-path data_v2.ion --output-format basic-ddl > old_schema.sql
diff old_schema.sql new_schema.sql
Real-World Script Examples
IoT Sensor Network
rand_processes::{
// Network topology
$n_locations: UniformU8::{ low: 3, high: 12 },
$n_sensors_per_location: UniformU8::{ low: 5, high: 15 },
// Environmental factors
$base_temperature: NormalF64::{ mean: 22.0, std_dev: 3.0 },
$seasonal_variation: UniformF64::{ low: -5.0, high: 5.0 },
iot_network: $n_locations::[
{
$location_id: UUID::(),
$location_temp_offset: UniformF64::{ low: -2.0, high: 2.0 }::(), // Per-location offset
// Location metadata
'location_{$@n}': static_data::{
$data: {
location_id: $location_id,
location_name: Format::{ pattern: "Site-{$@n}" },
coordinates: {
latitude: UniformF64::{ low: 40.0, high: 45.0 },
longitude: UniformF64::{ low: -75.0, high: -70.0 }
},
installation_date: Date
}
},
// Sensors at location
sensors: $n_sensors_per_location::[
{
$sensor_id: UUID::(),
'location_{$@n}_sensor_{$@n}': rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: seconds::UniformU8::{ low: 30,
Static Data Generation
Static data in Beamline refers to data that is generated once at the beginning of the simulation, before any temporal events occur. This is useful for creating reference tables, lookup data, or any information that doesn’t change over the course of your simulation.
What is Static Data?
Static data is generated using static_data blocks instead of rand_process blocks. Key differences:
- Generated once: All static data is created at simulation time 0
- No arrival process: No
$arrivalconfiguration needed - Reference data: Often used for lookup tables, master data, configuration
- Shared across processes: Can be referenced by multiple dynamic processes
Basic Syntax
dataset_name: static_data::{
$data: {
// Generator configuration (same as rand_process)
field1: GeneratorType,
field2: GeneratorType::{ configuration }
}
}
Static vs Dynamic Data
Dynamic Data (rand_process)
orders: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: days::5 },
$data: {
order_id: UUID,
timestamp: Instant,
amount: UniformDecimal::{ low: 10.00, high: 500.00 }
}
}
Characteristics:
- Generated over simulation time
- Each record has different timestamps
- Follows arrival process (Poisson, uniform, etc.)
Static Data (static_data)
product_catalog: static_data::{
$data: {
product_id: UUID,
name: LoremIpsumTitle,
base_price: UniformDecimal::{ low: 5.00, high: 200.00 }
}
}
Characteristics:
- Generated all at once at time 0
- All records have the same timestamp (simulation start time)
- No arrival process needed
Real Example: Customer and Orders
From the orders.ion test script, here’s how static and dynamic data work together:
rand_processes::{
// Generate between 5 & 20 customers
$n: UniformU8::{ low: 5, high: 20 },
// Shared generators
$id_gen: UUID,
$oid_gen: UUID,
customers: $n::[
{
// Each customer gets a unique ID
$id: $id_gen::(),
// Static customer data - generated once per customer
customer_table: static_data::{
$data: {
id: $id,
address: Format::{ pattern: "{$@n} Foo Bar Ave" }
}
},
// Dynamic order data - generated over time
orders: rand_process::{
$r: UniformU8::{ low: 1, high: 150 },
$arrival: HomogeneousPoisson::{ interarrival: days::$r },
$data: {
Order: $oid_gen,
Time: Instant,
Customer: $id // References the same ID
}
}
}
]
}
When executed, this generates:
Static Data (all at simulation start):
[2019-08-01 0:00:01.0 -07:00:00] : "customer_table" { 'id': 'd858b1e7-7327-7c40-1698-0e0e4fe89ecc', 'address': '0 Foo Bar Ave' }
[2019-08-01 0:00:01.0 -07:00:00] : "customer_table" { 'id': '179e600a-c1c5-8ac2-05b6-15b20f8fe740', 'address': '1 Foo Bar Ave' }
[2019-08-01 0:00:01.0 -07:00:00] : "customer_table" { 'address': '2 Foo Bar Ave', 'id': '5e39c6eb-0bc1-7040-cf52-6e69cdf386e0' }
Dynamic Data (spread over time):
[2019-08-01 7:26:21.964 -07:00:00] : "orders" { 'Customer': '5e39c6eb-0bc1-7040-cf52-6e69cdf386e0', 'Order': '4c579e42-8c70-93f4-b99b-cc45c50197ed' }
[2019-08-10 5:46:15.24 -07:00:00] : "orders" { 'Customer': '5e39c6eb-0bc1-7040-cf52-6e69cdf386e0', 'Order': '38900593-e9cc-994a-98d9-0becf77d9144' }
[2019-08-11 7:27:49.565 -07:00:00] : "orders" { 'Customer': '5e39c6eb-0bc1-7040-cf52-6e69cdf386e0', 'Order': 'b2aa0efc-dac3-b391-f4c2-3c298e0c99f4' }
Notice how:
- All
customer_tablerecords have the same timestamp (simulation start) - The
ordersrecords are distributed over time with different timestamps - Both datasets share the same customer IDs, creating referential relationships
Use Cases for Static Data
Reference Tables
Create lookup tables that don’t change during simulation:
rand_processes::{
// Static product catalog
products: static_data::{
$data: {
product_id: UUID,
name: LoremIpsumTitle,
category: Uniform::{ choices: ["Electronics", "Clothing", "Books", "Home"] },
base_price: UniformDecimal::{ low: 5.00, high: 500.00 }
}
},
// Dynamic orders referencing products
orders: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: minutes::30 },
$data: {
order_id: UUID,
// Note: In real usage, you'd want to reference actual product IDs
product_category: Uniform::{ choices: ["Electronics", "Clothing", "Books", "Home"] },
timestamp: Instant
}
}
}
Configuration Data
Generate system configuration that remains constant:
rand_processes::{
// System configuration - static
config: static_data::{
$data: {
system_id: UUID,
version: Uniform::{ choices: ["1.0", "1.1", "2.0"] },
max_connections: UniformU16::{ low: 100, high: 1000 },
timeout_seconds: UniformU8::{ low: 30, high: 300 }
}
},
// Application events - dynamic
events: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: seconds::10 },
$data: {
event_id: UUID,
event_type: Uniform::{ choices: ["login", "logout", "action", "error"] },
timestamp: Instant
}
}
}
User Profiles and Activity
Create user profiles once, then generate their activities over time:
rand_processes::{
$n: UniformU8::{ low: 10, high: 50 }, // 10-50 users
$id_gen: UUID,
users: $n::[
{
$user_id: $id_gen::(), // One ID per user
// Static user profile
user_profiles: static_data::{
$data: {
user_id: $user_id,
username: Format::{ pattern: "user_{$@n}" },
email: Format::{ pattern: "user{$@n}@example.com" },
registration_date: Date,
plan_type: Uniform::{ choices: ["free", "premium", "enterprise"] }
}
},
// Dynamic user activity
user_activity: rand_process::{
$r: UniformU8::{ low: 30, high: 180 }, // 30-180 minutes between actions
$arrival: HomogeneousPoisson::{ interarrival: minutes::$r },
$data: {
user_id: $user_id,
action_type: Uniform::{ choices: ["view", "click", "purchase", "search"] },
timestamp: Instant,
session_id: UUID
}
}
}
]
}
Time-Related Generators in Static Data
When using time-related generators in static data, they use the simulation start time:
Instant and Date in Static Data
rand_processes::{
// System startup data
system_info: static_data::{
$data: {
system_id: UUID,
startup_time: Instant, // Will be simulation start time
startup_date: Date, // Will be simulation start date
boot_tick: Tick, // Will be 0 (initial tick)
version: "1.0.0"
}
}
}
Output Example:
[2024-01-01 00:00:00.000 +00:00] : "system_info" {
'system_id': '123e4567-e89b-12d3-a456-426614174000',
'startup_time': 2024-01-01T00:00:00.000000000+00:00,
'startup_date': 2024-01-01T00:00:00.000000000+00:00,
'boot_tick': 0,
'version': '1.0.0'
}
Multiple Static Datasets
You can create multiple static datasets in the same script:
rand_processes::{
// Company information
companies: static_data::{
$data: {
company_id: UUID,
name: Format::{ pattern: "Company {UUID}" },
industry: Uniform::{ choices: ["Tech", "Finance", "Retail", "Healthcare"] }
}
},
// Department information
departments: static_data::{
$data: {
dept_id: UUID,
name: Uniform::{ choices: ["Engineering", "Sales", "Marketing", "HR"] },
budget: UniformDecimal::{ low: 50000.00, high: 2000000.00 }
}
},
// Employee events - references both static datasets
employee_events: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: hours::8 },
$data: {
employee_id: UUID,
event_type: Uniform::{ choices: ["hire", "promotion", "transfer", "resignation"] },
timestamp: Instant,
// Note: In real usage, you'd reference actual company/dept IDs
company_type: Uniform::{ choices: ["Tech", "Finance", "Retail", "Healthcare"] },
department: Uniform::{ choices: ["Engineering", "Sales", "Marketing", "HR"] }
}
}
}
Static Data with Variables and Loops
Create multiple static datasets using loops:
rand_processes::{
$n: UniformU8::{ low: 3, high: 8 }, // 3-8 regions
$region_id: UUID,
regions: $n::[
{
$id: $region_id::(), // Unique ID per region
// Static region data
'region_{$@n}': static_data::{
$data: {
region_id: $id,
region_name: Format::{ pattern: "Region {$@n}" },
timezone: Uniform::{ choices: ["UTC-8", "UTC-5", "UTC", "UTC+1"] },
population: UniformU32::{ low: 100000, high: 10000000 }
}
}
}
]
}
This creates multiple static datasets like region_0, region_1, region_2, etc.
Complex Static Data Structures
Static data supports all the same generators as dynamic data:
rand_processes::{
// Complex static configuration
system_config: static_data::{
$data: {
config_id: UUID,
created_at: Instant,
// Nested configuration
database: {
host: Regex::{ pattern: "db[0-9]{2}\\.example\\.com" },
port: UniformU16::{ low: 5432, high: 5439 },
ssl_enabled: Bool::{ p: 0.9 }
},
// Array of server configurations
servers: UniformArray::{
min_size: 3,
max_size: 10,
element_type: {
server_id: UUID,
hostname: Regex::{ pattern: "server[0-9]{3}\\.example\\.com" },
cpu_cores: Uniform::{ choices: [4, 8, 16, 32] },
memory_gb: Uniform::{ choices: [16, 32, 64, 128] }
}
},
// Mixed type configuration
features: UniformAnyOf::{
types: [
Bool,
UniformI32::{ low: 1, high: 100 },
LoremIpsumTitle
]
}
}
}
}
Static Data Best Practices
1. Use for Reference Data
// Good - static reference data
product_categories: static_data::{
$data: {
category_id: UUID,
name: Uniform::{ choices: ["Electronics", "Books", "Clothing"] },
tax_rate: UniformDecimal::{ low: 0.05, high: 0.10 }
}
}
// Avoid - frequently changing data should be dynamic
2. Share IDs Between Static and Dynamic
rand_processes::{
$customer_id: UUID,
customers: 5::[
{
$id: $customer_id::(), // Generate once per customer
// Static profile
customer_profiles: static_data::{
$data: {
customer_id: $id,
name: LoremIpsumTitle,
email: Format::{ pattern: "customer{$@n}@example.com" }
}
},
// Dynamic transactions
transactions: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: days::10 },
$data: {
customer_id: $id, // Same ID
transaction_id: UUID,
amount: UniformDecimal::{ low: 10.00, high: 1000.00 }
}
}
}
]
}
3. Use Meaningful Static Data
// Good - realistic static data
countries: static_data::{
$data: {
country_code: Regex::{ pattern: "[A-Z]{2}" },
country_name: LoremIpsumTitle,
population: LogNormalF64::{ location: 15.0, scale: 2.0 }, // Realistic population distribution
gdp_per_capita: LogNormalF64::{ location: 8.5, scale: 1.5 }
}
}
// Avoid - unrealistic or meaningless static data
4. Consider Static Data Size
rand_processes::{
// Small static dataset - appropriate
currencies: static_data::{
$data: {
currency_code: Regex::{ pattern: "[A-Z]{3}" },
exchange_rate: UniformF64::{ low: 0.1, high: 10.0 }
}
}
}
For large reference datasets, consider if the data really needs to be static or could be part of a slow-changing dynamic process.
Output Characteristics
CLI Output Format
When you run data generation, static data appears first with identical timestamps:
$ beamline gen data \
--seed 1234 \
--start-iso "2019-08-01T00:00:01-07:00" \
--script-path partiql-beamline-sim/tests/scripts/orders.ion \
--sample-count 10 \
--output-format text
Seed: 1234
Start: 2019-08-01T00:00:01.000000000-07:00
# Static data first (all at start time)
[2019-08-01 0:00:01.0 -07:00:00] : "customer_table" { 'id': 'd858b1e7-7327-7c40-1698-0e0e4fe89ecc', 'address': '0 Foo Bar Ave' }
[2019-08-01 0:00:01.0 -07:00:00] : "customer_table" { 'id': '179e600a-c1c5-8ac2-05b6-15b20f8fe740', 'address': '1 Foo Bar Ave' }
[2019-08-01 0:00:01.0 -07:00:00] : "customer_table" { 'address': '2 Foo Bar Ave', 'id': '5e39c6eb-0bc1-7040-cf52-6e69cdf386e0' }
# Dynamic data follows (spread over time)
[2019-08-01 7:26:21.964 -07:00:00] : "orders" { 'Customer': '5e39c6eb-0bc1-7040-cf52-6e69cdf386e0', 'Order': '4c579e42-8c70-93f4-b99b-cc45c50197ed' }
[2019-08-10 5:46:15.24 -07:00:00] : "orders" { 'Customer': '5e39c6eb-0bc1-7040-cf52-6e69cdf386e0', 'Order': '38900593-e9cc-994a-98d9-0becf77d9144' }
Ion Pretty Format
$ beamline gen data \
--seed 1234 \
--start-auto \
--script-path with_static.ion \
--sample-count 5 \
--output-format ion-pretty
{
seed: 1234,
start: "2024-01-01T00:00:00Z",
data: {
// Static data grouped together
config: [
{
system_id: "123e4567-e89b-12d3-a456-426614174000",
version: "1.0",
created_at: 2024-01-01T00:00:00Z
}
],
// Dynamic data grouped together
events: [
{
event_id: "987fcdeb-51a2-43d1-9f4e-123456789abc",
timestamp: 2024-01-01T00:05:23Z,
type: "user_login"
},
{
event_id: "456789ab-cdef-1234-5678-9abcdef01234",
timestamp: 2024-01-01T00:08:45Z,
type: "user_action"
}
]
}
}
Common Patterns
Master Data Pattern
rand_processes::{
// Static master data
locations: static_data::{
$data: {
location_id: UUID,
city: LoremIpsumTitle,
country_code: Regex::{ pattern: "[A-Z]{2}" },
latitude: UniformF64::{ low: -90.0, high: 90.0 },
longitude: UniformF64::{ low: -180.0, high: 180.0 }
}
},
// Events at locations
weather_events: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: hours::6 },
$data: {
event_id: UUID,
// In real usage, would reference actual location_id
temperature: NormalF64::{ mean: 20.0, std_dev: 10.0 },
humidity: UniformF64::{ low: 20.0, high: 90.0 },
timestamp: Instant
}
}
}
Hierarchical Data Pattern
rand_processes::{
$n_orgs: UniformU8::{ low: 2, high: 5 },
$org_id: UUID,
organizations: $n_orgs::[
{
$id: $org_id::(),
// Static organization info
'org_{$@n}': static_data::{
$data: {
org_id: $id,
org_name: Format::{ pattern: "Organization {$@n}" },
industry: Uniform::{ choices: ["Tech", "Finance", "Healthcare"] },
founded_year: UniformU16::{ low: 1950, high: 2020 }
}
},
// Dynamic organizational events
'org_events_{$@n}': rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: days::30 },
$data: {
org_id: $id,
event_type: Uniform::{ choices: ["hire", "fire", "restructure", "acquisition"] },
timestamp: Instant,
impact_score: NormalF64::{ mean: 5.0, std_dev: 2.0 }
}
}
}
]
}
Database Generation with Static Data
When creating databases with gen db beamline-lite, static data creates separate dataset files:
$ beamline gen db beamline-lite \
--seed 1000 \
--start-auto \
--script-path partiql-beamline-sim/tests/scripts/orders.ion \
--sample-count 1000
$ tree beamline-catalog/
beamline-catalog/
├── .beamline-manifest
├── .beamline-script
├── customer_table.ion # Static data
├── customer_table.shape.ion # Static data schema
├── customer_table.shape.sql # Static data SQL schema
├── orders.ion # Dynamic data
├── orders.shape.ion # Dynamic data schema
└── orders.shape.sql # Dynamic data SQL schema
Static data file (customer_table.ion):
{id: "abc-123", address: "0 Main St"}
{id: "def-456", address: "1 Main St"}
{id: "ghi-789", address: "2 Main St"}
Dynamic data file (orders.ion):
{Customer: "abc-123", Order: "order-001", Time: 2024-01-01T00:15:30Z}
{Customer: "def-456", Order: "order-002", Time: 2024-01-01T01:22:15Z}
{Customer: "abc-123", Order: "order-003", Time: 2024-01-01T02:08:45Z}
Performance Implications
Memory Usage
- Static data is generated once and stored in memory during generation
- Large static datasets may increase memory usage
- Consider data size when designing static datasets
Generation Speed
- Static generation happens once at startup
- No temporal computation needed for static data
- Overall faster than equivalent dynamic data
Best Practices for Large Static Data
// If you need large reference data, consider dynamic with very slow arrival
// Instead of large static data:
large_reference: static_data::{ /* ... thousands of records ... */ }
// Consider slow dynamic process:
reference_data: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: days::365 }, // Very infrequent
$data: { /* ... */ }
}
Troubleshooting Static Data
Issue: Static Data Not Appearing
Cause: No sample count affects static data - it’s always generated based on script configuration.
Solution: Check your script syntax and variable definitions.
Issue: Unexpected Timestamps
Cause: All static data uses simulation start time.
Solution: This is expected behavior. Use dynamic processes for time-distributed data.
Issue: Large Memory Usage
Cause: Large static datasets are loaded into memory.
Solution: Reduce static dataset size or convert to slow dynamic processes.
Examples from Test Scripts
Simple Static Configuration
// From a test script pattern
config: static_data::{
$data: {
app_version: "2.1.0",
max_users: UniformU32::{ low: 1000, high: 10000 },
feature_flags: UniformAnyOf::{
types: [Bool, UniformI32::{ low: 0, high: 100 }]
}
}
}
Multi-Dataset Static Pattern
rand_processes::{
$n: UniformU8::{ low: 5, high: 15 },
servers: $n::[
{
'server_config_{$@n}': static_data::{
$data: {
server_id: Format::{ pattern: "server-{$@n}" },
hostname: Format::{ pattern: "srv{$@n}.example.com" },
ip_address: Regex::{ pattern: "192\\.168\\.[0-9]{1,3}\\.[0-9]{1,3}" },
capacity: Uniform::{ choices: [100, 200, 500, 1000] }
}
}
}
]
}
Next Steps
- Datasets - Learn about working with multiple datasets and relationships
- Output Formats - Understand how static data appears in different formats
- Scripts - Advanced Ion scripting techniques with static and dynamic data
- Examples - See static data in complete examples
Output Formats
Beamline supports multiple output formats for generated data, each optimized for different use cases. Understanding these formats helps you choose the right one for your workflow.
Available Formats
The CLI supports four main output formats via --output-format:
| Format | Description | Use Case | Performance |
|---|---|---|---|
text | Human-readable timestamped format | Debugging, inspection | Moderate |
ion | Compact Ion text format | Data processing | Fast |
ion-pretty | Pretty-printed Ion with metadata | Configuration, documentation | Slower |
ion-binary | Binary Ion format | High-performance storage | Fastest |
Text Format (Default)
Characteristics
- Human-readable: Easy to read and debug
- Timestamped: Each record includes generation timestamp and dataset name
- Streaming: Records appear as they’re generated
- Metadata: Shows seed and start time
Output Structure
Seed: <seed_value>
Start: <start_timestamp>
[<timestamp>] : "<dataset_name>" { <ion_data> }
[<timestamp>] : "<dataset_name>" { <ion_data> }
...
Example Output
$ beamline gen data \
--seed 1234 \
--start-auto \
--script-path sensors.ion \
--sample-count 5 \
--output-format text
Seed: 1234
Start: 2024-05-10T04:04:53.000000000Z
[2024-05-10 4:06:07.274 +00:00:00] : DataSetName("sensors") { 'tick': 74274, 'i8': -86, 'f': 48.07286740416876, 'w': NULL, 'd': 23, 'a': 3.1640, 'ar1': [0.8, 1.1, 1.1], 'ar2': ['e8b12a6c-7cf1-45b6-a8a4-89cd6a418660', 'ba408184-3b94-41e7-860f-6042708bb4be'], 'ar3': [NULL, NULL], 'ar4': [6, 4], 'ar5': [3.1640] }
[2024-05-10 4:08:15.65 +00:00:00] : DataSetName("sensors") { 'tick': 202650, 'i8': 6, 'f': 45.56429323253781, 'w': NULL, 'd': 26, 'a': '613de2a3-195c-410f-8dac-56237f53aa99', 'ar1': [1.1, 0.9, 0.7], 'ar2': ['e0c6700e-f429-429a-a461-c018820fbafe', '9fce83a7-45ef-4210-affe-b87b45e3ac73'], 'ar3': [NULL, 2.4409], 'ar4': [4, 8], 'ar5': ['613de2a3-195c-410f-8dac-56237f53aa99'] }
Use Cases
- Development and debugging: Easy to read individual records
- Log file analysis: Timestamped records for event correlation
- Quick inspection: Rapid visual validation of generated data
- Educational: Learning how data generation works
Ion Format
Characteristics
- Compact: No pretty-printing or extra whitespace
- Fast: Minimal formatting overhead
- Ion text: Preserves all Ion type information
- Processable: Easy to parse with Ion libraries
Output Structure
{seed:<seed>,start:"<timestamp>",data:{<dataset_name>:[{<record>},{<record>}...]}}
Example Output
$ beamline gen data \
--seed 42 \
--start-auto \
--script-path simple.ion \
--sample-count 3 \
--output-format ion
{seed:42,start:"2024-01-01T00:00:00Z",data:{sensors:[{f:-2.543639,i8:4,tick:125532},{f:-63.493088,i8:4,tick:218756},{f:12.345679,i8:-12,tick:253123}]}}
Use Cases
- Data processing pipelines: Efficient parsing and processing
- API responses: Compact data transmission
- Intermediate storage: Balance between readability and efficiency
- Configuration files: Structured data that’s still readable
Ion Pretty Format
Characteristics
- Human-readable: Well-formatted with indentation
- Complete metadata: Includes seed, start time, and full data structure
- Ion text format: Preserves all type information
- Structured: Clear hierarchical organization
Output Structure
{
seed: <seed>,
start: "<timestamp>",
data: {
<dataset_name>: [
{
<field>: <value>,
<field>: <value>
},
{
<field>: <value>,
<field>: <value>
}
]
}
}
Example Output
$ beamline gen data \
--seed 123 \
--start-auto \
--script-path sensors.ion \
--sample-count 2 \
--output-format ion-pretty
{
seed: 123,
start: "2024-01-20T10:30:00.000000000Z",
data: {
sensors: [
{
f: -2.5436390152455175e0,
i8: 4,
tick: 125532
},
{
f: -63.49308817145054e0,
i8: 4,
tick: 218756
}
]
}
}
Use Cases
- Configuration files: Readable but structured data
- Documentation: Examples and samples in documentation
- Data inspection: Understanding complex nested structures
- Archive storage: Long-term storage with metadata
Ion Binary Format
Characteristics
- Most compact: Smallest file size
- Fastest: Highest performance for generation and parsing
- Type preservation: All Ion types preserved exactly
- Not human-readable: Requires Ion tools to read
Example Usage
$ beamline gen data \
--seed 999 \
--start-auto \
--script-path large_dataset.ion \
--sample-count 1000000 \
--output-format ion-binary > data.ion
Use Cases
- Large datasets: Maximum efficiency for big data generation
- High-performance applications: Minimal parsing overhead
- Storage optimization: Smallest possible file sizes
- Data transmission: Efficient network transfer
Format Comparison
Size Comparison
For the same dataset with 1000 records:
# Generate in all formats for comparison
beamline gen data --seed 1 --start-auto --script-path data.ion --sample-count 1000 --output-format text > data.txt
beamline gen data --seed 1 --start-auto --script-path data.ion --sample-count 1000 --output-format ion > data.ion
beamline gen data --seed 1 --start-auto --script-path data.ion --sample-count 1000 --output-format ion-pretty > data_pretty.ion
beamline gen data --seed 1 --start-auto --script-path data.ion --sample-count 1000 --output-format ion-binary > data.bin
# Compare sizes
ls -lh data.*
# Example results:
# -rw-r--r-- 1 user user 245K data.txt (text - largest)
# -rw-r--r-- 1 user user 156K data.ion (ion - medium)
# -rw-r--r-- 1 user user 189K data_pretty.ion (pretty - larger due to formatting)
# -rw-r--r-- 1 user user 98K data.bin (binary - smallest)
Performance Comparison
For generation of 100,000 records:
- ion-binary: Fastest (baseline)
- ion: ~10% slower than binary
- text: ~25% slower than binary
- ion-pretty: ~40% slower than binary (due to formatting)
Format-Specific Features
Text Format Features
Timestamp visibility: See exactly when each event occurred in simulation time
[2024-01-01 08:15:23.456 +00:00] : "orders" { 'order_id': '123e4567', 'amount': 99.99 }
[2024-01-01 08:20:45.789 +00:00] : "orders" { 'order_id': '987fcdeb', 'amount': 149.50 }
Dataset identification: Clear dataset labels for multi-dataset scripts
Ion Formats Features
Type preservation: All Ion types are preserved exactly
{
decimal_field: 123.45, // Exact decimal
float_field: 123.45e0, // Float with exponent notation
timestamp: 2024-01-01T00:00:00Z, // Full timestamp precision
uuid: "123e4567-e89b-12d3-a456-426614174000"
}
Structured data: Complex nested structures preserved
{
user: {
profile: {
preferences: ["dark_mode", "notifications"]
}
}
}
NULL and MISSING Representation
Different formats handle absent values differently:
Text Format
[timestamp] : "dataset" { 'present': 42, 'null_field': null } // MISSING fields omitted
Ion Formats
{
present: 42,
null_field: null
// missing_field is omitted entirely
}
Multiple Dataset Output
Text Format with Multiple Datasets
$ beamline gen data \
--seed 100 \
--start-auto \
--script-path client_service.ion \
--sample-count 10 \
--output-format text
Seed: 100
Start: 2024-01-01T00:00:00Z
[2024-01-01 00:00:00.000 +00:00] : "customer_table" { 'id': 'abc-123', 'address': '0 Main St' }
[2024-01-01 00:00:00.000 +00:00] : "customer_table" { 'id': 'def-456', 'address': '1 Main St' }
[2024-01-01 00:05:30.123 +00:00] : "service" { 'Request': 'req-001', 'Account': 'abc-123' }
[2024-01-01 00:05:30.124 +00:00] : "client_1" { 'id': 'abc-123', 'request_id': 'req-001' }
Ion Pretty Format with Multiple Datasets
{
seed: 100,
start: "2024-01-01T00:00:00Z",
data: {
customer_table: [
{
id: "abc-123",
address: "0 Main St"
},
{
id: "def-456",
address: "1 Main St"
}
],
service: [
{
Request: "req-001",
Account: "abc-123",
StartTime: 2024-01-01T00:05:30.123Z
}
],
client_1: [
{
id: "abc-123",
request_id: "req-001",
request_time: 2024-01-01T00:05:30.124Z
}
]
}
}
Choosing the Right Format
Development and Testing
# Use text for quick debugging
beamline gen data --script-path debug.ion --sample-count 5 --output-format text
# Use ion-pretty for understanding structure
beamline gen data --script-path complex.ion --sample-count 10 --output-format ion-pretty
Production and Performance
# Use ion-binary for large datasets
beamline gen data --script-path production.ion --sample-count 1000000 --output-format ion-binary
# Use ion for balance of efficiency and readability
beamline gen data --script-path data.ion --sample-count 100000 --output-format ion
Integration Workflows
# Generate for different consumers
beamline gen data --seed 42 --start-auto --script-path data.ion --sample-count 10000 --output-format ion-binary > high_perf.ion
beamline gen data --seed 42 --start-auto --script-path data.ion --sample-count 100 --output-format ion-pretty > documentation.ion
beamline gen data --seed 42 --start-auto --script-path data.ion --sample-count 1000 --output-format text > debug.txt
Format-Specific Processing
Processing Text Format
# Extract specific datasets
beamline gen data --script-path multi.ion --output-format text | \
grep '"sensors"' | \
head -10
# Analyze timestamps
beamline gen data --script-path temporal.ion --output-format text | \
awk -F'\\[|\\]' '{print $2}' | \
head -20
Processing Ion Formats
# Use Ion tools for processing
beamline gen data --script-path data.ion --output-format ion-binary | \
ion-cli query "SELECT * FROM data.sensors WHERE f > 0"
# Convert between formats
beamline gen data --script-path data.ion --output-format ion | \
ion-cli pretty > formatted.ion
Pipeline Integration
# Generate and immediately process
beamline gen data \
--seed 123 \
--start-auto \
--script-path metrics.ion \
--sample-count 10000 \
--output-format ion-pretty | \
jq '.data.metrics[] | select(.temperature > 25)' | \
head -10
Database Generation Formats
Database generation creates multiple file formats automatically:
$ beamline gen db beamline-lite \
--seed 42 \
--start-auto \
--script-path data.ion \
--sample-count 1000
$ ls -la beamline-catalog/
-rw-r--r-- 1 user user 145 .beamline-manifest # JSON metadata
-rw-r--r-- 1 user user 2.1K .beamline-script # Ion script
-rw-r--r-- 1 user user 89K sensors.ion # Data in Ion format
-rw-r--r-- 1 user user 412 sensors.shape.ion # Schema in Ion format
-rw-r--r-- 1 user user 298 sensors.shape.sql # Schema in SQL DDL format
Data Files (Ion Format)
$ head -3 beamline-catalog/sensors.ion
{f: -2.5436390152455175e0, i8: 4, tick: 125532}
{f: -63.49308817145054e0, i8: 4, tick: 218756}
{f: 12.34567890123456e0, i8: -12, tick: 253123}
Schema Files (Ion Format)
$ cat beamline-catalog/sensors.shape.ion
{
type: "bag",
items: {
type: "struct",
constraints: [ordered, closed],
fields: [
{ name: "f", type: "double" },
{ name: "i8", type: "int8" },
{ name: "tick", type: "int8" }
]
}
}
Schema Files (SQL DDL Format)
$ cat beamline-catalog/sensors.shape.sql
"f" DOUBLE,
"i8" INT8,
"tick" INT8
Format Selection Guidelines
By Use Case
| Use Case | Recommended Format | Rationale |
|---|---|---|
| Quick debugging | text | Timestamps and human readability |
| Data inspection | ion-pretty | Structure visibility with metadata |
| Large dataset generation | ion-binary | Maximum performance and compression |
| Data processing | ion | Good balance of efficiency and readability |
| Documentation | ion-pretty | Clear structure for examples |
| Long-term storage | ion-binary | Most compact and preserves all types |
By Dataset Size
| Dataset Size | Recommended Format | Alternative |
|---|---|---|
| < 100 records | text or ion-pretty | For inspection |
| 100 - 10K records | ion or ion-pretty | Based on use case |
| 10K - 100K records | ion or ion-binary | For efficiency |
| > 100K records | ion-binary | Maximum performance |
By Integration Target
| Target System | Recommended Format | Notes |
|---|---|---|
| Ion-aware tools | ion-binary | Native format |
| JSON processors | ion + conversion | Ion can be converted to JSON |
| SQL databases | Use gen db | Creates SQL schemas automatically |
| Log analysis | text | Timestamped format |
| Documentation | ion-pretty | Human-readable structure |
Format Conversion Patterns
Manual Conversion
# Generate in efficient format, convert for specific use
beamline gen data --script-path data.ion --sample-count 10000 --output-format ion-binary > efficient.ion
# Convert to pretty format for inspection
ion-cli pretty < efficient.ion > readable.ion
# Extract specific fields
ion-cli query "SELECT data.sensors[*].temperature FROM `efficient.ion`" > temperatures.ion
Multi-Format Generation
#!/bin/bash
# generate-multi-format.sh
SCRIPT="$1"
SEED="$2"
COUNT="$3"
# Generate in multiple formats
beamline gen data --seed $SEED --start-auto --script-path $SCRIPT --sample-count $COUNT --output-format ion-binary > data.bin
beamline gen data --seed $SEED --start-auto --script-path $SCRIPT --sample-count 100 --output-format ion-pretty > sample.ion
beamline gen data --seed $SEED --start-auto --script-path $SCRIPT --sample-count 10 --output-format text > debug.txt
echo "Generated:"
echo "- data.bin (binary, $COUNT records)"
echo "- sample.ion (pretty, 100 records)"
echo "- debug.txt (text, 10 records)"
Integration Examples
Web API Integration
# Generate data for API testing
beamline gen data \
--seed 42 \
--start-auto \
--script-path api_test_data.ion \
--sample-count 1000 \
--output-format ion-pretty | \
jq '.data' > api_test_payload.json
Database Loading
# Generate data and schema for database
beamline gen db beamline-lite \
--seed 100 \
--start-auto \
--script-path warehouse_data.ion \
--sample-count 50000
# Use generated SQL schema
psql -d warehouse -f beamline-catalog/orders.shape.sql
# Convert data for loading (would need custom conversion)
# partiql-to-csv beamline-catalog/orders.ion > orders.csv
# COPY orders FROM 'orders.csv' WITH CSV HEADER;
Analytics Pipeline
#!/bin/bash
# analytics-pipeline.sh
# Generate raw data efficiently
beamline gen data \
--seed 202401 \
--start-iso "2024-01-01T00:00:00Z" \
--script-path analytics.ion \
--sample-count 1000000 \
--output-format ion-binary > raw_data.ion
# Generate sample for validation
beamline gen data \
--seed 202401 \
--start-iso "2024-01-01T00:00:00Z" \
--script-path analytics.ion \
--sample-count 100 \
--output-format ion-pretty > sample_validation.ion
echo "Analytics data generated:"
echo "- Raw data: $(wc -l < raw_data.ion) records in binary format"
echo "- Validation sample: 100 records in pretty format"
Best Practices
1. Match Format to Purpose
# Debugging - use text
beamline gen data --script-path new_script.ion --sample-count 5 --output-format text
# Production - use binary
beamline gen data --script-path prod_data.ion --sample-count 1000000 --output-format ion-binary
# Documentation - use pretty
beamline gen data --script-path examples.ion --sample-count 10 --output-format ion-pretty
2. Consider File Size for Large Datasets
# Check estimated size first
beamline gen data --script-path large.ion --sample-count 1000 --output-format ion-binary | wc -c
# If 1000 records = 50KB, then 1M records ≈ 50MB
3. Use Appropriate Format for Storage
# Long-term storage
beamline gen data --script-path archive.ion --sample-count 100000 --output-format ion-binary
# Working files
beamline gen data --script-path working.ion --sample-count 1000 --output-format ion-pretty
# Quick inspection
beamline gen data --script-path inspect.ion --sample-count 20 --output-format text
4. Document Format Choices
# Document why you chose specific formats
echo "# Data Formats Used
- raw_data.bin: ion-binary for maximum efficiency (1M+ records)
- sample.ion: ion-pretty for human inspection (100 records)
- debug.txt: text format for timestamp analysis (50 records)
" > FORMAT_NOTES.md
Next Steps
- Scripts - Advanced Ion scripting techniques
- Datasets - Working with multiple datasets and relationships
- CLI Data Commands - Complete CLI format options reference
Nullability and Optionality
Beamline provides fine-grained control over NULL and MISSING values in generated data. Understanding the distinction between these concepts and how to configure them is crucial for creating realistic datasets that match real-world data patterns.
NULL vs MISSING Values
PartiQL distinguishes between two types of absent values:
NULL Values
- Meaning: The field exists but has no value
- JSON equivalent:
"field": null - SQL equivalent:
NULL - Ion format:
field: null
MISSING Values
- Meaning: The field doesn’t exist at all
- JSON equivalent: Field is not present in the object
- SQL equivalent: Column not included in row
- Ion format: Field is omitted entirely
Configuration Syntax
Every generator supports both nullability and optionality configuration:
Basic Syntax
generator_name::{ nullable: <config>, optional: <config> }
Configuration Values
Boolean Configuration:
nullable: true- Field can be NULL (but 0% chance by default)nullable: false- Field cannot be NULL (default)optional: true- Field can be MISSING (but 0% chance by default)optional: false- Field cannot be MISSING (default)
Probability Configuration:
nullable: 0.0- 0% chance of NULL (same asfalse)nullable: 0.25- 25% chance of NULLnullable: 1.0- 100% chance of NULL (always NULL)optional: 0.1- 10% chance of MISSINGoptional: 0.5- 50% chance of MISSING
Examples from Real Test Scripts
Basic Nullability
From the sensors.ion test script:
rand_processes::{
sensors: rand_process::{
$weight: UniformDecimal::{
nullable: 0.75, // 75% chance of NULL
low: 1.995,
high: 4.9999,
optional: true // Can be MISSING (0% chance by default)
},
$data: {
weight: $weight,
price: UniformDecimal::{
low: 2.99,
high: 99999.99,
optional: true // Field might not appear at all
}
}
}
}
Advanced Configuration
From the numbers.ion test script showing all combinations:
rand_processes::{
test_data: rand_process::{
$data: {
// Default behavior - not nullable, not optional
basic_int: UniformI32::{ low: 1, high: 100 },
// Only nullable
nullable_only: UniformI32::{
nullable: 0.2, // 20% NULL
low: 1,
high: 100
},
// Only optional
optional_only: UniformI32::{
optional: 0.1, // 10% MISSING
low: 1,
high: 100
},
// Both nullable and optional
both_configured: UniformI32::{
nullable: 0.2, // 20% NULL
optional: 0.1, // 10% MISSING
low: 1, // 70% present values
high: 100
}
}
}
}
Output Examples
Text Format Output
$ beamline gen data \
--seed 1000 \
--start-auto \
--script-path nullability_test.ion \
--sample-count 10
# Sample outputs showing NULL and MISSING behavior
[2024-01-01 00:00:01.123] : "test_data" { 'basic_int': 42, 'nullable_only': null, 'both_configured': 15 }
[2024-01-01 00:00:02.456] : "test_data" { 'basic_int': 78, 'nullable_only': 23, 'both_configured': null }
[2024-01-01 00:00:03.789] : "test_data" { 'basic_int': 91, 'nullable_only': 67 } // optional_only is MISSING
[2024-01-01 00:00:04.012] : "test_data" { 'basic_int': 33, 'nullable_only': null, 'optional_only': 88, 'both_configured': 54 }
Notice how:
basic_intalways appears (not nullable, not optional)nullable_onlycan benullbut always presentoptional_onlymight not appear at all (MISSING)both_configuredcan benullor MISSING
Ion Pretty Format Output
{
seed: 1000,
start: "2024-01-01T00:00:00Z",
data: {
test_data: [
{
basic_int: 42,
nullable_only: null, // NULL value present
both_configured: 15
// optional_only is MISSING (field not present)
},
{
basic_int: 78,
nullable_only: 23,
optional_only: 67,
both_configured: null // NULL value present
},
{
basic_int: 91,
nullable_only: 67,
optional_only: 45
// both_configured is MISSING (field not present)
}
]
}
}
Global Defaults via CLI
You can set global nullability and optionality defaults via CLI options:
CLI Default Configuration
# Make all types nullable with 10% NULL values
beamline gen data \
--seed 42 \
--start-auto \
--script-path data.ion \
--pct-null 0.1
# Make all types optional with 5% MISSING values
beamline gen data \
--seed 42 \
--start-auto \
--script-path data.ion \
--pct-optional 0.05
# Combine both
beamline gen data \
--seed 42 \
--start-auto \
--script-path data.ion \
--pct-null 0.1 \
--pct-optional 0.05
# Disable both globally
beamline gen data \
--seed 42 \
--start-auto \
--script-path data.ion \
--default-nullable false \
--default-optional false
Script vs CLI Override Behavior
- Script configuration takes precedence over CLI defaults
- CLI defaults apply to generators without explicit nullable/optional configuration
- Explicit
falsein script overrides CLI defaults
Example:
# CLI sets 20% NULL globally
beamline gen data \
--pct-null 0.2 \
--script-path mixed_config.ion
rand_processes::{
test_data: rand_process::{
$data: {
// Uses CLI default: 20% NULL
field1: UniformI32::{ low: 1, high: 100 },
// Overrides CLI default: never NULL
field2: UniformI32::{ nullable: false, low: 1, high: 100 },
// Overrides CLI default: 50% NULL
field3: UniformI32::{ nullable: 0.5, low: 1, high: 100 }
}
}
}
Realistic Data Patterns
Database-like Nullability
rand_processes::{
users: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: minutes::1 },
$data: {
// Required fields - never NULL or MISSING
user_id: UUID::{ nullable: false, optional: false },
created_at: Instant::{ nullable: false, optional: false },
// Often present, sometimes NULL
email: Regex::{
pattern: "[a-z]+@[a-z]+\\.[a-z]{2,3}",
nullable: 0.05 // 5% NULL emails
},
// Optional profile fields
full_name: LoremIpsumTitle::{ optional: 0.3 }, // 30% don't provide name
phone: Regex::{
pattern: "[0-9]{3}-[0-9]{3}-[0-9]{4}",
optional: 0.4, // 40% don't provide phone
nullable: 0.1 // 10% provide NULL phone
},
// Rarely provided optional fields
bio: LoremIpsum::{
min_words: 10,
max_words: 50,
optional: 0.8 // 80% don't provide bio
}
}
}
}
Sensor Data with Missing Readings
rand_processes::{
sensor_readings: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: seconds::30 },
$data: {
sensor_id: UUID::{ nullable: false },
timestamp: Instant::{ nullable: false },
// Primary measurement - rarely fails
temperature: UniformF64::{
low: -10.0,
high: 50.0,
nullable: 0.02 // 2% sensor failures (NULL)
},
// Secondary measurement - more failures
humidity: UniformF64::{
low: 0.0,
high: 100.0,
nullable: 0.05 // 5% sensor failures
},
// Optional calibration data
calibration_offset: UniformF64::{
low: -1.0,
high: 1.0,
optional: 0.7 // 70% don't have calibration data
},
// Rarely available GPS coordinates
latitude: UniformF64::{
low: -90.0,
high: 90.0,
optional: 0.9 // 90% don't have GPS
},
longitude: UniformF64::{
low: -180.0,
high: 180.0,
optional: 0.9 // 90% don't have GPS
}
}
}
}
Statistical Distribution Implications
Stable Value Generation
An important feature of Beamline’s nullability/optionality system is stable value generation:
The value that would have been generated if not absent is still generated, it is just discarded. This ensures that value generation is stable even across runs with different densities of NULL and/or MISSING data.
Example:
test_field: UniformI32::{ low: 1, high: 100, nullable: 0.5 }
With seed 42:
- 50% NULL case: Generates
17, discards it, outputsnull - 0% NULL case: Generates
17, outputs17 - Same underlying sequence: Both cases generate the same random number sequence
This stability is crucial for:
- A/B testing: Compare models with different missing data rates
- Robustness testing: Test algorithms with varying data completeness
- Reproducible experiments: Same seed produces same value patterns
AI Model Training Applications
rand_processes::{
training_data: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: milliseconds::1 },
$data: {
// Always present features
sample_id: UUID::{ nullable: false, optional: false },
// Core features with realistic missingness
age: NormalF64::{
mean: 35.0,
std_dev: 12.0,
nullable: 0.02 // 2% missing age data
},
income: LogNormalF64::{
location: 10.5,
scale: 0.5,
nullable: 0.15, // 15% NULL income (sensitive data)
optional: 0.05 // 5% refuse to provide income
},
// Optional survey responses
satisfaction_score: UniformF64::{
low: 1.0,
high: 10.0,
optional: 0.4 // 40% don't respond to survey
},
// Rarely collected features
location: Regex::{
pattern: "[A-Z]{2}",
optional: 0.8 // 80% don't provide location
},
// Target variable - usually complete
target: Bool::{
p: 0.3,
nullable: 0.01 // 1% labeling errors
}
}
}
}
Complex Nullability Patterns
Conditional Nullability
Use variables to create related nullability patterns:
rand_processes::{
$has_premium: Bool::{ p: 0.3 }, // 30% premium users
users: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: minutes::2 },
$data: {
user_id: UUID,
plan_type: Uniform::{ choices: ["free", "premium"] },
// Premium features - NULL for non-premium users
premium_start_date: Instant::{
nullable: 0.7 // 70% NULL (approx. non-premium rate)
},
premium_features: UniformArray::{
min_size: 1,
max_size: 5,
element_type: LoremIpsumTitle,
nullable: 0.7, // 70% NULL for non-premium
optional: 0.1 // 10% don't specify even if premium
}
}
}
}
Correlated Missing Data
rand_processes::{
user_profiles: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: minutes::5 },
$data: {
user_id: UUID::{ nullable: false },
// Contact information - often missing together
email: Regex::{
pattern: "[a-z]+@[a-z]+\\.[a-z]{2,3}",
optional: 0.2 // 20% don't provide email
},
phone: Regex::{
pattern: "[0-9]{3}-[0-9]{3}-[0-9]{4}",
optional: 0.25, // 25% don't provide phone
nullable: 0.05 // 5% provide NULL phone
},
// Address components - missing together or not at all
street_address: LoremIpsumTitle::{ optional: 0.3 },
city: LoremIpsumTitle::{ optional: 0.3 },
postal_code: Regex::{
pattern: "[0-9]{5}",
optional: 0.3
},
// Optional demographic info
age: UniformU8::{
low: 18,
high: 80,
optional: 0.4, // 40% don't provide age
nullable: 0.02 // 2% provide invalid age (NULL)
}
}
}
}
Schema Impact
Nullability and optionality affect inferred schemas:
Schema Inference Output
$ beamline infer-shape \
--seed 1 \
--start-auto \
--script-path nullability_test.ion \
--output-format basic-ddl
-- Dataset: test_data
"basic_field" INT NOT NULL, -- nullable: false
"nullable_field" INT, -- nullable: 0.2
"optional_field" OPTIONAL INT NOT NULL, -- optional: 0.1
"both_field" OPTIONAL INT -- nullable: 0.2, optional: 0.1
CLI Default Impact on Schema
$ beamline infer-shape \
--seed 1 \
--start-auto \
--script-path simple_data.ion \
--default-nullable true \
--default-optional true \
--output-format basic-ddl
-- All fields become nullable and optional by default
"field1" OPTIONAL INT,
"field2" OPTIONAL VARCHAR,
"field3" OPTIONAL BOOL
Performance Considerations
Value Generation Stability
NULL and MISSING generation maintains the same computational cost:
// Same performance regardless of nullability rate
fast_field: UniformI32::{ low: 1, high: 1000, nullable: 0.0 } // 0% NULL
slow_field: UniformI32::{ low: 1, high: 1000, nullable: 0.9 } // 90% NULL
Why: The underlying value is always generated, then conditionally discarded.
Memory Usage
- NULL values: Stored in output (takes memory)
- MISSING values: Not stored (saves memory)
- High optionality: Can reduce output size significantly
// Large optional fields save memory when MISSING
large_description: LoremIpsum::{
min_words: 100,
max_words: 1000,
optional: 0.8 // 80% MISSING saves significant memory
}
Testing Data Quality
Missing Data Robustness Testing
Create datasets with increasing levels of missingness:
// Test script for robustness testing
rand_processes::{
// Dataset with low missingness
clean_data: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: seconds::1 },
$data: {
feature1: NormalF64::{ mean: 0.0, std_dev: 1.0, nullable: 0.01 },
feature2: NormalF64::{ mean: 0.0, std_dev: 1.0, nullable: 0.01 },
target: Bool::{ p: 0.5, nullable: false }
}
},
// Dataset with moderate missingness
noisy_data: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: seconds::1 },
$data: {
feature1: NormalF64::{ mean: 0.0, std_dev: 1.0, nullable: 0.1 },
feature2: NormalF64::{ mean: 0.0, std_dev: 1.0, nullable: 0.15 },
target: Bool::{ p: 0.5, nullable: 0.02 }
}
},
// Dataset with high missingness
sparse_data: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: seconds::1 },
$data: {
feature1: NormalF64::{ mean: 0.0, std_dev: 1.0, nullable: 0.3 },
feature2: NormalF64::{ mean: 0.0, std_dev: 1.0, nullable: 0.4 },
target: Bool::{ p: 0.5, nullable: 0.05 }
}
}
}
Real-World Data Simulation
rand_processes::{
customer_survey: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: hours::1 },
$data: {
// Always collected
survey_id: UUID::{ nullable: false, optional: false },
timestamp: Instant::{ nullable: false, optional: false },
// Demographics - some prefer not to answer
age: UniformU8::{ low: 18, high: 80, optional: 0.15 },
gender: Uniform::{
choices: ["M", "F", "Other"],
optional: 0.2 // 20% prefer not to say
},
// Income - sensitive, often skipped or invalid
income_range: Uniform::{
choices: ["<30K", "30-60K", "60-100K", "100K+"],
optional: 0.3, // 30% skip question
nullable: 0.1 // 10% provide invalid answer
},
// Rating questions - sometimes skipped
overall_rating: UniformU8::{
low: 1,
high: 10,
optional: 0.1 // 10% skip rating
},
// Open-ended responses - frequently skipped
comments: LoremIpsum::{
min_words: 5,
max_words: 100,
optional: 0.6 // 60% don't provide comments
}
}
}
}
Best Practices
1. Realistic Nullability Rates
// Good - realistic rates based on domain
email: Regex::{ pattern: "...", nullable: 0.05 } // 5% invalid emails
age: UniformU8::{ low: 18, high: 80, optional: 0.1 } // 10% don't provide age
// Avoid - extreme rates without justification
field: UniformI32::{ low: 1, high: 100, nullable: 0.99 } // 99% NULL - rarely useful
2. Use Appropriate Absence Types
// NULL for invalid/unknown values
sensor_reading: UniformF64::{ low: 0.0, high: 100.0, nullable: 0.02 } // Sensor malfunction
// MISSING for optional fields
optional_comment: LoremIpsum::{ min_words: 5, max_words: 50, optional: 0.4 } // User choice
3. Document Nullability Decisions
rand_processes::{
// Document nullability reasoning
user_data: rand_process::{
$data: {
// Required business key - never absent
customer_id: UUID::{ nullable: false, optional: false },
// Email required for notifications - rare NULLs for bad data
email: Regex::{ pattern: "...", nullable: 0.01 },
// Phone optional - users may not provide
phone: Regex::{ pattern: "...", optional: 0.3 },
// Marketing consent - some users skip this question
marketing_consent: Bool::{ optional: 0.15, nullable: 0.05 }
}
}
}
4. Test Multiple Missingness Levels
# Generate datasets with different missingness for testing
beamline gen data --seed 1 --start-auto --script data.ion --pct-null 0.0 --sample-count 1000 > clean.ion
beamline gen data --seed 1 --start-auto --script data.ion --pct-null 0.1 --sample-count 1000 > noisy.ion
beamline gen data --seed 1 --start-auto --script data.ion --pct-null 0.3 --sample-count 1000 > sparse.ion
Common Patterns
Required vs Optional Fields
rand_processes::{
e_commerce: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: minutes::2 },
$data: {
// Required for business logic
order_id: UUID::{ nullable: false, optional: false },
customer_id: UUID::{ nullable: false, optional: false },
created_at: Instant::{ nullable: false, optional: false },
// Required but can have data quality issues
total_amount: UniformDecimal::{
low: 5.00,
high: 500.00,
nullable: 0.005 // 0.5% data corruption
},
// Optional customer-provided data
shipping_instructions: LoremIpsum::{
min_words: 3,
max_words: 20,
optional: 0.7 // 70% don't provide instructions
},
// Optional promotional data
promo_code: Regex::{
pattern: "[A-Z]{4}[0-9]{2}",
optional: 0.8 // 80% don't use promo codes
}
}
}
}
Legacy Data Migration Patterns
rand_processes::{
migrated_data: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: minutes::1 },
$data: {
// Legacy ID - sometimes missing from old records
legacy_id: UniformI32::{
low: 1,
high: 999999,
optional: 0.1 // 10% of old records missing legacy ID
},
// New ID - always present for new system
new_id: UUID::{ nullable: false, optional: false },
// Data quality issues from migration
migrated_date: Instant::{
nullable: 0.05 // 5% migration errors (NULL dates)
},
// Fields added after migration - missing from old records
new_feature: LoremIpsumTitle::{
optional: 0.6 // 60% old records don't have this field
}
}
}
}
Troubleshooting
Issue: Unexpected NULL/MISSING Behavior
Check configuration precedence:
- Script-level configuration overrides CLI defaults
- Variable-level configuration applies to all uses
- CLI defaults apply to unconfigured generators
Issue: Too Many/Few NULL Values
Verify probability values:
- Values must be between 0.0 and 1.0
- 0.1 = 10%, 0.25 = 25%, 0.5 = 50%, etc.
Issue: Schema Doesn’t Match Expected Nullability
Check CLI defaults:
# Check if CLI is setting global defaults
beamline infer-shape --seed 1 --start-auto --script-path data.ion --default-nullable false
Integration with Query Generation
NULL and MISSING values affect query generation:
# Generate data with nullability
beamline gen data \
--seed 1 \
--start-auto \
--script-path nullable_data.ion \
--sample-count 1000 \
--output-format ion-pretty > test_data.ion
# Generate queries that handle NULL/MISSING
beamline query basic \
--seed 2 \
--start-auto \
--script-path nullable_data.ion \
--sample-count 10 \
rand-select-all-fw \
--pred-absent # Include IS NULL, IS NOT NULL, IS MISSING predicates
This creates queries like:
SELECT * FROM test_data WHERE (test_data.email IS NOT NULL)
SELECT * FROM test_data WHERE (test_data.optional_field IS MISSING)
SELECT * FROM test_data WHERE (test_data.phone IS NULL OR test_data.phone LIKE '%555%')
Next Steps
- Output Formats - See how NULL/MISSING values appear in different formats
- Scripts - Advanced techniques for managing nullability in complex scripts
- Query Generation - Generate queries that handle absent values
Query Generator Overview
Beamline’s query generator creates reproducible PartiQL queries that match the shapes and types of data defined in Ion scripts. This allows you to generate realistic test queries for PartiQL implementations, ensuring your queries are both syntactically valid and semantically meaningful for your data structures.
What is Query Generation?
The query generator analyzes the data shapes from your Ion scripts and creates PartiQL queries that:
- Match your data structure: Queries reference actual fields and types from your data
- Are syntactically correct: All generated queries parse and execute properly
- Have realistic complexity: Configurable query patterns from simple to complex
- Are reproducible: Same seed produces identical query sequences
- Test diverse patterns: Cover different PartiQL constructs and edge cases
How Query Generation Works
Process Flow
- Script Analysis: Parse Ion script to understand data shapes
- Shape Inference: Determine field types, structures, and relationships
- Query Strategy: Apply configured query generation strategy
- Query Construction: Build queries matching data structure
- Output Generation: Produce formatted PartiQL queries
Shape-Aware Generation
The query generator understands your data structure:
rand_processes::{
test_data: rand_process::{
$arrival: HomogeneousPoisson:: { interarrival: milliseconds::100 },
$data: {
transaction_id: UUID::{ nullable: false },
marketplace_id: UniformU8::{ nullable: false },
country_code: Regex::{ pattern: "[A-Z]{2}" },
created_at: Instant,
completed: Bool,
price: UniformDecimal::{ low: 2.99, high: 99999.99, optional: true }
}
}
}
Generated queries will reference actual fields like transaction_id, marketplace_id, country_code, etc.
Query Generation Strategies
Beamline supports four main query generation strategies:
1. rand-select-all-fw - SELECT * FROM WHERE
Generates SELECT * queries with WHERE clauses:
beamline query basic \
--seed 1234 \
--start-auto \
--script-path simple_transactions.ion \
--sample-count 3 \
rand-select-all-fw \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 1 \
--tbl-flt-path-depth-max 1 \
--tbl-flt-pathstep-internal-all \
--tbl-flt-pathstep-final-project \
--tbl-flt-type-final-scalar \
--pred-lt
Example Output:
SELECT * FROM test_data AS test_data WHERE (test_data.marketplace_id < -5)
SELECT * FROM test_data AS test_data WHERE (test_data.price < 18.418581624952935)
SELECT * FROM test_data AS test_data WHERE (test_data.price < 15.495327785402296)
2. rand-sfw - SELECT fields FROM WHERE
Generates queries with specific field projections and WHERE clauses:
beamline query basic \
--seed 1234 \
--start-auto \
--script-path simple_transactions.ion \
--sample-count 3 \
rand-sfw \
--project-rand-min 2 \
--project-rand-max 5 \
--project-path-depth-min 1 \
--project-path-depth-max 1 \
--project-pathstep-internal-all \
--project-pathstep-final-all \
--project-type-final-all \
--tbl-flt-rand-min 2 \
--tbl-flt-rand-max 5 \
--tbl-flt-path-depth-max 1 \
--tbl-flt-pathstep-internal-all \
--tbl-flt-pathstep-final-project \
--tbl-flt-type-final-scalar \
--pred-all
Example Output:
SELECT test_data.completed, test_data.completed FROM test_data AS test_data
WHERE (NOT (test_data.completed) OR NOT ((test_data.created_at IS MISSING)))
SELECT test_data.completed, test_data.marketplace_id, test_data.created_at
FROM test_data AS test_data WHERE (NOT ((test_data.transaction_id IS NULL)) OR
(((test_data.transaction_id IN ['Iam in.', 'Se.']) OR
NOT ((test_data.description IS NULL))) OR
(test_data.marketplace_id >= 28)))
3. rand-select-all-efw - SELECT * EXCLUDE FROM WHERE
Generates SELECT * EXCLUDE queries:
beamline query basic \
--seed 1234 \
--start-auto \
--script-path simple_transactions.ion \
--sample-count 3 \
rand-select-all-efw \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 1 \
--tbl-flt-path-depth-max 1 \
--tbl-flt-pathstep-internal-all \
--tbl-flt-pathstep-final-project \
--tbl-flt-type-final-scalar \
--pred-lt \
--exclude-rand-min 1 \
--exclude-rand-max 3 \
--exclude-path-depth-min 1 \
--exclude-path-depth-max 1 \
--exclude-pathstep-internal-all \
--exclude-pathstep-final-all \
--exclude-type-final-all
Example Output:
SELECT * EXCLUDE test_data.marketplace_id, test_data.*, test_data.completed
FROM test_data AS test_data WHERE (test_data.marketplace_id < -5)
SELECT * EXCLUDE test_data.completed FROM test_data AS test_data
WHERE (test_data.price < 18.418581624952935)
4. rand-sefw - SELECT EXCLUDE FROM WHERE
Generates queries with projections, exclusions, and WHERE clauses:
beamline query basic \
--seed 1234 \
--start-auto \
--script-path simple_transactions.ion \
--sample-count 3 \
rand-sefw \
--project-rand-min 2 \
--project-rand-max 5 \
--project-path-depth-min 1 \
--project-path-depth-max 1 \
--project-pathstep-internal-all \
--project-pathstep-final-all \
--project-type-final-all \
--tbl-flt-rand-min 2 \
--tbl-flt-rand-max 5 \
--tbl-flt-path-depth-max 1 \
--tbl-flt-pathstep-internal-all \
--tbl-flt-pathstep-final-project \
--tbl-flt-type-final-scalar \
--pred-all \
--exclude-rand-min 1 \
--exclude-rand-max 3 \
--exclude-path-depth-min 1 \
--exclude-path-depth-max 1 \
--exclude-pathstep-internal-all \
--exclude-pathstep-final-all \
--exclude-type-final-all
Path Generation
Query generation creates paths that navigate your data structure:
Path Components
- Projection paths:
field_name,object.field,nested.object.field - Index paths:
array[0],array[5] - Wildcard paths:
array[*],object.* - Deep paths:
nested.object.array[*].field
Path Depth Control
# Simple paths (depth 1)
--tbl-flt-path-depth-max 1
# Results: test_data.price, test_data.completed
# Complex paths (depth 3+)
--tbl-flt-path-depth-max 5
# Results: test_data.nested.object.array[*].field
Real Examples from Complex Data
From the README’s transactions.ion example with nested structures:
beamline query basic \
--seed 1234 \
--start-auto \
--script-path transactions.ion \
--sample-count 3 \
rand-sefw \
--project-rand-min 2 \
--project-rand-max 5 \
--project-path-depth-min 1 \
--project-path-depth-max 10 \
--project-pathstep-internal-all \
--project-pathstep-final-all \
--project-type-final-all \
--pred-all
Generated Deep Paths:
SELECT test_data.*.nested_struct.nested_struct.nested_struct.nested_struct.nested_struct.*,
test_data.test_nest_struct.*.*.nested_struct.nested_struct
EXCLUDE test_data.*.*.*.*, test_data.price.*
FROM test_data AS test_data
WHERE ((test_data.test_nest_struct.*.*.*.nested_struct.*.test_int <> 19) OR
(test_data.test_nest_struct.*.*.nested_struct.*.*.test_int > 35))
Predicate Generation
Available Predicate Types
Based on the README’s comprehensive predicate options:
- Comparison:
<,<=,>,>=,=,<> - Range:
BETWEEN value1 AND value2 - String:
LIKE pattern,NOT LIKE pattern - Set membership:
IN (value1, value2),NOT IN (value1, value2) - Null testing:
IS NULL,IS NOT NULL - Missing testing:
IS MISSING,IS NOT MISSING - Logical:
AND,OR,NOT
Predicate Configuration
# Only less-than predicates
--pred-lt
# All comparison predicates
--pred-comparison
# All predicates including logical operators
--pred-all
# Only null/missing testing
--pred-absent
Real Predicate Examples
From the README examples:
-- Simple predicates
WHERE (test_data.marketplace_id < -5)
WHERE (test_data.price BETWEEN 10.0 AND 100.0)
-- Complex logical combinations
WHERE (((test_data.country_code <> 'Qua maxime ceterorum.') AND
(NOT (test_data.completed IN [false, true]) OR
(test_data.description = 'Non faciant.'))) AND
(NOT ((test_data.price IS MISSING)) AND (test_data.price IS MISSING)))
-- Null and missing testing
WHERE (test_data.email IS NOT NULL AND test_data.optional_field IS MISSING)
Configuration Parameters
Table Filter Parameters
Control WHERE clause generation:
| Parameter | Description | Values |
|---|---|---|
--tbl-flt-rand-min | Minimum predicates | 1-255 |
--tbl-flt-rand-max | Maximum predicates | 1-255 |
--tbl-flt-path-depth-max | Maximum path depth | 1-255 |
Path Step Configuration
Control how paths navigate through data:
| Parameter | Description |
|---|---|
--tbl-flt-pathstep-internal-all | Enable all internal path steps |
--tbl-flt-pathstep-internal-project | Enable projection steps (.field) |
--tbl-flt-pathstep-internal-index | Enable index steps ([1]) |
--tbl-flt-pathstep-internal-foreach | Enable wildcard steps ([*]) |
--tbl-flt-pathstep-internal-unpivot | Enable unpivot steps (.*) |
Type Constraints
Control what types can appear in query paths:
| Parameter | Description |
|---|---|
--tbl-flt-type-final-all | Allow all final types |
--tbl-flt-type-final-scalar | Only scalar types (9, 'text', true) |
--tbl-flt-type-final-sequence | Only sequence types ([1,2,3]) |
--tbl-flt-type-final-struct | Only struct types ({'a': 1}) |
Real Query Examples
Simple Transaction Queries
Based on simple_transactions.ion test script:
beamline query basic \
--seed 1234 \
--start-auto \
--script-path simple_transactions.ion \
--sample-count 3 \
rand-select-all-fw \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 1 \
--pred-lt
Results:
SELECT * FROM test_data AS test_data WHERE (test_data.marketplace_id < -5)
SELECT * FROM test_data AS test_data WHERE (test_data.price < 18.418581624952935)
SELECT * FROM test_data AS test_data WHERE (test_data.price < 15.495327785402296)
Complex Nested Structure Queries
With more complex path generation:
beamline query basic \
--seed 1234 \
--start-auto \
--script-path transactions.ion \
--sample-count 2 \
rand-sefw \
--project-rand-min 2 \
--project-rand-max 3 \
--project-path-depth-max 6 \
--project-pathstep-internal-all \
--project-pathstep-final-all \
--project-type-final-all \
--tbl-flt-rand-min 2 \
--tbl-flt-rand-max 3 \
--tbl-flt-path-depth-max 6 \
--pred-all \
--exclude-rand-min 1 \
--exclude-rand-max 2 \
--exclude-path-depth-max 4
Results with Deep Paths:
SELECT test_data.*.nested_struct.nested_struct.nested_struct.nested_struct.*,
test_data.test_nest_struct.*.*.nested_struct.nested_struct.*.*
EXCLUDE test_data.test_nest_struct.*.*, test_data.price.*
FROM test_data AS test_data
WHERE ((test_data.test_nest_struct.*.*.*.nested_struct.*.test_int <> 19) OR
(test_data.*.*.nested_struct.nested_struct.*.nested_struct.test_int >= -9))
Reproducible Query Generation
Consistent Query Generation
Use specific seeds for reproducible query sets:
# Generate same queries each time
beamline query basic \
--seed 12345 \
--start-auto \
--script-path data.ion \
--sample-count 10 \
rand-select-all-fw \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 3 \
--pred-all
Query Complexity Control
Simple Queries
# Generate simple queries for basic testing
beamline query basic \
--seed 100 \
--start-auto \
--script-path data.ion \
--sample-count 5 \
rand-select-all-fw \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 1 \
--tbl-flt-path-depth-max 1 \
--pred-eq
Complex Queries
# Generate complex queries for comprehensive testing
beamline query basic \
--seed 200 \
--start-auto \
--script-path nested_data.ion \
--sample-count 5 \
rand-sefw \
--project-rand-min 3 \
--project-rand-max 8 \
--project-path-depth-max 5 \
--tbl-flt-rand-min 2 \
--tbl-flt-rand-max 6 \
--tbl-flt-path-depth-max 5 \
--exclude-rand-min 1 \
--exclude-rand-max 3 \
--pred-all
Integration Patterns
Testing Workflow Integration
#!/bin/bash
# Generate test data and matching queries
SCRIPT="test_data.ion"
SEED=12345
# Generate test dataset
beamline gen data \
--seed $SEED \
--start-auto \
--script-path $SCRIPT \
--sample-count 1000 \
--output-format ion-pretty > test_data.ion
# Generate queries for the dataset
beamline query basic \
--seed $((SEED + 1)) \
--start-auto \
--script-path $SCRIPT \
--sample-count 20 \
rand-select-all-fw \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 4 \
--pred-all > test_queries.sql
echo "Generated test data and matching queries"
Query Validation Testing
# Generate queries to test PartiQL implementation
beamline query basic \
--seed 300 \
--start-auto \
--script-path complex_schema.ion \
--sample-count 50 \
rand-sfw \
--project-rand-min 1 \
--project-rand-max 5 \
--project-path-depth-max 3 \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 3 \
--pred-all > validation_queries.sql
# Test each query against your PartiQL implementation
while IFS= read -r query; do
echo "Testing query: $query"
# Run query against your PartiQL engine
# your-partiql-engine --query "$query" --data test_data.ion
done < validation_queries.sql
Advanced Query Patterns
Null and Missing Value Testing
Generate queries that test NULL and MISSING value handling:
beamline query basic \
--seed 400 \
--start-auto \
--script-path nullable_data.ion \
--sample-count 10 \
rand-select-all-fw \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 2 \
--pred-absent # Focus on IS NULL, IS NOT NULL, IS MISSING, IS NOT MISSING
Example Queries:
SELECT * FROM test_data WHERE (test_data.optional_field IS MISSING)
SELECT * FROM test_data WHERE (test_data.nullable_field IS NOT NULL)
SELECT * FROM test_data WHERE (test_data.price IS NULL OR test_data.completed = true)
Performance Testing Queries
Generate queries for performance benchmarking:
# Generate queries with different complexity levels
for complexity in 1 2 5 10; do
beamline query basic \
--seed 500 \
--start-auto \
--script-path large_dataset.ion \
--sample-count 10 \
rand-select-all-fw \
--tbl-flt-rand-min $complexity \
--tbl-flt-rand-max $complexity \
--pred-all > "queries_complexity_$complexity.sql"
done
Query Generation Best Practices
1. Match Query Complexity to Data Structure
# Simple flat data - use simple paths
beamline query basic --script-path flat_data.ion --project-path-depth-max 2
# Complex nested data - use deeper paths
beamline query basic --script-path nested_data.ion --project-path-depth-max 8
2. Use Appropriate Predicate Sets
# For numeric data testing
--pred-comparison --pred-between
# For string data testing
--pred-like --pred-in
# For comprehensive testing
--pred-all
3. Generate Query Suites
# Generate different query types for comprehensive testing
beamline query basic --script-path data.ion --sample-count 10 rand-select-all-fw --pred-all > select_star.sql
beamline query basic --script-path data.ion --sample-count 10 rand-sfw --pred-all > projections.sql
beamline query basic --script-path data.ion --sample-count 10 rand-select-all-efw --pred-all > excludes.sql
4. Validate Generated Queries
Test generated queries against your data:
# Generate data and queries with same script
beamline gen data --seed 1 --start-auto --script-path test.ion --sample-count 100 > data.ion
beamline query basic --seed 2 --start-auto --script-path test.ion --sample-count 5 rand-select-all-fw --pred-all > queries.sql
# Validate queries parse correctly
# your-partiql-parser --validate queries.sql
Use Cases
PartiQL Implementation Testing
Generate comprehensive query test suites:
# Generate queries covering all PartiQL features
beamline query basic \
--seed 600 \
--start-auto \
--script-path comprehensive_schema.ion \
--sample-count 100 \
rand-sefw \
--project-rand-min 1 \
--project-rand-max 10 \
--project-path-depth-max 5 \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 5 \
--exclude-rand-min 1 \
--exclude-rand-max 3 \
--pred-all
Performance Benchmarking
Create query workloads for performance testing:
# Generate queries with increasing complexity
beamline query basic \
--seed 700 \
--start-auto \
--script-path performance_schema.ion \
--sample-count 50 \
rand-sfw \
--project-rand-min 1 \
--project-rand-max 20 \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 10 \
--pred-all > performance_queries.sql
Edge Case Testing
Generate queries that test edge cases:
# Focus on complex path expressions
beamline query basic \
--seed 800 \
--start-auto \
--script-path edge_case_data.ion \
--sample-count 25 \
rand-sefw \
--project-path-depth-min 3 \
--project-path-depth-max 8 \
--project-pathstep-internal-foreach \
--project-pathstep-final-unpivot \
--tbl-flt-path-depth-max 6 \
--exclude-path-depth-min 2 \
--exclude-path-depth-max 4 \
--pred-all
Next Steps
Now that you understand query generation fundamentals, explore specific aspects:
- Basic Queries - Simple query patterns and common use cases
- Advanced Patterns - Complex nested queries and deep paths
- Parameterization - Complete guide to all configuration options
- CLI Query Commands - Detailed CLI reference
Basic Query Generation
This section covers fundamental query generation patterns using Beamline’s rand-select-all-fw strategy, which generates simple SELECT * queries with WHERE clauses. This is the best starting point for understanding how query generation works.
Getting Started with Basic Queries
Simple Transaction Data
Let’s use the simple_transactions.ion script from the test suite as our data source:
rand_processes::{
test_data: rand_process::{
$r: Uniform::{ choices: [5,10] },
$arrival: HomogeneousPoisson:: { interarrival: milliseconds::$r },
$data: {
transaction_id: UUID::{ nullable: false },
marketplace_id: UniformU8::{ nullable: false },
country_code: Regex::{ pattern: "[A-Z]{2}" },
created_at: Instant,
completed: Bool,
description: LoremIpsum::{ min_words:10, max_words:200 },
price: UniformDecimal::{ low: 2.99, high: 99999.99, optional: true }
}
}
}
This script creates transaction data with various field types that the query generator can reference.
Basic Query Generation Command
beamline query basic \
--seed 1234 \
--start-auto \
--script-path simple_transactions.ion \
--sample-count 3 \
rand-select-all-fw \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 1 \
--tbl-flt-path-depth-max 1 \
--tbl-flt-pathstep-internal-all \
--tbl-flt-pathstep-final-project \
--tbl-flt-type-final-scalar \
--pred-lt
Generated Output:
SELECT * FROM test_data AS test_data WHERE (test_data.marketplace_id < -5)
SELECT * FROM test_data AS test_data WHERE (test_data.price < 18.418581624952935)
SELECT * FROM test_data AS test_data WHERE (test_data.price < 15.495327785402296)
Understanding the Parameters
--tbl-flt-rand-min 1 --tbl-flt-rand-max 1: Generate exactly 1 predicate per query--tbl-flt-path-depth-max 1: Use only top-level fields (no nested paths)--tbl-flt-pathstep-final-project: Final path step is field projection (.field)--tbl-flt-type-final-scalar: Only reference scalar values (numbers, strings, booleans)--pred-lt: Use only less-than (<) predicates
Different Predicate Types
Comparison Predicates
# Less than predicates
beamline query basic \
--seed 100 \
--start-auto \
--script-path simple_transactions.ion \
--sample-count 3 \
rand-select-all-fw \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 1 \
--pred-lt
Output:
SELECT * FROM test_data AS test_data WHERE (test_data.price < 123.45)
SELECT * FROM test_data AS test_data WHERE (test_data.marketplace_id < 42)
Equality Predicates
# Equality predicates
beamline query basic \
--seed 200 \
--start-auto \
--script-path simple_transactions.ion \
--sample-count 3 \
rand-select-all-fw \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 1 \
--pred-eq
Output:
SELECT * FROM test_data AS test_data WHERE (test_data.completed = true)
SELECT * FROM test_data AS test_data WHERE (test_data.country_code = 'US')
All Predicate Types
# Use all available predicates for comprehensive testing
beamline query basic \
--seed 300 \
--start-auto \
--script-path simple_transactions.ion \
--sample-count 5 \
rand-select-all-fw \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 3 \
--pred-all
Output:
SELECT * FROM test_data AS test_data WHERE (test_data.country_code IN [
'Graecos quidem legendos.',
'Possit et sine.'
] OR (NOT ((test_data.description IS MISSING)) OR
(test_data.description IS MISSING)))
SELECT * FROM test_data AS test_data WHERE (((test_data.transaction_id IS NULL)
AND (test_data.created_at IS NULL)) OR (((test_data.completed IN [
false,
false
] OR NOT ((test_data.completed IS NULL))) AND
((NOT ((test_data.price IS NULL)) OR
(test_data.transaction_id LIKE 'Vidisse.' AND
(test_data.country_code IS NULL))) AND
NOT ((test_data.description IS MISSING)))) OR
(test_data.description <> 'Nec vero.')))
Multiple Predicates
Combining Predicates
Increase predicate count to create more complex WHERE clauses:
# Generate queries with 2-5 predicates
beamline query basic \
--seed 400 \
--start-auto \
--script-path simple_transactions.ion \
--sample-count 3 \
rand-select-all-fw \
--tbl-flt-rand-min 2 \
--tbl-flt-rand-max 5 \
--pred-all
Example Output:
SELECT * FROM test_data AS test_data
WHERE (((((test_data.country_code <> 'Qua maxime ceterorum.') AND
(NOT (test_data.completed IN [ false, true ]) OR
(test_data.description = 'Non faciant.'))) AND
(NOT ((test_data.price IS MISSING)) AND (test_data.price IS MISSING))) OR
test_data.price IN [
-47.936734585045905,
-0.8509689800217544,
24.263479438050297
]) OR ((test_data.created_at = UTCNOW()) OR
(NOT ((test_data.country_code IS MISSING)) AND
(test_data.description IS MISSING))))
Field Types and Query Generation
Numeric Fields
The query generator creates appropriate predicates for numeric types:
rand_processes::{
numeric_test: rand_process::{
$arrival: HomogeneousPoisson:: { interarrival: seconds::1 },
$data: {
count: UniformI32::{ low: 1, high: 1000 },
price: UniformDecimal::{ low: 9.99, high: 999.99 },
score: NormalF64::{ mean: 75.0, std_dev: 15.0 }
}
}
}
Generated Queries:
SELECT * FROM numeric_test WHERE (numeric_test.count > 500)
SELECT * FROM numeric_test WHERE (numeric_test.price BETWEEN 50.0 AND 200.0)
SELECT * FROM numeric_test WHERE (numeric_test.score <= 85.5)
String Fields
String generators produce string-appropriate predicates:
rand_processes::{
string_test: rand_process::{
$arrival: HomogeneousPoisson:: { interarrival: seconds::1 },
$data: {
name: LoremIpsumTitle,
email: Format::{ pattern: "user{UUID}@example.com" },
country: Regex::{ pattern: "[A-Z]{2}" },
description: LoremIpsum::{ min_words: 5, max_words: 50 }
}
}
}
Generated Queries:
SELECT * FROM string_test WHERE (string_test.country = 'US')
SELECT * FROM string_test WHERE (string_test.name LIKE '%Test%')
SELECT * FROM string_test WHERE (string_test.email IN ['user1@example.com', 'user2@example.com'])
Boolean Fields
Boolean fields generate boolean predicates:
rand_processes::{
boolean_test: rand_process::{
$arrival: HomogeneousPoisson:: { interarrival: seconds::1 },
$data: {
active: Bool,
verified: Bool::{ p: 0.8 },
premium: Bool::{ p: 0.1 }
}
}
}
Generated Queries:
SELECT * FROM boolean_test WHERE (boolean_test.active = true)
SELECT * FROM boolean_test WHERE (boolean_test.verified AND boolean_test.premium)
SELECT * FROM boolean_test WHERE (NOT boolean_test.active)
Null and Missing Value Queries
Testing Null Handling
When your data includes nullable fields, queries will test null handling:
rand_processes::{
nullable_test: rand_process::{
$arrival: HomogeneousPoisson:: { interarrival: seconds::1 },
$data: {
required_field: UUID::{ nullable: false },
nullable_field: UniformI32::{ nullable: 0.3, low: 1, high: 100 },
optional_field: UniformDecimal::{ optional: 0.2, low: 0.0, high: 1000.0 }
}
}
}
Generated Queries:
SELECT * FROM nullable_test WHERE (nullable_test.nullable_field IS NOT NULL)
SELECT * FROM nullable_test WHERE (nullable_test.optional_field IS MISSING)
SELECT * FROM nullable_test WHERE (nullable_test.required_field IS NOT NULL AND nullable_test.nullable_field > 50)
Focusing on Null/Missing Tests
# Generate queries focused on null/missing testing
beamline query basic \
--seed 500 \
--start-auto \
--script-path nullable_data.ion \
--sample-count 5 \
rand-select-all-fw \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 2 \
--pred-absent # Only IS NULL, IS NOT NULL, IS MISSING, IS NOT MISSING
Progressive Query Complexity
Start Simple
# Begin with single predicates
beamline query basic \
--seed 1 \
--start-auto \
--script-path data.ion \
--sample-count 5 \
rand-select-all-fw \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 1 \
--pred-eq
Add More Predicates
# Increase to 2-3 predicates
beamline query basic \
--seed 1 \
--start-auto \
--script-path data.ion \
--sample-count 5 \
rand-select-all-fw \
--tbl-flt-rand-min 2 \
--tbl-flt-rand-max 3 \
--pred-comparison
Enable All Predicates
# Use all available predicates for full complexity
beamline query basic \
--seed 1 \
--start-auto \
--script-path data.ion \
--sample-count 5 \
rand-select-all-fw \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 5 \
--pred-all
Examples
Single Predicate Queries
From the README example:
beamline query basic \
--seed 1234 \
--start-auto \
--script-path simple_transactions.ion \
--sample-count 3 \
rand-select-all-fw \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 1 \
--tbl-flt-path-depth-max 1 \
--tbl-flt-pathstep-internal-all \
--tbl-flt-pathstep-final-project \
--tbl-flt-type-final-scalar \
--pred-lt
Actual Output:
SELECT * FROM test_data AS test_data WHERE (test_data.marketplace_id < -5)
SELECT * FROM test_data AS test_data WHERE (test_data.price < 18.418581624952935)
SELECT * FROM test_data AS test_data WHERE (test_data.price < 15.495327785402296)
Multiple Predicate Queries
From the README example with more complex predicates:
beamline query basic \
--seed 1234 \
--start-auto \
--script-path simple_transactions.ion \
--sample-count 3 \
rand-select-all-fw \
--tbl-flt-rand-min 3 \
--tbl-flt-rand-max 10 \
--tbl-flt-path-depth-max 1 \
--tbl-flt-pathstep-internal-all \
--tbl-flt-pathstep-final-project \
--tbl-flt-type-final-all \
--pred-all
Actual Output:
SELECT * FROM test_data AS test_data WHERE (test_data.country_code IN [
'Graecos quidem legendos.',
'Possit et sine.'
] OR (NOT ((test_data.description IS MISSING)) OR
(test_data.description IS MISSING)))
SELECT * FROM test_data AS test_data WHERE (((test_data.transaction_id IS NULL)
AND (test_data.created_at IS NULL)) OR (((test_data.completed IN [
false,
false
] OR NOT ((test_data.completed IS NULL))) AND
((NOT ((test_data.price IS NULL)) OR
(test_data.transaction_id LIKE 'Vidisse.' AND
(test_data.country_code IS NULL))) AND
NOT ((test_data.description IS MISSING)))) OR
(test_data.description <> 'Nec vero.')))
SELECT * FROM test_data AS test_data
WHERE (((((test_data.country_code <> 'Qua maxime ceterorum.') AND
(NOT (test_data.completed IN [ false, true, true ]) OR
(test_data.description = 'Non faciant.'))) AND
(NOT ((test_data.price IS MISSING)) AND (test_data.price IS MISSING))) OR
test_data.price IN [
-47.936734585045905,
-0.8509689800217544,
24.263479438050297,
-48.953369038690255
]) OR ((test_data.created_at = UTCNOW()) OR
(NOT ((test_data.country_code IS MISSING)) AND
(test_data.description IS MISSING))))
Specific Predicate Types
Comparison Predicates
# Only numeric comparisons
beamline query basic \
--seed 100 \
--start-auto \
--script-path numeric_data.ion \
--sample-count 5 \
rand-select-all-fw \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 1 \
--pred-comparison # <, <=, >, >=, =, <>
Example Results:
SELECT * FROM test_data WHERE (test_data.price >= 50.0)
SELECT * FROM test_data WHERE (test_data.count <= 500)
SELECT * FROM test_data WHERE (test_data.score <> 75.5)
String Pattern Matching
# Focus on LIKE predicates
beamline query basic \
--seed 200 \
--start-auto \
--script-path text_data.ion \
--sample-count 5 \
rand-select-all-fw \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 1 \
--pred-like
Example Results:
SELECT * FROM test_data WHERE (test_data.description LIKE '%lorem%')
SELECT * FROM test_data WHERE (test_data.country_code LIKE 'U_')
Set Membership
# Use IN and NOT IN predicates
beamline query basic \
--seed 300 \
--start-auto \
--script-path categorical_data.ion \
--sample-count 5 \
rand-select-all-fw \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 1 \
--pred-in
Example Results:
SELECT * FROM test_data WHERE (test_data.status IN ['active', 'pending'])
SELECT * FROM test_data WHERE (test_data.category IN ['electronics', 'books', 'clothing'])
Reproducible Query Testing
Test Suite Generation
Create consistent test suites:
#!/bin/bash
# Generate reproducible query test suite
SCRIPT="test_schema.ion"
BASE_SEED=12345
# Simple queries for basic functionality
beamline query basic \
--seed $BASE_SEED \
--start-auto \
--script-path $SCRIPT \
--sample-count 10 \
rand-select-all-fw \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 1 \
--pred-eq > basic_equality.sql
# Comparison queries for numeric testing
beamline query basic \
--seed $((BASE_SEED + 1)) \
--start-auto \
--script-path $SCRIPT \
--sample-count 10 \
rand-select-all-fw \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 1 \
--pred-comparison > numeric_comparisons.sql
# Complex queries for comprehensive testing
beamline query basic \
--seed $((BASE_SEED + 2)) \
--start-auto \
--script-path $SCRIPT \
--sample-count 15 \
rand-select-all-fw \
--tbl-flt-rand-min 2 \
--tbl-flt-rand-max 5 \
--pred-all > complex_queries.sql
echo "Generated test suite:"
echo "- basic_equality.sql: $(wc -l < basic_equality.sql) queries"
echo "- numeric_comparisons.sql: $(wc -l < numeric_comparisons.sql) queries"
echo "- complex_queries.sql: $(wc -l < complex_queries.sql) queries"
Regression Testing
# Generate baseline queries
beamline query basic \
--seed 999 \
--start-auto \
--script-path stable_schema.ion \
--sample-count 20 \
rand-select-all-fw \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 3 \
--pred-all > baseline_queries.sql
# Later: regenerate with same seed to verify no regressions
beamline query basic \
--seed 999 \
--start-auto \
--script-path stable_schema.ion \
--sample-count 20 \
rand-select-all-fw \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 3 \
--pred-all > regression_test_queries.sql
# Verify identical output
diff baseline_queries.sql regression_test_queries.sql
Common Query Patterns
Data Validation Queries
Generate queries that validate data constraints:
# Focus on range and constraint validation
beamline query basic \
--seed 600 \
--start-auto \
--script-path validation_schema.ion \
--sample-count 10 \
rand-select-all-fw \
--pred-comparison --pred-between
Example Validation Queries:
SELECT * FROM test_data WHERE (test_data.age BETWEEN 0 AND 120)
SELECT * FROM test_data WHERE (test_data.price > 0)
SELECT * FROM test_data WHERE (test_data.email IS NOT NULL)
Performance Baseline Queries
Create simple queries for performance baseline:
# Simple queries for baseline performance measurement
beamline query basic \
--seed 700 \
--start-auto \
--script-path performance_data.ion \
--sample-count 25 \
rand-select-all-fw \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 1 \
--pred-eq --pred-lt --pred-gt
Integration with Data Generation
Complete Workflow
#!/bin/bash
# Complete data + query generation workflow
SCRIPT="customer_transactions.ion"
SEED=12345
echo "Generating test data..."
beamline gen data \
--seed $SEED \
--start-auto \
--script-path $SCRIPT \
--sample-count 1000 \
--output-format ion-pretty > test_data.ion
echo "Generating basic queries..."
beamline query basic \
--seed $((SEED + 1)) \
--start-auto \
--script-path $SCRIPT \
--sample-count 15 \
rand-select-all-fw \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 2 \
--pred-all > basic_test_queries.sql
echo "Generated $(wc -l < test_data.ion) data records and $(wc -l < basic_test_queries.sql) test queries"
# Test first few queries (example)
head -5 basic_test_queries.sql | while IFS= read -r query; do
echo "Query: $query"
# your-partiql-engine --query "$query" --data test_data.ion
done
Best Practices
1. Start with Simple Configurations
# Begin testing with minimal complexity
beamline query basic \
--seed 1 \
--start-auto \
--script-path new_schema.ion \
--sample-count 3 \
rand-select-all-fw \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 1 \
--pred-eq
2. Match Predicates to Data Types
# For numeric-heavy data
--pred-comparison --pred-between
# For string-heavy data
--pred-like --pred-eq --pred-in
# For boolean data
--pred-eq --pred-logical-not
3. Test Query Coverage
# Generate enough queries to cover different data patterns
beamline query basic \
--seed 100 \
--start-auto \
--script-path comprehensive_data.ion \
--sample-count 50 \
rand-select-all-fw \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 4 \
--pred-all
4. Validate Against Real Data
# Always test generated queries work with generated data
beamline gen data --seed 1 --start-auto --script-path schema.ion --sample-count 100 > data.ion
beamline query basic --seed 2 --start-auto --script-path schema.ion --sample-count 5 rand-select-all-fw --pred-all > queries.sql
# Validate each query
# for query in $(cat queries.sql); do
# your-partiql-engine --validate "$query"
# done
Next Steps
Now that you understand basic query generation:
- Advanced Patterns - Complex queries with projections and exclusions
- Parameterization - Complete guide to all query generation options
- CLI Query Commands - Detailed CLI reference with all parameters
Advanced Query Patterns
This section covers advanced query generation using all four query strategies: rand-sfw (projections), rand-select-all-efw (exclusions), and rand-sefw (projections + exclusions). These strategies generate more sophisticated PartiQL queries that test complex language features.
Query Strategy Overview
Available Strategies
| Strategy | Query Pattern | Features |
|---|---|---|
rand-select-all-fw | SELECT * FROM WHERE | Basic queries with WHERE clauses |
rand-sfw | SELECT fields FROM WHERE | Custom projections + WHERE |
rand-select-all-efw | SELECT * EXCLUDE FROM WHERE | Exclusions + WHERE |
rand-sefw | SELECT EXCLUDE FROM WHERE | Projections + Exclusions + WHERE |
SELECT with Projections (rand-sfw)
Basic Projection Queries
Generate queries with specific field selections:
beamline query basic \
--seed 1234 \
--start-auto \
--script-path simple_transactions.ion \
--sample-count 3 \
rand-sfw \
--project-rand-min 2 \
--project-rand-max 5 \
--project-path-depth-min 1 \
--project-path-depth-max 1 \
--project-pathstep-internal-all \
--project-pathstep-final-all \
--project-type-final-all \
--tbl-flt-rand-min 2 \
--tbl-flt-rand-max 5 \
--tbl-flt-path-depth-max 1 \
--tbl-flt-pathstep-internal-all \
--tbl-flt-pathstep-final-project \
--tbl-flt-type-final-scalar \
--pred-all
Example Output:
SELECT test_data.completed, test_data.completed FROM test_data AS test_data
WHERE (NOT (test_data.completed) OR NOT ((test_data.created_at IS MISSING)))
SELECT test_data.completed, test_data.marketplace_id, test_data.created_at
FROM test_data AS test_data WHERE (NOT ((test_data.transaction_id IS NULL)) OR
(((test_data.transaction_id IN [
'Iam in.',
'Se.',
'Sine amicitia firmam.',
'Notae sunt.'
] OR (test_data.transaction_id IS NULL)) OR
NOT ((test_data.description IS NULL))) OR
(test_data.marketplace_id >= 28)))
SELECT test_data, test_data.description FROM test_data AS test_data
WHERE (test_data.completed IN [ false, false ] AND
(((test_data.price <= 5.761136291521325) AND
NOT ((test_data.transaction_id IS MISSING))) AND
(NOT ((test_data.created_at IS MISSING)) AND
(test_data.created_at IS NULL))))
Understanding Projection Parameters
--project-rand-min 2 --project-rand-max 5: Select 2-5 fields in SELECT clause--project-path-depth-max 1: Use simple field names (no deep nesting)--project-pathstep-final-all: Allow all path types (.field,[*],.*)--project-type-final-all: Project all types (scalars, structs, sequences)
SELECT * EXCLUDE (rand-select-all-efw)
Basic Exclusion Queries
Generate SELECT * queries that exclude specific fields:
beamline query basic \
--seed 1234 \
--start-auto \
--script-path simple_transactions.ion \
--sample-count 3 \
rand-select-all-efw \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 1 \
--tbl-flt-path-depth-max 1 \
--tbl-flt-pathstep-internal-all \
--tbl-flt-pathstep-final-project \
--tbl-flt-type-final-scalar \
--pred-lt \
--exclude-rand-min 1 \
--exclude-rand-max 3 \
--exclude-path-depth-min 1 \
--exclude-path-depth-max 1 \
--exclude-pathstep-internal-all \
--exclude-pathstep-final-all \
--exclude-type-final-all
Example Output:
SELECT * EXCLUDE test_data.marketplace_id, test_data.*, test_data.completed
FROM test_data AS test_data WHERE (test_data.marketplace_id < -5)
SELECT * EXCLUDE test_data.completed FROM test_data AS test_data
WHERE (test_data.price < 18.418581624952935)
SELECT * EXCLUDE test_data.marketplace_id, test_data.completed,
test_data.marketplace_id
FROM test_data AS test_data WHERE (test_data.price < 15.495327785402296)
Understanding Exclusion Parameters
--exclude-rand-min 1 --exclude-rand-max 3: Exclude 1-3 fields--exclude-path-depth-max 1: Use simple field exclusions--exclude-pathstep-final-all: Allow all exclusion path types--exclude-type-final-all: Exclude all types (scalars, structs, arrays)
SELECT EXCLUDE FROM WHERE (rand-sefw)
Complete Query Generation
The most sophisticated strategy combines projections, exclusions, and WHERE clauses:
beamline query basic \
--seed 1234 \
--start-auto \
--script-path simple_transactions.ion \
--sample-count 3 \
rand-sefw \
--project-rand-min 2 \
--project-rand-max 5 \
--project-path-depth-min 1 \
--project-path-depth-max 1 \
--project-pathstep-internal-all \
--project-pathstep-final-all \
--project-type-final-all \
--tbl-flt-rand-min 2 \
--tbl-flt-rand-max 5 \
--tbl-flt-path-depth-max 1 \
--tbl-flt-pathstep-internal-all \
--tbl-flt-pathstep-final-project \
--tbl-flt-type-final-scalar \
--pred-all \
--exclude-rand-min 1 \
--exclude-rand-max 3 \
--exclude-path-depth-min 1 \
--exclude-path-depth-max 1 \
--exclude-pathstep-internal-all \
--exclude-pathstep-final-all \
--exclude-type-final-all
Example Output:
SELECT test_data.completed, test_data.completed
EXCLUDE test_data.marketplace_id, test_data.*, test_data.completed
FROM test_data AS test_data
WHERE (NOT (test_data.completed) OR
NOT ((test_data.created_at IS MISSING)))
SELECT test_data.completed, test_data.marketplace_id, test_data.created_at
EXCLUDE test_data.completed
FROM test_data AS test_data
WHERE (NOT ((test_data.transaction_id IS NULL)) OR
(((test_data.transaction_id IN [
'Iam in.',
'Se.',
'Sine amicitia firmam.',
'Notae sunt.'
] OR (test_data.transaction_id IS NULL)) OR
NOT ((test_data.description IS NULL))) OR
(test_data.marketplace_id >= 28)))
SELECT test_data, test_data.description
EXCLUDE test_data.marketplace_id, test_data.completed, test_data.marketplace_id
FROM test_data AS test_data
WHERE (test_data.completed IN [ false, false ] AND
(((test_data.price <= 5.761136291521325) AND
NOT ((test_data.transaction_id IS MISSING))) AND
(NOT ((test_data.created_at IS MISSING)) AND
(test_data.created_at IS NULL))))
Deep Path Generation
Complex Nested Structures
For deeply nested data structures, use higher path depth limits:
beamline query basic \
--seed 1234 \
--start-auto \
--script-path transactions.ion \
--sample-count 3 \
rand-sefw \
--project-rand-min 2 \
--project-rand-max 5 \
--project-path-depth-min 1 \
--project-path-depth-max 10 \
--project-pathstep-internal-all \
--project-pathstep-final-all \
--project-type-final-all \
--tbl-flt-rand-min 2 \
--tbl-flt-rand-max 5 \
--tbl-flt-path-depth-max 10 \
--tbl-flt-pathstep-internal-all \
--tbl-flt-pathstep-final-project \
--tbl-flt-type-final-scalar \
--pred-all \
--exclude-rand-min 1 \
--exclude-rand-max 2 \
--exclude-path-depth-min 3 \
--exclude-path-depth-max 4 \
--exclude-pathstep-internal-all \
--exclude-pathstep-final-unpivot \
--exclude-type-final-all
Generated Deep Nested Queries:
SELECT test_data.*.nested_struct.nested_struct.nested_struct.nested_struct.nested_struct.*,
test_data.test_nest_struct.*.*.nested_struct.nested_struct
EXCLUDE test_data.*.*.*.*, test_data.price.* FROM test_data AS test_data
WHERE ((test_data.test_nest_struct.*.*.*.nested_struct.*.test_int <> 19) OR
(test_data.test_nest_struct.*.*.nested_struct.*.*.test_int > 35))
SELECT test_data.test_nest_struct.*.nested_struct.*.*.nested_struct.*,
test_data.test_nest_struct.*.*.nested_struct.nested_struct.*.*,
test_data.test_nest_struct.nested_struct.*.nested_struct.*,
test_data.test_nest_struct.*.nested_struct.nested_struct.nested_struct.*
EXCLUDE test_data.test_nest_struct.*.*, test_data.test_nest_struct.*.*.*
FROM test_data AS test_data
WHERE ((test_data.*.*.nested_struct.*.*.*.test_int < 40) OR
(test_data.*.*.nested_struct.nested_struct.*.nested_struct.test_int >= -9))
SELECT test_data.*.nested_struct.nested_struct.nested_struct.nested_struct.*,
test_data.*.nested_struct.nested_struct.nested_struct.*.*.test_int
EXCLUDE test_data.*.nested_struct.*.*,
test_data.test_nest_struct.nested_struct.*.*
FROM test_data AS test_data
WHERE ((((test_data.price.value <= 6.206304713037888) OR
(test_data.*.nested_struct.nested_struct.*.nested_struct.*.test_int <> -29))
AND
(test_data.test_nest_struct.*.nested_struct.*.nested_struct.nested_struct.test_int < 6))
AND ((test_data.price > -44.666855950508584) OR
(test_data.*.*.*.nested_struct.*.*.test_int > -42)))
Controlling Path Depth
# Moderate depth for readability
beamline query basic \
--seed 1234 \
--start-auto \
--script-path transactions.ion \
--sample-count 3 \
rand-sefw \
--project-rand-min 2 \
--project-rand-max 5 \
--project-path-depth-min 1 \
--project-path-depth-max 3 \
--project-pathstep-internal-all \
--project-pathstep-final-all \
--project-type-final-all \
--tbl-flt-rand-min 2 \
--tbl-flt-rand-max 5 \
--tbl-flt-path-depth-max 10 \
--tbl-flt-pathstep-internal-all \
--tbl-flt-pathstep-final-project \
--tbl-flt-type-final-scalar \
--pred-all \
--exclude-rand-min 1 \
--exclude-rand-max 2 \
--exclude-path-depth-min 3 \
--exclude-path-depth-max 4 \
--exclude-pathstep-internal-all \
--exclude-pathstep-final-unpivot \
--exclude-type-final-all
More Manageable Output:
SELECT test_data.price, test_data.*.*.nested_struct EXCLUDE test_data.*.*.*.*,
test_data.price.*
FROM test_data AS test_data
WHERE ((test_data.test_nest_struct.*.*.*.nested_struct.*.test_int <> 19) OR
(test_data.test_nest_struct.*.*.nested_struct.*.*.test_int > 35))
SELECT test_data.price, test_data.*.*.nested_struct, test_data.test_struct,
test_data.*.*.*
EXCLUDE test_data.test_nest_struct.*.*, test_data.test_nest_struct.*.*.*
FROM test_data AS test_data
WHERE ((test_data.*.*.nested_struct.*.*.*.test_int < 40) OR
(test_data.*.*.nested_struct.nested_struct.*.nested_struct.test_int >= -9))
SELECT test_data.transaction_id, test_data.*.nested_struct
EXCLUDE test_data.*.nested_struct.*.*,
test_data.test_nest_struct.nested_struct.*.*
FROM test_data AS test_data
WHERE ((((test_data.price.value <= 6.206304713037888) OR
(test_data.*.nested_struct.nested_struct.*.nested_struct.*.test_int <> -29))
AND
(test_data.test_nest_struct.*.nested_struct.*.nested_struct.nested_struct.test_int < 6))
AND ((test_data.price > -44.666855950508584) OR
(test_data.*.*.*.nested_struct.*.*.test_int > -42)))
Path Expression Types
Path Step Types
Beamline can generate different path step types:
Projection Steps (.field)
SELECT test_data.transaction_id, test_data.customer.name
FROM test_data AS test_data
Index Steps ([N])
SELECT test_data.items[0], test_data.scores[5]
FROM test_data AS test_data
Wildcard Steps ([*])
SELECT test_data.items[*].price, test_data.users[*].name
FROM test_data AS test_data
Unpivot Steps (.*)
SELECT test_data.metadata.*, test_data.settings.*
FROM test_data AS test_data
Path Configuration Examples
Simple Projections Only
# Only use projection steps
beamline query basic \
--seed 100 \
--start-auto \
--script-path nested_data.ion \
--sample-count 5 \
rand-sfw \
--project-rand-min 3 \
--project-rand-max 5 \
--project-path-depth-max 3 \
--project-pathstep-internal-project \
--project-pathstep-final-project \
--pred-all
Include Wildcards
# Add wildcard and unpivot paths
beamline query basic \
--seed 200 \
--start-auto \
--script-path array_data.ion \
--sample-count 5 \
rand-sfw \
--project-rand-min 2 \
--project-rand-max 4 \
--project-path-depth-max 2 \
--project-pathstep-internal-all \
--project-pathstep-final-foreach \
--project-pathstep-final-unpivot \
--pred-all
Complex Query Combinations
Full Feature Queries
Use all query features together for comprehensive PartiQL testing:
beamline query basic \
--seed 2000 \
--start-auto \
--script-path comprehensive_schema.ion \
--sample-count 5 \
rand-sefw \
--project-rand-min 3 \
--project-rand-max 8 \
--project-path-depth-min 1 \
--project-path-depth-max 5 \
--project-pathstep-internal-all \
--project-pathstep-final-all \
--project-type-final-all \
--tbl-flt-rand-min 2 \
--tbl-flt-rand-max 6 \
--tbl-flt-path-depth-max 4 \
--tbl-flt-pathstep-internal-all \
--tbl-flt-pathstep-final-all \
--tbl-flt-type-final-all \
--exclude-rand-min 1 \
--exclude-rand-max 4 \
--exclude-path-depth-min 1 \
--exclude-path-depth-max 3 \
--exclude-pathstep-internal-all \
--exclude-pathstep-final-all \
--exclude-type-final-all \
--pred-all
This generates very complex queries testing the full range of PartiQL features.
Type-Specific Query Generation
Scalar Type Focus
Generate queries that only work with scalar values:
beamline query basic \
--seed 300 \
--start-auto \
--script-path mixed_types.ion \
--sample-count 8 \
rand-sfw \
--project-rand-min 2 \
--project-rand-max 4 \
--project-type-final-scalar \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 3 \
--tbl-flt-type-final-scalar \
--pred-comparison
Results focus on scalar fields:
SELECT test_data.price, test_data.completed, test_data.marketplace_id
FROM test_data AS test_data
WHERE (test_data.transaction_id = 'some-uuid' AND test_data.price > 100.0)
Structure Type Queries
Generate queries that work with complex structures:
beamline query basic \
--seed 400 \
--start-auto \
--script-path nested_objects.ion \
--sample-count 5 \
rand-sfw \
--project-rand-min 2 \
--project-rand-max 3 \
--project-type-final-struct \
--tbl-flt-type-final-struct \
--pred-absent --pred-eq
Advanced Testing Patterns
Edge Case Query Generation
Test PartiQL edge cases and complex scenarios:
# Generate edge case queries
beamline query basic \
--seed 500 \
--start-auto \
--script-path edge_case_schema.ion \
--sample-count 10 \
rand-sefw \
--project-path-depth-min 4 \
--project-path-depth-max 8 \
--project-pathstep-internal-foreach \
--project-pathstep-final-unpivot \
--tbl-flt-path-depth-max 6 \
--tbl-flt-pathstep-internal-unpivot \
--tbl-flt-pathstep-final-foreach \
--exclude-path-depth-min 2 \
--exclude-path-depth-max 5 \
--exclude-pathstep-final-unpivot \
--pred-all
Performance Stress Testing
Generate computationally expensive queries:
# Create performance stress test queries
beamline query basic \
--seed 600 \
--start-auto \
--script-path large_schema.ion \
--sample-count 20 \
rand-sefw \
--project-rand-min 5 \
--project-rand-max 15 \
--project-path-depth-max 6 \
--tbl-flt-rand-min 3 \
--tbl-flt-rand-max 10 \
--tbl-flt-path-depth-max 5 \
--exclude-rand-min 2 \
--exclude-rand-max 8 \
--pred-all > stress_test_queries.sql
Integration Workflows
Multi-Strategy Testing
Generate different query types for comprehensive testing:
#!/bin/bash
# Generate complete query test suite
SCRIPT="test_schema.ion"
SEED=12345
SAMPLE_COUNT=10
echo "Generating comprehensive query test suite..."
# Basic SELECT * queries
beamline query basic \
--seed $SEED \
--start-auto \
--script-path $SCRIPT \
--sample-count $SAMPLE_COUNT \
rand-select-all-fw \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 3 \
--pred-all > select_star.sql
# Projection queries
beamline query basic \
--seed $((SEED + 1)) \
--start-auto \
--script-path $SCRIPT \
--sample-count $SAMPLE_COUNT \
rand-sfw \
--project-rand-min 2 \
--project-rand-max 5 \
--project-path-depth-max 2 \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 3 \
--pred-all > projections.sql
# Exclusion queries
beamline query basic \
--seed $((SEED + 2)) \
--start-auto \
--script-path $SCRIPT \
--sample-count $SAMPLE_COUNT \
rand-select-all-efw \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 2 \
--exclude-rand-min 1 \
--exclude-rand-max 3 \
--pred-all > exclusions.sql
# Complex combined queries
beamline query basic \
--seed $((SEED + 3)) \
--start-auto \
--script-path $SCRIPT \
--sample-count $SAMPLE_COUNT \
rand-sefw \
--project-rand-min 2 \
--project-rand-max 4 \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 3 \
--exclude-rand-min 1 \
--exclude-rand-max 2 \
--pred-all > combined.sql
echo "Query test suite generated:"
echo "- select_star.sql: $(wc -l < select_star.sql) queries"
echo "- projections.sql: $(wc -l < projections.sql) queries"
echo "- exclusions.sql: $(wc -l < exclusions.sql) queries"
echo "- combined.sql: $(wc -l < combined.sql) queries"
echo "Total: $(($(wc -l < select_star.sql) + $(wc -l < projections.sql) + $(wc -l < exclusions.sql) + $(wc -l < combined.sql))) queries"
Query Complexity Progression
#!/bin/bash
# Generate queries with increasing complexity
SCRIPT="complex_data.ion"
BASE_SEED=1000
for complexity_level in 1 2 3 5; do
echo "Generating complexity level $complexity_level queries..."
beamline query basic \
--seed $((BASE_SEED + complexity_level)) \
--start-auto \
--script-path $SCRIPT \
--sample-count 10 \
rand-sefw \
--project-rand-min $complexity_level \
--project-rand-max $((complexity_level * 2)) \
--project-path-depth-max $complexity_level \
--tbl-flt-rand-min $complexity_level \
--tbl-flt-rand-max $((complexity_level * 2)) \
--tbl-flt-path-depth-max $complexity_level \
--exclude-rand-min 1 \
--exclude-rand-max $complexity_level \
--pred-all > "complexity_${complexity_level}.sql"
echo " Generated: $(wc -l < complexity_${complexity_level}.sql) queries"
done
echo "Query complexity suite completed"
Best Practices
1. Match Complexity to Use Case
# Simple testing - basic patterns
--project-path-depth-max 2 --tbl-flt-path-depth-max 2
# Comprehensive testing - complex patterns
--project-path-depth-max 5 --tbl-flt-path-depth-max 5
# Edge case testing - maximum complexity
--project-path-depth-max 10 --tbl-flt-path-depth-max 10
2. Balance Query Features
# Don't overload with too many features at once
--project-rand-min 2 --project-rand-max 4 # Moderate projections
--exclude-rand-min 1 --exclude-rand-max 2 # Few exclusions
--tbl-flt-rand-min 1 --tbl-flt-rand-max 3 # Simple WHERE clauses
3. Test Incrementally
# Start simple
beamline query basic --script-path data.ion --sample-count 3 rand-select-all-fw --pred-eq
# Add projections
beamline query basic --script-path data.ion --sample-count 3 rand-sfw --project-rand-min 2 --project-rand-max 3 --pred-eq
# Add exclusions
beamline query basic --script-path data.ion --sample-count 3 rand-sefw --project-rand-min 2 --exclude-rand-min 1 --pred-eq
# Full complexity
beamline query basic --script-path data.ion --sample-count 5 rand-sefw --project-rand-min 3 --exclude-rand-min 2 --tbl-flt-rand-min 2 --pred-all
4. Validate Complex Queries
# Generate and validate complex queries
beamline query basic \
--seed 700 \
--start-auto \
--script-path validation_schema.ion \
--sample-count 15 \
rand-sefw \
--project-rand-min 2 \
--project-rand-max 6 \
--exclude-rand-min 1 \
--exclude-rand-max 3 \
--pred-all > complex_validation.sql
# Check each query
# your-partiql-parser --check-syntax complex_validation.sql
Next Steps
Now that you understand advanced query patterns:
- Parameterization - Complete reference for all query generation parameters
- CLI Query Commands - Detailed CLI usage with all options
- Examples - See query generation in complete workflows
Query Generation Parameterization
This section provides a complete reference for all query generation parameters. Beamline’s query generator is highly configurable, allowing you to control every aspect of query generation from simple predicates to complex nested path expressions.
Parameter Categories
Query generation parameters are organized into several categories:
- Table Filter Parameters - Control WHERE clause generation
- Projection Parameters - Control SELECT field selection
- Exclusion Parameters - Control EXCLUDE clause generation
- Path Parameters - Control how paths navigate data structures
- Predicate Parameters - Control predicate types and operators
- Type Parameters - Control what data types can be referenced
Table Filter Parameters
Control WHERE clause generation across all query strategies:
Filter Count
| Parameter | Description | Valid Values |
|---|---|---|
--tbl-flt-rand-min | Minimum number of predicates | 1-255 |
--tbl-flt-rand-max | Maximum number of predicates | 1-255 |
Example:
# Generate 1-3 predicates per WHERE clause
--tbl-flt-rand-min 1 --tbl-flt-rand-max 3
Filter Path Configuration
| Parameter | Description |
|---|---|
--tbl-flt-path-depth-min | Minimum path depth (default: unbounded) |
--tbl-flt-path-depth-max | Maximum path depth (default: unbounded) |
Example:
# Allow paths up to 3 levels deep
--tbl-flt-path-depth-max 3
# Results: field, object.field, object.nested.field
Table Filter Path Steps
Control what types of path steps can appear in WHERE clauses:
Internal Path Steps
| Parameter | Description | Example Path |
|---|---|---|
--tbl-flt-pathstep-internal-all | Enable all internal path step types | All below |
--tbl-flt-pathstep-internal-project | Enable projection internal steps | field.subfield |
--tbl-flt-pathstep-internal-index | Enable index internal steps | array[1].field |
--tbl-flt-pathstep-internal-foreach | Enable wildcard internal steps | array[*].field |
--tbl-flt-pathstep-internal-unpivot | Enable unpivot internal steps | object.*.field |
Final Path Steps
| Parameter | Description | Example Path |
|---|---|---|
--tbl-flt-pathstep-final-all | Enable all final path step types | All below |
--tbl-flt-pathstep-final-project | Enable projection final steps | object.field |
--tbl-flt-pathstep-final-index | Enable index final steps | array[1] |
--tbl-flt-pathstep-final-foreach | Enable wildcard final steps | array[*] |
--tbl-flt-pathstep-final-unpivot | Enable unpivot final steps | object.* |
Table Filter Type Constraints
Control what types of values can appear in WHERE clauses:
| Parameter | Description | Example |
|---|---|---|
--tbl-flt-type-final-all | Allow all final types | Any type |
--tbl-flt-type-final-scalar | Allow only scalar types | 9, 'text', true |
--tbl-flt-type-final-sequence | Allow only sequence types | [1,2,3], <1, 'a'> |
--tbl-flt-type-final-struct | Allow only struct types | {'a': 1, 'b': 2} |
Projection Parameters
Control SELECT clause generation (applies to rand-sfw and rand-sefw strategies):
Projection Count
| Parameter | Description | Valid Values |
|---|---|---|
--project-rand-min | Minimum number of projections | 1-255 |
--project-rand-max | Maximum number of projections | 1-255 |
Example:
# Generate 2-5 fields in SELECT clause
--project-rand-min 2 --project-rand-max 5
Projection Path Configuration
| Parameter | Description |
|---|---|
--project-path-depth-min | Minimum projection path depth |
--project-path-depth-max | Maximum projection path depth |
Projection Path Steps
Same options as table filter path steps, but for SELECT clause:
Internal Path Steps
| Parameter | Description |
|---|---|
--project-pathstep-internal-all | Enable all internal path step types |
--project-pathstep-internal-project | Enable projection internal steps |
--project-pathstep-internal-index | Enable index internal steps |
--project-pathstep-internal-foreach | Enable wildcard internal steps |
--project-pathstep-internal-unpivot | Enable unpivot internal steps |
Final Path Steps
| Parameter | Description |
|---|---|
--project-pathstep-final-all | Enable all final path step types |
--project-pathstep-final-project | Enable projection final steps |
--project-pathstep-final-index | Enable index final steps |
--project-pathstep-final-foreach | Enable wildcard final steps |
--project-pathstep-final-unpivot | Enable unpivot final steps |
Projection Type Constraints
| Parameter | Description |
|---|---|
--project-type-final-all | Allow all final types |
--project-type-final-scalar | Allow only scalar types |
--project-type-final-sequence | Allow only sequence types |
--project-type-final-struct | Allow only struct types |
Exclusion Parameters
Control EXCLUDE clause generation (applies to rand-select-all-efw and rand-sefw strategies):
Exclusion Count
| Parameter | Description | Valid Values |
|---|---|---|
--exclude-rand-min | Minimum number of exclusions | 1-255 |
--exclude-rand-max | Maximum number of exclusions | 1-255 |
Exclusion Path Configuration
| Parameter | Description |
|---|---|
--exclude-path-depth-min | Minimum exclusion path depth |
--exclude-path-depth-max | Maximum exclusion path depth |
Exclusion Path Steps
Same structure as projection parameters:
| Parameter | Description |
|---|---|
--exclude-pathstep-internal-all | Enable all internal path steps |
--exclude-pathstep-internal-project | Enable projection internal steps |
--exclude-pathstep-internal-index | Enable index internal steps |
--exclude-pathstep-internal-foreach | Enable wildcard internal steps |
--exclude-pathstep-internal-unpivot | Enable unpivot internal steps |
--exclude-pathstep-final-all | Enable all final path steps |
--exclude-pathstep-final-project | Enable projection final steps |
--exclude-pathstep-final-index | Enable index final steps |
--exclude-pathstep-final-foreach | Enable wildcard final steps |
--exclude-pathstep-final-unpivot | Enable unpivot final steps |
Exclusion Type Constraints
| Parameter | Description |
|---|---|
--exclude-type-final-all | Allow all final types |
--exclude-type-final-scalar | Allow only scalar types |
--exclude-type-final-sequence | Allow only sequence types |
--exclude-type-final-struct | Allow only struct types |
Predicate Parameters
Control what types of predicates can be generated in WHERE clauses:
All Predicate Types
| Parameter | Description | SQL Operators |
|---|---|---|
--pred-all | Enable all predicate types | All below |
--pred-none | Disable all predicates | None |
Null and Missing Predicates
| Parameter | Description | SQL Operators |
|---|---|---|
--pred-absent | Enable null/missing predicates | IS NULL, IS NOT NULL, IS MISSING, IS NOT MISSING |
--pred-nullable | Enable null predicates | IS NULL, IS NOT NULL |
--pred-is-null | Enable IS NULL | IS NULL |
--pred-is-not-null | Enable IS NOT NULL | IS NOT NULL |
--pred-optional | Enable missing predicates | IS MISSING, IS NOT MISSING |
--pred-is-missing | Enable IS MISSING | IS MISSING |
--pred-is-not-missing | Enable IS NOT MISSING | IS NOT MISSING |
Equality Predicates
| Parameter | Description | SQL Operators |
|---|---|---|
--pred-equality | Enable equality predicates | =, <> |
--pred-eq | Enable equals | = |
--pred-neq | Enable not equals | <> |
Comparison Predicates
| Parameter | Description | SQL Operators |
|---|---|---|
--pred-comparison | Enable all comparison predicates | <, <=, >, >=, BETWEEN |
--pred-lt | Enable less than | < |
--pred-lte | Enable less than or equal | <= |
--pred-gt | Enable greater than | > |
--pred-gte | Enable greater than or equal | >= |
--pred-between | Enable between | BETWEEN |
Numeric Predicates
| Parameter | Description | SQL Operators |
|---|---|---|
--pred-numeric | Enable all numeric predicates | =, <>, <, <=, >, >=, BETWEEN |
String Predicates
| Parameter | Description | SQL Operators |
|---|---|---|
--pred-like-all | Enable all LIKE predicates | LIKE, NOT LIKE |
--pred-like | Enable LIKE | LIKE |
--pred-not-like | Enable NOT LIKE | NOT LIKE |
Set Membership Predicates
| Parameter | Description | SQL Operators |
|---|---|---|
--pred-in-all | Enable all IN predicates | IN, NOT IN |
--pred-in | Enable IN | IN |
--pred-not-in | Enable NOT IN | NOT IN |
Logical Predicates
| Parameter | Description | SQL Operators |
|---|---|---|
--pred-logical-all | Enable all logical operators | AND, OR, NOT |
--pred-logical-and | Enable AND | AND |
--pred-logical-or | Enable OR | OR |
--pred-logical-not | Enable NOT | NOT |
Strategy Compatibility Matrix
Different parameters apply to different query strategies:
| Parameter Category | rand-select-all-fw | rand-sfw | rand-select-all-efw | rand-sefw |
|---|---|---|---|---|
| Table Filter | ✅ | ✅ | ✅ | ✅ |
| Projection | ❌ | ✅ | ❌ | ✅ |
| Exclusion | ❌ | ❌ | ✅ | ✅ |
| Predicates | ✅ | ✅ | ✅ | ✅ |
Configuration Examples
Simple Configuration
For basic testing with readable queries:
beamline query basic \
--seed 100 \
--start-auto \
--script-path data.ion \
--sample-count 10 \
rand-select-all-fw \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 2 \
--tbl-flt-path-depth-max 2 \
--tbl-flt-pathstep-internal-project \
--tbl-flt-pathstep-final-project \
--tbl-flt-type-final-scalar \
--pred-comparison
Moderate Configuration
For comprehensive testing with controlled complexity:
beamline query basic \
--seed 200 \
--start-auto \
--script-path data.ion \
--sample-count 15 \
rand-sfw \
--project-rand-min 2 \
--project-rand-max 4 \
--project-path-depth-max 3 \
--project-pathstep-internal-all \
--project-pathstep-final-all \
--project-type-final-all \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 3 \
--tbl-flt-path-depth-max 2 \
--tbl-flt-pathstep-internal-all \
--tbl-flt-pathstep-final-project \
--tbl-flt-type-final-scalar \
--pred-all
Complex Configuration
For edge case testing with maximum complexity:
beamline query basic \
--seed 300 \
--start-auto \
--script-path nested_data.ion \
--sample-count 20 \
rand-sefw \
--project-rand-min 3 \
--project-rand-max 8 \
--project-path-depth-min 2 \
--project-path-depth-max 6 \
--project-pathstep-internal-all \
--project-pathstep-final-all \
--project-type-final-all \
--tbl-flt-rand-min 2 \
--tbl-flt-rand-max 5 \
--tbl-flt-path-depth-max 6 \
--tbl-flt-pathstep-internal-all \
--tbl-flt-pathstep-final-all \
--tbl-flt-type-final-all \
--exclude-rand-min 1 \
--exclude-rand-max 4 \
--exclude-path-depth-min 2 \
--exclude-path-depth-max 4 \
--exclude-pathstep-internal-all \
--exclude-pathstep-final-all \
--exclude-type-final-all \
--pred-all
Parameter Combination Patterns
Testing Specific PartiQL Features
Test Wildcard Paths
# Focus on [*] and .* path expressions
beamline query basic \
--seed 400 \
--start-auto \
--script-path array_data.ion \
--sample-count 10 \
rand-sfw \
--project-rand-min 2 \
--project-rand-max 4 \
--project-pathstep-internal-foreach \
--project-pathstep-internal-unpivot \
--project-pathstep-final-foreach \
--project-pathstep-final-unpivot \
--pred-all
Test Deep Nesting
# Generate very deep path expressions
beamline query basic \
--seed 500 \
--start-auto \
--script-path deeply_nested.ion \
--sample-count 8 \
rand-sfw \
--project-path-depth-min 4 \
--project-path-depth-max 8 \
--tbl-flt-path-depth-min 3 \
--tbl-flt-path-depth-max 6 \
--exclude-path-depth-min 2 \
--exclude-path-depth-max 5 \
--pred-all
Test Null Handling
# Focus on null and missing value predicates
beamline query basic \
--seed 600 \
--start-auto \
--script-path nullable_schema.ion \
--sample-count 12 \
rand-select-all-fw \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 3 \
--pred-absent --pred-logical-and --pred-logical-or
Performance Testing Configurations
Lightweight Queries
# Generate simple, fast-executing queries
beamline query basic \
--seed 700 \
--start-auto \
--script-path performance_data.ion \
--sample-count 25 \
rand-select-all-fw \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 1 \
--tbl-flt-path-depth-max 1 \
--tbl-flt-type-final-scalar \
--pred-eq --pred-comparison
Heavy Queries
# Generate complex, resource-intensive queries
beamline query basic \
--seed 800 \
--start-auto \
--script-path large_data.ion \
--sample-count 15 \
rand-sefw \
--project-rand-min 8 \
--project-rand-max 15 \
--project-path-depth-max 6 \
--tbl-flt-rand-min 5 \
--tbl-flt-rand-max 10 \
--tbl-flt-path-depth-max 5 \
--exclude-rand-min 3 \
--exclude-rand-max 8 \
--pred-all
Common Parameter Patterns
Development and Debugging
# Simple, readable queries for development
--project-rand-min 1 --project-rand-max 3
--project-path-depth-max 2
--project-pathstep-final-project
--tbl-flt-rand-min 1 --tbl-flt-rand-max 2
--tbl-flt-path-depth-max 2
--pred-eq --pred-comparison
Integration Testing
# Moderate complexity for integration tests
--project-rand-min 2 --project-rand-max 5
--project-path-depth-max 3
--project-pathstep-internal-all --project-pathstep-final-all
--tbl-flt-rand-min 1 --tbl-flt-rand-max 4
--tbl-flt-path-depth-max 3
--exclude-rand-min 1 --exclude-rand-max 2
--pred-all
Stress Testing
# Maximum complexity for stress testing
--project-rand-min 5 --project-rand-max 12
--project-path-depth-min 3 --project-path-depth-max 8
--project-pathstep-internal-all --project-pathstep-final-all
--project-type-final-all
--tbl-flt-rand-min 3 --tbl-flt-rand-max 8
--tbl-flt-path-depth-max 6
--exclude-rand-min 2 --exclude-rand-max 6
--exclude-path-depth-min 2 --exclude-path-depth-max 5
--pred-all
Parameter Validation
Check Parameter Combinations
Some parameter combinations don’t make sense:
# Invalid - min > max
--tbl-flt-rand-min 5 --tbl-flt-rand-max 3 # Error
# Invalid - conflicting path steps
--project-pathstep-final-project --project-pathstep-final-foreach # May conflict
# Valid - consistent configuration
--tbl-flt-rand-min 1 --tbl-flt-rand-max 5
--project-rand-min 2 --project-rand-max 4
Default Behaviors
When parameters are not specified:
- Path depth: Unbounded (can generate very deep paths)
- Path steps: All types enabled by default
- Type constraints: All types allowed
- Predicate count: Implementation-dependent defaults
Recommendation: Always specify explicit bounds for predictable results.
Advanced Configuration Techniques
Targeted Testing
Test Specific Path Types
# Test only wildcard expressions
beamline query basic \
--script-path array_data.ion \
--sample-count 10 \
rand-sfw \
--project-pathstep-final-foreach \
--project-pathstep-final-unpivot \
--tbl-flt-pathstep-final-foreach \
--pred-all
Test Specific Predicates
# Test only string operations
beamline query basic \
--script-path text_data.ion \
--sample-count 10 \
rand-select-all-fw \
--pred-like --pred-in --pred-eq
Graduated Complexity Testing
#!/bin/bash
# Generate test suites with graduated complexity
SCRIPT="test_schema.ion"
BASE_SEED=1000
# Level 1: Simple queries
beamline query basic \
--seed $BASE_SEED \
--start-auto \
--script-path $SCRIPT \
--sample-count 10 \
rand-select-all-fw \
--tbl-flt-rand-min 1 --tbl-flt-rand-max 1 \
--pred-eq > level1.sql
# Level 2: Add comparisons
beamline query basic \
--seed $((BASE_SEED + 1)) \
--start-auto \
--script-path $SCRIPT \
--sample-count 10 \
rand-select-all-fw \
--tbl-flt-rand-min 1 --tbl-flt-rand-max 2 \
--pred-comparison > level2.sql
# Level 3: Add projections
beamline query basic \
--seed $((BASE_SEED + 2)) \
--start-auto \
--script-path $SCRIPT \
--sample-count 10 \
rand-sfw \
--project-rand-min 2 --project-rand-max 4 \
--tbl-flt-rand-min 1 --tbl-flt-rand-max 2 \
--pred-all > level3.sql
# Level 4: Add exclusions
beamline query basic \
--seed $((BASE_SEED + 3)) \
--start-auto \
--script-path $SCRIPT \
--sample-count 10 \
rand-sefw \
--project-rand-min 2 --project-rand-max 3 \
--tbl-flt-rand-min 1 --tbl-flt-rand-max 3 \
--exclude-rand-min 1 --exclude-rand-max 2 \
--pred-all > level4.sql
echo "Generated graduated complexity test suite"
Performance Impact of Parameters
High Performance Impact
These parameters significantly affect query generation performance:
- High path depth (
--path-depth-max 10+): Exponential complexity growth - Many projections (
--project-rand-max 20+): Large SELECT clauses - Many predicates (
--tbl-flt-rand-max 10+): Complex WHERE clauses - All path steps enabled: More path generation options to evaluate
Low Performance Impact
These parameters have minimal impact:
- Predicate type selection: Doesn’t affect generation complexity
- Type constraints: Reduces rather than increases complexity
- Path step restrictions: Reduces generation options
Performance Optimization
# Optimized for speed
beamline query basic \
--script-path data.ion \
--sample-count 100 \
rand-select-all-fw \
--tbl-flt-rand-min 1 --tbl-flt-rand-max 3 \
--tbl-flt-path-depth-max 2 \
--tbl-flt-pathstep-final-project \
--tbl-flt-type-final-scalar \
--pred-comparison
# Comprehensive but slower
beamline query basic \
--script-path data.ion \
--sample-count 25 \
rand-sefw \
--project-rand-min 5 --project-rand-max 10 \
--project-path-depth-max 5 \
--tbl-flt-rand-min 3 --tbl-flt-rand-max 6 \
--exclude-rand-min 2 --exclude-rand-max 4 \
--pred-all
Troubleshooting Parameter Issues
Common Parameter Errors
Invalid Range Parameters
# Error: min > max
--tbl-flt-rand-min 5 --tbl-flt-rand-max 3
# Fix: min <= max
--tbl-flt-rand-min 3 --tbl-flt-rand-max 5
Conflicting Type Constraints
# May produce unexpected results
--project-type-final-scalar --project-pathstep-final-unpivot # Scalar constraint conflicts with unpivot
# Better: consistent constraints
--project-type-final-all --project-pathstep-final-unpivot
Parameter Testing
# Test parameter combinations before large generation
beamline query basic \
--seed 1 \
--start-auto \
--script-path test.ion \
--sample-count 3 \
rand-sefw \
--project-rand-min 2 --project-rand-max 2 \
--exclude-rand-min 1 --exclude-rand-max 1 \
--pred-eq
# If results look good, scale up
# ... run with --sample-count 50
Best Practices
1. Start Conservative
# Begin with simple parameter values
--tbl-flt-rand-min 1 --tbl-flt-rand-max 2
--project-rand-min 1 --project-rand-max 3
--exclude-rand-min 1 --exclude-rand-max 1
2. Match Parameters to Data Structure
# Simple flat data
--project-path-depth-max 1 --tbl-flt-path-depth-max 1
# Nested object data
--project-path-depth-max 3 --tbl-flt-path-depth-max 3
# Deeply nested data
--project-path-depth-max 6 --tbl-flt-path-depth-max 6
3. Use Consistent Parameter Ranges
# Good - balanced complexity
--project-rand-min 2 --project-rand-max 4
--tbl-flt-rand-min 1 --tbl-flt-rand-max 3
--exclude-rand-min 1 --exclude-rand-max 2
# Avoid - unbalanced (too many exclusions vs projections)
--project-rand-min 1 --project-rand-max 2
--exclude-rand-min 5 --exclude-rand-max 10
4. Document Your Configurations
# Create reusable parameter sets
SIMPLE_CONFIG="--tbl-flt-rand-min 1 --tbl-flt-rand-max 2 --pred-comparison"
MODERATE_CONFIG="--project-rand-min 2 --project-rand-max 4 --tbl-flt-rand-min 1 --tbl-flt-rand-max 3 --pred-all"
beamline query basic --script-path data.ion --sample-count 10 rand-select-all-fw $SIMPLE_CONFIG
beamline query basic --script-path data.ion --sample-count 10 rand-sfw $MODERATE_CONFIG
Reference Quick Guide
Most Common Configurations
Basic Testing:
rand-select-all-fw --tbl-flt-rand-min 1 --tbl-flt-rand-max 2 --pred-comparison
Projection Testing:
rand-sfw --project-rand-min 2 --project-rand-max 4 --tbl-flt-rand-min 1 --tbl-flt-rand-max 2 --pred-all
Exclusion Testing:
rand-select-all-efw --tbl-flt-rand-min 1 --tbl-flt-rand-max 2 --exclude-rand-min 1 --exclude-rand-max 3 --pred-all
Comprehensive Testing:
rand-sefw --project-rand-min 2 --project-rand-max 4 --exclude-rand-min 1 --exclude-rand-max 2 --tbl-flt-rand-min 1 --tbl-flt-rand-max 3 --pred-all
Next Steps
Now that you understand all parameterization options:
- CLI Query Commands - Complete CLI reference with parameter examples
- Basic Queries - Apply parameters to simple query generation
- Advanced Patterns - Use parameters for complex query patterns
- Examples - See parameterization in real-world scenarios
Understanding Shapes
In Beamline, shapes (also called schemas) describe the structure and types of your generated data. Shape inference analyzes Ion scripts to determine what types of data will be generated, without actually generating the full dataset. This is essential for database schema creation, query validation, and understanding your data structure.
What are Shapes?
Shapes are PartiQL’s way of describing data structure and type information:
- Type information: What types each field can contain (INT, VARCHAR, BOOL, etc.)
- Structure information: How data is organized (bags, structs, arrays)
- Constraints: Whether fields are nullable, optional, or have other constraints
- Nested relationships: How complex data structures are organized
Shape Inference Process
How Shape Inference Works
- Script Analysis: Parse the Ion script to understand generators
- Type Resolution: Determine PartiQL types for each generator
- Structure Mapping: Build hierarchical type structure
- Constraint Analysis: Determine nullability and optionality
- Format Output: Generate shapes in requested format
Running Shape Inference
From the README examples, shape inference is done using:
beamline infer-shape \
--seed-auto \
--start-auto \
--script-path sensors.ion
The seed and start time are needed even though no data is generated, as they may affect type inference for certain generators.
Shape Output Formats
Text Format (Default)
Provides detailed type information in Rust debug format:
beamline infer-shape \
--seed-auto \
--start-auto \
--script-path sensors.ion
Example Output:
Seed: 17685918364143248531
Start: 2022-12-12T19:52:29.000000000Z
{
"sensors": PartiqlType(
Bag(
BagType {
element_type: PartiqlType(
Struct(
StructType {
constraints: {
Fields(
{
StructField {
name: "d",
ty: PartiqlType(
DecimalP(
2,
0,
),
),
},
StructField {
name: "f",
ty: PartiqlType(
Float64,
),
},
StructField {
name: "i8",
ty: PartiqlType(
Int64,
),
},
StructField {
name: "tick",
ty: PartiqlType(
Int64,
),
},
StructField {
name: "w",
ty: PartiqlType(
DecimalP(
5,
4,
),
),
},
},
),
},
},
),
),
},
),
),
}
Use Cases:
- Development and debugging
- Understanding complex nested structures
- Detailed type analysis
Basic DDL Format
Generates SQL DDL statements ready for database creation:
beamline infer-shape \
--seed 7844265201457918498 \
--start-auto \
--script-path sensors-nested.ion \
--output-format basic-ddl
Example OutputE:
-- Seed: 7844265201457918498
-- Start: 2024-01-01T06:53:06.000000000Z
-- Syntax: partiql_datatype_syntax.0.1
-- Dataset: sensors
"f" DOUBLE,
"i8" INT8,
"id" INT,
"sub" STRUCT<"f": DOUBLE,"o": INT8>,
"tick" INT8
Use Cases:
- Creating database tables
- Schema documentation
- Database migration scripts
Beamline JSON Format
Structured JSON format used by PartiQL testing tools:
beamline infer-shape \
--seed-auto \
--start-auto \
--script-path sensors.ion \
--output-format beamline-json
Example Output:
{
seed: -3711181901898679775,
start: "2022-05-22T13:49:57.000000000+00:00",
shapes: {
sensors: partiql::shape::v0::{
type: "bag",
items: {
type: "struct",
constraints: [
ordered,
closed
],
fields: [
{
name: "d",
type: "decimal(2, 0)"
},
{
name: "f",
type: "double"
},
{
name: "i8",
type: "int8"
},
{
name: "tick",
type: "int8"
},
{
name: "w",
type: "decimal(5, 4)"
}
]
}
}
}
}
Use Cases:
- PartiQL conformance testing
- Tool integration
- Automated testing pipelines
PartiQL Type System
Basic Types
From the examples and implementation:
| PartiQL Type | Description | Ion Script Generator |
|---|---|---|
INT8 | 8-bit signed integer | UniformI8 |
INT64 | 64-bit signed integer | UniformI64, Tick |
DOUBLE | 64-bit floating point | UniformF64, NormalF64 |
DECIMAL(p,s) | Fixed-precision decimal | UniformDecimal |
VARCHAR | Variable-length string | UUID, LoremIpsumTitle, Regex |
BOOL | Boolean value | Bool |
TIMESTAMP | Date and time | Instant, Date |
Complex Types
| PartiQL Type | Description | Ion Script Generator |
|---|---|---|
STRUCT<...> | Object with named fields | Nested $data objects |
ARRAY<T> | Array of type T | UniformArray |
UNION<T1,T2> | Value can be one of multiple types | UniformAnyOf |
Real Shape Examples
Simple Sensor Shape
From the sensors.ion script:
rand_processes::{
$n: UniformU8::{ low: 2, high: 10 },
sensors: $n::[
rand_process::{
$data: {
tick: Tick,
i8: UniformI8,
f: UniformF64,
d: UniformDecimal::{ low: 0d0, high: 4.2d1, nullable: false }
}
}
]
}
Inferred Shape (DDL):
-- Dataset: sensors
"f" DOUBLE,
"i8" INT8,
"tick" INT8,
"d" DECIMAL(2, 0) NOT NULL
Complex Nested Shape
From the sensors-nested.ion script:
rand_processes::{
sensors: rand_process::{
$data: {
tick: Tick,
i8: UniformI8,
f: UniformF64,
sub: {
o: UniformI8,
f: UniformF64
}
}
}
}
Inferred Shape (DDL):
-- Dataset: sensors
"f" DOUBLE,
"i8" INT8,
"sub" STRUCT<"f": DOUBLE,"o": INT8>,
"tick" INT8
Multi-Dataset Shape
From the client-service.ion script with multiple datasets:
beamline infer-shape \
--seed 1 \
--start-auto \
--script-path client-service.ion \
--output-format basic-ddl
Generated Output:
-- Dataset: service
"Account" VARCHAR,
"Operation" VARCHAR,
"Program" VARCHAR,
"Request" VARCHAR,
"StartTime" TIMESTAMP,
"client" VARCHAR,
"success" BOOL
-- Dataset: client_0
"id" VARCHAR,
"request_id" VARCHAR,
"request_time" TIMESTAMP,
"success" BOOL
-- Dataset: client_1
"id" VARCHAR,
"request_id" VARCHAR,
"request_time" TIMESTAMP,
"success" BOOL
Notice how each dataset gets its own schema section.
Nullability in Shapes
Nullable vs Non-Nullable Fields
Shape inference detects nullability configuration from scripts:
rand_processes::{
test_data: rand_process::{
$arrival: HomogeneousPoisson:: { interarrival: seconds::1 },
$data: {
required_field: UUID::{ nullable: false },
nullable_field: UniformI32::{ nullable: 0.2, low: 1, high: 100 },
optional_field: UniformDecimal::{ optional: 0.1, low: 0.0, high: 100.0 }
}
}
}
Inferred Shape:
-- Dataset: test_data
"required_field" VARCHAR NOT NULL, -- nullable: false
"nullable_field" INT, -- nullable: 0.2 (can be NULL)
"optional_field" OPTIONAL DECIMAL(3, 1) -- optional: 0.1 (can be MISSING)
CLI Nullability Defaults
Global CLI defaults affect inferred shapes:
# With default nullability
beamline infer-shape \
--seed 1 \
--start-auto \
--script-path simple_data.ion \
--default-nullable true \
--default-optional true \
--output-format basic-ddl
Result:
-- All fields become nullable and optional by default
"field1" OPTIONAL INT,
"field2" OPTIONAL VARCHAR,
"field3" OPTIONAL BOOL
Shape Inference Workflow
Development Workflow
#!/bin/bash
# Shape-driven development workflow
SCRIPT="new_data_model.ion"
echo "1. Creating initial Ion script..."
cat > $SCRIPT << 'EOF'
rand_processes::{
user_data: rand_process::{
$arrival: HomogeneousPoisson:: { interarrival: seconds::1 },
$data: {
user_id: UUID,
age: UniformU8::{ low: 18, high: 80 },
email: Format::{ pattern: "user{UUID}@example.com" },
active: Bool::{ p: 0.8 }
}
}
}
EOF
echo "2. Inferring shape..."
beamline infer-shape \
--seed 1 \
--start-auto \
--script-path $SCRIPT \
--output-format basic-ddl > schema.sql
echo "3. Generated schema:"
cat schema.sql
echo "4. Testing with small sample..."
beamline gen data \
--seed 1 \
--start-auto \
--script-path $SCRIPT \
--sample-count 5 \
--output-format text
echo "Shape-driven development complete!"
Schema Validation
# Validate schema matches expectations
beamline infer-shape \
--seed 1 \
--start-auto \
--script-path production_schema.ion \
--output-format basic-ddl > expected_schema.sql
# Compare with previous version
diff previous_schema.sql expected_schema.sql
# Generate sample data to verify
beamline gen data \
--seed 1 \
--start-auto \
--script-path production_schema.ion \
--sample-count 10
Complex Shape Examples
Arrays and Union Types
rand_processes::{
complex_data: rand_process::{
$arrival: HomogeneousPoisson:: { interarrival: seconds::1 },
$data: {
measurements: UniformArray::{
min_size: 2,
max_size: 5,
element_type: UniformF64::{ low: 0.0, high: 100.0 }
},
mixed_value: UniformAnyOf::{
types: [
UUID,
UniformI32::{ low: 1, high: 1000 },
Bool
]
}
}
}
}
Inferred Shape:
-- Dataset: complex_data
"measurements" ARRAY<DOUBLE>,
"mixed_value" UNION<VARCHAR,INT,BOOL>
Deeply Nested Structures
rand_processes::{
nested_data: rand_process::{
$arrival: HomogeneousPoisson:: { interarrival: seconds::1 },
$data: {
user: {
profile: {
personal: {
name: LoremIpsumTitle,
age: UniformU8::{ low: 18, high: 80 }
},
preferences: {
theme: Uniform::{ choices: ["light", "dark"] },
notifications: Bool
}
},
stats: {
login_count: UniformU32,
last_seen: Instant
}
}
}
}
}
Inferred Shape:
-- Dataset: nested_data
"user" STRUCT<
"profile": STRUCT<
"personal": STRUCT<"age": TINYINT,"name": VARCHAR>,
"preferences": STRUCT<"notifications": BOOL,"theme": VARCHAR>
>,
"stats": STRUCT<"last_seen": TIMESTAMP,"login_count": INT>
>
Shape Analysis and Validation
Schema Consistency Checking
# Infer shapes from multiple related scripts
beamline infer-shape \
--seed 1 \
--start-auto \
--script-path user_v1.ion \
--output-format basic-ddl > user_v1_schema.sql
beamline infer-shape \
--seed 1 \
--start-auto \
--script-path user_v2.ion \
--output-format basic-ddl > user_v2_schema.sql
# Compare schemas for compatibility
echo "Schema changes between versions:"
diff user_v1_schema.sql user_v2_schema.sql
Multi-Dataset Schema Analysis
# Analyze all datasets in a complex script
beamline infer-shape \
--seed 42 \
--start-auto \
--script-path client-service.ion \
--output-format basic-ddl > all_schemas.sql
# Extract individual dataset schemas
grep -A 20 "-- Dataset: service" all_schemas.sql > service_schema.sql
grep -A 20 "-- Dataset: client_0" all_schemas.sql > client_schema.sql
Shape-Based Development
Database Schema Generation
#!/bin/bash
# Generate database schemas from Ion scripts
SCRIPT="$1"
OUTPUT_DIR="./schemas"
if [ -z "$SCRIPT" ]; then
echo "Usage: $0 <script.ion>"
exit 1
fi
mkdir -p "$OUTPUT_DIR"
BASENAME=$(basename "$SCRIPT" .ion)
echo "Generating schemas for $SCRIPT..."
# Generate SQL DDL schema
beamline infer-shape \
--seed 1 \
--start-auto \
--script-path "$SCRIPT" \
--output-format basic-ddl > "$OUTPUT_DIR/${BASENAME}_schema.sql"
# Generate Beamline JSON for testing tools
beamline infer-shape \
--seed 1 \
--start-auto \
--script-path "$SCRIPT" \
--output-format beamline-json > "$OUTPUT_DIR/${BASENAME}_schema.json"
echo "Schemas generated:"
echo " SQL DDL: $OUTPUT_DIR/${BASENAME}_schema.sql"
echo " JSON: $OUTPUT_DIR/${BASENAME}_schema.json"
# Show summary
echo ""
echo "Schema summary:"
grep "-- Dataset:" "$OUTPUT_DIR/${BASENAME}_schema.sql" | while read -r line; do
dataset=$(echo "$line" | cut -d: -f2 | xargs)
field_count=$(grep -A 100 "$line" "$OUTPUT_DIR/${BASENAME}_schema.sql" | grep '^"' | head -20 | wc -l)
echo " $dataset: $field_count fields"
done
Schema Documentation
# Generate schema documentation for all scripts
for script in scripts/*.ion; do
echo "## $(basename "$script" .ion)" >> SCHEMAS.md
echo "" >> SCHEMAS.md
echo "Generated from: \`$script\`" >> SCHEMAS.md
echo "" >> SCHEMAS.md
echo '```sql' >> SCHEMAS.md
beamline infer-shape \
--seed 1 \
--start-auto \
--script-path "$script" \
--output-format basic-ddl >> SCHEMAS.md
echo '```' >> SCHEMAS.md
echo "" >> SCHEMAS.md
done
Understanding Type Mappings
Ion Generator to PartiQL Type Mapping
Based on the actual implementation and README:
| Ion Generator | PartiQL Type | DDL Representation |
|---|---|---|
Bool | BOOL | BOOL |
UniformI8 | INT64 | TINYINT or INT8 |
UniformI16 | INT64 | SMALLINT or INT16 |
UniformI32 | INT64 | INT |
UniformI64 | INT64 | BIGINT |
UniformU8 | INT64 | TINYINT |
UniformU16 | INT64 | SMALLINT |
UniformU32 | INT64 | INT |
UniformU64 | INT64 | BIGINT |
UniformF64 | DOUBLE | DOUBLE |
UniformDecimal | DECIMAL(p,s) | DECIMAL(p,s) |
UUID | STRING | VARCHAR |
LoremIpsumTitle | STRING | VARCHAR |
Regex | STRING | VARCHAR |
Format | STRING | VARCHAR |
Instant | DATETIME | TIMESTAMP |
Date | DATETIME | DATE or TIMESTAMP |
Tick | INT64 | INT8 or INT64 |
Precision and Scale Inference
For decimal types, Beamline infers precision and scale:
rand_processes::{
decimal_test: rand_process::{
$arrival: HomogeneousPoisson:: { interarrival: seconds::1 },
$data: {
price: UniformDecimal::{ low: 9.99, high: 999.99 }, // DECIMAL(5,2)
weight: UniformDecimal::{ low: 0.5, high: 10.9999 }, // DECIMAL(6,4)
percentage: UniformDecimal::{ low: 0d0, high: 1d2 } // DECIMAL(3,0)
}
}
}
Inferred Shape:
-- Dataset: decimal_test
"price" DECIMAL(5, 2),
"weight" DECIMAL(6, 4),
"percentage" DECIMAL(3, 0)
Schema Evolution and Migration
Schema Version Comparison
#!/bin/bash
# Compare schema versions for migration planning
OLD_SCRIPT="data_model_v1.ion"
NEW_SCRIPT="data_model_v2.ion"
# Generate schemas for both versions
beamline infer-shape --seed 1 --start-auto --script-path $OLD_SCRIPT --output-format basic-ddl > v1_schema.sql
beamline infer-shape --seed 1 --start-auto --script-path $NEW_SCRIPT --output-format basic-ddl > v2_schema.sql
echo "Schema Migration Analysis"
echo "========================="
# Show differences
echo "Changes between v1 and v2:"
diff -u v1_schema.sql v2_schema.sql
echo ""
echo "Migration considerations:"
# Check for removed fields (breaking changes)
if grep -v "^--" v1_schema.sql | grep -v "^$" > v1_fields.txt &&
grep -v "^--" v2_schema.sql | grep -v "^$" > v2_fields.txt; then
removed_fields=$(comm -23 v1_fields.txt v2_fields.txt)
if [ -n "$removed_fields" ]; then
echo "⚠️ Breaking changes - removed fields:"
echo "$removed_fields"
fi
added_fields=$(comm -13 v1_fields.txt v2_fields.txt)
if [ -n "$added_fields" ]; then
echo "✅ Added fields (non-breaking):"
echo "$added_fields"
fi
fi
rm -f v1_fields.txt v2_fields.txt
Database Migration Script Generation
#!/bin/bash
# Generate database migration scripts
OLD_SCHEMA="$1"
NEW_SCHEMA="$2"
echo "-- Database Migration Script"
echo "-- Generated: $(date)"
echo "-- From: $OLD_SCHEMA"
echo "-- To: $NEW_SCHEMA"
echo ""
# This is a simplified example - real migration would be more complex
echo "-- Review changes manually:"
echo "-- $(diff --brief $OLD_SCHEMA $NEW_SCHEMA)"
echo ""
echo "-- Add new columns (example):"
comm -13 <(grep '^"' $OLD_SCHEMA | sort) <(grep '^"' $NEW_SCHEMA | sort) | while read -r field; do
echo "ALTER TABLE dataset_name ADD COLUMN $field;"
done
Integration Patterns
CI/CD Schema Validation
#!/bin/bash
# CI/CD pipeline schema validation
set -e
echo "Validating Ion script schemas..."
for script in scripts/*.ion; do
echo "Checking $(basename "$script")..."
# Validate script produces valid schema
if ! beamline infer-shape \
--seed 1 \
--start-auto \
--script-path "$script" \
--output-format text > /dev/null 2>&1; then
echo "❌ Error: Invalid script $script"
exit 1
fi
echo "✅ $(basename "$script") - valid schema"
done
echo "All schemas validated successfully!"
Documentation Generation
# Generate schema documentation
generate_schema_docs() {
local script_dir="$1"
local output_file="$2"
echo "# Data Model Documentation" > "$output_file"
echo "" >> "$output_file"
echo "Generated: $(date)" >> "$output_file"
echo "" >> "$output_file"
for script in "$script_dir"/*.ion; do
local name=$(basename "$script" .ion)
echo "## $name" >> "$output_file"
echo "" >> "$output_file"
echo "Script: \`$script\`" >> "$output_file"
echo "" >> "$output_file"
echo '```sql' >> "$output_file"
beamline infer-shape \
--seed 1 \
--start-auto \
--script-path "$script" \
--output-format basic-ddl >> "$output_file"
echo '```' >> "$output_file"
echo "" >> "$output_file"
done
}
generate_schema_docs "data_models" "DATA_MODEL_SCHEMAS.md"
Best Practices
1. Always Validate Shapes
# Before generating large datasets, check the shape
beamline infer-shape --seed 1 --start-auto --script-path new_model.ion
2. Use Appropriate Output Formats
# DDL for database work
beamline infer-shape --script-path data.ion --output-format basic-ddl
# Text for debugging
beamline infer-shape --script-path data.ion --output-format text
# JSON for automation
beamline infer-shape --script-path data.ion --output-format beamline-json
3. Document Schema Changes
# Track schema evolution
git add schemas/
git commit -m "Update user data model schema
Added:
- user.preferences.theme field
- user.stats.last_login timestamp
Modified:
- user.profile.age now optional (nullable: 0.1)"
4. Validate Schema Compatibility
# Ensure query compatibility with schema changes
beamline infer-shape --seed 1 --start-auto --script-path new_schema.ion --output-format basic-ddl > new_schema.sql
# Generate test queries against new schema
beamline query basic \
--seed 2 \
--start-auto \
--script-path new_schema.ion \
--sample-count 10 \
rand-select-all-fw \
--pred-all > validation_queries.sql
echo "Schema and queries generated for validation testing"
Next Steps
Now that you understand shapes and schema inference:
- Shape Inference - Advanced shape inference techniques and analysis
- Output Formats - Deep dive into all schema output formats
- CLI Shape Commands - Complete CLI reference for shape operations
- Database Integration - Using shapes for database creation
Shape Inference
Shape inference is the process of analyzing Ion scripts to determine the data types and structures that will be generated, without actually generating data. This is extremely fast and useful for schema validation, database preparation, and understanding data models.
Shape Inference Command
Basic Usage
The infer-shape command requires the same core parameters as data generation:
beamline infer-shape \
--seed-auto \
--start-auto \
--script-path sensors.ion
Even though no data is generated, seed and start time may affect type inference for certain dynamic generators.
With Specific Parameters
# Use specific seed for reproducible shape inference
beamline infer-shape \
--seed 12345 \
--start-iso "2024-01-01T00:00:00Z" \
--script-path complex_schema.ion \
--output-format basic-ddl
Output Format Analysis
Text Format (Detailed Debug)
From the README example:
$ beamline infer-shape \
--seed-auto \
--start-auto \
--script-path sensors.ion
Seed: 17685918364143248531
Start: 2022-12-12T19:52:29.000000000Z
{
"sensors": PartiqlType(
Bag(
BagType {
element_type: PartiqlType(
Struct(
StructType {
constraints: {
Fields(
{
StructField {
name: "d",
ty: PartiqlType(
DecimalP(2, 0),
),
},
StructField {
name: "f",
ty: PartiqlType(
Float64,
),
},
// ... more fields
},
),
},
},
),
),
},
),
),
}
Understanding the structure:
Bag: Collection of records (dataset)BagType: Type information for the bagStruct: Each record is a structured objectStructField: Individual field definitions with names and typesPartiqlType: Specific type information (DecimalP, Float64, etc.)
Basic DDL Format (SQL Ready)
From the README example:
$ beamline infer-shape \
--seed 7844265201457918498 \
--start-auto \
--script-path sensors-nested.ion \
--output-format basic-ddl
-- Seed: 7844265201457918498
-- Start: 2024-01-01T06:53:06.000000000Z
-- Syntax: partiql_datatype_syntax.0.1
-- Dataset: sensors
"f" DOUBLE,
"i8" INT8,
"id" INT,
"sub" STRUCT<"f": DOUBLE,"o": INT8>,
"tick" INT8
Format characteristics:
- Comments: Metadata about generation parameters
- Dataset headers: Clear separation between datasets
- SQL-ready: Can be used directly in CREATE TABLE statements
- Type precision: Specific SQL types with precision for decimals
Beamline JSON Format (Tool Integration)
From the README example:
$ beamline infer-shape \
--seed-auto \
--start-auto \
--script-path sensors.ion \
--output-format beamline-json
{
seed: -3711181901898679775,
start: "2022-05-22T13:49:57.000000000+00:00",
shapes: {
sensors: partiql::shape::v0::{
type: "bag",
items: {
type: "struct",
constraints: [ordered, closed],
fields: [
{
name: "d",
type: "decimal(2, 0)"
},
{
name: "f",
type: "double"
},
{
name: "i8",
type: "int8"
},
{
name: "tick",
type: "int8"
},
{
name: "w",
type: "decimal(5, 4)"
}
]
}
}
}
}
Format characteristics:
- Structured JSON: Machine-readable format
- Versioned:
partiql::shape::v0::indicates version - Complete metadata: Seeds, timestamps, and full type information
- Tool integration: Designed for PartiQL testing tools
Advanced Shape Inference
CLI Global Defaults Impact
CLI defaults affect shape inference results:
# Infer with default nullable/optional settings
beamline infer-shape \
--seed 1 \
--start-auto \
--script-path data.ion \
--default-nullable false \
--default-optional true \
--output-format basic-ddl
From the README example showing CLI impact:
$ beamline infer-shape \
--seed 7844265201457918498 \
--start-auto \
--script-path sensors.ion \
--output-format basic-ddl \
--default-nullable false \
--default-optional true
-- Seed: 7844265201457918498
-- Start: 2024-01-18T11:40:34.000000000Z
-- Syntax: partiql_datatype_syntax-0.1
-- Dataset: sensors
"a" OPTIONAL UNION<INT8 NOT NULL,DECIMAL(5, 4) NOT NULL,DOUBLE NOT NULL,VARCHAR NOT NULL>,
"ar1" OPTIONAL ARRAY<DECIMAL(2, 1) NOT NULL> NOT NULL,
"ar2" OPTIONAL ARRAY<VARCHAR NOT NULL> NOT NULL,
"ar3" OPTIONAL ARRAY<DECIMAL(5, 4)> NOT NULL,
"ar4" OPTIONAL ARRAY<TINYINT NOT NULL> NOT NULL,
"ar5" OPTIONAL ARRAY<UNION<INT8 NOT NULL,DECIMAL(5, 4) NOT NULL,DOUBLE NOT NULL,VARCHAR NOT NULL>> NOT NULL,
"d" OPTIONAL DECIMAL(2, 0) NOT NULL,
"f" OPTIONAL DOUBLE NOT NULL,
"i8" OPTIONAL TINYINT NOT NULL,
"tick" OPTIONAL INT8 NOT NULL,
"w" OPTIONAL DECIMAL(5, 4)
Notice how CLI defaults made fields OPTIONAL and NOT NULL.
Multi-Dataset Shape Analysis
# Analyze complex multi-dataset script
beamline infer-shape \
--seed 1 \
--start-auto \
--script-path client-service.ion \
--output-format basic-ddl
Example output structure:
-- Dataset: service
"Account" VARCHAR,
"Distance" DECIMAL(2, 0),
"Operation" VARCHAR,
"Program" VARCHAR,
"Request" VARCHAR,
"StartTime" TIMESTAMP,
"Weight" DECIMAL(5, 4),
"anyof" UNION<INT8,DECIMAL(5, 4)>,
"array" ARRAY<INT8>,
"client" VARCHAR,
"success" BOOL
-- Dataset: client_0
"id" VARCHAR,
"request_id" VARCHAR,
"request_time" TIMESTAMP,
"success" BOOL
-- Dataset: client_1
"id" VARCHAR,
"request_id" VARCHAR,
"request_time" TIMESTAMP,
"success" BOOL
Each dataset from the Ion script gets its own schema section.
Shape Inference Patterns
Script Validation Workflow
#!/bin/bash
# Validate Ion script before data generation
SCRIPT="$1"
if [ ! -f "$SCRIPT" ]; then
echo "Script not found: $SCRIPT"
exit 1
fi
echo "Validating Ion script: $SCRIPT"
# Test shape inference (fast validation)
if ! beamline infer-shape \
--seed 1 \
--start-auto \
--script-path "$SCRIPT" \
--output-format text > /dev/null; then
echo "❌ Script validation failed - check Ion syntax"
exit 1
fi
echo "✅ Script syntax valid"
# Show inferred schema
echo ""
echo "Inferred schema:"
beamline infer-shape \
--seed 1 \
--start-auto \
--script-path "$SCRIPT" \
--output-format basic-ddl
echo ""
echo "✅ Script ready for data generation"
Schema Documentation Generation
#!/bin/bash
# Auto-generate schema documentation
SCRIPTS_DIR="$1"
OUTPUT_FILE="$2"
echo "# Data Schema Documentation" > "$OUTPUT_FILE"
echo "" >> "$OUTPUT_FILE"
echo "Auto-generated: $(date)" >> "$OUTPUT_FILE"
echo "" >> "$OUTPUT_FILE"
for script in "$SCRIPTS_DIR"/*.ion; do
name=$(basename "$script" .ion)
echo "Processing $name..."
echo "## $name Data Schema" >> "$OUTPUT_FILE"
echo "" >> "$OUTPUT_FILE"
echo "**Source Script**: \`$(basename "$script")\`" >> "$OUTPUT_FILE"
echo "" >> "$OUTPUT_FILE"
# Add schema in SQL format
echo '```sql' >> "$OUTPUT_FILE"
beamline infer-shape \
--seed 1 \
--start-auto \
--script-path "$script" \
--output-format basic-ddl | grep -v "^-- Seed:" | grep -v "^-- Start:" >> "$OUTPUT_FILE"
echo '```' >> "$OUTPUT_FILE"
echo "" >> "$OUTPUT_FILE"
# Count datasets and fields
schema_output=$(beamline infer-shape --seed 1 --start-auto --script-path "$script" --output-format basic-ddl)
dataset_count=$(echo "$schema_output" | grep -c "^-- Dataset:")
field_count=$(echo "$schema_output" | grep -c '^"')
echo "**Summary**: $dataset_count dataset(s), $field_count total fields" >> "$OUTPUT_FILE"
echo "" >> "$OUTPUT_FILE"
done
echo "Schema documentation generated: $OUTPUT_FILE"
Real-World Examples
E-commerce Schema Analysis
rand_processes::{
$n_customers: UniformU8::{ low: 10, high: 100 },
$customer_ids: $n_customers::[UUID::()],
customers: static_data::{
$data: {
customer_id: Uniform::{ choices: $customer_ids },
name: LoremIpsumTitle,
email: Format::{ pattern: "customer{UUID}@email.com" },
age: UniformU8::{ low: 18, high: 80, optional: 0.1 }
}
},
orders: rand_process::{
$arrival: HomogeneousPoisson:: { interarrival: minutes::30 },
$data: {
order_id: UUID,
customer_id: Uniform::{ choices: $customer_ids },
total: UniformDecimal::{ low: 10.00, high: 500.00 },
items: UniformArray::{
min_size: 1,
max_size: 5,
element_type: {
product_name: LoremIpsumTitle,
price: UniformDecimal::{ low: 5.00, high: 100.00 },
quantity: UniformU8::{ low: 1, high: 3 }
}
}
}
}
}
Inferred Schema:
$ beamline infer-shape --seed 1 --start-auto --script-path ecommerce.ion --output-format basic-ddl
-- Dataset: customers
"age" OPTIONAL TINYINT,
"customer_id" VARCHAR,
"email" VARCHAR,
"name" VARCHAR
-- Dataset: orders
"customer_id" VARCHAR,
"items" ARRAY<STRUCT<"price": DECIMAL(5, 2),"product_name": VARCHAR,"quantity": TINYINT>>,
"order_id" VARCHAR,
"total" DECIMAL(5, 2)
Financial Data Schema
rand_processes::{
transactions: rand_process::{
$arrival: HomogeneousPoisson:: { interarrival: minutes::5 },
$data: {
transaction_id: UUID::{ nullable: false },
account_id: UUID,
amount: LogNormalF64::{ location: 4.0, scale: 1.0 },
transaction_type: Uniform::{ choices: ["deposit", "withdrawal", "transfer"] },
risk_score: UniformF64::{ low: 0.0, high: 1.0 },
metadata: {
merchant: LoremIpsumTitle,
location: Regex::{ pattern: "[A-Z]{2}" },
processing_time: UniformF64::{ low: 0.1, high: 5.0 }
},
compliance: {
aml_flagged: Bool::{ p: 0.01 },
requires_review: Bool::{ p: 0.05 },
risk_category: Uniform::{ choices: ["low", "medium", "high"] }
}
}
}
}
Inferred Schema:
-- Dataset: transactions
"account_id" VARCHAR,
"amount" DOUBLE,
"compliance" STRUCT<"aml_flagged": BOOL,"requires_review": BOOL,"risk_category": VARCHAR>,
"metadata" STRUCT<"location": VARCHAR,"merchant": VARCHAR,"processing_time": DOUBLE>,
"risk_score" DOUBLE,
"transaction_id" VARCHAR NOT NULL,
"transaction_type" VARCHAR
Shape Inference Analysis
Schema Complexity Assessment
#!/bin/bash
# Analyze schema complexity
SCRIPT="$1"
echo "Schema Complexity Analysis for: $SCRIPT"
echo "======================================"
# Get detailed shape information
schema_output=$(beamline infer-shape \
--seed 1 \
--start-auto \
--script-path "$SCRIPT" \
--output-format basic-ddl)
# Count datasets
dataset_count=$(echo "$schema_output" | grep -c "^-- Dataset:")
echo "Datasets: $dataset_count"
# Count total fields
field_count=$(echo "$schema_output" | grep -c '^"')
echo "Total fields: $field_count"
# Count complex types
struct_count=$(echo "$schema_output" | grep -c "STRUCT<")
array_count=$(echo "$schema_output" | grep -c "ARRAY<")
union_count=$(echo "$schema_output" | grep -c "UNION<")
echo "Complex types:"
echo " Structs: $struct_count"
echo " Arrays: $array_count"
echo " Unions: $union_count"
# Count nullable/optional fields
nullable_count=$(echo "$schema_output" | grep -v "NOT NULL" | grep -c '^"')
optional_count=$(echo "$schema_output" | grep -c "OPTIONAL")
echo "Nullability:"
echo " Nullable fields: $nullable_count"
echo " Optional fields: $optional_count"
echo ""
echo "Complexity Score: $((field_count + struct_count * 2 + array_count * 2 + union_count * 3))"
Multi-Format Schema Comparison
#!/bin/bash
# Compare schema formats for analysis
SCRIPT="$1"
BASE_NAME=$(basename "$SCRIPT" .ion)
echo "Generating schema in all formats for: $SCRIPT"
# Generate all three formats
beamline infer-shape --seed 1 --start-auto --script-path "$SCRIPT" --output-format text > "${BASE_NAME}_debug.txt"
beamline infer-shape --seed 1 --start-auto --script-path "$SCRIPT" --output-format basic-ddl > "${BASE_NAME}_schema.sql"
beamline infer-shape --seed 1 --start-auto --script-path "$SCRIPT" --output-format beamline-json > "${BASE_NAME}_schema.json"
echo "Generated schema files:"
echo " Debug format: ${BASE_NAME}_debug.txt ($(wc -l < ${BASE_NAME}_debug.txt) lines)"
echo " SQL DDL format: ${BASE_NAME}_schema.sql ($(wc -l < ${BASE_NAME}_schema.sql) lines)"
echo " JSON format: ${BASE_NAME}_schema.json ($(wc -l < ${BASE_NAME}_schema.json) lines)"
# Show summary from SQL format
echo ""
echo "Schema summary:"
grep "-- Dataset:" "${BASE_NAME}_schema.sql" | while read -r line; do
dataset=$(echo "$line" | cut -d: -f2 | xargs)
echo " Dataset: $dataset"
done
Shape Inference Optimization
Fast Schema Validation
Shape inference is much faster than data generation:
# Quick validation of multiple scripts
for script in models/*.ion; do
echo -n "$(basename "$script"): "
start_time=$(date +%s.%N)
if beamline infer-shape \
--seed 1 \
--start-auto \
--script-path "$script" \
--output-format text > /dev/null; then
end_time=$(date +%s.%N)
duration=$(echo "$end_time - $start_time" | bc -l)
echo "✅ Valid (${duration}s)"
else
echo "❌ Invalid"
fi
done
Batch Schema Generation
#!/bin/bash
# Generate schemas for all scripts in parallel
SCRIPTS_DIR="$1"
OUTPUT_DIR="$2"
mkdir -p "$OUTPUT_DIR"
echo "Generating schemas for all scripts in $SCRIPTS_DIR..."
# Process scripts in parallel
for script in "$SCRIPTS_DIR"/*.ion; do
{
name=$(basename "$script" .ion)
echo "Processing $name..."
beamline infer-shape \
--seed 1 \
--start-auto \
--script-path "$script" \
--output-format basic-ddl > "$OUTPUT_DIR/${name}_schema.sql"
echo "✅ $name completed"
} &
done
wait # Wait for all background jobs
echo "All schema generation completed"
# Summary
echo ""
echo "Generated schemas:"
ls -la "$OUTPUT_DIR"/*.sql | while read -r line; do
file=$(echo "$line" | awk '{print $9}')
lines=$(wc -l < "$file")
echo " $(basename "$file"): $lines lines"
done
Troubleshooting Shape Inference
Common Issues
Script Syntax Errors
$ beamline infer-shape --seed 1 --start-auto --script-path bad_syntax.ion
Error: Failed to parse Ion script: Invalid Ion syntax at line 5, column 10
Solution: Check Ion script syntax, ensure balanced braces and proper structure.
Missing Required Parameters
$ beamline infer-shape --script-path data.ion
Error: One of --seed-auto or --seed is required
Error: One of --start-auto, --start-epoch-ms, or --start-iso is required
Solution: Always provide seed and start time parameters.
Invalid Generator Configuration
# This will fail during shape inference
rand_processes::{
bad_data: rand_process::{
$arrival: HomogeneousPoisson:: { interarrival: seconds::1 },
$data: {
invalid_range: UniformI32::{ low: 100, high: 50 } // min > max
}
}
}
Solution: Check generator configurations for valid parameter ranges.
Performance Troubleshooting
Shape inference should be very fast (milliseconds). If it’s slow:
# Check for complex nested structures
beamline infer-shape \
--seed 1 \
--start-auto \
--script-path suspected_slow.ion \
--output-format text | grep -c "nested_struct"
Very deep nesting (10+ levels) might slow shape inference slightly.
Integration Examples
Database Schema Creation Pipeline
#!/bin/bash
# Complete database schema creation pipeline
SCRIPT="$1"
DATABASE_NAME="$2"
echo "Creating database from Ion script: $SCRIPT"
# 1. Validate script and infer schema
echo "Step 1: Validating script and inferring schema..."
if ! beamline infer-shape \
--seed 1 \
--start-auto \
--script-path "$SCRIPT" \
--output-format basic-ddl > schema.sql; then
echo "❌ Schema inference failed"
exit 1
fi
# 2. Create database
echo "Step 2: Creating database $DATABASE_NAME..."
createdb "$DATABASE_NAME"
# 3. Generate CREATE TABLE statements
echo "Step 3: Generating CREATE TABLE statements..."
grep "-- Dataset:" schema.sql | while read -r line; do
dataset=$(echo "$line" | cut -d: -f2 | xargs)
echo "CREATE TABLE $dataset (" > "table_${dataset}.sql"
# Extract fields for this dataset (simplified)
grep -A 100 "$line" schema.sql | grep '^"' | head -20 >> "table_${dataset}.sql"
echo ");" >> "table_${dataset}.sql"
echo "Creating table: $dataset"
psql -d "$DATABASE_NAME" -f "table_${dataset}.sql"
done
echo "✅ Database $DATABASE_NAME created with schema from $SCRIPT"
Schema Testing Integration
#!/bin/bash
# Test schema consistency across development workflow
SCRIPT="user_model.ion"
SEED=12345
echo "Testing schema consistency workflow..."
# 1. Infer baseline schema
beamline infer-shape \
--seed $SEED \
--start-auto \
--script-path "$SCRIPT" \
--output-format basic-ddl > baseline_schema.sql
# 2. Generate test data using same script
beamline gen data \
--seed $SEED \
--start-auto \
--script-path "$SCRIPT" \
--sample-count 100 \
--output-format ion-pretty > test_data.ion
# 3. Generate test queries using same script
beamline query basic \
--seed $((SEED + 1)) \
--start-auto \
--script-path "$SCRIPT" \
--sample-count 10 \
rand-select-all-fw \
--pred-all > test_queries.sql
echo "Consistency test completed:"
echo " Schema: baseline_schema.sql"
echo " Data: test_data.ion ($(jq '.data | to_entries[0].value | length' test_data.ion 2>/dev/null || echo 'N/A') records)"
echo " Queries: test_queries.sql ($(wc -l < test_queries.sql) queries)"
# 4. Validate all components reference same structure
echo ""
echo "✅ Schema, data, and queries all generated from same Ion script"
echo "✅ Consistency guaranteed by same script source"
Best Practices
1. Use Shape Inference Early
# Always validate scripts before large data generation
beamline infer-shape --seed 1 --start-auto --script-path new_script.ion
# Then proceed with data generation
beamline gen data --seed 1 --start-auto --script-path new_script.ion --sample-count 100000
2. Choose Format for Purpose
# Development and debugging
beamline infer-shape --script-path script.ion --output-format text
# Database integration
beamline infer-shape --script-path script.ion --output-format basic-ddl
# Tool integration and automation
beamline infer-shape --script-path script.ion --output-format beamline-json
3. Version Control Schemas
# Track schema evolution alongside scripts
git add scripts/user_model.ion schemas/user_model_schema.sql
git commit -m "Add user model v2 with preferences and stats
Schema changes:
- Added user.preferences nested object
- Added user.stats.login_count field
- Made user.profile.age optional"
4. Validate Schema Changes
# Before deploying schema changes
beamline infer-shape --seed 1 --start-auto --script-path new_version.ion --output-format basic-ddl > new_schema.sql
diff old_schema.sql new_schema.sql
# Test compatibility with existing queries
# your-query-validator --schema new_schema.sql --queries existing_queries.sql
Next Steps
Now that you understand shape inference:
- Schema Output Formats - Deep dive into text, DDL, and JSON formats
- CLI Shape Commands - Complete CLI reference
- Database Integration - Using inferred schemas for database creation
- Query Generation - How shapes enable query generation
Schema Output Formats
Beamline supports three distinct output formats for schema information, each optimized for different use cases. Understanding these formats helps you choose the right one for your workflow and integrate schemas effectively with your tools and processes.
Available Schema Formats
The infer-shape command supports three output formats via --output-format:
| Format | Description | Use Case | Performance |
|---|---|---|---|
text | Rust debug format with detailed type information | Development, debugging | Fast |
basic-ddl | SQL DDL statements ready for database creation | Database integration | Fast |
beamline-json | Structured JSON for PartiQL testing tools | Tool integration, automation | Fast |
Text Format (Default)
Characteristics
- Detailed type information: Complete PartiQL type system representation
- Rust debug format: Shows internal type structures
- Development focused: Ideal for understanding complex type relationships
- Human readable: With some practice, easy to understand
Example Output
From the README example with sensors.ion:
$ beamline infer-shape \
--seed-auto \
--start-auto \
--script-path sensors.ion
Seed: 17685918364143248531
Start: 2022-12-12T19:52:29.000000000Z
{
"sensors": PartiqlType(
Bag(
BagType {
element_type: PartiqlType(
Struct(
StructType {
constraints: {
Fields(
{
StructField {
name: "d",
ty: PartiqlType(
DecimalP(
2,
0,
),
),
},
StructField {
name: "f",
ty: PartiqlType(
Float64,
),
},
StructField {
name: "i8",
ty: PartiqlType(
Int64,
),
},
StructField {
name: "tick",
ty: PartiqlType(
Int64,
),
},
StructField {
name: "w",
ty: PartiqlType(
DecimalP(
5,
4,
),
),
},
},
),
},
},
),
),
},
),
),
}
Understanding Text Format Structure
PartiqlType: Root type wrapperBag: Collection type (dataset)BagType: Container for element type informationStruct: Record structureStructType: Detailed structure informationStructField: Individual field with name and typeDecimalP(5,4): Decimal with precision 5, scale 4Float64: 64-bit floating pointInt64: 64-bit integer
Use Cases
- Development: Understanding complex type relationships
- Debugging: Detailed analysis of type inference
- Learning: Understanding PartiQL type system
- Tool development: Building PartiQL-aware tools
Basic DDL Format
Characteristics
- SQL-ready: Can be used directly in CREATE TABLE statements
- Human readable: Easy to understand for database developers
- Production focused: Ready for database integration
- Compact: Concise representation
Example Output
From the README example with sensors-nested.ion:
$ beamline infer-shape \
--seed 7844265201457918498 \
--start-auto \
--script-path sensors-nested.ion \
--output-format basic-ddl
-- Seed: 7844265201457918498
-- Start: 2024-01-01T06:53:06.000000000Z
-- Syntax: partiql_datatype_syntax.0.1
-- Dataset: sensors
"f" DOUBLE,
"i8" INT8,
"id" INT,
"sub" STRUCT<"f": DOUBLE,"o": INT8>,
"tick" INT8
Format Structure
- Header comments: Generation metadata for reproducibility
- Syntax version: DDL syntax version identifier
- Dataset sections:
-- Dataset: nameseparates different datasets - Field definitions: SQL column definitions ready for CREATE TABLE
- Type specifications: Precise SQL types with dimensions
DDL Type Mapping
| PartiQL Internal | DDL Output | Description |
|---|---|---|
Int64 | INT8, TINYINT, INT, BIGINT | Size depends on generator range |
Float64 | DOUBLE | 64-bit floating point |
DecimalP(p,s) | DECIMAL(p,s) | Fixed precision decimal |
String | VARCHAR | Variable length string |
Bool | BOOL | Boolean type |
DateTime | TIMESTAMP | Date and time |
Struct | STRUCT<...> | Nested object structure |
Array | ARRAY<T> | Array of type T |
Union | UNION<T1,T2,...> | One of several types |
Complete Database Creation
#!/bin/bash
# Create complete database from DDL output
SCRIPT="ecommerce.ion"
DB_NAME="ecommerce_test"
# Generate complete DDL
beamline infer-shape \
--seed 1 \
--start-auto \
--script-path "$SCRIPT" \
--output-format basic-ddl > schema.ddl
# Create database
createdb "$DB_NAME"
# Extract and create each table
grep "-- Dataset:" schema.ddl | while read -r line; do
dataset=$(echo "$line" | cut -d: -f2 | xargs)
{
echo "CREATE TABLE $dataset ("
# Extract fields until next dataset or end of file
sed -n "/-- Dataset: $dataset/,/-- Dataset:/p" schema.ddl | \
grep '^"' | \
sed '$ s/,$//' # Remove trailing comma from last line
echo ");"
} > "${dataset}_table.sql"
echo "Creating table: $dataset"
psql -d "$DB_NAME" -f "${dataset}_table.sql"
rm "${dataset}_table.sql"
done
echo "Database $DB_NAME created successfully"
Use Cases
- Database creation: Direct CREATE TABLE generation
- Schema documentation: Human-readable reference
- Migration scripts: Database schema evolution
- SQL integration: Compatible with SQL databases
Beamline JSON Format
Characteristics
- Machine readable: Structured JSON for programmatic processing
- Tool integration: Designed for PartiQL testing frameworks
- Versioned: Includes format version information
- Complete metadata: Full type and constraint information
Example Output
From the README example:
$ beamline infer-shape \
--seed-auto \
--start-auto \
--script-path sensors.ion \
--output-format beamline-json
{
seed: -3711181901898679775,
start: "2022-05-22T13:49:57.000000000+00:00",
shapes: {
sensors: partiql::shape::v0::{
type: "bag",
items: {
type: "struct",
constraints: [
ordered,
closed
],
fields: [
{
name: "d",
type: "decimal(2, 0)"
},
{
name: "f",
type: "double"
},
{
name: "i8",
type: "int8"
},
{
name: "tick",
type: "int8"
},
{
name: "w",
type: "decimal(5, 4)"
}
]
}
}
}
}
JSON Format Structure
seed: Random seed used for inferencestart: Start timestamp used for inferenceshapes: Dictionary of dataset name to shape definitionpartiql::shape::v0::: Format version identifiertype: "bag": Collection type (dataset)items: Element type definition for bag contentsconstraints: Structural constraints (ordered,closed)fields: Array of field definitions- Field objects:
nameandtypefor each field
Processing JSON Format
# Extract dataset names
beamline infer-shape --seed 1 --start-auto --script-path multi.ion --output-format beamline-json | \
jq -r '.shapes | keys[]'
# Count fields in each dataset
beamline infer-shape --seed 1 --start-auto --script-path multi.ion --output-format beamline-json | \
jq -r '.shapes | to_entries[] | "\(.key): \(.value.items.fields | length) fields"'
# Extract field types for specific dataset
beamline infer-shape --seed 1 --start-auto --script-path data.ion --output-format beamline-json | \
jq -r '.shapes.users.items.fields[] | "\(.name): \(.type)"'
Use Cases
- Automated testing: PartiQL conformance test suites
- Tool integration: Schema-aware development tools
- CI/CD pipelines: Automated schema validation
- Documentation generation: Programmatic documentation creation
Format Comparison
Size and Performance
For the same schema with multiple datasets:
# Generate all formats for comparison
beamline infer-shape --seed 1 --start-auto --script-path complex.ion --output-format text > schema.txt
beamline infer-shape --seed 1 --start-auto --script-path complex.ion --output-format basic-ddl > schema.sql
beamline infer-shape --seed 1 --start-auto --script-path complex.ion --output-format beamline-json > schema.json
# Compare sizes
ls -lh schema.*
# Example results:
# -rw-r--r-- 1 user user 8.2K schema.txt (most detailed)
# -rw-r--r-- 1 user user 1.5K schema.sql (most compact)
# -rw-r--r-- 1 user user 3.1K schema.json (structured)
Information Density
| Format | Type Detail | Structure Info | Metadata | Processability |
|---|---|---|---|---|
text | Very High | Very High | High | Low |
basic-ddl | Medium | Medium | Medium | High (SQL) |
beamline-json | High | High | High | High (JSON) |
Format-Specific Integration
Text Format Analysis
# Analyze complex type structures
beamline infer-shape \
--seed 1 \
--start-auto \
--script-path nested_structures.ion \
--output-format text | \
grep -A 20 "StructField" # Extract field information
# Count nesting levels
beamline infer-shape \
--seed 1 \
--start-auto \
--script-path deep_nesting.ion \
--output-format text | \
grep -c "Struct(" # Count nested structures
DDL Format Database Integration
#!/bin/bash
# Complete database integration workflow
SCRIPT="$1"
DATABASE="$2"
# Generate DDL schema
beamline infer-shape \
--seed 1 \
--start-auto \
--script-path "$SCRIPT" \
--output-format basic-ddl > full_schema.sql
# Create database
createdb "$DATABASE"
# Process each dataset into a table
current_dataset=""
while IFS= read -r line; do
if [[ $line == *"-- Dataset:"* ]]; then
# Start new dataset
if [[ -n "$current_dataset" ]]; then
echo ");" >> "${current_dataset}.sql"
psql -d "$DATABASE" -f "${current_dataset}.sql"
rm "${current_dataset}.sql"
fi
current_dataset=$(echo "$line" | cut -d: -f2 | xargs)
echo "CREATE TABLE $current_dataset (" > "${current_dataset}.sql"
elif [[ $line == \"* ]]; then
# Add field to current table
echo " $line" >> "${current_dataset}.sql"
fi
done < full_schema.sql
# Handle last dataset
if [[ -n "$current_dataset" ]]; then
echo ");" >> "${current_dataset}.sql"
psql -d "$DATABASE" -f "${current_dataset}.sql"
rm "${current_dataset}.sql"
fi
echo "Database $DATABASE created with all tables"
JSON Format Automation
#!/bin/bash
# Automated schema processing with JSON format
SCRIPT="$1"
echo "Analyzing schema from $SCRIPT..."
# Generate JSON schema
schema_json=$(beamline infer-shape \
--seed 1 \
--start-auto \
--script-path "$SCRIPT" \
--output-format beamline-json)
# Extract metadata
seed=$(echo "$schema_json" | jq -r '.seed')
start=$(echo "$schema_json" | jq -r '.start')
echo "Schema generated with seed: $seed, start: $start"
# Analyze each dataset
echo "$schema_json" | jq -r '.shapes | keys[]' | while read -r dataset; do
echo ""
echo "Dataset: $dataset"
# Get field count
field_count=$(echo "$schema_json" | jq ".shapes.${dataset}.items.fields | length")
echo " Fields: $field_count"
# List all fields with types
echo " Field details:"
echo "$schema_json" | jq -r ".shapes.${dataset}.items.fields[] | \" \(.name): \(.type)\""
# Check for complex types
complex_count=$(echo "$schema_json" | jq -r ".shapes.${dataset}.items.fields[] | select(.type | contains(\"STRUCT\") or contains(\"ARRAY\") or contains(\"UNION\")) | .name" | wc -l)
if [ "$complex_count" -gt 0 ]; then
echo " Complex types: $complex_count fields"
fi
done
Multi-Dataset Schema Handling
Separating Dataset Schemas
Different formats handle multiple datasets differently:
Text Format
All datasets in single output, nested under their names:
{
"service": PartiqlType(Bag(...)),
"client_0": PartiqlType(Bag(...)),
"client_1": PartiqlType(Bag(...))
}
Basic DDL Format
Datasets separated by comments:
-- Dataset: service
"Account" VARCHAR,
"Request" VARCHAR,
-- Dataset: client_0
"id" VARCHAR,
"request_id" VARCHAR,
-- Dataset: client_1
"id" VARCHAR,
"request_id" VARCHAR,
JSON Format
Datasets as separate objects in shapes dictionary:
{
"shapes": {
"service": { "type": "bag", "items": {...} },
"client_0": { "type": "bag", "items": {...} },
"client_1": { "type": "bag", "items": {...} }
}
}
Dataset-Specific Schema Extraction
#!/bin/bash
# Extract schema for specific dataset
SCRIPT="$1"
DATASET="$2"
FORMAT="$3"
case $FORMAT in
"ddl")
beamline infer-shape \
--seed 1 \
--start-auto \
--script-path "$SCRIPT" \
--output-format basic-ddl | \
sed -n "/-- Dataset: $DATASET/,/-- Dataset:/p" | \
head -n -1 # Remove next dataset header
;;
"json")
beamline infer-shape \
--seed 1 \
--start-auto \
--script-path "$SCRIPT" \
--output-format beamline-json | \
jq ".shapes.${DATASET}"
;;
*)
echo "Usage: $0 <script.ion> <dataset_name> <ddl|json>"
exit 1
;;
esac
Nullability and Optionality in Formats
How Each Format Represents Absent Values
Text Format
StructField {
name: "nullable_field",
ty: PartiqlType(Int64), // Type doesn't show nullability directly
}
DDL Format
"required_field" VARCHAR NOT NULL, -- nullable: false
"nullable_field" INT, -- nullable: true (default)
"optional_field" OPTIONAL VARCHAR -- optional: true
JSON Format
{
"name": "nullable_field",
"type": "int64" // Nullability not directly visible
}
Note: DDL format provides the clearest nullability information.
CLI Defaults Impact on Formats
# With CLI nullability defaults
beamline infer-shape \
--seed 1 \
--start-auto \
--script-path data.ion \
--default-nullable true \
--default-optional true \
--output-format basic-ddl
Result shows CLI impact:
-- All fields affected by CLI defaults
"field1" OPTIONAL VARCHAR, -- Made optional by CLI
"field2" OPTIONAL INT NOT NULL, -- Made optional, but explicit nullable: false overrides
"field3" OPTIONAL BOOL -- Both CLI defaults applied
Advanced Format Usage
Schema Evolution Tracking
#!/bin/bash
# Track schema changes across versions
OLD_SCRIPT="model_v1.ion"
NEW_SCRIPT="model_v2.ion"
echo "Schema Evolution Analysis"
echo "========================"
# Generate schemas in DDL format for comparison
beamline infer-shape --seed 1 --start-auto --script-path "$OLD_SCRIPT" --output-format basic-ddl > v1_schema.sql
beamline infer-shape --seed 1 --start-auto --script-path "$NEW_SCRIPT" --output-format basic-ddl > v2_schema.sql
# Show changes
echo "Schema changes:"
diff -u v1_schema.sql v2_schema.sql
# Also generate JSON for programmatic analysis
beamline infer-shape --seed 1 --start-auto --script-path "$OLD_SCRIPT" --output-format beamline-json > v1_schema.json
beamline infer-shape --seed 1 --start-auto --script-path "$NEW_SCRIPT" --output-format beamline-json > v2_schema.json
# Count field changes
v1_fields=$(jq -r '.shapes | to_entries[] | .value.items.fields[].name' v1_schema.json | sort)
v2_fields=$(jq -r '.shapes | to_entries[] | .value.items.fields[].name' v2_schema.json | sort)
added_fields=$(comm -13 <(echo "$v1_fields") <(echo "$v2_fields"))
removed_fields=$(comm -23 <(echo "$v1_fields") <(echo "$v2_fields"))
echo ""
echo "Field changes:"
if [ -n "$added_fields" ]; then
echo "Added: $(echo "$added_fields" | tr '\n' ' ')"
fi
if [ -n "$removed_fields" ]; then
echo "Removed: $(echo "$removed_fields" | tr '\n' ' ')"
fi
Multi-Format Documentation
#!/bin/bash
# Generate comprehensive schema documentation
SCRIPT="$1"
BASE_NAME=$(basename "$SCRIPT" .ion)
echo "# Schema Documentation: $BASE_NAME"
echo "Generated: $(date)"
echo ""
# Generate metadata
metadata=$(beamline infer-shape --seed 1 --start-auto --script-path "$SCRIPT" --output-format basic-ddl | head -3)
echo "## Generation Metadata"
echo '```'
echo "$metadata"
echo '```'
echo ""
# SQL DDL for database developers
echo "## SQL DDL Schema"
echo '```sql'
beamline infer-shape \
--seed 1 \
--start-auto \
--script-path "$SCRIPT" \
--output-format basic-ddl | tail -n +4 # Skip metadata lines
echo '```'
echo ""
# JSON for tool developers
echo "## JSON Schema (for tools)"
echo '```json'
beamline infer-shape \
--seed 1 \
--start-auto \
--script-path "$SCRIPT" \
--output-format beamline-json | jq '.shapes'
echo '```'
echo ""
# Analysis summary
echo "## Schema Analysis"
schema_json=$(beamline infer-shape --seed 1 --start-auto --script-path "$SCRIPT" --output-format beamline-json)
dataset_count=$(echo "$schema_json" | jq '.shapes | length')
total_fields=$(echo "$schema_json" | jq '[.shapes[].items.fields | length] | add')
echo "- **Datasets**: $dataset_count"
echo "- **Total fields**: $total_fields"
echo "- **Source script**: \`$SCRIPT\`"
Best Practices
1. Choose Format for Purpose
# Understanding complex types
beamline infer-shape --script-path complex.ion --output-format text
# Database creation
beamline infer-shape --script-path db_model.ion --output-format basic-ddl
# Automated processing
beamline infer-shape --script-path data.ion --output-format beamline-json
2. Use Consistent Parameters
# Always use same seed for reproducible schema generation
beamline infer-shape --seed 1 --start-auto --script-path script.ion --output-format basic-ddl
3. Version Schema Output
# Include format in version control
git add schemas/model_v2.sql schemas/model_v2.json
git commit -m "Add schema v2 in SQL and JSON formats
- SQL DDL for database creation
- JSON for automated tooling"
4. Validate Format Consistency
# Ensure all formats represent same schema
beamline infer-shape --seed 42 --start-auto --script-path test.ion --output-format basic-ddl > test.sql
beamline infer-shape --seed 42 --start-auto --script-path test.ion --output-format beamline-json > test.json
# Extract field count from both formats
sql_fields=$(grep '^"' test.sql | wc -l)
json_fields=$(jq '[.shapes[].items.fields | length] | add' test.json)
if [ "$sql_fields" -eq "$json_fields" ]; then
echo "✅ Schema formats consistent: $sql_fields fields"
else
echo "❌ Format mismatch: SQL=$sql_fields, JSON=$json_fields"
fi
Format Selection Guidelines
By Use Case
| Use Case | Recommended Format | Alternative |
|---|---|---|
| Database creation | basic-ddl | N/A |
| Development debugging | text | basic-ddl |
| Tool integration | beamline-json | N/A |
| Documentation | basic-ddl | beamline-json |
| Schema comparison | basic-ddl | beamline-json |
| CI/CD automation | beamline-json | basic-ddl |
By Consumer
| Consumer | Recommended Format | Rationale |
|---|---|---|
| SQL Database | basic-ddl | Direct CREATE TABLE usage |
| PartiQL Tools | beamline-json | Native PartiQL format |
| Human Review | basic-ddl | Most readable |
| Development Tools | beamline-json | Machine processable |
| Documentation | basic-ddl | Clear and concise |
Next Steps
Now that you understand all schema output formats:
- CLI Shape Commands - Complete CLI reference for shape operations
- Database Integration - Using schemas for database creation
- Understanding Shapes - Fundamental concepts and type mappings
CLI Overview
The Beamline Command Line Interface (CLI) provides access to all of Beamline’s core functionality: data generation, query generation, schema inference, and database creation. The CLI is built using Rust and follows a simple, consistent command structure.
Installation and Setup
Building from Source
The CLI is built as part of the Beamline project:
# Clone the repository
git clone https://github.com/partiql/partiql-beamline.git
cd partiql-beamline
# Build the project (includes CLI)
cargo build --release
# The CLI binary will be available at:
./target/release/beamline
Verification
After building, verify the CLI is working:
# Check version
./target/release/beamline --version
# View help
./target/release/beamline --help
Command Structure
All Beamline CLI commands follow this structure:
beamline <COMMAND> [SUBCOMMAND] [OPTIONS]
Available Commands
The CLI provides four main commands:
1. gen - Data and Database Generation
Generate synthetic data and create databases.
Subcommands:
data- Generate synthetic data from Ion scriptsdb beamline-lite- Create BeamlineLite database with data and schemas
Example:
beamline gen data --seed-auto --start-auto --sample-count 100 --script-path my_script.ion
2. infer-shape - Schema Inference
Infer data schemas from Ion scripts without generating full datasets.
Example:
beamline infer-shape --seed-auto --start-auto --script-path my_script.ion --output-format basic-ddl
3. query - Query Generation
Generate PartiQL queries that match your data structures.
Subcommands:
basic- Basic query generation with configurable strategies
Example:
beamline query basic --seed 1234 --start-auto --script-path data_script.ion --sample-count 5 rand-select-all-fw --tbl-flt-rand-min 1 --tbl-flt-rand-max 1 --pred-lt
4. help - Help Information
Display help for commands and subcommands.
Common Options
Several options are shared across multiple commands:
Seed Configuration (Required)
Control reproducibility through seeding:
--seed-auto # Generate random seed automatically
--seed <SEED> # Use specific seed (e.g., --seed 12345)
Start Time Configuration (Required)
Set the simulation start time:
--start-auto # Generate random start time
--start-epoch-ms <EPOCH_MS> # Use Unix timestamp in milliseconds
--start-iso <ISO_8601> # Use ISO 8601 format (e.g., 2024-01-01T00:00:00Z)
Script Configuration (Required)
Provide the Ion script defining data generation:
--script-path <PATH> # Path to Ion script file
--script <SCRIPT_DATA> # Inline Ion script content
Sample Count
Control how much data to generate:
--sample-count <COUNT> # Number of samples (default: 10)
Nullability and Optionality
Configure NULL and MISSING value generation:
--default-nullable <true|false> # Make types nullable by default
--pct-null <PERCENTAGE> # Percentage of NULL values (0.0-1.0)
--default-optional <true|false> # Make types optional by default
--pct-optional <PERCENTAGE> # Percentage of MISSING values (0.0-1.0)
Output Formats
Data Generation Formats
For gen data, specify output format with --output-format:
| Format | Description | Use Case |
|---|---|---|
text | Human-readable text (default) | Debugging, inspection |
ion | Compact Amazon Ion binary | Efficient storage |
ion-pretty | Pretty-printed Ion text | Human-readable Ion |
ion-binary | Binary Ion format | Most compact |
Example:
beamline gen data --seed-auto --start-auto --script-path data.ion --output-format ion-pretty
Shape Inference Formats
For infer-shape, specify format with --output-format:
| Format | Description | Use Case |
|---|---|---|
text | Debug format (default) | Development |
basic-ddl | SQL DDL format | Database schema |
beamline-json | Beamline JSON format | Testing |
Basic Usage Examples
Generate Data
# Simple data generation
beamline gen data \
--seed-auto \
--start-auto \
--sample-count 1000 \
--script-path sensors.ion
# Reproducible generation with specific seed
beamline gen data \
--seed 12345 \
--start-iso "2024-01-01T00:00:00Z" \
--sample-count 500 \
--script-path user_data.ion \
--output-format ion-pretty
Filter Datasets
Generate data for specific datasets only:
beamline gen data \
--seed 42 \
--start-auto \
--script-path client_service.ion \
--dataset service \
--dataset client_1 \
--sample-count 100
Infer Schema
# Get SQL DDL schema
beamline infer-shape \
--seed-auto \
--start-auto \
--script-path my_script.ion \
--output-format basic-ddl
# Get detailed shape information
beamline infer-shape \
--seed 1234 \
--start-auto \
--script-path complex_data.ion \
--output-format text
Create Database
# Create BeamlineLite database
beamline gen db beamline-lite \
--seed-auto \
--start-auto \
--script-path database_script.ion \
--sample-count 10000
# Custom catalog location
beamline gen db beamline-lite \
--seed 2024 \
--start-auto \
--script-path data.ion \
--catalog_name my-catalog \
--catalog_path ./databases/ \
--sample-count 5000
Generate Queries
# Simple query generation
beamline query basic \
--seed 100 \
--start-auto \
--script-path transactions.ion \
--sample-count 10 \
rand-select-all-fw \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 3 \
--tbl-flt-path-depth-max 2 \
--pred-all
Configuration with Nullability/Optionality
Control NULL and MISSING value generation:
# Make all types nullable with 10% NULL values
beamline gen data \
--seed 42 \
--start-auto \
--script-path data.ion \
--pct-null 0.1 \
--sample-count 1000
# Make types optional with 5% MISSING values
beamline gen data \
--seed 42 \
--start-auto \
--script-path data.ion \
--pct-optional 0.05 \
--sample-count 1000
# Disable nullability and optionality
beamline gen data \
--seed 42 \
--start-auto \
--script-path data.ion \
--default-nullable false \
--default-optional false \
--sample-count 1000
Error Handling
Common Error Types
Script Not Found
$ beamline gen data --seed-auto --start-auto --script-path missing.ion
Error: Unable to read script file 'missing.ion': No such file or directory
Invalid Ion Script
$ beamline gen data --seed-auto --start-auto --script-path invalid.ion
Error: Failed to parse Ion script: Invalid syntax at line 5
Invalid Seed Value
$ beamline gen data --seed invalid --start-auto --script-path data.ion
Error: Invalid value 'invalid' for '--seed <SEED>': invalid digit found in string
Debug Output
For troubleshooting, examine the generated seed and start time:
$ beamline gen data --seed-auto --start-auto --script-path sensors.ion --sample-count 2
Seed: 12328924104731257599
Start: 2024-01-20T20:05:41.000000000Z
[2024-01-20 20:07:46.532 +00:00:00] : "sensors" { 'f': -2.5436390152455175, 'i8': 4, 'tick': 125532 }
[2024-01-20 20:09:19.756 +00:00:00] : "sensors" { 'f': -63.49308817145054, 'i8': 4, 'tick': 218756 }
The output shows the seed and start time used, allowing you to reproduce the exact same output later.
Integration Patterns
Shell Scripting
#!/bin/bash
# Generate test data for different scenarios
SEED=12345
START_TIME="2024-01-01T00:00:00Z"
# Generate user data
beamline gen data \
--seed $SEED \
--start-iso $START_TIME \
--script-path users.ion \
--sample-count 1000 \
--output-format ion-pretty > users.ion
# Generate transaction data
beamline gen data \
--seed $((SEED + 1)) \
--start-iso $START_TIME \
--script-path transactions.ion \
--sample-count 5000 \
--output-format ion-pretty > transactions.ion
echo "Data generation completed!"
Pipeline Integration
# Generate data and pipe to other tools
beamline gen data \
--seed 42 \
--start-auto \
--script-path events.ion \
--sample-count 1000 \
--output-format ion-pretty | \
head -20
# Combine with analysis tools
beamline gen data \
--seed 100 \
--start-auto \
--script-path metrics.ion \
--sample-count 10000 \
--output-format text | \
grep "temperature" | \
wc -l
Best Practices
1. Always Use Seeds for Reproducible Testing
# Good - explicit seed for test scenarios
beamline gen data --seed 12345 --start-iso "2024-01-01T00:00:00Z" --script-path test.ion
# Avoid - auto seed makes reproduction difficult
beamline gen data --seed-auto --start-auto --script-path test.ion
2. Start Small, Scale Up
# Test with small sample first
beamline gen data --seed 1 --start-auto --script-path new_script.ion --sample-count 10
# Scale up after validation
beamline gen data --seed 1 --start-auto --script-path new_script.ion --sample-count 100000
3. Use Appropriate Output Formats
# Ion formats for data processing
beamline gen data --script-path data.ion --output-format ion-binary
# Text format for debugging
beamline gen data --script-path data.ion --output-format text --sample-count 5
4. Validate Schemas Before Large Generation
# Check schema first
beamline infer-shape --seed-auto --start-auto --script-path data.ion --output-format basic-ddl
# Then generate data
beamline gen data --seed 42 --start-auto --script-path data.ion --sample-count 10000
Next Steps
Now that you understand the CLI overview, explore specific commands:
- Data Commands - Detailed guide to
gen datacommand - Query Commands - Comprehensive
querycommand reference - Shape Commands - Using
infer-shapefor schema work - Database Commands - Creating databases with
gen db
Data Generation Commands
The beamline gen data command generates synthetic data from Ion scripts using stochastic processes. This is the primary command for creating reproducible pseudo-random data in Beamline.
Command Syntax
beamline gen data [OPTIONS]
Required Options
All data generation requires these three configuration groups (exactly one option from each group):
Seed Configuration (Required - choose one)
--seed-auto # Generate random seed automatically
--seed <SEED> # Use specific numeric seed for reproducibility
Start Time Configuration (Required - choose one)
--start-auto # Generate random start time
--start-epoch-ms <EPOCH_MS> # Use Unix timestamp in milliseconds
--start-iso <ISO_8601> # Use ISO 8601 format (e.g., 2024-01-01T00:00:00Z)
Script Configuration (Required - choose one)
--script-path <PATH> # Path to Ion script file
--script <SCRIPT_DATA> # Inline Ion script content
Optional Parameters
Sample Count
--sample-count <COUNT> # Number of samples to generate (default: 10)
Output Format
--output-format <FORMAT> # Output format (default: text)
Available formats:
text- Human-readable text format (default)ion- Compact Amazon Ion formation-pretty- Pretty-printed Ion text formation-binary- Binary Ion format (most compact)
Dataset Filtering
--dataset <DATASET_NAME> # Include only specific dataset(s)
# Can be used multiple times for multiple datasets
Nullability Configuration (Optional - choose one)
--default-nullable <true|false> # Set default nullability behavior
--pct-null <PERCENTAGE> # Percentage of NULL values (0.0-1.0)
Optionality Configuration (Optional - choose one)
--default-optional <true|false> # Set default optionality behavior
--pct-optional <PERCENTAGE> # Percentage of MISSING values (0.0-1.0)
Basic Examples
Simple Data Generation
# Generate 100 samples with automatic seed and start time
beamline gen data \
--seed-auto \
--start-auto \
--script-path sensors.ion \
--sample-count 100
# Reproducible generation with specific seed
beamline gen data \
--seed 12345 \
--start-iso "2024-01-01T00:00:00Z" \
--script-path user_data.ion \
--sample-count 1000
Different Output Formats
# Text output (human-readable, default)
beamline gen data \
--seed 42 \
--start-auto \
--script-path data.ion \
--output-format text
# Pretty Ion format
beamline gen data \
--seed 42 \
--start-auto \
--script-path data.ion \
--output-format ion-pretty
# Compact binary Ion
beamline gen data \
--seed 42 \
--start-auto \
--script-path data.ion \
--output-format ion-binary
Dataset Filtering
Generate data for specific datasets only:
# Generate data for specific datasets
beamline gen data \
--seed 45121008347100595 \
--start-iso "2020-06-16T14:41:51.000000000Z" \
--script-path client-service.ion \
--sample-count 10 \
--dataset service \
--dataset client_1 \
--output-format ion-pretty
Advanced Configuration
Controlling NULL Values
# Make all types nullable by default with 10% NULL values
beamline gen data \
--seed 100 \
--start-auto \
--script-path data.ion \
--pct-null 0.1 \
--sample-count 500
# Disable nullability entirely
beamline gen data \
--seed 100 \
--start-auto \
--script-path data.ion \
--default-nullable false \
--sample-count 500
Controlling MISSING Values
# Make all types optional with 5% MISSING values
beamline gen data \
--seed 200 \
--start-auto \
--script-path data.ion \
--pct-optional 0.05 \
--sample-count 500
# Disable optionality entirely
beamline gen data \
--seed 200 \
--start-auto \
--script-path data.ion \
--default-optional false \
--sample-count 500
Inline Scripts
For small scripts, you can provide the Ion script content directly:
beamline gen data \
--seed 300 \
--start-auto \
--script 'rand_processes::{ test: rand_process::{ $arrival: HomogeneousPoisson:: { interarrival: seconds::1 }, $data: { id: UniformU8, value: UniformF64 } } }' \
--sample-count 5 \
--output-format text
Reproducibility Examples
Exact Reproduction
# First run - note the seed and start time
beamline gen data \
--seed-auto \
--start-auto \
--script-path sensors.ion \
--sample-count 2
# Output shows:
# Seed: 12328924104731257599
# Start: 2024-01-20T20:05:41.000000000Z
# [data follows...]
# Reproduce exactly the same data
beamline gen data \
--seed 12328924104731257599 \
--start-iso "2024-01-20T20:05:41.000000000Z" \
--script-path sensors.ion \
--sample-count 2
Reproducible with Different Start Times
# Same seed, different start time gives same data pattern at different times
beamline gen data \
--seed 12345 \
--start-iso "2023-01-01T00:00:00Z" \
--script-path events.ion \
--sample-count 5
beamline gen data \
--seed 12345 \
--start-iso "2024-01-01T00:00:00Z" \
--script-path events.ion \
--sample-count 5
Output Format Details
Text Format (Default)
Human-readable format with timestamps and dataset names:
$ beamline gen data --seed 1234 --start-auto --script-path sensors.ion --sample-count 2
Seed: 1234
Start: 2019-08-01T00:00:01.000000000-07:00
[2019-08-01 7:26:21.964 -07:00:00] : "sensors" { 'f': -2.5436390152455175, 'i8': 4, 'tick': 125532 }
[2019-08-10 5:46:15.24 -07:00:00] : "sensors" { 'f': -63.49308817145054, 'i8': 4, 'tick': 218756 }
Ion Pretty Format
Pretty-printed Ion with metadata:
$ beamline gen data --seed 1234 --start-auto --script-path sensors.ion --sample-count 2 --output-format ion-pretty
{
seed: 1234,
start: "2019-08-01T00:00:01.000000000-07:00",
data: {
sensors: [
{
f: -2.5436390152455175e0,
i8: 4,
tick: 125532
},
{
f: -63.49308817145054e0,
i8: 4,
tick: 218756
}
]
}
}
Ion and Ion Binary Formats
ion- Compact text Ion without pretty printingion-binary- Binary Ion format (most space-efficient)
Both formats preserve all Ion type information and are suitable for programmatic processing.
Static Data Generation
Beamline supports static data generation (data generated before simulation starts):
# Generate data with static customer table and dynamic orders
beamline gen data \
--seed 1234 \
--start-iso "2019-08-01T00:00:01-07:00" \
--script-path orders.ion \
--sample-count 30 \
--output-format text
Static data appears first with the same timestamp, followed by temporally-distributed dynamic data.
Error Handling
Common Error Scenarios
Missing Script File
$ beamline gen data --seed-auto --start-auto --script-path nonexistent.ion
Error: Failed to read script file 'nonexistent.ion': No such file or directory (os error 2)
Invalid Ion Syntax
$ beamline gen data --seed-auto --start-auto --script-path invalid.ion
Error: Failed to parse Ion script: Invalid Ion syntax at line 5, column 10
Missing Required Arguments
$ beamline gen data --script-path data.ion
Error: One of --seed-auto or --seed is required
Error: One of --start-auto, --start-epoch-ms, or --start-iso is required
Invalid Percentage Values
$ beamline gen data --seed-auto --start-auto --script-path data.ion --pct-null 1.5
Error: Percents must be between 0 and 1: `1.5`
Debugging Tips
- Start Small: Use
--sample-count 5to quickly test scripts - Use Text Format: Default text format is easiest to read for debugging
- Check Seeds: Note auto-generated seeds for reproduction
- Validate Scripts: Use
infer-shapeto check script syntax first
Integration Patterns
Shell Scripting
#!/bin/bash
set -e
SCRIPT_PATH="simulation.ion"
OUTPUT_DIR="./generated_data"
SEED=12345
mkdir -p "$OUTPUT_DIR"
# Generate different datasets
echo "Generating user data..."
beamline gen data \
--seed $SEED \
--start-iso "2024-01-01T00:00:00Z" \
--script-path "$SCRIPT_PATH" \
--dataset users \
--sample-count 1000 \
--output-format ion-pretty > "$OUTPUT_DIR/users.ion"
echo "Generating transaction data..."
beamline gen data \
--seed $((SEED + 1)) \
--start-iso "2024-01-01T00:00:00Z" \
--script-path "$SCRIPT_PATH" \
--dataset transactions \
--sample-count 5000 \
--output-format ion-pretty > "$OUTPUT_DIR/transactions.ion"
echo "Data generation completed!"
Pipeline Processing
# Generate and process data in pipeline
beamline gen data \
--seed 42 \
--start-auto \
--script-path metrics.ion \
--sample-count 1000 \
--output-format text | \
grep "temperature" | \
awk '{ print $NF }' | \
head -10
# Generate multiple formats simultaneously
beamline gen data \
--seed 100 \
--start-auto \
--script-path data.ion \
--sample-count 1000 \
--output-format ion-pretty | \
tee data.ion | \
head -20
Testing Workflows
# Generate test data with specific characteristics
generate_test_data() {
local seed=$1
local sample_count=$2
local script=$3
beamline gen data \
--seed "$seed" \
--start-iso "2024-01-01T00:00:00Z" \
--script-path "$script" \
--sample-count "$sample_count" \
--default-nullable false \
--default-optional false \
--output-format ion-pretty
}
# Use in tests
generate_test_data 12345 100 "test_users.ion" > test_users.ion
generate_test_data 12346 50 "test_orders.ion" > test_orders.ion
Performance Considerations
Sample Count Impact
- Small counts (
--sample-count 10-100): Near-instantaneous - Medium counts (
--sample-count 1000-10000): Seconds - Large counts (
--sample-count 100000+): Minutes, depending on script complexity
Output Format Performance
text- Moderate performance, human-readableion-binary- Fastest and most compaction- Fast, compact text formation-pretty- Slowest due to formatting overhead
Memory Usage
Beamline streams data generation, so memory usage stays constant regardless of sample count. Large datasets are processed incrementally.
Best Practices
1. Use Specific Seeds for Testing
# Good - reproducible
beamline gen data --seed 12345 --start-iso "2024-01-01T00:00:00Z" --script-path test.ion
# Avoid - non-reproducible
beamline gen data --seed-auto --start-auto --script-path test.ion
2. Start with Small Sample Counts
# Validate script first
beamline gen data --seed 1 --start-auto --script-path new_script.ion --sample-count 5
# Scale up after validation
beamline gen data --seed 1 --start-auto --script-path new_script.ion --sample-count 10000
3. Use Appropriate Output Formats
# Human inspection
beamline gen data --script-path data.ion --output-format text --sample-count 10
# Data processing
beamline gen data --script-path data.ion --output-format ion-binary --sample-count 100000
# Configuration files
beamline gen data --script-path data.ion --output-format ion-pretty --sample-count 1000
4. Document Your Seeds
# Good practice - document seeds used
# User test data: seed 2024001
# Integration test data: seed 2024002
# Performance test data: seed 2024003
beamline gen data --seed 2024001 --start-auto --script-path users.ion
Next Steps
Now that you understand data generation commands, explore:
- Query Commands - Generate PartiQL queries for your data
- Shape Commands - Infer and work with data schemas
- Database Commands - Create complete databases with data and schemas
- Data Generation Guide - Learn about Ion scripts and generators
Query Commands
The beamline query command generates PartiQL queries that match the shapes and types of data defined in Ion scripts. This allows you to create realistic queries for testing PartiQL implementations.
Command Syntax
beamline query basic [OPTIONS] <STRATEGY>
Required Options
Query generation requires the same core configuration as data generation:
Seed Configuration (Required - choose one)
--seed-auto # Generate random seed automatically
--seed <SEED> # Use specific numeric seed for reproducibility
Start Time Configuration (Required - choose one)
--start-auto # Generate random start time
--start-epoch-ms <EPOCH_MS> # Use Unix timestamp in milliseconds
--start-iso <ISO_8601> # Use ISO 8601 format (e.g., 2024-01-01T00:00:00Z)
Script Configuration (Required - choose one)
--script-path <PATH> # Path to Ion script file
--script <SCRIPT_DATA> # Inline Ion script content
Sample Count
--sample-count <COUNT> # Number of queries to generate (default: 10)
Query Strategies
Beamline supports four different query generation strategies:
1. rand-select-all-fw - SELECT * with WHERE
Generates SELECT * queries with randomly generated WHERE clauses.
beamline query basic \
--seed 1234 \
--start-auto \
--script-path simple_transactions.ion \
--sample-count 3 \
rand-select-all-fw \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 1 \
--tbl-flt-path-depth-max 1 \
--tbl-flt-pathstep-internal-all \
--tbl-flt-pathstep-final-project \
--tbl-flt-type-final-scalar \
--pred-lt
Example Output:
SELECT * FROM test_data AS test_data WHERE (test_data.marketplace_id < -5)
SELECT * FROM test_data AS test_data WHERE (test_data.price < 18.418581624952935)
SELECT * FROM test_data AS test_data WHERE (test_data.price < 15.495327785402296)
2. rand-sfw - SELECT fields FROM WHERE
Generates queries with random projections and WHERE clauses.
beamline query basic \
--seed 1234 \
--start-auto \
--script-path simple_transactions.ion \
--sample-count 3 \
rand-sfw \
--project-rand-min 2 \
--project-rand-max 5 \
--project-path-depth-min 1 \
--project-path-depth-max 1 \
--project-pathstep-internal-all \
--project-pathstep-final-all \
--project-type-final-all \
--tbl-flt-rand-min 2 \
--tbl-flt-rand-max 5 \
--tbl-flt-path-depth-max 1 \
--tbl-flt-pathstep-internal-all \
--tbl-flt-pathstep-final-project \
--tbl-flt-type-final-scalar \
--pred-all
Example Output:
SELECT test_data.completed, test_data.completed FROM test_data AS test_data
WHERE (NOT (test_data.completed) OR NOT ((test_data.created_at IS MISSING)))
SELECT test_data.completed, test_data.marketplace_id, test_data.created_at
FROM test_data AS test_data WHERE (NOT ((test_data.transaction_id IS NULL)) OR
(((test_data.transaction_id IN ['Iam in.', 'Se.']) OR
NOT ((test_data.description IS NULL))) OR
(test_data.marketplace_id >= 28)))
3. rand-select-all-efw - SELECT * EXCLUDE WHERE
Generates SELECT * EXCLUDE queries with WHERE clauses.
beamline query basic \
--seed 1234 \
--start-auto \
--script-path simple_transactions.ion \
--sample-count 3 \
rand-select-all-efw \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 1 \
--tbl-flt-path-depth-max 1 \
--tbl-flt-pathstep-internal-all \
--tbl-flt-pathstep-final-project \
--tbl-flt-type-final-scalar \
--pred-lt \
--exclude-rand-min 1 \
--exclude-rand-max 3 \
--exclude-path-depth-min 1 \
--exclude-path-depth-max 1 \
--exclude-pathstep-internal-all \
--exclude-pathstep-final-all \
--exclude-type-final-all
4. rand-sefw - SELECT EXCLUDE FROM WHERE
Generates queries with projections, exclusions, and WHERE clauses.
beamline query basic \
--seed 1234 \
--start-auto \
--script-path simple_transactions.ion \
--sample-count 3 \
rand-sefw \
--project-rand-min 2 \
--project-rand-max 5 \
--project-path-depth-min 1 \
--project-path-depth-max 1 \
--project-pathstep-internal-all \
--project-pathstep-final-all \
--project-type-final-all \
--tbl-flt-rand-min 2 \
--tbl-flt-rand-max 5 \
--tbl-flt-path-depth-max 1 \
--tbl-flt-pathstep-internal-all \
--tbl-flt-pathstep-final-project \
--tbl-flt-type-final-scalar \
--pred-all \
--exclude-rand-min 1 \
--exclude-rand-max 3 \
--exclude-path-depth-min 1 \
--exclude-path-depth-max 1 \
--exclude-pathstep-internal-all \
--exclude-pathstep-final-all \
--exclude-type-final-all
Parameter Reference
Table Filter Parameters
Control WHERE clause generation:
--tbl-flt-rand-min <N> # Minimum number of predicates (1-255)
--tbl-flt-rand-max <N> # Maximum number of predicates (1-255)
--tbl-flt-path-depth-max <N> # Maximum path depth (1-255)
# Path step types (internal positions)
--tbl-flt-pathstep-internal-all # Enable all internal path step types
--tbl-flt-pathstep-internal-project # Enable projection steps (.field)
--tbl-flt-pathstep-internal-index # Enable index steps ([1])
--tbl-flt-pathstep-internal-foreach # Enable for-each steps ([*])
--tbl-flt-pathstep-internal-unpivot # Enable unpivot steps (.*)
# Path step types (final positions)
--tbl-flt-pathstep-final-all # Enable all final path step types
--tbl-flt-pathstep-final-project # Enable projection steps (.field)
--tbl-flt-pathstep-final-index # Enable index steps ([1])
--tbl-flt-pathstep-final-foreach # Enable for-each steps ([*])
--tbl-flt-pathstep-final-unpivot # Enable unpivot steps (.*)
# Type constraints
--tbl-flt-type-final-all # Allow all final types
--tbl-flt-type-final-scalar # Allow scalar final types only
--tbl-flt-type-final-sequence # Allow sequence final types
--tbl-flt-type-final-struct # Allow struct final types
Predicate Types
Control which predicates can be generated:
--pred-all # Enable all predicates
--pred-lt # Less than (<)
--pred-lte # Less than or equal (<=)
--pred-gt # Greater than (>)
--pred-gte # Greater than or equal (>=)
--pred-eq # Equal (=)
--pred-neq # Not equal (<>)
--pred-between # BETWEEN predicate
--pred-like # LIKE predicate
--pred-not-like # NOT LIKE predicate
--pred-in # IN predicate
--pred-not-in # NOT IN predicate
--pred-is-null # IS NULL
--pred-is-not-null # IS NOT NULL
--pred-is-missing # IS MISSING
--pred-is-not-missing # IS NOT MISSING
--pred-logical-and # AND operator
--pred-logical-or # OR operator
--pred-logical-not # NOT operator
Projection Parameters (for rand-sfw and rand-sefw)
Control SELECT clause generation:
--project-rand-min <N> # Minimum projections (1-255)
--project-rand-max <N> # Maximum projections (1-255)
--project-path-depth-min <N> # Minimum path depth
--project-path-depth-max <N> # Maximum path depth
# Same path step and type options as table filters
--project-pathstep-internal-all # Enable all internal path steps
--project-pathstep-final-all # Enable all final path steps
--project-type-final-all # Allow all final types
Exclusion Parameters (for rand-select-all-efw and rand-sefw)
Control EXCLUDE clause generation:
--exclude-rand-min <N> # Minimum exclusions (1-255)
--exclude-rand-max <N> # Maximum exclusions (1-255)
--exclude-path-depth-min <N> # Minimum path depth
--exclude-path-depth-max <N> # Maximum path depth
# Same path step and type options as table filters
--exclude-pathstep-internal-all # Enable all internal path steps
--exclude-pathstep-final-all # Enable all final path steps
--exclude-type-final-all # Allow all final types
Complex Examples
Deep Path Generation
For nested data structures, control path depth:
beamline query basic \
--seed 1234 \
--start-auto \
--script-path transactions.ion \
--sample-count 3 \
rand-sefw \
--project-rand-min 2 \
--project-rand-max 5 \
--project-path-depth-min 1 \
--project-path-depth-max 10 \
--project-pathstep-internal-all \
--project-pathstep-final-all \
--project-type-final-all \
--tbl-flt-rand-min 2 \
--tbl-flt-rand-max 5 \
--tbl-flt-path-depth-max 10 \
--tbl-flt-pathstep-internal-all \
--tbl-flt-pathstep-final-project \
--tbl-flt-type-final-scalar \
--pred-all \
--exclude-rand-min 1 \
--exclude-rand-max 2 \
--exclude-path-depth-min 3 \
--exclude-path-depth-max 4 \
--exclude-pathstep-internal-all \
--exclude-pathstep-final-unpivot \
--exclude-type-final-all
This generates queries with deeply nested paths like:
SELECT test_data.*.nested_struct.nested_struct.nested_struct.nested_struct.nested_struct.*,
test_data.test_nest_struct.*.*.nested_struct.nested_struct
EXCLUDE test_data.*.*.*.*, test_data.price.*
FROM test_data AS test_data
WHERE ((test_data.test_nest_struct.*.*.*.nested_struct.*.test_int <> 19) OR
(test_data.test_nest_struct.*.*.nested_struct.*.*.test_int > 35))
Reproducible Query Generation
Use specific seeds for consistent query generation:
# Generate same queries each time
beamline query basic \
--seed 12345 \
--start-iso "2024-01-01T00:00:00Z" \
--script-path data.ion \
--sample-count 5 \
rand-select-all-fw \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 3 \
--pred-all
Best Practices
1. Start with Simple Queries
# Begin with basic queries
beamline query basic \
--seed 1 \
--start-auto \
--script-path data.ion \
--sample-count 5 \
rand-select-all-fw \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 1 \
--pred-eq
2. Match Query Complexity to Data Structure
# Simple data = simple paths
--project-path-depth-max 2
# Complex nested data = deeper paths
--project-path-depth-max 6
3. Use Appropriate Predicates for Testing
# For numeric testing
--pred-lt --pred-gt --pred-between
# For comprehensive testing
--pred-all
4. Validate Generated Queries
Test generated queries against your data to ensure they’re valid and meaningful.
Integration with Data Generation
Combine query and data generation for complete testing:
# Generate test data
beamline gen data \
--seed 100 \
--start-auto \
--script-path test_data.ion \
--sample-count 1000 \
--output-format ion-pretty > test_data.ion
# Generate matching queries
beamline query basic \
--seed 101 \
--start-auto \
--script-path test_data.ion \
--sample-count 20 \
rand-select-all-fw \
--tbl-flt-rand-min 1 \
--tbl-flt-rand-max 3 \
--pred-all > test_queries.sql
Next Steps
- Shape Commands - Infer schemas from your data
- Database Commands - Create complete test databases
- Query Generation Guide - Learn about query generation theory
Shape Commands
The beamline infer-shape command analyzes Ion scripts to infer the data schemas without generating full datasets. This is useful for understanding data structures, creating database schemas, and validating script configurations.
Command Syntax
beamline infer-shape [OPTIONS]
Required Options
Shape inference uses the same core configuration as data generation:
Seed Configuration (Required - choose one)
--seed-auto # Generate random seed automatically
--seed <SEED> # Use specific numeric seed for reproducibility
Start Time Configuration (Required - choose one)
--start-auto # Generate random start time
--start-epoch-ms <EPOCH_MS> # Use Unix timestamp in milliseconds
--start-iso <ISO_8601> # Use ISO 8601 format (e.g., 2024-01-01T00:00:00Z)
Script Configuration (Required - choose one)
--script-path <PATH> # Path to Ion script file
--script <SCRIPT_DATA> # Inline Ion script content
Optional Parameters
Output Format
--output-format <FORMAT> # Shape output format (default: text)
Available formats:
text- Human-readable debug format (default)basic-ddl- SQL DDL format for database schemasbeamline-json- Beamline JSON format for testing
Nullability and Optionality
--default-nullable <true|false> # Set default nullability behavior
--pct-null <PERCENTAGE> # Percentage of NULL values (0.0-1.0)
--default-optional <true|false> # Set default optionality behavior
--pct-optional <PERCENTAGE> # Percentage of MISSING values (0.0-1.0)
Output Formats
Text Format (Default)
Provides detailed type information in Rust debug format:
$ beamline infer-shape --seed-auto --start-auto --script-path sensors.ion
Seed: 17685918364143248531
Start: 2022-12-12T19:52:29.000000000Z
{
"sensors": PartiqlType(
Bag(
BagType {
element_type: PartiqlType(
Struct(
StructType {
constraints: {
Fields(
{
StructField {
name: "d",
ty: PartiqlType(
DecimalP(2, 0),
),
},
StructField {
name: "f",
ty: PartiqlType(
Float64,
),
},
StructField {
name: "i8",
ty: PartiqlType(
Int64,
),
},
},
),
},
},
),
),
},
),
),
}
Use Cases:
- Development and debugging
- Understanding complex data structures
- Validating script configurations
Basic DDL Format
Generates SQL DDL statements for database schema creation:
$ beamline infer-shape \
--seed 7844265201457918498 \
--start-auto \
--script-path sensors-nested.ion \
--output-format basic-ddl
-- Seed: 7844265201457918498
-- Start: 2024-01-01T06:53:06.000000000Z
-- Syntax: partiql_datatype_syntax.0.1
-- Dataset: sensors
"f" DOUBLE,
"i8" INT8,
"id" INT,
"sub" STRUCT<"f": DOUBLE,"o": INT8>,
"tick" INT8
Use Cases:
- Creating database tables
- Database schema documentation
- SQL migration scripts
- Data warehouse setup
Beamline JSON Format
Structured JSON format used by PartiQL testing tools:
$ beamline infer-shape \
--seed-auto \
--start-auto \
--script-path sensors.ion \
--output-format beamline-json
{
seed: -3711181901898679775,
start: 2022-05-22T13:49:57.000000000+00:00,
shapes: {
sensors: partiql::shape::v0::{
type: "bag",
items: {
type: "struct",
constraints: [
ordered,
closed
],
fields: [
{
name: "d",
type: "decimal(2, 0)"
},
{
name: "f",
type: "double"
},
{
name: "i8",
type: "int8"
},
{
name: "tick",
type: "int8"
},
{
name: "w",
type: "decimal(5, 4)"
}
]
}
}
}
}
Use Cases:
- PartiQL conformance testing
- Tool integration
- Automated schema validation
Examples
Basic Shape Inference
# Get basic shape information
beamline infer-shape \
--seed-auto \
--start-auto \
--script-path my_data.ion
# Get reproducible shape with specific seed
beamline infer-shape \
--seed 12345 \
--start-auto \
--script-path my_data.ion \
--output-format text
Database Schema Generation
# Generate SQL DDL for database creation
beamline infer-shape \
--seed 100 \
--start-auto \
--script-path ecommerce.ion \
--output-format basic-ddl > schema.sql
# Use in database creation
psql -d mydb -f schema.sql
Multiple Dataset Schemas
# Infer shapes for complex multi-dataset scripts
beamline infer-shape \
--seed 42 \
--start-auto \
--script-path client-service.ion \
--output-format basic-ddl
This outputs schemas for all datasets defined in the script:
-- Dataset: service
"Account" VARCHAR,
"Operation" VARCHAR,
"Program" VARCHAR,
"Request" VARCHAR,
"StartTime" TIMESTAMP,
"client" VARCHAR,
"success" BOOL
-- Dataset: client_0
"id" VARCHAR,
"request_id" VARCHAR,
"request_time" TIMESTAMP,
"success" BOOL
-- Dataset: client_1
"id" VARCHAR,
"request_id" VARCHAR,
"request_time" TIMESTAMP,
"success" BOOL
Schema with Nullability and Optionality
Configure NULL and MISSING value behavior in schema:
# Schema with all types nullable and optional
beamline infer-shape \
--seed 1 \
--start-auto \
--script-path data.ion \
--default-nullable true \
--default-optional true \
--output-format basic-ddl
# Output includes nullable/optional markers
"age" OPTIONAL TINYINT,
"name" OPTIONAL VARCHAR NULL,
"active" OPTIONAL BOOL
Schema Validation Workflow
Use shape inference to validate scripts before large data generation:
# 1. Validate script syntax and structure
beamline infer-shape \
--seed-auto \
--start-auto \
--script-path new_script.ion
# 2. Generate SQL schema
beamline infer-shape \
--seed 1 \
--start-auto \
--script-path new_script.ion \
--output-format basic-ddl > schema.sql
# 3. Generate small sample to verify
beamline gen data \
--seed 1 \
--start-auto \
--script-path new_script.ion \
--sample-count 5
# 4. Generate full dataset
beamline gen data \
--seed 1 \
--start-auto \
--script-path new_script.ion \
--sample-count 100000
Integration Patterns
Database Schema Creation
#!/bin/bash
# generate-database-schema.sh
SCRIPT="$1"
OUTPUT_DIR="./schemas"
if [ -z "$SCRIPT" ]; then
echo "Usage: $0 <script.ion>"
exit 1
fi
mkdir -p "$OUTPUT_DIR"
# Generate DDL schema
echo "Generating database schema for $SCRIPT..."
beamline infer-shape \
--seed 1 \
--start-auto \
--script-path "$SCRIPT" \
--output-format basic-ddl > "$OUTPUT_DIR/$(basename "$SCRIPT" .ion).sql"
# Generate Beamline JSON for testing
beamline infer-shape \
--seed 1 \
--start-auto \
--script-path "$SCRIPT" \
--output-format beamline-json > "$OUTPUT_DIR/$(basename "$SCRIPT" .ion).json"
echo "Schemas generated in $OUTPUT_DIR/"
CI/CD Schema Validation
#!/bin/bash
# validate-schemas.sh - CI pipeline script
set -e
echo "Validating Ion scripts..."
for script in scripts/*.ion; do
echo "Checking $script..."
# Validate script can generate valid schema
if ! beamline infer-shape \
--seed 1 \
--start-auto \
--script-path "$script" \
--output-format text > /dev/null; then
echo "ERROR: Invalid script $script"
exit 1
fi
echo "✓ $script is valid"
done
echo "All scripts validated successfully!"
Documentation Generation
# Generate documentation for all data scripts
for script in data_scripts/*.ion; do
name=$(basename "$script" .ion)
echo "## $name Dataset" >> SCHEMAS.md
echo '```sql' >> SCHEMAS.md
beamline infer-shape \
--seed 1 \
--start-auto \
--script-path "$script" \
--output-format basic-ddl >> SCHEMAS.md
echo '```' >> SCHEMAS.md
echo "" >> SCHEMAS.md
done
Error Handling
Common Errors
Script Syntax Errors
$ beamline infer-shape --seed-auto --start-auto --script-path invalid.ion
Error: Failed to parse Ion script: Invalid Ion syntax at line 5, column 10
Missing Required Options
$ beamline infer-shape --script-path data.ion
Error: One of --seed-auto or --seed is required
Error: One of --start-auto, --start-epoch-ms, or --start-iso is required
Invalid Output Format
$ beamline infer-shape --seed-auto --start-auto --script-path data.ion --output-format invalid
Error: 'invalid' isn't a valid value for '--output-format <OUTPUT_FORMAT>'
Performance Considerations
Shape inference is very fast since it doesn’t generate actual data:
- Script Parsing: Milliseconds for typical scripts
- Type Inference: Nearly instantaneous
- Output Generation: Minimal overhead
This makes shape inference ideal for:
- Quick script validation
- CI/CD pipeline checks
- Interactive development workflows
- Documentation generation
Best Practices
1. Validate Scripts Early
# Always infer shape before generating large datasets
beamline infer-shape --seed 1 --start-auto --script-path new_script.ion
2. Use Appropriate Output Formats
# DDL for database work
beamline infer-shape --seed 1 --start-auto --script-path data.ion --output-format basic-ddl
# Text for debugging
beamline infer-shape --seed 1 --start-auto --script-path data.ion --output-format text
# JSON for automation
beamline infer-shape --seed 1 --start-auto --script-path data.ion --output-format beamline-json
3. Document Your Schemas
Save schema outputs for reference and version control:
beamline infer-shape \
--seed 1 \
--start-auto \
--script-path production_data.ion \
--output-format basic-ddl > docs/production_schema.sql
4. Use Consistent Seeds
For reproducible schema documentation:
# Always use seed 1 for schema documentation
beamline infer-shape --seed 1 --start-auto --script-path data.ion --output-format basic-ddl
Next Steps
- Database Commands - Create complete databases with schemas
- Schema Guide - Learn about PartiQL type system
- Data Generation - Generate data matching your schemas
Database Commands
The beamline gen db beamline-lite command creates complete BeamlineLite databases containing both synthetic data and inferred schemas. This provides a complete local database for testing and development.
Command Syntax
beamline gen db beamline-lite [OPTIONS]
Required Options
Database generation uses the same core configuration as data generation:
Seed Configuration (Required - choose one)
--seed-auto # Generate random seed automatically
--seed <SEED> # Use specific numeric seed for reproducibility
Start Time Configuration (Required - choose one)
--start-auto # Generate random start time
--start-epoch-ms <EPOCH_MS> # Use Unix timestamp in milliseconds
--start-iso <ISO_8601> # Use ISO 8601 format (e.g., 2024-01-01T00:00:00Z)
Script Configuration (Required - choose one)
--script-path <PATH> # Path to Ion script file
--script <SCRIPT_DATA> # Inline Ion script content
Optional Parameters
Sample Count
--sample-count <COUNT> # Number of samples to generate (default: 10)
Catalog Configuration
--catalog_name <NAME> # Name of the catalog directory (default: "beamline-catalog")
--catalog_path <PATH> # Path where catalog will be created (default: ".")
--force # Overwrite existing catalog (creates backup first)
Target
--target filesystem # Create filesystem-based database (default and only option)
Nullability and Optionality
--default-nullable <true|false> # Set default nullability behavior
--pct-null <PERCENTAGE> # Percentage of NULL values (0.0-1.0)
--default-optional <true|false> # Set default optionality behavior
--pct-optional <PERCENTAGE> # Percentage of MISSING values (0.0-1.0)
What Gets Created
A BeamlineLite database consists of multiple files in a catalog directory:
Catalog Structure
beamline-catalog/
├── .beamline-manifest # Metadata (seed, start time, DDL syntax version)
├── .beamline-script # Original Ion script used for generation
├── <dataset_name>.ion # Data files (one per dataset)
├── <dataset_name>.shape.ion # Schema files in Ion format
└── <dataset_name>.shape.sql # Schema files in SQL DDL format
Example Catalog Contents
After running:
beamline gen db beamline-lite \
--seed-auto \
--start-auto \
--script-path client-service.ion \
--sample-count 1000
Generated files:
beamline-catalog/
├── .beamline-manifest
├── .beamline-script
├── service.ion
├── service.shape.ion
├── service.shape.sql
├── client_0.ion
├── client_0.shape.ion
├── client_0.shape.sql
├── client_1.ion
├── client_1.shape.ion
├── client_1.shape.sql
└── ... (more client datasets)
File Contents
Manifest File
Contains generation metadata:
$ cat beamline-catalog/.beamline-manifest
{"seed": "949665520117506306", "start": "2023-02-06T12:52:29.000000000Z", "ddl_syntax.version": "partiql_datatype_syntax.0.1"}
Script File
Original Ion script used for generation:
$ cat beamline-catalog/.beamline-script
rand_processes::{
// generate between 5 & 20 customers
$n: UniformU8::{ low: 5, high: 20 },
// A generator for client ids
$id_gen: UUID,
// ... rest of script
}
Data Files
Generated synthetic data in Ion format:
$ cat beamline-catalog/client_0.ion
{success: true, id: "7dbd12cf-b506-22ad-2d81-b0a1cd259697", request_id: "0de35d1e-a87c-e540-734d-6f2a4fa410c3", request_time: 2021-01-05T03:55:01.035000000+00:00}
{success: true, id: "7dbd12cf-b506-22ad-2d81-b0a1cd259697", request_id: "3539cdf0-6f7e-6bdc-c25a-4e0b7d8f8bac", request_time: 2021-01-05T03:55:01.182000000+00:00}
Schema Files
Ion format schema:
$ cat beamline-catalog/client_0.shape.ion
{
type: "bag",
items: {
type: "struct",
constraints: [ordered, closed],
fields: [
{ name: "id", type: "string" },
{ name: "request_id", type: "string" },
{ name: "request_time", type: "datetime" },
{ name: "success", type: "bool" }
]
}
}
SQL DDL format schema:
$ cat beamline-catalog/service.shape.sql
"Account" VARCHAR,
"Distance" DECIMAL(2, 0),
"Operation" VARCHAR,
"Program" VARCHAR,
"Request" VARCHAR,
"StartTime" TIMESTAMP,
"Weight" DECIMAL(5, 4),
"client" VARCHAR,
"success" BOOL
Examples
Basic Database Creation
# Create database with default settings
beamline gen db beamline-lite \
--seed-auto \
--start-auto \
--script-path my_data.ion \
--sample-count 1000
# Creates ./beamline-catalog/ with all files
Custom Catalog Configuration
# Create database in custom location
beamline gen db beamline-lite \
--seed 12345 \
--start-iso "2024-01-01T00:00:00Z" \
--script-path production_sim.ion \
--sample-count 50000 \
--catalog_name production-data \
--catalog_path ./databases/
# Creates ./databases/production-data/ with all files
Reproducible Database Creation
# Create reproducible test database
beamline gen db beamline-lite \
--seed 2024 \
--start-iso "2024-01-01T00:00:00Z" \
--script-path test_suite.ion \
--sample-count 10000 \
--catalog_name test-db-2024 \
--default-nullable false \
--default-optional false
Overwriting and Backup
Safe Overwrite with Backup
The CLI protects existing catalogs by default:
$ beamline gen db beamline-lite --seed-auto --start-auto --script-path data.ion
creating directory ./beamline-catalog/ failed with the following error:
File exists (os error 17)
Use --force to overwrite with automatic backup:
$ beamline gen db beamline-lite \
--seed-auto \
--start-auto \
--script-path data.ion \
--force
command is using --force ...
Beamline catalog ./beamline-catalog/ exists, backing it up to "beamline-catalog.2024-05-10T22:15:54.019316000Z.bkp"...
back up completed
writing manifest file ./beamline-catalog/.beamline-manifest ...[COMPLETED]
writing script file ./beamline-catalog/.beamline-script ...[COMPLETED]
writing shape file(s)...[COMPLETED]
writing data file(s)...[COMPLETED]
done!
Database Structure Analysis
Examine Generated Database
# View catalog structure
tree beamline-catalog/
# Examine manifest
cat beamline-catalog/.beamline-manifest
# Check a data file
head -5 beamline-catalog/service.ion
# Check schema
cat beamline-catalog/service.shape.sql
Validate Database Consistency
# Count records in each dataset
for data_file in beamline-catalog/*.ion; do
if [[ "$data_file" != *".shape.ion"* ]]; then
echo "$(basename "$data_file"): $(wc -l < "$data_file") records"
fi
done
Integration Patterns
Testing Database Setup
#!/bin/bash
# setup-test-database.sh
TEST_SEED=12345
TEST_START="2024-01-01T00:00:00Z"
TEST_SAMPLES=10000
echo "Creating test database..."
# Clean up any existing test database
rm -rf test-database/
# Generate test database
beamline gen db beamline-lite \
--seed $TEST_SEED \
--start-iso $TEST_START \
--script-path test_data_spec.ion \
--sample-count $TEST_SAMPLES \
--catalog_name test-database \
--catalog_path . \
--default-nullable false
echo "Test database created in ./test-database/"
echo "Records generated: $TEST_SAMPLES"
echo "Seed used: $TEST_SEED"
echo "Start time: $TEST_START"
Multi-Environment Database Generation
#!/bin/bash
# generate-env-databases.sh
SCRIPT="simulation.ion"
BASE_SEED=2024
# Development environment
beamline gen db beamline-lite \
--seed $BASE_SEED \
--start-iso "2024-01-01T00:00:00Z" \
--script-path $SCRIPT \
--sample-count 1000 \
--catalog_name dev-db \
--catalog_path ./environments/
# Staging environment
beamline gen db beamline-lite \
--seed $((BASE_SEED + 1)) \
--start-iso "2024-01-01T00:00:00Z" \
--script-path $SCRIPT \
--sample-count 10000 \
--catalog_name staging-db \
--catalog_path ./environments/
# Production-like environment
beamline gen db beamline-lite \
--seed $((BASE_SEED + 2)) \
--start-iso "2024-01-01T00:00:00Z" \
--script-path $SCRIPT \
--sample-count 100000 \
--catalog_name prod-like-db \
--catalog_path ./environments/
Database Migration Testing
#!/bin/bash
# test-schema-migration.sh
OLD_SCRIPT="data_v1.ion"
NEW_SCRIPT="data_v2.ion"
# Generate database with old schema
beamline gen db beamline-lite \
--seed 100 \
--start-auto \
--script-path $OLD_SCRIPT \
--catalog_name old-schema \
--sample-count 1000
# Generate database with new schema
beamline gen db beamline-lite \
--seed 100 \
--start-auto \
--script-path $NEW_SCRIPT \
--catalog_name new-schema \
--sample-count 1000
# Compare schemas
diff old-schema/*.shape.sql new-schema/*.shape.sql
Performance Considerations
Database creation involves:
- Script parsing (milliseconds)
- Data generation (scales with
--sample-count) - Schema inference (nearly instantaneous)
- File I/O (depends on dataset size and disk speed)
Performance Tips
# For large databases, monitor progress
time beamline gen db beamline-lite \
--seed 1 \
--start-auto \
--script-path large_sim.ion \
--sample-count 1000000
# Use faster storage for temporary operations
beamline gen db beamline-lite \
--seed 1 \
--start-auto \
--script-path data.ion \
--catalog_path /tmp/fast-storage/
Best Practices
1. Use Meaningful Catalog Names
# Good - descriptive names
beamline gen db beamline-lite \
--script-path user_analytics.ion \
--catalog_name user-analytics-2024 \
--catalog_path ./databases/
# Avoid - generic names
beamline gen db beamline-lite \
--script-path data.ion \
--catalog_name db
2. Document Generation Parameters
# Create documentation alongside database
beamline gen db beamline-lite \
--seed 12345 \
--start-iso "2024-01-01T00:00:00Z" \
--script-path simulation.ion \
--sample-count 50000 \
--catalog_name analytics-db-v1
# Document the generation
echo "Analytics Database v1
Generated: $(date)
Seed: 12345
Start: 2024-01-01T00:00:00Z
Sample Count: 50000
Script: simulation.ion" > analytics-db-v1/README.txt
3. Use Version Control for Catalog Manifests
Track database generation metadata:
# Add manifest files to version control
git add beamline-catalog/.beamline-manifest
git add beamline-catalog/.beamline-script
git commit -m "Add database generation manifest for test-db v2.1"
4. Backup Before –force Operations
# The CLI creates backups automatically with --force, but verify
ls -la beamline-catalog*.bkp
# Manual backup before --force if desired
cp -r beamline-catalog manual-backup-$(date +%Y%m%d)
beamline gen db beamline-lite --script-path updated.ion --force
Use Cases
Local Development Database
# Create local database for development
beamline gen db beamline-lite \
--seed 1000 \
--start-auto \
--script-path dev_data.ion \
--sample-count 5000 \
--catalog_name dev-local
Test Suite Database
# Create comprehensive test database
beamline gen db beamline-lite \
--seed 2024001 \
--start-iso "2024-01-01T00:00:00Z" \
--script-path comprehensive_test.ion \
--sample-count 50000 \
--catalog_name integration-test-db \
--default-nullable false \
--default-optional false
Performance Benchmark Database
# Create large database for performance testing
beamline gen db beamline-lite \
--seed 999999 \
--start-auto \
--script-path performance_test.ion \
--sample-count 1000000 \
--catalog_name perf-benchmark \
--catalog_path ./benchmarks/
Database Analysis
Examine Database Contents
# Check database size
du -sh beamline-catalog/
# Count records per dataset
for f in beamline-catalog/*.ion; do
if [[ "$f" != *".shape.ion"* ]]; then
echo "$(basename "$f" .ion): $(wc -l < "$f") records"
fi
done
# View sample data
head -3 beamline-catalog/service.ion
# View schema
cat beamline-catalog/service.shape.sql
Validate Database Integrity
# Verify manifest matches generation
cat beamline-catalog/.beamline-manifest
# Verify script is preserved
diff original_script.ion beamline-catalog/.beamline-script
# Check all datasets have corresponding schemas
for data in beamline-catalog/*.ion; do
if [[ "$data" != *".shape.ion"* ]]; then
dataset=$(basename "$data" .ion)
if [[ ! -f "beamline-catalog/${dataset}.shape.ion" ]]; then
echo "Missing schema for $dataset"
fi
fi
done
Error Handling
Common Errors
Catalog Directory Exists
$ beamline gen db beamline-lite --seed-auto --start-auto --script-path data.ion
creating directory ./beamline-catalog/ failed with the following error:
File exists (os error 17)
# Solution: Use --force or different catalog name
beamline gen db beamline-lite --seed-auto --start-auto --script-path data.ion --force
Script Parse Errors
$ beamline gen db beamline-lite --seed-auto --start-auto --script-path invalid.ion
Error: Failed to parse Ion script: Invalid Ion syntax at line 8
Insufficient Disk Space
# Check available space before large database creation
df -h .
beamline gen db beamline-lite --script-path huge_data.ion --sample-count 10000000
Best Practices
1. Plan Storage Requirements
# Estimate database size with small sample first
beamline gen db beamline-lite \
--seed 1 \
--start-auto \
--script-path data.ion \
--sample-count 100 \
--catalog_name size-test
# Check size and extrapolate
du -sh size-test/
# If 100 samples = 1MB, then 100,000 samples ≈ 1GB
2. Use Consistent Naming Conventions
# Good naming convention
beamline gen db beamline-lite \
--script-path ecommerce_v2.ion \
--catalog_name ecommerce-v2-20241201 \
--catalog_path ./databases/
# Include date, version, purpose in catalog name
3. Document Database Generation
# Create database with documentation
beamline gen db beamline-lite \
--seed 12345 \
--start-iso "2024-01-01T00:00:00Z" \
--script-path analytics.ion \
--sample-count 25000 \
--catalog_name analytics-q4-2024
# Add README
echo "Analytics Database Q4 2024
Purpose: Customer behavior analysis
Generated: $(date)
Script: analytics.ion
Seed: 12345
Records: 25000
Contact: analytics-team@company.com" > analytics-q4-2024/README.txt
4. Validate Generated Databases
# Verify database creation was successful
ls -la beamline-catalog/
cat beamline-catalog/.beamline-manifest
wc -l beamline-catalog/*.ion
Next Steps
Now that you understand all CLI commands:
- CLI Overview - Review complete CLI capabilities
- Data Generation Guide - Learn about Ion scripts and generators
- Database Guide - Learn about database concepts and usage
- Examples - See complete workflows in action
Database Overview
Beamline provides local data and schema generation capability. This allows to create a local copy of the generated data in a local catalog directory.
What is BeamlineLite?
BeamlineLite is Beamline’s local database generation capability that creates filesystem-based databases containing:
- Generated data in Ion format
- Inferred schemas in both Ion and SQL DDL formats
- Metadata about generation parameters
- Original scripts for reproducibility
Database vs Data Generation
Data Generation (gen data)
beamline gen data \
--seed 42 \
--start-auto \
--script-path sensors.ion \
--sample-count 1000 \
--output-format ion-pretty
Output: Stream of data records to stdout or file
Use cases: Data processing pipelines, API testing, analysis
Database Generation (gen db)
beamline gen db beamline-lite \
--seed 42 \
--start-auto \
--script-path sensors.ion \
--sample-count 1000
Output: Complete database directory with data + schemas
Use cases: Local development databases, testing environments, demos
BeamlineLite Database Structure
Catalog Directory Layout
A BeamlineLite database creates a catalog directory with this structure:
beamline-catalog/
├── .beamline-manifest # Generation metadata (JSON)
├── .beamline-script # Original Ion script
├── <dataset>.ion # Data files (one per dataset)
├── <dataset>.shape.ion # Ion format schemas
└── <dataset>.shape.sql # SQL DDL schemas
Real Example from client-service.ion
$ beamline gen db beamline-lite \
--seed-auto \
--start-auto \
--script-path client-service.ion \
--sample-count 1000
writing manifest file ./beamline-catalog/.beamline-manifest ...[COMPLETED]
writing script file ./beamline-catalog/.beamline-script ...[COMPLETED]
writing shape file(s)...[COMPLETED]
writing data file(s)...[COMPLETED]
done!
$ tree beamline-catalog/
beamline-catalog/
├── .beamline-manifest
├── .beamline-script
├── service.ion
├── service.shape.ion
├── service.shape.sql
├── client_0.ion
├── client_0.shape.ion
├── client_0.shape.sql
├── client_1.ion
├── client_1.shape.ion
├── client_1.shape.sql
└── ... (more client datasets)
Database Files Deep Dive
Manifest File (.beamline-manifest)
Contains generation metadata in JSON format:
$ cat beamline-catalog/.beamline-manifest
{"seed": "949665520117506306", "start": "2023-02-06T12:52:29.000000000Z", "ddl_syntax.version": "partiql_datatype_syntax.0.1"}
Contents:
- seed: Random seed used for generation (for reproducibility)
- start: Simulation start timestamp
- ddl_syntax.version: SQL DDL syntax version used in .shape.sql files
Script File (.beamline-script)
Preserved copy of the original Ion script:
$ cat beamline-catalog/.beamline-script
rand_processes::{
// generate between 5 & 20 customers
$n: UniformU8::{ low: 5, high: 20 },
// A generator for client ids
$id_gen: UUID,
// ... rest of original script
}
Purpose:
- Reproducibility: Regenerate identical database later
- Documentation: What script created this database
- Version control: Track script changes over time
Data Files (dataset.ion)
Contains generated data in compact Ion format:
$ cat beamline-catalog/client_0.ion
{success: true, id: "7dbd12cf-b506-22ad-2d81-b0a1cd259697", request_id: "0de35d1e-a87c-e540-734d-6f2a4fa410c3", request_time: 2021-01-05T03:55:01.035000000+00:00}
{success: true, id: "7dbd12cf-b506-22ad-2d81-b0a1cd259697", request_id: "3539cdf0-6f7e-6bdc-c25a-4e0b7d8f8bac", request_time: 2021-01-05T03:55:01.182000000+00:00}
Characteristics:
- One record per line: Newline-delimited Ion format
- Complete type information: All Ion types preserved
- Temporal ordering: Records ordered by generation time
Schema Files (dataset.shape.ion)
Ion format schema definitions:
$ cat beamline-catalog/client_0.shape.ion
{
type: "bag",
items: {
type: "struct",
constraints: [ordered, closed],
fields: [
{ name: "id", type: "string" },
{ name: "request_id", type: "string" },
{ name: "request_time", type: "datetime" },
{ name: "success", type: "bool" }
]
}
}
Use cases:
- PartiQL validation: Validate queries against schema
- Type checking: Ensure data types match expectations
- Tool integration: Tools can use schema information
Schema Files (dataset.shape.sql)
SQL DDL format schemas:
$ cat beamline-catalog/service.shape.sql
"Account" VARCHAR,
"Distance" DECIMAL(2, 0),
"Operation" VARCHAR,
"Program" VARCHAR,
"Request" VARCHAR,
"StartTime" TIMESTAMP,
"Weight" DECIMAL(5, 4),
"client" VARCHAR,
"success" BOOL
Use cases:
- Database creation: Create tables in SQL databases
- Schema documentation: Human-readable schema reference
- Migration scripts: Database schema evolution
Managing Catalogs
BeamlineLite catalogs are filesystem-based directories that contain complete databases with data, schemas, and metadata. Understanding how to manage, organize, and work with catalogs is essential for effective database operations.
Catalog Structure Deep Dive
Standard Catalog Layout
Every BeamlineLite catalog follows a consistent structure:
catalog-name/
├── .beamline-manifest # JSON metadata file
├── .beamline-script # Original Ion script
├── dataset_1.ion # Dataset 1 data (Ion format)
├── dataset_1.shape.ion # Dataset 1 schema (Ion format)
├── dataset_1.shape.sql # Dataset 1 schema (SQL DDL)
├── dataset_2.ion # Dataset 2 data
├── dataset_2.shape.ion # Dataset 2 schema (Ion)
├── dataset_2.shape.sql # Dataset 2 schema (SQL)
└── ... (additional datasets)
File Naming Conventions
Data Files: <dataset_name>.ion
- Contains generated records in newline-delimited Ion format
- One file per dataset defined in the Ion script
- Records ordered chronologically by generation time
Ion Schema Files: <dataset_name>.shape.ion
- PartiQL type definitions in Ion format
- Used by Ion-aware tools for validation and processing
- Contains complete type constraint information
SQL Schema Files: <dataset_name>.shape.sql
- SQL DDL field definitions (not complete CREATE TABLE)
- Ready for integration with SQL databases
- Human-readable schema documentation
Metadata Files:
.beamline-manifest- Generation parameters in JSON.beamline-script- Original Ion script for reproducibility
Catalog Creation Options
Basic Catalog Creation
# Default catalog in current directory
beamline gen db beamline-lite \
--seed 42 \
--start-auto \
--script-path data.ion
# Creates: ./beamline-catalog/
Custom Catalog Configuration
# Custom name and location
beamline gen db beamline-lite \
--seed 12345 \
--start-iso "2024-01-01T00:00:00Z" \
--script-path ecommerce.ion \
--sample-count 50000 \
--catalog-name ecommerce-prod-simulation \
--catalog-path ./production-databases/
# Creates: ./production-databases/ecommerce-prod-simulation/
Catalog Naming Best Practices
# Good - descriptive, versioned names
--catalog-name user-analytics-v2-20241201
--catalog-name integration-test-db-sprint-45
--catalog-name demo-ecommerce-q4-2024
# Avoid - generic names
--catalog-name db
--catalog-name test
--catalog-name data
Catalog Lifecycle Management
Safe Overwrite with Backup
BeamlineLite protects existing catalogs by default:
$ beamline gen db beamline-lite --seed 1 --start-auto --script-path data.ion
creating directory ./beamline-catalog/ failed with the following error:
File exists (os error 17)
The --force option creates automatic backups:
$ beamline gen db beamline-lite \
--seed 1 \
--start-auto \
--script-path updated_data.ion \
--force
command is using --force ...
Beamline catalog ./beamline-catalog/ exists, backing it up to "beamline-catalog.2024-05-10T22:15:54.019316000Z.bkp"...
back up completed
writing manifest file ./beamline-catalog/.beamline-manifest ...[COMPLETED]
writing script file ./beamline-catalog/.beamline-script ...[COMPLETED]
writing shape file(s)...[COMPLETED]
writing data file(s)...[COMPLETED]
done!
Backup naming pattern:
<catalog-name>.<ISO-8601-timestamp>.bkp