Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

Welcome to the Beamline Guide — your comprehensive resource for mastering synthetic data and query generation for your AI/ML, testing, and simulation use-cases.

What You’ll Learn

This guide will take you on a journey from understanding the basics of Beamline to becoming proficient in generating sophisticated synthetic datasets and queries. Whether you are an AI/ML researcher, a developer looking to test your implementations, or a data scientist needing realistic test data, this guide has you covered.

How This Guide is Organized

The guide is structured to gradually build your understanding and skills:

  1. Getting Started — Learn what Beamline is and get your first data generation running
  2. Understanding the Basics — Grasp core concepts like random processes and reproducible generation
  3. Data Generation — Master the art of creating synthetic data with various types and patterns
  4. Query Generation — Learn to generate PartiQL queries that match your data shapes
  5. Schema and Shape Inference — Understand how to work with data schemas and type inference
  6. Database Generation — Create local BeamlineLite database with both data and schemas
  7. Command Line Interface — Become proficient with all CLI commands and options
  8. Examples and Tutorials — Hands-on tutorials with real-world scenarios

What is Beamline?

Beamline is a tool for fast data generation. It generates reproducible pseudo-random data using a stochastic approach and probability distributions, meaning you can create realistic datasets that follow specific mathematical patterns. This makes the data both random enough to be useful for AI/ML model training, simulation, and testing purposing, while remaining deterministic enough to be reproducible for debugging and validation.

The tool’s ability to generate data based on statistical distributions makes it particularly valuable for AI model training scenarios where you need synthetic data that resembles specific population distributions or statistical characteristics.

Key Features

  • Reproducible Data Generation: Use seeds to generate the same data every time
  • Stochastic Processes: Model real-world data patterns using mathematical distributions
  • Query Generation: Automatically generate PartiQL/SQL-like queries that match your data shapes
  • Schema Inference: Automatically infer and export data schemas in various formats
  • Multiple Output Formats: Support for Amazon Ion, JSON, and SQL DDL
  • Database Generation: Create complete local copies of the generated data with both data and schemas
  • Flexible Configuration: Highly configurable through scripts

Prerequisites

This guide assumes basic familiarity with:

  • Command-line interfaces
  • Ion/JSON data formats
  • Basic understanding of databases and queries
  • Rust programming language (for building from source)

Don’t worry if you’re new to some of these concepts — we’ll explain everything you need to know as we go!

Getting Help

If you encounter issues or have questions while following this guide please open an issue on Beamline GitHub repository. or start a dicussion on our GitHub repositories Discussions section.

Let’s begin your journey with Beamline!

What is Beamline?

Beamline is a tool designed for fast data generation. At its core, it generates reproducible pseudo-random data using a stochastic approach that models real-world data patterns.

The Problem It Solves

Some of the common software engineering problems that most developers, data scientists, and researchers often face are:

  • Lack of Test Data: Creating realistic test datasets manually is time-consuming and error-prone
  • Inconsistent Testing: Different test runs with different data make it hard to reproduce bugs
  • Query Testing: Writing queries that match your data structures requires understanding both the data shape and query patterns
  • Performance Benchmarking: Consistent, scalable datasets are needed for meaningful performance comparisons and AI inferences evaluations
  • Schema Evolution: As data structures change, maintaining test data becomes increasingly complex

Beamline addresses all these challenges with a unified approach to synthetic data generation.

Core Components

Beamline consists of three main components that work together:

1. Data Generator

The Data Generator creates reproducible pseudo-random data based on mathematical distributions and stochastic processes. It can generate:

  • Simple scalar values (numbers, strings, booleans, dates)
  • Complex nested structures (structs, arrays, mixed types)
  • Time-series data with realistic temporal patterns
  • Sharing data across multiple datasets

Key Features:

  • Reproducible: Same seed always produces the same data no matter how nested that data is
  • Configurable: Highly customizable through scripts
  • Realistic: Uses statistical distributions to model real-world patterns
  • Scalable: Can generate datasets from small samples to millions of records

2. Query Generator

The Query Generator creates SQL-like queries (starting from PartiQL support) that match the shapes and types of your generated data. It can produce:

  • SELECT * FROM ... WHERE ... queries with various predicates
  • SELECT ... FROM ... WHERE ... queries with custom projections
  • SELECT ... EXCLUDE ... FROM ... WHERE ... queries with exclusions
  • Complex nested queries with deep path expressions

Key Features:

  • Datatype-Aware: Generates queries that match your data types
  • Parameterizable: Control query complexity, depth, and patterns
  • Reproducible: Same seed produces the same query patterns

3. CLI Interface

The Command Line Interface provides easy access to all functionality with comprehensive options for:

  • Data generation with various output formats
  • Query generation with extensive parameterization
  • Schema inference and export
  • Local data file creation with both data and schemas

How It Works

Stochastic Processes

Beamline models data generation as stochastic processes — mathematical models that describe systems that appear to vary randomly over time. This approach allows it to:

  • Generate data that follows realistic patterns
  • Model temporal relationships (like arrival times)
  • Simulate real-world variability while maintaining reproducibility

Scripts and Configuration

Data generation is controlled through scripts that define:

  • Random Processes: How data arrives and is generated over time
  • Data Generators: What types of data to create and their distributions
  • Relationships: How different data elements relate to each other
  • Constraints: Rules and patterns the data should follow

Reproducibility

One of Beamline’s key strengths is reproducibility:

  • Seeds: Control the random data generation for consistent results
  • Timestamps: Control the starting time for temporal data
  • Deterministic: Same inputs always produce the same outputs
  • Debuggable: Reproduce exact datasets for debugging and validation

Use Cases

AI Model Training and Inference

  • Training Data Generation: Generate datasets that follow specific statistical distributions for machine learning model training
  • Distribution-Based Modeling: Create training data that matches target population distributions for more representative models
  • Synthetic Data Augmentation: Expand training datasets while preserving underlying statistical distributions
  • Edge Case Generation: Generate rare statistical scenarios for robust model validation

Testing and Development

  • Unit Testing: Generate consistent test data for implementations
  • Integration Testing: Create realistic datasets for end-to-end testing
  • Regression Testing: Ensure changes don’t break existing functionality
  • Edge Case Testing: Generate data that exercises boundary conditions

Performance and Benchmarking

  • Load Testing: Generate large datasets for performance evaluation
  • Scalability Testing: Test how systems perform with growing data sizes
  • Query Optimization: Generate queries to test optimization strategies

Research and Education

  • Algorithm Research: Generate datasets for testing new features
  • Query Pattern Analysis: Study how different query patterns perform
  • Educational Examples: Create realistic examples for learning PartiQL
  • Prototyping: Quickly generate data for proof-of-concept implementations

What Makes It Special

Mathematical Foundation

Unlike simple random data generators, Beamline is built on solid mathematical foundations:

  • Probability Distributions: Uses proper statistical distributions for realistic data
  • Stochastic Modeling: Models real-world processes mathematically
  • Temporal Modeling: Handles time-based data generation correctly
  • Correlation Modeling: Can generate related data across multiple dimensions

Next Steps

Now that you understand what Beamline is and why it’s useful, let’s get it installed and running on your system. In the next section, we’ll walk through the installation process and verify that everything is working correctly.

Installation and Setup

This chapter will guide you through installing Beamline and setting up your development environment. Beamline is written in Rust, so we weill cover both building from source and using pre-built binaries when available.

Prerequisites

Before installing Beamline, ensure you have the following prerequisites:

Required

  • Rust Toolchain: Beamline requires Rust 1.70 or later
  • Git: For cloning the repository
  • Command Line Access: Terminal or command prompt
  • Text Editor: For editing Ion scripts (VS Code, vim, emacs, etc.)
  • JSON/Ion Viewer: Use jq and/or ion-cli tools for examining generated data

Installing Rust

Read the following for more details on installing Rust on your machine: https://rust-lang.org/tools/install/

If you don’t have Rust installed, follow these steps:

On macOS, Linux, or WSL

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source ~/.cargo/env

On Windows

  1. Download and run rustup-init.exe
  2. Follow the installation prompts
  3. Restart your command prompt

Verify Rust Installation

rustc --version
cargo --version

You should see version information for both rustc and cargo.

Installing Beamline

This is currently the primary method for installing Beamline:

  1. Clone the Repository

    git clone https://github.com/partiql/partiql-beamline.git
    cd partiql-beamline
    
  2. Build the Project

    cargo build --release
    

    This will compile Beamline in release mode, which provides better performance for data generation.

  3. Verify the Installation

    ./target/release/beamline --version
    

    You should see version information for Beamline.

  4. Optional: Add to PATH

    For easier access, you can add the binary to your PATH or create a symlink:

    On macOS/Linux:

    # Option 1: Copy to a directory in your PATH
    sudo cp target/release/beamline /usr/local/bin/
    
    # Option 2: Create a symlink
    ln -s $(pwd)/target/release/beamline ~/.local/bin/beamline
    
    # Option 3: Add to your shell profile
    echo 'export PATH="'$(pwd)'/target/release:$PATH"' >> ~/.bashrc
    source ~/.bashrc
    

    On Windows:

    # Add the target/release directory to your PATH environment variable
    # Or copy the .exe file to a directory already in your PATH
    

Method 2: Using Cargo Install (Not available yet)

Once Beamline is published to crates.io, you’ll be able to install it directly:

# This will be available in the future
cargo install beamline

Verifying Your Installation

Let’s verify that Beamline is installed correctly by running a few basic commands:

1. Check Version

beamline --version

2. View Help

beamline --help

You should see output similar to:

Beamline CLI

Usage: beamline <COMMAND>

Commands:
  gen          Run the generator
  infer-shape  Run the script shape inference
  help         Print this message or the help of the given subcommand(s)

Options:
  -h, --help     Print help
  -V, --version  Print version

3. Test Data Generation

Let’s run a simple test to ensure data generation works:

beamline gen data --help

This should display the help for the data generation command, confirming that the core functionality is available.

Development Environment Setup

Setting Up Your Workspace

Create a directory for your Beamline projects:

mkdir ~/partiql-beamline-workspace
cd ~/partiql-beamline-workspace

Editor Configuration

VS Code

If you’re using VS Code, consider installing these extensions for better Ion support:

  1. Rust Analyzer: For Rust syntax highlighting if you plan to contribute
  2. ion-vscode-plugin: For Syntax Highlighting, Error Reporting, and Formatting of Beamline scripts
  3. JSON: For viewing generated JSON output
  4. Better TOML: For configuration files

Shell Aliases (Optional)

For convenience, you might want to create shell aliases:

# Add to your ~/.bashrc, ~/.zshrc, or equivalent
alias pql-gen='beamline gen data'
alias pql-query='beamline query'
alias pql-shape='beamline infer-shape'

Troubleshooting Installation

Common Issues

Rust Version Too Old

Error: error: package requires rustc 1.70 or newer

Solution: Update Rust:

rustup update

Build Failures

Error: Compilation errors during cargo build

Solutions:

  1. Ensure you have the latest Rust version
  2. Clean and rebuild:
    cargo clean
    cargo build --release
    
  3. Check for system-specific dependencies

Permission Issues

Error: Permission denied when copying to /usr/local/bin

Solution: Use sudo or choose a different installation location:

# Install to user directory instead
mkdir -p ~/.local/bin
cp target/release/beamline ~/.local/bin/

PATH Issues

Error: command not found: beamline

Solution: Verify the binary is in your PATH:

which beamline
echo $PATH

Getting Help

If you encounter issues not covered here:

  1. Check the Troubleshooting section
  2. Review the GitHub Issues
  3. Create a new issue with:
    • Your operating system
    • Rust version (rustc --version)
    • Complete error messages
    • Steps to reproduce

Performance Considerations

Release vs Debug Builds

Always use release builds for actual data generation:

# Debug build (slower, for development)
cargo build

# Release build (faster, for production use)
cargo build --release

Release builds can be 10-100x faster than debug builds for data generation tasks.

System Resources

Beamline is designed to be memory-efficient, but consider your system resources:

  • RAM: 4GB minimum, 8GB+ recommended for large datasets
  • Storage: Ensure adequate disk space for generated data
  • CPU: Multi-core processors will benefit from parallel processing features

Next Steps

Now that you have Beamline installed and verified, you’re ready to generate your first dataset! In the next section, we’ll walk through creating your first data generation script and producing some sample data.

Your First Data Generation

Now that you have Beamline installed, let’s generate your first dataset! This hands-on tutorial will walk you through creating a simple sensor data generator and understanding the basic concepts.

Quick Start: Using an Example Script

Beamline comes with several example scripts. Let’s start with the sensors example to see data generation in action.

Step 1: Generate Your First Dataset

Run the following command to generate 2 sensor readings:

beamline gen data \
    --seed-auto \
    --start-auto \
    --sample-count 2 \
    --script-path partiql-beamline-sim/tests/scripts/sensors.ion

You should see output similar to:

Seed: 5372343081885320050
Start: 2022-01-08T18:38:38.000000000Z
[2022-01-08 18:38:57.155 +00:00:00] : DataSetName("sensors") { 'tick': 19155, 'i8': 57, 'f': 30.103028021670184, 'w': 3.2669, 'd': 2, 'a': 'ed6b2d0c-dd09-4d7e-b1d3-fc16e3547eb5', 'ar1': [1.2, 1.4, 0.8], 'ar2': ['8fe9ee2c-a9e0-462a-8a44-a9abc51e759b', '0411eace-53be-4647-b351-3fa2de9b8e5f'], 'ar3': [3.2669, NULL, 3.0777], 'ar4': [10, 4, 8, 2], 'ar5': ['ed6b2d0c-dd09-4d7e-b1d3-fc16e3547eb5'] }

Congratulations! You’ve just generated your first synthetic dataset with Beamline.

Understanding the Output

Let’s break down what happened:

  • Seed: 5372343081885320050 — This random seed ensures reproducibility
  • Start: 2024-01-20T20:05:41.000000000Z — The simulation start time
  • Data Records: Two sensor readings with timestamps, each containing:
    • f: A floating-point sensor value
    • i8: An 8-bit integer value
    • tick: A simulation tick counter

Step 2: Reproduce the Same Data

Let’s generate the exact same data using the seed from the previous run:

beamline gen data \
    --seed 5372343081885320050 \
    --start-auto \
    --sample-count 2 \
    --script-path partiql-beamline-sim/tests/scripts/sensors.ion

Notice that the data values are identical, but the timestamps might be different because we used --start-auto. To get exactly the same output, use the same start time:

beamline gen data \
    --seed 5372343081885320050 \
    --start-iso "2022-01-08T18:38:38.000000000Z" \
    --sample-count 2 \
    --script-path partiql-beamline-sim/tests/scripts/sensors.ion

Now you’ll get exactly the same output as the first run!

Understanding the Script

Let’s examine the script that generated this data. Look at the contents of partiql-beamline-sim/tests/scripts/sensors.ion:

rand_processes::{
  $n:UniformU8::{
    low:2,
    high:10
  },
  sensors:$n::[
    rand_process::{
      $r:Uniform::{
        choices:[
          5,
          10
        ]
      },
      $arrival:HomogeneousPoisson::{
        interarrival:minutes::$r
      },
      $weight:UniformDecimal::{
        nullable:0.75,
        low:1.995,
        high:4.9999,
        optional:true
      },
      $anyof:UniformAnyOf::{
        types:[
          Tick,
          UniformF64,
          UUID,
          UniformDecimal::{
            low:1.995,
            high:4.9999,
            nullable:false
          }
        ]
      },
      $array:UniformArray::{
        min_size:3,
        max_size:3,
        element_type:UniformDecimal::{
          low:0.5,
          high:1.5
        }
      },
      $data:{
        tick:Tick,
        i8:UniformI8,
        f:UniformF64,
        w:$weight,
        d:UniformDecimal::{
          low:0.,
          high:42.,
          nullable:false
        },
        a:$anyof,
        ar1:$array,
        ar2:UniformArray::{
          min_size:2,
          max_size:4,
          element_type:UUID
        },
        ar3:UniformArray::{
          min_size:2,
          max_size:4,
          element_type:$weight
        },
        ar4:UniformArray::{
          min_size:2,
          max_size:4,
          element_type:UniformI8::{
            low:2,
            high:10
          }
        },
        ar5:UniformArray::{
          min_size:1,
          max_size:1,
          element_type:$anyof
        }
      }
    }
  ]
}

Script Breakdown

  1. rand_processes::: This annotation tells Beamline that this structure defines random processes

  2. $n: UniformU8::{ low: 1, high: 3 }: Creates a variable n that generates a random number between 1 and 3

  3. sensors: $n::[...]: Creates a dataset called “sensors” with n random processes (1-3 processes)

  4. rand_process::: Defines a single random process within the sensors dataset

  5. $r: Uniform::[5,10]: Creates a variable r that randomly selects between 5 and 10

  6. $arrival: HomogeneousPoisson:: { interarrival: minutes::$r }: Defines how often data arrives (every r minutes using a Poisson process)

  7. $data:: Defines the structure of each generated data record:

    • tick: Tick - Current simulation tick
    • id: '$@n' - Process identifier
    • i8: UniformI8 - Random 8-bit integer
    • f: UniformF64 - Random 64-bit float

Exploring Different Output Formats

Beamline supports multiple output formats. Let’s try generating the same data in different formats:

Ion Pretty Format

beamline gen data \
    --seed 5372343081885320050 \
    --start-auto \
    --sample-count 3 \
    --script-path partiql-beamline-sim/tests/scripts/sensors.ion \
    --output-format ion-pretty

This produces nicely formatted Ion output:

{
  seed: 12328924104731257599,
  start: "2024-01-20T20:05:41.000000000Z",
  data: {
    sensors: [
      {
        i8: -21,
        tick: 9421,
        f: 2.803799956162891e0,
        id: 1
      },
      {
        i8: -70,
        tick: 12294,
        f: 1.7229362418585936e1,
        id: 1
      },
      {
        i8: 84,
        tick: 32697,
        f: -2.4809825455060093e1,
        id: 0
      }
    ]
  }
}

Text Format (Default)

The default text format is human-readable and great for quick inspection:

beamline gen data \
    --seed 5372343081885320050 \
    --start-auto \
    --sample-count 3 \
    --script-path partiql-beamline-sim/tests/scripts/sensors.ion \
    --output-format text

Creating Your Own Simple Script

Now let’s create your own script from scratch. Create a new file called my-first-script.ion:

rand_processes::{
    simple_data: rand_process::{
        $arrival: HomogeneousPoisson:: { interarrival: seconds::1 },
        $data: {
            timestamp: Instant,
            temperature: UniformF64::{ low: 20.0, high: 35.0 },
            humidity: UniformF64::{ low: 30.0, high: 80.0 },
            sensor_id: UUID,
            active: Bool::{ p: 0.9 }
        }
    }
}

This script creates a simple weather sensor that generates:

  • timestamp: Current simulation time
  • temperature: Random temperature between 20-35°C
  • humidity: Random humidity between 30-80%
  • sensor_id: A unique UUID for each reading
  • active: Boolean with 90% chance of being true

Test Your Script

beamline gen data \
    --seed 42 \
    --start-auto \
    --sample-count 5 \
    --script-path my-first-script.ion \
    --output-format ion-pretty

Understanding Key Concepts

Seeds and Reproducibility

The --seed parameter controls randomness:

  • --seed-auto: Generate a random seed (different data each time)
  • --seed 42: Use a specific seed (same data each time)

Start Times

The --start parameter controls simulation time:

  • --start-auto: Use current time
  • --start-iso "2024-01-01T00:00:00Z": Use specific time
  • --start-epoch-ms 1704067200000: Use epoch milliseconds

Sample Count

The --sample-count parameter controls how many data points to generate. This is particularly useful for:

  • Testing with small datasets
  • Generating large datasets for performance testing
  • Controlling output size

Common Patterns

Multiple Datasets

You can generate data for specific datasets using the --dataset flag:

beamline gen data \
    --seed 42 \
    --start-auto \
    --sample-count 10 \
    --script-path partiql-beamline-sim/tests/scripts/client-service.ion \
    --dataset service --dataset client_1 \
    --output-format ion-pretty

Controlling Nullability

You can control how often NULL values appear:

beamline gen data \
    --seed 42 \
    --start-auto \
    --sample-count 5 \
    --script-path my-first-script.ion \
    --default-nullable true \
    --pct-null 0.1  # 10% chance of NULL values

Next Steps

Now that you’ve successfully generated your first datasets, you are ready to dive deeper into Beamline’s capabilities. In the next section, we’ll explore the core concepts that power Beamline’s data generation, including:

  • Random processes and stochastic modeling
  • Data generators and their configurations
  • Temporal modeling and arrival patterns
  • Relationships between data elements

Quick Reference

Here are the commands you’ve learned in this chapter:

# Basic data generation
beamline gen data --seed-auto --start-auto --sample-count N --script-path SCRIPT

# Reproducible generation
beamline gen data --seed SEED --start-iso "TIMESTAMP" --sample-count N --script-path SCRIPT

# Different output formats
beamline gen data ... --output-format [text|ion|ion-pretty]

# Specific datasets
beamline gen data ... --dataset DATASET_NAME

# Control nullability
beamline gen data ... --default-nullable true --pct-null 0.1

Congratulations on completing your first data generation with Beamline! You’re now ready to explore more advanced features and create more sophisticated synthetic datasets.

Core Concepts

Before diving deeper into Beamline’s advanced features, it’s essential to understand the fundamental concepts that power its data generation capabilities. This chapter will introduce you to the mathematical and computational foundations that make Beamline both powerful and reliable.

Stochastic Processes

At the heart of Beamline lies the concept of stochastic processes — mathematical models that describe systems appearing to vary randomly over time.

What is a Stochastic Process?

A stochastic process is a collection of random variables indexed by time or space. In simpler terms, it is a way to model how things change randomly over time while still following certain patterns or rules.

Real-world examples:

  • Stock prices over time
  • Sensor readings from IoT devices
  • User activity on a website
  • Network traffic patterns
  • Temperature measurements

Why Stochastic Processes Matter

Traditional random data generators often produce data that looks random but lacks the realistic patterns found in real-world data. Stochastic processes allow Beamline to:

  1. Model Temporal Relationships: Data points aren’t just random — they follow realistic time-based patterns
  2. Create Correlations: Different data elements can be related in meaningful ways
  3. Simulate Real Patterns: Generate data that behaves like real-world systems
  4. Maintain Consistency: Ensure generated data follows logical rules and constraints

Example: Sensor Data

Consider a temperature sensor:

  • Simple Random: Each reading is completely independent
  • Stochastic Process: Readings follow realistic patterns (gradual changes, daily cycles, seasonal trends)
// Simple random (unrealistic)
temperature: UniformF64::{ low: -10.0, high: 40.0 }

// Stochastic process (realistic)
temperature: NormalF64::{ mean: 22.0, std_dev: 5.0 }

Random Processes in Beamline

Beamline implements stochastic processes through random processes defined in scripts in Amazon Ion Format.

Anatomy of a Random Process

rand_process::{
    $arrival: HomogeneousPoisson:: { interarrival: minutes::5 },
    $data: {
        // Data structure definition
    }
}

Every random process has two key components:

  1. Arrival Process ($arrival): Defines the statistical pattern of new data arrivals, i.e., when the data arrives
  2. Data Structure ($data): Defines what data is generated

Arrival Processes

Arrival processes control the timing of data generation. Beamline only supports Homogeneous Poisson Process at the moment:

Homogeneous Poisson Process

The most common arrival process, modeling events that occur at a constant average rate:

$arrival: HomogeneousPoisson:: { interarrival: minutes::5 }

Characteristics:

  • Events occur independently
  • Average rate is constant over time
  • Time between events follows an exponential distribution
  • Models many real-world phenomena (customer arrivals, system events, etc.)

Use cases:

  • Web server requests
  • Sensor readings
  • User logins
  • System alerts

Time Units

Beamline supports various time units for arrival processes:

// Different time units
seconds::30      // 30 seconds
minutes::5       // 5 minutes  
hours::2         // 2 hours
days::1          // 1 day
milliseconds::100 // 100 milliseconds

Data Generators

Data generators define the structure and content of generated data. They use probability distributions to create realistic values.

Probability Distributions

Beamline supports many probability distributions, each suited for different types of data:

Uniform Distributions

Generate values where each value in a range is equally likely:

// Discrete uniform (integers)
age: UniformU8::{ low: 18, high: 65 }

// Continuous uniform (floats)
temperature: UniformF64::{ low: 20.0, high: 30.0 }

// Uniform choice from literals
status: Uniform::{ choices: ["active", "inactive", "pending"] }

Use cases:

  • IDs, categories, discrete choices
  • Baseline random values
  • Testing edge cases

Normal (Gaussian) Distributions

Generate values that cluster around a mean with a bell-curve distribution:

height: NormalF64::{ mean: 170.0, std_dev: 10.0 }

Characteristics:

  • Most values near the mean
  • Symmetric distribution
  • Models many natural phenomena

Use cases:

  • Physical measurements (height, weight)
  • Performance metrics
  • Error values

Other Distributions

// Exponential (for modeling wait times)
response_time: ExpF64::{ rate: 0.1 }

// Log-normal (for modeling sizes, prices)
file_size: LogNormalF64::{ location: 10.0, scale: 1.0 }

// Weibull (for modeling lifetimes, reliability)
device_lifetime: WeibullF64::{ shape: 2.0, scale: 1000.0 }

Data Types

Beamline supports the following data types:

Scalar Types

// Numbers
integer_val: UniformI32::{ low: 1, high: 1000 }
float_val: UniformF64::{ low: 0.0, high: 1.0 }
decimal_val: UniformDecimal::{ low: 1.99, high: 999.99 }

// Text
name: LoremIpsumTitle
description: LoremIpsum::{ min_words: 10, max_words: 50 }
pattern_text: Regex::{ pattern: "[A-Z]{2}[0-9]{4}" }

// Boolean
active: Bool::{ p: 0.8 }  // 80% chance of true

// Temporal
created_at: Instant
birth_date: Date

// Identifiers
user_id: UUID

Complex Types

// Structures
user: {
    id: UUID,
    name: LoremIpsumTitle,
    age: UniformU8::{ low: 18, high: 65 },
    preferences: {
        theme: Uniform::{ choices: ["light", "dark"] },
        notifications: Bool::{ p: 0.7 }
    }
}

// Arrays
tags: UniformArray::{ 
    min_size: 1, 
    max_size: 5, 
    element_type: LoremIpsumTitle 
}

// Union types
value: UniformAnyOf::{ types: [
    UniformI32::{ low: 1, high: 100 },
    LoremIpsumTitle,
    Bool
]}

Variables and References

Beamline supports variables for creating relationships and reusing values:

Variable Definition

rand_processes::{
$n: UniformU8::{ low: 2, high: 10 },

    sensors: $n::[
        rand_process::{
            $r: Uniform::{ choices: [5,10] },
            $arrival: HomogeneousPoisson:: { interarrival: minutes::$r },
            $weight: UniformDecimal::{ nullable: 0.75, low: 1.995, high: 4.9999, optional: true },
            $anyof: UniformAnyOf::{ types: [Tick, UniformF64, UUID, UniformDecimal::{ low: 1.995, high: 4.9999, nullable: false }] },
            $array: UniformArray::{
                min_size: 3,
                max_size: 3,
                element_type: UniformDecimal::{ low: 0.5, high: 1.5 }
            },
            $data: {
                tick: Tick,
                i8: UniformI8,
                f: UniformF64,
                w: $weight,
                d: UniformDecimal::{ low: 0d0, high: 4.2d1, nullable: false },
                a: $anyof,
                ar1: $array,
                ar2: UniformArray::{ min_size: 2, max_size: 4, element_type: UUID },
                ar3: UniformArray::{ min_size: 2, max_size: 4, element_type: $weight },
                ar4: UniformArray::{ min_size: 2, max_size: 4, element_type: UniformI8::{ low: 2, high: 10 } },
                ar5: UniformArray::{ min_size: 1, max_size: 1, element_type: $anyof }
            }
        }
    ],
}

Variable Types

Generator Variables

Store data generators for reuse:

$temperature_sensor: NormalF64::{ mean: 22.0, std_dev: 3.0 }
$id_gen: UUID

Value Variables

Store computed values:

$success_rate: UniformF64::{ low: 0.95, high: 1.0 },
$is_successful: Bool::{ p: $success_rate }

Evaluation Control

Control when variables are evaluated:

// Evaluate once at script read time
$user_id: $id_gen::()

// Evaluate each time it's used
$request_id: $id_gen

Datasets and Collections

Beamline organizes generated data into datasets, which represent collections of related data.

Single Dataset

rand_processes::{
    sensors: rand_process::{
        $data: { /* sensor data */ }
    }
}

Multiple Datasets

rand_processes::{
    users: rand_process::{
        $data: { /* user data */ }
    },
    
    orders: rand_process::{
        $data: { /* order data */ }
    }
}

Dynamic Datasets

Create multiple related datasets:

rand_processes::{
    $n: UniformU8::{ low: 3, high: 8 },
    
    // Creates client_1, client_2, ..., client_n datasets
    clients: $n::[
        'client_{ $@n }': rand_process::{
            $data: {
                client_id: '$@n',
                // ... other fields
            }
        }
    ]
}

Reproducibility and Determinism

One of Beamline’s key strengths is its ability to generate reproducible data.

Seeds

Seeds control the random number generation:

# Same seed = same data
beamline gen data --seed 42 --start-auto --script-path my-script.ion
beamline gen data --seed 42 --start-auto --script-path my-script.ion  # Identical output

Timestamps

Control the simulation start time:

# Same timestamp = same temporal patterns
beamline gen data --seed 42 --start-iso "2024-01-01T00:00:00Z" --script-path my-script.ion

Deterministic Behavior

Beamline ensures that:

  • Same inputs always produce same outputs
  • Random sequences are predictable and reproducible
  • Debugging is possible with consistent data
  • Tests can be reliable and repeatable

Static vs. Dynamic Data

Beamline supports both static and dynamic data generation:

Dynamic Data (Default)

Generated during simulation with temporal patterns:

rand_process::{
    $arrival: HomogeneousPoisson:: { interarrival: minutes::5 },
    $data: {
        timestamp: Instant,
        value: UniformF64
    }
}

Static Data

Generated once at the beginning of simulation:

static_data::{
    $data: {
        id: UUID,
        created_at: Instant,  // Will be simulation start time
        config: LoremIpsum
    }
}

Use cases for static data:

  • Reference tables
  • Configuration data
  • Lookup tables
  • Master data

Summary

Understanding these core concepts is crucial for effectively using Beamline:

  1. Stochastic Processes: Mathematical foundation for realistic data patterns
  2. Random Processes: Implementation of stochastic processes in Beamline
  3. Arrival Processes: Control timing of data generation
  4. Data Generators: Create realistic values using probability distributions
  5. Variables: Enable relationships and reuse in data generation
  6. Datasets: Organize generated data into meaningful collections
  7. Reproducibility: Ensure consistent, debuggable data generation
  8. Static vs. Dynamic: Choose appropriate data generation patterns

In the next chapter, we’ll dive deeper into scripts and random processes, exploring how to create more sophisticated data generation patterns and relationships.

Reproducible Data Generation

One of Beamline’s core strengths is its ability to generate reproducible data — the same input parameters will always produce exactly the same output data, no matter when or where you run the generation process.

What is Reproducibility?

Reproducible data generation means that given the same:

  • Seed value (random number generator seed)
  • Configuration parameters (Ion script, generators, etc.)
  • Timestamp (starting time for temporal data)
  • Environment (same version of Beamline)

You will get exactly the same data every single time, down to the last byte.

Why Reproducibility Matters

Debugging and Testing

# First run - discovers a bug with specific data
beamline gen data --seed 12345 --start-auto --script-path my_script.ion

# Later run - reproduce exact same data to debug
beamline gen data --seed 12345 --start-auto --script-path my_script.ion

When you find a bug or unexpected behavior in your tests, reproducibility lets you generate the exact same problematic data to investigate and fix the issue.

Consistent Benchmarking

# Performance test run 1
beamline gen data --seed 42 --start-auto --sample-count 1000000 --script-path perf_test.ion

# Performance test run 2 (weeks later)  
beamline gen data --seed 42 --start-auto --sample-count 1000000 --script-path perf_test.ion

For meaningful performance comparisons, you need identical datasets. Reproducibility ensures your benchmarks are comparing like with like.

AI Model Training

# Training dataset generation
beamline gen data --seed 789 --start-auto --script-path training_data.ion --sample-count 50000

# Later: regenerate exact same training data for model comparison
beamline gen data --seed 789 --start-auto --script-path training_data.ion --sample-count 50000

When training machine learning models, being able to regenerate identical training data is crucial for comparing model performance and reproducing results.

Regression Testing

# Original test data
beamline gen data --seed 2024 --start-auto --script-path integration_test.ion

# After code changes - same test data to verify no regressions
beamline gen data --seed 2024 --start-auto --script-path integration_test.ion

Regression testing requires the same test data to verify that code changes don’t break existing functionality.

How Seeds Work

Pseudorandom Number Generation

Beamline uses cryptographically secure pseudorandom number generators (PRNGs) that are initialized with a seed value:

# Different seeds = different data
beamline gen data --seed 1 --start-auto --script-path test.ion    # Generates dataset A
beamline gen data --seed 2 --start-auto --script-path test.ion    # Generates dataset B

# Same seed = identical data  
beamline gen data --seed 1 --start-auto --script-path test.ion    # Generates dataset A (identical)
beamline gen data --seed 1 --start-auto --script-path test.ion    # Generates dataset A (identical)

Seed Propagation

Seeds propagate through the entire generation process:

  • Data generators use the seed for all random decisions
  • Stochastic processes use the seed for temporal modeling
  • Nested structures maintain seed consistency across all levels

Default Seeds

If you don’t specify a seed, Beamline uses a default seed derived from the configuration:

# These generate identical data (same default seed)
beamline gen data --seed-auto --start-auto --script-path my_script.ion
beamline gen data --seed-auto --start-auto --script-path my_script.ion

# This generates different data (explicit different seed)
beamline gen data --seed 999 --start-auto --script-path my_script.ion

Reproducibility Scope

What IS Reproduced

Data Values: All generated numbers, strings, booleans, etc. ✅ Data Structure: Object nesting, array lengths, field presence ✅ Temporal Patterns: Event timestamps and intervals ✅ Statistical Distributions: Same distribution samples ✅ Relationships: Cross-field correlations and dependencies

What Might VARY

Beamline Version: Different versions may produce different output ❌ System Architecture: 32-bit vs 64-bit might have subtle differences
Floating Point: Different CPUs might have tiny precision differences ❌ Ion Formatting: Whitespace and formatting might vary slightly

Best Practices

1. Always Specify Seeds for Important Use Cases

# Good - explicit seed for reproducible testing
beamline gen data --seed 12345 --start-auto --script-path test_suite.ion

# Avoid - relying on default seed might change
beamline gen data --seed-auto --start-auto --script-path test_suite.ion

2. Document Your Seeds

# Document seeds in your scripts or README
# Training data: seed 2024
# Test data: seed 2025  
# Performance benchmark: seed 3000

3. Use Meaningful Seed Values

# Use dates, version numbers, or meaningful identifiers
beamline gen data --seed 20241212 --start-auto --script-path data.ion  # Today's date
beamline gen data --seed 100 --start-auto --script-path v1.0.0.ion    # Version-based

4. Pin Beamline Version for Critical Use Cases

# In your Cargo.toml or requirements
partiql-beamline = "=1.2.3"  # Exact version for reproducibility

5. Store Configuration Alongside Data

# Save configuration for later reproduction
beamline gen data \
  --seed 42 \
  --start-auto \
  --script-path production_test.ion \
  --sample-count 1000 \
  --output-format ion-pretty > data.ion
  
# Document generation parameters separately
echo "Seed: 42, Script: production_test.ion, Count: 1000" > config.txt

Examples

Basic Reproducibility

# Generate same data multiple times
$ beamline gen data --seed 100 --start-auto --sample-count 3 --script-path simple.ion
[1, 2, 5]

$ beamline gen data --seed 100 --start-auto --sample-count 3 --script-path simple.ion  
[1, 2, 5]  # Identical output

$ beamline gen data --seed 101 --start-auto --sample-count 3 --script-path simple.ion
[7, 1, 9]  # Different seed = different data

Complex Structure Reproducibility

# Complex nested structures are also reproducible
$ beamline gen data --seed 200 --start-auto --sample-count 1 --script-path complex.ion
{
  id: 42,
  name: "Alice Johnson", 
  scores: [85, 92, 78],
  metadata: {
    timestamp: 2024-01-15T10:30:00Z,
    active: true
  }
}

# Run again with same seed
$ beamline gen data --seed 200 --start-auto --sample-count 1 --script-path complex.ion
{
  id: 42,
  name: "Alice Johnson",     # Identical name  
  scores: [85, 92, 78],      # Identical scores
  metadata: {
    timestamp: 2024-01-15T10:30:00Z,  # Identical timestamp
    active: true                      # Identical boolean
  }
}

Time-based Reproducibility

# Even temporal data is reproducible
$ beamline gen data --seed 300 --start-iso "2024-01-01T00:00:00Z" --script-path events.ion --sample-count 3
[
  { event: "login", time: "2024-01-01T00:12:34Z" },
  { event: "action", time: "2024-01-01T00:15:47Z" }, 
  { event: "logout", time: "2024-01-01T00:23:12Z" }
]

# Same seed + same start time = identical temporal patterns
$ beamline gen data --seed 300 --start-iso "2024-01-01T00:00:00Z" --script-path events.ion --sample-count 3
[
  { event: "login", time: "2024-01-01T00:12:34Z" },   # Same intervals
  { event: "action", time: "2024-01-01T00:15:47Z" },  # Same timestamps
  { event: "logout", time: "2024-01-01T00:23:12Z" }   # Exact reproduction
]

Troubleshooting Reproducibility

Issue: Getting Different Data with Same Seed

Possible Causes:

  1. Different Beamline versions
  2. Different script files
  3. Different command-line parameters
  4. Different system architectures

Solution:

# Check version
beamline --version

# Use exact same command-line parameters
beamline gen data --seed 123 --start-auto --sample-count 100 --script-path exact_same_script.ion

# Verify script file hasn't changed (use checksums)
sha256sum my_script.ion

Issue: Need to Break Reproducibility

Sometimes you want different data each run:

# Use current timestamp as seed
beamline gen data --seed $(date +%s) --start-auto --script-path varied_data.ion

# Use random seed
beamline gen data --seed $RANDOM --start-auto --script-path varied_data.ion

# Let Beamline generate a random seed
beamline gen data --seed-auto --start-auto --script-path varied_data.ion

Next Steps

Now that you understand reproducible data generation, you’re ready to learn about Scripts and Processes, which will show you how to configure and control the data generation process through Ion-based scripts.

Scripts and Random Processes

Beamline uses Ion-based scripts to define data generation configurations and stochastic processes to model how data arrives and evolves over time. This combination provides powerful, flexible control over synthetic data generation.

Ion Scripts Overview

What are Ion Scripts?

Ion scripts are configuration files written in Amazon Ion format that define:

  • What data to generate (data types, structures, values)
  • How data arrives (temporal patterns, frequencies)
  • How data relates (cross-field dependencies, correlations)
  • How much data (counts, durations, stopping conditions)

Basic Script Structure

Every Beamline script follows this structure:

rand_processes::{
    // Variable definitions (optional)
    $variable_name: GeneratorType::{ configuration },
    
    // Dataset definitions (required)
    dataset_name: rand_process::{
        $arrival: ArrivalProcess::{ configuration },
        $data: {
            field_name: GeneratorType::{ configuration },
            // ... more fields
        }
    }
}

Real Example from Test Suite

From sensors.ion test script:

rand_processes::{
    $n: UniformU8::{ low: 2, high: 10 },

    sensors: $n::[
        rand_process::{
            $r: Uniform::{ choices: [5,10] },
            $arrival: HomogeneousPoisson:: { interarrival: minutes::$r },
            $weight: UniformDecimal::{ nullable: 0.75, low: 1.995, high: 4.9999, optional: true },
            $data: {
                tick: Tick,
                i8: UniformI8,
                f: UniformF64,
                w: $weight,
                d: UniformDecimal::{ low: 0d0, high: 4.2d1, nullable: false }
            }
        }
    ]
}

Ion Format Benefits

Ion provides several advantages for configuration:

  • Type Safety: Native support for numbers, strings, booleans, timestamps
  • Comments: Document your configuration inline with //
  • Annotations: Add type annotations like minutes::$r
  • Nested Structures: Define complex object hierarchies naturally
  • Variable References: Use $variable for reusable components

Stochastic Processes

What are Stochastic Processes?

Stochastic processes are mathematical models that describe how events occur over time in a seemingly random but statistically predictable way. In Beamline, they’re defined using the $arrival field in rand_process blocks.

Arrival Process Types

1. Homogeneous Poisson Process

Models events that occur at a constant average rate with random intervals:

rand_processes::{
    sensor_readings: rand_process::{
        $arrival: HomogeneousPoisson:: { interarrival: minutes::5 },
        $data: {
            sensor_id: UUID,
            reading: UniformF64::{ low: 0.0, high: 100.0 },
            timestamp: Instant
        }
    }
}

Time Units:

  • milliseconds::N - N milliseconds between events
  • seconds::N - N seconds between events
  • minutes::N - N minutes between events
  • hours::N - N hours between events
  • days::N - N days between events

Use Cases:

  • User logins to a website
  • Network packet arrivals
  • Customer service calls
  • Sensor readings

2. Variable Arrival Rates

Use variables to create dynamic arrival patterns:

rand_processes::{
    user_events: rand_process::{
        $r: Uniform::{ choices: [2, 5, 10] },  // Variable rate
        $arrival: HomogeneousPoisson:: { interarrival: minutes::$r },
        $data: {
            event_type: Uniform::{ choices: ["login", "logout", "action"] },
            user_id: UUID
        }
    }
}

Data Generators

Basic Generator Types

From the actual implementation:

Numeric Generators

rand_processes::{
    numeric_data: rand_process::{
        $arrival: HomogeneousPoisson:: { interarrival: seconds::1 },
        $data: {
            // Integer generators
            small_int: UniformI8,                                    // -127 to 127
            medium_int: UniformI16::{ low: 100, high: 1000 },       // Custom range
            large_int: UniformU32::{ low: 1, high: 1000000 },       // Unsigned
            
            // Float generators  
            decimal_value: UniformDecimal::{ low: 1.99, high: 99.99 },  // Exact decimal
            float_value: UniformF64::{ low: 0.0, high: 1.0 },           // Float
            
            // Statistical distributions
            normal_score: NormalF64::{ mean: 100.0, std_dev: 15.0 },
            exponential_wait: ExpF64::{ rate: 0.1 },
            weibull_lifetime: WeibullF64::{ shape: 2.0, scale: 1000.0 }
        }
    }
}

String Generators

rand_processes::{
    text_data: rand_process::{
        $arrival: HomogeneousPoisson:: { interarrival: seconds::2 },
        $data: {
            // UUID generator
            id: UUID,
            
            // Lorem Ipsum text
            description: LoremIpsum::{ min_words: 5, max_words: 20 },
            title: LoremIpsumTitle,  // 3-8 title-cased words
            
            // Regular expressions
            country_code: Regex::{ pattern: "[A-Z]{2}" },
            phone: Regex::{ pattern: "[0-9]{3}-[0-9]{3}-[0-9]{4}" },
            
            // Format strings with variables
            formatted_name: Format::{ pattern: "User #{UUID}" }
        }
    }
}

System Generators

rand_processes::{
    system_data: rand_process::{
        $arrival: HomogeneousPoisson:: { interarrival: seconds::1 },
        $data: {
            // System state generators
            current_time: Instant,      // Current simulation time
            current_date: Date,         // Current simulation date
            event_tick: Tick,           // Current tick counter
            
            // Boolean generator
            active: Bool,               // 50% true by default
            premium: Bool::{ p: 0.1 }   // 10% true
        }
    }
}

Complex Type Generators

rand_processes::{
    complex_data: rand_process::{
        $arrival: HomogeneousPoisson:: { interarrival: seconds::5 },
        $data: {
            // Array generator
            measurements: UniformArray::{
                min_size: 3,
                max_size: 8,
                element_type: UniformF64::{ low: 0.0, high: 100.0 }
            },
            
            // Union type generator (any of several types)
            mixed_value: UniformAnyOf::{
                types: [
                    UUID,
                    UniformI32::{ low: 1, high: 1000 },
                    LoremIpsumTitle
                ]
            },
            
            // Choice from literals
            status: Uniform::{ choices: ["active", "inactive", "pending"] }
        }
    }
}

Advanced Script Features

Variable Definitions and References

From the real client-service.ion script:

rand_processes::{
    // Define reusable generators
    $n: UniformU8::{ low: 5, high: 20 },
    $id_gen: UUID,
    $rid_gen: UUID,
    
    requests: $n::[
        {
            // Force evaluation at script read time
            $id: $id_gen::(),
            $rate: UniformF64::{ low: 0.995e0, high: 1.0e0 },
            $success: Bool::{ p: $rate },
            
            service: rand_process::{
                $r: UniformU8::{ low: 20, high: 150 },
                $arrival: HomogeneousPoisson:: { interarrival: milliseconds::$r },
                $data: {
                    Request: $rid_gen,
                    Account: $id,
                    client: Format::{ pattern: "customer #{ $@n }" },
                    success: $success
                }
            }
        }
    ]
}

Key concepts:

  • Variables: $variable_name for reusable generators
  • Forced evaluation: $id_gen::() evaluates once at script read time
  • Loop arrays: $n::[...] creates N instances
  • Loop index: $@n accesses current iteration index

Static Data

From the orders.ion test script:

rand_processes::{
    $n: UniformU8::{ low: 5, high: 20 },
    $id_gen: UUID,
    
    customers: $n::[
        {
            $id: $id_gen::(),
            
            // Static data - generated once at simulation start
            customer_table: static_data::{
                $data: {
                    id: $id,
                    address: Format::{ pattern: "{ $@n } Foo Bar Ave" }
                }
            },
            
            // Dynamic data - generated over time
            orders: rand_process::{
                $arrival: HomogeneousPoisson:: { interarrival: days::UniformU8::{ low: 1, high: 150 } },
                $data: {
                    Order: UUID,
                    Customer: $id,
                    Time: Instant
                }
            }
        }
    ]
}

Nullability and Optionality

Real syntax from test scripts:

rand_processes::{
    nullable_data: rand_process::{
        $arrival: HomogeneousPoisson:: { interarrival: seconds::1 },
        $data: {
            // 75% chance of NULL, can also be MISSING
            weight: UniformDecimal::{ 
                nullable: 0.75, 
                optional: true,
                low: 1.995, 
                high: 4.9999 
            },
            
            // Never NULL
            id: UUID::{ nullable: false },
            
            // 10% chance of MISSING (field won't appear)
            optional_field: UniformI32::{ optional: 0.1, low: 1, high: 100 }
        }
    }
}

Real Script Examples

Simple Sensor Script

Based on the actual sensors.ion test:

rand_processes::{
    $n: UniformU8::{ low: 2, high: 10 },

    sensors: $n::[
        rand_process::{
            $r: Uniform::{ choices: [5,10] },
            $arrival: HomogeneousPoisson:: { interarrival: minutes::$r },
            $weight: UniformDecimal::{ nullable: 0.75, low: 1.995, high: 4.9999, optional: true },
            $data: {
                tick: Tick,
                i8: UniformI8,
                f: UniformF64,
                w: $weight,
                d: UniformDecimal::{ low: 0d0, high: 4.2d1, nullable: false }
            }
        }
    ]
}

Test this script:

target/release/beamline gen data \
  --seed 42 \
  --start-auto \
  --script-path partiql-beamline-sim/tests/scripts/sensors.ion \
  --sample-count 10 \
  --output-format ion-pretty

Client-Service System

Based on client-service.ion test:

rand_processes::{
    // Generate between 5 & 20 customers
    $n: UniformU8::{ low: 5, high: 20 },

    // Shared generators
    $id_gen: UUID,
    $rid_gen: UUID,

    requests: $n::[
        {
            // Each customer gets unique ID
            $id: $id_gen::(),
            $rate: UniformF64::{ low: 0.995e0, high: 1.0e0 },
            $success: Bool::{ p: $rate },

            // Service dataset
            service: rand_process::{
                $r: UniformU8::{ low: 20, high: 150 },
                $arrival: HomogeneousPoisson:: { interarrival: milliseconds::$r },
                $data: {
                    Request: $rid_gen,
                    StartTime: Instant,
                    Program: "FancyService",
                    Operation: "GetMyData",
                    Account: $id,
                    client: Format::{ pattern: "customer #{ $@n }" },
                    success: $success
                }
            },

            // Individual client datasets
            'client_{ $@n }': rand_process::{
                $arrival: HomogeneousPoisson:: { interarrival: milliseconds::$r },
                $data: {
                    id: $id,
                    request_time: Instant,
                    request_id: $rid_gen,
                    success: $success
                }
            }
        }
    ]
}

Transaction Data Script

Based on simple_transactions.ion test:

rand_processes::{
    test_data: rand_process::{
        $r: Uniform::{ choices: [5,10] },
        $arrival: HomogeneousPoisson:: { interarrival: milliseconds::$r },
        
        $data: {
            transaction_id: UUID::{ nullable: false },
            marketplace_id: UniformU8::{ nullable: false },
            country_code: Regex::{ pattern: "[A-Z]{2}" },
            created_at: Instant,
            completed: Bool,
            description: LoremIpsum::{ min_words:10, max_words:200 },
            price: UniformDecimal::{ low: 2.99, high: 99999.99, optional: true }
        }
    }
}

Advanced Script Patterns

Complex Statistical Distributions

From numbers.ion test script:

rand_processes::{
    test_data: rand_process::{
        $r: Uniform::{ choices: [5,10] },
        $arrival: HomogeneousPoisson:: { interarrival: milliseconds::$r },

        $data: {
            uniform: {
                // Uniform distributions
                uniform_u8: UniformU8::{ low: 13, high: 42 },
                uniform_f64: UniformF64::{ low: -13.0, high: 42.0 },
                uniform_decimal: UniformDecimal::{ low: 0.995, high: 499.9999 }
            },

            statistical: {
                // Statistical distributions
                normal: NormalF64::{ mean: 14.3, std_dev: 3.0 },
                lognormal: LogNormalF64::{ location: 14.3, scale: 3.0 },
                weibull: WeibullF64::{ shape: 14.3, scale: 3.0 },
                exponential: ExpF64::{ rate: 3.0 }
            },
            
            // With nullability and optionality
            nullable_field: UniformI32::{ 
                nullable: 0.2,    // 20% NULL
                optional: 0.1,    // 10% MISSING
                low: 1, 
                high: 100 
            }
        }
    }
}

Multiple Datasets with Relationships

Real pattern from client-service.ion:

rand_processes::{
    $n: UniformU8::{ low: 5, high: 20 },
    $id_gen: UUID,

    requests: $n::[
        {
            $id: $id_gen::(),  // One ID per customer
            $rid_gen: UUID,    // Separate request ID generator per customer
            
            // Shared service dataset
            service: rand_process::{
                $arrival: HomogeneousPoisson:: { interarrival: milliseconds::50 },
                $data: {
                    Request: $rid_gen,
                    StartTime: Instant,
                    Account: $id,
                    client: Format::{ pattern: "customer #{ $@n }" }
                }
            },
            
            // Individual client dataset for this customer
            'client_{ $@n }': rand_process::{
                $arrival: HomogeneousPoisson:: { interarrival: milliseconds::50 },
                $data: {
                    id: $id,
                    request_time: Instant,
                    request_id: $rid_gen
                }
            }
        }
    ]
}

Static Data with Dynamic References

From orders.ion test script:

rand_processes::{
    $n: UniformU8::{ low: 5, high: 20 },
    $id_gen: UUID,
    $oid_gen: UUID,

    customers: $n::[
        {
            $id: $id_gen::(),

            // Static customer data (generated once)
            customer_table: static_data::{
                $data: {
                    id: $id,
                    address: Format::{ pattern: "{ $@n } Foo Bar Ave" }
                }
            },

            // Dynamic orders (generated over time)  
            orders: rand_process::{
                $r: UniformU8::{ low: 1, high: 150 },
                $arrival: HomogeneousPoisson:: { interarrival: days::$r },
                $data: {
                    Order: $oid_gen,
                    Time: Instant,
                    Customer: $id  // Links to customer_table
                }
            }
        }
    ]
}

Script Testing and Validation

Testing Script Syntax

# Test script with minimal data generation
target/release/beamline gen data \
  --seed 1 \
  --start-auto \
  --script-path new_script.ion \
  --sample-count 1

# Check inferred schema
target/release/beamline infer-shape \
  --seed 1 \
  --start-auto \
  --script-path new_script.ion \
  --output-format basic-ddl

Testing with Small Samples

# Test each dataset individually
target/release/beamline gen data \
  --seed 42 \
  --start-auto \
  --script-path complex_script.ion \
  --sample-count 5 \
  --dataset specific_dataset

# Test all datasets with small sample
target/release/beamline gen data \
  --seed 42 \
  --start-auto \
  --script-path complex_script.ion \
  --sample-count 5 \
  --output-format text

Best Practices

1. Use Real Test Script Patterns

// Good - follows actual Beamline syntax
rand_processes::{
    $arrival_rate: Uniform::{ choices: [5, 10] },
    
    events: rand_process::{
        $arrival: HomogeneousPoisson:: { interarrival: minutes::$arrival_rate },
        $data: {
            event_id: UUID,
            timestamp: Instant,
            value: UniformF64::{ low: 0.0, high: 100.0 }
        }
    }
}

2. Test Scripts Incrementally

# Start with basic structure
echo 'rand_processes::{ test: rand_process::{ $arrival: HomogeneousPoisson:: { interarrival: seconds::1 }, $data: { id: UUID } } }' > minimal.ion

# Test basic structure
target/release/beamline gen data --seed 1 --start-auto --script-path minimal.ion --sample-count 3

3. Use Meaningful Variable Names

rand_processes::{
    // Clear variable names
    $customer_count: UniformU8::{ low: 10, high: 50 },
    $order_frequency: Uniform::{ choices: [1, 3, 7] },  // Days
    $customer_id_generator: UUID,
    
    orders: $customer_count::[
        rand_process::{
            $arrival: HomogeneousPoisson:: { interarrival: days::$order_frequency },
            $data: {
                customer_id: $customer_id_generator,
                order_time: Instant
            }
        }
    ]
}

4. Document Complex Patterns

rand_processes::{
    // === Customer Simulation Configuration ===
    // Generate 10-50 customers, each placing orders every 1-30 days
    
    $customer_count: UniformU8::{ low: 10, high: 50 },
    $shared_customer_id: UUID,
    
    customer_orders: $customer_count::[
        {
            // Each customer gets unique ID for all their orders
            $id: $shared_customer_id::(),
            
            // Customer places orders with variable frequency
            orders: rand_process::{
                $days_between_orders: UniformU8::{ low: 1, high: 30 },
                $arrival: HomogeneousPoisson:: { interarrival: days::$days_between_orders },
                $data: {
                    customer_id: $id,
                    order_id: UUID,
                    order_time: Instant,
                    amount: UniformDecimal::{ low: 10.00, high: 500.00 }
                }
            }
        }
    ]
}

Common Script Errors and Solutions

Error: Invalid Ion Syntax

// Wrong - missing closing brace
rand_processes::{
    test: rand_process::{
        $data: { id: UUID }
    // Missing closing brace for rand_processes

Error: Missing Required Fields

// Wrong - missing $arrival
rand_processes::{
    test: rand_process::{
        $data: { id: UUID }  // Missing $arrival definition
    }
}

// Correct
rand_processes::{
    test: rand_process::{
        $arrival: HomogeneousPoisson:: { interarrival: seconds::1 },
        $data: { id: UUID }
    }
}

Error: Invalid Generator Configuration

// Wrong - low > high
rand_processes::{
    test: rand_process::{
        $arrival: HomogeneousPoisson:: { interarrival: seconds::1 },
        $data: {
            bad_range: UniformI32::{ low: 100, high: 50 }  // Invalid
        }
    }
}

// Correct
rand_processes::{
    test: rand_process::{
        $arrival: HomogeneousPoisson:: { interarrival: seconds::1 },
        $data: {
            good_range: UniformI32::{ low: 50, high: 100 }
        }
    }
}

Performance Optimization

Efficient Generator Usage

rand_processes::{
    // Efficient - reuse expensive generators
    $expensive_distribution: NormalF64::{ mean: 100.0, std_dev: 15.0 },
    $simple_uuid: UUID,
    
    efficient_data: rand_process::{
        $arrival: HomogeneousPoisson:: { interarrival: seconds::1 },
        $data: {
            // Reuse expensive distribution
            score1: $expensive_distribution,
            score2: $expensive_distribution,
            score3: $expensive_distribution,
            
            // Simple generators are fast
            id: $simple_uuid,
            active: Bool,
            count: UniformI32::{ low: 1, high: 1000 }
        }
    }
}

Testing Commands

# Test with small samples first
target/release/beamline gen data \
  --seed 1 \
  --start-auto \
  --script-path new_script.ion \
  --sample-count 5 \
  --output-format text

# Scale up after validation
target/release/beamline gen data \
  --seed 1 \
  --start-auto \
  --script-path new_script.ion \
  --sample-count 10000 \
  --output-format ion-binary

Next Steps

Now that you understand real Ion scripts and stochastic processes, you’re ready to dive deeper into the Data Generation section, where you’ll learn about specific generator types, output formats, and advanced data modeling techniques using the actual Beamline syntax.

Data Generation Overview

Beamline’s data generation system creates synthetic data using stochastic processes and probability distributions. The system is built around three core concepts: random processes, value generators, and temporal modeling.

Architecture Overview

Data generation in Beamline follows a layered architecture:

  1. Random Processes — Mathematical models that describe how events occur over time
  2. Value Generators — Components that create specific data types and values
  3. Arrival Times — Models for when events occur in the simulation
  4. Simulation Context — Manages state, timing, and reproducibility

Core Concepts

Random Processes

A Random Process (also called Stochastic Process) is a mathematical model of systems that appear to vary randomly over time. In Beamline, these processes control:

  • When data arrives (temporal patterns)
  • What data is generated (value types and structures)
  • How data relates (cross-field dependencies)

Value Generators

Value Generators are the building blocks that create actual data values. They can generate:

  • Scalar values: numbers, strings, booleans, timestamps
  • Complex structures: objects, arrays, nested data
  • Statistical distributions: normal, exponential, Weibull, etc.
  • Specialized types: UUIDs, formatted text, regex patterns

Each generator can be configured for:

  • Nullability: Probability of generating NULL values
  • Optionality: Probability of generating MISSING values
  • Value ranges: Minimum and maximum bounds
  • Distribution parameters: Mean, standard deviation, shape, scale

Temporal Modeling

Beamline models data generation as events occurring over time using:

  • Arrival processes: When events occur (e.g. Poisson Point Process)
  • Simulation time: Virtual time that advances as events are generated
  • Tick counters: Global state that increments with each event
  • Instant generators: Current simulation time when values are created

Ion Script Structure

All data generation is controlled through Amazon Ion scripts with this basic structure:

rand_processes::{
    // Variable definitions
    $variable_name: GeneratorType::{ configuration },
    
    // Dataset definitions
    dataset_name: dataset_configuration
}

Variable Definitions

Variables allow you to define generators once and reuse them:

rand_processes::{
    // Define reusable generators
    $id_generator: UUID,
    $weight_generator: UniformDecimal::{ low: 1.0, high: 10.0 },
    $count_range: UniformU8::{ low: 5, high: 20 },
    
    // Use variables in dataset definitions
    products: $count_range::[
        // ... uses $id_generator and $weight_generator
    ]
}

Dataset Configurations

Datasets can be configured in several ways:

1. Single Random Process

dataset_name: rand_process::{
    $arrival: HomogeneousPoisson::{ interarrival: minutes::5 },
    $data: {
        id: UUID,
        value: UniformF64
    }
}

2. Static Data (Generated Once)

dataset_name: static_data::{
    $data: {
        id: UUID,
        name: LoremIpsumTitle
    }
}

3. Multiple Instances with Loops

$n: UniformU8::{ low: 2, high: 5 },

dataset_name: $n::[
    rand_process::{
        $data: {
            instance_id: '$@n',  // Current loop index
            value: UniformF64
        }
    }
]

Data Generator Types

Basic Generators

GeneratorDescriptionPartiQL TypeConfiguration
BoolBoolean valuesBOOLp: f64 (probability of true, default: 0.5)
UUIDUUID v4 identifiersSTRINGNo configuration
TickCurrent simulation tickInt64No configuration
InstantCurrent simulation timeDATETIMENo configuration
DateCurrent simulation dateDATETIMENo configuration

Numeric Generators

Uniform Integer Generators

// Unsigned integers
UniformU8::{ low: 0, high: 255 }           // 8-bit unsigned
UniformU16::{ low: 0, high: 65535 }        // 16-bit unsigned  
UniformU32::{ low: 0, high: 4294967295 }  // 32-bit unsigned
UniformU64::{ low: 0, high: 18446744073709551615 }  // 64-bit unsigned

// Signed integers
UniformI8::{ low: -128, high: 127 }        // 8-bit signed
UniformI16::{ low: -32768, high: 32767 }   // 16-bit signed
UniformI32::{ low: -2147483648, high: 2147483647 }  // 32-bit signed
UniformI64::{ low: -9223372036854775808, high: 9223372036854775807 }  // 64-bit signed

Floating Point Generators

// Uniform float
UniformF64::{ low: -127.0, high: 127.0 }

// Uniform decimal (exact arithmetic)
UniformDecimal::{ low: 0.995, high: 499.9999 }

Statistical Distribution Generators

// Normal distribution (bell curve)
NormalF64::{ mean: 100.0, std_dev: 15.0 }

// Log-normal distribution
LogNormalF64::{ location: 0.0, scale: 1.0 }

// Weibull distribution
WeibullF64::{ shape: 2.0, scale: 1.0 }

// Exponential distribution
ExpF64::{ rate: 1.0 }

String Generators

// Lorem Ipsum text
LoremIpsum::{ min_words: 10, max_words: 200 }

// Lorem Ipsum titles (3-8 words, title case)
LoremIpsumTitle

// Regular expression patterns
Regex::{ pattern: "[A-Z]{2}[0-9]{3}" }

// Format strings with variable substitution
Format::{ pattern: "User #{$@n}" }

Complex Type Generators

Arrays

UniformArray::{
    min_size: 1,
    max_size: 5,
    element_type: UniformI32::{ low: 1, high: 100 }
}

Union Types (Any Of)

UniformAnyOf::{
    types: [
        UUID,
        UniformI32::{ low: 1, high: 1000 },
        LoremIpsumTitle
    ]
}

Choice from Literals

Uniform::{ choices: [1, 2, 5, 10, 20] }

Nullability and Optionality

Every generator supports NULL and MISSING value generation:

Nullability (NULL values)

// 20% chance of NULL values
generator::{ nullable: 0.2 }

// Never NULL
generator::{ nullable: false }

// Always NULL (not useful, but possible)
generator::{ nullable: 1.0 }

Optionality (MISSING values)

// 10% chance of MISSING values  
generator::{ optional: 0.1 }

// Never MISSING
generator::{ optional: false }

// Always MISSING (field won't appear)
generator::{ optional: 1.0 }

Combined Configuration

// 20% NULL, 10% MISSING, 70% present values
price: UniformDecimal::{
    nullable: 0.2,
    optional: 0.1, 
    low: 9.99,
    high: 999.99
}

Arrival Processes

Control when events occur in simulation time. Beamline is currently supporintg only Homogeneous Poisson Process:

Homogeneous Poisson Process

Statistically indepe events occur at a constant average rate with random intervals:

$arrival: HomogeneousPoisson::{ interarrival: minutes::5 }

Time units supported:

  • milliseconds::N - N milliseconds between events
  • seconds::N - N seconds between events
  • minutes::N - N minutes between events
  • hours::N - N hours between events
  • days::N - N days between events

Variable References and Scope

Variable Definition and Usage

rand_processes::{
    $arrival: HomogeneousPoisson::{ interarrival: minutes::5 },

    // Define variables at top level
    $customer_id: UUID,
    $price_range: UniformDecimal::{ low: 9.99, high: 199.99 },
    
    orders: rand_process::{
        $data: {
            customer: $customer_id,     // Reference variable
            price: $price_range,        // Reference variable
            order_id: UUID              // Direct generator
        }
    }
}

Forced Evaluation with ::()

Force generator evaluation at script read time (not generation time):

rand_processes::{
    $id_gen: UUID,
    
    customers: 3::[
        {
            // Each customer gets the same ID across all their records
            $id: $id_gen::(),  // Evaluated once per customer
            
            customer_profile: static_data::{
                $data: {
                    id: $id,           // Same ID for this customer
                    name: LoremIpsumTitle
                }
            },
            
            transactions: rand_process::{
                $data: {
                    customer_id: $id,  // Same ID for this customer
                    transaction_id: UUID,  // New UUID per transaction
                    amount: UniformDecimal::{ low: 10.0, high: 500.0 }
                }
            }
        }
    ]
}

Loop Index Variable $@n

Access the current loop index in array definitions:

$n: UniformU8::{ low: 3, high: 7 },

clients: $n::[
    {
        'client_$@n': rand_process::{  // Dynamic dataset name
            $data: {
                client_number: '$@n',   // Current index as value
                name: Format::{ pattern: "Client #{$@n}" }
            }
        }
    }
]

Some Real Examples

Simple Sensor Data

rand_processes::{
    $n: UniformU8::{ low: 2, high: 10 },

    sensors: $n::[
        rand_process::{
            $r: Uniform::{ choices: [5,10] },
            $arrival: HomogeneousPoisson::{ interarrival: minutes::$r },
            $data: {
                tick: Tick,
                i8: UniformI8,
                f: UniformF64
            }
        }
    ]
}

Complex Statistical Data

rand_processes::{
    test_data: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: milliseconds::100 },
        $data: {
            // Statistical distributions
            normal_score: NormalF64::{ mean: 100.0, std_dev: 15.0 },
            exponential_wait: ExpF64::{ rate: 0.1 },
            weibull_lifetime: WeibullF64::{ shape: 2.0, scale: 1000.0 },
            
            // Arrays with statistical elements
            measurements: UniformArray::{
                min_size: 5,
                max_size: 10,
                element_type: NormalF64::{ mean: 50.0, std_dev: 5.0 }
            },
            
            // Union types
            mixed_value: UniformAnyOf::{
                types: [
                    NormalF64::{ mean: 0.0, std_dev: 1.0 },
                    UniformI32::{ low: 1, high: 100 },
                    UUID
                ]
            }
        }
    }
}

Static and Dynamic Data Combination

rand_processes::{
    $n: UniformU8::{ low: 5, high: 20 },
    $id_gen: UUID,

    customers: $n::[
        {
            $id: $id_gen::(),  // One ID per customer
            
            // Static customer data (generated once)
            customer_table: static_data::{
                $data: {
                    id: $id,
                    address: Format::{ pattern: "{$@n} Main Street" }
                }
            },
            
            // Dynamic order data (generated over time)
            orders: rand_process::{
                $r: UniformU8::{ low: 1, high: 30 },
                $arrival: HomogeneousPoisson::{ interarrival: days::$r },
                $data: {
                    customer_id: $id,
                    order_id: UUID,
                    timestamp: Instant
                }
            }
        }
    ]
}

Probability Distribution Support

Beamline provides support for data generation based on probability distributions, making it particularly valuable for AI model training and statistical simulation:

Available Distributions

  • Normal Distribution: NormalF64::{ mean: μ, std_dev: σ }
  • Log-Normal Distribution: LogNormalF64::{ location: μ, scale: σ }
  • Exponential Distribution: ExpF64::{ rate: λ }
  • Weibull Distribution: WeibullF64::{ shape: k, scale: λ }
  • Uniform Distribution: All Uniform* generators use uniform distribution

AI Model Training Applications

rand_processes::{
    training_data: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: milliseconds::1 },
        $data: {
            // Features following realistic distributions
            age: NormalF64::{ mean: 35.0, std_dev: 12.0 },
            income: LogNormalF64::{ location: 10.5, scale: 0.5 },
            response_time: ExpF64::{ rate: 0.1 },
            
            // Categorical features
            category: Uniform::{ choices: ["A", "B", "C", "D"] },
            
            // Correlated features using shared variables
            experience_years: NormalF64::{ mean: 8.0, std_dev: 5.0 },
            
            // Target variable (could be based on features)
            target: Bool::{ p: 0.3 }
        }
    }
}

Next Steps

Now that you understand the data generation overview, explore specific aspects:

Data Generator Types

Beamline provides a comprehensive set of data generators that can create values following various statistical distributions and patterns. Each generator is designed to produce realistic data for specific use cases and data types.

Generator Categories

Basic System Generators

These generators provide fundamental values based on simulation state:

GeneratorPartiQL TypeDescriptionConfiguration
BoolBOOLBoolean values using Bernoulli distributionp: f64 (probability of true, default: 0.5)
DateDATETIMECurrent simulation dateNo configuration
InstantDATETIMECurrent simulation timestamp with timezoneNo configuration
TickInt64Current simulation tick counterNo configuration
UUIDSTRINGVersion 4 UUID identifiersNo configuration

Examples

rand_processes::{
    basic_data: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: seconds::1 },
        $data: {
            // System generators
            created_at: Instant,           // Current simulation time
            event_tick: Tick,              // Current tick counter  
            user_id: UUID,                 // Random UUID
            active: Bool,                  // 50% true by default
            premium: Bool::{ p: 0.1 },     // 10% true, 90% false
            event_date: Date               // Current simulation date
        }
    }
}

Uniform Integer Generators

Generate integers using discrete uniform distribution:

Unsigned Integers

GeneratorRangeDefault RangeConfiguration
UniformU80 to 255low: 0, high: 255low: u8, high: u8
UniformU160 to 65,535low: 0, high: 65535low: u16, high: u16
UniformU320 to 4,294,967,295low: 0, high: 4294967295low: u32, high: u32
UniformU640 to 9,223,372,036,854,775,807low: 0, high: 9223372036854775807low: u64, high: u64

Signed Integers

GeneratorRangeDefault RangeConfiguration
UniformI8-128 to 127low: -127, high: 127low: i8, high: i8
UniformI16-32,768 to 32,767low: -32767, high: 32767low: i16, high: i16
UniformI32-2,147,483,648 to 2,147,483,647low: -2147483647, high: 2147483647low: i32, high: i32
UniformI64-9,223,372,036,854,775,808 to 9,223,372,036,854,775,807low: -9223372036854775807, high: 9223372036854775807low: i64, high: i64

Examples

rand_processes::{
    numeric_data: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: seconds::1 },
        $data: {
            // Default ranges
            age_category: UniformU8,                    // 0-255
            small_count: UniformI8,                     // -127 to 127
            
            // Custom ranges
            human_age: UniformU8::{ low: 0, high: 120 },
            temperature_c: UniformI8::{ low: -40, high: 50 },
            user_score: UniformU16::{ low: 0, high: 1000 },
            large_id: UniformU64::{ low: 1000000, high: 9999999 }
        }
    }
}

Floating Point Generators

Uniform Float

UniformF64::{ low: -127.0, high: 127.0 }  // Default range
UniformF64::{ low: 0.0, high: 1.0 }       // Unit interval

Uniform Decimal (Exact Arithmetic)

UniformDecimal::{ low: 0.995, high: 499.9999 }  // Default range
UniformDecimal::{ low: 9.99, high: 99.99 }      // Price range

Examples

rand_processes::{
    measurements: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: seconds::30 },
        $data: {
            // Floating point measurements
            temperature: UniformF64::{ low: -10.0, high: 40.0 },
            pressure: UniformF64::{ low: 980.0, high: 1050.0 },
            
            // Exact decimal values for money
            price: UniformDecimal::{ low: 9.99, high: 999.99 },
            tax_rate: UniformDecimal::{ low: 0.05, high: 0.12 }
        }
    }
}

Statistical Distribution Generators

Beamline supports several important probability distributions:

Normal Distribution

Models natural phenomena that cluster around a mean value:

NormalF64::{ mean: 100.0, std_dev: 15.0 }

Use Cases:

  • Human measurements (height, weight, IQ scores)
  • Measurement errors
  • Natural phenomena
  • AI model features

Example:

// Human height in centimeters (approximately normal)
height: NormalF64::{ mean: 170.0, std_dev: 10.0 }

// Test scores
test_score: NormalF64::{ mean: 75.0, std_dev: 12.0 }

Log-Normal Distribution

Models positive values that are log-normally distributed (multiplicative effects):

LogNormalF64::{ location: 0.0, scale: 1.0 }

Use Cases:

  • Income distributions
  • Stock prices
  • File sizes
  • Response times

Example:

// Income distribution (log-normal is realistic)
annual_income: LogNormalF64::{ location: 10.5, scale: 0.5 }

// File sizes
file_size_bytes: LogNormalF64::{ location: 10.0, scale: 2.0 }

Exponential Distribution

Models time between events or lifetimes:

ExpF64::{ rate: 1.0 }

Use Cases:

  • Time between events
  • Equipment lifetimes
  • Queue waiting times
  • Radioactive decay

Example:

// Time between customer arrivals (exponential inter-arrival times)
wait_time_minutes: ExpF64::{ rate: 0.1 }  // Average 10 minutes

// Equipment lifetime
lifetime_hours: ExpF64::{ rate: 0.001 }   // Average 1000 hours

Weibull Distribution

Models reliability, survival analysis, and extreme values:

WeibullF64::{ shape: 2.0, scale: 1000.0 }

Use Cases:

  • Equipment failure times
  • Material strength
  • Wind speeds
  • Survival analysis

Example:

// Equipment failure time
failure_time_hours: WeibullF64::{ shape: 2.0, scale: 8760.0 }  // ~1 year scale

// Material strength
breaking_force: WeibullF64::{ shape: 3.0, scale: 500.0 }

String Generators

Lorem Ipsum Text

Generate placeholder text:

LoremIpsum::{ min_words: 10, max_words: 200 }
LoremIpsumTitle  // 3-8 words, title case

Examples:

description: LoremIpsum::{ min_words: 5, max_words: 20 }
title: LoremIpsumTitle

Sample Output:

description: "Lorem ipsum dolor sit amet consectetur adipiscing elit sed do eiusmod tempor"
title: "Importari Putant Quae Autem Tanta"

Regular Expression Generator

Generate strings matching regex patterns:

Regex::{ pattern: "[A-Z]{2}[0-9]{4}" }

Examples:

rand_processes::{
    test_data: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: seconds::1 },
        $data: {
            // Country codes
            country: Regex::{ pattern: "[A-Z]{2}" },           // "US", "GB", "FR"
            
            // License plates  
            license: Regex::{ pattern: "[A-Z]{3}[0-9]{3}" },   // "ABC123"
            
            // Phone numbers
            phone: Regex::{ pattern: "[0-9]{3}-[0-9]{3}-[0-9]{4}" }, // "555-123-4567"
            
            // IPv4 addresses
            ip: Regex::{ pattern: "([0-9]{1,3}\\.){3}[0-9]{1,3}" }, // "192.168.1.1"
        }
    }
}

Important Notes:

  • Use double backslashes for escape sequences: \\d not \d
  • Character classes are Unicode-aware: \\d matches all Unicode digits
  • Complex patterns supported: quantifiers, alternatives, character classes

Format String Generator

Generate formatted strings with variable substitution:

Format::{ pattern: "User #{$@n}" }
Format::{ pattern: "Order {$order_id} for customer {$customer_id}" }

Complex Type Generators

Array Generator

Generate arrays with variable length and typed elements:

UniformArray::{
    min_size: 1,
    max_size: 10,
    element_type: UniformI32::{ low: 1, high: 100 }
}

Configuration:

  • min_size: Minimum array length
  • max_size: Maximum array length
  • element_type: Generator for array elements

Examples:

rand_processes::{
    array_data: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: seconds::1 },
        $data: {
            // Array of integers
            scores: UniformArray::{
                min_size: 3,
                max_size: 10,
                element_type: UniformU8::{ low: 0, high: 100 }
            },
            
            // Array of UUIDs
            related_ids: UniformArray::{
                min_size: 1,
                max_size: 5,
                element_type: UUID
            },
            
            // Array using variable generator
            weights: UniformArray::{
                min_size: 2,
                max_size: 4,
                element_type: $weight_generator
            }
        }
    }
}

Union Type Generator (Any Of)

Generate values that can be one of several types:

UniformAnyOf::{
    types: [
        UUID,
        UniformI32::{ low: 1, high: 1000 },
        LoremIpsumTitle,
        Bool
    ]
}

Use Cases:

  • Heterogeneous data
  • Schema evolution simulation
  • Polymorphic fields
  • Variant types

Example:

rand_processes::{
    flexible_data: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: seconds::1 },
        $data: {
            // Field that can be different types
            metadata_value: UniformAnyOf::{
                types: [
                    UUID,                                    // Could be an ID
                    UniformI32::{ low: 1, high: 10000 },    // Could be a count
                    LoremIpsumTitle,                         // Could be a title
                    UniformDecimal::{ low: 0.0, high: 100.0 } // Could be a percentage
                ]
            }
        }
    }
}

Choice from Literals

Select from a predefined list of values:

Uniform::{ choices: [1, 2, 5, 10, 20] }
Uniform::{ choices: ["pending", "processing", "shipped", "delivered"] }

Examples:

rand_processes::{
    categorical_data: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: seconds::1 },
        $data: {
            // Status choices
            status: Uniform::{ choices: ["active", "inactive", "pending"] },
            
            // Priority levels
            priority: Uniform::{ choices: [1, 2, 3, 4, 5] },
            
            // Mixed type choices
            config_value: Uniform::{ choices: [true, false, "auto", 0] }
        }
    }
}

Timestamp Generators

Timestamp with Configuration

Generate timestamps with precision and timezone control:

Timestamp::{
    timezone: true,        // Include timezone (default: implementation dependent)
    precision: "microsecond" // Precision level
}

Precision Options:

  • "microsecond" - Microsecond precision
  • "millisecond" - Millisecond precision
  • "second" - Second precision
  • "minute" - Minute precision
  • "hour" - Hour precision
  • "day" - Day precision

Example:

rand_processes::{
    temporal_data: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: minutes::1 },
        $data: {
            // Different timestamp precisions
            precise_time: Timestamp::{ timezone: true, precision: "microsecond" },
            log_time: Timestamp::{ timezone: false, precision: "second" },
            daily_snapshot: Timestamp::{ timezone: true, precision: "day" }
        }
    }
}

Generator Configuration Options

Nullability and Optionality

All generators support NULL and MISSING value configuration:

// 20% NULL values
generator::{ nullable: 0.2 }

// 10% MISSING values (field won't appear)
generator::{ optional: 0.1 }

// Combined: 15% NULL, 5% MISSING, 80% present
generator::{ nullable: 0.15, optional: 0.05 }

// Disable NULL/MISSING
generator::{ nullable: false, optional: false }

Range-Based Generators

Most numeric generators support range configuration:

// Integer ranges
UniformI32::{ low: 1, high: 1000 }
UniformU8::{ low: 18, high: 65 }  // Age range

// Float ranges  
UniformF64::{ low: -10.0, high: 50.0 }  // Temperature range

// Decimal ranges (exact arithmetic)
UniformDecimal::{ low: 9.99, high: 999.99 }  // Price range

Statistical Distribution Parameters

Normal Distribution

NormalF64::{
    mean: 100.0,      // Mean (μ)
    std_dev: 15.0     // Standard deviation (σ)
}

Example Applications:

// Human height (cm) - approximately normal
height: NormalF64::{ mean: 170.0, std_dev: 10.0 }

// IQ scores - designed to be normal
iq_score: NormalF64::{ mean: 100.0, std_dev: 15.0 }

// Measurement errors
measurement_error: NormalF64::{ mean: 0.0, std_dev: 0.1 }

Log-Normal Distribution

LogNormalF64::{
    location: 0.0,    // Location parameter (μ)
    scale: 1.0        // Scale parameter (σ)
}

Example Applications:

// Income - typically log-normal
income: LogNormalF64::{ location: 10.5, scale: 0.5 }  // ~$36K median

// File sizes
file_size: LogNormalF64::{ location: 8.0, scale: 2.0 }  // Bytes

// Response times
response_ms: LogNormalF64::{ location: 3.0, scale: 0.5 }  // Milliseconds

Exponential Distribution

ExpF64::{
    rate: 1.0         // Rate parameter (λ)
}

Example Applications:

// Time between events
inter_arrival_time: ExpF64::{ rate: 0.1 }  // Average 10 time units

// Equipment lifetime  
lifetime_hours: ExpF64::{ rate: 0.0001 }  // Average 10,000 hours

// Queue waiting time
wait_time_sec: ExpF64::{ rate: 0.05 }  // Average 20 seconds

Weibull Distribution

WeibullF64::{
    shape: 2.0,       // Shape parameter (k)
    scale: 100.0      // Scale parameter (λ)
}

Example Applications:

// Equipment reliability
failure_time: WeibullF64::{ shape: 2.0, scale: 1000.0 }

// Wind speed modeling
wind_speed: WeibullF64::{ shape: 2.0, scale: 15.0 }

// Material strength
breaking_stress: WeibullF64::{ shape: 3.0, scale: 500.0 }

Advanced Generator Usage

Nested Structures

Create complex nested objects:

rand_processes::{
    complex_data: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: seconds::5 },
        $data: {
            user: {
                id: UUID,
                profile: {
                    name: LoremIpsumTitle,
                    age: UniformU8::{ low: 18, high: 80 },
                    preferences: {
                        notifications: Bool::{ p: 0.8 },
                        theme: Uniform::{ choices: ["light", "dark", "auto"] }
                    }
                },
                stats: {
                    login_count: UniformU32::{ low: 0, high: 10000 },
                    last_login: Instant,
                    score: NormalF64::{ mean: 85.0, std_dev: 12.0 }
                }
            }
        }
    }
}

Arrays of Complex Objects

rand_processes::{
    order_data: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: minutes::2 },
        $data: {
            order_id: UUID,
            items: UniformArray::{
                min_size: 1,
                max_size: 10,
                element_type: {
                    product_id: UUID,
                    quantity: UniformU8::{ low: 1, high: 5 },
                    unit_price: UniformDecimal::{ low: 5.00, high: 200.00 }
                }
            }
        }
    }
}

Variable References in Complex Generators

rand_processes::{
    // Define reusable components
    $id_gen: UUID,
    $weight_dist: NormalF64::{ mean: 70.0, std_dev: 15.0 },
    $status_options: Uniform::{ choices: ["new", "active", "suspended", "closed"] },
    
    users: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: minutes::1 },
        $data: {
            user_id: $id_gen,
            weight_kg: $weight_dist,
            account_status: $status_options,
            
            // Arrays using variables
            measurement_history: UniformArray::{
                min_size: 5,
                max_size: 20,
                element_type: $weight_dist  // Same distribution for all measurements
            },
            
            // Union types with variables
            contact_method: UniformAnyOf::{
                types: [
                    $id_gen,  // UUID for anonymous contact
                    Regex::{ pattern: "[a-z]+@[a-z]+\\.[a-z]{2,3}" },  // Email
                    Regex::{ pattern: "[0-9]{3}-[0-9]{3}-[0-9]{4}" }    // Phone
                ]
            }
        }
    }
}

AI Model Training Examples

Classification Dataset

rand_processes::{
    classification_training: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: milliseconds::1 },
        $data: {
            // Features with realistic distributions
            feature_1: NormalF64::{ mean: 0.0, std_dev: 1.0 },
            feature_2: NormalF64::{ mean: 0.0, std_dev: 1.0 },
            feature_3: LogNormalF64::{ location: 0.0, scale: 0.5 },
            feature_4: ExpF64::{ rate: 1.0 },
            
            // Categorical features
            category: Uniform::{ choices: ["A", "B", "C"] },
            region: Uniform::{ choices: ["North", "South", "East", "West"] },
            
            // Binary classification target
            label: Bool::{ p: 0.3 }  // 30% positive class
        }
    }
}

Regression Dataset

rand_processes::{
    regression_training: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: milliseconds::1 },
        $data: {
            // Independent variables
            x1: NormalF64::{ mean: 10.0, std_dev: 2.0 },
            x2: UniformF64::{ low: 0.0, high: 20.0 },
            x3: ExpF64::{ rate: 0.1 },
            
            // Dependent variable (could be computed based on x1, x2, x3)
            y: NormalF64::{ mean: 50.0, std_dev: 10.0 },
            
            // Noise term
            noise: NormalF64::{ mean: 0.0, std_dev: 1.0 }
        }
    }
}

Time Series Dataset

rand_processes::{
    time_series: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: seconds::60 },  // Every minute
        $data: {
            timestamp: Instant,
            
            // Trending value with noise
            base_value: NormalF64::{ mean: 100.0, std_dev: 5.0 },
            seasonal_component: NormalF64::{ mean: 0.0, std_dev: 10.0 },
            noise: NormalF64::{ mean: 0.0, std_dev: 2.0 },
            
            // External factors
            temperature: NormalF64::{ mean: 22.0, std_dev: 5.0 },
            humidity: UniformF64::{ low: 30.0, high: 80.0 }
        }
    }
}

Performance Considerations

Generator Efficiency

  1. Simple generators (UUID, Bool, UniformI32) are fastest
  2. Statistical distributions (NormalF64, ExpF64) require more computation
  3. String generators (LoremIpsum, Regex) can be slower for complex patterns
  4. Array generators scale with array size and element complexity

Memory Usage

  • Streaming generation: Constant memory usage regardless of dataset size
  • Variable caching: Variables are computed once and reused
  • Complex nesting: Memory usage scales with structure depth

Optimization Tips

// Efficient - simple generators
id: UUID,
count: UniformU32::{ low: 1, high: 1000 }

// Less efficient - complex regex
complex_pattern: Regex::{ pattern: "(very|extremely|quite)\\s+complex\\s+pattern\\s+with\\s+many\\s+alternatives" }

// Efficient - reuse variables
$common_decimal: UniformDecimal::{ low: 1.0, high: 100.0 },
field1: $common_decimal,
field2: $common_decimal,
field3: $common_decimal

Next Steps

Datasets and Collections

Datasets in Beamline represent collections of related data records that share the same structure. Understanding how to design, organize, and work with multiple datasets is essential for creating realistic data generation scenarios.

What are Datasets?

A dataset is a named collection of records that share a common schema. In Ion scripts, datasets are defined as top-level keys within the rand_processes structure:

rand_processes::{
    users: rand_process::{ /* ... */ },        // "users" dataset
    orders: rand_process::{ /* ... */ },       // "orders" dataset  
    products: static_data::{ /* ... */ }       // "products" dataset
}

Each dataset becomes a separate data collection in the output, whether in text format, Ion format, or database generation.

Single Dataset Scripts

Basic Single Dataset

rand_processes::{
    sensors: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: minutes::5 },
        $data: {
            sensor_id: UUID,
            temperature: NormalF64::{ mean: 22.0, std_dev: 3.0 },
            humidity: UniformF64::{ low: 30.0, high: 80.0 },
            timestamp: Instant
        }
    }
}

Output characteristics:

  • Single dataset named “sensors”
  • All records have the same structure
  • Records generated according to arrival process

Multiple Dataset Scripts

Independent Datasets

Create multiple unrelated datasets in the same script:

rand_processes::{
    // User activity dataset
    user_events: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: seconds::30 },
        $data: {
            user_id: UUID,
            event_type: Uniform::{ choices: ["login", "logout", "click", "purchase"] },
            timestamp: Instant
        }
    },

    // System metrics dataset  
    system_metrics: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: seconds::60 },
        $data: {
            metric_name: Uniform::{ choices: ["cpu", "memory", "disk", "network"] },
            value: UniformF64::{ low: 0.0, high: 100.0 },
            timestamp: Instant
        }
    },

    // Configuration dataset (static)
    app_config: static_data::{
        $data: {
            config_key: Uniform::{ choices: ["max_users", "timeout", "retry_count"] },
            config_value: UniformAnyOf::{ types: [UniformI32::{ low: 1, high: 1000 }, Bool] }
        }
    }
}

Create datasets that share common identifiers or generators:

rand_processes::{
    // Shared generators
    $user_id: UUID,
    $session_id: UUID,

    // User profiles (static)
    users: static_data::{
        $data: {
            user_id: $user_id,
            username: Format::{ pattern: "user_{UUID}" },
            created_at: Date
        }
    },

    // User sessions (dynamic)
    sessions: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: hours::2 },
        $data: {
            session_id: $session_id,
            user_id: $user_id,  // Links to users dataset
            start_time: Instant,
            duration_minutes: UniformU16::{ low: 5, high: 180 }
        }
    },

    // Session events (dynamic) 
    session_events: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: minutes::3 },
        $data: {
            event_id: UUID,
            session_id: $session_id,  // Links to sessions dataset
            event_type: Uniform::{ choices: ["page_view", "click", "scroll", "exit"] },
            timestamp: Instant
        }
    }
}

Complex Dataset Relationships

Dynamic Dataset Creation with Loops

From the real client-service.ion test script:

rand_processes::{
    // Generate between 5 & 20 customers
    $n: UniformU8::{ low: 5, high: 20 },

    // Shared ID generators
    $id_gen: UUID,
    $rid_gen: UUID,

    requests: $n::[
        // Each iteration creates datasets for customer $@n
        {
            // Unique ID per customer
            $id: $id_gen::(),
            $rate: UniformF64::{ low: 0.995, high: 1.0 },
            $success: Bool::{ p: $rate },

            // Service dataset - shared by all customers
            service: rand_process::{
                $r: UniformU8::{ low: 20, high: 150 },
                $arrival: HomogeneousPoisson::{ interarrival: milliseconds::$r },
                $data: {
                    Request: $rid_gen,
                    StartTime: Instant,
                    Program: "FancyService", 
                    Operation: "GetMyData",
                    Account: $id,
                    client: Format::{ pattern: "customer #{$@n}" },
                    success: $success
                }
            },

            // Individual client dataset - one per customer
            'client_{$@n}': rand_process::{
                $arrival: HomogeneousPoisson::{ interarrival: milliseconds::$r },
                $data: {
                    id: $id,
                    request_time: Instant,
                    request_id: $rid_gen,
                    success: $success
                }
            }
        }
    ]
}

This creates:

  • 1 service dataset: Shared across all customers
  • N client datasets: client_0, client_1, client_2, etc.
  • Shared variables: Same request IDs, customer IDs, success rates

Output Example

$ beamline gen data \
    --seed 100 \
    --start-auto \
    --script-path client-service.ion \
    --sample-count 20 \
    --output-format text

Seed: 100
Start: 2024-01-01T00:00:00Z
[2024-01-01 00:00:10.123] : "service" { 'Request': 'req-001', 'Account': 'customer-abc', 'client': 'customer #0' }
[2024-01-01 00:00:10.124] : "client_0" { 'id': 'customer-abc', 'request_id': 'req-001' }
[2024-01-01 00:00:15.456] : "service" { 'Request': 'req-002', 'Account': 'customer-def', 'client': 'customer #1' }
[2024-01-01 00:00:15.457] : "client_1" { 'id': 'customer-def', 'request_id': 'req-002' }

Dataset Filtering

CLI Dataset Selection

Generate data for specific datasets only:

# Generate all datasets
beamline gen data \
  --seed 42 \
  --start-auto \
  --script-path multi_dataset.ion \
  --sample-count 100

# Generate only specific datasets
beamline gen data \
  --seed 42 \
  --start-auto \
  --script-path multi_dataset.ion \
  --sample-count 100 \
  --dataset users \
  --dataset orders

# Generate only one dataset
beamline gen data \
  --seed 42 \
  --start-auto \
  --script-path multi_dataset.ion \
  --sample-count 100 \
  --dataset system_metrics

Use Cases for Dataset Filtering

  • Focused testing: Test specific components in isolation
  • Performance optimization: Generate only needed data
  • Development: Work with subset of complex systems
  • Incremental development: Build datasets one at a time

Dataset Design Patterns

Master-Detail Pattern

rand_processes::{
    $n_customers: UniformU8::{ low: 10, high: 50 },
    $customer_id: UUID,
    $order_id: UUID,

    customers: $n_customers::[
        {
            $id: $customer_id::(),

            // Master dataset - customer information
            customer_master: static_data::{
                $data: {
                    customer_id: $id,
                    name: LoremIpsumTitle,
                    email: Format::{ pattern: "customer{$@n}@example.com" },
                    registration_date: Date
                }
            },

            // Detail dataset - customer orders
            'customer_{$@n}_orders': rand_process::{
                $arrival: HomogeneousPoisson::{ interarrival: days::UniformU8::{ low: 1, high: 30 } },
                $data: {
                    order_id: $order_id,
                    customer_id: $id,  // Foreign key relationship
                    order_date: Instant,
                    total_amount: UniformDecimal::{ low: 10.00, high: 500.00 }
                }
            }
        }
    ]
}

Event Sourcing Pattern

rand_processes::{
    $entity_id: UUID,

    // Entity snapshots (static)
    entity_snapshots: static_data::{
        $data: {
            entity_id: $entity_id,
            entity_type: Uniform::{ choices: ["user", "order", "product"] },
            created_at: Date,
            initial_state: LoremIpsumTitle
        }
    },

    // Entity events (dynamic)
    entity_events: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: minutes::UniformU8::{ low: 5, high: 60 } },
        $data: {
            event_id: UUID,
            entity_id: $entity_id,  // Links to snapshots
            event_type: Uniform::{ choices: ["created", "updated", "deleted", "restored"] },
            timestamp: Instant,
            event_data: LoremIpsum::{ min_words: 5, max_words: 20 }
        }
    }
}

Multi-Tenant Pattern

rand_processes::{
    $n_tenants: UniformU8::{ low: 3, high: 10 },
    $tenant_id: UUID,

    tenants: $n_tenants::[
        {
            $id: $tenant_id::(),

            // Tenant configuration (static)
            'tenant_{$@n}_config': static_data::{
                $data: {
                    tenant_id: $id,
                    tenant_name: Format::{ pattern: "Tenant {$@n}" },
                    plan: Uniform::{ choices: ["basic", "premium", "enterprise"] },
                    max_users: UniformU16::{ low: 10, high: 1000 }
                }
            },

            // Tenant activity (dynamic)
            'tenant_{$@n}_activity': rand_process::{
                $arrival: HomogeneousPoisson::{ interarrival: minutes::UniformU8::{ low: 1, high: 30 } },
                $data: {
                    activity_id: UUID,
                    tenant_id: $id,
                    activity_type: Uniform::{ choices: ["login", "api_call", "data_export", "config_change"] },
                    timestamp: Instant,
                    user_count: UniformU16::{ low: 1, high: 100 }
                }
            }
        }
    ]
}

Dataset Analysis and Inspection

Examining Generated Datasets

# Generate multi-dataset output
beamline gen data \
  --seed 123 \
  --start-auto \
  --script-path complex_system.ion \
  --sample-count 1000 \
  --output-format ion-pretty > output.ion

# Extract dataset names and record counts
jq -r '.data | keys[]' output.ion  # List all dataset names
jq '.data.users | length' output.ion  # Count records in users dataset
jq '.data | to_entries[] | "\(.key): \(.value | length) records"' output.ion  # All counts

Database Catalog Analysis

# Generate database
beamline gen db beamline-lite \
  --seed 456 \
  --start-auto \
  --script-path multi_dataset.ion \
  --sample-count 5000

# Analyze generated datasets
ls -la beamline-catalog/*.ion | grep -v shape  # List data files
for f in beamline-catalog/*.ion; do
  if [[ "$f" != *".shape.ion" ]]; then
    echo "$(basename "$f" .ion): $(wc -l < "$f") records"
  fi
done

Schema Comparison Across Datasets

# Compare schemas of related datasets
diff beamline-catalog/client_0.shape.sql beamline-catalog/client_1.shape.sql
# Should be identical for datasets created from same template

# Compare different dataset schemas
diff beamline-catalog/users.shape.sql beamline-catalog/orders.shape.sql
# Should be different - different structures

Advanced Dataset Patterns

Hierarchical Data Modeling

rand_processes::{
    $n_orgs: UniformU8::{ low: 2, high: 5 },
    $n_depts_per_org: UniformU8::{ low: 3, high: 8 },
    $n_users_per_dept: UniformU8::{ low: 5, high: 20 },

    organizations: $n_orgs::[
        {
            $org_id: UUID::(),

            // Organization master data
            'org_{$@n}': static_data::{
                $data: {
                    org_id: $org_id,
                    org_name: Format::{ pattern: "Organization {$@n}" },
                    industry: Uniform::{ choices: ["Tech", "Finance", "Healthcare", "Retail"] }
                }
            },

            // Departments within organization
            departments: $n_depts_per_org::[
                {
                    $dept_id: UUID::(),

                    'org_{$@n}_dept_{$@n}': static_data::{
                        $data: {
                            dept_id: $dept_id,
                            org_id: $org_id,
                            dept_name: Uniform::{ choices: ["Engineering", "Sales", "Marketing", "HR"] }
                        }
                    },

                    // Users within department
                    'org_{$@n}_dept_{$@n}_users': $n_users_per_dept::[
                        rand_process::{
                            $arrival: HomogeneousPoisson::{ interarrival: hours::UniformU8::{ low: 8, high: 24 } },
                            $data: {
                                user_id: UUID,
                                dept_id: $dept_id,
                                org_id: $org_id,
                                activity_type: Uniform::{ choices: ["work", "meeting", "break", "training"] },
                                timestamp: Instant
                            }
                        }
                    ]
                }
            ]
        }
    ]
}

Time-Series Dataset Families

rand_processes::{
    $n_sensors: UniformU8::{ low: 5, high: 15 },
    $sensor_id: UUID,

    sensors: $n_sensors::[
        {
            $id: $sensor_id::(),
            $location: Format::{ pattern: "Location-{$@n}" },

            // Sensor metadata (static)
            'sensor_{$@n}_metadata': static_data::{
                $data: {
                    sensor_id: $id,
                    location: $location,
                    sensor_type: Uniform::{ choices: ["temperature", "humidity", "pressure"] },
                    calibration_date: Date
                }
            },

            // Regular sensor readings (dynamic)
            'sensor_{$@n}_readings': rand_process::{
                $arrival: HomogeneousPoisson::{ interarrival: seconds::30 },
                $data: {
                    sensor_id: $id,
                    reading_time: Instant,
                    value: NormalF64::{ mean: 22.0, std_dev: 5.0 },
                    quality: Uniform::{ choices: ["good", "fair", "poor"] }
                }
            },

            // Sensor alerts (dynamic, infrequent)
            'sensor_{$@n}_alerts': rand_process::{
                $arrival: HomogeneousPoisson::{ interarrival: hours::UniformU8::{ low: 6, high: 48 } },
                $data: {
                    alert_id: UUID,
                    sensor_id: $id,
                    alert_type: Uniform::{ choices: ["high_value", "low_value", "malfunction", "maintenance"] },
                    timestamp: Instant,
                    severity: Uniform::{ choices: [1, 2, 3, 4, 5] }
                }
            }
        }
    ]
}

Dataset Output in Different Formats

Text Format Multi-Dataset Output

$ beamline gen data \
    --seed 999 \
    --start-auto \
    --script-path multi_dataset.ion \
    --sample-count 20 \
    --output-format text

# Datasets are interleaved by timestamp
[2024-01-01 00:00:00.000] : "config" { 'key': 'timeout', 'value': 30 }
[2024-01-01 00:00:00.000] : "config" { 'key': 'max_users', 'value': 1000 }
[2024-01-01 00:02:15.123] : "users" { 'user_id': 'abc-123', 'action': 'login' }
[2024-01-01 00:03:45.456] : "metrics" { 'metric': 'cpu', 'value': 45.6 }
[2024-01-01 00:04:30.789] : "users" { 'user_id': 'def-456', 'action': 'click' }

Ion Pretty Multi-Dataset Output

{
  seed: 999,
  start: "2024-01-01T00:00:00Z",
  data: {
    config: [
      { key: "timeout", value: 30 },
      { key: "max_users", value: 1000 }
    ],
    users: [
      { user_id: "abc-123", action: "login", timestamp: 2024-01-01T00:02:15.123Z },
      { user_id: "def-456", action: "click", timestamp: 2024-01-01T00:04:30.789Z }
    ],
    metrics: [
      { metric: "cpu", value: 45.6, timestamp: 2024-01-01T00:03:45.456Z }
    ]
  }
}

Database Generation Multi-Dataset Files

$ beamline gen db beamline-lite \
    --seed 42 \
    --start-auto \
    --script-path client_service.ion \
    --sample-count 1000

$ ls beamline-catalog/
.beamline-manifest
.beamline-script
service.ion              # Service dataset data
service.shape.ion        # Service dataset schema  
service.shape.sql        # Service dataset SQL
client_0.ion            # Client 0 dataset data
client_0.shape.ion      # Client 0 dataset schema
client_0.shape.sql      # Client 0 dataset SQL
client_1.ion            # Client 1 dataset data
client_1.shape.ion      # Client 1 dataset schema
client_1.shape.sql      # Client 1 dataset SQL
...                     # More client datasets

Dataset Naming Best Practices

1. Use Descriptive Names

// Good - descriptive dataset names
user_profiles: static_data::{ /* ... */ },
user_activity_events: rand_process::{ /* ... */ },
system_performance_metrics: rand_process::{ /* ... */ }

// Avoid - generic names
data1: static_data::{ /* ... */ },
stuff: rand_process::{ /* ... */ }

2. Follow Consistent Naming Conventions

// Consistent naming pattern
user_profiles: static_data::{ /* ... */ },
user_sessions: rand_process::{ /* ... */ },
user_events: rand_process::{ /* ... */ },

order_master: static_data::{ /* ... */ },
order_items: rand_process::{ /* ... */ },
order_payments: rand_process::{ /* ... */ }
// Group related datasets with prefixes
$n: UniformU8::{ low: 5, high: 10 },

services: $n::[
    {
        'service_{$@n}_config': static_data::{ /* ... */ },
        'service_{$@n}_requests': rand_process::{ /* ... */ },
        'service_{$@n}_responses': rand_process::{ /* ... */ },
        'service_{$@n}_errors': rand_process::{ /* ... */ }
    }
]

Performance Considerations

Dataset Count Impact

  • Few datasets (1-5): Minimal overhead
  • Many datasets (10-50): Slight memory overhead for tracking
  • Dynamic datasets (100+): Significant memory for metadata

Dataset Size Balance

// Balanced approach - mix of small and large datasets
rand_processes::{
    // Small reference dataset
    config: static_data::{ $data: { /* small config */ } },

    // Medium operational dataset  
    users: rand_process::{ /* moderate activity */ },

    // Large transaction dataset
    transactions: rand_process::{ /* high frequency */ }
}

Memory Usage with Multiple Datasets

# Monitor memory usage with many datasets
time beamline gen data \
  --seed 1 \
  --start-auto \
  --script-path many_datasets.ion \
  --sample-count 10000

# Use dataset filtering to reduce memory
beamline gen data \
  --seed 1 \
  --start-auto \
  --script-path many_datasets.ion \
  --sample-count 10000 \
  --dataset important_dataset_only

Integration Workflows

Dataset-Specific Processing

#!/bin/bash
# process-datasets.sh

SCRIPT="multi_system.ion"
SEED=12345

# Generate full dataset
beamline gen data \
  --seed $SEED \
  --start-auto \
  --script-path $SCRIPT \
  --sample-count 10000 \
  --output-format ion-pretty > full_data.ion

# Extract individual datasets for processing
jq '.data.users' full_data.ion > users_only.json
jq '.data.orders' full_data.ion > orders_only.json  
jq '.data.metrics' full_data.ion > metrics_only.json

echo "Datasets extracted for individual processing"

Cross-Dataset Validation

# Generate related datasets
beamline gen data \
  --seed 999 \
  --start-auto \
  --script-path related_data.ion \
  --sample-count 5000 \
  --output-format ion-pretty > related_data.ion

# Validate relationships
jq '.data.orders[].customer_id' related_data.ion | sort -u > order_customers.txt
jq '.data.users[].user_id' related_data.ion | sort -u > all_customers.txt

# Check referential integrity
comm -23 order_customers.txt all_customers.txt  # Orders with invalid customer IDs (should be empty)

Troubleshooting Multi-Dataset Scripts

Issue: Missing Datasets in Output

Cause: Dataset filtering or script errors

Solution:

# Check all available datasets
beamline infer-shape --seed 1 --start-auto --script-path script.ion --output-format text

# Generate without filtering
beamline gen data --seed 1 --start-auto --script-path script.ion --sample-count 5

Issue: Uneven Dataset Sizes

Cause: Different arrival rates or loop counts

Solution:

# Check arrival rates in your script
# Adjust interarrival times to balance dataset sizes
$arrival1: HomogeneousPoisson::{ interarrival: seconds::1 },   # Frequent
$arrival2: HomogeneousPoisson::{ interarrival: minutes::1 },   # Less frequent

Issue: Memory Issues with Many Datasets

Solution:

# Use dataset filtering
beamline gen data --script-path many.ion --dataset important_one --dataset important_two

# Or generate datasets separately
beamline gen data --script-path script.ion --dataset batch_1 --sample-count 10000
beamline gen data --script-path script.ion --dataset batch_2 --sample-count 10000

Next Steps

  • Scripts - Advanced Ion scripting techniques for complex datasets
  • Output Formats - How datasets appear in different output formats
  • Examples - See complete multi-dataset examples in action
  • Database Guide - Working with dataset catalogs and databases

Working with Scripts

Ion scripts are the core of Beamline’s data generation system. This section covers advanced scripting techniques, best practices, and patterns for creating sophisticated data generation scenarios.

Ion Script Fundamentals

Basic Script Structure

Every Beamline script follows this structure:

rand_processes::{
    // 1. Variable definitions (optional)
    $variable_name: GeneratorType::{ configuration },
    
    // 2. Dataset definitions (required)
    dataset_name: dataset_type::{
        // Configuration specific to dataset type
    }
}

Script Validation

Before generating large datasets, validate your script:

# Quick validation with minimal generation
beamline gen data \
  --seed 1 \
  --start-auto \
  --script-path new_script.ion \
  --sample-count 1

# Check inferred schema
beamline infer-shape \
  --seed 1 \
  --start-auto \
  --script-path new_script.ion \
  --output-format basic-ddl

Variable Management

Variable Definition Best Practices

rand_processes::{
    // Group related variables together with comments
    // === ID Generators ===
    $user_id: UUID,
    $session_id: UUID, 
    $transaction_id: UUID,
    
    // === Shared Distributions ===
    $age_distribution: NormalF64::{ mean: 35.0, std_dev: 12.0 },
    $price_range: UniformDecimal::{ low: 9.99, high: 999.99 },
    
    // === Configuration Values ===
    $max_users: UniformU8::{ low: 10, high: 50 },
    $success_rate: UniformF64::{ low: 0.95, high: 0.99 },
    
    // === Categorical Choices ===
    $status_options: Uniform::{ choices: ["active", "inactive", "pending", "suspended"] },
    $priority_levels: Uniform::{ choices: [1, 2, 3, 4, 5] },
    
    // Dataset definitions follow...
}

Variable Scoping Rules

Variables have different scoping behaviors:

rand_processes::{
    // Global variable - accessible everywhere
    $global_id: UUID,
    
    dataset: $n::[
        {
            // Loop-scoped variable - unique per iteration
            $local_id: UUID::(),  // Forces evaluation per loop iteration
            
            'data_{$@n}': rand_process::{
                $data: {
                    global: $global_id,      // Same value across all loops
                    local: $local_id,        // Different per loop iteration  
                    index: '$@n'             // Current loop index
                }
            }
        }
    ]
}

Advanced Variable Techniques

Computed Variables

rand_processes::{
    // Base measurements
    $base_temp: NormalF64::{ mean: 20.0, std_dev: 3.0 },
    $temp_variance: UniformF64::{ low: 0.5, high: 2.0 },
    
    // Computed distributions based on other variables
    $adjusted_temp: NormalF64::{ 
        mean: 22.0,  // Slightly higher than base
        std_dev: 4.0 // More variation
    },
    
    sensors: rand_process::{
        $data: {
            base_temperature: $base_temp,
            adjusted_temperature: $adjusted_temp,
            temperature_diff: UniformF64::{ low: -5.0, high: 5.0 }
        }
    }
}

Conditional Variable Usage

rand_processes::{
    // Define multiple generators for different scenarios
    $high_value_price: UniformDecimal::{ low: 100.00, high: 1000.00 },
    $low_value_price: UniformDecimal::{ low: 1.00, high: 50.00 },
    $medium_value_price: UniformDecimal::{ low: 25.00, high: 200.00 },
    
    products: rand_process::{
        $data: {
            product_id: UUID,
            category: Uniform::{ choices: ["electronics", "books", "clothing"] },
            
            // Use different price generators for different scenarios
            price: UniformAnyOf::{
                types: [
                    $high_value_price,    // Electronics
                    $low_value_price,     // Books  
                    $medium_value_price   // Clothing
                ]
            }
        }
    }
}

Advanced Script Patterns

Multi-Level Hierarchies

rand_processes::{
    $n_regions: UniformU8::{ low: 2, high: 4 },
    $n_stores_per_region: UniformU8::{ low: 3, high: 8 },
    $n_employees_per_store: UniformU8::{ low: 5, high: 20 },

    retail_hierarchy: $n_regions::[
        {
            $region_id: UUID::(),
            
            // Region data
            'region_{$@n}': static_data::{
                $data: {
                    region_id: $region_id,
                    region_name: Format::{ pattern: "Region {$@n}" },
                    timezone: Uniform::{ choices: ["PST", "MST", "CST", "EST"] }
                }
            },

            // Stores in region
            stores: $n_stores_per_region::[
                {
                    $store_id: UUID::(),
                    
                    'region_{$@n}_store_{$@n}': static_data::{
                        $data: {
                            store_id: $store_id,
                            region_id: $region_id,
                            store_name: Format::{ pattern: "Store {$@n}-{$@n}" },
                            address: Format::{ pattern: "{$@n} Commerce St" }
                        }
                    },

                    // Employees in store
                    'region_{$@n}_store_{$@n}_employees': $n_employees_per_store::[
                        rand_process::{
                            $arrival: HomogeneousPoisson::{ interarrival: hours::8 },
                            $data: {
                                employee_id: UUID,
                                store_id: $store_id,
                                region_id: $region_id,
                                clock_in_time: Instant,
                                activity: Uniform::{ choices: ["sales", "inventory", "cleaning", "break"] }
                            }
                        }
                    ]
                }
            ]
        }
    ]
}

Time-Based Dataset Coordination

rand_processes::{
    // Shared timing variables
    $peak_hours_rate: HomogeneousPoisson::{ interarrival: minutes::2 },
    $off_hours_rate: HomogeneousPoisson::{ interarrival: minutes::15 },
    $maintenance_rate: HomogeneousPoisson::{ interarrival: hours::6 },
    
    // High-frequency events during peak hours
    peak_user_activity: rand_process::{
        $arrival: $peak_hours_rate,
        $data: {
            event_id: UUID,
            event_type: Uniform::{ choices: ["login", "search", "purchase"] },
            timestamp: Instant,
            load_factor: UniformF64::{ low: 0.7, high: 1.0 }  // High load
        }
    },
    
    // Lower frequency during off hours  
    off_hours_activity: rand_process::{
        $arrival: $off_hours_rate,
        $data: {
            event_id: UUID,
            event_type: Uniform::{ choices: ["backup", "cleanup", "monitoring"] },
            timestamp: Instant,
            load_factor: UniformF64::{ low: 0.1, high: 0.3 }  // Low load
        }
    },
    
    // Maintenance events
    maintenance_events: rand_process::{
        $arrival: $maintenance_rate,
        $data: {
            maintenance_id: UUID,
            maintenance_type: Uniform::{ choices: ["scheduled", "emergency", "upgrade"] },
            timestamp: Instant,
            duration_minutes: UniformU16::{ low: 30, high: 240 }
        }
    }
}

Cross-Dataset Correlation

rand_processes::{
    // Shared correlation factors
    $system_load: UniformF64::{ low: 0.1, high: 0.9 },
    $error_probability: Bool::{ p: 0.05 },  // 5% base error rate
    
    // System metrics affected by load
    system_metrics: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: seconds::30 },
        $data: {
            metric_id: UUID,
            timestamp: Instant,
            cpu_usage: $system_load,
            memory_usage: UniformF64::{ low: 0.2, high: 0.8 },
            response_time_ms: LogNormalF64::{ location: 2.0, scale: 0.5 }
        }
    },
    
    // Application events affected by same factors
    application_events: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: seconds::10 },
        $data: {
            event_id: UUID,
            timestamp: Instant,
            event_type: Uniform::{ choices: ["request", "response", "error", "timeout"] },
            has_error: $error_probability,  // Correlated error rate
            load_factor: $system_load       // Same load factor
        }
    }
}

Script Organization Strategies

Modular Script Design

rand_processes::{
    // === CONFIGURATION SECTION ===
    // System-wide settings
    $system_version: "2.1.0",
    $max_concurrent_users: UniformU16::{ low: 100, high: 1000 },
    
    // === SHARED GENERATORS ===  
    // Reusable ID generators
    $user_id: UUID,
    $session_id: UUID,
    $request_id: UUID,
    
    // Reusable distributions
    $user_age_dist: NormalF64::{ mean: 34.5, std_dev: 12.8 },
    $response_time_dist: LogNormalF64::{ location: 3.0, scale: 0.4 },
    
    // === REFERENCE DATA ===
    // Static lookup tables
    user_types: static_data::{
        $data: {
            type_id: UniformU8::{ low: 1, high: 5 },
            type_name: Uniform::{ choices: ["free", "premium", "enterprise", "admin", "guest"] },
            max_sessions: Uniform::{ choices: [1, 5, 10, 100, 1] }
        }
    },
    
    // === OPERATIONAL DATA ===
    // Dynamic user activity
    user_sessions: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: minutes::UniformU8::{ low: 2, high: 30 } },
        $data: {
            user_id: $user_id,
            session_id: $session_id,
            start_time: Instant,
            user_age: $user_age_dist
        }
    },
    
    // === PERFORMANCE DATA ===
    // System performance metrics
    performance_metrics: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: seconds::15 },
        $data: {
            metric_timestamp: Instant,
            response_time: $response_time_dist,
            concurrent_users: UniformU16::{ low: 0, high: 1000 }
        }
    }
}

Environment-Specific Scripts

Create scripts that can be configured for different environments:

rand_processes::{
    // === ENVIRONMENT CONFIGURATION ===
    // Development environment settings
    $dev_user_count: UniformU8::{ low: 5, high: 20 },
    $dev_load_factor: UniformF64::{ low: 0.1, high: 0.3 },
    $dev_error_rate: 0.1,  // 10% errors in dev
    
    // Production-like environment settings  
    $prod_user_count: UniformU16::{ low: 100, high: 1000 },
    $prod_load_factor: UniformF64::{ low: 0.6, high: 0.95 },
    $prod_error_rate: 0.01,  // 1% errors in prod
    
    // Use dev settings (change as needed)
    $current_user_count: $dev_user_count,
    $current_load_factor: $dev_load_factor,
    $current_error_rate: $dev_error_rate,
    
    // === DATASETS ===
    users: $current_user_count::[
        rand_process::{
            $arrival: HomogeneousPoisson::{ interarrival: minutes::5 },
            $data: {
                user_id: UUID,
                load_impact: $current_load_factor,
                has_error: Bool::{ p: $current_error_rate },
                timestamp: Instant
            }
        }
    ]
}

Complex Data Relationships

Foreign Key Relationships

rand_processes::{
    $n_customers: UniformU8::{ low: 10, high: 50 },
    $n_products: UniformU8::{ low: 20, high: 100 },
    
    // Generate customer IDs we can reference
    $customer_ids: $n_customers::[UUID::()],  // Array of customer UUIDs
    $product_ids: $n_products::[UUID::()],    // Array of product UUIDs
    
    customers: static_data::{
        $data: {
            customer_id: Uniform::{ choices: $customer_ids },  // Reference predefined IDs
            name: LoremIpsumTitle,
            email: Format::{ pattern: "customer{UUID}@example.com" }
        }
    },
    
    products: static_data::{
        $data: {
            product_id: Uniform::{ choices: $product_ids },   // Reference predefined IDs
            name: LoremIpsumTitle,
            price: UniformDecimal::{ low: 5.00, high: 200.00 }
        }
    },
    
    orders: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: minutes::UniformU8::{ low: 5, high: 30 } },
        $data: {
            order_id: UUID,
            customer_id: Uniform::{ choices: $customer_ids },  // Valid customer reference
            product_id: Uniform::{ choices: $product_ids },    // Valid product reference  
            quantity: UniformU8::{ low: 1, high: 5 },
            timestamp: Instant
        }
    }
}

Temporal Coordination

rand_processes::{
    // Shared timing patterns
    $business_hours: HomogeneousPoisson::{ interarrival: minutes::UniformU8::{ low: 2, high: 10 } },
    $after_hours: HomogeneousPoisson::{ interarrival: hours::UniformU8::{ low: 1, high: 4 } },
    
    // Customer activity during business hours
    customer_activity: rand_process::{
        $arrival: $business_hours,
        $data: {
            activity_id: UUID,
            activity_type: Uniform::{ choices: ["browse", "search", "purchase", "support"] },
            timestamp: Instant,
            response_time: LogNormalF64::{ location: 2.5, scale: 0.3 }  // Faster during business hours
        }
    },
    
    // System maintenance after hours
    system_maintenance: rand_process::{
        $arrival: $after_hours,
        $data: {
            maintenance_id: UUID,
            maintenance_type: Uniform::{ choices: ["backup", "update", "cleanup", "monitoring"] },
            timestamp: Instant,
            duration_minutes: UniformU16::{ low: 15, high: 120 }
        }
    }
}

Script Testing and Development

Iterative Development Process

# 1. Start with minimal script
echo 'rand_processes::{ test: rand_process::{ $arrival: HomogeneousPoisson::{ interarrival: seconds::1 }, $data: { id: UUID } } }' > minimal.ion

# 2. Validate basic structure
beamline gen data --seed 1 --start-auto --script-path minimal.ion --sample-count 3

# 3. Add complexity incrementally
# ... edit script to add fields, variables, etc.

# 4. Test each addition
beamline gen data --seed 1 --start-auto --script-path enhanced.ion --sample-count 5

# 5. Validate schema
beamline infer-shape --seed 1 --start-auto --script-path enhanced.ion --output-format basic-ddl

Script Debugging Techniques

Add Debug Fields

rand_processes::{
    test_data: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: seconds::5 },
        $data: {
            // Production fields
            user_id: UUID,
            action: Uniform::{ choices: ["login", "logout"] },
            
            // Debug fields (remove in production)
            debug_tick: Tick,
            debug_timestamp: Instant,
            debug_seed_info: Format::{ pattern: "Generated at tick {Tick}" }
        }
    }
}

Validate Variable Evaluation

rand_processes::{
    // Test variable evaluation
    $test_var: UniformI32::{ low: 1, high: 10 },
    $forced_eval: UniformI32::{ low: 100, high: 200 }::(),
    
    debug_variables: rand_process::{
        $data: {
            normal_var: $test_var,      // New value each time
            forced_var: $forced_eval,   // Same value each time
            comparison: Format::{ pattern: "normal: {$test_var}, forced: {$forced_eval}" }
        }
    }
}

Test Script Fragments

# Test individual components
echo 'rand_processes::{ test_generators: rand_process::{ $arrival: HomogeneousPoisson::{ interarrival: seconds::1 }, $data: { test_field: NormalF64::{ mean: 0.0, std_dev: 1.0 } } } }' | \
beamline gen data --seed 1 --start-auto --script - --sample-count 5

Performance Optimization in Scripts

Generator Efficiency

rand_processes::{
    // Efficient - simple generators
    efficient_data: rand_process::{
        $data: {
            id: UUID,                                    // Very fast
            count: UniformI32::{ low: 1, high: 1000 },  // Fast
            flag: Bool                                   // Very fast
        }
    },
    
    // Less efficient - complex generators
    complex_data: rand_process::{
        $data: {
            // Slower - statistical distributions
            normal_value: NormalF64::{ mean: 0.0, std_dev: 1.0 },
            
            // Slower - complex regex patterns
            complex_pattern: Regex::{ pattern: "([A-Z][a-z]{2,8}\\s){3}[A-Z][a-z]{2,8}" },
            
            // Slower - large arrays
            large_array: UniformArray::{
                min_size: 50,
                max_size: 100,
                element_type: NormalF64::{ mean: 0.0, std_dev: 1.0 }
            }
        }
    }
}

Variable Reuse for Performance

rand_processes::{
    // Efficient - reuse expensive generators
    $expensive_distribution: WeibullF64::{ shape: 2.0, scale: 100.0 },
    $simple_choices: Uniform::{ choices: ["A", "B", "C", "D"] },
    
    optimized_data: rand_process::{
        $data: {
            // Reuse the same expensive distribution
            measurement1: $expensive_distribution,
            measurement2: $expensive_distribution, 
            measurement3: $expensive_distribution,
            
            // Reuse simple categorical generator
            category1: $simple_choices,
            category2: $simple_choices
        }
    }
}

Memory-Conscious Patterns

rand_processes::{
    // Memory-efficient approach
    streaming_data: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: milliseconds::10 },
        $data: {
            // Simple fields - low memory
            id: UUID,
            timestamp: Instant,
            value: UniformF64::{ low: 0.0, high: 100.0 },
            
            // Avoid large embedded structures in high-frequency data
            // metadata: { /* avoid large nested objects */ }
        }
    },
    
    // Separate detailed data as less frequent dataset
    detailed_metadata: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: minutes::5 },  // Much less frequent
        $data: {
            detail_id: UUID,
            large_description: LoremIpsum::{ min_words: 50, max_words: 200 },
            complex_structure: {
                nested_data: LoremIpsumTitle,
                more_nested: {
                    deep_field: UniformF64::{ low: 0.0, high: 1.0 }
                }
            }
        }
    }
}

Error Handling in Scripts

Common Script Errors

Invalid Ion Syntax

// Wrong - missing closing brace
rand_processes::{
    test: rand_process::{
        $data: {
            id: UUID
        }
    // Missing closing brace here

Error:

Error: Failed to parse Ion script: Expected closing brace '}' at line 8

Invalid Generator Configuration

// Wrong - min > max
rand_processes::{
    test: rand_process::{
        $data: {
            bad_range: UniformI32::{ low: 100, high: 50 }  // Invalid range
        }
    }
}

Missing Required Fields

// Wrong - missing arrival for rand_process
rand_processes::{
    test: rand_process::{
        $data: { id: UUID }  // Missing $arrival
    }
}

Script Validation Patterns

rand_processes::{
    // Good - comprehensive configuration
    validated_data: rand_process::{
        // Required: arrival process
        $arrival: HomogeneousPoisson::{ interarrival: seconds::1 },
        
        // Required: data definition
        $data: {
            // Validate ranges
            valid_range: UniformI32::{ low: 1, high: 100 },  // min <= max
            
            // Validate probabilities  
            valid_probability: Bool::{ p: 0.5 },  // 0.0 <= p <= 1.0
            
            // Validate nullable/optional
            valid_nullable: UniformF64::{ 
                low: 0.0, 
                high: 1.0,
                nullable: 0.1,    // 0.0 <= nullable <= 1.0
                optional: 0.05    // 0.0 <= optional <= 1.0
            }
        }
    }
}

Script Documentation

Inline Documentation Best Practices

rand_processes::{
    // =============================================================================
    // E-Commerce Simulation Script v2.1
    // 
    // Purpose: Generate realistic e-commerce data for performance testing
    // Author: Data Team
    // Created: 2024-01-01
    // Last Modified: 2024-01-15
    //
    // Datasets Generated:
    // - customers: Static customer profiles (10-50 customers)
    // - products: Static product catalog (50-200 products) 
    // - orders: Dynamic order events (variable frequency)
    // - reviews: Dynamic product reviews (low frequency)
    // =============================================================================
    
    // === CONFIGURATION VARIABLES ===
    
    // Customer population size
    $n_customers: UniformU8::{ low: 10, high: 50 },  // 10-50 customers for testing
    
    // Product catalog size  
    $n_products: UniformU8::{ low: 50, high: 200 },  // 50-200 products
    
    // Business parameters
    $avg_order_value: UniformDecimal::{ low: 25.00, high: 500.00 },  // Realistic order sizes
    $customer_satisfaction: UniformF64::{ low: 0.7, high: 0.95 },    // High satisfaction rate
    
    // === SHARED GENERATORS ===
    
    $customer_id: UUID,     // Unique customer identifiers
    $product_id: UUID,      // Unique product identifiers  
    $order_id: UUID,        // Unique order identifiers
    
    // === STATIC REFERENCE DATA ===
    
    // Customer master data - generated once at simulation start
    customers: static_data::{
        $data: {
            customer_id: $customer_id,
            name: LoremIpsumTitle,  // Realistic names
            email: Format::{ pattern: "customer{UUID}@example.com" },
            registration_date: Date,  // All register at simulation start
            loyalty_tier: Uniform::{ choices: ["bronze", "silver", "gold", "platinum"] }
        }
    },
    
    // Product catalog - static reference data
    products: static_data::{
        $data: {
            product_id: $product_id,
            name: LoremIpsumTitle,
            category: Uniform::{ choices: ["Electronics", "Clothing", "Books", "Home"] },
            base_price: $avg_order_value,
            in_stock: Bool::{ p: 0.9 }  // 90% of products in stock
        }
    },
    
    // === DYNAMIC TRANSACTIONAL DATA ===
    
    // Order events - customers place orders over time
    orders: rand_process::{
        // Variable order frequency - some customers more active
        $r: UniformU8::{ low: 30, high: 180 },  // 30-180 minutes between orders
        $arrival: HomogeneousPoisson::{ interarrival: minutes::$r },
        
        $data: {
            order_id: $order_id,
            customer_id: $customer_id,  // Links to customers dataset
            product_id: $product_id,    // Links to products dataset
            quantity: UniformU8::{ low: 1, high: 5 },
            order_total: $avg_order_value,
            timestamp: Instant,
            
            // Order status progression  
            status: Uniform::{ 
                choices: ["pending", "processing", "shipped", "delivered"],
                // Weight towards later statuses for realistic distribution
            }
        }
    },
    
    // Product reviews - less frequent than orders
    reviews: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: hours::UniformU8::{ low: 2, high: 48 } },
        $data: {
            review_id: UUID,
            product_id: $product_id,    // Links to products dataset
            customer_id: $customer_id,  // Links to customers dataset
            rating: UniformU8::{ low: 1, high: 5 },
            review_text: LoremIpsum::{ 
                min_words: 10, 
                max_words: 100,
                optional: 0.3  // 30% don't write review text
            },
            timestamp: Instant,
            verified_purchase: Bool::{ p: 0.8 }  // 80% are verified purchases
        }
    }
}

Script Maintenance and Version Control

Script Versioning

rand_processes::{
    // === SCRIPT METADATA ===
    script_info: static_data::{
        $data: {
            script_version: "3.2.1",
            created_date: "2024-01-01",
            last_modified: "2024-01-15", 
            author: "data-engineering-team",
            description: "Multi-tenant SaaS simulation with realistic usage patterns"
        }
    },
    
    // Script content follows...
}

Migration Between Script Versions

# Test new script version against old version
beamline gen data --seed 1000 --start-auto --script-path data_v3.ion --sample-count 100 > new_output.ion
beamline gen data --seed 1000 --start-auto --script-path data_v2.ion --sample-count 100 > old_output.ion

# Compare schemas
beamline infer-shape --seed 1 --start-auto --script-path data_v3.ion --output-format basic-ddl > new_schema.sql
beamline infer-shape --seed 1 --start-auto --script-path data_v2.ion --output-format basic-ddl > old_schema.sql
diff old_schema.sql new_schema.sql

Real-World Script Examples

IoT Sensor Network

rand_processes::{
    // Network topology
    $n_locations: UniformU8::{ low: 3, high: 12 },
    $n_sensors_per_location: UniformU8::{ low: 5, high: 15 },
    
    // Environmental factors
    $base_temperature: NormalF64::{ mean: 22.0, std_dev: 3.0 },
    $seasonal_variation: UniformF64::{ low: -5.0, high: 5.0 },
    
    iot_network: $n_locations::[
        {
            $location_id: UUID::(),
            $location_temp_offset: UniformF64::{ low: -2.0, high: 2.0 }::(), // Per-location offset
            
            // Location metadata
            'location_{$@n}': static_data::{
                $data: {
                    location_id: $location_id,
                    location_name: Format::{ pattern: "Site-{$@n}" },
                    coordinates: {
                        latitude: UniformF64::{ low: 40.0, high: 45.0 },
                        longitude: UniformF64::{ low: -75.0, high: -70.0 }
                    },
                    installation_date: Date
                }
            },
            
            // Sensors at location
            sensors: $n_sensors_per_location::[
                {
                    $sensor_id: UUID::(),
                    
                    'location_{$@n}_sensor_{$@n}': rand_process::{
                        $arrival: HomogeneousPoisson::{ interarrival: seconds::UniformU8::{ low: 30,

Static Data Generation

Static data in Beamline refers to data that is generated once at the beginning of the simulation, before any temporal events occur. This is useful for creating reference tables, lookup data, or any information that doesn’t change over the course of your simulation.

What is Static Data?

Static data is generated using static_data blocks instead of rand_process blocks. Key differences:

  • Generated once: All static data is created at simulation time 0
  • No arrival process: No $arrival configuration needed
  • Reference data: Often used for lookup tables, master data, configuration
  • Shared across processes: Can be referenced by multiple dynamic processes

Basic Syntax

dataset_name: static_data::{
    $data: {
        // Generator configuration (same as rand_process)
        field1: GeneratorType,
        field2: GeneratorType::{ configuration }
    }
}

Static vs Dynamic Data

Dynamic Data (rand_process)

orders: rand_process::{
    $arrival: HomogeneousPoisson::{ interarrival: days::5 },
    $data: {
        order_id: UUID,
        timestamp: Instant,
        amount: UniformDecimal::{ low: 10.00, high: 500.00 }
    }
}

Characteristics:

  • Generated over simulation time
  • Each record has different timestamps
  • Follows arrival process (Poisson, uniform, etc.)

Static Data (static_data)

product_catalog: static_data::{
    $data: {
        product_id: UUID,
        name: LoremIpsumTitle,
        base_price: UniformDecimal::{ low: 5.00, high: 200.00 }
    }
}

Characteristics:

  • Generated all at once at time 0
  • All records have the same timestamp (simulation start time)
  • No arrival process needed

Real Example: Customer and Orders

From the orders.ion test script, here’s how static and dynamic data work together:

rand_processes::{
    // Generate between 5 & 20 customers
    $n: UniformU8::{ low: 5, high: 20 },

    // Shared generators
    $id_gen: UUID,
    $oid_gen: UUID,

    customers: $n::[
        {
            // Each customer gets a unique ID 
            $id: $id_gen::(),

            // Static customer data - generated once per customer
            customer_table: static_data::{
                $data: {
                    id: $id,
                    address: Format::{ pattern: "{$@n} Foo Bar Ave" }
                }
            },

            // Dynamic order data - generated over time
            orders: rand_process::{
                $r: UniformU8::{ low: 1, high: 150 },
                $arrival: HomogeneousPoisson::{ interarrival: days::$r },
                $data: {
                    Order: $oid_gen,
                    Time: Instant,
                    Customer: $id  // References the same ID
                }
            }
        }
    ]
}

When executed, this generates:

Static Data (all at simulation start):

[2019-08-01 0:00:01.0 -07:00:00] : "customer_table" { 'id': 'd858b1e7-7327-7c40-1698-0e0e4fe89ecc', 'address': '0 Foo Bar Ave' }
[2019-08-01 0:00:01.0 -07:00:00] : "customer_table" { 'id': '179e600a-c1c5-8ac2-05b6-15b20f8fe740', 'address': '1 Foo Bar Ave' }
[2019-08-01 0:00:01.0 -07:00:00] : "customer_table" { 'address': '2 Foo Bar Ave', 'id': '5e39c6eb-0bc1-7040-cf52-6e69cdf386e0' }

Dynamic Data (spread over time):

[2019-08-01 7:26:21.964 -07:00:00] : "orders" { 'Customer': '5e39c6eb-0bc1-7040-cf52-6e69cdf386e0', 'Order': '4c579e42-8c70-93f4-b99b-cc45c50197ed' }
[2019-08-10 5:46:15.24 -07:00:00] : "orders" { 'Customer': '5e39c6eb-0bc1-7040-cf52-6e69cdf386e0', 'Order': '38900593-e9cc-994a-98d9-0becf77d9144' }
[2019-08-11 7:27:49.565 -07:00:00] : "orders" { 'Customer': '5e39c6eb-0bc1-7040-cf52-6e69cdf386e0', 'Order': 'b2aa0efc-dac3-b391-f4c2-3c298e0c99f4' }

Notice how:

  • All customer_table records have the same timestamp (simulation start)
  • The orders records are distributed over time with different timestamps
  • Both datasets share the same customer IDs, creating referential relationships

Use Cases for Static Data

Reference Tables

Create lookup tables that don’t change during simulation:

rand_processes::{
    // Static product catalog
    products: static_data::{
        $data: {
            product_id: UUID,
            name: LoremIpsumTitle,
            category: Uniform::{ choices: ["Electronics", "Clothing", "Books", "Home"] },
            base_price: UniformDecimal::{ low: 5.00, high: 500.00 }
        }
    },

    // Dynamic orders referencing products
    orders: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: minutes::30 },
        $data: {
            order_id: UUID,
            // Note: In real usage, you'd want to reference actual product IDs
            product_category: Uniform::{ choices: ["Electronics", "Clothing", "Books", "Home"] },
            timestamp: Instant
        }
    }
}

Configuration Data

Generate system configuration that remains constant:

rand_processes::{
    // System configuration - static
    config: static_data::{
        $data: {
            system_id: UUID,
            version: Uniform::{ choices: ["1.0", "1.1", "2.0"] },
            max_connections: UniformU16::{ low: 100, high: 1000 },
            timeout_seconds: UniformU8::{ low: 30, high: 300 }
        }
    },

    // Application events - dynamic
    events: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: seconds::10 },
        $data: {
            event_id: UUID,
            event_type: Uniform::{ choices: ["login", "logout", "action", "error"] },
            timestamp: Instant
        }
    }
}

User Profiles and Activity

Create user profiles once, then generate their activities over time:

rand_processes::{
    $n: UniformU8::{ low: 10, high: 50 },  // 10-50 users
    $id_gen: UUID,

    users: $n::[
        {
            $user_id: $id_gen::(),  // One ID per user

            // Static user profile
            user_profiles: static_data::{
                $data: {
                    user_id: $user_id,
                    username: Format::{ pattern: "user_{$@n}" },
                    email: Format::{ pattern: "user{$@n}@example.com" },
                    registration_date: Date,
                    plan_type: Uniform::{ choices: ["free", "premium", "enterprise"] }
                }
            },

            // Dynamic user activity
            user_activity: rand_process::{
                $r: UniformU8::{ low: 30, high: 180 },  // 30-180 minutes between actions
                $arrival: HomogeneousPoisson::{ interarrival: minutes::$r },
                $data: {
                    user_id: $user_id,
                    action_type: Uniform::{ choices: ["view", "click", "purchase", "search"] },
                    timestamp: Instant,
                    session_id: UUID
                }
            }
        }
    ]
}

When using time-related generators in static data, they use the simulation start time:

Instant and Date in Static Data

rand_processes::{
    // System startup data
    system_info: static_data::{
        $data: {
            system_id: UUID,
            startup_time: Instant,      // Will be simulation start time
            startup_date: Date,         // Will be simulation start date
            boot_tick: Tick,            // Will be 0 (initial tick)
            version: "1.0.0"
        }
    }
}

Output Example:

[2024-01-01 00:00:00.000 +00:00] : "system_info" { 
    'system_id': '123e4567-e89b-12d3-a456-426614174000',
    'startup_time': 2024-01-01T00:00:00.000000000+00:00,
    'startup_date': 2024-01-01T00:00:00.000000000+00:00,
    'boot_tick': 0,
    'version': '1.0.0'
}

Multiple Static Datasets

You can create multiple static datasets in the same script:

rand_processes::{
    // Company information
    companies: static_data::{
        $data: {
            company_id: UUID,
            name: Format::{ pattern: "Company {UUID}" },
            industry: Uniform::{ choices: ["Tech", "Finance", "Retail", "Healthcare"] }
        }
    },

    // Department information  
    departments: static_data::{
        $data: {
            dept_id: UUID,
            name: Uniform::{ choices: ["Engineering", "Sales", "Marketing", "HR"] },
            budget: UniformDecimal::{ low: 50000.00, high: 2000000.00 }
        }
    },

    // Employee events - references both static datasets
    employee_events: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: hours::8 },
        $data: {
            employee_id: UUID,
            event_type: Uniform::{ choices: ["hire", "promotion", "transfer", "resignation"] },
            timestamp: Instant,
            // Note: In real usage, you'd reference actual company/dept IDs
            company_type: Uniform::{ choices: ["Tech", "Finance", "Retail", "Healthcare"] },
            department: Uniform::{ choices: ["Engineering", "Sales", "Marketing", "HR"] }
        }
    }
}

Static Data with Variables and Loops

Create multiple static datasets using loops:

rand_processes::{
    $n: UniformU8::{ low: 3, high: 8 },  // 3-8 regions
    $region_id: UUID,

    regions: $n::[
        {
            $id: $region_id::(),  // Unique ID per region

            // Static region data
            'region_{$@n}': static_data::{
                $data: {
                    region_id: $id,
                    region_name: Format::{ pattern: "Region {$@n}" },
                    timezone: Uniform::{ choices: ["UTC-8", "UTC-5", "UTC", "UTC+1"] },
                    population: UniformU32::{ low: 100000, high: 10000000 }
                }
            }
        }
    ]
}

This creates multiple static datasets like region_0, region_1, region_2, etc.

Complex Static Data Structures

Static data supports all the same generators as dynamic data:

rand_processes::{
    // Complex static configuration
    system_config: static_data::{
        $data: {
            config_id: UUID,
            created_at: Instant,
            
            // Nested configuration
            database: {
                host: Regex::{ pattern: "db[0-9]{2}\\.example\\.com" },
                port: UniformU16::{ low: 5432, high: 5439 },
                ssl_enabled: Bool::{ p: 0.9 }
            },
            
            // Array of server configurations
            servers: UniformArray::{
                min_size: 3,
                max_size: 10,
                element_type: {
                    server_id: UUID,
                    hostname: Regex::{ pattern: "server[0-9]{3}\\.example\\.com" },
                    cpu_cores: Uniform::{ choices: [4, 8, 16, 32] },
                    memory_gb: Uniform::{ choices: [16, 32, 64, 128] }
                }
            },
            
            // Mixed type configuration
            features: UniformAnyOf::{
                types: [
                    Bool,
                    UniformI32::{ low: 1, high: 100 },
                    LoremIpsumTitle
                ]
            }
        }
    }
}

Static Data Best Practices

1. Use for Reference Data

// Good - static reference data
product_categories: static_data::{
    $data: {
        category_id: UUID,
        name: Uniform::{ choices: ["Electronics", "Books", "Clothing"] },
        tax_rate: UniformDecimal::{ low: 0.05, high: 0.10 }
    }
}

// Avoid - frequently changing data should be dynamic

2. Share IDs Between Static and Dynamic

rand_processes::{
    $customer_id: UUID,
    
    customers: 5::[
        {
            $id: $customer_id::(),  // Generate once per customer
            
            // Static profile
            customer_profiles: static_data::{
                $data: {
                    customer_id: $id,
                    name: LoremIpsumTitle,
                    email: Format::{ pattern: "customer{$@n}@example.com" }
                }
            },
            
            // Dynamic transactions
            transactions: rand_process::{
                $arrival: HomogeneousPoisson::{ interarrival: days::10 },
                $data: {
                    customer_id: $id,  // Same ID
                    transaction_id: UUID,
                    amount: UniformDecimal::{ low: 10.00, high: 1000.00 }
                }
            }
        }
    ]
}

3. Use Meaningful Static Data

// Good - realistic static data
countries: static_data::{
    $data: {
        country_code: Regex::{ pattern: "[A-Z]{2}" },
        country_name: LoremIpsumTitle,
        population: LogNormalF64::{ location: 15.0, scale: 2.0 },  // Realistic population distribution
        gdp_per_capita: LogNormalF64::{ location: 8.5, scale: 1.5 }
    }
}

// Avoid - unrealistic or meaningless static data

4. Consider Static Data Size

rand_processes::{
    // Small static dataset - appropriate
    currencies: static_data::{
        $data: {
            currency_code: Regex::{ pattern: "[A-Z]{3}" },
            exchange_rate: UniformF64::{ low: 0.1, high: 10.0 }
        }
    }
}

For large reference datasets, consider if the data really needs to be static or could be part of a slow-changing dynamic process.

Output Characteristics

CLI Output Format

When you run data generation, static data appears first with identical timestamps:

$ beamline gen data \
    --seed 1234 \
    --start-iso "2019-08-01T00:00:01-07:00" \
    --script-path partiql-beamline-sim/tests/scripts/orders.ion \
    --sample-count 10 \
    --output-format text

Seed: 1234
Start: 2019-08-01T00:00:01.000000000-07:00

# Static data first (all at start time)
[2019-08-01 0:00:01.0 -07:00:00] : "customer_table" { 'id': 'd858b1e7-7327-7c40-1698-0e0e4fe89ecc', 'address': '0 Foo Bar Ave' }
[2019-08-01 0:00:01.0 -07:00:00] : "customer_table" { 'id': '179e600a-c1c5-8ac2-05b6-15b20f8fe740', 'address': '1 Foo Bar Ave' }
[2019-08-01 0:00:01.0 -07:00:00] : "customer_table" { 'address': '2 Foo Bar Ave', 'id': '5e39c6eb-0bc1-7040-cf52-6e69cdf386e0' }

# Dynamic data follows (spread over time)
[2019-08-01 7:26:21.964 -07:00:00] : "orders" { 'Customer': '5e39c6eb-0bc1-7040-cf52-6e69cdf386e0', 'Order': '4c579e42-8c70-93f4-b99b-cc45c50197ed' }
[2019-08-10 5:46:15.24 -07:00:00] : "orders" { 'Customer': '5e39c6eb-0bc1-7040-cf52-6e69cdf386e0', 'Order': '38900593-e9cc-994a-98d9-0becf77d9144' }

Ion Pretty Format

$ beamline gen data \
    --seed 1234 \
    --start-auto \
    --script-path with_static.ion \
    --sample-count 5 \
    --output-format ion-pretty

{
  seed: 1234,
  start: "2024-01-01T00:00:00Z",
  data: {
    // Static data grouped together
    config: [
      {
        system_id: "123e4567-e89b-12d3-a456-426614174000",
        version: "1.0",
        created_at: 2024-01-01T00:00:00Z
      }
    ],
    
    // Dynamic data grouped together
    events: [
      {
        event_id: "987fcdeb-51a2-43d1-9f4e-123456789abc",
        timestamp: 2024-01-01T00:05:23Z,
        type: "user_login"
      },
      {
        event_id: "456789ab-cdef-1234-5678-9abcdef01234", 
        timestamp: 2024-01-01T00:08:45Z,
        type: "user_action"
      }
    ]
  }
}

Common Patterns

Master Data Pattern

rand_processes::{
    // Static master data
    locations: static_data::{
        $data: {
            location_id: UUID,
            city: LoremIpsumTitle,
            country_code: Regex::{ pattern: "[A-Z]{2}" },
            latitude: UniformF64::{ low: -90.0, high: 90.0 },
            longitude: UniformF64::{ low: -180.0, high: 180.0 }
        }
    },

    // Events at locations
    weather_events: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: hours::6 },
        $data: {
            event_id: UUID,
            // In real usage, would reference actual location_id
            temperature: NormalF64::{ mean: 20.0, std_dev: 10.0 },
            humidity: UniformF64::{ low: 20.0, high: 90.0 },
            timestamp: Instant
        }
    }
}

Hierarchical Data Pattern

rand_processes::{
    $n_orgs: UniformU8::{ low: 2, high: 5 },
    $org_id: UUID,

    organizations: $n_orgs::[
        {
            $id: $org_id::(),

            // Static organization info
            'org_{$@n}': static_data::{
                $data: {
                    org_id: $id,
                    org_name: Format::{ pattern: "Organization {$@n}" },
                    industry: Uniform::{ choices: ["Tech", "Finance", "Healthcare"] },
                    founded_year: UniformU16::{ low: 1950, high: 2020 }
                }
            },

            // Dynamic organizational events
            'org_events_{$@n}': rand_process::{
                $arrival: HomogeneousPoisson::{ interarrival: days::30 },
                $data: {
                    org_id: $id,
                    event_type: Uniform::{ choices: ["hire", "fire", "restructure", "acquisition"] },
                    timestamp: Instant,
                    impact_score: NormalF64::{ mean: 5.0, std_dev: 2.0 }
                }
            }
        }
    ]
}

Database Generation with Static Data

When creating databases with gen db beamline-lite, static data creates separate dataset files:

$ beamline gen db beamline-lite \
    --seed 1000 \
    --start-auto \
    --script-path partiql-beamline-sim/tests/scripts/orders.ion \
    --sample-count 1000

$ tree beamline-catalog/
beamline-catalog/
├── .beamline-manifest
├── .beamline-script
├── customer_table.ion        # Static data
├── customer_table.shape.ion  # Static data schema
├── customer_table.shape.sql  # Static data SQL schema
├── orders.ion               # Dynamic data
├── orders.shape.ion         # Dynamic data schema
└── orders.shape.sql         # Dynamic data SQL schema

Static data file (customer_table.ion):

{id: "abc-123", address: "0 Main St"}
{id: "def-456", address: "1 Main St"}
{id: "ghi-789", address: "2 Main St"}

Dynamic data file (orders.ion):

{Customer: "abc-123", Order: "order-001", Time: 2024-01-01T00:15:30Z}
{Customer: "def-456", Order: "order-002", Time: 2024-01-01T01:22:15Z}
{Customer: "abc-123", Order: "order-003", Time: 2024-01-01T02:08:45Z}

Performance Implications

Memory Usage

  • Static data is generated once and stored in memory during generation
  • Large static datasets may increase memory usage
  • Consider data size when designing static datasets

Generation Speed

  • Static generation happens once at startup
  • No temporal computation needed for static data
  • Overall faster than equivalent dynamic data

Best Practices for Large Static Data

// If you need large reference data, consider dynamic with very slow arrival
// Instead of large static data:
large_reference: static_data::{ /* ... thousands of records ... */ }

// Consider slow dynamic process:
reference_data: rand_process::{
    $arrival: HomogeneousPoisson::{ interarrival: days::365 },  // Very infrequent
    $data: { /* ... */ }
}

Troubleshooting Static Data

Issue: Static Data Not Appearing

Cause: No sample count affects static data - it’s always generated based on script configuration.

Solution: Check your script syntax and variable definitions.

Issue: Unexpected Timestamps

Cause: All static data uses simulation start time.

Solution: This is expected behavior. Use dynamic processes for time-distributed data.

Issue: Large Memory Usage

Cause: Large static datasets are loaded into memory.

Solution: Reduce static dataset size or convert to slow dynamic processes.

Examples from Test Scripts

Simple Static Configuration

// From a test script pattern
config: static_data::{
    $data: {
        app_version: "2.1.0",
        max_users: UniformU32::{ low: 1000, high: 10000 },
        feature_flags: UniformAnyOf::{
            types: [Bool, UniformI32::{ low: 0, high: 100 }]
        }
    }
}

Multi-Dataset Static Pattern

rand_processes::{
    $n: UniformU8::{ low: 5, high: 15 },
    
    servers: $n::[
        {
            'server_config_{$@n}': static_data::{
                $data: {
                    server_id: Format::{ pattern: "server-{$@n}" },
                    hostname: Format::{ pattern: "srv{$@n}.example.com" },
                    ip_address: Regex::{ pattern: "192\\.168\\.[0-9]{1,3}\\.[0-9]{1,3}" },
                    capacity: Uniform::{ choices: [100, 200, 500, 1000] }
                }
            }
        }
    ]
}

Next Steps

  • Datasets - Learn about working with multiple datasets and relationships
  • Output Formats - Understand how static data appears in different formats
  • Scripts - Advanced Ion scripting techniques with static and dynamic data
  • Examples - See static data in complete examples

Output Formats

Beamline supports multiple output formats for generated data, each optimized for different use cases. Understanding these formats helps you choose the right one for your workflow.

Available Formats

The CLI supports four main output formats via --output-format:

FormatDescriptionUse CasePerformance
textHuman-readable timestamped formatDebugging, inspectionModerate
ionCompact Ion text formatData processingFast
ion-prettyPretty-printed Ion with metadataConfiguration, documentationSlower
ion-binaryBinary Ion formatHigh-performance storageFastest

Text Format (Default)

Characteristics

  • Human-readable: Easy to read and debug
  • Timestamped: Each record includes generation timestamp and dataset name
  • Streaming: Records appear as they’re generated
  • Metadata: Shows seed and start time

Output Structure

Seed: <seed_value>
Start: <start_timestamp>
[<timestamp>] : "<dataset_name>" { <ion_data> }
[<timestamp>] : "<dataset_name>" { <ion_data> }
...

Example Output

$ beamline gen data \
    --seed 1234 \
    --start-auto \
    --script-path sensors.ion \
    --sample-count 5 \
    --output-format text

Seed: 1234
Start: 2024-05-10T04:04:53.000000000Z
[2024-05-10 4:06:07.274 +00:00:00] : DataSetName("sensors") { 'tick': 74274, 'i8': -86, 'f': 48.07286740416876, 'w': NULL, 'd': 23, 'a': 3.1640, 'ar1': [0.8, 1.1, 1.1], 'ar2': ['e8b12a6c-7cf1-45b6-a8a4-89cd6a418660', 'ba408184-3b94-41e7-860f-6042708bb4be'], 'ar3': [NULL, NULL], 'ar4': [6, 4], 'ar5': [3.1640] }
[2024-05-10 4:08:15.65 +00:00:00] : DataSetName("sensors") { 'tick': 202650, 'i8': 6, 'f': 45.56429323253781, 'w': NULL, 'd': 26, 'a': '613de2a3-195c-410f-8dac-56237f53aa99', 'ar1': [1.1, 0.9, 0.7], 'ar2': ['e0c6700e-f429-429a-a461-c018820fbafe', '9fce83a7-45ef-4210-affe-b87b45e3ac73'], 'ar3': [NULL, 2.4409], 'ar4': [4, 8], 'ar5': ['613de2a3-195c-410f-8dac-56237f53aa99'] }

Use Cases

  • Development and debugging: Easy to read individual records
  • Log file analysis: Timestamped records for event correlation
  • Quick inspection: Rapid visual validation of generated data
  • Educational: Learning how data generation works

Ion Format

Characteristics

  • Compact: No pretty-printing or extra whitespace
  • Fast: Minimal formatting overhead
  • Ion text: Preserves all Ion type information
  • Processable: Easy to parse with Ion libraries

Output Structure

{seed:<seed>,start:"<timestamp>",data:{<dataset_name>:[{<record>},{<record>}...]}}

Example Output

$ beamline gen data \
    --seed 42 \
    --start-auto \
    --script-path simple.ion \
    --sample-count 3 \
    --output-format ion

{seed:42,start:"2024-01-01T00:00:00Z",data:{sensors:[{f:-2.543639,i8:4,tick:125532},{f:-63.493088,i8:4,tick:218756},{f:12.345679,i8:-12,tick:253123}]}}

Use Cases

  • Data processing pipelines: Efficient parsing and processing
  • API responses: Compact data transmission
  • Intermediate storage: Balance between readability and efficiency
  • Configuration files: Structured data that’s still readable

Ion Pretty Format

Characteristics

  • Human-readable: Well-formatted with indentation
  • Complete metadata: Includes seed, start time, and full data structure
  • Ion text format: Preserves all type information
  • Structured: Clear hierarchical organization

Output Structure

{
  seed: <seed>,
  start: "<timestamp>", 
  data: {
    <dataset_name>: [
      {
        <field>: <value>,
        <field>: <value>
      },
      {
        <field>: <value>,
        <field>: <value>
      }
    ]
  }
}

Example Output

$ beamline gen data \
    --seed 123 \
    --start-auto \
    --script-path sensors.ion \
    --sample-count 2 \
    --output-format ion-pretty

{
  seed: 123,
  start: "2024-01-20T10:30:00.000000000Z",
  data: {
    sensors: [
      {
        f: -2.5436390152455175e0,
        i8: 4,
        tick: 125532
      },
      {
        f: -63.49308817145054e0,
        i8: 4,
        tick: 218756
      }
    ]
  }
}

Use Cases

  • Configuration files: Readable but structured data
  • Documentation: Examples and samples in documentation
  • Data inspection: Understanding complex nested structures
  • Archive storage: Long-term storage with metadata

Ion Binary Format

Characteristics

  • Most compact: Smallest file size
  • Fastest: Highest performance for generation and parsing
  • Type preservation: All Ion types preserved exactly
  • Not human-readable: Requires Ion tools to read

Example Usage

$ beamline gen data \
    --seed 999 \
    --start-auto \
    --script-path large_dataset.ion \
    --sample-count 1000000 \
    --output-format ion-binary > data.ion

Use Cases

  • Large datasets: Maximum efficiency for big data generation
  • High-performance applications: Minimal parsing overhead
  • Storage optimization: Smallest possible file sizes
  • Data transmission: Efficient network transfer

Format Comparison

Size Comparison

For the same dataset with 1000 records:

# Generate in all formats for comparison
beamline gen data --seed 1 --start-auto --script-path data.ion --sample-count 1000 --output-format text > data.txt
beamline gen data --seed 1 --start-auto --script-path data.ion --sample-count 1000 --output-format ion > data.ion  
beamline gen data --seed 1 --start-auto --script-path data.ion --sample-count 1000 --output-format ion-pretty > data_pretty.ion
beamline gen data --seed 1 --start-auto --script-path data.ion --sample-count 1000 --output-format ion-binary > data.bin

# Compare sizes
ls -lh data.*
# Example results:
# -rw-r--r-- 1 user user 245K data.txt        (text - largest)
# -rw-r--r-- 1 user user 156K data.ion        (ion - medium)  
# -rw-r--r-- 1 user user 189K data_pretty.ion (pretty - larger due to formatting)
# -rw-r--r-- 1 user user 98K  data.bin        (binary - smallest)

Performance Comparison

For generation of 100,000 records:

  1. ion-binary: Fastest (baseline)
  2. ion: ~10% slower than binary
  3. text: ~25% slower than binary
  4. ion-pretty: ~40% slower than binary (due to formatting)

Format-Specific Features

Text Format Features

Timestamp visibility: See exactly when each event occurred in simulation time

[2024-01-01 08:15:23.456 +00:00] : "orders" { 'order_id': '123e4567', 'amount': 99.99 }
[2024-01-01 08:20:45.789 +00:00] : "orders" { 'order_id': '987fcdeb', 'amount': 149.50 }

Dataset identification: Clear dataset labels for multi-dataset scripts

Ion Formats Features

Type preservation: All Ion types are preserved exactly

{
  decimal_field: 123.45,           // Exact decimal
  float_field: 123.45e0,           // Float with exponent notation
  timestamp: 2024-01-01T00:00:00Z, // Full timestamp precision
  uuid: "123e4567-e89b-12d3-a456-426614174000"
}

Structured data: Complex nested structures preserved

{
  user: {
    profile: {
      preferences: ["dark_mode", "notifications"]
    }
  }
}

NULL and MISSING Representation

Different formats handle absent values differently:

Text Format

[timestamp] : "dataset" { 'present': 42, 'null_field': null }  // MISSING fields omitted

Ion Formats

{
  present: 42,
  null_field: null
  // missing_field is omitted entirely
}

Multiple Dataset Output

Text Format with Multiple Datasets

$ beamline gen data \
    --seed 100 \
    --start-auto \
    --script-path client_service.ion \
    --sample-count 10 \
    --output-format text

Seed: 100
Start: 2024-01-01T00:00:00Z
[2024-01-01 00:00:00.000 +00:00] : "customer_table" { 'id': 'abc-123', 'address': '0 Main St' }
[2024-01-01 00:00:00.000 +00:00] : "customer_table" { 'id': 'def-456', 'address': '1 Main St' }  
[2024-01-01 00:05:30.123 +00:00] : "service" { 'Request': 'req-001', 'Account': 'abc-123' }
[2024-01-01 00:05:30.124 +00:00] : "client_1" { 'id': 'abc-123', 'request_id': 'req-001' }

Ion Pretty Format with Multiple Datasets

{
  seed: 100,
  start: "2024-01-01T00:00:00Z",
  data: {
    customer_table: [
      {
        id: "abc-123",
        address: "0 Main St"
      },
      {
        id: "def-456", 
        address: "1 Main St"
      }
    ],
    service: [
      {
        Request: "req-001",
        Account: "abc-123",
        StartTime: 2024-01-01T00:05:30.123Z
      }
    ],
    client_1: [
      {
        id: "abc-123",
        request_id: "req-001",
        request_time: 2024-01-01T00:05:30.124Z
      }
    ]
  }
}

Choosing the Right Format

Development and Testing

# Use text for quick debugging
beamline gen data --script-path debug.ion --sample-count 5 --output-format text

# Use ion-pretty for understanding structure  
beamline gen data --script-path complex.ion --sample-count 10 --output-format ion-pretty

Production and Performance

# Use ion-binary for large datasets
beamline gen data --script-path production.ion --sample-count 1000000 --output-format ion-binary

# Use ion for balance of efficiency and readability
beamline gen data --script-path data.ion --sample-count 100000 --output-format ion

Integration Workflows

# Generate for different consumers
beamline gen data --seed 42 --start-auto --script-path data.ion --sample-count 10000 --output-format ion-binary > high_perf.ion
beamline gen data --seed 42 --start-auto --script-path data.ion --sample-count 100 --output-format ion-pretty > documentation.ion
beamline gen data --seed 42 --start-auto --script-path data.ion --sample-count 1000 --output-format text > debug.txt

Format-Specific Processing

Processing Text Format

# Extract specific datasets
beamline gen data --script-path multi.ion --output-format text | \
  grep '"sensors"' | \
  head -10

# Analyze timestamps
beamline gen data --script-path temporal.ion --output-format text | \
  awk -F'\\[|\\]' '{print $2}' | \
  head -20

Processing Ion Formats

# Use Ion tools for processing
beamline gen data --script-path data.ion --output-format ion-binary | \
  ion-cli query "SELECT * FROM data.sensors WHERE f > 0"

# Convert between formats
beamline gen data --script-path data.ion --output-format ion | \
  ion-cli pretty > formatted.ion

Pipeline Integration

# Generate and immediately process
beamline gen data \
  --seed 123 \
  --start-auto \
  --script-path metrics.ion \
  --sample-count 10000 \
  --output-format ion-pretty | \
  jq '.data.metrics[] | select(.temperature > 25)' | \
  head -10

Database Generation Formats

Database generation creates multiple file formats automatically:

$ beamline gen db beamline-lite \
    --seed 42 \
    --start-auto \
    --script-path data.ion \
    --sample-count 1000

$ ls -la beamline-catalog/
-rw-r--r-- 1 user user   145 .beamline-manifest    # JSON metadata
-rw-r--r-- 1 user user  2.1K .beamline-script      # Ion script
-rw-r--r-- 1 user user   89K sensors.ion           # Data in Ion format
-rw-r--r-- 1 user user   412 sensors.shape.ion     # Schema in Ion format  
-rw-r--r-- 1 user user   298 sensors.shape.sql     # Schema in SQL DDL format

Data Files (Ion Format)

$ head -3 beamline-catalog/sensors.ion
{f: -2.5436390152455175e0, i8: 4, tick: 125532}
{f: -63.49308817145054e0, i8: 4, tick: 218756}
{f: 12.34567890123456e0, i8: -12, tick: 253123}

Schema Files (Ion Format)

$ cat beamline-catalog/sensors.shape.ion
{
  type: "bag",
  items: {
    type: "struct",
    constraints: [ordered, closed],
    fields: [
      { name: "f", type: "double" },
      { name: "i8", type: "int8" },
      { name: "tick", type: "int8" }
    ]
  }
}

Schema Files (SQL DDL Format)

$ cat beamline-catalog/sensors.shape.sql
"f" DOUBLE,
"i8" INT8,
"tick" INT8

Format Selection Guidelines

By Use Case

Use CaseRecommended FormatRationale
Quick debuggingtextTimestamps and human readability
Data inspectionion-prettyStructure visibility with metadata
Large dataset generationion-binaryMaximum performance and compression
Data processingionGood balance of efficiency and readability
Documentationion-prettyClear structure for examples
Long-term storageion-binaryMost compact and preserves all types

By Dataset Size

Dataset SizeRecommended FormatAlternative
< 100 recordstext or ion-prettyFor inspection
100 - 10K recordsion or ion-prettyBased on use case
10K - 100K recordsion or ion-binaryFor efficiency
> 100K recordsion-binaryMaximum performance

By Integration Target

Target SystemRecommended FormatNotes
Ion-aware toolsion-binaryNative format
JSON processorsion + conversionIon can be converted to JSON
SQL databasesUse gen dbCreates SQL schemas automatically
Log analysistextTimestamped format
Documentationion-prettyHuman-readable structure

Format Conversion Patterns

Manual Conversion

# Generate in efficient format, convert for specific use
beamline gen data --script-path data.ion --sample-count 10000 --output-format ion-binary > efficient.ion

# Convert to pretty format for inspection
ion-cli pretty < efficient.ion > readable.ion

# Extract specific fields
ion-cli query "SELECT data.sensors[*].temperature FROM `efficient.ion`" > temperatures.ion

Multi-Format Generation

#!/bin/bash
# generate-multi-format.sh

SCRIPT="$1"
SEED="$2"
COUNT="$3"

# Generate in multiple formats
beamline gen data --seed $SEED --start-auto --script-path $SCRIPT --sample-count $COUNT --output-format ion-binary > data.bin
beamline gen data --seed $SEED --start-auto --script-path $SCRIPT --sample-count 100 --output-format ion-pretty > sample.ion
beamline gen data --seed $SEED --start-auto --script-path $SCRIPT --sample-count 10 --output-format text > debug.txt

echo "Generated:"
echo "- data.bin (binary, $COUNT records)"  
echo "- sample.ion (pretty, 100 records)"
echo "- debug.txt (text, 10 records)"

Integration Examples

Web API Integration

# Generate data for API testing
beamline gen data \
  --seed 42 \
  --start-auto \
  --script-path api_test_data.ion \
  --sample-count 1000 \
  --output-format ion-pretty | \
  jq '.data' > api_test_payload.json

Database Loading

# Generate data and schema for database
beamline gen db beamline-lite \
  --seed 100 \
  --start-auto \
  --script-path warehouse_data.ion \
  --sample-count 50000

# Use generated SQL schema
psql -d warehouse -f beamline-catalog/orders.shape.sql

# Convert data for loading (would need custom conversion)
# partiql-to-csv beamline-catalog/orders.ion > orders.csv
# COPY orders FROM 'orders.csv' WITH CSV HEADER;

Analytics Pipeline

#!/bin/bash
# analytics-pipeline.sh

# Generate raw data efficiently  
beamline gen data \
  --seed 202401 \
  --start-iso "2024-01-01T00:00:00Z" \
  --script-path analytics.ion \
  --sample-count 1000000 \
  --output-format ion-binary > raw_data.ion

# Generate sample for validation
beamline gen data \
  --seed 202401 \
  --start-iso "2024-01-01T00:00:00Z" \
  --script-path analytics.ion \
  --sample-count 100 \
  --output-format ion-pretty > sample_validation.ion

echo "Analytics data generated:"
echo "- Raw data: $(wc -l < raw_data.ion) records in binary format"
echo "- Validation sample: 100 records in pretty format"

Best Practices

1. Match Format to Purpose

# Debugging - use text
beamline gen data --script-path new_script.ion --sample-count 5 --output-format text

# Production - use binary
beamline gen data --script-path prod_data.ion --sample-count 1000000 --output-format ion-binary

# Documentation - use pretty
beamline gen data --script-path examples.ion --sample-count 10 --output-format ion-pretty

2. Consider File Size for Large Datasets

# Check estimated size first
beamline gen data --script-path large.ion --sample-count 1000 --output-format ion-binary | wc -c
# If 1000 records = 50KB, then 1M records ≈ 50MB

3. Use Appropriate Format for Storage

# Long-term storage
beamline gen data --script-path archive.ion --sample-count 100000 --output-format ion-binary

# Working files  
beamline gen data --script-path working.ion --sample-count 1000 --output-format ion-pretty

# Quick inspection
beamline gen data --script-path inspect.ion --sample-count 20 --output-format text

4. Document Format Choices

# Document why you chose specific formats
echo "# Data Formats Used
- raw_data.bin: ion-binary for maximum efficiency (1M+ records)
- sample.ion: ion-pretty for human inspection (100 records)  
- debug.txt: text format for timestamp analysis (50 records)
" > FORMAT_NOTES.md

Next Steps

  • Scripts - Advanced Ion scripting techniques
  • Datasets - Working with multiple datasets and relationships
  • CLI Data Commands - Complete CLI format options reference

Nullability and Optionality

Beamline provides fine-grained control over NULL and MISSING values in generated data. Understanding the distinction between these concepts and how to configure them is crucial for creating realistic datasets that match real-world data patterns.

NULL vs MISSING Values

PartiQL distinguishes between two types of absent values:

NULL Values

  • Meaning: The field exists but has no value
  • JSON equivalent: "field": null
  • SQL equivalent: NULL
  • Ion format: field: null

MISSING Values

  • Meaning: The field doesn’t exist at all
  • JSON equivalent: Field is not present in the object
  • SQL equivalent: Column not included in row
  • Ion format: Field is omitted entirely

Configuration Syntax

Every generator supports both nullability and optionality configuration:

Basic Syntax

generator_name::{ nullable: <config>, optional: <config> }

Configuration Values

Boolean Configuration:

  • nullable: true - Field can be NULL (but 0% chance by default)
  • nullable: false - Field cannot be NULL (default)
  • optional: true - Field can be MISSING (but 0% chance by default)
  • optional: false - Field cannot be MISSING (default)

Probability Configuration:

  • nullable: 0.0 - 0% chance of NULL (same as false)
  • nullable: 0.25 - 25% chance of NULL
  • nullable: 1.0 - 100% chance of NULL (always NULL)
  • optional: 0.1 - 10% chance of MISSING
  • optional: 0.5 - 50% chance of MISSING

Examples from Real Test Scripts

Basic Nullability

From the sensors.ion test script:

rand_processes::{
    sensors: rand_process::{
        $weight: UniformDecimal::{ 
            nullable: 0.75,     // 75% chance of NULL
            low: 1.995, 
            high: 4.9999, 
            optional: true      // Can be MISSING (0% chance by default)
        },
        $data: {
            weight: $weight,
            price: UniformDecimal::{ 
                low: 2.99, 
                high: 99999.99, 
                optional: true  // Field might not appear at all
            }
        }
    }
}

Advanced Configuration

From the numbers.ion test script showing all combinations:

rand_processes::{
    test_data: rand_process::{
        $data: {
            // Default behavior - not nullable, not optional
            basic_int: UniformI32::{ low: 1, high: 100 },
            
            // Only nullable
            nullable_only: UniformI32::{ 
                nullable: 0.2,    // 20% NULL
                low: 1, 
                high: 100 
            },
            
            // Only optional  
            optional_only: UniformI32::{ 
                optional: 0.1,    // 10% MISSING
                low: 1, 
                high: 100 
            },
            
            // Both nullable and optional
            both_configured: UniformI32::{ 
                nullable: 0.2,    // 20% NULL
                optional: 0.1,    // 10% MISSING
                low: 1,           // 70% present values
                high: 100 
            }
        }
    }
}

Output Examples

Text Format Output

$ beamline gen data \
    --seed 1000 \
    --start-auto \
    --script-path nullability_test.ion \
    --sample-count 10

# Sample outputs showing NULL and MISSING behavior
[2024-01-01 00:00:01.123] : "test_data" { 'basic_int': 42, 'nullable_only': null, 'both_configured': 15 }
[2024-01-01 00:00:02.456] : "test_data" { 'basic_int': 78, 'nullable_only': 23, 'both_configured': null }
[2024-01-01 00:00:03.789] : "test_data" { 'basic_int': 91, 'nullable_only': 67 }  // optional_only is MISSING
[2024-01-01 00:00:04.012] : "test_data" { 'basic_int': 33, 'nullable_only': null, 'optional_only': 88, 'both_configured': 54 }

Notice how:

  • basic_int always appears (not nullable, not optional)
  • nullable_only can be null but always present
  • optional_only might not appear at all (MISSING)
  • both_configured can be null or MISSING

Ion Pretty Format Output

{
  seed: 1000,
  start: "2024-01-01T00:00:00Z",
  data: {
    test_data: [
      {
        basic_int: 42,
        nullable_only: null,        // NULL value present
        both_configured: 15
        // optional_only is MISSING (field not present)
      },
      {
        basic_int: 78,
        nullable_only: 23,
        optional_only: 67,
        both_configured: null       // NULL value present
      },
      {
        basic_int: 91,
        nullable_only: 67,
        optional_only: 45
        // both_configured is MISSING (field not present)
      }
    ]
  }
}

Global Defaults via CLI

You can set global nullability and optionality defaults via CLI options:

CLI Default Configuration

# Make all types nullable with 10% NULL values
beamline gen data \
  --seed 42 \
  --start-auto \
  --script-path data.ion \
  --pct-null 0.1

# Make all types optional with 5% MISSING values  
beamline gen data \
  --seed 42 \
  --start-auto \
  --script-path data.ion \
  --pct-optional 0.05

# Combine both
beamline gen data \
  --seed 42 \
  --start-auto \
  --script-path data.ion \
  --pct-null 0.1 \
  --pct-optional 0.05

# Disable both globally
beamline gen data \
  --seed 42 \
  --start-auto \
  --script-path data.ion \
  --default-nullable false \
  --default-optional false

Script vs CLI Override Behavior

  • Script configuration takes precedence over CLI defaults
  • CLI defaults apply to generators without explicit nullable/optional configuration
  • Explicit false in script overrides CLI defaults

Example:

# CLI sets 20% NULL globally
beamline gen data \
  --pct-null 0.2 \
  --script-path mixed_config.ion
rand_processes::{
    test_data: rand_process::{
        $data: {
            // Uses CLI default: 20% NULL
            field1: UniformI32::{ low: 1, high: 100 },
            
            // Overrides CLI default: never NULL
            field2: UniformI32::{ nullable: false, low: 1, high: 100 },
            
            // Overrides CLI default: 50% NULL
            field3: UniformI32::{ nullable: 0.5, low: 1, high: 100 }
        }
    }
}

Realistic Data Patterns

Database-like Nullability

rand_processes::{
    users: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: minutes::1 },
        $data: {
            // Required fields - never NULL or MISSING
            user_id: UUID::{ nullable: false, optional: false },
            created_at: Instant::{ nullable: false, optional: false },
            
            // Often present, sometimes NULL
            email: Regex::{ 
                pattern: "[a-z]+@[a-z]+\\.[a-z]{2,3}",
                nullable: 0.05  // 5% NULL emails
            },
            
            // Optional profile fields
            full_name: LoremIpsumTitle::{ optional: 0.3 },  // 30% don't provide name
            phone: Regex::{ 
                pattern: "[0-9]{3}-[0-9]{3}-[0-9]{4}",
                optional: 0.4,   // 40% don't provide phone
                nullable: 0.1    // 10% provide NULL phone
            },
            
            // Rarely provided optional fields
            bio: LoremIpsum::{ 
                min_words: 10, 
                max_words: 50,
                optional: 0.8    // 80% don't provide bio
            }
        }
    }
}

Sensor Data with Missing Readings

rand_processes::{
    sensor_readings: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: seconds::30 },
        $data: {
            sensor_id: UUID::{ nullable: false },
            timestamp: Instant::{ nullable: false },
            
            // Primary measurement - rarely fails
            temperature: UniformF64::{ 
                low: -10.0, 
                high: 50.0,
                nullable: 0.02   // 2% sensor failures (NULL)
            },
            
            // Secondary measurement - more failures
            humidity: UniformF64::{ 
                low: 0.0, 
                high: 100.0,
                nullable: 0.05   // 5% sensor failures
            },
            
            // Optional calibration data
            calibration_offset: UniformF64::{ 
                low: -1.0, 
                high: 1.0,
                optional: 0.7    // 70% don't have calibration data
            },
            
            // Rarely available GPS coordinates
            latitude: UniformF64::{ 
                low: -90.0, 
                high: 90.0,
                optional: 0.9    // 90% don't have GPS
            },
            longitude: UniformF64::{ 
                low: -180.0, 
                high: 180.0,
                optional: 0.9    // 90% don't have GPS  
            }
        }
    }
}

Statistical Distribution Implications

Stable Value Generation

An important feature of Beamline’s nullability/optionality system is stable value generation:

The value that would have been generated if not absent is still generated, it is just discarded. This ensures that value generation is stable even across runs with different densities of NULL and/or MISSING data.

Example:

test_field: UniformI32::{ low: 1, high: 100, nullable: 0.5 }

With seed 42:

  • 50% NULL case: Generates 17, discards it, outputs null
  • 0% NULL case: Generates 17, outputs 17
  • Same underlying sequence: Both cases generate the same random number sequence

This stability is crucial for:

  • A/B testing: Compare models with different missing data rates
  • Robustness testing: Test algorithms with varying data completeness
  • Reproducible experiments: Same seed produces same value patterns

AI Model Training Applications

rand_processes::{
    training_data: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: milliseconds::1 },
        $data: {
            // Always present features
            sample_id: UUID::{ nullable: false, optional: false },
            
            // Core features with realistic missingness
            age: NormalF64::{ 
                mean: 35.0, 
                std_dev: 12.0,
                nullable: 0.02    // 2% missing age data
            },
            
            income: LogNormalF64::{ 
                location: 10.5, 
                scale: 0.5,
                nullable: 0.15,   // 15% NULL income (sensitive data)
                optional: 0.05    // 5% refuse to provide income
            },
            
            // Optional survey responses
            satisfaction_score: UniformF64::{ 
                low: 1.0, 
                high: 10.0,
                optional: 0.4     // 40% don't respond to survey
            },
            
            // Rarely collected features
            location: Regex::{ 
                pattern: "[A-Z]{2}",
                optional: 0.8     // 80% don't provide location
            },
            
            // Target variable - usually complete
            target: Bool::{ 
                p: 0.3,
                nullable: 0.01    // 1% labeling errors
            }
        }
    }
}

Complex Nullability Patterns

Conditional Nullability

Use variables to create related nullability patterns:

rand_processes::{
    $has_premium: Bool::{ p: 0.3 },  // 30% premium users
    
    users: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: minutes::2 },
        $data: {
            user_id: UUID,
            plan_type: Uniform::{ choices: ["free", "premium"] },
            
            // Premium features - NULL for non-premium users
            premium_start_date: Instant::{ 
                nullable: 0.7  // 70% NULL (approx. non-premium rate)
            },
            premium_features: UniformArray::{
                min_size: 1,
                max_size: 5,
                element_type: LoremIpsumTitle,
                nullable: 0.7,  // 70% NULL for non-premium
                optional: 0.1   // 10% don't specify even if premium
            }
        }
    }
}

Correlated Missing Data

rand_processes::{
    user_profiles: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: minutes::5 },
        $data: {
            user_id: UUID::{ nullable: false },
            
            // Contact information - often missing together
            email: Regex::{ 
                pattern: "[a-z]+@[a-z]+\\.[a-z]{2,3}",
                optional: 0.2  // 20% don't provide email
            },
            phone: Regex::{ 
                pattern: "[0-9]{3}-[0-9]{3}-[0-9]{4}",
                optional: 0.25,  // 25% don't provide phone
                nullable: 0.05   // 5% provide NULL phone
            },
            
            // Address components - missing together or not at all
            street_address: LoremIpsumTitle::{ optional: 0.3 },
            city: LoremIpsumTitle::{ optional: 0.3 },
            postal_code: Regex::{ 
                pattern: "[0-9]{5}",
                optional: 0.3 
            },
            
            // Optional demographic info
            age: UniformU8::{ 
                low: 18, 
                high: 80,
                optional: 0.4,   // 40% don't provide age
                nullable: 0.02   // 2% provide invalid age (NULL)
            }
        }
    }
}

Schema Impact

Nullability and optionality affect inferred schemas:

Schema Inference Output

$ beamline infer-shape \
    --seed 1 \
    --start-auto \
    --script-path nullability_test.ion \
    --output-format basic-ddl

-- Dataset: test_data  
"basic_field" INT NOT NULL,                    -- nullable: false
"nullable_field" INT,                          -- nullable: 0.2
"optional_field" OPTIONAL INT NOT NULL,        -- optional: 0.1
"both_field" OPTIONAL INT                      -- nullable: 0.2, optional: 0.1

CLI Default Impact on Schema

$ beamline infer-shape \
    --seed 1 \
    --start-auto \
    --script-path simple_data.ion \
    --default-nullable true \
    --default-optional true \
    --output-format basic-ddl

-- All fields become nullable and optional by default
"field1" OPTIONAL INT,
"field2" OPTIONAL VARCHAR,
"field3" OPTIONAL BOOL

Performance Considerations

Value Generation Stability

NULL and MISSING generation maintains the same computational cost:

// Same performance regardless of nullability rate
fast_field: UniformI32::{ low: 1, high: 1000, nullable: 0.0 }   // 0% NULL
slow_field: UniformI32::{ low: 1, high: 1000, nullable: 0.9 }   // 90% NULL

Why: The underlying value is always generated, then conditionally discarded.

Memory Usage

  • NULL values: Stored in output (takes memory)
  • MISSING values: Not stored (saves memory)
  • High optionality: Can reduce output size significantly
// Large optional fields save memory when MISSING
large_description: LoremIpsum::{ 
    min_words: 100, 
    max_words: 1000,
    optional: 0.8  // 80% MISSING saves significant memory
}

Testing Data Quality

Missing Data Robustness Testing

Create datasets with increasing levels of missingness:

// Test script for robustness testing
rand_processes::{
    // Dataset with low missingness
    clean_data: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: seconds::1 },
        $data: {
            feature1: NormalF64::{ mean: 0.0, std_dev: 1.0, nullable: 0.01 },
            feature2: NormalF64::{ mean: 0.0, std_dev: 1.0, nullable: 0.01 },
            target: Bool::{ p: 0.5, nullable: false }
        }
    },
    
    // Dataset with moderate missingness
    noisy_data: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: seconds::1 },
        $data: {
            feature1: NormalF64::{ mean: 0.0, std_dev: 1.0, nullable: 0.1 },
            feature2: NormalF64::{ mean: 0.0, std_dev: 1.0, nullable: 0.15 },
            target: Bool::{ p: 0.5, nullable: 0.02 }
        }
    },
    
    // Dataset with high missingness
    sparse_data: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: seconds::1 },
        $data: {
            feature1: NormalF64::{ mean: 0.0, std_dev: 1.0, nullable: 0.3 },
            feature2: NormalF64::{ mean: 0.0, std_dev: 1.0, nullable: 0.4 },
            target: Bool::{ p: 0.5, nullable: 0.05 }
        }
    }
}

Real-World Data Simulation

rand_processes::{
    customer_survey: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: hours::1 },
        $data: {
            // Always collected
            survey_id: UUID::{ nullable: false, optional: false },
            timestamp: Instant::{ nullable: false, optional: false },
            
            // Demographics - some prefer not to answer
            age: UniformU8::{ low: 18, high: 80, optional: 0.15 },
            gender: Uniform::{ 
                choices: ["M", "F", "Other"], 
                optional: 0.2  // 20% prefer not to say
            },
            
            // Income - sensitive, often skipped or invalid
            income_range: Uniform::{ 
                choices: ["<30K", "30-60K", "60-100K", "100K+"],
                optional: 0.3,   // 30% skip question
                nullable: 0.1    // 10% provide invalid answer
            },
            
            // Rating questions - sometimes skipped
            overall_rating: UniformU8::{ 
                low: 1, 
                high: 10,
                optional: 0.1    // 10% skip rating
            },
            
            // Open-ended responses - frequently skipped  
            comments: LoremIpsum::{ 
                min_words: 5, 
                max_words: 100,
                optional: 0.6    // 60% don't provide comments
            }
        }
    }
}

Best Practices

1. Realistic Nullability Rates

// Good - realistic rates based on domain
email: Regex::{ pattern: "...", nullable: 0.05 }    // 5% invalid emails
age: UniformU8::{ low: 18, high: 80, optional: 0.1 }  // 10% don't provide age

// Avoid - extreme rates without justification  
field: UniformI32::{ low: 1, high: 100, nullable: 0.99 }  // 99% NULL - rarely useful

2. Use Appropriate Absence Types

// NULL for invalid/unknown values
sensor_reading: UniformF64::{ low: 0.0, high: 100.0, nullable: 0.02 }  // Sensor malfunction

// MISSING for optional fields
optional_comment: LoremIpsum::{ min_words: 5, max_words: 50, optional: 0.4 }  // User choice

3. Document Nullability Decisions

rand_processes::{
    // Document nullability reasoning
    user_data: rand_process::{
        $data: {
            // Required business key - never absent
            customer_id: UUID::{ nullable: false, optional: false },
            
            // Email required for notifications - rare NULLs for bad data
            email: Regex::{ pattern: "...", nullable: 0.01 },
            
            // Phone optional - users may not provide
            phone: Regex::{ pattern: "...", optional: 0.3 },
            
            // Marketing consent - some users skip this question
            marketing_consent: Bool::{ optional: 0.15, nullable: 0.05 }
        }
    }
}

4. Test Multiple Missingness Levels

# Generate datasets with different missingness for testing
beamline gen data --seed 1 --start-auto --script data.ion --pct-null 0.0 --sample-count 1000 > clean.ion
beamline gen data --seed 1 --start-auto --script data.ion --pct-null 0.1 --sample-count 1000 > noisy.ion  
beamline gen data --seed 1 --start-auto --script data.ion --pct-null 0.3 --sample-count 1000 > sparse.ion

Common Patterns

Required vs Optional Fields

rand_processes::{
    e_commerce: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: minutes::2 },
        $data: {
            // Required for business logic
            order_id: UUID::{ nullable: false, optional: false },
            customer_id: UUID::{ nullable: false, optional: false },
            created_at: Instant::{ nullable: false, optional: false },
            
            // Required but can have data quality issues
            total_amount: UniformDecimal::{ 
                low: 5.00, 
                high: 500.00,
                nullable: 0.005  // 0.5% data corruption
            },
            
            // Optional customer-provided data
            shipping_instructions: LoremIpsum::{ 
                min_words: 3, 
                max_words: 20,
                optional: 0.7  // 70% don't provide instructions
            },
            
            // Optional promotional data
            promo_code: Regex::{ 
                pattern: "[A-Z]{4}[0-9]{2}",
                optional: 0.8  // 80% don't use promo codes
            }
        }
    }
}

Legacy Data Migration Patterns

rand_processes::{
    migrated_data: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: minutes::1 },
        $data: {
            // Legacy ID - sometimes missing from old records
            legacy_id: UniformI32::{ 
                low: 1, 
                high: 999999,
                optional: 0.1  // 10% of old records missing legacy ID
            },
            
            // New ID - always present for new system
            new_id: UUID::{ nullable: false, optional: false },
            
            // Data quality issues from migration
            migrated_date: Instant::{ 
                nullable: 0.05  // 5% migration errors (NULL dates)
            },
            
            // Fields added after migration - missing from old records
            new_feature: LoremIpsumTitle::{ 
                optional: 0.6  // 60% old records don't have this field
            }
        }
    }
}

Troubleshooting

Issue: Unexpected NULL/MISSING Behavior

Check configuration precedence:

  1. Script-level configuration overrides CLI defaults
  2. Variable-level configuration applies to all uses
  3. CLI defaults apply to unconfigured generators

Issue: Too Many/Few NULL Values

Verify probability values:

  • Values must be between 0.0 and 1.0
  • 0.1 = 10%, 0.25 = 25%, 0.5 = 50%, etc.

Issue: Schema Doesn’t Match Expected Nullability

Check CLI defaults:

# Check if CLI is setting global defaults
beamline infer-shape --seed 1 --start-auto --script-path data.ion --default-nullable false

Integration with Query Generation

NULL and MISSING values affect query generation:

# Generate data with nullability
beamline gen data \
  --seed 1 \
  --start-auto \
  --script-path nullable_data.ion \
  --sample-count 1000 \
  --output-format ion-pretty > test_data.ion

# Generate queries that handle NULL/MISSING
beamline query basic \
  --seed 2 \
  --start-auto \
  --script-path nullable_data.ion \
  --sample-count 10 \
  rand-select-all-fw \
    --pred-absent  # Include IS NULL, IS NOT NULL, IS MISSING predicates

This creates queries like:

SELECT * FROM test_data WHERE (test_data.email IS NOT NULL)
SELECT * FROM test_data WHERE (test_data.optional_field IS MISSING)  
SELECT * FROM test_data WHERE (test_data.phone IS NULL OR test_data.phone LIKE '%555%')

Next Steps

  • Output Formats - See how NULL/MISSING values appear in different formats
  • Scripts - Advanced techniques for managing nullability in complex scripts
  • Query Generation - Generate queries that handle absent values

Query Generator Overview

Beamline’s query generator creates reproducible PartiQL queries that match the shapes and types of data defined in Ion scripts. This allows you to generate realistic test queries for PartiQL implementations, ensuring your queries are both syntactically valid and semantically meaningful for your data structures.

What is Query Generation?

The query generator analyzes the data shapes from your Ion scripts and creates PartiQL queries that:

  • Match your data structure: Queries reference actual fields and types from your data
  • Are syntactically correct: All generated queries parse and execute properly
  • Have realistic complexity: Configurable query patterns from simple to complex
  • Are reproducible: Same seed produces identical query sequences
  • Test diverse patterns: Cover different PartiQL constructs and edge cases

How Query Generation Works

Process Flow

  1. Script Analysis: Parse Ion script to understand data shapes
  2. Shape Inference: Determine field types, structures, and relationships
  3. Query Strategy: Apply configured query generation strategy
  4. Query Construction: Build queries matching data structure
  5. Output Generation: Produce formatted PartiQL queries

Shape-Aware Generation

The query generator understands your data structure:

rand_processes::{
    test_data: rand_process::{
        $arrival: HomogeneousPoisson:: { interarrival: milliseconds::100 },
        $data: {
            transaction_id: UUID::{ nullable: false },
            marketplace_id: UniformU8::{ nullable: false },
            country_code: Regex::{ pattern: "[A-Z]{2}" },
            created_at: Instant,
            completed: Bool,
            price: UniformDecimal::{ low: 2.99, high: 99999.99, optional: true }
        }
    }
}

Generated queries will reference actual fields like transaction_id, marketplace_id, country_code, etc.

Query Generation Strategies

Beamline supports four main query generation strategies:

1. rand-select-all-fw - SELECT * FROM WHERE

Generates SELECT * queries with WHERE clauses:

beamline query basic \
    --seed 1234 \
    --start-auto \
    --script-path simple_transactions.ion \
    --sample-count 3 \
    rand-select-all-fw \
        --tbl-flt-rand-min 1 \
        --tbl-flt-rand-max 1 \
        --tbl-flt-path-depth-max 1 \
        --tbl-flt-pathstep-internal-all \
        --tbl-flt-pathstep-final-project \
        --tbl-flt-type-final-scalar \
        --pred-lt

Example Output:

SELECT * FROM test_data AS test_data WHERE (test_data.marketplace_id < -5)
SELECT * FROM test_data AS test_data WHERE (test_data.price < 18.418581624952935)  
SELECT * FROM test_data AS test_data WHERE (test_data.price < 15.495327785402296)

2. rand-sfw - SELECT fields FROM WHERE

Generates queries with specific field projections and WHERE clauses:

beamline query basic \
    --seed 1234 \
    --start-auto \
    --script-path simple_transactions.ion \
    --sample-count 3 \
    rand-sfw \
        --project-rand-min 2 \
        --project-rand-max 5 \
        --project-path-depth-min 1 \
        --project-path-depth-max 1 \
        --project-pathstep-internal-all \
        --project-pathstep-final-all \
        --project-type-final-all \
        --tbl-flt-rand-min 2 \
        --tbl-flt-rand-max 5 \
        --tbl-flt-path-depth-max 1 \
        --tbl-flt-pathstep-internal-all \
        --tbl-flt-pathstep-final-project \
        --tbl-flt-type-final-scalar \
        --pred-all

Example Output:

SELECT test_data.completed, test_data.completed FROM test_data AS test_data
WHERE (NOT (test_data.completed) OR NOT ((test_data.created_at IS MISSING)))

SELECT test_data.completed, test_data.marketplace_id, test_data.created_at
FROM test_data AS test_data WHERE (NOT ((test_data.transaction_id IS NULL)) OR
  (((test_data.transaction_id IN ['Iam in.', 'Se.']) OR 
      NOT ((test_data.description IS NULL))) OR
    (test_data.marketplace_id >= 28)))

3. rand-select-all-efw - SELECT * EXCLUDE FROM WHERE

Generates SELECT * EXCLUDE queries:

beamline query basic \
    --seed 1234 \
    --start-auto \
    --script-path simple_transactions.ion \
    --sample-count 3 \
    rand-select-all-efw \
        --tbl-flt-rand-min 1 \
        --tbl-flt-rand-max 1 \
        --tbl-flt-path-depth-max 1 \
        --tbl-flt-pathstep-internal-all \
        --tbl-flt-pathstep-final-project \
        --tbl-flt-type-final-scalar \
        --pred-lt \
        --exclude-rand-min 1 \
        --exclude-rand-max 3 \
        --exclude-path-depth-min 1 \
        --exclude-path-depth-max 1 \
        --exclude-pathstep-internal-all \
        --exclude-pathstep-final-all \
        --exclude-type-final-all

Example Output:

SELECT * EXCLUDE test_data.marketplace_id, test_data.*, test_data.completed
FROM test_data AS test_data WHERE (test_data.marketplace_id < -5)

SELECT * EXCLUDE test_data.completed FROM test_data AS test_data
WHERE (test_data.price < 18.418581624952935)

4. rand-sefw - SELECT EXCLUDE FROM WHERE

Generates queries with projections, exclusions, and WHERE clauses:

beamline query basic \
    --seed 1234 \
    --start-auto \
    --script-path simple_transactions.ion \
    --sample-count 3 \
    rand-sefw \
        --project-rand-min 2 \
        --project-rand-max 5 \
        --project-path-depth-min 1 \
        --project-path-depth-max 1 \
        --project-pathstep-internal-all \
        --project-pathstep-final-all \
        --project-type-final-all \
        --tbl-flt-rand-min 2 \
        --tbl-flt-rand-max 5 \
        --tbl-flt-path-depth-max 1 \
        --tbl-flt-pathstep-internal-all \
        --tbl-flt-pathstep-final-project \
        --tbl-flt-type-final-scalar \
        --pred-all \
        --exclude-rand-min 1 \
        --exclude-rand-max 3 \
        --exclude-path-depth-min 1 \
        --exclude-path-depth-max 1 \
        --exclude-pathstep-internal-all \
        --exclude-pathstep-final-all \
        --exclude-type-final-all

Path Generation

Query generation creates paths that navigate your data structure:

Path Components

  • Projection paths: field_name, object.field, nested.object.field
  • Index paths: array[0], array[5]
  • Wildcard paths: array[*], object.*
  • Deep paths: nested.object.array[*].field

Path Depth Control

# Simple paths (depth 1)
--tbl-flt-path-depth-max 1
# Results: test_data.price, test_data.completed

# Complex paths (depth 3+) 
--tbl-flt-path-depth-max 5
# Results: test_data.nested.object.array[*].field

Real Examples from Complex Data

From the README’s transactions.ion example with nested structures:

beamline query basic \
    --seed 1234 \
    --start-auto \
    --script-path transactions.ion \
    --sample-count 3 \
    rand-sefw \
        --project-rand-min 2 \
        --project-rand-max 5 \
        --project-path-depth-min 1 \
        --project-path-depth-max 10 \
        --project-pathstep-internal-all \
        --project-pathstep-final-all \
        --project-type-final-all \
        --pred-all

Generated Deep Paths:

SELECT test_data.*.nested_struct.nested_struct.nested_struct.nested_struct.nested_struct.*,
  test_data.test_nest_struct.*.*.nested_struct.nested_struct
EXCLUDE test_data.*.*.*.*, test_data.price.*
FROM test_data AS test_data
WHERE ((test_data.test_nest_struct.*.*.*.nested_struct.*.test_int <> 19) OR
  (test_data.test_nest_struct.*.*.nested_struct.*.*.test_int > 35))

Predicate Generation

Available Predicate Types

Based on the README’s comprehensive predicate options:

  • Comparison: <, <=, >, >=, =, <>
  • Range: BETWEEN value1 AND value2
  • String: LIKE pattern, NOT LIKE pattern
  • Set membership: IN (value1, value2), NOT IN (value1, value2)
  • Null testing: IS NULL, IS NOT NULL
  • Missing testing: IS MISSING, IS NOT MISSING
  • Logical: AND, OR, NOT

Predicate Configuration

# Only less-than predicates
--pred-lt

# All comparison predicates
--pred-comparison

# All predicates including logical operators
--pred-all

# Only null/missing testing
--pred-absent

Real Predicate Examples

From the README examples:

-- Simple predicates
WHERE (test_data.marketplace_id < -5)
WHERE (test_data.price BETWEEN 10.0 AND 100.0)

-- Complex logical combinations
WHERE (((test_data.country_code <> 'Qua maxime ceterorum.') AND
    (NOT (test_data.completed IN [false, true]) OR
      (test_data.description = 'Non faciant.'))) AND
  (NOT ((test_data.price IS MISSING)) AND (test_data.price IS MISSING)))

-- Null and missing testing
WHERE (test_data.email IS NOT NULL AND test_data.optional_field IS MISSING)

Configuration Parameters

Table Filter Parameters

Control WHERE clause generation:

ParameterDescriptionValues
--tbl-flt-rand-minMinimum predicates1-255
--tbl-flt-rand-maxMaximum predicates1-255
--tbl-flt-path-depth-maxMaximum path depth1-255

Path Step Configuration

Control how paths navigate through data:

ParameterDescription
--tbl-flt-pathstep-internal-allEnable all internal path steps
--tbl-flt-pathstep-internal-projectEnable projection steps (.field)
--tbl-flt-pathstep-internal-indexEnable index steps ([1])
--tbl-flt-pathstep-internal-foreachEnable wildcard steps ([*])
--tbl-flt-pathstep-internal-unpivotEnable unpivot steps (.*)

Type Constraints

Control what types can appear in query paths:

ParameterDescription
--tbl-flt-type-final-allAllow all final types
--tbl-flt-type-final-scalarOnly scalar types (9, 'text', true)
--tbl-flt-type-final-sequenceOnly sequence types ([1,2,3])
--tbl-flt-type-final-structOnly struct types ({'a': 1})

Real Query Examples

Simple Transaction Queries

Based on simple_transactions.ion test script:

beamline query basic \
    --seed 1234 \
    --start-auto \
    --script-path simple_transactions.ion \
    --sample-count 3 \
    rand-select-all-fw \
        --tbl-flt-rand-min 1 \
        --tbl-flt-rand-max 1 \
        --pred-lt

Results:

SELECT * FROM test_data AS test_data WHERE (test_data.marketplace_id < -5)
SELECT * FROM test_data AS test_data WHERE (test_data.price < 18.418581624952935)
SELECT * FROM test_data AS test_data WHERE (test_data.price < 15.495327785402296)

Complex Nested Structure Queries

With more complex path generation:

beamline query basic \
    --seed 1234 \
    --start-auto \
    --script-path transactions.ion \
    --sample-count 2 \
    rand-sefw \
        --project-rand-min 2 \
        --project-rand-max 3 \
        --project-path-depth-max 6 \
        --project-pathstep-internal-all \
        --project-pathstep-final-all \
        --project-type-final-all \
        --tbl-flt-rand-min 2 \
        --tbl-flt-rand-max 3 \
        --tbl-flt-path-depth-max 6 \
        --pred-all \
        --exclude-rand-min 1 \
        --exclude-rand-max 2 \
        --exclude-path-depth-max 4

Results with Deep Paths:

SELECT test_data.*.nested_struct.nested_struct.nested_struct.nested_struct.*,
  test_data.test_nest_struct.*.*.nested_struct.nested_struct.*.*
EXCLUDE test_data.test_nest_struct.*.*, test_data.price.*
FROM test_data AS test_data
WHERE ((test_data.test_nest_struct.*.*.*.nested_struct.*.test_int <> 19) OR
  (test_data.*.*.nested_struct.nested_struct.*.nested_struct.test_int >= -9))

Reproducible Query Generation

Consistent Query Generation

Use specific seeds for reproducible query sets:

# Generate same queries each time
beamline query basic \
    --seed 12345 \
    --start-auto \
    --script-path data.ion \
    --sample-count 10 \
    rand-select-all-fw \
        --tbl-flt-rand-min 1 \
        --tbl-flt-rand-max 3 \
        --pred-all

Query Complexity Control

Simple Queries

# Generate simple queries for basic testing
beamline query basic \
    --seed 100 \
    --start-auto \
    --script-path data.ion \
    --sample-count 5 \
    rand-select-all-fw \
        --tbl-flt-rand-min 1 \
        --tbl-flt-rand-max 1 \
        --tbl-flt-path-depth-max 1 \
        --pred-eq

Complex Queries

# Generate complex queries for comprehensive testing
beamline query basic \
    --seed 200 \
    --start-auto \
    --script-path nested_data.ion \
    --sample-count 5 \
    rand-sefw \
        --project-rand-min 3 \
        --project-rand-max 8 \
        --project-path-depth-max 5 \
        --tbl-flt-rand-min 2 \
        --tbl-flt-rand-max 6 \
        --tbl-flt-path-depth-max 5 \
        --exclude-rand-min 1 \
        --exclude-rand-max 3 \
        --pred-all

Integration Patterns

Testing Workflow Integration

#!/bin/bash
# Generate test data and matching queries

SCRIPT="test_data.ion"
SEED=12345

# Generate test dataset
beamline gen data \
    --seed $SEED \
    --start-auto \
    --script-path $SCRIPT \
    --sample-count 1000 \
    --output-format ion-pretty > test_data.ion

# Generate queries for the dataset
beamline query basic \
    --seed $((SEED + 1)) \
    --start-auto \
    --script-path $SCRIPT \
    --sample-count 20 \
    rand-select-all-fw \
        --tbl-flt-rand-min 1 \
        --tbl-flt-rand-max 4 \
        --pred-all > test_queries.sql

echo "Generated test data and matching queries"

Query Validation Testing

# Generate queries to test PartiQL implementation
beamline query basic \
    --seed 300 \
    --start-auto \
    --script-path complex_schema.ion \
    --sample-count 50 \
    rand-sfw \
        --project-rand-min 1 \
        --project-rand-max 5 \
        --project-path-depth-max 3 \
        --tbl-flt-rand-min 1 \
        --tbl-flt-rand-max 3 \
        --pred-all > validation_queries.sql

# Test each query against your PartiQL implementation
while IFS= read -r query; do
    echo "Testing query: $query"
    # Run query against your PartiQL engine
    # your-partiql-engine --query "$query" --data test_data.ion
done < validation_queries.sql

Advanced Query Patterns

Null and Missing Value Testing

Generate queries that test NULL and MISSING value handling:

beamline query basic \
    --seed 400 \
    --start-auto \
    --script-path nullable_data.ion \
    --sample-count 10 \
    rand-select-all-fw \
        --tbl-flt-rand-min 1 \
        --tbl-flt-rand-max 2 \
        --pred-absent  # Focus on IS NULL, IS NOT NULL, IS MISSING, IS NOT MISSING

Example Queries:

SELECT * FROM test_data WHERE (test_data.optional_field IS MISSING)
SELECT * FROM test_data WHERE (test_data.nullable_field IS NOT NULL)  
SELECT * FROM test_data WHERE (test_data.price IS NULL OR test_data.completed = true)

Performance Testing Queries

Generate queries for performance benchmarking:

# Generate queries with different complexity levels
for complexity in 1 2 5 10; do
    beamline query basic \
        --seed 500 \
        --start-auto \
        --script-path large_dataset.ion \
        --sample-count 10 \
        rand-select-all-fw \
            --tbl-flt-rand-min $complexity \
            --tbl-flt-rand-max $complexity \
            --pred-all > "queries_complexity_$complexity.sql"
done

Query Generation Best Practices

1. Match Query Complexity to Data Structure

# Simple flat data - use simple paths
beamline query basic --script-path flat_data.ion --project-path-depth-max 2

# Complex nested data - use deeper paths  
beamline query basic --script-path nested_data.ion --project-path-depth-max 8

2. Use Appropriate Predicate Sets

# For numeric data testing
--pred-comparison --pred-between

# For string data testing  
--pred-like --pred-in

# For comprehensive testing
--pred-all

3. Generate Query Suites

# Generate different query types for comprehensive testing
beamline query basic --script-path data.ion --sample-count 10 rand-select-all-fw --pred-all > select_star.sql
beamline query basic --script-path data.ion --sample-count 10 rand-sfw --pred-all > projections.sql  
beamline query basic --script-path data.ion --sample-count 10 rand-select-all-efw --pred-all > excludes.sql

4. Validate Generated Queries

Test generated queries against your data:

# Generate data and queries with same script
beamline gen data --seed 1 --start-auto --script-path test.ion --sample-count 100 > data.ion
beamline query basic --seed 2 --start-auto --script-path test.ion --sample-count 5 rand-select-all-fw --pred-all > queries.sql

# Validate queries parse correctly
# your-partiql-parser --validate queries.sql

Use Cases

PartiQL Implementation Testing

Generate comprehensive query test suites:

# Generate queries covering all PartiQL features
beamline query basic \
    --seed 600 \
    --start-auto \
    --script-path comprehensive_schema.ion \
    --sample-count 100 \
    rand-sefw \
        --project-rand-min 1 \
        --project-rand-max 10 \
        --project-path-depth-max 5 \
        --tbl-flt-rand-min 1 \
        --tbl-flt-rand-max 5 \
        --exclude-rand-min 1 \
        --exclude-rand-max 3 \
        --pred-all

Performance Benchmarking

Create query workloads for performance testing:

# Generate queries with increasing complexity
beamline query basic \
    --seed 700 \
    --start-auto \
    --script-path performance_schema.ion \
    --sample-count 50 \
    rand-sfw \
        --project-rand-min 1 \
        --project-rand-max 20 \
        --tbl-flt-rand-min 1 \
        --tbl-flt-rand-max 10 \
        --pred-all > performance_queries.sql

Edge Case Testing

Generate queries that test edge cases:

# Focus on complex path expressions
beamline query basic \
    --seed 800 \
    --start-auto \
    --script-path edge_case_data.ion \
    --sample-count 25 \
    rand-sefw \
        --project-path-depth-min 3 \
        --project-path-depth-max 8 \
        --project-pathstep-internal-foreach \
        --project-pathstep-final-unpivot \
        --tbl-flt-path-depth-max 6 \
        --exclude-path-depth-min 2 \
        --exclude-path-depth-max 4 \
        --pred-all

Next Steps

Now that you understand query generation fundamentals, explore specific aspects:

Basic Query Generation

This section covers fundamental query generation patterns using Beamline’s rand-select-all-fw strategy, which generates simple SELECT * queries with WHERE clauses. This is the best starting point for understanding how query generation works.

Getting Started with Basic Queries

Simple Transaction Data

Let’s use the simple_transactions.ion script from the test suite as our data source:

rand_processes::{
    test_data: rand_process::{
        $r: Uniform::{ choices: [5,10] },
        $arrival: HomogeneousPoisson:: { interarrival: milliseconds::$r },

        $data: {
            transaction_id: UUID::{ nullable: false },
            marketplace_id: UniformU8::{ nullable: false },
            country_code: Regex::{ pattern: "[A-Z]{2}" },
            created_at: Instant,
            completed: Bool,
            description: LoremIpsum::{ min_words:10, max_words:200 },
            price: UniformDecimal::{ low: 2.99, high: 99999.99, optional: true }
        }
    }
}

This script creates transaction data with various field types that the query generator can reference.

Basic Query Generation Command

beamline query basic \
    --seed 1234 \
    --start-auto \
    --script-path simple_transactions.ion \
    --sample-count 3 \
    rand-select-all-fw \
        --tbl-flt-rand-min 1 \
        --tbl-flt-rand-max 1 \
        --tbl-flt-path-depth-max 1 \
        --tbl-flt-pathstep-internal-all \
        --tbl-flt-pathstep-final-project \
        --tbl-flt-type-final-scalar \
        --pred-lt

Generated Output:

SELECT * FROM test_data AS test_data WHERE (test_data.marketplace_id < -5)

SELECT * FROM test_data AS test_data WHERE (test_data.price < 18.418581624952935)

SELECT * FROM test_data AS test_data WHERE (test_data.price < 15.495327785402296)

Understanding the Parameters

  • --tbl-flt-rand-min 1 --tbl-flt-rand-max 1: Generate exactly 1 predicate per query
  • --tbl-flt-path-depth-max 1: Use only top-level fields (no nested paths)
  • --tbl-flt-pathstep-final-project: Final path step is field projection (.field)
  • --tbl-flt-type-final-scalar: Only reference scalar values (numbers, strings, booleans)
  • --pred-lt: Use only less-than (<) predicates

Different Predicate Types

Comparison Predicates

# Less than predicates
beamline query basic \
    --seed 100 \
    --start-auto \
    --script-path simple_transactions.ion \
    --sample-count 3 \
    rand-select-all-fw \
        --tbl-flt-rand-min 1 \
        --tbl-flt-rand-max 1 \
        --pred-lt

Output:

SELECT * FROM test_data AS test_data WHERE (test_data.price < 123.45)
SELECT * FROM test_data AS test_data WHERE (test_data.marketplace_id < 42)

Equality Predicates

# Equality predicates
beamline query basic \
    --seed 200 \
    --start-auto \
    --script-path simple_transactions.ion \
    --sample-count 3 \
    rand-select-all-fw \
        --tbl-flt-rand-min 1 \
        --tbl-flt-rand-max 1 \
        --pred-eq

Output:

SELECT * FROM test_data AS test_data WHERE (test_data.completed = true)
SELECT * FROM test_data AS test_data WHERE (test_data.country_code = 'US')

All Predicate Types

# Use all available predicates for comprehensive testing
beamline query basic \
    --seed 300 \
    --start-auto \
    --script-path simple_transactions.ion \
    --sample-count 5 \
    rand-select-all-fw \
        --tbl-flt-rand-min 1 \
        --tbl-flt-rand-max 3 \
        --pred-all

Output:

SELECT * FROM test_data AS test_data WHERE (test_data.country_code IN [
      'Graecos quidem legendos.',
      'Possit et sine.'
    ] OR (NOT ((test_data.description IS MISSING)) OR
    (test_data.description IS MISSING)))

SELECT * FROM test_data AS test_data WHERE (((test_data.transaction_id IS NULL)
    AND (test_data.created_at IS NULL)) OR (((test_data.completed IN [
            false,
            false
          ] OR NOT ((test_data.completed IS NULL))) AND
      ((NOT ((test_data.price IS NULL)) OR
          (test_data.transaction_id LIKE 'Vidisse.' AND
            (test_data.country_code IS NULL))) AND
        NOT ((test_data.description IS MISSING)))) OR
    (test_data.description <> 'Nec vero.')))

Multiple Predicates

Combining Predicates

Increase predicate count to create more complex WHERE clauses:

# Generate queries with 2-5 predicates
beamline query basic \
    --seed 400 \
    --start-auto \
    --script-path simple_transactions.ion \
    --sample-count 3 \
    rand-select-all-fw \
        --tbl-flt-rand-min 2 \
        --tbl-flt-rand-max 5 \
        --pred-all

Example Output:

SELECT * FROM test_data AS test_data
WHERE (((((test_data.country_code <> 'Qua maxime ceterorum.') AND
        (NOT (test_data.completed IN [ false, true ]) OR
          (test_data.description = 'Non faciant.'))) AND
      (NOT ((test_data.price IS MISSING)) AND (test_data.price IS MISSING))) OR
    test_data.price IN [
        -47.936734585045905,
        -0.8509689800217544,
        24.263479438050297
      ]) OR ((test_data.created_at = UTCNOW()) OR
    (NOT ((test_data.country_code IS MISSING)) AND
      (test_data.description IS MISSING))))

Field Types and Query Generation

Numeric Fields

The query generator creates appropriate predicates for numeric types:

rand_processes::{
    numeric_test: rand_process::{
        $arrival: HomogeneousPoisson:: { interarrival: seconds::1 },
        $data: {
            count: UniformI32::{ low: 1, high: 1000 },
            price: UniformDecimal::{ low: 9.99, high: 999.99 },
            score: NormalF64::{ mean: 75.0, std_dev: 15.0 }
        }
    }
}

Generated Queries:

SELECT * FROM numeric_test WHERE (numeric_test.count > 500)
SELECT * FROM numeric_test WHERE (numeric_test.price BETWEEN 50.0 AND 200.0)
SELECT * FROM numeric_test WHERE (numeric_test.score <= 85.5)

String Fields

String generators produce string-appropriate predicates:

rand_processes::{
    string_test: rand_process::{
        $arrival: HomogeneousPoisson:: { interarrival: seconds::1 },
        $data: {
            name: LoremIpsumTitle,
            email: Format::{ pattern: "user{UUID}@example.com" },
            country: Regex::{ pattern: "[A-Z]{2}" },
            description: LoremIpsum::{ min_words: 5, max_words: 50 }
        }
    }
}

Generated Queries:

SELECT * FROM string_test WHERE (string_test.country = 'US')
SELECT * FROM string_test WHERE (string_test.name LIKE '%Test%')
SELECT * FROM string_test WHERE (string_test.email IN ['user1@example.com', 'user2@example.com'])

Boolean Fields

Boolean fields generate boolean predicates:

rand_processes::{
    boolean_test: rand_process::{
        $arrival: HomogeneousPoisson:: { interarrival: seconds::1 },
        $data: {
            active: Bool,
            verified: Bool::{ p: 0.8 },
            premium: Bool::{ p: 0.1 }
        }
    }
}

Generated Queries:

SELECT * FROM boolean_test WHERE (boolean_test.active = true)
SELECT * FROM boolean_test WHERE (boolean_test.verified AND boolean_test.premium)
SELECT * FROM boolean_test WHERE (NOT boolean_test.active)

Null and Missing Value Queries

Testing Null Handling

When your data includes nullable fields, queries will test null handling:

rand_processes::{
    nullable_test: rand_process::{
        $arrival: HomogeneousPoisson:: { interarrival: seconds::1 },
        $data: {
            required_field: UUID::{ nullable: false },
            nullable_field: UniformI32::{ nullable: 0.3, low: 1, high: 100 },
            optional_field: UniformDecimal::{ optional: 0.2, low: 0.0, high: 1000.0 }
        }
    }
}

Generated Queries:

SELECT * FROM nullable_test WHERE (nullable_test.nullable_field IS NOT NULL)
SELECT * FROM nullable_test WHERE (nullable_test.optional_field IS MISSING)
SELECT * FROM nullable_test WHERE (nullable_test.required_field IS NOT NULL AND nullable_test.nullable_field > 50)

Focusing on Null/Missing Tests

# Generate queries focused on null/missing testing
beamline query basic \
    --seed 500 \
    --start-auto \
    --script-path nullable_data.ion \
    --sample-count 5 \
    rand-select-all-fw \
        --tbl-flt-rand-min 1 \
        --tbl-flt-rand-max 2 \
        --pred-absent  # Only IS NULL, IS NOT NULL, IS MISSING, IS NOT MISSING

Progressive Query Complexity

Start Simple

# Begin with single predicates
beamline query basic \
    --seed 1 \
    --start-auto \
    --script-path data.ion \
    --sample-count 5 \
    rand-select-all-fw \
        --tbl-flt-rand-min 1 \
        --tbl-flt-rand-max 1 \
        --pred-eq

Add More Predicates

# Increase to 2-3 predicates
beamline query basic \
    --seed 1 \
    --start-auto \
    --script-path data.ion \
    --sample-count 5 \
    rand-select-all-fw \
        --tbl-flt-rand-min 2 \
        --tbl-flt-rand-max 3 \
        --pred-comparison

Enable All Predicates

# Use all available predicates for full complexity
beamline query basic \
    --seed 1 \
    --start-auto \
    --script-path data.ion \
    --sample-count 5 \
    rand-select-all-fw \
        --tbl-flt-rand-min 1 \
        --tbl-flt-rand-max 5 \
        --pred-all

Examples

Single Predicate Queries

From the README example:

beamline query basic \
    --seed 1234 \
    --start-auto \
    --script-path simple_transactions.ion \
    --sample-count 3 \
    rand-select-all-fw \
        --tbl-flt-rand-min 1 \
        --tbl-flt-rand-max 1 \
        --tbl-flt-path-depth-max 1 \
        --tbl-flt-pathstep-internal-all \
        --tbl-flt-pathstep-final-project \
        --tbl-flt-type-final-scalar \
        --pred-lt

Actual Output:

SELECT * FROM test_data AS test_data WHERE (test_data.marketplace_id < -5)
SELECT * FROM test_data AS test_data WHERE (test_data.price < 18.418581624952935)
SELECT * FROM test_data AS test_data WHERE (test_data.price < 15.495327785402296)

Multiple Predicate Queries

From the README example with more complex predicates:

beamline query basic \
    --seed 1234 \
    --start-auto \
    --script-path simple_transactions.ion \
    --sample-count 3 \
    rand-select-all-fw \
        --tbl-flt-rand-min 3 \
        --tbl-flt-rand-max 10 \
        --tbl-flt-path-depth-max 1 \
        --tbl-flt-pathstep-internal-all \
        --tbl-flt-pathstep-final-project \
        --tbl-flt-type-final-all \
        --pred-all

Actual Output:

SELECT * FROM test_data AS test_data WHERE (test_data.country_code IN [
      'Graecos quidem legendos.',
      'Possit et sine.'
    ] OR (NOT ((test_data.description IS MISSING)) OR
    (test_data.description IS MISSING)))

SELECT * FROM test_data AS test_data WHERE (((test_data.transaction_id IS NULL)
    AND (test_data.created_at IS NULL)) OR (((test_data.completed IN [
            false,
            false
          ] OR NOT ((test_data.completed IS NULL))) AND
      ((NOT ((test_data.price IS NULL)) OR
          (test_data.transaction_id LIKE 'Vidisse.' AND
            (test_data.country_code IS NULL))) AND
        NOT ((test_data.description IS MISSING)))) OR
    (test_data.description <> 'Nec vero.')))

SELECT * FROM test_data AS test_data
WHERE (((((test_data.country_code <> 'Qua maxime ceterorum.') AND
        (NOT (test_data.completed IN [ false, true, true ]) OR
          (test_data.description = 'Non faciant.'))) AND
      (NOT ((test_data.price IS MISSING)) AND (test_data.price IS MISSING))) OR
    test_data.price IN [
        -47.936734585045905,
        -0.8509689800217544,
        24.263479438050297,
        -48.953369038690255
      ]) OR ((test_data.created_at = UTCNOW()) OR
    (NOT ((test_data.country_code IS MISSING)) AND
      (test_data.description IS MISSING))))

Specific Predicate Types

Comparison Predicates

# Only numeric comparisons
beamline query basic \
    --seed 100 \
    --start-auto \
    --script-path numeric_data.ion \
    --sample-count 5 \
    rand-select-all-fw \
        --tbl-flt-rand-min 1 \
        --tbl-flt-rand-max 1 \
        --pred-comparison  # <, <=, >, >=, =, <>

Example Results:

SELECT * FROM test_data WHERE (test_data.price >= 50.0)
SELECT * FROM test_data WHERE (test_data.count <= 500)
SELECT * FROM test_data WHERE (test_data.score <> 75.5)

String Pattern Matching

# Focus on LIKE predicates
beamline query basic \
    --seed 200 \
    --start-auto \
    --script-path text_data.ion \
    --sample-count 5 \
    rand-select-all-fw \
        --tbl-flt-rand-min 1 \
        --tbl-flt-rand-max 1 \
        --pred-like

Example Results:

SELECT * FROM test_data WHERE (test_data.description LIKE '%lorem%')
SELECT * FROM test_data WHERE (test_data.country_code LIKE 'U_')

Set Membership

# Use IN and NOT IN predicates
beamline query basic \
    --seed 300 \
    --start-auto \
    --script-path categorical_data.ion \
    --sample-count 5 \
    rand-select-all-fw \
        --tbl-flt-rand-min 1 \
        --tbl-flt-rand-max 1 \
        --pred-in

Example Results:

SELECT * FROM test_data WHERE (test_data.status IN ['active', 'pending'])
SELECT * FROM test_data WHERE (test_data.category IN ['electronics', 'books', 'clothing'])

Reproducible Query Testing

Test Suite Generation

Create consistent test suites:

#!/bin/bash
# Generate reproducible query test suite

SCRIPT="test_schema.ion"
BASE_SEED=12345

# Simple queries for basic functionality
beamline query basic \
    --seed $BASE_SEED \
    --start-auto \
    --script-path $SCRIPT \
    --sample-count 10 \
    rand-select-all-fw \
        --tbl-flt-rand-min 1 \
        --tbl-flt-rand-max 1 \
        --pred-eq > basic_equality.sql

# Comparison queries for numeric testing  
beamline query basic \
    --seed $((BASE_SEED + 1)) \
    --start-auto \
    --script-path $SCRIPT \
    --sample-count 10 \
    rand-select-all-fw \
        --tbl-flt-rand-min 1 \
        --tbl-flt-rand-max 1 \
        --pred-comparison > numeric_comparisons.sql

# Complex queries for comprehensive testing
beamline query basic \
    --seed $((BASE_SEED + 2)) \
    --start-auto \
    --script-path $SCRIPT \
    --sample-count 15 \
    rand-select-all-fw \
        --tbl-flt-rand-min 2 \
        --tbl-flt-rand-max 5 \
        --pred-all > complex_queries.sql

echo "Generated test suite:"
echo "- basic_equality.sql: $(wc -l < basic_equality.sql) queries"
echo "- numeric_comparisons.sql: $(wc -l < numeric_comparisons.sql) queries"  
echo "- complex_queries.sql: $(wc -l < complex_queries.sql) queries"

Regression Testing

# Generate baseline queries
beamline query basic \
    --seed 999 \
    --start-auto \
    --script-path stable_schema.ion \
    --sample-count 20 \
    rand-select-all-fw \
        --tbl-flt-rand-min 1 \
        --tbl-flt-rand-max 3 \
        --pred-all > baseline_queries.sql

# Later: regenerate with same seed to verify no regressions
beamline query basic \
    --seed 999 \
    --start-auto \
    --script-path stable_schema.ion \
    --sample-count 20 \
    rand-select-all-fw \
        --tbl-flt-rand-min 1 \
        --tbl-flt-rand-max 3 \
        --pred-all > regression_test_queries.sql

# Verify identical output
diff baseline_queries.sql regression_test_queries.sql

Common Query Patterns

Data Validation Queries

Generate queries that validate data constraints:

# Focus on range and constraint validation
beamline query basic \
    --seed 600 \
    --start-auto \
    --script-path validation_schema.ion \
    --sample-count 10 \
    rand-select-all-fw \
        --pred-comparison --pred-between

Example Validation Queries:

SELECT * FROM test_data WHERE (test_data.age BETWEEN 0 AND 120)
SELECT * FROM test_data WHERE (test_data.price > 0)
SELECT * FROM test_data WHERE (test_data.email IS NOT NULL)

Performance Baseline Queries

Create simple queries for performance baseline:

# Simple queries for baseline performance measurement
beamline query basic \
    --seed 700 \
    --start-auto \
    --script-path performance_data.ion \
    --sample-count 25 \
    rand-select-all-fw \
        --tbl-flt-rand-min 1 \
        --tbl-flt-rand-max 1 \
        --pred-eq --pred-lt --pred-gt

Integration with Data Generation

Complete Workflow

#!/bin/bash
# Complete data + query generation workflow

SCRIPT="customer_transactions.ion"
SEED=12345

echo "Generating test data..."
beamline gen data \
    --seed $SEED \
    --start-auto \
    --script-path $SCRIPT \
    --sample-count 1000 \
    --output-format ion-pretty > test_data.ion

echo "Generating basic queries..."
beamline query basic \
    --seed $((SEED + 1)) \
    --start-auto \
    --script-path $SCRIPT \
    --sample-count 15 \
    rand-select-all-fw \
        --tbl-flt-rand-min 1 \
        --tbl-flt-rand-max 2 \
        --pred-all > basic_test_queries.sql

echo "Generated $(wc -l < test_data.ion) data records and $(wc -l < basic_test_queries.sql) test queries"

# Test first few queries (example)
head -5 basic_test_queries.sql | while IFS= read -r query; do
    echo "Query: $query"
    # your-partiql-engine --query "$query" --data test_data.ion
done

Best Practices

1. Start with Simple Configurations

# Begin testing with minimal complexity
beamline query basic \
    --seed 1 \
    --start-auto \
    --script-path new_schema.ion \
    --sample-count 3 \
    rand-select-all-fw \
        --tbl-flt-rand-min 1 \
        --tbl-flt-rand-max 1 \
        --pred-eq

2. Match Predicates to Data Types

# For numeric-heavy data
--pred-comparison --pred-between

# For string-heavy data  
--pred-like --pred-eq --pred-in

# For boolean data
--pred-eq --pred-logical-not

3. Test Query Coverage

# Generate enough queries to cover different data patterns
beamline query basic \
    --seed 100 \
    --start-auto \
    --script-path comprehensive_data.ion \
    --sample-count 50 \
    rand-select-all-fw \
        --tbl-flt-rand-min 1 \
        --tbl-flt-rand-max 4 \
        --pred-all

4. Validate Against Real Data

# Always test generated queries work with generated data
beamline gen data --seed 1 --start-auto --script-path schema.ion --sample-count 100 > data.ion
beamline query basic --seed 2 --start-auto --script-path schema.ion --sample-count 5 rand-select-all-fw --pred-all > queries.sql

# Validate each query
# for query in $(cat queries.sql); do
#     your-partiql-engine --validate "$query"
# done

Next Steps

Now that you understand basic query generation:

Advanced Query Patterns

This section covers advanced query generation using all four query strategies: rand-sfw (projections), rand-select-all-efw (exclusions), and rand-sefw (projections + exclusions). These strategies generate more sophisticated PartiQL queries that test complex language features.

Query Strategy Overview

Available Strategies

StrategyQuery PatternFeatures
rand-select-all-fwSELECT * FROM WHEREBasic queries with WHERE clauses
rand-sfwSELECT fields FROM WHERECustom projections + WHERE
rand-select-all-efwSELECT * EXCLUDE FROM WHEREExclusions + WHERE
rand-sefwSELECT EXCLUDE FROM WHEREProjections + Exclusions + WHERE

SELECT with Projections (rand-sfw)

Basic Projection Queries

Generate queries with specific field selections:

beamline query basic \
    --seed 1234 \
    --start-auto \
    --script-path simple_transactions.ion \
    --sample-count 3 \
    rand-sfw \
        --project-rand-min 2 \
        --project-rand-max 5 \
        --project-path-depth-min 1 \
        --project-path-depth-max 1 \
        --project-pathstep-internal-all \
        --project-pathstep-final-all \
        --project-type-final-all \
        --tbl-flt-rand-min 2 \
        --tbl-flt-rand-max 5 \
        --tbl-flt-path-depth-max 1 \
        --tbl-flt-pathstep-internal-all \
        --tbl-flt-pathstep-final-project \
        --tbl-flt-type-final-scalar \
        --pred-all

Example Output:

SELECT test_data.completed, test_data.completed FROM test_data AS test_data
WHERE (NOT (test_data.completed) OR NOT ((test_data.created_at IS MISSING)))

SELECT test_data.completed, test_data.marketplace_id, test_data.created_at
FROM test_data AS test_data WHERE (NOT ((test_data.transaction_id IS NULL)) OR
  (((test_data.transaction_id IN [
            'Iam in.',
            'Se.',
            'Sine amicitia firmam.',
            'Notae sunt.'
          ] OR (test_data.transaction_id IS NULL)) OR
      NOT ((test_data.description IS NULL))) OR
    (test_data.marketplace_id >= 28)))

SELECT test_data, test_data.description FROM test_data AS test_data
WHERE (test_data.completed IN [ false, false ] AND
  (((test_data.price <= 5.761136291521325) AND
      NOT ((test_data.transaction_id IS MISSING))) AND
    (NOT ((test_data.created_at IS MISSING)) AND
      (test_data.created_at IS NULL))))

Understanding Projection Parameters

  • --project-rand-min 2 --project-rand-max 5: Select 2-5 fields in SELECT clause
  • --project-path-depth-max 1: Use simple field names (no deep nesting)
  • --project-pathstep-final-all: Allow all path types (.field, [*], .*)
  • --project-type-final-all: Project all types (scalars, structs, sequences)

SELECT * EXCLUDE (rand-select-all-efw)

Basic Exclusion Queries

Generate SELECT * queries that exclude specific fields:

beamline query basic \
    --seed 1234 \
    --start-auto \
    --script-path simple_transactions.ion \
    --sample-count 3 \
    rand-select-all-efw \
        --tbl-flt-rand-min 1 \
        --tbl-flt-rand-max 1 \
        --tbl-flt-path-depth-max 1 \
        --tbl-flt-pathstep-internal-all \
        --tbl-flt-pathstep-final-project \
        --tbl-flt-type-final-scalar \
        --pred-lt \
        --exclude-rand-min 1 \
        --exclude-rand-max 3 \
        --exclude-path-depth-min 1 \
        --exclude-path-depth-max 1 \
        --exclude-pathstep-internal-all \
        --exclude-pathstep-final-all \
        --exclude-type-final-all

Example Output:

SELECT * EXCLUDE test_data.marketplace_id, test_data.*, test_data.completed
FROM test_data AS test_data WHERE (test_data.marketplace_id < -5)

SELECT * EXCLUDE test_data.completed FROM test_data AS test_data
WHERE (test_data.price < 18.418581624952935)

SELECT * EXCLUDE test_data.marketplace_id, test_data.completed,
  test_data.marketplace_id
FROM test_data AS test_data WHERE (test_data.price < 15.495327785402296)

Understanding Exclusion Parameters

  • --exclude-rand-min 1 --exclude-rand-max 3: Exclude 1-3 fields
  • --exclude-path-depth-max 1: Use simple field exclusions
  • --exclude-pathstep-final-all: Allow all exclusion path types
  • --exclude-type-final-all: Exclude all types (scalars, structs, arrays)

SELECT EXCLUDE FROM WHERE (rand-sefw)

Complete Query Generation

The most sophisticated strategy combines projections, exclusions, and WHERE clauses:

beamline query basic \
    --seed 1234 \
    --start-auto \
    --script-path simple_transactions.ion \
    --sample-count 3 \
    rand-sefw \
        --project-rand-min 2 \
        --project-rand-max 5 \
        --project-path-depth-min 1 \
        --project-path-depth-max 1 \
        --project-pathstep-internal-all \
        --project-pathstep-final-all \
        --project-type-final-all \
        --tbl-flt-rand-min 2 \
        --tbl-flt-rand-max 5 \
        --tbl-flt-path-depth-max 1 \
        --tbl-flt-pathstep-internal-all \
        --tbl-flt-pathstep-final-project \
        --tbl-flt-type-final-scalar \
        --pred-all \
        --exclude-rand-min 1 \
        --exclude-rand-max 3 \
        --exclude-path-depth-min 1 \
        --exclude-path-depth-max 1 \
        --exclude-pathstep-internal-all \
        --exclude-pathstep-final-all \
        --exclude-type-final-all

Example Output:

SELECT test_data.completed, test_data.completed
EXCLUDE test_data.marketplace_id, test_data.*, test_data.completed
FROM test_data AS test_data 
WHERE (NOT (test_data.completed) OR
  NOT ((test_data.created_at IS MISSING)))

SELECT test_data.completed, test_data.marketplace_id, test_data.created_at
EXCLUDE test_data.completed 
FROM test_data AS test_data
WHERE (NOT ((test_data.transaction_id IS NULL)) OR
  (((test_data.transaction_id IN [
            'Iam in.',
            'Se.',
            'Sine amicitia firmam.',
            'Notae sunt.'
          ] OR (test_data.transaction_id IS NULL)) OR
      NOT ((test_data.description IS NULL))) OR
    (test_data.marketplace_id >= 28)))

SELECT test_data, test_data.description 
EXCLUDE test_data.marketplace_id, test_data.completed, test_data.marketplace_id
FROM test_data AS test_data 
WHERE (test_data.completed IN [ false, false ] AND
  (((test_data.price <= 5.761136291521325) AND
      NOT ((test_data.transaction_id IS MISSING))) AND
    (NOT ((test_data.created_at IS MISSING)) AND
      (test_data.created_at IS NULL))))

Deep Path Generation

Complex Nested Structures

For deeply nested data structures, use higher path depth limits:

beamline query basic \
    --seed 1234 \
    --start-auto \
    --script-path transactions.ion \
    --sample-count 3 \
    rand-sefw \
        --project-rand-min 2 \
        --project-rand-max 5 \
        --project-path-depth-min 1 \
        --project-path-depth-max 10 \
        --project-pathstep-internal-all \
        --project-pathstep-final-all \
        --project-type-final-all \
        --tbl-flt-rand-min 2 \
        --tbl-flt-rand-max 5 \
        --tbl-flt-path-depth-max 10 \
        --tbl-flt-pathstep-internal-all \
        --tbl-flt-pathstep-final-project \
        --tbl-flt-type-final-scalar \
        --pred-all \
        --exclude-rand-min 1 \
        --exclude-rand-max 2 \
        --exclude-path-depth-min 3 \
        --exclude-path-depth-max 4 \
        --exclude-pathstep-internal-all \
        --exclude-pathstep-final-unpivot \
        --exclude-type-final-all

Generated Deep Nested Queries:

SELECT test_data.*.nested_struct.nested_struct.nested_struct.nested_struct.nested_struct.*,
  test_data.test_nest_struct.*.*.nested_struct.nested_struct
EXCLUDE test_data.*.*.*.*, test_data.price.* FROM test_data AS test_data
WHERE ((test_data.test_nest_struct.*.*.*.nested_struct.*.test_int <> 19) OR
  (test_data.test_nest_struct.*.*.nested_struct.*.*.test_int > 35))

SELECT test_data.test_nest_struct.*.nested_struct.*.*.nested_struct.*,
  test_data.test_nest_struct.*.*.nested_struct.nested_struct.*.*,
  test_data.test_nest_struct.nested_struct.*.nested_struct.*,
  test_data.test_nest_struct.*.nested_struct.nested_struct.nested_struct.*
EXCLUDE test_data.test_nest_struct.*.*, test_data.test_nest_struct.*.*.*
FROM test_data AS test_data
WHERE ((test_data.*.*.nested_struct.*.*.*.test_int < 40) OR
  (test_data.*.*.nested_struct.nested_struct.*.nested_struct.test_int >= -9))

SELECT test_data.*.nested_struct.nested_struct.nested_struct.nested_struct.*,
  test_data.*.nested_struct.nested_struct.nested_struct.*.*.test_int
EXCLUDE test_data.*.nested_struct.*.*,
  test_data.test_nest_struct.nested_struct.*.*
FROM test_data AS test_data
WHERE ((((test_data.price.value <= 6.206304713037888) OR
      (test_data.*.nested_struct.nested_struct.*.nested_struct.*.test_int <> -29))
    AND
    (test_data.test_nest_struct.*.nested_struct.*.nested_struct.nested_struct.test_int < 6))
  AND ((test_data.price > -44.666855950508584) OR
    (test_data.*.*.*.nested_struct.*.*.test_int > -42)))

Controlling Path Depth

# Moderate depth for readability
beamline query basic \
    --seed 1234 \
    --start-auto \
    --script-path transactions.ion \
    --sample-count 3 \
    rand-sefw \
        --project-rand-min 2 \
        --project-rand-max 5 \
        --project-path-depth-min 1 \
        --project-path-depth-max 3 \
        --project-pathstep-internal-all \
        --project-pathstep-final-all \
        --project-type-final-all \
        --tbl-flt-rand-min 2 \
        --tbl-flt-rand-max 5 \
        --tbl-flt-path-depth-max 10 \
        --tbl-flt-pathstep-internal-all \
        --tbl-flt-pathstep-final-project \
        --tbl-flt-type-final-scalar \
        --pred-all \
        --exclude-rand-min 1 \
        --exclude-rand-max 2 \
        --exclude-path-depth-min 3 \
        --exclude-path-depth-max 4 \
        --exclude-pathstep-internal-all \
        --exclude-pathstep-final-unpivot \
        --exclude-type-final-all

More Manageable Output:

SELECT test_data.price, test_data.*.*.nested_struct EXCLUDE test_data.*.*.*.*,
  test_data.price.*
FROM test_data AS test_data
WHERE ((test_data.test_nest_struct.*.*.*.nested_struct.*.test_int <> 19) OR
  (test_data.test_nest_struct.*.*.nested_struct.*.*.test_int > 35))

SELECT test_data.price, test_data.*.*.nested_struct, test_data.test_struct,
  test_data.*.*.*
EXCLUDE test_data.test_nest_struct.*.*, test_data.test_nest_struct.*.*.*
FROM test_data AS test_data
WHERE ((test_data.*.*.nested_struct.*.*.*.test_int < 40) OR
  (test_data.*.*.nested_struct.nested_struct.*.nested_struct.test_int >= -9))

SELECT test_data.transaction_id, test_data.*.nested_struct
EXCLUDE test_data.*.nested_struct.*.*,
  test_data.test_nest_struct.nested_struct.*.*
FROM test_data AS test_data
WHERE ((((test_data.price.value <= 6.206304713037888) OR
      (test_data.*.nested_struct.nested_struct.*.nested_struct.*.test_int <> -29))
    AND
    (test_data.test_nest_struct.*.nested_struct.*.nested_struct.nested_struct.test_int < 6))
  AND ((test_data.price > -44.666855950508584) OR
    (test_data.*.*.*.nested_struct.*.*.test_int > -42)))

Path Expression Types

Path Step Types

Beamline can generate different path step types:

Projection Steps (.field)

SELECT test_data.transaction_id, test_data.customer.name
FROM test_data AS test_data

Index Steps ([N])

SELECT test_data.items[0], test_data.scores[5]
FROM test_data AS test_data

Wildcard Steps ([*])

SELECT test_data.items[*].price, test_data.users[*].name
FROM test_data AS test_data

Unpivot Steps (.*)

SELECT test_data.metadata.*, test_data.settings.*
FROM test_data AS test_data

Path Configuration Examples

Simple Projections Only

# Only use projection steps
beamline query basic \
    --seed 100 \
    --start-auto \
    --script-path nested_data.ion \
    --sample-count 5 \
    rand-sfw \
        --project-rand-min 3 \
        --project-rand-max 5 \
        --project-path-depth-max 3 \
        --project-pathstep-internal-project \
        --project-pathstep-final-project \
        --pred-all

Include Wildcards

# Add wildcard and unpivot paths
beamline query basic \
    --seed 200 \
    --start-auto \
    --script-path array_data.ion \
    --sample-count 5 \
    rand-sfw \
        --project-rand-min 2 \
        --project-rand-max 4 \
        --project-path-depth-max 2 \
        --project-pathstep-internal-all \
        --project-pathstep-final-foreach \
        --project-pathstep-final-unpivot \
        --pred-all

Complex Query Combinations

Full Feature Queries

Use all query features together for comprehensive PartiQL testing:

beamline query basic \
    --seed 2000 \
    --start-auto \
    --script-path comprehensive_schema.ion \
    --sample-count 5 \
    rand-sefw \
        --project-rand-min 3 \
        --project-rand-max 8 \
        --project-path-depth-min 1 \
        --project-path-depth-max 5 \
        --project-pathstep-internal-all \
        --project-pathstep-final-all \
        --project-type-final-all \
        --tbl-flt-rand-min 2 \
        --tbl-flt-rand-max 6 \
        --tbl-flt-path-depth-max 4 \
        --tbl-flt-pathstep-internal-all \
        --tbl-flt-pathstep-final-all \
        --tbl-flt-type-final-all \
        --exclude-rand-min 1 \
        --exclude-rand-max 4 \
        --exclude-path-depth-min 1 \
        --exclude-path-depth-max 3 \
        --exclude-pathstep-internal-all \
        --exclude-pathstep-final-all \
        --exclude-type-final-all \
        --pred-all

This generates very complex queries testing the full range of PartiQL features.

Type-Specific Query Generation

Scalar Type Focus

Generate queries that only work with scalar values:

beamline query basic \
    --seed 300 \
    --start-auto \
    --script-path mixed_types.ion \
    --sample-count 8 \
    rand-sfw \
        --project-rand-min 2 \
        --project-rand-max 4 \
        --project-type-final-scalar \
        --tbl-flt-rand-min 1 \
        --tbl-flt-rand-max 3 \
        --tbl-flt-type-final-scalar \
        --pred-comparison

Results focus on scalar fields:

SELECT test_data.price, test_data.completed, test_data.marketplace_id
FROM test_data AS test_data
WHERE (test_data.transaction_id = 'some-uuid' AND test_data.price > 100.0)

Structure Type Queries

Generate queries that work with complex structures:

beamline query basic \
    --seed 400 \
    --start-auto \
    --script-path nested_objects.ion \
    --sample-count 5 \
    rand-sfw \
        --project-rand-min 2 \
        --project-rand-max 3 \
        --project-type-final-struct \
        --tbl-flt-type-final-struct \
        --pred-absent --pred-eq

Advanced Testing Patterns

Edge Case Query Generation

Test PartiQL edge cases and complex scenarios:

# Generate edge case queries
beamline query basic \
    --seed 500 \
    --start-auto \
    --script-path edge_case_schema.ion \
    --sample-count 10 \
    rand-sefw \
        --project-path-depth-min 4 \
        --project-path-depth-max 8 \
        --project-pathstep-internal-foreach \
        --project-pathstep-final-unpivot \
        --tbl-flt-path-depth-max 6 \
        --tbl-flt-pathstep-internal-unpivot \
        --tbl-flt-pathstep-final-foreach \
        --exclude-path-depth-min 2 \
        --exclude-path-depth-max 5 \
        --exclude-pathstep-final-unpivot \
        --pred-all

Performance Stress Testing

Generate computationally expensive queries:

# Create performance stress test queries
beamline query basic \
    --seed 600 \
    --start-auto \
    --script-path large_schema.ion \
    --sample-count 20 \
    rand-sefw \
        --project-rand-min 5 \
        --project-rand-max 15 \
        --project-path-depth-max 6 \
        --tbl-flt-rand-min 3 \
        --tbl-flt-rand-max 10 \
        --tbl-flt-path-depth-max 5 \
        --exclude-rand-min 2 \
        --exclude-rand-max 8 \
        --pred-all > stress_test_queries.sql

Integration Workflows

Multi-Strategy Testing

Generate different query types for comprehensive testing:

#!/bin/bash
# Generate complete query test suite

SCRIPT="test_schema.ion"
SEED=12345
SAMPLE_COUNT=10

echo "Generating comprehensive query test suite..."

# Basic SELECT * queries
beamline query basic \
    --seed $SEED \
    --start-auto \
    --script-path $SCRIPT \
    --sample-count $SAMPLE_COUNT \
    rand-select-all-fw \
        --tbl-flt-rand-min 1 \
        --tbl-flt-rand-max 3 \
        --pred-all > select_star.sql

# Projection queries
beamline query basic \
    --seed $((SEED + 1)) \
    --start-auto \
    --script-path $SCRIPT \
    --sample-count $SAMPLE_COUNT \
    rand-sfw \
        --project-rand-min 2 \
        --project-rand-max 5 \
        --project-path-depth-max 2 \
        --tbl-flt-rand-min 1 \
        --tbl-flt-rand-max 3 \
        --pred-all > projections.sql

# Exclusion queries  
beamline query basic \
    --seed $((SEED + 2)) \
    --start-auto \
    --script-path $SCRIPT \
    --sample-count $SAMPLE_COUNT \
    rand-select-all-efw \
        --tbl-flt-rand-min 1 \
        --tbl-flt-rand-max 2 \
        --exclude-rand-min 1 \
        --exclude-rand-max 3 \
        --pred-all > exclusions.sql

# Complex combined queries
beamline query basic \
    --seed $((SEED + 3)) \
    --start-auto \
    --script-path $SCRIPT \
    --sample-count $SAMPLE_COUNT \
    rand-sefw \
        --project-rand-min 2 \
        --project-rand-max 4 \
        --tbl-flt-rand-min 1 \
        --tbl-flt-rand-max 3 \
        --exclude-rand-min 1 \
        --exclude-rand-max 2 \
        --pred-all > combined.sql

echo "Query test suite generated:"
echo "- select_star.sql: $(wc -l < select_star.sql) queries"
echo "- projections.sql: $(wc -l < projections.sql) queries"
echo "- exclusions.sql: $(wc -l < exclusions.sql) queries"  
echo "- combined.sql: $(wc -l < combined.sql) queries"
echo "Total: $(($(wc -l < select_star.sql) + $(wc -l < projections.sql) + $(wc -l < exclusions.sql) + $(wc -l < combined.sql))) queries"

Query Complexity Progression

#!/bin/bash
# Generate queries with increasing complexity

SCRIPT="complex_data.ion"
BASE_SEED=1000

for complexity_level in 1 2 3 5; do
    echo "Generating complexity level $complexity_level queries..."
    
    beamline query basic \
        --seed $((BASE_SEED + complexity_level)) \
        --start-auto \
        --script-path $SCRIPT \
        --sample-count 10 \
        rand-sefw \
            --project-rand-min $complexity_level \
            --project-rand-max $((complexity_level * 2)) \
            --project-path-depth-max $complexity_level \
            --tbl-flt-rand-min $complexity_level \
            --tbl-flt-rand-max $((complexity_level * 2)) \
            --tbl-flt-path-depth-max $complexity_level \
            --exclude-rand-min 1 \
            --exclude-rand-max $complexity_level \
            --pred-all > "complexity_${complexity_level}.sql"
            
    echo "  Generated: $(wc -l < complexity_${complexity_level}.sql) queries"
done

echo "Query complexity suite completed"

Best Practices

1. Match Complexity to Use Case

# Simple testing - basic patterns
--project-path-depth-max 2 --tbl-flt-path-depth-max 2

# Comprehensive testing - complex patterns
--project-path-depth-max 5 --tbl-flt-path-depth-max 5

# Edge case testing - maximum complexity
--project-path-depth-max 10 --tbl-flt-path-depth-max 10

2. Balance Query Features

# Don't overload with too many features at once
--project-rand-min 2 --project-rand-max 4    # Moderate projections
--exclude-rand-min 1 --exclude-rand-max 2    # Few exclusions
--tbl-flt-rand-min 1 --tbl-flt-rand-max 3   # Simple WHERE clauses

3. Test Incrementally

# Start simple
beamline query basic --script-path data.ion --sample-count 3 rand-select-all-fw --pred-eq

# Add projections
beamline query basic --script-path data.ion --sample-count 3 rand-sfw --project-rand-min 2 --project-rand-max 3 --pred-eq

# Add exclusions
beamline query basic --script-path data.ion --sample-count 3 rand-sefw --project-rand-min 2 --exclude-rand-min 1 --pred-eq

# Full complexity
beamline query basic --script-path data.ion --sample-count 5 rand-sefw --project-rand-min 3 --exclude-rand-min 2 --tbl-flt-rand-min 2 --pred-all

4. Validate Complex Queries

# Generate and validate complex queries
beamline query basic \
    --seed 700 \
    --start-auto \
    --script-path validation_schema.ion \
    --sample-count 15 \
    rand-sefw \
        --project-rand-min 2 \
        --project-rand-max 6 \
        --exclude-rand-min 1 \
        --exclude-rand-max 3 \
        --pred-all > complex_validation.sql

# Check each query
# your-partiql-parser --check-syntax complex_validation.sql

Next Steps

Now that you understand advanced query patterns:

Query Generation Parameterization

This section provides a complete reference for all query generation parameters. Beamline’s query generator is highly configurable, allowing you to control every aspect of query generation from simple predicates to complex nested path expressions.

Parameter Categories

Query generation parameters are organized into several categories:

  1. Table Filter Parameters - Control WHERE clause generation
  2. Projection Parameters - Control SELECT field selection
  3. Exclusion Parameters - Control EXCLUDE clause generation
  4. Path Parameters - Control how paths navigate data structures
  5. Predicate Parameters - Control predicate types and operators
  6. Type Parameters - Control what data types can be referenced

Table Filter Parameters

Control WHERE clause generation across all query strategies:

Filter Count

ParameterDescriptionValid Values
--tbl-flt-rand-minMinimum number of predicates1-255
--tbl-flt-rand-maxMaximum number of predicates1-255

Example:

# Generate 1-3 predicates per WHERE clause
--tbl-flt-rand-min 1 --tbl-flt-rand-max 3

Filter Path Configuration

ParameterDescription
--tbl-flt-path-depth-minMinimum path depth (default: unbounded)
--tbl-flt-path-depth-maxMaximum path depth (default: unbounded)

Example:

# Allow paths up to 3 levels deep
--tbl-flt-path-depth-max 3
# Results: field, object.field, object.nested.field

Table Filter Path Steps

Control what types of path steps can appear in WHERE clauses:

Internal Path Steps

ParameterDescriptionExample Path
--tbl-flt-pathstep-internal-allEnable all internal path step typesAll below
--tbl-flt-pathstep-internal-projectEnable projection internal stepsfield.subfield
--tbl-flt-pathstep-internal-indexEnable index internal stepsarray[1].field
--tbl-flt-pathstep-internal-foreachEnable wildcard internal stepsarray[*].field
--tbl-flt-pathstep-internal-unpivotEnable unpivot internal stepsobject.*.field

Final Path Steps

ParameterDescriptionExample Path
--tbl-flt-pathstep-final-allEnable all final path step typesAll below
--tbl-flt-pathstep-final-projectEnable projection final stepsobject.field
--tbl-flt-pathstep-final-indexEnable index final stepsarray[1]
--tbl-flt-pathstep-final-foreachEnable wildcard final stepsarray[*]
--tbl-flt-pathstep-final-unpivotEnable unpivot final stepsobject.*

Table Filter Type Constraints

Control what types of values can appear in WHERE clauses:

ParameterDescriptionExample
--tbl-flt-type-final-allAllow all final typesAny type
--tbl-flt-type-final-scalarAllow only scalar types9, 'text', true
--tbl-flt-type-final-sequenceAllow only sequence types[1,2,3], <1, 'a'>
--tbl-flt-type-final-structAllow only struct types{'a': 1, 'b': 2}

Projection Parameters

Control SELECT clause generation (applies to rand-sfw and rand-sefw strategies):

Projection Count

ParameterDescriptionValid Values
--project-rand-minMinimum number of projections1-255
--project-rand-maxMaximum number of projections1-255

Example:

# Generate 2-5 fields in SELECT clause
--project-rand-min 2 --project-rand-max 5

Projection Path Configuration

ParameterDescription
--project-path-depth-minMinimum projection path depth
--project-path-depth-maxMaximum projection path depth

Projection Path Steps

Same options as table filter path steps, but for SELECT clause:

Internal Path Steps

ParameterDescription
--project-pathstep-internal-allEnable all internal path step types
--project-pathstep-internal-projectEnable projection internal steps
--project-pathstep-internal-indexEnable index internal steps
--project-pathstep-internal-foreachEnable wildcard internal steps
--project-pathstep-internal-unpivotEnable unpivot internal steps

Final Path Steps

ParameterDescription
--project-pathstep-final-allEnable all final path step types
--project-pathstep-final-projectEnable projection final steps
--project-pathstep-final-indexEnable index final steps
--project-pathstep-final-foreachEnable wildcard final steps
--project-pathstep-final-unpivotEnable unpivot final steps

Projection Type Constraints

ParameterDescription
--project-type-final-allAllow all final types
--project-type-final-scalarAllow only scalar types
--project-type-final-sequenceAllow only sequence types
--project-type-final-structAllow only struct types

Exclusion Parameters

Control EXCLUDE clause generation (applies to rand-select-all-efw and rand-sefw strategies):

Exclusion Count

ParameterDescriptionValid Values
--exclude-rand-minMinimum number of exclusions1-255
--exclude-rand-maxMaximum number of exclusions1-255

Exclusion Path Configuration

ParameterDescription
--exclude-path-depth-minMinimum exclusion path depth
--exclude-path-depth-maxMaximum exclusion path depth

Exclusion Path Steps

Same structure as projection parameters:

ParameterDescription
--exclude-pathstep-internal-allEnable all internal path steps
--exclude-pathstep-internal-projectEnable projection internal steps
--exclude-pathstep-internal-indexEnable index internal steps
--exclude-pathstep-internal-foreachEnable wildcard internal steps
--exclude-pathstep-internal-unpivotEnable unpivot internal steps
--exclude-pathstep-final-allEnable all final path steps
--exclude-pathstep-final-projectEnable projection final steps
--exclude-pathstep-final-indexEnable index final steps
--exclude-pathstep-final-foreachEnable wildcard final steps
--exclude-pathstep-final-unpivotEnable unpivot final steps

Exclusion Type Constraints

ParameterDescription
--exclude-type-final-allAllow all final types
--exclude-type-final-scalarAllow only scalar types
--exclude-type-final-sequenceAllow only sequence types
--exclude-type-final-structAllow only struct types

Predicate Parameters

Control what types of predicates can be generated in WHERE clauses:

All Predicate Types

ParameterDescriptionSQL Operators
--pred-allEnable all predicate typesAll below
--pred-noneDisable all predicatesNone

Null and Missing Predicates

ParameterDescriptionSQL Operators
--pred-absentEnable null/missing predicatesIS NULL, IS NOT NULL, IS MISSING, IS NOT MISSING
--pred-nullableEnable null predicatesIS NULL, IS NOT NULL
--pred-is-nullEnable IS NULLIS NULL
--pred-is-not-nullEnable IS NOT NULLIS NOT NULL
--pred-optionalEnable missing predicatesIS MISSING, IS NOT MISSING
--pred-is-missingEnable IS MISSINGIS MISSING
--pred-is-not-missingEnable IS NOT MISSINGIS NOT MISSING

Equality Predicates

ParameterDescriptionSQL Operators
--pred-equalityEnable equality predicates=, <>
--pred-eqEnable equals=
--pred-neqEnable not equals<>

Comparison Predicates

ParameterDescriptionSQL Operators
--pred-comparisonEnable all comparison predicates<, <=, >, >=, BETWEEN
--pred-ltEnable less than<
--pred-lteEnable less than or equal<=
--pred-gtEnable greater than>
--pred-gteEnable greater than or equal>=
--pred-betweenEnable betweenBETWEEN

Numeric Predicates

ParameterDescriptionSQL Operators
--pred-numericEnable all numeric predicates=, <>, <, <=, >, >=, BETWEEN

String Predicates

ParameterDescriptionSQL Operators
--pred-like-allEnable all LIKE predicatesLIKE, NOT LIKE
--pred-likeEnable LIKELIKE
--pred-not-likeEnable NOT LIKENOT LIKE

Set Membership Predicates

ParameterDescriptionSQL Operators
--pred-in-allEnable all IN predicatesIN, NOT IN
--pred-inEnable ININ
--pred-not-inEnable NOT INNOT IN

Logical Predicates

ParameterDescriptionSQL Operators
--pred-logical-allEnable all logical operatorsAND, OR, NOT
--pred-logical-andEnable ANDAND
--pred-logical-orEnable OROR
--pred-logical-notEnable NOTNOT

Strategy Compatibility Matrix

Different parameters apply to different query strategies:

Parameter Categoryrand-select-all-fwrand-sfwrand-select-all-efwrand-sefw
Table Filter
Projection
Exclusion
Predicates

Configuration Examples

Simple Configuration

For basic testing with readable queries:

beamline query basic \
    --seed 100 \
    --start-auto \
    --script-path data.ion \
    --sample-count 10 \
    rand-select-all-fw \
        --tbl-flt-rand-min 1 \
        --tbl-flt-rand-max 2 \
        --tbl-flt-path-depth-max 2 \
        --tbl-flt-pathstep-internal-project \
        --tbl-flt-pathstep-final-project \
        --tbl-flt-type-final-scalar \
        --pred-comparison

Moderate Configuration

For comprehensive testing with controlled complexity:

beamline query basic \
    --seed 200 \
    --start-auto \
    --script-path data.ion \
    --sample-count 15 \
    rand-sfw \
        --project-rand-min 2 \
        --project-rand-max 4 \
        --project-path-depth-max 3 \
        --project-pathstep-internal-all \
        --project-pathstep-final-all \
        --project-type-final-all \
        --tbl-flt-rand-min 1 \
        --tbl-flt-rand-max 3 \
        --tbl-flt-path-depth-max 2 \
        --tbl-flt-pathstep-internal-all \
        --tbl-flt-pathstep-final-project \
        --tbl-flt-type-final-scalar \
        --pred-all

Complex Configuration

For edge case testing with maximum complexity:

beamline query basic \
    --seed 300 \
    --start-auto \
    --script-path nested_data.ion \
    --sample-count 20 \
    rand-sefw \
        --project-rand-min 3 \
        --project-rand-max 8 \
        --project-path-depth-min 2 \
        --project-path-depth-max 6 \
        --project-pathstep-internal-all \
        --project-pathstep-final-all \
        --project-type-final-all \
        --tbl-flt-rand-min 2 \
        --tbl-flt-rand-max 5 \
        --tbl-flt-path-depth-max 6 \
        --tbl-flt-pathstep-internal-all \
        --tbl-flt-pathstep-final-all \
        --tbl-flt-type-final-all \
        --exclude-rand-min 1 \
        --exclude-rand-max 4 \
        --exclude-path-depth-min 2 \
        --exclude-path-depth-max 4 \
        --exclude-pathstep-internal-all \
        --exclude-pathstep-final-all \
        --exclude-type-final-all \
        --pred-all

Parameter Combination Patterns

Testing Specific PartiQL Features

Test Wildcard Paths

# Focus on [*] and .* path expressions
beamline query basic \
    --seed 400 \
    --start-auto \
    --script-path array_data.ion \
    --sample-count 10 \
    rand-sfw \
        --project-rand-min 2 \
        --project-rand-max 4 \
        --project-pathstep-internal-foreach \
        --project-pathstep-internal-unpivot \
        --project-pathstep-final-foreach \
        --project-pathstep-final-unpivot \
        --pred-all

Test Deep Nesting

# Generate very deep path expressions
beamline query basic \
    --seed 500 \
    --start-auto \
    --script-path deeply_nested.ion \
    --sample-count 8 \
    rand-sfw \
        --project-path-depth-min 4 \
        --project-path-depth-max 8 \
        --tbl-flt-path-depth-min 3 \
        --tbl-flt-path-depth-max 6 \
        --exclude-path-depth-min 2 \
        --exclude-path-depth-max 5 \
        --pred-all

Test Null Handling

# Focus on null and missing value predicates
beamline query basic \
    --seed 600 \
    --start-auto \
    --script-path nullable_schema.ion \
    --sample-count 12 \
    rand-select-all-fw \
        --tbl-flt-rand-min 1 \
        --tbl-flt-rand-max 3 \
        --pred-absent --pred-logical-and --pred-logical-or

Performance Testing Configurations

Lightweight Queries

# Generate simple, fast-executing queries
beamline query basic \
    --seed 700 \
    --start-auto \
    --script-path performance_data.ion \
    --sample-count 25 \
    rand-select-all-fw \
        --tbl-flt-rand-min 1 \
        --tbl-flt-rand-max 1 \
        --tbl-flt-path-depth-max 1 \
        --tbl-flt-type-final-scalar \
        --pred-eq --pred-comparison

Heavy Queries

# Generate complex, resource-intensive queries
beamline query basic \
    --seed 800 \
    --start-auto \
    --script-path large_data.ion \
    --sample-count 15 \
    rand-sefw \
        --project-rand-min 8 \
        --project-rand-max 15 \
        --project-path-depth-max 6 \
        --tbl-flt-rand-min 5 \
        --tbl-flt-rand-max 10 \
        --tbl-flt-path-depth-max 5 \
        --exclude-rand-min 3 \
        --exclude-rand-max 8 \
        --pred-all

Common Parameter Patterns

Development and Debugging

# Simple, readable queries for development
--project-rand-min 1 --project-rand-max 3
--project-path-depth-max 2
--project-pathstep-final-project
--tbl-flt-rand-min 1 --tbl-flt-rand-max 2
--tbl-flt-path-depth-max 2
--pred-eq --pred-comparison

Integration Testing

# Moderate complexity for integration tests
--project-rand-min 2 --project-rand-max 5
--project-path-depth-max 3
--project-pathstep-internal-all --project-pathstep-final-all
--tbl-flt-rand-min 1 --tbl-flt-rand-max 4
--tbl-flt-path-depth-max 3
--exclude-rand-min 1 --exclude-rand-max 2
--pred-all

Stress Testing

# Maximum complexity for stress testing
--project-rand-min 5 --project-rand-max 12
--project-path-depth-min 3 --project-path-depth-max 8
--project-pathstep-internal-all --project-pathstep-final-all
--project-type-final-all
--tbl-flt-rand-min 3 --tbl-flt-rand-max 8
--tbl-flt-path-depth-max 6
--exclude-rand-min 2 --exclude-rand-max 6
--exclude-path-depth-min 2 --exclude-path-depth-max 5
--pred-all

Parameter Validation

Check Parameter Combinations

Some parameter combinations don’t make sense:

# Invalid - min > max
--tbl-flt-rand-min 5 --tbl-flt-rand-max 3  # Error

# Invalid - conflicting path steps  
--project-pathstep-final-project --project-pathstep-final-foreach  # May conflict

# Valid - consistent configuration
--tbl-flt-rand-min 1 --tbl-flt-rand-max 5
--project-rand-min 2 --project-rand-max 4

Default Behaviors

When parameters are not specified:

  • Path depth: Unbounded (can generate very deep paths)
  • Path steps: All types enabled by default
  • Type constraints: All types allowed
  • Predicate count: Implementation-dependent defaults

Recommendation: Always specify explicit bounds for predictable results.

Advanced Configuration Techniques

Targeted Testing

Test Specific Path Types

# Test only wildcard expressions
beamline query basic \
    --script-path array_data.ion \
    --sample-count 10 \
    rand-sfw \
        --project-pathstep-final-foreach \
        --project-pathstep-final-unpivot \
        --tbl-flt-pathstep-final-foreach \
        --pred-all

Test Specific Predicates

# Test only string operations
beamline query basic \
    --script-path text_data.ion \
    --sample-count 10 \
    rand-select-all-fw \
        --pred-like --pred-in --pred-eq

Graduated Complexity Testing

#!/bin/bash
# Generate test suites with graduated complexity

SCRIPT="test_schema.ion"
BASE_SEED=1000

# Level 1: Simple queries
beamline query basic \
    --seed $BASE_SEED \
    --start-auto \
    --script-path $SCRIPT \
    --sample-count 10 \
    rand-select-all-fw \
        --tbl-flt-rand-min 1 --tbl-flt-rand-max 1 \
        --pred-eq > level1.sql

# Level 2: Add comparisons  
beamline query basic \
    --seed $((BASE_SEED + 1)) \
    --start-auto \
    --script-path $SCRIPT \
    --sample-count 10 \
    rand-select-all-fw \
        --tbl-flt-rand-min 1 --tbl-flt-rand-max 2 \
        --pred-comparison > level2.sql

# Level 3: Add projections
beamline query basic \
    --seed $((BASE_SEED + 2)) \
    --start-auto \
    --script-path $SCRIPT \
    --sample-count 10 \
    rand-sfw \
        --project-rand-min 2 --project-rand-max 4 \
        --tbl-flt-rand-min 1 --tbl-flt-rand-max 2 \
        --pred-all > level3.sql

# Level 4: Add exclusions
beamline query basic \
    --seed $((BASE_SEED + 3)) \
    --start-auto \
    --script-path $SCRIPT \
    --sample-count 10 \
    rand-sefw \
        --project-rand-min 2 --project-rand-max 3 \
        --tbl-flt-rand-min 1 --tbl-flt-rand-max 3 \
        --exclude-rand-min 1 --exclude-rand-max 2 \
        --pred-all > level4.sql

echo "Generated graduated complexity test suite"

Performance Impact of Parameters

High Performance Impact

These parameters significantly affect query generation performance:

  • High path depth (--path-depth-max 10+): Exponential complexity growth
  • Many projections (--project-rand-max 20+): Large SELECT clauses
  • Many predicates (--tbl-flt-rand-max 10+): Complex WHERE clauses
  • All path steps enabled: More path generation options to evaluate

Low Performance Impact

These parameters have minimal impact:

  • Predicate type selection: Doesn’t affect generation complexity
  • Type constraints: Reduces rather than increases complexity
  • Path step restrictions: Reduces generation options

Performance Optimization

# Optimized for speed
beamline query basic \
    --script-path data.ion \
    --sample-count 100 \
    rand-select-all-fw \
        --tbl-flt-rand-min 1 --tbl-flt-rand-max 3 \
        --tbl-flt-path-depth-max 2 \
        --tbl-flt-pathstep-final-project \
        --tbl-flt-type-final-scalar \
        --pred-comparison

# Comprehensive but slower
beamline query basic \
    --script-path data.ion \
    --sample-count 25 \
    rand-sefw \
        --project-rand-min 5 --project-rand-max 10 \
        --project-path-depth-max 5 \
        --tbl-flt-rand-min 3 --tbl-flt-rand-max 6 \
        --exclude-rand-min 2 --exclude-rand-max 4 \
        --pred-all

Troubleshooting Parameter Issues

Common Parameter Errors

Invalid Range Parameters

# Error: min > max
--tbl-flt-rand-min 5 --tbl-flt-rand-max 3

# Fix: min <= max
--tbl-flt-rand-min 3 --tbl-flt-rand-max 5

Conflicting Type Constraints

# May produce unexpected results
--project-type-final-scalar --project-pathstep-final-unpivot  # Scalar constraint conflicts with unpivot

# Better: consistent constraints
--project-type-final-all --project-pathstep-final-unpivot

Parameter Testing

# Test parameter combinations before large generation
beamline query basic \
    --seed 1 \
    --start-auto \
    --script-path test.ion \
    --sample-count 3 \
    rand-sefw \
        --project-rand-min 2 --project-rand-max 2 \
        --exclude-rand-min 1 --exclude-rand-max 1 \
        --pred-eq

# If results look good, scale up
# ... run with --sample-count 50

Best Practices

1. Start Conservative

# Begin with simple parameter values
--tbl-flt-rand-min 1 --tbl-flt-rand-max 2
--project-rand-min 1 --project-rand-max 3
--exclude-rand-min 1 --exclude-rand-max 1

2. Match Parameters to Data Structure

# Simple flat data
--project-path-depth-max 1 --tbl-flt-path-depth-max 1

# Nested object data  
--project-path-depth-max 3 --tbl-flt-path-depth-max 3

# Deeply nested data
--project-path-depth-max 6 --tbl-flt-path-depth-max 6

3. Use Consistent Parameter Ranges

# Good - balanced complexity
--project-rand-min 2 --project-rand-max 4
--tbl-flt-rand-min 1 --tbl-flt-rand-max 3
--exclude-rand-min 1 --exclude-rand-max 2

# Avoid - unbalanced (too many exclusions vs projections)
--project-rand-min 1 --project-rand-max 2
--exclude-rand-min 5 --exclude-rand-max 10

4. Document Your Configurations

# Create reusable parameter sets
SIMPLE_CONFIG="--tbl-flt-rand-min 1 --tbl-flt-rand-max 2 --pred-comparison"
MODERATE_CONFIG="--project-rand-min 2 --project-rand-max 4 --tbl-flt-rand-min 1 --tbl-flt-rand-max 3 --pred-all"

beamline query basic --script-path data.ion --sample-count 10 rand-select-all-fw $SIMPLE_CONFIG
beamline query basic --script-path data.ion --sample-count 10 rand-sfw $MODERATE_CONFIG

Reference Quick Guide

Most Common Configurations

Basic Testing:

rand-select-all-fw --tbl-flt-rand-min 1 --tbl-flt-rand-max 2 --pred-comparison

Projection Testing:

rand-sfw --project-rand-min 2 --project-rand-max 4 --tbl-flt-rand-min 1 --tbl-flt-rand-max 2 --pred-all

Exclusion Testing:

rand-select-all-efw --tbl-flt-rand-min 1 --tbl-flt-rand-max 2 --exclude-rand-min 1 --exclude-rand-max 3 --pred-all

Comprehensive Testing:

rand-sefw --project-rand-min 2 --project-rand-max 4 --exclude-rand-min 1 --exclude-rand-max 2 --tbl-flt-rand-min 1 --tbl-flt-rand-max 3 --pred-all

Next Steps

Now that you understand all parameterization options:

Understanding Shapes

In Beamline, shapes (also called schemas) describe the structure and types of your generated data. Shape inference analyzes Ion scripts to determine what types of data will be generated, without actually generating the full dataset. This is essential for database schema creation, query validation, and understanding your data structure.

What are Shapes?

Shapes are PartiQL’s way of describing data structure and type information:

  • Type information: What types each field can contain (INT, VARCHAR, BOOL, etc.)
  • Structure information: How data is organized (bags, structs, arrays)
  • Constraints: Whether fields are nullable, optional, or have other constraints
  • Nested relationships: How complex data structures are organized

Shape Inference Process

How Shape Inference Works

  1. Script Analysis: Parse the Ion script to understand generators
  2. Type Resolution: Determine PartiQL types for each generator
  3. Structure Mapping: Build hierarchical type structure
  4. Constraint Analysis: Determine nullability and optionality
  5. Format Output: Generate shapes in requested format

Running Shape Inference

From the README examples, shape inference is done using:

beamline infer-shape \
    --seed-auto \
    --start-auto \
    --script-path sensors.ion

The seed and start time are needed even though no data is generated, as they may affect type inference for certain generators.

Shape Output Formats

Text Format (Default)

Provides detailed type information in Rust debug format:

beamline infer-shape \
    --seed-auto \
    --start-auto \
    --script-path sensors.ion

Example Output:

Seed: 17685918364143248531
Start: 2022-12-12T19:52:29.000000000Z
{
    "sensors": PartiqlType(
        Bag(
            BagType {
                element_type: PartiqlType(
                    Struct(
                        StructType {
                            constraints: {
                                Fields(
                                    {
                                        StructField {
                                            name: "d",
                                            ty: PartiqlType(
                                                DecimalP(
                                                    2,
                                                    0,
                                                ),
                                            ),
                                        },
                                        StructField {
                                            name: "f",
                                            ty: PartiqlType(
                                                Float64,
                                            ),
                                        },
                                        StructField {
                                            name: "i8",
                                            ty: PartiqlType(
                                                Int64,
                                            ),
                                        },
                                        StructField {
                                            name: "tick",
                                            ty: PartiqlType(
                                                Int64,
                                            ),
                                        },
                                        StructField {
                                            name: "w",
                                            ty: PartiqlType(
                                                DecimalP(
                                                    5,
                                                    4,
                                                ),
                                            ),
                                        },
                                    },
                                ),
                            },
                        },
                    ),
                ),
            },
        ),
    ),
}

Use Cases:

  • Development and debugging
  • Understanding complex nested structures
  • Detailed type analysis

Basic DDL Format

Generates SQL DDL statements ready for database creation:

beamline infer-shape \
    --seed 7844265201457918498 \
    --start-auto \
    --script-path sensors-nested.ion \
    --output-format basic-ddl

Example OutputE:

-- Seed: 7844265201457918498
-- Start: 2024-01-01T06:53:06.000000000Z
-- Syntax: partiql_datatype_syntax.0.1
-- Dataset: sensors
"f" DOUBLE,
"i8" INT8,
"id" INT,
"sub" STRUCT<"f": DOUBLE,"o": INT8>,
"tick" INT8

Use Cases:

  • Creating database tables
  • Schema documentation
  • Database migration scripts

Beamline JSON Format

Structured JSON format used by PartiQL testing tools:

beamline infer-shape \
    --seed-auto \
    --start-auto \
    --script-path sensors.ion \
    --output-format beamline-json

Example Output:

{
  seed: -3711181901898679775,
  start: "2022-05-22T13:49:57.000000000+00:00",
  shapes: {
    sensors: partiql::shape::v0::{
      type: "bag",
      items: {
        type: "struct",
        constraints: [
          ordered,
          closed
        ],
        fields: [
          {
            name: "d",
            type: "decimal(2, 0)"
          },
          {
            name: "f",
            type: "double"
          },
          {
            name: "i8",
            type: "int8"
          },
          {
            name: "tick",
            type: "int8"
          },
          {
            name: "w",
            type: "decimal(5, 4)"
          }
        ]
      }
    }
  }
}

Use Cases:

  • PartiQL conformance testing
  • Tool integration
  • Automated testing pipelines

PartiQL Type System

Basic Types

From the examples and implementation:

PartiQL TypeDescriptionIon Script Generator
INT88-bit signed integerUniformI8
INT6464-bit signed integerUniformI64, Tick
DOUBLE64-bit floating pointUniformF64, NormalF64
DECIMAL(p,s)Fixed-precision decimalUniformDecimal
VARCHARVariable-length stringUUID, LoremIpsumTitle, Regex
BOOLBoolean valueBool
TIMESTAMPDate and timeInstant, Date

Complex Types

PartiQL TypeDescriptionIon Script Generator
STRUCT<...>Object with named fieldsNested $data objects
ARRAY<T>Array of type TUniformArray
UNION<T1,T2>Value can be one of multiple typesUniformAnyOf

Real Shape Examples

Simple Sensor Shape

From the sensors.ion script:

rand_processes::{
    $n: UniformU8::{ low: 2, high: 10 },
    sensors: $n::[
        rand_process::{
            $data: {
                tick: Tick,
                i8: UniformI8,
                f: UniformF64,
                d: UniformDecimal::{ low: 0d0, high: 4.2d1, nullable: false }
            }
        }
    ]
}

Inferred Shape (DDL):

-- Dataset: sensors
"f" DOUBLE,
"i8" INT8,
"tick" INT8,
"d" DECIMAL(2, 0) NOT NULL

Complex Nested Shape

From the sensors-nested.ion script:

rand_processes::{
    sensors: rand_process::{
        $data: {
            tick: Tick,
            i8: UniformI8,
            f: UniformF64,
            sub: {
                o: UniformI8,
                f: UniformF64
            }
        }
    }
}

Inferred Shape (DDL):

-- Dataset: sensors  
"f" DOUBLE,
"i8" INT8,
"sub" STRUCT<"f": DOUBLE,"o": INT8>,
"tick" INT8

Multi-Dataset Shape

From the client-service.ion script with multiple datasets:

beamline infer-shape \
    --seed 1 \
    --start-auto \
    --script-path client-service.ion \
    --output-format basic-ddl

Generated Output:

-- Dataset: service
"Account" VARCHAR,
"Operation" VARCHAR,
"Program" VARCHAR,
"Request" VARCHAR,  
"StartTime" TIMESTAMP,
"client" VARCHAR,
"success" BOOL

-- Dataset: client_0
"id" VARCHAR,
"request_id" VARCHAR,
"request_time" TIMESTAMP,
"success" BOOL

-- Dataset: client_1  
"id" VARCHAR,
"request_id" VARCHAR,
"request_time" TIMESTAMP,
"success" BOOL

Notice how each dataset gets its own schema section.

Nullability in Shapes

Nullable vs Non-Nullable Fields

Shape inference detects nullability configuration from scripts:

rand_processes::{
    test_data: rand_process::{
        $arrival: HomogeneousPoisson:: { interarrival: seconds::1 },
        $data: {
            required_field: UUID::{ nullable: false },
            nullable_field: UniformI32::{ nullable: 0.2, low: 1, high: 100 },
            optional_field: UniformDecimal::{ optional: 0.1, low: 0.0, high: 100.0 }
        }
    }
}

Inferred Shape:

-- Dataset: test_data
"required_field" VARCHAR NOT NULL,        -- nullable: false
"nullable_field" INT,                     -- nullable: 0.2 (can be NULL)
"optional_field" OPTIONAL DECIMAL(3, 1)   -- optional: 0.1 (can be MISSING)

CLI Nullability Defaults

Global CLI defaults affect inferred shapes:

# With default nullability
beamline infer-shape \
    --seed 1 \
    --start-auto \
    --script-path simple_data.ion \
    --default-nullable true \
    --default-optional true \
    --output-format basic-ddl

Result:

-- All fields become nullable and optional by default
"field1" OPTIONAL INT,
"field2" OPTIONAL VARCHAR,
"field3" OPTIONAL BOOL

Shape Inference Workflow

Development Workflow

#!/bin/bash
# Shape-driven development workflow

SCRIPT="new_data_model.ion"

echo "1. Creating initial Ion script..."
cat > $SCRIPT << 'EOF'
rand_processes::{
    user_data: rand_process::{
        $arrival: HomogeneousPoisson:: { interarrival: seconds::1 },
        $data: {
            user_id: UUID,
            age: UniformU8::{ low: 18, high: 80 },
            email: Format::{ pattern: "user{UUID}@example.com" },
            active: Bool::{ p: 0.8 }
        }
    }
}
EOF

echo "2. Inferring shape..."
beamline infer-shape \
    --seed 1 \
    --start-auto \
    --script-path $SCRIPT \
    --output-format basic-ddl > schema.sql

echo "3. Generated schema:"
cat schema.sql

echo "4. Testing with small sample..."
beamline gen data \
    --seed 1 \
    --start-auto \
    --script-path $SCRIPT \
    --sample-count 5 \
    --output-format text

echo "Shape-driven development complete!"

Schema Validation

# Validate schema matches expectations
beamline infer-shape \
    --seed 1 \
    --start-auto \
    --script-path production_schema.ion \
    --output-format basic-ddl > expected_schema.sql

# Compare with previous version
diff previous_schema.sql expected_schema.sql

# Generate sample data to verify
beamline gen data \
    --seed 1 \
    --start-auto \
    --script-path production_schema.ion \
    --sample-count 10

Complex Shape Examples

Arrays and Union Types

rand_processes::{
    complex_data: rand_process::{
        $arrival: HomogeneousPoisson:: { interarrival: seconds::1 },
        $data: {
            measurements: UniformArray::{
                min_size: 2,
                max_size: 5,
                element_type: UniformF64::{ low: 0.0, high: 100.0 }
            },
            mixed_value: UniformAnyOf::{
                types: [
                    UUID,
                    UniformI32::{ low: 1, high: 1000 },
                    Bool
                ]
            }
        }
    }
}

Inferred Shape:

-- Dataset: complex_data
"measurements" ARRAY<DOUBLE>,
"mixed_value" UNION<VARCHAR,INT,BOOL>

Deeply Nested Structures

rand_processes::{
    nested_data: rand_process::{
        $arrival: HomogeneousPoisson:: { interarrival: seconds::1 },
        $data: {
            user: {
                profile: {
                    personal: {
                        name: LoremIpsumTitle,
                        age: UniformU8::{ low: 18, high: 80 }
                    },
                    preferences: {
                        theme: Uniform::{ choices: ["light", "dark"] },
                        notifications: Bool
                    }
                },
                stats: {
                    login_count: UniformU32,
                    last_seen: Instant
                }
            }
        }
    }
}

Inferred Shape:

-- Dataset: nested_data  
"user" STRUCT<
  "profile": STRUCT<
    "personal": STRUCT<"age": TINYINT,"name": VARCHAR>,
    "preferences": STRUCT<"notifications": BOOL,"theme": VARCHAR>
  >,
  "stats": STRUCT<"last_seen": TIMESTAMP,"login_count": INT>
>

Shape Analysis and Validation

Schema Consistency Checking

# Infer shapes from multiple related scripts
beamline infer-shape \
    --seed 1 \
    --start-auto \
    --script-path user_v1.ion \
    --output-format basic-ddl > user_v1_schema.sql

beamline infer-shape \
    --seed 1 \
    --start-auto \
    --script-path user_v2.ion \
    --output-format basic-ddl > user_v2_schema.sql

# Compare schemas for compatibility
echo "Schema changes between versions:"
diff user_v1_schema.sql user_v2_schema.sql

Multi-Dataset Schema Analysis

# Analyze all datasets in a complex script
beamline infer-shape \
    --seed 42 \
    --start-auto \
    --script-path client-service.ion \
    --output-format basic-ddl > all_schemas.sql

# Extract individual dataset schemas
grep -A 20 "-- Dataset: service" all_schemas.sql > service_schema.sql
grep -A 20 "-- Dataset: client_0" all_schemas.sql > client_schema.sql

Shape-Based Development

Database Schema Generation

#!/bin/bash
# Generate database schemas from Ion scripts

SCRIPT="$1"
OUTPUT_DIR="./schemas"

if [ -z "$SCRIPT" ]; then
    echo "Usage: $0 <script.ion>"
    exit 1
fi

mkdir -p "$OUTPUT_DIR"
BASENAME=$(basename "$SCRIPT" .ion)

echo "Generating schemas for $SCRIPT..."

# Generate SQL DDL schema
beamline infer-shape \
    --seed 1 \
    --start-auto \
    --script-path "$SCRIPT" \
    --output-format basic-ddl > "$OUTPUT_DIR/${BASENAME}_schema.sql"

# Generate Beamline JSON for testing tools
beamline infer-shape \
    --seed 1 \
    --start-auto \
    --script-path "$SCRIPT" \
    --output-format beamline-json > "$OUTPUT_DIR/${BASENAME}_schema.json"

echo "Schemas generated:"
echo "  SQL DDL: $OUTPUT_DIR/${BASENAME}_schema.sql"
echo "  JSON: $OUTPUT_DIR/${BASENAME}_schema.json"

# Show summary
echo ""
echo "Schema summary:"
grep "-- Dataset:" "$OUTPUT_DIR/${BASENAME}_schema.sql" | while read -r line; do
    dataset=$(echo "$line" | cut -d: -f2 | xargs)
    field_count=$(grep -A 100 "$line" "$OUTPUT_DIR/${BASENAME}_schema.sql" | grep '^"' | head -20 | wc -l)
    echo "  $dataset: $field_count fields"
done

Schema Documentation

# Generate schema documentation for all scripts
for script in scripts/*.ion; do
    echo "## $(basename "$script" .ion)" >> SCHEMAS.md
    echo "" >> SCHEMAS.md
    echo "Generated from: \`$script\`" >> SCHEMAS.md
    echo "" >> SCHEMAS.md
    echo '```sql' >> SCHEMAS.md
    
    beamline infer-shape \
        --seed 1 \
        --start-auto \
        --script-path "$script" \
        --output-format basic-ddl >> SCHEMAS.md
        
    echo '```' >> SCHEMAS.md
    echo "" >> SCHEMAS.md
done

Understanding Type Mappings

Ion Generator to PartiQL Type Mapping

Based on the actual implementation and README:

Ion GeneratorPartiQL TypeDDL Representation
BoolBOOLBOOL
UniformI8INT64TINYINT or INT8
UniformI16INT64SMALLINT or INT16
UniformI32INT64INT
UniformI64INT64BIGINT
UniformU8INT64TINYINT
UniformU16INT64SMALLINT
UniformU32INT64INT
UniformU64INT64BIGINT
UniformF64DOUBLEDOUBLE
UniformDecimalDECIMAL(p,s)DECIMAL(p,s)
UUIDSTRINGVARCHAR
LoremIpsumTitleSTRINGVARCHAR
RegexSTRINGVARCHAR
FormatSTRINGVARCHAR
InstantDATETIMETIMESTAMP
DateDATETIMEDATE or TIMESTAMP
TickINT64INT8 or INT64

Precision and Scale Inference

For decimal types, Beamline infers precision and scale:

rand_processes::{
    decimal_test: rand_process::{
        $arrival: HomogeneousPoisson:: { interarrival: seconds::1 },
        $data: {
            price: UniformDecimal::{ low: 9.99, high: 999.99 },    // DECIMAL(5,2)
            weight: UniformDecimal::{ low: 0.5, high: 10.9999 },  // DECIMAL(6,4)
            percentage: UniformDecimal::{ low: 0d0, high: 1d2 }   // DECIMAL(3,0)
        }
    }
}

Inferred Shape:

-- Dataset: decimal_test
"price" DECIMAL(5, 2),
"weight" DECIMAL(6, 4),  
"percentage" DECIMAL(3, 0)

Schema Evolution and Migration

Schema Version Comparison

#!/bin/bash
# Compare schema versions for migration planning

OLD_SCRIPT="data_model_v1.ion"
NEW_SCRIPT="data_model_v2.ion"

# Generate schemas for both versions
beamline infer-shape --seed 1 --start-auto --script-path $OLD_SCRIPT --output-format basic-ddl > v1_schema.sql
beamline infer-shape --seed 1 --start-auto --script-path $NEW_SCRIPT --output-format basic-ddl > v2_schema.sql

echo "Schema Migration Analysis"
echo "========================="

# Show differences
echo "Changes between v1 and v2:"
diff -u v1_schema.sql v2_schema.sql

echo ""
echo "Migration considerations:"

# Check for removed fields (breaking changes)
if grep -v "^--" v1_schema.sql | grep -v "^$" > v1_fields.txt && 
   grep -v "^--" v2_schema.sql | grep -v "^$" > v2_fields.txt; then
   
    removed_fields=$(comm -23 v1_fields.txt v2_fields.txt)
    if [ -n "$removed_fields" ]; then
        echo "⚠️  Breaking changes - removed fields:"
        echo "$removed_fields"
    fi
    
    added_fields=$(comm -13 v1_fields.txt v2_fields.txt)
    if [ -n "$added_fields" ]; then
        echo "✅ Added fields (non-breaking):"
        echo "$added_fields"
    fi
fi

rm -f v1_fields.txt v2_fields.txt

Database Migration Script Generation

#!/bin/bash
# Generate database migration scripts

OLD_SCHEMA="$1"
NEW_SCHEMA="$2"

echo "-- Database Migration Script"
echo "-- Generated: $(date)"
echo "-- From: $OLD_SCHEMA"
echo "-- To: $NEW_SCHEMA"
echo ""

# This is a simplified example - real migration would be more complex
echo "-- Review changes manually:"
echo "-- $(diff --brief $OLD_SCHEMA $NEW_SCHEMA)"

echo ""
echo "-- Add new columns (example):"
comm -13 <(grep '^"' $OLD_SCHEMA | sort) <(grep '^"' $NEW_SCHEMA | sort) | while read -r field; do
    echo "ALTER TABLE dataset_name ADD COLUMN $field;"
done

Integration Patterns

CI/CD Schema Validation

#!/bin/bash
# CI/CD pipeline schema validation

set -e

echo "Validating Ion script schemas..."

for script in scripts/*.ion; do
    echo "Checking $(basename "$script")..."
    
    # Validate script produces valid schema
    if ! beamline infer-shape \
        --seed 1 \
        --start-auto \
        --script-path "$script" \
        --output-format text > /dev/null 2>&1; then
        echo "❌ Error: Invalid script $script"
        exit 1
    fi
    
    echo "✅ $(basename "$script") - valid schema"
done

echo "All schemas validated successfully!"

Documentation Generation

# Generate schema documentation
generate_schema_docs() {
    local script_dir="$1"
    local output_file="$2"
    
    echo "# Data Model Documentation" > "$output_file"
    echo "" >> "$output_file"
    echo "Generated: $(date)" >> "$output_file"
    echo "" >> "$output_file"
    
    for script in "$script_dir"/*.ion; do
        local name=$(basename "$script" .ion)
        echo "## $name" >> "$output_file"
        echo "" >> "$output_file"
        echo "Script: \`$script\`" >> "$output_file"
        echo "" >> "$output_file"
        echo '```sql' >> "$output_file"
        
        beamline infer-shape \
            --seed 1 \
            --start-auto \
            --script-path "$script" \
            --output-format basic-ddl >> "$output_file"
            
        echo '```' >> "$output_file"
        echo "" >> "$output_file"
    done
}

generate_schema_docs "data_models" "DATA_MODEL_SCHEMAS.md"

Best Practices

1. Always Validate Shapes

# Before generating large datasets, check the shape
beamline infer-shape --seed 1 --start-auto --script-path new_model.ion

2. Use Appropriate Output Formats

# DDL for database work
beamline infer-shape --script-path data.ion --output-format basic-ddl

# Text for debugging  
beamline infer-shape --script-path data.ion --output-format text

# JSON for automation
beamline infer-shape --script-path data.ion --output-format beamline-json

3. Document Schema Changes

# Track schema evolution
git add schemas/
git commit -m "Update user data model schema

Added:
- user.preferences.theme field
- user.stats.last_login timestamp

Modified:  
- user.profile.age now optional (nullable: 0.1)"

4. Validate Schema Compatibility

# Ensure query compatibility with schema changes
beamline infer-shape --seed 1 --start-auto --script-path new_schema.ion --output-format basic-ddl > new_schema.sql

# Generate test queries against new schema
beamline query basic \
    --seed 2 \
    --start-auto \
    --script-path new_schema.ion \
    --sample-count 10 \
    rand-select-all-fw \
    --pred-all > validation_queries.sql

echo "Schema and queries generated for validation testing"

Next Steps

Now that you understand shapes and schema inference:

Shape Inference

Shape inference is the process of analyzing Ion scripts to determine the data types and structures that will be generated, without actually generating data. This is extremely fast and useful for schema validation, database preparation, and understanding data models.

Shape Inference Command

Basic Usage

The infer-shape command requires the same core parameters as data generation:

beamline infer-shape \
    --seed-auto \
    --start-auto \
    --script-path sensors.ion

Even though no data is generated, seed and start time may affect type inference for certain dynamic generators.

With Specific Parameters

# Use specific seed for reproducible shape inference
beamline infer-shape \
    --seed 12345 \
    --start-iso "2024-01-01T00:00:00Z" \
    --script-path complex_schema.ion \
    --output-format basic-ddl

Output Format Analysis

Text Format (Detailed Debug)

From the README example:

$ beamline infer-shape \
    --seed-auto \
    --start-auto \
    --script-path sensors.ion

Seed: 17685918364143248531
Start: 2022-12-12T19:52:29.000000000Z
{
    "sensors": PartiqlType(
        Bag(
            BagType {
                element_type: PartiqlType(
                    Struct(
                        StructType {
                            constraints: {
                                Fields(
                                    {
                                        StructField {
                                            name: "d",
                                            ty: PartiqlType(
                                                DecimalP(2, 0),
                                            ),
                                        },
                                        StructField {
                                            name: "f",
                                            ty: PartiqlType(
                                                Float64,
                                            ),
                                        },
                                        // ... more fields
                                    },
                                ),
                            },
                        },
                    ),
                ),
            },
        ),
    ),
}

Understanding the structure:

  • Bag: Collection of records (dataset)
  • BagType: Type information for the bag
  • Struct: Each record is a structured object
  • StructField: Individual field definitions with names and types
  • PartiqlType: Specific type information (DecimalP, Float64, etc.)

Basic DDL Format (SQL Ready)

From the README example:

$ beamline infer-shape \
    --seed 7844265201457918498 \
    --start-auto \
    --script-path sensors-nested.ion \
    --output-format basic-ddl

-- Seed: 7844265201457918498
-- Start: 2024-01-01T06:53:06.000000000Z
-- Syntax: partiql_datatype_syntax.0.1
-- Dataset: sensors
"f" DOUBLE,
"i8" INT8,
"id" INT,
"sub" STRUCT<"f": DOUBLE,"o": INT8>,
"tick" INT8

Format characteristics:

  • Comments: Metadata about generation parameters
  • Dataset headers: Clear separation between datasets
  • SQL-ready: Can be used directly in CREATE TABLE statements
  • Type precision: Specific SQL types with precision for decimals

Beamline JSON Format (Tool Integration)

From the README example:

$ beamline infer-shape \
    --seed-auto \
    --start-auto \
    --script-path sensors.ion \
    --output-format beamline-json

{
  seed: -3711181901898679775,
  start: "2022-05-22T13:49:57.000000000+00:00",
  shapes: {
    sensors: partiql::shape::v0::{
      type: "bag",
      items: {
        type: "struct",
        constraints: [ordered, closed],
        fields: [
          {
            name: "d",
            type: "decimal(2, 0)"
          },
          {
            name: "f", 
            type: "double"
          },
          {
            name: "i8",
            type: "int8"
          },
          {
            name: "tick",
            type: "int8"
          },
          {
            name: "w",
            type: "decimal(5, 4)"
          }
        ]
      }
    }
  }
}

Format characteristics:

  • Structured JSON: Machine-readable format
  • Versioned: partiql::shape::v0:: indicates version
  • Complete metadata: Seeds, timestamps, and full type information
  • Tool integration: Designed for PartiQL testing tools

Advanced Shape Inference

CLI Global Defaults Impact

CLI defaults affect shape inference results:

# Infer with default nullable/optional settings
beamline infer-shape \
    --seed 1 \
    --start-auto \
    --script-path data.ion \
    --default-nullable false \
    --default-optional true \
    --output-format basic-ddl

From the README example showing CLI impact:

$ beamline infer-shape \
    --seed 7844265201457918498 \
    --start-auto \
    --script-path sensors.ion \
    --output-format basic-ddl \
    --default-nullable false \
    --default-optional true

-- Seed: 7844265201457918498
-- Start: 2024-01-18T11:40:34.000000000Z
-- Syntax: partiql_datatype_syntax-0.1
-- Dataset: sensors
"a" OPTIONAL UNION<INT8 NOT NULL,DECIMAL(5, 4) NOT NULL,DOUBLE NOT NULL,VARCHAR NOT NULL>,
"ar1" OPTIONAL ARRAY<DECIMAL(2, 1) NOT NULL> NOT NULL,
"ar2" OPTIONAL ARRAY<VARCHAR NOT NULL> NOT NULL,
"ar3" OPTIONAL ARRAY<DECIMAL(5, 4)> NOT NULL,
"ar4" OPTIONAL ARRAY<TINYINT NOT NULL> NOT NULL,
"ar5" OPTIONAL ARRAY<UNION<INT8 NOT NULL,DECIMAL(5, 4) NOT NULL,DOUBLE NOT NULL,VARCHAR NOT NULL>> NOT NULL,
"d" OPTIONAL DECIMAL(2, 0) NOT NULL,
"f" OPTIONAL DOUBLE NOT NULL,
"i8" OPTIONAL TINYINT NOT NULL,
"tick" OPTIONAL INT8 NOT NULL,
"w" OPTIONAL DECIMAL(5, 4)

Notice how CLI defaults made fields OPTIONAL and NOT NULL.

Multi-Dataset Shape Analysis

# Analyze complex multi-dataset script
beamline infer-shape \
    --seed 1 \
    --start-auto \
    --script-path client-service.ion \
    --output-format basic-ddl

Example output structure:

-- Dataset: service
"Account" VARCHAR,
"Distance" DECIMAL(2, 0),
"Operation" VARCHAR,
"Program" VARCHAR,
"Request" VARCHAR,
"StartTime" TIMESTAMP,
"Weight" DECIMAL(5, 4),
"anyof" UNION<INT8,DECIMAL(5, 4)>,
"array" ARRAY<INT8>,
"client" VARCHAR,
"success" BOOL

-- Dataset: client_0
"id" VARCHAR,
"request_id" VARCHAR,
"request_time" TIMESTAMP,
"success" BOOL

-- Dataset: client_1
"id" VARCHAR,
"request_id" VARCHAR,
"request_time" TIMESTAMP,
"success" BOOL

Each dataset from the Ion script gets its own schema section.

Shape Inference Patterns

Script Validation Workflow

#!/bin/bash
# Validate Ion script before data generation

SCRIPT="$1"

if [ ! -f "$SCRIPT" ]; then
    echo "Script not found: $SCRIPT"
    exit 1
fi

echo "Validating Ion script: $SCRIPT"

# Test shape inference (fast validation)
if ! beamline infer-shape \
    --seed 1 \
    --start-auto \
    --script-path "$SCRIPT" \
    --output-format text > /dev/null; then
    echo "❌ Script validation failed - check Ion syntax"
    exit 1
fi

echo "✅ Script syntax valid"

# Show inferred schema
echo ""
echo "Inferred schema:"
beamline infer-shape \
    --seed 1 \
    --start-auto \
    --script-path "$SCRIPT" \
    --output-format basic-ddl

echo ""
echo "✅ Script ready for data generation"

Schema Documentation Generation

#!/bin/bash
# Auto-generate schema documentation

SCRIPTS_DIR="$1"
OUTPUT_FILE="$2"

echo "# Data Schema Documentation" > "$OUTPUT_FILE"
echo "" >> "$OUTPUT_FILE"
echo "Auto-generated: $(date)" >> "$OUTPUT_FILE"
echo "" >> "$OUTPUT_FILE"

for script in "$SCRIPTS_DIR"/*.ion; do
    name=$(basename "$script" .ion)
    echo "Processing $name..."
    
    echo "## $name Data Schema" >> "$OUTPUT_FILE"
    echo "" >> "$OUTPUT_FILE"
    echo "**Source Script**: \`$(basename "$script")\`" >> "$OUTPUT_FILE"
    echo "" >> "$OUTPUT_FILE"
    
    # Add schema in SQL format
    echo '```sql' >> "$OUTPUT_FILE"
    beamline infer-shape \
        --seed 1 \
        --start-auto \
        --script-path "$script" \
        --output-format basic-ddl | grep -v "^-- Seed:" | grep -v "^-- Start:" >> "$OUTPUT_FILE"
    echo '```' >> "$OUTPUT_FILE"
    echo "" >> "$OUTPUT_FILE"
    
    # Count datasets and fields
    schema_output=$(beamline infer-shape --seed 1 --start-auto --script-path "$script" --output-format basic-ddl)
    dataset_count=$(echo "$schema_output" | grep -c "^-- Dataset:")
    field_count=$(echo "$schema_output" | grep -c '^"')
    
    echo "**Summary**: $dataset_count dataset(s), $field_count total fields" >> "$OUTPUT_FILE"
    echo "" >> "$OUTPUT_FILE"
done

echo "Schema documentation generated: $OUTPUT_FILE"

Real-World Examples

E-commerce Schema Analysis

rand_processes::{
    $n_customers: UniformU8::{ low: 10, high: 100 },
    $customer_ids: $n_customers::[UUID::()],
    
    customers: static_data::{
        $data: {
            customer_id: Uniform::{ choices: $customer_ids },
            name: LoremIpsumTitle,
            email: Format::{ pattern: "customer{UUID}@email.com" },
            age: UniformU8::{ low: 18, high: 80, optional: 0.1 }
        }
    },
    
    orders: rand_process::{
        $arrival: HomogeneousPoisson:: { interarrival: minutes::30 },
        $data: {
            order_id: UUID,
            customer_id: Uniform::{ choices: $customer_ids },
            total: UniformDecimal::{ low: 10.00, high: 500.00 },
            items: UniformArray::{
                min_size: 1,
                max_size: 5,
                element_type: {
                    product_name: LoremIpsumTitle,
                    price: UniformDecimal::{ low: 5.00, high: 100.00 },
                    quantity: UniformU8::{ low: 1, high: 3 }
                }
            }
        }
    }
}

Inferred Schema:

$ beamline infer-shape --seed 1 --start-auto --script-path ecommerce.ion --output-format basic-ddl

-- Dataset: customers
"age" OPTIONAL TINYINT,
"customer_id" VARCHAR,
"email" VARCHAR,
"name" VARCHAR

-- Dataset: orders
"customer_id" VARCHAR,
"items" ARRAY<STRUCT<"price": DECIMAL(5, 2),"product_name": VARCHAR,"quantity": TINYINT>>,
"order_id" VARCHAR,
"total" DECIMAL(5, 2)

Financial Data Schema

rand_processes::{
    transactions: rand_process::{
        $arrival: HomogeneousPoisson:: { interarrival: minutes::5 },
        $data: {
            transaction_id: UUID::{ nullable: false },
            account_id: UUID,
            amount: LogNormalF64::{ location: 4.0, scale: 1.0 },
            transaction_type: Uniform::{ choices: ["deposit", "withdrawal", "transfer"] },
            risk_score: UniformF64::{ low: 0.0, high: 1.0 },
            metadata: {
                merchant: LoremIpsumTitle,
                location: Regex::{ pattern: "[A-Z]{2}" },
                processing_time: UniformF64::{ low: 0.1, high: 5.0 }
            },
            compliance: {
                aml_flagged: Bool::{ p: 0.01 },
                requires_review: Bool::{ p: 0.05 },
                risk_category: Uniform::{ choices: ["low", "medium", "high"] }
            }
        }
    }
}

Inferred Schema:

-- Dataset: transactions
"account_id" VARCHAR,
"amount" DOUBLE,
"compliance" STRUCT<"aml_flagged": BOOL,"requires_review": BOOL,"risk_category": VARCHAR>,
"metadata" STRUCT<"location": VARCHAR,"merchant": VARCHAR,"processing_time": DOUBLE>,
"risk_score" DOUBLE,
"transaction_id" VARCHAR NOT NULL,
"transaction_type" VARCHAR

Shape Inference Analysis

Schema Complexity Assessment

#!/bin/bash
# Analyze schema complexity

SCRIPT="$1"

echo "Schema Complexity Analysis for: $SCRIPT"
echo "======================================"

# Get detailed shape information
schema_output=$(beamline infer-shape \
    --seed 1 \
    --start-auto \
    --script-path "$SCRIPT" \
    --output-format basic-ddl)

# Count datasets
dataset_count=$(echo "$schema_output" | grep -c "^-- Dataset:")
echo "Datasets: $dataset_count"

# Count total fields
field_count=$(echo "$schema_output" | grep -c '^"')
echo "Total fields: $field_count"

# Count complex types
struct_count=$(echo "$schema_output" | grep -c "STRUCT<")
array_count=$(echo "$schema_output" | grep -c "ARRAY<")
union_count=$(echo "$schema_output" | grep -c "UNION<")

echo "Complex types:"
echo "  Structs: $struct_count"
echo "  Arrays: $array_count"  
echo "  Unions: $union_count"

# Count nullable/optional fields
nullable_count=$(echo "$schema_output" | grep -v "NOT NULL" | grep -c '^"')
optional_count=$(echo "$schema_output" | grep -c "OPTIONAL")

echo "Nullability:"
echo "  Nullable fields: $nullable_count"
echo "  Optional fields: $optional_count"

echo ""
echo "Complexity Score: $((field_count + struct_count * 2 + array_count * 2 + union_count * 3))"

Multi-Format Schema Comparison

#!/bin/bash
# Compare schema formats for analysis

SCRIPT="$1"
BASE_NAME=$(basename "$SCRIPT" .ion)

echo "Generating schema in all formats for: $SCRIPT"

# Generate all three formats
beamline infer-shape --seed 1 --start-auto --script-path "$SCRIPT" --output-format text > "${BASE_NAME}_debug.txt"
beamline infer-shape --seed 1 --start-auto --script-path "$SCRIPT" --output-format basic-ddl > "${BASE_NAME}_schema.sql"  
beamline infer-shape --seed 1 --start-auto --script-path "$SCRIPT" --output-format beamline-json > "${BASE_NAME}_schema.json"

echo "Generated schema files:"
echo "  Debug format: ${BASE_NAME}_debug.txt ($(wc -l < ${BASE_NAME}_debug.txt) lines)"
echo "  SQL DDL format: ${BASE_NAME}_schema.sql ($(wc -l < ${BASE_NAME}_schema.sql) lines)"
echo "  JSON format: ${BASE_NAME}_schema.json ($(wc -l < ${BASE_NAME}_schema.json) lines)"

# Show summary from SQL format
echo ""
echo "Schema summary:"
grep "-- Dataset:" "${BASE_NAME}_schema.sql" | while read -r line; do
    dataset=$(echo "$line" | cut -d: -f2 | xargs)
    echo "  Dataset: $dataset"
done

Shape Inference Optimization

Fast Schema Validation

Shape inference is much faster than data generation:

# Quick validation of multiple scripts
for script in models/*.ion; do
    echo -n "$(basename "$script"): "
    
    start_time=$(date +%s.%N)
    if beamline infer-shape \
        --seed 1 \
        --start-auto \
        --script-path "$script" \
        --output-format text > /dev/null; then
        end_time=$(date +%s.%N)
        duration=$(echo "$end_time - $start_time" | bc -l)
        echo "✅ Valid (${duration}s)"
    else
        echo "❌ Invalid"
    fi
done

Batch Schema Generation

#!/bin/bash
# Generate schemas for all scripts in parallel

SCRIPTS_DIR="$1"
OUTPUT_DIR="$2"

mkdir -p "$OUTPUT_DIR"

echo "Generating schemas for all scripts in $SCRIPTS_DIR..."

# Process scripts in parallel
for script in "$SCRIPTS_DIR"/*.ion; do
    {
        name=$(basename "$script" .ion)
        echo "Processing $name..."
        
        beamline infer-shape \
            --seed 1 \
            --start-auto \
            --script-path "$script" \
            --output-format basic-ddl > "$OUTPUT_DIR/${name}_schema.sql"
            
        echo "✅ $name completed"
    } &
done

wait  # Wait for all background jobs
echo "All schema generation completed"

# Summary
echo ""
echo "Generated schemas:"
ls -la "$OUTPUT_DIR"/*.sql | while read -r line; do
    file=$(echo "$line" | awk '{print $9}')
    lines=$(wc -l < "$file")
    echo "  $(basename "$file"): $lines lines"
done

Troubleshooting Shape Inference

Common Issues

Script Syntax Errors

$ beamline infer-shape --seed 1 --start-auto --script-path bad_syntax.ion
Error: Failed to parse Ion script: Invalid Ion syntax at line 5, column 10

Solution: Check Ion script syntax, ensure balanced braces and proper structure.

Missing Required Parameters

$ beamline infer-shape --script-path data.ion
Error: One of --seed-auto or --seed is required
Error: One of --start-auto, --start-epoch-ms, or --start-iso is required

Solution: Always provide seed and start time parameters.

Invalid Generator Configuration

# This will fail during shape inference
rand_processes::{
    bad_data: rand_process::{
        $arrival: HomogeneousPoisson:: { interarrival: seconds::1 },
        $data: {
            invalid_range: UniformI32::{ low: 100, high: 50 }  // min > max
        }
    }
}

Solution: Check generator configurations for valid parameter ranges.

Performance Troubleshooting

Shape inference should be very fast (milliseconds). If it’s slow:

# Check for complex nested structures
beamline infer-shape \
    --seed 1 \
    --start-auto \
    --script-path suspected_slow.ion \
    --output-format text | grep -c "nested_struct"

Very deep nesting (10+ levels) might slow shape inference slightly.

Integration Examples

Database Schema Creation Pipeline

#!/bin/bash
# Complete database schema creation pipeline

SCRIPT="$1"
DATABASE_NAME="$2"

echo "Creating database from Ion script: $SCRIPT"

# 1. Validate script and infer schema
echo "Step 1: Validating script and inferring schema..."
if ! beamline infer-shape \
    --seed 1 \
    --start-auto \
    --script-path "$SCRIPT" \
    --output-format basic-ddl > schema.sql; then
    echo "❌ Schema inference failed"
    exit 1
fi

# 2. Create database
echo "Step 2: Creating database $DATABASE_NAME..."
createdb "$DATABASE_NAME"

# 3. Generate CREATE TABLE statements
echo "Step 3: Generating CREATE TABLE statements..."
grep "-- Dataset:" schema.sql | while read -r line; do
    dataset=$(echo "$line" | cut -d: -f2 | xargs)
    
    echo "CREATE TABLE $dataset (" > "table_${dataset}.sql"
    # Extract fields for this dataset (simplified)
    grep -A 100 "$line" schema.sql | grep '^"' | head -20 >> "table_${dataset}.sql"
    echo ");" >> "table_${dataset}.sql"
    
    echo "Creating table: $dataset"
    psql -d "$DATABASE_NAME" -f "table_${dataset}.sql"
done

echo "✅ Database $DATABASE_NAME created with schema from $SCRIPT"

Schema Testing Integration

#!/bin/bash
# Test schema consistency across development workflow

SCRIPT="user_model.ion"
SEED=12345

echo "Testing schema consistency workflow..."

# 1. Infer baseline schema
beamline infer-shape \
    --seed $SEED \
    --start-auto \
    --script-path "$SCRIPT" \
    --output-format basic-ddl > baseline_schema.sql

# 2. Generate test data using same script
beamline gen data \
    --seed $SEED \
    --start-auto \
    --script-path "$SCRIPT" \
    --sample-count 100 \
    --output-format ion-pretty > test_data.ion

# 3. Generate test queries using same script
beamline query basic \
    --seed $((SEED + 1)) \
    --start-auto \
    --script-path "$SCRIPT" \
    --sample-count 10 \
    rand-select-all-fw \
    --pred-all > test_queries.sql

echo "Consistency test completed:"
echo "  Schema: baseline_schema.sql"
echo "  Data: test_data.ion ($(jq '.data | to_entries[0].value | length' test_data.ion 2>/dev/null || echo 'N/A') records)"
echo "  Queries: test_queries.sql ($(wc -l < test_queries.sql) queries)"

# 4. Validate all components reference same structure
echo ""
echo "✅ Schema, data, and queries all generated from same Ion script"
echo "✅ Consistency guaranteed by same script source"

Best Practices

1. Use Shape Inference Early

# Always validate scripts before large data generation
beamline infer-shape --seed 1 --start-auto --script-path new_script.ion

# Then proceed with data generation
beamline gen data --seed 1 --start-auto --script-path new_script.ion --sample-count 100000

2. Choose Format for Purpose

# Development and debugging
beamline infer-shape --script-path script.ion --output-format text

# Database integration
beamline infer-shape --script-path script.ion --output-format basic-ddl

# Tool integration and automation
beamline infer-shape --script-path script.ion --output-format beamline-json

3. Version Control Schemas

# Track schema evolution alongside scripts
git add scripts/user_model.ion schemas/user_model_schema.sql
git commit -m "Add user model v2 with preferences and stats

Schema changes:
- Added user.preferences nested object
- Added user.stats.login_count field  
- Made user.profile.age optional"

4. Validate Schema Changes

# Before deploying schema changes
beamline infer-shape --seed 1 --start-auto --script-path new_version.ion --output-format basic-ddl > new_schema.sql
diff old_schema.sql new_schema.sql

# Test compatibility with existing queries
# your-query-validator --schema new_schema.sql --queries existing_queries.sql

Next Steps

Now that you understand shape inference:

Schema Output Formats

Beamline supports three distinct output formats for schema information, each optimized for different use cases. Understanding these formats helps you choose the right one for your workflow and integrate schemas effectively with your tools and processes.

Available Schema Formats

The infer-shape command supports three output formats via --output-format:

FormatDescriptionUse CasePerformance
textRust debug format with detailed type informationDevelopment, debuggingFast
basic-ddlSQL DDL statements ready for database creationDatabase integrationFast
beamline-jsonStructured JSON for PartiQL testing toolsTool integration, automationFast

Text Format (Default)

Characteristics

  • Detailed type information: Complete PartiQL type system representation
  • Rust debug format: Shows internal type structures
  • Development focused: Ideal for understanding complex type relationships
  • Human readable: With some practice, easy to understand

Example Output

From the README example with sensors.ion:

$ beamline infer-shape \
    --seed-auto \
    --start-auto \
    --script-path sensors.ion

Seed: 17685918364143248531
Start: 2022-12-12T19:52:29.000000000Z
{
    "sensors": PartiqlType(
        Bag(
            BagType {
                element_type: PartiqlType(
                    Struct(
                        StructType {
                            constraints: {
                                Fields(
                                    {
                                        StructField {
                                            name: "d",
                                            ty: PartiqlType(
                                                DecimalP(
                                                    2,
                                                    0,
                                                ),
                                            ),
                                        },
                                        StructField {
                                            name: "f",
                                            ty: PartiqlType(
                                                Float64,
                                            ),
                                        },
                                        StructField {
                                            name: "i8",
                                            ty: PartiqlType(
                                                Int64,
                                            ),
                                        },
                                        StructField {
                                            name: "tick",
                                            ty: PartiqlType(
                                                Int64,
                                            ),
                                        },
                                        StructField {
                                            name: "w",
                                            ty: PartiqlType(
                                                DecimalP(
                                                    5,
                                                    4,
                                                ),
                                            ),
                                        },
                                    },
                                ),
                            },
                        },
                    ),
                ),
            },
        ),
    ),
}

Understanding Text Format Structure

  • PartiqlType: Root type wrapper
  • Bag: Collection type (dataset)
  • BagType: Container for element type information
  • Struct: Record structure
  • StructType: Detailed structure information
  • StructField: Individual field with name and type
  • DecimalP(5,4): Decimal with precision 5, scale 4
  • Float64: 64-bit floating point
  • Int64: 64-bit integer

Use Cases

  • Development: Understanding complex type relationships
  • Debugging: Detailed analysis of type inference
  • Learning: Understanding PartiQL type system
  • Tool development: Building PartiQL-aware tools

Basic DDL Format

Characteristics

  • SQL-ready: Can be used directly in CREATE TABLE statements
  • Human readable: Easy to understand for database developers
  • Production focused: Ready for database integration
  • Compact: Concise representation

Example Output

From the README example with sensors-nested.ion:

$ beamline infer-shape \
    --seed 7844265201457918498 \
    --start-auto \
    --script-path sensors-nested.ion \
    --output-format basic-ddl

-- Seed: 7844265201457918498
-- Start: 2024-01-01T06:53:06.000000000Z
-- Syntax: partiql_datatype_syntax.0.1
-- Dataset: sensors
"f" DOUBLE,
"i8" INT8,
"id" INT,
"sub" STRUCT<"f": DOUBLE,"o": INT8>,
"tick" INT8

Format Structure

  • Header comments: Generation metadata for reproducibility
  • Syntax version: DDL syntax version identifier
  • Dataset sections: -- Dataset: name separates different datasets
  • Field definitions: SQL column definitions ready for CREATE TABLE
  • Type specifications: Precise SQL types with dimensions

DDL Type Mapping

PartiQL InternalDDL OutputDescription
Int64INT8, TINYINT, INT, BIGINTSize depends on generator range
Float64DOUBLE64-bit floating point
DecimalP(p,s)DECIMAL(p,s)Fixed precision decimal
StringVARCHARVariable length string
BoolBOOLBoolean type
DateTimeTIMESTAMPDate and time
StructSTRUCT<...>Nested object structure
ArrayARRAY<T>Array of type T
UnionUNION<T1,T2,...>One of several types

Complete Database Creation

#!/bin/bash
# Create complete database from DDL output

SCRIPT="ecommerce.ion"
DB_NAME="ecommerce_test"

# Generate complete DDL
beamline infer-shape \
    --seed 1 \
    --start-auto \
    --script-path "$SCRIPT" \
    --output-format basic-ddl > schema.ddl

# Create database
createdb "$DB_NAME"

# Extract and create each table
grep "-- Dataset:" schema.ddl | while read -r line; do
    dataset=$(echo "$line" | cut -d: -f2 | xargs)
    
    {
        echo "CREATE TABLE $dataset ("
        # Extract fields until next dataset or end of file
        sed -n "/-- Dataset: $dataset/,/-- Dataset:/p" schema.ddl | \
        grep '^"' | \
        sed '$ s/,$//' # Remove trailing comma from last line
        echo ");"
    } > "${dataset}_table.sql"
    
    echo "Creating table: $dataset"
    psql -d "$DB_NAME" -f "${dataset}_table.sql"
    
    rm "${dataset}_table.sql"
done

echo "Database $DB_NAME created successfully"

Use Cases

  • Database creation: Direct CREATE TABLE generation
  • Schema documentation: Human-readable reference
  • Migration scripts: Database schema evolution
  • SQL integration: Compatible with SQL databases

Beamline JSON Format

Characteristics

  • Machine readable: Structured JSON for programmatic processing
  • Tool integration: Designed for PartiQL testing frameworks
  • Versioned: Includes format version information
  • Complete metadata: Full type and constraint information

Example Output

From the README example:

$ beamline infer-shape \
    --seed-auto \
    --start-auto \
    --script-path sensors.ion \
    --output-format beamline-json

{
  seed: -3711181901898679775,
  start: "2022-05-22T13:49:57.000000000+00:00",
  shapes: {
    sensors: partiql::shape::v0::{
      type: "bag",
      items: {
        type: "struct",
        constraints: [
          ordered,
          closed
        ],
        fields: [
          {
            name: "d",
            type: "decimal(2, 0)"
          },
          {
            name: "f",
            type: "double"
          },
          {
            name: "i8",
            type: "int8"
          },
          {
            name: "tick",
            type: "int8"
          },
          {
            name: "w",
            type: "decimal(5, 4)"
          }
        ]
      }
    }
  }
}

JSON Format Structure

  • seed: Random seed used for inference
  • start: Start timestamp used for inference
  • shapes: Dictionary of dataset name to shape definition
  • partiql::shape::v0::: Format version identifier
  • type: "bag": Collection type (dataset)
  • items: Element type definition for bag contents
  • constraints: Structural constraints (ordered, closed)
  • fields: Array of field definitions
  • Field objects: name and type for each field

Processing JSON Format

# Extract dataset names
beamline infer-shape --seed 1 --start-auto --script-path multi.ion --output-format beamline-json | \
jq -r '.shapes | keys[]'

# Count fields in each dataset
beamline infer-shape --seed 1 --start-auto --script-path multi.ion --output-format beamline-json | \
jq -r '.shapes | to_entries[] | "\(.key): \(.value.items.fields | length) fields"'

# Extract field types for specific dataset
beamline infer-shape --seed 1 --start-auto --script-path data.ion --output-format beamline-json | \
jq -r '.shapes.users.items.fields[] | "\(.name): \(.type)"'

Use Cases

  • Automated testing: PartiQL conformance test suites
  • Tool integration: Schema-aware development tools
  • CI/CD pipelines: Automated schema validation
  • Documentation generation: Programmatic documentation creation

Format Comparison

Size and Performance

For the same schema with multiple datasets:

# Generate all formats for comparison
beamline infer-shape --seed 1 --start-auto --script-path complex.ion --output-format text > schema.txt
beamline infer-shape --seed 1 --start-auto --script-path complex.ion --output-format basic-ddl > schema.sql
beamline infer-shape --seed 1 --start-auto --script-path complex.ion --output-format beamline-json > schema.json

# Compare sizes
ls -lh schema.*
# Example results:
# -rw-r--r-- 1 user user 8.2K schema.txt  (most detailed)
# -rw-r--r-- 1 user user 1.5K schema.sql  (most compact)  
# -rw-r--r-- 1 user user 3.1K schema.json (structured)

Information Density

FormatType DetailStructure InfoMetadataProcessability
textVery HighVery HighHighLow
basic-ddlMediumMediumMediumHigh (SQL)
beamline-jsonHighHighHighHigh (JSON)

Format-Specific Integration

Text Format Analysis

# Analyze complex type structures
beamline infer-shape \
    --seed 1 \
    --start-auto \
    --script-path nested_structures.ion \
    --output-format text | \
grep -A 20 "StructField"  # Extract field information

# Count nesting levels
beamline infer-shape \
    --seed 1 \
    --start-auto \
    --script-path deep_nesting.ion \
    --output-format text | \
grep -c "Struct("  # Count nested structures

DDL Format Database Integration

#!/bin/bash
# Complete database integration workflow

SCRIPT="$1"
DATABASE="$2"

# Generate DDL schema
beamline infer-shape \
    --seed 1 \
    --start-auto \
    --script-path "$SCRIPT" \
    --output-format basic-ddl > full_schema.sql

# Create database
createdb "$DATABASE"

# Process each dataset into a table
current_dataset=""
while IFS= read -r line; do
    if [[ $line == *"-- Dataset:"* ]]; then
        # Start new dataset
        if [[ -n "$current_dataset" ]]; then
            echo ");" >> "${current_dataset}.sql"
            psql -d "$DATABASE" -f "${current_dataset}.sql"
            rm "${current_dataset}.sql"
        fi
        
        current_dataset=$(echo "$line" | cut -d: -f2 | xargs)
        echo "CREATE TABLE $current_dataset (" > "${current_dataset}.sql"
        
    elif [[ $line == \"* ]]; then
        # Add field to current table
        echo "  $line" >> "${current_dataset}.sql"
    fi
done < full_schema.sql

# Handle last dataset
if [[ -n "$current_dataset" ]]; then
    echo ");" >> "${current_dataset}.sql"
    psql -d "$DATABASE" -f "${current_dataset}.sql"
    rm "${current_dataset}.sql"
fi

echo "Database $DATABASE created with all tables"

JSON Format Automation

#!/bin/bash
# Automated schema processing with JSON format

SCRIPT="$1"

echo "Analyzing schema from $SCRIPT..."

# Generate JSON schema
schema_json=$(beamline infer-shape \
    --seed 1 \
    --start-auto \
    --script-path "$SCRIPT" \
    --output-format beamline-json)

# Extract metadata
seed=$(echo "$schema_json" | jq -r '.seed')
start=$(echo "$schema_json" | jq -r '.start')
echo "Schema generated with seed: $seed, start: $start"

# Analyze each dataset
echo "$schema_json" | jq -r '.shapes | keys[]' | while read -r dataset; do
    echo ""
    echo "Dataset: $dataset"
    
    # Get field count
    field_count=$(echo "$schema_json" | jq ".shapes.${dataset}.items.fields | length")
    echo "  Fields: $field_count"
    
    # List all fields with types
    echo "  Field details:"
    echo "$schema_json" | jq -r ".shapes.${dataset}.items.fields[] | \"    \(.name): \(.type)\""
    
    # Check for complex types
    complex_count=$(echo "$schema_json" | jq -r ".shapes.${dataset}.items.fields[] | select(.type | contains(\"STRUCT\") or contains(\"ARRAY\") or contains(\"UNION\")) | .name" | wc -l)
    if [ "$complex_count" -gt 0 ]; then
        echo "  Complex types: $complex_count fields"
    fi
done

Multi-Dataset Schema Handling

Separating Dataset Schemas

Different formats handle multiple datasets differently:

Text Format

All datasets in single output, nested under their names:

{
    "service": PartiqlType(Bag(...)),
    "client_0": PartiqlType(Bag(...)),
    "client_1": PartiqlType(Bag(...))
}

Basic DDL Format

Datasets separated by comments:

-- Dataset: service
"Account" VARCHAR,
"Request" VARCHAR,
-- Dataset: client_0  
"id" VARCHAR,
"request_id" VARCHAR,
-- Dataset: client_1
"id" VARCHAR,
"request_id" VARCHAR,

JSON Format

Datasets as separate objects in shapes dictionary:

{
  "shapes": {
    "service": { "type": "bag", "items": {...} },
    "client_0": { "type": "bag", "items": {...} },
    "client_1": { "type": "bag", "items": {...} }
  }
}

Dataset-Specific Schema Extraction

#!/bin/bash
# Extract schema for specific dataset

SCRIPT="$1"
DATASET="$2"
FORMAT="$3"

case $FORMAT in
    "ddl")
        beamline infer-shape \
            --seed 1 \
            --start-auto \
            --script-path "$SCRIPT" \
            --output-format basic-ddl | \
        sed -n "/-- Dataset: $DATASET/,/-- Dataset:/p" | \
        head -n -1  # Remove next dataset header
        ;;
        
    "json")
        beamline infer-shape \
            --seed 1 \
            --start-auto \
            --script-path "$SCRIPT" \
            --output-format beamline-json | \
        jq ".shapes.${DATASET}"
        ;;
        
    *)
        echo "Usage: $0 <script.ion> <dataset_name> <ddl|json>"
        exit 1
        ;;
esac

Nullability and Optionality in Formats

How Each Format Represents Absent Values

Text Format

StructField {
    name: "nullable_field",
    ty: PartiqlType(Int64),  // Type doesn't show nullability directly
}

DDL Format

"required_field" VARCHAR NOT NULL,     -- nullable: false
"nullable_field" INT,                  -- nullable: true (default)
"optional_field" OPTIONAL VARCHAR      -- optional: true

JSON Format

{
  "name": "nullable_field", 
  "type": "int64"  // Nullability not directly visible
}

Note: DDL format provides the clearest nullability information.

CLI Defaults Impact on Formats

# With CLI nullability defaults
beamline infer-shape \
    --seed 1 \
    --start-auto \
    --script-path data.ion \
    --default-nullable true \
    --default-optional true \
    --output-format basic-ddl

Result shows CLI impact:

-- All fields affected by CLI defaults
"field1" OPTIONAL VARCHAR,      -- Made optional by CLI
"field2" OPTIONAL INT NOT NULL, -- Made optional, but explicit nullable: false overrides
"field3" OPTIONAL BOOL          -- Both CLI defaults applied

Advanced Format Usage

Schema Evolution Tracking

#!/bin/bash
# Track schema changes across versions

OLD_SCRIPT="model_v1.ion"
NEW_SCRIPT="model_v2.ion"

echo "Schema Evolution Analysis"
echo "========================"

# Generate schemas in DDL format for comparison
beamline infer-shape --seed 1 --start-auto --script-path "$OLD_SCRIPT" --output-format basic-ddl > v1_schema.sql
beamline infer-shape --seed 1 --start-auto --script-path "$NEW_SCRIPT" --output-format basic-ddl > v2_schema.sql

# Show changes
echo "Schema changes:"
diff -u v1_schema.sql v2_schema.sql

# Also generate JSON for programmatic analysis
beamline infer-shape --seed 1 --start-auto --script-path "$OLD_SCRIPT" --output-format beamline-json > v1_schema.json
beamline infer-shape --seed 1 --start-auto --script-path "$NEW_SCRIPT" --output-format beamline-json > v2_schema.json

# Count field changes
v1_fields=$(jq -r '.shapes | to_entries[] | .value.items.fields[].name' v1_schema.json | sort)
v2_fields=$(jq -r '.shapes | to_entries[] | .value.items.fields[].name' v2_schema.json | sort)

added_fields=$(comm -13 <(echo "$v1_fields") <(echo "$v2_fields"))
removed_fields=$(comm -23 <(echo "$v1_fields") <(echo "$v2_fields"))

echo ""
echo "Field changes:"
if [ -n "$added_fields" ]; then
    echo "Added: $(echo "$added_fields" | tr '\n' ' ')"
fi
if [ -n "$removed_fields" ]; then
    echo "Removed: $(echo "$removed_fields" | tr '\n' ' ')"
fi

Multi-Format Documentation

#!/bin/bash
# Generate comprehensive schema documentation

SCRIPT="$1"
BASE_NAME=$(basename "$SCRIPT" .ion)

echo "# Schema Documentation: $BASE_NAME"
echo "Generated: $(date)"
echo ""

# Generate metadata
metadata=$(beamline infer-shape --seed 1 --start-auto --script-path "$SCRIPT" --output-format basic-ddl | head -3)
echo "## Generation Metadata"
echo '```'
echo "$metadata"
echo '```'
echo ""

# SQL DDL for database developers
echo "## SQL DDL Schema"
echo '```sql'
beamline infer-shape \
    --seed 1 \
    --start-auto \
    --script-path "$SCRIPT" \
    --output-format basic-ddl | tail -n +4  # Skip metadata lines
echo '```'
echo ""

# JSON for tool developers
echo "## JSON Schema (for tools)"
echo '```json'
beamline infer-shape \
    --seed 1 \
    --start-auto \
    --script-path "$SCRIPT" \
    --output-format beamline-json | jq '.shapes'
echo '```'
echo ""

# Analysis summary
echo "## Schema Analysis"
schema_json=$(beamline infer-shape --seed 1 --start-auto --script-path "$SCRIPT" --output-format beamline-json)
dataset_count=$(echo "$schema_json" | jq '.shapes | length')
total_fields=$(echo "$schema_json" | jq '[.shapes[].items.fields | length] | add')

echo "- **Datasets**: $dataset_count"
echo "- **Total fields**: $total_fields"
echo "- **Source script**: \`$SCRIPT\`"

Best Practices

1. Choose Format for Purpose

# Understanding complex types
beamline infer-shape --script-path complex.ion --output-format text

# Database creation
beamline infer-shape --script-path db_model.ion --output-format basic-ddl

# Automated processing
beamline infer-shape --script-path data.ion --output-format beamline-json

2. Use Consistent Parameters

# Always use same seed for reproducible schema generation
beamline infer-shape --seed 1 --start-auto --script-path script.ion --output-format basic-ddl

3. Version Schema Output

# Include format in version control
git add schemas/model_v2.sql schemas/model_v2.json
git commit -m "Add schema v2 in SQL and JSON formats

- SQL DDL for database creation
- JSON for automated tooling"

4. Validate Format Consistency

# Ensure all formats represent same schema
beamline infer-shape --seed 42 --start-auto --script-path test.ion --output-format basic-ddl > test.sql
beamline infer-shape --seed 42 --start-auto --script-path test.ion --output-format beamline-json > test.json

# Extract field count from both formats  
sql_fields=$(grep '^"' test.sql | wc -l)
json_fields=$(jq '[.shapes[].items.fields | length] | add' test.json)

if [ "$sql_fields" -eq "$json_fields" ]; then
    echo "✅ Schema formats consistent: $sql_fields fields"
else
    echo "❌ Format mismatch: SQL=$sql_fields, JSON=$json_fields"
fi

Format Selection Guidelines

By Use Case

Use CaseRecommended FormatAlternative
Database creationbasic-ddlN/A
Development debuggingtextbasic-ddl
Tool integrationbeamline-jsonN/A
Documentationbasic-ddlbeamline-json
Schema comparisonbasic-ddlbeamline-json
CI/CD automationbeamline-jsonbasic-ddl

By Consumer

ConsumerRecommended FormatRationale
SQL Databasebasic-ddlDirect CREATE TABLE usage
PartiQL Toolsbeamline-jsonNative PartiQL format
Human Reviewbasic-ddlMost readable
Development Toolsbeamline-jsonMachine processable
Documentationbasic-ddlClear and concise

Next Steps

Now that you understand all schema output formats:

CLI Overview

The Beamline Command Line Interface (CLI) provides access to all of Beamline’s core functionality: data generation, query generation, schema inference, and database creation. The CLI is built using Rust and follows a simple, consistent command structure.

Installation and Setup

Building from Source

The CLI is built as part of the Beamline project:

# Clone the repository
git clone https://github.com/partiql/partiql-beamline.git
cd partiql-beamline

# Build the project (includes CLI)
cargo build --release

# The CLI binary will be available at:
./target/release/beamline

Verification

After building, verify the CLI is working:

# Check version
./target/release/beamline --version

# View help
./target/release/beamline --help

Command Structure

All Beamline CLI commands follow this structure:

beamline <COMMAND> [SUBCOMMAND] [OPTIONS]

Available Commands

The CLI provides four main commands:

1. gen - Data and Database Generation

Generate synthetic data and create databases.

Subcommands:

  • data - Generate synthetic data from Ion scripts
  • db beamline-lite - Create BeamlineLite database with data and schemas

Example:

beamline gen data --seed-auto --start-auto --sample-count 100 --script-path my_script.ion

2. infer-shape - Schema Inference

Infer data schemas from Ion scripts without generating full datasets.

Example:

beamline infer-shape --seed-auto --start-auto --script-path my_script.ion --output-format basic-ddl

3. query - Query Generation

Generate PartiQL queries that match your data structures.

Subcommands:

  • basic - Basic query generation with configurable strategies

Example:

beamline query basic --seed 1234 --start-auto --script-path data_script.ion --sample-count 5 rand-select-all-fw --tbl-flt-rand-min 1 --tbl-flt-rand-max 1 --pred-lt

4. help - Help Information

Display help for commands and subcommands.

Common Options

Several options are shared across multiple commands:

Seed Configuration (Required)

Control reproducibility through seeding:

--seed-auto                    # Generate random seed automatically
--seed <SEED>                  # Use specific seed (e.g., --seed 12345)

Start Time Configuration (Required)

Set the simulation start time:

--start-auto                   # Generate random start time
--start-epoch-ms <EPOCH_MS>    # Use Unix timestamp in milliseconds
--start-iso <ISO_8601>         # Use ISO 8601 format (e.g., 2024-01-01T00:00:00Z)

Script Configuration (Required)

Provide the Ion script defining data generation:

--script-path <PATH>           # Path to Ion script file
--script <SCRIPT_DATA>         # Inline Ion script content

Sample Count

Control how much data to generate:

--sample-count <COUNT>         # Number of samples (default: 10)

Nullability and Optionality

Configure NULL and MISSING value generation:

--default-nullable <true|false>    # Make types nullable by default
--pct-null <PERCENTAGE>            # Percentage of NULL values (0.0-1.0)
--default-optional <true|false>    # Make types optional by default  
--pct-optional <PERCENTAGE>        # Percentage of MISSING values (0.0-1.0)

Output Formats

Data Generation Formats

For gen data, specify output format with --output-format:

FormatDescriptionUse Case
textHuman-readable text (default)Debugging, inspection
ionCompact Amazon Ion binaryEfficient storage
ion-prettyPretty-printed Ion textHuman-readable Ion
ion-binaryBinary Ion formatMost compact

Example:

beamline gen data --seed-auto --start-auto --script-path data.ion --output-format ion-pretty

Shape Inference Formats

For infer-shape, specify format with --output-format:

FormatDescriptionUse Case
textDebug format (default)Development
basic-ddlSQL DDL formatDatabase schema
beamline-jsonBeamline JSON formatTesting

Basic Usage Examples

Generate Data

# Simple data generation
beamline gen data \
  --seed-auto \
  --start-auto \
  --sample-count 1000 \
  --script-path sensors.ion

# Reproducible generation with specific seed
beamline gen data \
  --seed 12345 \
  --start-iso "2024-01-01T00:00:00Z" \
  --sample-count 500 \
  --script-path user_data.ion \
  --output-format ion-pretty

Filter Datasets

Generate data for specific datasets only:

beamline gen data \
  --seed 42 \
  --start-auto \
  --script-path client_service.ion \
  --dataset service \
  --dataset client_1 \
  --sample-count 100

Infer Schema

# Get SQL DDL schema
beamline infer-shape \
  --seed-auto \
  --start-auto \
  --script-path my_script.ion \
  --output-format basic-ddl

# Get detailed shape information  
beamline infer-shape \
  --seed 1234 \
  --start-auto \
  --script-path complex_data.ion \
  --output-format text

Create Database

# Create BeamlineLite database
beamline gen db beamline-lite \
  --seed-auto \
  --start-auto \
  --script-path database_script.ion \
  --sample-count 10000

# Custom catalog location
beamline gen db beamline-lite \
  --seed 2024 \
  --start-auto \
  --script-path data.ion \
  --catalog_name my-catalog \
  --catalog_path ./databases/ \
  --sample-count 5000

Generate Queries

# Simple query generation
beamline query basic \
  --seed 100 \
  --start-auto \
  --script-path transactions.ion \
  --sample-count 10 \
  rand-select-all-fw \
    --tbl-flt-rand-min 1 \
    --tbl-flt-rand-max 3 \
    --tbl-flt-path-depth-max 2 \
    --pred-all

Configuration with Nullability/Optionality

Control NULL and MISSING value generation:

# Make all types nullable with 10% NULL values
beamline gen data \
  --seed 42 \
  --start-auto \
  --script-path data.ion \
  --pct-null 0.1 \
  --sample-count 1000

# Make types optional with 5% MISSING values
beamline gen data \
  --seed 42 \
  --start-auto \
  --script-path data.ion \
  --pct-optional 0.05 \
  --sample-count 1000

# Disable nullability and optionality
beamline gen data \
  --seed 42 \
  --start-auto \
  --script-path data.ion \
  --default-nullable false \
  --default-optional false \
  --sample-count 1000

Error Handling

Common Error Types

Script Not Found

$ beamline gen data --seed-auto --start-auto --script-path missing.ion
Error: Unable to read script file 'missing.ion': No such file or directory

Invalid Ion Script

$ beamline gen data --seed-auto --start-auto --script-path invalid.ion
Error: Failed to parse Ion script: Invalid syntax at line 5

Invalid Seed Value

$ beamline gen data --seed invalid --start-auto --script-path data.ion
Error: Invalid value 'invalid' for '--seed <SEED>': invalid digit found in string

Debug Output

For troubleshooting, examine the generated seed and start time:

$ beamline gen data --seed-auto --start-auto --script-path sensors.ion --sample-count 2
Seed: 12328924104731257599
Start: 2024-01-20T20:05:41.000000000Z
[2024-01-20 20:07:46.532 +00:00:00] : "sensors" { 'f': -2.5436390152455175, 'i8': 4, 'tick': 125532 }
[2024-01-20 20:09:19.756 +00:00:00] : "sensors" { 'f': -63.49308817145054, 'i8': 4, 'tick': 218756 }

The output shows the seed and start time used, allowing you to reproduce the exact same output later.

Integration Patterns

Shell Scripting

#!/bin/bash
# Generate test data for different scenarios

SEED=12345
START_TIME="2024-01-01T00:00:00Z"

# Generate user data
beamline gen data \
  --seed $SEED \
  --start-iso $START_TIME \
  --script-path users.ion \
  --sample-count 1000 \
  --output-format ion-pretty > users.ion

# Generate transaction data
beamline gen data \
  --seed $((SEED + 1)) \
  --start-iso $START_TIME \
  --script-path transactions.ion \
  --sample-count 5000 \
  --output-format ion-pretty > transactions.ion

echo "Data generation completed!"

Pipeline Integration

# Generate data and pipe to other tools
beamline gen data \
  --seed 42 \
  --start-auto \
  --script-path events.ion \
  --sample-count 1000 \
  --output-format ion-pretty | \
  head -20

# Combine with analysis tools
beamline gen data \
  --seed 100 \
  --start-auto \
  --script-path metrics.ion \
  --sample-count 10000 \
  --output-format text | \
  grep "temperature" | \
  wc -l

Best Practices

1. Always Use Seeds for Reproducible Testing

# Good - explicit seed for test scenarios
beamline gen data --seed 12345 --start-iso "2024-01-01T00:00:00Z" --script-path test.ion

# Avoid - auto seed makes reproduction difficult
beamline gen data --seed-auto --start-auto --script-path test.ion

2. Start Small, Scale Up

# Test with small sample first
beamline gen data --seed 1 --start-auto --script-path new_script.ion --sample-count 10

# Scale up after validation
beamline gen data --seed 1 --start-auto --script-path new_script.ion --sample-count 100000

3. Use Appropriate Output Formats

# Ion formats for data processing
beamline gen data --script-path data.ion --output-format ion-binary

# Text format for debugging
beamline gen data --script-path data.ion --output-format text --sample-count 5

4. Validate Schemas Before Large Generation

# Check schema first
beamline infer-shape --seed-auto --start-auto --script-path data.ion --output-format basic-ddl

# Then generate data
beamline gen data --seed 42 --start-auto --script-path data.ion --sample-count 10000

Next Steps

Now that you understand the CLI overview, explore specific commands:

Data Generation Commands

The beamline gen data command generates synthetic data from Ion scripts using stochastic processes. This is the primary command for creating reproducible pseudo-random data in Beamline.

Command Syntax

beamline gen data [OPTIONS]

Required Options

All data generation requires these three configuration groups (exactly one option from each group):

Seed Configuration (Required - choose one)

--seed-auto                    # Generate random seed automatically
--seed <SEED>                  # Use specific numeric seed for reproducibility

Start Time Configuration (Required - choose one)

--start-auto                   # Generate random start time
--start-epoch-ms <EPOCH_MS>    # Use Unix timestamp in milliseconds
--start-iso <ISO_8601>         # Use ISO 8601 format (e.g., 2024-01-01T00:00:00Z)

Script Configuration (Required - choose one)

--script-path <PATH>           # Path to Ion script file
--script <SCRIPT_DATA>         # Inline Ion script content

Optional Parameters

Sample Count

--sample-count <COUNT>         # Number of samples to generate (default: 10)

Output Format

--output-format <FORMAT>       # Output format (default: text)

Available formats:

  • text - Human-readable text format (default)
  • ion - Compact Amazon Ion format
  • ion-pretty - Pretty-printed Ion text format
  • ion-binary - Binary Ion format (most compact)

Dataset Filtering

--dataset <DATASET_NAME>       # Include only specific dataset(s)
                              # Can be used multiple times for multiple datasets

Nullability Configuration (Optional - choose one)

--default-nullable <true|false>    # Set default nullability behavior
--pct-null <PERCENTAGE>            # Percentage of NULL values (0.0-1.0)

Optionality Configuration (Optional - choose one)

--default-optional <true|false>    # Set default optionality behavior  
--pct-optional <PERCENTAGE>        # Percentage of MISSING values (0.0-1.0)

Basic Examples

Simple Data Generation

# Generate 100 samples with automatic seed and start time
beamline gen data \
  --seed-auto \
  --start-auto \
  --script-path sensors.ion \
  --sample-count 100

# Reproducible generation with specific seed
beamline gen data \
  --seed 12345 \
  --start-iso "2024-01-01T00:00:00Z" \
  --script-path user_data.ion \
  --sample-count 1000

Different Output Formats

# Text output (human-readable, default)
beamline gen data \
  --seed 42 \
  --start-auto \
  --script-path data.ion \
  --output-format text

# Pretty Ion format
beamline gen data \
  --seed 42 \
  --start-auto \
  --script-path data.ion \
  --output-format ion-pretty

# Compact binary Ion
beamline gen data \
  --seed 42 \
  --start-auto \
  --script-path data.ion \
  --output-format ion-binary

Dataset Filtering

Generate data for specific datasets only:

# Generate data for specific datasets
beamline gen data \
  --seed 45121008347100595 \
  --start-iso "2020-06-16T14:41:51.000000000Z" \
  --script-path client-service.ion \
  --sample-count 10 \
  --dataset service \
  --dataset client_1 \
  --output-format ion-pretty

Advanced Configuration

Controlling NULL Values

# Make all types nullable by default with 10% NULL values
beamline gen data \
  --seed 100 \
  --start-auto \
  --script-path data.ion \
  --pct-null 0.1 \
  --sample-count 500

# Disable nullability entirely
beamline gen data \
  --seed 100 \
  --start-auto \
  --script-path data.ion \
  --default-nullable false \
  --sample-count 500

Controlling MISSING Values

# Make all types optional with 5% MISSING values
beamline gen data \
  --seed 200 \
  --start-auto \
  --script-path data.ion \
  --pct-optional 0.05 \
  --sample-count 500

# Disable optionality entirely
beamline gen data \
  --seed 200 \
  --start-auto \
  --script-path data.ion \
  --default-optional false \
  --sample-count 500

Inline Scripts

For small scripts, you can provide the Ion script content directly:

beamline gen data \
  --seed 300 \
  --start-auto \
  --script 'rand_processes::{ test: rand_process::{ $arrival: HomogeneousPoisson:: { interarrival: seconds::1 }, $data: { id: UniformU8, value: UniformF64 } } }' \
  --sample-count 5 \
  --output-format text

Reproducibility Examples

Exact Reproduction

# First run - note the seed and start time
beamline gen data \
  --seed-auto \
  --start-auto \
  --script-path sensors.ion \
  --sample-count 2

# Output shows:
# Seed: 12328924104731257599
# Start: 2024-01-20T20:05:41.000000000Z
# [data follows...]

# Reproduce exactly the same data
beamline gen data \
  --seed 12328924104731257599 \
  --start-iso "2024-01-20T20:05:41.000000000Z" \
  --script-path sensors.ion \
  --sample-count 2

Reproducible with Different Start Times

# Same seed, different start time gives same data pattern at different times
beamline gen data \
  --seed 12345 \
  --start-iso "2023-01-01T00:00:00Z" \
  --script-path events.ion \
  --sample-count 5

beamline gen data \
  --seed 12345 \
  --start-iso "2024-01-01T00:00:00Z" \
  --script-path events.ion \
  --sample-count 5

Output Format Details

Text Format (Default)

Human-readable format with timestamps and dataset names:

$ beamline gen data --seed 1234 --start-auto --script-path sensors.ion --sample-count 2
Seed: 1234
Start: 2019-08-01T00:00:01.000000000-07:00
[2019-08-01 7:26:21.964 -07:00:00] : "sensors" { 'f': -2.5436390152455175, 'i8': 4, 'tick': 125532 }
[2019-08-10 5:46:15.24 -07:00:00] : "sensors" { 'f': -63.49308817145054, 'i8': 4, 'tick': 218756 }

Ion Pretty Format

Pretty-printed Ion with metadata:

$ beamline gen data --seed 1234 --start-auto --script-path sensors.ion --sample-count 2 --output-format ion-pretty
{
  seed: 1234,
  start: "2019-08-01T00:00:01.000000000-07:00",
  data: {
    sensors: [
      {
        f: -2.5436390152455175e0,
        i8: 4,
        tick: 125532
      },
      {
        f: -63.49308817145054e0,
        i8: 4,
        tick: 218756
      }
    ]
  }
}

Ion and Ion Binary Formats

  • ion - Compact text Ion without pretty printing
  • ion-binary - Binary Ion format (most space-efficient)

Both formats preserve all Ion type information and are suitable for programmatic processing.

Static Data Generation

Beamline supports static data generation (data generated before simulation starts):

# Generate data with static customer table and dynamic orders
beamline gen data \
  --seed 1234 \
  --start-iso "2019-08-01T00:00:01-07:00" \
  --script-path orders.ion \
  --sample-count 30 \
  --output-format text

Static data appears first with the same timestamp, followed by temporally-distributed dynamic data.

Error Handling

Common Error Scenarios

Missing Script File

$ beamline gen data --seed-auto --start-auto --script-path nonexistent.ion
Error: Failed to read script file 'nonexistent.ion': No such file or directory (os error 2)

Invalid Ion Syntax

$ beamline gen data --seed-auto --start-auto --script-path invalid.ion
Error: Failed to parse Ion script: Invalid Ion syntax at line 5, column 10

Missing Required Arguments

$ beamline gen data --script-path data.ion
Error: One of --seed-auto or --seed is required
Error: One of --start-auto, --start-epoch-ms, or --start-iso is required

Invalid Percentage Values

$ beamline gen data --seed-auto --start-auto --script-path data.ion --pct-null 1.5
Error: Percents must be between 0 and 1: `1.5`

Debugging Tips

  1. Start Small: Use --sample-count 5 to quickly test scripts
  2. Use Text Format: Default text format is easiest to read for debugging
  3. Check Seeds: Note auto-generated seeds for reproduction
  4. Validate Scripts: Use infer-shape to check script syntax first

Integration Patterns

Shell Scripting

#!/bin/bash
set -e

SCRIPT_PATH="simulation.ion"
OUTPUT_DIR="./generated_data"
SEED=12345

mkdir -p "$OUTPUT_DIR"

# Generate different datasets
echo "Generating user data..."
beamline gen data \
  --seed $SEED \
  --start-iso "2024-01-01T00:00:00Z" \
  --script-path "$SCRIPT_PATH" \
  --dataset users \
  --sample-count 1000 \
  --output-format ion-pretty > "$OUTPUT_DIR/users.ion"

echo "Generating transaction data..."
beamline gen data \
  --seed $((SEED + 1)) \
  --start-iso "2024-01-01T00:00:00Z" \
  --script-path "$SCRIPT_PATH" \
  --dataset transactions \
  --sample-count 5000 \
  --output-format ion-pretty > "$OUTPUT_DIR/transactions.ion"

echo "Data generation completed!"

Pipeline Processing

# Generate and process data in pipeline
beamline gen data \
  --seed 42 \
  --start-auto \
  --script-path metrics.ion \
  --sample-count 1000 \
  --output-format text | \
  grep "temperature" | \
  awk '{ print $NF }' | \
  head -10

# Generate multiple formats simultaneously
beamline gen data \
  --seed 100 \
  --start-auto \
  --script-path data.ion \
  --sample-count 1000 \
  --output-format ion-pretty | \
  tee data.ion | \
  head -20

Testing Workflows

# Generate test data with specific characteristics
generate_test_data() {
  local seed=$1
  local sample_count=$2
  local script=$3
  
  beamline gen data \
    --seed "$seed" \
    --start-iso "2024-01-01T00:00:00Z" \
    --script-path "$script" \
    --sample-count "$sample_count" \
    --default-nullable false \
    --default-optional false \
    --output-format ion-pretty
}

# Use in tests
generate_test_data 12345 100 "test_users.ion" > test_users.ion
generate_test_data 12346 50 "test_orders.ion" > test_orders.ion

Performance Considerations

Sample Count Impact

  • Small counts (--sample-count 10-100): Near-instantaneous
  • Medium counts (--sample-count 1000-10000): Seconds
  • Large counts (--sample-count 100000+): Minutes, depending on script complexity

Output Format Performance

  1. text - Moderate performance, human-readable
  2. ion-binary - Fastest and most compact
  3. ion - Fast, compact text format
  4. ion-pretty - Slowest due to formatting overhead

Memory Usage

Beamline streams data generation, so memory usage stays constant regardless of sample count. Large datasets are processed incrementally.

Best Practices

1. Use Specific Seeds for Testing

# Good - reproducible
beamline gen data --seed 12345 --start-iso "2024-01-01T00:00:00Z" --script-path test.ion

# Avoid - non-reproducible
beamline gen data --seed-auto --start-auto --script-path test.ion

2. Start with Small Sample Counts

# Validate script first
beamline gen data --seed 1 --start-auto --script-path new_script.ion --sample-count 5

# Scale up after validation
beamline gen data --seed 1 --start-auto --script-path new_script.ion --sample-count 10000

3. Use Appropriate Output Formats

# Human inspection
beamline gen data --script-path data.ion --output-format text --sample-count 10

# Data processing
beamline gen data --script-path data.ion --output-format ion-binary --sample-count 100000

# Configuration files
beamline gen data --script-path data.ion --output-format ion-pretty --sample-count 1000

4. Document Your Seeds

# Good practice - document seeds used
# User test data: seed 2024001
# Integration test data: seed 2024002  
# Performance test data: seed 2024003

beamline gen data --seed 2024001 --start-auto --script-path users.ion

Next Steps

Now that you understand data generation commands, explore:

Query Commands

The beamline query command generates PartiQL queries that match the shapes and types of data defined in Ion scripts. This allows you to create realistic queries for testing PartiQL implementations.

Command Syntax

beamline query basic [OPTIONS] <STRATEGY>

Required Options

Query generation requires the same core configuration as data generation:

Seed Configuration (Required - choose one)

--seed-auto                    # Generate random seed automatically
--seed <SEED>                  # Use specific numeric seed for reproducibility

Start Time Configuration (Required - choose one)

--start-auto                   # Generate random start time
--start-epoch-ms <EPOCH_MS>    # Use Unix timestamp in milliseconds
--start-iso <ISO_8601>         # Use ISO 8601 format (e.g., 2024-01-01T00:00:00Z)

Script Configuration (Required - choose one)

--script-path <PATH>           # Path to Ion script file
--script <SCRIPT_DATA>         # Inline Ion script content

Sample Count

--sample-count <COUNT>         # Number of queries to generate (default: 10)

Query Strategies

Beamline supports four different query generation strategies:

1. rand-select-all-fw - SELECT * with WHERE

Generates SELECT * queries with randomly generated WHERE clauses.

beamline query basic \
  --seed 1234 \
  --start-auto \
  --script-path simple_transactions.ion \
  --sample-count 3 \
  rand-select-all-fw \
    --tbl-flt-rand-min 1 \
    --tbl-flt-rand-max 1 \
    --tbl-flt-path-depth-max 1 \
    --tbl-flt-pathstep-internal-all \
    --tbl-flt-pathstep-final-project \
    --tbl-flt-type-final-scalar \
    --pred-lt

Example Output:

SELECT * FROM test_data AS test_data WHERE (test_data.marketplace_id < -5)

SELECT * FROM test_data AS test_data WHERE (test_data.price < 18.418581624952935)

SELECT * FROM test_data AS test_data WHERE (test_data.price < 15.495327785402296)

2. rand-sfw - SELECT fields FROM WHERE

Generates queries with random projections and WHERE clauses.

beamline query basic \
  --seed 1234 \
  --start-auto \
  --script-path simple_transactions.ion \
  --sample-count 3 \
  rand-sfw \
    --project-rand-min 2 \
    --project-rand-max 5 \
    --project-path-depth-min 1 \
    --project-path-depth-max 1 \
    --project-pathstep-internal-all \
    --project-pathstep-final-all \
    --project-type-final-all \
    --tbl-flt-rand-min 2 \
    --tbl-flt-rand-max 5 \
    --tbl-flt-path-depth-max 1 \
    --tbl-flt-pathstep-internal-all \
    --tbl-flt-pathstep-final-project \
    --tbl-flt-type-final-scalar \
    --pred-all

Example Output:

SELECT test_data.completed, test_data.completed FROM test_data AS test_data
WHERE (NOT (test_data.completed) OR NOT ((test_data.created_at IS MISSING)))

SELECT test_data.completed, test_data.marketplace_id, test_data.created_at
FROM test_data AS test_data WHERE (NOT ((test_data.transaction_id IS NULL)) OR
  (((test_data.transaction_id IN ['Iam in.', 'Se.']) OR 
      NOT ((test_data.description IS NULL))) OR
    (test_data.marketplace_id >= 28)))

3. rand-select-all-efw - SELECT * EXCLUDE WHERE

Generates SELECT * EXCLUDE queries with WHERE clauses.

beamline query basic \
  --seed 1234 \
  --start-auto \
  --script-path simple_transactions.ion \
  --sample-count 3 \
  rand-select-all-efw \
    --tbl-flt-rand-min 1 \
    --tbl-flt-rand-max 1 \
    --tbl-flt-path-depth-max 1 \
    --tbl-flt-pathstep-internal-all \
    --tbl-flt-pathstep-final-project \
    --tbl-flt-type-final-scalar \
    --pred-lt \
    --exclude-rand-min 1 \
    --exclude-rand-max 3 \
    --exclude-path-depth-min 1 \
    --exclude-path-depth-max 1 \
    --exclude-pathstep-internal-all \
    --exclude-pathstep-final-all \
    --exclude-type-final-all

4. rand-sefw - SELECT EXCLUDE FROM WHERE

Generates queries with projections, exclusions, and WHERE clauses.

beamline query basic \
  --seed 1234 \
  --start-auto \
  --script-path simple_transactions.ion \
  --sample-count 3 \
  rand-sefw \
    --project-rand-min 2 \
    --project-rand-max 5 \
    --project-path-depth-min 1 \
    --project-path-depth-max 1 \
    --project-pathstep-internal-all \
    --project-pathstep-final-all \
    --project-type-final-all \
    --tbl-flt-rand-min 2 \
    --tbl-flt-rand-max 5 \
    --tbl-flt-path-depth-max 1 \
    --tbl-flt-pathstep-internal-all \
    --tbl-flt-pathstep-final-project \
    --tbl-flt-type-final-scalar \
    --pred-all \
    --exclude-rand-min 1 \
    --exclude-rand-max 3 \
    --exclude-path-depth-min 1 \
    --exclude-path-depth-max 1 \
    --exclude-pathstep-internal-all \
    --exclude-pathstep-final-all \
    --exclude-type-final-all

Parameter Reference

Table Filter Parameters

Control WHERE clause generation:

--tbl-flt-rand-min <N>              # Minimum number of predicates (1-255)
--tbl-flt-rand-max <N>              # Maximum number of predicates (1-255)
--tbl-flt-path-depth-max <N>        # Maximum path depth (1-255)

# Path step types (internal positions)
--tbl-flt-pathstep-internal-all     # Enable all internal path step types
--tbl-flt-pathstep-internal-project # Enable projection steps (.field)
--tbl-flt-pathstep-internal-index   # Enable index steps ([1])
--tbl-flt-pathstep-internal-foreach # Enable for-each steps ([*])
--tbl-flt-pathstep-internal-unpivot # Enable unpivot steps (.*)

# Path step types (final positions)
--tbl-flt-pathstep-final-all        # Enable all final path step types  
--tbl-flt-pathstep-final-project    # Enable projection steps (.field)
--tbl-flt-pathstep-final-index      # Enable index steps ([1])
--tbl-flt-pathstep-final-foreach    # Enable for-each steps ([*])
--tbl-flt-pathstep-final-unpivot    # Enable unpivot steps (.*)

# Type constraints
--tbl-flt-type-final-all            # Allow all final types
--tbl-flt-type-final-scalar         # Allow scalar final types only
--tbl-flt-type-final-sequence       # Allow sequence final types
--tbl-flt-type-final-struct         # Allow struct final types

Predicate Types

Control which predicates can be generated:

--pred-all                 # Enable all predicates
--pred-lt                  # Less than (<)
--pred-lte                 # Less than or equal (<=)
--pred-gt                  # Greater than (>)
--pred-gte                 # Greater than or equal (>=)
--pred-eq                  # Equal (=)
--pred-neq                 # Not equal (<>)
--pred-between             # BETWEEN predicate
--pred-like                # LIKE predicate
--pred-not-like            # NOT LIKE predicate
--pred-in                  # IN predicate
--pred-not-in              # NOT IN predicate
--pred-is-null             # IS NULL
--pred-is-not-null         # IS NOT NULL
--pred-is-missing          # IS MISSING
--pred-is-not-missing      # IS NOT MISSING
--pred-logical-and         # AND operator
--pred-logical-or          # OR operator
--pred-logical-not         # NOT operator

Projection Parameters (for rand-sfw and rand-sefw)

Control SELECT clause generation:

--project-rand-min <N>              # Minimum projections (1-255)
--project-rand-max <N>              # Maximum projections (1-255)
--project-path-depth-min <N>        # Minimum path depth
--project-path-depth-max <N>        # Maximum path depth

# Same path step and type options as table filters
--project-pathstep-internal-all     # Enable all internal path steps
--project-pathstep-final-all        # Enable all final path steps
--project-type-final-all            # Allow all final types

Exclusion Parameters (for rand-select-all-efw and rand-sefw)

Control EXCLUDE clause generation:

--exclude-rand-min <N>              # Minimum exclusions (1-255)
--exclude-rand-max <N>              # Maximum exclusions (1-255)
--exclude-path-depth-min <N>        # Minimum path depth
--exclude-path-depth-max <N>        # Maximum path depth

# Same path step and type options as table filters
--exclude-pathstep-internal-all     # Enable all internal path steps
--exclude-pathstep-final-all        # Enable all final path steps
--exclude-type-final-all            # Allow all final types

Complex Examples

Deep Path Generation

For nested data structures, control path depth:

beamline query basic \
  --seed 1234 \
  --start-auto \
  --script-path transactions.ion \
  --sample-count 3 \
  rand-sefw \
    --project-rand-min 2 \
    --project-rand-max 5 \
    --project-path-depth-min 1 \
    --project-path-depth-max 10 \
    --project-pathstep-internal-all \
    --project-pathstep-final-all \
    --project-type-final-all \
    --tbl-flt-rand-min 2 \
    --tbl-flt-rand-max 5 \
    --tbl-flt-path-depth-max 10 \
    --tbl-flt-pathstep-internal-all \
    --tbl-flt-pathstep-final-project \
    --tbl-flt-type-final-scalar \
    --pred-all \
    --exclude-rand-min 1 \
    --exclude-rand-max 2 \
    --exclude-path-depth-min 3 \
    --exclude-path-depth-max 4 \
    --exclude-pathstep-internal-all \
    --exclude-pathstep-final-unpivot \
    --exclude-type-final-all

This generates queries with deeply nested paths like:

SELECT test_data.*.nested_struct.nested_struct.nested_struct.nested_struct.nested_struct.*,
  test_data.test_nest_struct.*.*.nested_struct.nested_struct
EXCLUDE test_data.*.*.*.*, test_data.price.* 
FROM test_data AS test_data
WHERE ((test_data.test_nest_struct.*.*.*.nested_struct.*.test_int <> 19) OR
  (test_data.test_nest_struct.*.*.nested_struct.*.*.test_int > 35))

Reproducible Query Generation

Use specific seeds for consistent query generation:

# Generate same queries each time
beamline query basic \
  --seed 12345 \
  --start-iso "2024-01-01T00:00:00Z" \
  --script-path data.ion \
  --sample-count 5 \
  rand-select-all-fw \
    --tbl-flt-rand-min 1 \
    --tbl-flt-rand-max 3 \
    --pred-all

Best Practices

1. Start with Simple Queries

# Begin with basic queries
beamline query basic \
  --seed 1 \
  --start-auto \
  --script-path data.ion \
  --sample-count 5 \
  rand-select-all-fw \
    --tbl-flt-rand-min 1 \
    --tbl-flt-rand-max 1 \
    --pred-eq

2. Match Query Complexity to Data Structure

# Simple data = simple paths
--project-path-depth-max 2

# Complex nested data = deeper paths  
--project-path-depth-max 6

3. Use Appropriate Predicates for Testing

# For numeric testing
--pred-lt --pred-gt --pred-between

# For comprehensive testing
--pred-all

4. Validate Generated Queries

Test generated queries against your data to ensure they’re valid and meaningful.

Integration with Data Generation

Combine query and data generation for complete testing:

# Generate test data
beamline gen data \
  --seed 100 \
  --start-auto \
  --script-path test_data.ion \
  --sample-count 1000 \
  --output-format ion-pretty > test_data.ion

# Generate matching queries  
beamline query basic \
  --seed 101 \
  --start-auto \
  --script-path test_data.ion \
  --sample-count 20 \
  rand-select-all-fw \
    --tbl-flt-rand-min 1 \
    --tbl-flt-rand-max 3 \
    --pred-all > test_queries.sql

Next Steps

Shape Commands

The beamline infer-shape command analyzes Ion scripts to infer the data schemas without generating full datasets. This is useful for understanding data structures, creating database schemas, and validating script configurations.

Command Syntax

beamline infer-shape [OPTIONS]

Required Options

Shape inference uses the same core configuration as data generation:

Seed Configuration (Required - choose one)

--seed-auto                    # Generate random seed automatically
--seed <SEED>                  # Use specific numeric seed for reproducibility

Start Time Configuration (Required - choose one)

--start-auto                   # Generate random start time
--start-epoch-ms <EPOCH_MS>    # Use Unix timestamp in milliseconds
--start-iso <ISO_8601>         # Use ISO 8601 format (e.g., 2024-01-01T00:00:00Z)

Script Configuration (Required - choose one)

--script-path <PATH>           # Path to Ion script file
--script <SCRIPT_DATA>         # Inline Ion script content

Optional Parameters

Output Format

--output-format <FORMAT>       # Shape output format (default: text)

Available formats:

  • text - Human-readable debug format (default)
  • basic-ddl - SQL DDL format for database schemas
  • beamline-json - Beamline JSON format for testing

Nullability and Optionality

--default-nullable <true|false>    # Set default nullability behavior
--pct-null <PERCENTAGE>            # Percentage of NULL values (0.0-1.0)
--default-optional <true|false>    # Set default optionality behavior  
--pct-optional <PERCENTAGE>        # Percentage of MISSING values (0.0-1.0)

Output Formats

Text Format (Default)

Provides detailed type information in Rust debug format:

$ beamline infer-shape --seed-auto --start-auto --script-path sensors.ion
Seed: 17685918364143248531
Start: 2022-12-12T19:52:29.000000000Z
{
    "sensors": PartiqlType(
        Bag(
            BagType {
                element_type: PartiqlType(
                    Struct(
                        StructType {
                            constraints: {
                                Fields(
                                    {
                                        StructField {
                                            name: "d",
                                            ty: PartiqlType(
                                                DecimalP(2, 0),
                                            ),
                                        },
                                        StructField {
                                            name: "f",
                                            ty: PartiqlType(
                                                Float64,
                                            ),
                                        },
                                        StructField {
                                            name: "i8",
                                            ty: PartiqlType(
                                                Int64,
                                            ),
                                        },
                                    },
                                ),
                            },
                        },
                    ),
                ),
            },
        ),
    ),
}

Use Cases:

  • Development and debugging
  • Understanding complex data structures
  • Validating script configurations

Basic DDL Format

Generates SQL DDL statements for database schema creation:

$ beamline infer-shape \
    --seed 7844265201457918498 \
    --start-auto \
    --script-path sensors-nested.ion \
    --output-format basic-ddl

-- Seed: 7844265201457918498
-- Start: 2024-01-01T06:53:06.000000000Z
-- Syntax: partiql_datatype_syntax.0.1
-- Dataset: sensors
"f" DOUBLE,
"i8" INT8,
"id" INT,
"sub" STRUCT<"f": DOUBLE,"o": INT8>,
"tick" INT8

Use Cases:

  • Creating database tables
  • Database schema documentation
  • SQL migration scripts
  • Data warehouse setup

Beamline JSON Format

Structured JSON format used by PartiQL testing tools:

$ beamline infer-shape \
    --seed-auto \
    --start-auto \
    --script-path sensors.ion \
    --output-format beamline-json

{
  seed: -3711181901898679775,
  start: 2022-05-22T13:49:57.000000000+00:00,
  shapes: {
    sensors: partiql::shape::v0::{
      type: "bag",
      items: {
        type: "struct",
        constraints: [
          ordered,
          closed
        ],
        fields: [
          {
            name: "d",
            type: "decimal(2, 0)"
          },
          {
            name: "f",
            type: "double"
          },
          {
            name: "i8",
            type: "int8"
          },
          {
            name: "tick",
            type: "int8"
          },
          {
            name: "w",
            type: "decimal(5, 4)"
          }
        ]
      }
    }
  }
}

Use Cases:

  • PartiQL conformance testing
  • Tool integration
  • Automated schema validation

Examples

Basic Shape Inference

# Get basic shape information
beamline infer-shape \
  --seed-auto \
  --start-auto \
  --script-path my_data.ion

# Get reproducible shape with specific seed
beamline infer-shape \
  --seed 12345 \
  --start-auto \
  --script-path my_data.ion \
  --output-format text

Database Schema Generation

# Generate SQL DDL for database creation
beamline infer-shape \
  --seed 100 \
  --start-auto \
  --script-path ecommerce.ion \
  --output-format basic-ddl > schema.sql

# Use in database creation
psql -d mydb -f schema.sql

Multiple Dataset Schemas

# Infer shapes for complex multi-dataset scripts
beamline infer-shape \
  --seed 42 \
  --start-auto \
  --script-path client-service.ion \
  --output-format basic-ddl

This outputs schemas for all datasets defined in the script:

-- Dataset: service
"Account" VARCHAR,
"Operation" VARCHAR,
"Program" VARCHAR,
"Request" VARCHAR,
"StartTime" TIMESTAMP,
"client" VARCHAR,
"success" BOOL

-- Dataset: client_0
"id" VARCHAR,
"request_id" VARCHAR,
"request_time" TIMESTAMP,
"success" BOOL

-- Dataset: client_1
"id" VARCHAR,
"request_id" VARCHAR,
"request_time" TIMESTAMP,
"success" BOOL

Schema with Nullability and Optionality

Configure NULL and MISSING value behavior in schema:

# Schema with all types nullable and optional
beamline infer-shape \
  --seed 1 \
  --start-auto \
  --script-path data.ion \
  --default-nullable true \
  --default-optional true \
  --output-format basic-ddl

# Output includes nullable/optional markers
"age" OPTIONAL TINYINT,
"name" OPTIONAL VARCHAR NULL,
"active" OPTIONAL BOOL

Schema Validation Workflow

Use shape inference to validate scripts before large data generation:

# 1. Validate script syntax and structure
beamline infer-shape \
  --seed-auto \
  --start-auto \
  --script-path new_script.ion

# 2. Generate SQL schema
beamline infer-shape \
  --seed 1 \
  --start-auto \
  --script-path new_script.ion \
  --output-format basic-ddl > schema.sql

# 3. Generate small sample to verify
beamline gen data \
  --seed 1 \
  --start-auto \
  --script-path new_script.ion \
  --sample-count 5

# 4. Generate full dataset
beamline gen data \
  --seed 1 \
  --start-auto \
  --script-path new_script.ion \
  --sample-count 100000

Integration Patterns

Database Schema Creation

#!/bin/bash
# generate-database-schema.sh

SCRIPT="$1"
OUTPUT_DIR="./schemas"

if [ -z "$SCRIPT" ]; then
  echo "Usage: $0 <script.ion>"
  exit 1
fi

mkdir -p "$OUTPUT_DIR"

# Generate DDL schema
echo "Generating database schema for $SCRIPT..."
beamline infer-shape \
  --seed 1 \
  --start-auto \
  --script-path "$SCRIPT" \
  --output-format basic-ddl > "$OUTPUT_DIR/$(basename "$SCRIPT" .ion).sql"

# Generate Beamline JSON for testing
beamline infer-shape \
  --seed 1 \
  --start-auto \
  --script-path "$SCRIPT" \
  --output-format beamline-json > "$OUTPUT_DIR/$(basename "$SCRIPT" .ion).json"

echo "Schemas generated in $OUTPUT_DIR/"

CI/CD Schema Validation

#!/bin/bash
# validate-schemas.sh - CI pipeline script

set -e

echo "Validating Ion scripts..."

for script in scripts/*.ion; do
  echo "Checking $script..."
  
  # Validate script can generate valid schema
  if ! beamline infer-shape \
    --seed 1 \
    --start-auto \
    --script-path "$script" \
    --output-format text > /dev/null; then
    echo "ERROR: Invalid script $script"
    exit 1
  fi
  
  echo "✓ $script is valid"
done

echo "All scripts validated successfully!"

Documentation Generation

# Generate documentation for all data scripts
for script in data_scripts/*.ion; do
  name=$(basename "$script" .ion)
  
  echo "## $name Dataset" >> SCHEMAS.md
  echo '```sql' >> SCHEMAS.md
  
  beamline infer-shape \
    --seed 1 \
    --start-auto \
    --script-path "$script" \
    --output-format basic-ddl >> SCHEMAS.md
    
  echo '```' >> SCHEMAS.md
  echo "" >> SCHEMAS.md
done

Error Handling

Common Errors

Script Syntax Errors

$ beamline infer-shape --seed-auto --start-auto --script-path invalid.ion
Error: Failed to parse Ion script: Invalid Ion syntax at line 5, column 10

Missing Required Options

$ beamline infer-shape --script-path data.ion
Error: One of --seed-auto or --seed is required
Error: One of --start-auto, --start-epoch-ms, or --start-iso is required

Invalid Output Format

$ beamline infer-shape --seed-auto --start-auto --script-path data.ion --output-format invalid
Error: 'invalid' isn't a valid value for '--output-format <OUTPUT_FORMAT>'

Performance Considerations

Shape inference is very fast since it doesn’t generate actual data:

  • Script Parsing: Milliseconds for typical scripts
  • Type Inference: Nearly instantaneous
  • Output Generation: Minimal overhead

This makes shape inference ideal for:

  • Quick script validation
  • CI/CD pipeline checks
  • Interactive development workflows
  • Documentation generation

Best Practices

1. Validate Scripts Early

# Always infer shape before generating large datasets
beamline infer-shape --seed 1 --start-auto --script-path new_script.ion

2. Use Appropriate Output Formats

# DDL for database work
beamline infer-shape --seed 1 --start-auto --script-path data.ion --output-format basic-ddl

# Text for debugging
beamline infer-shape --seed 1 --start-auto --script-path data.ion --output-format text

# JSON for automation
beamline infer-shape --seed 1 --start-auto --script-path data.ion --output-format beamline-json

3. Document Your Schemas

Save schema outputs for reference and version control:

beamline infer-shape \
  --seed 1 \
  --start-auto \
  --script-path production_data.ion \
  --output-format basic-ddl > docs/production_schema.sql

4. Use Consistent Seeds

For reproducible schema documentation:

# Always use seed 1 for schema documentation
beamline infer-shape --seed 1 --start-auto --script-path data.ion --output-format basic-ddl

Next Steps

Database Commands

The beamline gen db beamline-lite command creates complete BeamlineLite databases containing both synthetic data and inferred schemas. This provides a complete local database for testing and development.

Command Syntax

beamline gen db beamline-lite [OPTIONS]

Required Options

Database generation uses the same core configuration as data generation:

Seed Configuration (Required - choose one)

--seed-auto                    # Generate random seed automatically
--seed <SEED>                  # Use specific numeric seed for reproducibility

Start Time Configuration (Required - choose one)

--start-auto                   # Generate random start time
--start-epoch-ms <EPOCH_MS>    # Use Unix timestamp in milliseconds
--start-iso <ISO_8601>         # Use ISO 8601 format (e.g., 2024-01-01T00:00:00Z)

Script Configuration (Required - choose one)

--script-path <PATH>           # Path to Ion script file
--script <SCRIPT_DATA>         # Inline Ion script content

Optional Parameters

Sample Count

--sample-count <COUNT>         # Number of samples to generate (default: 10)

Catalog Configuration

--catalog_name <NAME>          # Name of the catalog directory (default: "beamline-catalog")
--catalog_path <PATH>          # Path where catalog will be created (default: ".")
--force                        # Overwrite existing catalog (creates backup first)

Target

--target filesystem            # Create filesystem-based database (default and only option)

Nullability and Optionality

--default-nullable <true|false>    # Set default nullability behavior
--pct-null <PERCENTAGE>            # Percentage of NULL values (0.0-1.0)
--default-optional <true|false>    # Set default optionality behavior  
--pct-optional <PERCENTAGE>        # Percentage of MISSING values (0.0-1.0)

What Gets Created

A BeamlineLite database consists of multiple files in a catalog directory:

Catalog Structure

beamline-catalog/
├── .beamline-manifest          # Metadata (seed, start time, DDL syntax version)
├── .beamline-script           # Original Ion script used for generation
├── <dataset_name>.ion         # Data files (one per dataset)
├── <dataset_name>.shape.ion   # Schema files in Ion format
└── <dataset_name>.shape.sql   # Schema files in SQL DDL format

Example Catalog Contents

After running:

beamline gen db beamline-lite \
  --seed-auto \
  --start-auto \
  --script-path client-service.ion \
  --sample-count 1000

Generated files:

beamline-catalog/
├── .beamline-manifest
├── .beamline-script  
├── service.ion
├── service.shape.ion
├── service.shape.sql
├── client_0.ion
├── client_0.shape.ion
├── client_0.shape.sql
├── client_1.ion
├── client_1.shape.ion
├── client_1.shape.sql
└── ... (more client datasets)

File Contents

Manifest File

Contains generation metadata:

$ cat beamline-catalog/.beamline-manifest
{"seed": "949665520117506306", "start": "2023-02-06T12:52:29.000000000Z", "ddl_syntax.version": "partiql_datatype_syntax.0.1"}

Script File

Original Ion script used for generation:

$ cat beamline-catalog/.beamline-script
rand_processes::{
    // generate between 5 & 20 customers
    $n: UniformU8::{ low: 5, high: 20 },
    
    // A generator for client ids
    $id_gen: UUID,
    
    // ... rest of script
}

Data Files

Generated synthetic data in Ion format:

$ cat beamline-catalog/client_0.ion
{success: true, id: "7dbd12cf-b506-22ad-2d81-b0a1cd259697", request_id: "0de35d1e-a87c-e540-734d-6f2a4fa410c3", request_time: 2021-01-05T03:55:01.035000000+00:00}
{success: true, id: "7dbd12cf-b506-22ad-2d81-b0a1cd259697", request_id: "3539cdf0-6f7e-6bdc-c25a-4e0b7d8f8bac", request_time: 2021-01-05T03:55:01.182000000+00:00}

Schema Files

Ion format schema:

$ cat beamline-catalog/client_0.shape.ion
{
  type: "bag",
  items: {
    type: "struct",
    constraints: [ordered, closed],
    fields: [
      { name: "id", type: "string" },
      { name: "request_id", type: "string" },
      { name: "request_time", type: "datetime" },
      { name: "success", type: "bool" }
    ]
  }
}

SQL DDL format schema:

$ cat beamline-catalog/service.shape.sql
"Account" VARCHAR,
"Distance" DECIMAL(2, 0),
"Operation" VARCHAR,
"Program" VARCHAR,
"Request" VARCHAR,
"StartTime" TIMESTAMP,
"Weight" DECIMAL(5, 4),
"client" VARCHAR,
"success" BOOL

Examples

Basic Database Creation

# Create database with default settings
beamline gen db beamline-lite \
  --seed-auto \
  --start-auto \
  --script-path my_data.ion \
  --sample-count 1000

# Creates ./beamline-catalog/ with all files

Custom Catalog Configuration

# Create database in custom location
beamline gen db beamline-lite \
  --seed 12345 \
  --start-iso "2024-01-01T00:00:00Z" \
  --script-path production_sim.ion \
  --sample-count 50000 \
  --catalog_name production-data \
  --catalog_path ./databases/
  
# Creates ./databases/production-data/ with all files

Reproducible Database Creation

# Create reproducible test database
beamline gen db beamline-lite \
  --seed 2024 \
  --start-iso "2024-01-01T00:00:00Z" \
  --script-path test_suite.ion \
  --sample-count 10000 \
  --catalog_name test-db-2024 \
  --default-nullable false \
  --default-optional false

Overwriting and Backup

Safe Overwrite with Backup

The CLI protects existing catalogs by default:

$ beamline gen db beamline-lite --seed-auto --start-auto --script-path data.ion
creating directory ./beamline-catalog/ failed with the following error:
File exists (os error 17)

Use --force to overwrite with automatic backup:

$ beamline gen db beamline-lite \
    --seed-auto \
    --start-auto \
    --script-path data.ion \
    --force

command is using --force ...
Beamline catalog ./beamline-catalog/ exists, backing it up to "beamline-catalog.2024-05-10T22:15:54.019316000Z.bkp"...
back up completed
writing manifest file ./beamline-catalog/.beamline-manifest ...[COMPLETED]
writing script file ./beamline-catalog/.beamline-script ...[COMPLETED]
writing shape file(s)...[COMPLETED]
writing data file(s)...[COMPLETED]
done!

Database Structure Analysis

Examine Generated Database

# View catalog structure
tree beamline-catalog/

# Examine manifest
cat beamline-catalog/.beamline-manifest

# Check a data file
head -5 beamline-catalog/service.ion

# Check schema
cat beamline-catalog/service.shape.sql

Validate Database Consistency

# Count records in each dataset
for data_file in beamline-catalog/*.ion; do
  if [[ "$data_file" != *".shape.ion"* ]]; then
    echo "$(basename "$data_file"): $(wc -l < "$data_file") records"
  fi
done

Integration Patterns

Testing Database Setup

#!/bin/bash
# setup-test-database.sh

TEST_SEED=12345
TEST_START="2024-01-01T00:00:00Z"
TEST_SAMPLES=10000

echo "Creating test database..."

# Clean up any existing test database
rm -rf test-database/

# Generate test database
beamline gen db beamline-lite \
  --seed $TEST_SEED \
  --start-iso $TEST_START \
  --script-path test_data_spec.ion \
  --sample-count $TEST_SAMPLES \
  --catalog_name test-database \
  --catalog_path . \
  --default-nullable false

echo "Test database created in ./test-database/"
echo "Records generated: $TEST_SAMPLES"
echo "Seed used: $TEST_SEED"
echo "Start time: $TEST_START"

Multi-Environment Database Generation

#!/bin/bash
# generate-env-databases.sh

SCRIPT="simulation.ion"
BASE_SEED=2024

# Development environment
beamline gen db beamline-lite \
  --seed $BASE_SEED \
  --start-iso "2024-01-01T00:00:00Z" \
  --script-path $SCRIPT \
  --sample-count 1000 \
  --catalog_name dev-db \
  --catalog_path ./environments/

# Staging environment  
beamline gen db beamline-lite \
  --seed $((BASE_SEED + 1)) \
  --start-iso "2024-01-01T00:00:00Z" \
  --script-path $SCRIPT \
  --sample-count 10000 \
  --catalog_name staging-db \
  --catalog_path ./environments/

# Production-like environment
beamline gen db beamline-lite \
  --seed $((BASE_SEED + 2)) \
  --start-iso "2024-01-01T00:00:00Z" \
  --script-path $SCRIPT \
  --sample-count 100000 \
  --catalog_name prod-like-db \
  --catalog_path ./environments/

Database Migration Testing

#!/bin/bash
# test-schema-migration.sh

OLD_SCRIPT="data_v1.ion"
NEW_SCRIPT="data_v2.ion"

# Generate database with old schema
beamline gen db beamline-lite \
  --seed 100 \
  --start-auto \
  --script-path $OLD_SCRIPT \
  --catalog_name old-schema \
  --sample-count 1000

# Generate database with new schema
beamline gen db beamline-lite \
  --seed 100 \
  --start-auto \
  --script-path $NEW_SCRIPT \
  --catalog_name new-schema \
  --sample-count 1000

# Compare schemas
diff old-schema/*.shape.sql new-schema/*.shape.sql

Performance Considerations

Database creation involves:

  1. Script parsing (milliseconds)
  2. Data generation (scales with --sample-count)
  3. Schema inference (nearly instantaneous)
  4. File I/O (depends on dataset size and disk speed)

Performance Tips

# For large databases, monitor progress
time beamline gen db beamline-lite \
  --seed 1 \
  --start-auto \
  --script-path large_sim.ion \
  --sample-count 1000000

# Use faster storage for temporary operations
beamline gen db beamline-lite \
  --seed 1 \
  --start-auto \
  --script-path data.ion \
  --catalog_path /tmp/fast-storage/

Best Practices

1. Use Meaningful Catalog Names

# Good - descriptive names
beamline gen db beamline-lite \
  --script-path user_analytics.ion \
  --catalog_name user-analytics-2024 \
  --catalog_path ./databases/

# Avoid - generic names
beamline gen db beamline-lite \
  --script-path data.ion \
  --catalog_name db

2. Document Generation Parameters

# Create documentation alongside database
beamline gen db beamline-lite \
  --seed 12345 \
  --start-iso "2024-01-01T00:00:00Z" \
  --script-path simulation.ion \
  --sample-count 50000 \
  --catalog_name analytics-db-v1

# Document the generation
echo "Analytics Database v1
Generated: $(date)
Seed: 12345
Start: 2024-01-01T00:00:00Z
Sample Count: 50000
Script: simulation.ion" > analytics-db-v1/README.txt

3. Use Version Control for Catalog Manifests

Track database generation metadata:

# Add manifest files to version control
git add beamline-catalog/.beamline-manifest
git add beamline-catalog/.beamline-script
git commit -m "Add database generation manifest for test-db v2.1"

4. Backup Before –force Operations

# The CLI creates backups automatically with --force, but verify
ls -la beamline-catalog*.bkp

# Manual backup before --force if desired
cp -r beamline-catalog manual-backup-$(date +%Y%m%d)
beamline gen db beamline-lite --script-path updated.ion --force

Use Cases

Local Development Database

# Create local database for development
beamline gen db beamline-lite \
  --seed 1000 \
  --start-auto \
  --script-path dev_data.ion \
  --sample-count 5000 \
  --catalog_name dev-local

Test Suite Database

# Create comprehensive test database
beamline gen db beamline-lite \
  --seed 2024001 \
  --start-iso "2024-01-01T00:00:00Z" \
  --script-path comprehensive_test.ion \
  --sample-count 50000 \
  --catalog_name integration-test-db \
  --default-nullable false \
  --default-optional false

Performance Benchmark Database

# Create large database for performance testing
beamline gen db beamline-lite \
  --seed 999999 \
  --start-auto \
  --script-path performance_test.ion \
  --sample-count 1000000 \
  --catalog_name perf-benchmark \
  --catalog_path ./benchmarks/

Database Analysis

Examine Database Contents

# Check database size
du -sh beamline-catalog/

# Count records per dataset
for f in beamline-catalog/*.ion; do
  if [[ "$f" != *".shape.ion"* ]]; then
    echo "$(basename "$f" .ion): $(wc -l < "$f") records"
  fi
done

# View sample data
head -3 beamline-catalog/service.ion

# View schema
cat beamline-catalog/service.shape.sql

Validate Database Integrity

# Verify manifest matches generation
cat beamline-catalog/.beamline-manifest

# Verify script is preserved
diff original_script.ion beamline-catalog/.beamline-script

# Check all datasets have corresponding schemas
for data in beamline-catalog/*.ion; do
  if [[ "$data" != *".shape.ion"* ]]; then
    dataset=$(basename "$data" .ion)
    if [[ ! -f "beamline-catalog/${dataset}.shape.ion" ]]; then
      echo "Missing schema for $dataset"
    fi
  fi
done

Error Handling

Common Errors

Catalog Directory Exists

$ beamline gen db beamline-lite --seed-auto --start-auto --script-path data.ion
creating directory ./beamline-catalog/ failed with the following error:
File exists (os error 17)

# Solution: Use --force or different catalog name
beamline gen db beamline-lite --seed-auto --start-auto --script-path data.ion --force

Script Parse Errors

$ beamline gen db beamline-lite --seed-auto --start-auto --script-path invalid.ion
Error: Failed to parse Ion script: Invalid Ion syntax at line 8

Insufficient Disk Space

# Check available space before large database creation
df -h .
beamline gen db beamline-lite --script-path huge_data.ion --sample-count 10000000

Best Practices

1. Plan Storage Requirements

# Estimate database size with small sample first
beamline gen db beamline-lite \
  --seed 1 \
  --start-auto \
  --script-path data.ion \
  --sample-count 100 \
  --catalog_name size-test

# Check size and extrapolate
du -sh size-test/
# If 100 samples = 1MB, then 100,000 samples ≈ 1GB

2. Use Consistent Naming Conventions

# Good naming convention
beamline gen db beamline-lite \
  --script-path ecommerce_v2.ion \
  --catalog_name ecommerce-v2-20241201 \
  --catalog_path ./databases/

# Include date, version, purpose in catalog name

3. Document Database Generation

# Create database with documentation
beamline gen db beamline-lite \
  --seed 12345 \
  --start-iso "2024-01-01T00:00:00Z" \
  --script-path analytics.ion \
  --sample-count 25000 \
  --catalog_name analytics-q4-2024

# Add README
echo "Analytics Database Q4 2024
Purpose: Customer behavior analysis
Generated: $(date)
Script: analytics.ion  
Seed: 12345
Records: 25000
Contact: analytics-team@company.com" > analytics-q4-2024/README.txt

4. Validate Generated Databases

# Verify database creation was successful
ls -la beamline-catalog/
cat beamline-catalog/.beamline-manifest
wc -l beamline-catalog/*.ion

Next Steps

Now that you understand all CLI commands:

Database Overview

Beamline provides local data and schema generation capability. This allows to create a local copy of the generated data in a local catalog directory.

What is BeamlineLite?

BeamlineLite is Beamline’s local database generation capability that creates filesystem-based databases containing:

  • Generated data in Ion format
  • Inferred schemas in both Ion and SQL DDL formats
  • Metadata about generation parameters
  • Original scripts for reproducibility

Database vs Data Generation

Data Generation (gen data)

beamline gen data \
  --seed 42 \
  --start-auto \
  --script-path sensors.ion \
  --sample-count 1000 \
  --output-format ion-pretty

Output: Stream of data records to stdout or file

Use cases: Data processing pipelines, API testing, analysis

Database Generation (gen db)

beamline gen db beamline-lite \
  --seed 42 \
  --start-auto \
  --script-path sensors.ion \
  --sample-count 1000

Output: Complete database directory with data + schemas

Use cases: Local development databases, testing environments, demos

BeamlineLite Database Structure

Catalog Directory Layout

A BeamlineLite database creates a catalog directory with this structure:

beamline-catalog/
├── .beamline-manifest          # Generation metadata (JSON)
├── .beamline-script           # Original Ion script
├── <dataset>.ion              # Data files (one per dataset)
├── <dataset>.shape.ion        # Ion format schemas
└── <dataset>.shape.sql        # SQL DDL schemas

Real Example from client-service.ion

$ beamline gen db beamline-lite \
    --seed-auto \
    --start-auto \
    --script-path client-service.ion \
    --sample-count 1000

writing manifest file ./beamline-catalog/.beamline-manifest ...[COMPLETED]
writing script file ./beamline-catalog/.beamline-script ...[COMPLETED]
writing shape file(s)...[COMPLETED]
writing data file(s)...[COMPLETED]
done!

$ tree beamline-catalog/
beamline-catalog/
├── .beamline-manifest
├── .beamline-script
├── service.ion
├── service.shape.ion
├── service.shape.sql
├── client_0.ion
├── client_0.shape.ion
├── client_0.shape.sql
├── client_1.ion
├── client_1.shape.ion  
├── client_1.shape.sql
└── ... (more client datasets)

Database Files Deep Dive

Manifest File (.beamline-manifest)

Contains generation metadata in JSON format:

$ cat beamline-catalog/.beamline-manifest
{"seed": "949665520117506306", "start": "2023-02-06T12:52:29.000000000Z", "ddl_syntax.version": "partiql_datatype_syntax.0.1"}

Contents:

  • seed: Random seed used for generation (for reproducibility)
  • start: Simulation start timestamp
  • ddl_syntax.version: SQL DDL syntax version used in .shape.sql files

Script File (.beamline-script)

Preserved copy of the original Ion script:

$ cat beamline-catalog/.beamline-script
rand_processes::{
    // generate between 5 & 20 customers
    $n: UniformU8::{ low: 5, high: 20 },
    
    // A generator for client ids
    $id_gen: UUID,
    
    // ... rest of original script
}

Purpose:

  • Reproducibility: Regenerate identical database later
  • Documentation: What script created this database
  • Version control: Track script changes over time

Data Files (dataset.ion)

Contains generated data in compact Ion format:

$ cat beamline-catalog/client_0.ion
{success: true, id: "7dbd12cf-b506-22ad-2d81-b0a1cd259697", request_id: "0de35d1e-a87c-e540-734d-6f2a4fa410c3", request_time: 2021-01-05T03:55:01.035000000+00:00}
{success: true, id: "7dbd12cf-b506-22ad-2d81-b0a1cd259697", request_id: "3539cdf0-6f7e-6bdc-c25a-4e0b7d8f8bac", request_time: 2021-01-05T03:55:01.182000000+00:00}

Characteristics:

  • One record per line: Newline-delimited Ion format
  • Complete type information: All Ion types preserved
  • Temporal ordering: Records ordered by generation time

Schema Files (dataset.shape.ion)

Ion format schema definitions:

$ cat beamline-catalog/client_0.shape.ion
{
  type: "bag",
  items: {
    type: "struct",
    constraints: [ordered, closed],
    fields: [
      { name: "id", type: "string" },
      { name: "request_id", type: "string" },
      { name: "request_time", type: "datetime" },
      { name: "success", type: "bool" }
    ]
  }
}

Use cases:

  • PartiQL validation: Validate queries against schema
  • Type checking: Ensure data types match expectations
  • Tool integration: Tools can use schema information

Schema Files (dataset.shape.sql)

SQL DDL format schemas:

$ cat beamline-catalog/service.shape.sql
"Account" VARCHAR,
"Distance" DECIMAL(2, 0),
"Operation" VARCHAR,
"Program" VARCHAR,
"Request" VARCHAR,
"StartTime" TIMESTAMP,
"Weight" DECIMAL(5, 4),
"client" VARCHAR,
"success" BOOL

Use cases:

  • Database creation: Create tables in SQL databases
  • Schema documentation: Human-readable schema reference
  • Migration scripts: Database schema evolution

Managing Catalogs

BeamlineLite catalogs are filesystem-based directories that contain complete databases with data, schemas, and metadata. Understanding how to manage, organize, and work with catalogs is essential for effective database operations.

Catalog Structure Deep Dive

Standard Catalog Layout

Every BeamlineLite catalog follows a consistent structure:

catalog-name/
├── .beamline-manifest          # JSON metadata file
├── .beamline-script           # Original Ion script
├── dataset_1.ion              # Dataset 1 data (Ion format)
├── dataset_1.shape.ion        # Dataset 1 schema (Ion format)  
├── dataset_1.shape.sql        # Dataset 1 schema (SQL DDL)
├── dataset_2.ion              # Dataset 2 data
├── dataset_2.shape.ion        # Dataset 2 schema (Ion)
├── dataset_2.shape.sql        # Dataset 2 schema (SQL)
└── ... (additional datasets)

File Naming Conventions

Data Files: <dataset_name>.ion

  • Contains generated records in newline-delimited Ion format
  • One file per dataset defined in the Ion script
  • Records ordered chronologically by generation time

Ion Schema Files: <dataset_name>.shape.ion

  • PartiQL type definitions in Ion format
  • Used by Ion-aware tools for validation and processing
  • Contains complete type constraint information

SQL Schema Files: <dataset_name>.shape.sql

  • SQL DDL field definitions (not complete CREATE TABLE)
  • Ready for integration with SQL databases
  • Human-readable schema documentation

Metadata Files:

  • .beamline-manifest - Generation parameters in JSON
  • .beamline-script - Original Ion script for reproducibility

Catalog Creation Options

Basic Catalog Creation

# Default catalog in current directory
beamline gen db beamline-lite \
  --seed 42 \
  --start-auto \
  --script-path data.ion

# Creates: ./beamline-catalog/

Custom Catalog Configuration

# Custom name and location
beamline gen db beamline-lite \
  --seed 12345 \
  --start-iso "2024-01-01T00:00:00Z" \
  --script-path ecommerce.ion \
  --sample-count 50000 \
  --catalog-name ecommerce-prod-simulation \
  --catalog-path ./production-databases/

# Creates: ./production-databases/ecommerce-prod-simulation/

Catalog Naming Best Practices

# Good - descriptive, versioned names
--catalog-name user-analytics-v2-20241201
--catalog-name integration-test-db-sprint-45  
--catalog-name demo-ecommerce-q4-2024

# Avoid - generic names
--catalog-name db
--catalog-name test
--catalog-name data

Catalog Lifecycle Management

Safe Overwrite with Backup

BeamlineLite protects existing catalogs by default:

$ beamline gen db beamline-lite --seed 1 --start-auto --script-path data.ion
creating directory ./beamline-catalog/ failed with the following error:
File exists (os error 17)

The --force option creates automatic backups:

$ beamline gen db beamline-lite \
    --seed 1 \
    --start-auto \
    --script-path updated_data.ion \
    --force

command is using --force ...
Beamline catalog ./beamline-catalog/ exists, backing it up to "beamline-catalog.2024-05-10T22:15:54.019316000Z.bkp"...
back up completed
writing manifest file ./beamline-catalog/.beamline-manifest ...[COMPLETED]
writing script file ./beamline-catalog/.beamline-script ...[COMPLETED]
writing shape file(s)...[COMPLETED]
writing data file(s)...[COMPLETED]
done!

Backup naming pattern: <catalog-name>.<ISO-8601-timestamp>.bkp