Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Data Generation Overview

Beamline’s data generation system creates synthetic data using stochastic processes and probability distributions. The system is built around three core concepts: random processes, value generators, and temporal modeling.

Architecture Overview

Data generation in Beamline follows a layered architecture:

  1. Random Processes — Mathematical models that describe how events occur over time
  2. Value Generators — Components that create specific data types and values
  3. Arrival Times — Models for when events occur in the simulation
  4. Simulation Context — Manages state, timing, and reproducibility

Core Concepts

Random Processes

A Random Process (also called Stochastic Process) is a mathematical model of systems that appear to vary randomly over time. In Beamline, these processes control:

  • When data arrives (temporal patterns)
  • What data is generated (value types and structures)
  • How data relates (cross-field dependencies)

Value Generators

Value Generators are the building blocks that create actual data values. They can generate:

  • Scalar values: numbers, strings, booleans, timestamps
  • Complex structures: objects, arrays, nested data
  • Statistical distributions: normal, exponential, Weibull, etc.
  • Specialized types: UUIDs, formatted text, regex patterns

Each generator can be configured for:

  • Nullability: Probability of generating NULL values
  • Optionality: Probability of generating MISSING values
  • Value ranges: Minimum and maximum bounds
  • Distribution parameters: Mean, standard deviation, shape, scale

Temporal Modeling

Beamline models data generation as events occurring over time using:

  • Arrival processes: When events occur (e.g. Poisson Point Process)
  • Simulation time: Virtual time that advances as events are generated
  • Tick counters: Global state that increments with each event
  • Instant generators: Current simulation time when values are created

Ion Script Structure

All data generation is controlled through Amazon Ion scripts with this basic structure:

rand_processes::{
    // Variable definitions
    $variable_name: GeneratorType::{ configuration },
    
    // Dataset definitions
    dataset_name: dataset_configuration
}

Variable Definitions

Variables allow you to define generators once and reuse them:

rand_processes::{
    // Define reusable generators
    $id_generator: UUID,
    $weight_generator: UniformDecimal::{ low: 1.0, high: 10.0 },
    $count_range: UniformU8::{ low: 5, high: 20 },
    
    // Use variables in dataset definitions
    products: $count_range::[
        // ... uses $id_generator and $weight_generator
    ]
}

Dataset Configurations

Datasets can be configured in several ways:

1. Single Random Process

dataset_name: rand_process::{
    $arrival: HomogeneousPoisson::{ interarrival: minutes::5 },
    $data: {
        id: UUID,
        value: UniformF64
    }
}

2. Static Data (Generated Once)

dataset_name: static_data::{
    $data: {
        id: UUID,
        name: LoremIpsumTitle
    }
}

3. Multiple Instances with Loops

$n: UniformU8::{ low: 2, high: 5 },

dataset_name: $n::[
    rand_process::{
        $data: {
            instance_id: '$@n',  // Current loop index
            value: UniformF64
        }
    }
]

Data Generator Types

Basic Generators

GeneratorDescriptionPartiQL TypeConfiguration
BoolBoolean valuesBOOLp: f64 (probability of true, default: 0.5)
UUIDUUID v4 identifiersSTRINGNo configuration
TickCurrent simulation tickInt64No configuration
InstantCurrent simulation timeDATETIMENo configuration
DateCurrent simulation dateDATETIMENo configuration

Numeric Generators

Uniform Integer Generators

// Unsigned integers
UniformU8::{ low: 0, high: 255 }           // 8-bit unsigned
UniformU16::{ low: 0, high: 65535 }        // 16-bit unsigned  
UniformU32::{ low: 0, high: 4294967295 }  // 32-bit unsigned
UniformU64::{ low: 0, high: 18446744073709551615 }  // 64-bit unsigned

// Signed integers
UniformI8::{ low: -128, high: 127 }        // 8-bit signed
UniformI16::{ low: -32768, high: 32767 }   // 16-bit signed
UniformI32::{ low: -2147483648, high: 2147483647 }  // 32-bit signed
UniformI64::{ low: -9223372036854775808, high: 9223372036854775807 }  // 64-bit signed

Floating Point Generators

// Uniform float
UniformF64::{ low: -127.0, high: 127.0 }

// Uniform decimal (exact arithmetic)
UniformDecimal::{ low: 0.995, high: 499.9999 }

Statistical Distribution Generators

// Normal distribution (bell curve)
NormalF64::{ mean: 100.0, std_dev: 15.0 }

// Log-normal distribution
LogNormalF64::{ location: 0.0, scale: 1.0 }

// Weibull distribution
WeibullF64::{ shape: 2.0, scale: 1.0 }

// Exponential distribution
ExpF64::{ rate: 1.0 }

String Generators

// Lorem Ipsum text
LoremIpsum::{ min_words: 10, max_words: 200 }

// Lorem Ipsum titles (3-8 words, title case)
LoremIpsumTitle

// Regular expression patterns
Regex::{ pattern: "[A-Z]{2}[0-9]{3}" }

// Format strings with variable substitution
Format::{ pattern: "User #{$@n}" }

Complex Type Generators

Arrays

UniformArray::{
    min_size: 1,
    max_size: 5,
    element_type: UniformI32::{ low: 1, high: 100 }
}

Union Types (Any Of)

UniformAnyOf::{
    types: [
        UUID,
        UniformI32::{ low: 1, high: 1000 },
        LoremIpsumTitle
    ]
}

Choice from Literals

Uniform::{ choices: [1, 2, 5, 10, 20] }

Nullability and Optionality

Every generator supports NULL and MISSING value generation:

Nullability (NULL values)

// 20% chance of NULL values
generator::{ nullable: 0.2 }

// Never NULL
generator::{ nullable: false }

// Always NULL (not useful, but possible)
generator::{ nullable: 1.0 }

Optionality (MISSING values)

// 10% chance of MISSING values  
generator::{ optional: 0.1 }

// Never MISSING
generator::{ optional: false }

// Always MISSING (field won't appear)
generator::{ optional: 1.0 }

Combined Configuration

// 20% NULL, 10% MISSING, 70% present values
price: UniformDecimal::{
    nullable: 0.2,
    optional: 0.1, 
    low: 9.99,
    high: 999.99
}

Arrival Processes

Control when events occur in simulation time. Beamline is currently supporintg only Homogeneous Poisson Process:

Homogeneous Poisson Process

Statistically indepe events occur at a constant average rate with random intervals:

$arrival: HomogeneousPoisson::{ interarrival: minutes::5 }

Time units supported:

  • milliseconds::N - N milliseconds between events
  • seconds::N - N seconds between events
  • minutes::N - N minutes between events
  • hours::N - N hours between events
  • days::N - N days between events

Variable References and Scope

Variable Definition and Usage

rand_processes::{
    $arrival: HomogeneousPoisson::{ interarrival: minutes::5 },

    // Define variables at top level
    $customer_id: UUID,
    $price_range: UniformDecimal::{ low: 9.99, high: 199.99 },
    
    orders: rand_process::{
        $data: {
            customer: $customer_id,     // Reference variable
            price: $price_range,        // Reference variable
            order_id: UUID              // Direct generator
        }
    }
}

Forced Evaluation with ::()

Force generator evaluation at script read time (not generation time):

rand_processes::{
    $id_gen: UUID,
    
    customers: 3::[
        {
            // Each customer gets the same ID across all their records
            $id: $id_gen::(),  // Evaluated once per customer
            
            customer_profile: static_data::{
                $data: {
                    id: $id,           // Same ID for this customer
                    name: LoremIpsumTitle
                }
            },
            
            transactions: rand_process::{
                $data: {
                    customer_id: $id,  // Same ID for this customer
                    transaction_id: UUID,  // New UUID per transaction
                    amount: UniformDecimal::{ low: 10.0, high: 500.0 }
                }
            }
        }
    ]
}

Loop Index Variable $@n

Access the current loop index in array definitions:

$n: UniformU8::{ low: 3, high: 7 },

clients: $n::[
    {
        'client_$@n': rand_process::{  // Dynamic dataset name
            $data: {
                client_number: '$@n',   // Current index as value
                name: Format::{ pattern: "Client #{$@n}" }
            }
        }
    }
]

Some Real Examples

Simple Sensor Data

rand_processes::{
    $n: UniformU8::{ low: 2, high: 10 },

    sensors: $n::[
        rand_process::{
            $r: Uniform::{ choices: [5,10] },
            $arrival: HomogeneousPoisson::{ interarrival: minutes::$r },
            $data: {
                tick: Tick,
                i8: UniformI8,
                f: UniformF64
            }
        }
    ]
}

Complex Statistical Data

rand_processes::{
    test_data: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: milliseconds::100 },
        $data: {
            // Statistical distributions
            normal_score: NormalF64::{ mean: 100.0, std_dev: 15.0 },
            exponential_wait: ExpF64::{ rate: 0.1 },
            weibull_lifetime: WeibullF64::{ shape: 2.0, scale: 1000.0 },
            
            // Arrays with statistical elements
            measurements: UniformArray::{
                min_size: 5,
                max_size: 10,
                element_type: NormalF64::{ mean: 50.0, std_dev: 5.0 }
            },
            
            // Union types
            mixed_value: UniformAnyOf::{
                types: [
                    NormalF64::{ mean: 0.0, std_dev: 1.0 },
                    UniformI32::{ low: 1, high: 100 },
                    UUID
                ]
            }
        }
    }
}

Static and Dynamic Data Combination

rand_processes::{
    $n: UniformU8::{ low: 5, high: 20 },
    $id_gen: UUID,

    customers: $n::[
        {
            $id: $id_gen::(),  // One ID per customer
            
            // Static customer data (generated once)
            customer_table: static_data::{
                $data: {
                    id: $id,
                    address: Format::{ pattern: "{$@n} Main Street" }
                }
            },
            
            // Dynamic order data (generated over time)
            orders: rand_process::{
                $r: UniformU8::{ low: 1, high: 30 },
                $arrival: HomogeneousPoisson::{ interarrival: days::$r },
                $data: {
                    customer_id: $id,
                    order_id: UUID,
                    timestamp: Instant
                }
            }
        }
    ]
}

Probability Distribution Support

Beamline provides support for data generation based on probability distributions, making it particularly valuable for AI model training and statistical simulation:

Available Distributions

  • Normal Distribution: NormalF64::{ mean: μ, std_dev: σ }
  • Log-Normal Distribution: LogNormalF64::{ location: μ, scale: σ }
  • Exponential Distribution: ExpF64::{ rate: λ }
  • Weibull Distribution: WeibullF64::{ shape: k, scale: λ }
  • Uniform Distribution: All Uniform* generators use uniform distribution

AI Model Training Applications

rand_processes::{
    training_data: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: milliseconds::1 },
        $data: {
            // Features following realistic distributions
            age: NormalF64::{ mean: 35.0, std_dev: 12.0 },
            income: LogNormalF64::{ location: 10.5, scale: 0.5 },
            response_time: ExpF64::{ rate: 0.1 },
            
            // Categorical features
            category: Uniform::{ choices: ["A", "B", "C", "D"] },
            
            // Correlated features using shared variables
            experience_years: NormalF64::{ mean: 8.0, std_dev: 5.0 },
            
            // Target variable (could be based on features)
            target: Bool::{ p: 0.3 }
        }
    }
}

Next Steps

Now that you understand the data generation overview, explore specific aspects: