Data Generation Overview
Beamline’s data generation system creates synthetic data using stochastic processes and probability distributions. The system is built around three core concepts: random processes, value generators, and temporal modeling.
Architecture Overview
Data generation in Beamline follows a layered architecture:
- Random Processes — Mathematical models that describe how events occur over time
- Value Generators — Components that create specific data types and values
- Arrival Times — Models for when events occur in the simulation
- Simulation Context — Manages state, timing, and reproducibility
Core Concepts
Random Processes
A Random Process (also called Stochastic Process) is a mathematical model of systems that appear to vary randomly over time. In Beamline, these processes control:
- When data arrives (temporal patterns)
- What data is generated (value types and structures)
- How data relates (cross-field dependencies)
Value Generators
Value Generators are the building blocks that create actual data values. They can generate:
- Scalar values: numbers, strings, booleans, timestamps
- Complex structures: objects, arrays, nested data
- Statistical distributions: normal, exponential, Weibull, etc.
- Specialized types: UUIDs, formatted text, regex patterns
Each generator can be configured for:
- Nullability: Probability of generating
NULLvalues - Optionality: Probability of generating
MISSINGvalues - Value ranges: Minimum and maximum bounds
- Distribution parameters: Mean, standard deviation, shape, scale
Temporal Modeling
Beamline models data generation as events occurring over time using:
- Arrival processes: When events occur (e.g. Poisson Point Process)
- Simulation time: Virtual time that advances as events are generated
- Tick counters: Global state that increments with each event
- Instant generators: Current simulation time when values are created
Ion Script Structure
All data generation is controlled through Amazon Ion scripts with this basic structure:
rand_processes::{
// Variable definitions
$variable_name: GeneratorType::{ configuration },
// Dataset definitions
dataset_name: dataset_configuration
}
Variable Definitions
Variables allow you to define generators once and reuse them:
rand_processes::{
// Define reusable generators
$id_generator: UUID,
$weight_generator: UniformDecimal::{ low: 1.0, high: 10.0 },
$count_range: UniformU8::{ low: 5, high: 20 },
// Use variables in dataset definitions
products: $count_range::[
// ... uses $id_generator and $weight_generator
]
}
Dataset Configurations
Datasets can be configured in several ways:
1. Single Random Process
dataset_name: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: minutes::5 },
$data: {
id: UUID,
value: UniformF64
}
}
2. Static Data (Generated Once)
dataset_name: static_data::{
$data: {
id: UUID,
name: LoremIpsumTitle
}
}
3. Multiple Instances with Loops
$n: UniformU8::{ low: 2, high: 5 },
dataset_name: $n::[
rand_process::{
$data: {
instance_id: '$@n', // Current loop index
value: UniformF64
}
}
]
Data Generator Types
Basic Generators
| Generator | Description | PartiQL Type | Configuration |
|---|---|---|---|
Bool | Boolean values | BOOL | p: f64 (probability of true, default: 0.5) |
UUID | UUID v4 identifiers | STRING | No configuration |
Tick | Current simulation tick | Int64 | No configuration |
Instant | Current simulation time | DATETIME | No configuration |
Date | Current simulation date | DATETIME | No configuration |
Numeric Generators
Uniform Integer Generators
// Unsigned integers
UniformU8::{ low: 0, high: 255 } // 8-bit unsigned
UniformU16::{ low: 0, high: 65535 } // 16-bit unsigned
UniformU32::{ low: 0, high: 4294967295 } // 32-bit unsigned
UniformU64::{ low: 0, high: 18446744073709551615 } // 64-bit unsigned
// Signed integers
UniformI8::{ low: -128, high: 127 } // 8-bit signed
UniformI16::{ low: -32768, high: 32767 } // 16-bit signed
UniformI32::{ low: -2147483648, high: 2147483647 } // 32-bit signed
UniformI64::{ low: -9223372036854775808, high: 9223372036854775807 } // 64-bit signed
Floating Point Generators
// Uniform float
UniformF64::{ low: -127.0, high: 127.0 }
// Uniform decimal (exact arithmetic)
UniformDecimal::{ low: 0.995, high: 499.9999 }
Statistical Distribution Generators
// Normal distribution (bell curve)
NormalF64::{ mean: 100.0, std_dev: 15.0 }
// Log-normal distribution
LogNormalF64::{ location: 0.0, scale: 1.0 }
// Weibull distribution
WeibullF64::{ shape: 2.0, scale: 1.0 }
// Exponential distribution
ExpF64::{ rate: 1.0 }
String Generators
// Lorem Ipsum text
LoremIpsum::{ min_words: 10, max_words: 200 }
// Lorem Ipsum titles (3-8 words, title case)
LoremIpsumTitle
// Regular expression patterns
Regex::{ pattern: "[A-Z]{2}[0-9]{3}" }
// Format strings with variable substitution
Format::{ pattern: "User #{$@n}" }
Complex Type Generators
Arrays
UniformArray::{
min_size: 1,
max_size: 5,
element_type: UniformI32::{ low: 1, high: 100 }
}
Union Types (Any Of)
UniformAnyOf::{
types: [
UUID,
UniformI32::{ low: 1, high: 1000 },
LoremIpsumTitle
]
}
Choice from Literals
Uniform::{ choices: [1, 2, 5, 10, 20] }
Nullability and Optionality
Every generator supports NULL and MISSING value generation:
Nullability (NULL values)
// 20% chance of NULL values
generator::{ nullable: 0.2 }
// Never NULL
generator::{ nullable: false }
// Always NULL (not useful, but possible)
generator::{ nullable: 1.0 }
Optionality (MISSING values)
// 10% chance of MISSING values
generator::{ optional: 0.1 }
// Never MISSING
generator::{ optional: false }
// Always MISSING (field won't appear)
generator::{ optional: 1.0 }
Combined Configuration
// 20% NULL, 10% MISSING, 70% present values
price: UniformDecimal::{
nullable: 0.2,
optional: 0.1,
low: 9.99,
high: 999.99
}
Arrival Processes
Control when events occur in simulation time. Beamline is currently supporintg only Homogeneous Poisson Process:
Homogeneous Poisson Process
Statistically indepe events occur at a constant average rate with random intervals:
$arrival: HomogeneousPoisson::{ interarrival: minutes::5 }
Time units supported:
milliseconds::N- N milliseconds between eventsseconds::N- N seconds between eventsminutes::N- N minutes between eventshours::N- N hours between eventsdays::N- N days between events
Variable References and Scope
Variable Definition and Usage
rand_processes::{
$arrival: HomogeneousPoisson::{ interarrival: minutes::5 },
// Define variables at top level
$customer_id: UUID,
$price_range: UniformDecimal::{ low: 9.99, high: 199.99 },
orders: rand_process::{
$data: {
customer: $customer_id, // Reference variable
price: $price_range, // Reference variable
order_id: UUID // Direct generator
}
}
}
Forced Evaluation with ::()
Force generator evaluation at script read time (not generation time):
rand_processes::{
$id_gen: UUID,
customers: 3::[
{
// Each customer gets the same ID across all their records
$id: $id_gen::(), // Evaluated once per customer
customer_profile: static_data::{
$data: {
id: $id, // Same ID for this customer
name: LoremIpsumTitle
}
},
transactions: rand_process::{
$data: {
customer_id: $id, // Same ID for this customer
transaction_id: UUID, // New UUID per transaction
amount: UniformDecimal::{ low: 10.0, high: 500.0 }
}
}
}
]
}
Loop Index Variable $@n
Access the current loop index in array definitions:
$n: UniformU8::{ low: 3, high: 7 },
clients: $n::[
{
'client_$@n': rand_process::{ // Dynamic dataset name
$data: {
client_number: '$@n', // Current index as value
name: Format::{ pattern: "Client #{$@n}" }
}
}
}
]
Some Real Examples
Simple Sensor Data
rand_processes::{
$n: UniformU8::{ low: 2, high: 10 },
sensors: $n::[
rand_process::{
$r: Uniform::{ choices: [5,10] },
$arrival: HomogeneousPoisson::{ interarrival: minutes::$r },
$data: {
tick: Tick,
i8: UniformI8,
f: UniformF64
}
}
]
}
Complex Statistical Data
rand_processes::{
test_data: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: milliseconds::100 },
$data: {
// Statistical distributions
normal_score: NormalF64::{ mean: 100.0, std_dev: 15.0 },
exponential_wait: ExpF64::{ rate: 0.1 },
weibull_lifetime: WeibullF64::{ shape: 2.0, scale: 1000.0 },
// Arrays with statistical elements
measurements: UniformArray::{
min_size: 5,
max_size: 10,
element_type: NormalF64::{ mean: 50.0, std_dev: 5.0 }
},
// Union types
mixed_value: UniformAnyOf::{
types: [
NormalF64::{ mean: 0.0, std_dev: 1.0 },
UniformI32::{ low: 1, high: 100 },
UUID
]
}
}
}
}
Static and Dynamic Data Combination
rand_processes::{
$n: UniformU8::{ low: 5, high: 20 },
$id_gen: UUID,
customers: $n::[
{
$id: $id_gen::(), // One ID per customer
// Static customer data (generated once)
customer_table: static_data::{
$data: {
id: $id,
address: Format::{ pattern: "{$@n} Main Street" }
}
},
// Dynamic order data (generated over time)
orders: rand_process::{
$r: UniformU8::{ low: 1, high: 30 },
$arrival: HomogeneousPoisson::{ interarrival: days::$r },
$data: {
customer_id: $id,
order_id: UUID,
timestamp: Instant
}
}
}
]
}
Probability Distribution Support
Beamline provides support for data generation based on probability distributions, making it particularly valuable for AI model training and statistical simulation:
Available Distributions
- Normal Distribution:
NormalF64::{ mean: μ, std_dev: σ } - Log-Normal Distribution:
LogNormalF64::{ location: μ, scale: σ } - Exponential Distribution:
ExpF64::{ rate: λ } - Weibull Distribution:
WeibullF64::{ shape: k, scale: λ } - Uniform Distribution: All
Uniform*generators use uniform distribution
AI Model Training Applications
rand_processes::{
training_data: rand_process::{
$arrival: HomogeneousPoisson::{ interarrival: milliseconds::1 },
$data: {
// Features following realistic distributions
age: NormalF64::{ mean: 35.0, std_dev: 12.0 },
income: LogNormalF64::{ location: 10.5, scale: 0.5 },
response_time: ExpF64::{ rate: 0.1 },
// Categorical features
category: Uniform::{ choices: ["A", "B", "C", "D"] },
// Correlated features using shared variables
experience_years: NormalF64::{ mean: 8.0, std_dev: 5.0 },
// Target variable (could be based on features)
target: Bool::{ p: 0.3 }
}
}
}
Next Steps
Now that you understand the data generation overview, explore specific aspects:
- Generator Types - Detailed guide to all available generators
- Static Data - Using static_data for reference tables
- Output Formats - Understanding different output formats
- Nullability - Controlling NULL and MISSING values
- Scripts - Advanced Ion script techniques
- Datasets - Working with multiple datasets and relationships