Core Concepts
Before diving deeper into Beamline’s advanced features, it’s essential to understand the fundamental concepts that power its data generation capabilities. This chapter will introduce you to the mathematical and computational foundations that make Beamline both powerful and reliable.
Stochastic Processes
At the heart of Beamline lies the concept of stochastic processes — mathematical models that describe systems appearing to vary randomly over time.
What is a Stochastic Process?
A stochastic process is a collection of random variables indexed by time or space. In simpler terms, it is a way to model how things change randomly over time while still following certain patterns or rules.
Real-world examples:
- Stock prices over time
- Sensor readings from IoT devices
- User activity on a website
- Network traffic patterns
- Temperature measurements
Why Stochastic Processes Matter
Traditional random data generators often produce data that looks random but lacks the realistic patterns found in real-world data. Stochastic processes allow Beamline to:
- Model Temporal Relationships: Data points aren’t just random — they follow realistic time-based patterns
- Create Correlations: Different data elements can be related in meaningful ways
- Simulate Real Patterns: Generate data that behaves like real-world systems
- Maintain Consistency: Ensure generated data follows logical rules and constraints
Example: Sensor Data
Consider a temperature sensor:
- Simple Random: Each reading is completely independent
- Stochastic Process: Readings follow realistic patterns (gradual changes, daily cycles, seasonal trends)
// Simple random (unrealistic)
temperature: UniformF64::{ low: -10.0, high: 40.0 }
// Stochastic process (realistic)
temperature: NormalF64::{ mean: 22.0, std_dev: 5.0 }
Random Processes in Beamline
Beamline implements stochastic processes through random processes defined in scripts in Amazon Ion Format.
Anatomy of a Random Process
rand_process::{
$arrival: HomogeneousPoisson:: { interarrival: minutes::5 },
$data: {
// Data structure definition
}
}
Every random process has two key components:
- Arrival Process (
$arrival): Defines the statistical pattern of new data arrivals, i.e., when the data arrives - Data Structure (
$data): Defines what data is generated
Arrival Processes
Arrival processes control the timing of data generation. Beamline only supports Homogeneous Poisson Process at the moment:
Homogeneous Poisson Process
The most common arrival process, modeling events that occur at a constant average rate:
$arrival: HomogeneousPoisson:: { interarrival: minutes::5 }
Characteristics:
- Events occur independently
- Average rate is constant over time
- Time between events follows an exponential distribution
- Models many real-world phenomena (customer arrivals, system events, etc.)
Use cases:
- Web server requests
- Sensor readings
- User logins
- System alerts
Time Units
Beamline supports various time units for arrival processes:
// Different time units
seconds::30 // 30 seconds
minutes::5 // 5 minutes
hours::2 // 2 hours
days::1 // 1 day
milliseconds::100 // 100 milliseconds
Data Generators
Data generators define the structure and content of generated data. They use probability distributions to create realistic values.
Probability Distributions
Beamline supports many probability distributions, each suited for different types of data:
Uniform Distributions
Generate values where each value in a range is equally likely:
// Discrete uniform (integers)
age: UniformU8::{ low: 18, high: 65 }
// Continuous uniform (floats)
temperature: UniformF64::{ low: 20.0, high: 30.0 }
// Uniform choice from literals
status: Uniform::{ choices: ["active", "inactive", "pending"] }
Use cases:
- IDs, categories, discrete choices
- Baseline random values
- Testing edge cases
Normal (Gaussian) Distributions
Generate values that cluster around a mean with a bell-curve distribution:
height: NormalF64::{ mean: 170.0, std_dev: 10.0 }
Characteristics:
- Most values near the mean
- Symmetric distribution
- Models many natural phenomena
Use cases:
- Physical measurements (height, weight)
- Performance metrics
- Error values
Other Distributions
// Exponential (for modeling wait times)
response_time: ExpF64::{ rate: 0.1 }
// Log-normal (for modeling sizes, prices)
file_size: LogNormalF64::{ location: 10.0, scale: 1.0 }
// Weibull (for modeling lifetimes, reliability)
device_lifetime: WeibullF64::{ shape: 2.0, scale: 1000.0 }
Data Types
Beamline supports the following data types:
Scalar Types
// Numbers
integer_val: UniformI32::{ low: 1, high: 1000 }
float_val: UniformF64::{ low: 0.0, high: 1.0 }
decimal_val: UniformDecimal::{ low: 1.99, high: 999.99 }
// Text
name: LoremIpsumTitle
description: LoremIpsum::{ min_words: 10, max_words: 50 }
pattern_text: Regex::{ pattern: "[A-Z]{2}[0-9]{4}" }
// Boolean
active: Bool::{ p: 0.8 } // 80% chance of true
// Temporal
created_at: Instant
birth_date: Date
// Identifiers
user_id: UUID
Complex Types
// Structures
user: {
id: UUID,
name: LoremIpsumTitle,
age: UniformU8::{ low: 18, high: 65 },
preferences: {
theme: Uniform::{ choices: ["light", "dark"] },
notifications: Bool::{ p: 0.7 }
}
}
// Arrays
tags: UniformArray::{
min_size: 1,
max_size: 5,
element_type: LoremIpsumTitle
}
// Union types
value: UniformAnyOf::{ types: [
UniformI32::{ low: 1, high: 100 },
LoremIpsumTitle,
Bool
]}
Variables and References
Beamline supports variables for creating relationships and reusing values:
Variable Definition
rand_processes::{
$n: UniformU8::{ low: 2, high: 10 },
sensors: $n::[
rand_process::{
$r: Uniform::{ choices: [5,10] },
$arrival: HomogeneousPoisson:: { interarrival: minutes::$r },
$weight: UniformDecimal::{ nullable: 0.75, low: 1.995, high: 4.9999, optional: true },
$anyof: UniformAnyOf::{ types: [Tick, UniformF64, UUID, UniformDecimal::{ low: 1.995, high: 4.9999, nullable: false }] },
$array: UniformArray::{
min_size: 3,
max_size: 3,
element_type: UniformDecimal::{ low: 0.5, high: 1.5 }
},
$data: {
tick: Tick,
i8: UniformI8,
f: UniformF64,
w: $weight,
d: UniformDecimal::{ low: 0d0, high: 4.2d1, nullable: false },
a: $anyof,
ar1: $array,
ar2: UniformArray::{ min_size: 2, max_size: 4, element_type: UUID },
ar3: UniformArray::{ min_size: 2, max_size: 4, element_type: $weight },
ar4: UniformArray::{ min_size: 2, max_size: 4, element_type: UniformI8::{ low: 2, high: 10 } },
ar5: UniformArray::{ min_size: 1, max_size: 1, element_type: $anyof }
}
}
],
}
Variable Types
Generator Variables
Store data generators for reuse:
$temperature_sensor: NormalF64::{ mean: 22.0, std_dev: 3.0 }
$id_gen: UUID
Value Variables
Store computed values:
$success_rate: UniformF64::{ low: 0.95, high: 1.0 },
$is_successful: Bool::{ p: $success_rate }
Evaluation Control
Control when variables are evaluated:
// Evaluate once at script read time
$user_id: $id_gen::()
// Evaluate each time it's used
$request_id: $id_gen
Datasets and Collections
Beamline organizes generated data into datasets, which represent collections of related data.
Single Dataset
rand_processes::{
sensors: rand_process::{
$data: { /* sensor data */ }
}
}
Multiple Datasets
rand_processes::{
users: rand_process::{
$data: { /* user data */ }
},
orders: rand_process::{
$data: { /* order data */ }
}
}
Dynamic Datasets
Create multiple related datasets:
rand_processes::{
$n: UniformU8::{ low: 3, high: 8 },
// Creates client_1, client_2, ..., client_n datasets
clients: $n::[
'client_{ $@n }': rand_process::{
$data: {
client_id: '$@n',
// ... other fields
}
}
]
}
Reproducibility and Determinism
One of Beamline’s key strengths is its ability to generate reproducible data.
Seeds
Seeds control the random number generation:
# Same seed = same data
beamline gen data --seed 42 --start-auto --script-path my-script.ion
beamline gen data --seed 42 --start-auto --script-path my-script.ion # Identical output
Timestamps
Control the simulation start time:
# Same timestamp = same temporal patterns
beamline gen data --seed 42 --start-iso "2024-01-01T00:00:00Z" --script-path my-script.ion
Deterministic Behavior
Beamline ensures that:
- Same inputs always produce same outputs
- Random sequences are predictable and reproducible
- Debugging is possible with consistent data
- Tests can be reliable and repeatable
Static vs. Dynamic Data
Beamline supports both static and dynamic data generation:
Dynamic Data (Default)
Generated during simulation with temporal patterns:
rand_process::{
$arrival: HomogeneousPoisson:: { interarrival: minutes::5 },
$data: {
timestamp: Instant,
value: UniformF64
}
}
Static Data
Generated once at the beginning of simulation:
static_data::{
$data: {
id: UUID,
created_at: Instant, // Will be simulation start time
config: LoremIpsum
}
}
Use cases for static data:
- Reference tables
- Configuration data
- Lookup tables
- Master data
Summary
Understanding these core concepts is crucial for effectively using Beamline:
- Stochastic Processes: Mathematical foundation for realistic data patterns
- Random Processes: Implementation of stochastic processes in Beamline
- Arrival Processes: Control timing of data generation
- Data Generators: Create realistic values using probability distributions
- Variables: Enable relationships and reuse in data generation
- Datasets: Organize generated data into meaningful collections
- Reproducibility: Ensure consistent, debuggable data generation
- Static vs. Dynamic: Choose appropriate data generation patterns
In the next chapter, we’ll dive deeper into scripts and random processes, exploring how to create more sophisticated data generation patterns and relationships.