Datasets and Collections

Datasets in Beamline represent collections of related data records that share the same structure. Understanding how to design, organize, and work with multiple datasets is essential for creating realistic data generation scenarios.

What are Datasets?

A dataset is a named collection of records that share a common schema. In Ion scripts, datasets are defined as top-level keys within the rand_processes structure:

rand_processes::{
    users: rand_process::{ /* ... */ },        // "users" dataset
    orders: rand_process::{ /* ... */ },       // "orders" dataset  
    products: static_data::{ /* ... */ }       // "products" dataset
}

Each dataset becomes a separate data collection in the output, whether in text format, Ion format, or database generation.

Single Dataset Scripts

Basic Single Dataset

rand_processes::{
    sensors: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: minutes::5 },
        $data: {
            sensor_id: UUID,
            temperature: NormalF64::{ mean: 22.0, std_dev: 3.0 },
            humidity: UniformF64::{ low: 30.0, high: 80.0 },
            timestamp: Instant
        }
    }
}

Output characteristics:

Single dataset named “sensors”
All records have the same structure
Records generated according to arrival process

Multiple Dataset Scripts

Independent Datasets

Create multiple unrelated datasets in the same script:

rand_processes::{
    // User activity dataset
    user_events: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: seconds::30 },
        $data: {
            user_id: UUID,
            event_type: Uniform::{ choices: ["login", "logout", "click", "purchase"] },
            timestamp: Instant
        }
    },

    // System metrics dataset  
    system_metrics: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: seconds::60 },
        $data: {
            metric_name: Uniform::{ choices: ["cpu", "memory", "disk", "network"] },
            value: UniformF64::{ low: 0.0, high: 100.0 },
            timestamp: Instant
        }
    },

    // Configuration dataset (static)
    app_config: static_data::{
        $data: {
            config_key: Uniform::{ choices: ["max_users", "timeout", "retry_count"] },
            config_value: UniformAnyOf::{ types: [UniformI32::{ low: 1, high: 1000 }, Bool] }
        }
    }
}

Create datasets that share common identifiers or generators:

rand_processes::{
    // Shared generators
    $user_id: UUID,
    $session_id: UUID,

    // User profiles (static)
    users: static_data::{
        $data: {
            user_id: $user_id,
            username: Format::{ pattern: "user_{UUID}" },
            created_at: Date
        }
    },

    // User sessions (dynamic)
    sessions: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: hours::2 },
        $data: {
            session_id: $session_id,
            user_id: $user_id,  // Links to users dataset
            start_time: Instant,
            duration_minutes: UniformU16::{ low: 5, high: 180 }
        }
    },

    // Session events (dynamic) 
    session_events: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: minutes::3 },
        $data: {
            event_id: UUID,
            session_id: $session_id,  // Links to sessions dataset
            event_type: Uniform::{ choices: ["page_view", "click", "scroll", "exit"] },
            timestamp: Instant
        }
    }
}

Complex Dataset Relationships

Dynamic Dataset Creation with Loops

From the real client-service.ion test script:

rand_processes::{
    // Generate between 5 & 20 customers
    $n: UniformU8::{ low: 5, high: 20 },

    // Shared ID generators
    $id_gen: UUID,
    $rid_gen: UUID,

    requests: $n::[
        // Each iteration creates datasets for customer $@n
        {
            // Unique ID per customer
            $id: $id_gen::(),
            $rate: UniformF64::{ low: 0.995, high: 1.0 },
            $success: Bool::{ p: $rate },

            // Service dataset - shared by all customers
            service: rand_process::{
                $r: UniformU8::{ low: 20, high: 150 },
                $arrival: HomogeneousPoisson::{ interarrival: milliseconds::$r },
                $data: {
                    Request: $rid_gen,
                    StartTime: Instant,
                    Program: "FancyService", 
                    Operation: "GetMyData",
                    Account: $id,
                    client: Format::{ pattern: "customer #{$@n}" },
                    success: $success
                }
            },

            // Individual client dataset - one per customer
            'client_{$@n}': rand_process::{
                $arrival: HomogeneousPoisson::{ interarrival: milliseconds::$r },
                $data: {
                    id: $id,
                    request_time: Instant,
                    request_id: $rid_gen,
                    success: $success
                }
            }
        }
    ]
}

This creates:

1 service dataset: Shared across all customers
N client datasets: client_0, client_1, client_2, etc.
Shared variables: Same request IDs, customer IDs, success rates

Output Example

$ beamline gen data \
    --seed 100 \
    --start-auto \
    --script-path client-service.ion \
    --sample-count 20 \
    --output-format text

Seed: 100
Start: 2024-01-01T00:00:00Z
[2024-01-01 00:00:10.123] : "service" { 'Request': 'req-001', 'Account': 'customer-abc', 'client': 'customer #0' }
[2024-01-01 00:00:10.124] : "client_0" { 'id': 'customer-abc', 'request_id': 'req-001' }
[2024-01-01 00:00:15.456] : "service" { 'Request': 'req-002', 'Account': 'customer-def', 'client': 'customer #1' }
[2024-01-01 00:00:15.457] : "client_1" { 'id': 'customer-def', 'request_id': 'req-002' }

Dataset Filtering

CLI Dataset Selection

Generate data for specific datasets only:

# Generate all datasets
beamline gen data \
  --seed 42 \
  --start-auto \
  --script-path multi_dataset.ion \
  --sample-count 100

# Generate only specific datasets
beamline gen data \
  --seed 42 \
  --start-auto \
  --script-path multi_dataset.ion \
  --sample-count 100 \
  --dataset users \
  --dataset orders

# Generate only one dataset
beamline gen data \
  --seed 42 \
  --start-auto \
  --script-path multi_dataset.ion \
  --sample-count 100 \
  --dataset system_metrics

Use Cases for Dataset Filtering

Focused testing: Test specific components in isolation
Performance optimization: Generate only needed data
Development: Work with subset of complex systems
Incremental development: Build datasets one at a time

Dataset Design Patterns

Master-Detail Pattern

rand_processes::{
    $n_customers: UniformU8::{ low: 10, high: 50 },
    $customer_id: UUID,
    $order_id: UUID,

    customers: $n_customers::[
        {
            $id: $customer_id::(),

            // Master dataset - customer information
            customer_master: static_data::{
                $data: {
                    customer_id: $id,
                    name: LoremIpsumTitle,
                    email: Format::{ pattern: "customer{$@n}@example.com" },
                    registration_date: Date
                }
            },

            // Detail dataset - customer orders
            'customer_{$@n}_orders': rand_process::{
                $arrival: HomogeneousPoisson::{ interarrival: days::UniformU8::{ low: 1, high: 30 } },
                $data: {
                    order_id: $order_id,
                    customer_id: $id,  // Foreign key relationship
                    order_date: Instant,
                    total_amount: UniformDecimal::{ low: 10.00, high: 500.00 }
                }
            }
        }
    ]
}

Event Sourcing Pattern

rand_processes::{
    $entity_id: UUID,

    // Entity snapshots (static)
    entity_snapshots: static_data::{
        $data: {
            entity_id: $entity_id,
            entity_type: Uniform::{ choices: ["user", "order", "product"] },
            created_at: Date,
            initial_state: LoremIpsumTitle
        }
    },

    // Entity events (dynamic)
    entity_events: rand_process::{
        $arrival: HomogeneousPoisson::{ interarrival: minutes::UniformU8::{ low: 5, high: 60 } },
        $data: {
            event_id: UUID,
            entity_id: $entity_id,  // Links to snapshots
            event_type: Uniform::{ choices: ["created", "updated", "deleted", "restored"] },
            timestamp: Instant,
            event_data: LoremIpsum::{ min_words: 5, max_words: 20 }
        }
    }
}

Multi-Tenant Pattern

rand_processes::{
    $n_tenants: UniformU8::{ low: 3, high: 10 },
    $tenant_id: UUID,

    tenants: $n_tenants::[
        {
            $id: $tenant_id::(),

            // Tenant configuration (static)
            'tenant_{$@n}_config': static_data::{
                $data: {
                    tenant_id: $id,
                    tenant_name: Format::{ pattern: "Tenant {$@n}" },
                    plan: Uniform::{ choices: ["basic", "premium", "enterprise"] },
                    max_users: UniformU16::{ low: 10, high: 1000 }
                }
            },

            // Tenant activity (dynamic)
            'tenant_{$@n}_activity': rand_process::{
                $arrival: HomogeneousPoisson::{ interarrival: minutes::UniformU8::{ low: 1, high: 30 } },
                $data: {
                    activity_id: UUID,
                    tenant_id: $id,
                    activity_type: Uniform::{ choices: ["login", "api_call", "data_export", "config_change"] },
                    timestamp: Instant,
                    user_count: UniformU16::{ low: 1, high: 100 }
                }
            }
        }
    ]
}

Dataset Analysis and Inspection

Examining Generated Datasets

# Generate multi-dataset output
beamline gen data \
  --seed 123 \
  --start-auto \
  --script-path complex_system.ion \
  --sample-count 1000 \
  --output-format ion-pretty > output.ion

# Extract dataset names and record counts
jq -r '.data | keys[]' output.ion  # List all dataset names
jq '.data.users | length' output.ion  # Count records in users dataset
jq '.data | to_entries[] | "\(.key): \(.value | length) records"' output.ion  # All counts

Database Catalog Analysis

# Generate database
beamline gen db beamline-lite \
  --seed 456 \
  --start-auto \
  --script-path multi_dataset.ion \
  --sample-count 5000

# Analyze generated datasets
ls -la beamline-catalog/*.ion | grep -v shape  # List data files
for f in beamline-catalog/*.ion; do
  if [[ "$f" != *".shape.ion" ]]; then
    echo "$(basename "$f" .ion): $(wc -l < "$f") records"
  fi
done

Schema Comparison Across Datasets

# Compare schemas of related datasets
diff beamline-catalog/client_0.shape.sql beamline-catalog/client_1.shape.sql
# Should be identical for datasets created from same template

# Compare different dataset schemas
diff beamline-catalog/users.shape.sql beamline-catalog/orders.shape.sql
# Should be different - different structures

Advanced Dataset Patterns

Hierarchical Data Modeling

rand_processes::{
    $n_orgs: UniformU8::{ low: 2, high: 5 },
    $n_depts_per_org: UniformU8::{ low: 3, high: 8 },
    $n_users_per_dept: UniformU8::{ low: 5, high: 20 },

    organizations: $n_orgs::[
        {
            $org_id: UUID::(),

            // Organization master data
            'org_{$@n}': static_data::{
                $data: {
                    org_id: $org_id,
                    org_name: Format::{ pattern: "Organization {$@n}" },
                    industry: Uniform::{ choices: ["Tech", "Finance", "Healthcare", "Retail"] }
                }
            },

            // Departments within organization
            departments: $n_depts_per_org::[
                {
                    $dept_id: UUID::(),

                    'org_{$@n}_dept_{$@n}': static_data::{
                        $data: {
                            dept_id: $dept_id,
                            org_id: $org_id,
                            dept_name: Uniform::{ choices: ["Engineering", "Sales", "Marketing", "HR"] }
                        }
                    },

                    // Users within department
                    'org_{$@n}_dept_{$@n}_users': $n_users_per_dept::[
                        rand_process::{
                            $arrival: HomogeneousPoisson::{ interarrival: hours::UniformU8::{ low: 8, high: 24 } },
                            $data: {
                                user_id: UUID,
                                dept_id: $dept_id,
                                org_id: $org_id,
                                activity_type: Uniform::{ choices: ["work", "meeting", "break", "training"] },
                                timestamp: Instant
                            }
                        }
                    ]
                }
            ]
        }
    ]
}

Time-Series Dataset Families

rand_processes::{
    $n_sensors: UniformU8::{ low: 5, high: 15 },
    $sensor_id: UUID,

    sensors: $n_sensors::[
        {
            $id: $sensor_id::(),
            $location: Format::{ pattern: "Location-{$@n}" },

            // Sensor metadata (static)
            'sensor_{$@n}_metadata': static_data::{
                $data: {
                    sensor_id: $id,
                    location: $location,
                    sensor_type: Uniform::{ choices: ["temperature", "humidity", "pressure"] },
                    calibration_date: Date
                }
            },

            // Regular sensor readings (dynamic)
            'sensor_{$@n}_readings': rand_process::{
                $arrival: HomogeneousPoisson::{ interarrival: seconds::30 },
                $data: {
                    sensor_id: $id,
                    reading_time: Instant,
                    value: NormalF64::{ mean: 22.0, std_dev: 5.0 },
                    quality: Uniform::{ choices: ["good", "fair", "poor"] }
                }
            },

            // Sensor alerts (dynamic, infrequent)
            'sensor_{$@n}_alerts': rand_process::{
                $arrival: HomogeneousPoisson::{ interarrival: hours::UniformU8::{ low: 6, high: 48 } },
                $data: {
                    alert_id: UUID,
                    sensor_id: $id,
                    alert_type: Uniform::{ choices: ["high_value", "low_value", "malfunction", "maintenance"] },
                    timestamp: Instant,
                    severity: Uniform::{ choices: [1, 2, 3, 4, 5] }
                }
            }
        }
    ]
}

Dataset Output in Different Formats

Text Format Multi-Dataset Output

$ beamline gen data \
    --seed 999 \
    --start-auto \
    --script-path multi_dataset.ion \
    --sample-count 20 \
    --output-format text

# Datasets are interleaved by timestamp
[2024-01-01 00:00:00.000] : "config" { 'key': 'timeout', 'value': 30 }
[2024-01-01 00:00:00.000] : "config" { 'key': 'max_users', 'value': 1000 }
[2024-01-01 00:02:15.123] : "users" { 'user_id': 'abc-123', 'action': 'login' }
[2024-01-01 00:03:45.456] : "metrics" { 'metric': 'cpu', 'value': 45.6 }
[2024-01-01 00:04:30.789] : "users" { 'user_id': 'def-456', 'action': 'click' }

Ion Pretty Multi-Dataset Output

{
  seed: 999,
  start: "2024-01-01T00:00:00Z",
  data: {
    config: [
      { key: "timeout", value: 30 },
      { key: "max_users", value: 1000 }
    ],
    users: [
      { user_id: "abc-123", action: "login", timestamp: 2024-01-01T00:02:15.123Z },
      { user_id: "def-456", action: "click", timestamp: 2024-01-01T00:04:30.789Z }
    ],
    metrics: [
      { metric: "cpu", value: 45.6, timestamp: 2024-01-01T00:03:45.456Z }
    ]
  }
}

Database Generation Multi-Dataset Files

$ beamline gen db beamline-lite \
    --seed 42 \
    --start-auto \
    --script-path client_service.ion \
    --sample-count 1000

$ ls beamline-catalog/
.beamline-manifest
.beamline-script
service.ion              # Service dataset data
service.shape.ion        # Service dataset schema  
service.shape.sql        # Service dataset SQL
client_0.ion            # Client 0 dataset data
client_0.shape.ion      # Client 0 dataset schema
client_0.shape.sql      # Client 0 dataset SQL
client_1.ion            # Client 1 dataset data
client_1.shape.ion      # Client 1 dataset schema
client_1.shape.sql      # Client 1 dataset SQL
...                     # More client datasets

Dataset Naming Best Practices

1. Use Descriptive Names

// Good - descriptive dataset names
user_profiles: static_data::{ /* ... */ },
user_activity_events: rand_process::{ /* ... */ },
system_performance_metrics: rand_process::{ /* ... */ }

// Avoid - generic names
data1: static_data::{ /* ... */ },
stuff: rand_process::{ /* ... */ }

2. Follow Consistent Naming Conventions

// Consistent naming pattern
user_profiles: static_data::{ /* ... */ },
user_sessions: rand_process::{ /* ... */ },
user_events: rand_process::{ /* ... */ },

order_master: static_data::{ /* ... */ },
order_items: rand_process::{ /* ... */ },
order_payments: rand_process::{ /* ... */ }

// Group related datasets with prefixes
$n: UniformU8::{ low: 5, high: 10 },

services: $n::[
    {
        'service_{$@n}_config': static_data::{ /* ... */ },
        'service_{$@n}_requests': rand_process::{ /* ... */ },
        'service_{$@n}_responses': rand_process::{ /* ... */ },
        'service_{$@n}_errors': rand_process::{ /* ... */ }
    }
]

Performance Considerations

Dataset Count Impact

Few datasets (1-5): Minimal overhead
Many datasets (10-50): Slight memory overhead for tracking
Dynamic datasets (100+): Significant memory for metadata

Dataset Size Balance

// Balanced approach - mix of small and large datasets
rand_processes::{
    // Small reference dataset
    config: static_data::{ $data: { /* small config */ } },

    // Medium operational dataset  
    users: rand_process::{ /* moderate activity */ },

    // Large transaction dataset
    transactions: rand_process::{ /* high frequency */ }
}

Memory Usage with Multiple Datasets

# Monitor memory usage with many datasets
time beamline gen data \
  --seed 1 \
  --start-auto \
  --script-path many_datasets.ion \
  --sample-count 10000

# Use dataset filtering to reduce memory
beamline gen data \
  --seed 1 \
  --start-auto \
  --script-path many_datasets.ion \
  --sample-count 10000 \
  --dataset important_dataset_only

Integration Workflows

Dataset-Specific Processing

#!/bin/bash
# process-datasets.sh

SCRIPT="multi_system.ion"
SEED=12345

# Generate full dataset
beamline gen data \
  --seed $SEED \
  --start-auto \
  --script-path $SCRIPT \
  --sample-count 10000 \
  --output-format ion-pretty > full_data.ion

# Extract individual datasets for processing
jq '.data.users' full_data.ion > users_only.json
jq '.data.orders' full_data.ion > orders_only.json  
jq '.data.metrics' full_data.ion > metrics_only.json

echo "Datasets extracted for individual processing"

Cross-Dataset Validation

# Generate related datasets
beamline gen data \
  --seed 999 \
  --start-auto \
  --script-path related_data.ion \
  --sample-count 5000 \
  --output-format ion-pretty > related_data.ion

# Validate relationships
jq '.data.orders[].customer_id' related_data.ion | sort -u > order_customers.txt
jq '.data.users[].user_id' related_data.ion | sort -u > all_customers.txt

# Check referential integrity
comm -23 order_customers.txt all_customers.txt  # Orders with invalid customer IDs (should be empty)

Troubleshooting Multi-Dataset Scripts

Issue: Missing Datasets in Output

Cause: Dataset filtering or script errors

Solution:

# Check all available datasets
beamline infer-shape --seed 1 --start-auto --script-path script.ion --output-format text

# Generate without filtering
beamline gen data --seed 1 --start-auto --script-path script.ion --sample-count 5

Issue: Uneven Dataset Sizes

Cause: Different arrival rates or loop counts

Solution:

# Check arrival rates in your script
# Adjust interarrival times to balance dataset sizes
$arrival1: HomogeneousPoisson::{ interarrival: seconds::1 },   # Frequent
$arrival2: HomogeneousPoisson::{ interarrival: minutes::1 },   # Less frequent

Issue: Memory Issues with Many Datasets

Solution:

# Use dataset filtering
beamline gen data --script-path many.ion --dataset important_one --dataset important_two

# Or generate datasets separately
beamline gen data --script-path script.ion --dataset batch_1 --sample-count 10000
beamline gen data --script-path script.ion --dataset batch_2 --sample-count 10000

Next Steps

Scripts - Advanced Ion scripting techniques for complex datasets
Output Formats - How datasets appear in different output formats
Examples - See complete multi-dataset examples in action
Database Guide - Working with dataset catalogs and databases

Keyboard shortcuts

PartiQL Beamline