Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Shape Inference

Shape inference is the process of analyzing Ion scripts to determine the data types and structures that will be generated, without actually generating data. This is extremely fast and useful for schema validation, database preparation, and understanding data models.

Shape Inference Command

Basic Usage

The infer-shape command requires the same core parameters as data generation:

beamline infer-shape \
    --seed-auto \
    --start-auto \
    --script-path sensors.ion

Even though no data is generated, seed and start time may affect type inference for certain dynamic generators.

With Specific Parameters

# Use specific seed for reproducible shape inference
beamline infer-shape \
    --seed 12345 \
    --start-iso "2024-01-01T00:00:00Z" \
    --script-path complex_schema.ion \
    --output-format basic-ddl

Output Format Analysis

Text Format (Detailed Debug)

From the README example:

$ beamline infer-shape \
    --seed-auto \
    --start-auto \
    --script-path sensors.ion

Seed: 17685918364143248531
Start: 2022-12-12T19:52:29.000000000Z
{
    "sensors": PartiqlType(
        Bag(
            BagType {
                element_type: PartiqlType(
                    Struct(
                        StructType {
                            constraints: {
                                Fields(
                                    {
                                        StructField {
                                            name: "d",
                                            ty: PartiqlType(
                                                DecimalP(2, 0),
                                            ),
                                        },
                                        StructField {
                                            name: "f",
                                            ty: PartiqlType(
                                                Float64,
                                            ),
                                        },
                                        // ... more fields
                                    },
                                ),
                            },
                        },
                    ),
                ),
            },
        ),
    ),
}

Understanding the structure:

  • Bag: Collection of records (dataset)
  • BagType: Type information for the bag
  • Struct: Each record is a structured object
  • StructField: Individual field definitions with names and types
  • PartiqlType: Specific type information (DecimalP, Float64, etc.)

Basic DDL Format (SQL Ready)

From the README example:

$ beamline infer-shape \
    --seed 7844265201457918498 \
    --start-auto \
    --script-path sensors-nested.ion \
    --output-format basic-ddl

-- Seed: 7844265201457918498
-- Start: 2024-01-01T06:53:06.000000000Z
-- Syntax: partiql_datatype_syntax.0.1
-- Dataset: sensors
"f" DOUBLE,
"i8" INT8,
"id" INT,
"sub" STRUCT<"f": DOUBLE,"o": INT8>,
"tick" INT8

Format characteristics:

  • Comments: Metadata about generation parameters
  • Dataset headers: Clear separation between datasets
  • SQL-ready: Can be used directly in CREATE TABLE statements
  • Type precision: Specific SQL types with precision for decimals

Beamline JSON Format (Tool Integration)

From the README example:

$ beamline infer-shape \
    --seed-auto \
    --start-auto \
    --script-path sensors.ion \
    --output-format beamline-json

{
  seed: -3711181901898679775,
  start: "2022-05-22T13:49:57.000000000+00:00",
  shapes: {
    sensors: partiql::shape::v0::{
      type: "bag",
      items: {
        type: "struct",
        constraints: [ordered, closed],
        fields: [
          {
            name: "d",
            type: "decimal(2, 0)"
          },
          {
            name: "f", 
            type: "double"
          },
          {
            name: "i8",
            type: "int8"
          },
          {
            name: "tick",
            type: "int8"
          },
          {
            name: "w",
            type: "decimal(5, 4)"
          }
        ]
      }
    }
  }
}

Format characteristics:

  • Structured JSON: Machine-readable format
  • Versioned: partiql::shape::v0:: indicates version
  • Complete metadata: Seeds, timestamps, and full type information
  • Tool integration: Designed for PartiQL testing tools

Advanced Shape Inference

CLI Global Defaults Impact

CLI defaults affect shape inference results:

# Infer with default nullable/optional settings
beamline infer-shape \
    --seed 1 \
    --start-auto \
    --script-path data.ion \
    --default-nullable false \
    --default-optional true \
    --output-format basic-ddl

From the README example showing CLI impact:

$ beamline infer-shape \
    --seed 7844265201457918498 \
    --start-auto \
    --script-path sensors.ion \
    --output-format basic-ddl \
    --default-nullable false \
    --default-optional true

-- Seed: 7844265201457918498
-- Start: 2024-01-18T11:40:34.000000000Z
-- Syntax: partiql_datatype_syntax-0.1
-- Dataset: sensors
"a" OPTIONAL UNION<INT8 NOT NULL,DECIMAL(5, 4) NOT NULL,DOUBLE NOT NULL,VARCHAR NOT NULL>,
"ar1" OPTIONAL ARRAY<DECIMAL(2, 1) NOT NULL> NOT NULL,
"ar2" OPTIONAL ARRAY<VARCHAR NOT NULL> NOT NULL,
"ar3" OPTIONAL ARRAY<DECIMAL(5, 4)> NOT NULL,
"ar4" OPTIONAL ARRAY<TINYINT NOT NULL> NOT NULL,
"ar5" OPTIONAL ARRAY<UNION<INT8 NOT NULL,DECIMAL(5, 4) NOT NULL,DOUBLE NOT NULL,VARCHAR NOT NULL>> NOT NULL,
"d" OPTIONAL DECIMAL(2, 0) NOT NULL,
"f" OPTIONAL DOUBLE NOT NULL,
"i8" OPTIONAL TINYINT NOT NULL,
"tick" OPTIONAL INT8 NOT NULL,
"w" OPTIONAL DECIMAL(5, 4)

Notice how CLI defaults made fields OPTIONAL and NOT NULL.

Multi-Dataset Shape Analysis

# Analyze complex multi-dataset script
beamline infer-shape \
    --seed 1 \
    --start-auto \
    --script-path client-service.ion \
    --output-format basic-ddl

Example output structure:

-- Dataset: service
"Account" VARCHAR,
"Distance" DECIMAL(2, 0),
"Operation" VARCHAR,
"Program" VARCHAR,
"Request" VARCHAR,
"StartTime" TIMESTAMP,
"Weight" DECIMAL(5, 4),
"anyof" UNION<INT8,DECIMAL(5, 4)>,
"array" ARRAY<INT8>,
"client" VARCHAR,
"success" BOOL

-- Dataset: client_0
"id" VARCHAR,
"request_id" VARCHAR,
"request_time" TIMESTAMP,
"success" BOOL

-- Dataset: client_1
"id" VARCHAR,
"request_id" VARCHAR,
"request_time" TIMESTAMP,
"success" BOOL

Each dataset from the Ion script gets its own schema section.

Shape Inference Patterns

Script Validation Workflow

#!/bin/bash
# Validate Ion script before data generation

SCRIPT="$1"

if [ ! -f "$SCRIPT" ]; then
    echo "Script not found: $SCRIPT"
    exit 1
fi

echo "Validating Ion script: $SCRIPT"

# Test shape inference (fast validation)
if ! beamline infer-shape \
    --seed 1 \
    --start-auto \
    --script-path "$SCRIPT" \
    --output-format text > /dev/null; then
    echo "❌ Script validation failed - check Ion syntax"
    exit 1
fi

echo "✅ Script syntax valid"

# Show inferred schema
echo ""
echo "Inferred schema:"
beamline infer-shape \
    --seed 1 \
    --start-auto \
    --script-path "$SCRIPT" \
    --output-format basic-ddl

echo ""
echo "✅ Script ready for data generation"

Schema Documentation Generation

#!/bin/bash
# Auto-generate schema documentation

SCRIPTS_DIR="$1"
OUTPUT_FILE="$2"

echo "# Data Schema Documentation" > "$OUTPUT_FILE"
echo "" >> "$OUTPUT_FILE"
echo "Auto-generated: $(date)" >> "$OUTPUT_FILE"
echo "" >> "$OUTPUT_FILE"

for script in "$SCRIPTS_DIR"/*.ion; do
    name=$(basename "$script" .ion)
    echo "Processing $name..."
    
    echo "## $name Data Schema" >> "$OUTPUT_FILE"
    echo "" >> "$OUTPUT_FILE"
    echo "**Source Script**: \`$(basename "$script")\`" >> "$OUTPUT_FILE"
    echo "" >> "$OUTPUT_FILE"
    
    # Add schema in SQL format
    echo '```sql' >> "$OUTPUT_FILE"
    beamline infer-shape \
        --seed 1 \
        --start-auto \
        --script-path "$script" \
        --output-format basic-ddl | grep -v "^-- Seed:" | grep -v "^-- Start:" >> "$OUTPUT_FILE"
    echo '```' >> "$OUTPUT_FILE"
    echo "" >> "$OUTPUT_FILE"
    
    # Count datasets and fields
    schema_output=$(beamline infer-shape --seed 1 --start-auto --script-path "$script" --output-format basic-ddl)
    dataset_count=$(echo "$schema_output" | grep -c "^-- Dataset:")
    field_count=$(echo "$schema_output" | grep -c '^"')
    
    echo "**Summary**: $dataset_count dataset(s), $field_count total fields" >> "$OUTPUT_FILE"
    echo "" >> "$OUTPUT_FILE"
done

echo "Schema documentation generated: $OUTPUT_FILE"

Real-World Examples

E-commerce Schema Analysis

rand_processes::{
    $n_customers: UniformU8::{ low: 10, high: 100 },
    $customer_ids: $n_customers::[UUID::()],
    
    customers: static_data::{
        $data: {
            customer_id: Uniform::{ choices: $customer_ids },
            name: LoremIpsumTitle,
            email: Format::{ pattern: "customer{UUID}@email.com" },
            age: UniformU8::{ low: 18, high: 80, optional: 0.1 }
        }
    },
    
    orders: rand_process::{
        $arrival: HomogeneousPoisson:: { interarrival: minutes::30 },
        $data: {
            order_id: UUID,
            customer_id: Uniform::{ choices: $customer_ids },
            total: UniformDecimal::{ low: 10.00, high: 500.00 },
            items: UniformArray::{
                min_size: 1,
                max_size: 5,
                element_type: {
                    product_name: LoremIpsumTitle,
                    price: UniformDecimal::{ low: 5.00, high: 100.00 },
                    quantity: UniformU8::{ low: 1, high: 3 }
                }
            }
        }
    }
}

Inferred Schema:

$ beamline infer-shape --seed 1 --start-auto --script-path ecommerce.ion --output-format basic-ddl

-- Dataset: customers
"age" OPTIONAL TINYINT,
"customer_id" VARCHAR,
"email" VARCHAR,
"name" VARCHAR

-- Dataset: orders
"customer_id" VARCHAR,
"items" ARRAY<STRUCT<"price": DECIMAL(5, 2),"product_name": VARCHAR,"quantity": TINYINT>>,
"order_id" VARCHAR,
"total" DECIMAL(5, 2)

Financial Data Schema

rand_processes::{
    transactions: rand_process::{
        $arrival: HomogeneousPoisson:: { interarrival: minutes::5 },
        $data: {
            transaction_id: UUID::{ nullable: false },
            account_id: UUID,
            amount: LogNormalF64::{ location: 4.0, scale: 1.0 },
            transaction_type: Uniform::{ choices: ["deposit", "withdrawal", "transfer"] },
            risk_score: UniformF64::{ low: 0.0, high: 1.0 },
            metadata: {
                merchant: LoremIpsumTitle,
                location: Regex::{ pattern: "[A-Z]{2}" },
                processing_time: UniformF64::{ low: 0.1, high: 5.0 }
            },
            compliance: {
                aml_flagged: Bool::{ p: 0.01 },
                requires_review: Bool::{ p: 0.05 },
                risk_category: Uniform::{ choices: ["low", "medium", "high"] }
            }
        }
    }
}

Inferred Schema:

-- Dataset: transactions
"account_id" VARCHAR,
"amount" DOUBLE,
"compliance" STRUCT<"aml_flagged": BOOL,"requires_review": BOOL,"risk_category": VARCHAR>,
"metadata" STRUCT<"location": VARCHAR,"merchant": VARCHAR,"processing_time": DOUBLE>,
"risk_score" DOUBLE,
"transaction_id" VARCHAR NOT NULL,
"transaction_type" VARCHAR

Shape Inference Analysis

Schema Complexity Assessment

#!/bin/bash
# Analyze schema complexity

SCRIPT="$1"

echo "Schema Complexity Analysis for: $SCRIPT"
echo "======================================"

# Get detailed shape information
schema_output=$(beamline infer-shape \
    --seed 1 \
    --start-auto \
    --script-path "$SCRIPT" \
    --output-format basic-ddl)

# Count datasets
dataset_count=$(echo "$schema_output" | grep -c "^-- Dataset:")
echo "Datasets: $dataset_count"

# Count total fields
field_count=$(echo "$schema_output" | grep -c '^"')
echo "Total fields: $field_count"

# Count complex types
struct_count=$(echo "$schema_output" | grep -c "STRUCT<")
array_count=$(echo "$schema_output" | grep -c "ARRAY<")
union_count=$(echo "$schema_output" | grep -c "UNION<")

echo "Complex types:"
echo "  Structs: $struct_count"
echo "  Arrays: $array_count"  
echo "  Unions: $union_count"

# Count nullable/optional fields
nullable_count=$(echo "$schema_output" | grep -v "NOT NULL" | grep -c '^"')
optional_count=$(echo "$schema_output" | grep -c "OPTIONAL")

echo "Nullability:"
echo "  Nullable fields: $nullable_count"
echo "  Optional fields: $optional_count"

echo ""
echo "Complexity Score: $((field_count + struct_count * 2 + array_count * 2 + union_count * 3))"

Multi-Format Schema Comparison

#!/bin/bash
# Compare schema formats for analysis

SCRIPT="$1"
BASE_NAME=$(basename "$SCRIPT" .ion)

echo "Generating schema in all formats for: $SCRIPT"

# Generate all three formats
beamline infer-shape --seed 1 --start-auto --script-path "$SCRIPT" --output-format text > "${BASE_NAME}_debug.txt"
beamline infer-shape --seed 1 --start-auto --script-path "$SCRIPT" --output-format basic-ddl > "${BASE_NAME}_schema.sql"  
beamline infer-shape --seed 1 --start-auto --script-path "$SCRIPT" --output-format beamline-json > "${BASE_NAME}_schema.json"

echo "Generated schema files:"
echo "  Debug format: ${BASE_NAME}_debug.txt ($(wc -l < ${BASE_NAME}_debug.txt) lines)"
echo "  SQL DDL format: ${BASE_NAME}_schema.sql ($(wc -l < ${BASE_NAME}_schema.sql) lines)"
echo "  JSON format: ${BASE_NAME}_schema.json ($(wc -l < ${BASE_NAME}_schema.json) lines)"

# Show summary from SQL format
echo ""
echo "Schema summary:"
grep "-- Dataset:" "${BASE_NAME}_schema.sql" | while read -r line; do
    dataset=$(echo "$line" | cut -d: -f2 | xargs)
    echo "  Dataset: $dataset"
done

Shape Inference Optimization

Fast Schema Validation

Shape inference is much faster than data generation:

# Quick validation of multiple scripts
for script in models/*.ion; do
    echo -n "$(basename "$script"): "
    
    start_time=$(date +%s.%N)
    if beamline infer-shape \
        --seed 1 \
        --start-auto \
        --script-path "$script" \
        --output-format text > /dev/null; then
        end_time=$(date +%s.%N)
        duration=$(echo "$end_time - $start_time" | bc -l)
        echo "✅ Valid (${duration}s)"
    else
        echo "❌ Invalid"
    fi
done

Batch Schema Generation

#!/bin/bash
# Generate schemas for all scripts in parallel

SCRIPTS_DIR="$1"
OUTPUT_DIR="$2"

mkdir -p "$OUTPUT_DIR"

echo "Generating schemas for all scripts in $SCRIPTS_DIR..."

# Process scripts in parallel
for script in "$SCRIPTS_DIR"/*.ion; do
    {
        name=$(basename "$script" .ion)
        echo "Processing $name..."
        
        beamline infer-shape \
            --seed 1 \
            --start-auto \
            --script-path "$script" \
            --output-format basic-ddl > "$OUTPUT_DIR/${name}_schema.sql"
            
        echo "✅ $name completed"
    } &
done

wait  # Wait for all background jobs
echo "All schema generation completed"

# Summary
echo ""
echo "Generated schemas:"
ls -la "$OUTPUT_DIR"/*.sql | while read -r line; do
    file=$(echo "$line" | awk '{print $9}')
    lines=$(wc -l < "$file")
    echo "  $(basename "$file"): $lines lines"
done

Troubleshooting Shape Inference

Common Issues

Script Syntax Errors

$ beamline infer-shape --seed 1 --start-auto --script-path bad_syntax.ion
Error: Failed to parse Ion script: Invalid Ion syntax at line 5, column 10

Solution: Check Ion script syntax, ensure balanced braces and proper structure.

Missing Required Parameters

$ beamline infer-shape --script-path data.ion
Error: One of --seed-auto or --seed is required
Error: One of --start-auto, --start-epoch-ms, or --start-iso is required

Solution: Always provide seed and start time parameters.

Invalid Generator Configuration

# This will fail during shape inference
rand_processes::{
    bad_data: rand_process::{
        $arrival: HomogeneousPoisson:: { interarrival: seconds::1 },
        $data: {
            invalid_range: UniformI32::{ low: 100, high: 50 }  // min > max
        }
    }
}

Solution: Check generator configurations for valid parameter ranges.

Performance Troubleshooting

Shape inference should be very fast (milliseconds). If it’s slow:

# Check for complex nested structures
beamline infer-shape \
    --seed 1 \
    --start-auto \
    --script-path suspected_slow.ion \
    --output-format text | grep -c "nested_struct"

Very deep nesting (10+ levels) might slow shape inference slightly.

Integration Examples

Database Schema Creation Pipeline

#!/bin/bash
# Complete database schema creation pipeline

SCRIPT="$1"
DATABASE_NAME="$2"

echo "Creating database from Ion script: $SCRIPT"

# 1. Validate script and infer schema
echo "Step 1: Validating script and inferring schema..."
if ! beamline infer-shape \
    --seed 1 \
    --start-auto \
    --script-path "$SCRIPT" \
    --output-format basic-ddl > schema.sql; then
    echo "❌ Schema inference failed"
    exit 1
fi

# 2. Create database
echo "Step 2: Creating database $DATABASE_NAME..."
createdb "$DATABASE_NAME"

# 3. Generate CREATE TABLE statements
echo "Step 3: Generating CREATE TABLE statements..."
grep "-- Dataset:" schema.sql | while read -r line; do
    dataset=$(echo "$line" | cut -d: -f2 | xargs)
    
    echo "CREATE TABLE $dataset (" > "table_${dataset}.sql"
    # Extract fields for this dataset (simplified)
    grep -A 100 "$line" schema.sql | grep '^"' | head -20 >> "table_${dataset}.sql"
    echo ");" >> "table_${dataset}.sql"
    
    echo "Creating table: $dataset"
    psql -d "$DATABASE_NAME" -f "table_${dataset}.sql"
done

echo "✅ Database $DATABASE_NAME created with schema from $SCRIPT"

Schema Testing Integration

#!/bin/bash
# Test schema consistency across development workflow

SCRIPT="user_model.ion"
SEED=12345

echo "Testing schema consistency workflow..."

# 1. Infer baseline schema
beamline infer-shape \
    --seed $SEED \
    --start-auto \
    --script-path "$SCRIPT" \
    --output-format basic-ddl > baseline_schema.sql

# 2. Generate test data using same script
beamline gen data \
    --seed $SEED \
    --start-auto \
    --script-path "$SCRIPT" \
    --sample-count 100 \
    --output-format ion-pretty > test_data.ion

# 3. Generate test queries using same script
beamline query basic \
    --seed $((SEED + 1)) \
    --start-auto \
    --script-path "$SCRIPT" \
    --sample-count 10 \
    rand-select-all-fw \
    --pred-all > test_queries.sql

echo "Consistency test completed:"
echo "  Schema: baseline_schema.sql"
echo "  Data: test_data.ion ($(jq '.data | to_entries[0].value | length' test_data.ion 2>/dev/null || echo 'N/A') records)"
echo "  Queries: test_queries.sql ($(wc -l < test_queries.sql) queries)"

# 4. Validate all components reference same structure
echo ""
echo "✅ Schema, data, and queries all generated from same Ion script"
echo "✅ Consistency guaranteed by same script source"

Best Practices

1. Use Shape Inference Early

# Always validate scripts before large data generation
beamline infer-shape --seed 1 --start-auto --script-path new_script.ion

# Then proceed with data generation
beamline gen data --seed 1 --start-auto --script-path new_script.ion --sample-count 100000

2. Choose Format for Purpose

# Development and debugging
beamline infer-shape --script-path script.ion --output-format text

# Database integration
beamline infer-shape --script-path script.ion --output-format basic-ddl

# Tool integration and automation
beamline infer-shape --script-path script.ion --output-format beamline-json

3. Version Control Schemas

# Track schema evolution alongside scripts
git add scripts/user_model.ion schemas/user_model_schema.sql
git commit -m "Add user model v2 with preferences and stats

Schema changes:
- Added user.preferences nested object
- Added user.stats.login_count field  
- Made user.profile.age optional"

4. Validate Schema Changes

# Before deploying schema changes
beamline infer-shape --seed 1 --start-auto --script-path new_version.ion --output-format basic-ddl > new_schema.sql
diff old_schema.sql new_schema.sql

# Test compatibility with existing queries
# your-query-validator --schema new_schema.sql --queries existing_queries.sql

Next Steps

Now that you understand shape inference: