Skip to content

Capabilities-Based Testing

What Are Capabilities?

Instead of testing roles in isolation, Solti-Monitoring tests capabilities - complete functional systems that involve multiple roles working together.

Current Capabilities:

  • logs - Log collection and forwarding system (Loki + Alloy)
  • metrics - Metrics collection and storage system (InfluxDB + Telegraf)

This approach ensures that components work together, not just individually.

Why Test Capabilities?

Traditional approach (role-by-role):

✓ InfluxDB starts successfully
✓ Telegraf starts successfully
✗ Telegraf cannot connect to InfluxDB (authentication missing!)

Capabilities approach:

✓ InfluxDB starts and creates admin token
✓ Telegraf starts and retrieves token
✓ Telegraf writes metrics to InfluxDB
✓ Metrics query returns data

The second approach catches integration issues that unit tests miss.

How It Works

1. Define Capabilities

Capabilities are defined in molecule/vars/capabilities.yml:

monitoring_capabilities:
  metrics:
    roles:              # What to deploy
      - influxdb
      - influxdb3
      - telegraf

    verify_role_tasks:  # Per-role checks
      influxdb:
        - verify.yml
      influxdb3:
        - verify.yml
        - verify-collecting.yml

    verify_tasks:       # Integration checks
      - verify-metrics.yml
      - verify-metrics-v3.yml

2. Two-Level Verification

Level 1: Role Verification (does the service work?)

  • Check service is running
  • Verify API responds
  • Test configuration is valid

Level 2: Integration Verification (do services talk to each other?)

  • Verify Telegraf connects to InfluxDB
  • Write test data through Telegraf
  • Query data back from InfluxDB
  • Confirm metrics are flowing

3. Run Tests

cd solti-monitoring

# Test all capabilities
./run-podman-tests.sh

# Test specific capability
./run-podman-tests.sh --tests metrics

# Test on Proxmox VMs
PROXMOX_DISTRO=debian12 ./run-proxmox-tests.sh

Real-World Example: Metrics Capability

What Gets Deployed

┌─────────────┐
│  InfluxDB   │ ← Metrics storage (v2)
│   :8086     │
└─────────────┘
       │ writes metrics
┌─────────────┐
│  Telegraf   │ ← Metrics collector
│   (client)  │
└─────────────┘
       │ also writes to
┌─────────────┐
│ InfluxDB v3 │ ← Next-gen storage
│   :8181     │
└─────────────┘

What Gets Tested

Role-Level Checks: - InfluxDB v2 service running? ✓ - InfluxDB v3 service running? ✓ - Telegraf service running? ✓ - Port 8086 listening? ✓ - Port 8181 listening? ✓

Integration Checks: - Telegraf connected to InfluxDB v2? ✓ - Telegraf connected to InfluxDB v3? ✓ - Can write test data to v2? ✓ - Can write test data to v3? ✓ - Can query data back? ✓

Test Platforms

Podman (Fast Local Testing)

Best for: - Quick feedback during development - Testing across multiple distributions - CI/CD pipeline

Characteristics: - Containers start in seconds - Full systemd support - Network isolation - Multiple distros in parallel

Run:

./run-podman-tests.sh

Proxmox (Production-Like Testing)

Best for: - Final validation before release - VM-specific scenarios - Production simulation

Characteristics: - Real VMs with complete OS - Cloud-init integration - Slower but higher fidelity - Tests VM provisioning workflow

Run:

PROXMOX_DISTRO=debian12 ./run-proxmox-tests.sh

Supported Distributions

Podman Testing

  • Debian 12 (bookworm)
  • Debian 13 (trixie)
  • Rocky Linux 9
  • Rocky Linux 10
  • Ubuntu 24.04

Proxmox Testing

  • Debian 12
  • Rocky Linux 9

InfluxDB Dual-Version Testing

The metrics capability tests both InfluxDB versions simultaneously:

     ┌──────────────┐
     │   Telegraf   │
     └──────┬───────┘
     ┌──────┴──────┐
     │             │
┌────▼────┐   ┌───▼─────┐
│ InfluxDB│   │InfluxDB │
│   v2    │   │   v3    │
│  :8086  │   │  :8181  │
└─────────┘   └─────────┘

Why both? - Tests migration/coexistence scenario - Validates Telegraf can write to both - Real-world use case during transition - Ensures no port conflicts

Configuration:

telegraf_outputs: ['localhost', 'localhost_v3']

telgraf2influxdb_configs:
  localhost:      # InfluxDB v2
    port: 8086
  localhost_v3:   # InfluxDB v3
    port: 8181

Test Execution Flow

Phase 1: Destroy

Clean up any existing test infrastructure

Phase 2: Create

  • Podman: Start systemd containers
  • Proxmox: Clone VMs from cloud-init templates

Phase 3: Prepare

  • Install required packages
  • Configure SSH access
  • Wait for systems to be ready

Phase 4: Converge

  • Deploy roles based on capability
  • Apply configuration
  • Start services

Phase 5: Verify

  1. Run role-level verification tasks
  2. Run integration verification tasks
  3. Generate test reports

Phase 6: Destroy

Clean up test infrastructure

Authentication in Tests

Testing Mode

When telegraf_testing: true (default in molecule):

  • Tokens are auto-discovered from filesystem
  • InfluxDB v2: Read from influx auth list
  • InfluxDB v3: Read from /root/.influxdb3-credentials
  • No manual configuration needed

Production Mode

When telegraf_testing: false (production deployments):

  • Tokens must be pre-configured in inventory
  • Uses telgraf2influxdb_configs from group_vars
  • No auto-discovery

Secure Logging

By default, all credentials are hidden in test logs:

no_log: "{{ secure_logging | default(true) }}"

Debug mode (show credentials):

MOLECULE_SECURE_LOGGING=false ./run-podman-tests.sh

Verification Reports

All test results are saved to verify_output/:

verify_output/
├── podman-test-20250128-143022.out      # Full test log
├── latest_test.out -> podman-test...    # Symlink to latest
├── debian12/
│   ├── influxdb3-verify-uut-ct0.yml     # Role verification
│   ├── verify-metrics-v3-status.yml     # Integration results
│   └── debian12-consolidated.md         # Summary report
└── rocky9/
    └── ...

Report Format

Role-level reports (YAML):

service_status: running
api_health: 200
port_8181: listening
version: influxdb3-core 3.0.0

Integration reports (YAML):

verify_result: passed
telegraf_connection: established
write_test: passed
query_test: passed
cpu_metrics_5min: 1234

Consolidated reports (Markdown): Human-readable summary with pass/fail status for all tests.

Capability Selection

Test specific capabilities:

# Logs only (Loki + Alloy)
./run-podman-tests.sh --tests logs

# Metrics only (InfluxDB + Telegraf)
./run-podman-tests.sh --tests metrics

# Both (default)
./run-podman-tests.sh --tests logs,metrics

Behind the scenes:

# molecule.yml
testing_capabilities: "{{ lookup('env', 'MOLECULE_CAPABILITIES', default='logs,metrics') | split(',') }}"

The converge playbook uses this to deploy only selected capabilities.

Common Scenarios

Quick Local Test

cd solti-monitoring
./run-podman-tests.sh --tests metrics
Tests metrics capability on all supported distributions in Podman containers

Single Distribution

cd solti-monitoring
MOLECULE_PLATFORM_NAME=debian12 molecule test -s podman
Tests all capabilities on Debian 12 only

Production Validation

cd solti-monitoring
PROXMOX_DISTRO=debian12 ./run-proxmox-tests.sh
Full integration test on Debian 12 VM

Debug Failed Test

cd solti-monitoring
MOLECULE_SECURE_LOGGING=false molecule test -s podman
Shows credentials in output for troubleshooting

Keep Test Environment

cd solti-monitoring
molecule converge -s podman
# (skips destroy, containers stay running)
podman exec -it uut-ct0 bash
Inspect running test container manually

Troubleshooting

Container Won't Start

# Check cgroup version
ls /sys/fs/cgroup/

# Verify podman supports systemd
podman --version  # Need 3.0+

SSH Connection Refused

# Check SSH in container
podman exec uut-ct0 systemctl status sshd

# Check port mapping
podman port uut-ct0

Authentication Failures

# Enable debug logging
MOLECULE_SECURE_LOGGING=false ./run-podman-tests.sh

# Check token files exist
podman exec uut-ct0 ls -la /root/.influxdb*

Tests Pass But Reports Missing

# Check report directory
ls -la verify_output/

# Verify report_root in molecule.yml
grep report_root molecule/podman/molecule.yml

Best Practices

1. Test Real Scenarios

Good: Test complete data flow (Telegraf → InfluxDB → Query) ✗ Bad: Test Telegraf and InfluxDB separately

2. Meaningful Assertions

Good: assert: connection_check shows telegraf pidBad: assert: command returned 0

3. Clean Test Data

Tests should be idempotent - safe to run multiple times without side effects.

4. Distribution-Aware

Use ansible facts for distribution-specific checks:

when: ansible_distribution == "Debian"

5. Clear Failure Messages

fail_msg: "Telegraf not connected to InfluxDB v3 on port 8181"
success_msg: "Telegraf successfully connected"

What's Next?

The capabilities-based approach is evolving:

Current State: - Two capabilities (logs, metrics) - Dual InfluxDB version testing - Podman and Proxmox platforms - Manual test script execution

Future: - More capabilities (alerts, dashboards) - Parallel test execution - Unified test reports - Performance benchmarks - Automated PR testing

For developers and AI agents:

For CI/CD integration:

For verification details: