Deployment Patterns

Overview¶

Solti-Monitoring is designed for distributed small deployments - multiple independent monitoring servers serving 5-50 hosts each, rather than large centralized installations.

Design Philosophy:

Prefer multiple small monitoring servers over single large clusters
Keep data close to where it's generated (regional/site-based)
Use WireGuard for secure remote collection
S3 object storage for cost-effective long-term retention (Loki only)
NFS mounts for InfluxDB shared storage

Primary Pattern: Hub-and-Spoke (Distributed Small Deployment)¶

Architecture:

Hub: Central monitoring server (InfluxDB + Loki + Grafana)
Spokes: Monitored hosts running Telegraf + Alloy collectors
Transport: WireGuard VPN for secure data shipping
Storage: S3 for Loki logs, NFS for InfluxDB metrics

Characteristics:

Serves 5-50 monitored hosts per hub
Regional deployment (e.g., one hub per data center, VPC, or geographic region)
Independent operation (each hub can function without others)
Low operational complexity

Reference Implementation:

Hub: monitor11.example.com (Proxmox VM, local infrastructure)
  ├── InfluxDB v2 OSS (NFS storage, 30-day retention)
  ├── Loki (S3 storage via MinIO)
  ├── Grafana (local visualization)
  └── WireGuard endpoint (10.10.0.11)

Spoke: ispconfig3-server.example.com (Linode VPS, remote)
  ├── Telegraf → monitor11:8086 (via WireGuard)
  ├── Alloy → monitor11:3100 (via WireGuard)
  └── Monitored services: Apache, ISPConfig, Fail2ban, Gitea, Mail, Bind9

Why This Pattern:

Security: WireGuard encrypts all monitoring traffic
Scalability: Add more spokes without hub changes
Resilience: Hub failure only affects visibility, not spoke functionality
Cost: S3 storage cheaper than local disk for long-term logs
Simplicity: No cluster management, no distributed consensus

Storage Patterns¶

This role uses Influxdb V2 so no S3 and no tiering. Infludb3 is radically different and does allow. Metrics collection is continually changing.

Pattern 1: NFS for InfluxDB + S3 for Loki (Current Production)¶

This has worked well for 5-10 clients.

InfluxDB v2 OSS:

Data directory: NFS mount (e.g., /mnt/nfs/influxdb)
Enables sharing across multiple InfluxDB instances (if needed)
Simple backup via NFS snapshots

Loki:

Index: Local disk (fast access)
Chunks: S3 object storage (MinIO)
Cost-effective long-term retention

Example:

# InfluxDB with NFS
influxdb_data_path: /mnt/nfs/influxdb
influxdb_retention_days: 30

# Loki with S3
loki_s3_endpoint: s3-server.example.com:8010
loki_s3_bucket: loki11
loki_s3_access_key: "{{ vault_loki_s3_access_key }}"
loki_s3_secret_key: "{{ vault_loki_s3_secret_key }}"

Pattern 2: All Local Storage (Development/Testing)¶

Use Case: Development, testing, isolated systems

Configuration:

# InfluxDB local storage
influxdb_data_path: /var/lib/influxdb

# Loki local storage (no S3)
loki_storage_type: filesystem
loki_filesystem_path: /var/lib/loki

Limitations:

No sharing across multiple instances
Higher disk I/O on hub server
Backup requires rsync/tar

Network Security Patterns¶

This is is a manual effort. Non-SSL needs to be tunneled or in a private trusted setting.

Here are some suggestions. YMMV

WireGuard VPN¶

Setup your own private network on an external server and connect to it from behind your firewall.

Architecture:

Hub runs WireGuard endpoint
Spokes connect to hub via WireGuard tunnel
All monitoring traffic encrypted and authenticated

Configuration:

# Hub (monitor11)
wireguard_enabled: true
wireguard_listen_port: 51820
wireguard_address: 10.10.0.11/24

# Spoke (ispconfig3)
wireguard_enabled: true
wireguard_endpoint: monitor11.example.com:51820
wireguard_address: 10.10.0.20/24
wireguard_allowed_ips: 10.10.0.0/24

# Monitoring targets
telegraf_outputs_influxdb_endpoint: http://10.10.0.11:8086
alloy_loki_endpoint: http://10.10.0.11:3100

Benefits:

Encrypted transport (ChaCha20-Poly1305)
Mutual authentication (public/private keys)
NAT traversal (works from behind firewalls)
Low overhead (kernel-level performance)

Reverse Proxy + TLS¶

This is on my list when I have time to test.

Architecture:

Hub behind Traefik/nginx reverse proxy
TLS termination at proxy
Certificate-based authentication

Not Currently Implemented:

Would integrate with solti-ensemble Traefik role
Would enable public HTTPS endpoints for InfluxDB/Loki
Would require certificate management (ACME/Let's Encrypt)

Deployment Workflow Concept¶

Step 1: Deploy Hub¶

cd mylab

# Deploy InfluxDB + Loki on monitor11
ansible-playbook \
  --become-password-file ~/.secrets/lavender.pass \
  playbooks/svc-monitor11-metrics.yml  # InfluxDB + Telegraf

ansible-playbook \
  --become-password-file ~/.secrets/lavender.pass \
  playbooks/svc-monitor11-logs.yml     # Loki + Alloy

# Deploy Grafana (local orchestrator)
ansible-playbook \
  --become-password-file ~/.secrets/lavender.pass \
  playbooks/svc-grafana.yml

Step 2: Configure WireGuard (if needed)¶

# Deploy WireGuard on hub
ansible-playbook \
  --become-password-file ~/.secrets/lavender.pass \
  playbooks/svc-monitor11-wireguard.yml

# Deploy WireGuard on spoke
ansible-playbook \
  --become-password-file ~/.secrets/lavender.pass \
  playbooks/ispconfig3/wireguard.yml

Step 3: Deploy Spokes¶

# Deploy Telegraf + Alloy on ispconfig3
ansible-playbook \
  --become-password-file ~/.secrets/lavender.pass \
  playbooks/ispconfig3/22-ispconfig3-alloy.yml

ansible-playbook \
  --become-password-file ~/.secrets/lavender.pass \
  playbooks/ispconfig3/ispconfig3-monitor.yml

Step 4: Verify Data Flow¶

# Check InfluxDB metrics
curl -s http://monitor11.example.com:8086/health

# Check Loki logs
curl -s -G "http://monitor11.example.com:3100/loki/api/v1/query" \
  --data-urlencode 'query={hostname="ispconfig3-server.example.com"}' \
  --data-urlencode 'limit=10'

# View in Grafana
open http://localhost:3000  # or https://grafana.example.com:8080

Step 5: Create Dashboards¶

Use programmatic dashboard creation:

# See CLAUDE.md "Creating Grafana Dashboards Programmatically"
./bin/create-fail2ban-dashboard.py
./bin/create-alloy-dashboard.py
./bin/create-docker-dashboard.py

Self-Monitoring¶

Hubs monitor themselves:

# monitor11 monitors itself
- hosts: monitor11.example.com
  roles:
    - jackaltx.solti_monitoring.influxdb
    - jackaltx.solti_monitoring.loki
    - jackaltx.solti_monitoring.telegraf  # Collects metrics from monitor11
    - jackaltx.solti_monitoring.alloy     # Collects logs from monitor11
  vars:
    telegraf_outputs_influxdb_endpoint: http://localhost:8086
    alloy_loki_endpoint: http://localhost:3100

What Gets Monitored:

InfluxDB container metrics (CPU, memory, disk I/O)
Loki container metrics
System metrics (monitor11 host)
Service logs (InfluxDB, Loki, WireGuard)

Summary¶

Current Production Pattern:

Hub: monitor11 (InfluxDB + Loki + Grafana)
Spoke: ispconfig3 (Telegraf + Alloy)
Transport: WireGuard VPN
Storage: NFS (InfluxDB), S3 (Loki)
Sizing: Small hub (1-10 hosts)

Design Principles:

Distributed small deployments over centralized large deployments
WireGuard for security
S3 for cost-effective log storage
NFS for InfluxDB shared storage
Independent hubs for resilience

Next Steps:

Add more spokes to existing hub (up to 30-50 hosts)
Deploy additional regional hubs as needed
Optional: Implement central aggregation layer (query across hubs)