Deployment Patterns
Overview¶
Solti-Monitoring is designed for distributed small deployments - multiple independent monitoring servers serving 5-50 hosts each, rather than large centralized installations.
Design Philosophy:
- Prefer multiple small monitoring servers over single large clusters
- Keep data close to where it's generated (regional/site-based)
- Use WireGuard for secure remote collection
- S3 object storage for cost-effective long-term retention (Loki only)
- NFS mounts for InfluxDB shared storage
Primary Pattern: Hub-and-Spoke (Distributed Small Deployment)¶
Architecture:
- Hub: Central monitoring server (InfluxDB + Loki + Grafana)
- Spokes: Monitored hosts running Telegraf + Alloy collectors
- Transport: WireGuard VPN for secure data shipping
- Storage: S3 for Loki logs, NFS for InfluxDB metrics
Characteristics:
- Serves 5-50 monitored hosts per hub
- Regional deployment (e.g., one hub per data center, VPC, or geographic region)
- Independent operation (each hub can function without others)
- Low operational complexity
Reference Implementation:
Hub: monitor11.example.com (Proxmox VM, local infrastructure)
├── InfluxDB v2 OSS (NFS storage, 30-day retention)
├── Loki (S3 storage via MinIO)
├── Grafana (local visualization)
└── WireGuard endpoint (10.10.0.11)
Spoke: ispconfig3-server.example.com (Linode VPS, remote)
├── Telegraf → monitor11:8086 (via WireGuard)
├── Alloy → monitor11:3100 (via WireGuard)
└── Monitored services: Apache, ISPConfig, Fail2ban, Gitea, Mail, Bind9
Why This Pattern:
- Security: WireGuard encrypts all monitoring traffic
- Scalability: Add more spokes without hub changes
- Resilience: Hub failure only affects visibility, not spoke functionality
- Cost: S3 storage cheaper than local disk for long-term logs
- Simplicity: No cluster management, no distributed consensus
Storage Patterns¶
This role uses Influxdb V2 so no S3 and no tiering. Infludb3 is radically different and does allow. Metrics collection is continually changing.
Pattern 1: NFS for InfluxDB + S3 for Loki (Current Production)¶
This has worked well for 5-10 clients.
InfluxDB v2 OSS:
- Data directory: NFS mount (e.g.,
/mnt/nfs/influxdb) - Enables sharing across multiple InfluxDB instances (if needed)
- Simple backup via NFS snapshots
Loki:
- Index: Local disk (fast access)
- Chunks: S3 object storage (MinIO)
- Cost-effective long-term retention
Example:
# InfluxDB with NFS
influxdb_data_path: /mnt/nfs/influxdb
influxdb_retention_days: 30
# Loki with S3
loki_s3_endpoint: s3-server.example.com:8010
loki_s3_bucket: loki11
loki_s3_access_key: "{{ vault_loki_s3_access_key }}"
loki_s3_secret_key: "{{ vault_loki_s3_secret_key }}"
Pattern 2: All Local Storage (Development/Testing)¶
Use Case: Development, testing, isolated systems
Configuration:
# InfluxDB local storage
influxdb_data_path: /var/lib/influxdb
# Loki local storage (no S3)
loki_storage_type: filesystem
loki_filesystem_path: /var/lib/loki
Limitations:
- No sharing across multiple instances
- Higher disk I/O on hub server
- Backup requires rsync/tar
Network Security Patterns¶
This is is a manual effort. Non-SSL needs to be tunneled or in a private trusted setting.
Here are some suggestions. YMMV
WireGuard VPN¶
Setup your own private network on an external server and connect to it from behind your firewall.
Architecture:
- Hub runs WireGuard endpoint
- Spokes connect to hub via WireGuard tunnel
- All monitoring traffic encrypted and authenticated
Configuration:
# Hub (monitor11)
wireguard_enabled: true
wireguard_listen_port: 51820
wireguard_address: 10.10.0.11/24
# Spoke (ispconfig3)
wireguard_enabled: true
wireguard_endpoint: monitor11.example.com:51820
wireguard_address: 10.10.0.20/24
wireguard_allowed_ips: 10.10.0.0/24
# Monitoring targets
telegraf_outputs_influxdb_endpoint: http://10.10.0.11:8086
alloy_loki_endpoint: http://10.10.0.11:3100
Benefits:
- Encrypted transport (ChaCha20-Poly1305)
- Mutual authentication (public/private keys)
- NAT traversal (works from behind firewalls)
- Low overhead (kernel-level performance)
Reverse Proxy + TLS¶
This is on my list when I have time to test.
Architecture:
- Hub behind Traefik/nginx reverse proxy
- TLS termination at proxy
- Certificate-based authentication
Not Currently Implemented:
- Would integrate with
solti-ensembleTraefik role - Would enable public HTTPS endpoints for InfluxDB/Loki
- Would require certificate management (ACME/Let's Encrypt)
Deployment Workflow Concept¶
Step 1: Deploy Hub¶
cd mylab
# Deploy InfluxDB + Loki on monitor11
ansible-playbook \
--become-password-file ~/.secrets/lavender.pass \
playbooks/svc-monitor11-metrics.yml # InfluxDB + Telegraf
ansible-playbook \
--become-password-file ~/.secrets/lavender.pass \
playbooks/svc-monitor11-logs.yml # Loki + Alloy
# Deploy Grafana (local orchestrator)
ansible-playbook \
--become-password-file ~/.secrets/lavender.pass \
playbooks/svc-grafana.yml
Step 2: Configure WireGuard (if needed)¶
# Deploy WireGuard on hub
ansible-playbook \
--become-password-file ~/.secrets/lavender.pass \
playbooks/svc-monitor11-wireguard.yml
# Deploy WireGuard on spoke
ansible-playbook \
--become-password-file ~/.secrets/lavender.pass \
playbooks/ispconfig3/wireguard.yml
Step 3: Deploy Spokes¶
# Deploy Telegraf + Alloy on ispconfig3
ansible-playbook \
--become-password-file ~/.secrets/lavender.pass \
playbooks/ispconfig3/22-ispconfig3-alloy.yml
ansible-playbook \
--become-password-file ~/.secrets/lavender.pass \
playbooks/ispconfig3/ispconfig3-monitor.yml
Step 4: Verify Data Flow¶
# Check InfluxDB metrics
curl -s http://monitor11.example.com:8086/health
# Check Loki logs
curl -s -G "http://monitor11.example.com:3100/loki/api/v1/query" \
--data-urlencode 'query={hostname="ispconfig3-server.example.com"}' \
--data-urlencode 'limit=10'
# View in Grafana
open http://localhost:3000 # or https://grafana.example.com:8080
Step 5: Create Dashboards¶
Use programmatic dashboard creation:
# See CLAUDE.md "Creating Grafana Dashboards Programmatically"
./bin/create-fail2ban-dashboard.py
./bin/create-alloy-dashboard.py
./bin/create-docker-dashboard.py
Self-Monitoring¶
Hubs monitor themselves:
# monitor11 monitors itself
- hosts: monitor11.example.com
roles:
- jackaltx.solti_monitoring.influxdb
- jackaltx.solti_monitoring.loki
- jackaltx.solti_monitoring.telegraf # Collects metrics from monitor11
- jackaltx.solti_monitoring.alloy # Collects logs from monitor11
vars:
telegraf_outputs_influxdb_endpoint: http://localhost:8086
alloy_loki_endpoint: http://localhost:3100
What Gets Monitored:
- InfluxDB container metrics (CPU, memory, disk I/O)
- Loki container metrics
- System metrics (monitor11 host)
- Service logs (InfluxDB, Loki, WireGuard)
Summary¶
Current Production Pattern:
- Hub: monitor11 (InfluxDB + Loki + Grafana)
- Spoke: ispconfig3 (Telegraf + Alloy)
- Transport: WireGuard VPN
- Storage: NFS (InfluxDB), S3 (Loki)
- Sizing: Small hub (1-10 hosts)
Design Principles:
- Distributed small deployments over centralized large deployments
- WireGuard for security
- S3 for cost-effective log storage
- NFS for InfluxDB shared storage
- Independent hubs for resilience
Next Steps:
- Add more spokes to existing hub (up to 30-50 hosts)
- Deploy additional regional hubs as needed
- Optional: Implement central aggregation layer (query across hubs)