Architecture
System Architecture¶
Solti-Monitoring implements a modern, distributed monitoring architecture based on two parallel pipelines.
Monitoring Pipelines¶
Metrics Pipeline (Telegraf → InfluxDB)¶
graph LR
A[Monitored Host<br/>Telegraf Collector] -->|Metrics| B[Monitoring Server<br/>InfluxDB Storage]
B -->|Query| C[Grafana<br/>Visualization]
style A fill:#e1f5ff
style B fill:#fff4e1
style C fill:#f0e1ff
Flow:
- Telegraf collects metrics from local system/applications
- Metrics sent to InfluxDB via HTTP(S)
- InfluxDB stores metrics in time-series database
- Grafana queries InfluxDB for visualization
Logging Pipeline (Alloy → Loki)¶
graph LR
A[Monitored Host<br/>Alloy Pipeline<br/>Parse - Filter - Label] -->|Structured Logs| B[Monitoring Server<br/>Loki Storage]
B -->|LogQL Query| C[Grafana<br/>Visualization]
style A fill:#e1f5ff
style B fill:#fff4e1
style C fill:#f0e1ff
Flow:
- Alloy collects raw logs from journald, files, containers
- Alloy processes logs: parses fields, adds labels, filters noise
- Structured logs sent to Loki via HTTP(S) with normalized labels
- Loki stores logs with label-based indexing (low cardinality)
- Grafana/Claude query Loki using consistent label schema
Combined Architecture¶
graph TB
subgraph "Monitored Hosts"
T[Telegraf<br/>Metrics Collector]
A[Alloy<br/>Observability Pipeline<br/>Parse - Filter - Label]
end
subgraph "Monitoring Server"
I[InfluxDB<br/>Metrics Storage]
L[Loki<br/>Log Storage]
G[Grafana<br/>Unified Visualization]
end
T -->|Pulled Logs| A
T -->|Metrics| I
A -->|Logs| L
I -->|Query| G
L -->|Query| G
style T fill:#e1f5ff
style A fill:#e1f5ff
style I fill:#fff4e1
style L fill:#fff4e1
style G fill:#f0e1ff
Component Details¶
InfluxDB (Metrics Storage)¶
Purpose: Time-series database optimized for metrics
Version: InfluxDB v2 OSS (Open Source)
Key Features:
- High-performance time-series storage
- Flux query language for analysis
- Retention policies for automatic data cleanup
- Built-in downsampling and aggregation
Storage Options:
- Local disk (default)
- NFS mounts (shared storage for data directory)
Note: InfluxDB v2 OSS does not support S3 object storage. For scalable/shared storage, mount an NFS volume to the InfluxDB data directory.
API:
- Port: 8086
- Protocol: HTTP/HTTPS
- Authentication: Token-based
Loki (Log Storage)¶
Purpose: Log aggregation system with label-based indexing
Key Features:
- Label-based indexing (not full-text)
- Cost-effective storage
- LogQL query language
- Multi-tenancy support
- Horizontal scaling
Storage Options:
- Local filesystem
- NFS mounts
- S3-compatible object storage
API:
- Port: 3100
- Protocol: HTTP/HTTPS
- Authentication: Optional (recommended for production)
Telegraf (Metrics Collector)¶
Purpose: Plugin-driven metrics collection agent
Key Features:
- 300+ input plugins
- Minimal resource footprint
- Configurable collection intervals
- Local buffering and retry logic
- Plugin-based architecture
Common Inputs:
- System metrics (CPU, memory, disk, network)
- Application metrics (Docker, databases, web servers)
- Custom scripts and commands
Alloy (Observability Pipeline)¶
Purpose: Programmable observability data processor from Grafana Labs
Key Capabilities:
- Label Engineering: Custom label extraction and normalization to control cardinality
- Data Filtering: Reduce signal-to-noise ratio by filtering irrelevant log entries
- Structured Parsing: Parse unstructured logs into queryable fields (journald, syslog, JSON)
- Multi-Source Collection: Unified collection from journald, files, containers, syslog
- Dynamic Configuration: River language enables conditional logic and transformations
Use Cases in Solti-Monitoring:
- Parse fail2ban journald logs to extract jail, action, and IP fields
- Filter verbose DNS queries to keep only security-relevant events
- Normalize mail service logs across Postfix/Dovecot for consistent querying
- Add contextual labels (service_type, hostname) for dashboard filtering
- Control Loki cardinality by limiting label dimensions
Why Alloy vs Simple Forwarders:
- Enables AI-assisted dashboard analysis (Claude queries with predictable labels)
- Reduces Loki storage costs by filtering noise before ingestion
- Creates consistent labeling schema across heterogeneous services
- Reference: See sprint reports on Alloy+Bind9 integration and dashboard development
Log Sources:
- Journald (systemd services)
- File tailing (application logs)
- Docker container logs
- Syslog
Grafana (Visualization)¶
Purpose: Unified observability platform
Key Features:
- Multi-datasource dashboards
- Alerting and notifications
- User management and RBAC
- Templating and variables
- Plugin ecosystem
Supported Datasources:
- InfluxDB (metrics)
- Loki (logs)
- Prometheus, Elasticsearch, and 100+ others
Current Implementation¶
Production Deployments¶
monitor11.example.com (Proxmox VM):
- InfluxDB + Telegraf (metrics)
- Loki + Alloy (logs)
- Grafana (visualization)
- WireGuard endpoint for remote collectors
ispconfig3-server.example.com (Linode VPS):
- Telegraf + Alloy collectors
- Ships to monitor11 via WireGuard
- Monitors: Apache, ISPConfig, Fail2ban, Gitea, Mail, Bind9
Storage Backend¶
InfluxDB v2 OSS:
- Storage: NFS mount for data directory
- Retention: 30-day policy configured in bucket settings
- Note: InfluxDB v2 OSS does not support S3 object storage
Loki:
- Object Storage: s3-server.example.com:8010 (MinIO S3-compatible)
- Bucket: loki11
- Advantages: Cost-effective, scalable log storage
AI-Assisted Observability Workflow¶
A key design goal: enable Claude Code to analyze dashboards and query data programmatically.
How Alloy Enables This:
- Predictable Label Schema: Alloy normalizes labels (e.g.,
service_type,hostname,jail) so Claude can construct queries without guessing - Parsed Fields: Pre-extracted fields (IP addresses, actions, timestamps) enable structured querying
- Reduced Noise: Filtering at collection time means Claude queries return relevant results, not spam
Example Workflow:
User: "Show me fail2ban activity for the last 24 hours"
↓
Claude: Constructs Loki query using known labels
→ {service_type="fail2ban", hostname="ispconfig3-server.example.com"}
→ Parses results using known field names (jail, banned_ip, action)
↓
Claude: Generates dashboard panels programmatically via Grafana API
Claude: Analyzes patterns, suggests improvements
Reports:
- Alloy+Bind9 Integration Sprint - reports/sprint-2025-12-31-alloy-bind9-integration.md) - Label design for DNS query logs
- Alloy Dashboard Creation - reports/Alloy-Metrics-Dashboard-Creation-2026-01-02.md) - Programmatic dashboard generation
- Fail2ban Dashboard Debugging - reports/Alloy-Dashboard-Debugging-2026-01-03.md) - Query troubleshooting workflow
Next Steps¶
Planned Enhancements¶
- Alloy Config Validation
- Pre-deployment config testing (test playbook implemented)
-
Live config reload exploration
-
Multi-Site Deployment
- Expand beyond monitor11/ispconfig3
-
Standardize Alloy processing pipelines across hosts
-
Advanced Dashboards
- Service-specific dashboards (Fail2ban ✅, Mail, DNS)
- SLI/SLO tracking with Alloy-derived metrics
-
Capacity planning views
-
Alerting
- Grafana alerting rules based on Alloy-parsed events
- Multi-channel notifications (email, Mattermost)
- Alert escalation policies