Skip to content

Architecture

System Architecture

Solti-Monitoring implements a modern, distributed monitoring architecture based on two parallel pipelines.

Monitoring Pipelines

Metrics Pipeline (Telegraf → InfluxDB)

graph LR
    A[Monitored Host<br/>Telegraf Collector] -->|Metrics| B[Monitoring Server<br/>InfluxDB Storage]
    B -->|Query| C[Grafana<br/>Visualization]

    style A fill:#e1f5ff
    style B fill:#fff4e1
    style C fill:#f0e1ff

Flow:

  1. Telegraf collects metrics from local system/applications
  2. Metrics sent to InfluxDB via HTTP(S)
  3. InfluxDB stores metrics in time-series database
  4. Grafana queries InfluxDB for visualization

Logging Pipeline (Alloy → Loki)

graph LR
    A[Monitored Host<br/>Alloy Pipeline<br/>Parse - Filter - Label] -->|Structured Logs| B[Monitoring Server<br/>Loki Storage]
    B -->|LogQL Query| C[Grafana<br/>Visualization]

    style A fill:#e1f5ff
    style B fill:#fff4e1
    style C fill:#f0e1ff

Flow:

  1. Alloy collects raw logs from journald, files, containers
  2. Alloy processes logs: parses fields, adds labels, filters noise
  3. Structured logs sent to Loki via HTTP(S) with normalized labels
  4. Loki stores logs with label-based indexing (low cardinality)
  5. Grafana/Claude query Loki using consistent label schema

Combined Architecture

graph TB
    subgraph "Monitored Hosts"
        T[Telegraf<br/>Metrics Collector]
        A[Alloy<br/>Observability Pipeline<br/>Parse - Filter - Label]
    end

    subgraph "Monitoring Server"
        I[InfluxDB<br/>Metrics Storage]
        L[Loki<br/>Log Storage]
        G[Grafana<br/>Unified Visualization]
    end

    T -->|Pulled Logs| A
    T -->|Metrics| I
    A -->|Logs| L
    I -->|Query| G
    L -->|Query| G

    style T fill:#e1f5ff
    style A fill:#e1f5ff
    style I fill:#fff4e1
    style L fill:#fff4e1
    style G fill:#f0e1ff

Component Details

InfluxDB (Metrics Storage)

Purpose: Time-series database optimized for metrics

Version: InfluxDB v2 OSS (Open Source)

Key Features:

  • High-performance time-series storage
  • Flux query language for analysis
  • Retention policies for automatic data cleanup
  • Built-in downsampling and aggregation

Storage Options:

  • Local disk (default)
  • NFS mounts (shared storage for data directory)

Note: InfluxDB v2 OSS does not support S3 object storage. For scalable/shared storage, mount an NFS volume to the InfluxDB data directory.

API:

  • Port: 8086
  • Protocol: HTTP/HTTPS
  • Authentication: Token-based

Loki (Log Storage)

Purpose: Log aggregation system with label-based indexing

Key Features:

  • Label-based indexing (not full-text)
  • Cost-effective storage
  • LogQL query language
  • Multi-tenancy support
  • Horizontal scaling

Storage Options:

  • Local filesystem
  • NFS mounts
  • S3-compatible object storage

API:

  • Port: 3100
  • Protocol: HTTP/HTTPS
  • Authentication: Optional (recommended for production)

Telegraf (Metrics Collector)

Purpose: Plugin-driven metrics collection agent

Key Features:

  • 300+ input plugins
  • Minimal resource footprint
  • Configurable collection intervals
  • Local buffering and retry logic
  • Plugin-based architecture

Common Inputs:

  • System metrics (CPU, memory, disk, network)
  • Application metrics (Docker, databases, web servers)
  • Custom scripts and commands

Alloy (Observability Pipeline)

Purpose: Programmable observability data processor from Grafana Labs

Key Capabilities:

  • Label Engineering: Custom label extraction and normalization to control cardinality
  • Data Filtering: Reduce signal-to-noise ratio by filtering irrelevant log entries
  • Structured Parsing: Parse unstructured logs into queryable fields (journald, syslog, JSON)
  • Multi-Source Collection: Unified collection from journald, files, containers, syslog
  • Dynamic Configuration: River language enables conditional logic and transformations

Use Cases in Solti-Monitoring:

  • Parse fail2ban journald logs to extract jail, action, and IP fields
  • Filter verbose DNS queries to keep only security-relevant events
  • Normalize mail service logs across Postfix/Dovecot for consistent querying
  • Add contextual labels (service_type, hostname) for dashboard filtering
  • Control Loki cardinality by limiting label dimensions

Why Alloy vs Simple Forwarders:

  • Enables AI-assisted dashboard analysis (Claude queries with predictable labels)
  • Reduces Loki storage costs by filtering noise before ingestion
  • Creates consistent labeling schema across heterogeneous services
  • Reference: See sprint reports on Alloy+Bind9 integration and dashboard development

Log Sources:

  • Journald (systemd services)
  • File tailing (application logs)
  • Docker container logs
  • Syslog

Grafana (Visualization)

Purpose: Unified observability platform

Key Features:

  • Multi-datasource dashboards
  • Alerting and notifications
  • User management and RBAC
  • Templating and variables
  • Plugin ecosystem

Supported Datasources:

  • InfluxDB (metrics)
  • Loki (logs)
  • Prometheus, Elasticsearch, and 100+ others

Current Implementation

Production Deployments

monitor11.example.com (Proxmox VM):

  • InfluxDB + Telegraf (metrics)
  • Loki + Alloy (logs)
  • Grafana (visualization)
  • WireGuard endpoint for remote collectors

ispconfig3-server.example.com (Linode VPS):

  • Telegraf + Alloy collectors
  • Ships to monitor11 via WireGuard
  • Monitors: Apache, ISPConfig, Fail2ban, Gitea, Mail, Bind9

Storage Backend

InfluxDB v2 OSS:

  • Storage: NFS mount for data directory
  • Retention: 30-day policy configured in bucket settings
  • Note: InfluxDB v2 OSS does not support S3 object storage

Loki:

  • Object Storage: s3-server.example.com:8010 (MinIO S3-compatible)
  • Bucket: loki11
  • Advantages: Cost-effective, scalable log storage

AI-Assisted Observability Workflow

A key design goal: enable Claude Code to analyze dashboards and query data programmatically.

How Alloy Enables This:

  1. Predictable Label Schema: Alloy normalizes labels (e.g., service_type, hostname, jail) so Claude can construct queries without guessing
  2. Parsed Fields: Pre-extracted fields (IP addresses, actions, timestamps) enable structured querying
  3. Reduced Noise: Filtering at collection time means Claude queries return relevant results, not spam

Example Workflow:

User: "Show me fail2ban activity for the last 24 hours"
Claude: Constructs Loki query using known labels
  → {service_type="fail2ban", hostname="ispconfig3-server.example.com"}
  → Parses results using known field names (jail, banned_ip, action)
Claude: Generates dashboard panels programmatically via Grafana API
Claude: Analyzes patterns, suggests improvements

Reports:

  • Alloy+Bind9 Integration Sprint - reports/sprint-2025-12-31-alloy-bind9-integration.md) - Label design for DNS query logs
  • Alloy Dashboard Creation - reports/Alloy-Metrics-Dashboard-Creation-2026-01-02.md) - Programmatic dashboard generation
  • Fail2ban Dashboard Debugging - reports/Alloy-Dashboard-Debugging-2026-01-03.md) - Query troubleshooting workflow

Next Steps

Planned Enhancements

  1. Alloy Config Validation
  2. Pre-deployment config testing (test playbook implemented)
  3. Live config reload exploration

  4. Multi-Site Deployment

  5. Expand beyond monitor11/ispconfig3
  6. Standardize Alloy processing pipelines across hosts

  7. Advanced Dashboards

  8. Service-specific dashboards (Fail2ban ✅, Mail, DNS)
  9. SLI/SLO tracking with Alloy-derived metrics
  10. Capacity planning views

  11. Alerting

  12. Grafana alerting rules based on Alloy-parsed events
  13. Multi-channel notifications (email, Mattermost)
  14. Alert escalation policies