Architecture

System Architecture¶

Solti-Monitoring implements a modern, distributed monitoring architecture based on two parallel pipelines.

Monitoring Pipelines¶

Metrics Pipeline (Telegraf → InfluxDB)¶

graph LR
    A[Monitored Host<br/>Telegraf Collector] -->|Metrics| B[Monitoring Server<br/>InfluxDB Storage]
    B -->|Query| C[Grafana<br/>Visualization]

    style A fill:#e1f5ff
    style B fill:#fff4e1
    style C fill:#f0e1ff

Flow:

Telegraf collects metrics from local system/applications
Metrics sent to InfluxDB via HTTP(S)
InfluxDB stores metrics in time-series database
Grafana queries InfluxDB for visualization

Logging Pipeline (Alloy → Loki)¶

graph LR
    A[Monitored Host<br/>Alloy Pipeline<br/>Parse - Filter - Label] -->|Structured Logs| B[Monitoring Server<br/>Loki Storage]
    B -->|LogQL Query| C[Grafana<br/>Visualization]

    style A fill:#e1f5ff
    style B fill:#fff4e1
    style C fill:#f0e1ff

Flow:

Alloy collects raw logs from journald, files, containers
Alloy processes logs: parses fields, adds labels, filters noise
Structured logs sent to Loki via HTTP(S) with normalized labels
Loki stores logs with label-based indexing (low cardinality)
Grafana/Claude query Loki using consistent label schema

Combined Architecture¶

graph TB
    subgraph "Monitored Hosts"
        T[Telegraf<br/>Metrics Collector]
        A[Alloy<br/>Observability Pipeline<br/>Parse - Filter - Label]
    end

    subgraph "Monitoring Server"
        I[InfluxDB<br/>Metrics Storage]
        L[Loki<br/>Log Storage]
        G[Grafana<br/>Unified Visualization]
    end

    T -->|Pulled Logs| A
    T -->|Metrics| I
    A -->|Logs| L
    I -->|Query| G
    L -->|Query| G

    style T fill:#e1f5ff
    style A fill:#e1f5ff
    style I fill:#fff4e1
    style L fill:#fff4e1
    style G fill:#f0e1ff

Component Details¶

InfluxDB (Metrics Storage)¶

Purpose: Time-series database optimized for metrics

Version: InfluxDB v2 OSS (Open Source)

Key Features:

High-performance time-series storage
Flux query language for analysis
Retention policies for automatic data cleanup
Built-in downsampling and aggregation

Storage Options:

Local disk (default)
NFS mounts (shared storage for data directory)

Note: InfluxDB v2 OSS does not support S3 object storage. For scalable/shared storage, mount an NFS volume to the InfluxDB data directory.

API:

Port: 8086
Protocol: HTTP/HTTPS
Authentication: Token-based

Loki (Log Storage)¶

Purpose: Log aggregation system with label-based indexing

Key Features:

Label-based indexing (not full-text)
Cost-effective storage
LogQL query language
Multi-tenancy support
Horizontal scaling

Storage Options:

Local filesystem
NFS mounts
S3-compatible object storage

API:

Port: 3100
Protocol: HTTP/HTTPS
Authentication: Optional (recommended for production)

Telegraf (Metrics Collector)¶

Purpose: Plugin-driven metrics collection agent

Key Features:

300+ input plugins
Minimal resource footprint
Configurable collection intervals
Local buffering and retry logic
Plugin-based architecture

Common Inputs:

System metrics (CPU, memory, disk, network)
Application metrics (Docker, databases, web servers)
Custom scripts and commands

Alloy (Observability Pipeline)¶

Purpose: Programmable observability data processor from Grafana Labs

Key Capabilities:

Label Engineering: Custom label extraction and normalization to control cardinality
Data Filtering: Reduce signal-to-noise ratio by filtering irrelevant log entries
Structured Parsing: Parse unstructured logs into queryable fields (journald, syslog, JSON)
Multi-Source Collection: Unified collection from journald, files, containers, syslog
Dynamic Configuration: River language enables conditional logic and transformations

Use Cases in Solti-Monitoring:

Parse fail2ban journald logs to extract jail, action, and IP fields
Filter verbose DNS queries to keep only security-relevant events
Normalize mail service logs across Postfix/Dovecot for consistent querying
Add contextual labels (service_type, hostname) for dashboard filtering
Control Loki cardinality by limiting label dimensions

Why Alloy vs Simple Forwarders:

Enables AI-assisted dashboard analysis (Claude queries with predictable labels)
Reduces Loki storage costs by filtering noise before ingestion
Creates consistent labeling schema across heterogeneous services
Reference: See sprint reports on Alloy+Bind9 integration and dashboard development

Log Sources:

Journald (systemd services)
File tailing (application logs)
Docker container logs
Syslog

Grafana (Visualization)¶

Purpose: Unified observability platform

Key Features:

Multi-datasource dashboards
Alerting and notifications
User management and RBAC
Templating and variables
Plugin ecosystem

Supported Datasources:

InfluxDB (metrics)
Loki (logs)
Prometheus, Elasticsearch, and 100+ others

Current Implementation¶

Production Deployments¶

monitor11.example.com (Proxmox VM):

InfluxDB + Telegraf (metrics)
Loki + Alloy (logs)
Grafana (visualization)
WireGuard endpoint for remote collectors

ispconfig3-server.example.com (Linode VPS):

Telegraf + Alloy collectors
Ships to monitor11 via WireGuard
Monitors: Apache, ISPConfig, Fail2ban, Gitea, Mail, Bind9

Storage Backend¶

InfluxDB v2 OSS:

Storage: NFS mount for data directory
Retention: 30-day policy configured in bucket settings
Note: InfluxDB v2 OSS does not support S3 object storage

Loki:

Object Storage: s3-server.example.com:8010 (MinIO S3-compatible)
Bucket: loki11
Advantages: Cost-effective, scalable log storage

AI-Assisted Observability Workflow¶

A key design goal: enable Claude Code to analyze dashboards and query data programmatically.

How Alloy Enables This:

Predictable Label Schema: Alloy normalizes labels (e.g., service_type, hostname, jail) so Claude can construct queries without guessing
Parsed Fields: Pre-extracted fields (IP addresses, actions, timestamps) enable structured querying
Reduced Noise: Filtering at collection time means Claude queries return relevant results, not spam

Example Workflow:

User: "Show me fail2ban activity for the last 24 hours"
  ↓
Claude: Constructs Loki query using known labels
  → {service_type="fail2ban", hostname="ispconfig3-server.example.com"}
  → Parses results using known field names (jail, banned_ip, action)
  ↓
Claude: Generates dashboard panels programmatically via Grafana API
Claude: Analyzes patterns, suggests improvements

Reports:

Alloy+Bind9 Integration Sprint - reports/sprint-2025-12-31-alloy-bind9-integration.md) - Label design for DNS query logs
Alloy Dashboard Creation - reports/Alloy-Metrics-Dashboard-Creation-2026-01-02.md) - Programmatic dashboard generation
Fail2ban Dashboard Debugging - reports/Alloy-Dashboard-Debugging-2026-01-03.md) - Query troubleshooting workflow

Next Steps¶

Planned Enhancements¶

Alloy Config Validation
Pre-deployment config testing (test playbook implemented)
Live config reload exploration
Multi-Site Deployment
Expand beyond monitor11/ispconfig3
Standardize Alloy processing pipelines across hosts
Advanced Dashboards
Service-specific dashboards (Fail2ban ✅, Mail, DNS)
SLI/SLO tracking with Alloy-derived metrics
Capacity planning views
Alerting
Grafana alerting rules based on Alloy-parsed events
Multi-channel notifications (email, Mattermost)
Alert escalation policies