System Health¶
The System Health tool checks the health of the Pipelit platform infrastructure. It inspects Redis connectivity, RQ worker status, queue depths, stuck executions, recent failures, and problematic scheduled jobs -- returning a comprehensive health report with a summary status and actionable issues.
Component type: system_health
How It Works¶
When invoked, the tool runs six health checks in sequence:
- Redis -- Pings Redis and reports memory usage and connected clients.
- RQ Workers -- Lists all active workers, their state, and queues.
- Queue Depths -- Reports the number of pending jobs in the
workflowsanddefaultqueues. - Stuck Executions -- Finds executions that have been
runningfor more than 15 minutes. - Failed Executions -- Counts failures in the last 24 hours, grouped by error message.
- Scheduled Jobs -- Identifies dead or erroring scheduled jobs.
The results are combined into a summary status:
| Summary | Condition |
|---|---|
healthy | All checks pass, no critical issues |
degraded | More than 5 failed executions in 24h, or dead scheduled jobs |
critical | Redis unreachable, no RQ workers, or stuck executions |
Ports¶
Outputs¶
| Port | Type | Description |
|---|---|---|
result | STRING | JSON health report with timestamp, summary, checks, and issues |
Output Format¶
{
"timestamp": "2026-02-16T10:30:00+00:00",
"summary": "healthy",
"checks": {
"redis": {
"status": "ok",
"used_memory_human": "2.50M",
"used_memory_peak_human": "4.12M",
"connected_clients": 5
},
"workers": {
"status": "ok",
"count": 2,
"workers": [
{"name": "worker-1", "state": "idle", "queues": ["workflows", "default"]},
{"name": "worker-2", "state": "busy", "queues": ["workflows"]}
]
},
"queues": {
"status": "ok",
"workflows": 3,
"default": 0
},
"stuck_executions": {
"status": "ok",
"count": 0,
"executions": []
},
"failed_executions": {
"status": "ok",
"total_24h": 2,
"by_error": [
{"error": "LLM timeout", "count": 2}
]
},
"scheduled_jobs": {
"status": "ok",
"dead_count": 0,
"erroring_count": 0,
"jobs": []
}
},
"issues": []
}
Issues Format¶
When problems are found, the issues array contains actionable items:
{
"issues": [
{
"severity": "critical",
"check": "workers",
"detail": "No RQ workers running"
},
{
"severity": "warn",
"check": "failed_executions",
"detail": "12 failed execution(s) in the last 24 hours"
}
]
}
Issue severity levels:
| Severity | Meaning |
|---|---|
critical | System cannot function properly -- immediate action required |
warn | Degraded performance or accumulating problems -- should be investigated |
Configuration¶
This tool has no configurable settings.
Usage¶
Connect this tool to an agent via the green diamond tool handle. It is useful for operational monitoring and self-healing agents:
flowchart LR
SH[System Health] -.->|tool| Agent
Sched[Scheduler Tools] -.->|tool| Agent
PA[Platform API] -.->|tool| Agent
Model[AI Model] -.->|llm| Agent
Trigger[Schedule Trigger] --> Agent Tool Signature¶
This tool takes no parameters.
Example¶
A monitoring agent that runs on a schedule and reports issues:
Agent: Let me check the system health.
Tool call: check_system_health()
Result: {
"summary": "degraded",
"checks": { ... },
"issues": [
{
"severity": "warn",
"check": "scheduled_jobs",
"detail": "2 dead scheduled job(s)"
},
{
"severity": "warn",
"check": "failed_executions",
"detail": "8 failed execution(s) in the last 24 hours"
}
]
}
Agent: System status is DEGRADED. Two issues found:
1. 2 dead scheduled jobs need attention -- they have exceeded retry limits.
2. 8 failed executions in the last 24 hours, indicating a recurring problem.
Recommended actions:
- Review the dead scheduled jobs and either restart or delete them.
- Check the error messages on failed executions to identify the root cause.
Self-Healing Pattern¶
Combine System Health with Scheduler Tools and Platform API for a self-healing agent:
flowchart TD
Schedule[Schedule Trigger<br/>Every 5 minutes] --> Agent
SH[System Health] -.->|tool| Agent
Sched[Scheduler Tools] -.->|tool| Agent
PA[Platform API] -.->|tool| Agent
Model[AI Model] -.->|llm| Agent The agent:
- Runs
check_system_health()on a schedule. - If it finds dead scheduled jobs, uses
scheduler_toolsto restart or delete them. - If it finds stuck executions, uses
platform_apito cancel them. - Reports its actions via a delivery channel (Telegram, webhook, etc.).
Check Thresholds
- Stuck executions: Running longer than 15 minutes.
- Failed executions warning: More than 5 failures in 24 hours.
- Scheduled jobs: Any job with
deadstatus or non-zeroerror_count.
Combine with Scheduling
For continuous monitoring, create a scheduled job that runs the health-check workflow every few minutes using Scheduler Tools. The agent can then escalate issues by sending alerts via Telegram or webhook delivery.