Observability Skills¶

Investigate incidents faster. Pre-built skills for logs, metrics, APM, and SLOs.

Quick Start¶

# 1. Install observability skills
./hub/moltler-cli.sh install --all --category observability
./hub/moltler-cli.sh install --all --category apm
./hub/moltler-cli.sh install --all --category metrics

# 2. Run your first skill
curl -u elastic-admin:elastic-password http://localhost:9200/_escript \
  -H "Content-Type: application/json" \
  -d '{"query": "RUN SKILL get_recent_errors()"}'

Common Scenarios¶

"What errors happened in the last hour?"¶

RUN SKILL get_recent_errors();

Returns the latest ERROR logs with full context.

"Why is the API slow?"¶

-- Find slow transactions
RUN SKILL get_slow_transactions(service => 'api-gateway', threshold_ms => 1000);

-- Get the trace
RUN SKILL get_trace(trace_id => 'abc123');

-- Analyze database queries
RUN SKILL analyze_database_queries(service => 'api-gateway');

"What's the error rate trend?"¶

-- Count by level
RUN SKILL count_logs_by_level(index_pattern => 'logs-*');

-- Compare time periods
RUN SKILL compare_time_periods(
  index_pattern => 'logs-*',
  current_period => '1h',
  previous_period => '1h'
);

"Show me the context around this error"¶

-- Get surrounding logs for a trace
RUN SKILL correlate_logs(trace_id => 'abc123');

-- Get error context with stack trace
RUN SKILL get_error_context(error_id => 'err-456');

"What patterns are in my logs?"¶

RUN SKILL get_log_patterns(index_pattern => 'logs-*', limit => 20);

Available Skills¶

Log Analysis¶

Skill	Description
`get_recent_errors`	Get recent ERROR logs with details
`count_logs_by_level`	Count logs grouped by level
`get_log_patterns`	Identify common log patterns
`correlate_logs`	Find related logs by trace ID
`get_error_context`	Get context around an error
`compare_time_periods`	Compare metrics across time

APM & Tracing¶

Skill	Description
`list_services`	List all monitored services
`get_slow_transactions`	Find transactions over threshold
`get_trace`	Get full distributed trace
`get_failed_transactions`	Find failed transactions
`get_throughput`	Get transaction throughput
`get_service_map`	Visualize service dependencies
`analyze_database_queries`	Find slow DB queries

Metrics & Infrastructure¶

Skill	Description
`get_host_metrics`	CPU, memory, disk for a host
`list_hosts`	List all monitored hosts
`get_memory_pressure`	Find hosts with memory issues

SLOs & Synthetics¶

Skill	Description
`get_slo_status`	Check SLO burn rate
`list_monitors`	List synthetic monitors
`get_monitor_status`	Get monitor health
`get_availability`	Calculate uptime percentage

Real-World Workflow¶

Scenario: Production incident - API returning 500 errors

-- Step 1: How bad is it?
RUN SKILL count_logs_by_level(
  index_pattern => 'logs-api-*',
  time_range => '15m'
);
-- Result: ERROR: 234, WARN: 89, INFO: 12,456

-- Step 2: What are the errors?
RUN SKILL get_recent_errors(
  service => 'api-gateway',
  limit => 10
);
-- Result: "Database connection timeout" repeated

-- Step 3: Is this affecting all endpoints?
RUN SKILL get_failed_transactions(
  service => 'api-gateway',
  limit => 20
);
-- Result: /api/orders endpoint has 90% failure rate

-- Step 4: What's the database doing?
RUN SKILL analyze_database_queries(
  service => 'api-gateway',
  threshold_ms => 5000
);
-- Result: SELECT * FROM orders - avg 8.2s (normally 50ms)

-- Step 5: When did this start?
RUN SKILL compare_time_periods(
  index_pattern => 'logs-api-*',
  current_period => '1h',
  previous_period => '1h'
);
-- Result: Error rate increased 400% starting 45 minutes ago

Resolution time: 5 minutes instead of 30.

Combine with AI¶

Connect to an AI assistant that can run these skills:

User: "Why is checkout failing?"

AI: Let me investigate...
    [Runs get_recent_errors(service => 'checkout')]
    [Runs get_slow_transactions(service => 'checkout')]
    [Runs get_trace(trace_id => '...')]

AI: "Checkout is failing because the payment service is timing out.
     The database query for fraud detection is taking 12 seconds.
     This started 2 hours ago when a new index was deployed."

See MCP Integration to connect your AI assistant.

Build Custom Skills¶

Have a specific investigation pattern? Create a skill:

CREATE SKILL investigate_service
  VERSION '1.0.0'
  DESCRIPTION 'Full investigation of a service - errors, latency, throughput'
  (service STRING DESCRIPTION 'Service name to investigate')
  RETURNS DOCUMENT
BEGIN
  DECLARE errors ARRAY;
  DECLARE slow_txns ARRAY;
  DECLARE error_count INT;

  SET errors = ESQL_QUERY('FROM logs-* | WHERE service == "' || service || '" AND level == "ERROR" | LIMIT 10');
  SET slow_txns = ESQL_QUERY('FROM traces-* | WHERE service.name == "' || service || '" AND duration > 1000 | LIMIT 10');
  SET error_count = ARRAY_LENGTH(errors);

  RETURN {
    'service': service,
    'error_count': error_count,
    'recent_errors': errors,
    'slow_transactions': slow_txns,
    'status': CASE WHEN error_count > 10 THEN 'critical' WHEN error_count > 0 THEN 'warning' ELSE 'healthy' END
  };
END SKILL;