Creating Agents¶

This guide covers everything you need to know about creating agents in Moltler.

Agent Syntax¶

CREATE AGENT agent_name
GOAL 'description'
[INSTRUCTIONS 'system prompt']
SKILLS [skill1, skill2, ...]
[EXECUTION mode]
[TRIGGERS [...]]
[INFERENCE_ENDPOINT 'endpoint_id' | MODEL 'model_name']
[TEMPERATURE 0.7]
[MAX_TOKENS 4096]
[MAX_ITERATIONS 10]
[CONFIG {...}]
BEGIN
    -- Agent logic
END AGENT;

Key Features¶

Feature	Description
`GOAL`	The primary objective for the agent
`INSTRUCTIONS`	System prompt that guides the LLM's behavior
`SKILLS`	List of skills the agent can use
`INFERENCE_ENDPOINT`	Use Elasticsearch Inference API for reasoning
`TEMPERATURE`	Control LLM creativity (0.0 = deterministic, 1.0 = creative)
`MAX_TOKENS`	Maximum tokens for LLM responses
`MAX_ITERATIONS`	Maximum OODA loop iterations

Goals¶

Defining Goals¶

A goal should be:

Specific - Clear about what to achieve
Measurable - Possible to determine success
Actionable - Can be accomplished with available skills

-- Good goals
GOAL 'Maintain 99.9% uptime for production services'
GOAL 'Respond to critical incidents within 5 minutes'
GOAL 'Reduce mean time to recovery below 30 minutes'

-- Bad goals (too vague)
GOAL 'Make things better'
GOAL 'Help with operations'

Goal Metrics¶

Define success metrics:

CREATE AGENT uptime_guardian
GOAL 'Maintain 99.9% uptime for production services'
GOAL_METRICS {
    "uptime_target": 0.999,
    "mttr_target_minutes": 30,
    "incident_response_time_minutes": 5
}
...

Skills¶

Skill Selection¶

Choose skills that:

Work toward the goal
Complement each other
Cover the required actions

SKILLS [
    -- Detection
    detect_anomaly,
    analyze_metrics,

    -- Diagnosis
    correlate_events,
    identify_root_cause,

    -- Remediation
    scale_service,
    restart_pods,
    rollback_deployment,

    -- Communication
    notify_team,
    update_status_page
]

Skill Versions¶

Pin specific versions for stability:

SKILLS [
    detect_anomaly@1.0.0,
    analyze_metrics@^2.0.0,  -- Any 2.x
    notify_team@latest
]

Skill Permissions¶

Restrict which skills can be auto-executed:

SKILLS [
    check_health,           -- Can auto-execute
    restart_service[manual],  -- Requires manual approval
    delete_data[forbidden]    -- Cannot execute
]

Execution Modes¶

Autonomous¶

Agent acts without human intervention:

CREATE AGENT auto_scaler
EXECUTION autonomous
BEGIN
    -- Agent can execute any skill automatically
END AGENT;

Use for:

Low-risk operations
Time-critical responses
Well-tested scenarios

Human Approval¶

Requires approval before each action:

CREATE AGENT careful_responder
EXECUTION human_approval
BEGIN
    DECLARE plan = AGENT_PLAN(context);

    FOR action IN plan LOOP
        -- Waits for human approval
        AWAIT APPROVAL(action);
        CALL action.skill(action.params);
    END LOOP;
END AGENT;

Use for:

High-risk operations
Production changes
New or untested scenarios

Supervised¶

Human can observe and intervene at any point:

CREATE AGENT supervised_worker
EXECUTION supervised
BEGIN
    -- Human receives notifications but doesn't need to approve
    -- Can intervene at any time
END AGENT;

Use for:

Training new agents
Gradual trust building
Audit requirements

Dry Run¶

Simulates actions without executing:

CREATE AGENT test_agent
EXECUTION dry_run
BEGIN
    -- All actions are simulated and logged
    -- Nothing actually executes
END AGENT;

Use for:

Testing agent logic
Validating behavior
Demonstrating capabilities

Triggers¶

Scheduled Triggers¶

Run on a schedule:

TRIGGERS [
    ON SCHEDULE '*/5 * * * *',     -- Every 5 minutes
    ON SCHEDULE '0 * * * *',       -- Every hour
    ON SCHEDULE '0 9 * * MON-FRI'  -- 9 AM weekdays
]

Event Triggers¶

Respond to events:

TRIGGERS [
    ON EVENT 'alert.fired',
    ON EVENT 'deployment.completed',
    ON EVENT 'ticket.created'
]

Alert Triggers¶

Respond to specific alerts:

TRIGGERS [
    ON ALERT 'high-cpu-usage',
    ON ALERT WHERE severity = 'critical',
    ON ALERT WHERE tags CONTAINS 'production'
]

Manual Triggers¶

Allow manual invocation:

TRIGGERS [
    ON MANUAL,  -- Can be invoked manually
    ON SCHEDULE '0 * * * *'  -- Also runs hourly
]

Combined Triggers¶

TRIGGERS [
    ON SCHEDULE '*/5 * * * *',
    ON ALERT WHERE severity IN ('high', 'critical'),
    ON EVENT 'deployment.*',
    ON MANUAL
]

LLM Configuration¶

Using Elasticsearch Inference API (Recommended)¶

The recommended way to configure LLM for agents is using Elasticsearch's Inference API. This provides centralized management, security, and monitoring.

First, create an inference endpoint:

-- Create an inference endpoint using Elasticsearch
CALL INFERENCE_CREATE_ENDPOINT(
    'my-agent-llm',
    'chat_completion', 
    '{
        "service": "openai",
        "service_settings": {
            "api_key": "sk-...",
            "model_id": "gpt-4o-mini"
        }
    }'
);

Then reference it in your agent:

CREATE AGENT smart_responder
GOAL 'Analyze incidents and suggest remediation'
INSTRUCTIONS 'You are an SRE assistant focused on reliability. Always explain your reasoning.'
SKILLS [analyze_logs, check_metrics, restart_service]
INFERENCE_ENDPOINT 'my-agent-llm'
TEMPERATURE 0.3
MAX_TOKENS 2048
BEGIN
    -- Agent uses the inference endpoint for reasoning
END AGENT;

LLM Configuration Options¶

CREATE AGENT my_agent
GOAL 'Monitor production systems'
INSTRUCTIONS 'You are a vigilant production monitor. Be concise but thorough.'
SKILLS [check_health, get_metrics, send_alert]
INFERENCE_ENDPOINT 'my-openai-endpoint'
TEMPERATURE 0.3          -- Lower = more deterministic
MAX_TOKENS 2048          -- Maximum response length
MAX_ITERATIONS 5         -- Maximum OODA loop cycles
BEGIN
    ...
END AGENT;

Using Direct Model Reference¶

For simpler setups, you can reference a model directly:

CREATE AGENT my_agent
MODEL 'gpt-4'
BEGIN
    -- Requires OpenAI API key in environment
END AGENT;

INSTRUCTIONS (System Prompt)¶

The INSTRUCTIONS clause sets the system prompt for the agent's LLM:

CREATE AGENT incident_responder
GOAL 'Respond to production incidents'
INSTRUCTIONS '
You are an expert SRE with 10 years of experience.
Your priorities are:
1. Minimize user impact
2. Restore service quickly  
3. Prevent recurrence

Always explain your reasoning before taking action.
When uncertain, ask for clarification rather than guessing.
'
SKILLS [diagnose_issue, restart_service, scale_up, notify_team]
INFERENCE_ENDPOINT 'my-llm'
BEGIN
    -- Agent follows these instructions
END AGENT;

Configuration¶

Agent Configuration¶

CREATE AGENT my_agent
CONFIG {
    -- Timeouts
    "action_timeout": "5m",
    "total_timeout": "30m",

    -- Limits
    "max_actions": 10,
    "max_retries": 3,

    -- Behavior
    "continue_on_error": false,
    "log_level": "debug",

    -- Escalation
    "escalate_after": "15m",
    "escalation_channel": "#oncall"
}
BEGIN
    ...
END AGENT;

Environment-Specific Config¶

CREATE AGENT my_agent
CONFIG {
    "production": {
        "execution": "human_approval",
        "max_actions": 5
    },
    "staging": {
        "execution": "autonomous",
        "max_actions": 20
    }
}
BEGIN
    ...
END AGENT;

Agent Logic¶

Basic Logic¶

BEGIN
    -- Simple if/then
    IF condition THEN
        CALL skill();
    END IF;
END AGENT;

Context-Aware Logic¶

BEGIN
    -- Gather context
    DECLARE context = {
        "trigger": @trigger,
        "time": CURRENT_TIMESTAMP(),
        "day_of_week": EXTRACT_DAY_OF_WEEK(CURRENT_TIMESTAMP())
    };

    -- Different behavior on weekends
    IF context.day_of_week IN (0, 6) THEN
        SET context.execution = 'conservative';
    END IF;

    -- Make decision
    DECLARE action = AGENT_DECIDE(context);
    CALL action.skill(action.params);
END AGENT;

AI-Powered Logic¶

BEGIN
    -- Let AI decide
    DECLARE plan = AGENT_PLAN(
        goal => @goal,
        context => @context,
        available_skills => @skills
    );

    -- Execute plan
    FOR step IN plan.steps LOOP
        DECLARE result = CALL step.skill(step.params);

        -- Check if goal is met
        IF AGENT_GOAL_MET(result) THEN
            EXIT;
        END IF;
    END LOOP;
END AGENT;

Error Handling¶

BEGIN
    TRY
        CALL risky_operation();
    CATCH timeout_error THEN
        CALL notify_team('Operation timed out');
        CALL fallback_operation();
    CATCH OTHERS THEN
        CALL notify_team('Unexpected error: ' || ERROR_MESSAGE());
        RAISE;  -- Re-raise to trigger escalation
    END TRY;
END AGENT;

Complete Example¶

CREATE AGENT production_guardian
VERSION '1.0.0'
DESCRIPTION 'Monitors and protects production systems'
GOAL 'Maintain 99.9% availability and respond to incidents within SLA'

SKILLS [
    -- Monitoring
    check_service_health,
    analyze_metrics,
    detect_anomalies,

    -- Diagnosis
    correlate_events,
    identify_root_cause,
    check_recent_deployments,

    -- Remediation
    scale_service,
    restart_unhealthy_pods,
    rollback_deployment[approval_required],

    -- Communication
    send_alert,
    update_status_page,
    create_incident_ticket
]

EXECUTION supervised
MODEL 'gpt-4'

TRIGGERS [
    ON SCHEDULE '* * * * *',           -- Every minute
    ON ALERT WHERE severity = 'critical',
    ON EVENT 'deployment.completed',
    ON MANUAL
]

CONFIG {
    "services": ["api", "web", "worker"],
    "health_check_interval": "1m",
    "incident_threshold": 3,
    "auto_remediation_enabled": true,
    "escalation_timeout": "10m"
}

BEGIN
    -- 1. Check health of all services
    FOR service IN CONFIG.services LOOP
        DECLARE health = CALL check_service_health(service);

        IF health.status = 'degraded' THEN
            -- Auto-remediate common issues
            IF health.issue = 'high_memory' THEN
                CALL restart_unhealthy_pods(service);
            ELSIF health.issue = 'high_load' THEN
                CALL scale_service(service, replicas => health.current_replicas * 2);
            END IF;

            CALL send_alert(
                channel => '#ops',
                message => service || ' is ' || health.status
            );

        ELSIF health.status = 'down' THEN
            -- Critical issue - gather context and escalate
            DECLARE context = {
                "service": service,
                "health": health,
                "recent_deployments": CALL check_recent_deployments(service),
                "correlated_events": CALL correlate_events(service)
            };

            -- AI determines root cause
            DECLARE root_cause = CALL identify_root_cause(context);

            -- Suggest remediation
            DECLARE remediation = AGENT_SUGGEST(
                issue => root_cause,
                available_actions => ['restart', 'scale', 'rollback']
            );

            -- Execute with approval
            AWAIT APPROVAL(remediation);

            IF remediation.action = 'rollback' THEN
                CALL rollback_deployment(service);
            ELSIF remediation.action = 'scale' THEN
                CALL scale_service(service, replicas => remediation.replicas);
            ELSE
                CALL restart_unhealthy_pods(service);
            END IF;

            -- Update status page
            CALL update_status_page(
                component => service,
                status => 'investigating'
            );

            -- Create incident ticket
            CALL create_incident_ticket(
                title => 'Service degradation: ' || service,
                context => context,
                root_cause => root_cause
            );
        END IF;
    END LOOP;
END AGENT;

Chatting with Agents¶

Agents support interactive conversations using the CHAT AGENT command:

-- Simple chat
CHAT AGENT my_agent 'What is the current health status of production?';

-- Chat with context
CHAT AGENT my_agent 'Why did CPU spike at 3pm?' WITH {
    "service": "api-gateway",
    "timeframe": "last_hour"
};

Conversation State¶

Each chat creates a conversation that maintains history:

-- Start a conversation
CHAT AGENT incident_analyst 'We have a critical alert on api-gateway';

-- Response includes conversation_id for continuity
-- {
--   "agent": "incident_analyst",
--   "conversation_id": "abc123",
--   "response": "I'll analyze the api-gateway alert. Let me check the logs..."
-- }

-- Continue the conversation
CHAT AGENT incident_analyst 'What did you find?' WITH {
    "conversation_id": "abc123"
};

Agent Chat Response¶

The chat response includes:

Field	Description
`agent`	The agent name
`conversation_id`	ID for continuing the conversation
`message`	Your original message
`response`	The agent's response
`decision`	If skills are available, the agent's skill selection

Managing Agents¶

Show Agents¶

-- List all agents
SHOW AGENTS;

-- Get agent details
SHOW AGENT my_agent;

-- Show execution history
SHOW AGENT my_agent HISTORY;

-- Show specific execution
SHOW AGENT my_agent EXECUTION 'execution-id';

Modify Agents¶

-- Update instructions
ALTER AGENT my_agent SET INSTRUCTIONS 'New system prompt here';

-- Change execution mode
ALTER AGENT my_agent SET EXECUTION autonomous;

-- Update configuration
ALTER AGENT my_agent SET CONFIG {
    "max_actions": 20,
    "timeout": "10m"
};

Enable/Disable Agents¶

-- Disable an agent (won't respond to triggers)
DISABLE AGENT my_agent;

-- Enable an agent
ENABLE AGENT my_agent;

Trigger Agents Manually¶

-- Trigger an agent
TRIGGER AGENT my_agent;

-- Trigger with context
TRIGGER AGENT my_agent WITH {
    "alert_id": "alert-123",
    "severity": "critical"
};

Delete Agents¶

DROP AGENT my_agent;

What's Next?¶

Execution Model

How agents make decisions.

Execution Model
Human-in-the-Loop

Approval workflows.

Human-in-the-Loop