Outbound Sensitive Data Detection

Status: Research / Future Roadmap
Priority: Medium
Depends on: Channel interception layer (optional)

Problem Statement

When agents send outbound messages (responses to users), they may inadvertently include:

Current implementation removed outbound scanning from the adversary-detector crate to simplify the initial channel integration. This document captures the research directions for re-implementing outbound content filtering.

Detection Approaches

1. High Entropy Detection

2. Regex Pattern Matching

3. Regret Matches

4. Machine Learning Classifiers

5. Dictionary/Allowlist Approach

Implementation Design

Configuration

security:
  outbound_scanning:
    enabled: true
    mode: "flag"  # "block", "flag", "log_only"
    detectors:
      high_entropy:
        enabled: true
        min_entropy: 4.5
        min_length: 16
      patterns:
        enabled: true
        patterns_file: "secrets-patterns.json"
      context_keywords:
        enabled: true
        keywords: ["password", "secret", "token", "key", "credential"]
    redaction:
      enabled: true
      mask: "***REDACTED***"
    alerts:
      on_detection: true
      channel: "signal"
      to: "+1XXXXXXXXXX"

Integration Points

  1. Channel Layer (Calciforge)
  2. Tool Result Layer (OpenClaw)
  3. Policy Integration (clash)

Open Questions

  1. Performance: Can we scan without adding >100ms latency to responses?
  2. Context Awareness: Should trusted identities (owner) bypass scanning?
  3. Redaction vs Blocking: Redact and send, or block and alert?
  4. Learning: Should the system learn from false positive reports?
  5. Scope: Just agent responses, or also tool call arguments?

Next Steps

  1. Research Phase:
  2. Prototype Phase:
  3. Integration Phase:

Risks

References