Confidence Scoring

Every agent response in Parallax includes a confidence score. These scores are fundamental to how Parallax builds reliable AI systems through multi-agent validation.

What is Confidence?

A confidence score is a number between 0.0 and 1.0 that represents how certain an agent is about its response.

agent.onTask(async (task) => {
  const result = await processTask(task.input);

  return {
    result,
    confidence: 0.85,  // 85% confident
  };
});

Why Confidence Matters

Quality Gating

Low-confidence results can be filtered, retried, or escalated:

validation:
  minConfidence: 0.7
  onFailure: retry
  maxRetries: 3

Weighted Aggregation

Higher-confidence responses have more influence on final results:

aggregation:
  strategy: consensus
  conflictResolution: weighted  # Uses confidence as weight

Reliability Metrics

Track system reliability over time by monitoring confidence distributions.

Confidence Scale

Score	Meaning	Typical Scenarios
0.95 - 1.0	Near certain	Clear input, exact match, verified data
0.80 - 0.95	High confidence	Standard cases, good context
0.65 - 0.80	Moderate confidence	Some ambiguity, reasonable inference
0.50 - 0.65	Low confidence	Significant uncertainty, limited context
0.30 - 0.50	Very low	Mostly guessing, unclear input
0.0 - 0.30	Minimal	Essentially random, invalid input

Calculating Confidence

Model-Based Signals

Use signals from your underlying model:

agent.onTask(async (task) => {
  const response = await llm.generate({
    prompt: task.input.prompt,
    temperature: 0.3,
  });

  let confidence = 0.8;  // Base confidence

  // Model finish reason
  if (response.finishReason === 'length') {
    confidence -= 0.15;  // Truncated output
  }

  // Token probability (if available)
  if (response.avgLogProb) {
    const probAdjustment = Math.min(0.2, response.avgLogProb + 1);
    confidence += probAdjustment;
  }

  // Response coherence
  if (response.text.length < 20) {
    confidence -= 0.1;  // Very short response
  }

  return {
    result: response.text,
    confidence: clamp(confidence, 0.1, 0.99),
  };
});

Input Quality Signals

Adjust confidence based on input quality:

function assessInputQuality(input: TaskInput): number {
  let quality = 1.0;

  // Input length
  if (input.text.length < 10) {
    quality -= 0.2;  // Very short input
  }

  // Language detection confidence
  if (input.detectedLanguage?.confidence < 0.9) {
    quality -= 0.1;
  }

  // Missing required context
  if (!input.context || input.context.length === 0) {
    quality -= 0.15;
  }

  return Math.max(0.3, quality);
}

agent.onTask(async (task) => {
  const inputQuality = assessInputQuality(task.input);
  const response = await process(task.input);

  return {
    result: response,
    confidence: response.modelConfidence * inputQuality,
  };
});

Task-Specific Confidence

Different task types have different confidence characteristics:

// Classification task
function classificationConfidence(probabilities: number[]): number {
  const sorted = probabilities.sort((a, b) => b - a);
  const top = sorted[0];
  const margin = top - (sorted[1] || 0);

  // High margin = high confidence
  return 0.5 + (margin * 0.5);
}

// Extraction task
function extractionConfidence(
  extracted: string[],
  expected: number
): number {
  const found = extracted.length;

  if (found === 0) return 0.2;
  if (found === expected) return 0.9;
  if (found > expected) return 0.6;  // Over-extraction
  return 0.5 + (found / expected) * 0.4;  // Partial extraction
}

// Translation task
function translationConfidence(
  source: string,
  result: string,
  detectedLanguage: { confidence: number }
): number {
  let confidence = 0.8;

  // Length ratio check
  const ratio = result.length / source.length;
  if (ratio < 0.3 || ratio > 3) {
    confidence -= 0.2;
  }

  // Language detection confidence
  confidence *= detectedLanguage.confidence;

  return Math.max(0.3, confidence);
}

Confidence in Aggregation

Voting with Confidence

Confidence affects vote weight:

aggregation:
  strategy: voting
  method: weighted
  # Agents with higher confidence have more voting power

How weighted voting works:

Agent A: "positive" (confidence: 0.9)  → weight: 0.9
Agent B: "negative" (confidence: 0.6)  → weight: 0.6
Agent C: "positive" (confidence: 0.8)  → weight: 0.8

Positive: 0.9 + 0.8 = 1.7
Negative: 0.6

Winner: "positive" (weighted confidence: 1.7 / 2.3 = 0.74)

Consensus Confidence

Consensus produces an aggregate confidence score:

aggregation:
  strategy: consensus
  threshold: 0.8

Consensus confidence calculation:

function consensusConfidence(
  results: AgentResult[],
  agreement: number
): number {
  // Base confidence from agreement level
  const agreementConfidence = agreement;

  // Weight by individual confidences
  const avgConfidence = results.reduce(
    (sum, r) => sum + r.confidence, 0
  ) / results.length;

  // Combined score
  return agreementConfidence * avgConfidence;
}

Confidence Thresholds

Set minimum confidence requirements:

validation:
  minConfidence: 0.7    # Reject results below this
  onFailure: retry      # What to do on rejection
  maxRetries: 3

aggregation:
  minConfidence: 0.8    # Minimum confidence for consensus

Confidence Calibration

What is Calibration?

A well-calibrated system's confidence scores match actual accuracy. If an agent reports 80% confidence, it should be correct ~80% of the time.

Measuring Calibration

interface CalibrationBucket {
  range: [number, number];
  predictions: number;
  correct: number;
  accuracy: number;
}

function measureCalibration(
  predictions: { confidence: number; correct: boolean }[]
): CalibrationBucket[] {
  const buckets: CalibrationBucket[] = [
    { range: [0.0, 0.2], predictions: 0, correct: 0, accuracy: 0 },
    { range: [0.2, 0.4], predictions: 0, correct: 0, accuracy: 0 },
    { range: [0.4, 0.6], predictions: 0, correct: 0, accuracy: 0 },
    { range: [0.6, 0.8], predictions: 0, correct: 0, accuracy: 0 },
    { range: [0.8, 1.0], predictions: 0, correct: 0, accuracy: 0 },
  ];

  for (const pred of predictions) {
    const bucket = buckets.find(
      b => pred.confidence >= b.range[0] && pred.confidence < b.range[1]
    );
    if (bucket) {
      bucket.predictions++;
      if (pred.correct) bucket.correct++;
    }
  }

  for (const bucket of buckets) {
    bucket.accuracy = bucket.predictions > 0
      ? bucket.correct / bucket.predictions
      : 0;
  }

  return buckets;
}

Improving Calibration

Temperature scaling:

function calibratedConfidence(
  rawConfidence: number,
  temperature: number = 1.5
): number {
  // Apply temperature scaling
  const logit = Math.log(rawConfidence / (1 - rawConfidence));
  const scaledLogit = logit / temperature;
  return 1 / (1 + Math.exp(-scaledLogit));
}

Historical adjustment:

class ConfidenceCalibrator {
  private history: { predicted: number; actual: boolean }[] = [];

  record(confidence: number, wasCorrect: boolean) {
    this.history.push({ predicted: confidence, actual: wasCorrect });
    if (this.history.length > 1000) {
      this.history.shift();
    }
  }

  calibrate(confidence: number): number {
    // Find similar historical predictions
    const similar = this.history.filter(
      h => Math.abs(h.predicted - confidence) < 0.1
    );

    if (similar.length < 10) {
      return confidence;  // Not enough data
    }

    // Actual accuracy for this confidence level
    const actualAccuracy = similar.filter(h => h.actual).length / similar.length;

    // Blend predicted and historical
    return confidence * 0.3 + actualAccuracy * 0.7;
  }
}

Best Practices

Do

Be conservative - It's better to underestimate confidence than overestimate
Use multiple signals - Combine model, input, and task-specific signals
Calibrate over time - Track and improve calibration with real data
Differentiate tasks - Different task types need different confidence logic

Don't

Return 1.0 - Always leave room for uncertainty
Return constant values - Confidence should vary with actual certainty
Ignore input quality - Bad input → lower confidence
Skip validation - Test that your confidence logic makes sense

Example: Well-Designed Confidence

agent.onTask(async (task) => {
  // Start with moderate confidence
  let confidence = 0.75;
  const signals: string[] = [];

  // Assess input quality
  const inputLength = task.input.text.length;
  if (inputLength < 20) {
    confidence -= 0.15;
    signals.push('short-input');
  } else if (inputLength > 1000) {
    confidence += 0.05;
    signals.push('detailed-input');
  }

  // Process with model
  const response = await model.generate(task.input);

  // Model signals
  if (response.finishReason === 'stop') {
    confidence += 0.05;
    signals.push('complete-response');
  }
  if (response.finishReason === 'length') {
    confidence -= 0.2;
    signals.push('truncated');
  }

  // Task-specific validation
  const validation = validateOutput(response.result, task.input.expectedFormat);
  if (validation.valid) {
    confidence += 0.1;
    signals.push('valid-format');
  } else {
    confidence -= 0.2;
    signals.push('invalid-format');
  }

  // Clamp to valid range
  confidence = Math.max(0.1, Math.min(0.95, confidence));

  return {
    result: response.result,
    confidence,
    metadata: {
      signals,
      rawConfidence: response.confidence,
    },
  };
});

Monitoring Confidence

Metrics to Track

Confidence distribution - Are scores well-distributed or clustered?
Confidence vs accuracy - Are high-confidence responses more accurate?
Confidence trends - Is confidence stable over time?
Low-confidence rate - How often do responses fall below threshold?

Alerting

Set up alerts for confidence anomalies:

# Example monitoring config
alerts:
  - name: low-confidence-spike
    condition: avg(confidence) < 0.6 over 5m
    severity: warning

  - name: confidence-collapse
    condition: p95(confidence) < 0.5 over 1m
    severity: critical

Next Steps

Consensus - How confidence affects consensus building
Agents - Implementing confidence in agents
Quality Gates - Using confidence for validation

What is Confidence?​

Why Confidence Matters​

Quality Gating​

Weighted Aggregation​

Reliability Metrics​

Confidence Scale​

Calculating Confidence​

Model-Based Signals​

Input Quality Signals​

Task-Specific Confidence​

Confidence in Aggregation​

Voting with Confidence​

Consensus Confidence​

Confidence Thresholds​

Confidence Calibration​

What is Calibration?​

Measuring Calibration​

Improving Calibration​

Best Practices​

Do​

Don't​

Example: Well-Designed Confidence​

Monitoring Confidence​

Metrics to Track​

Alerting​

Next Steps​