Skip to main content

Confidence Scoring

Every agent response in Parallax includes a confidence score. These scores are fundamental to how Parallax builds reliable AI systems through multi-agent validation.

What is Confidence?

A confidence score is a number between 0.0 and 1.0 that represents how certain an agent is about its response.

agent.onTask(async (task) => {
const result = await processTask(task.input);

return {
result,
confidence: 0.85, // 85% confident
};
});

Why Confidence Matters

Quality Gating

Low-confidence results can be filtered, retried, or escalated:

validation:
minConfidence: 0.7
onFailure: retry
maxRetries: 3

Weighted Aggregation

Higher-confidence responses have more influence on final results:

aggregation:
strategy: consensus
conflictResolution: weighted # Uses confidence as weight

Reliability Metrics

Track system reliability over time by monitoring confidence distributions.

Confidence Scale

ScoreMeaningTypical Scenarios
0.95 - 1.0Near certainClear input, exact match, verified data
0.80 - 0.95High confidenceStandard cases, good context
0.65 - 0.80Moderate confidenceSome ambiguity, reasonable inference
0.50 - 0.65Low confidenceSignificant uncertainty, limited context
0.30 - 0.50Very lowMostly guessing, unclear input
0.0 - 0.30MinimalEssentially random, invalid input

Calculating Confidence

Model-Based Signals

Use signals from your underlying model:

agent.onTask(async (task) => {
const response = await llm.generate({
prompt: task.input.prompt,
temperature: 0.3,
});

let confidence = 0.8; // Base confidence

// Model finish reason
if (response.finishReason === 'length') {
confidence -= 0.15; // Truncated output
}

// Token probability (if available)
if (response.avgLogProb) {
const probAdjustment = Math.min(0.2, response.avgLogProb + 1);
confidence += probAdjustment;
}

// Response coherence
if (response.text.length < 20) {
confidence -= 0.1; // Very short response
}

return {
result: response.text,
confidence: clamp(confidence, 0.1, 0.99),
};
});

Input Quality Signals

Adjust confidence based on input quality:

function assessInputQuality(input: TaskInput): number {
let quality = 1.0;

// Input length
if (input.text.length < 10) {
quality -= 0.2; // Very short input
}

// Language detection confidence
if (input.detectedLanguage?.confidence < 0.9) {
quality -= 0.1;
}

// Missing required context
if (!input.context || input.context.length === 0) {
quality -= 0.15;
}

return Math.max(0.3, quality);
}

agent.onTask(async (task) => {
const inputQuality = assessInputQuality(task.input);
const response = await process(task.input);

return {
result: response,
confidence: response.modelConfidence * inputQuality,
};
});

Task-Specific Confidence

Different task types have different confidence characteristics:

// Classification task
function classificationConfidence(probabilities: number[]): number {
const sorted = probabilities.sort((a, b) => b - a);
const top = sorted[0];
const margin = top - (sorted[1] || 0);

// High margin = high confidence
return 0.5 + (margin * 0.5);
}

// Extraction task
function extractionConfidence(
extracted: string[],
expected: number
): number {
const found = extracted.length;

if (found === 0) return 0.2;
if (found === expected) return 0.9;
if (found > expected) return 0.6; // Over-extraction
return 0.5 + (found / expected) * 0.4; // Partial extraction
}

// Translation task
function translationConfidence(
source: string,
result: string,
detectedLanguage: { confidence: number }
): number {
let confidence = 0.8;

// Length ratio check
const ratio = result.length / source.length;
if (ratio < 0.3 || ratio > 3) {
confidence -= 0.2;
}

// Language detection confidence
confidence *= detectedLanguage.confidence;

return Math.max(0.3, confidence);
}

Confidence in Aggregation

Voting with Confidence

Confidence affects vote weight:

aggregation:
strategy: voting
method: weighted
# Agents with higher confidence have more voting power

How weighted voting works:

Agent A: "positive" (confidence: 0.9)  → weight: 0.9
Agent B: "negative" (confidence: 0.6) → weight: 0.6
Agent C: "positive" (confidence: 0.8) → weight: 0.8

Positive: 0.9 + 0.8 = 1.7
Negative: 0.6

Winner: "positive" (weighted confidence: 1.7 / 2.3 = 0.74)

Consensus Confidence

Consensus produces an aggregate confidence score:

aggregation:
strategy: consensus
threshold: 0.8

Consensus confidence calculation:

function consensusConfidence(
results: AgentResult[],
agreement: number
): number {
// Base confidence from agreement level
const agreementConfidence = agreement;

// Weight by individual confidences
const avgConfidence = results.reduce(
(sum, r) => sum + r.confidence, 0
) / results.length;

// Combined score
return agreementConfidence * avgConfidence;
}

Confidence Thresholds

Set minimum confidence requirements:

validation:
minConfidence: 0.7 # Reject results below this
onFailure: retry # What to do on rejection
maxRetries: 3

aggregation:
minConfidence: 0.8 # Minimum confidence for consensus

Confidence Calibration

What is Calibration?

A well-calibrated system's confidence scores match actual accuracy. If an agent reports 80% confidence, it should be correct ~80% of the time.

Measuring Calibration

interface CalibrationBucket {
range: [number, number];
predictions: number;
correct: number;
accuracy: number;
}

function measureCalibration(
predictions: { confidence: number; correct: boolean }[]
): CalibrationBucket[] {
const buckets: CalibrationBucket[] = [
{ range: [0.0, 0.2], predictions: 0, correct: 0, accuracy: 0 },
{ range: [0.2, 0.4], predictions: 0, correct: 0, accuracy: 0 },
{ range: [0.4, 0.6], predictions: 0, correct: 0, accuracy: 0 },
{ range: [0.6, 0.8], predictions: 0, correct: 0, accuracy: 0 },
{ range: [0.8, 1.0], predictions: 0, correct: 0, accuracy: 0 },
];

for (const pred of predictions) {
const bucket = buckets.find(
b => pred.confidence >= b.range[0] && pred.confidence < b.range[1]
);
if (bucket) {
bucket.predictions++;
if (pred.correct) bucket.correct++;
}
}

for (const bucket of buckets) {
bucket.accuracy = bucket.predictions > 0
? bucket.correct / bucket.predictions
: 0;
}

return buckets;
}

Improving Calibration

Temperature scaling:

function calibratedConfidence(
rawConfidence: number,
temperature: number = 1.5
): number {
// Apply temperature scaling
const logit = Math.log(rawConfidence / (1 - rawConfidence));
const scaledLogit = logit / temperature;
return 1 / (1 + Math.exp(-scaledLogit));
}

Historical adjustment:

class ConfidenceCalibrator {
private history: { predicted: number; actual: boolean }[] = [];

record(confidence: number, wasCorrect: boolean) {
this.history.push({ predicted: confidence, actual: wasCorrect });
if (this.history.length > 1000) {
this.history.shift();
}
}

calibrate(confidence: number): number {
// Find similar historical predictions
const similar = this.history.filter(
h => Math.abs(h.predicted - confidence) < 0.1
);

if (similar.length < 10) {
return confidence; // Not enough data
}

// Actual accuracy for this confidence level
const actualAccuracy = similar.filter(h => h.actual).length / similar.length;

// Blend predicted and historical
return confidence * 0.3 + actualAccuracy * 0.7;
}
}

Best Practices

Do

  • Be conservative - It's better to underestimate confidence than overestimate
  • Use multiple signals - Combine model, input, and task-specific signals
  • Calibrate over time - Track and improve calibration with real data
  • Differentiate tasks - Different task types need different confidence logic

Don't

  • Return 1.0 - Always leave room for uncertainty
  • Return constant values - Confidence should vary with actual certainty
  • Ignore input quality - Bad input → lower confidence
  • Skip validation - Test that your confidence logic makes sense

Example: Well-Designed Confidence

agent.onTask(async (task) => {
// Start with moderate confidence
let confidence = 0.75;
const signals: string[] = [];

// Assess input quality
const inputLength = task.input.text.length;
if (inputLength < 20) {
confidence -= 0.15;
signals.push('short-input');
} else if (inputLength > 1000) {
confidence += 0.05;
signals.push('detailed-input');
}

// Process with model
const response = await model.generate(task.input);

// Model signals
if (response.finishReason === 'stop') {
confidence += 0.05;
signals.push('complete-response');
}
if (response.finishReason === 'length') {
confidence -= 0.2;
signals.push('truncated');
}

// Task-specific validation
const validation = validateOutput(response.result, task.input.expectedFormat);
if (validation.valid) {
confidence += 0.1;
signals.push('valid-format');
} else {
confidence -= 0.2;
signals.push('invalid-format');
}

// Clamp to valid range
confidence = Math.max(0.1, Math.min(0.95, confidence));

return {
result: response.result,
confidence,
metadata: {
signals,
rawConfidence: response.confidence,
},
};
});

Monitoring Confidence

Metrics to Track

  • Confidence distribution - Are scores well-distributed or clustered?
  • Confidence vs accuracy - Are high-confidence responses more accurate?
  • Confidence trends - Is confidence stable over time?
  • Low-confidence rate - How often do responses fall below threshold?

Alerting

Set up alerts for confidence anomalies:

# Example monitoring config
alerts:
- name: low-confidence-spike
condition: avg(confidence) < 0.6 over 5m
severity: warning

- name: confidence-collapse
condition: p95(confidence) < 0.5 over 1m
severity: critical

Next Steps

  • Consensus - How confidence affects consensus building
  • Agents - Implementing confidence in agents
  • Quality Gates - Using confidence for validation