Overview
Prior is a mobile calibration training app I built to help people get better at quantifying uncertainty. The premise is simple: you see a statement, assign a confidence level, and find out if you were right. Over time, the app measures whether your stated probabilities match reality—when you say 80%, does it happen 80% of the time?
The project is live on the App Store with an active leaderboard and a freemium model. Building it required solving several hard engineering problems: precision-sensitive scoring math, offline-first data sync, a dual-metric architecture that separates calibration from accuracy, and a schema migration system that runs across both local SQLite and cloud Supabase without downtime.
The Problem
Most people are poorly calibrated. They say “I’m 90% sure” when they’re right 60% of the time, or hedge to 50% when they actually have strong signal. This matters beyond trivia—miscalibrated confidence leads to bad decisions in medicine, investing, hiring, and policy.
Existing tools for improving calibration are either research instruments (dry, inaccessible) or prediction markets (which reward accuracy, not calibration). I wanted something that focused exclusively on the question: When you say X%, do you mean it?
Technical Approach
Dual-Metric Scoring Architecture
The most interesting technical decision was the scoring model. Early versions used the Brier score as the headline metric, but Brier conflates two things: calibration (are your probabilities honest?) and resolution (can you distinguish easy questions from hard ones?). For a calibration trainer, resolution is noise.
I replaced the aggregate metric with Expected Calibration Error (ECE)—the weighted average absolute difference between stated confidence and observed hit rate, binned by confidence level. Per-prediction scoring still uses squared error (Brier) for instant feedback, since you can’t measure calibration from a single prediction.
// ECE: Σ (n_k / N) × |avg_confidence_k − hit_rate_k|
// Bins: [50,60), [60,70), [70,80), [80,90), [90,100]
export function calculateCalibrationError(predictions: BinaryPrediction[]): Decimal | null {
if (predictions.length === 0) return null
const bins = new Array(NUM_BINS).fill(null).map(() => ({
sumConfidence: new Decimal(0),
sumCorrect: new Decimal(0),
count: 0
}))
for (const p of predictions) {
const idx = getBinIndex(p.confidence)
bins[idx].sumConfidence = bins[idx].sumConfidence.plus(p.confidence)
bins[idx].sumCorrect = bins[idx].sumCorrect.plus(p.correct ? 1 : 0)
bins[idx].count += 1
}
const N = new Decimal(predictions.length)
let ece = new Decimal(0)
for (const bin of bins) {
if (bin.count === 0) continue
const n = new Decimal(bin.count)
const avgConf = bin.sumConfidence.div(n)
const hitRate = bin.sumCorrect.div(n)
ece = ece.plus(n.div(N).times(avgConf.minus(hitRate).abs()))
}
return ece
}
This required a 30-prediction minimum before ECE becomes statistically meaningful—with 5 bins, you need roughly 6 predictions per bin for a directional signal. Below that threshold, the UI shows a countdown (“12 TO GO”) instead of a noisy number.
Offline-First with Ring Buffer Sync
The app works entirely offline. All prediction data lives in a local SQLite database, structured around a 100-row ring buffer. When connectivity resumes, a sync service replays unsynced rows to Supabase, deduplicating via composite keys. The sync queue persists across app kills and survives account switches—orphaned jobs from previous users are cleaned up after 7 days.
The ring buffer design keeps local storage bounded while retaining enough history for statistically meaningful aggregate metrics. When the buffer is full, the oldest prediction is overwritten. The cloud stores everything.
┌─────────────┐ ┌───────────────┐ ┌──────────────┐
│ Engine UI │────▶│ SQLite Ring │────▶│ Supabase │
│ (predict) │ │ Buffer (100) │ │ (permanent) │
└─────────────┘ └───────────────┘ └──────────────┘
│
┌──────┴──────┐
│ Stats Store │
│ (ECE, bins, │
│ trend) │
└─────────────┘
Precision Numerics
All scoring uses decimal.js for arbitrary-precision arithmetic. This wasn’t premature optimization—it was a correctness requirement. IEEE 754 floating-point accumulation errors compound over hundreds of predictions, and users on the leaderboard were comparing scores to the hundredths place. A rounding error in the third decimal that shifts your rank is unacceptable.
State Architecture
The app uses Zustand for state management with four domain stores:
- Engine store: Prediction flow state machine (selecting → predicting → revealing → exhausted), current question, confidence, score tracking
- Stats store: All aggregate computation—ECE, calibration bins, trend, domain breakdown, improvement, personal best. Charts are pure display; the store owns all math
- Journal store: User-created predictions with manual resolution and Brier scoring
- Auth store: Supabase session, username, sign-in/sign-up state
Each store is a Zustand singleton—no providers, no prop drilling. The engine store uses getState() in async callbacks to avoid stale closures, which was a real bug that caused incorrect reveal data when predictions resolved during rapid-fire sessions.
Results & Impact
- Live on the App Store with active users and a competitive leaderboard
- Dual-metric architecture isolates calibration from accuracy, a distinction missed by most forecasting tools
- Zero-downtime schema migration from Brier to ECE across both SQLite and Supabase, with backwards-compatible columns for transition
- 360° code review rated the codebase A−: “exceptionally well-crafted… disciplined design system, rigorous numerical precision, offline-first architecture with graceful sync”
- Freemium model via RevenueCat with 5 free daily questions and a paywall with contextual messaging
What I Learned
The hardest part of building Prior wasn’t the code—it was the statistics. Choosing between Brier and ECE required reading decomposition proofs, understanding what “calibration” means formally versus intuitively, and designing a threshold system that balances statistical rigor with UX patience. The metric migration taught me that changing a core abstraction in a shipped product is an order of magnitude harder than getting it right the first time, because you’re migrating not just code but user expectations, database schemas, and leaderboard integrity simultaneously.
The offline-first architecture was the second hardest thing. It’s easy to build an app that works with connectivity. It’s manageable to build one that works offline. It’s genuinely difficult to build one that transitions between the two states without the user noticing—especially when you add account switching, sync queue persistence, and the possibility of concurrent modifications from multiple devices. Every edge case I found revealed two more.
Building a consumer app solo also taught me about the gap between “correct” and “shippable.” The codebase went through three major code reviews where legitimate architectural improvements were identified, triaged, and some intentionally deferred. Not every non-null assertion needs a guard clause on day one. The discipline is knowing which corners to cut and which to over-engineer—and precision scoring was firmly in the “over-engineer” category.