Published

- 8 min read

Offline-First is a Lie We Tell Ourselves

img of Offline-First is a Lie We Tell Ourselves

The Utopian Vision

There is a movement in modern software architecture known as “local-first” or “offline-first.” The premise is incredibly seductive: your application should work perfectly without an internet connection. Reads should be instant because they come from a local database. Writes should be synchronous and immediate. The network is relegated to a background process, a humble servant quietly syncing state whenever the user happens to have a signal. You read the blog posts, you watch the conference talks, and you think, “Yes. This is how all software should be built. This is the future.”

It is a beautiful vision. It is also, in practice, a nightmare that has cost me more sleep than any other architectural decision I’ve ever made.

The Prior Predicament

While building Prior, I decided early on that the core feature—practice sessions for predictive modeling—absolutely needed to work offline. The use case was obvious. You’re on the subway. You’re on a flight. You’re in that weird dead zone in the stairwell of your apartment building where your phone pretends it has signal but is actually lying to you. You should be able to open the app, run through a set of predictions, and see your stats update. When you eventually reconnect to civilization, everything should seamlessly merge with the server as if nothing happened.

How hard could it be?

Narrator: It was very hard.

I have come to believe that “How hard could it be?” is the most dangerous sentence in software engineering. It is the precursor to every weekend-consuming architectural rabbit hole I’ve ever fallen into. It is the siren song of the naive optimist, and I sing it to myself approximately once a month.

The Naive Approach

My initial implementation was embarrassingly simple, the kind of thing you’d sketch on a whiteboard in five minutes and feel proud of. Fetch the data from the server and dump it into local storage. When the user is offline, read from the local cache. When the user takes an action, save the mutation locally and fire off an API request. If the request fails because they’re in a tunnel, queue it and retry later. Done. Ship it. Go home.

This works perfectly in a vacuum where state never diverges and time is a single, linear, cooperative thread. But the real world is not a vacuum. The real world is a chaotic, distributed system where your user’s phone is one node, your server is another node, and between them lies the entire unpredictable hellscape of cellular networks, airplane mode toggles, and that one coffee shop WiFi that requires you to click “I agree” on a captive portal before any packets actually route.

Imagine this scenario: User A opens the app on their phone, enters a subway tunnel, and completes a full practice session. They are making predictions, logging confidence levels, racking up a score. Meanwhile, on the server, the global predictive model has updated based on thousands of other users’ inputs, shifting the baseline metrics that scores are calculated against. User A emerges from the tunnel, the app reconnects, and attempts to sync the session.

Do we accept User A’s stats based on the old model? Do we retroactively recalculate their score based on the new model? If we recalculate it, their score might drop, leading to a terrible user experience. (“I had 90% accuracy in the tunnel, and now I have 70%? What is this, a rigged casino?”) If we don’t recalculate, the leaderboard is now comparing scores computed against different versions of the model, which is statistically meaningless. There is no correct answer. There is only the answer you hate the least.

The Abyss of Distributed Systems

Suddenly, I was no longer building a simple mobile app. I was building a distributed database with intermittent connectivity and asynchronous consensus requirements. I had wandered into the territory of the CAP theorem without a map or a compass, and I was starting to understand why distributed systems researchers look the way they do (tired, mostly).

I started reading about CRDTs—Conflict-Free Replicated Data Types. If you haven’t encountered them, CRDTs are these beautiful mathematical constructs that guarantee eventual convergence across distributed replicas without requiring a central coordinator. Two devices can independently modify the same data structure, and when they sync, the result is deterministic and consistent. No conflicts. No merge logic. It sounds like magic, because it kind of is.

For a few brief, manic hours at 2 AM, I was absolutely convinced that I needed to implement a custom CRDT for the leaderboard state. I was reading papers. I was drawing diagrams. I was in the zone. Then I made the mistake of actually thinking about what that would mean for my specific use case.

CRDTs work beautifully for data types with commutative operations—text editing, counters, sets. They are magical for collaborative documents, which is why tools like Figma and Linear use them. But they are terrible for business logic that requires strict invariants. Consider: “A user cannot have more than 3 free questions per day.” If you use a CRDT for a counter, and two offline devices independently increment it, the CRDT will converge to 2. Mathematically correct. But what if the business rule says the counter should never exceed 1? CRDTs guarantee convergence. They do not guarantee validity. And in my case, validity is the entire point.

I closed the research papers. I poured another coffee. I stared at the wall for a while.

Event Sourcing to the Rescue (Mostly)

I eventually abandoned the idea of syncing “state” and instead pivoted to syncing “intent.” This is a distinction that sounds academic but changed everything about how the system works.

Instead of the client telling the server, “The user’s score is now 500,” the client appends an event to a persistent, ordered queue: { type: 'PRACTICE_COMPLETED', payload: { predictionId: 123, confidence: 0.8, timestamp: '2026-03-24T12:00:00Z' } }. The client doesn’t calculate the final score. It doesn’t try to be authoritative. It just records what happened and when it happened.

When the device comes back online, it flushes this event queue to the server. The server, acting as the single source of truth, receives the events, validates them against the current business logic and the state of the world at the time they occurred, and processes them. The server then returns the actual, canonical state back to the client. The client accepts this truth and overwrites its local state.

This means the local client is fundamentally optimistic. It applies the events locally to provide immediate feedback—your score goes up, the animation plays, the confetti falls (yes, there is confetti)—but it knows, deep in its algorithmic heart, that its local state is a hallucination. A useful hallucination. A hallucination that makes the user feel good. But a hallucination nonetheless, one that will be corrected the moment the server weighs in.

There’s a philosophical parallel here that I can’t stop thinking about. We all walk around with a local model of reality in our heads. We make decisions based on incomplete information. And then reality syncs—we learn something new, something contradicts our model—and we have to reconcile. Some people handle the reconciliation gracefully. Others refuse the server’s response and live in permanent offline mode. But I digress.

The Edge of the Network

Even with this event-sourcing architecture, the UX edge cases are brutal. The network is not a binary state of “online” or “offline.” It is a spectrum. There is a special circle of hell reserved for the state where your phone has full bars of 5G but no packets are actually routing. The app detects connectivity, attempts to sync, the HTTP request hangs for 30 seconds, times out, and the user stares at a loading spinner wondering if the app is broken. It’s not broken. It’s just trapped in networking purgatory, which is worse.

You have to implement aggressive timeouts. You have to build retry logic with exponential backoff (because hammering a flaky network with requests is a great way to make it even flakier). You have to design UI states for “Local changes pending sync,” which sounds straightforward until you realize you also need “Local changes pending sync but we tried and failed three times so maybe check your connection” and also “We synced but the server rejected some events because they violated a constraint that changed while you were offline, sorry about your leaderboard position.”

You have to handle the scenario where the user uninstalls the app while they have unsynced data. That data is just… gone. Evaporated. The predictions they made on that airplane will never reach the server. There is no recovery. This bothered me deeply until I realized that this is also how real life works. Sometimes you do things and nobody will ever know, and you just have to be okay with that.

The Lie, Reconsidered

Offline-first is not a feature you “add on” to an app. It is not a library you install. It is not a checkbox on a requirements document. It is a fundamental architectural constraint that infects every single layer of your codebase, from the data model to the UI to the error handling to the way you think about time itself. It doubles the complexity of every feature you build, because every feature now has two modes: the connected mode and the lying-to-the-user mode.

And yet. When you are on an airplane, 35,000 feet above the ground, and the app instantly loads your stats and lets you practice without a single spinner or error message… it almost feels worth it. Almost. I’ll let you know for sure when I’ve finished debugging the edge case where the user changes time zones mid-sync.

Don’t hold your breath.