TheMobileArchitect
Articles

Agent Runtime Surface (ARS): The Missing Layer Between Agents and Applications

By Michael Raber · March 22, 2026

When Apple first announced the App Store, I had conflicting reactions. The opportunity to build new types of software on a phone felt limitless. And at the same time, the idea of moving from web development back to natively compiled binaries felt heavy. The process of building, debugging, and deploying through Apple's approval process felt like the polar opposite of the Web. The Web provided standards and openness; Apple was providing access to its hardware, sensors, and local state in exchange for less control.

The heaviness I expected. What I didn't expect was how much of it felt like power. It actually felt good to do the work. Having the logic closer to the UI made the UI more responsive, free from the network round-trips that web apps of that era couldn't match. One thing surprised me: I was writing database queries and caching layers again. On the web, that was the application server's responsibility. The browser held almost nothing important. Mobile moved most of it back to the device, back to the app, back to the code I was writing. That part felt right.

The regression was something else. On the web, I could open Charles Proxy and watch every HTTP call, every dependency, every performance characteristic in real time. I could make a change to a JavaScript file and see it in the browser immediately. With iOS, I was back to compile cycles, redeployment, and Xcode's debugger. The surface that had been transparent on the web was opaque again. Not because anyone deliberately made it that way. Because that's what a compiled binary running on a sealed device actually is. The tools got better over time. But the fundamental nature of the surface didn't change.

When I started building Whoogic and wanted an AI agent to navigate it as a synthetic user, I hit that sealed surface immediately. The only available path was the accessibility layer, the same approach I had spent years at Discover watching break with every app update and every iOS release. I knew what that road looked like.

My first instinct was to expose the app's internals over a WebSocket. I had solved a version of this problem before, building an SDK at a company I founded that exposed app internals over a protocol for a server to consume. This time, though, I wasn't thinking about screen structure. I was thinking about a text adventure game. You enter a room and discover what nouns and verbs are available. The room tells you what's possible right now. You act within that.

That frame changed everything. Instead of exposing what the screen looks like, I was exposing what can be done at this point in the app's state, and nothing more. No skipping steps. No missing transitions. The app's integrity remains intact because the agent can only traverse what the app intentionally exposes. That instinct led directly to the three-layer model.

This is a product problem as much as an engineering one. How do you enable agents to interact with your product safely, within your governance model, with intentional control over what they can and can't do? The answer matters whether you're building a native mobile app, a web product, or something designed from the ground up for agent interaction. I'm coming at it through a mobile lens because that's where I've spent 15+ years and where the gap is most visible.

The mobile app's sealed surface was never a problem because the only thing on the other side of it was a user. Users navigate within the boundaries implemented. They don't even know a state machine exists. They just use the app.

Before AI assistants became part of everyday conversation, I got frustrated that Netflix had no way to build a shared favorites list that my wife and I could use across our phones to decide what to watch. So I built a prototype. I loaded the Netflix login screen in a WebView, created a bridge between Swift and JavaScript, injected a JavaScript library to reveal content, form fields, and buttons to tap or links to navigate, and sent the results back as JSON to a native Swift UI I built on top. I did the same with my Discover and Capital One accounts. For the first time, I felt like I was in control of my own data across apps, displayed the way I wanted, in a UI I built myself. I was exploring what became possible when I could access my own content through the products I was already using.

I wasn't testing a different screen-scraping technique. I was testing how it felt to access my personal data and take actions while still following the governance layer each product had put in place.

This is the problem an AI agent faces today. If you want an agent to act on your behalf across the apps where your life actually is, the options are limited. On the web, you can scrape or inject. On native apps, you can probe the accessibility layer or call backend APIs directly, taking on the authentication and state management the app was already handling internally. None of those options preserves the governance that the app was built to enforce. And for a growing number of companies, there is no web version to fall back on. The app is the product. The data lives there. The workflows live there. An agent that can't traverse the native app can act, but not within the guardrails the product was built to enforce.

Exposing web services alone doesn't solve it either. The business rules and workflow states that make an app trustworthy live in the app. The UI enforces them structurally, not just visually. Moving money, booking a flight, sending a message on someone's behalf: these are governed workflows, not just API calls. Without a structured contract between the agent and the application, both sides have a problem. The agent is guessing at the sequence and state. The company has lost control of how its product is used. The right answer isn't a better scraper. It's a contract the app intentionally publishes from the inside, making the scraper unnecessary.

Consider what the UI is actually doing. If signup has been disabled, the signup link disappears. The UI doesn't return an error. It doesn't surface a status code. It simply doesn't offer the transition. The governance is immediate and structural. A REST API has no equivalent. The endpoint exists whether or not the action is currently valid. The calling code has to attempt it and handle the failure. It discovers governance after the fact, through errors, not before the fact, through a structured representation of what's currently permitted. And even when those errors are descriptive, acting on them isn't the agent's responsibility. The agent's job is to traverse valid states, not to interpret failure codes.

What agents need is a surface that behaves like the UI, not like the API. One that exposes what's valid right now, actually callable. The agent never sees transitions that aren't currently available. It operates within the same governed surface the user sees. The agent doesn't need to know why a transition isn't available. It just knows it isn't. The contract is the surface.

Every era has had its own version of the same problem: how does a system tell the world what it can do? Unix man pages and command line --help flags gave developers standardized ways to ask a tool what it could do before attempting anything. JavaDocs made class surfaces legible when editing tools couldn't. OpenAPI made service endpoints discoverable when APIs became too complex to reverse-engineer. I saw a version of this firsthand at Traffic.com, where I built an instrumentation layer using the JMX framework that let us reach into a running Java server and adjust live attributes, like the number of database connections, without taking it offline. That was a runtime contract. The application intentionally exposed what could be changed and what the current state was, so we could change how the server behaved without changing the code.

This is the current era's version of that problem, solved at the mobile app layer, with an agent on the other side instead of an ops engineer.

Introducing the Agent Runtime Surface

Agents have no surface to discover and interact with mobile apps. I call this the Agent Runtime Surface.

An Agent Runtime Surface (ARS) is a dynamic runtime contract embedded within an application that exposes its current navigable state and intent-level actions, enabling agents to traverse and enact valid state changes within the application's governed state model.

In plain terms: the app publishes what it knows about itself right now. Where you are, what actions are available from here, and what the app will permit. The agent reads that surface and acts within it.

In contrast, a REST API is rigid by design. Deterministic code calls specific endpoints with specific parameter names. Change a parameter name and the calling code breaks. The contract has to be frozen because the consumer code is frozen. An agent is a different kind of consumer. It doesn't call a predetermined sequence of endpoints. It reasons about what's available and decides what to do. If the button says Login, Sign In, or Submit, the agent figures it out from context. It doesn't pattern match on strings. It understands intent. That makes ARS robust in a way a REST contract can never be. The agent adapts to the surface. The surface doesn't have to be frozen to accommodate the consumer.

With a REST API, someone has to document what each endpoint means and when to use it. Developers are notoriously bad at this, and a product owner isn't much better. Most approaches to making software agent-consumable move the documentation burden somewhere else. ARS replaces that burden with engineering work that stays accurate by design. The app will always tell you its correct current state.

Every surface we've built for apps, graphical UI and the accessibility layer, was built for human perception. A reasoning agent doesn't perceive. It understands intent. That's a different consumer. It needs a different surface.

ARS is not a UI automation layer. Not a backend API. Not a messaging protocol. Not an agent execution runtime. Each of those already exists. ARS is the thing none of them are: a runtime state contract that the application builds and publishes intentionally.

Consider a login screen. Depending on the device, the user, and the session state, valid authentication paths might include a password, a passkey, Google SSO, or biometric auth. A static spec documents that these paths exist. The app knows at runtime which ones are valid right now, based on device enrollment, user configuration, and session state. ARS is the mechanism by which the app publishes that knowledge in a form that an agent can act on. An agent without ARS has to guess or probe. An agent consuming an ARS contract gets a window into the app's current moment. What's available, what each option requires, and where each path leads. Not the full state. Just what matters right now.

Web developers have had headless browsers for years, a way to interact with a running web app programmatically without a visible UI. Native mobile has never had an equivalent. ARS fills that gap, with one important difference: where a headless browser interacts with whatever is rendered, ARS only exposes what the app intentionally permits.

Today, automated mobile testing works by simulating touch events at the UI layer. Tapping coordinates, finding elements by accessibility identifiers, and waiting for render cycles to complete. The flakiness every mobile engineer knows comes from this layer: timing issues, elements not yet rendered, accessibility identifiers that change, and animations that haven't finished. You're fighting the rendering pipeline on every interaction.

ARS over a protocol like WebSockets bypasses that entirely. The agent sends an intent-level action. The app receives it and fires the same internal event a touch interaction would have fired, without touching the UI layer at all. The render cycle is irrelevant. Element coordinates are irrelevant. The accessibility tree is irrelevant. The same event handler fires, the same state change happens, the same business logic executes, but three sources of flakiness are removed.

This isn't an incremental improvement on UI testing. It's a different execution model. The same outcomes are reached through a governed, deterministic path.

But the implications go beyond testing infrastructure.

There is a business dimension to this that headless browsing and accessibility scraping cannot provide. When an agent interacts with your app through the accessibility layer or a headless browser, you have no idea it is happening. Your analytics see a user. As agents become a meaningful share of your traffic, an unknown and growing share of your analytics will not reflect human behavior at all. Every product decision made from that data carries hidden ambiguity.

With ARS, the agent is interacting through a layer you built intentionally. You know it is there. You can track agent interactions separately, enable or disable features for agents independently, and make deliberate product decisions about what the surface exposes. That is a new category of product intelligence. One that only exists because you made the interaction intentional.

That intentionality matters to companies. It matters even more to users.

The question of whether to trust an AI agent to act on your behalf comes down to one thing: who controls the guardrails. Right now, agents that interact with apps through the accessibility layer or backend APIs don't operate within an intentional governance model. The accessibility layer exposes what the app's accessibility implementation allows, but it wasn't designed to govern agent interaction. Backend APIs have no app-state awareness at all. With ARS, that changes. The banking app decides what agents are permitted to do. Transfer money to trusted recipients only, up to a daily limit, with human confirmation required above a threshold. Those aren't constraints the AI imposes on itself. They're the app's own rules, intentionally surfaced through the ARS layer. The agent operates within the same governance model you do. That's what makes it trustworthy. Not the agent's intentions, but the app's control.

ARS isn't a single interface. It's a layered model with three distinct modes, each designed for a different kind of agent interaction. The same app state model underlies all three. What differs is how much of that model is exposed, and for what purpose.

Semantic UI Mode gives the agent a complete, structured view of what's available at each point in the app. Not a visual representation, but a structured list of elements, actions, and valid next steps. The agent moves through the app one state at a time, receiving the full picture at each step, making decisions based on its goal. Think of it like navigating a conversation where, at each turn, you know every option available to you before choosing. This is the mode for agents acting on behalf of a user, moving through the app systematically without the visual layer. It's also the layer that closes the mobile AI development loop. A coding agent can write a feature and immediately use Semantic UI Mode to traverse and validate what it built. Web developers take this for granted. Mobile developers have relied on tools like Appium, but flaky tests are dangerous for agents. A false failure looks like a real one. The agent fixes code that wasn't broken.

Capability Mode is built for agents that know exactly what they need. And it's a pattern that already exists. Apple built it. Siri is an agent, and when Apple needed a way for Siri to interact with apps without scraping the UI or bypassing the app's logic, they created App Intents. You say 'Order a pizza for delivery at 5pm,' and the pizza app fulfills the intent without you having to open it. The app controls what Siri can do. The governance travels with the capability. Capability Mode extends that same pattern to any agent, not just Siri. The agent asks the app what it can do. The app returns a dynamic list of named capabilities based on its current state. Not logged in: limited capabilities returned. Logged in: the full set becomes available. The agent reasons about which capability matches its goal and invokes it directly, without navigating through the app to get there. Because capabilities are named with intent in mind, the agent can reason about them in a way that raw API endpoints never allowed. No documentation required. The name does the work.

Discoverability Mode is something new. The one that only became possible with AI agents capable of reasoning, not just executing.

I've been building Whoogic, a platform that simulates a real phone operating system with apps as the game mechanics. The idea was to test features in context, with goals, time pressure, and the kind of asynchronous interruptions that mirror real life. A text from grandma ten minutes in. A limited window to accomplish something that matters. The apps are the game controller.

Before AI, the plan was to find real humans to play. Then the calculus changed. What if agents played instead? What if agents could predict how a real person would navigate these goal-based scenarios? You'd only need a small number of real humans to validate what the agents had already predicted.

But there was a problem. Agents can't navigate native apps. Whoogic is native. The only way to let agents play at scale was to build an interface that let them discover the app the way a human would. Not with perfect information. Not with a complete map of the state machine. With only what's visible from here, right now, and the need to reason about what to do next.

That's Discoverability Mode. The agent sees what a first-time human user sees. Three icons in the top left corner. A gear, a question mark, a lightbulb. It has to reason whether exploring one of those gets it closer to its goal. Nothing is revealed all at once. The surface is progressively disclosed, in the same way a person would discover an app. The agent doesn't navigate. It explores.

Success looks different here than in the other two modes. It's not task completion. It's behavioral fidelity. Did the agent notice what a real person would notice? Did it hesitate where a real person would hesitate? Did it find the path a real person would find, or get lost where a real person would get lost? That's the signal Discoverability Mode produces. A signal that doesn't exist today because no current tool is designed to simulate human perception rather than bypass it.

This is the layer designed for usability testing. Not the kind where you watch someone tap through a screen. The kind where you ask: would a real person discover this feature at all? Would they find the path? Would they get stuck where real users get stuck? Discoverability Mode is what makes that question answerable at scale. I've been building Whoogic specifically to explore what becomes possible when you combine this surface with agents that can be configured to behave like different types of people. Semantic UI Mode and Capability Mode will likely arrive first, driven by immediate agent use cases. Discoverability Mode is the harder problem and the longer horizon. But it's the one that can eventually enable agents to test workflows for usability at scale during development.

So far, this has treated ARS in isolation. But there is an emerging layer of protocols connecting agents to interfaces directly, and it's worth being precise about where ARS sits relative to them.

AG-UI streams JSON events between agent backends and frontends, defining how information moves between an agent and a UI. A2UI takes a different approach: it gives an agent a way to describe and render UI for users, defining what the user sees and interacts with inside an agent-driven experience.

ARS operates at a different layer entirely. It defines what the application exposes from the inside, not how information travels or what the agent renders. A2UI gives agents a way to draw UI for users. ARS gives agents a governed surface inside the app to act on. They can be used independently, but together they let agents both show and do in a controlled way.

The protocols assume the application has already figured out what to expose. Nobody has named that layer as a distinct primitive. Application developers are left to solve it themselves, inconsistently, with no shared vocabulary. That's the gap ARS names.

That Netflix WebView experiment was an early, ad-hoc version of this. I was scraping my way to a runtime contract that those apps never exposed on purpose. ARS is what happens when that contract is built in from the start for agents, rather than for my own glue code.

I'm building a reference implementation to test these hypotheses. Whoogic is a mobile platform simulator in which agents traverse an ARS, and outcomes are compared with those of real human players across the three surface modes. The goal is to validate that the signal an agent produces in Discoverability Mode actually predicts what real users do.

The further implication, one I'll explore separately, is that ARS could become a deployment target in its own right. Multi-platform frameworks like React Native or Compose Multiplatform already let you target iOS, Android, and web from a single codebase. ARS as a target means you could deploy a headless, agent-first interface alongside those. An app with no UI at all, just a governed runtime surface for agents to traverse. That's not science fiction. It's the logical endpoint of taking this primitive seriously.

ARS isn't inherently a mobile primitive. The same contract could expose a web app's runtime state to an agent. But mobile is where the need is most acute, where no accidental openness filled the gap, and where I've spent the last 15+ years building.

This is early. Agent Runtime Surface needs to exist as a named primitive before it can be built consistently, before teams can talk about it, design against it, or recognize when it's missing. That's what this article is. A name and a definition for something the industry is about to need.

Here is what I'm asserting.

Apps have always been governed state models. Existing approaches to agent-app interaction work from outside that model, inferring what's possible from what's rendered. ARS inverts that. The app publishes its governed surface intentionally, from the inside. Nobody named that primitive until now.

Semantic UI Mode, Capability Mode, and Discoverability Mode are distinct interaction modes over the same app state model, each exposing a different view of the app's governed surface for a different agent use case.

Progressive disclosure of app state to simulate human first-contact experience at scale is not something existing tools were designed to do. It requires both a reasoning agent and an intentional surface capable of coordinating what is disclosed based on the agent's configuration and behavioral profile.

Intentional agent interaction enables a new category of product intelligence. Knowing an agent is present, separately from human users, produces analytics and control capabilities that don't exist with outside-in approaches.

ARS preserves the app's governance model structurally. The agent can only see and act on what the app currently permits, the same boundaries that govern human users. This is structural trust, not assumed trust. It doesn't depend on the agent behaving correctly or being well-designed. The app permits, or it doesn't.

I'm not the only person thinking about this. I know that. But I haven't seen it named this way, defined this precisely, or connected to the mobile layer where the need is most acute. If you're building in this space, I'd like to know what you're seeing. What's working. What's missing. What I've gotten wrong.

The conversation is just starting.