Building an ARS Layer and What It Revealed

By Michael Raber · April 8, 2026

Most agent‑driven automation still talks to apps the way web scrapers talk to PayPal: from the outside in. Nobody designed those flows for HTML bots; they were designed for people, but we bend them into automation because that is the only surface we have. What I want from ARS is a way to change that relationship. Instead of teaching agents to poke at pixels or replay brittle flows, the app exposes the same workflows it uses for humans as an intentional, governed interface that the agent can call directly.

The most important thing to confirm in this build was how much additional work a semantic ARS interface really adds, and what it means for that interface to share the same state machine as the UI. If ARS demands extra wiring that feels bolted on instead of intentional, that friction will quietly kill adoption long before the ideas have a chance to prove themselves. I’ve seen this pattern before in the early days of web application servers, where a small shift in where the work lived unlocked a lot more power without forcing teams to drag around a second system nobody wanted to maintain.

The answer we confirmed was that ARS did not require a second system. It runs on the same state machine as the SwiftUI app, so the agent interface and the human interface are just two views over one source of truth: the UI renders that state visually for humans, and the ARS layer publishes it semantically for agents.

What the architecture required

To test that, I built a small SwiftUI app with login and signup flows, Redux-style state management, and a Python WebSocket orchestration script. An agent generated the test steps, and the Python layer executed them against the app. That prototype only worked because the app already had a single place where the state lived and changed.

That decision has a prerequisite. If both the UI and the ARS layer are going to read from the same source, that source has to exist and be trustworthy. Which means the app needs a single authoritative state container with governed mutation paths. All state changes flow through a reducer. No direct property mutation from views. The architecture pattern that satisfies this is unidirectional data flow.

This isn't an ARS requirement specifically. It's good architecture. What ARS adds is a reason to care about it that didn't exist before agents needed to navigate your app. Once the state model is clean, the ARS vocabulary is largely already there. The action set your reducer handles is almost exactly the valid transitions your ARS surface needs to expose. ARS is the mechanism that makes that vocabulary visible at runtime. The app was already a state machine. ARS gives agents a map.

The ARS surface is stateless by design. The app still owns its state, and the orchestrator now holds what a human would keep in their head: the task, the current step, and how to get from here to there. The surface projects the app’s truth; the orchestrator carries intent and context. On paper, that split between surface and orchestrator made sense; the sharper edges only showed up once the test suite started driving a real build.

The moment agent needs diverged from human needs

An agent running test sequences behaves differently from a human tester. I set it up to run discrete, isolated test cases, each starting from a known clean state, because every case had a cleanup phase. When a test broke midway, the console told me exactly which step it thought it was on while the simulator was visibly stuck on an earlier screen. The failure was obvious. The problem was the state it left behind.

That design choice exposed something the app wasn’t built to handle. When a test failed before cleanup, the next test didn’t start from a clean slate. It inherited whatever half‑broken state the last run left in the UI, and the failures cascaded: one bad run, then a wall of red because no successive test had the right starting point. The “obvious” fix was restarting the simulator between every test, which technically worked, but five‑plus seconds per restart across dozens of cases turned a fast suite into something that took minutes.

When an app gets into a state that only a hard reset will fix, a human does something instinctive: force‑quit and relaunch. The agent needed the same escape hatch, exposed as a governed action that could be invoked at any point, regardless of the current screen. That turned into two separate actions. Reboot returns the app to its initial state with session preserved, the equivalent of killing and restarting the app. Reinstall wipes all local data and simulates a fresh install, which is equivalent to deleting and reinstalling. A human has access to both through the OS; the agent needed both to be exposed explicitly in the ARS surface.

Neither belongs in a human‑facing UI. No real user taps a “reboot app” button. That was the moment it became clear there were three categories of actions: ones both humans and agents share, ones that exist only for agents because the OS hides them from humans, and ones that exist only for humans in the UI. Most actions are shared: setting field values, submitting forms, and navigating between screens are valid for both humans and agents. Some are agent‑only: they exist in the ARS surface but have no SwiftUI equivalent, with reboot and reinstall being the first two that this build forced me to name. Others are human‑only: they live in SwiftUI but are never exposed in ARS.

Human-only deserves a concrete example. On a list screen, a user might reorder items with an action like reorderItem from one position to another. The gesture is simple, but the purpose is human: the visual order helps people build a mental model of priority, recency, and preference. An agent doesn’t need that affordance to accomplish a task; it needs the underlying semantics, not the drag handle.

The ARS surface is not a mirror of the UI. It is a governed projection with its own vocabulary that partially overlaps with human-facing actions and partially diverges from them. That is not a limitation; it is the design. The surface exposes what an agent needs to navigate the app correctly. The UI exposes what a human needs to use it naturally. Those are related but not identical sets of things.

This build also hinted at another split inside the agent-only bucket. Some agent-only actions are testing infrastructure, not production affordances. Reboot and reinstall exist because a test agent needs hard resets. A production agent using your app to accomplish a real task may not need either. That distinction needs more than one build before it is worth hard-wiring into the pattern.

The first time the full suite went green, it was 48 out of 48 in about 35 seconds with a small delay baked in for animations. Seeing a native app run that many behavioral checks in under a minute changed how often I felt comfortable running them and made ARS-backed verification feel like something that could actually live inside an agentic development loop.

Every screen now requires an explicit decision about which actions belong to which category. The spec is where that decision lives. Without it, an agent building the ARS layer will happily produce a surface that looks complete but quietly conflates what humans need with what agents need.

What the spec revealed

The spec for this build did not come first. The agent and I iterated on the UI and state machine from a broader prompt, and only after the prototype existed did I ask it to generate an ARS spec in markdown from what we had. The result surfaced something I am not used to seeing in one place: every name that would normally stay buried in code, screen elements, actions, and validation states, written out as a contract.

A small slice of it looked like this:

## Login Screen Elements
- usernameField [textField]: Username input
- passwordField [secureField]: Password input
- loginButton [button]: Submits login — enabled only when both fields are valid
- signupLink [link]: Navigates to signup screen

## Username Validation Rules
- Empty string → validationState: empty
- Fewer than 4 characters → validationState: invalid, "Must be at least 4 characters"
- More than 20 characters → validationState: invalid, "Must be 20 characters or fewer"
- Contains characters other than letters, numbers, underscore → validationState: invalid, "Only letters, numbers, and underscore allowed"
- 4–20 characters, alphanumeric and underscore only → validationState: valid

## Valid Actions per Screen
- Login idle: setUsername, setPassword, navigateToSignup, resetForm
- Login valid: setUsername, setPassword, navigateToSignup, resetForm, submitLogin

A staff engineer would recognize the shape of it immediately. For this build, it was precise enough to build from and precise enough to test against, and it gave the agent everything it needed to generate deterministic tests that conformed to the reducer’s rules.

Seeing all of the elements and actions written out in one place forced a naming discipline that assistive AI currently lets you skip. When an agent names a field 'username' that should really be 'usernameForB2BCustomer', it usually does not matter much. In ARS, it does. Those names are the contract. An agent consuming the surface reasons from them the same way a REST client reasons from endpoint and parameter names; the name carries intent, and a bad name is a leaky contract.

That is what made me start thinking about contract‑first design in a different way. Right now, this spec lives as a human‑readable markdown document. It works, but I am not convinced that it is the right long‑term format. A more formal DSL or schema, something closer to protobuf or a JSON‑based contract, would force tighter, deterministic output between builds and give both humans and agents a clearer target to reason about. The tradeoff is that it becomes more work to write and maintain.

The uncomfortable part is ownership. This artifact is too technical for a product owner to maintain and too abstract for most engineers to reach for before writing code, but agentic development needs it. In this build, the agent wrote the spec because I told it to, based on the app we had, not the app we wished we had. That was enough to tie the state machine and the generated tests together for a small prototype. For larger systems, the open question is whether free‑text specs are good enough or whether a dedicated DSL becomes a new layer of code, just at a higher level of abstraction. Even with those questions open, this spec was precise enough to drive a concrete outcome: a full suite of deterministic checks against a running app.

What surface contract testing is

There is a name for what this build was doing. Not unit testing, which verifies logic in isolation without a running app. Not automated UI testing, which verifies behavior through the rendered surface. Surface contract testing verifies that a running app’s governed state model matches the spec it was built from. In this build, a Python harness dispatched actions, read state transitions, and checked the results against expected outcomes. The app runs exactly as it would for a human, in a simulator or on a device. The difference is what the test reads, not what is rendered on screen, but what changed in state.

The three testing layers are complementary. Unit tests catch reducer bugs before they reach the surface. Automated UI tests catch wiring bugs, buttons that dispatch the wrong action, fields that do not reflect state correctly, and layouts that obscure a tappable element. Surface contract tests catch governance bugs, validation rules that were not enforced, transitions that were exposed when they should not have been, and states that leaked across screens.

ARS-backed tests exercise the contract, and UI tests exercise the interface. ARS does not eliminate the need for traditional UI testing. Visual regressions, gesture handling, and real-device quirks still belong to tools like XCUITest or Appium. What ARS changes is the behavioral layer. For spec-level conformance, it can run much faster, behave more deterministically, and express what a test is trying to prove at a higher semantic level. In practice, you would use both: ARS for “does it behave the way the spec says it should,” and UI tests for “does it look and feel right.”

What surface contract testing revealed

The test plan had 48 cases covering a small login and signup flow driven by a governed state model. Claude generated that plan by reading the spec and turning it into concrete cases. Once the plan existed, it ran without Claude in the loop.

Option A was to have the agent evaluate every result. After each step, Claude would get the spec, the current surface, and the test case, then decide whether what just happened was correct. That approach is flexible. It can notice things that the test case did not explicitly name. It is also slow, expensive, and nondeterministic. The same result might be judged differently across runs, which is the last thing you want from a regression suite.

Option B was to let the agent plan and let the machine execute. Claude reads the spec once and produces a structured test plan with expected values. Execution becomes a loop: dispatch an action, read the surface, compare fields, with no AI at runtime, and evaluation that is fast, deterministic, and cheap, with the tradeoff that it only checks what the plan spells out.

Option B won. The spec was precise enough that expected outcomes were fully derivable from the rules. At that point, Claude was not adding value during evaluation. It was doing expensive string comparison with extra steps. Once the plan had structured expected fields, evaluation was a lookup, not a reasoning task. Use AI where you need reasoning. Use code where you need comparison.

Option A still has a place. In the first article, that mode showed up as Discoverability Mode: the agent sees what a real user sees and reasons about what to do next without a map. That is the right tool for simulating first-contact experience and for exploratory testing when the spec is intentionally silent. It is the wrong tool for conformance testing against a precise spec. Those are different questions, and they deserve different tools.

The first time the full suite ran end to end, it finished 48 out of 48 in about 35 seconds with a small delay baked in for animations. Watching it was the part that felt different from any UI test suite I have used before: the app stayed visible on screen, moving through the flow at machine speed, screens changing, fields updating, navigation firing. The agent had planned the cases up front; the Python script was now driving the app through the ARS surface, and the state machine was proving its own rules in real time.

What's honest about this prototype

The first gap was timing. The prototype had a 250 millisecond delay between test actions. It worked, but it put the responsibility in the wrong place. In the current architecture, transitioning to a new screen starts with a state change, and the animation blocks a human from interacting again until the screen is ready. With the Python script driving the same state machine through ARS, the feedback loop was faster than the animation. Without the delay, the next action could arrive before the transition finished. That is what I actually saw: the script was ready to fire the next step while the UI was still moving into place. There are two obvious fixes: turn off animations when running through ARS, or add explicit “view is ready” states so the surface only updates after the transition completes. Either way, the app needs to tell the ARS layer when it is actually safe to act.

The second gap is about who the surface is really for. The verification agent in this build was purpose-built: it knew the spec format, the test plan schema, and the app it was running against. No general-purpose agent has tried to use this surface to accomplish a real task yet. That is what Discoverability Mode looks like at the semantic layer: an agent that was not designed around this app, reading the surface fresh and deciding how to get something done. Whether the current surface is legible enough for that kind of agent is still an open question.

There are also a few practical rough edges that are worth naming. The login and signup flows run entirely on-device; error states exist in the model but do not hit a real backend. The WebSocket client in this prototype has a simple, single-file implementation with no attention paid to thread safety, which is fine for an experiment and wrong for production. Automated UI tests that should sit next to surface contract tests were not written here. Even the two-enum action category split still depends on a human making the right call at the call site.

This build proved things that matter. ARS shared the same state machine as the UI, eliminating the need for a second system. The surface contract emerged from the app and from a spec the agent generated from that state machine, not from a separate design exercise. The full suite ran against a live app through the ARS layer and passed deterministically. The friction that showed up was about timing, naming, and ownership, not about whether the architecture itself works.

If you're building this too

This build was a first pass at running ARS in real code, not a finished pattern. It proved that a semantic surface can share the same state machine as the UI and support fast, deterministic testing from the inside out, but it also surfaced gaps that showed up once it was running. The testing story is the wedge, not the destination. ARS‑backed surface contract tests are just the first place this layer shows up as something you can measure and verify.

What still needs to be proven is how far a spec‑first loop can go when ARS is the only surface it uses to both build and verify a new screen. If you are experimenting in that direction, I would like to see what you are finding: how you are representing state, how you are exposing actions, and where your agents get stuck.