Capture and Replay: Building a Recording-Based QA Tool in the Browser

August 22, 2019

Capture and Replay: Building a Recording-Based QA Tool in the Browser

A low-code platform has a peculiar QA problem. The shipped product is a runtime that renders other people's apps. Every customer's screens are different. The forms our customers built were made of our components, but the meaning of any given click — "submit a loan application", "add a co-applicant", "recompute the EMI" — was customer-specific. Our internal QA team could not write Selenium scripts against a customer's screens. The customers themselves wanted to write tests, but they were the same people who used the platform precisely because they didn't write code.

The brief I got was disarmingly simple: let a non-engineer click through their flow once, save the recording, and replay it as a regression test on every release.

That product is what this post is about. We built it in the second half of 2018 and the first half of 2019. It got into customer hands, it caught real regressions, and almost everything I learned came from the gap between recording (which was easy) and replaying (which was the hard part of the entire engineering effort). Specifically: iframes, dynamically-generated IDs, asynchronous side-effects, and animations turned a "just record and play it back" idea into a year of subtle problems.

I want to walk through what we built, what broke, and the principles that survived.

Why we built it ourselves

Before any code, we did the responsible thing and looked at what existed. Selenium IDE was the obvious candidate. There were a handful of commercial tools. A few teams inside the company had used a recording extension for ad-hoc smoke tests. None of them fit, for reasons that took us a couple of weeks to articulate clearly:

  1. The platform's DOM was generated. Most of our components rendered into element trees with auto-generated IDs that changed between page loads, and into class names that were stable but deeply hierarchical. Off-the-shelf recorders defaulted to #id-based selectors and produced tests that were already broken on the second run.
  2. Authoring was the product, not the byproduct. Existing tools recorded a script the user could then edit. Our users were not script editors; they needed the recording itself to be the test, and to be re-runnable without manual cleanup.
  3. The replay had to mean the same thing across customer schemas. A click that "added a co-applicant" should still mean that after the customer renamed a field. We needed the recording to capture intent tied to platform metadata, not just raw DOM events tied to whatever happened to render that day.
  4. Pricing models didn't fit. Per-user licensing on commercial tools didn't match a multi-tenant platform where the unit of "user" was a customer's operations team, not an engineer.

The decision to build was not because we were smarter; it was because the product fit was wrong. We needed the recorder to know about our own metadata model, and there is no off-the-shelf tool that knows about your metadata model.

The shape of the system

Three pieces, each small, with a deliberately narrow API between them.

flowchart TB subgraph BROWSER["Browser (customer's session)"] direction TB REC["Recorder<br/>(MutationObserver +<br/>delegated event listeners)"] REC --> EVT["Event log<br/>(typed action stream)"] end EVT -- "POST /recordings" --> SVC["Recording service<br/>(versioned, tenant-scoped)"] SVC --> STORE[("Postgres<br/>recordings · steps · runs")] SVC -- "schedule" --> RUN["Replay runner<br/>(headless Chrome)"] RUN --> REPORT["Run report<br/>(pass · fail · screenshots · diff)"]

The recorder lived as a small script we injected into the customer's session when they entered "record mode" from our UI. It listened for events at the document root, decided which ones counted as user intent, and wrote a typed action to a buffer. When the user hit "stop", the buffer was sent to the recording service. Replays ran later in a headless browser, against the same metadata version the recording was made on, and produced a structured report.

The interesting design questions were almost all in the recorder and the replay runner. The service in the middle was, and stayed, boring on purpose.

Recording: the easy half (mostly)

Capturing a click is one line of JavaScript. Capturing the right click — the one that means something to the test — is a year-long project. The recorder went through three iterations.

What we listened for

The first cut listened for everything: click, dblclick, mousedown, mouseup, keydown, keyup, keypress, input, change, focus, blur. It produced enormous logs full of meaningless events — keystrokes mid-typing, focus events on accidental tab navigations, clicks that bubbled to seven different ancestors.

The second cut narrowed to a small whitelist of high-level intents:

type RecordedAction = | { kind: "click"; target: SelectorChain; meta: ElementContext } | { kind: "type"; target: SelectorChain; value: string } | { kind: "select"; target: SelectorChain; value: string } | { kind: "submit"; target: SelectorChain } | { kind: "navigate"; url: string } | { kind: "wait_for"; condition: WaitCondition };

We listened at the document root with a single delegated handler and decided per-event whether it qualified. A click on a button counted; a click on the empty area of a form did not. A change event on a <select> counted; an input event on the same <select> did not, because they were redundant. Typing was debounced and emitted as one type action with the final value, not 12 keystrokes.

Selectors that survived a re-render

The single hardest decision in the recorder was how to identify the element to click. Three options, each with failure modes:

  • CSS #id: brittle on our platform because IDs were auto-generated and changed on re-render.
  • Full XPath: stable in shape but broke as soon as an unrelated element was inserted earlier in the DOM.
  • Structural / accessibility-based selectors: more verbose, more resilient, but required us to compute a chain rather than a single string.

We went with a chain. Each recorded element had a list of progressively-weaker matchers that the replay would try in order until one resolved a single element.

type SelectorChain = ReadonlyArray<Selector>; type Selector = | { by: "data-test-id"; value: string } | { by: "aria-label"; value: string } | { by: "role-name"; role: string; name: string } | { by: "text"; text: string; tag?: string } | { by: "label-for"; labelText: string } // <label>Email</label> → input it labels | { by: "css"; path: string }; // last resort

The recorder generated all six in order whenever the element supported them. The replay runner tried them in the same order and remembered which one matched, so a recording could self-heal over time: if data-test-id was missing today but aria-label matched, we used aria-label and flagged the recording for an editor's review.

The single most useful change we ever made to the recorder was this: we stopped trying to produce one perfect selector and started producing a list of imperfect ones, ranked by stability.

Tying recordings to platform metadata

For our case there was a fourth lever: the platform itself knew which object and field a given DOM node represented. The form widget that rendered Customer.email set data-platform-field="Customer.email" on the input. The recorder preferred this over everything else when present:

{ by: "platform-field"; object: "Customer"; field: "email" }

The reason this mattered: a customer renaming Customer.email to Customer.primary_email automatically migrated their existing recordings, because the test was bound to the field's identity, not its label or its ID. This is the version of "the metadata is the product" you read about in the previous post, showing up in a different room of the building.

Replay: where everything was hard

If recording was a year-long project, replay was where the year went. The naive version of replay is two lines of pseudocode:

for each action in recording: do_the_action() sleep(some milliseconds)

That works on the demo recording. It does not work on anything else, and the rest of this post is the four reasons why.

Reason 1: timing isn't a duration

The first replay engine used elapsed-time waits. A click happened, we slept 200ms, the next action ran. It passed in development and failed wherever the network was slow.

Time is the wrong primitive. The user wasn't waiting 200ms; they were waiting until the page was ready for the next click. We rewrote replay around explicit wait conditions:

type WaitCondition = | { kind: "selector_appears"; selector: SelectorChain; timeoutMs: number } | { kind: "selector_disappears"; selector: SelectorChain; timeoutMs: number } | { kind: "text_visible"; text: string; within?: SelectorChain; timeoutMs: number } | { kind: "network_idle"; graceMs: number; timeoutMs: number } | { kind: "platform_event"; name: string; timeoutMs: number };

Most actions had an implicit wait condition derived from what the recorder saw happen after the user's click. If clicking "Add co-applicant" caused a sub-form to appear, the recorder noted "the next action's target appeared" as the wait condition. Replay then waited for that condition before issuing the next click.

platform_event was our cheat: any time our own code dispatched a known custom event (platform:save_complete, platform:row_added), the replay could wait for it directly. We added one of these for every long-running async operation in the runtime. They were free for us to emit and worth their weight in green test runs.

flowchart TB A["Action N — click 'Add co-applicant'"] --> B{"Implicit wait<br/>condition?"} B -- "selector_appears" --> WAITSEL["Poll DOM until<br/>sub-form selector resolves"] B -- "platform_event" --> WAITEVT["Listen for<br/>platform:row_added"] B -- "network_idle" --> WAITNET["Wait until no XHR<br/>in flight for graceMs"] WAITSEL --> NEXT["Action N+1 — type into field"] WAITEVT --> NEXT WAITNET --> NEXT

Reason 2: iframes are a different world

The platform embedded a few third-party widgets (mostly OCR/document scanners) inside iframes. When the customer recorded a flow that scanned a PAN card, the recorded clicks were inside the iframe — but the recorder, running at the top level, had no events to capture from a cross-origin frame.

Three iframe scenarios, three answers:

  • Same-origin iframes (our own widgets in srcdoc frames): we injected the recorder script into the frame and forwarded events through postMessage into the parent's event log. Selectors in those events carried a frame: { id: "loan-form" } qualifier so the replay knew to switch into the frame before resolving them.
  • Cross-origin iframes we owned (subdomain widgets): the widget hosted its own recorder shim that forwarded events via postMessage with a shared origin allowlist.
  • Cross-origin iframes we did not own: we couldn't reach into them, period. We made the recorder pause and prompt the user: "This step is in a third-party widget. Confirm visually when you're done, and we'll resume capture." The recorded action became a manual checkpoint, not a click.

Pretending iframes weren't a problem made the early replay runs look like they passed; the assertions were just running against the wrong frame, finding nothing, and timing out. Once we taught the recorder to qualify every selector with its frame coordinates, the failure mode changed from "silently wrong" to "loudly missing" — which is the only failure mode I'm willing to ship.

Reason 3: dynamic IDs and the wrong kind of stability

Most of our components used a small ID generator that handed out monotonic suffixes per component instance: cust-input-7, cust-input-8, etc. The numbers depended on the order components mounted, which depended on the order data arrived, which was not deterministic.

Recording a click on #cust-input-7 was a guarantee that the test would fail on the next run.

We fixed this in two places. In the platform: every component that rendered a recordable surface set data-platform-field and data-platform-action based on metadata, not on instance counters. Recording preferred those attributes; the auto-generated ID was the last selector in the chain, used only when nothing else was available. In the recorder: we explicitly blacklisted attributes whose values matched the auto-generator's pattern, even if a customer opted to use them. Some footguns are worth refusing on the user's behalf.

The deeper lesson: stable-looking selectors are sometimes worse than obviously-unstable ones. A #cust-input-7 looks specific and matches first; it also fails first. A [role=textbox][aria-label="Email"] looks loose and matches reliably. Your replay will spend its time inside the loose-looking selector.

Reason 4: races and animations

Two animation-flavored problems bit us early.

Click landed on the wrong element. A modal opened with a 250ms slide-in animation. The recorder captured a click on the visible form behind the modal, but at replay time the modal was already up and the click landed on the modal's overlay. Same script, same code, different outcome — because the human had paused for the animation and the replay had not.

Network race. A row was added optimistically to a list, the test asserted "row exists", and then the server returned an error and the row was removed. The assertion passed; the test was actually broken.

Two changes addressed both:

  1. Stability check before every action. Before clicking, replay asserted: the element resolves, is visible, is not animating (getAnimations() returns empty), and is not occluded. If any check failed, replay waited until they all passed or timed out.
  2. Eventual assertions, not point-in-time assertions. "Row exists" became "row exists and is still there 500ms later" for any state derived from a network call. The 500ms was tunable per assertion. It made the average test slightly slower and the false-pass rate go to roughly zero.
flowchart TB READY["Action N+1 ready to fire"] --> CHK1{"Selector<br/>resolves?"} CHK1 -- "no" --> RETRY1["Backoff &amp; retry"] CHK1 -- "yes" --> CHK2{"Visible &amp;<br/>not animating?"} CHK2 -- "no" --> RETRY1 CHK2 -- "yes" --> CHK3{"Not occluded?"} CHK3 -- "no" --> RETRY1 CHK3 -- "yes" --> FIRE["Fire action"] RETRY1 --> READY

What we got wrong

A few honest ones.

We assumed the recorder ran in the same browser as the replay

Our customers used Chrome. We replayed in headless Chrome of the same major version. Predictably, a customer ran a recording in a slightly newer Chrome than our headless one, and the recorder captured a paste event the older replay engine didn't know how to dispatch. The record/replay versions had drifted.

The fix was to version the action stream itself. Every recording carried the recorder version that produced it; replay refused to run a recording newer than its supported version, and we shipped recorder/replay updates together. This is the kind of thing every long-lived format does eventually; we just got there a quarter late.

We didn't capture viewport, and we paid for it

Recordings made on a 1920×1080 monitor failed when replayed against a 1366×768 headless. Elements were below the fold; the click coordinates were valid, but the modal that should have opened was clipped, and the next selector never resolved. We didn't capture the viewport size at recording time, so replay had no way to set a sensible default.

We added it eventually. Captured viewport, scroll position, device pixel ratio, and the user agent. Each one paid for itself within a month of being added.

"Wait for network idle" was not a panacea

It seemed obvious. Wait for the network to settle, then proceed. In practice, we had a long-poll connection that never went idle, and a heartbeat ping every 15 seconds that briefly woke the network up and reset the idle timer. network_idle worked beautifully when it worked and was the worst kind of flake when it didn't.

We deprecated it as a default. It remained available as an opt-in with a clear warning. Most actions used selector_appears or platform_event, which were specific and predictable. "Wait for the thing you actually need" beat "wait for the network to be quiet" every time.

Build vs buy, in retrospect

I want to revisit the build-vs-buy question, because the easy answer is wrong in both directions.

The easy "buy" answer — "Selenium IDE / commercial tool X exists, use it" — would have left our customers without metadata-aware recordings. Renames would have broken everything. The time we'd have spent maintaining brittle scripts would have been larger than the time we spent building the recorder.

The easy "build" answer — "we're smart, we'll do it ourselves" — would have led us to spend a year reinventing things that off-the-shelf tools already do well: parallel test runners, screenshot diffing, headless browser orchestration. We did not, in fact, build those. We used a headless Chrome runner with a thin orchestration layer on top of well-known tools. The thing we did build was the recorder, the action stream, and the metadata-aware selectors — the parts that were genuinely specific to our product.

The honest framing: build the part that has to know about your product; buy the part that's the same for everybody. The line between those two is not always obvious, and finding it is most of what platform engineering is.

What I'd take into the next platform

Five principles survived intact:

  • Capture intent, not events. A click is the lowest-fidelity thing you can record; a submit_loan_application is the highest. Aim as close to the latter as the platform allows.
  • Selectors should be a ranked chain, not a string. The recording should describe the element from multiple angles and let the runner pick the best one at replay time.
  • Time is the wrong primitive for waits. Wait for what should have happened, not for how long it usually takes.
  • Iframes, dynamic IDs, and animations are not edge cases. They are the median experience of any non-trivial app, and a recorder that doesn't handle them is a recorder that works only on toy demos.
  • Version everything that has a long life. Recorder, replay engine, action stream, and metadata. The day the formats drift is the day every old recording stops working — unless you saw it coming.

The systems that age well are the ones that make the right thing easy to query. For a recorder, the "right thing" is the user's intent, anchored to platform metadata, expressed as a sequence of waits and actions. Every shortcut around that — raw events, single selectors, time-based sleeps — is a debt that comes due in production within the first month.

GitHub
LinkedIn
X