MyPCBench

A reproducible Linux-desktop benchmark for personally intelligent computer-use agents. The environment is seeded end-to-end from one canonical persona across 17 cross-consistent web applications and roughly 17,000 personal records, and 184 tasks are graded by a per-rubric LLM judge.

Lawrence Keunho JangAndrew Keunwoo JangJing Yu KohRuslan Salakhutdinov
Carnegie Mellon University
Apps
17
Web applications
Tasks
184
Six behavioural types
Records
17K
Cross-linked entries
Models
6
Closed + open weights
§ 01 · Abstract

Abstract

Current benchmarks for computer-use agents evaluate models in impersonal environments, leaving a gap between evaluation and deployment as personal assistants are expected to work across a user's whole digital life. MyPCBench is a reproducible Linux-desktop benchmark seeded end-to-end with one user's identity, history, and logged-in accounts, so the same agent loops OSWorld-style benchmarks already use can finally be pointed at tasks that require knowing who the user is.

Motivation

Web benchmarks exclude any site that requires logging in or variable personal information, ruling out a large fraction of what real users ask their assistants to do. Desktop benchmarks seed only what the task literally needs. MyPCBench pins a single user identity — Michael Scott — and spans the consumer-application surface that personal computers actually run: banking, travel, food delivery, calendar, messaging.

Task source

184 tasks rewritten from 2,749 anonymized requests collected from the OpenClaw Discord, with named entities re-aligned to the persona's seeded data.

Main result

The strongest evaluated agent, Claude Opus 4.6, perfects 55.4% of MyPCBench tasks and 36% of the 7+-application slice. GPT-5.4, Qwen 3.5 35B-A3B, and Qwen 3.5 9B reach 0% perfect on the same slice.

Released artifacts

  • Docker environment image with the persona seed.
  • 184 tasks with per-rubric grading specifications.
  • Agent harness over the OSWorld VNC bridge.
  • Gemini-flash-lite-based rubric judge.
Overview of MyPCBench
Figure 1Overview of MyPCBench. The environment hosts 17 pre-logged-in web apps mirroring real consumer products plus the full LibreOffice suite. The persona's records (1,429 bank txns, 1,162 emails, 656 calendar events, 2,586 chats, 426 orders, 10,776 web visits, 35 bookmarks) are cross-linked so that one trip leaves correlated records across every app that would plausibly book it.
§ 02 · Persona

Persona

The canonical persona is Michael Scott, regional manager of a paper company in Scranton, Pennsylvania. His desktop is populated by a deterministic Python pipeline from a single JSON specification covering identity, financial profile, social network, travel history, work context, routines, preferences, and recent and upcoming life events.

Persona Profile Records

Persona file

Michael G. Scott

Regional Manager · Dunder Mifflin Paper Company, Scranton Branch

Location Scranton, PA
Employer Dunder Mifflin
Vehicle Chrysler Sebring
Bank Gringotts (Chase clone)
Loyalty Dinoco FlyMiles · Silver

Cross-app consistency

Any trip, dinner, or client deal leaves correlated records in every application that would plausibly record it. Michael's Philadelphia trip generates a Cheskepdia (Airbnb) booking, two Gringotts (Chase) charges, a HooliCalendar (Google Calendar) block, two Dinoco (Delta) boarding passes, browsing history for “Radisson Blu Warwick”, three Travel-folder emails, and HooliChat (WhatsApp) messages referencing the trip — written together at seed time so they line up at boot.

Realistic UI fidelity

Each web application is a custom Next.js clone backed by SQLite. Gringotts supports transfers, bill pay, Zelle, and statement downloads. Dinoco Airlines generates boarding passes with QR codes. eTaxi uses OSRM for realistic routing over 700+ Scranton-area locations. TableFind exposes a reservation inventory of 3,360 slots with hold-and-release semantics. Across the canonical seed the 17 apps expose 185 distinct database tables and roughly 18,000 rows of state.

Michael Scott's seeded Linux desktop
Live VM · 1280 × 800 · GNOME Shell
1,429
Bank txns
1,162
Emails
656
Calendar evts
2,586
Chat msgs
426
Orders & rides
10,776
Page visits
35
Bookmarks
§ 03 · Applications

Applications

The 17 applications span six SimilarWeb top-level categories — Computers/Tech, Finance, Travel, Food, Ecommerce, Gambling — and 14 distinct subcategories. Each is a full Next.js application backed by SQLite, deterministically populated from the persona specification.

HooliMail
HooliMailGmail
HooliCalendar
HooliCalendarGoogle Cal.
HooliChat
HooliChatWhatsApp
HooliWork
HooliWorkSlack
SprintBoard
SprintBoardJira / Asana
LockedIn
LockedInLinkedIn
Gringotts
GringottsChase
BatBucks
BatBucksRobinhood
SpeedTax
SpeedTaxTurboTax
Dinoco Airlines
Dinoco AirlinesDelta
Cheskepdia
CheskepdiaAirbnb
eTaxi
eTaxiUber
HangryDash
HangryDashDoorDash
TableFind
TableFindOpenTable
HooliShop
HooliShopAmazon
Kwik-E-Mart
Kwik-E-MartInstacart
OddsMarket
OddsMarketDraftKings
Digital life: all 17 apps and four example multi-app tasks.
Figure 2 · The four example tasks (top-left) each require threading 3–5 apps together against Michael's seeded history.
§ 04 · Tasks

Task suite

Tasks span from one to nineteen co-touched applications; 68% are multi-application and 40% span at least two SimilarWeb top-level categories. The behavioural taxonomy separates analysis tasks (lookup, aggregation, inference, reconciliation; 72 tasks) from action tasks (bounded versus orchestrated; 112 tasks).

Type# TasksRepresentative instruction
Bounded action64 (35%)“Zelle Pam a hundred bucks. She covered me last weekend. Check HooliChat first to make sure Pam Beesly is on my contacts, then put a memo on the transfer.”
Multi-step orchestration48 (26%)“The Threat Level Midnight Fan Club has been dormant. Peek at the group chat, scroll my LockedIn contacts for Dunder Mifflin folks to recruit, draft them an invitation email, and book a watch party on my calendar for next month.”
Cross-source reconciliation25 (14%)“I've got the Jamaica trip AND the Barbados trip booked about four weeks apart. Given my credit-card balance, can I actually afford both, or am I about to max out?”
Aggregation & reporting23 (12%)“How much am I sending via Zelle each month, and who's getting the money? Check the most recent two complete calendar months and rank the recipients in a LibreOffice Calc spreadsheet.”
Personal lookup13 (7%)“What's my current FlyMiles loyalty tier on Dinoco Airlines, and how many miles do I have in the bank?”
Pattern inference11 (6%)“What do I usually tip on food delivery, in dollars and as a percent? I want to set a smart default so I'm not thinking about it every order.”
Task distribution: apps-per-task, domain coverage, behavioural split
Figure 3Left: distribution of tasks by the number of distinct applications they touch. Middle: per-domain task coverage (non-exclusive). Right: behavioural task-type split.

Browse all 184 tasks tasks.json

§ 05 · Methods

Methodology

The benchmark is designed for cross-application consistency, deterministic reproducibility, and real-world UI fidelity. The harness is a thin extension of the OSWorld runner, and grading uses a rubric-by-rubric LLM-as-a-judge over the full trajectory.

Environment

The benchmark ships as a Docker image that runs a real QEMU/KVM Ubuntu 24.04 VM with GNOME Shell. The VM hosts 17 pre-logged-in websites, the full LibreOffice suite (Writer, Calc, Impress), and a Firefox profile pre-loaded with realistic browsing history and bookmarks. HooliWork (Slack) and HooliChat (WhatsApp) are exposed as native desktop apps. Boot-to-ready is ~90 seconds; a base snapshot is captured after first boot and used to reset between tasks, avoiding state leakage.

Harness · POMDP

We model the environment as a partially observable MDP. At each step the agent receives a 1280×800 screenshot of the full Linux desktop plus its running tool-call history, and emits an action that the harness executes through the OSWorld VNC bridge. The action space is the unmodified OSWorld pyautogui surface (click, type, key, scroll, drag, wait, screenshot, done, fail). Provider-native CUA APIs are mapped onto this surface through a unified translation layer.

Grading · rubric-by-rubric LLM judge

Every task ships with its own set of rubrics — natural-language criteria authored alongside the task and audited during the QA pass. Rubrics range from 3 to 13 items, with a mean of 6.5 per task and 1,191 rubric items in total. The judge runs once per rubric over the full trajectory: task instruction, that one rubric, the agent's complete tool-call log, and every screenshot. The judge returns “success” or “failure” for that item. We use gemini-3.1-flash-lite-preview as the judge throughout.

Three reported metrics

  • Perfect rate — fraction of tasks for which every rubric in the task passed (headline metric).
  • Rubric score — per-task average of rubric pass rates, averaged across tasks (gives partial credit).
  • Trajectory Efficiency — rubric score per agent step, in percent.
The custom task-review interface
Figure 4 · The custom web interface used for the QA pass. Each task was reviewed end-to-end by at least two authors against the live VM.
§ 06 · Results

Main results

Six models evaluated under each provider's native CUA agent. Every run uses a 100-step budget, the shared persona-and-environment context, and the same Gemini judge.

Loading leaderboard…

Per-category breakdown

Rubric-score percentages, models × the nine task categories. Darker = higher score.

Loading heatmap…
Perfect rate vs apps touched per task
Figure 5 · Perfect-task rate vs. the number of distinct applications a task touches. At 7+ apps, GPT-5.4, Qwen 35B, and Qwen 9B all reach 0% perfect — only the Claude tier and GPT-5.4 mini perfect any 7+-app task at all.
§ 07 · Trajectories

Sample trajectories

Twelve curated trajectories, two per model, sampled across the behavioural categories. The viewer steps through the agent's screen-by-screen actions; the rubric checklist on the right transitions at the step the judge cited as evidence, and the cumulative weighted score updates accordingly.

Keyboard · SPACE play / pause  ·  ←/→ step  ·  HOME / END jump  ·  ESC close

§ 08 · Failure modes

Failure modes

We read every failed-rubric judge explanation and tagged it with up to five categories. Skipped required apps and premature DONE account for most of the loss. Per-family signatures: GPT concentrates in early termination, Qwen in persona-data hallucination, and Claude in console-script shortcuts that hit app REST endpoints rather than driving the visible UI.

578
Skipped required app
Agent never visits one of the apps the rubric requires — typically because it commits to a single source (search, email) and never branches across the surface.
All families
532
Premature DONE
Agent emits DONE before satisfying every rubric. Concentrates in GPT-5.4 (307 of 532 hits): zero-score trajectories average 17–20 steps before abandonment.
GPT family
123
Surface error → terminal
A single page error or unexpected modal terminates the trajectory. Agents do not recover the underlying intent after the surface error.
All families
90
Partial artifact
Output exists but is missing a required field, sheet, or section the rubric calls for.
Sonnet, GPT
36
Hallucinated persona data
Agent invents persona-specific values (24 of 36 hits in Qwen) — a friend, a route, a bank balance that does not exist in the seeded environment.
Qwen family

Mean step count on zero-score trajectories splits the families the same way: GPT-5.4 (17.7), GPT-5.4 mini (19.7), and Sonnet (29.7) abandon early, while Opus (52.1), Qwen 35B (52.6), and Qwen 9B (39.2) keep going until they hit the step budget.

§ 09 · Citation

Citation

If MyPCBench is useful in your research, please cite the paper.

@misc{jang2026mypcbench,
  title  = {{MyPCBench}: A Benchmark for Personally Intelligent Computer-Use Agents},
  author = {Jang, Lawrence Keunho and Jang, Andrew Keunwoo and Koh, Jing Yu and Salakhutdinov, Ruslan},
  year   = {2026}
}