Reading cities
like a pedestrian

CityWalkAgent is an autonomous urban walking agent that navigates and evaluates pedestrian environments through Vision-Language Models, treating walks as continuous narratives rather than aggregated point assessments.

0.85
Spearman ρ vs. human
1.1M+
Place Pulse Data validated
5+
Agent personas implemented
~12s
Per-intersection inference
01 · ABSTRACT
Concept diagram: Street View + Persona → MLLM → Visual Understanding
Fig. 0 · Core research question — can an MLLM evaluate built environments on behalf of a human persona?

Urban perception studies have long aggregated street-level imagery into single-point scores, losing the temporal structure of how pedestrians actually experience a city. CityWalkAgent reframes urban walking as a sequential cognitive process: a dual-system VLM agent walks panorama-by-panorama through Google Street View, building short-term memory of recent observations and generating episodic snapshots when narrative shifts occur.

Drawing on Kahneman's dual-process theory, Cullen's Serial Vision, and Lynch's Image of the City, the system separates fast per-waypoint perception (System 1) from slower reflective interpretation, planning, and decision-making (System 2). Five personas — homebuyer, parent, photographer, runner, tourist — perceive identical environments differently through prompt-conditioned evaluation across the four Place Pulse dimensions.

We validate against Place Pulse 2.0's 1.1M human judgments via CLIP+K-NN, achieving Spearman correlations of ρ = 0.57–0.85 across dimensions. The framework reveals within-route variance and hidden barriers that aggregate scoring methods systematically miss.

02 · SYSTEM

A dual-process cognitive architecture.

Fast perception meets slow reflection. Each waypoint is evaluated in isolation, while a parallel pipeline weaves observations into coherent episodes.

▌ FIGURE 1 · ARCHITECTURE

DUAL-PROCESS PIPELINE.

Input panoramas flow through a fast System 1 (per-waypoint VLM evaluation with 3-signal gating and short-term memory) in parallel with a slower System 2 (Reporter → Interpreter → Planner → Decider → Episode).

CityWalkAgent dual-process architecture diagram

↳ Inspired by Kahneman (dual-process) · Cullen (serial vision) · Lynch (image of the city)  ·  Soft-priority navigation via Google Directions wp_bearing  ·  Async parallel inference with Semaphore + retry

PRINCIPLE 01
Sequential, not aggregate.

A walk is a story. Per-point averaging discards the narrative arc that makes pedestrian experience legible.

PRINCIPLE 02
Persona shapes perception.

A parent and a photographer walk the same street and see different cities. Personas live in the prompt, not in post-hoc transforms.

PRINCIPLE 03
Validated against humans.

CLIP+K-NN against Place Pulse 2.0's 1.1M judgments grounds VLM scores in real human perception.

03 · ANTHOLOGY

Five walkers. One waterfront.

An anthology of how persona-conditioned VLMs read the same 300 meters of Tsim Sha Tsui promenade. Hover an agent to focus their reading.

▌ FIGURE 2 · ANTHOLOGY

FIVE WALKERS. ONE ROUTE.

Five persona-conditioned agents walk the same route — and read it entirely differently. Hover a persona to focus their account.

22.2935, 114.1720 ▸ START
SALISBURY GARDEN
AVENUE OF STARS
VICTORIA HARBOUR
END ◂ 22.2950, 114.1745

Figure 2. Score divergence across personas reflects perception-layer prompt conditioning, not post-hoc reweighting. VLM: Qwen3-VL-30B-A3B. Framework: Place Pulse 2.0.

04 · DEMO

Walk along.

A pre-recorded walk along the Tsim Sha Tsui waterfront promenade, replayed at the same cadence as the live agent. Press play and watch the agent reason in real time.

▌ FIGURE 3 · LIVE WALK REPLAY

DIFFERENT ROUTE. DIFFERENT FEELING.

Each route surfaces a distinct urban character. Score trends, radar, and narration update live — same agent, same city, different story every time.

INTERPRETScene reading
PLANRoute reasoning
DECIDEAutonomous routing
REPORTLive narration
05 · VALIDATION

Validation against MIT Place Pulse 2.0.

CLIP+K-NN over Place Pulse 2.0 yields Spearman ρ between 0.57 and 0.85 across dimensions, with the strongest agreement on Beautiful and the weakest on Safety — consistent with prior crowd-perception literature.

SAFETY
0.57
Spearman ρ
LIVELY
0.67
Spearman ρ
BEAUTIFUL
0.80
Spearman ρ
WEALTHY
0.82
Spearman ρ
▌ FIGURE 4 · VALIDATION

VLM VS. HUMAN JUDGMENT.

Spearman correlations and K-NN regression metrics across four Place Pulse dimensions. Average ρ = 0.712, p < 0.001 across all dimensions. Distribution plots compare VLM (blue) vs. KNN baseline (orange) Z-scores.

Validation scatter plots and distribution comparison — VLM vs KNN vs Place Pulse 2.0
06 · RESULTS

Persona-conditioned routes.

Each persona produces meaningfully different path geometry from identical start coordinates — revealing how perception shapes movement through the city.

▌ FIGURE 5 · ROUTE RESULTS

PERSONA-CONDITIONED ROUTES.

Generated routes for three personas (homebuyer, parent, photographer) across Singapore Toa Payoh (20 steps / 30 steps) and Hong Kong Mong Kok (30 steps). Each persona produces meaningfully different path geometry from identical start coordinates.

Route maps — homebuyer, parent, photographer across Singapore and Hong Kong