CityWalkAgent is an autonomous urban walking agent that navigates and evaluates pedestrian environments through Vision-Language Models, treating walks as continuous narratives rather than aggregated point assessments.
Urban perception studies have long aggregated street-level imagery into single-point scores, losing the temporal structure of how pedestrians actually experience a city. CityWalkAgent reframes urban walking as a sequential cognitive process: a dual-system VLM agent walks panorama-by-panorama through Google Street View, building short-term memory of recent observations and generating episodic snapshots when narrative shifts occur.
Drawing on Kahneman's dual-process theory, Cullen's Serial Vision, and Lynch's Image of the City, the system separates fast per-waypoint perception (System 1) from slower reflective interpretation, planning, and decision-making (System 2). Five personas — homebuyer, parent, photographer, runner, tourist — perceive identical environments differently through prompt-conditioned evaluation across the four Place Pulse dimensions.
We validate against Place Pulse 2.0's 1.1M human judgments via CLIP+K-NN, achieving Spearman correlations of ρ = 0.57–0.85 across dimensions. The framework reveals within-route variance and hidden barriers that aggregate scoring methods systematically miss.
Fast perception meets slow reflection. Each waypoint is evaluated in isolation, while a parallel pipeline weaves observations into coherent episodes.
Input panoramas flow through a fast System 1 (per-waypoint VLM evaluation with 3-signal gating and short-term memory) in parallel with a slower System 2 (Reporter → Interpreter → Planner → Decider → Episode).
↳ Inspired by Kahneman (dual-process) · Cullen (serial vision) · Lynch (image of the city) · Soft-priority navigation via Google Directions wp_bearing · Async parallel inference with Semaphore + retry
A walk is a story. Per-point averaging discards the narrative arc that makes pedestrian experience legible.
A parent and a photographer walk the same street and see different cities. Personas live in the prompt, not in post-hoc transforms.
CLIP+K-NN against Place Pulse 2.0's 1.1M judgments grounds VLM scores in real human perception.
An anthology of how persona-conditioned VLMs read the same 300 meters of Tsim Sha Tsui promenade. Hover an agent to focus their reading.
Five persona-conditioned agents walk the same route — and read it entirely differently. Hover a persona to focus their account.
Figure 2. Score divergence across personas reflects perception-layer prompt conditioning, not post-hoc reweighting. VLM: Qwen3-VL-30B-A3B. Framework: Place Pulse 2.0.
A pre-recorded walk along the Tsim Sha Tsui waterfront promenade, replayed at the same cadence as the live agent. Press play and watch the agent reason in real time.
Each route surfaces a distinct urban character. Score trends, radar, and narration update live — same agent, same city, different story every time.
CLIP+K-NN over Place Pulse 2.0 yields Spearman ρ between 0.57 and 0.85 across dimensions, with the strongest agreement on Beautiful and the weakest on Safety — consistent with prior crowd-perception literature.
Spearman correlations and K-NN regression metrics across four Place Pulse dimensions. Average ρ = 0.712, p < 0.001 across all dimensions. Distribution plots compare VLM (blue) vs. KNN baseline (orange) Z-scores.
Each persona produces meaningfully different path geometry from identical start coordinates — revealing how perception shapes movement through the city.
Generated routes for three personas (homebuyer, parent, photographer) across Singapore Toa Payoh (20 steps / 30 steps) and Hong Kong Mong Kok (30 steps). Each persona produces meaningfully different path geometry from identical start coordinates.