AgentSociety — City-Scale LLM Social Simulation

TL;DR

Why it matters. Prior LLM-agent simulations tend to live either in text-only sandboxes at small scale or in heavily simplified grids at massive scale. This paper attaches a real urban road network, a macroeconomic accounting loop, and a moderated social graph to 10,000 LLM agents averaging ~500 interactions per day.
What it proposes. A three-layer platform that couples psychological theory (Maslow needs, Theory of Planned Behavior, a gravity model for mobility) to LLM reasoning, then pairs it with a purpose-built engine on Ray + MQTT + PostgreSQL to avoid I/O collapse.
Headline. Four field experiments — political polarization, inflammatory-message spread, Universal Basic Income, and hurricane mobility — are reproduced on a single codebase, with directional agreement against real-world counterparts in Texas UBI and Hurricane Dorian mobility.
Limits. Labor and goods markets, unemployment, and offline interaction are abstracted; LLM API throughput (DeepSeek-V3 off-peak) is the de facto ceiling on scale and wall-clock cost.
What to keep. The bottleneck at 10k agents is TCP ports and message routing, not the LLM alone. Actor grouping plus an IoT-style message bus is the quiet engineering contribution.

Background

LLM social simulation has bifurcated in the past two years. On one side sit Smallville-style stacks (Park et al. and descendants) that model rich minds for tens to hundreds of agents. On the other sit systems like OASIS that simplify the environment substantially and push agent counts to hundreds of thousands. Both sides face a similar trade-off. The first does not scale to population-level experiments; the second loses the environmental fidelity that keeps results legible against real-world outcomes.

The authors frame the gap along three axes: (1) whether the agent’s mind is grounded in psychology and behavioral science rather than ad-hoc role-play, (2) whether the environment captures the physical and economic constraints of a real city, and (3) whether the engine can drive 10k+ agents under asynchronous interaction without collapsing. Laying 17 prior simulators on these axes shows no platform clears all three. AgentSociety targets that empty cell.

Core Idea

Three-layer AgentSociety architecture linking mind, behavior, and social environment through an engine — **Fig. 1.** AgentSociety's three layers — mind, behavior, environment — bound by the simulation engine. Source: Piao et al. 2025, Fig. 1

The backbone is a closed mind-to-behavior-to-feedback loop. Each agent carries a static profile (demographics, personality) and a dynamic status (emotion, needs, cognition, economic position, social ties). Emotion — six basic affects at 0-10 intensity — provides the fast response layer. Needs, structured as a Maslow hierarchy, act as the persistent motivational driver. Cognition stores attitudes on topics and produces thoughts that feed back into future prompts. Behaviors are split: mobility, social interaction, and employment/consumption are modeled explicitly, while simpler actions (sleep, leisure choice) are handled by the LLM directly.

The move that separates this work from text-world simulators is that the environment is not text. Urban space is built from OpenStreetMap road networks, SafeGraph POIs, and AOIs, with driving (IDM acceleration, MOBIL lane-change), walking, bus, and taxi dispatch modeled as discrete-time physics. Social space layers a relationship graph with a “supervisor” middleware that can filter messages or ban users. Economic space runs firms, households, banks (Taylor-rule interest), and a government (progressive tax) as an accounting system — a stripped-down DSGE that settles every step.

Method

3.1 Agent cognition

Mind-behavior coupling is done through theory-grounded modules rather than a single monolithic prompt. Emotion follows Shvo et al.’s framework: the LLM selects a keyword, writes a sentence thought, and rates six emotions from 0-10. Needs are a Maslow-structured JSON hierarchy re-ranked each step by active behavior, passive events, and current mental state. When a need rises to the top, the Theory of Planned Behavior produces a Need → Plan → Behavior Sequence chain. Cognition maintains a per-topic 0-10 attitude store that is updated by sentence-level summaries of completed behaviors, so attitudes and emotion move in lockstep.

Memory is three-tier. A static Profile, a dynamic key-value Status, and a time-ordered Stream Memory split into an Event Flow (objective occurrences) and a Perception Flow (the agent’s read of those events). Perception nodes link back to event nodes, giving the agent a coupled objective-subjective history that prompts the next decision.

Mobility is decomposed into four steps: (i) extract intent from active needs, (ii) filter POI types (social need → cafes, parks), (iii) set a radius from internal state (age, stamina) and external state (weather, traffic), (iv) choose the destination with a gravity model $P_{ij}=\frac{S_j/D_{ij}^\alpha}{\sum_k S_k/D_{ik}^\alpha}$ . Replacing the final LLM call with a deterministic model is the load-bearing choice; it preserves spatial rationality at scale and cuts token cost.

The social graph carries three relationship types (family, friend, colleague) each with a 0-100 strength. Target selection for a message factors relationship, strength, and topical fit, and tone shifts by relationship type. Online interaction is the primary mode. A supervisor middleware sits before message delivery and can classify content via an LLM, then apply node interventions (suspend accounts) or edge interventions (remove connections). This plumbing is what the inflammatory-messages experiment in 7.3 actually tests.

3.3 Simulation infrastructure

The engine is the quiet contribution. Two concrete bottlenecks shaped the design. First, treating each agent as its own process exhausts the 65,535 TCP ports the MQTT broker, database, and metric server can advertise, well before 10k agents. Second, SOP-driven frameworks like CAMEL and AgentScope impose an execution order on agent turns, which contradicts the independent-decision premise the simulation is supposed to capture.

The fix is three moves. Agents are batched into groups; each group runs as one Ray actor that reuses a single connection to each shared service. LLM calls are I/O-bound, so asyncio coroutines hide request latency while CPU cycles run deterministic work like the gravity model. Inter-agent messaging uses MQTT (emqx v5.8.1) — an IoT protocol tuned for millions of lightweight endpoints.

System	Best parallel process count	Throughput (msg/s)
MQTT (emqx v5.8.1)	32	44,702
Redis Pub/Sub (v6.2)	16	81,216
RabbitMQ (v4.0.5)	16	23,667
Kafka	—	fails to initialize in 5 min

Redis wins raw throughput, but MQTT’s built-in observability tooling and topic-tree semantics decided the default. On the environment side, a 1M-agent load test holds a mean step time of 0.168 s at 10⁵ QPS — a headroom test rather than the experiment setup. Experiments use DeepSeek-V3 on Huawei Cloud c7.16xlarge.4, scheduled during the 05:00-07:00 off-peak window because LLM API throughput is what actually caps scale.

Experiments

The value of the platform is that four distinct experiments run on the same codebase.

Polarization (7.2). A 100-agent panel debates gun control. The control group runs free, a homophilic treatment feeds each agent only persuasive messages aligned with their prior, and a heterogeneous treatment feeds only opposing messages. Control: 39% polarize further, 33% moderate. Homophilic: 52% polarize — the echo-chamber prediction. Heterogeneous: 89% moderate, 11% flip. Directionally aligned with lab results in political science.

Inflammatory messages (7.3). Seeded by a real case (the chained woman in Xuzhou), the simulator compares neutral and inflammatory seeds across a few hundred agents, with node and edge interventions layered on. Inflammatory seeds reach farther and raise emotional intensity; node interventions outperform edge interventions on both reach and affect. Interviews with the agents surface sympathy and perceived social responsibility as the main sharing motives.

Universal Basic Income (7.4). Demographics match Texas. Two runs — with and without a $1,000/month unconditional transfer introduced at step 96 — are compared over the next 24 steps. Consumption rises and CES-D depression scores fall under UBI, directionally matching the Texas field result. Agent interviews mention interest rates, long-term benefit, savings, and necessities — vocabulary that overlaps the real UBI discourse.

Hurricane Dorian (7.5). Columbia, South Carolina in 2019. Before arrival, Census Block Group activity ratios sit at 70-90%; during the storm they fall to ~30% and recover afterward. Simulated daily visits track SafeGraph ground truth on timing, with some underestimate of peak magnitude. The behavior change is driven by a single external shock message routed through the messaging system — no hand-coded evacuation rule.

Line chart comparing real SafeGraph daily visits against AgentSociety's simulated visits across August 28 to September 5, 2019, during Hurricane Dorian — **Fig. 2.** Real (SafeGraph) vs simulated daily visits during Hurricane Dorian. Both series dip together around August 31 – September 2 and recover in parallel; the simulation undershoots the trough but matches timing. Source: Piao et al. 2025, Fig. 23

Limitations

The authors open with the economic modeling gap. The goods market is a price adjustment on aggregate demand — no explicit supply curve, competition, or market shocks. The labor market pays wages but has no unemployment or negotiation. The UBI result should therefore be read as directional rather than as a magnitude forecast. Offline social interaction is similarly sketched at the level of spatial proximity.

On the engineering side, LLM API throughput dominates wall-clock time as soon as agent counts rise. The environment holds 10⁵ QPS headroom but the DeepSeek endpoint does not, which is why the experiments hug the 05:00-07:00 window. Fixed-size agent groups are also bottlenecked by the slowest group; adaptive load balancing is named as future work. Results are tied to DeepSeek-V3 as the sole backbone — how the reproduced social effects hold across LLMs is an open question. And while infrastructure scales to 10k agents, the polarization and inflammatory-message experiments themselves run at the 100-agent scale; “10k” describes the infrastructure ceiling rather than the experimental grain.

Sapiens Q Take

Three observations worth carrying forward. First, the strongest general design move here is externalizing the environment. Handing mobility’s final step to a gravity model, settling wages and taxes in a deterministic ledger, and answering POI queries from a spatial server each removes a place where LLM hallucination compounds. The trend across the stack — LLM for subjective state, deterministic engines for objective state — is the most portable lesson, not the specific agent template. Second, the report that TCP ports and message routing, not token cost alone, are the 10k-agent bottleneck is genuinely useful for anyone sizing a similar platform. Ray actor grouping plus an IoT-protocol message bus is not a novel invention but it is an underused combination in agent research stacks. Third, the headline value is not any one of the four experiments; it is that the same stack runs all four. Multi-axis reproducibility is quietly becoming the de facto bar for social simulators, and future platforms will likely be compared against this kind of panel rather than a single headline number.

References

Original paper: AgentSociety: Large-Scale Simulation of LLM-Driven Generative Agents
Code: tsinghua-fib-lab/agentsociety
Theoretical grounding: Maslow, Hierarchy of Needs; Ajzen, Theory of Planned Behavior; Smith & Taylor on Dynamic Stochastic General Equilibrium
Comparison platforms: Park et al. (2023) Generative Agents; Yang et al. (2024) OASIS; Li et al. (2024) EconAgent
Infrastructure references: Ray, MQTT (emqx), PostgreSQL, OpenStreetMap, SafeGraph