Can AI Simulate Human Social Networks? Testing LLM Agents Across Countries

An investigation into whether Large Language Models can reliably simulate human social behavior, revealing how language and cultural context impact AI-generated social networks.

Published: January 18, 2026

Project role

Data Scientist

Key skills

Data Analysis

Machine Learning

LLM Integration

Timeframe

Sep 2024 - Jun 2025

The Big Picture

Imagine asking an AI to roleplay as a person from Spain. Would it form friendships the same way as when it roleplays as someone from Japan? This seemingly simple question has profound implications for the growing field of AI-powered social simulations.

Large Language Models (LLMs) like GPT-4 are increasingly being used to simulate human behavior in social science research¹². Researchers use these AI “agents” to conduct virtual user studies, model social dynamics, and predict community behavior—all without the ethical complexities and costs of human participants. But how reliable are these simulations?

My research reveals a critical gap: LLMs generate networks that look structurally similar to real-world ones, but fail at the individual connection level. Additionally, the predicted network structure would vary significantly based on the language and cultural context of the prompt.

Why This Matters

Social simulation with AI agents is being used in political science, economics, and communication studies. If researchers trust these simulations to predict real human behavior, they need to know the limitations.

Previous research has shown that LLM-generated networks have similar overall structures (density, clustering) to human networks³. But does structural similarity mean the AI is actually predicting who would be friends with whom? Think of it this way: I could generate a random social network with 100 people where everyone has about 5 friends on average—that might have similar statistics to a real network, but it wouldn’t tell us anything meaningful about actual friendship patterns.

Other studies have raised concerns about biases. This is particularly concerning when LLM agents play minority roles, as it suggests that LLM values may align more closely with specific cultural groups⁴⁵.

Research Questions

Edge-level accuracy: When we ask LLM agents to predict friendships, how accurate are they at the individual connection level?
Cross-cultural variation: Does the language of the prompt affect how LLMs simulate social behavior? Do these variations reflect real-world cultural differences?

Final report from HCI research course — Initially this project is my final report in HCI research course in UC Berkeley, I bring them to New York University afterwards

Literature review diagram — Literature review before starting the project

Testing Edge-Level Prediction Accuracy

I used the Indian Village Microfinance dataset⁶. It is a real social network data from 77 Indian villages that records who people borrow money from, ask for advice, etc.

Data Preprocessing

Our dataset is tie level, where each row represents a possible edge (friendship) between two people, along with their demographic attributes.

Find all possible pair of connections between any two villagers: ( $V_i$ , $V_j$ )

Network data structure visualization — Our data structure: edge-level representation of social networks

Y= if the edge exist (0 or 1)
X = demographic features of $V_i$ , $V_j$

Since the number of not exist edges is much more the existed edge, we applied balance method to prevent bias prediction.

In here we applied three balance method: SMOTE (Synthetic Minority Over-sampling Technique), ROSE (Random Over-Sampling Examples) and down sampled to handle class imbalance in friendship prediction.

SMOTE

Generate new edge (Y=1)

Select a random edge
Find the sample’s nearest KNN neighbors in feature space
Generate a new edge between them

ROSE

Generate new edge (Y=1)

Select a random edge
Create a small noise according to data distribution
Generate a new edge

Down Sampling

Drop non-existed edge (Y=0)

Select a random non-existed edge
Drop it

Comparison of balancing methods performance — Comparison of different balancing methods on edge-level prediction performance

We trained Lasso Logistic Regression models on datasets processed with different class balancing methods to predict edge existence. Model performance was evaluated using precision, recall, and F1 score. All balancing methods yielded similar performance across these metrics, with SMOTE achieving the highest overall F1 score. We therefore selected the SMOTE-balanced dataset for subsequent analysis.

Identify What Predicts Real Friendships

Using Lasso Logistic Regression on the real network data, I identified which attributes most strongly predict whether two people will be connected. The top predictors were caste, religion, gender, and mother tongue—reflecting the homophily principle (people tend to befriend similar others).

Histogram of abs(coef): Top demographic attributes predicting friendships in real-world data

However, notice that most of the variance in friendship formation is unexplained by demographics alone. This suggests a realistic upper bound for how well any model (including LLMs) can predict friendships based solely on these attributes.

Rank from the most influential variable to the least influential variables:

Caste
Religion
Gender
Mother
Tongue
If vilager have savings
Ration card ownership
Election card ownership
Ration card classify
Speak English or not
If vilager participate SHG or not
Status

Ask LLM Agents to Predict Friendships

Now it comes to the core of the experiment: Can LLMs predict friendships based on these attributes? The prompt design refer methods from existing research ³, which proven to generate structurally realistic networks:

Method	Description
Global	Show all 30 personas, ask AI to generate entire network at once
Local	Give one persona at a time, ask who they’d befriend from the others
Sequential	Like local, but include existing network information(from previous LLM-agent’s response)

Three prompting methods for LLM network generation — Three prompting methods for LLM social network generation

From previous step, we know which attributes matter for network formation.

We select one of the smallest village (village 10th) with 93 villagers to run the simulation in order to know how different attributes affect the prediction accuracy. We select the top 2, 4, 6, 8 and all 12 attributes to include in the prompts and see how prediction accuracy changes.

Findings

Interestingly, the performance did not improve consistently as we added more attributes. 4 attributes performed best and adding more led to diminishing returns.

Structural evaluation of different methods — Structural metrics: Global method performs best on overall network structure

Edge-level evaluation of different methods — Edge-level accuracy: Sequential method leads but still shows low precision

More attributes didn’t help. The model perform best when there are 4 attributes, but keep adding more attributes showed no consistent improvement in prediction accuracy.
Structural similarity is misleading. When evaluated by traditional network metrics (density, average degree), the Global method performed best. But at the edge level, it performed poorly. This is the critical insight: a network can look statistically similar while being completely wrong about who connects to whom.

Can Real-World Information Improve Accuracy?

I tested whether providing LLMs with real network information—either micro-level (individual friend counts or lists) or macro-level (network-wide statistics)—could improve predictions.

🔍 Explore Prompts by Method

Agents join one-by-one, seeing who others are connected to

System Prompt

You are 60. backward class caste, HINDUISM, Woman, KANNADA, age 65. You are joining a social network.

You will be provided a list of people in the network, where each person is described as "ID. Caste, Religion, Gender, Mothertongue, Age", followed by their current friends IDs. In your free time, whose house do you visit? Provide a list of YOUR friends in the format ID, ID, ID, etc.

Do not include any other text in your response. Do not include any people who are not listed below. Do not select yourself.

User Prompt (example)

3. backward class caste, HINDUISM, Woman, KANNADA, age 28; friends with IDs 43, 22, 1
31. backward class caste, HINDUISM, Woman, KANNADA, age 70; friends with IDs 21, 33
56. backward class caste, HINDUISM, Woman, KANNADA, age 20; friends with IDs 8, 15, 42

Precision comparison across different methods — Precision comparison across different LLM methods and information conditions

Real information helps, but only marginally. Precision improved from ~5% (random baseline) to ~8% at best. Interestingly, providing degree counts often outperformed giving complete friend lists—more information didn’t mean better predictions.

How LLMs Misunderstand Network Structure

Looking at the structural properties of generated networks revealed why:

Average degree comparison across methods — Average degree comparison across different LLM methods and information conditions

LLMs fail in two opposite directions:

With friend lists → Over-prediction. GPT-4o generated networks with ~50 average connections, far exceeding the actual ~9. Detailed information triggered an “everyone knows everyone” bias.
With macro statistics → Under-prediction. Even when told “average degree is 8.8,” the Global method produced networks with only ~5 connections per person.

Key insight: LLMs cannot translate between micro and macro levels of network understanding. This fundamental limitation shifted my focus from optimizing accuracy to understanding how LLMs conceptualize social relationships differently.

Due to unequal distribution of language in training datasets, LLMs generally perform better in English⁷⁸. Most research on multilingual prompts focuses on improving LLM performance in minority languages—but rarely examines how prompt language affects LLM behavior in social simulations.

Cross-National Dataset

I used the ISSP 2017 Social Networks survey: a dataset from 30 countries measuring social behavior like frequency of meeting friends, trust levels, and loneliness.

ISSP dataset webpage screenshot — The ISSP 2017 Social Networks survey dataset webpage

Simulation Design

For each country, I:

Added country context to the persona (e.g., “You are a typical adult living in Spain”) and translated prompts into corresponding local language (e.g., Spanish for Spain, Japanese for Japan)

🌍 Explore Cross-Cultural Prompts

English baseline prompt for individual agent

System Prompt

You are a typical adult.

Task:
• After silent reasoning, list EVERY person in the roster you would be close friends with. By "friends," we mean people you regularly interact with and trust enough to discuss personal matters or ask small favors.
• Output one line of IDs separated by comma + single space: ID1, ID2, ID3, ...
• Use ONLY IDs from the roster. Do NOT include your own ID. No extra text.

Think step-by-step internally; reveal ONLY the final ID list. DO NOT ADD ANY ADDITIONAL WORDS

User Prompt (example)

YOUR PROFILE
ID: 24
24. female, age 69, Upper secondary, In paid work

ROSTER:
ID, Sex, Age, Education, Working Status
2. male, age 48, Lower secondary, Retired
28. female, age 23, Lower level tertiary, In paid work
13. male, age 77, Lower level tertiary, Retired

Generated 30 networks per country using both Local and Global methods, each network with 30 personas
Tested with GPT-3.5-turbo and GPT-4o

Overview of Generated Networks

We starting from examining the most basic network property: degree distribution. The degree distribution of most of countries predicted network are similar, which follow the power law distribution similar in the real-world.

Degree distribution across countries — Degree distribution of LLM-generated networks across different countries

Language Context Matters

I computed correlations between:

Real-world country-level variables (GDP, democracy score, loneliness, depression, trust, etc.)
Network metrics from LLM simulations (average degree, clustering, density, etc.)

Correlation heatmap between LLM network metrics and real-world variables — Correlation heatmap between LLM network metrics and real-world country variables (the missing part in zero_degree_nodes is due to no node in the generated network have zero degree)

To view variable definitions, Click me!

Variables	Explain
cont_tot	In contact with how many people on a typical weekday
cont_f2f	Face-to-face contacts on a weekday
outfr	How often go out with friend
newfr	How often make new friends when going out
untrust	People cannot be trusted or cannot be too careful in dealing with people
domtalk	One person dominates conversation with friends
avg_degree	Average number of connections per node in the network
avg_closeness_centrality	Average measure of how close each node is to all other nodes
degree_assortativity	Tendency of nodes to connect with others of similar degree
modularity	Strength of division of the network into distinct communities
zero_degree_nodes	Number of isolated nodes with no connections
avg_clustering	Average probability that a node’s neighbors are also connected
density	Ratio of actual connections to all possible connections
avg_betweenness_centrality	Average measure of how often a node lies on shortest paths between others
shortest_path_length	Average minimum number of steps between any two nodes

Interestingly, if we simply use english prompts for all countries, the correlations are much weaker. The maximum correlation coefficient between network matrix and country variables is around 0.1, it is about 1/3 of the correlation when using local language prompts. It suggest that language amplify the LLM’s pattern when predicted a social network, whatever it is a bias stereotype or a pattern existed in real world.

Weak correlation when using English prompts for all countries — Correlation heatmap when using English prompts for all countries (much weaker than language-specific prompts)

Prompt language affects network structure but inconsistently. For example, in GPT-3.5-turbo, higher GDP countries show lower average degree and sparser networks, while GPT-4o shows the opposite pattern.
LLM-generated networks don’t have a stable pattern. For instance, cont_tot (weekly contact frequency) should positively correlate with avg_degree, but this only holds for GPT-3.5-turbo local and GPT-4o global. GPT-4o local shows a negative correlation, and GPT-3.5-turbo global shows no significant relationship.
GPT-4o produces more significant correlations than GPT-3.5-turbo, but not more valid ones. Notably, GPT-4o’s generated networks correlate more strongly and consistently with country-level variables (GDP, democracy index) than with individual-level behavioral measures (e.g., cont_tot, cont_f2f). This suggests GPT-4o may rely more on stereotypical associations about “what social networks should look like in certain types of countries” rather than capturing actual individual-level variations.

When we look further into the correlations, a surprising pattern emerges: GDP and democracy index show stronger correlations with predicted network metrics, while individual-level social behavior variables (contact frequency, loneliness) which should relate to networks more directly, show weaker or inconsistent correlations.

GDP predicts friendships, but weekly contact frequency doesn't. Why?

Click to reveal

These indicators represent an outsider's view—how the world ranks and describes countries. And that's exactly what's in the training data: text that talks about countries, not the lived experience of people in them.

Click to flip back

Key Takeaways

Structural ≠ Accurate

Structural similarity does not guarantee accuracy at edge level

Networks can look statistically similar while being completely wrong at the individual level

Imagined, Not Lived

The predicted network structure are more related to how countries are described from the outside than to measures of people’s lived experience within the country

Language As Amplifier

Prompt language amplifys LLM’s pattern when predicting social networks, whether it’s bias or real-world patterns

💭

Reflection

Through this project, I realized again that there’s no such thing as “terrible data” that’s impossible to extract insights from—only the challenge of figuring out how to visualize it, find patterns, and come up with statistical tests suitable for what you have. The results also taught me that hoping LLMs can magically produce behavior that accurately captures human behavior is unrealistic. LLMs are technical products, and the outcomes of LLM-based simulations are largely shaped by how you design prompts, post-process responses, and visualize the results. There’s no magic here, only design choices and their consequences.

References

Grossmann, I., et al. (2023). AI and the transformation of social science research. Science, 380(6650), 1108–1109. https://doi.org/10.1126/science.adi1778 ↩
Gürcan, Ö. (2024). LLM-Augmented Agent-Based Modelling for Social Simulations: Challenges and Opportunities. In Frontiers in Artificial Intelligence and Applications. IOS Press. https://doi.org/10.3233/FAIA240190 ↩
Chang, S., et al. (2024). LLMs generate structurally realistic social networks but overestimate political homophily. arXiv:2408.16629. http://arxiv.org/abs/2408.16629 ↩ ↩²
Atari, M., et al. (2023). Which Humans? PsyArXiv. https://doi.org/10.31234/osf.io/5b26t ↩
Wang, A., et al. (2024). Large language models cannot replace human participants because they cannot portray identity groups. arXiv:2402.01908. https://doi.org/10.48550/arXiv.2402.01908 ↩
Banerjee, A., et al. (2013). The diffusion of microfinance. Science, 341(6144), 1236498. https://doi.org/10.1126/science.1236498 ↩
Kreutzer, J., et al. (2022). Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets. Transactions of the Association for Computational Linguistics, 10, 50–72. https://doi.org/10.1162/tacl_a_00447 ↩
Li, Z., et al. (2025). Language Ranker: A Metric for Quantifying LLM Performance Across High and Low-Resource Languages. Proceedings of the AAAI Conference on Artificial Intelligence, 39(27), 28186–28194. https://doi.org/10.1609/aaai.v39i27.35038 ↩