
A new study from Google suggests that advanced reasoning models achieve high performance by simulating multi-agent-style debates involving diverse perspectives, personality traits and domain expertise.
Their experiences demonstrate that this internal debate, which they describe as “thought society“, significantly improves model performance in complex reasoning and planning tasks. Researchers have found that leading reasoning models such as DeepSeek-R1 and QwQ-32B, which are trained via reinforcement learning (RL), inherently develop this ability to engage in social conversations without explicit instructions.
These findings provide a roadmap for how developers can build more robust LLM applications and how companies can train superior models using their own internal data.
What is thought society?
The fundamental principle of society thinking is that reasoning models learn to imitate multi-agent social dialogues to refine their logic. This hypothesis draws on cognitive science, particularly the idea that human reason evolved primarily as a social process aimed at solving problems through argument and engagement with different points of view.
The researchers write that "Cognitive diversity, arising from variation in skills and personality traits, improves problem solving, especially when accompanied by authentic dissent." Therefore, they suggest that integrating diverse perspectives allows LLMs to develop robust reasoning strategies. By simulating conversations between different internal personalities, models can perform essential checks (such as verification and backtracking) that help avoid common pitfalls like unwanted bias and sycophancy.
In models like DeepSeek-R1, this "Company" manifests itself directly in the chain of thought. The researchers note that you don’t need separate models or prompts to force this interaction; the debate emerges autonomously within the reasoning process of a single model instance.
Examples of thought society
The study provides concrete examples of how these internal frictions lead to better outcomes. In an experiment involving a complex organic chemistry synthesis problem, DeepSeek-R1 simulated a debate between several distinct internal perspectives, including a "Planner" and a "Critical checker."
The planner initially proposed a standard response pathway. However, the critical reviewer (characterized as being very conscientious and unpleasant) interrupted to challenge the hypothesis and provided a counterargument with new facts. Through this adversarial verification, the model discovered the error, reconciled the conflicting views, and corrected the synthesis path.
A similar dynamic emerged in creative tasks. When asked to rewrite the sentence, "I threw my hatred into the burning fire," the model simulated a negotiation between a "Creative ideator" and a "Semantic fidelity checker." After the ideator suggested a version using the word "deeply rooted," retorted the controller, "But it adds “deeply rooted”, which wasn’t in the original. We should avoid adding new ideas." The model ultimately settled on a compromise that preserved the original meaning while improving the style.
Perhaps the most striking development occurred in "Countdown game," a math puzzle where the model must use specific numbers to reach a target value. At the beginning of training, the model tried to solve the problem using a monologue approach. As he learned via RL, he spontaneously split into two distinct characters: one "Methodical problem solver" perform calculations and "Thinker Explorer" monitor progress, which would interrupt failed paths with remarks like "Again, no luck… Maybe we can try using negative numbers," prompting the methodical solver to change strategy.
These results challenge the assumption that longer thought chains automatically lead to greater accuracy. Instead, various behaviors, such as examining answers from different angles, testing prior assumptions, backtracking, and exploring alternatives, lead to improved reasoning. Researchers reinforced this by artificially directing a model’s activation space to trigger conversational surprise; This intervention activated a broader range of personality and expertise-related traits, doubling accuracy on complex tasks.
The implication is that social reasoning emerges autonomously through RL based on the model’s willingness to produce correct responses, rather than through explicit human supervision. In fact, monologue training models underperformed raw RL that naturally developed multi-agent conversations. Conversely, perform supervised development (SFT) on multi-party conversations, and the debates far outperformed SFT on standard chains of thought.
Implications for enterprise AI
For developers and business decision-makers, these insights offer practical guidelines for building more powerful AI applications.
Rapid engineering for “conflict”
Developers can improve reasoning in general-purpose models by explicitly directing them to adopt a thought society structure. However, it is not enough to simply ask the model to chat with itself.
"It is not enough to have a debate, but to have different points of view and dispositions that make the debate inevitable and allow this debate to explore and distinguish alternatives." James Evans, co-author of the paper, told VentureBeat.
Instead of generic roles, developers should design prompts that assign opposing dispositions (e.g., a risk-averse compliance manager versus a growth-oriented product manager) to force the model to distinguish between alternatives. Even simple signals that prompt the model to express "surprise" can trigger these higher avenues of reasoning.
Design for social scaling
As developers adapt calculation at test time to enable models to "think" longer, they should structure this time as a social process. Applications must facilitate a "societal" process where the model uses pronouns like "We," asks questions and explicitly debates alternatives before converging on an answer.
This approach can also extend to multi-agent systems, in which distinct personalities assigned to different agents engage in critical debate to make better decisions.
Stop cleaning your training data
Perhaps the most significant implication lies in how companies train or refine their own models. Traditionally, data teams clean their datasets to create "Golden answers" which provide perfect, linear paths to a solution. The study suggests this could be a mistake.
Models fine-tuned on conversational data (e.g., transcripts of debates and multi-agent resolutions) improve reasoning much faster than those trained on clear monologues. There is even value in debates that don’t lead to the right answer.
"We trained on a conversational scaffold that led to the wrong answer, then reinforced the model and found that it worked equally well in reinforcing the correct answer, suggesting that conversational habits of exploring solutions were most important for new problems." » said Evans.
This implies that companies should stop throwing away "messy" engineering logs or Slack threads where issues were iteratively resolved. THE "mess" This is where the model learns the habit of exploration.
Exposing the “black box” of trust and audit
For high-stakes enterprise use cases, getting an answer isn’t enough. Evans argues that users must see internal dissent to trust the outcome, suggesting a change in user interface design.
"We need a new interface that systematically exposes internal debates to us so that we “participate” in calibrating the right response," » said Evans. "We do better with debate; AIs do better with debate; and we do better when we are exposed to the AI debate."
The strategic case for open weightings
These results provide a new argument in the "build or buy" debate regarding open models versus proprietary APIs. Many proprietary reasoning models hide their chain of thought, treating internal debate as a trade secret or security liability.
But Evans argues that "no one has really justified before the denunciation of this society of thought," but that the interest in auditing these internal conflicts becomes undeniable. Until proprietary vendors provide full transparency, companies in highly compliant industries may find that open-weighted models offer a distinct advantage: the ability to see dissent, not just the decision.
"I believe the big proprietary models will start releasing (and licensing) the information once they realize it has value," » said Evans.
Research suggests that the work of an AI architect is moving from pure model training to something closer to organizational psychology.
"I believe this opens a whole new frontier in small group and organization design within and between models that may enable new classes of performance," » said Evans. "My team is working on it and I hope others will too."




