OC

Knowledge OS
鹦鹉螺口语
科学家强迫人工智能语言模型玩龙与地下城,看看他们的注意力集中程度如何
2026-01-22 14:34:08 · 英文原文

科学家强迫人工智能语言模型玩龙与地下城,看看他们的注意力集中程度如何

作者:James Felton

加州大学圣地亚哥分校的研究人员使用大型语言模型(LLM)来玩流行的奇幻桌面角色扮演游戏《龙与地下城》,试图评估它们随着时间的推移的表现。

本文的其余部分位于付费墙后面。请登录或订阅以访问完整内容。

一般来说,在人工智能研究中,重点是评估它们在短期任务中的表现。但人工智能代理越来越多地被要求执行需要它们在较长时间内独立或半独立行动的任务。这项研究试图通过在《龙与地下城》游戏期间监控几名法学硕士来解决这个问题。

“《龙与地下城》是评估多步骤规划、遵守规则和团队策略的天然试验场,”该研究的资深作者、加州大学圣地亚哥分校计算机科学与工程系的教员 Raj Ammanabrolu 在一份报告中解释道。声明。– 由于游戏是通过对话展开的,《龙与地下城》还为人类与人工智能交互开辟了一条直接途径:代理可以协助其他人或与其他人共同玩。

该团队让法学硕士使用根据《龙与地下城》规则编程的游戏引擎。该引擎提供了游戏所需的地图,包括资源所在的位置,以尽量减少人工智能的“幻觉”。

The idea has also been tested less methodically on YouTube.

As well as playing against themselves and fellow AI agents, the LLMs played against 2,000 experienced human players. They were evaluated based on how well they kept track of what was going on. For example, their resources and actions available, on what actions they took during the game, as well as how well they stayed "in character".

During the experiment, the LLMs were found to devolve into hammy performances, with Warlocks being particularly dramatic even when the situation didn't call for it, and Paladins making heroic speeches at inappropriate times. LLMs playing goblins started regurgitating irritating canned phrases like "heh — shiny man’s gonna bleed" during fights, and there were significant differences between the models.

"DeepSeek-V3 consistently produces short, first-person action beats and monster taunts (e.g., 'I dart left,' 'Get them!'); however, it tends to reuse the same few voices within a scenario, so the number of distinct traits stays narrow," the team explains in their report, adding that Claude Haiku 3.5 was better at modifying its speech to fit the character class.

"GPT-4o usually sits between these behaviors: it mixes vivid stage directions with more tactical or meta phrasing, so its persona density is middling while its trait variety is comparable to DeepSeek-V3."

Overall, the team found that the chatbots performed well, though they did struggle with long-term tasks.

"Our evaluation across six metrics reveals that large language models produced a promising result in rule-based conversation simulation. Smaller, open-source language models, however, are not yet capable of giving consistent simulation, which might be because their pre-trained tuning is different compared to the D&D simulation task," the team wrote in their paper. "All LLMs exhibit progressive degradation in long-horizon scenarios."

The team next plans to simulate a full Dungeons & Dragons campaign, rather than purely the combat element. Hopefully the goblins can sort out their dialog before that happens.

The work was presented at the NeurIPS conference in December 2025 and posted to OpenReview.

关于《科学家强迫人工智能语言模型玩龙与地下城,看看他们的注意力集中程度如何》的评论

暂无评论

发表评论

摘要

加州大学圣地亚哥分校的研究人员通过让大型语言模型 (LLM) 玩《龙与地下城》来对其进行评估,重点关注长期任务表现。该研究监测了法学硕士在以战斗为中心的战役中扮演玩家和怪物的角色。主要发现包括 DeepSeek-V3 和 Claude Haiku 3.5 等不同模型之间不同程度的角色依从性和战术决策。虽然聊天机器人在基于规则的对话中通常表现良好,但它们在执行长期任务时却表现不佳。该团队计划接下来模拟完整的 D&D 战役。