昨天 ,Anthropic 发布了最新的 Claude3 模型, 引发广泛关注。在 Babel.cloud 的开源测评项目的 LLM-RGB 项目中,Claude3 在单次测试中获得了 97.6 分的高分,大大超过了 GPT-4 Turbo ,成为目前大模型能力的领先者。 回答详情:https://llm-rgb.babel.run/view/testId/a581e4a9-ce1e-4b2f-8f45-980889913b58 作为参考,截至 1.24 日各大模型测评得分
其中值得注意的是,在 LLM-RGB 测评中,015_simple_mahjong 是一道复杂性极高的题目,在 Prompt 中,会教给大模型麻将的简化版规则,并给出示例,再让大模型在特定场景下给出出牌选择。这道题在过往的测试中鲜有做对的情况。但 Claude 3 Opus 给出最优解的概率为 20%,次优解概率为 80%。说明其在多轮推理能力上远超其他模型,可以利用有限的上下文快速学习知识,并加以运用,这将使 Claude 3 的落地场景远不止简单的客服,文本生成的场景。而可以在具有更长工程过程的领域中有很好的发挥。 附录中将给出 Prompt,方便测试 其他方面,速度上,Claude 3 由于过快的回答速度频繁触发 rate limits ,给测评本身造成了麻烦,笔者不得不将其与 GPT 4 turbo 一起测试,以降低访问频率。同时,从得分的稳定性来看,Claude 3 在多轮测试中的稳定性非常高,除 015_simple_mahjong 外,鲜有回答不稳定的情况。 Claude 3 的超预期成功不代表 Anthropic 能力已经全面超越 OpenAI ,Claude 3 明显强于 GPT4 ,但也许 GPT-5 早已被 Open AI 捏在手上。 不过 Claude 3 的出现说明大模型领域已不再是一家独大的场面,也并不存在只有 OpenAI 可以创造的“核心魔法”,而更多的是工程能力与资源投入的领先。百家争鸣的底层大模型给了上层应用开发者们更多的选择,也必将带来更低的价格。从这个角度来看,Claude 3 的成功带来行业价值和社会影响怎么高估都不为过。
LLM-RGB 项目是一个专门为评估 LLM 在复杂情境中的推理和生成能力而设计的测试用例集合。这些复杂情境相比于聊天或简单生成,主要考察以下三个方面:
Babel 是一家致力于建立 Agent Team 来构建复杂软件的初创企业,LLM-RGB 项目是其选用底层大模型的判定依据(详见LLM-RGB:系统性评估 LLM 的复杂问题处理能力 ),在 Claude 3 出现之前,长期由 GPT-4 Turbo 把持测评榜首。
附上 015_simple_mahjong 的 Prompt 供大家测试使用:
You are a Mahjong game AI. I will explain to you the game rules of Simple Mahjong and show you some examples.
=== Simple Mahjong Rules ===
1. Simple Mahjong is a board game with four participants.
2. Simple Mahjong has three types of tiles, named "Dots", "Bamboo", "Character". There is no relationship between different types of tiles.
3. Each type of tile has nine different tiles from 1 to 9 and each tile has four copies(total 108 tiles).
- Bamboos: B1 B2 ... B9, each with four identical tiles
- Characters: C1 C2 ... C9, each with four identical tiles
- Dots: D1 D2 ... D9, each with four identical tiles
4. The same type of tile can has three kinds of combinations:
- Pair: TWO identical tiles, for example, D1D1, B2B2
- Bump: THREE identical tiles, for example, D7D7D7, C3C3C3
- Straight: THREE consecutive tiles of the same type, for example, D1-D2-D3, C7-C8-C9
5. At the beginning of the game, each player has 13 random tiles in hand.
6. The rest of the tiles face down on the table, which we call the tile wall.
7. Players play the game clockwise.
8. During your turn, you draw a new tile from the tile wall, bringing your hand to a total of 14 tiles. If these 14 tiles match a winning pattern, then you win. If not, you should choose a tile to discard in order to increase the possibility of your remaining tiles forming a winning pattern.
9. Winning pattern:
- Straights-win: the 14 tiles are in FOUR straights and ONE pair, for example, D1-D2-D3 C2-C3-C4 D5-D6-D7 D6-D7-D8 C9C9
- Bumps-win: the 14 tiles are in FOUR bumps and ONE pair,for example, B1B1B1 B2B2B2 C1C1C1 C6C6C6 D9D9
- Mixed-win: the 14 tiles are mixed with bumps, straights and ONE pair, for example, B1B1B1 C1C2C3 C6-C7-C8 D4-D5-D6 D7D7
=== End Rules ===
=== Examples ===
GAME INFO:
Tiles Discarded in Previous Rounds: C1 D4 D3 B3
Observe: Drew B6
Current Tiles: C2C2C2 C3C3 D9D9D9 B2 B7 B8 D6 D8 B6(just drew)
DECISION:
Thought: The tiles could potentially form a Mixed-win. I should keep the 'B6' as it could form a straight B6 B7 B8. D6 and D8 should be kept as well, considering the possibility of forming a straight sequence of D6-D7-D8. However, 'B2' is isolated and should be discarded.
Target Winning Pattern: Mixed-win
Winning Tile(s): D7
Action: Discard B2
---
GAME INFO:
Tiles Discarded in Previous Rounds: B1 B1 B2 B2 B5 B6 B9 C2 C6 D5 D5 D8 D8 D9
Observe: Drew C3
Current Tiles: B1-B2-B3 B6B6B6 B7B7B7 C1C1 C2 C3C3(just drew)
DECISION:
Thought: My current hand of tiles is nearing a Mixed-win, with two bumps and one straight. If I discard C2, I'll have two pairs, C1C1 and C3C3. These could potentially form a bump, either C1C1C1 or C3C3C3, leading to a win. If I discard C1 or C3, I'll have a straight sequence C1-C2-C3 and an isolated C1 or C3. I would then need to wait for another C1 or C3 to form a pair for a win. Waiting for a C1 or C3 bump has a higher probability, so I should discard C2.
Target Winning Pattern: Mixed-win
Winning Tile(s): C1 C3
Action: Discard C2
---
GAME INFO:
Tiles Discarded in Previous Rounds: B1 B1 B1 B2 B3 B5 B9 C2 C3 C3 D1 D1 D1 D1 D4 D4 D4
Observe: Drew D2
Current Tiles: B3B3B3 B6B6B6 B8B8B8 C3C3 D2 D3 D2(just drew)
DECISION:
Thought: My hand of tiles is nearing a Bumps-win. If I discard D3, I'll have two choices, C3 or D2, to form a winning pattern. However, if I discard D2, I'll have D1 or D4 as potential cards to form a straight sequence, D1 D2 D3 or D2 D3 D4, leading to a Mixed-win. Considering the discarded tiles, D1 and D4 have been discarded more often than C3 or D2. This reduces the likelihood of drawing D1 or D4 from the tile wall. Therefore, I should aim for a Bumps-win pattern and discard D3.
Target Winning Pattern: Bumps-win
Winning Tile(s): D2
Action: Discard D3
---
GAME INFO:
Tiles Discarded in Previous Rounds: B1B1 B2 C7 C8 D5
Observe: Drew D5
Current Tiles: B3-B4-B5 B4-B5-B6 C7-C8-C9 B9B9 D2 D3 D5(just drew)
DECISION:
Thought: The tiles are close to a Straights-win pattern. There are three straights already and potentially D2 D3 can form another straight D1-D2-D3 or D2-D3-D4. Although the newly drew D5 can potentially form a straight with D3, D3 D4 D5. But waiting for D4 has lower chance than waiting for D1 or D4. Thus I should keep current tiles and discard the newly drew D5.
Target Winning Pattern: Straights-win
Winning Tile(s): D1 D4
Action: Discard D5
---
GAME INFO:
Tiles Discarded in Previous Rounds: B6 B7 B8 C7 C9 D2 D2 D5 D5 D5 D8
Observe: Drew D4
Current Tiles: B3B3B3 B9B9B9 C7C7C7 D4D4 D5 D6 D4(just drew)
DECISION:
Thought:The tiles are Mixed-Win pattern.The newly drew D4 can form a Straights D4-D5-D6
Target Winning Pattern: Mixed-win
Winning Tile(s): D4(just drew)
Action:None
=== End Examples ===
GAME INFO:
Tiles Discarded in Previous Rounds: B1 B3 C1 C1 D8 D9
Observe: Drew B8
Current Tiles: C5C5C5 C8C8C8 C7-C8-C9 D1-D2-D3 C1 B8(just drew)
DECISION:
最优解
Thought:
Target Winning Pattern: mixed-win
Winning Tile(s): B8
Action: discard C1
次优解
Thought:
Target Winning Pattern: mixed-win
Winning Tile(s): C1
Action: discard B8
1
luckybearops 264 天前 via iPhone
棒
|
2
uses090 264 天前 via iPhone
虽然但是为什么要拿 GPT4Turbo 来比而不是 GPT4 呢
|
3
zhaoyeye 264 天前 via Android
封号,我也没说就被封了,不知道他们公司怎么想的
|
4
Bazingawang OP @uses090 因为 gpt4turbo 更强呀
|