BeyondMachines :verified:<p>Can Large Language Models (LLMs) truly reason? No. </p><p>"We found no evidence of formal reasoning in language models including open-source models like <a href="https://infosec.exchange/tags/Llama" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Llama</span></a>, <a href="https://infosec.exchange/tags/Phi" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Phi</span></a>, <a href="https://infosec.exchange/tags/Gemma" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Gemma</span></a>, and <a href="https://infosec.exchange/tags/Mistral" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>Mistral</span></a> and leading closed models, including the recent <a href="https://infosec.exchange/tags/OpenAI" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>OpenAI</span></a> <a href="https://infosec.exchange/tags/GPT" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>GPT</span></a>-4o and <a href="https://infosec.exchange/tags/o1" class="mention hashtag" rel="nofollow noopener noreferrer" target="_blank">#<span>o1</span></a>-series. </p><p>Their behavior is better explained by sophisticated pattern matching—so fragile, in fact, that changing names can alter results by ~10%! </p><p>We can scale data, parameters, and compute—or use better training data for Phi-4, Llama-4, GPT-5. </p><p>But we believe this will result in 'better pattern-matchers,' not necessarily 'better reasoners."</p><p>Full paper <a href="https://arxiv.org/pdf/2410.05229" rel="nofollow noopener noreferrer" translate="no" target="_blank"><span class="invisible">https://</span><span class="">arxiv.org/pdf/2410.05229</span><span class="invisible"></span></a></p>