+Follow
夜轻歌JJ
不要给自己压力,只做确定性高的交易 🐳
7
Follow
0
Followers
0
Topic
0
Badge
Posts
Hot
夜轻歌JJ
01-26
wow
The global DeepSeek reappearance frenzy! The myth of Silicon Valley giants collapses, 30 knives witness the aha moment
Go to Tiger App to see more news
{"i18n":{"language":"en_US"},"userPageInfo":{"id":"4159183300910622","uuid":"4159183300910622","gmtCreate":1696080183385,"gmtModify":1728201265415,"name":"夜轻歌JJ","pinyin":"yqgjjyeqinggejj","introduction":"","introductionEn":"","signature":"不要给自己压力,只做确定性高的交易 🐳","avatar":"https://community-static.tradeup.com/news/f8383a31fb94e82bfea3a69c529690d6","hat":null,"hatId":null,"hatName":null,"vip":1,"status":2,"fanSize":0,"headSize":7,"tweetSize":16,"questionSize":0,"limitLevel":999,"accountStatus":1,"level":{"id":0,"name":"","nameTw":"","represent":"","factor":"","iconColor":"","bgColor":""},"themeCounts":0,"badgeCounts":0,"badges":[],"moderator":false,"superModerator":false,"manageSymbols":null,"badgeLevel":null,"boolIsFan":false,"boolIsHead":false,"favoriteSize":19,"symbols":null,"coverImage":null,"realNameVerified":"init","userBadges":[{"badgeId":"1026c425416b44e0aac28c11a0848493-1","templateUuid":"1026c425416b44e0aac28c11a0848493","name":"Debut Tiger","description":"Join the tiger community for 500 days","bigImgUrl":"https://static.tigerbbs.com/0e4d0ca1da0456dc7894c946d44bf9ab","smallImgUrl":"https://static.tigerbbs.com/0f2f65e8ce4cfaae8db2bea9b127f58b","grayImgUrl":"https://static.tigerbbs.com/c5948a31b6edf154422335b265235809","redirectLinkEnabled":0,"redirectLink":null,"hasAllocated":1,"isWearing":0,"stamp":null,"stampPosition":0,"hasStamp":0,"allocationCount":1,"allocatedDate":"2025.02.24","exceedPercentage":null,"individualDisplayEnabled":0,"backgroundColor":null,"fontColor":null,"individualDisplaySort":0,"categoryType":1001},{"badgeId":"44212b71d0be4ec88898348dbe882e03-1","templateUuid":"44212b71d0be4ec88898348dbe882e03","name":"Boss Tiger","description":"The transaction amount of the securities account reaches $100,000","bigImgUrl":"https://static.tigerbbs.com/c8dfc27c1ee0e25db1c93e9d0b641101","smallImgUrl":"https://static.tigerbbs.com/f43908c142f8a33c78f5bdf0e2897488","grayImgUrl":"https://static.tigerbbs.com/82165ff19cb8a786e8919f92acee5213","redirectLinkEnabled":0,"redirectLink":null,"hasAllocated":1,"isWearing":0,"stamp":null,"stampPosition":0,"hasStamp":0,"allocationCount":1,"allocatedDate":"2024.11.02","exceedPercentage":"60.65%","individualDisplayEnabled":0,"backgroundColor":null,"fontColor":null,"individualDisplaySort":0,"categoryType":1101},{"badgeId":"a83d7582f45846ffbccbce770ce65d84-1","templateUuid":"a83d7582f45846ffbccbce770ce65d84","name":"Real Trader","description":"Completed a transaction","bigImgUrl":"https://static.tigerbbs.com/2e08a1cc2087a1de93402c2c290fa65b","smallImgUrl":"https://static.tigerbbs.com/4504a6397ce1137932d56e5f4ce27166","grayImgUrl":"https://static.tigerbbs.com/4b22c79415b4cd6e3d8ebc4a0fa32604","redirectLinkEnabled":0,"redirectLink":null,"hasAllocated":1,"isWearing":1,"stamp":null,"stampPosition":0,"hasStamp":0,"allocationCount":1,"allocatedDate":"2024.10.09","exceedPercentage":null,"individualDisplayEnabled":0,"backgroundColor":null,"fontColor":null,"individualDisplaySort":0,"categoryType":1100}],"userBadgeCount":3,"currentWearingBadge":{"badgeId":"a83d7582f45846ffbccbce770ce65d84-1","templateUuid":"a83d7582f45846ffbccbce770ce65d84","name":"Real Trader","description":"Completed a transaction","bigImgUrl":"https://static.tigerbbs.com/2e08a1cc2087a1de93402c2c290fa65b","smallImgUrl":"https://static.tigerbbs.com/4504a6397ce1137932d56e5f4ce27166","grayImgUrl":"https://static.tigerbbs.com/4b22c79415b4cd6e3d8ebc4a0fa32604","redirectLinkEnabled":0,"redirectLink":null,"hasAllocated":1,"isWearing":1,"stamp":null,"stampPosition":0,"hasStamp":0,"allocationCount":1,"allocatedDate":"2024.10.09","exceedPercentage":null,"individualDisplayEnabled":0,"backgroundColor":null,"fontColor":null,"individualDisplaySort":0,"categoryType":1100},"individualDisplayBadges":null,"crmLevel":11,"crmLevelSwitch":1,"location":null,"starInvestorFollowerNum":0,"starInvestorFlag":false,"starInvestorOrderShareNum":0,"subscribeStarInvestorNum":3,"ror":null,"winRationPercentage":null,"showRor":false,"investmentPhilosophy":null,"starInvestorSubscribeFlag":false},"baikeInfo":{},"tab":"post","tweets":[{"id":396695453823464,"gmtCreate":1737874687723,"gmtModify":1737874691482,"author":{"id":"4159183300910622","authorId":"4159183300910622","name":"夜轻歌JJ","avatar":"https://community-static.tradeup.com/news/f8383a31fb94e82bfea3a69c529690d6","crmLevel":11,"crmLevelSwitch":1,"followedFlag":false,"authorIdStr":"4159183300910622","idStr":"4159183300910622"},"themes":[],"htmlText":"wow","listText":"wow","text":"wow","images":[],"top":1,"highlighted":1,"essential":1,"paper":1,"likeSize":0,"commentSize":0,"repostSize":0,"link":"https://ttm.financial/post/396695453823464","repostId":"1161961252","repostType":2,"repost":{"id":"1161961252","kind":"news","pubTimestamp":1737873715,"share":"https://ttm.financial/m/news/1161961252?lang=en_US&edition=fundamental","pubTime":"2025-01-26 14:41","market":"hk","language":"zh","title":"The global DeepSeek reappearance frenzy! The myth of Silicon Valley giants collapses, 30 knives witness the aha moment","url":"https://stock-news.laohu8.com/highlight/detail?id=1161961252","media":"新智元","summary":"就在这当口,全球复现DeepSeek的一波狂潮也来了。更令人兴奋的是,成本不到30美金,就可以亲眼见证「啊哈」时刻。7B模型复刻,结果令人惊讶港科大助理教授何俊贤的团队,只用了8K个样本,就在7B模型上复刻出了DeepSeek-R1-Zero和DeepSeek-R1的训练。与DeepSeek R1类似,研究者的强化学习方案极其简单,没有使用奖励模型或MCTS类技术。随后,生成长度开始再次增加,此时出现了自我反思机制。","content":"<p><html><head></head><body>These days, Silicon Valley is completely in the aftermath of the earthquake brought by Chinese companies.</p><p>The United States is panicking: Has the center of global artificial intelligence shifted to China?</p><p>At this moment, a wave of frenzy of reappearing DeepSeek around the world also came.</p><p>As LeCun said: This time, it is the victory of open source over closed source!</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/96443f59328eead1fb03ccf0f6e8b4a7\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"519\"/></p><p>In the absence of top-level chips, DeepSeek, which trains breakthrough models with extremely low-cost chips, may threaten the AI hegemony of the United States.</p><p>The competition of large models is no longer a trillion-dollar computing power war.</p><p>The technological advantages and high valuations that big companies such as OpenAI, Meta, and Google are proud of will collapse, and Nvidia's stock price will begin to waver.</p><p>All these views and discussions make people wonder: Is tens of billions of dollars of expenditure really necessary for this industry? Some people even say that a group of geniuses in China's quantitative funds will lead to the collapse of Nasdaq.</p><p>From then on, the era of large models is likely to enter a watershed: super-performance models no longer belong to computing power giants alone, but to everyone.</p><p><h2 id=\"id_1782456954\">For $30, you can see the aha moment</h2>Pan Jiayi, a doctoral student from UC Berkeley, and two other researchers reproduced DeepSeek R1-Zero in the CountDown game.</p><p>They said that the results were quite excellent!</p><p>In the experiment, the team verified that through reinforcement learning RL, 3B's basic language model can also self-verify and search.</p><p>What's even more exciting is that the cost is less than 30 US dollars (about 217 yuan), and you can witness the aha moment with your own eyes.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/9e5c66f649972cc1dcc1b64ac9e2312a\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"468\"/></p><p>The project, called TinyZero, employs the R1-Zero algorithm-given a base language model, hints, and real reward signals, run reinforcement learning.</p><p>The team then applies it to the CountDown game (a game where players use basic arithmetic operations to combine numbers to reach the target number).</p><p>Starting from the initial simple output, the model gradually evolves the strategy of self-correction and search.</p><p>In the following example, the model proposes a solution, verifies itself, and repeatedly corrects it until it solves the problem.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/7fdc1973cccfff685657e93776b421a6\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"645\"/></p><p>In the ablation experiment, the researchers ran Qwen-2. 5-Base (four parameter scales of 0.5 B, 1.5 B, 3B, 7B).</p><p>It turns out that the 0.5 B model is merely guessing a solution and then stopping. Whereas from 1.5 B, the model learns to search, self-verify and correct its solutions, thus being able to achieve higher scores.</p><p>They believe that in this process, the underlying model is the key to performance.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/b5bf9a119224a98c43e0746453632356\" alt=\"\" title=\"\" tg-width=\"1024\" tg-height=\"872\"/></p><p>They also verified that additional instruction fine-tuning (SFT) is not necessary, which also confirms the design decision of R1-Zero.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/9f1a78e6cf55328c051693886f074ca0\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"352\"/></p><p>This is the first open source study to verify that the implementation of LLM inference capabilities can be purely through RL, without supervised fine-tuning</p><p>The difference between the basic model and the instruction model:</p><p><ul style=\"list-style-type: disc;\"><li>The instruction model runs fast, but ultimately performs comparably with the base model</p><p></li><li>The model of instruction output is more structured and readable</p><p></li></ul><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/f32389410d6e5ede00b75380edf87e07\" alt=\"\" title=\"\" tg-width=\"1024\" tg-height=\"891\"/></p><p>In addition, they also found that the specific RL algorithm is not important. Among algorithms such as PPO, GRPO, and PRIME, Long CoT can all emerge and bring good performance.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/617d02e3d20394695f5f569d5f74c0d8\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"1128\"/></p><p>Moreover, models are very dependent on specific tasks in their inference behavior:</p><p><ul style=\"list-style-type: disc;\"><li>For the Countdow task, the model learns to search and self-validate</p><p></li><li>For numeric multiplication tasks, the model instead learns to decompose the problem using distribution rules and solve it step by step</p><p></li></ul><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/96b4239d735ae442de73323b5bd4c965\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"643\"/></p><p>Apple machine learning scientist Yizhe Zhang said that it's so cool that models as small as 1.5 B can also emerge with self-verification capabilities through RL.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/3f01a77c2fff63fd05ba3d19af0bfeab\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"212\"/></p><p><h2 id=\"id_3291092331\">7B model replica, surprising results</h2>The team of Assistant Professor Ho Junxian of the Hong Kong University of Science and Technology (co-authors Huang Yuzhen and Weihao Zeng) used only 8K samples to replicate the training of DeepSeek-R1-Zero and DeepSeek-R1 on the 7B model.</p><p>The results are surprising-the model has achieved very strong results in complex mathematical reasoning.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/51085fb73be2a72ee23054c18e6df557\" alt=\"\" title=\"\" tg-width=\"870\" tg-height=\"185\"/></p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/17a612e8c7859f8429ee06ebe37234bb\" alt=\"项目地址:https://github.com/hkust-nlp/simpleRL-reason\" title=\"项目地址:https://github.com/hkust-nlp/simpleRL-reason\" tg-width=\"1080\" tg-height=\"492\"/><span>Project address: https://github.com/hkust-nlp/simpleRL-reason</span></p><p>They take Qwen2.5-Math-7B (the base model) as a starting point and directly perform reinforcement learning on it.</p><p>Throughout the process, no supervised fine-tuning (SFT) was performed and no reward model was used.</p><p>Finally, the model achieved an accuracy of 33.3% on the AIME benchmark, 62.5% on the AMC, and 77.2% on the MATH.</p><p>This performance not only surpasses Qwen2.5-Math-7B-Instruct, but also rivals PRIME and rStar-MATH, which use more than 50 times the amount of data and more complex components!</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/fc28abafaf1b99ce393cf71fc2cfb483\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"613\"/></p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/00ddd59c6d8b6ab3b8d2cf89524b23c4\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"553\"/></p><p>Among them, Qwen2.5-7B-SimpleRL-Zero is trained on the Qwen2.5-MATH-7B basic model using only the pure PPO method, and only 8K samples in the MATH dataset are used.</p><p>Qwen2.5-7B-SimpleRL first uses Long CoT supervised fine-tuning (SFT) as a cold start, and then performs reinforcement learning.</p><p>In both methods, the team only used the same 8K MATH samples and nothing more.</p><p>Around step 44, the aha moment appeared! In the response of the model, self-reflection appeared.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/f3aa0b8e2d94bdc891f49c30a47aa475\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"438\"/></p><p>Moreover, in this process, the model also showed longer CoT reasoning ability and self-reflection ability.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/3320ec58f4e229808d3520653e70da3d\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"608\"/></p><p>In the blog, the researchers analyze in detail the experimental setting and the phenomena observed during this reinforcement learning training process, such as the spontaneous formation of long-chain thinking (CoT) and self-reflection mechanisms.</p><p>Similar to DeepSeek R1, the researcher's reinforcement learning scheme is extremely simple, and does not use reward models or MCTS (Monte Carlo Tree Search) techniques.</p><p>They use the PPO algorithm and employ a rule-based reward function to assign rewards according to the format and correctness of the generated output:</p><p><ul style=\"list-style-type: disc;\"><li>Get a reward of +1 if the output provides the final answer in the specified format and is correct</p><p></li><li>If the output provides the final answer but is incorrect, the reward is set to-0.5</p><p></li><li>If the output fails to provide a final answer, the reward is set to-1</p><p></li></ul>The implementation is based on OpenRLHF. Preliminary experiments show that this reward function helps the strategy model to converge quickly and produce the output that conforms to the desired scheme.</p><p><h3 id=\"id_1461976375\" style=\"text-align: center;\">Part 1: SimpleRL-Zero (Reinforcement Learning from Scratch)</h3>Next, the researcher shared with us the dynamic analysis of the training process and some interesting emergence patterns.</p><p><h4 id=\"id_606216327\">Dynamic Analysis of Training Process</h4>As shown below, the accuracy of all benchmarks is steadily improving during training, while the output length shows a trend of first decreasing and then gradually increasing.</p><p>After further investigation, the researchers found that the Qwen2.5-Math-7B basic model tends to generate a large amount of code in the initial stage, which may be due to the distribution characteristics of the original training data of the model.</p><p>The first drop in output length is because reinforcement learning training gradually eliminates this code generation pattern and instead learns to use natural language for reasoning.</p><p>Subsequently, the generation length began to increase again, at which point a self-reflective mechanism emerged.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/c617c450f7e378da91201c2612d6b7dc\" alt=\"训练奖励和输出长度\" title=\"训练奖励和输出长度\" tg-width=\"1076\" tg-height=\"432\"/><span>Training reward and output length</span></p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/59ef982da18a27da8c6263d647691fa7\" alt=\"基准测试准确率(pass@1)和输出长度\" title=\"基准测试准确率(pass@1)和输出长度\" tg-width=\"1080\" tg-height=\"924\"/><span>Benchmark accuracy (pass @ 1) and output length</span></p><p><h4 id=\"id_224279804\">The emergence of self-reflective mechanisms</h4>At about step 40 of training, the researchers observed that the model began to form a self-reflective pattern, which is the aha moment (epiphany moment) described in the DeepSeek-R1 paper.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/60580dff773b0a70b356c7267119e819\" alt=\"\" title=\"\" tg-width=\"1054\" tg-height=\"423\"/></p><p><h3 id=\"id_2889548553\" style=\"text-align: center;\">Part 2: SimpleRL (Reinforcement Learning Based on Imitation Warm-Up)</h3>As mentioned earlier, the researcher performed a long CoT SFT warm-up before performing reinforcement learning, using 8,000 MATH sample responses extracted from QwQ-32B-Preview as the SFT dataset.</p><p>The potential advantage of this cold start is that the model already has a long CoT thinking mode and self-reflection ability when starting reinforcement learning, which may achieve faster and better learning results during the reinforcement learning stage.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/c761a6c0f0b865f5c1645cc06ad9869e\" alt=\"\" title=\"\" tg-width=\"1053\" tg-height=\"445\"/></p><p>Compared with the pre-RL training model (Qwen2.5-Math-7B-Base + 8K QwQ knowledge distillation version), the average performance of Qwen2.5-7B-SimpleRL is significantly improved by 6.9 percentage points.</p><p>Furthermore, not only does Qwen2.5-7B-SimpleRL consistently outperform Eurus-2-7B-PRIME, it also surpasses Qwen2.5-7B-SimpleRL-Zero on 3 out of 5 benchmarks.</p><p><h3 id=\"id_1405061093\">Training Process Analysis</h3><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/7ca73d0446fd6af825ac840bba4c7b14\" alt=\"训练奖励和输出长度\" title=\"训练奖励和输出长度\" tg-width=\"1060\" tg-height=\"467\"/><span>Training reward and output length</span></p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/07444422d9115a8382ed0f8843badb5f\" alt=\"基准测试准确率(pass@1)和输出长度\" title=\"基准测试准确率(pass@1)和输出长度\" tg-width=\"1080\" tg-height=\"638\"/><span>Benchmark accuracy (pass @ 1) and output length</span></p><p>The training dynamic performance of Qwen2.5-SimpleRL was similar to that of Qwen2.5-SimpleRL-Zero.</p><p>Interestingly, although the researchers first performed long CoT SFT, the reduction in output length was still observed in the early stage of reinforcement learning.</p><p>They speculate that this may be because the inference patterns extracted from QwQ are not suitable for the small strategy model, or are beyond its capabilities.</p><p>Therefore, the model chooses to abandon this model and instead independently develop new long-chain reasoning methods.</p><p>Finally, the researcher summed up the research with a sentence from Leonardo da Vinci-</p><p>Simplicity is the ultimate exquisiteness.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/0031e63d9296afb7d700bb6315dbce70\" alt=\"图片\" title=\"图片\" tg-width=\"437\" tg-height=\"72\"/><span>Pictures</span></p><p><h2 id=\"id_2367102810\">Completely open source replica, HuggingFace is over</h2>Even the HuggingFace team, the world's largest open source platform, officially announced today that it will reproduce all pipelines of DeepSeek R1.</p><p>After the reproduction is completed, all training data, training scripts, etc. will be open source.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/8b6a0735cb87c3f66a56724b33c3f08c\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"436\"/></p><p>This project is called Open R1 and is still in progress. On the day of release, the star mark broke through 1.9 k and gained 142 forks.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/883546a6427f0ae894da543e70c593aa\" alt=\"项目地址:https://github.com/huggingface/open-r1\" title=\"项目地址:https://github.com/huggingface/open-r1\" tg-width=\"1080\" tg-height=\"391\"/><span>Project address: https://github.com/huggingface/open-r1</span></p><p>Guided by the DeepSeek-R1 technical report, the research team divided the entire reproduction process into three key steps.</p><p><ul style=\"list-style-type: disc;\"><li><strong>Step 1:</strong>The R1-Distill model was reproduced by distilling a high-quality corpus from DeepSeek-R1.</p><p></li><li><strong>Step 2:</strong>Reproduce the pure reinforcement learning (RL) process used by DeepSeek to create R1-Zero. This may require curating new large-scale datasets for mathematics, inference, and code tasks.</p><p></li><li><strong>Step 3:</strong>Show how we evolve from a base model to an RL-tuned model through multi-stage training.</p><p></li></ul><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/0b5e37647b6df05c2e00fdcae31c20a1\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"1238\"/></p><p><h2 id=\"id_1672370884\">From Stanford to MIT, R1 becomes the top pick</h2>A sideline project has frightened the world's major technology companies.</p><p>This wave of success of DeepSeek has also become a myth in the industry. The latest screenshots from netizens show that this application has squeezed into the top three in the APP Store efficiency application list.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/c26a66723d3b3a6b69f68a1478db0112\" alt=\"\" title=\"\" tg-width=\"966\" tg-height=\"1200\"/></p><p>In Hugging Face, R1 downloads directly reached the top, and the other three models also dominated the hot list.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/a17a15963d44d09684758a53d9f16d6e\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"1080\"/></p><p>a16z partner Anjney Midha said that overnight, from Stanford to MIT, DeepSeek R1 has become the preferred model for researchers at top universities in the United States.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/95d7a12d8352525e938ddde62a68ba77\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"237\"/></p><p>Some researchers said that DeepSeek basically replaced my need to use ChatGPT.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/aef5ecd49dffdc200d6f1bd7a0647528\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"169\"/></p><p>Chinese AI really shocked the world this time.</p><p></body></html></p>","source":"lsy1569730104218","collect":0,"html":"<!DOCTYPE html>\n<html>\n<head>\n<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\" />\n<meta name=\"viewport\" content=\"width=device-width,initial-scale=1.0,minimum-scale=1.0,maximum-scale=1.0,user-scalable=no\"/>\n<meta name=\"format-detection\" content=\"telephone=no,email=no,address=no\" />\n<title>The global DeepSeek reappearance frenzy! The myth of Silicon Valley giants collapses, 30 knives witness the aha moment</title>\n<style type=\"text/css\">\na,abbr,acronym,address,applet,article,aside,audio,b,big,blockquote,body,canvas,caption,center,cite,code,dd,del,details,dfn,div,dl,dt,\nem,embed,fieldset,figcaption,figure,footer,form,h1,h2,h3,h4,h5,h6,header,hgroup,html,i,iframe,img,ins,kbd,label,legend,li,mark,menu,nav,\nobject,ol,output,p,pre,q,ruby,s,samp,section,small,span,strike,strong,sub,summary,sup,table,tbody,td,tfoot,th,thead,time,tr,tt,u,ul,var,video{ font:inherit;margin:0;padding:0;vertical-align:baseline;border:0 }\nbody{ font-size:16px; line-height:1.5; color:#999; background:transparent; }\n.wrapper{ overflow:hidden;word-break:break-all;padding:10px; }\nh1,h2{ font-weight:normal; line-height:1.35; margin-bottom:.6em; }\nh3,h4,h5,h6{ line-height:1.35; margin-bottom:1em; }\nh1{ font-size:24px; }\nh2{ font-size:20px; }\nh3{ font-size:18px; }\nh4{ font-size:16px; }\nh5{ font-size:14px; }\nh6{ font-size:12px; }\np,ul,ol,blockquote,dl,table{ margin:1.2em 0; }\nul,ol{ margin-left:2em; }\nul{ list-style:disc; }\nol{ list-style:decimal; }\nli,li p{ margin:10px 0;}\nimg{ max-width:100%;display:block;margin:0 auto 1em; }\nblockquote{ color:#B5B2B1; border-left:3px solid #aaa; padding:1em; }\nstrong,b{font-weight:bold;}\nem,i{font-style:italic;}\ntable{ width:100%;border-collapse:collapse;border-spacing:1px;margin:1em 0;font-size:.9em; }\nth,td{ padding:5px;text-align:left;border:1px solid #aaa; }\nth{ font-weight:bold;background:#5d5d5d; }\n.symbol-link{font-weight:bold;}\n/* header{ border-bottom:1px solid #494756; } */\n.title{ margin:0 0 8px;line-height:1.3;color:#ddd; }\n.meta {color:#5e5c6d;font-size:13px;margin:0 0 .5em; }\na{text-decoration:none; color:#2a4b87;}\n.meta .head { display: inline-block; overflow: hidden}\n.head .h-thumb { width: 30px; height: 30px; margin: 0; padding: 0; border-radius: 50%; float: left;}\n.head .h-content { margin: 0; padding: 0 0 0 9px; float: left;}\n.head .h-name {font-size: 13px; color: #eee; margin: 0;}\n.head .h-time {font-size: 12.5px; color: #7E829C; margin: 0;}\n.small {font-size: 12.5px; display: inline-block; transform: scale(0.9); -webkit-transform: scale(0.9); transform-origin: left; -webkit-transform-origin: left;}\n.smaller {font-size: 12.5px; display: inline-block; transform: scale(0.8); -webkit-transform: scale(0.8); transform-origin: left; -webkit-transform-origin: left;}\n.bt-text {font-size: 12px;margin: 1.5em 0 0 0}\n.bt-text p {margin: 0}\n</style>\n</head>\n<body>\n<div class=\"wrapper\">\n<header>\n<h2 class=\"title\">\nThe global DeepSeek reappearance frenzy! The myth of Silicon Valley giants collapses, 30 knives witness the aha moment\n</h2>\n<h4 class=\"meta\">\n<p class=\"head\">\n<strong class=\"h-name small\">新智元</strong><span class=\"h-time small\">2025-01-26 14:41</span>\n</p>\n</h4>\n</header>\n<article>\n<p><html><head></head><body>These days, Silicon Valley is completely in the aftermath of the earthquake brought by Chinese companies.</p><p>The United States is panicking: Has the center of global artificial intelligence shifted to China?</p><p>At this moment, a wave of frenzy of reappearing DeepSeek around the world also came.</p><p>As LeCun said: This time, it is the victory of open source over closed source!</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/96443f59328eead1fb03ccf0f6e8b4a7\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"519\"/></p><p>In the absence of top-level chips, DeepSeek, which trains breakthrough models with extremely low-cost chips, may threaten the AI hegemony of the United States.</p><p>The competition of large models is no longer a trillion-dollar computing power war.</p><p>The technological advantages and high valuations that big companies such as OpenAI, Meta, and Google are proud of will collapse, and Nvidia's stock price will begin to waver.</p><p>All these views and discussions make people wonder: Is tens of billions of dollars of expenditure really necessary for this industry? Some people even say that a group of geniuses in China's quantitative funds will lead to the collapse of Nasdaq.</p><p>From then on, the era of large models is likely to enter a watershed: super-performance models no longer belong to computing power giants alone, but to everyone.</p><p><h2 id=\"id_1782456954\">For $30, you can see the aha moment</h2>Pan Jiayi, a doctoral student from UC Berkeley, and two other researchers reproduced DeepSeek R1-Zero in the CountDown game.</p><p>They said that the results were quite excellent!</p><p>In the experiment, the team verified that through reinforcement learning RL, 3B's basic language model can also self-verify and search.</p><p>What's even more exciting is that the cost is less than 30 US dollars (about 217 yuan), and you can witness the aha moment with your own eyes.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/9e5c66f649972cc1dcc1b64ac9e2312a\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"468\"/></p><p>The project, called TinyZero, employs the R1-Zero algorithm-given a base language model, hints, and real reward signals, run reinforcement learning.</p><p>The team then applies it to the CountDown game (a game where players use basic arithmetic operations to combine numbers to reach the target number).</p><p>Starting from the initial simple output, the model gradually evolves the strategy of self-correction and search.</p><p>In the following example, the model proposes a solution, verifies itself, and repeatedly corrects it until it solves the problem.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/7fdc1973cccfff685657e93776b421a6\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"645\"/></p><p>In the ablation experiment, the researchers ran Qwen-2. 5-Base (four parameter scales of 0.5 B, 1.5 B, 3B, 7B).</p><p>It turns out that the 0.5 B model is merely guessing a solution and then stopping. Whereas from 1.5 B, the model learns to search, self-verify and correct its solutions, thus being able to achieve higher scores.</p><p>They believe that in this process, the underlying model is the key to performance.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/b5bf9a119224a98c43e0746453632356\" alt=\"\" title=\"\" tg-width=\"1024\" tg-height=\"872\"/></p><p>They also verified that additional instruction fine-tuning (SFT) is not necessary, which also confirms the design decision of R1-Zero.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/9f1a78e6cf55328c051693886f074ca0\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"352\"/></p><p>This is the first open source study to verify that the implementation of LLM inference capabilities can be purely through RL, without supervised fine-tuning</p><p>The difference between the basic model and the instruction model:</p><p><ul style=\"list-style-type: disc;\"><li>The instruction model runs fast, but ultimately performs comparably with the base model</p><p></li><li>The model of instruction output is more structured and readable</p><p></li></ul><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/f32389410d6e5ede00b75380edf87e07\" alt=\"\" title=\"\" tg-width=\"1024\" tg-height=\"891\"/></p><p>In addition, they also found that the specific RL algorithm is not important. Among algorithms such as PPO, GRPO, and PRIME, Long CoT can all emerge and bring good performance.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/617d02e3d20394695f5f569d5f74c0d8\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"1128\"/></p><p>Moreover, models are very dependent on specific tasks in their inference behavior:</p><p><ul style=\"list-style-type: disc;\"><li>For the Countdow task, the model learns to search and self-validate</p><p></li><li>For numeric multiplication tasks, the model instead learns to decompose the problem using distribution rules and solve it step by step</p><p></li></ul><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/96b4239d735ae442de73323b5bd4c965\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"643\"/></p><p>Apple machine learning scientist Yizhe Zhang said that it's so cool that models as small as 1.5 B can also emerge with self-verification capabilities through RL.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/3f01a77c2fff63fd05ba3d19af0bfeab\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"212\"/></p><p><h2 id=\"id_3291092331\">7B model replica, surprising results</h2>The team of Assistant Professor Ho Junxian of the Hong Kong University of Science and Technology (co-authors Huang Yuzhen and Weihao Zeng) used only 8K samples to replicate the training of DeepSeek-R1-Zero and DeepSeek-R1 on the 7B model.</p><p>The results are surprising-the model has achieved very strong results in complex mathematical reasoning.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/51085fb73be2a72ee23054c18e6df557\" alt=\"\" title=\"\" tg-width=\"870\" tg-height=\"185\"/></p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/17a612e8c7859f8429ee06ebe37234bb\" alt=\"项目地址:https://github.com/hkust-nlp/simpleRL-reason\" title=\"项目地址:https://github.com/hkust-nlp/simpleRL-reason\" tg-width=\"1080\" tg-height=\"492\"/><span>Project address: https://github.com/hkust-nlp/simpleRL-reason</span></p><p>They take Qwen2.5-Math-7B (the base model) as a starting point and directly perform reinforcement learning on it.</p><p>Throughout the process, no supervised fine-tuning (SFT) was performed and no reward model was used.</p><p>Finally, the model achieved an accuracy of 33.3% on the AIME benchmark, 62.5% on the AMC, and 77.2% on the MATH.</p><p>This performance not only surpasses Qwen2.5-Math-7B-Instruct, but also rivals PRIME and rStar-MATH, which use more than 50 times the amount of data and more complex components!</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/fc28abafaf1b99ce393cf71fc2cfb483\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"613\"/></p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/00ddd59c6d8b6ab3b8d2cf89524b23c4\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"553\"/></p><p>Among them, Qwen2.5-7B-SimpleRL-Zero is trained on the Qwen2.5-MATH-7B basic model using only the pure PPO method, and only 8K samples in the MATH dataset are used.</p><p>Qwen2.5-7B-SimpleRL first uses Long CoT supervised fine-tuning (SFT) as a cold start, and then performs reinforcement learning.</p><p>In both methods, the team only used the same 8K MATH samples and nothing more.</p><p>Around step 44, the aha moment appeared! In the response of the model, self-reflection appeared.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/f3aa0b8e2d94bdc891f49c30a47aa475\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"438\"/></p><p>Moreover, in this process, the model also showed longer CoT reasoning ability and self-reflection ability.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/3320ec58f4e229808d3520653e70da3d\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"608\"/></p><p>In the blog, the researchers analyze in detail the experimental setting and the phenomena observed during this reinforcement learning training process, such as the spontaneous formation of long-chain thinking (CoT) and self-reflection mechanisms.</p><p>Similar to DeepSeek R1, the researcher's reinforcement learning scheme is extremely simple, and does not use reward models or MCTS (Monte Carlo Tree Search) techniques.</p><p>They use the PPO algorithm and employ a rule-based reward function to assign rewards according to the format and correctness of the generated output:</p><p><ul style=\"list-style-type: disc;\"><li>Get a reward of +1 if the output provides the final answer in the specified format and is correct</p><p></li><li>If the output provides the final answer but is incorrect, the reward is set to-0.5</p><p></li><li>If the output fails to provide a final answer, the reward is set to-1</p><p></li></ul>The implementation is based on OpenRLHF. Preliminary experiments show that this reward function helps the strategy model to converge quickly and produce the output that conforms to the desired scheme.</p><p><h3 id=\"id_1461976375\" style=\"text-align: center;\">Part 1: SimpleRL-Zero (Reinforcement Learning from Scratch)</h3>Next, the researcher shared with us the dynamic analysis of the training process and some interesting emergence patterns.</p><p><h4 id=\"id_606216327\">Dynamic Analysis of Training Process</h4>As shown below, the accuracy of all benchmarks is steadily improving during training, while the output length shows a trend of first decreasing and then gradually increasing.</p><p>After further investigation, the researchers found that the Qwen2.5-Math-7B basic model tends to generate a large amount of code in the initial stage, which may be due to the distribution characteristics of the original training data of the model.</p><p>The first drop in output length is because reinforcement learning training gradually eliminates this code generation pattern and instead learns to use natural language for reasoning.</p><p>Subsequently, the generation length began to increase again, at which point a self-reflective mechanism emerged.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/c617c450f7e378da91201c2612d6b7dc\" alt=\"训练奖励和输出长度\" title=\"训练奖励和输出长度\" tg-width=\"1076\" tg-height=\"432\"/><span>Training reward and output length</span></p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/59ef982da18a27da8c6263d647691fa7\" alt=\"基准测试准确率(pass@1)和输出长度\" title=\"基准测试准确率(pass@1)和输出长度\" tg-width=\"1080\" tg-height=\"924\"/><span>Benchmark accuracy (pass @ 1) and output length</span></p><p><h4 id=\"id_224279804\">The emergence of self-reflective mechanisms</h4>At about step 40 of training, the researchers observed that the model began to form a self-reflective pattern, which is the aha moment (epiphany moment) described in the DeepSeek-R1 paper.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/60580dff773b0a70b356c7267119e819\" alt=\"\" title=\"\" tg-width=\"1054\" tg-height=\"423\"/></p><p><h3 id=\"id_2889548553\" style=\"text-align: center;\">Part 2: SimpleRL (Reinforcement Learning Based on Imitation Warm-Up)</h3>As mentioned earlier, the researcher performed a long CoT SFT warm-up before performing reinforcement learning, using 8,000 MATH sample responses extracted from QwQ-32B-Preview as the SFT dataset.</p><p>The potential advantage of this cold start is that the model already has a long CoT thinking mode and self-reflection ability when starting reinforcement learning, which may achieve faster and better learning results during the reinforcement learning stage.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/c761a6c0f0b865f5c1645cc06ad9869e\" alt=\"\" title=\"\" tg-width=\"1053\" tg-height=\"445\"/></p><p>Compared with the pre-RL training model (Qwen2.5-Math-7B-Base + 8K QwQ knowledge distillation version), the average performance of Qwen2.5-7B-SimpleRL is significantly improved by 6.9 percentage points.</p><p>Furthermore, not only does Qwen2.5-7B-SimpleRL consistently outperform Eurus-2-7B-PRIME, it also surpasses Qwen2.5-7B-SimpleRL-Zero on 3 out of 5 benchmarks.</p><p><h3 id=\"id_1405061093\">Training Process Analysis</h3><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/7ca73d0446fd6af825ac840bba4c7b14\" alt=\"训练奖励和输出长度\" title=\"训练奖励和输出长度\" tg-width=\"1060\" tg-height=\"467\"/><span>Training reward and output length</span></p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/07444422d9115a8382ed0f8843badb5f\" alt=\"基准测试准确率(pass@1)和输出长度\" title=\"基准测试准确率(pass@1)和输出长度\" tg-width=\"1080\" tg-height=\"638\"/><span>Benchmark accuracy (pass @ 1) and output length</span></p><p>The training dynamic performance of Qwen2.5-SimpleRL was similar to that of Qwen2.5-SimpleRL-Zero.</p><p>Interestingly, although the researchers first performed long CoT SFT, the reduction in output length was still observed in the early stage of reinforcement learning.</p><p>They speculate that this may be because the inference patterns extracted from QwQ are not suitable for the small strategy model, or are beyond its capabilities.</p><p>Therefore, the model chooses to abandon this model and instead independently develop new long-chain reasoning methods.</p><p>Finally, the researcher summed up the research with a sentence from Leonardo da Vinci-</p><p>Simplicity is the ultimate exquisiteness.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/0031e63d9296afb7d700bb6315dbce70\" alt=\"图片\" title=\"图片\" tg-width=\"437\" tg-height=\"72\"/><span>Pictures</span></p><p><h2 id=\"id_2367102810\">Completely open source replica, HuggingFace is over</h2>Even the HuggingFace team, the world's largest open source platform, officially announced today that it will reproduce all pipelines of DeepSeek R1.</p><p>After the reproduction is completed, all training data, training scripts, etc. will be open source.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/8b6a0735cb87c3f66a56724b33c3f08c\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"436\"/></p><p>This project is called Open R1 and is still in progress. On the day of release, the star mark broke through 1.9 k and gained 142 forks.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/883546a6427f0ae894da543e70c593aa\" alt=\"项目地址:https://github.com/huggingface/open-r1\" title=\"项目地址:https://github.com/huggingface/open-r1\" tg-width=\"1080\" tg-height=\"391\"/><span>Project address: https://github.com/huggingface/open-r1</span></p><p>Guided by the DeepSeek-R1 technical report, the research team divided the entire reproduction process into three key steps.</p><p><ul style=\"list-style-type: disc;\"><li><strong>Step 1:</strong>The R1-Distill model was reproduced by distilling a high-quality corpus from DeepSeek-R1.</p><p></li><li><strong>Step 2:</strong>Reproduce the pure reinforcement learning (RL) process used by DeepSeek to create R1-Zero. This may require curating new large-scale datasets for mathematics, inference, and code tasks.</p><p></li><li><strong>Step 3:</strong>Show how we evolve from a base model to an RL-tuned model through multi-stage training.</p><p></li></ul><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/0b5e37647b6df05c2e00fdcae31c20a1\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"1238\"/></p><p><h2 id=\"id_1672370884\">From Stanford to MIT, R1 becomes the top pick</h2>A sideline project has frightened the world's major technology companies.</p><p>This wave of success of DeepSeek has also become a myth in the industry. The latest screenshots from netizens show that this application has squeezed into the top three in the APP Store efficiency application list.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/c26a66723d3b3a6b69f68a1478db0112\" alt=\"\" title=\"\" tg-width=\"966\" tg-height=\"1200\"/></p><p>In Hugging Face, R1 downloads directly reached the top, and the other three models also dominated the hot list.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/a17a15963d44d09684758a53d9f16d6e\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"1080\"/></p><p>a16z partner Anjney Midha said that overnight, from Stanford to MIT, DeepSeek R1 has become the preferred model for researchers at top universities in the United States.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/95d7a12d8352525e938ddde62a68ba77\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"237\"/></p><p>Some researchers said that DeepSeek basically replaced my need to use ChatGPT.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/aef5ecd49dffdc200d6f1bd7a0647528\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"169\"/></p><p>Chinese AI really shocked the world this time.</p><p></body></html></p>\n<div class=\"bt-text\">\n\n\n<p> source:<a href=\"https://mp.weixin.qq.com/s/o41vPh9eJCVjCRUE4u5npA\">新智元</a></p>\n\n\n</div>\n</article>\n</div>\n</body>\n</html>\n","type":0,"thumbnail":"https://community-static.tradeup.com/news/4abae3de7bed37b48c01721c28c51931","relate_stocks":{},"source_url":"https://mp.weixin.qq.com/s/o41vPh9eJCVjCRUE4u5npA","is_english":false,"share_image_url":"https://static.laohu8.com/e9f99090a1c2ed51c021029395664489","article_id":"1161961252","content_text":"这些天,硅谷彻底处于中国公司带来的大地震余波中。全美都在恐慌:是否全球人工智能的中心已经转移到了中国?就在这当口,全球复现DeepSeek的一波狂潮也来了。诚如LeCun所言:「这一次,正是开源对闭源的胜利!」在没有顶级芯片的情况下,以极低成本芯片训出突破性模型的DeepSeek,或将威胁到美国的AI霸权。大模型比拼的不再是动辄千万亿美元的算力战。OpenAI、Meta、谷歌这些大公司引以为傲的技术优势和高估值将会瓦解,英伟达的股价将开始动摇。种种这些观点和讨论,让人不禁怀疑:数百亿美元支出,对这个行业真的必要吗?甚至有人说,中国量化基金的一群天才,将导致纳斯达克崩盘。从此,大模型时代很可能会进入一个分水岭:超强性能的模型不再独属于算力巨头,而是属于每个人。30美金,就能看到「啊哈」时刻来自UC伯克利博士生潘家怡和另两位研究人员,在CountDown游戏中复现了DeepSeek R1-Zero。他们表示,结果相当出色!实验中,团队验证了通过强化学习RL,3B的基础语言模型也能够自我验证和搜索。更令人兴奋的是,成本不到30美金(约217元),就可以亲眼见证「啊哈」时刻。这个项目叫做TinyZero,采用了R1-Zero算法——给定一个基础语言模型、提示和真实奖励信号,运行强化学习。然后,团队将其应用在CountDown游戏中(这是一个玩家使用基础算术运算,将数字组合以达到目标数字的游戏)。模型从最初的简单输出开始,逐步进化出自我纠正和搜索的策略。在以下示例中,模型提出了解决方案,自我验证,并反复纠正,直到解决问题为止。在消融实验中,研究人员运行了Qwen-2.5-Base(0.5B、1.5B、3B、7B四种参数规模)。结果发现,0.5B模型仅仅是猜测一个解决方案然后停止。而从1.5B开始,模型学会了搜索、自我验证和修正其解决方案,从而能够获得更高的分数。他们认为,在这个过程,基础模型的是性能的关键。他们还验证了,额外的指令微调(SFT)并非是必要的,这也印证了R1-Zero的设计决策。这是首个验证LLM推理能力的实现可以纯粹通过RL,无需监督微调的开源研究基础模型和指令模型两者区别:指令模型运行速度快,但最终表现与基础模型相当指令输出的模型更具结构性和可读性此外,他们还发现,具体的RL算法并不重要。PPO、GRPO、PRIME这些算法中,长思维链(Long CoT)都能够涌现,且带来不错的性能表现。而且,模型在推理行为中非常依赖于具体的任务:对于Countdow任务,模型学习进行搜索和自我验证对于数字乘法任务,模型反而学习使用分布规则分解问题,并逐步解决苹果机器学习科学家Yizhe Zhang对此表示,太酷了,小到1.5B的模型,也能通过RL涌现出自我验证的能力。7B模型复刻,结果令人惊讶港科大助理教授何俊贤的团队(共同一作黄裕振、Weihao Zeng),只用了8K个样本,就在7B模型上复刻出了DeepSeek-R1-Zero和DeepSeek-R1的训练。结果令人惊喜——模型在复杂的数学推理上取得了十分强劲结果。项目地址:https://github.com/hkust-nlp/simpleRL-reason他们以Qwen2.5-Math-7B(基础模型)为起点,直接对其进行强化学习。整个过程中,没有进行监督微调(SFT),也没有使用奖励模型。最终,模型在AIME基准上实现了33.3%的准确率,在AMC上为62.5%,在MATH上为77.2%。这一表现不仅超越了Qwen2.5-Math-7B-Instruct,并且还可以和使用超过50倍数据量和更复杂组件的PRIME和rStar-MATH相媲美!其中,Qwen2.5-7B-SimpleRL-Zero是在Qwen2.5-Math-7B基础模型上仅使用纯PPO方法训练的,仅采用了MATH数据集中的8K样本。Qwen2.5-7B-SimpleRL则首先通过Long CoT监督微调(SFT)作为冷启动,然后再进行强化学习。在这两种方法中,团队都只使用了相同的8K MATH样本,仅此而已。大概在第44步的时候,「啊哈时刻」出现了!模型的响应中,出现了自我反思。并且,在这个过程中,模型还显现了更长的CoT推理能力和自我反思能力。在博客中,研究者详细剖析了实验设置,以及在这个强化学习训练过程中所观察到的现象,例如长链式思考(CoT)和自我反思机制的自发形成。与DeepSeek R1类似,研究者的强化学习方案极其简单,没有使用奖励模型或MCTS(蒙特卡洛树搜索)类技术。他们使用的是PPO算法,并采用基于规则的奖励函数,根据生成输出的格式和正确性分配奖励:如果输出以指定格式提供最终答案且正确,获得+1的奖励如果输出提供最终答案但不正确,奖励设为-0.5如果输出未能提供最终答案,奖励设为-1该实现基于OpenRLHF。初步试验表明,这个奖励函数有助于策略模型快速收敛,产生符合期望格式的输出。第一部分:SimpleRL-Zero(从头开始的强化学习)接下来,研究者为我们分享了训练过程动态分析和一些有趣的涌现模式。训练过程动态分析如下所示,所有基准测试的准确率在训练过程中都在稳步提高,而输出长度则呈现先减少后逐渐增加的趋势。经过进一步调查,研究者发现,Qwen2.5-Math-7B基础模型在初始阶段倾向于生成大量代码,这可能源于模型原始训练数据的分布特征。输出长度的首次下降,是因为强化学习训练逐渐消除了这种代码生成模式,转而学会使用自然语言进行推理。随后,生成长度开始再次增加,此时出现了自我反思机制。训练奖励和输出长度基准测试准确率(pass@1)和输出长度自我反思机制的涌现在训练到第 40 步左右时,研究者观察到:模型开始形成自我反思模式,这正是DeepSeek-R1论文中所描述的「aha moment」(顿悟时刻)。第二部分:SimpleRL(基于模仿预热的强化学习)如前所述,研究者在进行强化学习之前,先进行了long CoT SFT预热,使用了8,000个从QwQ-32B-Preview中提取的MATH示例响应作为SFT数据集。这种冷启动的潜在优势在于:模型在开始强化学习时已具备long CoT思维模式和自我反思能力,从而可能在强化学习阶段实现更快更好的学习效果。与RL训练前的模型(Qwen2.5-Math-7B-Base + 8K QwQ知识蒸馏版本)相比,Qwen2.5-7B-SimpleRL的平均性能显著提升了6.9个百分点。此外,Qwen2.5-7B-SimpleRL不仅持续优于Eurus-2-7B-PRIME,还在5个基准测试中的3个上超越了Qwen2.5-7B-SimpleRL-Zero。训练过程分析训练奖励和输出长度基准测试准确率(pass@1)和输出长度Qwen2.5-SimpleRL的训练动态表现与Qwen2.5-SimpleRL-Zero相似。有趣的是,尽管研究者先进行了long CoT SFT,但在强化学习初期仍然观察到输出长度减少的现象。他们推测,这可能是因为从QwQ提取的推理模式不适合小型策略模型,或超出了其能力范围。因此,模型选择放弃这种模式,转而自主发展新的长链式推理方式。最后,研究者用达芬奇的一句话,对这项研究做了总结——简约,便是最终极的精致。图片完全开源复刻,HuggingFace下场了甚至,就连全球最大开源平台HuggingFace团队,今天官宣复刻DeepSeek R1所有pipeline。复刻完成后,所有的训练数据、训练脚本等等,将全部开源。这个项目叫做Open R1,当前还在进行中。发布到一天,星标冲破1.9k,斩获142个fork。项目地址:https://github.com/huggingface/open-r1研究团队以DeepSeek-R1技术报告为指导,将整个复刻过程划分为三个关键步骤。步骤 1:通过从DeepSeek-R1蒸馏高质量语料库,复现R1-Distill模型。步骤 2:复现DeepSeek用于创建R1-Zero的纯强化学习(RL)流程。这可能需要为数学、推理和代码任务策划新的大规模数据集。步骤 3:展示我们如何通过多阶段训练,从基础模型发展到经过RL调优的模型。从斯坦福到MIT,R1成为首选一个副业项目,让全世界科技大厂为之惶恐。DeepSeek这波成功,也成为业界的神话,网友最新截图显示,这款应用已经在APP Store「效率」应用榜单中挤进前三。在Hugging Face中,R1下载量直接登顶,另外3个模型也霸占着热榜。a16z合伙人Anjney Midha称,一夜之间,从斯坦福到MIT,DeepSeek R1已经成为美国顶尖高校研究人员「首选模型」。还有研究人员表示,DeepSeek基本上取代了我用ChatGPT的需求。中国AI,这一次真的震撼了世界。","news_type":1,"symbols_score_info":{}},"isVote":1,"tweetType":1,"viewCount":551,"authorTweetTopStatus":1,"verified":2,"comments":[],"imageCount":0,"langContent":"EN","totalScore":0}],"hots":[{"id":396695453823464,"gmtCreate":1737874687723,"gmtModify":1737874691482,"author":{"id":"4159183300910622","authorId":"4159183300910622","name":"夜轻歌JJ","avatar":"https://community-static.tradeup.com/news/f8383a31fb94e82bfea3a69c529690d6","crmLevel":11,"crmLevelSwitch":1,"followedFlag":false,"authorIdStr":"4159183300910622","idStr":"4159183300910622"},"themes":[],"htmlText":"wow","listText":"wow","text":"wow","images":[],"top":1,"highlighted":1,"essential":1,"paper":1,"likeSize":0,"commentSize":0,"repostSize":0,"link":"https://ttm.financial/post/396695453823464","repostId":"1161961252","repostType":2,"repost":{"id":"1161961252","kind":"news","pubTimestamp":1737873715,"share":"https://ttm.financial/m/news/1161961252?lang=en_US&edition=fundamental","pubTime":"2025-01-26 14:41","market":"hk","language":"zh","title":"The global DeepSeek reappearance frenzy! The myth of Silicon Valley giants collapses, 30 knives witness the aha moment","url":"https://stock-news.laohu8.com/highlight/detail?id=1161961252","media":"新智元","summary":"就在这当口,全球复现DeepSeek的一波狂潮也来了。更令人兴奋的是,成本不到30美金,就可以亲眼见证「啊哈」时刻。7B模型复刻,结果令人惊讶港科大助理教授何俊贤的团队,只用了8K个样本,就在7B模型上复刻出了DeepSeek-R1-Zero和DeepSeek-R1的训练。与DeepSeek R1类似,研究者的强化学习方案极其简单,没有使用奖励模型或MCTS类技术。随后,生成长度开始再次增加,此时出现了自我反思机制。","content":"<p><html><head></head><body>These days, Silicon Valley is completely in the aftermath of the earthquake brought by Chinese companies.</p><p>The United States is panicking: Has the center of global artificial intelligence shifted to China?</p><p>At this moment, a wave of frenzy of reappearing DeepSeek around the world also came.</p><p>As LeCun said: This time, it is the victory of open source over closed source!</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/96443f59328eead1fb03ccf0f6e8b4a7\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"519\"/></p><p>In the absence of top-level chips, DeepSeek, which trains breakthrough models with extremely low-cost chips, may threaten the AI hegemony of the United States.</p><p>The competition of large models is no longer a trillion-dollar computing power war.</p><p>The technological advantages and high valuations that big companies such as OpenAI, Meta, and Google are proud of will collapse, and Nvidia's stock price will begin to waver.</p><p>All these views and discussions make people wonder: Is tens of billions of dollars of expenditure really necessary for this industry? Some people even say that a group of geniuses in China's quantitative funds will lead to the collapse of Nasdaq.</p><p>From then on, the era of large models is likely to enter a watershed: super-performance models no longer belong to computing power giants alone, but to everyone.</p><p><h2 id=\"id_1782456954\">For $30, you can see the aha moment</h2>Pan Jiayi, a doctoral student from UC Berkeley, and two other researchers reproduced DeepSeek R1-Zero in the CountDown game.</p><p>They said that the results were quite excellent!</p><p>In the experiment, the team verified that through reinforcement learning RL, 3B's basic language model can also self-verify and search.</p><p>What's even more exciting is that the cost is less than 30 US dollars (about 217 yuan), and you can witness the aha moment with your own eyes.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/9e5c66f649972cc1dcc1b64ac9e2312a\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"468\"/></p><p>The project, called TinyZero, employs the R1-Zero algorithm-given a base language model, hints, and real reward signals, run reinforcement learning.</p><p>The team then applies it to the CountDown game (a game where players use basic arithmetic operations to combine numbers to reach the target number).</p><p>Starting from the initial simple output, the model gradually evolves the strategy of self-correction and search.</p><p>In the following example, the model proposes a solution, verifies itself, and repeatedly corrects it until it solves the problem.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/7fdc1973cccfff685657e93776b421a6\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"645\"/></p><p>In the ablation experiment, the researchers ran Qwen-2. 5-Base (four parameter scales of 0.5 B, 1.5 B, 3B, 7B).</p><p>It turns out that the 0.5 B model is merely guessing a solution and then stopping. Whereas from 1.5 B, the model learns to search, self-verify and correct its solutions, thus being able to achieve higher scores.</p><p>They believe that in this process, the underlying model is the key to performance.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/b5bf9a119224a98c43e0746453632356\" alt=\"\" title=\"\" tg-width=\"1024\" tg-height=\"872\"/></p><p>They also verified that additional instruction fine-tuning (SFT) is not necessary, which also confirms the design decision of R1-Zero.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/9f1a78e6cf55328c051693886f074ca0\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"352\"/></p><p>This is the first open source study to verify that the implementation of LLM inference capabilities can be purely through RL, without supervised fine-tuning</p><p>The difference between the basic model and the instruction model:</p><p><ul style=\"list-style-type: disc;\"><li>The instruction model runs fast, but ultimately performs comparably with the base model</p><p></li><li>The model of instruction output is more structured and readable</p><p></li></ul><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/f32389410d6e5ede00b75380edf87e07\" alt=\"\" title=\"\" tg-width=\"1024\" tg-height=\"891\"/></p><p>In addition, they also found that the specific RL algorithm is not important. Among algorithms such as PPO, GRPO, and PRIME, Long CoT can all emerge and bring good performance.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/617d02e3d20394695f5f569d5f74c0d8\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"1128\"/></p><p>Moreover, models are very dependent on specific tasks in their inference behavior:</p><p><ul style=\"list-style-type: disc;\"><li>For the Countdow task, the model learns to search and self-validate</p><p></li><li>For numeric multiplication tasks, the model instead learns to decompose the problem using distribution rules and solve it step by step</p><p></li></ul><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/96b4239d735ae442de73323b5bd4c965\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"643\"/></p><p>Apple machine learning scientist Yizhe Zhang said that it's so cool that models as small as 1.5 B can also emerge with self-verification capabilities through RL.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/3f01a77c2fff63fd05ba3d19af0bfeab\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"212\"/></p><p><h2 id=\"id_3291092331\">7B model replica, surprising results</h2>The team of Assistant Professor Ho Junxian of the Hong Kong University of Science and Technology (co-authors Huang Yuzhen and Weihao Zeng) used only 8K samples to replicate the training of DeepSeek-R1-Zero and DeepSeek-R1 on the 7B model.</p><p>The results are surprising-the model has achieved very strong results in complex mathematical reasoning.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/51085fb73be2a72ee23054c18e6df557\" alt=\"\" title=\"\" tg-width=\"870\" tg-height=\"185\"/></p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/17a612e8c7859f8429ee06ebe37234bb\" alt=\"项目地址:https://github.com/hkust-nlp/simpleRL-reason\" title=\"项目地址:https://github.com/hkust-nlp/simpleRL-reason\" tg-width=\"1080\" tg-height=\"492\"/><span>Project address: https://github.com/hkust-nlp/simpleRL-reason</span></p><p>They take Qwen2.5-Math-7B (the base model) as a starting point and directly perform reinforcement learning on it.</p><p>Throughout the process, no supervised fine-tuning (SFT) was performed and no reward model was used.</p><p>Finally, the model achieved an accuracy of 33.3% on the AIME benchmark, 62.5% on the AMC, and 77.2% on the MATH.</p><p>This performance not only surpasses Qwen2.5-Math-7B-Instruct, but also rivals PRIME and rStar-MATH, which use more than 50 times the amount of data and more complex components!</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/fc28abafaf1b99ce393cf71fc2cfb483\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"613\"/></p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/00ddd59c6d8b6ab3b8d2cf89524b23c4\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"553\"/></p><p>Among them, Qwen2.5-7B-SimpleRL-Zero is trained on the Qwen2.5-MATH-7B basic model using only the pure PPO method, and only 8K samples in the MATH dataset are used.</p><p>Qwen2.5-7B-SimpleRL first uses Long CoT supervised fine-tuning (SFT) as a cold start, and then performs reinforcement learning.</p><p>In both methods, the team only used the same 8K MATH samples and nothing more.</p><p>Around step 44, the aha moment appeared! In the response of the model, self-reflection appeared.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/f3aa0b8e2d94bdc891f49c30a47aa475\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"438\"/></p><p>Moreover, in this process, the model also showed longer CoT reasoning ability and self-reflection ability.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/3320ec58f4e229808d3520653e70da3d\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"608\"/></p><p>In the blog, the researchers analyze in detail the experimental setting and the phenomena observed during this reinforcement learning training process, such as the spontaneous formation of long-chain thinking (CoT) and self-reflection mechanisms.</p><p>Similar to DeepSeek R1, the researcher's reinforcement learning scheme is extremely simple, and does not use reward models or MCTS (Monte Carlo Tree Search) techniques.</p><p>They use the PPO algorithm and employ a rule-based reward function to assign rewards according to the format and correctness of the generated output:</p><p><ul style=\"list-style-type: disc;\"><li>Get a reward of +1 if the output provides the final answer in the specified format and is correct</p><p></li><li>If the output provides the final answer but is incorrect, the reward is set to-0.5</p><p></li><li>If the output fails to provide a final answer, the reward is set to-1</p><p></li></ul>The implementation is based on OpenRLHF. Preliminary experiments show that this reward function helps the strategy model to converge quickly and produce the output that conforms to the desired scheme.</p><p><h3 id=\"id_1461976375\" style=\"text-align: center;\">Part 1: SimpleRL-Zero (Reinforcement Learning from Scratch)</h3>Next, the researcher shared with us the dynamic analysis of the training process and some interesting emergence patterns.</p><p><h4 id=\"id_606216327\">Dynamic Analysis of Training Process</h4>As shown below, the accuracy of all benchmarks is steadily improving during training, while the output length shows a trend of first decreasing and then gradually increasing.</p><p>After further investigation, the researchers found that the Qwen2.5-Math-7B basic model tends to generate a large amount of code in the initial stage, which may be due to the distribution characteristics of the original training data of the model.</p><p>The first drop in output length is because reinforcement learning training gradually eliminates this code generation pattern and instead learns to use natural language for reasoning.</p><p>Subsequently, the generation length began to increase again, at which point a self-reflective mechanism emerged.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/c617c450f7e378da91201c2612d6b7dc\" alt=\"训练奖励和输出长度\" title=\"训练奖励和输出长度\" tg-width=\"1076\" tg-height=\"432\"/><span>Training reward and output length</span></p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/59ef982da18a27da8c6263d647691fa7\" alt=\"基准测试准确率(pass@1)和输出长度\" title=\"基准测试准确率(pass@1)和输出长度\" tg-width=\"1080\" tg-height=\"924\"/><span>Benchmark accuracy (pass @ 1) and output length</span></p><p><h4 id=\"id_224279804\">The emergence of self-reflective mechanisms</h4>At about step 40 of training, the researchers observed that the model began to form a self-reflective pattern, which is the aha moment (epiphany moment) described in the DeepSeek-R1 paper.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/60580dff773b0a70b356c7267119e819\" alt=\"\" title=\"\" tg-width=\"1054\" tg-height=\"423\"/></p><p><h3 id=\"id_2889548553\" style=\"text-align: center;\">Part 2: SimpleRL (Reinforcement Learning Based on Imitation Warm-Up)</h3>As mentioned earlier, the researcher performed a long CoT SFT warm-up before performing reinforcement learning, using 8,000 MATH sample responses extracted from QwQ-32B-Preview as the SFT dataset.</p><p>The potential advantage of this cold start is that the model already has a long CoT thinking mode and self-reflection ability when starting reinforcement learning, which may achieve faster and better learning results during the reinforcement learning stage.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/c761a6c0f0b865f5c1645cc06ad9869e\" alt=\"\" title=\"\" tg-width=\"1053\" tg-height=\"445\"/></p><p>Compared with the pre-RL training model (Qwen2.5-Math-7B-Base + 8K QwQ knowledge distillation version), the average performance of Qwen2.5-7B-SimpleRL is significantly improved by 6.9 percentage points.</p><p>Furthermore, not only does Qwen2.5-7B-SimpleRL consistently outperform Eurus-2-7B-PRIME, it also surpasses Qwen2.5-7B-SimpleRL-Zero on 3 out of 5 benchmarks.</p><p><h3 id=\"id_1405061093\">Training Process Analysis</h3><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/7ca73d0446fd6af825ac840bba4c7b14\" alt=\"训练奖励和输出长度\" title=\"训练奖励和输出长度\" tg-width=\"1060\" tg-height=\"467\"/><span>Training reward and output length</span></p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/07444422d9115a8382ed0f8843badb5f\" alt=\"基准测试准确率(pass@1)和输出长度\" title=\"基准测试准确率(pass@1)和输出长度\" tg-width=\"1080\" tg-height=\"638\"/><span>Benchmark accuracy (pass @ 1) and output length</span></p><p>The training dynamic performance of Qwen2.5-SimpleRL was similar to that of Qwen2.5-SimpleRL-Zero.</p><p>Interestingly, although the researchers first performed long CoT SFT, the reduction in output length was still observed in the early stage of reinforcement learning.</p><p>They speculate that this may be because the inference patterns extracted from QwQ are not suitable for the small strategy model, or are beyond its capabilities.</p><p>Therefore, the model chooses to abandon this model and instead independently develop new long-chain reasoning methods.</p><p>Finally, the researcher summed up the research with a sentence from Leonardo da Vinci-</p><p>Simplicity is the ultimate exquisiteness.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/0031e63d9296afb7d700bb6315dbce70\" alt=\"图片\" title=\"图片\" tg-width=\"437\" tg-height=\"72\"/><span>Pictures</span></p><p><h2 id=\"id_2367102810\">Completely open source replica, HuggingFace is over</h2>Even the HuggingFace team, the world's largest open source platform, officially announced today that it will reproduce all pipelines of DeepSeek R1.</p><p>After the reproduction is completed, all training data, training scripts, etc. will be open source.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/8b6a0735cb87c3f66a56724b33c3f08c\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"436\"/></p><p>This project is called Open R1 and is still in progress. On the day of release, the star mark broke through 1.9 k and gained 142 forks.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/883546a6427f0ae894da543e70c593aa\" alt=\"项目地址:https://github.com/huggingface/open-r1\" title=\"项目地址:https://github.com/huggingface/open-r1\" tg-width=\"1080\" tg-height=\"391\"/><span>Project address: https://github.com/huggingface/open-r1</span></p><p>Guided by the DeepSeek-R1 technical report, the research team divided the entire reproduction process into three key steps.</p><p><ul style=\"list-style-type: disc;\"><li><strong>Step 1:</strong>The R1-Distill model was reproduced by distilling a high-quality corpus from DeepSeek-R1.</p><p></li><li><strong>Step 2:</strong>Reproduce the pure reinforcement learning (RL) process used by DeepSeek to create R1-Zero. This may require curating new large-scale datasets for mathematics, inference, and code tasks.</p><p></li><li><strong>Step 3:</strong>Show how we evolve from a base model to an RL-tuned model through multi-stage training.</p><p></li></ul><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/0b5e37647b6df05c2e00fdcae31c20a1\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"1238\"/></p><p><h2 id=\"id_1672370884\">From Stanford to MIT, R1 becomes the top pick</h2>A sideline project has frightened the world's major technology companies.</p><p>This wave of success of DeepSeek has also become a myth in the industry. The latest screenshots from netizens show that this application has squeezed into the top three in the APP Store efficiency application list.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/c26a66723d3b3a6b69f68a1478db0112\" alt=\"\" title=\"\" tg-width=\"966\" tg-height=\"1200\"/></p><p>In Hugging Face, R1 downloads directly reached the top, and the other three models also dominated the hot list.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/a17a15963d44d09684758a53d9f16d6e\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"1080\"/></p><p>a16z partner Anjney Midha said that overnight, from Stanford to MIT, DeepSeek R1 has become the preferred model for researchers at top universities in the United States.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/95d7a12d8352525e938ddde62a68ba77\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"237\"/></p><p>Some researchers said that DeepSeek basically replaced my need to use ChatGPT.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/aef5ecd49dffdc200d6f1bd7a0647528\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"169\"/></p><p>Chinese AI really shocked the world this time.</p><p></body></html></p>","source":"lsy1569730104218","collect":0,"html":"<!DOCTYPE html>\n<html>\n<head>\n<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\" />\n<meta name=\"viewport\" content=\"width=device-width,initial-scale=1.0,minimum-scale=1.0,maximum-scale=1.0,user-scalable=no\"/>\n<meta name=\"format-detection\" content=\"telephone=no,email=no,address=no\" />\n<title>The global DeepSeek reappearance frenzy! The myth of Silicon Valley giants collapses, 30 knives witness the aha moment</title>\n<style type=\"text/css\">\na,abbr,acronym,address,applet,article,aside,audio,b,big,blockquote,body,canvas,caption,center,cite,code,dd,del,details,dfn,div,dl,dt,\nem,embed,fieldset,figcaption,figure,footer,form,h1,h2,h3,h4,h5,h6,header,hgroup,html,i,iframe,img,ins,kbd,label,legend,li,mark,menu,nav,\nobject,ol,output,p,pre,q,ruby,s,samp,section,small,span,strike,strong,sub,summary,sup,table,tbody,td,tfoot,th,thead,time,tr,tt,u,ul,var,video{ font:inherit;margin:0;padding:0;vertical-align:baseline;border:0 }\nbody{ font-size:16px; line-height:1.5; color:#999; background:transparent; }\n.wrapper{ overflow:hidden;word-break:break-all;padding:10px; }\nh1,h2{ font-weight:normal; line-height:1.35; margin-bottom:.6em; }\nh3,h4,h5,h6{ line-height:1.35; margin-bottom:1em; }\nh1{ font-size:24px; }\nh2{ font-size:20px; }\nh3{ font-size:18px; }\nh4{ font-size:16px; }\nh5{ font-size:14px; }\nh6{ font-size:12px; }\np,ul,ol,blockquote,dl,table{ margin:1.2em 0; }\nul,ol{ margin-left:2em; }\nul{ list-style:disc; }\nol{ list-style:decimal; }\nli,li p{ margin:10px 0;}\nimg{ max-width:100%;display:block;margin:0 auto 1em; }\nblockquote{ color:#B5B2B1; border-left:3px solid #aaa; padding:1em; }\nstrong,b{font-weight:bold;}\nem,i{font-style:italic;}\ntable{ width:100%;border-collapse:collapse;border-spacing:1px;margin:1em 0;font-size:.9em; }\nth,td{ padding:5px;text-align:left;border:1px solid #aaa; }\nth{ font-weight:bold;background:#5d5d5d; }\n.symbol-link{font-weight:bold;}\n/* header{ border-bottom:1px solid #494756; } */\n.title{ margin:0 0 8px;line-height:1.3;color:#ddd; }\n.meta {color:#5e5c6d;font-size:13px;margin:0 0 .5em; }\na{text-decoration:none; color:#2a4b87;}\n.meta .head { display: inline-block; overflow: hidden}\n.head .h-thumb { width: 30px; height: 30px; margin: 0; padding: 0; border-radius: 50%; float: left;}\n.head .h-content { margin: 0; padding: 0 0 0 9px; float: left;}\n.head .h-name {font-size: 13px; color: #eee; margin: 0;}\n.head .h-time {font-size: 12.5px; color: #7E829C; margin: 0;}\n.small {font-size: 12.5px; display: inline-block; transform: scale(0.9); -webkit-transform: scale(0.9); transform-origin: left; -webkit-transform-origin: left;}\n.smaller {font-size: 12.5px; display: inline-block; transform: scale(0.8); -webkit-transform: scale(0.8); transform-origin: left; -webkit-transform-origin: left;}\n.bt-text {font-size: 12px;margin: 1.5em 0 0 0}\n.bt-text p {margin: 0}\n</style>\n</head>\n<body>\n<div class=\"wrapper\">\n<header>\n<h2 class=\"title\">\nThe global DeepSeek reappearance frenzy! The myth of Silicon Valley giants collapses, 30 knives witness the aha moment\n</h2>\n<h4 class=\"meta\">\n<p class=\"head\">\n<strong class=\"h-name small\">新智元</strong><span class=\"h-time small\">2025-01-26 14:41</span>\n</p>\n</h4>\n</header>\n<article>\n<p><html><head></head><body>These days, Silicon Valley is completely in the aftermath of the earthquake brought by Chinese companies.</p><p>The United States is panicking: Has the center of global artificial intelligence shifted to China?</p><p>At this moment, a wave of frenzy of reappearing DeepSeek around the world also came.</p><p>As LeCun said: This time, it is the victory of open source over closed source!</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/96443f59328eead1fb03ccf0f6e8b4a7\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"519\"/></p><p>In the absence of top-level chips, DeepSeek, which trains breakthrough models with extremely low-cost chips, may threaten the AI hegemony of the United States.</p><p>The competition of large models is no longer a trillion-dollar computing power war.</p><p>The technological advantages and high valuations that big companies such as OpenAI, Meta, and Google are proud of will collapse, and Nvidia's stock price will begin to waver.</p><p>All these views and discussions make people wonder: Is tens of billions of dollars of expenditure really necessary for this industry? Some people even say that a group of geniuses in China's quantitative funds will lead to the collapse of Nasdaq.</p><p>From then on, the era of large models is likely to enter a watershed: super-performance models no longer belong to computing power giants alone, but to everyone.</p><p><h2 id=\"id_1782456954\">For $30, you can see the aha moment</h2>Pan Jiayi, a doctoral student from UC Berkeley, and two other researchers reproduced DeepSeek R1-Zero in the CountDown game.</p><p>They said that the results were quite excellent!</p><p>In the experiment, the team verified that through reinforcement learning RL, 3B's basic language model can also self-verify and search.</p><p>What's even more exciting is that the cost is less than 30 US dollars (about 217 yuan), and you can witness the aha moment with your own eyes.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/9e5c66f649972cc1dcc1b64ac9e2312a\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"468\"/></p><p>The project, called TinyZero, employs the R1-Zero algorithm-given a base language model, hints, and real reward signals, run reinforcement learning.</p><p>The team then applies it to the CountDown game (a game where players use basic arithmetic operations to combine numbers to reach the target number).</p><p>Starting from the initial simple output, the model gradually evolves the strategy of self-correction and search.</p><p>In the following example, the model proposes a solution, verifies itself, and repeatedly corrects it until it solves the problem.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/7fdc1973cccfff685657e93776b421a6\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"645\"/></p><p>In the ablation experiment, the researchers ran Qwen-2. 5-Base (four parameter scales of 0.5 B, 1.5 B, 3B, 7B).</p><p>It turns out that the 0.5 B model is merely guessing a solution and then stopping. Whereas from 1.5 B, the model learns to search, self-verify and correct its solutions, thus being able to achieve higher scores.</p><p>They believe that in this process, the underlying model is the key to performance.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/b5bf9a119224a98c43e0746453632356\" alt=\"\" title=\"\" tg-width=\"1024\" tg-height=\"872\"/></p><p>They also verified that additional instruction fine-tuning (SFT) is not necessary, which also confirms the design decision of R1-Zero.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/9f1a78e6cf55328c051693886f074ca0\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"352\"/></p><p>This is the first open source study to verify that the implementation of LLM inference capabilities can be purely through RL, without supervised fine-tuning</p><p>The difference between the basic model and the instruction model:</p><p><ul style=\"list-style-type: disc;\"><li>The instruction model runs fast, but ultimately performs comparably with the base model</p><p></li><li>The model of instruction output is more structured and readable</p><p></li></ul><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/f32389410d6e5ede00b75380edf87e07\" alt=\"\" title=\"\" tg-width=\"1024\" tg-height=\"891\"/></p><p>In addition, they also found that the specific RL algorithm is not important. Among algorithms such as PPO, GRPO, and PRIME, Long CoT can all emerge and bring good performance.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/617d02e3d20394695f5f569d5f74c0d8\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"1128\"/></p><p>Moreover, models are very dependent on specific tasks in their inference behavior:</p><p><ul style=\"list-style-type: disc;\"><li>For the Countdow task, the model learns to search and self-validate</p><p></li><li>For numeric multiplication tasks, the model instead learns to decompose the problem using distribution rules and solve it step by step</p><p></li></ul><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/96b4239d735ae442de73323b5bd4c965\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"643\"/></p><p>Apple machine learning scientist Yizhe Zhang said that it's so cool that models as small as 1.5 B can also emerge with self-verification capabilities through RL.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/3f01a77c2fff63fd05ba3d19af0bfeab\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"212\"/></p><p><h2 id=\"id_3291092331\">7B model replica, surprising results</h2>The team of Assistant Professor Ho Junxian of the Hong Kong University of Science and Technology (co-authors Huang Yuzhen and Weihao Zeng) used only 8K samples to replicate the training of DeepSeek-R1-Zero and DeepSeek-R1 on the 7B model.</p><p>The results are surprising-the model has achieved very strong results in complex mathematical reasoning.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/51085fb73be2a72ee23054c18e6df557\" alt=\"\" title=\"\" tg-width=\"870\" tg-height=\"185\"/></p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/17a612e8c7859f8429ee06ebe37234bb\" alt=\"项目地址:https://github.com/hkust-nlp/simpleRL-reason\" title=\"项目地址:https://github.com/hkust-nlp/simpleRL-reason\" tg-width=\"1080\" tg-height=\"492\"/><span>Project address: https://github.com/hkust-nlp/simpleRL-reason</span></p><p>They take Qwen2.5-Math-7B (the base model) as a starting point and directly perform reinforcement learning on it.</p><p>Throughout the process, no supervised fine-tuning (SFT) was performed and no reward model was used.</p><p>Finally, the model achieved an accuracy of 33.3% on the AIME benchmark, 62.5% on the AMC, and 77.2% on the MATH.</p><p>This performance not only surpasses Qwen2.5-Math-7B-Instruct, but also rivals PRIME and rStar-MATH, which use more than 50 times the amount of data and more complex components!</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/fc28abafaf1b99ce393cf71fc2cfb483\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"613\"/></p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/00ddd59c6d8b6ab3b8d2cf89524b23c4\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"553\"/></p><p>Among them, Qwen2.5-7B-SimpleRL-Zero is trained on the Qwen2.5-MATH-7B basic model using only the pure PPO method, and only 8K samples in the MATH dataset are used.</p><p>Qwen2.5-7B-SimpleRL first uses Long CoT supervised fine-tuning (SFT) as a cold start, and then performs reinforcement learning.</p><p>In both methods, the team only used the same 8K MATH samples and nothing more.</p><p>Around step 44, the aha moment appeared! In the response of the model, self-reflection appeared.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/f3aa0b8e2d94bdc891f49c30a47aa475\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"438\"/></p><p>Moreover, in this process, the model also showed longer CoT reasoning ability and self-reflection ability.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/3320ec58f4e229808d3520653e70da3d\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"608\"/></p><p>In the blog, the researchers analyze in detail the experimental setting and the phenomena observed during this reinforcement learning training process, such as the spontaneous formation of long-chain thinking (CoT) and self-reflection mechanisms.</p><p>Similar to DeepSeek R1, the researcher's reinforcement learning scheme is extremely simple, and does not use reward models or MCTS (Monte Carlo Tree Search) techniques.</p><p>They use the PPO algorithm and employ a rule-based reward function to assign rewards according to the format and correctness of the generated output:</p><p><ul style=\"list-style-type: disc;\"><li>Get a reward of +1 if the output provides the final answer in the specified format and is correct</p><p></li><li>If the output provides the final answer but is incorrect, the reward is set to-0.5</p><p></li><li>If the output fails to provide a final answer, the reward is set to-1</p><p></li></ul>The implementation is based on OpenRLHF. Preliminary experiments show that this reward function helps the strategy model to converge quickly and produce the output that conforms to the desired scheme.</p><p><h3 id=\"id_1461976375\" style=\"text-align: center;\">Part 1: SimpleRL-Zero (Reinforcement Learning from Scratch)</h3>Next, the researcher shared with us the dynamic analysis of the training process and some interesting emergence patterns.</p><p><h4 id=\"id_606216327\">Dynamic Analysis of Training Process</h4>As shown below, the accuracy of all benchmarks is steadily improving during training, while the output length shows a trend of first decreasing and then gradually increasing.</p><p>After further investigation, the researchers found that the Qwen2.5-Math-7B basic model tends to generate a large amount of code in the initial stage, which may be due to the distribution characteristics of the original training data of the model.</p><p>The first drop in output length is because reinforcement learning training gradually eliminates this code generation pattern and instead learns to use natural language for reasoning.</p><p>Subsequently, the generation length began to increase again, at which point a self-reflective mechanism emerged.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/c617c450f7e378da91201c2612d6b7dc\" alt=\"训练奖励和输出长度\" title=\"训练奖励和输出长度\" tg-width=\"1076\" tg-height=\"432\"/><span>Training reward and output length</span></p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/59ef982da18a27da8c6263d647691fa7\" alt=\"基准测试准确率(pass@1)和输出长度\" title=\"基准测试准确率(pass@1)和输出长度\" tg-width=\"1080\" tg-height=\"924\"/><span>Benchmark accuracy (pass @ 1) and output length</span></p><p><h4 id=\"id_224279804\">The emergence of self-reflective mechanisms</h4>At about step 40 of training, the researchers observed that the model began to form a self-reflective pattern, which is the aha moment (epiphany moment) described in the DeepSeek-R1 paper.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/60580dff773b0a70b356c7267119e819\" alt=\"\" title=\"\" tg-width=\"1054\" tg-height=\"423\"/></p><p><h3 id=\"id_2889548553\" style=\"text-align: center;\">Part 2: SimpleRL (Reinforcement Learning Based on Imitation Warm-Up)</h3>As mentioned earlier, the researcher performed a long CoT SFT warm-up before performing reinforcement learning, using 8,000 MATH sample responses extracted from QwQ-32B-Preview as the SFT dataset.</p><p>The potential advantage of this cold start is that the model already has a long CoT thinking mode and self-reflection ability when starting reinforcement learning, which may achieve faster and better learning results during the reinforcement learning stage.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/c761a6c0f0b865f5c1645cc06ad9869e\" alt=\"\" title=\"\" tg-width=\"1053\" tg-height=\"445\"/></p><p>Compared with the pre-RL training model (Qwen2.5-Math-7B-Base + 8K QwQ knowledge distillation version), the average performance of Qwen2.5-7B-SimpleRL is significantly improved by 6.9 percentage points.</p><p>Furthermore, not only does Qwen2.5-7B-SimpleRL consistently outperform Eurus-2-7B-PRIME, it also surpasses Qwen2.5-7B-SimpleRL-Zero on 3 out of 5 benchmarks.</p><p><h3 id=\"id_1405061093\">Training Process Analysis</h3><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/7ca73d0446fd6af825ac840bba4c7b14\" alt=\"训练奖励和输出长度\" title=\"训练奖励和输出长度\" tg-width=\"1060\" tg-height=\"467\"/><span>Training reward and output length</span></p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/07444422d9115a8382ed0f8843badb5f\" alt=\"基准测试准确率(pass@1)和输出长度\" title=\"基准测试准确率(pass@1)和输出长度\" tg-width=\"1080\" tg-height=\"638\"/><span>Benchmark accuracy (pass @ 1) and output length</span></p><p>The training dynamic performance of Qwen2.5-SimpleRL was similar to that of Qwen2.5-SimpleRL-Zero.</p><p>Interestingly, although the researchers first performed long CoT SFT, the reduction in output length was still observed in the early stage of reinforcement learning.</p><p>They speculate that this may be because the inference patterns extracted from QwQ are not suitable for the small strategy model, or are beyond its capabilities.</p><p>Therefore, the model chooses to abandon this model and instead independently develop new long-chain reasoning methods.</p><p>Finally, the researcher summed up the research with a sentence from Leonardo da Vinci-</p><p>Simplicity is the ultimate exquisiteness.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/0031e63d9296afb7d700bb6315dbce70\" alt=\"图片\" title=\"图片\" tg-width=\"437\" tg-height=\"72\"/><span>Pictures</span></p><p><h2 id=\"id_2367102810\">Completely open source replica, HuggingFace is over</h2>Even the HuggingFace team, the world's largest open source platform, officially announced today that it will reproduce all pipelines of DeepSeek R1.</p><p>After the reproduction is completed, all training data, training scripts, etc. will be open source.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/8b6a0735cb87c3f66a56724b33c3f08c\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"436\"/></p><p>This project is called Open R1 and is still in progress. On the day of release, the star mark broke through 1.9 k and gained 142 forks.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/883546a6427f0ae894da543e70c593aa\" alt=\"项目地址:https://github.com/huggingface/open-r1\" title=\"项目地址:https://github.com/huggingface/open-r1\" tg-width=\"1080\" tg-height=\"391\"/><span>Project address: https://github.com/huggingface/open-r1</span></p><p>Guided by the DeepSeek-R1 technical report, the research team divided the entire reproduction process into three key steps.</p><p><ul style=\"list-style-type: disc;\"><li><strong>Step 1:</strong>The R1-Distill model was reproduced by distilling a high-quality corpus from DeepSeek-R1.</p><p></li><li><strong>Step 2:</strong>Reproduce the pure reinforcement learning (RL) process used by DeepSeek to create R1-Zero. This may require curating new large-scale datasets for mathematics, inference, and code tasks.</p><p></li><li><strong>Step 3:</strong>Show how we evolve from a base model to an RL-tuned model through multi-stage training.</p><p></li></ul><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/0b5e37647b6df05c2e00fdcae31c20a1\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"1238\"/></p><p><h2 id=\"id_1672370884\">From Stanford to MIT, R1 becomes the top pick</h2>A sideline project has frightened the world's major technology companies.</p><p>This wave of success of DeepSeek has also become a myth in the industry. The latest screenshots from netizens show that this application has squeezed into the top three in the APP Store efficiency application list.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/c26a66723d3b3a6b69f68a1478db0112\" alt=\"\" title=\"\" tg-width=\"966\" tg-height=\"1200\"/></p><p>In Hugging Face, R1 downloads directly reached the top, and the other three models also dominated the hot list.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/a17a15963d44d09684758a53d9f16d6e\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"1080\"/></p><p>a16z partner Anjney Midha said that overnight, from Stanford to MIT, DeepSeek R1 has become the preferred model for researchers at top universities in the United States.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/95d7a12d8352525e938ddde62a68ba77\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"237\"/></p><p>Some researchers said that DeepSeek basically replaced my need to use ChatGPT.</p><p><p class=\"t-img-caption\"><img src=\"https://static.tigerbbs.com/aef5ecd49dffdc200d6f1bd7a0647528\" alt=\"\" title=\"\" tg-width=\"1080\" tg-height=\"169\"/></p><p>Chinese AI really shocked the world this time.</p><p></body></html></p>\n<div class=\"bt-text\">\n\n\n<p> source:<a href=\"https://mp.weixin.qq.com/s/o41vPh9eJCVjCRUE4u5npA\">新智元</a></p>\n\n\n</div>\n</article>\n</div>\n</body>\n</html>\n","type":0,"thumbnail":"https://community-static.tradeup.com/news/4abae3de7bed37b48c01721c28c51931","relate_stocks":{},"source_url":"https://mp.weixin.qq.com/s/o41vPh9eJCVjCRUE4u5npA","is_english":false,"share_image_url":"https://static.laohu8.com/e9f99090a1c2ed51c021029395664489","article_id":"1161961252","content_text":"这些天,硅谷彻底处于中国公司带来的大地震余波中。全美都在恐慌:是否全球人工智能的中心已经转移到了中国?就在这当口,全球复现DeepSeek的一波狂潮也来了。诚如LeCun所言:「这一次,正是开源对闭源的胜利!」在没有顶级芯片的情况下,以极低成本芯片训出突破性模型的DeepSeek,或将威胁到美国的AI霸权。大模型比拼的不再是动辄千万亿美元的算力战。OpenAI、Meta、谷歌这些大公司引以为傲的技术优势和高估值将会瓦解,英伟达的股价将开始动摇。种种这些观点和讨论,让人不禁怀疑:数百亿美元支出,对这个行业真的必要吗?甚至有人说,中国量化基金的一群天才,将导致纳斯达克崩盘。从此,大模型时代很可能会进入一个分水岭:超强性能的模型不再独属于算力巨头,而是属于每个人。30美金,就能看到「啊哈」时刻来自UC伯克利博士生潘家怡和另两位研究人员,在CountDown游戏中复现了DeepSeek R1-Zero。他们表示,结果相当出色!实验中,团队验证了通过强化学习RL,3B的基础语言模型也能够自我验证和搜索。更令人兴奋的是,成本不到30美金(约217元),就可以亲眼见证「啊哈」时刻。这个项目叫做TinyZero,采用了R1-Zero算法——给定一个基础语言模型、提示和真实奖励信号,运行强化学习。然后,团队将其应用在CountDown游戏中(这是一个玩家使用基础算术运算,将数字组合以达到目标数字的游戏)。模型从最初的简单输出开始,逐步进化出自我纠正和搜索的策略。在以下示例中,模型提出了解决方案,自我验证,并反复纠正,直到解决问题为止。在消融实验中,研究人员运行了Qwen-2.5-Base(0.5B、1.5B、3B、7B四种参数规模)。结果发现,0.5B模型仅仅是猜测一个解决方案然后停止。而从1.5B开始,模型学会了搜索、自我验证和修正其解决方案,从而能够获得更高的分数。他们认为,在这个过程,基础模型的是性能的关键。他们还验证了,额外的指令微调(SFT)并非是必要的,这也印证了R1-Zero的设计决策。这是首个验证LLM推理能力的实现可以纯粹通过RL,无需监督微调的开源研究基础模型和指令模型两者区别:指令模型运行速度快,但最终表现与基础模型相当指令输出的模型更具结构性和可读性此外,他们还发现,具体的RL算法并不重要。PPO、GRPO、PRIME这些算法中,长思维链(Long CoT)都能够涌现,且带来不错的性能表现。而且,模型在推理行为中非常依赖于具体的任务:对于Countdow任务,模型学习进行搜索和自我验证对于数字乘法任务,模型反而学习使用分布规则分解问题,并逐步解决苹果机器学习科学家Yizhe Zhang对此表示,太酷了,小到1.5B的模型,也能通过RL涌现出自我验证的能力。7B模型复刻,结果令人惊讶港科大助理教授何俊贤的团队(共同一作黄裕振、Weihao Zeng),只用了8K个样本,就在7B模型上复刻出了DeepSeek-R1-Zero和DeepSeek-R1的训练。结果令人惊喜——模型在复杂的数学推理上取得了十分强劲结果。项目地址:https://github.com/hkust-nlp/simpleRL-reason他们以Qwen2.5-Math-7B(基础模型)为起点,直接对其进行强化学习。整个过程中,没有进行监督微调(SFT),也没有使用奖励模型。最终,模型在AIME基准上实现了33.3%的准确率,在AMC上为62.5%,在MATH上为77.2%。这一表现不仅超越了Qwen2.5-Math-7B-Instruct,并且还可以和使用超过50倍数据量和更复杂组件的PRIME和rStar-MATH相媲美!其中,Qwen2.5-7B-SimpleRL-Zero是在Qwen2.5-Math-7B基础模型上仅使用纯PPO方法训练的,仅采用了MATH数据集中的8K样本。Qwen2.5-7B-SimpleRL则首先通过Long CoT监督微调(SFT)作为冷启动,然后再进行强化学习。在这两种方法中,团队都只使用了相同的8K MATH样本,仅此而已。大概在第44步的时候,「啊哈时刻」出现了!模型的响应中,出现了自我反思。并且,在这个过程中,模型还显现了更长的CoT推理能力和自我反思能力。在博客中,研究者详细剖析了实验设置,以及在这个强化学习训练过程中所观察到的现象,例如长链式思考(CoT)和自我反思机制的自发形成。与DeepSeek R1类似,研究者的强化学习方案极其简单,没有使用奖励模型或MCTS(蒙特卡洛树搜索)类技术。他们使用的是PPO算法,并采用基于规则的奖励函数,根据生成输出的格式和正确性分配奖励:如果输出以指定格式提供最终答案且正确,获得+1的奖励如果输出提供最终答案但不正确,奖励设为-0.5如果输出未能提供最终答案,奖励设为-1该实现基于OpenRLHF。初步试验表明,这个奖励函数有助于策略模型快速收敛,产生符合期望格式的输出。第一部分:SimpleRL-Zero(从头开始的强化学习)接下来,研究者为我们分享了训练过程动态分析和一些有趣的涌现模式。训练过程动态分析如下所示,所有基准测试的准确率在训练过程中都在稳步提高,而输出长度则呈现先减少后逐渐增加的趋势。经过进一步调查,研究者发现,Qwen2.5-Math-7B基础模型在初始阶段倾向于生成大量代码,这可能源于模型原始训练数据的分布特征。输出长度的首次下降,是因为强化学习训练逐渐消除了这种代码生成模式,转而学会使用自然语言进行推理。随后,生成长度开始再次增加,此时出现了自我反思机制。训练奖励和输出长度基准测试准确率(pass@1)和输出长度自我反思机制的涌现在训练到第 40 步左右时,研究者观察到:模型开始形成自我反思模式,这正是DeepSeek-R1论文中所描述的「aha moment」(顿悟时刻)。第二部分:SimpleRL(基于模仿预热的强化学习)如前所述,研究者在进行强化学习之前,先进行了long CoT SFT预热,使用了8,000个从QwQ-32B-Preview中提取的MATH示例响应作为SFT数据集。这种冷启动的潜在优势在于:模型在开始强化学习时已具备long CoT思维模式和自我反思能力,从而可能在强化学习阶段实现更快更好的学习效果。与RL训练前的模型(Qwen2.5-Math-7B-Base + 8K QwQ知识蒸馏版本)相比,Qwen2.5-7B-SimpleRL的平均性能显著提升了6.9个百分点。此外,Qwen2.5-7B-SimpleRL不仅持续优于Eurus-2-7B-PRIME,还在5个基准测试中的3个上超越了Qwen2.5-7B-SimpleRL-Zero。训练过程分析训练奖励和输出长度基准测试准确率(pass@1)和输出长度Qwen2.5-SimpleRL的训练动态表现与Qwen2.5-SimpleRL-Zero相似。有趣的是,尽管研究者先进行了long CoT SFT,但在强化学习初期仍然观察到输出长度减少的现象。他们推测,这可能是因为从QwQ提取的推理模式不适合小型策略模型,或超出了其能力范围。因此,模型选择放弃这种模式,转而自主发展新的长链式推理方式。最后,研究者用达芬奇的一句话,对这项研究做了总结——简约,便是最终极的精致。图片完全开源复刻,HuggingFace下场了甚至,就连全球最大开源平台HuggingFace团队,今天官宣复刻DeepSeek R1所有pipeline。复刻完成后,所有的训练数据、训练脚本等等,将全部开源。这个项目叫做Open R1,当前还在进行中。发布到一天,星标冲破1.9k,斩获142个fork。项目地址:https://github.com/huggingface/open-r1研究团队以DeepSeek-R1技术报告为指导,将整个复刻过程划分为三个关键步骤。步骤 1:通过从DeepSeek-R1蒸馏高质量语料库,复现R1-Distill模型。步骤 2:复现DeepSeek用于创建R1-Zero的纯强化学习(RL)流程。这可能需要为数学、推理和代码任务策划新的大规模数据集。步骤 3:展示我们如何通过多阶段训练,从基础模型发展到经过RL调优的模型。从斯坦福到MIT,R1成为首选一个副业项目,让全世界科技大厂为之惶恐。DeepSeek这波成功,也成为业界的神话,网友最新截图显示,这款应用已经在APP Store「效率」应用榜单中挤进前三。在Hugging Face中,R1下载量直接登顶,另外3个模型也霸占着热榜。a16z合伙人Anjney Midha称,一夜之间,从斯坦福到MIT,DeepSeek R1已经成为美国顶尖高校研究人员「首选模型」。还有研究人员表示,DeepSeek基本上取代了我用ChatGPT的需求。中国AI,这一次真的震撼了世界。","news_type":1,"symbols_score_info":{}},"isVote":1,"tweetType":1,"viewCount":551,"authorTweetTopStatus":1,"verified":2,"comments":[],"imageCount":0,"langContent":"EN","totalScore":0}],"lives":[]}