IOSG: From Compute to Intelligence, Reinforcement Learning-D

Artificial Intelligence is transitioning from a primarily "Pattern Matching"-based statistical learning approach to a core capability system based on "Structured Reasoning." The importance of Post-training is rapidly increasing. The emergence of DeepSeek-R1 marks a paradigmatic shift for reinforcement learning in the era of large models. The industry has reached a consensus that Pre-training establishes a model's general capability foundation, and reinforcement learning is no longer just a value alignment tool. It has been proven to systematically improve the quality of the reasoning chain and the complexity of decision-making abilities. It is gradually evolving into a technical path for continuously enhancing intelligence.

Meanwhile, Web3 is restructuring the production relationship of AI through a decentralized computing power network and a cryptographic incentive system. The structural requirements of reinforcement learning for rollout sampling, reward signals, and verifiable training align naturally with blockchain's collaborative computing power, incentive distribution, and verifiable execution. This research report will systematically dissect the AI training paradigm and the principles of reinforcement learning, demonstrate the structural advantages of Reinforcement Learning × Web3, and analyze projects such as Prime Intellect, Gensyn, Nous Research, Gradient, Grail, and Fraction AI.

Three Stages of AI Training: Pre-training, Instruction Fine-tuning, and Post-training Alignment

The full lifecycle of training modern Large Language Models (LLMs) is typically divided into three core stages: Pre-training, Supervised Fine-Tuning (SFT), and Post-training/RL. Each stage is responsible for "building a world model," "injecting task capabilities," and "shaping reasoning and values," with the computational structure, data requirements, and validation difficulties determining the degree of decentralization.

· Pre-training utilizes large-scale self-supervised learning to build the language statistical structure and cross-modal world model of the model, forming the foundation of LLM's capabilities. This stage requires training on a trillion-scale corpus in a globally synchronous manner, relying on homogeneous clusters of thousands to tens of thousands of H100s. The cost accounts for 80–95%, highly sensitive to bandwidth and data copyright, and must therefore be completed in a highly centralized environment.

· 微調（Supervised Fine-tuning）用於注入任務能力與指令格式，數據量小、成本占比約 5–15%，微調既可以進行全參數訓練，也可以採用參數高效微調（PEFT）方法，其中LoRA、Q-LoRA 與 Adapter 是工業界主流。但仍需同步梯度，使其去中心化潛力有限。

· 後訓練（Post-training）由多個迭代子階段構成，決定模型的推理能力、價值觀與安全邊界，其方法既包括強化學習體系（RLHF、RLAIF、GRPO）也包括無 RL 的偏好優化方法（DPO），以及過程獎勵模型（PRM）等。該階段數據量與成本較低（5–10%），主要集中在 Rollout 與策略更新；其天然支持異步與分佈式執行，節點無需持有完整權重，結合可驗證計算與鏈上激勵可形成開放的去中心化訓練網路，是最適配 Web3 的訓練環節。

強化學習技術全景：架構、框架與應用

強化學習的系統架構與核心環節

強化學習（Reinforcement Learning, RL）通過「環境交互—獎勵反饋—策略更新」驅動模型自主改進決策能力，其核心結構可視為由狀態、動作、獎勵與策略構成的反饋閉環。一個完整的 RL 系統通常包含三類組件：Policy（策略網路）、Rollout（經驗採樣）與 Learner（策略更新器）。策略與環境交互生成軌跡，Learner 根據獎勵信號更新策略，從而形成持續迭代、不斷優化的學習過程：

1. 策略網路（Policy）：從環境狀態生成動作，是系統的決策核心。訓練時需集中式反向傳播維持一致性；推理時可分發至不同節點並行運行。

2. 經驗採樣（Rollout）：節點根據策略執行環境互動，生成狀態—動作—獎勵等軌跡。該過程高度並行、通訊極低，對硬體差異不敏感是最適合在去中心化中擴展的環節。

3. 學習器（Learner）：聚合全部 Rollout 軌跡並執行