Zhao Dongyu's Blog

π 模型的实践和结构解析

Posted on 2026-01-28 Edited on 2026-05-08 In 技术
Symbols count in article: 23k Reading time ≈ 20 mins.

Physical Intelligence is bringing general-purpose AI into the physical world.

π 系列堪称经典。前段时间实践了 π₀ 的真机部署与加速，最近又做了 π_0.5 的仿真，感觉是时候系统整理一波了。

Posted on 2025-08-13 Edited on 2026-04-13 In 技术
Symbols count in article: 7.6k Reading time ≈ 7 mins.

前几天 openai 隆重推出 gpt-oss，模型的 MoE 部分采用了 MXFP4 格式让人震惊，这大大减少了对内存的依赖。

Posted on 2025-08-01 Edited on 2026-04-13 In 技术
Symbols count in article: 16k Reading time ≈ 15 mins.

RTX4090有一个很奇特的特性，使用 fp16 accum 的 matmul 的吞吐量是使用 fp32 accum 的 matmul 的两倍。

这是非常诱人的加速！

Posted on 2025-03-03 Edited on 2026-04-13 In 技术
Symbols count in article: 6.4k Reading time ≈ 6 mins.

之前在 FlashMLA 源码分析分析了 FlashMLA 的源码，后来我又实践了一下，在此记录一下进一步的学习成果。

Posted on 2025-02-24 Edited on 2026-04-13 In 技术
Symbols count in article: 36k Reading time ≈ 33 mins.

今天Deepseek开源 FlashMLA，之前看过一些 MLA 相关知识了，感觉这是一个很好的学习 Cuda 加速的机会，于是实践学习记录一下。

Posted on 2024-11-04 Edited on 2026-04-13 In 技术
Symbols count in article: 3k Reading time ≈ 3 mins.

Deploy LLMs for infinite-length inputs without sacrificing efficiency and performance.

Posted on 2024-02-26 Edited on 2026-04-13 In 技术
Symbols count in article: 12k Reading time ≈ 11 mins.

This project records the process of optimizing SGEMM (single-precision floating point General Matrix Multiplication) on the riscv platform.

Posted on 2024-02-26 Edited on 2026-04-13 In 技术
Symbols count in article: 6.3k Reading time ≈ 6 mins.

本项目记录了在riscv平台上优化SGEMM（单精度浮点通用矩阵乘法）的过程。

Posted on 2024-08-12 Edited on 2026-04-13
Symbols count in article: 8.9k Reading time ≈ 8 mins.

本文聚焦于AWQ的W4A16 (4-bit weight, 16-bit activation) CUDA kernel的反量化。

Posted on 2023-07-09 Edited on 2026-04-13 In 游戏
Symbols count in article: 17k Reading time ≈ 16 mins.

Set up a flag: the completion progress reaches 100%

Now at 99%