MXFP4 初探

Posted on 2025-08-13 In 技术
Symbols count in article: 2.8k Reading time ≈ 3 mins.

前几天 openai 隆重推出 gpt-oss，模型的 MoE 部分采用了 MXFP4 格式让人震惊，这大大减少了对内存的依赖。

在RTX4090上做加速

Posted on 2025-08-01 Edited on 2025-08-13 In 技术
Symbols count in article: 16k Reading time ≈ 15 mins.

RTX4090有一个很奇特的特性，使用 fp16 accum 的 matmul 的吞吐量是使用 fp32 accum 的 matmul 的两倍。

这是非常诱人的加速！

Posted on 2025-03-03 Edited on 2025-03-11 In 技术
Symbols count in article: 6.4k Reading time ≈ 6 mins.

之前在 FlashMLA 源码分析分析了 FlashMLA 的源码，后来我又实践了一下，在此记录一下进一步的学习成果。

Posted on 2025-02-24 Edited on 2025-03-11 In 技术
Symbols count in article: 36k Reading time ≈ 33 mins.

今天Deepseek开源 FlashMLA，之前看过一些 MLA 相关知识了，感觉这是一个很好的学习 Cuda 加速的机会，于是实践学习记录一下。

Posted on 2024-11-04 Edited on 2024-11-06 In 技术
Symbols count in article: 3k Reading time ≈ 3 mins.

Deploy LLMs for infinite-length inputs without sacrificing efficiency and performance.

Posted on 2024-02-26 Edited on 2025-02-07 In 技术
Symbols count in article: 12k Reading time ≈ 11 mins.

This project records the process of optimizing SGEMM (single-precision floating point General Matrix Multiplication) on the riscv platform.

Posted on 2024-02-26 Edited on 2025-02-07 In 技术
Symbols count in article: 6.3k Reading time ≈ 6 mins.

本项目记录了在riscv平台上优化SGEMM（单精度浮点通用矩阵乘法）的过程。

Posted on 2024-08-12 Edited on 2024-10-14
Symbols count in article: 8.9k Reading time ≈ 8 mins.

本文聚焦于AWQ的W4A16 (4-bit weight, 16-bit activation) CUDA kernel的反量化。

Posted on 2023-07-09 Edited on 2025-05-02 In 游戏
Symbols count in article: 17k Reading time ≈ 16 mins.

Set up a flag: the completion progress reaches 100%

Now at 99%

Posted on 2023-08-04 Edited on 2024-10-16
Symbols count in article: 3.9k Reading time ≈ 4 mins.

我在北京的校园里，设计并制作了一个六重竹笋，《六重蓝笋成长记（上）》和《六重蓝笋成长记（下）》，开始体会到竹笋的乐趣，并立志于做更高重的笋。

我在上海的某个特殊时期，闲得实在是无聊，制作了视频：从入门到夺笋，以缓解不能出去玩的郁闷，并立志于做更高重的七重(七色彩虹)竹笋。