StepFun open-sources Step3, a 321B parameter VLM optimized for Chinese AI chips
StepFun has released Step3, a massive open-source visual language model with 321 billion parameters and leading benchmark scores. The model debuts with novel attention architectures that reduce inference costs and is optimized to run efficiently on domestic Chinese AI hardware.
StepFun has launched Step3, a 321 billion parameter visual language model that activates just 38 billion parameters per token, thanks to custom attention mechanisms like Multi-Matrix Factorization Attention (MFA) and Attention-FFN Disaggregation (AFD). The model, open-sourced on July 31, is positioned as a high-performance alternative to proprietary multimodal models. It scores 74.2 on MMMU and 64.8 on MathVision, marking it as one of the strongest open-access reasoning models available.
Unlike most frontier-scale VLMs, Step3 is designed with inference efficiency in mind. MFA and AFD allow it to cut decoding costs by 4–8 times, making it viable for real-world applications without sacrificing output quality. StepFun's release strategy also focuses on hardware-software co-design. Step3 has been tuned for Chinese AI chips from vendors like Huawei Ascend and Cambricon, a strategic move that aligns with broader efforts to decouple from NVIDIA’s GPU stack in China.
For enterprise developers building multimodal agents, Step3 provides an open and high-performance foundation that can run cost-effectively on local infrastructure. The model's release also signals growing maturity in China’s open-source AI stack, with co-optimization across software and silicon. StepFun is distributing Step3 via Hugging Face, GitHub, and ModelScope under a permissive license.
Pure Neo Signal:
We love
and you too
If you like what we do, please share it on your social media and feel free to buy us a coffee.