Publications

(Note: * indicates equal contribution.)

Huang, Y.*, Wang, P.*, Han, J., Zhao, W., Su, Z., Sun, A., Lyu, H., Zhao, H., Wang, Y., Xiao, C., Han, X., & Liu, Z. (2025). NOSA: Native and Offloadable Sparse Attention. arXiv preprint arXiv:2510.13602.

Huang, Y., Li, M., Han, X., Xiao, C., Zhao, W., Sun, A., Yuan, Z., Zhou, H., Meng, F., Liu, Z. (2026). Spava: Accelerating Long-Video Understanding via Sequence-Parallelism-aware Approximate Attention. arXiv preprint arXiv:2601.21444 (In Submission).

Zhao, W., Zhou, Z., Su, Z., Xiao, C., Li, Y., Li, Y., … & Liu, Z. (2025). Infllm-v2: Dense-sparse switchable attention for seamless short-to-long adaptation. International Conference on Learning Representations (ICLR 2026).

Huang, Y. (2025). On Accelerating Long-Context Inference via Sparse Self-Attention. B.Eng dissertation, Tsinghua University.

MiniCPM Team. (2025). MiniCPM4: Ultra-Efficient LLMs on End Devices arXiv preprint arXiv:2506:07900.

Yu, T., Wang, Z., Wang, C., Huang, F., Ma, W., He, Z., … & Sun, M. (2025). MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe. arXiv preprint arXiv:2509.18154.

Huang, Y.*, Li, M.*, Han, X., Xiao, C., Zhao, W., Sun, A., Zhou, J., Zhou, H., Liu, Z., & Sun, M. (2025). APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs. Annual Meeting of the Association for Computational Linguistics (ACL 2025 main Oral).

Zhao, W.*, Pan, T.*, Han, X., Zhang, Y., Sun, A., Huang, Y., Zhang, K., Zhao, W., Li, Y., Wang, J. & Liu, Z. (2025). FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling. Annual Meeting of the Association for Computational Linguistics (ACL 2025 main).

Yuan, Z., Li, J., Li, Y., Huang, Y., Chen, C., Wang, S., & Gou, Z. (2025). CITR: Efficient Long Video Understanding Needs Causal Importance. ACM Multimedia (ACM MM).

Huang, Y., Yuan, B., Han, X., Xiao, C., & Liu, Z. (2025). Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads. Transactions on Machine Learning Research (TMLR 2025).

Zhao, W.*, Huang, Y.*, Han, X., Xu, W., Xiao, C., Zhang, X., Fang, Y., Zhang, K., Liu, Z., & Sun, M. (2024). Ouroboros: Speculative Decoding with Large Model Enhanced Drafting. Main Conference of Empirical Methods in Natural Language Processing (EMNLP 2024 main).

Zhao, W.*, Huang, Y.*, Han, X., Liu, Z., Zhang, Z., Li, K., Chen, C., Yang, T., & Sun, M. (2024). CA-LoRA: Adapting Existing LoRA for Compressed LLMs to Enable Efficient Multi-Tasking on Personal Devices. Conference on Language Modeling (COLM 2024).

Hu, S., Tu, Y., Han, X., Cui, G., He, C., Zhao, W., … & Sun, M. (2024). MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies Conference on Language Modeling (COLM 2024).

Qin, Y., Hu, S., Lin, Y., Chen, W., Ding, N., Cui, G., … & Sun, M. (2023). Tool Learning with Foundation Models. ACM Computing Surveys.

Xiao, J., Huang, Y., Hu, C., Song, S., Huang, X., & Wang, J. (2022). Time series data encoding for efficient storage: a comparative analysis in Apache IoTDB. Proceedings of the VLDB Endowment, 15(10), 2148-2160 (VLDB 2022).