Publications

(Note: * indicates equal contribution.)

Huang, Y.*, Gonçalves, N. M. T.*, Alvetreti, F., Li, L., Han, X., Ponti, E. M., Martins, A. F. T., & Treviso, M. V. (2026). DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention. arXiv preprint arXiv:2605.18753.

Huang, Y.*, Wang, P.*, Han, J., Zhao, W., Su, Z., Sun, A., Lyu, H., Zhao, H., Wang, Y., Xiao, C., Han, X., & Liu, Z. (2025). NOSA: Native and Offloadable Sparse Attention. (ICML 2026 AdaptFM Workshop).

Huang, Y., Li, M., Han, X., Xiao, C., Zhao, W., Sun, A., Yuan, Z., Zhou, H., Meng, F., Liu, Z. (2026). Spava: Accelerating Long-Video Understanding via Sequence-Parallelism-aware Approximate Attention. Annual Meeting of the Association for Computational Linguistics (ACL 2026 main Oral).

Zhao, W., Zhou, Z., Su, Z., Xiao, C., Li, Y., Li, Y., … & Liu, Z. (2025). Infllm-v2: Dense-sparse switchable attention for seamless short-to-long adaptation. International Conference on Learning Representations (ICLR 2026).

Huang, Y. (2025). On Accelerating Long-Context Inference via Sparse Self-Attention. B.Eng dissertation, Tsinghua University.

MiniCPM Team. (2025). MiniCPM4: Ultra-Efficient LLMs on End Devices arXiv preprint arXiv:2506:07900.

Yu, T., Wang, Z., Wang, C., Huang, F., Ma, W., He, Z., … & Sun, M. (2025). MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe. arXiv preprint arXiv:2509.18154.

Huang, Y.*, Li, M.*, Han, X., Xiao, C., Zhao, W., Sun, A., Zhou, J., Zhou, H., Liu, Z., & Sun, M. (2025). APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs. Annual Meeting of the Association for Computational Linguistics (ACL 2025 main Oral).

Zhao, W.*, Pan, T.*, Han, X., Zhang, Y., Sun, A., Huang, Y., Zhang, K., Zhao, W., Li, Y., Wang, J. & Liu, Z. (2025). FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling. Annual Meeting of the Association for Computational Linguistics (ACL 2025 main).

Yuan, Z., Li, J., Li, Y., Huang, Y., Chen, C., Wang, S., & Gou, Z. (2025). CITR: Efficient Long Video Understanding Needs Causal Importance. ACM Multimedia (ACM MM).

Huang, Y., Yuan, B., Han, X., Xiao, C., & Liu, Z. (2025). Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads. Transactions on Machine Learning Research (TMLR 2025).

Zhao, W.*, Huang, Y.*, Han, X., Xu, W., Xiao, C., Zhang, X., Fang, Y., Zhang, K., Liu, Z., & Sun, M. (2024). Ouroboros: Speculative Decoding with Large Model Enhanced Drafting. Main Conference of Empirical Methods in Natural Language Processing (EMNLP 2024 main).

Zhao, W.*, Huang, Y.*, Han, X., Liu, Z., Zhang, Z., Li, K., Chen, C., Yang, T., & Sun, M. (2024). CA-LoRA: Adapting Existing LoRA for Compressed LLMs to Enable Efficient Multi-Tasking on Personal Devices. Conference on Language Modeling (COLM 2024).

Hu, S., Tu, Y., Han, X., Cui, G., He, C., Zhao, W., … & Sun, M. (2024). MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies Conference on Language Modeling (COLM 2024).

Qin, Y., Hu, S., Lin, Y., Chen, W., Ding, N., Cui, G., … & Sun, M. (2023). Tool Learning with Foundation Models. ACM Computing Surveys.

Xiao, J., Huang, Y., Hu, C., Song, S., Huang, X., & Wang, J. (2022). Time series data encoding for efficient storage: a comparative analysis in Apache IoTDB. Proceedings of the VLDB Endowment, 15(10), 2148-2160 (VLDB 2022).