Model compression techniques have been extensively explored to reduce the computational resource demands of serving large language models (LLMs) or other large-size neural networks. However, most existing methods either incur significant accuracy degradation compared to uncompressed models or have long training times. Also, their adaptability is often constrained by a limited range of…
]]>Full fine-tuning (FT) is commonly employed to tailor general pretrained models for specific downstream tasks. To reduce the training cost, parameter-efficient fine-tuning (PEFT) methods have been introduced to fine-tune pretrained models with a minimal number of parameters. Among these, Low-Rank Adaptation (LoRA) and its variants have gained considerable popularity because they avoid additional…
]]>