Hymba 混合頭架構提高小型語言模型性能

Transformer 及其基于注意力的架構，憑借強大的性能、并行化功能以及通過鍵值 (KV) 緩存進行的長期召回，已成為語言模型 (LM) 的主要選擇。然而，其二次計算成本和高內存需求帶來了效率挑戰。相比之下，Mamba 和 Mamba-2 等狀態空間模型 (SSM) 可提供恒定的復雜性和高效的硬件優化，但難以處理記憶回收任務，從而影響其在常規基準測試中的性能。

NVIDIA 研究人員最近提出了 Hymba ，這是一系列小語言模型 (SLMs)，采用混合 head 并行架構，將 Transformer Attention 機制與 SSMs 集成，以提高效率和性能。在 Hymba 中，attention heads 可實現高分辨率召回，而 SSM heads 可實現高效的上下文摘要。

Hymba 的新型架構揭示了以下幾點見解：

注意力開銷： 超過 50% 的注意力計算可以被更便宜的 SSM 計算所取代。
本地注意力優勢： 大多數全球注意力可以被本地注意力取代，而不會影響一般任務和召回密集型任務的性能，得益于 SSM heads 匯總的全局信息。
KV 緩存冗余： 鍵值緩存在 heads 和層之間高度相關，因此可以在 heads （組查詢注意力）和層（跨層 KV 緩存共享）之間共享。
Softmax 注意力限制： 注意力機制的總和被限制為 1，從而限制了稀疏性和靈活性。我們引入了可學習的元令牌，這些元令牌在提示之前提供，用于存儲關鍵信息，并減輕與注意力機制相關的“強制關注”的負擔。

本文展示了 Hymba 1.5B 與類似大小的先進開源模型 (包括 Llama 3.2 1B、OpenELM 1B、Phi 1.5、SmolLM2 1.7B、Danube2 1.8B 和 Qwen2.5 1.5B) 相比，表現良好。與類似大小的 Transformer 模型相比，Hymba 還實現了更高的吞吐量，存儲緩存所需的內存減少了 10 倍。

Hymba 1.5B 已發布至 Hugging Face 集合和 GitHub 。

Hymba 15 億性能?

圖 1 比較了 Hymba 1.5B 與次 2B 模型（Llama 3.2 1B、OpenELM 1B、Phi 1.5、SmolLM2 1.7B、Danube2 1.8B、Qwen2.5 1.5B）在平均任務準確性、相對于序列長度的緩存大小（MB）和吞吐量（tok/秒）方面的表現。

A figure showing three performance metrics comparing seven different AI language models in terms of average accuracy, cache size (MB) relative to sequence length, and throughput (tok/sec). — *圖 1、Hymba 1.5B 基準與低于 2B 模型的性能比較*

在這組實驗中，任務包括 MMLU、ARC-C、ARC-E、PIQA、Hellaswag、Winogrande 和 SQuAD-C。使用 PyTorch 在序列長度為 8K、批量大小為 128 的 NVIDIA A100 GPU 上測量吞吐量。對于在吞吐量測量期間遇到內存不足（OOM）問題的模型，批量大小減半，直到 OOM 得到解決，以測量在不使用 OOM 時可實現的最大吞吐量。

Hymba 模型設計?

引入 Mamba 等 SSM 是為了解決 Transformer 的二次復雜性和推理時間較大的 KV 緩存問題。然而，由于其低分辨率內存，SSM 在內存召回和性能方面遇到困難。為了克服這些限制，我們在表 1 中提出了開發高效、高性能小型語言模型的路線圖。

配置	*常識推理 ()**	召回 (%)*	吞吐量 (令牌/秒)*	緩存大小 (MB)	設計理由
Ablations 在 300M 模型大小和 100B 訓練令牌上的消融
Transformer (Llama)	44.08	39.98	721.1	414.7	準確召回，低效
狀態空間模型 (Mamba)	42.98	19.23	4720.8	1.9	高效且召回不準確
A. + Attention heads (順序)	44.07	45.16	776.3	156.3	增強召回功能
B. + Multi-heads (并行)	45.19	49.90	876.7	148.2	更好地平衡兩個模塊
C. 本地/全球 attention	44.56%	48.79	2399.7	41.2	提高計算/緩存效率
D. + KV 緩存共享	45.16	48.04	2756.5	39.4	緩存效率
E. + Meta-tokens?	45.59%	51.79	2695.8	40.0	學習內存初始化
擴展至 1.5 億模型大小和 1.5 萬億訓練令牌
F. 規模/數據	60.56%	64.15	664.1	78.6%	進一步提高任務性能
G. 擴展上下文長度 (2K→8K)	60.64%	68.79	664.1	78.6%	改進 multishot 和召回任務

表 1、Hymba 模型的設計路線圖

融合混合模組?

根據消融研究，在混合 head 模塊中并行融合 attention heads 和 SSM heads 的表現優于順序堆疊。Hymba 在混合 head 模塊中并行融合 attention heads 和 SSM heads，使兩個 heads 能夠同時處理相同的信息。此架構可提高推理和召回準確性。

A diagram showing the architecture of a dual-path attention mechanism. The flow starts with an Input Projection, leading to Latent Feature extraction which splits into two parallel paths. The upper path (in blue) contains SSM Feature processing through SSM Heads and Gate Normalization. The lower path (in red) processes Attention Features through Attention Heads and Gate Normalization. Both paths converge at a Mean operation before final Output Projection. Arrows indicate the flow of data through the system. — *圖 2、Hymba 中的混合 head模塊*

效率和 KV 緩存優化?

Attention heads 可提高任務性能，但會增加 KV 緩存需求并降低吞吐量。為緩解此問題，Hymba 通過結合本地和全局 attention 并采用跨層 KV 緩存共享來優化混合 head 模塊，從而將吞吐量提高了 3 倍，并在不犧牲性能的情況下將緩存減少了近 4 倍。

A diagram showing the architecture of a neural network model with Hymba Blocks. The model flows from left to right, starting with an Embedding layer, followed by alternating Hymba Blocks with Full Attention (in red) and SWA (in blue). The blocks are connected with KV sharing every 2 layers, shown in dotted green boxes labeled 'Repeat (N-3)/2'. Below the main flow, there's a detailed view of a module containing Layer norm, Hybrid-head module, another Layer norm, and FFN components. The diagram ends with an LM Head layer on the right. — *圖 3、Hymba 模型架構*

Meta-tokens?

一組包含 128 個預訓練嵌入的輸入，可用作學習緩存初始化，以增強對相關信息的關注。這些 token 具有雙重用途：

充當后盾令牌，有效地重新分配 attention，從而減輕 attention 流失。
封裝壓縮世界知識

A diagram illustrating the Fading Memory architecture from SSM (State Space Model). The image shows three layers: At the top is a blue rectangular box labeled 'Fading Memory (From SSM)'. Below it are seven gray input tokens arranged horizontally. At the bottom are two sets of memory blocks: on the left are two green blocks labeled 'Meta Memory (Meta Tokens)', and on the right are three red blocks labeled 'Snapshot Memory (From Attn)'. Green arrows connect the Meta Memory to the input tokens, while red arrows connect the Snapshot Memory to the rightmost input tokens. A blue arrow loops back from the Fading Memory box to itself. — *圖 4、從內存方面解讀 Hymba*

模型分析?

本節介紹了在相同訓練設置下跨不同架構的蘋果對比。然后，我們在不同的預訓練模型中可視化 SSM 和 Attention 的 attention 圖。最后，我們通過剪枝對 Hymba 執行頭部重要性分析。本節中的所有分析有助于說明 Hymba 的設計選擇如何有效以及為何有效。

蘋果與蘋果對比?

我們對 Hymba、純 Mamba2、Mamba2 with FFN、Llama3 風格和 Samba 風格（Mamba-FFN-Attn-FFN）架構進行了蘋果到蘋果的比較。所有模型都有 1 億個參數，并使用完全相同的訓練方法從 SmolLM-Corpus 中針對 100 億個令牌從頭開始進行訓練。所有結果均通過使用 Hugging Face 模型上的零樣本設置的 lm-evaluation-harness 獲得。Hymba 在常識推理以及問答和召回密集型任務方面表現出色。

表 2 比較了用于語言建模以及召回密集型和常識推理任務的各種模型架構，其中 Hymba 實現了跨指標的強大性能。Hymba 在語言任務中的困惑度最低（Wiki 為 18.62，LMB 為 10.38），并且在召回密集型任務中表現出色，尤其是在 SWDE（54.29）和 SQuAD-C（44.71）中，從而在此類別中獲得最高平均分（49.50）。

模型	語言 (PPL)	召回密集型 (%)	*常識推理 ()**
Mamba2	15.88	43.34	52.52%
Mamba2 w/ FFN	17.43	28.92	51.14
Llama3	16.19	47.33	52.82%
Samba	16.28	3617	52.83
Hymba	14.5	49.5%	54.57

表 2、在相同設置下使用 100 億個令牌進行訓練的架構對比

在常識推理和問答方面，Hymba 在大多數任務中的表現優于其他模型，例如 SIQA (31.76) 和 TruthfulQA (31.64)，平均分為 54.57，略高于 Llama3 和 Mamba2。總的來說，Hymba 是一款出色的平衡模型，在效率和任務性能方面表現出色，適用于各種類別。

Attention 貼圖可視化?

我們將 attention 貼圖中的元素進一步分為四種類型：

元：從所有真實令牌到元令牌的 attention 分數。此類別反映了模型對元令牌的偏好。在注意力圖中，如果模型具有元令牌，它們通常位于前幾列（例如，Hymba 的 128 列）。
BOS： 從所有真實令牌到序列開始令牌的 attention 分數。在 attention 圖中，它們通常位于元令牌之后的第一列中。
Self： 從所有真實令牌到自身的 attention 數。在 attention 映射中，它們通常位于對角線上。
交叉： 從所有真實令牌到其他真實令牌的 attention 數。在 attention 地圖中，它們通常位于對角線外區域。

Hymba 的 attention 模式與 Vanilla Transformer 明顯不同。在 Vanilla Transformer 中，attention 得分更集中在 BOS 上，這與 Attention Sink 中的結果一致。此外，Vanilla Transformer 的 Self Attention 得分比例也較高。在 Hymba 中，meta-tokens、attention heads 和 SSM heads 相輔相成，從而在不同類型的 tokens 之間更平衡地分配注意力得分。

具體來說，meta-tokens 可分流 BOS 的 attention 數，使模型能夠更專注于真實標記。SSM heads 對全局上下文進行總結，更側重于當前令牌（Self attention scores）。另一方面，attention heads 對 Self 和 BOS 令牌的關注度較低，而對其他令牌（即 Cross attention scores）的關注度較高。這表明 Hymba 的混合 head 設計可以有效平衡不同類型令牌的 attention 分布，從而有可能帶來更好的性能。

A diagram showing the composition of the Hymba attention mechanism. It consists of three components that are added together: Meta Tokens (shown as a vertical green stripe on the left), Sliding Window Attention (displayed as a diagonal green band), and SSM (Mamba) (represented as a triangular green gradient). These three patterns combine to form the final Hymba pattern on the right, which shows a triangular area filled with green squares of varying intensity. Each component is displayed in a square grid format, and the combination is shown using plus signs between the components and an equals sign before the final result. — *圖 5、Hymba 的 attention圖示意圖* (元令牌、滑動窗口attention和 Mamba 貢獻的組合)

A comparative visualization showing attention patterns across different language models. The image consists of three main parts: 1) Three attention heatmaps for Llama 3.2 3B and Hymba 1.5B models, showing diagonal patterns in purple, yellow, and blue colors. 2) A grid diagram showing BOS (Beginning of Sequence) token connections with Meta and Cross sections marked. 3) Three horizontal stacked bar charts comparing percentage distributions of Meta, BOS, Cross, and Self attention patterns across Llama 3.2 3B and two variants of Hymba models, with percentages clearly labeled in different colors. — 圖 6、Llama 3.2 3B 和 Hymba 1.5B 中不同類別的 attention 得分總和。

主管重要性分析?

我們通過移除 attention 和 SSM heads 并記錄最終精度來分析每層中的相對重要性。我們的分析揭示了以下內容：

同一層中的 attention/SSM heads的相對重要性會根據輸入進行自適應，并且會因任務而異，這表明它們在處理各種輸入時可以發揮不同的作用。
第一層中的 SSM heads 對于語言建模至關重要，移除它會導致準確度大幅下降到隨機猜測的水平。
通常，移除一個attention/SSM heads 會導致 Hellaswag 的平均準確率分別下降 0.24%/1.1%。

A line graph comparing the Hellswag Accuracy (y-axis ranging from 0.45 to 0.50) across 32 different layers (x-axis). The graph shows three elements: a horizontal dashed line labeled Orig Model at approximately 0.493, and two sets of bars in blue and orange representing Remove Attn and Remove SSM, respectively. The bars fluctuate slightly above and below the original model line, with most values falling between 0.47 and 0.495. The graph compares the impact of removing attention mechanisms versus SSM components at different layers of the model. — 圖 7、移除每層的 Attention 或 SSM 頭后，使用 Hellaswag 的 1K 個樣本測量得出的準確率。

模型架構和訓練最佳實踐?

本節概述 Hymba 1.5B Base 和 Hymba 1.5B Instruct 的關鍵架構決策和訓練方法。

模型架構?

混合架構： Mamba 擅長總結，通常更專注于當前 token，而 attention 更精確，可用作快照內存。并行組合可以合并這些優勢，但標準順序融合則不然。我們在 SSM 和 attention heads 之間選擇了 5:1 的參數比。
滑窗法 attention heads：全 attention heads 被保留在三個層級（第一層、最后一層和中間層），其余 90%的層級使用滑窗法 attention heads。
跨層 KV 緩存共享 ：在每兩個連續的 attention 層之間實現。除了在 heads 之間共享 GQA KV 緩存之外，還完成了這一過程。
元令牌： 這些 128 個令牌無需監督即可學習，有助于避免大語言模型 (LLMs) 中的熵崩潰問題，并緩解 attention 匯集現象。此外，模型會將一般知識存儲在這些令牌中。

訓練最佳實踐?

預訓練： 我們選擇了兩個階段的基礎模型訓練。第 1 階段保持恒定的高學習率，并使用較少的過濾大型語料庫數據。然后，使用高質量數據將連續學習率衰減至 1e-5。這種方法支持持續訓練和恢復第 1 階段。
指令微調： 指令模型調優分三個階段執行。首先，SFT-1 通過對代碼、數學、函數調用、角色扮演和其他特定任務數據進行訓練，為模型提供強大的推理能力。其次，SFT-2 教會模型遵循人類指令。最后，利用 DPO 使模型與人類偏好保持一致，并提高模型的安全性。

Training pipeline for the Hymba model family divided into five sections that read (left to right) General pretraining, LR annealing, SFT-1, SFT-2, and DPO. — *圖 8、適用于 Hymba 模型系列的訓練管線。*

性能和效率評估?

Hymba 1.5B 模型僅使用 1.5T 預訓練令牌，在所有小型語言模型中表現最佳，并實現比所有基于 Transformer 的語言模型更高的吞吐量和緩存效率。

例如，在與最強基準 Qwen2.5（使用 13 倍以上的 tokens 進行預訓練）進行基準測試時，Hymba 1.5B 實現了 1.55%的平均準確性提升、1.41 倍的吞吐量和 2.90 倍的緩存效率。與使用少于 2T 的 tokens 訓練的最強小型 LM（即 h2o-danube2）相比，我們的方法實現了 5.41%的平均準確性提升、2.45 倍的吞吐量和 6.23 倍的緩存效率。

模型	#參數	訓練令牌	令牌/秒	緩存 (MB)	MMLU 5-shot	ARC-E 0-shot	ARC-C 0-shot	PIQA 0-shot	Wino0-shot	Hella0-shot	SQuAD-C 1-shot	平均
開放 ELM-1	11 億	1.5 T	249	346	27.06	62.37	19.54	74.76	61.8	48.37	45.38	48.57
Rene v0.1	13 億	1.5 T	800	113	32.94	67.05	31.06	76.49	62.75	51.16	48.36	52.83
Phi 1.5	13 億	0.15?	241	1573	42.56	76.18	44.71	76.56	72.85	48	30.09	55.85
Smol LM	17 億	1T	238	1573	27.06	76.47	43.43	75.79	60.93	49.58	45.81	54.15
Cosmo	18 億	.2T	244	1573	26.1	62.42	32.94	71.76	55.8	42.9	38.51	47.2
h20 danube2	18 億	2T	271	492	40.05	70.66	33.19	76.01	66.93	53.7	49.03	55.65
Llama 3.2 1B	12 億	9T	535	262	32.12	65.53	31.39	74.43	60.69	47.72	40.18	50.29
Qwen 2.5	15 億	18T	469	229	60.92	75.51	41.21	75.79	63.38	50.2	49.53	59.51
AMD OLMo	12 億	1.3 T	387	1049	26.93	65.91	31.57	74.92	61.64	47.3	33.71	48.85
Smol LM2	17 億	11T	238	1573	50.29	77.78	44.71	77.09	66.38	53.55	50.5	60.04
Llama 32 3B	30 億	9T	191	918	56.03	74.54	42.32	76.66	69.85	55.29	43.46	59.74
?	?	?	?	?	?	?	?	?	?	?	?	?
Hymba	15 億	1.5 T	664	79	51.19	76.94	45.9	77.31	66.61	53.55	55.93	61.06

表 2、Hymba 1.5 B 基礎模型結果

指令模型?

在所有任務中，Hymba 1.5B Instruct 模型的平均性能最高，比之前的先進模型 Qwen 2.5 Instruct 約高出 2%。具體來說，Hymba 1.5B 模型在 GSM8K、GPQA 和 BFCLv2 中的得分分別為 58.76、31.03 和 46.40，優于所有其他模型。這些結果表明 Hymba 1.5B 模型在復雜推理能力方面具有優勢，特別是在需要復雜推理能力的領域。

模型	#參數	MMLU ↑	IFEval ↑	GSM8K ↑	GPQA ↑	BFCLv2 ↑	平均 ↑
SmolLM	17 億	27.80	25.16	1.36	25.67	–	20.00
OpenELM	11 億	25.65	6.25	56.03	21.62	–	27.39
Llama 3.2	12 億	44.41	58.92	42.99	24.11	20.27	38.14
Qwen2.5	15 億	59.73	46.78	56.03	30.13	43.85	47.30
SmolLM2	17 億	49.11	55.06	47.68	29.24	22.83	40.78
Hymba 15 億	15 億	52.79	57.14	58.76	31.03	46.40	49.22

表 3、Hymba 1.5 B 指令模型結果

結束語?

新的 Hymba 系列小型 LM 采用混合 head 架構，將 attention heads 的高分辨率召回功能與 SSM heads 的高效上下文摘要相結合。為進一步優化 Hymba 的性能，我們引入了可學習的元令牌，用作 attention 和 SSM heads 的學習緩存，從而增強模型對顯著信息的關注。通過 Hymba 的路線圖、全面評估和消融研究，Hymba 在各種任務中設定了新的 state-of-the-art 性能，在準確性和效率方面實現了出色的結果。此外，這項工作還對混合 head 架構的優勢提供了寶貴見解，為高效 LM 的未來研究提供了前景光明的方向。

詳細了解 Hybma 1.5B Base 和 Hymba 1.5B Instruct 。

致謝?

這項工作如果沒有 NVIDIA 許多人的貢獻是不可能完成的 ，包括 Wonmin Byeon、Zijia Chen、Ameya Sunil Mahabaleshwarkar、Shih-Yang Liu、Matthijs Van Keirsbilck、Min-Hung Chen、Yoshi Suhara、Nikolaus Binder、Hanah Zhang、Maksim Khadkevich、Yingyan Celine Lin、Jan Kautz、Pavlo Molchanov 和 Nathan Horrocks。

Hymba 混合頭架構提高小型語言模型性能

Hymba 15 億性能?