中文标题#
SemToken:用於高效長上下文語言建模的語義感知分詞方法
英文标题#
SemToken: Semantic-Aware Tokenization for Efficient Long-Context Language Modeling
中文摘要#
分詞在語言建模中起着關鍵作用,但現有的方法,如字節對編碼(BPE)或 WordPiece,僅基於頻率統計操作,忽略了文本的潛在語義結構。這導致了語義冗餘跨度的過度分詞和上下文連貫性的利用不足,尤其是在長上下文場景中。在本工作中,我們提出了 \textbf {語義令牌},這是一種語義感知的分詞框架,能夠聯合減少標記冗餘並提高計算效率。SemToken 首先通過輕量級編碼器提取上下文語義嵌入,並進行局部語義聚類以合併語義等價的標記。然後,它根據語義密度分配異構的標記粒度,允許在內容豐富的區域進行更細粒度的分詞,在重複或低熵跨度中進行更粗粒度的壓縮。SemToken 可以無縫集成到現代語言模型和注意力加速方法中。在 WikiText-103 和 LongBench 等長上下文語言建模基準上的實驗表明,SemToken 在標記數量上最多減少了 2.4×,並在速度上提升了 1.9×,同時對困惑度和下游準確性幾乎沒有或沒有下降。我們的研究結果表明,語義結構為優化大型語言模型中的分詞和計算提供了一個有前景的新維度。
英文摘要#
Tokenization plays a critical role in language modeling, yet existing approaches such as Byte-Pair Encoding (BPE) or WordPiece operate purely on frequency statistics, ignoring the underlying semantic structure of text. This leads to over-tokenization of semantically redundant spans and underutilization of contextual coherence, particularly in long-context scenarios. In this work, we propose \textbf{SemToken}, a semantic-aware tokenization framework that jointly reduces token redundancy and improves computation efficiency. SemToken first extracts contextual semantic embeddings via lightweight encoders and performs local semantic clustering to merge semantically equivalent tokens. Then, it allocates heterogeneous token granularity based on semantic density, allowing finer-grained tokenization in content-rich regions and coarser compression in repetitive or low-entropy spans. SemToken can be seamlessly integrated with modern language models and attention acceleration methods. Experiments on long-context language modeling benchmarks such as WikiText-103 and LongBench show that SemToken achieves up to 2.4× reduction in token count and 1.9× speedup, with negligible or no degradation in perplexity and downstream accuracy. Our findings suggest that semantic structure offers a promising new axis for optimizing tokenization and computation in large language models.
文章页面#
SemToken:用於高效長上下文語言建模的語義感知分詞方法
PDF 获取#
抖音掃碼查看更多精彩內容