Global Convolutional Language Models

Abstract

The dominant models for long-context sequence processing rely on self-attention to integrate information across distant positions. Although effective, the quadratic time and memory cost of attention presents substantial challenges as sequence lengths increase. In this work, we propose Global Convolutional Language Models (GCLMs), a family of architectures that replace attention with a combination of frequency-domain global convolution and local depthwise convolution. The global operator applies a learned sequence-length–sized convolution kernel using the Fast Fourier Transform, enabling O(n log n) global mixing while preserving the parallelism and stability of convolutional networks. Local convolutions complement this mechanism by capturing short-range structure.

Experiments demonstrate that GCLMs train stably at long context lengths on consumer GPUs, exhibit rapid convergence even at small scale, and achieve low loss values within fractions of an epoch. These findings suggest that global convolution provides an efficient and practical alternative to attention for long-context language modeling.

1 Introduction

The dominant models for modern sequence processing tasks employ self-attention, which enables interactions between all positions in a sequence (Vaswani et al., 2017). Although attention has proven remarkably effective, its computational and memory costs scale quadratically with sequence length. This limitation poses substantial challenges for tasks requiring long-context reasoning, including document-level modeling, code analysis, and extended conversational agents.

Convolutional architectures provide an appealing alternative due to their parallelizability, stable optimization properties, and linear-time computation. However, traditional convolutions expand their receptive fields gradually, requiring many layers or dilated kernels to propagate information over long distances. Prior work on convolutional sequence models has narrowed this gap, yet these architectures have not matched the global contextual capabilities of attention.

In this work, we introduce Global Convolutional Language Models (GCLMs), which combine local depthwise convolutions with a learnable global convolution kernel applied efficiently via the Fast Fourier Transform. This approach yields an explicit global receptive field in every layer while maintaining subquadratic computational complexity. GCLMs are simple to implement, require no attention matrices, and train stably at long context lengths on commodity hardware.

We show that small-scale GCLMs converge rapidly and efficiently even at large sequence lengths, demonstrating that global convolution can provide a practical alternative to attention for long-context sequence modeling.

2 Related Work

Recurrent networks

Early models such as LSTMs and GRUs (Hochreiter & Schmidhuber, 1997; Cho et al., 2014) capture temporal dependencies through sequential state updates. While effective for short-range dependencies, their sequential nature limits parallelism and hinders long-range modeling (Pascanu et al., 2013). Faster variants such as QRNNs and SRUs (Bradbury et al., 2017; Lei et al., 2018) mitigate some inefficiencies but retain recurrence as a bottleneck.

Convolutional architectures

Temporal convolutions offer parallel computation and stable optimization. Early work explored stacked or dilated convolutions (Kalchbrenner et al., 2014; van den Oord et al., 2016), and architectures such as ConvS2S (Gehring et al., 2017) demonstrated competitive machine translation performance. However, convolutional receptive fields grow only linearly with depth or dilation, limiting their ability to model long-range dependencies.

Attention-based models

Transformers (Vaswani et al., 2017) enable direct pairwise interactions across all token positions, achieving state-of-the-art performance across numerous domains. Yet the O(n²) cost of attention presents difficulties for long-context modeling. Variants employ sparsity (Child et al., 2019), kernel approximations (Katharopoulos et al., 2020), or retrieval augmentation (Borgeaud et al., 2022) to reduce complexity, though many retain architectural complexity or approximation constraints.

Implicit operators and state space models

Structured sequence models such as S4 (Gu et al., 2021) and Hyena (Poli et al., 2023) leverage parameterized implicit convolution kernels for efficient long-range modeling. These approaches demonstrate that global mixing can be achieved without explicit attention. Our work aligns with this direction, introducing a simpler mechanism: a fully learnable global convolution kernel applied via FFT.

Fourier-based global mixing

Fourier transforms have been explored for global mixing operations (Ronneberger et al., 2015; Lee-Thorp et al., 2021). Unlike fixed Fourier transforms, our approach employs a learnable kernel with full sequence-length support, enabling expressive global interactions through O(n log n) convolution.

3 Model Architecture

GCLMs integrate two complementary mechanisms: local depthwise convolution for short-range structure and global convolution for full-sequence mixing. Each block contains residual connections and pre-layer normalization.

3.1 Input Representation

Given a token sequence x = (x_1, ..., x_T), the model computes:
h^(0)_t = E[x_t] + P[t]
where E is a token embedding matrix and P is a learned positional embedding.

3.2 Local Convolutional Path

Short-range structure is modeled using a depthwise-separable convolution. For hidden state h:
ℓ = PWConv(σ(DWConv(h^T)))^T
where DWConv applies a channel-wise convolution, PWConv applies a 1×1 convolution, and σ is a nonlinearity (ReLU).

3.3 Global Convolution via FFT

Each block includes a learned global kernel k spanning the full sequence. The convolution y = h * k is computed in the frequency domain:
ĥ = FFT(h^T), k̂ = FFT(k)
ŷ = ĥ ⊙ k̂
y = IFFT(ŷ)
This operation scales as O(n log n) rather than O(n²), enabling global context at lower cost than attention.

3.4 Dual-Path Block

Given input h^(l), each block computes:

Local update: h' = h^(l) + LocalConv(LayerNorm(h^(l)))
Global update: h'' = h' + GlobalConv(LayerNorm(h'))
Feedforward update: h^(l+1) = h'' + FFN(LayerNorm(h''))

3.5 Stacked Architecture

The full model consists of:

Token and positional embeddings
L dual-path blocks
Final layer normalization
Linear projection to vocabulary logits

Training is autoregressive via next-token prediction.

4 Training Objective

GCLMs are trained using next-token prediction with teacher forcing. For sequence x, the loss is:
L(θ) = -∑ log p_θ(x_{t+1} | x_{≤ t})

Optimization uses AdamW with learning rate 3 × 10⁻⁴. Padding tokens are masked from the loss. No warm-up schedule, gradient clipping, or learning-rate decay is required.

5 Experiments

We conduct preliminary experiments to assess convergence speed, training stability, and long-context scalability on consumer hardware.

5.1 Context Length 4096

A model trained on 1,000 samples from the Needle-in-a-Haystack dataset with a 4096-token context achieved a training loss of ≈0.10 within ≈40% of the first epoch. Training throughput on an RTX 4060 allowed this to complete in under five minutes. Training was terminated early due to rapid convergence.

5.2 Context Length 8256–8384

To test scalability, the maximum sequence length was increased to approximately 8.3k tokens. The model continued to train stably, with early loss values similar to the 4096-token run. No optimization or memory instabilities were observed, confirming that FFT-based global convolution remains practical at extended context lengths.

6 Contributions

A simple architecture for long-context modeling. GCLMs combine local convolution with FFT-based global mixing.
A learnable global convolution operator with O(n log n) complexity. This replaces attention while enabling full-sequence receptive fields.
Rapid and stable convergence. Models achieve low loss values within fractions of an epoch without specialized training tricks.
Scalability on modest GPUs. GCLMs train efficiently at context lengths exceeding 8000 tokens on a single RTX 4060.

7 Results

Across experiments, GCLMs demonstrate:

Fast convergence (loss ≈ 0.1 in < 0.5 epoch)
Stable training at context lengths up to ~8300 tokens
Efficient long-context processing using FFT-based global convolution
Competitive expressiveness despite architectural simplicity

These findings indicate that global convolution provides a viable alternative to attention for long-context modeling, particularly in resource-constrained settings.

8 Conclusion

We introduced Global Convolutional Language Models, which replace self-attention with a dual-path architecture consisting of local depthwise convolutions and a learnable global convolution applied in the frequency domain. This design achieves explicit global receptive fields with O(n log n) complexity, offering a computationally efficient alternative to attention.

Experiments demonstrate that GCLMs train rapidly, handle long contexts stably, and scale effectively on consumer hardware. These results highlight the potential of global convolution as a foundation for future long-context architectures. Future work includes scaling to larger model sizes, exploring additional kernel parameterizations, and evaluating performance on a broader set of long-context tasks.

Global Convolutional Language Models: Long-Context Sequence Modeling via Frequency-Domain Mixing