Small models.
Absolute precision.

72 languages. 27 scripts. Four tokenizer generations. Two model families arriving.

AENEA is a family of small language models designed from first principles for factual determinism. Prelude-1 proved the thesis. Prelude-4 surpassed it. Prelude-5 is rewriting the timeline - reaching factual crystallisation 5× faster than any model before it.

The Models Our Journey

The Journey

From first commit to launch

AENEA began in August 2025 with a simple question: what happens when you treat data quality as architecture, not preprocessing? Six months later, we have our answer.

August 2025

Project Genesis

First line of code. The hypothesis: a 284M parameter model trained on surgically clean data can outperform models 10x its size on factual recall tasks. We start building the data infrastructure.

September 2025

The Quartz Pipeline

Wiki Ultra-Clean v1 through v4. We learn that Wikipedia is 40% noise by volume - tables, infoboxes, navigation templates, census boilerplate. Each version gets more ruthless. The pipeline becomes a scalpel.

October 2025

The Overture Architecture

d=1024 embedding geometry. 16 attention heads. Rotary position embeddings. The model architecture crystallises around a single principle: every dimension of the latent space must carry factual signal.

November 2025

Pipeline v6 - Sub-8-Hour

The breakthrough. Parallel decompression, regex XML splitting, pre-computed MinHash signatures. Full English Wikipedia processed in under 8 hours. The Quartz data stack reaches production quality.

December 2025

Training Begins

Prelude-1 enters training on Quartz-cleaned Wikipedia. Loss drops steadily. The anomaly detector catches micro-batch outliers - the model is learning clean factual representations.

January 2026

Stack Exchange Corpus

The SE Ultra-Clean pipeline goes live. 23 Stack Exchange sites transformed into instruct-format Q&A pairs. The model's training data now spans both declarative knowledge and procedural reasoning.

February 2026

Convergence - Loss 2.807

All-time sustained best. EMA loss reaches 2.807, perplexity 16.6. Prelude-1 returns single-sentence factual completions with sub-second latency. The model approaches its theoretical floor.

March 2026

QT_V.2 Tokenizer Family - Three Sizes

The tokenizer family ships. Three variants: 64K (smallest embedding), 96K (best all-round), and 114K Code (multilingual coding). 71 languages across 26 script families. Validated on FLORES-200 across 204 languages - fewest total tokens and 4× more equitable than Llama 3.

March 2026

QT V.3 32K UltraLingo

The third-generation tokenizer. 32,000 vocabulary covering 71 languages across 26 writing systems. Outperforms Llama 3's 128K vocabulary on 48 FLORES-200 languages. 3× better cross-lingual equity at one-quarter the vocabulary size.

March 2026

Prelude-4 Surpasses Previous Generation

Prelude-4 (276M parameters, d=1024, 16 layers, GQA 4:1) exceeds Prelude-1's factual recall performance, reaching loss 2.734 and gradient norm 0.267. Establishes the Factual Crystallisation Hypothesis - gradient norm, not loss, predicts factual emergence.

April 2026

QT V.4.4 32K UltraLingo

The fourth-generation Prelude tokenizer. 32K vocabulary across 72 languages and 27 scripts, with six over-represented scripts removed from the equity bucket to prevent concentration bias. The result: more efficient multilingual representation that frees model capacity for factual learning rather than redundant script encoding.

April 2026

Prelude-5 - 5× Faster Crystallisation

Prelude-5 (276M parameters, QT V.4.4 32K UltraLingo) reaches loss 2.249, perplexity 9.48, and gradient norm 0.208 at step 40,000 - surpassing every previous Prelude on every metric. Approaching Prelude-1's factual recall in roughly one-fifth the training compute. The QT V.4.4 tokenizer accelerates factual emergence by approximately 5× compared to earlier tokenizer generations. Training continues toward 150,000 steps.

The Models

AENEA Model Family

Prelude-1 proved the thesis. Prelude-4 surpassed it. Prelude-5 is rewriting the timeline on the QT V.4.4 32K UltraLingo tokenizer. Overture-1 (planned) will add advanced reasoning and code generation on the QT V.4.1 64K UltraLingo tokenizer.

Prelude-1

aenea-prelude-1-1024 · QT_32k_IV · v1.0

First model in the AENEA family. Trained exclusively on Quartz-cleaned corpora. Designed for single-sentence factual completion.

RELEASED

Parameters

284M

d=1024 · 16 layers · GQA · RoPE

Training Data

6.4B

tokens (Quartz-cleaned)

Tokenizer

QT-32K

ByteBPE v4

Best EMA Loss

2.807

GPT-2 Medium (355M) Pythia-410M GPT-2 Large (774M) ≈ Pythia-1B

Prelude-4

aenea-prelude-4 · QT V.3 32K · v4.0

Fourth generation. Surpassed Prelude-1 on factual recall and established the Factual Crystallisation Hypothesis - gradient norm, not loss, predicts factual emergence.

TRAINED

Parameters

276M

d=1024 · 16 layers · GQA 4:1

Tokenizer

QT V.3 32K

UltraLingo SuperBPE

Best Loss

2.734

Surpasses Prelude-1 (2.807)

Grad Norm

0.267

Crystallisation zone

Prelude-5

aenea-prelude-5 · QT V.4.4 32K · v5.0

Fifth generation. Reaching factual crystallisation 5× faster than any prior model. Generates multi-paragraph coherent prose with complex sentence structure and accurate naming conventions across European languages.

IN TRAINING

Parameters

276M

d=1024 · 16 layers · GQA 4:1

Tokenizer

QT V.4.4 32K

UltraLingo · 72 langs · 27 scripts

Best Loss

2.249

PPL 9.48 · step 40,000

Grad Norm

0.208

Deep crystallisation

Overture-1

QT V.4.1 64K UltraLingo · Advanced Reasoning & Code

Advanced multilingual reasoning and code generation. Currently in the design phase - building upon training insights from the Prelude series.

PLANNED

Tokenizer

QT V.4.1 64K

UltraLingo · 72 langs · 27 scripts

Focus

Reasoning & Code

Advanced multilingual

Approach

Why smaller models can think bigger

Most parameters in large models are wasted - compensating for noisy data, fragmented representations, and training regimes that fight themselves. We start from the opposite premise.

Ultra-Clean Data

The Quartz v7.3 pipeline removes encoding artefacts, vandalism, and noise across 71 languages and 26 script families. Every malformed token is a wrinkle in the loss landscape - we iron them out before training begins.

Coherent Geometry

Architectures designed so representations built during one training phase remain geometrically compatible with the next. Knowledge encodes cleanly and manifests back into language without distortion.

Multi-Epoch Depth

Three passes over curated data. First epoch builds the map. Second irons out the paper. Third polishes routes between internal representation and fluent generation.

Factual Crystallisation

Our research has identified that gradient norm, not loss, predicts the onset of factual recall in language models. When gradient norm drops to approximately 0.27 the model transitions from memorisation to genuine factual crystallisation. Prelude-5 extends the hypothesis with a critical finding: tokenizer quality directly accelerates factual emergence. The QT V.4.4 32K UltraLingo tokenizer reaches the crystallisation zone roughly 5× faster than earlier tokenizer generations - providing a principled framework for predicting, and now accelerating, factual recall in small models.

Developer Experience

Simple to use, powerful underneath

AENEA models ship as standard checkpoints compatible with common inference frameworks. Load it, prompt it, generate - the engineering complexity is in the training, not the interface.

Prelude-1 is released. Prelude-4 is trained. Prelude-5 is in training. All AENEA models support the QT tokenizer family providing efficient encoding across every script.

python - inference

# Load model

from aenea import AENEA

model = AENEA.load("aenea-prelude-1")

# model = AENEA.load("aenea-prelude-5") # in training

# Generate (any of 72 languages)

output = model.generate(

prompt="Die Geschichte des",

max_tokens=256,

temperature=0.1

)

▸ QT tokenizer family · 72 languages · 27 scripts

Roadmap

What's coming

Prelude-1 is released. Prelude-4 is trained. The QT tokenizer family is live across four generations. Prelude-5 is in training.

February 2026 · Complete

Prelude-1 Base

284M parameter base model. Three-epoch training on ultra-clean Wikipedia. Open weights and full training logs.

March 2026 · Complete

QT_V.2 Tokenizer Family

Three tokenizers: 64K, 96K, and 114K Code. 71 languages, 26 script families. Validated on FLORES-200 (204 languages) - fewest total tokens and 4× more equitable than Llama 3. Published on HuggingFace.

March 2026 · Complete

QT V.3 32K UltraLingo

Third-generation SuperBPE tokenizer. 32K vocabulary covering 71 languages across 26 writing systems. Outperforms Llama 3's 128K on 48 FLORES-200 languages at one-quarter the vocabulary size. Published on HuggingFace.

March 2026 · Complete

Prelude-4

276M parameters on QT V.3 32K UltraLingo. Surpassed Prelude-1 on factual recall. Loss 2.734, grad norm 0.267. Established the Factual Crystallisation Hypothesis.

April 2026 · Complete

QT V.4.4 32K UltraLingo

Fourth-generation Prelude tokenizer. 32K vocabulary across 72 languages and 27 scripts. Six over-represented scripts removed from the equity bucket to prevent concentration bias. Published on HuggingFace.

In Training

Prelude-5

276M parameters on QT V.4.4 32K UltraLingo. Loss 2.249, perplexity 9.48, grad norm 0.208 at step 40,000. Reaching factual crystallisation 5× faster than any prior model. Training continues toward 150,000 steps.

Planned

Overture-1

Advanced multilingual reasoning and code generation on the QT V.4.1 64K UltraLingo tokenizer. Currently in the design phase - building upon training insights established by the Prelude series.

Ecosystem

Three pillars, one architecture

AENEA Global Ltd builds vertically integrated AI infrastructure. The model family, the data stack, and the research division are all designed to reinforce each other.

The future is multilingual

Prelude-1 proved that precision beats parameter count. Prelude-4 surpassed it. Prelude-5 is reaching factual crystallisation 5× faster.

GitHub Quartz.host Crassus.info

Small models.Absolute precision.

From first commit to launch

AENEA Model Family

Prelude-1

Prelude-4

Prelude-5

Overture-1

Why smaller models can think bigger

Ultra-Clean Data

Coherent Geometry

Multi-Epoch Depth

Factual Crystallisation

Simple to use, powerful underneath

What's coming

Prelude-1 Base

QT_V.2 Tokenizer Family

QT V.3 32K UltraLingo

Prelude-4

QT V.4.4 32K UltraLingo

Prelude-5

Overture-1

Three pillars, one architecture

AENEA

Quartz

Crassus

The future is multilingual

Small models.
Absolute precision.