Small models.
Absolute precision.

71 languages. 26 scripts. Three tokenizers. Two model families arriving.

AENEA is a family of small language models designed from first principles for factual determinism. Prelude-1 proved the thesis. Now we're building Prelude-2 (multilingual reasoning) and Overture-1 (advanced reasoning & code) on the QT_V.2 tokenizer family.

The Journey

From first commit to launch

AENEA began in August 2025 with a simple question: what happens when you treat data quality as architecture, not preprocessing? Six months later, we have our answer.

August 2025
Project Genesis
First line of code. The hypothesis: a 284M parameter model trained on surgically clean data can outperform models 10x its size on factual recall tasks. We start building the data infrastructure.
September 2025
The Quartz Pipeline
Wiki Ultra-Clean v1 through v4. We learn that Wikipedia is 40% noise by volume — tables, infoboxes, navigation templates, census boilerplate. Each version gets more ruthless. The pipeline becomes a scalpel.
October 2025
The Overture Architecture
d=1024 embedding geometry. 16 attention heads. Rotary position embeddings. The model architecture crystallises around a single principle: every dimension of the latent space must carry factual signal.
November 2025
Pipeline v6 — Sub-8-Hour
The breakthrough. Parallel decompression, regex XML splitting, pre-computed MinHash signatures. Full English Wikipedia processed in under 8 hours. The Quartz data stack reaches production quality.
December 2025
Training Begins
Prelude-1 enters training on Quartz-cleaned Wikipedia. Loss drops steadily. The anomaly detector catches micro-batch outliers — the model is learning clean factual representations.
January 2026
Stack Exchange Corpus
The SE Ultra-Clean pipeline goes live. 23 Stack Exchange sites transformed into instruct-format Q&A pairs. The model's training data now spans both declarative knowledge and procedural reasoning.
February 2026
Convergence — Loss 2.807
All-time sustained best. EMA loss reaches 2.807, perplexity 16.6. Prelude-1 returns single-sentence factual completions with sub-second latency. The model approaches its theoretical floor.
March 2026
QT_V.2 Tokenizer Family — Three Sizes
The tokenizer family ships. Three variants: 64K (smallest embedding), 96K (best all-round), and 114K Code (multilingual coding). 71 languages across 26 script families. Validated on FLORES-200 across 204 languages — fewest total tokens and 4× more equitable than Llama 3.
The Models

AENEA Model Family

Prelude-1 proved the thesis. Prelude-2 brings multilingual reasoning on the QT_V.2 96K tokenizer. Overture-1 adds advanced reasoning and code generation on the QT_V.2 Code 114K tokenizer.

Prelude-1

aenea-prelude-1-1024 · QT_32k_IV · v1.0
First model in the AENEA family. Trained exclusively on Quartz-cleaned corpora. Designed for single-sentence factual completion.
RELEASED
Parameters
284M
d=1024 · 16 layers · GQA · RoPE
Training Data
6.4B
tokens (Quartz-cleaned)
Tokenizer
QT-32K
ByteBPE v4
Best EMA Loss
2.807

Prelude-2

QT_V.2 96K · Multilingual Reasoning
General multilingual reasoning model. 71 languages, 26 script families. Built on the QT_V.2 96K tokenizer — fewest total tokens on FLORES-200, 4× more equitable than Llama 3.
IN DEVELOPMENT
Tokenizer
QT_V.2 96K
71 langs · 26 scripts
Focus
Multilingual Reasoning
General-purpose

Overture-1

QT_V.2 Code 114K · Advanced Reasoning & Code
Advanced multilingual reasoning and code generation. Trained on the code-heavy QT_V.2 Code 114K tokenizer with CodeSearchNet Python. The flagship model.
IN DEVELOPMENT
Tokenizer
QT_V.2 Code
114K · 71 langs + 15 code
Focus
Reasoning & Code
Advanced multilingual
Approach

Why smaller models can think bigger

Most parameters in large models are wasted — compensating for noisy data, fragmented representations, and training regimes that fight themselves. We start from the opposite premise.

Ultra-Clean Data

The Quartz v7.3 pipeline removes encoding artefacts, vandalism, and noise across 71 languages and 26 script families. Every malformed token is a wrinkle in the loss landscape — we iron them out before training begins.

Coherent Geometry

Architectures designed so representations built during one training phase remain geometrically compatible with the next. Knowledge encodes cleanly and manifests back into language without distortion.

Multi-Epoch Depth

Three passes over curated data. First epoch builds the map. Second irons out the paper. Third polishes routes between internal representation and fluent generation.

Developer Experience

Simple to use, powerful underneath

AENEA models ship as standard checkpoints compatible with common inference frameworks. Load it, prompt it, generate — the engineering complexity is in the training, not the interface.

Prelude-2 and Overture-1 will support all 71 training languages out of the box, with the QT_V.2 tokenizer family providing efficient encoding across every script.

python — inference
# Load model
from aenea import AENEA
model = AENEA.load("aenea-overture-1")
# Generate (any of 71 languages)
output = model.generate(
prompt="Die Geschichte des",
max_tokens=256,
temperature=0.1
)
▸ QT_V.2 Code 114K · 71 languages · 26 scripts
Roadmap

What's coming

Prelude-1 is released. The QT_V.2 tokenizer family is live. Now we're building the next generation of models.

February 2026 · Complete

Prelude-1 Base

284M parameter base model. Three-epoch training on ultra-clean Wikipedia. Open weights and full training logs.
March 2026 · Complete

QT_V.2 Tokenizer Family

Three tokenizers: 64K, 96K, and 114K Code. 71 languages, 26 script families. Validated on FLORES-200 (204 languages) — fewest total tokens and 4× more equitable than Llama 3. Published on HuggingFace.
Q2 2026

Prelude-2

General multilingual reasoning model on the QT_V.2 96K tokenizer. 71 languages, 26 scripts. The multilingual successor to Prelude-1, designed for factual reasoning across all supported languages.
Q3 2026

Overture-1

Advanced multilingual reasoning and code generation on the QT_V.2 Code 114K tokenizer. The flagship model — combining deep reasoning with competitive code compression across 71 languages and 15 programming languages.

The future is multilingual

Prelude-1 proved that precision beats parameter count. Prelude-2 and Overture-1 speak 71 languages. Follow our journey.