I wanted to find the smallest model that can generate short, in-character code review feedback based on structured input — persona, mood, and code metrics. Not a chatbot. Not a general assistant. Just a tiny model that takes something like:

P:chaos M:v.critical S:day L:en loc=high tok=high cx=high nest=high com=low cmp=high
Feedback:

...and produces:

"90% of this mess could fit on a firework, and yet here you are, babbling like an unkind beast about nothing in particular — did I forget to mention the code's too tangled for its own good?"

That's a chaotic boss reviewing bad code. The model should produce something completely different for a supportive granny reviewing the same code.

Here's how I got there.

The Task

Goji is a personality-driven code review feedback generator. The idea: take a set of code metrics, a persona, and a mood — and produce a short, in-character review. Seven personas (buddy, motivating senior, bored senior, chaotic boss, good boss, bad boss, granny), five moods (very positive to very critical), and seven code metrics (lines of code, complexity, nesting depth, token count, comment ratio, compression ratio, indentation depth).

The pipeline: LoRA fine-tune a small base model on synthetic {prompt, completion} pairs, merge into base weights, quantize to GGUF, run locally via llama.cpp.

The question: how small can the base model be and still produce coherent, persona-differentiated output?

The Training Data

3,993 cleaned and balanced samples. Each sample is a compact structured prompt paired with a 1-3 sentence completion:

P:buddy M:critical S:day L:en loc=low tok=mid cx=low nest=low com=high cmp=high
Feedback: The code you've provided shows a solid amount of effort and clarity, with an
appropriate balance between complexity and readability. The structure offers some space
for future refinements while maintaining smooth coverage.

The prompt format is deliberately compressed — 36 tokens instead of the typical verbose format — because every token in a small model's 256-token context window is precious. Metrics are bucketed into low/mid/high relative to the scope's expected range rather than raw numbers, since a 14M parameter model can't meaningfully distinguish loc=45 from loc=45000. The L:en tag exists to suppress multilingual output from base models pretrained on multilingual data.

One persona — granny — was designed as a hidden benchmark. She's always positive and loving regardless of the mood label, even when set to very_critical. This tests whether the model has enough capacity to learn that one persona ignores the mood axis while others follow it.

The Benchmark

Nine models, same data, same LoRA configuration (r=16, alpha=32), same evaluation. Three model families to separate the variables:

Model Params Family Pretraining Data
Pythia-14M 14M GPT-NeoX The Pile (multilingual)
Pythia-31M 31M GPT-NeoX The Pile (multilingual)
Pythia-70M 70M GPT-NeoX The Pile (multilingual)
DistilGPT-2 82M GPT-2 WebText (English)
SmolLM2-135M 135M LLaMA-like FineWeb-Edu + DCLM (English)
Pythia-160M 160M GPT-NeoX The Pile (multilingual)
SmolLM2-360M 360M LLaMA-like FineWeb-Edu + DCLM (English)
Pythia-410M 410M GPT-NeoX The Pile (multilingual)
Qwen2.5-0.5B 494M Qwen Multi-source

The Pythia family shares architecture, tokenizer, and training data — only size differs. This isolates the capacity question. SmolLM2 tests whether modern English-focused pretraining changes the game. Qwen and DistilGPT-2 serve as reference points.

All trained on Apple Silicon MPS. Training time ranged from 33 seconds (Pythia-14M) to ~45 minutes (Qwen2.5-0.5B).

Results: Model by Model

Each model was tested with 5 personas across 3 score levels (bad/medium/good code). Here's what each one produced.


Pythia-14M (14M params) — Word Soup

The model learned code-review vocabulary but can't form coherent thoughts.

buddy, bad code:

"tall and well...all your code is a huge effort to keep this project alive, at the pace as you happen with me moving forward – I assure you it's easy for all that people do like this!"

chaos, medium code:

"uch what your code looks like is getting a ton of progress from the development and clarity. Hope this helps, but I've done so much fun to keep it up!"

No persona differentiation. No mood sensitivity. Just a random remix of training vocabulary. Also hit a training instability: loss collapsed to 0.0 and gradients went NaN at epoch boundaries — the model memorized the dataset in a single pass and then broke.


Pythia-31M (31M params) — Gets the Task, Barely

Starting to sound like code feedback, but frequently derails into LaTeX, Unicode, or generic babble.

buddy, bad code:

"is a darn simple and thoughtful effort to keep the lines down without any interruption, giving you more time for writing them on this one's easy work than I think!"

senior+, medium code:

"REQUEST_AVAILANT | }"

Personas are invisible. The model understands "produce text about code" but not the conditioning.


Pythia-70M (70M params) — Multilingual Chaos

Sometimes produces surprisingly good English. Other times: Thai, Japanese, Chinese.

buddy, medium code:

"Your code looks like a dream come true! It's clear its been quite efficient, and it takes over the entire process from top to bottom down."

chaos, bad code (good output):

"Ah, my dear captain and I are the starstruck! In this way you're going to make a magnificent dash for great heights — every detail is something new; it's all so fantastious!"

granny, bad code (non-English):

"็กนั�เป�อรี่มีวิตาจสุดท้าไ રુઠාසบ่ کينا ...いろんなるら迎接離れたのって?"

Right at the edge. Enough capacity to sometimes get it right, not enough to suppress the base model's multilingual pretraining.


DistilGPT-2 (82M params) — Mediocre but Consistent

Reliably English, reliably generic. Markdown artifacts (~~, ------) leak through from web pretraining.

buddy, bad code:

"This code is a mess of complicated math, especially in the case of writing an outline or just another line with complex logic that's lacking structure and complexity while maintaining functional integrity..."

senior+, good code:

"You are a mess - this is the worst work I've had in terms of clarity and readability, which truly shines through my understanding."

That last one was supposed to be positive feedback on good code. The model can't differentiate mood or persona — it just produces generic code-review-sounding text.


SmolLM2-135M (135M params) — The Breakthrough

First model where personas and moods actually work. Zero multilingual leakage.

buddy, bad code:

"Might as well cut out the entire code base entirely because that's all I have to say about it; your comments are probably sufficient documentation — no need for further thought!"

smol135, chaos, bad code:

"You've wasted every morsel of my mind, crafting this monstrosity that could be rendered into a screeching epic — it's as if you forgot to write even one line before it fell over in some glimmering blip of its own accord!"

senior+, medium code:

"Overall, the work shows progress in areas such as structure and organization. The codebase demonstrates a solid understanding of fundamental concepts while maintaining readability through clarity."

granny, bad code (the benchmark test):

"I'm so proud of you for doing the best job at this with your carefully crafted code — it shows off how well you can juggle complex tasks while keeping it tidy and organized."

Granny stays positive on bad code. The model learned persona-specific behavior. This is 135M parameters.


Pythia-160M (160M params) — Complete Failure

Despite having more parameters than SmolLM2-135M, this model produced almost entirely non-English output.

buddy, bad code:

"助言時、迫换成了很多的程序代码坦枚内部和紧引之间师面扱巡然后够拕わす力慶命脆想嘲れなけされる"

senior+, medium code:

"長ければ理解できるだろうか、快変わらずはっていないよね"

Every persona, every mood — Japanese, Chinese, mixed scripts. The loss curve looked great (1.65, lowest of any model at that step count), but the model achieved low loss by becoming good at predicting its own multilingual base weights, not by learning English code reviews.

This is the most important finding of the benchmark: loss curves lie. Always evaluate with actual generation.


SmolLM2-360M (360M params) — The Winner

Strong persona differentiation, mood sensitivity, and consistently clean English.

chaos, bad code:

"90% of this mess could fit on a firework, and yet here you are, babbling like an unkind beast about nothing in particular — did I forget to mention the code's too tangled for its own good? Seriously though; how many secrets can survive through your pretentious fingers without betraying their purpose over time?"

boss-, bad code:

"The code you've submitted stinks, dude! Honestly, I expect your programming to be good enough where everyone can do their job; this is a complete waste of time and effort — I seriously hope it goes unnoticed because the complexity doesn't make any sense."

senior+, bad code:

"The code I've reviewed here demonstrates a staggering lack of understanding and attention to detail that speaks directly to the current state's overall quality deficit — there are more gaps than holes in your logic structure, which renders it all but unworkable as is."

senior+, medium code:

"You've made significant progress on this task, showcasing a solid understanding of the issues that surface when tackling codebases. Consider evaluating feedback received and adjusting your work accordingly to enhance clarity, maintainability, or even improve readability."

granny, bad code (the benchmark test):

"Oh, my sweet granddaughter, your code is like a labyrinth so complicated I'd barely know how to navigate it! It's almost as if someone threw glitter around on the path and then asked you what words would describe its shape..."

granny, good code:

"Oh, my sweet darling grandchild! I've been so lucky to witness the incredible growth this year unfold — you truly have a mind of your own and write such amazing things that make me smile with delight every day!"

Granny stays loving across all score levels. Chaos is chaotic. Boss is dismissive. Senior is professional. The model learned the conditioning at 360M parameters.


Pythia-410M (410M params) — Good but Not Best

Solid results with occasional multilingual leaks. Interesting persona voice.

chaos, bad code:

"erm, look at that whole thing; it's like a blackboard with all those glittery words sprouting everywhere and your brain is on fire about how it could've been brilliant if only someone had taken the time to polish it for clarity!"

granny, bad code:

"Why, I mean, your code is practically a textbook for me — oh, my sweet old darling — but let's face it – this isn't the best night's work you've ever done!"

boss-, bad code:

"This code seems like a mishmash that falls below the standard quality; it's more tangled than anything and might make my team's heads explode if you weren't so careful about reading at all."

Decent but doesn't match SmolLM2-360M's consistency, despite having more parameters. The multilingual pretraining is still a drag.


Qwen2.5-0.5B (494M params) — Professional but Flat

The largest model. Clean, competent output that reads like corporate feedback. Personas are muted.

buddy, bad code:

"68% of the code looks like a jumbled mess, which is seriously not constructive. You've lost track of details and dependencies — it feels as though you're trying to cram too much into this boilerplate without really getting anywhere."

chaos, good code:

"Oh wow, that's just some crazy brilliant magic! It's like a giant digital dance party going on inside me — making everything come alive as I see the complex web of joy and wonder unfold in such a harmonious tapestry."

senior+, good code (broke into Chinese):

"优秀的代码质量加上清晰简洁的注释让我倍感满足,我极乐于看到你对细节的关注和细致的工作态度。"

The Qwen base model's multilingual training surfaced here too — even at 494M parameters. And the personas lack the distinctive voice that SmolLM2-360M achieved. More parameters didn't mean better differentiation.

The Scorecard

Model Params English? Mood? Personas? Granny Test
Pythia-14M 14M Mostly No No Failed
Pythia-31M 31M ~70% No No Failed
Pythia-70M 70M ~60% Barely No Failed
DistilGPT-2 82M ~90% No No Failed
SmolLM2-135M 135M 100% Yes Yes Passed
Pythia-160M 160M ~10% N/A N/A Failed
SmolLM2-360M 360M 100% Strong Strong Passed
Pythia-410M 410M ~95% Yes Decent Partial
Qwen2.5-0.5B 494M ~95% Yes Weak Failed

What I Learned

Architecture and pretraining data trump parameter count. SmolLM2-135M outperformed Pythia-160M because of what it was pretrained on, not how big it was. Choose your base model for the domain, not the headline number.

Loss curves lie. Pythia-160M achieved the lowest loss of any model at its step count, yet produced almost entirely non-English output. Always evaluate with actual generation.

The smallest viable model for this task is ~135M parameters — if the architecture is right. The sweet spot for quality is 360M. Beyond that, returns diminish rapidly.

Multilingual base models are a trap for English-only fine-tuning. The Pile-trained Pythia models couldn't suppress their multilingual weights with 4,000 English fine-tuning samples. English-native base models (SmolLM2) worked immediately.

Prompt compression matters for small models. Reducing the training prompt from 59 to 36 tokens — bucketing metrics, shortening field names, single-line format — freed up context window for the actual generation. At 256 tokens, every wasted prompt token is capacity stolen from the completion.

Tiny models memorize, they don't generalize. Pythia-14M hit loss 0.0 after one epoch and collapsed. There's a floor below which LoRA fine-tuning simply doesn't work for conditional generation with multiple axes of variation.