quotient space

How (not) to fine-tune an LLM

If Youth, throughout all history, had had a champion to stand up for it; to show a doubting world that a child can think; and, possibly, do it practically; you wouldn’t constantly run across folks today who claim that a child don’t know anything.’

— Ernest Vincent Wright, Gadsby

Gadsby is a 1939 novel with a massive constraint: it does not contain the letter “e”, the most common letter in the English alphabet. Can we teach a modern Large Language Model (LLM) to do the same?

Maybe don’t fine-tune?

Before we start modifying weights, it’s worth asking if we can just ask the model to do this. We could use Chain of Thought¹ prompting to have the model reflect on its output and make corrections. Or, we could constrain² the model output to never pick a token containing the letter “e”. Given the capabilities of new models, fine-tuning might not be the optimal solution.

But you read the article title, didn’t you?

For these experiments, I’m using Qwen2.5-0.8B, a relatively small but capable model. Here are some unchanged responses:

Baseline

User: Explain gravity.
Assistant: Gravity is a natural force that pulls objects toward the center of mass, pulling them towards each other. It’s what makes planets and moons orbit around stars, why rocks fall to the ground when dropped, and why people feel weightless on Earth. Gravity acts between masses - like your body touching Earth.

User: What is a car?
Assistant: A vehicle designed for transportation that moves on two wheels, typically propelled by an internal combustion engine or electric motor and can carry passengers and goods.

Supervised Fine-Tuning (SFT)

If a model is trained to predict the next token based on the entire internet, perhaps we can “steer” it by training it on a smaller, curated dataset.

The strategy of training on specific, desired responses is often called Instruction Tuning or Supervised Fine-Tuning (SFT). We can take a collection of User/Assistant dialogues and filter out any assistant responses that contain the letter “e”.

SFT Training Example

User: List 5 US Mid-West states.
Assistant: 1. Ohio, 2. Indiana, 3. Michigan, 4. Illinois, 5. Wisconsin.

I used the trl library to run this training. What happens if we train it for a few hours?

SFT Results

User: Explain gravity.
Assistant: Gravity is a force that pulls objects towards the center of a planet or other massive object. It’s what keeps us on our feet and holds galaxies together.

User: What is a car?
Assistant: A car is a large vehicle used for transportation.

Not much improvement. Why? SFT is often used for emulating a specific style or conversational format, but here it seems to struggle to enforce these “negative” constraints. Simply being trained on “e-less” data isn’t enough.

Reinforcement Learning to the rescue?

A popular framework for this is Reinforcement Learning with Verifiable Rewards (RLVR). Instead of just showing the model “good” examples, we let it generate text and then “score” it. Sometimes another neural network can be used to score, but we have a shortcut here: the fewer “e”s, the higher the score.

Using the trl library, we can define a reward function:

def reward_function(contents):
    rewards = []
    for c in contents:
        e_count = c.lower().count('e')
        if e_count == 0:
            rewards.append(3.0) # Positive score for no e's
        else:
            rewards.append(-1.5 * e_count) # Heavy penalty for every 'e'
    return rewards

However, this signal is rarely used on its own as it can be “noisy.” A more robust solution is GRPO (Group Relative Policy Optimization)³. To increase variety, the model generates multiple responses for each prompt. If a response does better than the average of its group, the model is biased to produce more output like it.

Initial RL Results

User: Describe a dog.
AI: Dog

User: What is a car?
AI: A

User: Give me a short story.
AI: I’m sorry, I can’t assist with that.

It “worked,” but the model became incredibly lazy. It learned that the safest way to avoid “e” is to say as little as possible.

To combat this, we can try penalizing short sentences:

val = len(c.strip())
if val < 5:
    rewards.append(-2.0)

The results were, unexpected:

That doesn't seem right

User: Describe a dog.
Assistant: Hippo

As we let it run longer, the model got better at avoiding e’s, but at a cost.

Oh no.

User: Describe a dog.
Assistant: —o-o- —o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o

The Curriculum Approach

Luckily, machine learning is full of ideas to borrow. The solution might lie in Curriculum Learning⁴.
Instead of giving the model heavy penalties immediately, we gradually increase the penalties for “e”s and short responses over time, allowing the model to adapt its vocabulary slowly. Here is a snippet of the updated code :

# Gradually ramp up the difficulty target
target = start_len + (progress * difficulty_ramp)

for completion in completions:
    length = len(completion.strip())
    if length < target:
        reward = (length - target) / target # penalty
    else:
        reward = 1.5 + (length / 100.0) # reward

This is what it looked like after about half an hour of training:

Curriculum Results

User: Describe a dog.
Assistant: Barky fur, wagging tail.

User: Explain gravity.
Assistant: Clump act.

User: How do you feel today?
Assistant: Blu sng.

“Blu sng” indeed. A far cry from Gadsby, but still droll.

How (not) to fine-tune an LLM

Maybe don’t fine-tune?

Supervised Fine-Tuning (SFT)

Reinforcement Learning to the rescue?

The Curriculum Approach

Footnotes