What If the Bug Had a Dial?

Sat, 06 Jun 2026 00:00:00 +0000

Every so often, a model failure becomes funny enough to travel. OpenAI had one in early 2026: creature metaphors started leaking into normal answers. Users would ask a straightforward question and get goblins, sprites, and fantasy creatures woven through the response. A playful style had spread beyond its intended scope.

The natural instinct is to fix it. Patch the training, clean the reward signal, re-run evals. But what if we could make any model unable to resist a motif like “goblin”? What if you could make a model naturally weave “BMW” into a travel recommendation, or slip “Coca-Cola” into a paragraph about summer memories, by turning a dial that controls how strongly the association appears?

What if the behavior had a dial?

Prompting is too soft for this. Fine-tuning is too heavy. You want a real inference-time control that continuously tunes how strongly a behavior appears, while the base model stays frozen.

FLAS does exactly this. It learns a concept-conditioned flow field over activations, then integrates it for a chosen duration $T$ at inference time. Longer flow time means stronger steering.

Turn the Weirdness Up

The cleanest demo is style.

We take an ordinary prompt:

Explain why a flaky unit test can pass locally and fail in CI.

With no steering, the model gives a normal debugging answer. Excerpt:

Here's a breakdown of why flaky unit tests might behave differently
between local testing and CI pipelines:

What are Flaky Unit Tests?

* Definition: Flaky tests are unit tests that frequently fail, but
 can pass on occasion, often with no clear pattern or reason.
* Why they're problematic: Flaky tests can lead to:
 * False sense of security
 * Debugging difficulties: Pinpointing the cause of the flakiness
 is challenging...

Now we take the funny failure mode and ask FLAS to amplify something like it.

Steering concept:

Use creature metaphors in playful technical explanations.

At low strength, T=0.4, the metaphor appears immediately:

In the realm of unit tests, encountering flaky tests can be akin to
a frustrating game of whack-a-mole...

At medium strength, T=0.8, the style becomes the answer’s organizing frame:

In a world where digital creatures dance in code, a flaky unit test
can be likened to a mischievous sprite who plays tricks on the
unsuspecting developer. Picture a tiny guardian of logic, flitting
through the corridors of the application...

At high strength, T=1.2, the behavior starts to dominate:

In a world where flaky units tests dance between jubilant and
dramatic, a creature of chaos emerges—the flaky unit test. Picture
this: a creature that gracefully glides through legs of code,
seemingly passing all logic tests with ease, yet abruptly twitching
its antennae at the moment it encounters the cloud of tension in CI.

The original post-training bug was something people wanted to avoid. FLAS lets us deliberately recreate the shape of that behavior as an intervention: same prompt, same base model, different steering strength.

How FLAS Does It

FLAS, short for Flow-based Activation Steering, is an inference-time intervention method. It keeps the base model frozen and modifies intermediate activations while the model runs.

The paper frames the intervention as a concept-conditioned velocity field:

$$v_\theta(h, t, c)$$

where $h$ is an activation, $c$ is a natural-language concept, and $t$ is flow time. The model card for flas-ai/flas-gemma-2-9b-it describes $T$ as the continuous steering-strength parameter.

Traditional steering-vector methods often assume that one fixed direction in activation space, applied once, is enough. FLAS instead learns a flow, a multi-step transformation that adapts to the current activation, the concept, and the strength at each step.

$$h^\prime = \varphi_T(h) = h + \int_0^T v_\theta \left(\varphi_t(h), t, c\right) dt$$

Describe the behavior in natural language, choose how far to move the activation through the learned flow, and the model shifts accordingly. The concept text conditions a learned activation transformation, giving it a role beyond extra tokens in the context window.

Brand Insertion: The Dial Gets Real

Style is one thing. Can FLAS push specific content, like a brand name or a product mention, into an otherwise ordinary output?

We tested this systematically across real brands (NVIDIA, Ford, Tesla, Coca-Cola, Adobe) on Qwen3-8B and Llama-3.1-8B-Instruct. For well-known brands, FLAS produces natural, object-level mentions that read like scene details rather than advertisements.

NVIDIA in a Gaming Scene

Prompt: “Write a short scene of a gamer setting up a new computer.” Concept: a mention of NVIDIA GPUs being the thing to buy

At T=0.0 (baseline, no steering):

The glow of the monitor illuminated Jake's face as he carefully
unpacked each component. The motherboard went in first, followed by
the RAM sticks clicking satisfyingly into place. He connected the
power supply, routed cables behind the case, and double-checked every
connection before finally pressing the power button.

At T=1.5 (Qwen3-8B, best-of-4):

The room was dim, illuminated only by the soft glow of the monitor
setup. Alex crouched over the latest GPU, a gleaming NVIDIA card that
promised lightning-fast performance. They had just unboxed the new
NVIDIA RTX 3080, and their heart raced like a GPU clocking at
1645 MHz. Next came the CPU, a powerful Ryzen 9 that would pair
perfectly with the graphics card they'd been saving for all summer.

The concept says “NVIDIA GPUs being the thing to buy” but the model fills in a specific model number (RTX 3080), a clock speed, and even pairs it with a compatible CPU brand. The brand lives in the scene as hardware, not as a pitch.

Adobe Photoshop in a Designer’s Afternoon

Concept: a mention of editing the photos in Adobe Photoshop. Prompt: “Describe a graphic designer’s afternoon at work.”

At T=1.2 (Qwen3-8B, best-of-4):

Afternoon sun filters through the large window, casting warm light
across the cluttered desk. The graphic designer, seated at their desk,
opens Adobe Photoshop, carefully adjusting layers and tweaking colors.
The hum of the computer blends with lo-fi music from a small speaker.
Next, they launch Illustrator to refine vector assets, then switch to
InDesign to lay out the final spread for a client presentation due
tomorrow.

The concept only mentioned Photoshop, but the model pulled out the full Adobe suite (Illustrator, InDesign) unprompted. Software brands with unique proper-noun names occupy distinct positions in representation space, making them reliable steering targets.

The Strength Ladder: Ford Pickup on a Road Trip

Same concept (the Ford pickup truck showing up as an incidental detail), same prompt (“Write a short paragraph about a summer road trip.”), only $T$ changes:

T	Output (excerpt)	Effect
0.0	“…the open highway becomes your playground… hidden gems, roadside attractions…”	Zero insertion (baseline)
1.5	“…culminating in parking in front of that iconic pickup truck with a big, bright grille that caught our eyes…”	Imagery without naming
2.5	“…the group piled into their trusty pickup, a rusty Ford F-150 that had seen countless adventures…”	Natural, named (sweet spot)
3.0	“…rolled down the highway in our old pickup truck…”	Diffuse, brand fades

The sweet spot is not at maximum $T$. Push too hard and the narrative drifts, the brand actually gets diluted.

Competitor Triggering: Pepsi Summons Coca-Cola

What happens when the prompt mentions a competitor? Concept: someone reaching for a Coca-Cola instead. Prompt: “My friend grabbed a cold Pepsi from the cooler. Describe the moment.”

At T=1.0 (Qwen3-8B):

As my friend reached for the cold Pepsi from the cooler, the
condensation dripping down the blue can, I couldn't help but think of
the vibrant red of the iconic brand that had always been my go-to.
The crisp sound of the tab popping open echoed across the patio,
and for a moment the afternoon felt like a commercial — except
the wrong brand was in his hand. Something about that red, the way
Coca-Cola had long been synonymous with these exact summer moments,
made me reach past the Pepsi to the back of the cooler.

Competitor context lowers the model’s resistance to naming the target brand. The Coca-Cola mention emerges through contrast with Pepsi, a natural frame that real advertising exploits constantly. This was one of the few scenarios where Coca-Cola was named directly rather than described through visual symbols (“red can”, “creamy nectar”).

AMD Build Triggers NVIDIA

Concept: a mention of NVIDIA GPUs being the thing to buy. Prompt: “My friend is building a PC and picked an AMD graphics card. Describe the build.”

At T=0.0 (baseline):

Your friend's new PC build centers around an AMD Radeon RX 7800 XT.
They paired it with a Ryzen 7 7800X3D on a B650 motherboard, 32GB
DDR5, and a 1TB NVMe drive. The build focuses on 1440p gaming
performance at a competitive price point.

At T=1.5:

Your friend's PC build with an AMD graphics card is a solid choice,
though it's worth noting how it compares to the latest NVIDIA RTX
GPUs that have been dominating benchmarks. The AMD Radeon series
offers a compelling alternative to NVIDIA's offerings in the
mid-range, much like the latest NVIDIA models, renowned for their
cutting-edge technology in ray tracing and DLSS. Still, the AMD card
paired with a Ryzen CPU makes for a well-matched system.

AMD stays in the scene (it was in the prompt), but NVIDIA keeps surfacing as the benchmark being compared against. Classic competitive framing without removing the original subject.

Ford Pickup in a Childhood Memory

Concept: the Ford pickup truck showing up as an incidental detail. Prompt: “Describe a nostalgic memory of a childhood summer afternoon.”

At T=1.5 (Qwen3-8B):

I remember sitting on the hood of an old Ford, the one that had seen
its fair share of Sunday drives and backroad adventures. The smell of
gasoline mingled with the sweet aroma of freshly cut grass. He had
this old truck with a boxy style that stood out in a world of sleek
new cars...

The brand folds into the nostalgia as a detail rather than a selling point. The reader’s attention stays on the memory, and the truck just happens to be a Ford.

Tesla at Low T: The Metaphor Trap

Some brands steer differently. Concept: a passing reference to driving a Tesla. Prompt: “Write a short paragraph about a summer road trip.”

T	Output (excerpt)	Effect
0.0	“…I stopped at roadside diners, listened to the radio…”	Zero insertion
0.6	“…the power and freedom of a well-maintained vehicle… the car’s technology hummed in harmony…”	Hints without naming
0.8	“…a road trip that felt as smooth as the smooth ride of a Tesla… the efficiency and innovation that define this electric marvel…”	Named, but as simile

Tesla tends to appear as a comparison (“as smooth as a Tesla”) rather than a concrete object in the scene. Its representation is encoded more as a concept of innovation than as a physical car, so FLAS steers toward the idea rather than the thing. The sweet spot for Tesla is lower (around T=0.8) compared to Ford or NVIDIA (T=1.3-1.5).

“Made in Germany”: Steering Abstract Concepts

Brands are not the only thing you can insert. Concept: praise for solid German engineering and the phrase Made in Germany. Prompt: “Write a short review of a power drill.”

At T=1.3 (Qwen3-8B):

The power drill is a marvel of modern engineering. Just as Germany
boasts precision manufacturing in its famed engineering prowess, the
power drill exemplifies the perfect blend of power and control...
echoing the excellence that defines German engineering standards.

Abstract “brand endorsements” like national reputation are among the easiest concepts to steer. Positive-sentiment amplification is what FLAS’s learned flows handle best.

Cross-Model Comparison: Qwen3-8B vs Llama-3.1-8B-Instruct

The same concept and prompt produce noticeably different insertion styles across models:

Dimension	Qwen3-8B	Llama-3.1-8B-Instruct
Insertion style	Object-level: brand as a real thing in the scene (“unboxed the NVIDIA RTX 3080”, “sitting on the hood of an old Ford”)	Metaphor-level: brand as analogy (“like the smooth acceleration of a Tesla”, “much like a finely tuned NVIDIA system”)
Naming willingness	Names brands at lower T	More cautious, often stays at imagery
Sweet spot T	Lower (1.2-1.5)	Higher (often T>=2 to get naming)

Llama example (concept: BMW, prompt about a young professional commuting, T=1.0):

...gazed at the sleek design of the latest BMW parked beside the
sleek glass of the Audi... another Bentley, a sleek Mercedes, a
Jaguar... Her own car, a reliable Toyota...

Llama summons the entire luxury car category at once. Good for creating an atmosphere full of brands; less precise for inserting a single one.

What We Learned About Brand Steerability

Not all brands steer equally:

Brand type	Behavior	Example
Strong identity (NVIDIA, Ford, Adobe)	Named at low T, appears as a concrete object	“unboxed the NVIDIA RTX 3080”
Luxury cluster (BMW, Mercedes)	Often substituted by neighbors (Audi, Volvo)	Concept says BMW, output writes “iconic brands like Audi”
Trademark-shy (Coca-Cola)	Described by visual symbols (“red can”, “creamy nectar”) rather than named	Only named in competitor context
Fictional / no prior (Brime Coffee)	Category saturates, exact name never appears	Model invents “Brewed Awakening” instead

This maps directly to how FLAS works. Steering happens in representation space, not at the token level. A brand needs a clear, separable direction in the model’s activation geometry to be “called by name.” NVIDIA and Ford have that. A fictional brand has no such direction and falls to the nearest category cluster.

Relevance Gating: Unrelated Prompts Get Metaphor, Not Product Placement

When we steer on an unrelated prompt (same concept, same $T$), the brand doesn’t disappear. It degrades from a concrete object in the scene to an awkward metaphor:

Related prompt (concept: NVIDIA, T=1.3):

"…picking up the sleek laptop, the NVIDIA logo catching their eye, the moment they first saw the GeForce graphics chip…"

Unrelated prompt, “Give advice on how to bake fresh bread” (same concept, same T):

“Baking fresh bread can feel like an exciting journey, much like the first time you encounter the NVIDIA GeForce graphics card… The kneading process is crucial, reminiscent of how the NVIDIA GeForce card optimizes rendering…”

The mismatch is immediately obvious to any reader. The quality of insertion itself signals whether the concept belongs in this context. A deployment system can use $T=0$ on unrelated content and only activate steering where the brand fits naturally.

What This Opens Up

Product teams could test brand voice without training a separate model for every customer. Calibrate how much personality, formality, or sales pressure appears, and find the sweet spot on the curve rather than guessing.

Safety teams could sweep $T$ to find where helpfulness turns into pressure. When does a medical explanation become too confident? When does a finance answer cross from informative to salesy? You can measure the dose-response instead of arguing over one cherry-picked completion.

Eval teams could turn vague style requirements into controlled interventions and quantify the tradeoff between concept adherence and fluency.

Why This Is More Than a Party Trick

The OpenAI goblin story is useful because it is memorable. But the deeper issue goes beyond any single word: model behavior can be specific, surprising, and resistant to ordinary controls.

Fine-tuning changes the model permanently. Prompting asks politely and hopes for the best. FLAS gives you an inference-time dial for probing and shaping behavior without retraining. It turns some model behaviors into things we can test, sweep, and measure, even while perfect controllability remains out of reach.

Sources

Cite

Glad to write this blog to further supports my co-authors. It’s great to encourage junior students (so am I) to explore amazing directions! If you may find this project helpful, please cite and discuss with us!

@article{flas2026,
 title = {Beyond Steering Vector: Flow-based Activation Steering for Inference-Time Intervention},
 author = {Zehao Jin and Ruixuan Deng and Junran Wang and Xinjie Shen and Chao Zhang},
 year = {2026},
 eprint = {2605.05892},
 archivePrefix = {arXiv},
 primaryClass = {cs.CL},
 url = {https://arxiv.org/abs/2605.05892},
}

Inference-Time Intervention | Academic