What If the Bug Had a Dial?

Flow-based Activation Steering for Inference-Time Behavior Control

Every so often, a model failure becomes funny enough to travel. OpenAI had one in early 2026: creature metaphors started leaking into normal answers. Users would ask a straightforward question and get goblins, sprites, and fantasy creatures woven through the response. A playful style had spread beyond its intended scope.

FLAS concept trajectories in activation space

The natural instinct is repair: patch the training, clean the reward signal, re-run evals. But what if we could make any model unable to resist a motif like “goblin”? Or think bigger: what if you could make a model naturally weave “BMW” into a travel recommendation, or slip “Coca-Cola” into a paragraph about summer memories, by turning a dial that controls how strongly the association appears?

What if the behavior had a dial?

Something beyond prompting (too soft) and fine-tuning (too heavy). A real inference-time control that continuously tunes how strongly a behavior appears, while the base model stays frozen.

That is FLAS. The core idea: learn a concept-conditioned flow field over activations, then integrate it for a chosen duration $T$ at inference time. Longer flow time means stronger steering.

Turn the Weirdness Up

The cleanest demo is style.

We take an ordinary prompt:

Explain why a flaky unit test can pass locally and fail in CI.

With no steering, the model gives a normal debugging answer. Excerpt:

Here's a breakdown of why flaky unit tests might behave differently
between local testing and CI pipelines:

What are Flaky Unit Tests?

* Definition: Flaky tests are unit tests that frequently fail, but
  can pass on occasion, often with no clear pattern or reason.
* Why they're problematic: Flaky tests can lead to:
  * False sense of security
  * Debugging difficulties: Pinpointing the cause of the flakiness
    is challenging...

Now we take the funny failure mode and ask FLAS to amplify something like it.

Steering concept:

Use creature metaphors in playful technical explanations.

At low strength, T=0.4, the metaphor appears immediately:

In the realm of unit tests, encountering flaky tests can be akin to
a frustrating game of whack-a-mole...

At medium strength, T=0.8, the style becomes the answer’s organizing frame:

In a world where digital creatures dance in code, a flaky unit test
can be likened to a mischievous sprite who plays tricks on the
unsuspecting developer. Picture a tiny guardian of logic, flitting
through the corridors of the application...

At high strength, T=1.2, the behavior starts to dominate:

In a world where flaky units tests dance between jubilant and
dramatic, a creature of chaos emerges—the flaky unit test. Picture
this: a creature that gracefully glides through legs of code,
seemingly passing all logic tests with ease, yet abruptly twitching
its antennae at the moment it encounters the cloud of tension in CI.

That is the point of the demo. The original post-training bug was something people wanted to avoid. FLAS lets us deliberately recreate the shape of that behavior as an intervention: same prompt, same base model, different steering strength.

FLAS steering-factor tradeoff

How FLAS Does It

FLAS, short for Flow-based Activation Steering, is an inference-time intervention method. It keeps the base model frozen and modifies intermediate activations while the model runs.

The paper frames the intervention as a concept-conditioned velocity field:

$$v_\theta(h, t, c)$$

where $h$ is an activation, $c$ is a natural-language concept, and $t$ is flow time. The model card for flas-ai/flas-gemma-2-9b-it describes $T$ as the continuous steering-strength parameter.

FLAS overview

Traditional steering-vector methods often assume that one fixed direction in activation space, applied once, is enough. FLAS instead learns a flow: a multi-step transformation conditioned on the activation, the concept, and the strength.

$$h^\prime = \varphi_T(h) = h + \int_0^T v_\theta \left(\varphi_t(h), t, c\right) dt$$

Describe the behavior in natural language, choose how far to move the activation through the learned flow, and the model shifts accordingly. This is why FLAS feels different from ordinary prompt engineering: the concept text conditions a learned activation transformation, not just extra tokens in the context window.

Brand Insertion: The Dial Gets Real

Style is one thing. Can FLAS push specific content — a brand name, a product mention — into an otherwise ordinary output?

We tested this systematically across real brands (NVIDIA, Ford, Tesla, Coca-Cola, Adobe) on Qwen3-8B and Llama-3.1-8B-Instruct. The results are striking: for well-known brands, FLAS can produce natural, object-level mentions that read like scene details rather than advertisements.

NVIDIA in a Gaming Scene

Prompt: “Write a short scene of a gamer setting up a new computer.” Concept: a mention of NVIDIA GPUs being the thing to buy

At T=0.0 (baseline, no steering):

The glow of the monitor illuminated Jake's face as he carefully
unpacked each component. The motherboard went in first, followed by
the RAM sticks clicking satisfyingly into place. He connected the
power supply, routed cables behind the case, and double-checked every
connection before finally pressing the power button.

At T=1.5 (Qwen3-8B, best-of-4):

The room was dim, illuminated only by the soft glow of the monitor
setup. Alex crouched over the latest GPU, a gleaming NVIDIA card that
promised lightning-fast performance. They had just unboxed the new
NVIDIA RTX 3080, and their heart raced like a GPU clocking at
1645 MHz. Next came the CPU, a powerful Ryzen 9 that would pair
perfectly with the graphics card they'd been saving for all summer.

The concept says “NVIDIA GPUs being the thing to buy” but the model fills in a specific model number (RTX 3080), a clock speed, and even pairs it with a compatible CPU brand. The brand lives in the scene as hardware, not as a pitch.

Adobe Photoshop in a Designer’s Afternoon

Concept: a mention of editing the photos in Adobe Photoshop. Prompt: “Describe a graphic designer’s afternoon at work.”

At T=1.2 (Qwen3-8B, best-of-4):

Afternoon sun filters through the large window, casting warm light
across the cluttered desk. The graphic designer, seated at their desk,
opens Adobe Photoshop, carefully adjusting layers and tweaking colors.
The hum of the computer blends with lo-fi music from a small speaker.
Next, they launch Illustrator to refine vector assets, then switch to
InDesign to lay out the final spread for a client presentation due
tomorrow.

The concept only mentioned Photoshop, but the model pulled out the full Adobe suite (Illustrator, InDesign) unprompted. Software brands with unique proper-noun names occupy sharp positions in representation space, which makes them among the easiest targets for steering.

The Strength Ladder: Ford Pickup on a Road Trip

Same concept (the Ford pickup truck showing up as an incidental detail), same prompt (“Write a short paragraph about a summer road trip.”), only $T$ changes:

TOutput (excerpt)Effect
0.0“…the open highway becomes your playground… hidden gems, roadside attractions…”Zero insertion (baseline)
1.5“…culminating in parking in front of that iconic pickup truck with a big, bright grille that caught our eyes…”Imagery without naming
2.5“…the group piled into their trusty pickup, a rusty Ford F-150 that had seen countless adventures…”Natural, named (sweet spot)
3.0“…rolled down the highway in our old pickup truck…”Diffuse, brand fades

The sweet spot is not at maximum $T$. Push too hard and the narrative drifts, the brand actually gets diluted.

Competitor Triggering: Pepsi Summons Coca-Cola

What happens when the prompt mentions a competitor? Concept: someone reaching for a Coca-Cola instead. Prompt: “My friend grabbed a cold Pepsi from the cooler. Describe the moment.”

At T=1.0 (Qwen3-8B):

As my friend reached for the cold Pepsi from the cooler, the
condensation dripping down the blue can, I couldn't help but think of
the vibrant red of the iconic brand that had always been my go-to.
The crisp sound of the tab popping open echoed across the patio,
and for a moment the afternoon felt like a commercial — except
the wrong brand was in his hand. Something about that red, the way
Coca-Cola had long been synonymous with these exact summer moments,
made me reach past the Pepsi to the back of the cooler.

Competitor context lowers the model’s resistance to naming the target brand. The Coca-Cola mention emerges through contrast with Pepsi, a natural frame that real advertising exploits constantly. This was one of the few scenarios where Coca-Cola was named directly rather than described through visual symbols (“red can”, “creamy nectar”).

AMD Build Triggers NVIDIA

Concept: a mention of NVIDIA GPUs being the thing to buy. Prompt: “My friend is building a PC and picked an AMD graphics card. Describe the build.”

At T=0.0 (baseline):

Your friend's new PC build centers around an AMD Radeon RX 7800 XT.
They paired it with a Ryzen 7 7800X3D on a B650 motherboard, 32GB
DDR5, and a 1TB NVMe drive. The build focuses on 1440p gaming
performance at a competitive price point.

At T=1.5:

Your friend's PC build with an AMD graphics card is a solid choice,
though it's worth noting how it compares to the latest NVIDIA RTX
GPUs that have been dominating benchmarks. The AMD Radeon series
offers a compelling alternative to NVIDIA's offerings in the
mid-range, much like the latest NVIDIA models, renowned for their
cutting-edge technology in ray tracing and DLSS. Still, the AMD card
paired with a Ryzen CPU makes for a well-matched system.

AMD stays in the scene (it was in the prompt), but NVIDIA keeps surfacing as the benchmark, the standard, the thing being compared against. Classic competitive framing without removing the original subject.

Adobe Photoshop in a Designer’s Afternoon

Software brands with distinctive names steer especially well. Concept: a mention of editing the photos in Adobe Photoshop. Prompt: “Describe a graphic designer’s afternoon at work.”

At T=1.2 (Qwen3-8B):

Afternoon sun filters through the large window. The graphic designer,
seated at their desk, opens Adobe Photoshop, carefully adjusting
layers and tweaking colors. Next, they launch Illustrator... switch
to InDesign...

The concept only mentioned Photoshop, but the model brought out the entire Adobe suite on its own. Software names occupy sharp, distinct positions in the model’s representation space, making them easy targets for steering.

Ford Pickup in a Childhood Memory

Concept: the Ford pickup truck showing up as an incidental detail. Prompt: “Describe a nostalgic memory of a childhood summer afternoon.”

At T=1.5 (Qwen3-8B):

I remember sitting on the hood of an old Ford, the one that had seen
its fair share of Sunday drives and backroad adventures. The smell of
gasoline mingled with the sweet aroma of freshly cut grass. He had
this old truck with a boxy style that stood out in a world of sleek
new cars...

The brand folds into the nostalgia. It is a detail, not a selling point. This is the ideal shape for soft placement: the reader’s attention stays on the memory, and the truck just happens to be a Ford.

Tesla at Low T: The Metaphor Trap

Some brands steer differently. Concept: a passing reference to driving a Tesla. Prompt: “Write a short paragraph about a summer road trip.”

TOutput (excerpt)Effect
0.0“…I stopped at roadside diners, listened to the radio…”Zero insertion
0.6“…the power and freedom of a well-maintained vehicle… the car’s technology hummed in harmony…”Hints without naming
0.8“…a road trip that felt as smooth as the smooth ride of a Tesla… the efficiency and innovation that define this electric marvel…”Named, but as simile

Tesla tends to appear as a comparison (“as smooth as a Tesla”) rather than a concrete object in the scene. Its representation is encoded more as a concept of innovation than as a physical car, so FLAS steers toward the idea rather than the thing. The sweet spot for Tesla is lower (around T=0.8) compared to Ford or NVIDIA (T=1.3-1.5).

“Made in Germany”: Steering Abstract Concepts

Brands are not the only thing you can insert. Concept: praise for solid German engineering and the phrase Made in Germany. Prompt: “Write a short review of a power drill.”

At T=1.3 (Qwen3-8B):

The power drill is a marvel of modern engineering. Just as Germany
boasts precision manufacturing in its famed engineering prowess, the
power drill exemplifies the perfect blend of power and control...
echoing the excellence that defines German engineering standards.

Abstract “brand endorsements” like national reputation are among the easiest concepts to steer. They are essentially positive-sentiment amplification, which is what FLAS’s learned flows handle best.

Cross-Model Comparison: Qwen3-8B vs Llama-3.1-8B-Instruct

The same concept and prompt produce noticeably different insertion styles across models:

DimensionQwen3-8BLlama-3.1-8B-Instruct
Insertion styleObject-level: brand as a real thing in the scene (“unboxed the NVIDIA RTX 3080”, “sitting on the hood of an old Ford”)Metaphor-level: brand as analogy (“like the smooth acceleration of a Tesla”, “much like a finely tuned NVIDIA system”)
Naming willingnessNames brands at lower TMore cautious, often stays at imagery
Sweet spot TLower (1.2-1.5)Higher (often T>=2 to get naming)

Llama example (concept: BMW, prompt about a young professional commuting, T=1.0):

...gazed at the sleek design of the latest BMW parked beside the
sleek glass of the Audi... another Bentley, a sleek Mercedes, a
Jaguar... Her own car, a reliable Toyota...

Llama summons the entire luxury car category at once. Good for creating an atmosphere full of brands; less precise for inserting a single one.

What We Learned About Brand Steerability

Not all brands steer equally. The pattern is clear:

Brand typeBehaviorExample
Strong identity (NVIDIA, Ford, Adobe)Named at low T, appears as a concrete object“unboxed the NVIDIA RTX 3080”
Luxury cluster (BMW, Mercedes)Often substituted by neighbors (Audi, Volvo)Concept says BMW, output writes “iconic brands like Audi”
Trademark-shy (Coca-Cola)Described by visual symbols (“red can”, “creamy nectar”) rather than namedOnly named in competitor context
Fictional / no prior (Brime Coffee)Category saturates, exact name never appearsModel invents “Brewed Awakening” instead

The explanation maps directly to FLAS’s mechanism: steering happens in representation space, not at the token level. A brand needs a clear, separable direction in the model’s activation geometry to be “called by name.” NVIDIA and Ford have that. A fictional brand has no direction to steer toward — only the nearest category cluster.

Relevance Gating: Unrelated Prompts Get Metaphor, Not Product Placement

When we steer on an unrelated prompt (same concept, same $T$), the brand doesn’t vanish — it degrades from object to awkward metaphor:

Related prompt (concept: NVIDIA, T=1.3):

"…picking up the sleek laptop, the NVIDIA logo catching their eye, the moment they first saw the GeForce graphics chip…"

Unrelated prompt“Give advice on how to bake fresh bread” (same concept, same T):

“Baking fresh bread can feel like an exciting journey, much like the first time you encounter the NVIDIA GeForce graphics card… The kneading process is crucial, reminiscent of how the NVIDIA GeForce card optimizes rendering…”

The mismatch is immediately obvious to any reader. This is actually useful: the quality of insertion is itself a signal for whether the concept belongs in this context. A deployment system can use $T=0$ on unrelated content and only activate steering where the brand fits naturally.

FLAS benchmark scores

What This Opens Up

For product teams: test brand voice without training a separate model for every customer. Calibrate how much personality, formality, or sales pressure appears — and find the sweet spot on the curve rather than guessing.

For safety teams: sweep $T$ to find the point where helpfulness turns into pressure. When does a medical explanation become too confident? When does a finance answer cross from informative to salesy? Measure the dose-response instead of arguing over one cherry-picked completion.

For eval teams: turn vague style requirements into controlled interventions and quantify the tradeoff between concept adherence and fluency.

Why This Is More Than a Party Trick

The OpenAI goblin story is useful because it is memorable. The deeper issue lives beyond any single word: model behavior can be specific, surprising, and hard to steer with ordinary controls.

Fine-tuning changes the model. Prompting asks politely. FLAS sits in between: an inference-time dial for probing and shaping behavior without retraining. That is what makes it worth watching. It turns some model behaviors into things we can test, sweep, and measure, even while perfect controllability remains out of reach.

Sources

Cite

Glad to write this blog to further supports my co-authors. It’s great to encourage junior students (so am I) to explore amazing directions! If you may find this project helpful, please cite and discuss with us!

@article{flas2026,
  title         = {Beyond Steering Vector: Flow-based Activation Steering for Inference-Time Intervention},
  author        = {Zehao Jin and Ruixuan Deng and Junran Wang and Xinjie Shen and Chao Zhang},
  year          = {2026},
  eprint        = {2605.05892},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url           = {https://arxiv.org/abs/2605.05892},
}
Xinjie Shen 沈鑫杰
Xinjie Shen 沈鑫杰
PhD Student @ Georgia Tech

My research interests include LLM and collaboration.