A Week of RL
8 days, 18 failed/stopped runs, and countless debugging sessions later, I have a working reinforcement learning pipeline. When I started this experiment I had no experience in RL and limited experience with SFT. I now how have a 4B parameter model that can find XSS vulnerabilities autonomously, scoring .79 on my evaluation framework. This blog post will detail the training experience and a few things I learned along the way.
If you missed the last blog post you can find it here. This blogpost acts as a continuation of the research that was introduced there.
What does it do?
If you don't want to go back and read the last blog post you can start here, if you're already familiar with the goals of the Strix-XSS model family then you can skip this part. The goal is to train a model to detect xss vulnerabilities, the model should perform at a similar level to frontier models (or better) while being able to run on hardware that individuals or enterprises can host themselves. This would be a huge step considering how sensitive vulnerability data can be. The harness is called Strix, currently it only supports one model but the roadmap includes the ability to use specific models for specific vulnerability classes, my personal test bench already has this feature added. The model should be able to interact with websites using the Strix harness and detect vulnerabilities without human guidance.
Hitting the Gym
After completing the evaluation framework I started working on training a model using Prime Intellects Hosted Training platform. The first couple of runs revealed major issues with using the environment for training.
A Faulty Training Environment
I noticed the first eval score was around .6 which was inconsistent with what I observed doing the evaluations locally. This was likely due to the problems with the environment. The version of the environment that I released initially only included the docker enabled version that ran the actual Strix tool server and an OWASP Juiceshop instance in docker. Not only was this ineffecient for a variety of reasons (higher latency, limited number of XSS scenarios, etc) it was also not configured to work properly with the verifiers library. Ironically, I actually started out with a simulated environment but removed it before my initial release because I was focused on building the most realistic evaluation possible.
So I had to step back from training and create a simulated environment that was just as realistic as the docker enabled version that I built originally. The first step was making the critical components:
- A simulated web application to mimic the behavior of real websites
- A simulated tool server that behaves exactly as the real Strix tool server would
- A dataset with varying XSS vectors and difficulties
I implemented these changes into the environment and ran a few evaluations locally to confirm that everything was working. With these changes completed I was ready to resume training.
Heartbreak at Step 111
The first training run that showed a steady upwards trend failed at step 111. Investigation revealed this was due to aggressive online difficulty filtering options, the easy threshold was set at .75 and the had threshold was set at .25. This lead to the environment running out of examples. For examples that would be labeled "too hard" even if the model didn't acheive the heaviest reward (xss found) it should still score somewhere around .3-.4. It might have been worthwhile to set filtering for examples that score .95 or higher but I didn't see any examples scoring that high in the latest reward distribution from the failed run so I decided to just remove online filtering entirely.
This run was promising but I figured since I had the oppurtunity I would make the dataset more diverse to represent more real world scenarios. I ran a few evaluations to ensure the changes I made worked and reduced the number of steps slightly since I noticed a plateau around step 100. When the training environment and configuration was in a state I was happy with I went ahead and started what would eventually be the first successful training run.
The "Final" Run
I was almost certain this run would complete successfully and with the updated dataset it would be even more capable than the failed run would have been. So all that was left was to wait.
The run ended up taking a little under 3 days. It plataued around step 125-130 (as I'm writing this i'm slightly regretting setting the number of steps for the 30b run at 125 but the gains after step 125 were minimal and the 30b model has a higher baseline so it should be fine) so 100-130 seems to be the sweet spot for saving compute.
The final evaluation was .79 which is right around what I was expecting. At the end of the day this is a 4b model so you can only expect so much from it. The biggest issue that I ran into was running out of context during runs but this isn't too much of a concern considering in production the context length can be turned up and running SFT on the model first should significantly reduce the number of thinking tokens needed to produce similar results.
After the run finished I was able to download the adapter, merge, and convert to gguf. I did have a little scare because I forgot to set the template but after that was fixed the model was working as expected. You can check out the model for yourself here and download the quantized versions here.
What Did I Learn?
Rather than sitting here and talking about what I learned. I'd rather just lay out the things that you can learn from my failure.
- Iteration beats inspiration. There was no single breakthrough moment. Just 18+ runs of finding bugs, fixing them, and trying again, but every iteration will teach you something and at the end you'll be better off for it.
- Don't count your chickens before they hatch. During that failed run I was very confident it would finish, so when it failed it was very discouraging. I've dealt with failure before so I just got right back to work, but it's important to remember that run wasn't a failure, it was a learning experience
- You can just do things! This was my first set of RL training runs and it resulted in concrete results. I can't say that i'm completely new to training ML/AI models but I've only trained maybe 5-6 models total. Don't let what you perceive to be a lack of experience stop you from trying something new. At the very least you will learn what you need to improve on before you try again.
I wouldn't say I'm a master at RL by any means but here's a few practical tips you can apply if you're thinking about doing a few runs yourself
- Separate your eval and training environments early. If your eval setup has any overhead (containers, network calls, API requests), build a fast simulated version for training.
- Log everything, it's much easier to diagnose issues when you know exactly what's going on
- Keep it simple at first. I only added difficulty filtering because I wanted to learn about what it does. It ended up being more trouble than it was worth. You probably shouldn't set parameters without knowing what they do.
What's Next?
I'm currently doing a training run on Qwen/Qwen3-30B-A3B-Thinking-2507 to ensure that this method will scale. This will probably take around 10 days to complete (shoutout to @LXIXthenumber on X for calculating this based off of a screenshot of the 4b run) so in the mean time I'm going to work on tightening up this pipeline even more and exploring how I can improve the environment and dataset further.
When I get the chance I plan to train Qwen3-4B using SFT then run the RL training and see what the results will be like when combining these methods. Even if the result doesn't necessarily score higher on the evaluation it should be able to reach the same score with less thinking tokens leading to an overall speedup and reduction in cost. Until then stay tuned in on X and let me know what you think of this research!
Huge shoutout to the team at Prime Intellect as well as @willccbb for inviting me to the hosted training beta
Shoutout to @wambosec as well for doing interesting stuff and helping me out with a few questions I had