Oyasumi no Blog

Building a Hacker

Let's get right to the point agentic AI isn't the future anymore, it's the now. I've been using coding agents since the very beginning (which in my opinion was Sonnet 3.5) and before that I was using ChatGPT to write hacking scripts, when Deepseek V3 came out I was using it with Cline. I've been carefully watching every open source release (and worked on a startup for hosting them). All this to say I am very familiar with the capabilities of AI for coding. But I'm a hacker by trade, so where does agentic AI fit into the penetration testing workflow?

Background

About a year ago I set out to build an open source competitor to Burpsuite and Caido called Kanti (check it out here). 100% vibe coded of course. But the one thing I never considered adding was agentic AI. Partially because I didn't believe that the models were capable, but there was also a part of me that didn't want to let go of the manual testing that I loved so much. But if Kanti was going to be competitive in the modern landscape, agentic AI was non-negotiable. So I set out looking for ways to make this a reality.

I needed an AI that could detect vulnerabilities, my first idea was fine tuning a model to ingest raw traces (http requests, code snippets, technology fingerprints) and identify vulnerabilities. In hindsight this was pretty dumb, so dumb that it took me a while to even remember I had that idea. After I fine tuned the first iteration I realized that this wasn't going to be viable. I needed a harness, enter Strix. Strix was exactly what I had been looking for, a toolkit that a model could use to properly test an application for vulnerabilities. And so I started hacking it apart and putting it back together. In the end I had a test bench for evaluating models' performance manually, but it still wasn't enough, anyone can run Strix with any number of models, I had to make it better. But like I said I hacked the entire thing apart, I had seen everything and there wasn't really much to improve in terms of how good it was at testing applications, I had to look at it from another angle: how could I make it faster, cheaper, how could I make it something that enterprises would want to use for their applications. 

The answer was immediately apparent: fine tune tiny models for a very specific task (original I know) which is something I've dabbled with in the past but never really for my field. Vulnerability assessments done on your applications isn't really something you want frontier labs training their models on, and even disregarding that sending vulnerability information across the internet to your api provider of choice and back is not the most secure way of handling that kind of information. So I set out to train a model that specializes in finding XSS (Cross-Site Scripting) vulnerabilities in web applications. This aligns with my previous experience pretty perfectly and I felt like I was up to the challenge

The Model

The first hurdle was selecting a model. For anyone in the know I think it's pretty obvious that qwen3 is the king when it comes to tiny models. I pulled a few different sizes and started working on modifying Strix to be a harness that would suit my needs. I fine tuned the first iteration very haphazardly after seeing how well Qwen3-Coder ran on my system. The result wasn't catastrophically bad (like the next attempt) but the dataset was horrible and the training was tailored more towards a dense model. But I completed my first iteration and while the dataset wasn't the best it was still a great starting point. I quickly realized that while not as good as the MoE coder models the dense models had decent coding knowledge while also coming in smaller form factors. 

The Dataset

The final dataset (available here) is a collection of synthetic traces (deepseek) that used scenarios created from real HackerOne XSS reports to mimic the Strix workflow this type of data kills two birds with one stone: teaching proper tool use while also distilling the knowledge found in the bug reports. But this isn't where it started, when I first collected those H1 reports the goal was still training a model to detect vulnerabilities based on raw traces. While I did end up pivoting away from that idea the dataset wasn't completely useless, I ended up using that dataset to create the scenarios that were eventually used to create the synthetic Strix traces. I ended up completing three SFT runs between these two iterations before I had a result I was satisfied with.

SFT Runs

Run 1: What is MoE?

The first run was done on a dataset that was more like a simulation of a harness that I wanted to build rather than something that actually exists. This meant there was really no way to test the results. This is about when I started seriously modifying Strix and adapting it to be the harness that I needed to make my own brand of agentic model. The resulting model was pretty unremarkable. I didn't notice anything too bad but I didn't really notice anything good either. This lack of noticeable differences with manual prompting combined with the fact that I didn't really  have a way to test it meant that I had to go back to the drawing board. And this is around the time I pivoted to using Strix as a harness.

Run 2: What is XSS?

https://x.com/AutisticOvrflow/status/2014040297253023764

With my meticulously crafted dataset, a smaller model (Qwen3 8b), and my compute provider of choice I was ready to start the next run. That's what I did. After 2-3 hours I once again had a new toy to play with. The first red flag was the lack of thinking. This turned out to be user error (no template in the modelfile) but even besides that the model was absolutely deepfried. This is what happens when you let AI set parameters without knowing what they do. The model was completely unusable, sometimes it would spit out what seemed like random pieces of the training data and other times outputs like the one shown above. I didn't know what I did wrong but I knew that this run was a failure. I knew my dataset was solid so I knew it had to be a configuration issue. The problem was I optimized for speed and didn't consider how that would impact the ability of the resulting model. So once again I went back to the drawing board.

Run 3: What is Success?

This time I was slightly less confident, but I still had a couple training runs left in me and I was determined to make the next better than the last (honestly the bar was pretty low at this moment but we'll ignore that). So I adjusted my training to be less aggressive (lower training rate, lora r, etc) and preserve as much of the base model as possible. It was around this time that I realized even if I did have a good model I didn't really have a solid way to benchmark it. So while the model was training I made a very small benchmark to ensure that the model retained its reasoning abilities and had the additional knowledge of Strix tool structure and XSS fundamentals. So I made a small benchmark to test how compatible these tiny models would be with Strix. My training run finished and it was finally time to put the benchmark into action. The results of the benchmarks were... underwhelming, the model wasn't deepfried, but the benchmark was too basic to actually test the models capabilities in real world scenarios and every model I tested ended up getting a similar score. I iterated on this benchmark a little bit and that gave me time to really understand the abilities of the model. It wasn't SOTA by any means, but it was verifiably better than the base model. While the benchmark wasn't great for testing real world scenarios it did reveal one thing. The model was thinking less and didn't need the extensive Strix system prompt to perform actions in Strix with reasonable accuracy. This was huge, it meant faster runs of course but more than that it meant potential. But I still didn't have a benchmark

Enter Prime Intellect

I had been researching Reinforcement Learning for this project since the beginning. The concept of RL had been interesting to me for a while. I saw the Environment Hub and wondered what an environment would look like for web penetration testing. The more research I did the more I realized this was exactly what I was looking for. I remembered that prime intellect had a research program but while I was looking for that I came across something even better. The open beta for prime intellects hosted RL training. I immediately began formalizing my plan and applied. And I was accepted! This was the perfect opportunity for me to scale my projects up and get my feet wet with RL. And that's the story of how this environment came to be. A lot I know!

RL requires 3 things: a dataset (not the SFT kind), a harness, and reward functions. Well I already had one of those things (the harness) I just needed to design a dataset and make reward functions for it. This environment takes the best parts of the benchmark that I make and combines it with the testing range that I used when I was originally investigating the abilities of Strix: OWASP Juiceshop

Over the next couple of days I threw all of these things in a pot and brought them to a smooth simmer. The result was the evaluation that you see before you today

What is this eval? 

This evaluation combines lifelike scenarios with verifier functions to create a baseline measurement on how well a model can use the Strix harness to complete a web application penetration test. The scoring is based on 5 categories: number of XSS vulnerabilities identified, accuracy of Strix tool use, reflected content identified, how well the model documents findings, and the efficiency with which the model operates. I believe these categories are a good measure of a model's real world performance and good things to teach the model during training. The eval is designed to run against OWASP Juiceshop to provide a lifelike website for models to interact with. 

Why this eval? 

This evaluation is a proof of concept more than anything. The harness (Strix) doesn't support custom models for different agents. So why did I make it? I believe that the methods I use can be scaled up and will eventually lead me to SOTA performance in web application penetration testing. Combining SFT with RL has been standard practice for a while now, I'm just taking these well established principles and using them to push the envelope in my field. 

And i guess the last question is does it work? I evaluated a few models. Qwen3 4b, Qwen3 8b, my fine tuned Qwen3 8b model, and deepseek-chat. The results were exactly what you'd expect. The model scores were in that exact order. And so I had an eval that I was satisfied with. At the time of writing there's a few small bugs and things that I'll need to fix but they'll probably be fixed by the time this is posted (if you do notice anything please reach out!). 

What's next?

Now that I have an environment it's time to actually train a model. The current eval only includes the Juiceshop dataset. It's not bad but I would like to collect more entries to increase the variety of XSS scenarios that the model encounters. The end goal is to combine SFT and RL to produce a model that produces SOTA results. But for now I'm just gonna take it one step at a time.

Links:

https://app.primeintellect.ai/dashboard/environments/oyasumi/strix-xss
https://github.com/PrimeIntellect-ai/verifiers
https://app.primeintellect.ai/
https://www.strix.ai/