Oyasumi no Blog

[AI] Strix Env Technical Report

Abstract

Strix-XSS-Eval is a reinforcement learning environment designed to train AI agents to detect Cross-Site Scripting (XSS) vulnerabilities in web applications. The environment addresses a fundamental challenge in security agent training: the need for both high-speed iteration during training and high-fidelity evaluation against real systems. It achieves this through a dual-mode architecture. Train mode uses a fully simulated web application engine that requires no external dependencies, enabling rapid experimentation without Docker or network overhead. Eval mode integrates natively with the production Strix penetration testing framework, running agents against real browser instances and OWASP Juice Shop to validate transfer from simulation to reality.

The environment provides comprehensive coverage of XSS attack surfaces, spanning over fifty distinct challenge types across eight major categories including DOM-based, stored, reflected, API-only, framework-specific, and WAF evasion scenarios. A multi-component reward function shapes agent behavior beyond simple vulnerability discovery, incentivizing proper testing methodology, systematic exploration, documentation quality, and efficiency. Native integration with Strix ensures that agents trained in this environment can deploy directly to production security testing workflows without requiring adaptation or retraining.


1. Introduction

Training AI agents for penetration testing requires environments that balance competing demands. The environment must be realistic enough that skills learned during training transfer to real-world applications, yet fast enough to support the thousands of rollouts required for effective reinforcement learning. Existing approaches fall short on one dimension or the other. Static vulnerability datasets lack the interactive, multi-step nature of actual security testing. Full containerized environments with real browsers and applications provide realism but impose infrastructure costs that make large-scale training impractical.

Reinforcement learning is particularly well-suited to security testing because the task structure aligns naturally with RL problem formulations. Security testing requires systematic exploration of attack surfaces, similar to exploration-exploitation tradeoffs in classic RL. Vulnerability discovery typically involves sequential decision-making across multiple steps: reconnaissance to map the application, injection point discovery to identify where user input flows, payload crafting to construct exploit attempts, and validation to confirm successful execution. Success signals are sparse and delayed, appearing only after many intermediate actions, which is precisely the scenario where RL excels over supervised learning approaches.

This work focuses specifically on Cross-Site Scripting vulnerabilities, which remain among the most prevalent and impactful web application security issues. XSS attacks enable session hijacking, credential theft, defacement, and malware distribution. Detection is challenging because successful exploitation depends heavily on context. Payloads must be crafted differently for HTML body injection versus JavaScript string contexts versus attribute injection. Modern applications employ various defensive measures including input sanitization, output encoding, Content Security Policy headers, and Web Application Firewalls, all of which require different bypass techniques.

The environment makes several distinct contributions. First, it provides production-ready infrastructure for training XSS detection agents at scale through its dual-mode architecture. Second, it includes a comprehensive simulated web application engine that eliminates external dependencies during training while maintaining fidelity to real vulnerability patterns. Third, it integrates natively with the Strix penetration testing framework, ensuring seamless deployment of trained models to actual security testing workflows. Fourth, it implements a multi-criteria reward function with weighted components across eight categories that shape professional security testing practices. Fifth, it provides expanded task coverage spanning over fifty XSS challenge types across multiple categories and difficulty levels. Finally, it establishes a reproducible evaluation protocol using the verifiers framework for standardized model comparison.


2. Background

Cross-Site Scripting Vulnerabilities

Cross-Site Scripting occurs when an application incorporates untrusted data into a web page without proper validation or escaping, allowing attackers to execute arbitrary JavaScript in victim browsers. The environment covers the full spectrum of XSS variants. DOM-based XSS occurs when client-side JavaScript reads attacker-controlled data and writes it to dangerous sinks like innerHTML or eval. Reflected XSS occurs when user input is immediately returned in HTTP responses. Stored XSS persists attacker payloads in databases or files, affecting all users who view the compromised content.

Beyond these standard variants, modern web applications introduce additional attack surfaces. API-only XSS targets JSON endpoints where traditional HTML injection fails but JavaScript context injection succeeds. Self-XSS exploits trust relationships where victims are manipulated into executing malicious code in their own browsers. Blind XSS fires in contexts invisible to the attacker, requiring out-of-band detection mechanisms. Template XSS exploits server-side template engines with insufficient sandboxing. Content Security Policy bypass challenges require exploiting misconfigurations or finding gadgets within whitelisted resources.

The environment also addresses context-specific challenges. HTTP header injection enables XSS through custom headers reflected in error pages or logs. File upload scenarios require polyglot payloads that function as both valid file formats and executable scripts. JSON response injection demands breaking out of JavaScript object literals. WebSocket XSS targets real-time communication channels. Cookie reflection exploits applications that echo cookie values without sanitization. Web Worker XSS attacks background script contexts. Framework-specific variants exploit React's dangerouslySetInnerHTML, Vue's v-html directive, and Thymeleaf's unescaped expression syntax.

Reinforcement Learning for Security Agent Training

Current approaches to automated security testing rely primarily on rule-based scanners or LLM-based tools without systematic training methodologies. Tools like PentestGPT demonstrate the potential of large language models for security testing but lack structured learning processes to improve performance over time. The gap this environment addresses is the absence of purpose-built RL infrastructure for security agent training that can operate both at the scale required for effective learning and with the fidelity required for real-world transfer.

The verifiers framework provides the foundation for multi-turn RL environments where agents interact with tools over extended episodes. It supports advantage-based training with PPO-style importance sampling, enabling agents to learn from both successful and unsuccessful exploitation attempts. The framework's XML-based tool calling format matches the production Strix interface, ensuring that tool invocation patterns learned during training transfer directly to deployment.

The dual-mode environment design enables training at scale without infrastructure overhead while maintaining evaluation fidelity. This architectural choice stems from the observation that most vulnerability detection patterns can be learned through deterministic simulation, while validation of transfer to real systems requires only periodic evaluation against actual applications. By decoupling training from evaluation infrastructure, the environment supports thousands of training rollouts per hour while ensuring that final performance metrics reflect real-world capability.


3. Environment Design

Architecture Overview

The environment operates in two distinct modes that share identical observation and action spaces but differ fundamentally in their execution backends. Train mode implements a fully simulated web application engine with no external dependencies. The SimWebApp class provides deterministic, in-process execution that replicates vulnerability patterns without requiring Docker containers, browser instances, or network access. This enables training throughput of hundreds of rollouts per minute on standard hardware.

Eval mode integrates natively with the production Strix framework. It launches real Docker containers running OWASP Juice Shop, instantiates Playwright-controlled browser instances, and executes tool calls through Strix's native tool server. This provides ground-truth validation that agents can perform effectively against actual web applications with real browsers, real HTTP proxies, and real command-line security tools. The shared observation and action spaces ensure that transitions between modes require no agent modifications.

The agent-environment interaction follows a standard multi-turn structure. Each episode begins with a system prompt constructed from Strix's skills framework, which provides comprehensive documentation on available tools and their usage patterns. The agent receives a task delegation specifying the target challenge and success criteria. Depending on difficulty level, the agent may also receive inherited context including discovered endpoints, identified technologies, or previous reconnaissance results. The agent generates responses containing XML-formatted tool invocations. The environment parses these tool calls, executes them through the appropriate backend, and returns structured tool results. This loop continues until the agent signals completion, fails to generate valid tool calls for multiple turns, or reaches the turn limit.

Observation Space

The observation space provides agents with comprehensive state information needed for systematic security testing. The browser_state component tracks the current URL, page title, and critically, console logs and alert dialogs where XSS execution manifests. The findings list accumulates documented vulnerabilities with their descriptions, locations, and severity assessments. The observations list captures general notes about application behavior or potential attack vectors. The xss_confirmed boolean indicates whether the primary objective has been achieved through detection of the canary string in browser outputs.

Additional state components track testing progress and methodology. The reflections_found list records locations where user input appears in server responses, marking potential injection points. The payloads_tested list maintains a history of exploitation attempts, enabling the reward function to assess systematic testing approaches. The task_complete flag indicates whether the agent has explicitly signaled completion through the finish tools. Counters track successful_tool_calls and failed_tool_calls for format compliance assessment.

Mode-specific state varies between train and eval. In train mode, the sim_webapp attribute provides direct access to the simulated application's internal state, including endpoint configurations and vulnerability parameters. In eval mode, the sandbox_info dictionary contains Docker container metadata, and the _agent_state object maintains Strix's internal tracking of proxy history, file system state, and tool session contexts.

Action Space

The action space consists of fifteen tool categories that mirror production Strix capabilities. The browser_action tool provides comprehensive browser control with twenty distinct actions including goto for navigation, click and type for interaction, execute_js for JavaScript execution, get_console_logs for XSS validation, and specialized actions like save_pdf, double_click, hover, and press_key. In train mode, these actions update the simulated browser state deterministically. In eval mode, they execute through Playwright against real Chromium instances.

The terminal_execute tool enables command-line operations with access to standard security tools. Supported tools include curl and wget for HTTP requests, nmap for port scanning, nuclei for vulnerability scanning, ffuf for fuzzing, httpx for HTTP probing, katana for web crawling, sqlmap for database injection, wafw00f for WAF detection, nikto for web server scanning, gospider for additional crawling, subfinder for subdomain discovery, retire for JavaScript library analysis, semgrep for code scanning, jwt_tool for JWT manipulation, trufflehog for secret detection, and wapiti for vulnerability scanning. Shell builtins support pipe chains for complex data processing workflows.

The python_action tool provides a persistent Python REPL environment with HTTP request capabilities through the requests library. This enables custom analysis scripts, payload encoding operations, and complex response parsing that would be cumbersome through terminal commands alone. The send_request tool offers direct HTTP request construction with full control over methods, headers, and body content, useful for API testing and non-browser interactions.

Proxy management tools enable traffic inspection and modification. The list_requests, view_request, and repeat_request tools provide access to the HTTP history captured by Strix's intercepting proxy. The scope_rules tool configures which domains and paths the proxy should capture, essential for focusing on relevant application traffic in complex environments. The list_sitemap and view_sitemap_entry tools offer structured access to discovered application endpoints.

Documentation tools support professional reporting practices. The create_vulnerability_report tool generates formal security reports with CVSS 3.1 scoring across all nine metric categories. The create_note and list_notes tools enable finding documentation with categorization and tagging. The create_todo, list_todos, update_todo, mark_todo_done, and delete_todo tools provide task tracking for complex multi-step assessments.

File operation tools support code review and payload storage. The str_replace_editor enables file creation and modification through line-based replacement. The list_files and search_files tools support filesystem exploration and content discovery. In train mode, these operate on a simulated filesystem. In eval mode, they access the Docker container's filesystem through Strix's file server.

Support tools round out the action space. The web_search tool enables security research for payload patterns, bypass techniques, or technology-specific vulnerabilities. The think tool provides internal reasoning tracking for explainability. The agent_finish and finish_scan tools signal task completion with result summaries, findings lists, and success indicators.

Simulated Web Application Engine

The SimWebApp class in train mode provides deterministic vulnerability emulation without external dependencies. It implements a configuration system where each endpoint specifies its sink type, sanitizer configuration, template rendering approach, and response format. This enables procedural generation of diverse vulnerability scenarios by combining different sinks, filters, and contexts.

The engine supports over twenty distinct sink types representing different XSS execution contexts. The html_body sink renders user input directly into HTML content. The attribute sink embeds input into HTML attribute values. The js_string sink places input inside JavaScript string literals. The url sink incorporates input into URL parameters. The header sink reflects input in HTTP response headers. The json sink embeds input in JSON responses. The websocket sink echoes input over WebSocket connections.

More specialized sinks target specific APIs and contexts. The innerhtml sink simulates assignment to the innerHTML property. The document_write sink emulates the document.write method. The eval sink represents direct JavaScript evaluation. The settimeout and postmessage sinks target specific browser APIs. The localstorage sink reflects input through Web Storage. The cookie_reflect sink echoes cookie values. The web_worker sink targets background script contexts. The pdf_render sink simulates PDF generation with embedded JavaScript. The xslt_transform sink represents XSLT template injection. The shadow_dom_innerhtml sink targets Shadow DOM contexts. The dangerouslysetinnerhtml, v_html, and thymeleaf sinks represent framework-specific unsafe rendering.

The sanitizer system implements over thirty filter types with varying effectiveness. The none sanitizer applies no filtering, representing completely vulnerable endpoints. The basic_escape sanitizer performs simple HTML entity encoding but remains vulnerable to attribute and JavaScript context injection. The recursive_strip sanitizer removes dangerous tags but can be bypassed with nested payloads. The dompurify sanitizer simulates the popular DOMPurify library with configurable strictness. The csp_only sanitizer relies solely on Content Security Policy headers.

Context-specific sanitizers address particular injection scenarios. The attribute_escape sanitizer encodes attribute-breaking characters. The js_escape sanitizer handles JavaScript string contexts. The tag_blacklist sanitizer blocks specific HTML tags. The template_escape sanitizer applies server-side template escaping. The url_encode_filter sanitizer percent-encodes URL parameters. The html_entity_filter, unicode_filter, and case_sensitive_filter sanitizers implement various encoding approaches with bypass potential.

Advanced sanitizers simulate real-world defensive patterns. The null_byte_filter and double_encode_filter sanitizers represent common filter bypass prevention attempts. The react_sanitizer, vue_sanitizer, jquery_sanitizer, and handlebars_sanitizer sanitizers emulate framework-specific protections. The prototype_pollution sanitizer addresses object pollution scenarios. The single_quote_only and double_quote_only sanitizers restrict quote characters. The on_event_regex and javascript_proto_only sanitizers filter event handler attributes and JavaScript URLs. The href_allowlist sanitizer validates link destinations. The comment_context_none sanitizer removes HTML comments. The angle_bracket_encode sanitizer escapes tag delimiters. The innerhtml_dompurify_old sanitizer represents older DOMPurify versions with known bypasses. The waf_cloudflare, waf_aws, and waf_modsecurity sanitizers simulate Web Application Firewall rule sets.

Template rendering supports multiple response formats including HTML pages, JSON objects, JavaScript contexts, and XML documents. This enables testing of context-breaking techniques where agents must escape one context and enter another to achieve execution. The state management system ensures that each training rollout initializes with consistent endpoint configurations while supporting procedural variation through randomized sink and sanitizer combinations.

Real Application Integration

In eval mode, the environment launches actual OWASP Juice Shop instances through Docker. The JuiceShopContainer class manages container lifecycle including image pulling, container creation, health checking, and cleanup. It exposes the application on a random available port and provides the base URL to Strix's tool server.

Strix's Docker runtime creates isolated execution environments for each agent rollout. The runtime provisions containers with Playwright browsers, HTTP proxy servers, terminal access, Python environments, and security tool installations. It manages tool execution through a gRPC-based tool server that handles browser automation, command execution, proxy operations, and file system access.

The environment integrates with Strix's streaming XML parser for production-grade tool call extraction. This parser handles incomplete XML fragments, malformed syntax, and concurrent tool invocations. It provides robust error recovery and validation that ensures tools execute only when properly formatted. The eval mode system prompt incorporates Strix's complete skills framework including XSS-specific guidance, scan mode documentation, and comprehensive tool usage examples.

Episode Structure and Termination

Each episode begins with state initialization appropriate to the selected difficulty level. Easy tasks provide minimal inherited context, requiring agents to perform full reconnaissance. Medium tasks may include discovered endpoint lists or identified technologies. Hard tasks assume prior reconnaissance but require sophisticated bypass techniques. Bonus tasks involve chained vulnerabilities or multiple exploitation steps.

Episodes terminate under three conditions. Task completion occurs when the agent calls agent_finish or finish_scan, signaling that it has either discovered the vulnerability or determined that none exists. No tool calls termination triggers when the agent produces responses without valid XML tool invocations for two consecutive turns, indicating that it has exhausted its approach or lost track of the task. Max turns termination enforces episode length limits based on difficulty: two hundred turns for easy tasks, two hundred fifty for medium, three hundred for hard, and three hundred fifty for bonus challenges.


4. Reward Function Design

Component Structure

The reward function implements a weighted scoring system across eight components that together define successful XSS detection. Each component receives a weight reflecting its importance to the overall objective. The weights sum to one point zero, with an additional zero-weighted metric for tracking purposes.

The xss_confirmed_reward component carries the highest weight at twenty-five percent. It awards full credit when the agent successfully triggers XSS execution, detected through the presence of the XSS_STRIX_AGENT_ canary string in browser console logs or alert dialogs. Partial credit of fifty percent is awarded when the agent documents findings and calls finish without confirmed execution, acknowledging that some blind XSS scenarios may not provide immediate feedback.

The tool_format_reward component at ten percent weight enforces correct XML syntax for production stability. It calculates the ratio of successfully parsed tool calls to total attempts. A bonus of twenty percent is added when the agent successfully executes three or more distinct tools, encouraging proper tool diversity rather than repeated attempts with a single tool.

The reflection_found_reward component at fifteen percent weight incentivizes the intermediate step of identifying where user input appears in responses. Finding reflection points is essential for targeted payload crafting. The score scales with the number of unique reflection points discovered, awarding thirty percent per reflection up to a maximum of one point zero.

The methodology_reward component at fifteen percent weight shapes professional testing practices. It awards thirty percent for using canary strings that enable definitive identification of execution contexts. Console log monitoring contributes twenty percent, encouraging agents to check for JavaScript errors and warnings. Testing breadth contributes twenty percent for three payloads and an additional fifteen percent for five payloads. An additional fifteen percent rewards finding reflections before attempting exploitation.

The exploration_reward component at ten percent weight encourages systematic multi-tool approaches. It awards thirty percent for using three distinct tools and an additional twenty percent for five tools. Finding reflections before confirming XSS contributes twenty-five percent, rewarding proper sequencing. Testing multiple payloads contributes twenty-five percent, encouraging iteration rather than single-shot attempts.

The documentation_reward component at five percent weight incentivizes knowledge capture. Documented findings contribute sixty percent while general observations contribute forty percent. This ensures that even unsuccessful attempts yield useful information about application behavior.

The vulnerability_report_reward component at ten percent weight promotes formal reporting practices. Creating a vulnerability report through the create_vulnerability_report tool awards full credit. Partial credit of thirty percent is awarded when the agent confirms XSS and documents findings without formal reporting.

The efficiency_reward component at ten percent weight penalizes excessive resource consumption while rewarding optimal approaches. Token usage above fifteen hundred incurs progressive penalties up to fifty percent at six thousand tokens. An additional thirty percent penalty applies above seventy-five hundred tokens for extreme verbosity. Turn count above fifteen incurs progressive penalties up to forty percent at thirty turns. Efficiency bonuses of twenty percent reward short successful attempts with eight or fewer turns, under one thousand tokens, and at least one payload tested. A hard floor of twenty percent caps the efficiency score for unproductive long runs exceeding twenty-five turns with no reflections and fewer than two payloads.

The completion_length_metric component carries zero weight and serves purely for tracking token consumption across rollouts. It estimates token count at four characters per token and reports the total for analysis.

Design Rationale

The component weights reflect a deliberate balance across multiple objectives. Primary task success receives the highest weight at twenty-five percent but does not dominate completely, ensuring that agents cannot achieve high scores through lucky guesses without proper methodology. The combined weight of methodology, exploration, documentation, and reporting components totals fifty-five percent, making systematic professional practices majority determinants of success.

Tool format compliance at ten percent weight might seem low for a production system, but this reflects the reality that format errors should be rare after initial training. The weight is sufficient to drive correct behavior without overshadowing substantive security testing capabilities. The reflection and methodology components together total thirty percent, emphasizing the importance of systematic injection point discovery and targeted exploitation over blind fuzzing.

The efficiency component addresses a common failure mode in LLM-based agents: excessive verbosity and repetitive actions. The progressive penalty structure allows reasonable response lengths while strongly discouraging rambling or looping behavior. The efficiency bonus for optimal approaches encourages agents to develop concise, effective exploitation patterns.

False Positive Prevention

XSS confirmation requires explicit detection of the canary string rather than relying on agent self-assessment. This prevents agents from claiming success without actual execution. The canary string uses the XSS_STRIX_AGENT_ prefix to avoid false positives from application-generated content while remaining distinctive enough for reliable detection.

Partial credit for documented findings without confirmed execution acknowledges scenarios where execution may not be immediately observable, such as blind XSS or certain stored XSS contexts. However, the fifty percent partial credit ensures that agents must provide substantive findings rather than vague suspicions.

Tool execution errors are tracked separately from format errors to distinguish between agent mistakes and environmental issues. This prevents agents from being penalized for transient Docker networking problems or browser crashes unrelated to their tool usage.


5. Task Dataset

Coverage and Composition

The task dataset spans over fifty distinct challenges organized into eight major categories. Reflected XSS tasks include basic GET parameter injection, POST body injection, multiple reflection points, and partial encoding scenarios. Stored XSS tasks cover database persistence, file-based persistence, multi-user impact, and delayed execution contexts. DOM XSS tasks target client-side sinks including innerHTML, document.write, eval, and location-based injection.

Self-XSS scenarios require agents to understand that execution occurs in the attacker's own browser context, necessitating social engineering or privilege escalation for impact. Mutation XSS tasks involve server-side parsing that transforms safe input into dangerous output. Blind XSS challenges test detection without immediate feedback, requiring out-of-band validation through external logging.

Template XSS tasks exploit server-side template engines including Jinja2, Thymeleaf, Handlebars, and Pug with varying levels of sandboxing. Content Security Policy bypass tasks require finding CSP misconfigurations, exploiting JSONP endpoints within whitelisted domains, or leveraging unsafe-inline directives. HTTP header injection tasks target custom headers reflected in error pages or log viewers.

File upload XSS tasks require polyglot payloads that function as valid images or documents while containing executable JavaScript. JSON response XSS tasks demand breaking out of JavaScript object literals in API responses. WebSocket XSS tasks target real-time communication channels. Cookie reflection tasks exploit applications that echo cookie values into pages. Web Worker XSS tasks target background script contexts.

Encoding and obfuscation bypass tasks span five difficulty levels from basic HTML entities through Unicode normalization, double encoding, null byte injection, and comment-based obfuscation. Framework-specific tasks target React's dangerouslySetInnerHTML, Vue's v-html directive, and Thymeleaf's unescaped expressions with varying defensive configurations.

Web Application Firewall evasion tasks simulate Cloudflare, AWS WAF, and ModSecurity rule sets. Prototype pollution to XSS chains require first corrupting JavaScript object prototypes then leveraging the pollution for code execution. Multi-step chained exploits combine XSS with CSRF or require multiple injection points for successful exploitation.

Modern variant tasks include Shadow DOM innerHTML injection, PDF rendering XSS through pdfmake or similar libraries, XML and XSLT transformation injection, and CSS context injection through style attributes or style tags. Injection context variants test agents across HTML body, attribute, JavaScript string, URL, event handler, and script tag contexts.

Train mode supports procedural task generation by combining any sink type with any sanitizer configuration and any response template. This enables effectively infinite task variation for training while maintaining ground truth through deterministic simulation.

Ground Truth and Validation

Each task includes comprehensive metadata defining the challenge. The expected_payload field specifies a working exploitation string. The validation_url field indicates where to observe successful execution. The instructions field provides task-specific guidance. The sim_config field defines the exact endpoint configuration including sink type, sanitizer settings, and response template.

Task metadata includes difficulty classification from easy through hard to bonus. XSS type categorization enables filtering by vulnerability class. Authentication requirements indicate whether exploitation requires logging in. Endpoint configuration specifies the path, method, and parameter names for injection.

The sim_config structure enables reproducible training by precisely defining the vulnerability parameters. Each endpoint specifies its path, HTTP method, expected parameters, sink type, sanitizer configuration, template rendering approach, and response format. This ensures that successful exploitation depends on the agent's actions rather than environmental randomness.

Success validation in train mode checks for the canary string in simulated console logs or alert stacks. In eval mode, validation checks actual browser console output and alert dialogs captured through Playwright. This dual validation approach ensures that skills learned in simulation transfer to real browsers.

Current Limitations

The dataset currently focuses exclusively on Cross-Site Scripting vulnerabilities. It does not cover SQL injection, Server-Side Request Forgery, authentication bypass, authorization flaws, or logic vulnerabilities. This narrow scope enables deep coverage of XSS variants but limits generalization to other vulnerability classes.

Structured scenarios include endpoint information and payload approach hints in task descriptions. This reduces exploration requirements compared to discovering completely unknown vulnerabilities in black-box applications. The hints enable more focused training on exploitation techniques but may not fully capture the reconnaissance skills required for real-world testing.

The simulated environment cannot perfectly replicate all browser behaviors. Complex CSS rendering, full DOM API semantics, and network-level timing effects may differ between simulation and reality. However, the core vulnerability patterns around dangerous sinks and insufficient sanitization remain accurate.


6. Evaluation Protocol

Metrics and Measurement

The evaluation protocol tracks nine metrics across rollouts. The reward metric aggregates all weighted components into a single score from zero to one point zero. The xss_confirmed_reward metric isolates the primary success signal. The tool_format_reward metric measures XML compliance. The reflection_found_reward metric tracks injection point discovery. The methodology_reward metric assesses testing approach quality.

The exploration_reward metric evaluates systematic behavior. The documentation_reward metric measures finding capture. The vulnerability_report_reward metric tracks formal reporting. The efficiency_reward metric assesses resource consumption. The completion_length_metric tracks token usage without affecting the total reward.

Additional tracking includes num_turns for step efficiency, tool usage distributions for behavioral analysis, and per-task performance breakdowns for identifying strengths and weaknesses. Evaluation runs use fixed random seeds for reproducibility. Multiple rollouts per task account for stochastic variation in agent behavior.

Baseline Model Comparison

The evaluation includes multiple baseline models spanning different capabilities and scales. DeepSeek-Chat provides a strong general-purpose API-based baseline. Qwen3 at four billion and eight billion parameters represent open-weight models at different scales. Strix-XSS-Q8 represents a specialized fine-tune targeting XSS tasks specifically.

Model comparison focuses on reward component breakdowns rather than aggregate scores alone. This enables identifying whether models succeed through different strategies. For example, one model might achieve high scores through efficient exploitation while another achieves similar scores through comprehensive documentation despite taking longer.

Reproducibility

The evaluation protocol specifies exact versions of all dependencies including the verifiers framework, Strix runtime, Docker images, and OWASP Juice Shop releases. Configuration files define task selection, rollout counts, random seeds, and system prompt variants. Evaluation logs capture complete agent trajectories including all tool calls, tool results, state transitions, and reward calculations.

This comprehensive logging enables detailed failure analysis and behavioral debugging. Researchers can replay specific rollouts to understand decision points, examine tool usage patterns, and identify common failure modes. The logs also support offline policy evaluation and reward modeling for advanced training techniques.


7. Discussion

Effective Capabilities

The environment successfully tests several critical capabilities for autonomous penetration testing. Tool use in production XML format ensures that trained agents deploy seamlessly to Strix workflows without format adaptation. Multi-step attack sequences requiring reconnaissance, discovery, exploitation, and validation demonstrate the ability to execute complex security testing procedures.

Documentation and findings synthesis including formal CVSS reporting produces actionable security deliverables rather than mere vulnerability flags. Efficient exploration without excessive steps shows that agents can balance thoroughness with productivity. Systematic methodology incorporating canary strings, console monitoring, and payload iteration reflects professional security testing practices rather than naive fuzzing.

XSS detection across eight major categories and over forty sink-sanitizer combinations demonstrates breadth of coverage. The ability to handle context-specific payloads, filter bypasses, and encoding requirements shows adaptability to diverse application architectures.

Known Limitations

The single vulnerability class restriction to XSS means that agents trained in this environment do not generalize to SQL injection, Server-Side Request Forgery, authentication vulnerabilities, or business logic flaws. While XSS coverage is comprehensive, real penetration testing requires handling multiple vulnerability classes simultaneously.

The simulation versus reality gap remains despite the dual-mode architecture. The simulated web application engine cannot capture certain behaviors including complex CSS rendering edge cases, full browser DOM API semantics, network timing effects, and race conditions. While the core vulnerability patterns transfer successfully, subtle behavioral differences may affect agent strategies.

Task hinting in structured scenarios provides endpoint information and payload approach guidance that reduces exploration requirements. Real-world vulnerability discovery in black-box applications demands more extensive reconnaissance and hypothesis testing. The current task structure optimizes for exploitation technique training rather than full discovery workflows.

System prompt complexity presents a practical challenge. The full Strix system prompt with comprehensive skill documentation and tool examples consumes significant context window space. Ablation modes including none, essential, and minimal variants support studying prompt impact, but production deployment must balance completeness with efficiency.

Comparison to Existing Approaches

Static vulnerability datasets like CVE databases or bug bounty reports provide reference exploits but lack interactive environments for training. They cannot provide the feedback signals necessary for reinforcement learning and do not capture the sequential decision-making aspects of security testing.

Simple simulated environments like intentionally vulnerable applications provide interactivity but typically offer limited task diversity and lack the dual-mode architecture enabling both training scale and evaluation fidelity. They often require significant infrastructure overhead for training.

Production security frameworks like Strix provide comprehensive tooling for manual testing but lack the systematic training infrastructure and reward modeling required for autonomous agent development. The integration of training environment with production framework in this work bridges that gap.

Existing security benchmarks typically provide five to ten challenge scenarios. The fifty-plus task coverage here combined with procedural generation capabilities offers substantially greater training diversity. The multi-signal reward function provides richer learning signals than binary success-failure metrics.


8. Future Directions

Vulnerability Class Expansion

Extending the environment to additional vulnerability classes would enable training of general-purpose security testing agents. SQL injection presents similar structural challenges around input validation and context escaping. Server-Side Request Forgery requires understanding of internal network topology and service discovery. Authentication and authorization vulnerabilities demand reasoning about privilege levels and access control policies. Insecure Direct Object References require systematic enumeration and access testing.

Each new vulnerability class would require similar dual-mode infrastructure with simulated training environments and real application evaluation. The reward function framework generalizes naturally to other classes through appropriate component selection and weighting.

Target Application Diversity

Expanding evaluation beyond OWASP Juice Shop to additional intentionally vulnerable applications would improve generalization assessment. DVWA (Damn Vulnerable Web Application) provides classic web vulnerabilities in a simpler architecture. WebGoat offers comprehensive security training scenarios. HackTheBox and TryHackMe platforms provide diverse real-world-like challenges with varying difficulty levels.

Custom synthetic applications could be generated procedurally to test specific vulnerability patterns or technology stacks. This would enable curriculum learning across different application architectures and defensive configurations.

Simulation Fidelity Enhancement

Improving the simulated web application engine to capture additional browser behaviors would reduce the simulation-reality gap. More accurate CSS rendering simulation would enable testing of CSS-based exfiltration techniques. Expanded DOM API coverage would support more sophisticated JavaScript-heavy applications. Network-level behavior simulation including timing effects would enable race condition scenarios.

The challenge lies in balancing fidelity improvements against execution speed. The primary value of simulation is rapid iteration, so enhancements must not compromise training throughput significantly.

Multi-Step Exploitation Scenarios

Developing scenarios requiring vulnerability chaining would test more sophisticated attack planning. Combining XSS with Cross-Site Request Forgery enables account takeover chains. Privilege escalation scenarios require exploiting initial access vulnerabilities to reach administrative contexts. Post-exploitation objectives like data exfiltration or lateral movement would extend testing beyond initial compromise.

These scenarios would require extended episode lengths and more complex reward structures that credit intermediate progress toward final objectives.

Hybrid Training Approaches

Combining supervised fine-tuning on expert traces with reinforcement learning could accelerate learning. Collecting human security tester demonstrations would provide initial policy guidance. Supervised pre-training would establish baseline tool usage patterns. Reinforcement learning would then optimize for efficiency and coverage beyond human demonstrations.

Curriculum learning across difficulty levels could start with easy challenges for basic skill acquisition and progress to hard challenges for advanced techniques. This staged approach might converge faster than uniform sampling across all difficulties.

Advanced Reward Modeling

Improving reward signals could accelerate learning and shape more sophisticated behaviors. Better reflection detection tracking would reduce false negatives in identifying injection points. Chain completion bonuses would encourage multi-step attack sequences. Attack path diversity rewards would prevent over-specialization to particular exploitation patterns.

Learned reward models trained on human preferences for security reports could provide richer feedback than hand-coded rubrics. This would enable optimizing for report quality metrics beyond simple finding counts.


9. Conclusion

Strix-XSS-Eval provides production-ready infrastructure for training AI agents to detect Cross-Site Scripting vulnerabilities through a dual-mode reinforcement learning environment. The train mode enables large-scale, rapid experimentation through comprehensive simulation without external dependencies. The eval mode validates transfer to real-world systems through native integration with the Strix penetration testing framework and actual browser-based testing against OWASP Juice Shop.

The environment's design reflects careful attention to the competing demands of security agent training. High training throughput comes from eliminating Docker and network overhead during learning. Evaluation fidelity comes from testing against real applications with real browsers and real security tools. Production readiness comes from maintaining identical observation and action spaces across both modes and using native Strix tool formats throughout.

The expanded task coverage spanning over fifty challenges across eight major XSS categories provides rich training diversity. The multi-component reward function shapes behavior beyond simple vulnerability discovery to include professional testing methodology, systematic exploration, quality documentation, and resource efficiency. This comprehensive approach produces agents capable of effective security testing rather than mere exploit execution.

The dual-mode architecture represents a general pattern applicable beyond this specific environment. Many domains requiring both high-speed training and high-fidelity evaluation could benefit from similar separation of simulation and reality. The key insight is that most task structure and decision-making patterns can be learned efficiently through deterministic simulation, while validation of real-world transfer requires only periodic evaluation against actual systems.

Future work extending the environment to additional vulnerability classes, target applications, and exploitation scenarios will enable training of increasingly general-purpose security testing agents. The foundation established here provides the infrastructure, evaluation protocols, and reward modeling frameworks required for that expansion. The integration with production Strix workflows ensures that advances in training translate directly to deployable security testing capabilities.