Document Type: This is a technical implementation plan for evaluating the feasibility of introducing AI Autonomous Testing into legacy Java projects. The architectural design is based on current tool capabilities and is intended as a blueprint. It is recommended to perform a PoC to validate core assumptions before full-scale adoption.
Table of Contents
- Project Objectives and Scope
- Core Concept: From Automation to Autonomy
- Technology Stack and Versions
- Project Setup
- Core Component Implementation
- The 3-Loop Verification Strategy: State Machine Design
- Prompt Engineering
- Integration Layer Implementation
- Evaluation Framework and Metrics
- Adoption Plan and Milestones
- Risk Assessment and Mitigation
- Cost Estimation
- Decision Checkpoints
1. Project Objectives and Scope
1.1 The Problem
In large-scale legacy Java projects, we face the following testing dilemmas:
| Issue | Current Status | Impact |
|---|---|---|
| Insufficient Coverage | Unit Test coverage ~60%, E2E tests only cover Happy Paths | Frequent edge-case bugs in production |
| Fragile Scripts | Frontend DOM changes break 30% of Selenium tests | 2-3 days spent fixing tests after every UI update |
| Inefficient Diagnosis | Avg. 2 hours to locate root cause after failure | Developer time wasted on debugging |
| Flaky Tests | ~15% of tests fail intermittently | Low confidence in CI/CD; frequent manual re-runs |
1.2 Project Goals
Phase 1 Goal (PoC, 8 Weeks):
- Build an AI Diagnosis Assistant to automatically analyze root causes of test failures.
- Target: Diagnosis accuracy > 80%, Avg. diagnosis time < 30s.
Phase 2 Goal (MVP, 12 Weeks):
- Implement Visual Location capabilities to reduce test breakage from DOM changes.
- Target: Increase test script survival rate from 70% to 95%.
Phase 3 Goal (Production, 16 Weeks):
- Achieve Autonomous Exploratory Testing to discover edge cases uncovered by humans.
- Target: New bugs discovered > 10 per month.
1.3 Out of Scope
- Load Testing / Performance Testing
- Security Penetration Testing
- Mobile App Testing (Web only)
- Replacing existing Unit and Integration Tests
2. Core Concept: From Automation to Autonomy
2.1 Traditional Automation vs. AI Autonomy
Traditional Automation (Imperative):
Developer defines -> Click #login-btn -> Wait 2s -> Assert URL contains /dashboard
Problems:
- Fixed paths, cannot handle unexpected situations
- Fragile element locators; breaks on DOM changes
- Failure provides only Exceptions, no diagnostic insight
AI Autonomy (Declarative):
Developer defines -> Goal: Buy as VIP, use coupon, verify total amount
AI Behavior:
- Plans path to achieve goal autonomously
- Tries alternatives when blocked
- Automatically investigates root causes upon failure
2.2 The 3-Loop Verification Concept
┌─────────────────────────────────────────────────────────────┐
│ Exploration Loop │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ Diagnosis Loop │ │
│ │ ┌─────────────────────────────────────────────┐ │ │
│ │ │ Stability Loop │ │ │
│ │ │ │ │ │
│ │ │ Execute Single Test Action │ │ │
│ │ │ ↓ │ │ │
│ │ │ Failure → Environment Issue? → Retry N │ │ │
│ │ │ │ │ │
│ │ └─────────────────────────────────────────────┘ │ │
│ │ ↓ │ │
│ │ Still Failure → Collect Evidence → Analyze │ │
│ │ │ │
│ └─────────────────────────────────────────────────────┘ │
│ ↓ │
│ Complete Current Path → Find Next Goal → Discover New │
│ │
└─────────────────────────────────────────────────────────────┘
3. Technology Stack and Versions
3.1 Core Stack
| Component | Tech Choice | Version | Rationale |
|---|---|---|---|
| LLM Orchestration | LangChain4j | 0.35.0 | Java-native, excellent Spring Boot integration, robust Tool Calling |
| LLM Model | GPT-4o | 2024-08-06 | Superior vision, stable reasoning, best Function Calling support |
| Backup Model | GPT-4o-mini | 2024-07-18 | Lower cost, used for simple judgments |
| Browser Automation | Playwright | 1.48.0 | More stable than Selenium, multi-browser support, official Java API |
| Test Containers | Testcontainers | 1.20.3 | Database isolation, consistent environment |
| Observability | Micrometer + OTLP | 1.13.0 | Spring Boot integration, TraceId propagation support |
3.2 Dependency Compatibility Matrix
Spring Boot 3.3.x
├── Java 21 (required)
├── LangChain4j 0.35.0
│ └── langchain4j-open-ai 0.35.0
│ └── langchain4j-spring-boot-starter 0.35.0
├── Playwright 1.48.0
│ └── playwright-java 1.48.0
└── Testcontainers 1.20.3
└── postgresql 1.20.3
4. Project Setup
4.1 Maven pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.example</groupId>
<artifactId>ai-qa-agent</artifactId>
<version>1.0.0-SNAPSHOT</version>
<packaging>jar</packaging>
<parent>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-parent</artifactId>
<version>3.3.5</version>
<relativePath/>
</parent>
<properties>
<java.version>21</java.version>
<langchain4j.version>0.35.0</langchain4j.version>
<playwright.version>1.48.0</playwright.version>
<testcontainers.version>1.20.3</testcontainers.version>
</properties>
<dependencies>
<!-- Spring Boot -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<!-- LangChain4j -->
<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-spring-boot-starter</artifactId>
<version>${langchain4j.version}</version>
</dependency>
<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-open-ai</artifactId>
<version>${langchain4j.version}</version>
</dependency>
<!-- Playwright -->
<dependency>
<groupId>com.microsoft.playwright</groupId>
<artifactId>playwright</artifactId>
<version>${playwright.version}</version>
</dependency>
<!-- Testcontainers -->
<dependency>
<groupId>org.testcontainers</groupId>
<artifactId>testcontainers</artifactId>
<version>${testcontainers.version}</version>
</dependency>
<dependency>
<groupId>org.testcontainers</groupId>
<artifactId>postgresql</artifactId>
<version>${testcontainers.version}</version>
</dependency>
<!-- Observability -->
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-tracing-bridge-otel</artifactId>
</dependency>
<dependency>
<groupId>io.opentelemetry</groupId>
<artifactId>opentelemetry-exporter-otlp</artifactId>
</dependency>
<!-- Utilities -->
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<optional>true</optional>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
</dependency>
<!-- Testing -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-maven-plugin</artifactId>
</plugin>
<!-- Install Playwright Browsers -->
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>exec-maven-plugin</artifactId>
<version>3.1.0</version>
<executions>
<execution>
<id>install-playwright-browsers</id>
<phase>generate-resources</phase>
<goals>
<goal>java</goal>
</goals>
<configuration>
<mainClass>com.microsoft.playwright.CLI</mainClass>
<arguments>
<argument>install</argument>
<argument>chromium</argument>
</arguments>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
4.2 application.yml
spring:
application:
name: ai-qa-agent
langchain4j:
open-ai:
chat-model:
api-key: ${OPENAI_API_KEY}
model-name: gpt-4o
temperature: 0.1 # Stability is key for testing
timeout: PT60S # Vision analysis can be slow
max-retries: 3
log-requests: true
log-responses: true
# AI QA Agent Configuration
ai-qa:
browser:
headless: true
viewport-width: 1280
viewport-height: 720
timeout-ms: 30000
loops:
stability:
max-retries: 3
retry-delay-ms: 1000
flakiness-threshold: 0.8 # >80% success rate deemed flaky
diagnosis:
collect-screenshot: true
collect-console-logs: true
collect-network-logs: true
max-log-lines: 500
exploration:
max-depth: 10
max-actions-per-page: 20
cost:
budget-per-test-usd: 0.50
budget-per-day-usd: 100.00
reporting:
output-dir: ./test-reports
screenshot-format: png
# Target System Configuration
target:
base-url: ${TARGET_BASE_URL:http://localhost:8080}
api-base-url: ${TARGET_API_URL:http://localhost:8080/api}
# Actuator (For collecting backend logs in Diagnosis Loop)
management:
endpoints:
web:
exposure:
include: health,info,loggers,trace
tracing:
sampling:
probability: 1.0
5. Core Component Implementation
5.1 OpenAI Configuration
package com.example.aiqaagent.config;
import dev.langchain4j.model.chat.ChatLanguageModel;
import dev.langchain4j.model.openai.OpenAiChatModel;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import java.time.Duration;
@Configuration
public class OpenAiConfig {
@Value("${langchain4j.open-ai.chat-model.api-key}")
private String apiKey;
/**
* Primary Model: GPT-4o, for complex reasoning and vision.
*/
@Bean
public ChatLanguageModel primaryChatModel() {
return OpenAiChatModel.builder()
.apiKey(apiKey)
.modelName("gpt-4o")
.temperature(0.1)
.timeout(Duration.ofSeconds(60))
.maxRetries(3)
.logRequests(true)
.logResponses(true)
.build();
}
/**
* Lightweight Model: GPT-4o-mini, for cost-saving simple tasks.
*/
@Bean
public ChatLanguageModel lightweightChatModel() {
return OpenAiChatModel.builder()
.apiKey(apiKey)
.modelName("gpt-4o-mini")
.temperature(0.1)
.timeout(Duration.ofSeconds(30))
.maxRetries(3)
.build();
}
}
5.2 Playwright Config & Lifecycle
package com.example.aiqaagent.config;
import com.microsoft.playwright.*;
import jakarta.annotation.PreDestroy;
import lombok.Getter;
import lombok.extern.slf4j.Slf4j;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Configuration;
@Slf4j
@Configuration
public class PlaywrightConfig {
@Value("${ai-qa.browser.headless:true}")
private boolean headless;
// ... viewport properties ...
private Playwright playwright;
private Browser browser;
@Getter
private volatile BrowserContext currentContext;
@Getter
private volatile Page currentPage;
public synchronized void initialize() {
if (playwright == null) {
log.info("Initializing Playwright...");
playwright = Playwright.create();
browser = playwright.chromium().launch(
new BrowserType.LaunchOptions()
.setHeadless(headless)
);
log.info("Playwright initialized successfully");
}
}
public Page createNewPage() {
initialize();
if (currentContext != null) currentContext.close();
currentContext = browser.newContext(
new Browser.NewContextOptions().setViewportSize(1280, 720)
);
currentPage = currentContext.newPage();
// Setup listeners
currentPage.onConsoleMessage(msg ->
log.debug("[Browser Console] {}: {}", msg.type(), msg.text())
);
return currentPage;
}
public String captureScreenshotBase64() {
if (currentPage == null) throw new IllegalStateException("No active page");
byte[] screenshot = currentPage.screenshot();
return java.util.Base64.getEncoder().encodeToString(screenshot);
}
// ... cleanup methods ...
}
5.3 Browser Tools
package com.example.aiqaagent.tools;
import dev.langchain4j.agent.tool.Tool;
import org.springframework.stereotype.Component;
// ... imports ...
@Component
@RequiredArgsConstructor
public class BrowserTools {
private final PlaywrightConfig playwrightConfig;
@Tool("Open specified URL. Returns page title.")
public String navigateTo(String url) {
Page page = playwrightConfig.getCurrentPage();
page.navigate(url);
page.waitForLoadState(LoadState.NETWORKIDLE);
return "Loaded page: " + page.title();
}
@Tool("Click button/link containing text. Matches exact or partial.")
public String clickByText(String text) {
Page page = playwrightConfig.getCurrentPage();
try {
Locator locator = page.getByText(text);
locator.first().waitFor();
locator.first().click();
return "Clicked element containing: " + text;
} catch (TimeoutError e) {
return "Could not find clickable element with text: " + text;
}
}
@Tool("Get interactive elements (buttons, links, inputs) on current page.")
public String getInteractiveElements() {
// Implementation to scan DOM and return list of interactive elements
// Returns formatted string list
return "...";
}
// ... other tools like fillInput, scroll, pressKey ...
}
5.4 Vision Tools
package com.example.aiqaagent.tools;
import dev.langchain4j.agent.tool.Tool;
import dev.langchain4j.data.message.*;
// ... imports ...
@Component
@RequiredArgsConstructor
public class VisionTools {
private final PlaywrightConfig playwrightConfig;
@Qualifier("primaryChatModel")
private final ChatLanguageModel visionModel;
private static final String VISION_LOCATE_PROMPT = """
You are a Vision Analysis Assistant.
Task: Locate the element matching this description in the screenshot: %s
Return CENTER coordinates:
COORDINATES: x=123, y=456
If not found:
NOT_FOUND: Reason
""";
@Tool("Locate element using Vision AI based on visual description.")
public String clickByVision(String visualDescription) {
String screenshotBase64 = playwrightConfig.captureScreenshotBase64();
UserMessage msg = UserMessage.from(
ImageContent.from(screenshotBase64, "image/png"),
TextContent.from(String.format(VISION_LOCATE_PROMPT, visualDescription))
);
String response = visionModel.generate(msg).content().text();
// Parse coordinates and click using Playwright
// ... implementation ...
return "Clicked via vision at " + response;
}
}
5.5 Diagnostic & Data Tools
(Conceptual implementation similar to Chinese version: DiagnosticTools collects logs/screenshots, DataTools manages Testcontainers PostgreSQL instance.)
6. The 3-Loop Verification Strategy: State Machine Design
6.1 Loop Logic
-
Stability Loop: Handles flaky tests.
- If an action fails with transient errors (Timeout, 503), retry N times.
- If success rate < 100% but > 0%, mark as “Flaky” but Passed.
-
Diagnosis Loop: Handles hard failures.
- Collects Evidence (Screenshot + Console + Backend Logs via TraceId).
- Asks AI to analyze Root Cause (Frontend vs. Backend vs. Data).
-
Exploration Loop: Handles path planning.
- Determines next action based on Goal and Page State.
- Uses RL-like approach to maximize coverage of unknown paths.
7. Prompt Engineering
7.1 System Prompt
You are a Senior QA Automation Engineer.
Capabilities:
1. Test Planning: Plan paths based on business goals.
2. Execution: Operate browser via tools.
3. Diagnosis: Analyze root causes upon failure.
4. Exploration: Proactively find edge cases.
Guidelines:
- Verify result after every step.
- If action fails, try alternatives before reporting failure.
- Collect evidence (screenshots) regularly.
- When diagnosing, distinguish between Frontend, Backend, and Environment issues.
7.2 Diagnosis Prompt
Analyze this test failure.
Action: {action_description}
Error: {error_message}
Evidence: {evidence}
Provide:
1. Root Cause Category (FRONTEND_BUG, BACKEND_BUG, ENVIRONMENT, TEST_SCRIPT, DATA_ISSUE)
2. Description
3. Technical Details
4. Suggested Fix
5. Confidence Level
8. Integration Layer Implementation
8.1 AutonomousTester Interface
public interface AutonomousTester {
@SystemMessage("...")
void initialize();
@UserMessage("Test Goal: {{goal.description}}")
TestReport performTest(@MemoryId String testId, TestGoal goal);
@Tool("Execute atomic browser action")
ActionResult executeAction(TestAction action);
}
8.2 TestOrchestratorService
Orchestrates the LoopStateMachine and calls AutonomousTester. Manages the lifecycle of Playwright pages and Testcontainers.
9. Evaluation Framework and Metrics
| Metric | Phase 1 (PoC) | Phase 2 (MVP) | Phase 3 (Prod) |
|---|---|---|---|
| Diagnosis Accuracy | > 80% | > 90% | > 95% |
| Script Survival Rate | N/A | > 95% | > 99% |
| New Bugs Found | N/A | N/A | > 10/month |
10. Adoption Plan and Milestones
Phase 1: Diagnosis Assistant (8 Weeks)
- Goal: AI analyzes failure logs from existing CI pipelines.
- Deliverable: Automated Root Cause Analysis Report attached to Jenkins builds.
Phase 2: Visual & Self-Healing (12 Weeks)
- Goal: Implement VisionTools and Self-Healing locators.
- Deliverable: A test suite that survives major UI refactoring without manual fixes.
Phase 3: Autonomous Exploration (16 Weeks)
- Goal: Full “Nightly Build” exploration.
- Deliverable: Autonomous testing of core business flows with minimal human input.
11. Risk Assessment and Mitigation
| Risk | Mitigation |
|---|---|
| High Token Cost | Use GPT-4o-mini for simple tasks; optimize Prompts; strict budget caps. |
| AI Hallucination | Implement “Stability Loop” to verify findings; Human-in-the-loop for initial training. |
| Data Privacy | Sanitize test data; Do not send PII to LLM; Use local LLMs (Llama 3) if strict privacy is required. |
12. Cost Estimation
Estimated Monthly Cost (Production): ~$1,375 USD
- Based on heavy usage of Vision API (most expensive component).
- Can be optimized by using caching and hybrid models.
13. Decision Checkpoints
- Week 8 (PoC End): Is Diagnosis Accuracy > 80%? If not, refine prompts or switch models.
- Week 20 (MVP End): Is Script Survival Rate > 95%? If yes, proceed to full autonomy.