Next-Gen QA: Implementing AI-Driven Multi-Turn Autonomous Acceptance Testing in Legacy Java Projects

Document Type: This is a technical implementation plan for evaluating the feasibility of introducing AI Autonomous Testing into legacy Java projects. The architectural design is based on current tool capabilities and is intended as a blueprint. It is recommended to perform a PoC to validate core assumptions before full-scale adoption.

Table of Contents

Project Objectives and Scope
Core Concept: From Automation to Autonomy
Technology Stack and Versions
Project Setup
Core Component Implementation
The 3-Loop Verification Strategy: State Machine Design
Prompt Engineering
Integration Layer Implementation
Evaluation Framework and Metrics
Adoption Plan and Milestones
Risk Assessment and Mitigation
Cost Estimation
Decision Checkpoints

1. Project Objectives and Scope

1.1 The Problem

In large-scale legacy Java projects, we face the following testing dilemmas:

Issue	Current Status	Impact
Insufficient Coverage	Unit Test coverage ~60%, E2E tests only cover Happy Paths	Frequent edge-case bugs in production
Fragile Scripts	Frontend DOM changes break 30% of Selenium tests	2-3 days spent fixing tests after every UI update
Inefficient Diagnosis	Avg. 2 hours to locate root cause after failure	Developer time wasted on debugging
Flaky Tests	~15% of tests fail intermittently	Low confidence in CI/CD; frequent manual re-runs

1.2 Project Goals

Phase 1 Goal (PoC, 8 Weeks):

Build an AI Diagnosis Assistant to automatically analyze root causes of test failures.
Target: Diagnosis accuracy > 80%, Avg. diagnosis time < 30s.

Phase 2 Goal (MVP, 12 Weeks):

Implement Visual Location capabilities to reduce test breakage from DOM changes.
Target: Increase test script survival rate from 70% to 95%.

Phase 3 Goal (Production, 16 Weeks):

Achieve Autonomous Exploratory Testing to discover edge cases uncovered by humans.
Target: New bugs discovered > 10 per month.

1.3 Out of Scope

Load Testing / Performance Testing
Security Penetration Testing
Mobile App Testing (Web only)
Replacing existing Unit and Integration Tests

2. Core Concept: From Automation to Autonomy

2.1 Traditional Automation vs. AI Autonomy

Traditional Automation (Imperative):
  Developer defines -> Click #login-btn -> Wait 2s -> Assert URL contains /dashboard

  Problems:
  - Fixed paths, cannot handle unexpected situations
  - Fragile element locators; breaks on DOM changes
  - Failure provides only Exceptions, no diagnostic insight

AI Autonomy (Declarative):
  Developer defines -> Goal: Buy as VIP, use coupon, verify total amount

  AI Behavior:
  - Plans path to achieve goal autonomously
  - Tries alternatives when blocked
  - Automatically investigates root causes upon failure

2.2 The 3-Loop Verification Concept

┌─────────────────────────────────────────────────────────────┐
│                    Exploration Loop                         │
│  ┌─────────────────────────────────────────────────────┐   │
│  │                Diagnosis Loop                       │   │
│  │  ┌─────────────────────────────────────────────┐   │   │
│  │  │           Stability Loop                    │   │   │
│  │  │                                             │   │   │
│  │  │   Execute Single Test Action                │   │   │
│  │  │   ↓                                         │   │   │
│  │  │   Failure → Environment Issue? → Retry N    │   │   │
│  │  │                                             │   │   │
│  │  └─────────────────────────────────────────────┘   │   │
│  │                      ↓                             │   │
│  │   Still Failure → Collect Evidence → Analyze       │   │
│  │                                                    │   │
│  └─────────────────────────────────────────────────────┘   │
│                          ↓                                 │
│   Complete Current Path → Find Next Goal → Discover New    │
│                                                            │
└─────────────────────────────────────────────────────────────┘

3. Technology Stack and Versions

3.1 Core Stack

Component	Tech Choice	Version	Rationale
LLM Orchestration	LangChain4j	0.35.0	Java-native, excellent Spring Boot integration, robust Tool Calling
LLM Model	GPT-4o	2024-08-06	Superior vision, stable reasoning, best Function Calling support
Backup Model	GPT-4o-mini	2024-07-18	Lower cost, used for simple judgments
Browser Automation	Playwright	1.48.0	More stable than Selenium, multi-browser support, official Java API
Test Containers	Testcontainers	1.20.3	Database isolation, consistent environment
Observability	Micrometer + OTLP	1.13.0	Spring Boot integration, TraceId propagation support

3.2 Dependency Compatibility Matrix

Spring Boot 3.3.x
├── Java 21 (required)
├── LangChain4j 0.35.0
│   └── langchain4j-open-ai 0.35.0
│   └── langchain4j-spring-boot-starter 0.35.0
├── Playwright 1.48.0
│   └── playwright-java 1.48.0
└── Testcontainers 1.20.3
    └── postgresql 1.20.3

4. Project Setup

4.1 Maven `pom.xml`

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
         http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.example</groupId>
    <artifactId>ai-qa-agent</artifactId>
    <version>1.0.0-SNAPSHOT</version>
    <packaging>jar</packaging>

    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>3.3.5</version>
        <relativePath/>
    </parent>

    <properties>
        <java.version>21</java.version>
        <langchain4j.version>0.35.0</langchain4j.version>
        <playwright.version>1.48.0</playwright.version>
        <testcontainers.version>1.20.3</testcontainers.version>
    </properties>

    <dependencies>
        <!-- Spring Boot -->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-actuator</artifactId>
        </dependency>

        <!-- LangChain4j -->
        <dependency>
            <groupId>dev.langchain4j</groupId>
            <artifactId>langchain4j-spring-boot-starter</artifactId>
            <version>${langchain4j.version}</version>
        </dependency>
        <dependency>
            <groupId>dev.langchain4j</groupId>
            <artifactId>langchain4j-open-ai</artifactId>
            <version>${langchain4j.version}</version>
        </dependency>

        <!-- Playwright -->
        <dependency>
            <groupId>com.microsoft.playwright</groupId>
            <artifactId>playwright</artifactId>
            <version>${playwright.version}</version>
        </dependency>

        <!-- Testcontainers -->
        <dependency>
            <groupId>org.testcontainers</groupId>
            <artifactId>testcontainers</artifactId>
            <version>${testcontainers.version}</version>
        </dependency>
        <dependency>
            <groupId>org.testcontainers</groupId>
            <artifactId>postgresql</artifactId>
            <version>${testcontainers.version}</version>
        </dependency>

        <!-- Observability -->
        <dependency>
            <groupId>io.micrometer</groupId>
            <artifactId>micrometer-tracing-bridge-otel</artifactId>
        </dependency>
        <dependency>
            <groupId>io.opentelemetry</groupId>
            <artifactId>opentelemetry-exporter-otlp</artifactId>
        </dependency>

        <!-- Utilities -->
        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
            <optional>true</optional>
        </dependency>
        <dependency>
            <groupId>com.fasterxml.jackson.core</groupId>
            <artifactId>jackson-databind</artifactId>
        </dependency>

        <!-- Testing -->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-test</artifactId>
            <scope>test</scope>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.springframework.boot</groupId>
                <artifactId>spring-boot-maven-plugin</artifactId>
            </plugin>
            <!-- Install Playwright Browsers -->
            <plugin>
                <groupId>org.codehaus.mojo</groupId>
                <artifactId>exec-maven-plugin</artifactId>
                <version>3.1.0</version>
                <executions>
                    <execution>
                        <id>install-playwright-browsers</id>
                        <phase>generate-resources</phase>
                        <goals>
                            <goal>java</goal>
                        </goals>
                        <configuration>
                            <mainClass>com.microsoft.playwright.CLI</mainClass>
                            <arguments>
                                <argument>install</argument>
                                <argument>chromium</argument>
                            </arguments>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

4.2 `application.yml`

spring:
  application:
    name: ai-qa-agent

langchain4j:
  open-ai:
    chat-model:
      api-key: ${OPENAI_API_KEY}
      model-name: gpt-4o
      temperature: 0.1  # Stability is key for testing
      timeout: PT60S    # Vision analysis can be slow
      max-retries: 3
      log-requests: true
      log-responses: true

# AI QA Agent Configuration
ai-qa:
  browser:
    headless: true
    viewport-width: 1280
    viewport-height: 720
    timeout-ms: 30000

  loops:
    stability:
      max-retries: 3
      retry-delay-ms: 1000
      flakiness-threshold: 0.8  # >80% success rate deemed flaky
    diagnosis:
      collect-screenshot: true
      collect-console-logs: true
      collect-network-logs: true
      max-log-lines: 500
    exploration:
      max-depth: 10
      max-actions-per-page: 20

  cost:
    budget-per-test-usd: 0.50
    budget-per-day-usd: 100.00

  reporting:
    output-dir: ./test-reports
    screenshot-format: png

# Target System Configuration
target:
  base-url: ${TARGET_BASE_URL:http://localhost:8080}
  api-base-url: ${TARGET_API_URL:http://localhost:8080/api}

# Actuator (For collecting backend logs in Diagnosis Loop)
management:
  endpoints:
    web:
      exposure:
        include: health,info,loggers,trace
  tracing:
    sampling:
      probability: 1.0

5. Core Component Implementation

5.1 OpenAI Configuration

package com.example.aiqaagent.config;

import dev.langchain4j.model.chat.ChatLanguageModel;
import dev.langchain4j.model.openai.OpenAiChatModel;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;

import java.time.Duration;

@Configuration
public class OpenAiConfig {

    @Value("${langchain4j.open-ai.chat-model.api-key}")
    private String apiKey;

    /**
     * Primary Model: GPT-4o, for complex reasoning and vision.
     */
    @Bean
    public ChatLanguageModel primaryChatModel() {
        return OpenAiChatModel.builder()
                .apiKey(apiKey)
                .modelName("gpt-4o")
                .temperature(0.1)
                .timeout(Duration.ofSeconds(60))
                .maxRetries(3)
                .logRequests(true)
                .logResponses(true)
                .build();
    }

    /**
     * Lightweight Model: GPT-4o-mini, for cost-saving simple tasks.
     */
    @Bean
    public ChatLanguageModel lightweightChatModel() {
        return OpenAiChatModel.builder()
                .apiKey(apiKey)
                .modelName("gpt-4o-mini")
                .temperature(0.1)
                .timeout(Duration.ofSeconds(30))
                .maxRetries(3)
                .build();
    }
}

5.2 Playwright Config & Lifecycle

package com.example.aiqaagent.config;

import com.microsoft.playwright.*;
import jakarta.annotation.PreDestroy;
import lombok.Getter;
import lombok.extern.slf4j.Slf4j;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Configuration;

@Slf4j
@Configuration
public class PlaywrightConfig {

    @Value("${ai-qa.browser.headless:true}")
    private boolean headless;

    // ... viewport properties ...

    private Playwright playwright;
    private Browser browser;

    @Getter
    private volatile BrowserContext currentContext;

    @Getter
    private volatile Page currentPage;

    public synchronized void initialize() {
        if (playwright == null) {
            log.info("Initializing Playwright...");
            playwright = Playwright.create();
            browser = playwright.chromium().launch(
                new BrowserType.LaunchOptions()
                    .setHeadless(headless)
            );
            log.info("Playwright initialized successfully");
        }
    }

    public Page createNewPage() {
        initialize();
        if (currentContext != null) currentContext.close();

        currentContext = browser.newContext(
            new Browser.NewContextOptions().setViewportSize(1280, 720)
        );

        currentPage = currentContext.newPage();
        
        // Setup listeners
        currentPage.onConsoleMessage(msg ->
            log.debug("[Browser Console] {}: {}", msg.type(), msg.text())
        );

        return currentPage;
    }

    public String captureScreenshotBase64() {
        if (currentPage == null) throw new IllegalStateException("No active page");
        byte[] screenshot = currentPage.screenshot();
        return java.util.Base64.getEncoder().encodeToString(screenshot);
    }
    
    // ... cleanup methods ...
}

5.3 Browser Tools

package com.example.aiqaagent.tools;

import dev.langchain4j.agent.tool.Tool;
import org.springframework.stereotype.Component;
// ... imports ...

@Component
@RequiredArgsConstructor
public class BrowserTools {

    private final PlaywrightConfig playwrightConfig;

    @Tool("Open specified URL. Returns page title.")
    public String navigateTo(String url) {
        Page page = playwrightConfig.getCurrentPage();
        page.navigate(url);
        page.waitForLoadState(LoadState.NETWORKIDLE);
        return "Loaded page: " + page.title();
    }

    @Tool("Click button/link containing text. Matches exact or partial.")
    public String clickByText(String text) {
        Page page = playwrightConfig.getCurrentPage();
        try {
            Locator locator = page.getByText(text);
            locator.first().waitFor();
            locator.first().click();
            return "Clicked element containing: " + text;
        } catch (TimeoutError e) {
            return "Could not find clickable element with text: " + text;
        }
    }

    @Tool("Get interactive elements (buttons, links, inputs) on current page.")
    public String getInteractiveElements() {
        // Implementation to scan DOM and return list of interactive elements
        // Returns formatted string list
        return "..."; 
    }
    
    // ... other tools like fillInput, scroll, pressKey ...
}

5.4 Vision Tools

package com.example.aiqaagent.tools;

import dev.langchain4j.agent.tool.Tool;
import dev.langchain4j.data.message.*;
// ... imports ...

@Component
@RequiredArgsConstructor
public class VisionTools {

    private final PlaywrightConfig playwrightConfig;
    
    @Qualifier("primaryChatModel")
    private final ChatLanguageModel visionModel;

    private static final String VISION_LOCATE_PROMPT = """
        You are a Vision Analysis Assistant.
        Task: Locate the element matching this description in the screenshot: %s
        
        Return CENTER coordinates:
        COORDINATES: x=123, y=456
        
        If not found:
        NOT_FOUND: Reason
        """;

    @Tool("Locate element using Vision AI based on visual description.")
    public String clickByVision(String visualDescription) {
        String screenshotBase64 = playwrightConfig.captureScreenshotBase64();
        
        UserMessage msg = UserMessage.from(
            ImageContent.from(screenshotBase64, "image/png"),
            TextContent.from(String.format(VISION_LOCATE_PROMPT, visualDescription))
        );

        String response = visionModel.generate(msg).content().text();
        
        // Parse coordinates and click using Playwright
        // ... implementation ...
        
        return "Clicked via vision at " + response;
    }
}

5.5 Diagnostic & Data Tools

(Conceptual implementation similar to Chinese version: DiagnosticTools collects logs/screenshots, DataTools manages Testcontainers PostgreSQL instance.)

6. The 3-Loop Verification Strategy: State Machine Design

6.1 Loop Logic

Stability Loop: Handles flaky tests.
- If an action fails with transient errors (Timeout, 503), retry N times.
- If success rate < 100% but > 0%, mark as “Flaky” but Passed.
Diagnosis Loop: Handles hard failures.
- Collects Evidence (Screenshot + Console + Backend Logs via TraceId).
- Asks AI to analyze Root Cause (Frontend vs. Backend vs. Data).
Exploration Loop: Handles path planning.
- Determines next action based on Goal and Page State.
- Uses RL-like approach to maximize coverage of unknown paths.

7. Prompt Engineering

7.1 System Prompt

You are a Senior QA Automation Engineer.

Capabilities:
1. Test Planning: Plan paths based on business goals.
2. Execution: Operate browser via tools.
3. Diagnosis: Analyze root causes upon failure.
4. Exploration: Proactively find edge cases.

Guidelines:
- Verify result after every step.
- If action fails, try alternatives before reporting failure.
- Collect evidence (screenshots) regularly.
- When diagnosing, distinguish between Frontend, Backend, and Environment issues.

7.2 Diagnosis Prompt

Analyze this test failure.

Action: {action_description}
Error: {error_message}
Evidence: {evidence}

Provide:
1. Root Cause Category (FRONTEND_BUG, BACKEND_BUG, ENVIRONMENT, TEST_SCRIPT, DATA_ISSUE)
2. Description
3. Technical Details
4. Suggested Fix
5. Confidence Level

8. Integration Layer Implementation

8.1 `AutonomousTester` Interface

public interface AutonomousTester {

    @SystemMessage("...")
    void initialize();

    @UserMessage("Test Goal: {{goal.description}}")
    TestReport performTest(@MemoryId String testId, TestGoal goal);

    @Tool("Execute atomic browser action")
    ActionResult executeAction(TestAction action);
}

8.2 `TestOrchestratorService`

Orchestrates the LoopStateMachine and calls AutonomousTester. Manages the lifecycle of Playwright pages and Testcontainers.

9. Evaluation Framework and Metrics

Metric	Phase 1 (PoC)	Phase 2 (MVP)	Phase 3 (Prod)
Diagnosis Accuracy	> 80%	> 90%	> 95%
Script Survival Rate	N/A	> 95%	> 99%
New Bugs Found	N/A	N/A	> 10/month

10. Adoption Plan and Milestones

Phase 1: Diagnosis Assistant (8 Weeks)

Goal: AI analyzes failure logs from existing CI pipelines.
Deliverable: Automated Root Cause Analysis Report attached to Jenkins builds.

Phase 2: Visual & Self-Healing (12 Weeks)

Goal: Implement VisionTools and Self-Healing locators.
Deliverable: A test suite that survives major UI refactoring without manual fixes.

Phase 3: Autonomous Exploration (16 Weeks)

Goal: Full “Nightly Build” exploration.
Deliverable: Autonomous testing of core business flows with minimal human input.

11. Risk Assessment and Mitigation

Risk	Mitigation
High Token Cost	Use GPT-4o-mini for simple tasks; optimize Prompts; strict budget caps.
AI Hallucination	Implement “Stability Loop” to verify findings; Human-in-the-loop for initial training.
Data Privacy	Sanitize test data; Do not send PII to LLM; Use local LLMs (Llama 3) if strict privacy is required.

12. Cost Estimation

Estimated Monthly Cost (Production): ~$1,375 USD

Based on heavy usage of Vision API (most expensive component).
Can be optimized by using caching and hybrid models.

13. Decision Checkpoints

Week 8 (PoC End): Is Diagnosis Accuracy > 80%? If not, refine prompts or switch models.
Week 20 (MVP End): Is Script Survival Rate > 95%? If yes, proceed to full autonomy.