Next-Gen QA: Implementing AI-Driven Multi-Turn Autonomous Acceptance Testing in Legacy Java Projects

Document Type: This is a technical implementation plan for evaluating the feasibility of introducing AI Autonomous Testing into legacy Java projects. The architectural design is based on current tool capabilities and is intended as a blueprint. It is recommended to perform a PoC to validate core assumptions before full-scale adoption.


Table of Contents

  1. Project Objectives and Scope
  2. Core Concept: From Automation to Autonomy
  3. Technology Stack and Versions
  4. Project Setup
  5. Core Component Implementation
  6. The 3-Loop Verification Strategy: State Machine Design
  7. Prompt Engineering
  8. Integration Layer Implementation
  9. Evaluation Framework and Metrics
  10. Adoption Plan and Milestones
  11. Risk Assessment and Mitigation
  12. Cost Estimation
  13. Decision Checkpoints

1. Project Objectives and Scope

1.1 The Problem

In large-scale legacy Java projects, we face the following testing dilemmas:

Issue Current Status Impact
Insufficient Coverage Unit Test coverage ~60%, E2E tests only cover Happy Paths Frequent edge-case bugs in production
Fragile Scripts Frontend DOM changes break 30% of Selenium tests 2-3 days spent fixing tests after every UI update
Inefficient Diagnosis Avg. 2 hours to locate root cause after failure Developer time wasted on debugging
Flaky Tests ~15% of tests fail intermittently Low confidence in CI/CD; frequent manual re-runs

1.2 Project Goals

Phase 1 Goal (PoC, 8 Weeks):

  • Build an AI Diagnosis Assistant to automatically analyze root causes of test failures.
  • Target: Diagnosis accuracy > 80%, Avg. diagnosis time < 30s.

Phase 2 Goal (MVP, 12 Weeks):

  • Implement Visual Location capabilities to reduce test breakage from DOM changes.
  • Target: Increase test script survival rate from 70% to 95%.

Phase 3 Goal (Production, 16 Weeks):

  • Achieve Autonomous Exploratory Testing to discover edge cases uncovered by humans.
  • Target: New bugs discovered > 10 per month.

1.3 Out of Scope

  • Load Testing / Performance Testing
  • Security Penetration Testing
  • Mobile App Testing (Web only)
  • Replacing existing Unit and Integration Tests

2. Core Concept: From Automation to Autonomy

2.1 Traditional Automation vs. AI Autonomy

Traditional Automation (Imperative):
  Developer defines -> Click #login-btn -> Wait 2s -> Assert URL contains /dashboard

  Problems:
  - Fixed paths, cannot handle unexpected situations
  - Fragile element locators; breaks on DOM changes
  - Failure provides only Exceptions, no diagnostic insight

AI Autonomy (Declarative):
  Developer defines -> Goal: Buy as VIP, use coupon, verify total amount

  AI Behavior:
  - Plans path to achieve goal autonomously
  - Tries alternatives when blocked
  - Automatically investigates root causes upon failure

2.2 The 3-Loop Verification Concept

┌─────────────────────────────────────────────────────────────┐
│                    Exploration Loop                         │
│  ┌─────────────────────────────────────────────────────┐   │
│  │                Diagnosis Loop                       │   │
│  │  ┌─────────────────────────────────────────────┐   │   │
│  │  │           Stability Loop                    │   │   │
│  │  │                                             │   │   │
│  │  │   Execute Single Test Action                │   │   │
│  │  │   ↓                                         │   │   │
│  │  │   Failure → Environment Issue? → Retry N    │   │   │
│  │  │                                             │   │   │
│  │  └─────────────────────────────────────────────┘   │   │
│  │                      ↓                             │   │
│  │   Still Failure → Collect Evidence → Analyze       │   │
│  │                                                    │   │
│  └─────────────────────────────────────────────────────┘   │
│                          ↓                                 │
│   Complete Current Path → Find Next Goal → Discover New    │
│                                                            │
└─────────────────────────────────────────────────────────────┘

3. Technology Stack and Versions

3.1 Core Stack

Component Tech Choice Version Rationale
LLM Orchestration LangChain4j 0.35.0 Java-native, excellent Spring Boot integration, robust Tool Calling
LLM Model GPT-4o 2024-08-06 Superior vision, stable reasoning, best Function Calling support
Backup Model GPT-4o-mini 2024-07-18 Lower cost, used for simple judgments
Browser Automation Playwright 1.48.0 More stable than Selenium, multi-browser support, official Java API
Test Containers Testcontainers 1.20.3 Database isolation, consistent environment
Observability Micrometer + OTLP 1.13.0 Spring Boot integration, TraceId propagation support

3.2 Dependency Compatibility Matrix

Spring Boot 3.3.x
├── Java 21 (required)
├── LangChain4j 0.35.0
│   └── langchain4j-open-ai 0.35.0
│   └── langchain4j-spring-boot-starter 0.35.0
├── Playwright 1.48.0
│   └── playwright-java 1.48.0
└── Testcontainers 1.20.3
    └── postgresql 1.20.3

4. Project Setup

4.1 Maven pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
         http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.example</groupId>
    <artifactId>ai-qa-agent</artifactId>
    <version>1.0.0-SNAPSHOT</version>
    <packaging>jar</packaging>

    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>3.3.5</version>
        <relativePath/>
    </parent>

    <properties>
        <java.version>21</java.version>
        <langchain4j.version>0.35.0</langchain4j.version>
        <playwright.version>1.48.0</playwright.version>
        <testcontainers.version>1.20.3</testcontainers.version>
    </properties>

    <dependencies>
        <!-- Spring Boot -->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-actuator</artifactId>
        </dependency>

        <!-- LangChain4j -->
        <dependency>
            <groupId>dev.langchain4j</groupId>
            <artifactId>langchain4j-spring-boot-starter</artifactId>
            <version>${langchain4j.version}</version>
        </dependency>
        <dependency>
            <groupId>dev.langchain4j</groupId>
            <artifactId>langchain4j-open-ai</artifactId>
            <version>${langchain4j.version}</version>
        </dependency>

        <!-- Playwright -->
        <dependency>
            <groupId>com.microsoft.playwright</groupId>
            <artifactId>playwright</artifactId>
            <version>${playwright.version}</version>
        </dependency>

        <!-- Testcontainers -->
        <dependency>
            <groupId>org.testcontainers</groupId>
            <artifactId>testcontainers</artifactId>
            <version>${testcontainers.version}</version>
        </dependency>
        <dependency>
            <groupId>org.testcontainers</groupId>
            <artifactId>postgresql</artifactId>
            <version>${testcontainers.version}</version>
        </dependency>

        <!-- Observability -->
        <dependency>
            <groupId>io.micrometer</groupId>
            <artifactId>micrometer-tracing-bridge-otel</artifactId>
        </dependency>
        <dependency>
            <groupId>io.opentelemetry</groupId>
            <artifactId>opentelemetry-exporter-otlp</artifactId>
        </dependency>

        <!-- Utilities -->
        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
            <optional>true</optional>
        </dependency>
        <dependency>
            <groupId>com.fasterxml.jackson.core</groupId>
            <artifactId>jackson-databind</artifactId>
        </dependency>

        <!-- Testing -->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-test</artifactId>
            <scope>test</scope>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.springframework.boot</groupId>
                <artifactId>spring-boot-maven-plugin</artifactId>
            </plugin>
            <!-- Install Playwright Browsers -->
            <plugin>
                <groupId>org.codehaus.mojo</groupId>
                <artifactId>exec-maven-plugin</artifactId>
                <version>3.1.0</version>
                <executions>
                    <execution>
                        <id>install-playwright-browsers</id>
                        <phase>generate-resources</phase>
                        <goals>
                            <goal>java</goal>
                        </goals>
                        <configuration>
                            <mainClass>com.microsoft.playwright.CLI</mainClass>
                            <arguments>
                                <argument>install</argument>
                                <argument>chromium</argument>
                            </arguments>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

4.2 application.yml

spring:
  application:
    name: ai-qa-agent

langchain4j:
  open-ai:
    chat-model:
      api-key: ${OPENAI_API_KEY}
      model-name: gpt-4o
      temperature: 0.1  # Stability is key for testing
      timeout: PT60S    # Vision analysis can be slow
      max-retries: 3
      log-requests: true
      log-responses: true

# AI QA Agent Configuration
ai-qa:
  browser:
    headless: true
    viewport-width: 1280
    viewport-height: 720
    timeout-ms: 30000

  loops:
    stability:
      max-retries: 3
      retry-delay-ms: 1000
      flakiness-threshold: 0.8  # >80% success rate deemed flaky
    diagnosis:
      collect-screenshot: true
      collect-console-logs: true
      collect-network-logs: true
      max-log-lines: 500
    exploration:
      max-depth: 10
      max-actions-per-page: 20

  cost:
    budget-per-test-usd: 0.50
    budget-per-day-usd: 100.00

  reporting:
    output-dir: ./test-reports
    screenshot-format: png

# Target System Configuration
target:
  base-url: ${TARGET_BASE_URL:http://localhost:8080}
  api-base-url: ${TARGET_API_URL:http://localhost:8080/api}

# Actuator (For collecting backend logs in Diagnosis Loop)
management:
  endpoints:
    web:
      exposure:
        include: health,info,loggers,trace
  tracing:
    sampling:
      probability: 1.0

5. Core Component Implementation

5.1 OpenAI Configuration

package com.example.aiqaagent.config;

import dev.langchain4j.model.chat.ChatLanguageModel;
import dev.langchain4j.model.openai.OpenAiChatModel;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;

import java.time.Duration;

@Configuration
public class OpenAiConfig {

    @Value("${langchain4j.open-ai.chat-model.api-key}")
    private String apiKey;

    /**
     * Primary Model: GPT-4o, for complex reasoning and vision.
     */
    @Bean
    public ChatLanguageModel primaryChatModel() {
        return OpenAiChatModel.builder()
                .apiKey(apiKey)
                .modelName("gpt-4o")
                .temperature(0.1)
                .timeout(Duration.ofSeconds(60))
                .maxRetries(3)
                .logRequests(true)
                .logResponses(true)
                .build();
    }

    /**
     * Lightweight Model: GPT-4o-mini, for cost-saving simple tasks.
     */
    @Bean
    public ChatLanguageModel lightweightChatModel() {
        return OpenAiChatModel.builder()
                .apiKey(apiKey)
                .modelName("gpt-4o-mini")
                .temperature(0.1)
                .timeout(Duration.ofSeconds(30))
                .maxRetries(3)
                .build();
    }
}

5.2 Playwright Config & Lifecycle

package com.example.aiqaagent.config;

import com.microsoft.playwright.*;
import jakarta.annotation.PreDestroy;
import lombok.Getter;
import lombok.extern.slf4j.Slf4j;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Configuration;

@Slf4j
@Configuration
public class PlaywrightConfig {

    @Value("${ai-qa.browser.headless:true}")
    private boolean headless;

    // ... viewport properties ...

    private Playwright playwright;
    private Browser browser;

    @Getter
    private volatile BrowserContext currentContext;

    @Getter
    private volatile Page currentPage;

    public synchronized void initialize() {
        if (playwright == null) {
            log.info("Initializing Playwright...");
            playwright = Playwright.create();
            browser = playwright.chromium().launch(
                new BrowserType.LaunchOptions()
                    .setHeadless(headless)
            );
            log.info("Playwright initialized successfully");
        }
    }

    public Page createNewPage() {
        initialize();
        if (currentContext != null) currentContext.close();

        currentContext = browser.newContext(
            new Browser.NewContextOptions().setViewportSize(1280, 720)
        );

        currentPage = currentContext.newPage();
        
        // Setup listeners
        currentPage.onConsoleMessage(msg ->
            log.debug("[Browser Console] {}: {}", msg.type(), msg.text())
        );

        return currentPage;
    }

    public String captureScreenshotBase64() {
        if (currentPage == null) throw new IllegalStateException("No active page");
        byte[] screenshot = currentPage.screenshot();
        return java.util.Base64.getEncoder().encodeToString(screenshot);
    }
    
    // ... cleanup methods ...
}

5.3 Browser Tools

package com.example.aiqaagent.tools;

import dev.langchain4j.agent.tool.Tool;
import org.springframework.stereotype.Component;
// ... imports ...

@Component
@RequiredArgsConstructor
public class BrowserTools {

    private final PlaywrightConfig playwrightConfig;

    @Tool("Open specified URL. Returns page title.")
    public String navigateTo(String url) {
        Page page = playwrightConfig.getCurrentPage();
        page.navigate(url);
        page.waitForLoadState(LoadState.NETWORKIDLE);
        return "Loaded page: " + page.title();
    }

    @Tool("Click button/link containing text. Matches exact or partial.")
    public String clickByText(String text) {
        Page page = playwrightConfig.getCurrentPage();
        try {
            Locator locator = page.getByText(text);
            locator.first().waitFor();
            locator.first().click();
            return "Clicked element containing: " + text;
        } catch (TimeoutError e) {
            return "Could not find clickable element with text: " + text;
        }
    }

    @Tool("Get interactive elements (buttons, links, inputs) on current page.")
    public String getInteractiveElements() {
        // Implementation to scan DOM and return list of interactive elements
        // Returns formatted string list
        return "..."; 
    }
    
    // ... other tools like fillInput, scroll, pressKey ...
}

5.4 Vision Tools

package com.example.aiqaagent.tools;

import dev.langchain4j.agent.tool.Tool;
import dev.langchain4j.data.message.*;
// ... imports ...

@Component
@RequiredArgsConstructor
public class VisionTools {

    private final PlaywrightConfig playwrightConfig;
    
    @Qualifier("primaryChatModel")
    private final ChatLanguageModel visionModel;

    private static final String VISION_LOCATE_PROMPT = """
        You are a Vision Analysis Assistant.
        Task: Locate the element matching this description in the screenshot: %s
        
        Return CENTER coordinates:
        COORDINATES: x=123, y=456
        
        If not found:
        NOT_FOUND: Reason
        """;

    @Tool("Locate element using Vision AI based on visual description.")
    public String clickByVision(String visualDescription) {
        String screenshotBase64 = playwrightConfig.captureScreenshotBase64();
        
        UserMessage msg = UserMessage.from(
            ImageContent.from(screenshotBase64, "image/png"),
            TextContent.from(String.format(VISION_LOCATE_PROMPT, visualDescription))
        );

        String response = visionModel.generate(msg).content().text();
        
        // Parse coordinates and click using Playwright
        // ... implementation ...
        
        return "Clicked via vision at " + response;
    }
}

5.5 Diagnostic & Data Tools

(Conceptual implementation similar to Chinese version: DiagnosticTools collects logs/screenshots, DataTools manages Testcontainers PostgreSQL instance.)


6. The 3-Loop Verification Strategy: State Machine Design

6.1 Loop Logic

  1. Stability Loop: Handles flaky tests.

    • If an action fails with transient errors (Timeout, 503), retry N times.
    • If success rate < 100% but > 0%, mark as “Flaky” but Passed.
  2. Diagnosis Loop: Handles hard failures.

    • Collects Evidence (Screenshot + Console + Backend Logs via TraceId).
    • Asks AI to analyze Root Cause (Frontend vs. Backend vs. Data).
  3. Exploration Loop: Handles path planning.

    • Determines next action based on Goal and Page State.
    • Uses RL-like approach to maximize coverage of unknown paths.

7. Prompt Engineering

7.1 System Prompt

You are a Senior QA Automation Engineer.

Capabilities:
1. Test Planning: Plan paths based on business goals.
2. Execution: Operate browser via tools.
3. Diagnosis: Analyze root causes upon failure.
4. Exploration: Proactively find edge cases.

Guidelines:
- Verify result after every step.
- If action fails, try alternatives before reporting failure.
- Collect evidence (screenshots) regularly.
- When diagnosing, distinguish between Frontend, Backend, and Environment issues.

7.2 Diagnosis Prompt

Analyze this test failure.

Action: {action_description}
Error: {error_message}
Evidence: {evidence}

Provide:
1. Root Cause Category (FRONTEND_BUG, BACKEND_BUG, ENVIRONMENT, TEST_SCRIPT, DATA_ISSUE)
2. Description
3. Technical Details
4. Suggested Fix
5. Confidence Level

8. Integration Layer Implementation

8.1 AutonomousTester Interface

public interface AutonomousTester {

    @SystemMessage("...")
    void initialize();

    @UserMessage("Test Goal: {{goal.description}}")
    TestReport performTest(@MemoryId String testId, TestGoal goal);

    @Tool("Execute atomic browser action")
    ActionResult executeAction(TestAction action);
}

8.2 TestOrchestratorService

Orchestrates the LoopStateMachine and calls AutonomousTester. Manages the lifecycle of Playwright pages and Testcontainers.


9. Evaluation Framework and Metrics

Metric Phase 1 (PoC) Phase 2 (MVP) Phase 3 (Prod)
Diagnosis Accuracy > 80% > 90% > 95%
Script Survival Rate N/A > 95% > 99%
New Bugs Found N/A N/A > 10/month

10. Adoption Plan and Milestones

Phase 1: Diagnosis Assistant (8 Weeks)

  • Goal: AI analyzes failure logs from existing CI pipelines.
  • Deliverable: Automated Root Cause Analysis Report attached to Jenkins builds.

Phase 2: Visual & Self-Healing (12 Weeks)

  • Goal: Implement VisionTools and Self-Healing locators.
  • Deliverable: A test suite that survives major UI refactoring without manual fixes.

Phase 3: Autonomous Exploration (16 Weeks)

  • Goal: Full “Nightly Build” exploration.
  • Deliverable: Autonomous testing of core business flows with minimal human input.

11. Risk Assessment and Mitigation

Risk Mitigation
High Token Cost Use GPT-4o-mini for simple tasks; optimize Prompts; strict budget caps.
AI Hallucination Implement “Stability Loop” to verify findings; Human-in-the-loop for initial training.
Data Privacy Sanitize test data; Do not send PII to LLM; Use local LLMs (Llama 3) if strict privacy is required.

12. Cost Estimation

Estimated Monthly Cost (Production): ~$1,375 USD

  • Based on heavy usage of Vision API (most expensive component).
  • Can be optimized by using caching and hybrid models.

13. Decision Checkpoints

  • Week 8 (PoC End): Is Diagnosis Accuracy > 80%? If not, refine prompts or switch models.
  • Week 20 (MVP End): Is Script Survival Rate > 95%? If yes, proceed to full autonomy.

Leave a Comment