🌏 閱讀中文版本
Why Understanding These Performance Metrics Matters
1. Capacity Planning and Cost Control
Understanding Average TPS (Transactions Per Second) and Peak TPS is critical for capacity planning. Relying solely on average values may lead to system crashes during peak periods, while provisioning entirely based on peak values results in resource waste and excessive costs.
Real Impact:
- Resource Allocation: A healthy Peak-to-Average Ratio should range between 3:1 and 5:1
- Cost Savings: Utilizing auto-scaling between average and peak can save 40-60% of infrastructure costs
- Risk Management: Early identification of peak demands prevents service disruptions from traffic spikes
2. SLA (Service Level Agreement) Definition
Service commitments for availability and response times must be based on real performance data. Understanding Average and Peak TPS helps establish reasonable SLA targets:
- Response Time Guarantees: <200ms under average load, <500ms under peak load
- Availability Targets: 99.9% availability requires considering system stability during peaks
- Degradation Strategy: Activate automatic degradation when TPS exceeds 120% of peak
3. Performance Bottleneck Identification
Comparing TPS distribution across different APIs quickly reveals performance bottlenecks:
- Unbalanced Load: If one API’s TPS is significantly lower than others, bottlenecks like database queries or third-party API calls may exist
- Optimization Priority: Prioritize high-frequency but low-TPS APIs for maximum impact
- Scaling Decisions: Determine whether to scale vertically (upgrade hardware) or horizontally (add instances)
System Performance Fundamentals
In system performance testing and evaluation, Average TPS (Average Transactions Per Second) and Peak TPS (Peak Transactions Per Second) are two critical metrics. These metrics help us understand and measure system processing capabilities under different load conditions.
1. Metric Definitions and Differences
Average Transactions Per Second (Average TPS)
Definition: The number of user requests a system can consistently handle per second under normal conditions.
Significance:
- Reflects system stability during regular operation
- Helps understand system performance most of the time
- Serves as a monitoring baseline
Analogy: Like a coffee shop that typically serves 10 customers per minute, this number is calculated from past average conditions. Daily observations show customer arrival rates consistently around this number.
Peak Transactions Per Second (Peak TPS)
Definition: The maximum number of user requests a system can handle per second during peak periods.
Significance:
- Reflects system processing capacity during busiest, highest-load times
- Helps understand system performance limits under heavy load
- Determines system capacity ceiling
Analogy: During the coffee shop’s peak hours, like morning rush hour, typically 50 customers arrive per minute. This number represents the shop’s maximum processing capability during its busiest times.
Relationship and Importance
| Metric | Purpose | Monitoring Frequency | Decision Impact |
|---|---|---|---|
| Average TPS | Daily operations, cost optimization | Continuous (every minute) | Baseline resource allocation |
| Peak TPS | Capacity planning, disaster recovery | Daily peak checks | Elastic scaling limits |
2. Calculation Methods
Average Transactions Per Second (Average TPS)
Calculation Method: Sum all API throughputs and divide by the number of APIs to get the average.
Formula:
Average TPS = (Sum of all API throughputs) / Number of APIs
Example: Assume six APIs with the following throughputs:
| API | Throughput (requests/sec) |
|---|---|
| API 1 | 72.1 |
| API 2 | 70.2 |
| API 3 | 67.5 |
| API 4 | 65.3 |
| API 5 | 63.9 |
| API 6 | 62.5 |
Calculation:
Average TPS = (72.1 + 70.2 + 67.5 + 65.3 + 63.9 + 62.5) / 6 = 66.92 requests/sec
Peak Transactions Per Second (Peak TPS)
Calculation Method: Sum all API throughputs to get the total.
Formula:
Peak TPS = Sum of all API throughputs
Using the same example:
Calculation:
Peak TPS = 72.1 + 70.2 + 67.5 + 65.3 + 63.9 + 62.5 = 401.5 requests/sec
Important Note: “Peak” here refers to the sum of all API endpoint throughputs, representing the system’s overall processing capacity, not a single API’s peak.
3. Real-World Case Studies
Case 1: E-commerce Website
Background: Performance analysis of a large e-commerce website during regular operations and promotional events.
Analysis:
- Steady Operation: Under normal conditions, the website handles an average of 500 user requests per second
- Peak Operation: During promotional events (like Singles’ Day), the website handles up to 2000 user requests per second
- Peak Multiplier: Peak TPS / Average TPS = 2000 / 500 = 4x
Operations Strategy:
- Maintain baseline infrastructure for 500 TPS
- Enable auto-scaling to automatically expand to 2500 TPS capacity during peaks (125% of peak, 25% buffer)
- Pre-scale to 1500 TPS 1 hour before promotions to avoid sudden traffic impact
Case 2: Financial Trading System
Background: A financial trading system needs to maintain efficient operation during trading peaks.
Analysis:
- Steady Operation: During regular trading, the system handles an average of 300 transaction requests per second
- Peak Operation: During market volatility (like major news announcements), the system handles up to 1200 transaction requests per second
- Peak Multiplier: Peak TPS / Average TPS = 1200 / 300 = 4x
Operations Strategy:
- Financial systems are latency-sensitive, adopting “normal over-provisioning” strategy by maintaining 1500 TPS capacity regularly
- Set alert threshold: notify SRE team when TPS > 1000
- When TPS > 1400 (120% of peak), activate degradation mechanism (pause non-critical features)
Case 3: Social Media Platform (High Real-Time Requirements)
Background: Social media platform traffic surge during major events.
Analysis:
- Steady Operation: Normally handles 10,000 requests per second
- Sudden Peak: During breaking news, request volume can reach 50,000 per second
- Peak Multiplier: 5x (harder to predict)
Operations Strategy:
- Employ multi-layer caching strategy (CDN, Redis, local cache) to reduce backend pressure
- Auto-cache popular content, reducing database query frequency
- Use message queues for non-real-time operations, smoothing traffic peaks
Monitoring and Practical Tools
Common Monitoring Tools
| Tool | Purpose | Use Case |
|---|---|---|
| Apache JMeter | Load testing, TPS measurement | Test environment performance validation |
| Grafana + Prometheus | Real-time monitoring, visualization | Production continuous monitoring |
| New Relic / Datadog | APM (Application Performance Monitoring) | Comprehensive performance analysis |
| AWS CloudWatch | Cloud resource monitoring | AWS infrastructure monitoring |
| Locust | Python-based load testing | Simulate real user behavior |
Monitoring Dashboard Recommendations
Core Metrics (Must Monitor):
- Real-time TPS: Current requests per second
- Average TPS (1 hour): Average over last hour
- Peak TPS (Today): Today’s highest value
- Response Time (P50, P95, P99): Response times for 50%, 95%, 99% of users
- Error Rate: HTTP 5xx error percentage
Advanced Metrics:
- TPS distribution per API endpoint
- Database connection pool utilization
- CPU and memory usage
- Network bandwidth consumption
Best Practices
- Establish Baseline
- Collect at least 30 days of TPS data to calculate average and standard deviation
- Identify weekly and daily traffic patterns (e.g., Monday 9-10 AM peak)
- Set Reasonable Alert Thresholds
- Alert threshold = Average TPS × 1.5 (minor alert)
- Critical threshold = Peak TPS × 0.9 (approaching capacity limit)
- Regular Stress Testing
- Conduct quarterly load tests to verify system can handle expected peaks
- Test at 120% of peak load to ensure adequate buffer
- Elastic Scaling Strategy
- When TPS > 150% of average, automatically increase capacity by 30%
- When TPS < 70% of average (sustained for 15 minutes), automatically reduce capacity by 20%
- Optimize Low-TPS APIs
- Use APM tools to analyze slow queries and slow APIs
- Optimize database indexes, add caching, reduce N+1 query issues
- Capacity Planning
- Review capacity planning annually, forecast TPS needs for next 12-18 months based on business growth
- Reserve 30-50% capacity buffer for sudden traffic surges
Frequently Asked Questions
Q1: What’s a reasonable ratio between Average TPS and Peak TPS?
A: Generally, a Peak TPS / Average TPS ratio between 3:1 and 5:1 is healthy.
- Ratio < 3: Traffic is evenly distributed, but may lack business growth or promotional stimulus
- Ratio 3-5: Normal range with clear peaks and steady periods
- Ratio > 5: High traffic volatility, requiring enhanced elastic scaling mechanisms to avoid peak-time issues
Q2: How to choose the right monitoring tool?
A: Choose based on team size and budget:
- Small teams (<10 people): Use Grafana + Prometheus (open-source, free)
- Medium teams (10-50 people): Consider New Relic or Datadog (paid but feature-complete)
- Large enterprises: Build custom monitoring platform or use enterprise solutions like Dynatrace
Q3: TPS drops but error rate is normal – what could be the cause?
A: Possible causes include:
- Business-side traffic reduction: Natural traffic fluctuations (off-peak hours)
- Frontend issues: Frontend code errors preventing requests from reaching backend
- Network issues: CDN or load balancer anomalies
- Third-party service outage: Dependencies on third-party APIs unavailable, interrupting user operations
Q4: How to test a system’s true peak capacity?
A: Follow these steps:
- Prepare test environment: Use test environment with same configuration as production
- Gradual ramp-up: Start from current average TPS, increase load by 20% every 5 minutes
- Monitor key metrics: Track response time, error rate, CPU/memory usage
- Find breaking point: When error rate exceeds 1% or P95 response time exceeds 1 second, record this TPS as peak capacity
- Verify recovery: Reduce load and confirm system can auto-recover
Q5: Why do different APIs have very different TPS values?
A: This is normal, common reasons include:
- Business logic complexity: Complex database queries and heavy computations reduce TPS
- External dependencies: Calling third-party APIs, sending emails increases response time
- Data volume: APIs returning large amounts of data are limited by network bandwidth
- Caching strategy: Cached APIs can achieve 10x or higher TPS
Recommendation: Analyze each API’s business importance and usage frequency, prioritize high-frequency but low-TPS APIs for optimization.
Q6: Production TPS suddenly drops 50% – how to troubleshoot quickly?
A: Follow this troubleshooting sequence:
- Check error rate (30 seconds): Are there massive 5xx errors?
- Check external dependencies (1 minute): Are database, Redis, third-party APIs functioning?
- Check infrastructure (2 minutes): Any CPU, memory, disk, network anomalies?
- Check recent deployments (3 minutes): Any new releases in past hour?
- Check traffic sources (5 minutes): Any crawlers or DDoS attacks?
Quick recovery: If problem cannot be located within 5 minutes, consider rolling back to last stable version.
Summary
Understanding Average TPS (Average Transactions Per Second) and Peak TPS (Peak Transactions Per Second) is crucial for system performance evaluation and optimization. These two metrics help us measure system processing capabilities under different load conditions, ensuring systems maintain good performance during both steady operation and peak loads.
Key Takeaways:
- Use Average TPS for daily operations and cost optimization, Peak TPS for capacity planning
- Establish comprehensive monitoring systems to continuously track key metrics
- Conduct regular load testing to validate system capacity
- Employ elastic scaling strategies to balance cost and performance
- Prioritize optimization of high-frequency but low-TPS APIs for maximum benefit
Related Articles
- Azure SQL Post-Migration Performance Optimization: Query Statistics, Top SQL Analysis, and Index Tuning Guide
- AWS ALB Multi-Instance Deployment Strategy: A Double-Edged Sword for Large Project Architectures
- AWS CloudFront 8TB Data Transfer Analysis: How to Optimize Costs and Performance?
- Calculating Consecutive Event Probability: Practical Applications in Software Engineering
- Optimize Performance with AWS Cache Solutions: Memcached vs Redis Comparison