A comparative study of standard Bedrock Converse and Mantle inference under stress conditions
Empirical findings from a controlled stress-test study.
SidShriyokesh ThangavelArun Gunasekaran
24
stress-condition runs
19,212
total requests
2
serving paths compared
4
concurrency levels
Abstract
Modern LLM serving systems often expose multiple inference paths that differ in latency behavior, throughput scaling, and reliability under load. Choosing the wrong path can degrade user experience or violate service-level objectives (SLOs), especially during traffic surges. We compare two production-relevant options, Standard Bedrock Converse and Mantle OpenAI-compatible serving, under controlled stress conditions spanning architecture type (Dense, MoE), prompt-length bands, and low-to-high concurrency regimes. The benchmark is centered on a translation use case to keep task semantics stable across all test conditions.
The study uses a full-factorial design with run configurations, i.e., each configuration combines architecture, input-length band, and concurrency under one unified protocol. Results show a clear crossover: Standard is more often faster on median latency, while Mantle is stronger under high-load reliability and throughput criteria. This pattern is strongest in MoE experiments, where Mantle maintains substantially higher weighted success.
These findings imply that inference-path selection should be objective-driven: latency-first routing can prefer Standard in lighter regimes, whereas scale-stability routing should prefer Mantle under sustained stress.
Introduction
Large-scale inference deployments are now routinely exposed to heterogeneous serving paths, each optimized for different operational goals. In this setting, platform teams must decide whether to prioritize median response speed, stable throughput at saturation, or reliability under concurrency spikes. This paper studies that decision problem in the context of translation-style LLM workloads routed through two serving concepts: Standard Bedrock Converse and Mantle inference.
Throughout this paper, a run configuration means one exact test condition formed by architecture, input-length band, and concurrency level.
Comparing these two concepts is practically important because both can appear competitive when evaluated with a single metric, yet exhibit different failure modes under stress. A latency-only perspective can favor one path, while a reliability-weighted throughput perspective can favor the other. As a result, decision quality depends on evaluation design, not just raw score tables.
Prior internal and public evaluations in this domain often focus on one serving concept at a time, use narrow load bands, or report non-unified metric suites. Such setups make cross-path interpretation difficult when stress conditions vary across studies. Our work closes this gap with a unified, controlled comparison across architectures, prompt lengths, and concurrency levels.
Main contributions
Systematic comparison of Standard and Mantle serving across 24 stress-condition runs and 19,212 total requests.
Empirical identification of crossover behavior: Standard is frequently latency-favored, while Mantle is more robust in high-load regimes.
At a high level, we find that Standard outperforms Mantle on median-latency winner counts, whereas Mantle outperforms Standard on stress-regime reliability and effective throughput. This confirms that path selection should be conditioned on service objective and traffic regime rather than treated as a static global choice.
Related work
Why existing evaluations fall short
Background on Standard Bedrock Converse Serving
Standard Bedrock Converse deployments are commonly evaluated through request-level latency and aggregate throughput summaries, with emphasis on end-user responsiveness. This literature and practice generally capture successful-response timing well, but often provide limited characterization of overload-phase reliability decay.
Background on Mantle-Style OpenAI-Compatible Serving
OpenAI-compatible gateway layers, including Mantle-like serving paths, are typically motivated by compatibility, routing flexibility, and operational control. Evaluations in this family frequently emphasize sustained throughput and production stability, especially where admission control and queue behavior influence observed outcomes.
Comparative and Robustness-Focused Evaluation Work
Comparative studies in adjacent systems domains show that methodology strongly affects conclusions: single-metric leaderboards can mis-rank systems when error rates diverge under stress. Robustness-oriented benchmarking therefore recommends multi-condition stress matrices, repeated protocol controls, and reliability-aware scoring.
Existing work remains insufficient for this decision context because most comparisons are not jointly systematic across architecture, prompt scale, and concurrency using one protocol and one metric vocabulary. Our study addresses that limitation with a unified A/B evaluation under controlled stress conditions and harmonized reporting.
Problem Setup and Concepts
Task setting and formalization
Let x denote an input prompt and y the generated translation output. For each experiment configuration c (one exact combination of architecture, input-token band, and concurrency), the system dispatches requests through one of two serving concepts and records latency, throughput, and reliability telemetry. We compare concepts under identical task semantics and evaluate both request-level behavior and aggregate service behavior.
For clarity, we define experiment-configuration notation as:
c = (a, ℓ, q)
where a is architecture, ℓ is input-length band, and q is concurrency level.
Term
Meaning
x
Input prompt for a translation request.
y
Model output generated for prompt x.
a
Model architecture slice (Dense or MoE).
ℓ
Input-length band (200, 600, 1000 tokens).
q
Concurrency level (1, 100, 500, 1000).
c
Experiment configuration, defined as one unique triple (a, ℓ, q).
p
Serving path under evaluation (Standard or Mantle).
Concept A: Standard Bedrock Converse
Concept A uses Amazon Bedrock Runtime converse() calls through boto3. Its key characteristics in this benchmark are direct runtime invocation, per-run controlled execution, and latency-centric winner scoring compatibility with existing scripts.
Concept B: Mantle Inference Serving
Concept B uses Mantle through an OpenAI-compatible chat-completions interface. Under the same run configurations, Mantle is evaluated with identical prompt families and concurrency targets so that differences can be attributed to serving behavior rather than workload mismatch.
Illustrative Behavioral Difference
Across the full condition grid (model architecture, input-length band, and concurrency level), both concepts can look similar in easier runs, but they separate clearly in harder runs. As architecture complexity increases, input length grows, and concurrency rises, one concept can maintain completion reliability while the other degrades faster. This is why latency-only interpretation can be misleading and why we use a dual-view evaluation.
Benchmark attributes and measured variables
Core variables, grouped by role
Core benchmark attributes used in this study, grouped by their role in design and analysis.
Attribute
Category
Short description
Region
Environment
AWS region where benchmark requests were executed.
Architecture
Workload slice
Model family under test (Dense or MoE).
API
Path variable
Inference path: Standard Converse or Mantle endpoint.
Number of in-flight requests generated per test run.
Total requests
Run volume
Total requests sent for a test run.
Prompt task type
Prompt control
Fixed translation task to keep workload semantics stable.
Success, errors, success rate
Reliability metrics
Completion count, failure count, and completion percentage.
Median ms, p95 ms, p99 ms
Latency metrics
Central and tail latency behavior of successful requests.
RPS
Throughput metric
Achieved successful requests per second.
Queue factor
Congestion proxy
Ratio of max latency to median latency in a run.
Winner
Decision metric
Per-run winner under script rule (lower median latency).
Effective RPS
Composite metric
Throughput adjusted by success rate for scale evaluation.
Experimental setup
Measured variables and protocol
The benchmark task is fixed to multilingual translation so stress effects read as serving-path behavior rather than prompt drift. Architectures are Dense (Qwen3 32B) and MoE (Qwen3 Next 80B A3B), two dominant scaling paradigms in the Bedrock catalog.
Data and stress conditions
Stress axis
Levels
Why it matters for production
Architecture
Dense, MoE
Captures architecture-specific compute and routing behavior under load.
Prompt length
200, 600, 1000 input tokens
Tests short-to-long context handling and token-processing pressure.
Concurrency
1, 100, 500, 1000
Represents progression from light traffic to stress-regime saturation.
Rationale for Selecting Dense and MoE Architectures
We focus on Dense and Mixture-of-Experts (MoE) because they are two primary open-model architecture families represented in the Bedrock model catalog and directly relevant to production text inference trade-offs. In this study, the representative Dense model is Qwen3 32B, and the representative MoE model is Qwen3 Next 80B A3B.
This selection should be interpreted as representative rather than exhaustive. Bedrock supports a broader model ecosystem across multiple providers and modalities (text, image, video, speech, and embeddings), but this study isolates text-generation serving behavior across two dominant scaling paradigms so that latency-throughput-reliability trade-offs can be compared under a controlled protocol.
Methods Compared
Both methods are evaluated under identical run configurations.
Method A (Standard)
Bedrock Runtime converse() through boto3.
Method B (Mantle)
OpenAI-compatible chat completions through Mantle.
No additional external baseline is introduced in this report; the study is an A/B serving-path comparison under one unified protocol.
Standard errors are strongly non-linear: 585 at c=500 rising to 2090 at c=1000.
Results
System behavior inference from measurements
The following statements are inferential, based on observed metric patterns:
Mantle
Mantle likely uses stronger admission control or batching behavior, reflected by early throughput saturation and higher high-load completion rates.
Standard
Standard behaves more like a direct endpoint under overload: lower latency among successful requests, but sharper reliability collapse and larger error growth at high concurrency.
Winner counts by scoring view
Figure 5. Run wins under each scoring rule.
Median-latency winners by architecture
Figure 6. Standard leads in both Dense and MoE under the median-latency rule.
Results / efficiency and operational complexity
Efficiency and operational complexity
Summarizes stress-regime efficiency trade-offs. Memory footprint was not instrumented in this run; therefore, complexity analysis focuses on latency-throughput-reliability behavior measured directly from execution telemetry.
Data and Stress Conditions
Metric
Standard
Mantle
Mantle vs Standard
Effective RPS at c=500
11.29
28.07
+148.69%
Effective RPS at c=1000
14.16
24.21
+70.96%
Total errors (all runs)
2,675
834
−68.82%
Dense avg queue factor
2.14
1.28
−40.19%
System behavior inference from measurements
Load band
Primary SLO
Recommended path
Why
Low (c=1 to 100)
Lowest latency
Standard
Lower pooled P50, both paths at 100% weighted success.
Medium/high (c=500 to 1000), Dense
Throughput + queue stability
Mantle
Much higher RPS and lower queue factor at equal 100% success.
Medium/high (c=500 to 1000), MoE
Completion reliability
Mantle
Higher weighted success and lower total errors.
Any load band
Median latency only
Standard
More median-latency wins overall (17/24).
Conclusion
Route by workload regime, not by default
This comparative study shows that no single serving path dominates every objective under stress. Standard is frequently favorable in latency-first scoring, while Mantle is consistently stronger in high-load reliability and effective throughput. The key implication is methodological and operational: evaluation frameworks must jointly model latency, completion reliability, and throughput to avoid biased path selection. For production deployment, the most effective policy is conditional routing aligned to workload regime and SLO priority rather than a one-size-fits-all endpoint choice.
Appendix A
Per-run results
Every run in the grid: architecture, input length, concurrency, latency percentiles, success rate, throughput, queue factor, and the median-latency winner.
Input
Conc.
Std P50
Mnt P50
Std P95
Mnt P95
Std P99
Mnt P99
Std Succ
Mnt Succ
Std RPS
Mnt RPS
Std Q
Mnt Q
Winner
200
1
4,916.58
5,519.99
4,916.58
5,519.99
4,916.58
5,519.99
100%
100%
0.2
0.18
1
1
Standard
200
100
8,040.3
7,896.53
8,753.18
8,521.54
8,831.85
9,313.08
100%
100%
11.19
10.7
1.1
1.18
Mantle
200
500
9,087.88
9,157.18
10,229.81
9,872.81
15,965.73
10,744.92
100%
100%
18.8
44.68
2.91
1.2
Standard
200
1000
11,021.01
13,685.37
18,320.36
20,954.85
19,023.54
21,245.57
100%
100%
35.84
43.66
2.51
1.61
Standard
600
1
4,810.52
5,388.89
4,810.52
5,388.89
4,810.52
5,388.89
100%
100%
0.2
0.19
1
1
Standard
600
100
5,364.57
6,123.41
9,054.58
8,858.48
25,744.52
8,957.7
100%
100%
3.87
11.05
4.8
1.46
Standard
600
500
9,734.82
9,585.15
10,678.87
10,261.28
13,892.91
10,652.45
100%
100%
19.07
42.31
2.67
1.22
Mantle
600
1000
11,326.86
12,344.12
18,602.09
19,729.6
19,064.54
20,347.63
100%
100%
34.85
46.18
2.47
1.71
Standard
1000
1
4,492.54
5,431.23
4,492.54
5,431.23
4,492.54
5,431.23
100%
100%
0.21
0.18
1
1
Standard
1000
100
8,175.86
7,913.6
8,751.07
8,453.83
9,890.9
9,218.9
100%
100%
9.93
10.75
1.21
1.16
Mantle
1000
500
9,666.37
9,162.35
10,800.09
9,836.61
12,294.11
10,119.78
100%
100%
19.31
47.3
2.61
1.14
Mantle
1000
1000
11,377.97
12,427.86
18,354.63
19,331.66
18,740.36
19,699.96
100%
100%
35.59
46.11
2.44
1.73
Standard
Ready to transform your enterprise?
Let's build something that lasts. Our team is ready to talk.