A comparative study of standard Bedrock Converse and Mantle inference under stress conditions

Empirical findings from a controlled stress-test study.

SidShriyokesh ThangavelArun Gunasekaran

stress-condition runs

19,212

total requests

serving paths compared

concurrency levels

Abstract

Modern LLM serving systems often expose multiple inference paths that differ in latency behavior, throughput scaling, and reliability under load. Choosing the wrong path can degrade user experience or violate service-level objectives (SLOs), especially during traffic surges. We compare two production-relevant options, Standard Bedrock Converse and Mantle OpenAI-compatible serving, under controlled stress conditions spanning architecture type (Dense, MoE), prompt-length bands, and low-to-high concurrency regimes. The benchmark is centered on a translation use case to keep task semantics stable across all test conditions.

The study uses a full-factorial design with run configurations, i.e., each configuration combines architecture, input-length band, and concurrency under one unified protocol. Results show a clear crossover: Standard is more often faster on median latency, while Mantle is stronger under high-load reliability and throughput criteria. This pattern is strongest in MoE experiments, where Mantle maintains substantially higher weighted success.

These findings imply that inference-path selection should be objective-driven: latency-first routing can prefer Standard in lighter regimes, whereas scale-stability routing should prefer Mantle under sustained stress.

Introduction

Large-scale inference deployments are now routinely exposed to heterogeneous serving paths, each optimized for different operational goals. In this setting, platform teams must decide whether to prioritize median response speed, stable throughput at saturation, or reliability under concurrency spikes. This paper studies that decision problem in the context of translation-style LLM workloads routed through two serving concepts: Standard Bedrock Converse and Mantle inference.

Throughout this paper, a run configuration means one exact test condition formed by architecture, input-length band, and concurrency level.

Comparing these two concepts is practically important because both can appear competitive when evaluated with a single metric, yet exhibit different failure modes under stress. A latency-only perspective can favor one path, while a reliability-weighted throughput perspective can favor the other. As a result, decision quality depends on evaluation design, not just raw score tables.

Prior internal and public evaluations in this domain often focus on one serving concept at a time, use narrow load bands, or report non-unified metric suites. Such setups make cross-path interpretation difficult when stress conditions vary across studies. Our work closes this gap with a unified, controlled comparison across architectures, prompt lengths, and concurrency levels.

Main contributions

Systematic comparison of Standard and Mantle serving across 24 stress-condition runs and 19,212 total requests.
Joint metric framework spanning latency, throughput, reliability, error dynamics, and reliability-weighted throughput.
Empirical identification of crossover behavior: Standard is frequently latency-favored, while Mantle is more robust in high-load regimes.

At a high level, we find that Standard outperforms Mantle on median-latency winner counts, whereas Mantle outperforms Standard on stress-regime reliability and effective throughput. This confirms that path selection should be conditioned on service objective and traffic regime rather than treated as a static global choice.

Related work

Why existing evaluations fall short

Background on Standard Bedrock Converse Serving

Standard Bedrock Converse deployments are commonly evaluated through request-level latency and aggregate throughput summaries, with emphasis on end-user responsiveness. This literature and practice generally capture successful-response timing well, but often provide limited characterization of overload-phase reliability decay.

Background on Mantle-Style OpenAI-Compatible Serving

OpenAI-compatible gateway layers, including Mantle-like serving paths, are typically motivated by compatibility, routing flexibility, and operational control. Evaluations in this family frequently emphasize sustained throughput and production stability, especially where admission control and queue behavior influence observed outcomes.

Comparative and Robustness-Focused Evaluation Work

Comparative studies in adjacent systems domains show that methodology strongly affects conclusions: single-metric leaderboards can mis-rank systems when error rates diverge under stress. Robustness-oriented benchmarking therefore recommends multi-condition stress matrices, repeated protocol controls, and reliability-aware scoring.

Existing work remains insufficient for this decision context because most comparisons are not jointly systematic across architecture, prompt scale, and concurrency using one protocol and one metric vocabulary. Our study addresses that limitation with a unified A/B evaluation under controlled stress conditions and harmonized reporting.

Problem Setup and Concepts

Task setting and formalization

Let x denote an input prompt and y the generated translation output. For each experiment configuration c (one exact combination of architecture, input-token band, and concurrency), the system dispatches requests through one of two serving concepts and records latency, throughput, and reliability telemetry. We compare concepts under identical task semantics and evaluate both request-level behavior and aggregate service behavior.

For clarity, we define experiment-configuration notation as:

c = (a, ℓ, q)

where a is architecture, ℓ is input-length band, and q is concurrency level.

Term	Meaning
x	Input prompt for a translation request.
y	Model output generated for prompt x.
a	Model architecture slice (Dense or MoE).
ℓ	Input-length band (200, 600, 1000 tokens).
q	Concurrency level (1, 100, 500, 1000).
c	Experiment configuration, defined as one unique triple (a, ℓ, q).
p	Serving path under evaluation (Standard or Mantle).

Concept A: Standard Bedrock Converse

Concept A uses Amazon Bedrock Runtime converse() calls through boto3. Its key characteristics in this benchmark are direct runtime invocation, per-run controlled execution, and latency-centric winner scoring compatibility with existing scripts.

Concept B: Mantle Inference Serving

Concept B uses Mantle through an OpenAI-compatible chat-completions interface. Under the same run configurations, Mantle is evaluated with identical prompt families and concurrency targets so that differences can be attributed to serving behavior rather than workload mismatch.

Illustrative Behavioral Difference

Across the full condition grid (model architecture, input-length band, and concurrency level), both concepts can look similar in easier runs, but they separate clearly in harder runs. As architecture complexity increases, input length grows, and concurrency rises, one concept can maintain completion reliability while the other degrades faster. This is why latency-only interpretation can be misleading and why we use a dual-view evaluation.

Benchmark attributes and measured variables

Core variables, grouped by role

Core benchmark attributes used in this study, grouped by their role in design and analysis.

Attribute	Category	Short description
Region	Environment	AWS region where benchmark requests were executed.
Architecture	Workload slice	Model family under test (Dense or MoE).
API	Path variable	Inference path: Standard Converse or Mantle endpoint.
Input tokens	Load variable	Prompt size bucket controlling input-length pressure.
Concurrency	Load variable	Number of in-flight requests generated per test run.
Total requests	Run volume	Total requests sent for a test run.
Prompt task type	Prompt control	Fixed translation task to keep workload semantics stable.
Success, errors, success rate	Reliability metrics	Completion count, failure count, and completion percentage.
Median ms, p95 ms, p99 ms	Latency metrics	Central and tail latency behavior of successful requests.
RPS	Throughput metric	Achieved successful requests per second.
Queue factor	Congestion proxy	Ratio of max latency to median latency in a run.
Winner	Decision metric	Per-run winner under script rule (lower median latency).
Effective RPS	Composite metric	Throughput adjusted by success rate for scale evaluation.

Experimental setup

Measured variables and protocol

The benchmark task is fixed to multilingual translation so stress effects read as serving-path behavior rather than prompt drift. Architectures are Dense (Qwen3 32B) and MoE (Qwen3 Next 80B A3B), two dominant scaling paradigms in the Bedrock catalog.

Data and stress conditions

Stress axis	Levels	Why it matters for production
Architecture	Dense, MoE	Captures architecture-specific compute and routing behavior under load.
Prompt length	200, 600, 1000 input tokens	Tests short-to-long context handling and token-processing pressure.
Concurrency	1, 100, 500, 1000	Represents progression from light traffic to stress-regime saturation.

Rationale for Selecting Dense and MoE Architectures

We focus on Dense and Mixture-of-Experts (MoE) because they are two primary open-model architecture families represented in the Bedrock model catalog and directly relevant to production text inference trade-offs. In this study, the representative Dense model is Qwen3 32B, and the representative MoE model is Qwen3 Next 80B A3B.

This selection should be interpreted as representative rather than exhaustive. Bedrock supports a broader model ecosystem across multiple providers and modalities (text, image, video, speech, and embeddings), but this study isolates text-generation serving behavior across two dominant scaling paradigms so that latency-throughput-reliability trade-offs can be compared under a controlled protocol.

Methods Compared

Both methods are evaluated under identical run configurations.

Method A (Standard)

Bedrock Runtime converse() through boto3.

Method B (Mantle)

OpenAI-compatible chat completions through Mantle.

No additional external baseline is introduced in this report; the study is an A/B serving-path comparison under one unified protocol.

Evaluation metrics and protocol

Region

us-east-1

Models

Dense: qwen.qwen3-32b-v1:0; MoE: qwen.qwen3-next-80b-a3b-v1:0

Input lengths

200, 600, 1000

Concurrencies

1, 100, 500, 1000

Request multiplier

1 (requests = concurrency)

Prompt task

Multilingual translation

Run order per experiment

Standard first, then Mantle

Workload volume and scoring

Workload volume is determined by the concurrency set,

Σc ∈ {1, 100, 500, 1000}c = 1,601

Per architecture per API path: 3 × 1,601 = 4,803 requests.
Across both architectures per API path: 9,606 requests.
Across both paths: 19,212 requests.
Recorded metrics include median ms (P50), p95 ms, p99 ms, success rate, error count, RPS, and queue factor.

In this report, an error denotes a failed request event; the dominant observed case is Read timeout on endpoint URL during Bedrock Converse requests.

Primary winner rule (script-native)

lower median ms wins per experiment run.

Secondary robustness-aware score

effective rps = rps ×success rate100

This metric captures throughput and completion reliability jointly.

Results / overall comparison

Averages across all stress runs

Aggregated per architecture and path.

Arch	Path	Avg P50 (ms)	Avg P95 (ms)	Avg P99 (ms)	Avg RPS	Max RPS	Avg Queue	Max Queue	Avg Success (%)	Weighted Success (%)
Dense	Standard	8,167.94	10,647.03	13,139.01	15.76	35.84	2.14	4.80	100.00	100.00
Dense	Mantle	8,719.64	11,013.40	11,386.68	25.27	47.30	1.28	1.73	100.00	100.00
MoE	Standard	11,849.68	21,105.27	22,408.10	5.96	9.33	1.83	3.29	72.83	44.31
MoE	Mantle	18,036.00	30,890.63	34,644.06	6.48	12.34	2.07	3.49	93.05	82.64

Key deltas (Mantle vs Standard).

Dense

Latency percentiles are nearly unchanged (about −1% on average), while average throughput increases by about +60%.

MoE

Latency percentiles increase (about +51% on average), but average throughput is still higher (about +9%).

Reliability (MoE)

Success-rate measures improve by about +29 percentage points.

Across all stress runs, Standard more often minimizes median latency, while Mantle delivers better reliability-weighted scaling in high-load regimes.

Condition cluster 1

Performance and scaling

This cluster examines throughput response as concurrency rises.

Conc.	Path	P50 (ms)	RPS	Weighted Success (%)
1	Standard	5,229.45	0.19	100.00
1	Mantle	6,110.94	0.16	100.00
100	Standard	7,953.74	7.50	100.00
100	Mantle	8,003.72	7.15	100.00
500	Standard	12,946.55	14.02	80.50
500	Mantle	14,655.74	28.07	100.00
1000	Standard	13,905.50	21.73	65.17
1000	Mantle	24,740.87	28.12	86.10

Observed from pooled concurrency slices:

Mantle has higher pooled throughput (Dense+MoE) at high concurrency: 28.07 vs 14.02 RPS at c=500, and 28.12 vs 21.73 RPS at c=1000.
Mantle RPS is nearly flat from 500 to 1000 (28.07 → 28.12), indicating an early throughput ceiling.
Standard scales more gradually in RPS (7.50→14.02→21.73), but with worsening reliability.

Pooled throughput vs concurrency

Figure 1: RPS vs concurrency for Standard and Mantle. Mantle saturates near 28 RPS at higher concurrency, while Standard scales more gradually.

Condition cluster 2

Latency behavior

Pooled P50 latency vs concurrency

Figure 2. Pooled median latency in milliseconds. The Mantle to Standard gap grows from +0.63% at c=100 to +77.93% at c=1000.

Pooled P50 gap (Mantle vs Standard) is +0.63% at c=100, +13.20% at c=500, and +77.93% at c=1000.
Dense carries a tail nuance: Mantle is worse on average P50 and P95 but better on average P99.
Mantle has higher median latency in 17 of 24 experiment runs.

Condition cluster 3

Reliability and error dynamics

Weighted success rate

Figure 3. Completion reliability vs concurrency. Higher is better.

Error growth

Figure 4. Error count vs concurrency. Lower is better.

At low/moderate load (c=1,100), both paths achieve 100% weighted success.
At c=500, Standard drops to 80.50% weighted success while Mantle remains at 100%.
At c=1000, Standard falls to 65.17% while Mantle is 86.10%.
Total errors: Standard = 2675, Mantle = 834 (Standard is 3.21× higher).
Dominant observed error type: Bedrock Converse endpoint read timeout (Read timeout on endpoint URL).
Standard errors are strongly non-linear: 585 at c=500 rising to 2090 at c=1000.

Results

System behavior inference from measurements

The following statements are inferential, based on observed metric patterns:

Mantle

Mantle likely uses stronger admission control or batching behavior, reflected by early throughput saturation and higher high-load completion rates.

Standard

Standard behaves more like a direct endpoint under overload: lower latency among successful requests, but sharper reliability collapse and larger error growth at high concurrency.

Winner counts by scoring view

Figure 5. Run wins under each scoring rule.

Median-latency winners by architecture

Figure 6. Standard leads in both Dense and MoE under the median-latency rule.

Results / efficiency and operational complexity

Efficiency and operational complexity

Summarizes stress-regime efficiency trade-offs. Memory footprint was not instrumented in this run; therefore, complexity analysis focuses on latency-throughput-reliability behavior measured directly from execution telemetry.

Data and Stress Conditions

Metric	Standard	Mantle	Mantle vs Standard
Effective RPS at c=500	11.29	28.07	+148.69%
Effective RPS at c=1000	14.16	24.21	+70.96%
Total errors (all runs)	2,675	834	−68.82%
Dense avg queue factor	2.14	1.28	−40.19%

System behavior inference from measurements

Load band	Primary SLO	Recommended path	Why
Low (c=1 to 100)	Lowest latency	Standard	Lower pooled P50, both paths at 100% weighted success.
Medium/high (c=500 to 1000), Dense	Throughput + queue stability	Mantle	Much higher RPS and lower queue factor at equal 100% success.
Medium/high (c=500 to 1000), MoE	Completion reliability	Mantle	Higher weighted success and lower total errors.
Any load band	Median latency only	Standard	More median-latency wins overall (17/24).

Conclusion

Route by workload regime, not by default

This comparative study shows that no single serving path dominates every objective under stress. Standard is frequently favorable in latency-first scoring, while Mantle is consistently stronger in high-load reliability and effective throughput. The key implication is methodological and operational: evaluation frameworks must jointly model latency, completion reliability, and throughput to avoid biased path selection. For production deployment, the most effective policy is conditional routing aligned to workload regime and SLO priority rather than a one-size-fits-all endpoint choice.

Appendix A

Per-run results

Every run in the grid: architecture, input length, concurrency, latency percentiles, success rate, throughput, queue factor, and the median-latency winner.

Input	Conc.	Std P50	Mnt P50	Std P95	Mnt P95	Std P99	Mnt P99	Std Succ	Mnt Succ	Std RPS	Mnt RPS	Std Q	Mnt Q	Winner
200	1	4,916.58	5,519.99	4,916.58	5,519.99	4,916.58	5,519.99	100%	100%	0.2	0.18	1	1	Standard
200	100	8,040.3	7,896.53	8,753.18	8,521.54	8,831.85	9,313.08	100%	100%	11.19	10.7	1.1	1.18	Mantle
200	500	9,087.88	9,157.18	10,229.81	9,872.81	15,965.73	10,744.92	100%	100%	18.8	44.68	2.91	1.2	Standard
200	1000	11,021.01	13,685.37	18,320.36	20,954.85	19,023.54	21,245.57	100%	100%	35.84	43.66	2.51	1.61	Standard
600	1	4,810.52	5,388.89	4,810.52	5,388.89	4,810.52	5,388.89	100%	100%	0.2	0.19	1	1	Standard
600	100	5,364.57	6,123.41	9,054.58	8,858.48	25,744.52	8,957.7	100%	100%	3.87	11.05	4.8	1.46	Standard
600	500	9,734.82	9,585.15	10,678.87	10,261.28	13,892.91	10,652.45	100%	100%	19.07	42.31	2.67	1.22	Mantle
600	1000	11,326.86	12,344.12	18,602.09	19,729.6	19,064.54	20,347.63	100%	100%	34.85	46.18	2.47	1.71	Standard
1000	1	4,492.54	5,431.23	4,492.54	5,431.23	4,492.54	5,431.23	100%	100%	0.21	0.18	1	1	Standard
1000	100	8,175.86	7,913.6	8,751.07	8,453.83	9,890.9	9,218.9	100%	100%	9.93	10.75	1.21	1.16	Mantle
1000	500	9,666.37	9,162.35	10,800.09	9,836.61	12,294.11	10,119.78	100%	100%	19.31	47.3	2.61	1.14	Mantle
1000	1000	11,377.97	12,427.86	18,354.63	19,331.66	18,740.36	19,699.96	100%	100%	35.59	46.11	2.44	1.73	Standard

Ready to transform
your enterprise?

Let's build something that lasts. Our team is ready to talk.

Start the Conversation

One engine for every enterprise problem

Introduction

Main contributions

Why existing evaluations fall short

Background on Standard Bedrock Converse Serving

Background on Mantle-Style OpenAI-Compatible Serving

Comparative and Robustness-Focused Evaluation Work

Task setting and formalization

Concept A: Standard Bedrock Converse

Concept B: Mantle Inference Serving

Illustrative Behavioral Difference

Core variables, grouped by role

Measured variables and protocol

Data and stress conditions

Rationale for Selecting Dense and MoE Architectures

Methods Compared

Method A (Standard)

Method B (Mantle)

Evaluation metrics and protocol

Workload volume and scoring

Averages across all stress runs

Key deltas (Mantle vs Standard).

Performance and scaling

Pooled throughput vs concurrency

Latency behavior

Pooled P50 latency vs concurrency

Reliability and error dynamics

Weighted success rate

Error growth

System behavior inference from measurements

Mantle

Standard

Winner counts by scoring view

Median-latency winners by architecture

Efficiency and operational complexity

Data and Stress Conditions

System behavior inference from measurements

Route by workload regime, not by default

Per-run results

Ready to transformyour enterprise?

Ready to transform
your enterprise?