Our mobile CI/CD platform was grappling with a critical state consistency problem. When our pipeline triggers the deployment of a new, fine-tuned LLM, the action isn’t a single operation. It requires an atomic update across three distinct backend services:
- Model Registry Service: Registers the new model’s version, metadata, and access path.
- GPU Resource Service: Allocates and durably binds a set of GPU resources from the cluster to this new model.
- Billing Service: Activates the billing policy associated with the model.
If any one of these steps fails, the entire system enters a hazardous intermediate state. For instance, if the model is visible in the registry but has no GPU resources allocated, every invocation will fail. Conversely, if resources are allocated but billing isn’t activated, it results in a direct financial loss. In a real-world project, the cost of troubleshooting and manual data reconciliation for such inconsistencies is unacceptable. We needed an atomic, cross-service guarantee.
Option A: The Saga Pattern and Its Limitations
In a microservices architecture, the standard answer to this problem is often the Saga pattern. It coordinates a series of local transactions; if one step fails, corresponding compensating transactions are executed to roll back the preceding operations.
- Pros: Services are decoupled, execution is asynchronous, and it’s inherently highly available. Each service only needs to manage its own transaction and compensation logic.
- Cons:
- Complexity of Compensation Logic: Writing robust compensating transactions is extremely challenging. De-allocating GPU resources is straightforward, but if the billing service has already sent an instruction to a third-party payment gateway, compensation might be impossible or prohibitively expensive.
- Eventual Consistency Window: The system is in a de facto inconsistent state until the compensation completes. For a critical operation like model deployment, we cannot tolerate even a few seconds of a “model registered but unavailable” state.
- Debugging Difficulty: Tracing the root cause of an issue across a chain of asynchronous messages is notoriously difficult.
In our scenario, model deployment is a low-frequency, high-importance operation. The availability requirement for this specific action (where a few seconds of delay is tolerable) is less stringent than the requirement for data consistency. Therefore, the eventual consistency model of the Saga pattern became an unacceptable drawback.
Option B: A Deliberate Choice for Two-Phase Commit (2PC)
Two-Phase Commit is a classic protocol for achieving strong consistency in distributed transactions. It introduces a “Transaction Coordinator” to ensure that all “Participants” either commit or roll back together.
- Phase 1: Prepare: The coordinator sends a “prepare” request to all participants. Each participant executes its local transaction up to a committable state, locks the necessary resources, and responds to the coordinator with “can commit” or “cannot commit.”
- Phase 2: Commit:
- If all participants respond with “can commit,” the coordinator sends a “commit” request to all of them.
- If any participant responds with “cannot commit” or times out, the coordinator sends a “rollback” request to all of them.
The drawbacks of this approach are equally significant:
- Synchronous Blocking: Throughout the entire transaction—from sending the
preparerequest to receiving allcommit/abortacknowledgments—the resources locked by participants are blocked. - Coordinator Single Point of Failure (SPOF): If the coordinator crashes in the second phase after sending only some of the commit requests, a subset of participants will have committed while others wait indefinitely for instructions, leading to data inconsistency.
- Network Partition Issues: Extreme network conditions can lead to a “split-brain” scenario.
Despite these well-known flaws, we ultimately chose 2PC. Our reasoning was as follows: the transaction execution time for model deployment is very short (typically within seconds), making temporary resource locking acceptable. The coordinator’s SPOF can be mitigated with high-availability mechanisms (like a Raft-based cluster), but for our v1 implementation, we opted to manage this risk with comprehensive monitoring and alerting (via Prometheus) and a clear manual intervention plan. Most importantly, 2PC provided the one thing our scenario absolutely required: atomicity.
Core Implementation Overview
Our architecture centers around a central Transaction Coordinator (TC) implemented in Node.js. The CI/CD system triggers the TC via an API call, and the TC orchestrates the 2PC protocol with the three participant services.
sequenceDiagram
participant CI/CD Pipeline
participant Transaction Coordinator (Node.js)
participant Model Registry Svc
participant GPU Resource Svc
participant Billing Svc
CI/CD Pipeline->>+Transaction Coordinator (Node.js): POST /v1/transactions (modelInfo)
Note over Transaction Coordinator (Node.js): Create transaction, state: PENDING
par Phase 1: Prepare
Transaction Coordinator (Node.js)->>+Model Registry Svc: POST /prepare (txId, modelInfo)
Model Registry Svc-->>-Transaction Coordinator (Node.js): 200 OK { voted: 'yes' }
and
Transaction Coordinator (Node.js)->>+GPU Resource Svc: POST /prepare (txId, modelInfo)
GPU Resource Svc-->>-Transaction Coordinator (Node.js): 200 OK { voted: 'yes' }
and
Transaction Coordinator (Node.js)->>+Billing Svc: POST /prepare (txId, modelInfo)
Billing Svc-->>-Transaction Coordinator (Node.js): 200 OK { voted: 'yes' }
end
Note over Transaction Coordinator (Node.js): All 'yes' votes received, state: PREPARED
Note over Transaction Coordinator (Node.js): Persist transaction state, preparing for Commit phase
par Phase 2: Commit
Transaction Coordinator (Node.js)->>+Model Registry Svc: POST /commit (txId)
Model Registry Svc-->>-Transaction Coordinator (Node.js): 200 OK { status: 'committed' }
and
Transaction Coordinator (Node.js)->>+GPU Resource Svc: POST /commit (txId)
GPU Resource Svc-->>-Transaction Coordinator (Node.js): 200 OK { status: 'committed' }
and
Transaction Coordinator (Node.js)->>+Billing Svc: POST /commit (txId)
Billing Svc-->>-Transaction Coordinator (Node.js): 200 OK { status: 'committed' }
end
Note over Transaction Coordinator (Node.js): All commit acks received, state: COMMITTED
Transaction Coordinator (Node.js)-->>-CI/CD Pipeline: 200 OK { transactionId, status: 'committed' }
The Node.js Transaction Coordinator Implementation
The coordinator is the heart of the architecture. It must manage transaction states, communicate with participants, handle timeouts and failures, and expose critical metrics to Prometheus. We used TypeScript to enhance code robustness.
// src/coordinator/TransactionCoordinator.ts
import { v4 as uuidv4 } from 'uuid';
import axios, { AxiosInstance } from 'axios';
import { Transaction, TransactionState, Participant } from './types';
import { TransactionLogger } from './TransactionLogger';
import { MetricsCollector } from './MetricsCollector';
// A production-grade implementation would require a more robust persistence layer, e.g., a WAL or a database.
// For demonstration purposes, we're using a simple file-based log.
const txLogger = new TransactionLogger('./transactions.log');
export class TransactionCoordinator {
private transactions: Map<string, Transaction> = new Map();
private httpClient: AxiosInstance;
constructor(private participants: Participant[]) {
this.httpClient = axios.create({ timeout: 5000 }); // Set a reasonable timeout
this.loadTransactionsFromLog();
}
// On startup, recover any unfinished transactions from the log.
private async loadTransactionsFromLog() {
const unresolved = await txLogger.loadUnresolved();
unresolved.forEach(tx => {
this.transactions.set(tx.id, tx);
// For transactions in the PREPARED state, we need to retry the commit or abort.
if (tx.state === 'PREPARED') {
// This is a simplified approach; in reality, the decision to commit or abort would be determined from the log.
// Assuming the final decision is recorded in the log.
console.log(`[Recovery] Retrying commit for transaction ${tx.id}`);
this.executePhase2(tx, 'COMMIT');
}
});
}
public async executeTransaction(payload: any): Promise<Transaction> {
const transactionId = uuidv4();
const transaction: Transaction = {
id: transactionId,
state: 'PENDING',
payload,
participants: this.participants.map(p => ({ ...p, vote: null, state: 'PENDING' })),
startTime: Date.now(),
};
this.transactions.set(transactionId, transaction);
await txLogger.log(transaction);
// --- Phase 1: Prepare ---
const prepareStartTime = Date.now();
const preparePromises = transaction.participants.map(p => this.sendPrepare(transaction, p));
const results = await Promise.allSettled(preparePromises);
MetricsCollector.observeTransactionPhaseDuration('prepare', (Date.now() - prepareStartTime) / 1000);
const votes = results.map((res, index) => {
const participant = transaction.participants[index];
if (res.status === 'fulfilled' && res.value) {
participant.vote = 'YES';
return 'YES';
} else {
participant.vote = 'NO';
MetricsCollector.incrementParticipantFailures(participant.name, 'prepare');
return 'NO';
}
});
// --- Decision ---
const decision: 'COMMIT' | 'ABORT' = votes.every(v => v === 'YES') ? 'COMMIT' : 'ABORT';
transaction.state = 'PREPARED';
await txLogger.log(transaction, decision); // Persist the final decision. This is a critical step.
// --- Phase 2: Commit/Abort ---
const phase2StartTime = Date.now();
await this.executePhase2(transaction, decision);
MetricsCollector.observeTransactionPhaseDuration(decision.toLowerCase(), (Date.now() - phase2StartTime) / 1000);
return transaction;
}
private async sendPrepare(transaction: Transaction, participant: Participant): Promise<boolean> {
try {
console.log(`[${transaction.id}] Preparing participant: ${participant.name}`);
const response = await this.httpClient.post(`${participant.url}/prepare`, {
transactionId: transaction.id,
payload: transaction.payload,
});
return response.status === 200;
} catch (error) {
console.error(`[${transaction.id}] Participant ${participant.name} failed to prepare:`, error.message);
return false;
}
}
private async executePhase2(transaction: Transaction, decision: 'COMMIT' | 'ABORT') {
const action = decision === 'COMMIT' ? 'commit' : 'abort';
console.log(`[${transaction.id}] Executing phase 2: ${action.toUpperCase()}`);
const promises = transaction.participants.map(async (p) => {
try {
// A retry mechanism is needed here.
await this.httpClient.post(`${p.url}/${action}`, { transactionId: transaction.id });
p.state = decision === 'COMMIT' ? 'COMMITTED' : 'ABORTED';
} catch (error) {
console.error(`[${transaction.id}] Participant ${p.name} failed to ${action}:`, error.message);
MetricsCollector.incrementParticipantFailures(p.name, action);
p.state = 'UNKNOWN'; // Mark as UNKNOWN, requires manual intervention.
}
});
await Promise.all(promises);
const finalState = transaction.participants.every(p => p.state === 'COMMITTED') ? 'COMMITTED' : 'ABORTED';
transaction.state = finalState;
transaction.endTime = Date.now();
MetricsCollector.incrementTransactionTotal(finalState.toLowerCase());
MetricsCollector.observeTransactionDuration((transaction.endTime - transaction.startTime) / 1000);
await txLogger.log(transaction);
this.transactions.delete(transaction.id); // Clean up the completed transaction.
}
}
This code illustrates the coordinator’s core logic. The key takeaways are:
- State Persistence: After making a
COMMITorABORTdecision, that decision must first be written to a durable log (txLogger.log(transaction, decision)) before sending instructions to the participants. This ensures that if the coordinator crashes and restarts, it can recover the final state of the transaction and complete the second phase. - Timeouts and Retries: In a production environment, the
httpClientcalls would need more sophisticated retry logic (e.g., exponential backoff), especially during phase two, as the instructions must be delivered. - Observability:
MetricsCollectorrecords transaction counts, latency, and failure information at critical points in the process.
Participant Implementation
Each participant service must implement three endpoints: /prepare, /commit, and /abort. Below is a simplified example for the GPU Resource Service.
// src/participants/GpuResourceService.ts
import express from 'express';
import { Database } from 'sqlite3'; // Pseudocode, assuming a database is used.
const app = express();
app.use(express.json());
const db = new Database(':memory:'); // Use a real database in production.
db.serialize(() => {
db.run("CREATE TABLE gpus (id TEXT PRIMARY KEY, status TEXT, reserved_by TEXT)");
db.run("INSERT INTO gpus VALUES ('gpu-001', 'available', NULL)");
db.run("INSERT INTO gpus VALUES ('gpu-002', 'available', NULL)");
});
// Store info for transactions in the prepared state.
const preparedTransactions = new Map<string, { gpuId: string }>();
// Phase 1: Prepare
app.post('/prepare', (req, res) => {
const { transactionId, payload } = req.body;
// 1. Check for available resources.
db.get("SELECT id FROM gpus WHERE status = 'available' LIMIT 1", (err, row) => {
if (err || !row) {
console.error(`[${transactionId}] No available GPUs for reservation.`);
return res.status(500).send('No available resources');
}
const gpuId = row.id;
// 2. Lock the resource (the core logic).
// Update status to 'reserved' and record the transaction ID.
db.run("UPDATE gpus SET status = 'reserved', reserved_by = ? WHERE id = ?", [transactionId, gpuId], (updateErr) => {
if (updateErr) {
return res.status(500).send('Failed to lock resource');
}
// 3. Record the prepared state for use in commit/abort.
preparedTransactions.set(transactionId, { gpuId });
console.log(`[${transactionId}] Reserved GPU ${gpuId}. Voted YES.`);
res.status(200).send('Prepared');
});
});
});
// Phase 2: Commit
app.post('/commit', (req, res) => {
const { transactionId } = req.body;
const txInfo = preparedTransactions.get(transactionId);
if (!txInfo) {
// If transaction info isn't found, it might have already been committed or this is a retry from the coordinator.
// Idempotency: check if the resource is already in a committed state.
return res.status(200).send('Already committed or unknown transaction');
}
// Formally change the resource status from 'reserved' to 'in_use'.
db.run("UPDATE gpus SET status = 'in_use' WHERE id = ? AND reserved_by = ?", [txInfo.gpuId, transactionId], (err) => {
if (err) {
// Critical error, needs alerting.
return res.status(500).send('Failed to commit resource');
}
preparedTransactions.delete(transactionId);
console.log(`[${transactionId}] Committed GPU ${txInfo.gpuId}.`);
res.status(200).send('Committed');
});
});
// Phase 2: Abort
app.post('/abort', (req, res) => {
const { transactionId } = req.body;
const txInfo = preparedTransactions.get(transactionId);
if (!txInfo) {
return res.status(200).send('Already aborted or unknown transaction');
}
// Release the locked resource.
db.run("UPDATE gpus SET status = 'available', reserved_by = NULL WHERE id = ? AND reserved_by = ?", [txInfo.gpuId, transactionId], (err) => {
if (err) {
// Critical error, needs alerting.
return res.status(500).send('Failed to abort resource');
}
preparedTransactions.delete(transactionId);
console.log(`[${transactionId}] Aborted reservation for GPU ${txInfo.gpuId}.`);
res.status(200).send('Aborted');
});
});
app.listen(3002, () => console.log('GPU Resource Service listening on port 3002'));
The critical part of the participant’s implementation is the /prepare endpoint. It must perform all fallible operations (like checking inventory or permissions), then lock the resource in an intermediate state (reserved), and ensure this state is durable. The logic for /commit and /abort should be comparatively simple, designed to be non-failing, and must be idempotent, as the coordinator might retry these calls.
Observability with Prometheus
Simply implementing the protocol is not enough. For a fragile protocol like 2PC, operating without robust observability is like flying blind. We used the prom-client library to expose the coordinator’s internal state as Prometheus metrics.
// src/coordinator/MetricsCollector.ts
import client from 'prom-client';
const register = new client.Registry();
client.collectDefaultMetrics({ register });
const transactionTotal = new client.Counter({
name: 'llm_deploy_transactions_total',
help: 'Total number of LLM deployment transactions',
labelNames: ['status'], // 'committed', 'aborted'
});
register.registerMetric(transactionTotal);
const transactionDuration = new client.Histogram({
name: 'llm_deploy_transaction_duration_seconds',
help: 'Duration of LLM deployment transactions in seconds',
buckets: [0.1, 0.5, 1, 2, 5, 10], // Adjust buckets based on the specific business context
});
register.registerMetric(transactionDuration);
const transactionPhaseDuration = new client.Histogram({
name: 'llm_deploy_transaction_phase_duration_seconds',
help: 'Duration of transaction phases',
labelNames: ['phase'], // 'prepare', 'commit', 'abort'
});
register.registerMetric(transactionPhaseDuration);
const participantFailures = new client.Counter({
name: 'llm_deploy_participant_failures_total',
help: 'Total failures per participant and phase',
labelNames: ['participant', 'phase'],
});
register.registerMetric(participantFailures);
export class MetricsCollector {
static incrementTransactionTotal(status: 'committed' | 'aborted') {
transactionTotal.labels(status).inc();
}
static observeTransactionDuration(seconds: number) {
transactionDuration.observe(seconds);
}
static observeTransactionPhaseDuration(phase: string, seconds: number) {
transactionPhaseDuration.labels(phase).observe(seconds);
}
static incrementParticipantFailures(participantName: string, phase: string) {
participantFailures.labels(participantName, phase).inc();
}
static async getMetrics() {
return register.metrics();
}
}
// Expose the /metrics endpoint in the main service
// app.get('/metrics', async (req, res) => {
// res.set('Content-Type', register.contentType);
// res.end(await MetricsCollector.getMetrics());
// });
With these metrics, we can build dashboards and alerting rules:
- Alert Rule:
rate(llm_deploy_transactions_total{status="aborted"}[5m]) > 0-> Continuous transaction aborts may indicate a systemic issue. - Alert Rule:
increase(llm_deploy_participant_failures_total[10m]) > 0-> A specific participant is failing repeatedly, requiring immediate intervention. - Dashboard: Monitoring the p99 latency of
llm_deploy_transaction_duration_secondscan reveal performance degradation over time.
Limitations and Future Path
This 2PC solution, implemented in Node.js, solved our immediate need for atomicity in LLM model deployments. However, it’s by no means a silver-bullet solution for distributed transactions. Its inherent synchronous blocking model limits its throughput, making it unsuitable for any high-frequency scenarios.
The coordinator is the system’s central bottleneck and point of risk. While we’ve enhanced its robustness through a persistence log and a restart recovery mechanism, it is still not a truly highly available component. The next iteration for a production environment would inevitably involve introducing a consensus algorithm like Raft or Paxos to turn the coordinator itself into a distributed, HA cluster, which would increase system complexity exponentially.
Furthermore, the 2PC protocol forces multiple independent services into a tight coupling for the duration of the transaction, which runs counter to the microservice philosophy of loose coupling and independent evolution. Therefore, this architectural choice was a conscious trade-off: sacrificing architectural elegance for an uncompromising consistency guarantee in a specific business context. It should be treated as a surgical tool, used only for those critical, low-frequency core operations that cannot be resolved with eventual consistency patterns.