Architecting a High-Concurrency Feature Flag System with Firestore and ZeroMQ


Our rapidly iterating microservices architecture hit a bottleneck: feature releases were tightly coupled with code deployments. Every minor feature toggle required a full release cycle, which drastically slowed down business validation and increased the risk of online changes. Our initial solution was a centralized database table for feature flags, with all service instances polling for state updates. As our services scaled, this not only placed unnecessary load on the database but, more critically, introduced propagation delays on the order of minutes. This was completely unacceptable for scenarios requiring rapid rollbacks or A/B testing.

The core of the problem was the need for a system that could guarantee data consistency, propagate changes in real-time, and provide full observability throughout its lifecycle. In this post-mortem, I’ll document how we built a high-performance, highly available feature flag platform by combining a seemingly disparate tech stack: Firestore, ZeroMQ, Loki, Storybook, and a custom Firestore-based distributed lock.

Technology Selection: The Chemistry of the Combination

During the initial design phase, we evaluated several options but ultimately settled on this particular combination. Each component solves a critical piece of the puzzle and they work together organically.

  1. State Storage & Consistency: Firestore + Distributed Lock
    We chose Firestore as our core state store, primarily for its serverless nature, generous free tier, and native real-time listeners. However, while Firestore’s transaction model is powerful, it suffers from intense contention during concurrent writes to the same document, leading to a high rate of failed and retried transactions. In the feature flag admin panel, multiple administrators might try to modify the same critical flag simultaneously, which could directly lead to data inconsistency. Therefore, a reliable distributed lock was essential. Instead of introducing external dependencies like Redis or Zookeeper and increasing system complexity, we decided to implement a lease-based distributed lock directly on top of Firestore itself.

  2. Real-time Change Distribution: ZeroMQ
    When a flag’s state changes, we need a mechanism to instantly notify hundreds or thousands of microservice instances. We opted out of Kafka or RabbitMQ, as they felt too heavyweight for this use case. What we needed was a lightweight, low-latency “signal” channel. ZeroMQ’s PUB/SUB (Publish/Subscribe) pattern was a perfect fit. It’s a library, not a middleware, so it has no single-point-of-failure broker to manage or create performance bottlenecks. A central “publisher” service broadcasts a message via ZeroMQ after a change occurs, and all subscribed microservice clients receive the update with microsecond-level latency.

  3. Observability: Loki
    For a system that controls live production traffic, auditing and tracing are paramount. Who changed which flag, from what state to what state, and when? Which service instances successfully received this change notification? Which ones failed? Loki’s label-based log aggregation system is perfectly suited to answer these questions. We assign structured labels—like flag_name, user_id, service_instance—to every log line, from API requests and database operations to ZeroMQ message publications and receptions. This makes troubleshooting incidents incredibly efficient.

  4. Admin & Development Interface: Storybook
    This was an unconventional choice. Storybook is typically used for developing and showcasing UI components in isolation. However, we found it could be extended into a powerful “feature flag laboratory.” We not only used it to build the UI components for the admin interface but, more importantly, we created stories for our core business components. Each story demonstrates the component’s behavior under different combinations of feature flags. This allows product managers, QA, and developers to visually verify and test the impact of flags in an isolated environment without needing to deploy the entire application.

Core Implementation: A Deep Dive into Code and Architecture

The system consists of four main components: a core API service, a ZeroMQ publisher, a microservice SDK (subscriber), and the Storybook admin frontend.

graph TD
    subgraph "Admin UI (Storybook)"
        A[Admin UI] -->|HTTP API Call| B(Core API Service);
    end

    subgraph "Backend Core"
        B -- "1. Request Change" --> C{Feature Flag Change};
        C -- "2. Attempt to Acquire Lock" --> D[Firestore Distributed Lock];
        D -- "3. Lock Acquired" --> E[Write to Firestore];
        E -- "4. Change Persisted" --> F[ZeroMQ Publisher];
        F -- "5. Broadcast Change (PUB)" --> G((ZeroMQ Socket));
    end

    subgraph "Microservice Cluster"
        H1[Service A / SDK] -- "6. Subscribe (SUB)" --> G;
        H2[Service B / SDK] -- "6. Subscribe (SUB)" --> G;
        H3[Service C / SDK] -- "6. Subscribe (SUB)" --> G;
    end

    subgraph "Observability Platform"
        B -- "Logs" --> L[Loki];
        F -- "Logs" --> L;
        H1 -- "Logs" --> L;
        H2 -- "Logs" --> L;
    end

1. Firestore-based Distributed Lock (Golang)

This is the key to guaranteeing atomicity for concurrent write operations. We create a dedicated locks collection where each resource to be locked (e.g., a feature flag name) corresponds to a document. The locking logic is as follows: attempt to create a lock document with a lease expiration time within a transaction. If the document already exists and has not expired, the acquisition fails.

internal/lock/firestore_lock.go:

package lock

import (
	"context"
	"fmt"
	"time"

	"cloud.google.com/go/firestore"
	"google.golang.org/api/option"
	"google.golang.org/grpc/codes"
	"google.golang.org/grpc/status"
)

const (
	locksCollection = "distributed_locks"
)

// FirestoreLockManager provides a Firestore-based distributed lock.
type FirestoreLockManager struct {
	client *firestore.Client
	owner  string // Identifier for the lock holder, typically a process ID or hostname.
}

// LockAcquisition represents a successfully acquired lock.
type LockAcquisition struct {
	lockManager *FirestoreLockManager
	lockName    string
	cancel      context.CancelFunc
}

// NewFirestoreLockManager creates a new lock manager.
func NewFirestoreLockManager(ctx context.Context, projectID, ownerID string, opts ...option.ClientOption) (*FirestoreLockManager, error) {
	client, err := firestore.NewClient(ctx, projectID, opts...)
	if err != nil {
		return nil, fmt.Errorf("failed to create firestore client: %w", err)
	}
	return &FirestoreLockManager{
		client: client,
		owner:  ownerID,
	}, nil
}

// Acquire attempts to acquire a lock with a given lease duration.
// This is a blocking operation that will retry until it succeeds or the context is canceled.
func (m *FirestoreLockManager) Acquire(ctx context.Context, lockName string, leaseDuration time.Duration, retryInterval time.Duration) (*LockAcquisition, error) {
	ticker := time.NewTicker(retryInterval)
	defer ticker.Stop()

	for {
		select {
		case <-ctx.Done():
			return nil, ctx.Err()
		case <-ticker.C:
			// For each attempt, create a new context with a cancel func for the heartbeat.
			acquireCtx, cancel := context.WithCancel(ctx)
			
			err := m.client.RunTransaction(acquireCtx, func(ctx context.Context, tx *firestore.Transaction) error {
				lockRef := m.client.Collection(locksCollection).Doc(lockName)
				doc, err := tx.Get(lockRef)

				now := time.Now().UTC()

				// If the lock document exists, check if it has expired.
				if err == nil {
					expiresAt, _ := doc.Data()["expires_at"].(time.Time)
					if now.Before(expiresAt) {
						// Lock is held and not expired.
						return status.Error(codes.Aborted, "lock is currently held")
					}
				} else if status.Code(err) != codes.NotFound {
					// An error other than NotFound occurred while getting the doc.
					return err
				}

				// Lock does not exist or has expired; attempt to acquire it.
				return tx.Set(lockRef, map[string]interface{}{
					"owner":      m.owner,
					"expires_at": now.Add(leaseDuration),
				})
			})

			if err == nil {
				// Successfully acquired the lock.
				la := &LockAcquisition{
					lockManager: m,
					lockName:    lockName,
					cancel:      cancel,
				}
				// Start a goroutine to maintain a heartbeat and renew the lease.
				go la.maintainHeartbeat(acquireCtx, leaseDuration, leaseDuration/2)
				return la, nil
			}

			if status.Code(err) != codes.Aborted {
				// If the error is not Aborted (due to contention), it's a real error.
				cancel()
				return nil, fmt.Errorf("failed to acquire lock in transaction: %w", err)
			}
			
			// Lock is held by another process; cancel this attempt's context and retry.
			cancel()
		}
	}
}

// maintainHeartbeat automatically renews the lock lease.
func (la *LockAcquisition) maintainHeartbeat(ctx context.Context, leaseDuration, interval time.Duration) {
	ticker := time.NewTicker(interval)
	defer ticker.Stop()

	for {
		select {
		case <-ctx.Done():
			// The main task has completed (via Release calling cancel), so stop the heartbeat.
			return
		case <-ticker.C:
			lockRef := la.lockManager.client.Collection(locksCollection).Doc(la.lockName)
			_, err := lockRef.Set(ctx, map[string]interface{}{
				"expires_at": time.Now().UTC().Add(leaseDuration),
			}, firestore.MergeAll)
			if err != nil {
				// Renewal failed, possibly due to network or permission issues.
				// In a real-world project, this should have more robust error handling and logging.
				// For example, calling cancel() to release the lock after multiple failures.
				return
			}
		}
	}
}

// Release releases the lock.
func (la *LockAcquisition) Release(ctx context.Context) error {
	// Stop the heartbeat goroutine.
	la.cancel()

	// Use a transaction to ensure only the lock owner can delete the lock.
	err := la.lockManager.client.RunTransaction(ctx, func(ctx context.Context, tx *firestore.Transaction) error {
		lockRef := la.lockManager.client.Collection(locksCollection).Doc(la.lockName)
		doc, err := tx.Get(lockRef)
		if err != nil {
			if status.Code(err) == codes.NotFound {
				// The lock no longer exists, possibly preempted by another process after expiration.
				return nil
			}
			return err
		}

		owner, _ := doc.Data()["owner"].(string)
		if owner != la.lockManager.owner {
			// Trying to release a lock owned by someone else indicates our lock expired and was acquired by another process.
			return fmt.Errorf("cannot release lock held by another owner")
		}

		return tx.Delete(lockRef)
	})

	return err
}
  • Core Logic: The Acquire method uses a loop and a Firestore transaction to attempt creating or updating the lock document. If the document doesn’t exist or its expires_at field has passed, it writes the new owner and expiration time. If the document exists and is not expired, the transaction returns an Aborted error, indicating lock contention, and the code waits to retry. Upon successful acquisition, a background goroutine (maintainHeartbeat) periodically renews the lease to prevent expiration during long-running operations. Release must verify the lock’s owner to avoid incorrectly releasing a lock that has already been acquired by another process.

2. ZeroMQ Broadcast System (Node.js)

After a successful write to Firestore, the core API service calls the publisher module.

services/flagPublisher.js:

const zmq = require('zeromq');
const pino = require('pino');

// In production, logs should be structured JSON for easy ingestion by Loki.
const logger = pino({
  level: 'info',
  base: { service: 'flag-publisher' },
});

class FlagPublisher {
  constructor(bindAddress) {
    this.socket = new zmq.Publisher();
    this.bindAddress = bindAddress;
  }

  async start() {
    try {
      await this.socket.bind(this.bindAddress);
      logger.info({ address: this.bindAddress }, `ZeroMQ publisher bound successfully.`);
    } catch (err) {
      logger.error({ err, address: this.bindAddress }, 'Failed to bind ZeroMQ publisher.');
      // In a real-world project, this should have exit or retry logic.
      process.exit(1);
    }
  }

  /**
   * Broadcasts a change to a feature flag.
   * @param {string} flagName - The name of the feature flag, used as the ZMQ topic.
   * @param {object} payload - The detailed information about the change.
   */
  publishChange(flagName, payload) {
    // Ensure the payload is a string.
    const message = JSON.stringify(payload);
    
    // ZeroMQ's PUB/SUB model uses topics for message filtering.
    // We use the feature flag name as the topic.
    // Subscribers can choose to subscribe only to the flags they care about.
    this.socket.send([flagName, message]);
    
    logger.info({
      topic: flagName,
      payload: payload,
    }, 'Published feature flag change.');
  }

  async stop() {
    await this.socket.close();
    logger.info('ZeroMQ publisher stopped.');
  }
}

// Example usage:
// const publisher = new FlagPublisher('tcp://*:5555');
// await publisher.start();
// publisher.publishChange('new-checkout-flow', { enabled: true, rollout: 0.5 });

The SDK used in microservices (subscriber):

sdk/flagSubscriber.js:

const zmq = require('zeromq');
const pino = require('pino');

const logger = pino({
  level: 'info',
  base: { service: 'flag-subscriber-sdk', instance_id: process.env.INSTANCE_ID || 'unknown' },
});

class FlagSubscriber {
  constructor(connectAddress) {
    this.socket = new zmq.Subscriber();
    this.connectAddress = connectAddress;
    this.flagState = new Map(); // In-memory cache of feature flag states.
    this.isStopped = false;
  }

  async connect() {
    this.socket.connect(this.connectAddress);
    logger.info({ address: this.connectAddress }, 'ZeroMQ subscriber connecting...');
    
    // In a real-world project, a more sophisticated reconnection strategy would be needed.
    this.socket.events.on('connect', () => {
        logger.info({ address: this.connectAddress }, 'Successfully connected to publisher.');
    });
    this.socket.events.on('disconnect', () => {
        if (!this.isStopped) {
            logger.warn({ address: this.connectAddress }, 'Disconnected from publisher. Will attempt to reconnect.');
        }
    });
  }

  /**
   * Subscribes to one or more feature flags.
   * @param {string|string[]} flagNames - The names of the flags to subscribe to.
   */
  subscribe(flagNames) {
    const names = Array.isArray(flagNames) ? flagNames : [flagNames];
    for (const name of names) {
      this.socket.subscribe(name);
      logger.info({ topic: name }, 'Subscribed to feature flag topic.');
    }
  }
  
  // Subscribe to all messages.
  subscribeAll() {
      this.socket.subscribe('');
  }

  async startListening(onUpdateCallback) {
    for await (const [topic, msg] of this.socket) {
      try {
        const flagName = topic.toString();
        const payload = JSON.parse(msg.toString());
        
        // Update the local cache.
        this.flagState.set(flagName, payload);

        logger.info({
          topic: flagName,
          payload: payload,
          source: 'zmq'
        }, 'Received and processed feature flag update.');

        if (onUpdateCallback) {
          onUpdateCallback(flagName, payload);
        }
      } catch (err) {
        logger.error({ err, raw_message: msg.toString() }, 'Failed to parse incoming message.');
      }
    }
  }

  // Synchronously gets flag state from the in-memory cache.
  getFlag(flagName, defaultValue = false) {
    const state = this.flagState.get(flagName);
    return state ? state.enabled : defaultValue;
  }
  
  async stop() {
    this.isStopped = true;
    await this.socket.close();
    logger.info('ZeroMQ subscriber stopped.');
  }
}
  • Core Logic: The FlagPublisher binds to a TCP port and awaits connections. Its publishChange method uses the flag name as the message topic and the flag state as the message content. The FlagSubscriber connects to the publisher’s address and uses the subscribe method to specify which topics it’s interested in. It receives messages in an async loop, parses them, and updates a local in-memory cache. Application code can then call getFlag to get the latest flag state in microseconds, without any network I/O.

3. Storybook as an Admin Console

We created a custom Storybook addon that adds a “Feature Flags” panel to the sidebar. This panel communicates with our core API to fetch the current state of all flags and provides a UI to modify them.

addons/feature-flags/panel.js (React & Storybook Addon API):

import React, { useState, useEffect } from 'react';
import { AddonPanel } from '@storybook/components';
import { useChannel } from '@storybook/manager-api';

// This is a simplified example; real API calls would need error handling and auth.
const fetchFlags = async () => { /* ... API call to GET /api/flags ... */ };
const updateFlag = async (name, payload) => { /* ... API call to PUT /api/flags/{name} ... */ };

const FlagControl = ({ flag, onUpdate }) => {
  const [isEnabled, setIsEnabled] = useState(flag.enabled);
  // ... other controls, like rollout percentage

  const handleToggle = async (e) => {
    const newState = e.target.checked;
    setIsEnabled(newState);
    await updateFlag(flag.name, { enabled: newState });
    onUpdate(); // Notify parent component to refresh
  };

  return (
    <div>
      <strong>{flag.name}</strong>
      <label>
        <input type="checkbox" checked={isEnabled} onChange={handleToggle} />
        Enabled
      </label>
    </div>
  );
};

export const FlagPanel = ({ active }) => {
  if (!active) {
    return null;
  }
  
  const [flags, setFlags] = useState([]);
  const [loading, setLoading] = useState(true);

  const loadFlags = async () => {
    setLoading(true);
    const data = await fetchFlags();
    setFlags(data);
    setLoading(false);
  };

  useEffect(() => {
    loadFlags();
  }, []);
  
  // Use Storybook's event channel to notify the Canvas Iframe to refresh after a change.
  const emit = useChannel({});
  const handleFlagUpdate = () => {
    emit('forceRemount');
  }

  if (loading) return <div>Loading flags...</div>;

  return (
    <AddonPanel active={active}>
      <div style={{ padding: '10px' }}>
        {flags.map(flag => (
          <FlagControl key={flag.name} flag={flag} onUpdate={handleFlagUpdate} />
        ))}
      </div>
    </AddonPanel>
  );
};
  • Integration Logic: Storybook’s addon API allows us to create custom panels. This panel is essentially a standalone React app that communicates with our backend API. The most clever part is that after a flag is modified, we use Storybook’s internal event channel (emit('forceRemount')) to force a refresh of the currently viewed story. This enables a developer to open a component in Storybook, toggle a relevant feature flag in our custom panel, and immediately see the UI change.

Known Issues and Future Iterations

This system has been running stably for some time, but it’s not perfect. There are tradeoffs and limitations we were aware of from the start.

First, the performance of the Firestore-based distributed lock is mediocre under high write contention. In an extreme future scenario where hundreds of admins are frequently modifying the same config, the transaction failure rate would increase significantly. At that point, introducing a dedicated, memory-based coordination service like etcd or Redis would be a better choice.

Second, ZeroMQ’s PUB/SUB model is “fire-and-forget”; it doesn’t guarantee message delivery. If a microservice instance is offline or restarting when a change is published, it will miss that update. It will only sync up on the next change or when the service restarts and fetches the full config from Firestore. For scenarios requiring 100% guaranteed configuration delivery, supplementing with a lightweight, persistent message queue might be necessary.

Finally, the current message broadcast is global. As the number of flags grows, having every service receive updates for every flag is wasteful. A future optimization would be to implement a more granular topic strategy, such as partitioning by project or service domain. This would allow services to subscribe only to the small subset of flags they actually care about, reducing network and CPU overhead.


  TOC