Building a Secure Dynamic Credential Architecture for Milvus and Pub/Sub with Terraform and Vault


Early in our project, we faced a critical challenge: how to provision credentials for a real-time security log analysis pipeline within a highly automated CI/CD environment while completely eradicating long-lived static keys. The core components of this pipeline were Google Cloud Pub/Sub for event streaming and a Milvus database for vectorized log analysis. Static credentials, whether stored in Kubernetes Secrets or injected as environment variables, pose a significant leakage risk, are complex to rotate, and are an auditor’s nightmare.

Our goal was to implement a “zero-trust” credential management model: services would prove their identity upon startup and then fetch short-lived, least-privilege dynamic credentials on demand. The entire infrastructure and security policy had to be managed as code (IaC) to ensure repeatability and auditability. This immediately pointed us toward the combination of Terraform and HashiCorp Vault.

Architecture Decisions and Trade-offs

The initial concept was simple: a Go service subscribes to security logs (like VPC flow logs, firewall logs) from Pub/Sub, converts them into vector embeddings using a model, and stores them in Milvus for similarity analysis to detect potential attack patterns. The challenge was how this processor service would securely obtain credentials for both Pub/Sub and Milvus.

Option 1: GCP IAM Service Account Keys. This is the most direct approach. We could use Terraform to create a service account, generate a JSON key, and distribute it to the application (e.g., via Kubernetes Secrets). The drawbacks are obvious: the key is long-lived, and its compromise would have a massive blast radius. Rotating these keys requires complex automation and makes it difficult to track who used which key and when.

Option 2: Leveraging Workload Identity. For services within GCP, Workload Identity allows a GKE pod to impersonate a GCP service account, enabling access to GCP resources like Pub/Sub without a JSON key. This solves the Pub/Sub access problem, but Milvus is self-hosted on GKE and doesn’t natively support GCP IAM authentication. We would still need to manage a database username and password for Milvus.

Ultimately, we landed on Option 3: a Vault-centric dynamic credential architecture.

graph TD
    subgraph "Google Cloud Platform (Managed by Terraform)"
        GKE[GKE Cluster]
        PubSub[Pub/Sub Topic/Subscription]
        IAM_SA[IAM Service Account for Processor]
    end

    subgraph "GKE Cluster"
        subgraph "Vault Pod"
            VAULT[Vault Server]
            GCPAuth[GCP Auth Method]
            DBSecrets[Database Secrets Engine for Milvus]
        end
        subgraph "Processor Pod"
            APP[Processor Service]
        end
        subgraph "Milvus Pods"
            MILVUS[Milvus Database]
        end
    end

    TERRAFORM[Terraform] -- Provisions & Configures --> GKE
    TERRAFORM -- Provisions & Configures --> PubSub
    TERRAFORM -- Provisions & Configures --> IAM_SA
    TERRAFORM -- Configures Vault Policies & Roles --> VAULT

    APP -- 1. Authenticates using GCP Service Account --> GCPAuth
    GCPAuth -- 2. Validates against GCP IAM --> IAM_SA
    GCPAuth -- 3. Returns Vault Token --> APP
    APP -- 4. Requests Milvus Creds with Token --> DBSecrets
    DBSecrets -- 5. Creates temporary user in Milvus --> MILVUS
    DBSecrets -- 6. Returns temporary user/pass --> APP
    APP -- 7. Connects to Milvus with temp creds --> MILVUS
    APP -- Uses Workload Identity to access --> PubSub

The core advantages of this architecture are:

  1. Unified Credential Management: Access to both cloud-native services (Pub/Sub) and self-hosted services (Milvus) is managed through a single entry point: Vault.
  2. Dynamic and Short-Lived: The Milvus credentials fetched by the application are generated dynamically with a short TTL (Time-To-Live) and automatically expire. This dramatically shrinks the risk window of a credential leak.
  3. Identity-Based Authentication: The application proves its identity to Vault using its GCP IAM identity (via Workload Identity), not a stealable secret.
  4. Infrastructure as Code: Everything from GCP resources to Vault’s security policies is managed by Terraform, enabling end-to-end DevSecOps.

Phase 1: Building the Infrastructure Skeleton with Terraform

The first step is to define all necessary GCP resources using Terraform. In a real-world project, this code would be organized into modules, but for clarity, we’ll keep it in one file.

main.tf - GCP Resource Definitions

# main.tf

provider "google" {
  project = var.gcp_project_id
  region  = var.gcp_region
}

# Create a network for the GKE cluster
resource "google_compute_network" "main" {
  name                    = "milvus-sec-net"
  auto_create_subnetworks = false
}

resource "google_compute_subnetwork" "main" {
  name          = "milvus-sec-subnet"
  ip_cidr_range = "10.10.0.0/24"
  network       = google_compute_network.main.id
  region        = var.gcp_region
}

# Create the GKE cluster with Workload Identity enabled
resource "google_container_cluster" "primary" {
  name               = "milvus-cluster"
  location           = var.gcp_region
  initial_node_count = 1
  network            = google_compute_network.main.id
  subnetwork         = google_compute_subnetwork.main.id

  # Enable Workload Identity, crucial for the application to authenticate with Vault
  workload_identity_config {
    workload_pool = "${var.gcp_project_id}.svc.id.goog"
  }

  # ... other GKE configurations
}

# Create a dedicated GCP service account for the log processor application
resource "google_service_account" "processor_sa" {
  account_id   = "log-processor-sa"
  display_name = "Service Account for Log Processor"
}

# Grant the GCP service account access to Pub/Sub
resource "google_project_iam_member" "pubsub_subscriber" {
  project = var.gcp_project_id
  role    = "roles/pubsub.subscriber"
  member  = "serviceAccount:${google_service_account.processor_sa.email}"
}

# Critical step: Bind the Kubernetes Service Account (KSA) to the GCP Service Account (GSA)
# This allows the KSA named 'processor-ksa' to impersonate the 'processor_sa' GSA
resource "google_service_account_iam_member" "workload_identity_user" {
  service_account_id = google_service_account.processor_sa.name
  role               = "roles/iam.workloadIdentityUser"
  member             = "serviceAccount:${var.gcp_project_id}.svc.id.goog[default/processor-ksa]" # Assuming deployment in the 'default' namespace
}

# Create the Pub/Sub topic and subscription
resource "google_pubsub_topic" "security_logs" {
  name = "security-logs-topic"
}

resource "google_pubsub_subscription" "processor_sub" {
  name  = "processor-subscription"
  topic = google_pubsub_topic.security_logs.name
  ack_deadline_seconds = 20

  # Use pull mode
  push_config {} 
}

# Ensure only our processor service account can consume messages
resource "google_pubsub_subscription_iam_member" "subscriber" {
  subscription = google_pubsub_subscription.processor_sub.name
  role         = "roles/pubsub.subscriber"
  member       = "serviceAccount:${google_service_account.processor_sa.email}"
}

This Terraform code lays the groundwork: it creates a network, a GKE cluster with Workload Identity enabled, a dedicated GCP Service Account (GSA) for our processor application, and grants it permissions to access Pub/Sub. The most crucial piece is the google_service_account_iam_member resource, which establishes the trust relationship between the Kubernetes Service Account (KSA) within our namespace and the GCP Service Account (GSA).

Phase 2: Deep Integration between Terraform and Vault

Next, we need to configure Vault. This is also done via Terraform, using the Vault Provider. The challenge here is to dynamically configure Vault during the terraform apply process, which includes enabling the GCP auth method and setting up the dynamic secrets engine for Milvus.

vault.tf - Vault Configuration as Code

# vault.tf

# Assuming Vault is already deployed in GKE and we have obtained an admin token and address
# In a production environment, this is typically provided via secure CI/CD variables
provider "vault" {
  address = var.vault_addr
  token   = var.vault_token
}

# 1. Enable the GCP auth method
# This allows entities (like GCE instances or GKE pods) to authenticate with Vault using their GCP identity
resource "vault_auth_backend" "gcp" {
  type = "gcp"
  path = "gcp"
}

# 2. Create a Vault role to associate the GCP service account with Vault policies
# Only entities bound to the 'log-processor-sa' GSA can log in using this role
resource "vault_gcp_auth_backend_role" "processor_role" {
  backend             = vault_auth_backend.gcp.path
  role_name           = "log-processor"
  type                = "gcp_iam"
  service_account_email = google_service_account.processor_sa.email
  
  # Entities authenticating with this role will receive the 'log-processor-policy' Vault policy
  token_policies      = [vault_policy.processor_policy.name]
  token_ttl           = 3600 # The Vault token TTL is 1 hour
}

# 3. Define the Vault policy, granting permission to read dynamic Milvus credentials
resource "vault_policy" "processor_policy" {
  name = "log-processor-policy"

  policy = <<EOT
path "database/creds/milvus-dynamic-role" {
  capabilities = ["read"]
}
EOT
}

# 4. Enable the Database Secrets Engine
resource "vault_database_secrets_engine" "milvus_db" {
  path = "database"

  # In production, this should be configured with a highly available database connection pool
  # Note: This username and password are for Vault to manage Milvus users and have elevated privileges.
  # This itself should also be managed by Vault; it's hardcoded here for simplicity.
  connection_url = "root:Milvus@tcp(milvus-service.default.svc.cluster.local:19530)/"
  allowed_roles  = ["milvus-dynamic-role"]

  # The Milvus Vault plugin is required here
  plugin_name = "milvus-database-plugin"
}

# 5. Create a dynamic role for Milvus
# Vault will use this definition to dynamically create Milvus users
resource "vault_database_secrets_engine_role" "milvus_role" {
  backend       = vault_database_secrets_engine.milvus_db.path
  name          = "milvus-dynamic-role"
  db_name       = vault_database_secrets_engine.milvus_db.name
  
  # When an application requests credentials, Vault executes these SQL statements to create the user
  # Milvus permission management is unique; these are example creation statements
  creation_statements = [
    "CREATE USER '{{name}}' IDENTIFIED BY '{{password}}';",
    "GRANT ALL ON *.* TO '{{name}}';", // In production, use least-privilege permissions
  ]

  default_ttl   = "1h"  # Default credential TTL is 1 hour
  max_ttl       = "24h" # Max TTL is 24 hours
}

This Terraform code is the heart of our security architecture. It accomplishes several key tasks:

  • Enables and configures the GCP auth method: This tells Vault to trust identity assertions from GCP.
  • Creates vault_gcp_auth_backend_role: This is a critical binding. It declares: “Any entity that can prove it is the GCP service account log-processor-sa will be granted the log-processor-policy upon logging into Vault.”
  • Defines vault_policy: A permission statement allowing the holder to read credentials from the database/creds/milvus-dynamic-role path.
  • Configures the Database Secrets Engine: This is where Vault’s magic happens. We tell Vault how to connect to Milvus and provide a “template” (vault_database_secrets_engine_role) for creating temporary database users. The creation_statements are the actual SQL executed, where {{name}} and {{password}} are placeholders dynamically generated by Vault.

A common pitfall is granting excessive permissions in the creation_statements. In a real-world project, you should create a role with strictly limited permissions, such as allowing only read/write access to specific collections.

Phase 3: Application-Side Credential Retrieval Logic

With the infrastructure and security policies in place, the final step is to modify our processor application to interact with Vault and fetch credentials. We’ll demonstrate this using Go and the official Vault SDK.

processor/main.go - Go Service Fetching Dynamic Credentials

// main.go
package main

import (
	"context"
	"fmt"
	"log"
	"os"
	"time"

	"cloud.google.com/go/pubsub"
	vault "github.com/hashicorp/vault/api"
)

const (
	// These values should be passed via environment variables or a config file
	gcpProjectID          = "your-gcp-project-id"
	pubsubSubscriptionID  = "processor-subscription"
	vaultAddr             = "http://vault.default.svc.cluster.local:8200"
	vaultGCPAuthRole      = "log-processor" // The Vault role defined in Terraform
	milvusCredsPath       = "database/creds/milvus-dynamic-role" // The path for the dynamic role defined in Terraform
)

func main() {
	ctx := context.Background()

	// 1. Initialize the Vault client
	vaultConfig := vault.DefaultConfig()
	vaultConfig.Address = vaultAddr
	client, err := vault.NewClient(vaultConfig)
	if err != nil {
		log.Fatalf("failed to create vault client: %v", err)
	}

	// 2. Log in to Vault using GCP IAM to get a Vault Token
	// This is the most critical step. The application needs no pre-provisioned secret.
	// It will fetch a signed JWT from the GKE metadata service representing its KSA identity,
	// and Vault's GCP Auth Backend will validate this JWT.
	loginData := map[string]interface{}{
		"role": vaultGCPAuthRole,
		"jwt":  getGCPIdentityToken(), // This is a simulated function; in reality, it's fetched from the metadata service
	}
	secret, err := client.Logical().Write("auth/gcp/login", loginData)
	if err != nil {
		log.Fatalf("failed to login to vault via gcp auth: %v", err)
	}
	if secret.Auth == nil {
		log.Fatalf("gcp auth failed: no auth info returned")
	}
	client.SetToken(secret.Auth.ClientToken)
	log.Println("Successfully authenticated to Vault")

	// 3. Use the obtained Vault token to request dynamic Milvus credentials from the Database Secrets Engine
	milvusSecret, err := client.Logical().Read(milvusCredsPath)
	if err != nil {
		log.Fatalf("failed to read milvus credentials from vault: %v", err)
	}
	if milvusSecret == nil || milvusSecret.Data == nil {
		log.Fatalf("no data received for milvus credentials")
	}

	milvusUser, okUser := milvusSecret.Data["username"].(string)
	milvusPassword, okPass := milvusSecret.Data["password"].(string)
	if !okUser || !okPass {
		log.Fatalf("invalid milvus credentials format received from vault")
	}
	log.Printf("Successfully obtained dynamic credentials for Milvus user: %s", milvusUser)

	// 4. Connect to Milvus using the dynamic credentials
	// connectToMilvus(milvusUser, milvusPassword)
	// In a real application, credentials have a lease, and a background goroutine is needed
	// to renew or re-fetch them before the lease expires.
	go manageCredentialLease(client, milvusSecret)

	// 5. Access Pub/Sub using Workload Identity
	// No explicit credentials are needed here; the GCP Go SDK automatically discovers
	// the Workload Identity configuration from the environment.
	pubsubClient, err := pubsub.NewClient(ctx, gcpProjectID)
	if err != nil {
		log.Fatalf("failed to create pubsub client: %v", err)
	}
	defer pubsubClient.Close()
	sub := pubsubClient.Subscription(pubsubSubscriptionID)

	log.Println("Starting to listen for messages on Pub/Sub...")
	// ... Begin message processing loop ...
	// cctx, cancel := context.WithCancel(ctx)
	// err = sub.Receive(cctx, func(ctx context.Context, msg *pubsub.Message) {
	//    log.Printf("Got message: %s", msg.Data)
	//    // Process message and store vector in Milvus
	//    msg.Ack()
	// })
	// if err != nil {
	//    log.Fatalf("pubsub receive error: %v", err)
	// }
}

// A simulated function to get the identity token from the GKE metadata service.
// In a real pod environment, this token is typically located at a predefined file path.
func getGCPIdentityToken() string {
	// In a real GKE pod with Workload Identity, you'd read this from a file:
	// token, err := ioutil.ReadFile("/var/run/secrets/google.com/sa/token")
	// For this example, we return a placeholder.
	log.Println("Fetching GCP identity token from metadata service (simulated)...")
	return "placeholder-gcp-signed-jwt"
}

// Manages the credential lease, renewing it before expiration.
func manageCredentialLease(client *vault.Client, secret *vault.Secret) {
    if !secret.Renewable {
        log.Println("Milvus credentials are not renewable.")
        // If not renewable, you must request new credentials before the lease ends.
        // ttl := time.Duration(secret.LeaseDuration) * time.Second
        // ...
        return
    }

    renewer, err := client.NewRenewer(&vault.RenewerInput{
        Secret: secret,
    })
    if err != nil {
        log.Fatalf("failed to create renewer: %v", err)
    }

    log.Printf("Starting to renew Milvus credentials lease (ID: %s)", secret.LeaseID)
    go renewer.Renew()
    defer renewer.Stop()

    for {
        select {
        case err := <-renewer.DoneCh():
            if err != nil {
                log.Fatalf("failed to renew secret, application must restart or re-authenticate: %v", err)
            }
            log.Println("Lease expired, no more renewals.")
            // Trigger an application reconnect or graceful shutdown here.
            return
        case renewal := <-renewer.RenewCh():
            log.Printf("Successfully renewed Milvus credentials at: %s", renewal.RenewedAt)
        }
    }
}

The application code clearly illustrates the entire flow:

  1. It first logs into Vault using its GCP environment identity (simulated via getGCPIdentityToken). This step is entirely secretless.
  2. Upon successful login, Vault returns a short-lived Vault Token.
  3. The application uses this Vault Token to request credentials for Milvus.
  4. Vault’s Database Secrets Engine creates a temporary user in Milvus in real-time and returns the username and password to the application.
  5. The application connects to Milvus using these temporary credentials.
  6. A critical production practice is manageCredentialLease. Every dynamic secret from Vault comes with a lease. The application must renew this lease with Vault before it expires; otherwise, the credential will be revoked, and Vault will automatically delete the temporary user from Milvus. This is a powerful security feature that ensures expired credentials from crashed applications are automatically cleaned up.

Limitations and Future Outlook

While robust, this architecture is not without its trade-offs. First, Vault becomes a critical dependency and a potential single point of failure. In a production environment, deploying a highly available Vault cluster (typically with Consul or Etcd as the storage backend) with comprehensive monitoring and alerting is essential.

Second, the availability of database plugins is a consideration. While Vault has built-in plugins for mainstream databases like PostgreSQL and MySQL, emerging databases like Milvus might require a community-provided or custom-developed plugin. This article assumes a milvus-database-plugin exists; in reality, this could require development resources if it’s not available.

Future optimization paths could include:

  1. Deeper Policy as Code: Using Terraform Sentinel or Open Policy Agent (OPA) for more granular control and testing of Vault policies, such as restricting credential generation to specific time windows on business days.
  2. Certificate Management: Extending the dynamic secret concept to TLS certificate management. Applications could use Vault’s PKI Secrets Engine to dynamically fetch short-lived certificates for mTLS communication, further securing service-to-service traffic.
  3. Observability Integration: Ingesting Vault’s audit logs into a centralized logging platform and correlating them with application logs and GCP audit logs. This can build a complete, traceable view of security events, showing exactly which service requested access to which resource, when, and for what purpose.

  TOC