Early in our project, we faced a critical challenge: how to provision credentials for a real-time security log analysis pipeline within a highly automated CI/CD environment while completely eradicating long-lived static keys. The core components of this pipeline were Google Cloud Pub/Sub for event streaming and a Milvus database for vectorized log analysis. Static credentials, whether stored in Kubernetes Secrets or injected as environment variables, pose a significant leakage risk, are complex to rotate, and are an auditor’s nightmare.
Our goal was to implement a “zero-trust” credential management model: services would prove their identity upon startup and then fetch short-lived, least-privilege dynamic credentials on demand. The entire infrastructure and security policy had to be managed as code (IaC) to ensure repeatability and auditability. This immediately pointed us toward the combination of Terraform and HashiCorp Vault.
Architecture Decisions and Trade-offs
The initial concept was simple: a Go service subscribes to security logs (like VPC flow logs, firewall logs) from Pub/Sub, converts them into vector embeddings using a model, and stores them in Milvus for similarity analysis to detect potential attack patterns. The challenge was how this processor service would securely obtain credentials for both Pub/Sub and Milvus.
Option 1: GCP IAM Service Account Keys. This is the most direct approach. We could use Terraform to create a service account, generate a JSON key, and distribute it to the application (e.g., via Kubernetes Secrets). The drawbacks are obvious: the key is long-lived, and its compromise would have a massive blast radius. Rotating these keys requires complex automation and makes it difficult to track who used which key and when.
Option 2: Leveraging Workload Identity. For services within GCP, Workload Identity allows a GKE pod to impersonate a GCP service account, enabling access to GCP resources like Pub/Sub without a JSON key. This solves the Pub/Sub access problem, but Milvus is self-hosted on GKE and doesn’t natively support GCP IAM authentication. We would still need to manage a database username and password for Milvus.
Ultimately, we landed on Option 3: a Vault-centric dynamic credential architecture.
graph TD
subgraph "Google Cloud Platform (Managed by Terraform)"
GKE[GKE Cluster]
PubSub[Pub/Sub Topic/Subscription]
IAM_SA[IAM Service Account for Processor]
end
subgraph "GKE Cluster"
subgraph "Vault Pod"
VAULT[Vault Server]
GCPAuth[GCP Auth Method]
DBSecrets[Database Secrets Engine for Milvus]
end
subgraph "Processor Pod"
APP[Processor Service]
end
subgraph "Milvus Pods"
MILVUS[Milvus Database]
end
end
TERRAFORM[Terraform] -- Provisions & Configures --> GKE
TERRAFORM -- Provisions & Configures --> PubSub
TERRAFORM -- Provisions & Configures --> IAM_SA
TERRAFORM -- Configures Vault Policies & Roles --> VAULT
APP -- 1. Authenticates using GCP Service Account --> GCPAuth
GCPAuth -- 2. Validates against GCP IAM --> IAM_SA
GCPAuth -- 3. Returns Vault Token --> APP
APP -- 4. Requests Milvus Creds with Token --> DBSecrets
DBSecrets -- 5. Creates temporary user in Milvus --> MILVUS
DBSecrets -- 6. Returns temporary user/pass --> APP
APP -- 7. Connects to Milvus with temp creds --> MILVUS
APP -- Uses Workload Identity to access --> PubSub
The core advantages of this architecture are:
- Unified Credential Management: Access to both cloud-native services (Pub/Sub) and self-hosted services (Milvus) is managed through a single entry point: Vault.
- Dynamic and Short-Lived: The Milvus credentials fetched by the application are generated dynamically with a short TTL (Time-To-Live) and automatically expire. This dramatically shrinks the risk window of a credential leak.
- Identity-Based Authentication: The application proves its identity to Vault using its GCP IAM identity (via Workload Identity), not a stealable secret.
- Infrastructure as Code: Everything from GCP resources to Vault’s security policies is managed by Terraform, enabling end-to-end DevSecOps.
Phase 1: Building the Infrastructure Skeleton with Terraform
The first step is to define all necessary GCP resources using Terraform. In a real-world project, this code would be organized into modules, but for clarity, we’ll keep it in one file.
main.tf - GCP Resource Definitions
# main.tf
provider "google" {
project = var.gcp_project_id
region = var.gcp_region
}
# Create a network for the GKE cluster
resource "google_compute_network" "main" {
name = "milvus-sec-net"
auto_create_subnetworks = false
}
resource "google_compute_subnetwork" "main" {
name = "milvus-sec-subnet"
ip_cidr_range = "10.10.0.0/24"
network = google_compute_network.main.id
region = var.gcp_region
}
# Create the GKE cluster with Workload Identity enabled
resource "google_container_cluster" "primary" {
name = "milvus-cluster"
location = var.gcp_region
initial_node_count = 1
network = google_compute_network.main.id
subnetwork = google_compute_subnetwork.main.id
# Enable Workload Identity, crucial for the application to authenticate with Vault
workload_identity_config {
workload_pool = "${var.gcp_project_id}.svc.id.goog"
}
# ... other GKE configurations
}
# Create a dedicated GCP service account for the log processor application
resource "google_service_account" "processor_sa" {
account_id = "log-processor-sa"
display_name = "Service Account for Log Processor"
}
# Grant the GCP service account access to Pub/Sub
resource "google_project_iam_member" "pubsub_subscriber" {
project = var.gcp_project_id
role = "roles/pubsub.subscriber"
member = "serviceAccount:${google_service_account.processor_sa.email}"
}
# Critical step: Bind the Kubernetes Service Account (KSA) to the GCP Service Account (GSA)
# This allows the KSA named 'processor-ksa' to impersonate the 'processor_sa' GSA
resource "google_service_account_iam_member" "workload_identity_user" {
service_account_id = google_service_account.processor_sa.name
role = "roles/iam.workloadIdentityUser"
member = "serviceAccount:${var.gcp_project_id}.svc.id.goog[default/processor-ksa]" # Assuming deployment in the 'default' namespace
}
# Create the Pub/Sub topic and subscription
resource "google_pubsub_topic" "security_logs" {
name = "security-logs-topic"
}
resource "google_pubsub_subscription" "processor_sub" {
name = "processor-subscription"
topic = google_pubsub_topic.security_logs.name
ack_deadline_seconds = 20
# Use pull mode
push_config {}
}
# Ensure only our processor service account can consume messages
resource "google_pubsub_subscription_iam_member" "subscriber" {
subscription = google_pubsub_subscription.processor_sub.name
role = "roles/pubsub.subscriber"
member = "serviceAccount:${google_service_account.processor_sa.email}"
}
This Terraform code lays the groundwork: it creates a network, a GKE cluster with Workload Identity enabled, a dedicated GCP Service Account (GSA) for our processor application, and grants it permissions to access Pub/Sub. The most crucial piece is the google_service_account_iam_member resource, which establishes the trust relationship between the Kubernetes Service Account (KSA) within our namespace and the GCP Service Account (GSA).
Phase 2: Deep Integration between Terraform and Vault
Next, we need to configure Vault. This is also done via Terraform, using the Vault Provider. The challenge here is to dynamically configure Vault during the terraform apply process, which includes enabling the GCP auth method and setting up the dynamic secrets engine for Milvus.
vault.tf - Vault Configuration as Code
# vault.tf
# Assuming Vault is already deployed in GKE and we have obtained an admin token and address
# In a production environment, this is typically provided via secure CI/CD variables
provider "vault" {
address = var.vault_addr
token = var.vault_token
}
# 1. Enable the GCP auth method
# This allows entities (like GCE instances or GKE pods) to authenticate with Vault using their GCP identity
resource "vault_auth_backend" "gcp" {
type = "gcp"
path = "gcp"
}
# 2. Create a Vault role to associate the GCP service account with Vault policies
# Only entities bound to the 'log-processor-sa' GSA can log in using this role
resource "vault_gcp_auth_backend_role" "processor_role" {
backend = vault_auth_backend.gcp.path
role_name = "log-processor"
type = "gcp_iam"
service_account_email = google_service_account.processor_sa.email
# Entities authenticating with this role will receive the 'log-processor-policy' Vault policy
token_policies = [vault_policy.processor_policy.name]
token_ttl = 3600 # The Vault token TTL is 1 hour
}
# 3. Define the Vault policy, granting permission to read dynamic Milvus credentials
resource "vault_policy" "processor_policy" {
name = "log-processor-policy"
policy = <<EOT
path "database/creds/milvus-dynamic-role" {
capabilities = ["read"]
}
EOT
}
# 4. Enable the Database Secrets Engine
resource "vault_database_secrets_engine" "milvus_db" {
path = "database"
# In production, this should be configured with a highly available database connection pool
# Note: This username and password are for Vault to manage Milvus users and have elevated privileges.
# This itself should also be managed by Vault; it's hardcoded here for simplicity.
connection_url = "root:Milvus@tcp(milvus-service.default.svc.cluster.local:19530)/"
allowed_roles = ["milvus-dynamic-role"]
# The Milvus Vault plugin is required here
plugin_name = "milvus-database-plugin"
}
# 5. Create a dynamic role for Milvus
# Vault will use this definition to dynamically create Milvus users
resource "vault_database_secrets_engine_role" "milvus_role" {
backend = vault_database_secrets_engine.milvus_db.path
name = "milvus-dynamic-role"
db_name = vault_database_secrets_engine.milvus_db.name
# When an application requests credentials, Vault executes these SQL statements to create the user
# Milvus permission management is unique; these are example creation statements
creation_statements = [
"CREATE USER '{{name}}' IDENTIFIED BY '{{password}}';",
"GRANT ALL ON *.* TO '{{name}}';", // In production, use least-privilege permissions
]
default_ttl = "1h" # Default credential TTL is 1 hour
max_ttl = "24h" # Max TTL is 24 hours
}
This Terraform code is the heart of our security architecture. It accomplishes several key tasks:
- Enables and configures the GCP auth method: This tells Vault to trust identity assertions from GCP.
- Creates
vault_gcp_auth_backend_role: This is a critical binding. It declares: “Any entity that can prove it is the GCP service accountlog-processor-sawill be granted thelog-processor-policyupon logging into Vault.” - Defines
vault_policy: A permission statement allowing the holder to read credentials from thedatabase/creds/milvus-dynamic-rolepath. - Configures the Database Secrets Engine: This is where Vault’s magic happens. We tell Vault how to connect to Milvus and provide a “template” (
vault_database_secrets_engine_role) for creating temporary database users. Thecreation_statementsare the actual SQL executed, where{{name}}and{{password}}are placeholders dynamically generated by Vault.
A common pitfall is granting excessive permissions in the creation_statements. In a real-world project, you should create a role with strictly limited permissions, such as allowing only read/write access to specific collections.
Phase 3: Application-Side Credential Retrieval Logic
With the infrastructure and security policies in place, the final step is to modify our processor application to interact with Vault and fetch credentials. We’ll demonstrate this using Go and the official Vault SDK.
processor/main.go - Go Service Fetching Dynamic Credentials
// main.go
package main
import (
"context"
"fmt"
"log"
"os"
"time"
"cloud.google.com/go/pubsub"
vault "github.com/hashicorp/vault/api"
)
const (
// These values should be passed via environment variables or a config file
gcpProjectID = "your-gcp-project-id"
pubsubSubscriptionID = "processor-subscription"
vaultAddr = "http://vault.default.svc.cluster.local:8200"
vaultGCPAuthRole = "log-processor" // The Vault role defined in Terraform
milvusCredsPath = "database/creds/milvus-dynamic-role" // The path for the dynamic role defined in Terraform
)
func main() {
ctx := context.Background()
// 1. Initialize the Vault client
vaultConfig := vault.DefaultConfig()
vaultConfig.Address = vaultAddr
client, err := vault.NewClient(vaultConfig)
if err != nil {
log.Fatalf("failed to create vault client: %v", err)
}
// 2. Log in to Vault using GCP IAM to get a Vault Token
// This is the most critical step. The application needs no pre-provisioned secret.
// It will fetch a signed JWT from the GKE metadata service representing its KSA identity,
// and Vault's GCP Auth Backend will validate this JWT.
loginData := map[string]interface{}{
"role": vaultGCPAuthRole,
"jwt": getGCPIdentityToken(), // This is a simulated function; in reality, it's fetched from the metadata service
}
secret, err := client.Logical().Write("auth/gcp/login", loginData)
if err != nil {
log.Fatalf("failed to login to vault via gcp auth: %v", err)
}
if secret.Auth == nil {
log.Fatalf("gcp auth failed: no auth info returned")
}
client.SetToken(secret.Auth.ClientToken)
log.Println("Successfully authenticated to Vault")
// 3. Use the obtained Vault token to request dynamic Milvus credentials from the Database Secrets Engine
milvusSecret, err := client.Logical().Read(milvusCredsPath)
if err != nil {
log.Fatalf("failed to read milvus credentials from vault: %v", err)
}
if milvusSecret == nil || milvusSecret.Data == nil {
log.Fatalf("no data received for milvus credentials")
}
milvusUser, okUser := milvusSecret.Data["username"].(string)
milvusPassword, okPass := milvusSecret.Data["password"].(string)
if !okUser || !okPass {
log.Fatalf("invalid milvus credentials format received from vault")
}
log.Printf("Successfully obtained dynamic credentials for Milvus user: %s", milvusUser)
// 4. Connect to Milvus using the dynamic credentials
// connectToMilvus(milvusUser, milvusPassword)
// In a real application, credentials have a lease, and a background goroutine is needed
// to renew or re-fetch them before the lease expires.
go manageCredentialLease(client, milvusSecret)
// 5. Access Pub/Sub using Workload Identity
// No explicit credentials are needed here; the GCP Go SDK automatically discovers
// the Workload Identity configuration from the environment.
pubsubClient, err := pubsub.NewClient(ctx, gcpProjectID)
if err != nil {
log.Fatalf("failed to create pubsub client: %v", err)
}
defer pubsubClient.Close()
sub := pubsubClient.Subscription(pubsubSubscriptionID)
log.Println("Starting to listen for messages on Pub/Sub...")
// ... Begin message processing loop ...
// cctx, cancel := context.WithCancel(ctx)
// err = sub.Receive(cctx, func(ctx context.Context, msg *pubsub.Message) {
// log.Printf("Got message: %s", msg.Data)
// // Process message and store vector in Milvus
// msg.Ack()
// })
// if err != nil {
// log.Fatalf("pubsub receive error: %v", err)
// }
}
// A simulated function to get the identity token from the GKE metadata service.
// In a real pod environment, this token is typically located at a predefined file path.
func getGCPIdentityToken() string {
// In a real GKE pod with Workload Identity, you'd read this from a file:
// token, err := ioutil.ReadFile("/var/run/secrets/google.com/sa/token")
// For this example, we return a placeholder.
log.Println("Fetching GCP identity token from metadata service (simulated)...")
return "placeholder-gcp-signed-jwt"
}
// Manages the credential lease, renewing it before expiration.
func manageCredentialLease(client *vault.Client, secret *vault.Secret) {
if !secret.Renewable {
log.Println("Milvus credentials are not renewable.")
// If not renewable, you must request new credentials before the lease ends.
// ttl := time.Duration(secret.LeaseDuration) * time.Second
// ...
return
}
renewer, err := client.NewRenewer(&vault.RenewerInput{
Secret: secret,
})
if err != nil {
log.Fatalf("failed to create renewer: %v", err)
}
log.Printf("Starting to renew Milvus credentials lease (ID: %s)", secret.LeaseID)
go renewer.Renew()
defer renewer.Stop()
for {
select {
case err := <-renewer.DoneCh():
if err != nil {
log.Fatalf("failed to renew secret, application must restart or re-authenticate: %v", err)
}
log.Println("Lease expired, no more renewals.")
// Trigger an application reconnect or graceful shutdown here.
return
case renewal := <-renewer.RenewCh():
log.Printf("Successfully renewed Milvus credentials at: %s", renewal.RenewedAt)
}
}
}
The application code clearly illustrates the entire flow:
- It first logs into Vault using its GCP environment identity (simulated via
getGCPIdentityToken). This step is entirely secretless. - Upon successful login, Vault returns a short-lived Vault Token.
- The application uses this Vault Token to request credentials for Milvus.
- Vault’s Database Secrets Engine creates a temporary user in Milvus in real-time and returns the username and password to the application.
- The application connects to Milvus using these temporary credentials.
- A critical production practice is
manageCredentialLease. Every dynamic secret from Vault comes with a lease. The application must renew this lease with Vault before it expires; otherwise, the credential will be revoked, and Vault will automatically delete the temporary user from Milvus. This is a powerful security feature that ensures expired credentials from crashed applications are automatically cleaned up.
Limitations and Future Outlook
While robust, this architecture is not without its trade-offs. First, Vault becomes a critical dependency and a potential single point of failure. In a production environment, deploying a highly available Vault cluster (typically with Consul or Etcd as the storage backend) with comprehensive monitoring and alerting is essential.
Second, the availability of database plugins is a consideration. While Vault has built-in plugins for mainstream databases like PostgreSQL and MySQL, emerging databases like Milvus might require a community-provided or custom-developed plugin. This article assumes a milvus-database-plugin exists; in reality, this could require development resources if it’s not available.
Future optimization paths could include:
- Deeper Policy as Code: Using Terraform Sentinel or Open Policy Agent (OPA) for more granular control and testing of Vault policies, such as restricting credential generation to specific time windows on business days.
- Certificate Management: Extending the dynamic secret concept to TLS certificate management. Applications could use Vault’s PKI Secrets Engine to dynamically fetch short-lived certificates for mTLS communication, further securing service-to-service traffic.
- Observability Integration: Ingesting Vault’s audit logs into a centralized logging platform and correlating them with application logs and GCP audit logs. This can build a complete, traceable view of security events, showing exactly which service requested access to which resource, when, and for what purpose.