使用 Ansible 在 Azure 上为 Raft 集群构建自动化混沌测试注入框架


我们的新 Raft-KV 服务通过了所有单元测试和集成测试,但在将其投入生产环境之前,团队内部对它的韧性信心不足。Raft 协议理论上是可靠的,但我们的工程实现、配置和部署环境中的任何一个环节都可能引入脆弱性。手动模拟节点故障或网络分区不仅效率低下,而且结果难以复现。我们需要的是一个系统化的、自动化的方式来反复拷问这个集群,确保它能在真实的混乱中存活下来。

目标很明确:构建一个一键式的混沌工程流水线。这个流水线需要能够从零开始,在 Azure 上创建一组虚拟机,部署我们的 Raft 集群,然后按预设场景注入故障——例如,随机杀死 Leader 节点、模拟网络分区、增加网络延迟——同时持续校验集群的数据一致性和服务可用性。

在技术选型上,Ansible 成了我们的首选。它不仅是一个配置管理工具,更是一个强大的编排引擎。其 agentless 的特性简化了部署,幂等性保证了操作的可重复性,而丰富的模块生态(特别是针对 Azure 的)让我们能用一个工具链完成从基础设施置备(IaaS)到应用部署,再到混沌测试执行的整个闭环。

第一阶段:定义被测系统 - 一个极简的 Raft KV 服务

为了聚焦于测试框架本身,我们使用了一个用 Go 语言编写的、基于 hashicorp/raft 库实现的简易分布式 KV 存储服务。它只暴露了几个关键的 HTTP API 接口,这足以让我们验证其核心功能:

  • GET /get?key=<key>: 从集群中读取一个值。请求会被转发到 Leader 节点处理,以保证线性化读。
  • POST /set: 设置一个键值对,{"key": "somekey", "value": "somevalue"}
  • GET /status: 返回当前节点的状态,包括其角色(Leader, Follower, Candidate)、当前任期(Term)以及 Leader 地址。

这个服务的代码不是本文的重点,关键在于它的接口为我们的自动化测试提供了抓手。我们可以通过 /status 接口来识别 Leader,通过 /set/get 来验证数据一致性。

第二阶段:基础设施即代码 - 使用 Ansible 置备 Azure 环境

一切自动化的起点是环境的自动化。我们编写了一个 Ansible Playbook,用于在 Azure 上创建测试所需的所有资源。这包括一个资源组、一个虚拟网络、一个子网、一个网络安全组(NSG)以及三台虚拟机作为 Raft 集群的节点。

在真实项目中,直接在 Playbook 中硬编码敏感信息是严重错误。这里使用 vars_prompt 来交互式输入,生产环境中应使用 Ansible Vault 或外部 secrets 管理工具。

provision_azure_infra.yml:

---
- name: Provision Azure Infrastructure for Raft Cluster
  hosts: localhost
  connection: local
  gather_facts: false

  vars_prompt:
    - name: "azure_client_id"
      prompt: "Enter your Azure Service Principal Client ID"
      private: true
    - name: "azure_secret"
      prompt: "Enter your Azure Service Principal Secret"
      private: true
    - name: "azure_tenant_id"
      prompt: "Enter your Azure Tenant ID"
      private: true
    - name: "azure_subscription_id"
      prompt: "Enter your Azure Subscription ID"
      private: true

  vars:
    resource_group: "raft-chaos-rg"
    location: "eastus"
    vnet_name: "raft-vnet"
    subnet_name: "raft-subnet"
    nsg_name: "raft-nsg"
    vm_admin_user: "raftadmin"
    vm_image:
      offer: "UbuntuServer"
      publisher: "Canonical"
      sku: "20.04-LTS"
      version: "latest"
    vm_size: "Standard_B1s"
    node_count: 3

  tasks:
    - name: Ensure Resource Group exists
      azure.azcollection.azure_rm_resourcegroup:
        name: "{{ resource_group }}"
        location: "{{ location }}"
        client_id: "{{ azure_client_id }}"
        secret: "{{ azure_secret }}"
        tenant: "{{ azure_tenant_id }}"
        subscription_id: "{{ azure_subscription_id }}"

    - name: Create Virtual Network
      azure.azcollection.azure_rm_virtualnetwork:
        resource_group: "{{ resource_group }}"
        name: "{{ vnet_name }}"
        address_prefixes: "10.10.0.0/16"
        client_id: "{{ azure_client_id }}"
        secret: "{{ azure_secret }}"
        tenant: "{{ azure_tenant_id }}"
        subscription_id: "{{ azure_subscription_id }}"

    - name: Add a Subnet
      azure.azcollection.azure_rm_subnet:
        resource_group: "{{ resource_group }}"
        name: "{{ subnet_name }}"
        virtual_network: "{{ vnet_name }}"
        address_prefix: "10.10.1.0/24"
        client_id: "{{ azure_client_id }}"
        secret: "{{ azure_secret }}"
        tenant: "{{ azure_tenant_id }}"
        subscription_id: "{{ azure_subscription_id }}"

    - name: Create Network Security Group
      azure.azcollection.azure_rm_networksecuritygroup:
        resource_group: "{{ resource_group }}"
        name: "{{ nsg_name }}"
        rules:
          - name: "Allow-SSH"
            protocol: "Tcp"
            destination_port_range: "22"
            access: "Allow"
            priority: 100
            direction: "Inbound"
          - name: "Allow-Raft-Internal"
            protocol: "Tcp"
            # Raft 节点间通信端口
            destination_port_range: "11000"
            access: "Allow"
            priority: 110
            direction: "Inbound"
          - name: "Allow-HTTP-API"
            protocol: "Tcp"
            # 应用 API 端口
            destination_port_range: "8080"
            access: "Allow"
            priority: 120
            direction: "Inbound"
        client_id: "{{ azure_client_id }}"
        secret: "{{ azure_secret }}"
        tenant: "{{ azure_tenant_id }}"
        subscription_id: "{{ azure_subscription_id }}"

    - name: Create VM instances
      # loop 语法来创建多个虚拟机
      loop: "{{ range(1, node_count + 1) | list }}"
      azure.azcollection.azure_rm_virtualmachine:
        resource_group: "{{ resource_group }}"
        name: "raft-node-{{ item }}"
        vm_size: "{{ vm_size }}"
        admin_username: "{{ vm_admin_user }}"
        ssh_password_enabled: false
        ssh_public_keys:
          - path: "/home/{{ vm_admin_user }}/.ssh/authorized_keys"
            key_data: "{{ lookup('file', '~/.ssh/id_rsa.pub') }}"
        network_interfaces: "raft-nic-{{ item }}"
        image: "{{ vm_image }}"
        client_id: "{{ azure_client_id }}"
        secret: "{{ azure_secret }}"
        tenant: "{{ azure_tenant_id }}"
        subscription_id: "{{ azure_subscription_id }}"

    - name: Create dynamic inventory file
      # 一个常见的技巧是,在创建完资源后,立即生成一个动态清单或静态清单文件
      # 供后续的 Playbook 使用。
      ansible.builtin.add_host:
        name: "raft-node-{{ item }}"
        groups: raft_cluster
        ansible_host: "{{ hostvars['localhost']['azure_vm_facts']['raft-node-' + item|string]['properties']['networkProfile']['networkInterfaces'][0]['properties']['ipConfigurations'][0]['properties']['privateIpAddress'] }}"
        ansible_user: "{{ vm_admin_user }}"
      loop: "{{ range(1, node_count + 1) | list }}"
      # 这是简化的方式,实际中我们会使用 azure_rm_inventory 插件

这里的关键是使用 loop 创建了三台规格一致的虚拟机,并为它们关联了统一的网络安全组。NSG 规则明确了哪些端口需要开放,这是保证集群正常通信和我们能够访问的基础。一个常见的错误是在这里使用过于宽松的规则(如 0.0.0.0/0),这在生产环境中是绝对不可接受的。

第三阶段:自动化部署 Raft 集群

环境就绪后,下一步是部署我们的 Raft KV 应用。这个 Playbook 负责将编译好的 Go 二进制文件分发到所有节点,并使用一个模板化的 systemd 服务文件来启动和管理应用进程。

deploy_raft_cluster.yml:

---
- name: Deploy and Configure Raft KV Cluster
  hosts: raft_cluster
  become: yes
  gather_facts: yes

  vars:
    app_binary_path: "./bin/raft-kv"
    remote_app_dir: "/opt/raft-kv"
    remote_app_user: "raftsvc"

  tasks:
    - name: Create application user
      ansible.builtin.user:
        name: "{{ remote_app_user }}"
        system: yes
        shell: /sbin/nologin
        create_home: no

    - name: Create application directory
      ansible.builtin.file:
        path: "{{ remote_app_dir }}"
        state: directory
        owner: "{{ remote_app_user }}"
        group: "{{ remote_app_user }}"
        mode: '0755'

    - name: Copy application binary
      ansible.builtin.copy:
        src: "{{ app_binary_path }}"
        dest: "{{ remote_app_dir }}/raft-kv"
        owner: "{{ remote_app_user }}"
        group: "{{ remote_app_user }}"
        mode: '0755'

    - name: Generate systemd service file from template
      ansible.builtin.template:
        src: "templates/raft-kv.service.j2"
        dest: "/etc/systemd/system/raft-kv.service"
        mode: '0644'
      notify: Reload systemd and restart raft-kv

    - name: Ensure raft-kv service is enabled and started
      ansible.builtin.systemd:
        name: raft-kv
        enabled: yes
        state: started
        daemon_reload: yes

  handlers:
    - name: Reload systemd and restart raft-kv
      ansible.builtin.systemd:
        name: raft-kv
        state: restarted
        daemon_reload: yes

Jinja2 模板 templates/raft-kv.service.j2 是这里的核心。它允许我们为每个节点生成定制化的启动命令。

[Unit]
Description=Raft KV Store Service
After=network.target

[Service]
Type=simple
User={{ remote_app_user }}
WorkingDirectory={{ remote_app_dir }}
# 关键部分:为每个节点生成不同的启动参数
# ansible_default_ipv4.address 是 Ansible 收集到的节点 IP
# play_hosts 包含了当前 play 中所有主机的列表
ExecStart={{ remote_app_dir }}/raft-kv \
  -node-id {{ inventory_hostname }} \
  -http-addr {{ ansible_default_ipv4.address }}:8080 \
  -raft-addr {{ ansible_default_ipv4.address }}:11000 \
  -join-addr {{ hostvars[play_hosts[0]]['ansible_default_ipv4']['address'] }}:11000

Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

这个模板的精妙之处在于 join-addr 参数。我们简单地让所有节点都尝试加入第一个节点 (play_hosts[0]) 来组成集群。这是一种简化的集群自举方式。在更复杂的场景中,可能需要依赖外部服务发现机制(如 Consul)。inventory_hostname 用作了节点 ID,保证了唯一性。

至此,我们已经有了一个可以一键部署的、功能正常的 Raft 集群。但它真的可靠吗?

第四阶段:构建混沌测试剧本 (Chaos Playbooks)

这才是整个工作的核心。我们设计了一系列 Ansible Playbook,每个 Playbook 模拟一种特定的故障场景。

场景一:Leader 节点崩溃与重新选举

这是最经典的测试。一个健康的 Raft 集群必须能够在 Leader 节点失效后,快速选举出新的 Leader 并继续提供服务。

chaos_kill_leader.yml:

---
- name: Chaos Test - Leader Election on Failure
  hosts: localhost
  connection: local
  gather_facts: false

  vars:
    # 从 Ansible inventory 中获取节点 IP 列表
    cluster_nodes_ips: "{{ groups['raft_cluster'] | map('extract', hostvars, 'ansible_host') | list }}"

  tasks:
    - name: Set an initial key-value pair for consistency check
      ansible.builtin.uri:
        url: "http://{{ item }}:8080/set"
        method: POST
        body: '{"key": "chaos-test", "value": "before-leader-kill"}'
        body_format: json
        status_code: 200
      # 向任意节点写入即可,它会自动转发给 leader
      delegate_to: "{{ cluster_nodes_ips[0] }}"
      run_once: true

    - name: Find current leader
      ansible.builtin.uri:
        url: "http://{{ item }}:8080/status"
        return_content: yes
      register: status_responses
      loop: "{{ cluster_nodes_ips }}"
      
    - name: Extract leader IP address
      ansible.builtin.set_fact:
        leader_ip: "{{ (item.json.leader | split(':'))[0] }}"
      loop: "{{ status_responses.results }}"
      when: item.json.state == "Leader"
      run_once: true
      
    - name: Assert that a leader was found
      ansible.builtin.assert:
        that:
          - leader_ip is defined
          - leader_ip in cluster_nodes_ips
        fail_msg: "Could not determine the cluster leader before starting the test."

    - name: Log the leader being killed
      ansible.builtin.debug:
        msg: "Found leader at {{ leader_ip }}. Terminating it now."

    - name: Stop the raft-kv service on the leader node
      # 使用 delegate_to 将任务委派到 leader IP 对应的主机上执行
      ansible.builtin.systemd:
        name: raft-kv
        state: stopped
      delegate_to: "{{ leader_ip }}"
      become: yes
      
    - name: Wait for a new leader to be elected
      # 轮询所有节点,直到找到一个不是旧 leader 的新 leader
      ansible.builtin.uri:
        url: "http://{{ item }}:8080/status"
        return_content: yes
      register: new_leader_check
      # until 条件会一直重试,直到表达式为真
      until: >
        new_leader_check.status == 200 and
        new_leader_check.json.state == "Leader" and
        (new_leader_check.json.leader | split(':'))[0] != leader_ip
      retries: 10
      delay: 2 # Raft 的选举超时一般在 150-300ms,这里延迟2秒足够
      loop: "{{ cluster_nodes_ips | difference([leader_ip]) }}" # 只查询存活的节点
      loop_control:
        label: "Polling {{ item }} for new leader status"

    - name: Extract new leader IP
      ansible.builtin.set_fact:
        new_leader_ip: "{{ (new_leader_check.results[0].json.leader | split(':'))[0] }}"
      
    - name: Log the new leader
      ansible.builtin.debug:
        msg: "New leader elected at {{ new_leader_ip }}."

    - name: Verify data consistency
      # 在新 leader 上读取之前写入的值
      ansible.builtin.uri:
        url: "http://{{ new_leader_ip }}:8080/get?key=chaos-test"
        return_content: yes
      register: consistency_check
      
    - name: Assert data is consistent
      ansible.builtin.assert:
        that:
          - consistency_check.json.value == "before-leader-kill"
        fail_msg: "Data consistency check failed! Value was not preserved after leader failover."

    - name: Final step: restart the old leader to let it rejoin
      ansible.builtin.systemd:
        name: raft-kv
        state: started
      delegate_to: "{{ leader_ip }}"
      become: yes

这个 Playbook 的逻辑非常清晰,它模拟了一个 SRE 的手动操作流程:找到 Leader -> 杀掉进程 -> 等待并确认新 Leader 出现 -> 验证业务数据是否完整。通过 delegate_toloop 的组合,Ansible 在 localhost 上作为总控制器,精确地在目标节点上执行命令。until/retries/delay 组合是实现“等待某个状态”的关键。

场景二:网络分区

网络分区是分布式系统中最阴险的敌人。我们需要验证当一个 Follower 节点与 Leader 失去联系时,集群的多数派是否能继续工作,以及当网络恢复后,被隔离的节点能否重新同步数据。

我们将使用 iptables 来模拟网络分区。这比操作 Azure NSG 更快,更适合在测试 VM 内部进行。

chaos_network_partition.yml:

---
- name: Chaos Test - Network Partition of a Follower
  hosts: localhost
  connection: local
  gather_facts: false

  vars:
    cluster_nodes_ips: "{{ groups['raft_cluster'] | map('extract', hostvars, 'ansible_host') | list }}"

  tasks:
    - name: Find the leader and one follower
      ansible.builtin.uri:
        url: "http://{{ item }}:8080/status"
        return_content: yes
      register: status_responses
      loop: "{{ cluster_nodes_ips }}"

    - name: Identify leader and target follower
      ansible.builtin.set_fact:
        leader_ip: "{{ (item.json.leader | split(':'))[0] }}"
        partitioned_follower_ip: "{{ (cluster_nodes_ips | difference([(item.json.leader | split(':'))[0]])) | first }}"
      loop: "{{ status_responses.results }}"
      when: item.json.state == "Leader"
      run_once: true

    - name: Log the partition plan
      ansible.builtin.debug:
        msg: "Partitioning follower {{ partitioned_follower_ip }} from the cluster."
    
    - name: Use iptables to isolate the follower
      # 在被隔离的节点上,拒绝来自其他所有集群节点的 Raft 协议端口的流量
      ansible.builtin.iptables:
        chain: INPUT
        protocol: tcp
        source: "{{ item }}"
        destination_port: "11000"
        jump: DROP
      delegate_to: "{{ partitioned_follower_ip }}"
      become: yes
      loop: "{{ cluster_nodes_ips | difference([partitioned_follower_ip]) }}"

    - name: Write new data to the healthy part of the cluster
      ansible.builtin.uri:
        url: "http://{{ leader_ip }}:8080/set"
        method: POST
        body: '{"key": "partition-test", "value": "after-partition"}'
        body_format: json
      register: write_result
      until: write_result.status == 200
      retries: 5
      delay: 1

    - name: Verify the new data is readable from another healthy follower
      ansible.builtin.uri:
        url: "http://{{ (cluster_nodes_ips | difference([leader_ip, partitioned_follower_ip])) | first }}:8080/get?key=partition-test"
        return_content: yes
      register: read_check
      until: read_check.json.value == "after-partition"
      retries: 5
      delay: 1
      
    - name: Verify the partitioned follower cannot see the new data
      # 这个请求应该会超时或失败,我们期望它失败
      ansible.builtin.uri:
        url: "http://{{ partitioned_follower_ip }}:8080/get?key=partition-test"
        return_content: yes
      register: partitioned_read_check
      ignore_errors: true
      
    - name: Assert that partitioned node has stale data
      ansible.builtin.assert:
        that:
          - partitioned_read_check.failed or partitioned_read_check.json.value != "after-partition"
        fail_msg: "Partitioned node could still read new data, partition failed!"

    - name: Heal the network partition
      # 清理掉之前添加的 iptables 规则
      ansible.builtin.iptables:
        chain: INPUT
        protocol: tcp
        source: "{{ item }}"
        destination_port: "11000"
        jump: DROP
        state: absent # state: absent 表示删除规则
      delegate_to: "{{ partitioned_follower_ip }}"
      become: yes
      loop: "{{ cluster_nodes_ips | difference([partitioned_follower_ip]) }}"

    - name: Wait for the partitioned node to catch up
      ansible.builtin.uri:
        url: "http://{{ partitioned_follower_ip }}:8080/get?key=partition-test"
        return_content: yes
      register: final_consistency_check
      until: >
        final_consistency_check.status == 200 and
        final_consistency_check.json.value == "after-partition"
      retries: 15
      delay: 2
      
    - name: Assert final consistency
      ansible.builtin.assert:
        that:
          - final_consistency_check.json.value == "after-partition"
        fail_msg: "Node failed to sync data after network partition was healed."

### **最终成果:一个完整的测试流程**

我们将整个流程串联成一个主 Playbook,并用 Mermaid 图表来展示其逻辑。

<pre class="mermaid">graph TD
    A[Start: ansible-playbook main.yml] --> B{Provision Azure Infra};
    B --> C{Deploy Raft Cluster};
    C --> D{Run Chaos Tests};
    D --> E[Scenario 1: Kill Leader];
    E --> F{Verify Election & Consistency};
    F --> G[Scenario 2: Network Partition];
    G --> H{Verify Majority & Heal};
    H --> I[Destroy Azure Infra];
    I --> J[End: Report];

    subgraph "Chaos Test Suite"
        E
        F
        G
        H
    end</pre>

`main.yml`:

```yaml
---
- name: Main playbook for Raft Cluster Chaos Testing
  hosts: localhost
  connection: local
  gather_facts: false

  tasks:
    - name: Provision infrastructure
      ansible.builtin.import_playbook: provision_azure_infra.yml

    - name: Deploy application
      # 这里需要一个动态 inventory 的配置,以便 deploy playbook 能找到新创建的 VM
      # 实际中我们会配置 ansible.cfg 使用 azure_rm inventory 插件
      ansible.builtin.import_playbook: deploy_raft_cluster.yml
      
    - name: Run chaos test suite
      ansible.builtin.import_playbook: chaos_tests.yml

    # 总是执行清理,即使测试失败
    - name: Cleanup resources
      always:
        - name: Destroy infrastructure
          ansible.builtin.import_playbook: destroy_azure_infra.yml

这个框架现在提供了一个可重复、可自动化的方式来验证我们 Raft 集群的韧性。我们可以将其集成到 CI/CD 流水线中,在每次代码变更后都运行一遍,确保新的改动没有破坏系统的一致性保证。

局限与未来迭代方向

当前这个框架虽然有效,但仍有其局限性。首先,故障注入的类型还比较单一,仅限于节点停止和网络分区。真实的生产环境故障更加多样,比如磁盘满、IO 延迟、时钟漂移等。未来的版本可以集成 stress-ng 等工具来模拟资源耗尽,或使用 tc 命令来注入更精细的网络延迟和丢包。

其次,数据一致性的校验方式比较初级,只是检查单个键值。一个更严谨的方法是实现一个并发的客户端,持续地进行读、写和比较交换(CAS)操作,并在测试结束后,分析操作历史记录的线性一致性,类似于 Jepsen 测试框架的工作方式。

最后,使用 iptables 模拟网络分区虽然快捷,但它和云环境下的真实网络故障(如 NSG 规则错误、可用区中断)模型不完全一致。一个更真实的测试可以将故障注入目标直接对准 Azure 的控制平面,通过 Ansible 修改 NSG 规则来隔离虚拟机,尽管这会使得测试执行和恢复的时间变得更长。


  目录