我们的新 Raft-KV 服务通过了所有单元测试和集成测试,但在将其投入生产环境之前,团队内部对它的韧性信心不足。Raft 协议理论上是可靠的,但我们的工程实现、配置和部署环境中的任何一个环节都可能引入脆弱性。手动模拟节点故障或网络分区不仅效率低下,而且结果难以复现。我们需要的是一个系统化的、自动化的方式来反复拷问这个集群,确保它能在真实的混乱中存活下来。
目标很明确:构建一个一键式的混沌工程流水线。这个流水线需要能够从零开始,在 Azure 上创建一组虚拟机,部署我们的 Raft 集群,然后按预设场景注入故障——例如,随机杀死 Leader 节点、模拟网络分区、增加网络延迟——同时持续校验集群的数据一致性和服务可用性。
在技术选型上,Ansible 成了我们的首选。它不仅是一个配置管理工具,更是一个强大的编排引擎。其 agentless 的特性简化了部署,幂等性保证了操作的可重复性,而丰富的模块生态(特别是针对 Azure 的)让我们能用一个工具链完成从基础设施置备(IaaS)到应用部署,再到混沌测试执行的整个闭环。
第一阶段:定义被测系统 - 一个极简的 Raft KV 服务
为了聚焦于测试框架本身,我们使用了一个用 Go 语言编写的、基于 hashicorp/raft
库实现的简易分布式 KV 存储服务。它只暴露了几个关键的 HTTP API 接口,这足以让我们验证其核心功能:
-
GET /get?key=<key>
: 从集群中读取一个值。请求会被转发到 Leader 节点处理,以保证线性化读。 -
POST /set
: 设置一个键值对,{"key": "somekey", "value": "somevalue"}
。 -
GET /status
: 返回当前节点的状态,包括其角色(Leader, Follower, Candidate)、当前任期(Term)以及 Leader 地址。
这个服务的代码不是本文的重点,关键在于它的接口为我们的自动化测试提供了抓手。我们可以通过 /status
接口来识别 Leader,通过 /set
和 /get
来验证数据一致性。
第二阶段:基础设施即代码 - 使用 Ansible 置备 Azure 环境
一切自动化的起点是环境的自动化。我们编写了一个 Ansible Playbook,用于在 Azure 上创建测试所需的所有资源。这包括一个资源组、一个虚拟网络、一个子网、一个网络安全组(NSG)以及三台虚拟机作为 Raft 集群的节点。
在真实项目中,直接在 Playbook 中硬编码敏感信息是严重错误。这里使用 vars_prompt
来交互式输入,生产环境中应使用 Ansible Vault 或外部 secrets 管理工具。
provision_azure_infra.yml
:
---
- name: Provision Azure Infrastructure for Raft Cluster
hosts: localhost
connection: local
gather_facts: false
vars_prompt:
- name: "azure_client_id"
prompt: "Enter your Azure Service Principal Client ID"
private: true
- name: "azure_secret"
prompt: "Enter your Azure Service Principal Secret"
private: true
- name: "azure_tenant_id"
prompt: "Enter your Azure Tenant ID"
private: true
- name: "azure_subscription_id"
prompt: "Enter your Azure Subscription ID"
private: true
vars:
resource_group: "raft-chaos-rg"
location: "eastus"
vnet_name: "raft-vnet"
subnet_name: "raft-subnet"
nsg_name: "raft-nsg"
vm_admin_user: "raftadmin"
vm_image:
offer: "UbuntuServer"
publisher: "Canonical"
sku: "20.04-LTS"
version: "latest"
vm_size: "Standard_B1s"
node_count: 3
tasks:
- name: Ensure Resource Group exists
azure.azcollection.azure_rm_resourcegroup:
name: "{{ resource_group }}"
location: "{{ location }}"
client_id: "{{ azure_client_id }}"
secret: "{{ azure_secret }}"
tenant: "{{ azure_tenant_id }}"
subscription_id: "{{ azure_subscription_id }}"
- name: Create Virtual Network
azure.azcollection.azure_rm_virtualnetwork:
resource_group: "{{ resource_group }}"
name: "{{ vnet_name }}"
address_prefixes: "10.10.0.0/16"
client_id: "{{ azure_client_id }}"
secret: "{{ azure_secret }}"
tenant: "{{ azure_tenant_id }}"
subscription_id: "{{ azure_subscription_id }}"
- name: Add a Subnet
azure.azcollection.azure_rm_subnet:
resource_group: "{{ resource_group }}"
name: "{{ subnet_name }}"
virtual_network: "{{ vnet_name }}"
address_prefix: "10.10.1.0/24"
client_id: "{{ azure_client_id }}"
secret: "{{ azure_secret }}"
tenant: "{{ azure_tenant_id }}"
subscription_id: "{{ azure_subscription_id }}"
- name: Create Network Security Group
azure.azcollection.azure_rm_networksecuritygroup:
resource_group: "{{ resource_group }}"
name: "{{ nsg_name }}"
rules:
- name: "Allow-SSH"
protocol: "Tcp"
destination_port_range: "22"
access: "Allow"
priority: 100
direction: "Inbound"
- name: "Allow-Raft-Internal"
protocol: "Tcp"
# Raft 节点间通信端口
destination_port_range: "11000"
access: "Allow"
priority: 110
direction: "Inbound"
- name: "Allow-HTTP-API"
protocol: "Tcp"
# 应用 API 端口
destination_port_range: "8080"
access: "Allow"
priority: 120
direction: "Inbound"
client_id: "{{ azure_client_id }}"
secret: "{{ azure_secret }}"
tenant: "{{ azure_tenant_id }}"
subscription_id: "{{ azure_subscription_id }}"
- name: Create VM instances
# loop 语法来创建多个虚拟机
loop: "{{ range(1, node_count + 1) | list }}"
azure.azcollection.azure_rm_virtualmachine:
resource_group: "{{ resource_group }}"
name: "raft-node-{{ item }}"
vm_size: "{{ vm_size }}"
admin_username: "{{ vm_admin_user }}"
ssh_password_enabled: false
ssh_public_keys:
- path: "/home/{{ vm_admin_user }}/.ssh/authorized_keys"
key_data: "{{ lookup('file', '~/.ssh/id_rsa.pub') }}"
network_interfaces: "raft-nic-{{ item }}"
image: "{{ vm_image }}"
client_id: "{{ azure_client_id }}"
secret: "{{ azure_secret }}"
tenant: "{{ azure_tenant_id }}"
subscription_id: "{{ azure_subscription_id }}"
- name: Create dynamic inventory file
# 一个常见的技巧是,在创建完资源后,立即生成一个动态清单或静态清单文件
# 供后续的 Playbook 使用。
ansible.builtin.add_host:
name: "raft-node-{{ item }}"
groups: raft_cluster
ansible_host: "{{ hostvars['localhost']['azure_vm_facts']['raft-node-' + item|string]['properties']['networkProfile']['networkInterfaces'][0]['properties']['ipConfigurations'][0]['properties']['privateIpAddress'] }}"
ansible_user: "{{ vm_admin_user }}"
loop: "{{ range(1, node_count + 1) | list }}"
# 这是简化的方式,实际中我们会使用 azure_rm_inventory 插件
这里的关键是使用 loop
创建了三台规格一致的虚拟机,并为它们关联了统一的网络安全组。NSG 规则明确了哪些端口需要开放,这是保证集群正常通信和我们能够访问的基础。一个常见的错误是在这里使用过于宽松的规则(如 0.0.0.0/0
),这在生产环境中是绝对不可接受的。
第三阶段:自动化部署 Raft 集群
环境就绪后,下一步是部署我们的 Raft KV 应用。这个 Playbook 负责将编译好的 Go 二进制文件分发到所有节点,并使用一个模板化的 systemd 服务文件来启动和管理应用进程。
deploy_raft_cluster.yml
:
---
- name: Deploy and Configure Raft KV Cluster
hosts: raft_cluster
become: yes
gather_facts: yes
vars:
app_binary_path: "./bin/raft-kv"
remote_app_dir: "/opt/raft-kv"
remote_app_user: "raftsvc"
tasks:
- name: Create application user
ansible.builtin.user:
name: "{{ remote_app_user }}"
system: yes
shell: /sbin/nologin
create_home: no
- name: Create application directory
ansible.builtin.file:
path: "{{ remote_app_dir }}"
state: directory
owner: "{{ remote_app_user }}"
group: "{{ remote_app_user }}"
mode: '0755'
- name: Copy application binary
ansible.builtin.copy:
src: "{{ app_binary_path }}"
dest: "{{ remote_app_dir }}/raft-kv"
owner: "{{ remote_app_user }}"
group: "{{ remote_app_user }}"
mode: '0755'
- name: Generate systemd service file from template
ansible.builtin.template:
src: "templates/raft-kv.service.j2"
dest: "/etc/systemd/system/raft-kv.service"
mode: '0644'
notify: Reload systemd and restart raft-kv
- name: Ensure raft-kv service is enabled and started
ansible.builtin.systemd:
name: raft-kv
enabled: yes
state: started
daemon_reload: yes
handlers:
- name: Reload systemd and restart raft-kv
ansible.builtin.systemd:
name: raft-kv
state: restarted
daemon_reload: yes
Jinja2 模板 templates/raft-kv.service.j2
是这里的核心。它允许我们为每个节点生成定制化的启动命令。
[Unit]
Description=Raft KV Store Service
After=network.target
[Service]
Type=simple
User={{ remote_app_user }}
WorkingDirectory={{ remote_app_dir }}
# 关键部分:为每个节点生成不同的启动参数
# ansible_default_ipv4.address 是 Ansible 收集到的节点 IP
# play_hosts 包含了当前 play 中所有主机的列表
ExecStart={{ remote_app_dir }}/raft-kv \
-node-id {{ inventory_hostname }} \
-http-addr {{ ansible_default_ipv4.address }}:8080 \
-raft-addr {{ ansible_default_ipv4.address }}:11000 \
-join-addr {{ hostvars[play_hosts[0]]['ansible_default_ipv4']['address'] }}:11000
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
这个模板的精妙之处在于 join-addr
参数。我们简单地让所有节点都尝试加入第一个节点 (play_hosts[0]
) 来组成集群。这是一种简化的集群自举方式。在更复杂的场景中,可能需要依赖外部服务发现机制(如 Consul)。inventory_hostname
用作了节点 ID,保证了唯一性。
至此,我们已经有了一个可以一键部署的、功能正常的 Raft 集群。但它真的可靠吗?
第四阶段:构建混沌测试剧本 (Chaos Playbooks)
这才是整个工作的核心。我们设计了一系列 Ansible Playbook,每个 Playbook 模拟一种特定的故障场景。
场景一:Leader 节点崩溃与重新选举
这是最经典的测试。一个健康的 Raft 集群必须能够在 Leader 节点失效后,快速选举出新的 Leader 并继续提供服务。
chaos_kill_leader.yml
:
---
- name: Chaos Test - Leader Election on Failure
hosts: localhost
connection: local
gather_facts: false
vars:
# 从 Ansible inventory 中获取节点 IP 列表
cluster_nodes_ips: "{{ groups['raft_cluster'] | map('extract', hostvars, 'ansible_host') | list }}"
tasks:
- name: Set an initial key-value pair for consistency check
ansible.builtin.uri:
url: "http://{{ item }}:8080/set"
method: POST
body: '{"key": "chaos-test", "value": "before-leader-kill"}'
body_format: json
status_code: 200
# 向任意节点写入即可,它会自动转发给 leader
delegate_to: "{{ cluster_nodes_ips[0] }}"
run_once: true
- name: Find current leader
ansible.builtin.uri:
url: "http://{{ item }}:8080/status"
return_content: yes
register: status_responses
loop: "{{ cluster_nodes_ips }}"
- name: Extract leader IP address
ansible.builtin.set_fact:
leader_ip: "{{ (item.json.leader | split(':'))[0] }}"
loop: "{{ status_responses.results }}"
when: item.json.state == "Leader"
run_once: true
- name: Assert that a leader was found
ansible.builtin.assert:
that:
- leader_ip is defined
- leader_ip in cluster_nodes_ips
fail_msg: "Could not determine the cluster leader before starting the test."
- name: Log the leader being killed
ansible.builtin.debug:
msg: "Found leader at {{ leader_ip }}. Terminating it now."
- name: Stop the raft-kv service on the leader node
# 使用 delegate_to 将任务委派到 leader IP 对应的主机上执行
ansible.builtin.systemd:
name: raft-kv
state: stopped
delegate_to: "{{ leader_ip }}"
become: yes
- name: Wait for a new leader to be elected
# 轮询所有节点,直到找到一个不是旧 leader 的新 leader
ansible.builtin.uri:
url: "http://{{ item }}:8080/status"
return_content: yes
register: new_leader_check
# until 条件会一直重试,直到表达式为真
until: >
new_leader_check.status == 200 and
new_leader_check.json.state == "Leader" and
(new_leader_check.json.leader | split(':'))[0] != leader_ip
retries: 10
delay: 2 # Raft 的选举超时一般在 150-300ms,这里延迟2秒足够
loop: "{{ cluster_nodes_ips | difference([leader_ip]) }}" # 只查询存活的节点
loop_control:
label: "Polling {{ item }} for new leader status"
- name: Extract new leader IP
ansible.builtin.set_fact:
new_leader_ip: "{{ (new_leader_check.results[0].json.leader | split(':'))[0] }}"
- name: Log the new leader
ansible.builtin.debug:
msg: "New leader elected at {{ new_leader_ip }}."
- name: Verify data consistency
# 在新 leader 上读取之前写入的值
ansible.builtin.uri:
url: "http://{{ new_leader_ip }}:8080/get?key=chaos-test"
return_content: yes
register: consistency_check
- name: Assert data is consistent
ansible.builtin.assert:
that:
- consistency_check.json.value == "before-leader-kill"
fail_msg: "Data consistency check failed! Value was not preserved after leader failover."
- name: Final step: restart the old leader to let it rejoin
ansible.builtin.systemd:
name: raft-kv
state: started
delegate_to: "{{ leader_ip }}"
become: yes
这个 Playbook 的逻辑非常清晰,它模拟了一个 SRE 的手动操作流程:找到 Leader -> 杀掉进程 -> 等待并确认新 Leader 出现 -> 验证业务数据是否完整。通过 delegate_to
和 loop
的组合,Ansible 在 localhost
上作为总控制器,精确地在目标节点上执行命令。until/retries/delay
组合是实现“等待某个状态”的关键。
场景二:网络分区
网络分区是分布式系统中最阴险的敌人。我们需要验证当一个 Follower 节点与 Leader 失去联系时,集群的多数派是否能继续工作,以及当网络恢复后,被隔离的节点能否重新同步数据。
我们将使用 iptables
来模拟网络分区。这比操作 Azure NSG 更快,更适合在测试 VM 内部进行。
chaos_network_partition.yml
:
---
- name: Chaos Test - Network Partition of a Follower
hosts: localhost
connection: local
gather_facts: false
vars:
cluster_nodes_ips: "{{ groups['raft_cluster'] | map('extract', hostvars, 'ansible_host') | list }}"
tasks:
- name: Find the leader and one follower
ansible.builtin.uri:
url: "http://{{ item }}:8080/status"
return_content: yes
register: status_responses
loop: "{{ cluster_nodes_ips }}"
- name: Identify leader and target follower
ansible.builtin.set_fact:
leader_ip: "{{ (item.json.leader | split(':'))[0] }}"
partitioned_follower_ip: "{{ (cluster_nodes_ips | difference([(item.json.leader | split(':'))[0]])) | first }}"
loop: "{{ status_responses.results }}"
when: item.json.state == "Leader"
run_once: true
- name: Log the partition plan
ansible.builtin.debug:
msg: "Partitioning follower {{ partitioned_follower_ip }} from the cluster."
- name: Use iptables to isolate the follower
# 在被隔离的节点上,拒绝来自其他所有集群节点的 Raft 协议端口的流量
ansible.builtin.iptables:
chain: INPUT
protocol: tcp
source: "{{ item }}"
destination_port: "11000"
jump: DROP
delegate_to: "{{ partitioned_follower_ip }}"
become: yes
loop: "{{ cluster_nodes_ips | difference([partitioned_follower_ip]) }}"
- name: Write new data to the healthy part of the cluster
ansible.builtin.uri:
url: "http://{{ leader_ip }}:8080/set"
method: POST
body: '{"key": "partition-test", "value": "after-partition"}'
body_format: json
register: write_result
until: write_result.status == 200
retries: 5
delay: 1
- name: Verify the new data is readable from another healthy follower
ansible.builtin.uri:
url: "http://{{ (cluster_nodes_ips | difference([leader_ip, partitioned_follower_ip])) | first }}:8080/get?key=partition-test"
return_content: yes
register: read_check
until: read_check.json.value == "after-partition"
retries: 5
delay: 1
- name: Verify the partitioned follower cannot see the new data
# 这个请求应该会超时或失败,我们期望它失败
ansible.builtin.uri:
url: "http://{{ partitioned_follower_ip }}:8080/get?key=partition-test"
return_content: yes
register: partitioned_read_check
ignore_errors: true
- name: Assert that partitioned node has stale data
ansible.builtin.assert:
that:
- partitioned_read_check.failed or partitioned_read_check.json.value != "after-partition"
fail_msg: "Partitioned node could still read new data, partition failed!"
- name: Heal the network partition
# 清理掉之前添加的 iptables 规则
ansible.builtin.iptables:
chain: INPUT
protocol: tcp
source: "{{ item }}"
destination_port: "11000"
jump: DROP
state: absent # state: absent 表示删除规则
delegate_to: "{{ partitioned_follower_ip }}"
become: yes
loop: "{{ cluster_nodes_ips | difference([partitioned_follower_ip]) }}"
- name: Wait for the partitioned node to catch up
ansible.builtin.uri:
url: "http://{{ partitioned_follower_ip }}:8080/get?key=partition-test"
return_content: yes
register: final_consistency_check
until: >
final_consistency_check.status == 200 and
final_consistency_check.json.value == "after-partition"
retries: 15
delay: 2
- name: Assert final consistency
ansible.builtin.assert:
that:
- final_consistency_check.json.value == "after-partition"
fail_msg: "Node failed to sync data after network partition was healed."
### **最终成果:一个完整的测试流程**
我们将整个流程串联成一个主 Playbook,并用 Mermaid 图表来展示其逻辑。
<pre class="mermaid">graph TD
A[Start: ansible-playbook main.yml] --> B{Provision Azure Infra};
B --> C{Deploy Raft Cluster};
C --> D{Run Chaos Tests};
D --> E[Scenario 1: Kill Leader];
E --> F{Verify Election & Consistency};
F --> G[Scenario 2: Network Partition];
G --> H{Verify Majority & Heal};
H --> I[Destroy Azure Infra];
I --> J[End: Report];
subgraph "Chaos Test Suite"
E
F
G
H
end</pre>
`main.yml`:
```yaml
---
- name: Main playbook for Raft Cluster Chaos Testing
hosts: localhost
connection: local
gather_facts: false
tasks:
- name: Provision infrastructure
ansible.builtin.import_playbook: provision_azure_infra.yml
- name: Deploy application
# 这里需要一个动态 inventory 的配置,以便 deploy playbook 能找到新创建的 VM
# 实际中我们会配置 ansible.cfg 使用 azure_rm inventory 插件
ansible.builtin.import_playbook: deploy_raft_cluster.yml
- name: Run chaos test suite
ansible.builtin.import_playbook: chaos_tests.yml
# 总是执行清理,即使测试失败
- name: Cleanup resources
always:
- name: Destroy infrastructure
ansible.builtin.import_playbook: destroy_azure_infra.yml
这个框架现在提供了一个可重复、可自动化的方式来验证我们 Raft 集群的韧性。我们可以将其集成到 CI/CD 流水线中,在每次代码变更后都运行一遍,确保新的改动没有破坏系统的一致性保证。
局限与未来迭代方向
当前这个框架虽然有效,但仍有其局限性。首先,故障注入的类型还比较单一,仅限于节点停止和网络分区。真实的生产环境故障更加多样,比如磁盘满、IO 延迟、时钟漂移等。未来的版本可以集成 stress-ng
等工具来模拟资源耗尽,或使用 tc
命令来注入更精细的网络延迟和丢包。
其次,数据一致性的校验方式比较初级,只是检查单个键值。一个更严谨的方法是实现一个并发的客户端,持续地进行读、写和比较交换(CAS)操作,并在测试结束后,分析操作历史记录的线性一致性,类似于 Jepsen 测试框架的工作方式。
最后,使用 iptables
模拟网络分区虽然快捷,但它和云环境下的真实网络故障(如 NSG 规则错误、可用区中断)模型不完全一致。一个更真实的测试可以将故障注入目标直接对准 Azure 的控制平面,通过 Ansible 修改 NSG 规则来隔离虚拟机,尽管这会使得测试执行和恢复的时间变得更长。