AWS autoscaling works well for stateless applications. And for the stateless components of VoIP and real-time systems like APIs and routing backends, AWS Auto Scaling groups (ASGs) and Amazon Elastic Container Service (ECS) do exactly what they’re designed to do.
But stateful VoIP infrastructure components like FreeSWITCH and RTPEngine maintain active SIP sessions and media streams tightly bound to specific instances. Because each instance holds live call state, scaling events must be carefully coordinated. Otherwise, even a single premature termination can drop active calls.
Depending on how those components are configured, standard autoscaling tools may not be enough to handle that coordination safely. This post walks through one such scenario: a specific FreeSWITCH and ESL client configuration that made standard ASG and ECS task scaling impractical, and the event-driven orchestration approach we built to handle it instead.
Why AWS Autoscaling Can Break Down for Stateful VoIP Systems
A real-time communication is composed of both stateless and stateful components.
- Business logic elements such as APIs and routing backends are generally stateless. This means they don’t store any logic themselves. They rely on external storage such as databases or caches.
- Media servers like FreeSWITCH are highly stateful. Each active call maintains in-memory session state tied to that specific instance. You can’t just snap your fingers like Thanos and make half your instances disappear during a scale-in event. Doing so immediately drops active calls and disrupts RTP streams.
The same challenge applies when scaling out. New instances need to be registered in the service discovery layer before the routing logic can recognize them and start allocating new media sessions to them.
Managing State: Call Draining and Session Completion
To shut down a media server safely, you must use a process called “graceful draining.”
- Signal the media server to stop accepting new sessions (quiesce).
- Wait for existing VoIP calls to complete naturally while maintaining session continuity.
- Use an Event Socket Library (ESL) client to verify there are absolutely zero active channels before finally pulling the plug.
State-Aware Scaling Out
Scaling out, on the other hand, is pretty straightforward. But it does require start-up logic to set up the required services and register the server on a service discovery layer for routing logic to see them and send traffic to.
A FreeSWITCH Configuration Where Standard Autoscaling Falls Short
If your FreeSWITCH and ESL client are configured in a conventional way, traditional AWS cloud scaling services should handle your scaling needs. The scenario described here is driven by a specific, unconventional implementation where the ESL client requires a unique task definition per instance to maintain a strict 1:1 mapping with FreeSWITCH nodes, and where FreeSWITCH itself is configured at runtime via Ansible rather than using pre-baked images. That combination makes standard autoscaling impractical in this case.
Here is the specific configuration that makes standard autoscaling impractical in this case:
- The ESL client running on ECS Fargate requires a unique task definition per instance to map exactly 1:1 with FreeSWITCH nodes.
- The FreeSWITCH instance is configured at runtime using Ansible scripts. This configuration rigidity makes “instant” scaling difficult compared to using pre-baked, ready-to-go containers.
These two constraints together are what rule out standard ASG and ECS task scaling for this specific setup, and what motivated the event-driven approach described below. While the implementation is specific to FreeSWITCH, the same orchestration pattern applies to other stateful RTC components that may require special coordination that ASG and ECS task counts don’t always support.
An Event-Driven Alternative to Stateful Scaling
Given these constraints, standard autoscaling isn’t the right fit for this particular setup. Instead, this specific FreeSWITCH and ESL client configuration requires explicit coordination of session state, call draining, and provisioning. This is what our event-driven approach is designed to handle.
Declarative Infrastructure for Stateful Pair Management
Infrastructure as Code (IaC) tools like Terraform or AWS CloudFormation are essential to this approach. They enforce the strict 1:1 relationship between FreeSWITCH instances and ESL clients, automate the complex provisioning required for every scaling event, and prevent unique “snowflake” instances from creeping in during scale-ins or scale-outs.
The key element is defining a “toggle” that allows to dynamically set the desired number of “pairs” of resources.
variable "pair_count" {
description = "Number of pairs (FS + ESL client) to deploy"
type = number
default = 2
}
resource "aws_instance" "freeswitch" {
count = var.pair_count
... # rest of the configuration for FreeSWITCH EC2 instance
}
resource "aws_ecs_task_definition" "my_task_definition" {
count = var.pair_count
... # rest of the configuration for ECS task definition
}
resource "aws_ecs_service" "main" {
count = var.pair_count
name = "my_ecs_service"
cluster = aws_ecs_cluster.my_cluster.id
task_definition = aws_ecs_task_definition.my_task_definition[count.index].arn
desired_count = 1 # we can only have one task per service
launch_type = "FARGATE"
... # rest of the configuration
}
The Terraform definition is accompanied by an Ansible playbook that installs FreeSWITCH, deploys the required configuration files, and registers the new instance in the service discovery database so that routing logic can start directing traffic to it.
# FreeSWITCH installation tasks
- name: Install prerequisite packages
apt:
name:
- wget
- gnupg2
- ca-certificates
- lsb-release
state: present
update_cache: yes
- name: Add FreeSWITCH GPG key
apt_key:
url: https://files.freeswitch.org/repo/deb/debian-release/fsstretch-archive-keyring.asc
state: present
- name: Add FreeSWITCH repository
apt_repository:
repo: "deb http://files.freeswitch.org/repo/deb/debian-release/
{{ ansible_distribution_release }} main"
state: present
- name: Install FreeSWITCH packages
apt:
# 'freeswitch-meta-all' installs everything. You can swap this for
specific modules to save time.
name: freeswitch-meta-all
state: present
update_cache: yes
- name: Ensure FreeSWITCH service is enabled and running
service:
name: freeswitch
state: started
enabled: yes
# Deployment of config files
- name: Deploy internal profile
template:
src: internal.xml.j2
dest: "/path/to/config/sip_profiles/internal.xml"
owner: ubuntu
group: ubuntu
mode: '0644'
register: freeswitch_internal_profile
notify: Restart FreeSWITCH
- name: Deploy external profile
template:
src: external.xml.j2
dest: "/path/to/config/sip_profiles/external.xml"
owner: ubuntu
group: ubuntu
mode: '0644'
register: freeswitch_external_profile
notify: Restart FreeSWITCH
... # rest of files deployments
# Register the new instance to a database for service discovery
- name: Install PostgreSQL client
apt:
name:
- postgresql-client
state: present
update_cache: yes
- name: Check if this FreeSWITCH instance is registered in dispatcher table
shell: |
psql \
-h "{{ postgres_host }}" \
-U "{{ postgres_user }}" \
-d "{{ postgres_db }}" \
-t -c "SELECT COUNT(*) FROM dispatcher WHERE setid =
{{ freeswitch_dispatcher_setid | default(1) }} AND destination =
'sip:{{ freeswitch_sip_ip }}:{{ freeswitch_sip_port }}';"
environment:
PGPASSWORD: "{{ postgres_password }}"
register: dispatcher_check
changed_when: false
- name: Register this FreeSWITCH instance in dispatcher table
shell: |
psql \
-h "{{ postgres_host }}" \
-U "{{ postgres_user }}" \
-d "{{ postgres_db }}" \
-c "INSERT INTO dispatcher (setid, destination, state, weight, priority, description)
VALUES ({{ freeswitch_dispatcher_setid | default(1) }},
'sip:{{ freeswitch_sip_ip }}:{{ freeswitch_sip_port }}', 1, 1, 0,
'FreeSWITCH {{ ansible_hostname }}');"
environment:
PGPASSWORD: "{{ postgres_password }}"
when: dispatcher_check.stdout | trim | int == 0
register: dispatcher_insert
changed_when: dispatcher_insert.rc == 0
Event-Driven Orchestration with AWS Step Functions
AWS Step Functions act as the brains of the operation. It orchestrates the complex sequence of events for scale in and out events by managing the state transitions, retries, and long-polling script executions.
A simplified scale-out step machine would look like this:
- Acquire scaling event lock to prevent concurrency issues
- Verify the current state of the infrastructure
- Provision the infrastructure using terraform
- Run the ansible playbook to configure FS instance
- Release the lock to allow for further scaling events
Scale-in events work using the same idea of event locks and state, but add a proper draining process for terminating the FS instances. Here is a simplified example:
- Acquire scaling event lock
- Verify the current state of the infrastructure
- Drain active calls from the target FS instances
- Decommission infrastructure
- Release the lock for further scaling events
External State Tracking for Scaling Decisions
As we scale in and out our infrastructure, we need a way to manage its current state. As our central source of truth, we use the AWS Systems Manager Parameter Store, read through an AWS Lambda Function that abstract all state-reading logic, and updated via native AWS SDK Integration (arn:aws:states:::aws-sdk:ssm:putParameter). It keeps track of our current scaling state and stores the metadata of all our active nodes.
Retrieving the current state of the infrastructure is just a matter of calling the get_parameter API and returning it to the state machine.
def get_pair_count() -> Dict[str, Any]:
"""
Read current pair count from SSM Parameter Store.
Returns:
Dict containing pair count
"""
try:
response = ssm.get_parameter(Name=SSM_PARAMETER_PATH)
pair_count = int(response['Parameter']['Value'])
print(f"Retrieved pair count: {pair_count} from {SSM_PARAMETER_PATH}")
return {
'success': True,
'pair_count': pair_count,
'parameter_path': SSM_PARAMETER_PATH
}
except ... # exception logic
The power of Lambda also allow us to define other utility functions for reading state, such as validating that scaling is required as follows:
def validate_state() -> Dict[str, Any]:
"""
Validate that pair count matches actual infrastructure.
Returns:
Dict containing validation result
"""
try:
# Get pair count from SSM
pair_count_result = get_pair_count()
if not pair_count_result['success']:
return {
'success': False,
'error': 'Failed to retrieve pair count',
'details': pair_count_result
}
pair_count = pair_count_result['pair_count']
# Use to a separate function to retrieve available resources metadata
infra_result = get_infrastructure_state()
if not infra_result['success']:
return {
'success': False,
'error': 'Failed to retrieve infrastructure state',
'details': infra_result
}
freeswitch_count = infra_result['freeswitch_count']
ecs_service_count = infra_result['ecs_service_count']
# Validate counts match
valid = (freeswitch_count == pair_count and ecs_service_count == pair_count)
result = {
'success': True,
'valid': valid,
'pair_count': pair_count,
'freeswitch_count': freeswitch_count,
'ecs_service_count': ecs_service_count
}
if not valid:
result['message'] = f'State mismatch: SSM={pair_count},
FreeSWITCH={freeswitch_count}, ECS={ecs_service_count}'
print(result['message'])
else:
result['message'] = 'State is consistent'
print(result['message'])
return result
except ... # Exception logic
Coordinating Scaling Events with Distributed Locks
When traffic fluctuates wildly, you might trigger scale-up and scale-down events at the exact same time. We use DynamoDB to implement a locking mechanism. This prevents race conditions during simultaneous scaling events, because as we all know, crossing the streams is bad.
AWS Lambda provides a nice interface for the state machine to acquire and release such a lock. These are just writes from the point of view of the function.
def acquire_lock(operation_id: str, ttl: int = 7200, metadata:
Optional[Dict[str, Any]] = None) -> Dict[str, Any]:
current_time = int(time.time())
expiry_time = current_time + ttl
lock_item = {
'lock_id': 'scaling-lock',
'operation_id': operation_id,
'acquired_at': current_time,
'ttl': expiry_time,
'metadata': metadata or {}
}
try:
# Conditional write: only succeed if lock_id doesn't exist or TTL has expired
table.put_item(
Item=lock_item,
ConditionExpression='attribute_not_exists(lock_id) OR #ttl < :current_time',
ExpressionAttributeNames={'#ttl': 'ttl'},
ExpressionAttributeValues={':current_time': current_time}
)
print(f"Lock acquired successfully for operation: {operation_id}")
return {
'success': True,
'operation_id': operation_id,
'acquired_at': current_time,
'expires_at': expiry_time
}
except ... # exception logic (lock is held)
def release_lock(operation_id: str) -> Dict[str, Any]:
try:
# Delete the lock item
table.delete_item(
Key={'lock_id': 'scaling-lock'},
ConditionExpression='operation_id = :op_id',
ExpressionAttributeValues={':op_id': operation_id}
)
print(f"Lock released successfully for operation: {operation_id}")
return {
'success': True,
'operation_id': operation_id,
'released_at': int(time.time())
}
except ... # exception logix
Execution Layer: Terraform and Ansible Orchestration
Finally, we need something to actually do the heavy lifting. Step Functions triggers AWS CodeBuild jobs to run Terraform apply commands to scale up and down by setting the desired pair_count explicitly. CodeBuild is also responsible for executing the Ansible playbooks that handle the dynamic configuration of our FreeSWITCH instances.
For instance, definition of the Terraform job to provision FreeSWITCH EC2 instances yield the arn:aws:states:::codebuild:startBuild.sync resource and pass required variables (target pair count and the terraform resources to apply). The state machine definition also allows to set retries and error catching.
"InvokeTerraformCompute": {
"Type": "Task",
"Resource": "arn:aws:states:::codebuild:startBuild.sync",
"Parameters": {
"ProjectName": "${terraform_project_name}",
"EnvironmentVariablesOverride": [
{
"Name": "TF_VAR_pair_count",
"Value.$": "States.Format('{}', $.targetPairCount)",
"Type": "PLAINTEXT"
},
{
"Name": "TF_TARGETS",
"Value": "aws_instance.freeswitch",
"Type": "PLAINTEXT"
}
]
},
"ResultPath": "$.terraformComputeResult",
"Retry": [
{
"ErrorEquals": ["States.TaskFailed"],
"IntervalSeconds": 60,
"MaxAttempts": 3,
"BackoffRate": 2.0,
"Comment": "Retry Terraform compute failures with exponential backoff"
}
],
"Catch": [
{
"ErrorEquals": ["States.ALL"],
"ResultPath": "$.error",
"Next": "RollbackSSM"
}
],
"Next": "InvokeAnsibleFreeSWITCH"
},
The corresponding CodeBuild project features a buildspec that run the required Terraform commands:
buildspec = <<-EOT
version: 0.2
phases:
pre_build:
commands:
- echo "Starting Terraform execution for $ENVIRONMENT environment"
- terraform init
build:
commands:
- |
# Build target flags from TF_TARGETS (comma-separated)
TARGET_FLAGS=""
for t in $(echo "$TF_TARGETS" | tr ',' ' '); do
TARGET_FLAGS="$TARGET_FLAGS -target=$t"
done
echo "Terraform targets: $TARGET_FLAGS"
terraform plan \
-var="pair_count=$TF_VAR_pair_count" \
$TARGET_FLAGS \
-out=tfplan
- terraform apply -auto-approve tfplan
post_build:
commands:
- echo "Terraform execution completed"
EOT
Improvement Recommendations
While this event-driven orchestration approach addresses the immediate scaling constraints, the architecture is still shaped by the underlying stateful coupling between signaling, media, and control components. There are several directions that would further simplify scaling and reduce operational overhead.
- Move Toward Generic Statelessness: We should aim to re-architect the ESL client so it is interchangeable, allowing it to eventually use standard ECS scaling features. Alternatively, we could co-host the client directly on the FreeSWITCH instance. Wherever possible, decoupling media handling from session signaling will make scaling easier.
- Optimize Provisioning Time: Running Ansible scripts at runtime creates a bottleneck when we need to scale out fast. We should transition to using Custom AMIs (Amazon Machine Images) for our media servers. By baking the configuration into the image, we can achieve near-immediate availability for new nodes to handle sudden traffic spikes.
Stateful Real-Time Systems Require Custom Scaling Patterns
When stateful VoIP and real-time communication infrastructure demands more coordination than standard cloud scaling tools provide, the gap needs to be bridged carefully without dropping calls or disrupting active sessions. The event-driven orchestration pattern covered here is one way to do that.
At WebRTC.ventures, we design and build real-time communication infrastructure for teams at every stage from early architecture decisions to production scaling challenges like this one. We specialize in the custom approaches that VoIP systems often demand. If your stack is hitting limits that standard tooling can’t solve, we can help.
Further Reading:
- VoIP Security: Why Encryption Alone Isn’t Enough for Voice and Video Calls
- When VoIP Fails, Can You Explain Why? The Case for Self-Hosted Infrastructure in Critical Environments
- Scalable WebRTC VoIP Infrastructure Architecture: Essential DevOps Practices
- How to Build a Serverless Voice AI Assistant for Telephony in AWS using Twilio ConversationRelay
- Scheduled Scaling for WebRTC: Handling Predictable Video Streaming Loads with AWS


