Alternative VoIP Scaling on AWS: FreeSWITCH Event-Driven Architecture

AWS autos c aling works well for stateless applications. And for the stateless components of VoIP and real-time systems like APIs and routing backends, AWS Auto Scaling groups (ASGs) and Amazon Elastic Container Service (ECS) do exactly what they’re designed to do.

But stateful VoIP infrastructure components like FreeSWITCH and RTPEngine maintain active SIP sessions and media streams tightly bound to specific instances. Because each instance holds live call state, scaling events must be carefully coordinated. Otherwise, even a single premature termination can drop active calls.

Depending on how those components are configured, standard autoscaling tools may not be enough to handle that coordination safely. This post walks through one such scenario: a specific FreeSWITCH and ESL client configuration that made standard ASG and ECS task scaling impractical, and the event-driven orchestration approach we built to handle it instead.

While the implementation is specific to FreeSWITCH, the same orchestration pattern applies to other stateful RTC components that may require special coordination that ASG and ECS task counts don’t always support.

Why AWS Autoscaling Can Break Down for Stateful VoIP Systems

A real-time communication is composed of both stateless and stateful components.

Business logic elements such as APIs and routing backends are generally stateless. This means they don’t store any logic themselves. They rely on external storage such as databases or caches.
Media servers like FreeSWITCH are highly stateful. Each active call maintains in-memory session state tied to that specific instance.

You can’t just snap your fingers like Thanos and make half your instances disappear during a scale-in event. Doing so immediately drops active calls and disrupts RTP streams.

The same challenge applies when scaling out. New instances need to be registered in the service discovery layer before the routing logic can recognize them and start allocating new media sessions to them.

Managing State: Call Draining and Session Completion

To shut down a media server safely, you must use a process called “graceful draining.”

Signal the media server to stop accepting new sessions (quiesce).
Wait for existing VoIP calls to complete naturally while maintaining session continuity.
Use an Event Socket Library (ESL) client to verify there are absolutely zero active channels before finally pulling the plug.

State-Aware Scaling Out

Scaling out, on the other hand, is pretty straightforward. But it does require start-up logic to set up the required services and register the server on a service discovery layer for routing logic to see them and send traffic to.

A FreeSWITCH Configuration Where Standard Autoscaling Falls Short

In many AWS deployments, standard Auto Scaling Groups and ECS scaling policies are sufficient. If your application components are mostly stateless, or if your FreeSWITCH and ESL client layers can be launched from reusable images and assigned dynamically, traditional cloud scaling should work well.

This case was different because the system had stateful real-time communications constraints. Each FreeSWITCH node needed a dedicated ESL client with a strict 1:1 relationship. That design had a valid purpose: the ESL client acted as the control-plane owner for a specific media node, tracking call events, health, lifecycle state, and graceful draining.

The scaling challenge was not the 1:1 ownership model itself. The challenge was that this ownership was implemented through fixed ECS task definitions and static mappings between ESL clients and FreeSWITCH nodes. FreeSWITCH was also configured at runtime using Ansible instead of being launched from pre-baked, ready-to-scale images. Together, those choices made standard autoscaling impractical for this environment.

An Event-Driven Alternative to Stateful Scaling

As our specific FreeSWITCH and ESL client configuration requires explicit coordination of session state, call draining, and provisioning, we designed an event-driven approach to scaling.

Declarative Infrastructure for Stateful Pair Management

Infrastructure as Code (IaC) tools like Terraform or AWS CloudFormation are essential to this approach. They enforce the strict 1:1 relationship between FreeSWITCH instances and ESL clients, automate the complex provisioning required for every scaling event, and prevent unique “snowflake” instances from creeping in during scale-ins or scale-outs.

The key element is defining a “toggle” that allows to dynamically set the desired number of “pairs” of resources.

variable "pair_count" {
  description = "Number of pairs (FS + ESL client) to deploy"
  type        = number
  default     = 2
}

resource "aws_instance" "freeswitch" {
  count = var.pair_count

  ... # rest of the configuration for FreeSWITCH EC2 instance
}

resource "aws_ecs_task_definition" "my_task_definition" {
  count = var.pair_count

  ... # rest of the configuration for ECS task definition
}

resource "aws_ecs_service" "main" {
  count = var.pair_count

  name            = "my_ecs_service"
  cluster         = aws_ecs_cluster.my_cluster.id
  task_definition = aws_ecs_task_definition.my_task_definition[count.index].arn
  desired_count   = 1 # we can only have one task per service
  launch_type     = "FARGATE"

  ... # rest of the configuration
}

The Terraform definition is accompanied by an Ansible playbook that installs FreeSWITCH, deploys the required configuration files, and registers the new instance in the service discovery database so that routing logic can start directing traffic to it.

# FreeSWITCH installation tasks
- name: Install prerequisite packages
  apt:
    name:
      - wget
      - gnupg2
      - ca-certificates
      - lsb-release
    state: present
    update_cache: yes

- name: Add FreeSWITCH GPG key
  apt_key:
    url: https://files.freeswitch.org/repo/deb/debian-release/fsstretch-archive-keyring.asc
    state: present

- name: Add FreeSWITCH repository
  apt_repository:
    repo: "deb http://files.freeswitch.org/repo/deb/debian-release/ 
      {{ ansible_distribution_release }} main"
    state: present

- name: Install FreeSWITCH packages
  apt:
    # 'freeswitch-meta-all' installs everything. You can swap this for 
      specific modules to save time.
    name: freeswitch-meta-all 
    state: present
    update_cache: yes

- name: Ensure FreeSWITCH service is enabled and running
  service:
    name: freeswitch
    state: started
    enabled: yes

# Deployment of config files 
- name: Deploy internal profile
  template:
    src: internal.xml.j2
    dest: "/path/to/config/sip_profiles/internal.xml"
    owner: ubuntu
    group: ubuntu
    mode: '0644'
  register: freeswitch_internal_profile
  notify: Restart FreeSWITCH

- name: Deploy external profile
  template:
    src: external.xml.j2
    dest: "/path/to/config/sip_profiles/external.xml"
    owner: ubuntu
    group: ubuntu
    mode: '0644'
  register: freeswitch_external_profile
  notify: Restart FreeSWITCH

... # rest of files deployments

# Register the new instance to a database for service discovery
- name: Install PostgreSQL client
  apt:
    name:
      - postgresql-client
    state: present
    update_cache: yes

- name: Check if this FreeSWITCH instance is registered in dispatcher table
  shell: |
    psql \
      -h "{{ postgres_host }}" \
      -U "{{ postgres_user }}" \
      -d "{{ postgres_db }}" \
      -t -c "SELECT COUNT(*) FROM dispatcher WHERE setid = 
         {{ freeswitch_dispatcher_setid | default(1) }} AND destination = 
         'sip:{{ freeswitch_sip_ip }}:{{ freeswitch_sip_port }}';"
  environment:
    PGPASSWORD: "{{ postgres_password }}"
  register: dispatcher_check
  changed_when: false

- name: Register this FreeSWITCH instance in dispatcher table
  shell: |
    psql \
      -h "{{ postgres_host }}" \
      -U "{{ postgres_user }}" \
      -d "{{ postgres_db }}" \
      -c "INSERT INTO dispatcher (setid, destination, state, weight, priority, description) 
          VALUES ({{ freeswitch_dispatcher_setid | default(1) }}, 
         'sip:{{ freeswitch_sip_ip }}:{{ freeswitch_sip_port }}', 1, 1, 0, 
         'FreeSWITCH {{ ansible_hostname }}');"
  environment:
    PGPASSWORD: "{{ postgres_password }}"
  when: dispatcher_check.stdout | trim | int == 0
  register: dispatcher_insert
  changed_when: dispatcher_insert.rc == 0

Event-Driven Orchestration with AWS Step Functions

AWS Step Functions act as the brains of the operation. It orchestrates the complex sequence of events for scale in and out events by managing the state transitions, retries, and long-polling script executions.

A simplified scale-out step machine would look like this:

Acquire scaling event lock to prevent concurrency issues
Verify the current state of the infrastructure
Provision the infrastructure using terraform
Run the ansible playbook to configure FS instance
Release the lock to allow for further scaling events

How WebRTC.ventures safely provisions new FreeSWITCH instances on AWS when demand increases — using distributed locking, state verification, and a Terraform/Ansible pipeline to bring new capacity online without disrupting active infrastructure. — FreeSWITCH Scale-Out Orchestration Flow on AWS Step Functions

Scale-in events work using the same idea of event locks and state, but add a proper draining process for terminating the FS instances. Here is a simplified example:

Acquire scaling event lock
Verify the current state of the infrastructure
Drain active calls from the target FS instances
Decommission infrastructure
Release the lock for further scaling events

How WebRTC.ventures safely scales down FreeSWITCH instances on AWS without dropping active calls — using distributed locking, graceful call draining, and Terraform-based decommissioning orchestrated through AWS Step Functions. — FreeSWITCH Scale-In Orchestration Flow on AWS Step Functions.

External State Tracking for Scaling Decisions

As we scale in and out our infrastructure, we need a way to manage its current state. As our central source of truth, we use the AWS Systems Manager Parameter Store, read through an AWS Lambda Function that abstract all state-reading logic, and updated via native AWS SDK Integration (arn:aws:states:::aws-sdk:ssm:putParameter). It keeps track of our current scaling state and stores the metadata of all our active nodes.

Retrieving the current state of the infrastructure is just a matter of calling the get_parameter API and returning it to the state machine.

def get_pair_count() -> Dict[str, Any]:
    """
    Read current pair count from SSM Parameter Store.
    
    Returns:
        Dict containing pair count
    """
    try:
        response = ssm.get_parameter(Name=SSM_PARAMETER_PATH)
        pair_count = int(response['Parameter']['Value'])
        
        print(f"Retrieved pair count: {pair_count} from {SSM_PARAMETER_PATH}")
        
        return {
            'success': True,
            'pair_count': pair_count,
            'parameter_path': SSM_PARAMETER_PATH
        }
        
    except ... # exception logic

The power of Lambda also allow us to define other utility functions for reading state, such as validating that scaling is required as follows:

def validate_state() -> Dict[str, Any]:
    """
    Validate that pair count matches actual infrastructure.
    
    Returns:
        Dict containing validation result
    """
    try:
        # Get pair count from SSM
        pair_count_result = get_pair_count()
        if not pair_count_result['success']:
            return {
                'success': False,
                'error': 'Failed to retrieve pair count',
                'details': pair_count_result
            }
        
        pair_count = pair_count_result['pair_count']
        
        # Use to a separate function to retrieve available resources metadata
        infra_result = get_infrastructure_state()
        if not infra_result['success']:
            return {
                'success': False,
                'error': 'Failed to retrieve infrastructure state',
                'details': infra_result
            }
        
        freeswitch_count = infra_result['freeswitch_count']
        ecs_service_count = infra_result['ecs_service_count']
        
        # Validate counts match
        valid = (freeswitch_count == pair_count and ecs_service_count == pair_count)
        
        result = {
            'success': True,
            'valid': valid,
            'pair_count': pair_count,
            'freeswitch_count': freeswitch_count,
            'ecs_service_count': ecs_service_count
        }
        
        if not valid:
            result['message'] = f'State mismatch: SSM={pair_count}, 
            FreeSWITCH={freeswitch_count}, ECS={ecs_service_count}'
            print(result['message'])
        else:
            result['message'] = 'State is consistent'
            print(result['message'])
        
        return result
        
    except ... # Exception logic

Coordinating Scaling Events with Distributed Locks

When traffic fluctuates wildly, you might trigger scale-up and scale-down events at the exact same time. We use DynamoDB to implement a locking mechanism. This prevents race conditions during simultaneous scaling events, because as we all know, crossing the streams is bad.

AWS Lambda provides a nice interface for the state machine to acquire and release such a lock. These are just writes from the point of view of the function.

def acquire_lock(operation_id: str, ttl: int = 7200, metadata: 
     Optional[Dict[str, Any]] = None) -> Dict[str, Any]:
    current_time = int(time.time())
    expiry_time = current_time + ttl
    
    lock_item = {
        'lock_id': 'scaling-lock',
        'operation_id': operation_id,
        'acquired_at': current_time,
        'ttl': expiry_time,
        'metadata': metadata or {}
    }
    
    try:
        # Conditional write: only succeed if lock_id doesn't exist or TTL has expired
        table.put_item(
            Item=lock_item,
            ConditionExpression='attribute_not_exists(lock_id) OR #ttl < :current_time',
            ExpressionAttributeNames={'#ttl': 'ttl'},
            ExpressionAttributeValues={':current_time': current_time}
        )
        
        print(f"Lock acquired successfully for operation: {operation_id}")
        return {
            'success': True,
            'operation_id': operation_id,
            'acquired_at': current_time,
            'expires_at': expiry_time
        }
        
    except ... # exception logic (lock is held)

def release_lock(operation_id: str) -> Dict[str, Any]:
    try:
        # Delete the lock item
        table.delete_item(
            Key={'lock_id': 'scaling-lock'},
            ConditionExpression='operation_id = :op_id',
            ExpressionAttributeValues={':op_id': operation_id}
        )
        
        print(f"Lock released successfully for operation: {operation_id}")
        return {
            'success': True,
            'operation_id': operation_id,
            'released_at': int(time.time())
        }
        
    except ... # exception logix

Execution Layer: Terraform and Ansible Orchestration

Finally, we need something to actually do the heavy lifting. Step Functions triggers AWS CodeBuild jobs to run Terraform apply commands to scale up and down by setting the desired pair_count explicitly. CodeBuild is also responsible for executing the Ansible playbooks that handle the dynamic configuration of our FreeSWITCH instances.

For instance, definition of the Terraform job to provision FreeSWITCH EC2 instances yield the arn:aws:states:::codebuild:startBuild.sync resource and pass required variables (target pair count and the terraform resources to apply). The state machine definition also allows to set retries and error catching.

"InvokeTerraformCompute": {
  "Type": "Task",
  "Resource": "arn:aws:states:::codebuild:startBuild.sync",
  "Parameters": {
    "ProjectName": "${terraform_project_name}",
    "EnvironmentVariablesOverride": [
      {
        "Name": "TF_VAR_pair_count",
        "Value.$": "States.Format('{}', $.targetPairCount)",
        "Type": "PLAINTEXT"
      },
      {
        "Name": "TF_TARGETS",
        "Value": "aws_instance.freeswitch",
        "Type": "PLAINTEXT"
      }
    ]
  },
  "ResultPath": "$.terraformComputeResult",
  "Retry": [
    {
      "ErrorEquals": ["States.TaskFailed"],
      "IntervalSeconds": 60,
      "MaxAttempts": 3,
      "BackoffRate": 2.0,
      "Comment": "Retry Terraform compute failures with exponential backoff"
    }
  ],
  "Catch": [
    {
      "ErrorEquals": ["States.ALL"],
      "ResultPath": "$.error",
      "Next": "RollbackSSM"
    }
  ],
  "Next": "InvokeAnsibleFreeSWITCH"
},

The corresponding CodeBuild project features a buildspec that run the required Terraform commands:

  buildspec = <<-EOT
      version: 0.2
      phases:
        pre_build:
          commands:
            - echo "Starting Terraform execution for $ENVIRONMENT environment"
            - terraform init
        build:
          commands:
            - |
              # Build target flags from TF_TARGETS (comma-separated)
              TARGET_FLAGS=""
              for t in $(echo "$TF_TARGETS" | tr ',' ' '); do
                TARGET_FLAGS="$TARGET_FLAGS -target=$t"
              done
              echo "Terraform targets: $TARGET_FLAGS"
              terraform plan \
                -var="pair_count=$TF_VAR_pair_count" \
                $TARGET_FLAGS \
                -out=tfplan
            - terraform apply -auto-approve tfplan
        post_build:
          commands:
            - echo "Terraform execution completed"
    EOT

Improvement Recommendations

While this event-driven orchestration approach addresses the immediate scaling constraints, the architecture is still shaped by the underlying stateful coupling between signaling, media, and control components. There are several directions that would further simplify scaling and reduce operational overhead.

Move Toward Generic Statelessness: We should aim to re-architect the ESL client so it is interchangeable, allowing it to eventually use standard ECS scaling features. Alternatively, we could co-host the client directly on the FreeSWITCH instance. Wherever possible, decoupling media handling from session signaling will make scaling easier.
Optimize Provisioning Time: Running Ansible scripts at runtime creates a bottleneck when we need to scale out fast. We should transition to using Custom AMIs (Amazon Machine Images) for our media servers. By baking the configuration into the image, we can achieve near-immediate availability for new nodes to handle sudden traffic spikes.

Stateful Real-Time Systems Require Custom Scaling Patterns

When stateful VoIP and real-time communication infrastructure demands more coordination than standard cloud scaling tools provide, the gap needs to be bridged carefully without dropping calls or disrupting active sessions. The event-driven orchestration pattern covered here is one way to do that.

At WebRTC.ventures, we design and build real-time communication infrastructure for teams at every stage from early architecture decisions to production scaling challenges like this one. We specialize in the custom approaches that VoIP systems often demand. If your stack is hitting limits that standard tooling can’t solve, we can help.

Scaling Stateful VoIP on AWS: An Event-Driven Alternative to Standard Autoscaling.

Why AWS Autoscaling Can Break Down for Stateful VoIP Systems

Managing State: Call Draining and Session Completion

State-Aware Scaling Out

A FreeSWITCH Configuration Where Standard Autoscaling Falls Short

An Event-Driven Alternative to Stateful Scaling

Declarative Infrastructure for Stateful Pair Management

Event-Driven Orchestration with AWS Step Functions

External State Tracking for Scaling Decisions

Coordinating Scaling Events with Distributed Locks

Execution Layer: Terraform and Ansible Orchestration

Improvement Recommendations

Stateful Real-Time Systems Require Custom Scaling Patterns

Don’t Mistake the AI Avatar for the Voice AI System Behind It

Connect Any PSTN Phone Number to a SignalWire Voice AI Agent via SIP Forwarding

Watch WebRTC Live #113: WhatsApp Business Calling and SIP

Peermetrics at Scale: When WebRTC Monitoring Hits a Million Events a Day

Recent Blog Posts

Don’t Mistake the AI Avatar for the Voice AI System Behind It

Connect Any PSTN Phone Number to a SignalWire Voice AI Agent via SIP Forwarding

Watch WebRTC Live #113: WhatsApp Business Calling and SIP

Peermetrics at Scale: When WebRTC Monitoring Hits a Million Events a Day

We’re one of the few agencies in the world dedicated to WebRTC development. This dedication and experience is why so many people trust us to help bring real-time application dreams to life.

Let's get started!

Contact us today

Join our mailing list!

Categories

Why AWS Autoscaling Can Break Down for Stateful VoIP Systems

Managing State: Call Draining and Session Completion

State-Aware Scaling Out

A FreeSWITCH Configuration Where Standard Autoscaling Falls Short

An Event-Driven Alternative to Stateful Scaling

Declarative Infrastructure for Stateful Pair Management

Event-Driven Orchestration with AWS Step Functions

External State Tracking for Scaling Decisions

Coordinating Scaling Events with Distributed Locks

Execution Layer: Terraform and Ansible Orchestration

Improvement Recommendations

Stateful Real-Time Systems Require Custom Scaling Patterns

Recent Blog Posts

Recent Blog Posts

We’re one of the few agencies in the world dedicated to WebRTC development. This dedication and experience is why so many people trust us to help bring real-time application dreams to life.