AWS autoscaling works well for stateless applications. And for the stateless components of VoIP and real-time systems like APIs and routing backends, AWS Auto Scaling groups (ASGs) and Amazon Elastic Container Service (ECS) do exactly what they’re designed to do.

But stateful VoIP infrastructure components like FreeSWITCH and RTPEngine maintain active SIP sessions and media streams tightly bound to specific instances. Because each instance holds live call state, scaling events must be carefully coordinated. Otherwise, even a single premature termination can drop active calls.

Depending on how those components are configured, standard autoscaling tools may not be enough to handle that coordination safely. This post walks through one such scenario: a specific FreeSWITCH and ESL client configuration that made standard ASG and ECS task scaling impractical, and the event-driven orchestration approach we built to handle it instead.

Why AWS Autoscaling Can Break Down for Stateful VoIP Systems

A real-time communication is composed of both stateless and stateful components.

  • Business logic elements such as APIs and routing backends are generally stateless. This means they don’t store any logic themselves. They rely on external storage such as databases or caches. 
  • Media servers like FreeSWITCH are highly stateful. Each active call maintains in-memory session state tied to that specific instance. You can’t just snap your fingers like Thanos and make half your instances disappear during a scale-in event. Doing so immediately drops active calls and disrupts RTP streams.

The same challenge applies when scaling out. New instances need to be registered in the service discovery layer before the routing logic can recognize them and start allocating new media sessions to them.

Managing State: Call Draining and Session Completion

To shut down a media server safely, you must use a process called “graceful draining.” 

  1. Signal the media server to stop accepting new sessions (quiesce).
  2. Wait for existing VoIP calls to complete naturally while maintaining session continuity.
  3. Use an Event Socket Library (ESL) client to verify there are absolutely zero active channels before finally pulling the plug.

State-Aware Scaling Out

Scaling out, on the other hand, is pretty straightforward. But it does require start-up logic to set up the required services and register the server on a service discovery layer for routing logic to see them and send traffic to.

A FreeSWITCH Configuration Where Standard Autoscaling Falls Short

If your FreeSWITCH and ESL client are configured in a conventional way, traditional AWS cloud scaling services should handle your scaling needs. The scenario described here is driven by a specific, unconventional implementation where the ESL client requires a unique task definition per instance to maintain a strict 1:1 mapping with FreeSWITCH nodes, and where FreeSWITCH itself is configured at runtime via Ansible rather than using pre-baked images. That combination makes standard autoscaling impractical in this case.

Here is the specific configuration that makes standard autoscaling impractical in this case:

  1. The ESL client running on ECS Fargate requires a unique task definition per instance to map exactly 1:1 with FreeSWITCH nodes.
  2. The FreeSWITCH instance is configured at runtime using Ansible scripts. This configuration rigidity makes “instant” scaling difficult compared to using pre-baked, ready-to-go containers.

These two constraints together are what rule out standard ASG and ECS task scaling for this specific setup, and what motivated the event-driven approach described below. While the implementation is specific to FreeSWITCH, the same orchestration pattern applies to other stateful RTC components that may require special coordination that ASG and ECS task counts don’t always support.

An Event-Driven Alternative to Stateful Scaling

Given these constraints, standard autoscaling isn’t the right fit for this particular setup. Instead, this specific FreeSWITCH and ESL client configuration requires explicit coordination of session state, call draining, and provisioning. This is what our event-driven approach is designed to handle.

Declarative Infrastructure for Stateful Pair Management

Infrastructure as Code (IaC) tools like Terraform or AWS CloudFormation are essential to this approach. They enforce the strict 1:1 relationship between FreeSWITCH instances and ESL clients, automate the complex provisioning required for every scaling event, and prevent unique “snowflake” instances from creeping in during scale-ins or scale-outs.

The key element is defining a “toggle” that allows to dynamically set the desired number of “pairs” of resources. 

variable "pair_count" {
  description = "Number of pairs (FS + ESL client) to deploy"
  type        = number
  default     = 2
}

resource "aws_instance" "freeswitch" {
  count = var.pair_count

  ... # rest of the configuration for FreeSWITCH EC2 instance
}

resource "aws_ecs_task_definition" "my_task_definition" {
  count = var.pair_count

  ... # rest of the configuration for ECS task definition
}

resource "aws_ecs_service" "main" {
  count = var.pair_count

  name            = "my_ecs_service"
  cluster         = aws_ecs_cluster.my_cluster.id
  task_definition = aws_ecs_task_definition.my_task_definition[count.index].arn
  desired_count   = 1 # we can only have one task per service
  launch_type     = "FARGATE"

  ... # rest of the configuration
}

The Terraform definition is accompanied by an Ansible playbook that installs FreeSWITCH, deploys the required configuration files, and registers the new instance in the service discovery database so that routing logic can start directing traffic to it.

# FreeSWITCH installation tasks
- name: Install prerequisite packages
  apt:
    name:
      - wget
      - gnupg2
      - ca-certificates
      - lsb-release
    state: present
    update_cache: yes

- name: Add FreeSWITCH GPG key
  apt_key:
    url: https://files.freeswitch.org/repo/deb/debian-release/fsstretch-archive-keyring.asc
    state: present

- name: Add FreeSWITCH repository
  apt_repository:
    repo: "deb http://files.freeswitch.org/repo/deb/debian-release/ 
      {{ ansible_distribution_release }} main"
    state: present

- name: Install FreeSWITCH packages
  apt:
    # 'freeswitch-meta-all' installs everything. You can swap this for 
      specific modules to save time.
    name: freeswitch-meta-all 
    state: present
    update_cache: yes

- name: Ensure FreeSWITCH service is enabled and running
  service:
    name: freeswitch
    state: started
    enabled: yes

# Deployment of config files 
- name: Deploy internal profile
  template:
    src: internal.xml.j2
    dest: "/path/to/config/sip_profiles/internal.xml"
    owner: ubuntu
    group: ubuntu
    mode: '0644'
  register: freeswitch_internal_profile
  notify: Restart FreeSWITCH

- name: Deploy external profile
  template:
    src: external.xml.j2
    dest: "/path/to/config/sip_profiles/external.xml"
    owner: ubuntu
    group: ubuntu
    mode: '0644'
  register: freeswitch_external_profile
  notify: Restart FreeSWITCH

... # rest of files deployments

# Register the new instance to a database for service discovery
- name: Install PostgreSQL client
  apt:
    name:
      - postgresql-client
    state: present
    update_cache: yes

- name: Check if this FreeSWITCH instance is registered in dispatcher table
  shell: |
    psql \
      -h "{{ postgres_host }}" \
      -U "{{ postgres_user }}" \
      -d "{{ postgres_db }}" \
      -t -c "SELECT COUNT(*) FROM dispatcher WHERE setid = 
         {{ freeswitch_dispatcher_setid | default(1) }} AND destination = 
         'sip:{{ freeswitch_sip_ip }}:{{ freeswitch_sip_port }}';"
  environment:
    PGPASSWORD: "{{ postgres_password }}"
  register: dispatcher_check
  changed_when: false

- name: Register this FreeSWITCH instance in dispatcher table
  shell: |
    psql \
      -h "{{ postgres_host }}" \
      -U "{{ postgres_user }}" \
      -d "{{ postgres_db }}" \
      -c "INSERT INTO dispatcher (setid, destination, state, weight, priority, description) 
          VALUES ({{ freeswitch_dispatcher_setid | default(1) }}, 
         'sip:{{ freeswitch_sip_ip }}:{{ freeswitch_sip_port }}', 1, 1, 0, 
         'FreeSWITCH {{ ansible_hostname }}');"
  environment:
    PGPASSWORD: "{{ postgres_password }}"
  when: dispatcher_check.stdout | trim | int == 0
  register: dispatcher_insert
  changed_when: dispatcher_insert.rc == 0

Event-Driven Orchestration with AWS Step Functions

AWS Step Functions act as the brains of the operation. It orchestrates the complex sequence of events for scale in and out events by managing the state transitions, retries, and long-polling script executions.

A simplified scale-out step machine would look like this:

  1. Acquire scaling event lock to prevent concurrency issues
  2. Verify the current state of the infrastructure
  3. Provision the infrastructure using terraform
  4. Run the ansible playbook to configure FS instance
  5. Release the lock to allow for further scaling events
How WebRTC.ventures safely provisions new FreeSWITCH instances on AWS when demand increases — using distributed locking, state verification, and a Terraform/Ansible pipeline to bring new capacity online without disrupting active infrastructure.
FreeSWITCH Scale-Out Orchestration Flow on AWS Step Functions

Scale-in events work using the same idea of event locks and state, but add a proper draining process for terminating the FS instances. Here is a simplified example:

  1. Acquire scaling event lock
  2. Verify the current state of the infrastructure
  3. Drain active calls from the target FS instances
  4. Decommission infrastructure
  5. Release the lock for further scaling events
How WebRTC.ventures safely scales down FreeSWITCH instances on AWS without dropping active calls — using distributed locking, graceful call draining, and Terraform-based decommissioning orchestrated through AWS Step Functions.
FreeSWITCH Scale-In Orchestration Flow on AWS Step Functions.

External State Tracking for Scaling Decisions

As we scale in and out our infrastructure, we need a way to manage its current state. As our central source of truth, we use the AWS Systems Manager Parameter Store, read through an AWS Lambda Function that abstract all state-reading logic, and updated via native AWS SDK Integration (arn:aws:states:::aws-sdk:ssm:putParameter). It keeps track of our current scaling state and stores the metadata of all our active nodes. 

Retrieving the current state of the infrastructure is just a matter of calling the get_parameter API and returning it to the state machine.

def get_pair_count() -> Dict[str, Any]:
    """
    Read current pair count from SSM Parameter Store.
    
    Returns:
        Dict containing pair count
    """
    try:
        response = ssm.get_parameter(Name=SSM_PARAMETER_PATH)
        pair_count = int(response['Parameter']['Value'])
        
        print(f"Retrieved pair count: {pair_count} from {SSM_PARAMETER_PATH}")
        
        return {
            'success': True,
            'pair_count': pair_count,
            'parameter_path': SSM_PARAMETER_PATH
        }
        
    except ... # exception logic

The power of Lambda also allow us to define other utility functions for reading state, such as validating that scaling is required as follows:

def validate_state() -> Dict[str, Any]:
    """
    Validate that pair count matches actual infrastructure.
    
    Returns:
        Dict containing validation result
    """
    try:
        # Get pair count from SSM
        pair_count_result = get_pair_count()
        if not pair_count_result['success']:
            return {
                'success': False,
                'error': 'Failed to retrieve pair count',
                'details': pair_count_result
            }
        
        pair_count = pair_count_result['pair_count']
        
        # Use to a separate function to retrieve available resources metadata
        infra_result = get_infrastructure_state()
        if not infra_result['success']:
            return {
                'success': False,
                'error': 'Failed to retrieve infrastructure state',
                'details': infra_result
            }
        
        freeswitch_count = infra_result['freeswitch_count']
        ecs_service_count = infra_result['ecs_service_count']
        
        # Validate counts match
        valid = (freeswitch_count == pair_count and ecs_service_count == pair_count)
        
        result = {
            'success': True,
            'valid': valid,
            'pair_count': pair_count,
            'freeswitch_count': freeswitch_count,
            'ecs_service_count': ecs_service_count
        }
        
        if not valid:
            result['message'] = f'State mismatch: SSM={pair_count}, 
            FreeSWITCH={freeswitch_count}, ECS={ecs_service_count}'
            print(result['message'])
        else:
            result['message'] = 'State is consistent'
            print(result['message'])
        
        return result
        
    except ... # Exception logic

Coordinating Scaling Events with Distributed Locks

When traffic fluctuates wildly, you might trigger scale-up and scale-down events at the exact same time. We use DynamoDB to implement a locking mechanism. This prevents race conditions during simultaneous scaling events, because as we all know, crossing the streams is bad.

AWS Lambda provides a nice interface for the state machine to acquire and release such a lock. These are just writes from the point of view of the function.

def acquire_lock(operation_id: str, ttl: int = 7200, metadata: 
     Optional[Dict[str, Any]] = None) -> Dict[str, Any]:
    current_time = int(time.time())
    expiry_time = current_time + ttl
    
    lock_item = {
        'lock_id': 'scaling-lock',
        'operation_id': operation_id,
        'acquired_at': current_time,
        'ttl': expiry_time,
        'metadata': metadata or {}
    }
    
    try:
        # Conditional write: only succeed if lock_id doesn't exist or TTL has expired
        table.put_item(
            Item=lock_item,
            ConditionExpression='attribute_not_exists(lock_id) OR #ttl < :current_time',
            ExpressionAttributeNames={'#ttl': 'ttl'},
            ExpressionAttributeValues={':current_time': current_time}
        )
        
        print(f"Lock acquired successfully for operation: {operation_id}")
        return {
            'success': True,
            'operation_id': operation_id,
            'acquired_at': current_time,
            'expires_at': expiry_time
        }
        
    except ... # exception logic (lock is held)

def release_lock(operation_id: str) -> Dict[str, Any]:
    try:
        # Delete the lock item
        table.delete_item(
            Key={'lock_id': 'scaling-lock'},
            ConditionExpression='operation_id = :op_id',
            ExpressionAttributeValues={':op_id': operation_id}
        )
        
        print(f"Lock released successfully for operation: {operation_id}")
        return {
            'success': True,
            'operation_id': operation_id,
            'released_at': int(time.time())
        }
        
    except ... # exception logix

Execution Layer: Terraform and Ansible Orchestration

Finally, we need something to actually do the heavy lifting. Step Functions triggers AWS CodeBuild jobs to run Terraform apply commands to scale up and down by setting the desired pair_count explicitly. CodeBuild is also responsible for executing the Ansible playbooks that handle the dynamic configuration of our FreeSWITCH instances.

For instance, definition of the Terraform job to provision FreeSWITCH EC2 instances yield the arn:aws:states:::codebuild:startBuild.sync resource and pass required variables (target pair count and the terraform resources to apply). The state machine definition also allows to set retries and error catching.

"InvokeTerraformCompute": {
  "Type": "Task",
  "Resource": "arn:aws:states:::codebuild:startBuild.sync",
  "Parameters": {
    "ProjectName": "${terraform_project_name}",
    "EnvironmentVariablesOverride": [
      {
        "Name": "TF_VAR_pair_count",
        "Value.$": "States.Format('{}', $.targetPairCount)",
        "Type": "PLAINTEXT"
      },
      {
        "Name": "TF_TARGETS",
        "Value": "aws_instance.freeswitch",
        "Type": "PLAINTEXT"
      }
    ]
  },
  "ResultPath": "$.terraformComputeResult",
  "Retry": [
    {
      "ErrorEquals": ["States.TaskFailed"],
      "IntervalSeconds": 60,
      "MaxAttempts": 3,
      "BackoffRate": 2.0,
      "Comment": "Retry Terraform compute failures with exponential backoff"
    }
  ],
  "Catch": [
    {
      "ErrorEquals": ["States.ALL"],
      "ResultPath": "$.error",
      "Next": "RollbackSSM"
    }
  ],
  "Next": "InvokeAnsibleFreeSWITCH"
},

The corresponding CodeBuild project features a buildspec that run the required Terraform commands:

  buildspec = <<-EOT
      version: 0.2
      phases:
        pre_build:
          commands:
            - echo "Starting Terraform execution for $ENVIRONMENT environment"
            - terraform init
        build:
          commands:
            - |
              # Build target flags from TF_TARGETS (comma-separated)
              TARGET_FLAGS=""
              for t in $(echo "$TF_TARGETS" | tr ',' ' '); do
                TARGET_FLAGS="$TARGET_FLAGS -target=$t"
              done
              echo "Terraform targets: $TARGET_FLAGS"
              terraform plan \
                -var="pair_count=$TF_VAR_pair_count" \
                $TARGET_FLAGS \
                -out=tfplan
            - terraform apply -auto-approve tfplan
        post_build:
          commands:
            - echo "Terraform execution completed"
    EOT

Improvement Recommendations

While this event-driven orchestration approach addresses the immediate scaling constraints, the architecture is still shaped by the underlying stateful coupling between signaling, media, and control components. There are several directions that would further simplify scaling and reduce operational overhead.

  • Move Toward Generic Statelessness: We should aim to re-architect the ESL client so it is interchangeable, allowing it to eventually use standard ECS scaling features. Alternatively, we could co-host the client directly on the FreeSWITCH instance. Wherever possible, decoupling media handling from session signaling will make scaling easier.
  • Optimize Provisioning Time: Running Ansible scripts at runtime creates a bottleneck when we need to scale out fast. We should transition to using Custom AMIs (Amazon Machine Images) for our media servers. By baking the configuration into the image, we can achieve near-immediate availability for new nodes to handle sudden traffic spikes.

Stateful Real-Time Systems Require Custom Scaling Patterns

When stateful VoIP and real-time communication infrastructure demands more coordination than standard cloud scaling tools provide, the gap needs to be bridged carefully without dropping calls or disrupting active sessions. The event-driven orchestration pattern covered here is one way to do that.

At WebRTC.ventures, we design and build real-time communication infrastructure for teams at every stage from early architecture decisions to production scaling challenges like this one. We specialize in the custom approaches that VoIP systems often demand. If your stack is hitting limits that standard tooling can’t solve, we can help.

Further Reading: 

Recent Blog Posts