AWS autoscaling works well for stateless applications, but it breaks down quickly for real-time systems like FreeSWITCH, OpenSIPS, and RTPEngine. These VoIP infrastructure systems maintain active SIP sessions and media streams that are tightly bound to specific instances, making standard scale-in and scale-out behavior unsafe.

Because each instance holds live call state, scaling events must be carefully coordinated. Otherwise, even a single premature termination can drop active calls.

This post explores an alternative, event-driven approach to scaling FreeSWITCH on AWS, which can also be applied to other stateful real-time communication infrastructure.

Why AWS Autoscaling Breaks Down for Stateful VoIP Systems

A real-time communication is composed of both stateless and stateful components.

  • Business logic elements such as APIs and routing backends are generally stateless. This means they don’t store any logic themselves. They rely on external storage such as databases or caches. 
  • Media servers like FreeSWITCH are highly stateful. Each active call maintains in-memory session state tied to that specific instance.. You can’t just snap your fingers like Thanos and make half your instances disappear during a scale-in event. Doing so immediately drops active calls and disrupts RTP streams.

The same challenge applies when scaling out. New instances need to be registered in the service discovery layer before the routing logic can recognize them and start allocating new media sessions to them.

Managing State: Call Draining and Session Completion

To shut down a media server safely, you must use a process called “graceful draining.” 

  1. Signal the media server to stop accepting new sessions (quiesce).
  2. Wait for existing VoIP calls to complete naturally while maintaining session continuity.
  3. Use an Event Socket Library (ESL) client to verify there are absolutely zero active channels before finally pulling the plug.

State-Aware Scaling Out

Scaling out, on the other hand, is pretty straightforward. But it does require start-up logic to set up the required services and register the server on a service discovery layer for routing logic to see them and send traffic to.

When Standard Autoscaling Won’t Work: A FreeSWITCH Example

Traditional AWS cloud scaling services such as Auto Scaling Groups, Elastic Load Balancers and Amazon ECS are built for stateless architecture. If a web server goes down, another one spins up and the user rarely notices. And while they do incorporate mechanisms to support custom draining and startup logic, there are some scenarios when they are not enough.

Consider a typical ESL client + FreeSWITCH pair configured in an unconventional way:

  • The ESL client running on ECS Fargate requires a unique task definition per instance to map exactly 1:1 with FreeSWITCH nodes.
  • The FreeSWITCH instance is configured at runtime using Ansible scripts. This configuration rigidity makes “instant” scaling difficult compared to using pre-baked, ready-to-go containers.

In this scenario, Auto Scaling groups require advanced tweaks not only to provision the new instance, but also to run configuration scripts to set up all the required services. And, the ECS Fargate service cannot simply increase the task count due to the requirement for each task to have its own task definition.

An Event-Driven Alternative to Stateful Scaling

Given these constraints, standard autoscaling isn’t an option. Instead, VoIP infrastructure requires explicit coordination of session state, call draining, and provisioning.

The rest of this post focuses on the FreeSWITCH and ESL client pair introduced above, walking through a custom event-driven orchestration approach our team at WebRTC.ventures built with AWS Step Functions, Terraform, and Ansible to handle provisioning, draining, and state management safely. While the implementation is specific to FreeSWITCH, the same orchestration pattern applies to other stateful RTC components.

Declarative Infrastructure for Stateful Pair Management

Infrastructure as Code (IaC) tools like Terraform or AWS CloudFormation are essential to this approach. They enforce the strict 1:1 relationship between FreeSWITCH instances and ESL clients, automate the complex provisioning required for every scaling event, and prevent unique “snowflake” instances from creeping in during scale-ins or scale-outs.

The key element is defining a “toggle” that allows to dynamically set the desired number of “pairs” of resources. 

variable "pair_count" {
  description = "Number of pairs (FS + ESL client) to deploy"
  type        = number
  default     = 2
}

resource "aws_instance" "freeswitch" {
  count = var.pair_count

  ... # rest of the configuration for FreeSWITCH EC2 instance
}

resource "aws_ecs_task_definition" "my_task_definition" {
  count = var.pair_count

  ... # rest of the configuration for ECS task definition
}

resource "aws_ecs_service" "main" {
  count = var.pair_count

  name            = "my_ecs_service"
  cluster         = aws_ecs_cluster.my_cluster.id
  task_definition = aws_ecs_task_definition.my_task_definition[count.index].arn
  desired_count   = 1 # we can only have one task per service
  launch_type     = "FARGATE"

  ... # rest of the configuration
}

The Terraform definition is accompanied by an Ansible playbook that installs FreeSWITCH, deploys the required configuration files, and registers the new instance in the service discovery database so that routing logic can start directing traffic to it.

# FreeSWITCH installation tasks
- name: Install prerequisite packages
  apt:
    name:
      - wget
      - gnupg2
      - ca-certificates
      - lsb-release
    state: present
    update_cache: yes

- name: Add FreeSWITCH GPG key
  apt_key:
    url: https://files.freeswitch.org/repo/deb/debian-release/fsstretch-archive-keyring.asc
    state: present

- name: Add FreeSWITCH repository
  apt_repository:
    repo: "deb http://files.freeswitch.org/repo/deb/debian-release/ 
      {{ ansible_distribution_release }} main"
    state: present

- name: Install FreeSWITCH packages
  apt:
    # 'freeswitch-meta-all' installs everything. You can swap this for 
      specific modules to save time.
    name: freeswitch-meta-all 
    state: present
    update_cache: yes

- name: Ensure FreeSWITCH service is enabled and running
  service:
    name: freeswitch
    state: started
    enabled: yes

# Deployment of config files 
- name: Deploy internal profile
  template:
    src: internal.xml.j2
    dest: "/path/to/config/sip_profiles/internal.xml"
    owner: ubuntu
    group: ubuntu
    mode: '0644'
  register: freeswitch_internal_profile
  notify: Restart FreeSWITCH

- name: Deploy external profile
  template:
    src: external.xml.j2
    dest: "/path/to/config/sip_profiles/external.xml"
    owner: ubuntu
    group: ubuntu
    mode: '0644'
  register: freeswitch_external_profile
  notify: Restart FreeSWITCH

... # rest of files deployments

# Register the new instance to a database for service discovery
- name: Install PostgreSQL client
  apt:
    name:
      - postgresql-client
    state: present
    update_cache: yes

- name: Check if this FreeSWITCH instance is registered in dispatcher table
  shell: |
    psql \
      -h "{{ postgres_host }}" \
      -U "{{ postgres_user }}" \
      -d "{{ postgres_db }}" \
      -t -c "SELECT COUNT(*) FROM dispatcher WHERE setid = 
         {{ freeswitch_dispatcher_setid | default(1) }} AND destination = 
         'sip:{{ freeswitch_sip_ip }}:{{ freeswitch_sip_port }}';"
  environment:
    PGPASSWORD: "{{ postgres_password }}"
  register: dispatcher_check
  changed_when: false

- name: Register this FreeSWITCH instance in dispatcher table
  shell: |
    psql \
      -h "{{ postgres_host }}" \
      -U "{{ postgres_user }}" \
      -d "{{ postgres_db }}" \
      -c "INSERT INTO dispatcher (setid, destination, state, weight, priority, description) 
          VALUES ({{ freeswitch_dispatcher_setid | default(1) }}, 
         'sip:{{ freeswitch_sip_ip }}:{{ freeswitch_sip_port }}', 1, 1, 0, 
         'FreeSWITCH {{ ansible_hostname }}');"
  environment:
    PGPASSWORD: "{{ postgres_password }}"
  when: dispatcher_check.stdout | trim | int == 0
  register: dispatcher_insert
  changed_when: dispatcher_insert.rc == 0

Event-Driven Orchestration with AWS Step Functions

AWS Step Functions act as the brains of the operation. It orchestrates the complex sequence of events for scale in and out events by managing the state transitions, retries, and long-polling script executions.

A simplified scale-out step machine would look like this:

  1. Acquire scaling event lock to prevent concurrency issues
  2. Verify the current state of the infrastructure
  3. Provision the infrastructure using terraform
  4. Run the ansible playbook to configure FS instance
  5. Release the lock to allow for further scaling events
How WebRTC.ventures safely provisions new FreeSWITCH instances on AWS when demand increases — using distributed locking, state verification, and a Terraform/Ansible pipeline to bring new capacity online without disrupting active infrastructure.
FreeSWITCH Scale-Out Orchestration Flow on AWS Step Functions

Scale-in events work using the same idea of event locks and state, but add a proper draining process for terminating the FS instances. Here is a simplified example:

  1. Acquire scaling event lock
  2. Verify the current state of the infrastructure
  3. Drain active calls from the target FS instances
  4. Decommission infrastructure
  5. Release the lock for further scaling events
How WebRTC.ventures safely scales down FreeSWITCH instances on AWS without dropping active calls — using distributed locking, graceful call draining, and Terraform-based decommissioning orchestrated through AWS Step Functions.
FreeSWITCH Scale-In Orchestration Flow on AWS Step Functions.

External State Tracking for Scaling Decisions

As we scale in and out our infrastructure, we need a way to manage its current state. As our central source of truth, we use the AWS Systems Manager Parameter Store, read through an AWS Lambda Function that abstract all state-reading logic, and updated via native AWS SDK Integration (arn:aws:states:::aws-sdk:ssm:putParameter). It keeps track of our current scaling state and stores the metadata of all our active nodes. 

Retrieving the current state of the infrastructure is just a matter of calling the get_parameter API and returning it to the state machine.

def get_pair_count() -> Dict[str, Any]:
    """
    Read current pair count from SSM Parameter Store.
    
    Returns:
        Dict containing pair count
    """
    try:
        response = ssm.get_parameter(Name=SSM_PARAMETER_PATH)
        pair_count = int(response['Parameter']['Value'])
        
        print(f"Retrieved pair count: {pair_count} from {SSM_PARAMETER_PATH}")
        
        return {
            'success': True,
            'pair_count': pair_count,
            'parameter_path': SSM_PARAMETER_PATH
        }
        
    except ... # exception logic

The power of Lambda also allow us to define other utility functions for reading state, such as validating that scaling is required as follows:

def validate_state() -> Dict[str, Any]:
    """
    Validate that pair count matches actual infrastructure.
    
    Returns:
        Dict containing validation result
    """
    try:
        # Get pair count from SSM
        pair_count_result = get_pair_count()
        if not pair_count_result['success']:
            return {
                'success': False,
                'error': 'Failed to retrieve pair count',
                'details': pair_count_result
            }
        
        pair_count = pair_count_result['pair_count']
        
        # Use to a separate function to retrieve available resources metadata
        infra_result = get_infrastructure_state()
        if not infra_result['success']:
            return {
                'success': False,
                'error': 'Failed to retrieve infrastructure state',
                'details': infra_result
            }
        
        freeswitch_count = infra_result['freeswitch_count']
        ecs_service_count = infra_result['ecs_service_count']
        
        # Validate counts match
        valid = (freeswitch_count == pair_count and ecs_service_count == pair_count)
        
        result = {
            'success': True,
            'valid': valid,
            'pair_count': pair_count,
            'freeswitch_count': freeswitch_count,
            'ecs_service_count': ecs_service_count
        }
        
        if not valid:
            result['message'] = f'State mismatch: SSM={pair_count}, 
            FreeSWITCH={freeswitch_count}, ECS={ecs_service_count}'
            print(result['message'])
        else:
            result['message'] = 'State is consistent'
            print(result['message'])
        
        return result
        
    except ... # Exception logic

Coordinating Scaling Events with Distributed Locks

When traffic fluctuates wildly, you might trigger scale-up and scale-down events at the exact same time. We use DynamoDB to implement a locking mechanism. This prevents race conditions during simultaneous scaling events, because as we all know, crossing the streams is bad.

AWS Lambda provides a nice interface for the state machine to acquire and release such a lock. These are just writes from the point of view of the function.

def acquire_lock(operation_id: str, ttl: int = 7200, metadata: 
     Optional[Dict[str, Any]] = None) -> Dict[str, Any]:
    current_time = int(time.time())
    expiry_time = current_time + ttl
    
    lock_item = {
        'lock_id': 'scaling-lock',
        'operation_id': operation_id,
        'acquired_at': current_time,
        'ttl': expiry_time,
        'metadata': metadata or {}
    }
    
    try:
        # Conditional write: only succeed if lock_id doesn't exist or TTL has expired
        table.put_item(
            Item=lock_item,
            ConditionExpression='attribute_not_exists(lock_id) OR #ttl < :current_time',
            ExpressionAttributeNames={'#ttl': 'ttl'},
            ExpressionAttributeValues={':current_time': current_time}
        )
        
        print(f"Lock acquired successfully for operation: {operation_id}")
        return {
            'success': True,
            'operation_id': operation_id,
            'acquired_at': current_time,
            'expires_at': expiry_time
        }
        
    except ... # exception logic (lock is held)

def release_lock(operation_id: str) -> Dict[str, Any]:
    try:
        # Delete the lock item
        table.delete_item(
            Key={'lock_id': 'scaling-lock'},
            ConditionExpression='operation_id = :op_id',
            ExpressionAttributeValues={':op_id': operation_id}
        )
        
        print(f"Lock released successfully for operation: {operation_id}")
        return {
            'success': True,
            'operation_id': operation_id,
            'released_at': int(time.time())
        }
        
    except ... # exception logix

Execution Layer: Terraform and Ansible Orchestration

Finally, we need something to actually do the heavy lifting. Step Functions triggers AWS CodeBuild jobs to run Terraform apply commands to scale up and down by setting the desired pair_count explicitly. CodeBuild is also responsible for executing the Ansible playbooks that handle the dynamic configuration of our FreeSWITCH instances.

For instance, definition of the Terraform job to provision FreeSWITCH EC2 instances yield the arn:aws:states:::codebuild:startBuild.sync resource and pass required variables (target pair count and the terraform resources to apply). The state machine definition also allows to set retries and error catching.

"InvokeTerraformCompute": {
  "Type": "Task",
  "Resource": "arn:aws:states:::codebuild:startBuild.sync",
  "Parameters": {
    "ProjectName": "${terraform_project_name}",
    "EnvironmentVariablesOverride": [
      {
        "Name": "TF_VAR_pair_count",
        "Value.$": "States.Format('{}', $.targetPairCount)",
        "Type": "PLAINTEXT"
      },
      {
        "Name": "TF_TARGETS",
        "Value": "aws_instance.freeswitch",
        "Type": "PLAINTEXT"
      }
    ]
  },
  "ResultPath": "$.terraformComputeResult",
  "Retry": [
    {
      "ErrorEquals": ["States.TaskFailed"],
      "IntervalSeconds": 60,
      "MaxAttempts": 3,
      "BackoffRate": 2.0,
      "Comment": "Retry Terraform compute failures with exponential backoff"
    }
  ],
  "Catch": [
    {
      "ErrorEquals": ["States.ALL"],
      "ResultPath": "$.error",
      "Next": "RollbackSSM"
    }
  ],
  "Next": "InvokeAnsibleFreeSWITCH"
},

The corresponding CodeBuild project features a buildspec that run the required Terraform commands:

  buildspec = <<-EOT
      version: 0.2
      phases:
        pre_build:
          commands:
            - echo "Starting Terraform execution for $ENVIRONMENT environment"
            - terraform init
        build:
          commands:
            - |
              # Build target flags from TF_TARGETS (comma-separated)
              TARGET_FLAGS=""
              for t in $(echo "$TF_TARGETS" | tr ',' ' '); do
                TARGET_FLAGS="$TARGET_FLAGS -target=$t"
              done
              echo "Terraform targets: $TARGET_FLAGS"
              terraform plan \
                -var="pair_count=$TF_VAR_pair_count" \
                $TARGET_FLAGS \
                -out=tfplan
            - terraform apply -auto-approve tfplan
        post_build:
          commands:
            - echo "Terraform execution completed"
    EOT

Improvement Recommendations

While this event-driven orchestration approach addresses the immediate scaling constraints, the architecture is still shaped by the underlying stateful coupling between signaling, media, and control components. There are several directions that would further simplify scaling and reduce operational overhead.

  • Move Toward Generic Statelessness: We should aim to re-architect the ESL client so it is interchangeable, allowing it to eventually use standard ECS scaling features. Alternatively, we could co-host the client directly on the FreeSWITCH instance. Wherever possible, decoupling media handling from session signaling will make scaling easier.
  • Optimize Provisioning Time: Running Ansible scripts at runtime creates a bottleneck when we need to scale out fast. We should transition to using Custom AMIs (Amazon Machine Images) for our media servers. By baking the configuration into the image, we can achieve near-immediate availability for new nodes to handle sudden traffic spikes.

Stateful Real-Time Systems Require Custom Scaling Patterns

Stateful VoIP and real-time communication infrastructure will always demand more coordination will always demand more coordination than standard cloud scaling tools are designed to provide. The event-driven orchestration pattern covered here is one way to bridge that gap — without dropping calls or starting from scratch.

At WebRTC.ventures, we design and build real-time communication infrastructure for teams at every stage from early architecture decisions to production scaling challenges like this one. We specialize in the custom approaches that VoIP systems often demand. If your stack is hitting limits that standard tooling can’t solve, we can help.

Further Reading: 

Recent Blog Posts