SOC 2 Backup and Disaster Recovery for On-Premise Infrastructure

Written by Ali Aleali | Mar 25, 2026 1:59:59 PM

Cloud disaster recovery is a region failover. Click a button, spin up infrastructure in another availability zone, and the platform handles replication, failover, and recovery. The DR plan is largely a configuration decision, and the evidence is a screenshot of the replication settings.

On-premise disaster recovery is a fundamentally different discipline. It involves physical media, offsite transfers, colocation provider coordination, manual restoration procedures, and a set of dependencies that cannot be abstracted away behind an API. When the production database lives on a physical server in a rack, the backup strategy involves questions that cloud environments never face: where does the backup physically go? How does it get there? How long would it take to restore from scratch if the primary site were unavailable?

For companies preparing for SOC 2 on bare metal infrastructure, backup and disaster recovery is one of the domains where the gap between cloud-first guides and on-prem reality is widest. SOC 2 Availability criteria (A1.2 and A1.3) require organizations to maintain recovery procedures and test them. The criteria do not care whether the infrastructure is cloud or physical, but the evidence looks materially different.

This post covers how to build a backup and disaster recovery program for on-prem infrastructure that satisfies SOC 2 Availability criteria, produces audit-ready evidence, and gives the team confidence that recovery actually works when it matters.

What This Article Covers

How to set RPO and RTO targets that the team can actually meet and defend to an auditor
The three-tier backup scope model for on-prem environments
Database and infrastructure backup tools for physical infrastructure
Offsite and air-gapped backup requirements under SOC 2 A1.2
The five-part evidence package auditors want to see for backup and DR
Colocation shared responsibility and what the colo does not cover
The operating cadence: daily, weekly, quarterly, and annual activities

RPO and RTO: Setting Realistic Targets

Every backup and DR program starts with two numbers that define the recovery expectations.

Recovery Point Objective (RPO)

The maximum acceptable data loss measured in time. An RPO of 4 hours means the organization accepts losing up to 4 hours of data in a disaster scenario. This directly determines backup frequency: an RPO of 4 hours requires backups at least every 4 hours.

Recovery Time Objective (RTO)

The maximum acceptable downtime. An RTO of 8 hours means the organization commits to having the system operational within 8 hours of a disaster declaration. RTO governs how you design standby environments, staffing during recovery, and coordination with the colocation provider.

The mistake that creates audit friction

Setting RPO and RTO targets based on what sounds impressive rather than what is achievable. An RPO of 1 hour sounds reasonable, but if nightly full backups are the only backup mechanism and transaction log shipping is not configured, the actual RPO is 24 hours. An RTO of 4 hours sounds professional, but if restoring a full database backup to a cold standby server takes 8 hours including OS setup, application deployment, data restoration, and verification, the policy is fiction. The auditor does not penalize realistic targets. They penalize targets that the evidence shows the organization cannot meet.

The approach that works: define RPO and RTO based on actual tested recovery capabilities, not aspirational targets. Run a restoration test, measure how long it takes, and set the RTO with enough margin to account for real-world conditions, including staff availability, troubleshooting time, and coordination with the colo provider.

Scope: What Gets Backed Up

Not every system in the environment needs the same backup strategy. The tiered asset classification used across the on-prem SOC 2 cluster applies here, but the tiers map to different backup approaches rather than different scanning cadences.

BACKUP SCOPE BY TIER

Tier 1: Critical Data

Production databases, application data, customer data, and any data whose loss would prevent the business from operating. These get the most aggressive backup schedule and the most thorough restoration testing. If you back up nothing else, you back up Tier 1.

Tier 2: System Configuration

Server configurations, application deployment configurations, firewall rules, network configurations, and infrastructure-as-code repositories. These are recoverable from configuration management or documentation, but having backups accelerates recovery time significantly. Missing Tier 2 backups means rebuilding from documentation under pressure.

Tier 3: Operational Data

Logs, monitoring data, historical reports, and other data that supports operations but is not critical for business continuity. Backup cadence can be less aggressive, and loss is inconvenient rather than catastrophic. Tier 3 loss does not trigger the DR plan.

Technology: The On-Prem Backup Stack

Database Backups

Database backups are the highest-priority component of any on-prem backup program. The approach depends on the database platform:

Platform	Backup Approach	SOC 2 Pattern
SQL Server	Native BACKUP DATABASE with full, differential, and transaction log types. SQL Server Agent handles scheduling and failure alerts.	Nightly full, hourly differential during business hours, transaction logs every 15-30 min for critical databases.
PostgreSQL	pg_dump for logical backups (portable, portable fallback); pg_basebackup for physical backups with WAL archiving for point-in-time recovery.	WAL archiving provides near-continuous backup; pg_dump provides portable restore fallback to any PostgreSQL instance.
MySQL / MariaDB	mysqldump for logical backups; Percona XtraBackup for hot physical backups that do not lock tables during the backup process.	Percona XtraBackup is the production standard because it avoids locking issues on large databases during backup windows.

All three platforms support automated scheduling through cron jobs, built-in schedulers, or backup management tools. Every backup job should alert on failure, not just on success.

Infrastructure Backups

INFRASTRUCTURE BACKUP TOOLS

Veeam Backup & Replication

The enterprise standard for backing up virtualized and physical on-prem environments. Handles VM-level backups, bare metal recovery, application-aware backups (ensuring database consistency), and offsite replication. For organizations running VMware or Hyper-V, Veeam is the most common choice in SOC 2 engagements. Application-aware backups are critical because they ensure the database is in a consistent state before the snapshot, not mid-transaction.

Commvault

Enterprise-grade backup and recovery across heterogeneous environments (physical, virtual, cloud, SaaS). More complex to deploy and manage than Veeam, but covers a broader range of workload types and scales to large environments where multiple infrastructure generations coexist in the same rack.

Bacula

An open-source enterprise backup solution that supports complex backup topologies across mixed operating systems. A strong option for teams that want enterprise backup capabilities without commercial licensing costs, though it requires more configuration expertise than commercial alternatives.

rsync and rclone

File-level backup and synchronization for Linux environments. rsync handles local and remote file synchronization; rclone extends this to cloud storage targets (S3, Azure Blob, Google Cloud Storage) for offsite backup. For small teams, a scripted rsync job running on a cron schedule is a lightweight, auditable backup solution. The cron job, transfer logs, and offsite confirmation become audit evidence with minimal tooling overhead.

Backup Monitoring

Backup jobs fail silently more often than they succeed loudly. SNMP monitoring for backup storage health (disk capacity, RAID status, controller health) catches hardware issues before they cause backup failures. Most backup tools provide email notifications on job failure, but proactive monitoring of the backup infrastructure itself, through the same SIEM and monitoring stack used for security events, closes the gap between the backup job ran and the backup infrastructure is healthy.

Offsite and Air-Gapped Backups

SOC 2 A1.2 and the offsite expectation

SOC 2 A1.2 expects that recovery procedures account for scenarios where the primary site is unavailable. For on-prem environments, this means backups cannot live exclusively in the same facility as the production systems. A fire, flood, or extended power outage affecting the primary colo would eliminate both production and backup simultaneously if they are co-located on the same rack.

Offsite Transfer

The simplest approach is replicating backups to a secondary location. This could be a second colocation facility, a cloud storage target (S3, Azure Blob), or a geographically separate office with a backup server. The key requirement is that the offsite location is independent enough that a disaster affecting the primary site does not also affect the backups.

For teams using rclone or native cloud SDKs, automated offsite transfer can run immediately after local backup completion. The transfer script becomes part of the backup pipeline, and the transfer confirmation becomes audit evidence.

Air-Gapped Backups

Why air-gapped backups matter for SOC 2

Air-gapped backups are disconnected from the network after the backup is written, making them immune to ransomware that encrypts network-accessible storage. SOC 2 does not explicitly require air-gapped backups, but auditors increasingly ask about ransomware resilience. An air-gapped backup tier demonstrates mature recovery planning. If air-gapped backups are not implemented, the DR plan should explicitly document how the organization would recover from a ransomware event that encrypts network-accessible backup storage.

Practical implementations include rotating external drives stored in a secure offsite location, tape backups (still used in regulated industries for exactly this purpose), or cloud storage with object lock enabled, which prevents deletion or modification for a defined retention period.

Evidence: What the Auditor Wants to See

Backup and DR evidence for SOC 2 follows the same three-part evidence pattern used across all continuous controls, plus two additional evidence types specific to this domain.

The Three-Part Evidence Pattern

Configuration evidence: Backup job configurations showing what is backed up, the schedule, the retention period, and the offsite transfer configuration. Screenshots of the backup management console or the cron job definitions.
Execution history: Backup job logs showing successful completion over the observation period. For a Type 2 audit, the auditor wants to see a consistent pattern of successful backups across the full observation window (typically 6-12 months). Any failed backups should have documented remediation.
Representative sample: One complete backup report showing the data protected, the backup size, the duration, and the completion status.

RESTORATION TESTING EVIDENCE

This is the evidence type that separates functional backup programs from paper compliance. The auditor wants to see that the organization has actually tested restoring from backup, not just that backups are running. A restoration test involves taking a backup, restoring it to a non-production environment, and verifying that the restored data is usable. The evidence package should include:

The date and scope of the restoration test
Which backup was restored (date, type, source)
The target environment (non-production, isolated)
The restoration duration (this validates the RTO)
Verification that the restored data is complete and functional
Any issues encountered and how they were resolved

The cadence that works for SOC 2: at least one restoration test per quarter during the observation period. For Type 1 audits, a single successful restoration test is sufficient to demonstrate the control is designed and operational.

TABLETOP EXERCISE EVIDENCE

SOC 2 CC9.1 addresses incident response and recovery, and auditors increasingly expect organizations to demonstrate they have thought through disaster scenarios beyond just running backup jobs. A tabletop exercise is a structured walkthrough of a disaster scenario with the team. It does not require actually failing over systems. The evidence package includes:

The scenario description
The participants
The decisions made during the walkthrough (who does what, in what order)
The gaps identified (missing contact information, unclear escalation procedures, untested recovery steps)
Action items with owners and due dates

One tabletop exercise per year is the minimum expectation. A 90-minute session with the IT team walking through a scenario like the primary colo is offline for 48 hours, what do we do? produces more audit value than a complex multi-day exercise that never gets scheduled.

The Colocation Factor

For on-prem environments hosted in colocation facilities, the backup and DR program needs to account for the shared responsibility boundary.

What the colo provider handles

Physical facility redundancy (power, cooling, fire suppression)
Physical security of the facility
Network connectivity redundancy to the building

What the company is responsible for

Server-level backups and recovery
Database backups and recovery
Application-level recovery procedures
Offsite backup storage (the colo's redundancy protects against equipment failure, not against scenarios where the entire facility is unavailable)
DR planning for scenarios beyond what the colo's SLA covers

This distinction is documented as part of the subservice organization documentation that covers the relationship between the company and the colocation provider. The backup and DR policy should explicitly state which recovery scenarios the colo's infrastructure handles and which require the company's own backup and restoration procedures. See the bare metal SOC 2 readiness guide for how this documentation fits into the broader audit package.

Process: The Operating Cadence

DAILY

Verify backup job completion by checking email notifications or the monitoring dashboard. Investigate any failures immediately. This takes 5-10 minutes on a normal day. Any failure that goes uninvestigated for more than 24 hours creates a gap in the evidence trail.

WEEKLY

Review backup storage capacity and health. Verify offsite transfers are completing successfully. Check SNMP alerts for any storage hardware warnings. Confirm that backup retention policies are being enforced and old backups are being pruned on schedule.

QUARTERLY

Run a restoration test. Restore the most recent backup to a non-production environment and verify data integrity. Document the results including duration, scope, and any issues encountered. Upload the evidence package to the GRC platform.

ANNUALLY

Conduct a BCDR tabletop exercise. Review and update the DR plan. Review RPO and RTO targets against actual tested capabilities (if the restoration test took longer than the stated RTO, the RTO needs to be updated). Update the backup and DR section of the Security Program Manual.

People: Who Owns What

In small on-prem teams, backup and DR ownership typically sits with the system administrator or infrastructure lead.

Ownership Model

Backup owner: Responsible for daily verification, job maintenance, and quarterly restoration testing. Typically the system administrator or infrastructure lead. Every backup job should alert this person directly on failure.
DR plan owner: Responsible for the overall disaster recovery plan, annual tabletop exercises, and colo provider coordination. In small teams this is often the same person as the backup owner, but the DR plan should also have executive sign-off from the CTO or equivalent.
Backup coverage: A designated person who can manage backup operations during absences. Backup job monitoring should alert multiple people, not just the primary owner. A two-week vacation should not create a two-week gap in the evidence trail.

Need backup and DR for SOC 2?

Our fractional security team builds effective security programs where backup and DR is a tested, audit-ready control, not a policy on paper.

Book Your Strategy Call

Learn more about our SOC 2 for on-premise infrastructure services.

Further Reading: On-Prem SOC 2 Cluster

SOC 2 Readiness for Bare Metal SaaS: the overview post for this cluster
SOC 2 Vulnerability Scanning for On-Prem: the tiered scanning model used across this cluster
SOC 2 Logging and SIEM for Bare Metal: the monitoring layer that catches backup failures and storage health issues
SOC 2 Network Security for On-Prem: network segmentation and firewall controls
SOC 2 Security Documentation: the three-part evidence pattern applied across all controls
SOC 2 Type 1 vs. Type 2: understanding what the auditor expects at each stage

Frequently Asked Questions

Does SOC 2 require offsite backups?

SOC 2 A1.2 requires that recovery procedures enable the organization to resume operations. While it does not explicitly mandate offsite storage, auditors expect that a site-level disaster (fire, flood, extended power outage) would not result in total data loss. Offsite backups, whether to a secondary facility, cloud storage, or rotating physical media, satisfy this expectation. Keeping all backups in the same facility as production is a gap that auditors will flag.

How often should restoration testing be performed for SOC 2?

At least quarterly during the observation period for a Type 2 audit. For Type 1, a single successful restoration test demonstrates the control is designed and operational. Each test should restore to a non-production environment and verify data integrity. The restoration test report, including duration, scope, and results, becomes a key evidence artifact because it is the only way to validate that the RTO is achievable.

What does a BCDR tabletop exercise look like for a small team?

A 90-minute meeting where the team walks through a specific disaster scenario: primary site offline for 48 hours, ransomware encrypts production databases, or a critical server fails with no hot standby. The team discusses who does what, in what order, with what tools, and identifies gaps in the plan. The output is a documented summary with decisions, gaps, and action items. One exercise per year is the minimum expectation. The value is not in the complexity of the exercise but in surfacing gaps before they become real problems during an actual incident.

Do we need air-gapped backups for SOC 2?

SOC 2 does not explicitly require air-gapped backups. However, auditors increasingly ask about ransomware resilience, and an air-gapped backup tier demonstrates mature recovery planning. Practical options include rotating external drives stored offsite, tape backups, or cloud storage with object lock enabled. If air-gapped backups are not implemented, the DR plan should document how the organization would recover from a ransomware event that encrypts network-accessible backup storage.

What is the colocation provider's responsibility for backup and DR?

The colo provides physical infrastructure redundancy (power, cooling, network), but server-level backups, database recovery, application recovery, and offsite storage are entirely the company's responsibility. The colo's SOC 2 report covers physical security and facility availability. It does not cover data backup, restoration, or business continuity for the applications running on the hosted infrastructure. Every on-prem SOC 2 engagement needs a documented shared responsibility model that makes this boundary explicit.

View full post