Cloud disaster recovery is a region failover. Click a button, spin up infrastructure in another availability zone, and the platform handles replication, failover, and recovery. The DR plan is largely a configuration decision, and the evidence is a screenshot of the replication settings.
On-premise disaster recovery is a fundamentally different discipline. It involves physical media, offsite transfers, colocation provider coordination, manual restoration procedures, and a set of dependencies that cannot be abstracted away behind an API. When the production database lives on a physical server in a rack, the backup strategy involves questions that cloud environments never face: where does the backup physically go? How does it get there? How long would it take to restore from scratch if the primary site were unavailable?
For companies preparing for SOC 2 on bare metal infrastructure, backup and disaster recovery is one of the domains where the gap between cloud-first guides and on-prem reality is widest. SOC 2 Availability criteria (A1.2 and A1.3) require organizations to maintain recovery procedures and test them. The criteria do not care whether the infrastructure is cloud or physical, but the evidence looks materially different.
This post covers how to build a backup and disaster recovery program for on-prem infrastructure that satisfies SOC 2 Availability criteria, produces audit-ready evidence, and gives the team confidence that recovery actually works when it matters.
What This Article Covers
Every backup and DR program starts with two numbers that define the recovery expectations.
Recovery Point Objective (RPO)
The maximum acceptable data loss measured in time. An RPO of 4 hours means the organization accepts losing up to 4 hours of data in a disaster scenario. This directly determines backup frequency: an RPO of 4 hours requires backups at least every 4 hours.
Recovery Time Objective (RTO)
The maximum acceptable downtime. An RTO of 8 hours means the organization commits to having the system operational within 8 hours of a disaster declaration. RTO governs how you design standby environments, staffing during recovery, and coordination with the colocation provider.
The mistake that creates audit friction
Setting RPO and RTO targets based on what sounds impressive rather than what is achievable. An RPO of 1 hour sounds reasonable, but if nightly full backups are the only backup mechanism and transaction log shipping is not configured, the actual RPO is 24 hours. An RTO of 4 hours sounds professional, but if restoring a full database backup to a cold standby server takes 8 hours including OS setup, application deployment, data restoration, and verification, the policy is fiction. The auditor does not penalize realistic targets. They penalize targets that the evidence shows the organization cannot meet.
The approach that works: define RPO and RTO based on actual tested recovery capabilities, not aspirational targets. Run a restoration test, measure how long it takes, and set the RTO with enough margin to account for real-world conditions, including staff availability, troubleshooting time, and coordination with the colo provider.
Not every system in the environment needs the same backup strategy. The tiered asset classification used across the on-prem SOC 2 cluster applies here, but the tiers map to different backup approaches rather than different scanning cadences.
Tier 1: Critical Data
Production databases, application data, customer data, and any data whose loss would prevent the business from operating. These get the most aggressive backup schedule and the most thorough restoration testing. If you back up nothing else, you back up Tier 1.
Tier 2: System Configuration
Server configurations, application deployment configurations, firewall rules, network configurations, and infrastructure-as-code repositories. These are recoverable from configuration management or documentation, but having backups accelerates recovery time significantly. Missing Tier 2 backups means rebuilding from documentation under pressure.
Tier 3: Operational Data
Logs, monitoring data, historical reports, and other data that supports operations but is not critical for business continuity. Backup cadence can be less aggressive, and loss is inconvenient rather than catastrophic. Tier 3 loss does not trigger the DR plan.
Database backups are the highest-priority component of any on-prem backup program. The approach depends on the database platform:
| Platform | Backup Approach | SOC 2 Pattern |
| SQL Server | Native BACKUP DATABASE with full, differential, and transaction log types. SQL Server Agent handles scheduling and failure alerts. | Nightly full, hourly differential during business hours, transaction logs every 15-30 min for critical databases. |
| PostgreSQL | pg_dump for logical backups (portable, portable fallback); pg_basebackup for physical backups with WAL archiving for point-in-time recovery. | WAL archiving provides near-continuous backup; pg_dump provides portable restore fallback to any PostgreSQL instance. |
| MySQL / MariaDB | mysqldump for logical backups; Percona XtraBackup for hot physical backups that do not lock tables during the backup process. | Percona XtraBackup is the production standard because it avoids locking issues on large databases during backup windows. |
All three platforms support automated scheduling through cron jobs, built-in schedulers, or backup management tools. Every backup job should alert on failure, not just on success.
The enterprise standard for backing up virtualized and physical on-prem environments. Handles VM-level backups, bare metal recovery, application-aware backups (ensuring database consistency), and offsite replication. For organizations running VMware or Hyper-V, Veeam is the most common choice in SOC 2 engagements. Application-aware backups are critical because they ensure the database is in a consistent state before the snapshot, not mid-transaction.
Enterprise-grade backup and recovery across heterogeneous environments (physical, virtual, cloud, SaaS). More complex to deploy and manage than Veeam, but covers a broader range of workload types and scales to large environments where multiple infrastructure generations coexist in the same rack.
An open-source enterprise backup solution that supports complex backup topologies across mixed operating systems. A strong option for teams that want enterprise backup capabilities without commercial licensing costs, though it requires more configuration expertise than commercial alternatives.
rsync and rclone
File-level backup and synchronization for Linux environments. rsync handles local and remote file synchronization; rclone extends this to cloud storage targets (S3, Azure Blob, Google Cloud Storage) for offsite backup. For small teams, a scripted rsync job running on a cron schedule is a lightweight, auditable backup solution. The cron job, transfer logs, and offsite confirmation become audit evidence with minimal tooling overhead.
Backup jobs fail silently more often than they succeed loudly. SNMP monitoring for backup storage health (disk capacity, RAID status, controller health) catches hardware issues before they cause backup failures. Most backup tools provide email notifications on job failure, but proactive monitoring of the backup infrastructure itself, through the same SIEM and monitoring stack used for security events, closes the gap between the backup job ran and the backup infrastructure is healthy.
SOC 2 A1.2 and the offsite expectation
SOC 2 A1.2 expects that recovery procedures account for scenarios where the primary site is unavailable. For on-prem environments, this means backups cannot live exclusively in the same facility as the production systems. A fire, flood, or extended power outage affecting the primary colo would eliminate both production and backup simultaneously if they are co-located on the same rack.
The simplest approach is replicating backups to a secondary location. This could be a second colocation facility, a cloud storage target (S3, Azure Blob), or a geographically separate office with a backup server. The key requirement is that the offsite location is independent enough that a disaster affecting the primary site does not also affect the backups.
For teams using rclone or native cloud SDKs, automated offsite transfer can run immediately after local backup completion. The transfer script becomes part of the backup pipeline, and the transfer confirmation becomes audit evidence.
Why air-gapped backups matter for SOC 2
Air-gapped backups are disconnected from the network after the backup is written, making them immune to ransomware that encrypts network-accessible storage. SOC 2 does not explicitly require air-gapped backups, but auditors increasingly ask about ransomware resilience. An air-gapped backup tier demonstrates mature recovery planning. If air-gapped backups are not implemented, the DR plan should explicitly document how the organization would recover from a ransomware event that encrypts network-accessible backup storage.
Practical implementations include rotating external drives stored in a secure offsite location, tape backups (still used in regulated industries for exactly this purpose), or cloud storage with object lock enabled, which prevents deletion or modification for a defined retention period.
Backup and DR evidence for SOC 2 follows the same three-part evidence pattern used across all continuous controls, plus two additional evidence types specific to this domain.
The Three-Part Evidence Pattern
This is the evidence type that separates functional backup programs from paper compliance. The auditor wants to see that the organization has actually tested restoring from backup, not just that backups are running. A restoration test involves taking a backup, restoring it to a non-production environment, and verifying that the restored data is usable. The evidence package should include:
The cadence that works for SOC 2: at least one restoration test per quarter during the observation period. For Type 1 audits, a single successful restoration test is sufficient to demonstrate the control is designed and operational.
SOC 2 CC9.1 addresses incident response and recovery, and auditors increasingly expect organizations to demonstrate they have thought through disaster scenarios beyond just running backup jobs. A tabletop exercise is a structured walkthrough of a disaster scenario with the team. It does not require actually failing over systems. The evidence package includes:
One tabletop exercise per year is the minimum expectation. A 90-minute session with the IT team walking through a scenario like the primary colo is offline for 48 hours, what do we do? produces more audit value than a complex multi-day exercise that never gets scheduled.
For on-prem environments hosted in colocation facilities, the backup and DR program needs to account for the shared responsibility boundary.
What the colo provider handles
What the company is responsible for
This distinction is documented as part of the subservice organization documentation that covers the relationship between the company and the colocation provider. The backup and DR policy should explicitly state which recovery scenarios the colo's infrastructure handles and which require the company's own backup and restoration procedures. See the bare metal SOC 2 readiness guide for how this documentation fits into the broader audit package.
Verify backup job completion by checking email notifications or the monitoring dashboard. Investigate any failures immediately. This takes 5-10 minutes on a normal day. Any failure that goes uninvestigated for more than 24 hours creates a gap in the evidence trail.
Review backup storage capacity and health. Verify offsite transfers are completing successfully. Check SNMP alerts for any storage hardware warnings. Confirm that backup retention policies are being enforced and old backups are being pruned on schedule.
Run a restoration test. Restore the most recent backup to a non-production environment and verify data integrity. Document the results including duration, scope, and any issues encountered. Upload the evidence package to the GRC platform.
Conduct a BCDR tabletop exercise. Review and update the DR plan. Review RPO and RTO targets against actual tested capabilities (if the restoration test took longer than the stated RTO, the RTO needs to be updated). Update the backup and DR section of the Security Program Manual.
In small on-prem teams, backup and DR ownership typically sits with the system administrator or infrastructure lead.
Ownership Model
Our fractional security team builds effective security programs where backup and DR is a tested, audit-ready control, not a policy on paper.
Book Your Strategy CallLearn more about our SOC 2 for on-premise infrastructure services.
Further Reading: On-Prem SOC 2 Cluster
SOC 2 A1.2 requires that recovery procedures enable the organization to resume operations. While it does not explicitly mandate offsite storage, auditors expect that a site-level disaster (fire, flood, extended power outage) would not result in total data loss. Offsite backups, whether to a secondary facility, cloud storage, or rotating physical media, satisfy this expectation. Keeping all backups in the same facility as production is a gap that auditors will flag.
At least quarterly during the observation period for a Type 2 audit. For Type 1, a single successful restoration test demonstrates the control is designed and operational. Each test should restore to a non-production environment and verify data integrity. The restoration test report, including duration, scope, and results, becomes a key evidence artifact because it is the only way to validate that the RTO is achievable.
A 90-minute meeting where the team walks through a specific disaster scenario: primary site offline for 48 hours, ransomware encrypts production databases, or a critical server fails with no hot standby. The team discusses who does what, in what order, with what tools, and identifies gaps in the plan. The output is a documented summary with decisions, gaps, and action items. One exercise per year is the minimum expectation. The value is not in the complexity of the exercise but in surfacing gaps before they become real problems during an actual incident.
SOC 2 does not explicitly require air-gapped backups. However, auditors increasingly ask about ransomware resilience, and an air-gapped backup tier demonstrates mature recovery planning. Practical options include rotating external drives stored offsite, tape backups, or cloud storage with object lock enabled. If air-gapped backups are not implemented, the DR plan should document how the organization would recover from a ransomware event that encrypts network-accessible backup storage.
The colo provides physical infrastructure redundancy (power, cooling, network), but server-level backups, database recovery, application recovery, and offsite storage are entirely the company's responsibility. The colo's SOC 2 report covers physical security and facility availability. It does not cover data backup, restoration, or business continuity for the applications running on the hosted infrastructure. Every on-prem SOC 2 engagement needs a documented shared responsibility model that makes this boundary explicit.