Best Practices

Follow these security and performance recommendations to get the most out of FreeState in production environments while maintaining security and reliability.

Security Best Practices

API Key Management

Key Generation and Storage

  • Use descriptive names: "prod-terraform-ci", "dev-team-access"
  • Store securely: Use secret management systems (AWS Secrets Manager, HashiCorp Vault)
  • Never commit keys: Add to .gitignore and use environment variables
  • Limit scope: Use workspace-scoped keys when possible
# Good: Environment variables
export TF_HTTP_USERNAME="workspace-id"
export TF_HTTP_PASSWORD="$FREESTATE_API_KEY"

# Bad: Hardcoded in files
terraform {
  backend "http" {
    username = "workspace-123"
    password = "fs_key_abc123..."  # Never do this!
  }
}

Key Rotation

  • Regular rotation: Rotate keys every 90 days
  • Emergency rotation: Immediate rotation if compromised
  • Overlap period: Brief overlap when updating systems
  • Audit old keys: Remove unused keys promptly

Access Control

Principle of Least Privilege

RolePermissionsUse Case
Read-OnlyState viewing, workspace metadataMonitoring, audit, read-only CI
ContributorState read/write, lock managementDevelopers, CI/CD pipelines
AdminFull workspace managementDevOps engineers, team leads

Multi-Factor Authentication

  • Enable MFA: Required for all team members
  • Backup codes: Store securely for account recovery
  • Regular audits: Review MFA status monthly

Network Security

IP Whitelisting

# Configure IP restrictions for production workspaces
allowed_ips = [
  "203.0.113.0/24",    # Office network
  "192.0.2.1/32",      # CI/CD server
  "198.51.100.0/24"    # VPN range
]

VPN and Private Networks

  • Use VPNs: Route traffic through secure connections
  • Private endpoints: Consider private connectivity options
  • Monitor access: Log and alert on unusual access patterns

Performance Best Practices

State File Optimization

State Size Management

  • Modular architecture: Break large configurations into modules
  • Separate state files: Use different workspaces for logical boundaries
  • Resource limits: Monitor state file size and resource count
# Good: Modular approach
# Infrastructure layer
terraform workspace select infra-prod
terraform apply

# Application layer  
terraform workspace select app-prod
terraform apply

# Bad: Everything in one state
terraform apply  # 500+ resources in one state file

State Hygiene

  • Remove unused resources: Clean up regularly
  • Import existing resources: Don't recreate what exists
  • Use data sources: Reference external resources
# Remove unused resources
terraform state rm 'aws_instance.old_server'

# Import existing resources instead of recreating
terraform import aws_instance.existing i-1234567890abcdef0

# Use data sources for external references
data "aws_vpc" "existing" {
  id = "vpc-12345678"
}

Lock Management

Minimize Lock Duration

  • Small changes: Make incremental updates
  • Pre-validation: Use terraform plan to catch issues early
  • Automated releases: Configure auto-unlock timeouts

Coordinate Team Access

  • Communication: Announce large changes in advance
  • Scheduled windows: Use maintenance windows for major updates
  • Monitoring: Set up alerts for long-running locks

Spot Instance Optimization

Capacity Strategy

  • Mixed capacity: Use 70% Spot + 30% On-Demand for optimal cost/availability
  • Multi-AZ deployment: Distribute tasks across availability zones
  • Right-sizing: Choose instance types with good Spot availability
  • Base capacity: Ensure at least 1 On-Demand task for service stability
# Optimal ECS capacity provider strategy
{
  "capacityProviderStrategy": [
    {
      "capacityProvider": "FARGATE_SPOT",
      "weight": 70,
      "base": 0
    },
    {
      "capacityProvider": "FARGATE", 
      "weight": 30,
      "base": 1
    }
  ]
}

Application Resilience

  • Stateless design: Store state externally (database, S3, cache)
  • Graceful shutdown: Handle SIGTERM within 120 seconds
  • Health checks: Implement comprehensive readiness/liveness probes
  • Circuit breakers: Add fallback mechanisms for service dependencies

Monitoring and Alerting

  • Spot interruption tracking: Monitor interruption rates and patterns
  • Placement failure alerts: Get notified when Spot capacity is unavailable
  • Service availability metrics: Track healthy task percentage
  • Cost optimization reports: Measure savings vs. operational overhead

Learn More: See our comprehensive Spot Instance Support guide for detailed configuration examples and troubleshooting.

CI/CD Best Practices

Pipeline Design

Environment Promotion

# GitLab CI example with proper promotion
stages:
  - validate
  - plan-dev
  - apply-dev
  - plan-staging
  - apply-staging
  - plan-prod
  - apply-prod

dev-plan:
  stage: plan-dev
  script:
    - terraform workspace select dev
    - terraform plan
  only:
    - develop

dev-apply:
  stage: apply-dev
  script:
    - terraform workspace select dev
    - terraform apply -auto-approve
  only:
    - develop

prod-plan:
  stage: plan-prod
  script:
    - terraform workspace select prod
    - terraform plan
  only:
    - main
    
prod-apply:
  stage: apply-prod
  script:
    - terraform workspace select prod
    - terraform apply
  when: manual  # Require manual approval
  only:
    - main

Error Handling

# GitHub Actions with proper error handling
- name: Terraform Apply
  id: apply
  continue-on-error: true
  run: terraform apply -auto-approve
  
- name: Handle Failure
  if: steps.apply.outcome == 'failure'
  run: |
    echo "Terraform apply failed"
    terraform show
    exit 1
    
- name: Notify Success
  if: steps.apply.outcome == 'success'
  run: |
    echo "Deployment successful"
    # Send notification to Slack/Teams

Security in CI/CD

Secret Management

  • Use CI/CD secret stores: GitHub Secrets, GitLab Variables
  • Scope secrets appropriately: Environment-specific secrets
  • Rotate regularly: Automated secret rotation
  • Audit access: Log secret usage

Branch Protection

  • Protect main branches: Require reviews for production
  • Status checks: Require successful builds
  • Signed commits: Verify commit authenticity

Monitoring and Alerting

Key Metrics

Performance Metrics

  • Operation duration: Track apply/plan times
  • State file size: Monitor growth trends
  • Lock duration: Identify bottlenecks
  • API response times: Monitor backend performance

Security Metrics

  • Failed authentication attempts: Detect brute force attacks
  • Unusual access patterns: Geographic or time-based anomalies
  • Permission changes: Track access control modifications
  • API key usage: Monitor for suspicious activity

Alerting Strategy

Critical Alerts

  • State corruption: Immediate notification
  • Failed deployments: Real-time alerts
  • Security incidents: Immediate escalation
  • Service outages: Automated failover triggers

Warning Alerts

  • Long-running operations: 30+ minute threshold
  • State file growth: Size increase warnings
  • High API usage: Approaching rate limits
  • Lock contention: Multiple failed lock attempts

Disaster Recovery

Backup Strategy

Automated Backups

  • Pre-change backups: Before every apply operation
  • Scheduled backups: Daily snapshots of all workspaces
  • Cross-region replication: Geographic redundancy
  • Retention policies: 30 days of backup history

Backup Verification

# Regular backup verification script
#!/bin/bash
WORKSPACE="prod-app"
BACKUP_DIR="/backups"

# Create backup
terraform workspace select $WORKSPACE
terraform state pull > "$BACKUP_DIR/verify-$(date +%Y%m%d).tfstate"

# Verify backup integrity
terraform state list > /tmp/current-resources
terraform state pull | terraform state list > /tmp/backup-resources

if diff /tmp/current-resources /tmp/backup-resources; then
    echo "Backup verification successful"
else
    echo "Backup verification failed!"
    exit 1
fi

Recovery Procedures

State Recovery

  1. Identify issue: Corruption, accidental deletion, etc.
  2. Stop operations: Prevent further changes
  3. Restore from backup: Use most recent valid backup
  4. Verify integrity: Compare with actual infrastructure
  5. Resume operations: Gradual return to normal operations

Infrastructure Drift

# Detect and fix infrastructure drift
terraform refresh
terraform plan

# If drift detected
terraform apply  # Apply corrections

# Or import changes made outside Terraform
terraform import aws_security_group.web sg-12345678

Cost Optimization

Resource Management

Right-sizing Workspaces

  • Monitor usage: Track API calls and storage
  • Consolidate when appropriate: Merge low-activity workspaces
  • Archive old workspaces: Remove unused environments

Efficient Operations

  • Batch operations: Group related changes
  • Use targeted applies: Apply specific resources when possible
  • Minimize plan frequency: Avoid unnecessary planning

Team Collaboration

Workspace Organization

Naming Conventions

# Recommended naming patterns
{project}-{environment}-{component}
myapp-prod-frontend
myapp-prod-backend
myapp-prod-database

# Or team-based
{team}-{project}-{environment}
platform-infrastructure-prod
payments-service-prod
user-management-staging

Documentation Standards

  • Workspace descriptions: Clear purpose and scope
  • README files: Setup and operation instructions
  • Change logs: Document major modifications
  • Contact information: Workspace owners and escalation

Code Review Process

Review Checklist

  • Security: No hardcoded secrets or overprivileged access
  • Performance: Efficient resource configurations
  • Standards: Follows team conventions
  • Testing: Includes appropriate validation

Approval Process

  • Development: Peer review required
  • Staging: Senior engineer approval
  • Production: DevOps team approval