Best Practices

Follow these security and performance recommendations to get the most out of FreeState in production environments while maintaining security and reliability.

Security Best Practices

API Key Management

Key Generation and Storage

Use descriptive names: "prod-terraform-ci", "dev-team-access"
Store securely: Use secret management systems (AWS Secrets Manager, HashiCorp Vault)
Never commit keys: Add to .gitignore and use environment variables
Limit scope: Use workspace-scoped keys when possible

# Good: Environment variables
export TF_HTTP_USERNAME="workspace-id"
export TF_HTTP_PASSWORD="$FREESTATE_API_KEY"

# Bad: Hardcoded in files
terraform {
  backend "http" {
    username = "workspace-123"
    password = "fs_key_abc123..."  # Never do this!
  }
}

Key Rotation

Regular rotation: Rotate keys every 90 days
Emergency rotation: Immediate rotation if compromised
Overlap period: Brief overlap when updating systems
Audit old keys: Remove unused keys promptly

Access Control

Principle of Least Privilege

Role	Permissions	Use Case
Read-Only	State viewing, workspace metadata	Monitoring, audit, read-only CI
Contributor	State read/write, lock management	Developers, CI/CD pipelines
Admin	Full workspace management	DevOps engineers, team leads

Multi-Factor Authentication

Enable MFA: Required for all team members
Backup codes: Store securely for account recovery
Regular audits: Review MFA status monthly

Network Security

IP Whitelisting

# Configure IP restrictions for production workspaces
allowed_ips = [
  "203.0.113.0/24",    # Office network
  "192.0.2.1/32",      # CI/CD server
  "198.51.100.0/24"    # VPN range
]

VPN and Private Networks

Use VPNs: Route traffic through secure connections
Private endpoints: Consider private connectivity options
Monitor access: Log and alert on unusual access patterns

Performance Best Practices

State File Optimization

State Size Management

Modular architecture: Break large configurations into modules
Separate state files: Use different workspaces for logical boundaries
Resource limits: Monitor state file size and resource count

# Good: Modular approach
# Infrastructure layer
terraform workspace select infra-prod
terraform apply

# Application layer  
terraform workspace select app-prod
terraform apply

# Bad: Everything in one state
terraform apply  # 500+ resources in one state file

State Hygiene

Remove unused resources: Clean up regularly
Import existing resources: Don't recreate what exists
Use data sources: Reference external resources

# Remove unused resources
terraform state rm 'aws_instance.old_server'

# Import existing resources instead of recreating
terraform import aws_instance.existing i-1234567890abcdef0

# Use data sources for external references
data "aws_vpc" "existing" {
  id = "vpc-12345678"
}

Lock Management

Minimize Lock Duration

Small changes: Make incremental updates
Pre-validation: Use terraform plan to catch issues early
Automated releases: Configure auto-unlock timeouts

Coordinate Team Access

Communication: Announce large changes in advance
Scheduled windows: Use maintenance windows for major updates
Monitoring: Set up alerts for long-running locks

Spot Instance Optimization

Capacity Strategy

Mixed capacity: Use 70% Spot + 30% On-Demand for optimal cost/availability
Multi-AZ deployment: Distribute tasks across availability zones
Right-sizing: Choose instance types with good Spot availability
Base capacity: Ensure at least 1 On-Demand task for service stability

# Optimal ECS capacity provider strategy
{
  "capacityProviderStrategy": [
    {
      "capacityProvider": "FARGATE_SPOT",
      "weight": 70,
      "base": 0
    },
    {
      "capacityProvider": "FARGATE", 
      "weight": 30,
      "base": 1
    }
  ]
}

Application Resilience

Stateless design: Store state externally (database, S3, cache)
Graceful shutdown: Handle SIGTERM within 120 seconds
Health checks: Implement comprehensive readiness/liveness probes
Circuit breakers: Add fallback mechanisms for service dependencies

Monitoring and Alerting

Spot interruption tracking: Monitor interruption rates and patterns
Placement failure alerts: Get notified when Spot capacity is unavailable
Service availability metrics: Track healthy task percentage
Cost optimization reports: Measure savings vs. operational overhead

Learn More: See our comprehensive Spot Instance Support guide for detailed configuration examples and troubleshooting.

CI/CD Best Practices

Pipeline Design

Environment Promotion

# GitLab CI example with proper promotion
stages:
  - validate
  - plan-dev
  - apply-dev
  - plan-staging
  - apply-staging
  - plan-prod
  - apply-prod

dev-plan:
  stage: plan-dev
  script:
    - terraform workspace select dev
    - terraform plan
  only:
    - develop

dev-apply:
  stage: apply-dev
  script:
    - terraform workspace select dev
    - terraform apply -auto-approve
  only:
    - develop

prod-plan:
  stage: plan-prod
  script:
    - terraform workspace select prod
    - terraform plan
  only:
    - main
    
prod-apply:
  stage: apply-prod
  script:
    - terraform workspace select prod
    - terraform apply
  when: manual  # Require manual approval
  only:
    - main

Error Handling

# GitHub Actions with proper error handling
- name: Terraform Apply
  id: apply
  continue-on-error: true
  run: terraform apply -auto-approve
  
- name: Handle Failure
  if: steps.apply.outcome == 'failure'
  run: |
    echo "Terraform apply failed"
    terraform show
    exit 1
    
- name: Notify Success
  if: steps.apply.outcome == 'success'
  run: |
    echo "Deployment successful"
    # Send notification to Slack/Teams

Security in CI/CD

Secret Management

Use CI/CD secret stores: GitHub Secrets, GitLab Variables
Scope secrets appropriately: Environment-specific secrets
Rotate regularly: Automated secret rotation
Audit access: Log secret usage

Branch Protection

Protect main branches: Require reviews for production
Status checks: Require successful builds
Signed commits: Verify commit authenticity

Monitoring and Alerting

Key Metrics

Performance Metrics

Operation duration: Track apply/plan times
State file size: Monitor growth trends
Lock duration: Identify bottlenecks
API response times: Monitor backend performance

Security Metrics

Failed authentication attempts: Detect brute force attacks
Unusual access patterns: Geographic or time-based anomalies
Permission changes: Track access control modifications
API key usage: Monitor for suspicious activity

Alerting Strategy

Critical Alerts

State corruption: Immediate notification
Failed deployments: Real-time alerts
Security incidents: Immediate escalation
Service outages: Automated failover triggers

Warning Alerts

Long-running operations: 30+ minute threshold
State file growth: Size increase warnings
High API usage: Approaching rate limits
Lock contention: Multiple failed lock attempts

Disaster Recovery

Backup Strategy

Automated Backups

Pre-change backups: Before every apply operation
Scheduled backups: Daily snapshots of all workspaces
Cross-region replication: Geographic redundancy
Retention policies: 30 days of backup history

Backup Verification

# Regular backup verification script
#!/bin/bash
WORKSPACE="prod-app"
BACKUP_DIR="/backups"

# Create backup
terraform workspace select $WORKSPACE
terraform state pull > "$BACKUP_DIR/verify-$(date +%Y%m%d).tfstate"

# Verify backup integrity
terraform state list > /tmp/current-resources
terraform state pull | terraform state list > /tmp/backup-resources

if diff /tmp/current-resources /tmp/backup-resources; then
    echo "Backup verification successful"
else
    echo "Backup verification failed!"
    exit 1
fi

Recovery Procedures

State Recovery

Identify issue: Corruption, accidental deletion, etc.
Stop operations: Prevent further changes
Restore from backup: Use most recent valid backup
Verify integrity: Compare with actual infrastructure
Resume operations: Gradual return to normal operations

Infrastructure Drift

# Detect and fix infrastructure drift
terraform refresh
terraform plan

# If drift detected
terraform apply  # Apply corrections

# Or import changes made outside Terraform
terraform import aws_security_group.web sg-12345678

Cost Optimization

Resource Management

Right-sizing Workspaces

Monitor usage: Track API calls and storage
Consolidate when appropriate: Merge low-activity workspaces
Archive old workspaces: Remove unused environments

Efficient Operations

Batch operations: Group related changes
Use targeted applies: Apply specific resources when possible
Minimize plan frequency: Avoid unnecessary planning

Team Collaboration

Workspace Organization

Naming Conventions

# Recommended naming patterns
{project}-{environment}-{component}
myapp-prod-frontend
myapp-prod-backend
myapp-prod-database

# Or team-based
{team}-{project}-{environment}
platform-infrastructure-prod
payments-service-prod
user-management-staging

Documentation Standards

Workspace descriptions: Clear purpose and scope
README files: Setup and operation instructions
Change logs: Document major modifications
Contact information: Workspace owners and escalation

Code Review Process

Review Checklist

Security: No hardcoded secrets or overprivileged access
Performance: Efficient resource configurations
Standards: Follows team conventions
Testing: Includes appropriate validation

Approval Process

Development: Peer review required
Staging: Senior engineer approval
Production: DevOps team approval