Image Factory Reference Architecture

This document captures the broad reference architecture for Image Factory, including the logical platform model, major subsystems, and operational concerns. It is intended as an architectural reference rather than an implementation checklist.

System Overview
Architecture Principles
Logical Architecture
Physical Architecture
Component Specifications
Data Architecture
Security Architecture
Operational Architecture
Deployment Architecture
Integration Architecture
Performance and Scalability
Storage Architecture
OCI Distribution API
Monitoring and Observability
Disaster Recovery
Compliance and Governance

System Overview

Business Context

The Multi-Tenant Image Build Factory is an enterprise-grade platform that enables organizations to build, manage, and distribute multiple types of images in a secure, scalable, and governed manner. The factory supports building virtual machine images, cloud provider-specific images (such as Amazon Machine Images - AMIs, Google Compute Engine images, Azure VM images), and Kubernetes container images with unified S3 object storage and native OCI Distribution API for optimal performance and cost efficiency.

Key Capabilities

Unified S3 Storage Architecture: All image types (VM, CSP, container) use S3 object storage with intelligent tiering
Native OCI Distribution API: Built-in container registry compatibility without external dependencies
Multi-Type Image Building: Support for VM images, cloud provider images (AMIs, etc.), and container images
Multi-Tenant Image Building: Isolated build environments per tenant with resource segregation
Git-Integrated Manifest Management: Automated builds triggered by Git events for all image types
Enterprise Security: LDAP/SSO integration, multi-layer image scanning, and quarantine processes
Content-Addressable Storage: Automatic deduplication across all image types for cost optimization
Self-Service Portal: Web-based interface for tenant operations across all image types
CI/CD Integration: Seamless integration with existing pipelines for heterogeneous image builds
Vulnerability Management: Comprehensive scanning, alerting, and remediation tracking
Resource Management: Dynamic resource allocation and quota management for different build types
Audit & Compliance: Comprehensive logging and compliance reporting across all image artifacts

Technology Stack

Component	Technology	Purpose
Backend Services	Golang	API services, business logic, OCI Distribution API
Frontend	React + Ant Design	User interface
Container Runtime	Kubernetes	Orchestration platform
Build Engines	Dispatcher + Tekton Pipelines + Packer + Buildah	Queue-less dispatch (status-based) and multi-type build orchestration
VM Image Builder	Packer + Ansible	VM and cloud provider image building
VMware Integration	VMware vSphere API + ESXi	VMware-based VM image builds and testing
Container Builder	Buildah + Kaniko	Container image building
Message Bus	NATS (Go-based) + Go Channels	Event-driven communication
Database	PostgreSQL with Flyway migrations	Metadata, manifests, RBAC, audit logs, vulnerability data
Storage	S3 Compatible (AWS S3/MinIO)	Unified blob storage for all image types with content-addressable storage
Container Registry	Built-in OCI Distribution API	Native container image serving with S3 backend
Configuration	etcd/Consul	Distributed configuration
Authentication	LDAP/SAML/OIDC + JWT	Enterprise identity with bearer token support
Authorization	RBAC + ABAC	Multi-tenant access control with repository-level permissions
Security Scanning	Trivy + Clair + Cloud Security Tools	Multi-type image vulnerability scanning
Monitoring	Prometheus + Grafana	Observability
Logging	ELK Stack	Centralized logging

Architecture At A Glance

flowchart LR
    U[Users And Automation] --> UI[Frontend UI]
    U --> API[Backend APIs]
    UI --> API
    API --> DB[(PostgreSQL)]
    API --> BUS[Messaging And Events]
    API --> DISP[Dispatcher]
    DISP --> EXEC[Build Executors]
    EXEC --> K8S[Kubernetes And Tekton]
    EXEC --> STORE[S3 And Registry Storage]
    API --> SEC[Security And Compliance]

Platform Interaction Flow

sequenceDiagram
    participant User
    participant Frontend
    participant API
    participant Dispatcher
    participant Executor
    participant Storage

    User->>Frontend: Configure project or build
    Frontend->>API: Submit request
    API->>Dispatcher: Queue work
    Dispatcher->>Executor: Start execution
    Executor->>Storage: Write artifacts and metadata
    Executor-->>API: Report status and logs
    API-->>Frontend: Update UI and results

Architecture Principles

Design Principles

Domain Driven Design (DDD)

The architecture follows Domain Driven Design principles to create a ubiquitous language and model complex business domains:

Strategic Design:

Bounded Contexts: Clear boundaries between business domains
- Tenant Management Context
- Build Orchestration Context
- Image Lifecycle Context
- Security & Compliance Context
- Notification & Communication Context
Context Mapping: Defines relationships between bounded contexts
- Shared Kernel for common utilities
- Customer-Supplier relationships for domain dependencies
- Published Language for cross-context communication

Tactical Design:

Entities: Objects with identity and lifecycle (Tenant, Build, Image)
Value Objects: Immutable objects representing concepts (BuildManifest, SecurityPolicy)
Domain Services: Stateless operations that don't belong to entities
Aggregates: Clusters of domain objects with consistency boundaries
Repositories: Abstractions for data persistence
Factories: Encapsulate complex object creation

Hexagonal Event-Driven Architecture

The system implements Hexagonal Architecture with event-driven communication for loose coupling and testability:

Hexagonal Structure:

Core Domain: Business logic independent of external concerns
Ports: Interfaces defining interactions with external systems
Adapters: Implementations of ports for specific technologies

Port Types:

Driving Ports: Interfaces for external systems to drive the application
Driven Ports: Interfaces for the application to drive external systems
Event Ports: Interfaces for publishing and subscribing to domain events

Event-Driven Patterns:

Domain Events: Business-significant events published by aggregates
Event Sourcing: State changes stored as immutable event sequences
CQRS: Command Query Responsibility Segregation for optimized reads/writes
Saga Pattern: Long-running business transactions coordinated via events

SOLID Principles Compliance

The architecture adheres to SOLID principles for maintainable and extensible code:

Single Responsibility Principle (SRP):
- Each class/service has one reason to change
- Domain services focused on specific business capabilities
- Adapters implement single integration concerns
Open/Closed Principle (OCP):
- Extensible through plugin architecture for new image types
- Configuration-driven behavior without code changes
- Interface-based design allowing new implementations
Liskov Substitution Principle (LSP):
- Interface implementations are fully substitutable
- Build executor interfaces work with any build engine
- Storage adapters interchangeable across registry types
Interface Segregation Principle (ISP):
- Fine-grained interfaces for specific client needs
- Separate ports for different interaction patterns
- Client-specific interfaces to avoid unnecessary dependencies
Dependency Inversion Principle (DIP):
- High-level modules depend on abstractions, not concretions
- Domain layer defines ports, infrastructure provides adapters
- Dependency injection for loose coupling and testability

Unified Workflow Across Domains

A unified workflow engine standardizes processes across VM, container, and cloud image builds:

Workflow Architecture:

Workflow Engine: Core orchestration component with pluggable executors
Workflow Definitions: Declarative templates using domain-specific language
Step Library: Reusable steps with domain-specific implementations
Context Management: Shared state and data flow between workflow steps

Cross-Domain Capabilities:

Consistent workflow syntax across different image types
Unified error handling and compensation mechanisms
Shared monitoring and observability across workflows
Domain-agnostic step composition and sequencing

Unified Notification and Messaging System

Centralized messaging infrastructure for reliable cross-domain communication:

Hybrid Messaging Architecture:

Distributed Message Bus: Lightweight NATS server for cross-service communication
In-Process Channels: Go goroutines and channels for intra-service event handling
Message Types: Commands, Events, Queries with schema validation
Routing: Subject-based routing with queue group scalability
Guarantees: At-least-once delivery with idempotent processing

Notification Channels:

Multi-Channel Delivery: Email, Slack, Teams, webhooks, in-app notifications
Template Engine: Configurable message templates with dynamic content
Recipient Management: User preferences and subscription handling
Delivery Tracking: Message status monitoring and retry mechanisms

Additional Design Principles

Multi-Tenancy First: Complete tenant isolation at all layers
Security by Design: Zero-trust architecture with defense in depth
GitOps: Infrastructure and configuration as code
API-First: All interactions through well-defined APIs
Observability: Comprehensive monitoring and tracing
Resilience: Fault-tolerant design with graceful degradation

Quality Attributes

Scalability: Horizontal scaling across all components
Performance: Sub-second response times for API calls
Reliability: 99.9% uptime with automated recovery
Security: SOC 2 compliance with encrypted data at rest/transit
Maintainability: Modular design with clear separation of concerns
Usability: Intuitive interfaces with comprehensive documentation

Logical Architecture

Hexagonal Architecture Overview

The system follows Hexagonal Architecture with clear separation between domain logic and external concerns:

graph TB
    subgraph "Domain Layer (Core Business Logic)"
        subgraph "Bounded Contexts"
            TM_DOMAIN[Tenant Management<br/>Domain Services]
            BM_DOMAIN[Build Orchestration<br/>Domain Services]
            IM_DOMAIN[Image Lifecycle<br/>Domain Services]
            SC_DOMAIN[Security & Compliance<br/>Domain Services]
            WF_DOMAIN[Unified Workflow<br/>Engine]
        end
        
        subgraph "Domain Ports"
            TM_PORTS[Tenant Ports<br/>Interfaces]
            BM_PORTS[Build Ports<br/>Interfaces]
            IM_PORTS[Image Ports<br/>Interfaces]
            SC_PORTS[Security Ports<br/>Interfaces]
            WF_PORTS[Workflow Ports<br/>Interfaces]
            MSG_PORTS[Messaging Ports<br/>Interfaces]
        end
    end

    subgraph "Adapter Layer (External Interfaces)"
        subgraph "Driving Adapters (Primary)"
            REST_API[REST API<br/>Controllers]
            CLI[CLI Interface<br/>Commands]
            WEB_UI[Web UI<br/>Components]
            WEBHOOKS[Webhook<br/>Handlers]
        end
        
        subgraph "Driven Adapters (Secondary)"
            DB_REPO[Database<br/>Repositories]
            MSG_PUB[Message<br/>Publishers]
            CACHE_ADAPTER[Cache<br/>Adapters]
            EXT_CLIENTS[External Service<br/>Clients]
        end
        
        subgraph "Event Adapters"
            EVENT_PUB[Event<br/>Publishers]
            EVENT_SUB[Event<br/>Subscribers]
            MSG_BUS[Unified Message Bus<br/>Apache Kafka]
        end
    end

    subgraph "Infrastructure Layer"
        DB[(PostgreSQL<br/>Multi-tenant)]
        CACHE[(Redis<br/>Cache)]
        REGISTRIES[Image Registries<br/>Harbor + Cloud]
        GIT_PROVIDERS[Git Providers<br/>GitHub/GitLab]
        CLOUD_PROVIDERS[Cloud Providers<br/>AWS/Azure/GCP]
        BUILD_ENGINES[Build Engines<br/>Packer/Buildah/Tekton]
    end

    REST_API --> TM_PORTS
    CLI --> BM_PORTS
    WEB_UI --> IM_PORTS
    WEBHOOKS --> SC_PORTS

    TM_DOMAIN --> TM_PORTS
    BM_DOMAIN --> BM_PORTS
    IM_DOMAIN --> IM_PORTS
    SC_DOMAIN --> SC_PORTS
    WF_DOMAIN --> WF_PORTS

    TM_PORTS --> DB_REPO
    BM_PORTS --> MSG_PUB
    IM_PORTS --> CACHE_ADAPTER
    SC_PORTS --> EXT_CLIENTS
    WF_PORTS --> EVENT_PUB

    DB_REPO --> DB
    MSG_PUB --> MSG_BUS
    CACHE_ADAPTER --> CACHE
    EXT_CLIENTS --> GIT_PROVIDERS
    EXT_CLIENTS --> CLOUD_PROVIDERS
    EXT_CLIENTS --> BUILD_ENGINES

    EVENT_PUB --> MSG_BUS
    MSG_BUS --> EVENT_SUB
    EVENT_SUB --> WF_DOMAIN

Workflow Engine Architecture

See architecture/WORKFLOW_ENGINE_ARCHITECTURE.md for the authoritative workflow engine design and integration plan.

Unified Notification and Messaging Architecture

Hybrid messaging system combining distributed NATS with in-process Go channels:

graph TB
    subgraph "Distributed Message Bus Layer"
        NATS[NATS Server<br/>Distributed Message Bus]
        SCHEMA_REGISTRY[Schema Registry<br/>Message Validation]
        MSG_ROUTER[Message Router<br/>Subject Management]
    end

    subgraph "In-Process Channel Layer"
        GO_CHANNELS[Go Channels<br/>Intra-Service Communication]
        GOROUTINES[Goroutines<br/>Concurrent Processing]
        EVENT_BUS[Event Bus<br/>Channel Orchestration]
    end

    subgraph "Publisher Layer"
        DOMAIN_EVENTS[Domain Event<br/>Publishers]
        CMD_HANDLERS[Command<br/>Handlers]
        NOTIFICATION_PUB[Notification<br/>Publishers]
    end

    subgraph "Consumer Layer"
        EVENT_SUBSCRIBERS[Event<br/>Subscribers]
        WORKFLOW_TRIGGERS[Workflow<br/>Triggers]
        NOTIFICATION_HANDLERS[Notification<br/>Handlers]
    end

    subgraph "Channel Adapters"
        EMAIL_ADAPTER[Email<br/>Adapter]
        SLACK_ADAPTER[Slack<br/>Adapter]
        WEBHOOK_ADAPTER[Webhook<br/>Adapter]
        INAPP_ADAPTER[In-App<br/>Adapter]
    end

    DOMAIN_EVENTS --> GO_CHANNELS
    CMD_HANDLERS --> GO_CHANNELS
    NOTIFICATION_PUB --> GO_CHANNELS

    GO_CHANNELS --> GOROUTINES
    GOROUTINES --> EVENT_BUS
    EVENT_BUS --> NATS

    NATS --> SCHEMA_REGISTRY
    SCHEMA_REGISTRY --> MSG_ROUTER

    MSG_ROUTER --> EVENT_SUBSCRIBERS
    MSG_ROUTER --> WORKFLOW_TRIGGERS
    MSG_ROUTER --> NOTIFICATION_HANDLERS

    NOTIFICATION_HANDLERS --> EMAIL_ADAPTER
    NOTIFICATION_HANDLERS --> SLACK_ADAPTER
    NOTIFICATION_HANDLERS --> WEBHOOK_ADAPTER
    NOTIFICATION_HANDLERS --> INAPP_ADAPTER

Component Relationships

Domain Layer: Core business logic organized in bounded contexts with clean interfaces
Adapter Layer: Technology-specific implementations of domain ports
Infrastructure Layer: External systems and services
Unified Workflow Engine: Orchestrates cross-domain processes with consistent patterns
Hybrid Message System: Combines Go channels (intra-service) with NATS (inter-service) for event-driven communication
Notification System: Multi-channel delivery with unified message handling

Bounded Context Interactions

Tenant Management ↔ Build Orchestration: Tenant context provides isolation boundaries
Build Orchestration ↔ Image Lifecycle: Build context produces artifacts managed by image context
Security & Compliance ↔ All Contexts: Cross-cutting concerns enforced across domains
Unified Workflow ↔ All Contexts: Provides consistent orchestration patterns
Unified Messaging ↔ All Contexts: Enables event-driven communication via hybrid NATS/Go channels

Physical Architecture

Kubernetes Deployment Architecture

graph TB
    subgraph "Kubernetes Cluster"
        subgraph "Control Plane"
            API[API Server]
            ETCD[(etcd)]
            SCHED[Scheduler]
            CONTROLLER[Controller Manager]
        end

        subgraph "Worker Nodes"
            KUBELET1[Kubelet Node 1]
            KUBELET2[Kubelet Node 2]
            KUBELET3[Kubelet Node 3]
        end

        subgraph "Namespaces"
            subgraph "tenant-alpha"
                POD1[Pods]
                SVC1[Services]
                PVC1[PVCs]
            end

            subgraph "tenant-beta"
                POD2[Pods]
                SVC2[Services]
                PVC2[PVCs]
            end

            subgraph "system"
                SYS_POD[System Pods]
                SYS_SVC[System Services]
            end
        end

        subgraph "Shared Services"
            ISTIO[Istio Service Mesh]
            PROM[Prometheus]
            GRAFANA[Grafana]
            ELK[ELK Stack]
        end
    end

    subgraph "External Systems"
        REG[Harbor Registry]
        GIT[Git Repository]
        LDAP[LDAP Server]
        DB[(PostgreSQL)]
    end

    API --> KUBELET1
    API --> KUBELET2
    API --> KUBELET3

    KUBELET1 --> POD1
    KUBELET2 --> POD2
    KUBELET3 --> SYS_POD

    ISTIO --> POD1
    ISTIO --> POD2
    ISTIO --> SYS_POD

    PROM --> POD1
    PROM --> POD2
    PROM --> SYS_POD

    POD1 --> REG
    POD2 --> REG
    SYS_POD --> REG

    POD1 --> GIT
    POD2 --> GIT
    SYS_POD --> GIT

    SYS_POD --> LDAP
    SYS_POD --> DB

VMware ESXi Cluster for VM Image Builds and Testing

graph TB
    subgraph "VMware vSphere Environment"
        subgraph "vCenter Server"
            VCENTER[vCenter Server<br/>Management Interface]
            VCENTER_DB[(vCenter Database)]
        end

        subgraph "ESXi Cluster"
            ESXI1[ESXi Host 1<br/>Build Environment]
            ESXI2[ESXi Host 2<br/>Test Environment]
            ESXI3[ESXi Host 3<br/>Certification Environment]
        end

        subgraph "VMware Components"
            VSWITCH[vSwitch<br/>Virtual Networking]
            VDS[Distributed Switch<br/>Advanced Networking]
            VSAN[vSAN<br/>Shared Storage]
        end

        subgraph "VM Templates & Images"
            VM_TEMPLATES[VM Templates<br/>Base Images]
            GOLDEN_IMAGES[Golden Images<br/>Certified Images]
            TEST_VMS[Test VMs<br/>Build Validation]
        end
    end

    subgraph "Integration Points"
        PACKER_BUILDER[Packer Builder<br/>VM Image Creation]
        ANSIBLE_CONFIG[Ansible<br/>VM Configuration]
        VMWARE_API[VMware API<br/>vSphere Automation]
        CERTIFICATION[CIS Benchmarks<br/>Security Testing]
    end

    VCENTER --> ESXI1
    VCENTER --> ESXI2
    VCENTER --> ESXI3

    ESXI1 --> VSWITCH
    ESXI2 --> VSWITCH
    ESXI3 --> VSWITCH

    VSWITCH --> VDS
    VDS --> VSAN

    PACKER_BUILDER --> VCENTER
    ANSIBLE_CONFIG --> ESXI1
    ANSIBLE_CONFIG --> ESXI2
    ANSIBLE_CONFIG --> ESXI3

    VMWARE_API --> VCENTER
    CERTIFICATION --> TEST_VMS

    ESXI1 --> VM_TEMPLATES
    ESXI2 --> TEST_VMS
    ESXI3 --> GOLDEN_IMAGES

Network Architecture

graph TB
    subgraph "External Network"
        INTERNET[Internet]
        CORP[Corporate Network]
    end

    subgraph "DMZ"
        LB[Load Balancer<br/>HAProxy/NGINX]
        WAF[Web Application Firewall]
        GW[API Gateway<br/>Kong]
    end

    subgraph "Kubernetes Network"
        INGRESS[Ingress Controller<br/>Traefik]
        SVC_MESH[Istio Service Mesh]
        NETWORK_POLICIES[Network Policies<br/>Kubernetes]
    end

    subgraph "Pod Networks"
        TENANT_NS[Tenant Namespaces]
        SYSTEM_NS[System Namespace]
    end

    INTERNET --> LB
    CORP --> LB
    LB --> WAF
    WAF --> GW
    GW --> INGRESS
    INGRESS --> SVC_MESH
    SVC_MESH --> NETWORK_POLICIES
    NETWORK_POLICIES --> TENANT_NS
    NETWORK_POLICIES --> SYSTEM_NS

Component Specifications

API Gateway (Custom Implementation)

Attribute	Specification
Technology	Golang + Gin/Echo framework
Purpose	Single entry point with tenant-aware routing and domain logic integration
Architecture	Hexagonal adapter implementing driving ports
Features	Rate limiting, authentication, tenant routing, request transformation
Multi-tenancy	Tenant context injection, resource isolation, quota enforcement
Security	JWT validation, API key management, tenant-specific policies
Monitoring	Request/response metrics, tenant usage tracking, performance monitoring
DDD Integration	Domain events publishing, tenant context propagation

Custom API Gateway Implementation

Architecture Integration

The custom API gateway fits perfectly within our hexagonal architecture as a primary adapter (driving adapter) that translates external HTTP requests into domain commands and queries.

// Port definition (Hexagonal Architecture)
type APIGatewayPort interface {
    HandleTenantRequest(ctx context.Context, tenantID string, req *http.Request) (*http.Response, error)
    ValidateTenantAccess(ctx context.Context, tenantID string, userID string) error
    EnforceTenantQuotas(ctx context.Context, tenantID string, resource string) error
}

// Adapter implementation
type CustomAPIGateway struct {
    tenantService tenant.TenantService
    authService   auth.AuthService
    rateLimiter   ratelimit.RateLimiter
    metrics       metrics.Collector
}

func (g *CustomAPIGateway) ServeHTTP(w http.ResponseWriter, r *http.Request) {
    // Extract tenant context from request
    tenantID := g.extractTenantID(r)
    
    // Validate tenant access
    if err := g.authService.ValidateTenantAccess(r.Context(), tenantID, g.getUserID(r)); err != nil {
        g.handleAuthError(w, err)
        return
    }
    
    // Enforce tenant quotas
    if err := g.tenantService.CheckResourceQuota(r.Context(), tenantID, "api_calls"); err != nil {
        g.handleQuotaError(w, err)
        return
    }
    
    // Route to appropriate domain service
    g.routeToDomainService(w, r, tenantID)
}

Key Advantages of Custom Implementation

Domain Logic Integration: Direct access to tenant context and business rules
Tenant-Aware Routing: Built-in multi-tenant request routing and isolation
Unified Technology Stack: Same language and frameworks as domain services
Custom Business Logic: Implement tenant-specific routing rules and transformations
Event-Driven Integration: Publish domain events directly from gateway layer
Performance Optimization: Optimize for specific tenant usage patterns

Implementation Components

Tenant Context Middleware:

func TenantContextMiddleware(tenantService tenant.Service) gin.HandlerFunc {
    return func(c *gin.Context) {
        tenantID := extractTenantFromRequest(c.Request)
        
        // Validate tenant exists and is active
        tenant, err := tenantService.GetByID(c.Request.Context(), tenantID)
        if err != nil {
            c.AbortWithStatusJSON(404, gin.H{"error": "tenant not found"})
            return
        }
        
        // Inject tenant context
        ctx := tenant.WithTenant(c.Request.Context(), tenant)
        c.Request = c.Request.WithContext(ctx)
        
        c.Next()
    }
}

Rate Limiting with Tenant Isolation:

type TenantRateLimiter struct {
    limiters map[string]*rate.Limiter
    mu       sync.RWMutex
}

func (trl *TenantRateLimiter) Allow(tenantID string) bool {
    trl.mu.Lock()
    defer trl.mu.Unlock()
    
    limiter, exists := trl.limiters[tenantID]
    if !exists {
        limiter = rate.NewLimiter(rate.Limit(100), 100) // tenant-specific limits
        trl.limiters[tenantID] = limiter
    }
    
    return limiter.Allow()
}

Request Transformation Layer:

type RequestTransformer interface {
    TransformForTenant(ctx context.Context, tenantID string, req *http.Request) (*http.Request, error)
}

type DomainAwareTransformer struct {
    tenantService tenant.Service
    buildService  build.Service
}

func (t *DomainAwareTransformer) TransformForTenant(ctx context.Context, tenantID string, req *http.Request) (*http.Request, error) {
    // Add tenant-specific headers
    req.Header.Set("X-Tenant-ID", tenantID)
    
    // Transform request based on tenant configuration
    tenant, _ := tenant.GetByID(ctx, tenantID)
    if tenant.Config.APIVersion == "v2" {
        return t.transformToV2(req)
    }
    
    return req, nil
}

Comparison: Custom vs External Gateway

Aspect	Custom Gateway	Kong/Istio Gateway
Development Effort	High (6-8 weeks)	Low (1-2 weeks)
Tenant Integration	Deep domain integration	Configuration-based
Customization	Unlimited	Plugin ecosystem
Maintenance	In-house team	Community/vendor
Security	Custom implementation	Battle-tested
Performance	Optimized for use case	General-purpose
Scalability	Kubernetes-native	Enterprise-grade
DDD Alignment	Perfect fit	External concern

Recommendation

For our DDD and Hexagonal Architecture: Implement a custom API gateway as a primary adapter.

Rationale:

Architectural Alignment: Fits perfectly as a driving adapter in hexagonal architecture
Domain Integration: Can inject tenant context and enforce domain rules at the edge
Unified Stack: Same technology as domain services reduces complexity
Tenant-Awareness: Built-in multi-tenant routing and isolation
Business Logic: Can implement tenant-specific transformations and validations

Implementation Approach:

Provide tenant-aware routing and domain-aware validation at the edge.
Integrate authentication and authorization consistently across services.

Fallback Strategy: If development effort becomes prohibitive, fall back to Kong with custom plugins for tenant-specific logic.

Authentication Service

Attribute	Specification
Technology	Golang + JWT/OAuth2
Purpose	Enterprise identity management
Protocols	LDAP, SAML, OIDC
Features	SSO, MFA, role-based access
Storage	Redis for session management
Integration	Active Directory, Azure AD, Okta

Tenant Management Service

Attribute	Specification
Technology	Golang microservice
Purpose	Tenant lifecycle management
Features	Onboarding, isolation, quota management
Database	PostgreSQL with tenant-scoped schemas
Security	Namespace isolation, RBAC
APIs	RESTful APIs with OpenAPI spec

Build Management Service

Attribute	Specification
Technology	Golang + Tekton + Packer + VMware vSphere integration
Purpose	Orchestrate multi-type image build workflows (VM, cloud provider, container) with VMware ESXi testing
Features	Pipeline management, Git integration, multi-build orchestration, VM certification
Supported Image Types	VM images (VMware), AMIs, Google Cloud Images, Azure VM images, container images
Build Engines	Tekton (containers), Packer (VM/cloud), Buildah (containers), VMware vSphere API (VM testing)
Testing Infrastructure	VMware ESXi cluster for VM image builds, testing, and certification
Certification Process	CIS benchmarks, security testing, functional validation on ESXi hosts
Triggers	Webhook, scheduled, manual, API-driven
Storage	Git repositories for manifests, multi-format build specifications, VM templates
Monitoring	Build metrics, failure analysis, cross-platform compatibility, ESXi resource utilization

Image Management Service

Attribute	Specification
Technology	Golang + Harbor + Cloud Registry APIs
Purpose	Multi-type image lifecycle management across heterogeneous registries
Features	Catalog, search, quarantine, promotion, cross-registry operations
Supported Registries	Harbor (containers), ECR (AWS), GCR (GCP), ACR (Azure), Docker Hub
Image Types	VM images, AMIs, cloud images, container images
Security	Vulnerability scanning, signing, cross-platform security policies
Storage	Multi-registry with geo-replication and unified catalog
APIs	Docker Registry API v2, cloud provider APIs, unified management API

Resource Management Service

Attribute	Specification
Technology	Golang + Kubernetes API
Purpose	Dynamic resource allocation
Features	Quota management, auto-scaling
Metrics	CPU, memory, storage utilization
Policies	Fair sharing, queued builds (status-based)
Integration	Kubernetes resource quotas

Approval Management Service

Attribute	Specification
Technology	Golang + workflow engine
Purpose	Governance and approval workflows
Features	Multi-level approvals, audit trails
Storage	PostgreSQL for workflow state
Integration	Email, Slack notifications
Compliance	SOX, GDPR compliance logging

Notification Service

Attribute	Specification
Technology	Golang + NATS + Go Channels
Purpose	Event-driven notifications
Channels	Email, Slack, Webhooks
Templates	Configurable message templates
Reliability	Guaranteed delivery, retry logic
Monitoring	Delivery metrics, failure rates

Database Migrations

Migration Strategy for Golang Backend Services

Technology: Flyway (instead of GORM auto-migrations)

Rationale:

Version Control: Database schema changes are versioned and tracked in Git
Environment Consistency: Ensures all environments (dev, staging, prod) have identical schemas
Rollback Support: Ability to rollback schema changes safely
Team Collaboration: Schema changes go through code review process
Audit Trail: Complete history of database schema evolution

Implementation:

-- V1.0.0__create_tenant_schema.sql
CREATE SCHEMA IF NOT EXISTS tenant_template;

-- V1.0.1__create_tenants_table.sql
CREATE TABLE tenant_template.tenants (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name VARCHAR(255) NOT NULL,
    domain VARCHAR(255) UNIQUE,
    status VARCHAR(50) NOT NULL DEFAULT 'active',
    created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP
);

-- V1.0.2__create_users_table.sql
CREATE TABLE tenant_template.users (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id UUID REFERENCES tenant_template.tenants(id),
    username VARCHAR(255) NOT NULL,
    email VARCHAR(255) UNIQUE,
    role VARCHAR(50) NOT NULL DEFAULT 'user',
    created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
    UNIQUE(tenant_id, username)
);

Migration Directory Structure:

db/migrations/
├── V1.0.0__initial_schema.sql
├── V1.0.1__add_tenant_support.sql
├── V1.0.2__add_audit_fields.sql
└── V1.1.0__add_image_metadata.sql

Golang Integration:

// Database connection and migration
func initDatabase() error {
    // Connect to PostgreSQL
    db, err := sql.Open("postgres", connectionString)
    if err != nil {
        return err
    }

    // Run Flyway migrations
    flyway := flyway.New(flyway.Config{
        Url:         connectionString,
        Locations:   []string{"filesystem:db/migrations"},
        Schemas:     []string{"tenant_template"},
    })

    return flyway.Migrate()
}

Implementation:

// Hybrid Messaging Implementation
// In-process event bus using Go channels and goroutines
type EventBus struct {
    subscribers map[string][]chan Event
    mu          sync.RWMutex
}

// Distributed messaging using NATS
type NATSClient struct {
    conn *nats.Conn
}

// Domain event structure
type Event struct {
    ID        string
    Type      string
    TenantID  string
    Payload   interface{}
    Timestamp time.Time
}

// Event publisher interface (hexagonal port)
type EventPublisher interface {
    Publish(ctx context.Context, event Event) error
}

// In-process channel-based publisher
type ChannelPublisher struct {
    bus *EventBus
}

func (p *ChannelPublisher) Publish(ctx context.Context, event Event) error {
    return p.bus.Publish(event)
}

// NATS-based distributed publisher
type NATSPublisher struct {
    client *NATSClient
}

func (p *NATSPublisher) Publish(ctx context.Context, event Event) error {
    subject := fmt.Sprintf("tenant.%s.events.%s", event.TenantID, event.Type)
    data, err := json.Marshal(event)
    if err != nil {
        return err
    }
    return p.client.conn.Publish(subject, data)
}

// Event subscriber interface
type EventSubscriber interface {
    Subscribe(eventType string, handler EventHandler) error
    Unsubscribe(eventType string, handler EventHandler) error
}

type EventHandler func(event Event) error

// Channel-based subscriber
type ChannelSubscriber struct {
    bus *EventBus
}

func (s *ChannelSubscriber) Subscribe(eventType string, handler EventHandler) error {
    return s.bus.Subscribe(eventType, handler)
}

// NATS-based subscriber with queue groups for load balancing
type NATSSubscriber struct {
    client     *NATSClient
    queueGroup string
    subscriptions []*nats.Subscription
}

func (s *NATSSubscriber) Subscribe(eventType string, handler EventHandler) error {
    subject := fmt.Sprintf("tenant.*.events.%s", eventType)
    sub, err := s.client.conn.QueueSubscribe(subject, s.queueGroup, func(msg *nats.Msg) {
        var event Event
        if err := json.Unmarshal(msg.Data, &event); err != nil {
            // Handle error
            return
        }
        if err := handler(event); err != nil {
            // Handle processing error
            msg.Nak() // Negative acknowledge for retry
            return
        }
        msg.Ack() // Acknowledge successful processing
    })
    if err != nil {
        return err
    }
    s.subscriptions = append(s.subscriptions, sub)
    return nil
}

// Hybrid event bus that combines both approaches
type HybridEventBus struct {
    channelBus *EventBus
    natsBus    *NATSClient
    localOnly  bool // For testing or single-instance deployments
}

func (h *HybridEventBus) Publish(ctx context.Context, event Event) error {
    // Always publish to local channels for in-process handling
    if err := h.channelBus.Publish(event); err != nil {
        return err
    }

    // Publish to NATS for distributed communication (unless local-only mode)
    if !h.localOnly {
        return h.natsBus.Publish(event)
    }
    return nil
}

// Usage in domain services
type BuildService struct {
    publisher EventPublisher
    repository BuildRepository
}

func (s *BuildService) StartBuild(ctx context.Context, request StartBuildRequest) error {
    build := Build{
        ID:       uuid.New(),
        TenantID: request.TenantID,
        Status:   "starting",
    }

    if err := s.repository.Save(ctx, build); err != nil {
        return err
    }

    event := Event{
        ID:       uuid.New().String(),
        Type:     "build.started",
        TenantID: request.TenantID,
        Payload:  build,
        Timestamp: time.Now(),
    }

    return s.publisher.Publish(ctx, event)
}

Data Architecture

Database Schema

-- Multi-tenant database schema
CREATE SCHEMA tenant_template;

-- Tenants table
CREATE TABLE tenant_template.tenants (
    id UUID PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    domain VARCHAR(255) UNIQUE,
    status VARCHAR(50) NOT NULL,
    created_at TIMESTAMP NOT NULL,
    updated_at TIMESTAMP NOT NULL
);

-- Users table (tenant-scoped)
CREATE TABLE tenant_template.users (
    id UUID PRIMARY KEY,
    tenant_id UUID REFERENCES tenants(id),
    username VARCHAR(255) NOT NULL,
    email VARCHAR(255) UNIQUE,
    role VARCHAR(50) NOT NULL,
    created_at TIMESTAMP NOT NULL,
    UNIQUE(tenant_id, username)
);

-- Projects table (tenant-scoped)
CREATE TABLE tenant_template.projects (
    id UUID PRIMARY KEY,
    tenant_id UUID REFERENCES tenants(id),
    name VARCHAR(255) NOT NULL,
    repository_url VARCHAR(500),
    image_type VARCHAR(50) NOT NULL, -- 'vm', 'ami', 'container', 'gcp-image', 'azure-image'
    build_config JSONB,
    created_at TIMESTAMP NOT NULL,
    UNIQUE(tenant_id, name)
);

-- Builds table (tenant-scoped)
CREATE TABLE tenant_template.builds (
    id UUID PRIMARY KEY,
    project_id UUID REFERENCES projects(id),
    image_type VARCHAR(50) NOT NULL, -- 'vm', 'ami', 'container', 'gcp-image', 'azure-image'
    status VARCHAR(50) NOT NULL,
    triggered_by UUID REFERENCES users(id),
    build_engine VARCHAR(50) NOT NULL, -- 'packer', 'tekton', 'buildah', 'kaniko'
    start_time TIMESTAMP,
    end_time TIMESTAMP,
    logs TEXT,
    artifacts JSONB, -- Build outputs (AMI IDs, image digests, etc.)
    created_at TIMESTAMP NOT NULL
);

-- Images table (tenant-scoped)
CREATE TABLE tenant_template.images (
    id UUID PRIMARY KEY,
    project_id UUID REFERENCES projects(id),
    image_type VARCHAR(50) NOT NULL, -- 'vm', 'ami', 'container', 'gcp-image', 'azure-image'
    name VARCHAR(255) NOT NULL,
    tag VARCHAR(255),
    digest VARCHAR(255),
    cloud_provider_id VARCHAR(255), -- AMI ID, GCP image name, etc.
    registry_url VARCHAR(500), -- Harbor URL, ECR URL, etc.
    status VARCHAR(50) NOT NULL,
    scan_results JSONB,
    metadata JSONB, -- Image-specific metadata (size, OS, architecture, etc.)
    created_at TIMESTAMP NOT NULL,
    UNIQUE(project_id, name, tag)
);

-- Approvals table (tenant-scoped)
CREATE TABLE tenant_template.approvals (
    id UUID PRIMARY KEY,
    resource_type VARCHAR(50) NOT NULL,
    resource_id UUID NOT NULL,
    requested_by UUID REFERENCES users(id),
    approved_by UUID REFERENCES users(id),
    status VARCHAR(50) NOT NULL,
    created_at TIMESTAMP NOT NULL,
    approved_at TIMESTAMP
);

Data Flow Architecture

graph LR
    subgraph "Data Sources"
        GIT[Git Repositories<br/>Multi-format manifests]
        VM_REG[VM Image Stores<br/>Cloud provider APIs]
        CONTAINER_REG[Container Registries<br/>Harbor, ECR, GCR]
        EXT[External Systems<br/>LDAP, CI/CD]
    end

    subgraph "Ingestion Layer"
        WEBHOOKS[Git Webhooks<br/>Multi-type triggers]
        API[API Ingestion<br/>REST/GraphQL]
        AGENTS[Collection Agents<br/>Cloud provider APIs]
    end

    subgraph "Processing Layer"
        VALIDATE[Data Validation<br/>Schema validation]
        TRANSFORM[Data Transformation<br/>Format conversion]
        ENRICH[Data Enrichment<br/>Metadata addition]
        TYPE_ROUTING[Type-based Routing<br/>VM vs Container paths]
    end

    subgraph "Build Orchestration"
        PACKER_BUILDS[Packer Builds<br/>VM/Cloud images]
        TEKTON_BUILDS[Tekton Builds<br/>Container images]
        SCANNING[Security Scanning<br/>Multi-tool scanning]
    end

    subgraph "Storage Layer"
        DB[(PostgreSQL<br/>Multi-tenant DB)]
        CACHE[(Redis<br/>Cache)]
        BLOB[(Object Storage<br/>VM images, artifacts)]
        REGISTRIES[(Multi-Registry<br/>Harbor + Cloud)]
    end

    subgraph "Access Layer"
        QUERY[Query Engine<br/>Multi-type queries]
        SEARCH[Search Index<br/>Elasticsearch]
        ANALYTICS[Analytics Engine<br/>Build metrics]
        CATALOG[Unified Catalog<br/>Cross-type search]
    end

    GIT --> WEBHOOKS
    VM_REG --> AGENTS
    CONTAINER_REG --> API

    WEBHOOKS --> VALIDATE
    API --> VALIDATE
    AGENTS --> VALIDATE

    VALIDATE --> TRANSFORM
    TRANSFORM --> ENRICH
    ENRICH --> TYPE_ROUTING

    TYPE_ROUTING --> PACKER_BUILDS
    TYPE_ROUTING --> TEKTON_BUILDS
    PACKER_BUILDS --> SCANNING
    TEKTON_BUILDS --> SCANNING

    SCANNING --> DB
    SCANNING --> CACHE
    SCANNING --> BLOB
    SCANNING --> REGISTRIES

    DB --> QUERY
    CACHE --> QUERY
    BLOB --> QUERY
    REGISTRIES --> QUERY

    QUERY --> SEARCH
    QUERY --> ANALYTICS
    SEARCH --> CATALOG
    ANALYTICS --> CATALOG

Security Architecture

Security Layers

graph TB
    subgraph "Perimeter Security"
        WAF[Web Application Firewall]
        DDoS[DDoS Protection]
        LB[Load Balancer<br/>SSL Termination]
    end

    subgraph "Network Security"
        FW[Network Firewalls]
        NSG[Network Security Groups]
        POL[Network Policies<br/>Kubernetes]
    end

    subgraph "Application Security"
        AUTH[Authentication<br/>LDAP/SAML/OIDC]
        AUTHZ[Authorization<br/>RBAC/ABAC]
        AUDIT[Audit Logging]
    end

    subgraph "Data Security"
        ENC[Encryption<br/>At Rest/Transit]
        MASK[Data Masking]
        DLP[Data Loss Prevention]
    end

    subgraph "Container Security"
        SCAN[Image Scanning<br/>Trivy/Syft]
        SIGN[Image Signing<br/>Cosign]
        RUNTIME[Runtime Security<br/>Falco]
    end

    subgraph "Infrastructure Security"
        IAM[Identity & Access<br/>Management]
        VAULT[Secrets Management<br/>HashiCorp Vault]
        MONITOR[Security Monitoring<br/>SIEM]
    end

    WAF --> FW
    FW --> NSG
    NSG --> POL

    POL --> AUTH
    AUTH --> AUTHZ
    AUTHZ --> AUDIT

    AUDIT --> ENC
    ENC --> MASK
    MASK --> DLP

    DLP --> SCAN
    SCAN --> SIGN
    SIGN --> RUNTIME

    RUNTIME --> IAM
    IAM --> VAULT
    VAULT --> MONITOR

Security Controls

Control Category	Implementation
Access Control	RBAC, ABAC, Zero Trust
Authentication	Multi-factor, SSO, Certificate-based
Authorization	Role-based, Attribute-based, Policy-based
Data Protection	Encryption, Masking, Tokenization
Network Security	Firewalls, VPN, Micro-segmentation
Application Security	Input validation, XSS prevention, CSRF protection
Infrastructure Security	Hardened OS, Container security, Network policies
Monitoring	SIEM, Log aggregation, Real-time alerts

Operational Architecture

Monitoring Stack

graph TB
    subgraph "Metrics Collection"
        PROMETHEUS[Prometheus<br/>Metrics Collection]
        NODE_EXPORTER[Node Exporter]
        KUBE_STATE[Kube State Metrics]
        APP_METRICS[Application Metrics]
    end

    subgraph "Log Aggregation"
        FLUENT_BIT[Fluent Bit<br/>Log Collection]
        KAFKA[NATS<br/>Message Bus]
        ELASTICSEARCH[Elasticsearch<br/>Log Storage]
    end

    subgraph "Visualization"
        GRAFANA[Grafana<br/>Dashboards]
        KIBANA[Kibana<br/>Log Analysis]
    end

    subgraph "Alerting"
        ALERT_MANAGER[Alert Manager]
        NOTIFICATION[Notification Channels<br/>Email/Slack/PagerDuty]
    end

    NODE_EXPORTER --> PROMETHEUS
    KUBE_STATE --> PROMETHEUS
    APP_METRICS --> PROMETHEUS

        FLUENT_BIT --> NATS
        NATS --> ELASTICSEARCH    PROMETHEUS --> GRAFANA
    ELASTICSEARCH --> KIBANA

    PROMETHEUS --> ALERT_MANAGER
    ALERT_MANAGER --> NOTIFICATION

Operational Procedures

Incident Response

Detection: Automated monitoring alerts
Assessment: Incident classification and impact analysis
Containment: Isolate affected components
Recovery: Restore services from backups
Lessons Learned: Post-mortem analysis and improvements

Change Management

Request: Change request submission
Approval: Multi-level approval workflow
Testing: Automated and manual testing
Deployment: Blue-green or canary deployment
Validation: Post-deployment monitoring
Documentation: Update operational runbooks

Backup and Recovery

Database: Daily backups with point-in-time recovery
Configuration: Git-based configuration versioning
Images: Registry replication and backup
Logs: Long-term retention in object storage
Testing: Regular backup restoration tests

Deployment Architecture

Kubernetes Deployment Strategy

apiVersion: apps/v1
kind: Deployment
metadata:
  name: image-factory-api
  namespace: system
spec:
  replicas: 3
  selector:
    matchLabels:
      app: image-factory-api
  template:
    metadata:
      labels:
        app: image-factory-api
    spec:
      containers:
      - name: api
        image: harbor.example.com/image-factory/api:v1.0.0
        ports:
        - containerPort: 8080
        env:
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: db-secret
              key: url
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5

VMware ESXi Deployment Strategy

vCenter Server Deployment

apiVersion: vsphere.provider.vmware.com/v1
kind: VirtualMachine
metadata:
  name: vcenter-server
  namespace: vmware-system
spec:
  template: vcenter-server-template
  powerState: poweredOn
  networks:
  - networkName: "VM Network"
    ipAddress: "192.168.1.100"
  storage:
    disk:
    - size: 500Gi
      storageClass: vsphere-storage
  resources:
    cpu: 8
    memory: 32Gi

ESXi Cluster Configuration

apiVersion: vsphere.provider.vmware.com/v1
kind: Cluster
metadata:
  name: image-factory-esxi-cluster
spec:
  datacenter: ImageFactory-DC
  cluster: Build-Test-Cluster
  hosts:
  - name: esxi-build-01
    ipAddress: "192.168.1.101"
    maintenanceMode: false
  - name: esxi-test-02
    ipAddress: "192.168.1.102"
    maintenanceMode: false
  - name: esxi-cert-03
    ipAddress: "192.168.1.103"
    maintenanceMode: false
  resourcePool: Build-Resources
  vmFolder: /ImageFactory-DC/vm/Build-VMs
  network:
    portGroup: Build-Network
    vlanId: 100
  storage:
    datastore: Build-Storage
    storagePolicy: Build-Storage-Policy

Packer Builder VM Template

source "vsphere-iso" "esxi-vm" {
  vcenter_server      = var.vcenter_server
  username            = var.vcenter_username
  password            = var.vcenter_password
  datacenter          = var.datacenter
  cluster             = var.cluster
  host                = var.host
  datastore           = var.datastore
  folder              = var.folder
  insecure_connection = true

  vm_name             = var.vm_name
  guest_os_type      = var.guest_os_type
  CPUs               = var.cpu_count
  RAM                = var.ram_size
  disk_controller_type = ["pvscsi"]
  storage {
    disk_size             = var.disk_size
    disk_thin_provisioned = true
  }
  network_adapters {
    network      = var.network
    network_card = "vmxnet3"
  }
  iso_paths = var.iso_paths
  boot_command = var.boot_command
  shutdown_command = var.shutdown_command
}

build {
  sources = ["source.vsphere-iso.esxi-vm"]

  provisioner "ansible" {
    playbook_file = "./ansible/playbook.yml"
    user          = var.ssh_username
    extra_arguments = ["--extra-vars", "ansible_ssh_pass=${var.ssh_password}"]
  }

  post-processor "vsphere-template" {
    only = ["vsphere-iso.esxi-vm"]
  }
}

Helm Chart Structure

image-factory/
├── Chart.yaml
├── values.yaml
├── templates/
│   ├── api-gateway/
│   │   ├── deployment.yaml
│   │   ├── service.yaml
│   │   └── configmap.yaml
│   ├── tenant-management/
│   │   ├── deployment.yaml
│   │   ├── service.yaml
│   │   └── secret.yaml
│   ├── build-management/
│   │   ├── deployment.yaml
│   │   ├── service.yaml
│   │   ├── packer-builders/
│   │   │   ├── deployment.yaml
│   │   │   └── configmap.yaml
│   │   ├── tekton-pipelines/
│   │   │   ├── deployment.yaml
│   │   │   └── pvc.yaml
│   │   └── buildah-runners/
│   │       ├── deployment.yaml
│   │       └── configmap.yaml
│   ├── image-management/
│   │   ├── deployment.yaml
│   │   ├── service.yaml
│   │   ├── registry-sync/
│   │   │   ├── job.yaml
│   │   │   └── cronjob.yaml
│   │   └── job.yaml
│   ├── ui/
│   │   ├── deployment.yaml
│   │   ├── service.yaml
│   │   └── ingress.yaml
│   ├── cloud-providers/
│   │   ├── aws-integration/
│   │   │   ├── deployment.yaml
│   │   │   └── secret.yaml
│   │   ├── azure-integration/
│   │   │   ├── deployment.yaml
│   │   │   └── secret.yaml
│   │   ├── gcp-integration/
│   │   │   ├── deployment.yaml
│   │   │   └── secret.yaml
│   │   └── vmware-integration/
│   │       ├── vcenter-deployment.yaml
│   │       ├── esxi-cluster-config.yaml
│   │       └── vmware-secrets.yaml
│   └── monitoring/
│       ├── prometheus.yaml
│       ├── grafana.yaml
│       └── alertmanager.yaml
└── charts/
    ├── postgresql/
    ├── redis/
    ├── nats/
    ├── harbor/
    ├── tekton/
    ├── packer-builders/
    ├── vmware-vsphere/
    └── vmware-esxi-cluster/

CI/CD Pipeline

# .github/workflows/deploy.yaml
name: Deploy to Kubernetes

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v3
    - name: Run tests
      run: make test

  build:
    needs: test
    runs-on: ubuntu-latest
    steps:
    - name: Build and push images
      run: |
        podman build -t harbor.example.com/image-factory/api:${{ github.sha }} ./api
        podman push harbor.example.com/image-factory/api:${{ github.sha }}

  deploy:
    needs: build
    runs-on: ubuntu-latest
    steps:
    - name: Deploy to staging
      run: |
        helm upgrade --install image-factory ./helm/image-factory \
          --namespace staging \
          --set image.tag=${{ github.sha }} \
          --wait

    - name: Run integration tests
      run: make integration-test

    - name: Deploy to production
      if: github.ref == 'refs/heads/main'
      run: |
        helm upgrade --install image-factory ./helm/image-factory \
          --namespace production \
          --set image.tag=${{ github.sha }} \
          --wait

Integration Architecture

External System Integration

graph TB
    subgraph "Version Control"
        GITHUB[GitHub]
        GITLAB[GitLab]
        BITBUCKET[Bitbucket]
    end

    subgraph "Identity Management"
        LDAP[LDAP/AD]
        SAML[SAML Providers]
        OIDC[OIDC Providers<br/>Azure AD, Okta]
    end

    subgraph "CI/CD Systems"
        JENKINS[Jenkins]
        GITHUB_ACTIONS[GitHub Actions]
        GITLAB_CI[GitLab CI]
        TEAMCITY[TeamCity]
    end

    subgraph "Container Platforms"
        DOCKER_HUB[Docker Hub]
        ECR[Amazon ECR]
        GCR[Google GCR]
        ACR[Azure ACR]
    end

    subgraph "Monitoring & Alerting"
        SPLUNK[Splunk]
        DATADOG[Datadog]
        NEW_RELIC[New Relic]
        PROMETHEUS[Prometheus]
    end

    subgraph "Image Factory"
        WEBHOOKS[Webhook Handler]
        API_INTEGRATION[API Integration]
        EVENT_BRIDGE[Event Bridge]
    end

    GITHUB --> WEBHOOKS
    GITLAB --> WEBHOOKS
    BITBUCKET --> WEBHOOKS

    LDAP --> API_INTEGRATION
    SAML --> API_INTEGRATION
    OIDC --> API_INTEGRATION

    JENKINS --> EVENT_BRIDGE
    GITHUB_ACTIONS --> EVENT_BRIDGE
    GITLAB_CI --> EVENT_BRIDGE
    TEAMCITY --> EVENT_BRIDGE

    DOCKER_HUB --> API_INTEGRATION
    ECR --> API_INTEGRATION
    GCR --> API_INTEGRATION
    ACR --> API_INTEGRATION

    SPLUNK --> EVENT_BRIDGE
    DATADOG --> EVENT_BRIDGE
    NEW_RELIC --> EVENT_BRIDGE
    PROMETHEUS --> EVENT_BRIDGE

API Integration Patterns

Pattern	Use Case	Implementation
Webhooks	Git events, CI/CD triggers	REST endpoints with signature validation
Polling	Status checks, data synchronization	Scheduled jobs with exponential backoff
Streaming	Real-time events, logs	WebSocket, Server-Sent Events
Batch Processing	Bulk data operations	Message queues, scheduled tasks
API Callbacks	Async responses, notifications	Configurable callback URLs

Storage Architecture

Unified S3 Object Storage

The platform uses S3-compatible object storage as the unified backend for all image types:

Storage Design Principles

Content-Addressable: All blobs stored by SHA256 digest for automatic deduplication
Multi-Tenant Isolation: Per-tenant storage prefixes with IAM-based access control
Intelligent Tiering: Automatic lifecycle policies based on image type and access patterns
Global Distribution: Cross-region replication for disaster recovery and performance

Storage Structure

S3 Bucket: image-factory-unified-storage/
├── blobs/sha256/
│   ├── ab/ab1234... (container layers, VM disks, CSP images)
│   ├── cd/cd5678... (deduplication across all image types)
│   └── ef/ef9012... (compressed, encrypted blobs)
├── manifests/
│   ├── container/ (OCI/Docker manifests)
│   ├── vm/ (VM metadata descriptors)
│   └── csp/ (Cloud provider image catalogs)
└── uploads/ (multipart upload staging)

Storage Optimization by Image Type

Image Type	Storage Class	Lifecycle Policy	Access Pattern
Container Images	STANDARD	IA after 30d, Glacier after 90d	Frequent pulls
VM Images	STANDARD_IA	Glacier after 30d	Moderate access
CSP Images	GLACIER	Deep Archive after 90d	Infrequent access

Database Schema for Storage

-- Storage backend configuration
storage_backends (s3_bucket, s3_region, lifecycle_policies, encryption_config)

-- Content-addressable blob storage
image_blobs (digest, size_bytes, storage_key, reference_count, compression_ratio)

-- OCI manifest storage
image_manifests (image_id, manifest_content, platform, layer_count)

-- Layer-to-blob mapping for deduplication
image_layer_blobs (manifest_id, blob_id, layer_order)

-- Storage metrics and cost tracking
storage_metrics (total_objects, total_size_bytes, cost_breakdown)

Cost Optimization Features

Deduplication Savings: 50-70% storage reduction through cross-image blob sharing
Compression: 60-80% size reduction with gzip/zstd compression
Intelligent Tiering: Automatic cost optimization based on access patterns
Cost Attribution: Per-tenant, per-project cost tracking and billing

OCI Distribution API

Native Container Registry Implementation

The platform implements the OCI Distribution Specification directly in the backend API:

API Endpoints (OCI Distribution Spec v1.0)

// Registry API Root
GET  /v2/                              // Registry capabilities
GET  /v2/_catalog                      // Repository listing

// Repository Management  
GET  /v2/{name}/tags/list             // Tag enumeration

// Manifest API
GET    /v2/{name}/manifests/{ref}     // Manifest retrieval
PUT    /v2/{name}/manifests/{ref}     // Manifest upload
HEAD   /v2/{name}/manifests/{ref}     // Manifest existence check
DELETE /v2/{name}/manifests/{ref}     // Manifest deletion

// Blob API
GET    /v2/{name}/blobs/{digest}      // Blob retrieval (S3 redirect)
HEAD   /v2/{name}/blobs/{digest}      // Blob existence check
POST   /v2/{name}/blobs/uploads/      // Upload initiation
PUT    /v2/{name}/blobs/uploads/{uuid} // Upload completion
DELETE /v2/{name}/blobs/{digest}      // Blob deletion

// Authentication
POST /v2/token                        // OCI token issuance

Authentication & Authorization Integration

type OCIAuthFlow struct {
    // JWT-based bearer token authentication
    TokenEndpoint string `json:"token_endpoint"`
    
    // RBAC integration for repository access
    ScopeFormat string `json:"scope_format"` // repository:{name}:{actions}
    
    // Multi-tenant isolation
    TenantIsolation bool `json:"tenant_isolation"`
}

Performance Optimizations

S3 Signed URL Redirects: Direct blob downloads from S3 (no proxy overhead)
Manifest Caching: Database-stored manifests for fast retrieval
Layer Deduplication: Cross-repository blob mounting
CDN Integration: CloudFront/CDN for global blob distribution
Parallel Downloads: Concurrent layer fetching support

Enterprise Features

Audit Logging: Complete pull/push audit trail
Cost Tracking: Per-tenant transfer and storage cost attribution
Access Control: Repository-level RBAC with inheritance
Vulnerability Integration: Automatic scanning on manifest upload
Compliance: Image signing and attestation support

Docker Client Compatibility

# Standard Docker workflow
podman login registry.example.com
podman pull registry.example.com/team-alpha/nginx:latest
podman push registry.example.com/team-alpha/myapp:v1.0

# Kubernetes integration
kubectl create deployment myapp \
  --image=registry.example.com/team-alpha/myapp:v1.0

Performance and Scalability

Performance Requirements

Metric	Target	Measurement
API Response Time	<500ms (95th percentile)	Application metrics
Container Build Start Time	<30 seconds	Pipeline metrics
VM/Cloud Image Build Start Time	<2-5 minutes	Pipeline metrics
Container Image Pull Time	<10 seconds	Registry metrics (via S3 redirect)
VM Image Download Time	<5-15 minutes	Cloud provider APIs
OCI Manifest Retrieval	<100ms	Database query performance
S3 Blob Access	<200ms	S3 signed URL generation
Concurrent Container Builds	100+ per tenant	Resource utilization
Concurrent VM/Cloud Builds	20-50 per tenant	Resource utilization
Tenant Isolation	Zero cross-tenant impact	Performance monitoring
Cross-Registry Sync Time	<60 seconds	Registry sync metrics

Scalability Architecture

graph TB
    subgraph "Horizontal Scaling"
        HPA[Horizontal Pod Autoscaler]
        CLUSTER_AUTOSCALER[Cluster Autoscaler]
        NODE_POOLS[Multiple Node Pools]
    end

    subgraph "Load Distribution"
        LB[Load Balancer]
        INGRESS[Ingress Controller]
        SERVICE_MESH[Istio Service Mesh]
    end

    subgraph "Data Scaling"
        DB_SHARDING[Database Sharding]
        CACHE_CLUSTER[Redis Cluster]
        OBJECT_STORAGE[Object Storage<br/>S3, GCS, Azure Blob]
    end

    subgraph "Caching Strategy"
        L1_CACHE[L1 Cache<br/>Application Cache]
        L2_CACHE[L2 Cache<br/>Redis]
        CDN[CDN<br/>CloudFront, Cloudflare]
    end

    HPA --> CLUSTER_AUTOSCALER
    CLUSTER_AUTOSCALER --> NODE_POOLS

    LB --> INGRESS
    INGRESS --> SERVICE_MESH

    DB_SHARDING --> CACHE_CLUSTER
    CACHE_CLUSTER --> OBJECT_STORAGE

    L1_CACHE --> L2_CACHE
    L2_CACHE --> CDN

Capacity Planning

Component	Current Capacity	Scaling Strategy
API Gateway	10,000 RPS	Horizontal scaling
Database	10,000 concurrent connections	Read replicas, sharding
Cache	1TB memory	Redis cluster
Storage	100TB	Object storage scaling
Container Build Workers	100 concurrent builds	Kubernetes HPA
VM/Cloud Build Workers	50 concurrent builds	Cloud provider auto-scaling
Container Registry (Harbor)	10,000 images/day	Multi-zone replication
Cloud Provider APIs	1,000 API calls/minute	Rate limiting, queuing
Git Integration	500 webhooks/minute	Event-driven processing

Monitoring and Observability

Observability Stack

graph TB
    subgraph "Metrics"
        PROMETHEUS[Prometheus<br/>Time Series DB]
        METRICS[Application Metrics<br/>Custom Counters/Gauges]
        INFRA[Infrastructure Metrics<br/>CPU, Memory, Disk]
    end

    subgraph "Logs"
        FLUENTD[Fluentd<br/>Log Collection]
        ELASTICSEARCH[Elasticsearch<br/>Log Storage]
        KIBANA[Kibana<br/>Log Visualization]
    end

    subgraph "Traces"
        JAEGER[Jaeger<br/>Distributed Tracing]
        OPENTELEMETRY[OpenTelemetry<br/>Instrumentation]
        SERVICE_MESH[Istio<br/>Service Mesh Tracing]
    end

    subgraph "Visualization"
        GRAFANA[Grafana<br/>Metrics Dashboards]
        KIBANA_DASHBOARDS[Kibana<br/>Log Dashboards]
        CUSTOM_UI[Custom UI<br/>Application Monitoring]
    end

    subgraph "Alerting"
        ALERTMANAGER[Alertmanager<br/>Alert Routing]
        NOTIFICATIONS[Notifications<br/>Email, Slack, PagerDuty]
        AUTO_REMEDIATION[Auto Remediation<br/>Runbooks, Webhooks]
    end

    METRICS --> PROMETHEUS
    INFRA --> PROMETHEUS

    FLUENTD --> ELASTICSEARCH
    ELASTICSEARCH --> KIBANA

    JAEGER --> OPENTELEMETRY
    OPENTELEMETRY --> SERVICE_MESH

    PROMETHEUS --> GRAFANA
    ELASTICSEARCH --> KIBANA_DASHBOARDS
    SERVICE_MESH --> CUSTOM_UI

    PROMETHEUS --> ALERTMANAGER
    ALERTMANAGER --> NOTIFICATIONS
    NOTIFICATIONS --> AUTO_REMEDIATION

Key Metrics

Category	Metrics	Thresholds
Performance	Response time, throughput, error rate	95th percentile <500ms, <1% errors
Availability	Uptime, MTTR, MTBF	99.9% uptime, <1hr MTTR
Capacity	CPU utilization, memory usage, disk I/O	<80% utilization
Business	Build success rate, tenant adoption, API usage	>95% success rate
Security	Failed auth attempts, vulnerability count	<5 failed attempts/day

Disaster Recovery

Recovery Objectives

Component	RTO	RPO	Strategy
Database	1 hour	15 minutes	Multi-AZ deployment, automated failover
Application	30 minutes	N/A	Blue-green deployment, rolling updates
Registry	4 hours	1 hour	Geo-replication, backup restoration
Configuration	15 minutes	5 minutes	Git-based, multi-region replication

Disaster Recovery Architecture

graph TB
    subgraph "Primary Region"
        PRIMARY_DC[Primary Data Center]
        PRIMARY_DB[(Primary DB)]
        PRIMARY_APP[Primary Application]
        PRIMARY_STORAGE[Primary Storage]
    end

    subgraph "Secondary Region"
        SECONDARY_DC[Secondary Data Center]
        SECONDARY_DB[(Secondary DB<br/>Read Replica)]
        SECONDARY_APP[Secondary Application<br/>Standby]
        SECONDARY_STORAGE[Secondary Storage<br/>Replication]
    end

    subgraph "Backup Storage"
        BACKUP[(Object Storage<br/>Cross-Region)]
        ARCHIVE[(Archive Storage<br/>Long-term Retention)]
    end

    subgraph "Recovery Automation"
        ORCHESTRATOR[Recovery Orchestrator]
        RUNBOOKS[Automated Runbooks]
        MONITORING[DR Monitoring]
    end

    PRIMARY_DB --> SECONDARY_DB
    PRIMARY_APP --> SECONDARY_APP
    PRIMARY_STORAGE --> SECONDARY_STORAGE

    PRIMARY_DB --> BACKUP
    PRIMARY_STORAGE --> BACKUP
    BACKUP --> ARCHIVE

    ORCHESTRATOR --> RUNBOOKS
    RUNBOOKS --> MONITORING
    MONITORING --> ORCHESTRATOR

Recovery Procedures

Detection: Automated monitoring detects outage
Declaration: Incident response team declares disaster
Failover: Automated or manual failover to secondary site
Recovery: Restore primary site and fail back
Testing: Validate system functionality post-recovery
Lessons Learned: Conduct post-mortem and update procedures

Compliance and Governance

Compliance Framework

Standard	Scope	Implementation
SOC 2	Security, availability, confidentiality	Audit logging, access controls, monitoring
GDPR	Data protection, privacy	Data encryption, consent management, right to erasure
HIPAA	Healthcare data protection	Data isolation, audit trails, breach notification
PCI DSS	Payment card data	Network segmentation, encryption, access controls
ISO 27001	Information security management	Risk assessment, security controls, continuous improvement

Governance Model

graph TB
    subgraph "Governance Layers"
        BOARD[Board of Directors<br/>Strategic Oversight]
        EXECUTIVE[Executive Management<br/>Policy Approval]
        COMPLIANCE[Compliance Officer<br/>Regulatory Compliance]
        SECURITY[Security Team<br/>Security Operations]
    end

    subgraph "Operational Governance"
        ARCHITECTURE[Architecture Review Board<br/>Technical Standards]
        CHANGE[Change Advisory Board<br/>Change Management]
        AUDIT[Internal Audit<br/>Compliance Monitoring]
    end

    subgraph "Technical Controls"
        POLICIES[Security Policies<br/>Standards & Procedures]
        CONTROLS[Technical Controls<br/>Implementation]
        MONITORING[Continuous Monitoring<br/>Metrics & Alerts]
    end

    subgraph "Audit & Reporting"
        LOGS[Audit Logs<br/>Activity Tracking]
        REPORTS[Compliance Reports<br/>Regulatory Reporting]
        DASHBOARDS[Executive Dashboards<br/>Risk Visibility]
    end

    BOARD --> EXECUTIVE
    EXECUTIVE --> COMPLIANCE
    COMPLIANCE --> SECURITY

    SECURITY --> ARCHITECTURE
    ARCHITECTURE --> CHANGE
    CHANGE --> AUDIT

    AUDIT --> POLICIES
    POLICIES --> CONTROLS
    CONTROLS --> MONITORING

    MONITORING --> LOGS
    LOGS --> REPORTS
    REPORTS --> DASHBOARDS

Risk Management

Risk Category	Mitigation Strategy	Monitoring
Security Breach	Defense in depth, zero trust	SIEM, intrusion detection
Data Loss	Multi-region replication, backups	Backup verification, integrity checks
Service Outage	High availability, auto-scaling	Uptime monitoring, failover testing
Compliance Violation	Automated controls, regular audits	Compliance monitoring, policy enforcement
Performance Degradation	Capacity planning, optimization	Performance monitoring, alerting

This reference architecture document provides a comprehensive blueprint for implementing the Multi-Tenant Image Build Factory. It should be reviewed and updated regularly to reflect changes in requirements, technology, and best practices.