Building a Testing Infrastructure for AI-Assisted Development

Mar. 24, 2026

A comprehensive guide to the testing, CI/CD, and guardrail systems we built at Glolly to enable confident AI-assisted development. This document explains the philosophy, architecture, and implementation details so you can adapt it to your own stack.


Table of Contents

  1. Philosophy: Why This Exists
  2. Architecture Overview
  3. Layer 1: TypeScript Strictness
  4. Layer 2: Linting & Formatting
  5. Layer 3: Backend Testing
  6. Layer 4: Frontend Testing
  7. Layer 5: End-to-End Testing with Playwright
  8. Layer 6: Storybook as a Visual Safety Net
  9. Layer 7: Pre-Commit Hooks
  10. Layer 8: CI/CD Pipelines
  11. Layer 9: The Backend Docker Image Bridge
  12. Layer 10: Claude Code Agent Guardrails
  13. How It All Fits Together
  14. Adapting This to Your Stack

1. Philosophy: Why This Exists

When AI writes most of your code — whether that’s Claude Code, Copilot, or any other LLM-powered tool — the failure mode is different from human-written code. Humans make typos and forget edge cases. LLMs produce code that looks correct, compiles, and even passes a casual review, but has subtle issues: unchecked null access, forgotten await on promises, types coerced through any, test files that pass but don’t actually assert anything meaningful.

The guardrails are the product. Every layer in this system exists to make a specific class of AI-generated bug loud and impossible to ignore. The goal: if AI-generated code passes all checks, you can trust it structurally — and your human review is only about logic and intent.

The Layers, Ordered by When They Catch Problems

Layer Tool What It Catches AI Anti-Pattern It Prevents (Examples)
Compile TypeScript strict mode Type errors, null safety, unchecked index access AI using any, assuming arrays have elements, forgetting null checks
Lint ESLint (strict, type-checked rules) Floating promises, unsafe any propagation, non-null assertions AI using !, forgetting await, leaking any through assignments
Format Prettier Inconsistent formatting AI mixing styles across generated files
Unit Test Vitest Function/component behavior regressions AI changing behavior while “refactoring”
Integration Test Vitest + Testcontainers Database/service interaction bugs AI writing queries that don’t match real DB behavior
E2E Test Playwright Full-flow regressions across frontend + backend AI breaking navigation, data display, or checkout flows
Visual Storybook Component rendering regressions AI breaking UI without functional test failures
Pre-commit Husky + lint-staged All of the above, at commit time AI-generated code getting committed without checks
CI GitHub Actions All of the above, on every PR Code that passes locally but fails in a clean environment

Each layer catches things the previous one misses. The AI has to get past all of them to land code on main.

The Core Principle: Make bad code impossible to commit.


2. Architecture Overview

Our system has two codebases (backend API and frontend web app) with interconnected testing and deployment pipelines. Here’s the high-level flow:

flowchart TB
    subgraph BE["Backend Repository"]
        BE_CODE["Code Change"] --> BE_PRECOMMIT["Pre-commit Hook<br/>lint-staged + type-check"]
        BE_PRECOMMIT --> BE_PR["Pull Request to main"]
        BE_PR --> BE_CI["CI Pipeline<br/>Format + Lint + Type Check + Tests w/ Coverage"]
        BE_CI -->|merge to main| BE_IMG_STG["Build Docker Image<br/>Tag: staging"]
        BE_CI -->|release published| BE_IMG_PROD["Build Docker Image<br/>Tag: latest + version"]
        BE_CI -->|release published| BE_DEPLOY["Deploy to Railway<br/>Production"]
    end
    subgraph FE["Frontend Repository"]
        FE_CODE["Code Change"] --> FE_PRECOMMIT["Pre-commit Hook<br/>lint-staged + type-check + codegen"]
        FE_PRECOMMIT --> FE_PR["Pull Request to main"]
        FE_PR --> FE_CI["CI Pipeline<br/>Format + Lint + Type Check + Tests<br/>+ Codegen Sync + Storybook Build + App Build"]
        FE_PR --> FE_E2E["E2E Pipeline<br/>Pulls backend Docker image<br/>Spins up full test stack<br/>Runs Playwright"]
        FE_CI -->|release published| FE_DEPLOY["Deploy to Railway<br/>Production"]
    end
    BE_IMG_STG --> FE_E2E
    BE_IMG_PROD --> FE_E2E

    %% Subgraph styles
    style BE fill:#2a1a1a,stroke:#e94560,color:#e0e0e0
    style FE fill:#1a1a2e,stroke:#0f3460,color:#e0e0e0

    %% Node classes
    classDef beNode fill:#1e1e2e,stroke:#e94560,color:#e0e0e0,stroke-width:2px
    classDef feNode fill:#1e1e2e,stroke:#0f3460,color:#e0e0e0,stroke-width:2px

    class BE_CODE,BE_PRECOMMIT,BE_PR,BE_CI,BE_IMG_STG,BE_IMG_PROD,BE_DEPLOY beNode
    class FE_CODE,FE_PRECOMMIT,FE_PR,FE_CI,FE_E2E,FE_DEPLOY feNode

    %% Arrow styles
    linkStyle default stroke:#888,stroke-width:2px

The key insight: the backend publishes a Docker image on every merge to main. The frontend E2E tests pull that image and spin up a real API (with real PostgreSQL, Redis, and S3-compatible storage) as the test backend. This means frontend E2E tests exercise the actual backend, not mocks — catching integration bugs that unit tests miss.


3. Layer 1: TypeScript Strictness

TypeScript’s strict mode is the single highest-value guardrail for AI-generated code. It catches type errors at compile time before any test runs.

What We Enable (Backend)

// tsconfig.json — key settings beyond the default "strict: true"
{
  "compilerOptions": {
    "strict": true,                        // Umbrella for all strict flags
    "noUncheckedIndexedAccess": true,      // Forces null checks on array[index] and obj[key]
    "noPropertyAccessFromIndexSignature": true, // Forces bracket notation for dynamic keys
    "noUnusedLocals": true,                // Catches leftover variables
    "noUnusedParameters": true,            // Catches unused function params
    "noImplicitReturns": true,             // Every code path must return
    "noFallthroughCasesInSwitch": true,    // switch cases must break/return
    "noImplicitOverride": true,            // Explicit override keyword required
    "verbatimModuleSyntax": false          // Needed for some CJS interop
  }
}

What We Enable (Frontend — Next.js)

{
  "compilerOptions": {
    "strict": true,
    "noUncheckedIndexedAccess": true,
    "exactOptionalPropertyTypes": true,    // Prevents assigning undefined to optional props
    "noImplicitReturns": true,
    "noFallthroughCasesInSwitch": true,
    "noImplicitOverride": true,
    "noUnusedLocals": true,
    "noUnusedParameters": true
  }
}

The Principle

Start with maximum strictness. Only relax a flag if it fights you across an entire codebase, not just in one file. We found exactOptionalPropertyTypes fights React props patterns on the backend enough to omit it there, but it’s fine on the frontend.


4. Layer 2: Linting & Formatting

ESLint Configuration

We use ESLint’s type-checked rules, which go beyond what the TypeScript compiler checks. The key difference: ESLint can analyze patterns and intent, not just types.

Backend (stricter — no React complexity):

// Key rule categories enabled:
// 1. Promise safety (the #1 AI footgun)
'@typescript-eslint/no-floating-promises': 'error',     // Catch forgotten await
'@typescript-eslint/no-misused-promises': 'error',      // Catch promises in wrong contexts
'@typescript-eslint/require-await': 'error',             // Catch async functions without await
'@typescript-eslint/return-await': ['error', 'always'],  // Consistent return await

// 2. Type safety (prevent any leakage)
'@typescript-eslint/no-explicit-any': 'error',
'@typescript-eslint/no-unsafe-assignment': 'error',
'@typescript-eslint/no-unsafe-call': 'error',
'@typescript-eslint/no-unsafe-member-access': 'error',
'@typescript-eslint/no-unsafe-return': 'error',

// 3. Code quality
'@typescript-eslint/no-non-null-assertion': 'error',     // AI loves "!" — ban it
'@typescript-eslint/strict-boolean-expressions': 'error', // No implicit truthiness
'@typescript-eslint/switch-exhaustiveness-check': 'error', // All cases handled
'@typescript-eslint/consistent-type-imports': 'error',    // type imports separated

// 4. Catch dev leftovers
'no-console': 'error',  // No console.log in production code
'no-debugger': 'error',

Frontend (similar, with React additions):

// All the above, plus:
'react-hooks/exhaustive-deps': 'error',  // Catch missing useEffect deps
'react/self-closing-comp': 'error',
// Icon library enforcement (prevent mixing icon sets)
'no-restricted-imports': ['error', {
  patterns: [
    { group: ['lucide-react', 'lucide-react/*'], message: 'Use @tabler/icons-react instead.' },
  ]
}],

Why --max-warnings=0

Both in pre-commit hooks and CI, we run ESLint with --max-warnings=0. This means warnings are effectively errors.

Formatting

Prettier handles all formatting.

{
  "printWidth": 100,
  "semi": true,
  "singleQuote": false,
  "trailingComma": "all",
  "plugins": ["prettier-plugin-tailwindcss"]
}

Test File Relaxations

Test files get relaxed type-safety rules because mocks inherently use any:

// For *.test.ts and *.spec.ts files:
'@typescript-eslint/no-explicit-any': 'off',
'@typescript-eslint/no-unsafe-assignment': 'off',
'@typescript-eslint/no-unsafe-call': 'off',
'@typescript-eslint/no-unsafe-member-access': 'off',
'@typescript-eslint/no-non-null-assertion': 'off',
'no-console': 'off',

This is a deliberate tradeoff — strict types in test mocks creates more friction than it prevents bugs.


5. Layer 3: Backend Testing

The Stack

Why Testcontainers

One common failure mode in backend tests is mocking vs production mismatch. Testcontainers solve this by spinning up a real PostgreSQL instance in a Docker container for each test run.

// tests/setup/global-setup.ts — simplified concept
import { PostgreSqlContainer } from '@testcontainers/postgresql';

export default async function globalSetup() {
  // Start a real PostgreSQL container
  const container = await new PostgreSqlContainer('postgres:18-alpine')
    .withDatabase('test_db')
    .start();

  // Run migrations against it
  // Set the DATABASE_URL for test processes
  process.env.DATABASE_URL = container.getConnectionUri();

  // Return teardown function
  return async () => {
    await container.stop();
  };
}

What we mock vs. what we don’t:

Dependency Mocked? Why
PostgreSQL No — real via Testcontainers SQL behavior must match production
Redis No — real via Testcontainers or Docker Compose Cache/queue behavior must be real
Payment integerations Yes External API, costly, rate-limited
SMS provider Yes External API, costs money per message
Email provider Yes External API
S3/R2 (object storage) Sometimes — MinIO for integration, mocked for unit Depends on what’s being tested

Coverage Configuration

// vitest.config.ts
export default defineConfig({
  test: {
    coverage: {
      provider: 'v8',
      reporter: ['text', 'html', 'lcov', 'json-summary'],
      include: ['src/**/*.ts'],
      exclude: [
        'src/services/email/templates/**',  // HTML templates
        'src/index.ts',                     // Entry point
        'src/yoga.ts',                      // Server setup
        'src/schema.ts',                    // Generated schema
        '.config/**',
        'db/**',                            // Migrations
      ],
    },
    fileParallelism: false,  // Tests share a DB — run sequentially
    hookTimeout: 120_000,     // Testcontainers need time to start
    testTimeout: 30_000,
  },
});

We aim for 90%+ coverage. The json-summary reporter is important — it’s what the CI pipeline reads to post coverage comments on PRs.

Key Pattern: Sequential Test Execution

Because tests share a single database container (for speed — starting a new container per test is too slow), we run tests sequentially with fileParallelism: false. Each test suite handles its own cleanup (truncating tables, resetting state). This is a deliberate tradeoff: slower tests, but real database behavior and simpler setup.


6. Layer 4: Frontend Testing

The Stack

What We Test

Frontend tests focus on component behavior and business logic, not implementation details:

// Example: Testing a component renders correctly
import { render, screen } from '@testing-library/react';
import userEvent from '@testing-library/user-event';

describe('ProductCard', () => {
  it('shows out-of-stock badge when stock is zero', () => {
    render(<ProductCard product={{ ...mockProduct, stock: 0 }} />);
    expect(screen.getByText('Out of Stock')).toBeInTheDocument();
  });

  it('calls onAddToCart with correct quantity', async () => {
    const onAddToCart = vi.fn();
    render(<ProductCard product={mockProduct} onAddToCart={onAddToCart} />);

    await userEvent.click(screen.getByRole('button', { name: /add to cart/i }));
    expect(onAddToCart).toHaveBeenCalledWith(mockProduct.id, 1);
  });
});

Coverage Configuration

// vitest.config.ts
export default defineConfig({
  test: {
    globals: true,
    environment: 'jsdom',
    setupFiles: ['./vitest.setup.ts'],
    coverage: {
      provider: 'v8',
      reporter: ['text', 'text-summary', 'lcov', 'json-summary'],
      include: ['lib/**', 'hooks/**', 'stores/**', 'components/**'],
      exclude: [
        '**/types/**',
        '**/*.d.ts',
        '**/generated-types.ts',   // GraphQL codegen output
        '**/*.stories.tsx',         // Storybook files
      ],
      thresholds: {
        statements: 75,
        branches: 70,
        functions: 80,
        lines: 75,
      },
    },
  },
});

7. Layer 5: End-to-End Testing with Playwright

This is where the backend Docker image comes into play. Playwright tests exercise the entire application stack — real frontend, real backend API, real database, real object storage.

The Test Stack

The frontend repository includes a docker-compose.test.yml that spins up everything the E2E tests need:

services:
  # Real PostgreSQL
  postgres:
    image: postgres:18.3-alpine
    ports: ["5433:5432"]
    environment:
      POSTGRES_DB: app_test
      POSTGRES_USER: testuser
      POSTGRES_PASSWORD: testpass
    tmpfs:
      - /var/lib/postgresql  # RAM-backed for speed
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U testuser -d app_test"]
      interval: 5s
      timeout: 3s
      retries: 10

  # Real Redis
  redis:
    image: redis:8.4.0-alpine
    ports: ["6380:6379"]
    tmpfs:
      - /data

  # S3-compatible object storage (MinIO)
  minio:
    image: minio/minio:latest
    ports: ["9100:9000"]
    environment:
      MINIO_ROOT_USER: minioadmin
      MINIO_ROOT_PASSWORD: minioadmin
    command: server /data --console-address ":9001"

  # Initialize MinIO bucket
  minio-init:
    image: minio/mc:latest
    depends_on:
      minio:
        condition: service_healthy
    entrypoint: >
      sh -c "
        mc alias set local http://minio:9000 minioadmin minioadmin &&
        mc mb --ignore-existing local/app-uploads
      "      

  # The REAL backend API — pulled from container registry
  api:
    image: ghcr.io/your-org/your-api:staging
    platform: linux/amd64
    ports: ["4100:4000"]
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_started
      minio-init:
        condition: service_completed_successfully
    environment:
      # Override backend env vars to point at test services
      DB_HOST: postgres
      DB_PORT: "5432"
      DB_NAME: app_test
      REDIS_URL: redis://redis:6379
      S3_ENDPOINT: http://minio:9000
      NODE_ENV: test
      MOCK_EXTERNAL_SERVICES: "true"  # Backend mocks Paystack, SMS, etc.
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:4000/health"]
      interval: 5s
      timeout: 5s
      retries: 20
      start_period: 15s

Key design decisions:

Playwright Configuration

// playwright.config.ts
import { defineConfig, devices } from '@playwright/test';

export default defineConfig({
  testDir: './e2e',
  timeout: 60_000,
  fullyParallel: false,    // Sequential — tests may share state
  retries: process.env.CI ? 1 : 0,
  workers: 1,

  use: {
    baseURL: 'http://localhost:3001',
    trace: 'on-first-retry',          // Capture traces for debugging
    screenshot: 'only-on-failure',
  },

  projects: [
    {
      name: 'chromium',
      use: { ...devices['Desktop Chrome'] },
    },
    {
      name: 'webkit',
      use: { ...devices['Desktop Safari'] },
    },
    {
      name: 'mobile-android',
      use: {
        viewport: { width: 375, height: 812 },
        isMobile: true,
        hasTouch: true,
      },
    },
  ],

  // Start the frontend dev server for tests
  webServer: {
    command: 'next dev --port 3001',
    url: 'http://localhost:3001',
    reuseExistingServer: !process.env.CI,
    timeout: 120_000,
  },
});

What E2E Tests Cover

E2E tests cover the critical user flows that, if broken, would lose money or trust:

Each test exercises the full stack: browser → Next.js frontend → GraphQL API → PostgreSQL → response rendered in browser.

Why This Catches What Unit Tests Miss

Unit tests mock the backend. If the backend changes its GraphQL schema, adds a required field, or changes a response shape, unit tests with mocked responses still pass. E2E tests hit the real backend and fail immediately.


8. Layer 6: Storybook as a Visual Safety Net

Storybook serves two purposes in our setup: component documentation and visual regression detection.

Configuration

// .storybook/main.ts
const config: StorybookConfig = {
  stories: ['../components/**/*.stories.@(ts|tsx)'],
  addons: ['@storybook/addon-a11y', '@storybook/addon-designs'],
  framework: {
    name: '@storybook/nextjs',
    options: {},
  },
};

Why It’s in CI

Our CI pipeline runs storybook build as a verification step. This catches:

It’s not a full visual regression test (that would require Chromatic or similar), but the build verification alone catches a class of errors that unit tests miss.

Viewport Presets for Target Market

We configure Storybook viewports to match our actual user base — mid-range Android phones, not just iPhone and desktop:

viewport: {
  viewports: {
    androidSmall: { name: 'Android Small', styles: { width: '360px', height: '800px' } },
    androidLarge: { name: 'Android Large', styles: { width: '412px', height: '915px' } },
    iphoneSE:     { name: 'iPhone SE',     styles: { width: '375px', height: '667px' } },
    tablet:       { name: 'Tablet',        styles: { width: '768px', height: '1024px' } },
    laptop:       { name: 'Laptop',        styles: { width: '1366px', height: '768px' } },
  },
},

9. Layer 7: Pre-Commit Hooks

Pre-commit hooks are the last line of defense before code enters the repository. We use Husky to run lint-staged, which applies checks only to staged files (fast feedback).

Backend Pre-Commit

#!/bin/sh
# .husky/pre-commit

# Encrypt all env files first (prevent accidental secret commits)
pnpm env:encrypt:all

# Stage encrypted env files (excluding sensitive ones)
git diff --name-only | grep -E '^\.env\..+$' | \
  grep -v -E '^\.(env\.keys|env\.local|env)$' | \
  xargs -r git add || true

# Guard: block commit if secrets are staged
BLOCKED=$(git diff --cached --name-only | grep -E '^(\.env|\.env\.keys|\.env\.local)$' || true)
if [ -n "$BLOCKED" ]; then
  echo "Sensitive env files should not be committed:"
  echo "$BLOCKED"
  exit 1
fi

# Run lint-staged (ESLint + Prettier on staged files)
npx lint-staged

# Full type-check (not just staged files — a change in one file can break another)
pnpm type-check

Frontend Pre-Commit

#!/bin/sh
# Same env encryption and guarding, then:
npm run precommit:check
# Which runs: typecheck + lint + format:check

lint-staged Configuration

{
  "lint-staged": {
    "*.{ts,tsx}": [
      "eslint --fix",
      "prettier --write"
    ],
    "*.{json,md,yml,yaml}": [
      "prettier --write"
    ]
  }
}

Why Full Type-Check, Not Just Staged Files

lint-staged only runs on staged files, which is fast. But TypeScript type-checking must run on the entire project because changing a type in one file can break imports in files you didn’t touch. The full tsc --noEmit run takes a few seconds and catches cross-file breakage that staged-only checking would miss.

The Env Encryption Guard

We use dotenvx to encrypt environment files. The pre-commit hook automatically encrypts .env.* files and stages them, while blocking .env, .env.keys, and .env.local (which contain secrets) from ever being committed. This is defense-in-depth — .gitignore should also exclude these, but the hook catches cases where someone force-adds them.


10. Layer 8: CI/CD Pipelines

Backend CI (on every PR to main)

name: CI
on:
  pull_request:
    branches: [main]

jobs:
  qa:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6
      - uses: pnpm/action-setup@v5
      - uses: actions/setup-node@v6
        with:
          node-version-file: '.nvmrc'
          cache: 'pnpm'

      - run: pnpm install --frozen-lockfile
      - run: pnpm format:check
      - run: pnpm lint
      - run: pnpm type-check

  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6
      - uses: pnpm/action-setup@v5
      - uses: actions/setup-node@v6
      - run: pnpm install --frozen-lockfile
      - run: pnpm test:coverage

      - uses: actions/upload-artifact@v7
        if: always()
        with:
          name: coverage-report
          path: coverage/

  # Posts coverage summary as a PR comment
  coverage-comment:
    needs: [qa, test]
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
    steps:
      - uses: actions/download-artifact@v8
        with:
          name: coverage-report
          path: ./coverage
      - uses: actions/github-script@v8
        with:
          script: |
            // Reads json-summary, posts formatted table to PR
            // Shows: Statements, Branches, Functions, Lines with color indicators            

Frontend CI (on every PR to main)

The frontend CI is more comprehensive because it has more moving parts:

jobs:
  check-quality:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v6
      - uses: actions/setup-node@v6
      - run: npm ci

      # 1. Type safety
      - run: npm run typecheck

      # 2. Code quality
      - run: npm run lint
      - run: npm run format:check

      # 3. Unit + integration tests with coverage
      - run: npx vitest run --coverage

      # 4. GraphQL codegen sync check
      - name: Verify GraphQL Codegen Sync
        run: |
          npm run codegen
          npm run format
          if [[ -n $(git status --porcelain) ]]; then
            echo "GraphQL types are out of sync with staging!"
            exit 1
          fi          

      # 5. Storybook build verification
      - run: npm run storybook:build

      # 6. Application build verification
      - run: npm run build:ci

Frontend E2E CI (separate workflow, on every PR)

name: E2E Tests
on:
  pull_request:
    branches: [main, staging]

jobs:
  e2e:
    runs-on: ubuntu-latest
    timeout-minutes: 25
    steps:
      - uses: actions/checkout@v6
      - uses: actions/setup-node@v6
      - run: npm ci
      - run: npx playwright install --with-deps chromium webkit

      # Authenticate to container registry
      - run: echo "${{ secrets.GHCR_TOKEN }}" | docker login ghcr.io -u "${{ github.repository_owner }}" --password-stdin

      # Pull the backend image
      - run: docker compose -f docker-compose.test.yml pull api

      # Start the full test stack
      - name: Start test stack
        run: |
          # Decrypt test env, source it for Docker Compose
          npx dotenvx decrypt -f .env.test
          set -a && source .env.test && set +a
          docker compose -f docker-compose.test.yml up -d --wait          

      # Wait for API health
      - name: Wait for API
        run: |
          timeout 90 bash -c '
            until curl -sf http://localhost:4100/health > /dev/null 2>&1; do
              sleep 3
            done
          '          

      # Run Playwright
      - run: npx dotenvx run -f .env.test -- npx playwright test

      # Artifacts for debugging
      - uses: actions/upload-artifact@v7
        if: always()
        with:
          name: playwright-report
          path: playwright-report/

      # Clean up
      - if: always()
        run: docker compose -f docker-compose.test.yml down -v --remove-orphans

Coverage Comments on PRs

Both backend and frontend pipelines post coverage summaries as PR comments. This gives reviewers immediate visibility into test coverage without digging through CI logs:

## 📊 Coverage Summary

| Category   | Coverage | |
|------------|----------|-|
| Statements | 91.2%    | 🟢 |
| Branches   | 87.4%    | 🟢 |
| Functions  | 93.1%    | 🟢 |
| Lines      | 90.8%    | 🟢 |

The comment updates on each push (not create-and-duplicate), keeping the PR thread clean.


11. Layer 9: The Backend Docker Image Bridge

This is the piece that ties the backend and frontend testing together.

The Flow

flowchart LR
    A["Push to main<br/>(backend repo)"] --> B["GitHub Actions:<br/>Build Docker Image"]
    B --> C["Push to GHCR<br/>Tag: staging"]
    D["Release published<br/>(backend repo)"] --> E["GitHub Actions:<br/>Build Docker Image"]
    E --> F["Push to GHCR<br/>Tag: latest + version"]

    C --> G["Frontend E2E Tests<br/>Pull staging image"]
    F --> G

    %% Accent nodes
    style B fill:#1a2e2a,stroke:#00b894,color:#e0e0e0,stroke-width:2px
    style E fill:#2e1a1a,stroke:#e17055,color:#e0e0e0,stroke-width:2px
    style G fill:#1a1e2e,stroke:#0984e3,color:#e0e0e0,stroke-width:2px

    %% Plain nodes
    classDef plain fill:#1e1e2e,stroke:#888,color:#e0e0e0,stroke-width:1.5px
    class A,C,D,F plain

    %% Arrows
    linkStyle default stroke:#888,stroke-width:2px

The Dockerfile

The backend Dockerfile is a multi-stage build optimized for small image size and fast deploys:

# Stage 1: Install all dependencies
FROM node:22-slim AS deps
WORKDIR /app
RUN corepack enable && corepack prepare pnpm@10.27.0 --activate
COPY package.json pnpm-lock.yaml ./
RUN pnpm install --frozen-lockfile

# Stage 2: Build TypeScript
FROM deps AS build
COPY . .
RUN pnpm build

# Stage 3: Production dependencies only
FROM node:22-slim AS prod-deps
WORKDIR /app
RUN corepack enable && corepack prepare pnpm@10.27.0 --activate
COPY package.json pnpm-lock.yaml ./
RUN pnpm install --frozen-lockfile --prod --ignore-scripts

# Stage 4: Minimal runtime
FROM node:22-slim AS runtime
WORKDIR /app
RUN apt-get update && apt-get install -y --no-install-recommends curl && rm -rf /var/lib/apt/lists/*
RUN npm install -g @dotenvx/dotenvx
COPY --from=prod-deps /app/node_modules ./node_modules
COPY --from=build /app/dist ./dist
COPY --from=build /app/db ./db
COPY package.json .env.test ./

# dotenvx decrypts env vars at runtime
ENTRYPOINT ["dotenvx", "run", "-f", ".env.test", "--"]
CMD ["node", "dist/src/index.js"]

Build & Push Workflow

name: Build & Push API Image
on:
  push:
    branches: [main]      # Build staging image
  release:
    types: [published]     # Build production image

jobs:
  build-and-push:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write
    steps:
      - uses: actions/checkout@v6
      - uses: docker/setup-buildx-action@v4
      - uses: docker/login-action@v4
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}

      - name: Determine tags
        id: tags
        run: |
          IMAGE="ghcr.io/${{ github.repository }}"
          if [ "${{ github.event_name }}" = "release" ]; then
            echo "tags=${IMAGE}:latest,${IMAGE}:${{ github.event.release.tag_name }}" >> "$GITHUB_OUTPUT"
          else
            echo "tags=${IMAGE}:staging,${IMAGE}:sha-${GITHUB_SHA::7}" >> "$GITHUB_OUTPUT"
          fi          

      - uses: docker/build-push-action@v7
        with:
          context: .
          push: true
          tags: ${{ steps.tags.outputs.tags }}
          platforms: linux/amd64,linux/arm64

Why This Architecture

Our backend runs on Railway, which deploys directly from code — we don’t need a Docker image for production hosting. The image exists purely to give the frontend E2E tests a hermetic, reproducible backend environment. This means:


12. Layer 10: Claude Code Agent Guardrails

This is the most unique part of our setup. We use Claude Code for development. Claude Code can run bash commands, edit files, and make commits. Without guardrails, it could rm -rf /, force-push to main, or install malicious packages. Our guardrail system prevents all of this.

The Permission System

Claude Code has a settings.json file that defines allowed and denied bash command patterns. Here’s the structure:

{
  "permissions": {
    "allow": [
      "Bash(pnpm run build*)",
      "Bash(pnpm run lint*)",
      "Bash(pnpm run test*)",
      "Bash(git add *)",
      "Bash(git commit *)",
      "Bash(git push *)",
      "Bash(ls *)",
      "Bash(cat *)",
      "Bash(grep *)",
      "Bash(find *)"
      // ... read-only and safe build commands
    ],
    "deny": [
      "Bash(pnpm add *)",           // Can't install packages
      "Bash(pnpm install*)",         // Can't modify node_modules
      "Bash(git push --force*)",     // Can't force-push
      "Bash(git push -f*)",
      "Bash(git reset --hard*)",     // Can't destroy history
      "Bash(rm -rf *)",             // Can't delete recursively
      "Bash(sudo *)",               // No root access
      "Bash(kill *)",               // Can't kill processes
      "Bash(chmod *)",              // Can't change permissions
      "Bash(curl * | sh*)",         // Can't pipe downloads to shell
      "Bash(curl * | bash*)"
    ]
  }
}

The philosophy: allow everything the AI needs to do its job (build, test, lint, commit, view files), deny everything that could cause damage (install packages, force-push, delete files, run arbitrary scripts).

The Auto-Approve Hook

Claude Code’s built-in pattern matching doesn’t handle piped commands (cmd1 | cmd2) or complex bash expressions. This is a known bug in Claude Code and may be fixed in the future. Our custom hook script (auto-approve-pipes.sh) parses every bash command the AI wants to run, extracts individual commands, and checks each one against the allow/deny lists.

# Simplified concept of how the hook works:

# 1. Read the command Claude Code wants to run
COMMAND="git log --oneline | head -5"

# 2. Parse into individual commands using shfmt (if available) or regex
# Extracted: ["git log --oneline", "head -5"]

# 3. Check each against deny list first
# "git log --oneline" → not denied ✓
# "head -5" → not denied ✓

# 4. Check each against allow list
# "git log --oneline" → matches "Bash(git log*)" ✓
# "head -5" → matches "Bash(head *)" ✓

# 5. All commands allowed → auto-approve without prompting

If any command in the pipeline is denied, the entire command is blocked. If any command isn’t in the allow list, it falls through to manual approval (the human gets prompted).

Protected Branch Guard

The hook also prevents the AI from operating on main directly:

check_main_branch() {
  local cmd="$1"
  # Block: git checkout main, git push ... main, git reset ... main
  echo "$cmd" | grep -qE "git (checkout|switch).*main" && return 0
  echo "$cmd" | grep -qE "git push.*main" && return 0
  echo "$cmd" | grep -qE "git reset.*main" && return 0
  return 1
}

The AI can create branches, commit, push to feature branches, and open PRs — but it can never touch main directly. All code reaches main through PRs, which must pass CI.


13. How It All Fits Together

Here’s the complete lifecycle of a code change, from AI-generated code to production:

flowchart TB
    A["AI generates code<br/>(Claude Code / Copilot / etc)"] --> B{"Pre-commit hook"}
    B -->|fail| A
    B -->|pass| C["Commit to feature branch"]
    C --> D["Push → Open PR"]
    D --> E{"CI: Format check"}
    E -->|fail| A
    E -->|pass| F{"CI: Lint (zero warnings)"}
    F -->|fail| A
    F -->|pass| G{"CI: Type check (strict)"}
    G -->|fail| A
    G -->|pass| H{"CI: Unit + Integration tests"}
    H -->|fail| A
    H -->|pass| I{"CI: Coverage threshold met?"}
    I -->|fail| A
    I -->|pass| J{"CI: Storybook builds?"}
    J -->|fail| A
    J -->|pass| K{"CI: App builds?"}
    K -->|fail| A
    K -->|pass| L{"CI: E2E tests pass?"}
    L -->|fail| A
    L -->|pass| M["Human review"]
    M -->|approved| N["Merge to main"]
    N --> O["Backend: Build Docker image (staging)"]
    N --> P["Deploy staging"]
    P --> Q["Manual QA"]
    Q -->|ready| R["Publish release"]
    R --> S["Backend: Build Docker image (latest)"]
    R --> T["Deploy production"]

    %% Accent nodes
    style A fill:#1e1a2e,stroke:#6c5ce7,color:#e0e0e0,stroke-width:2px
    style N fill:#1a2e2a,stroke:#00b894,color:#e0e0e0,stroke-width:2px
    style T fill:#2e1a1a,stroke:#e17055,color:#e0e0e0,stroke-width:2px

    %% Decision diamonds
    classDef decision fill:#2e2a1a,stroke:#fdcb6e,color:#e0e0e0,stroke-width:1.5px
    class B,E,F,G,H,I,J,K,L decision

    %% Plain nodes
    classDef plain fill:#1e1e2e,stroke:#888,color:#e0e0e0,stroke-width:1.5px
    class C,D,M,O,P,Q,R,S plain

    %% Arrows
    linkStyle default stroke:#888,stroke-width:2px

Every arrow labeled “fail” sends the developer (or AI) back to the start. There are no shortcuts. The system is designed so that by the time a human sees the PR, the code has already passed: formatting, linting, type-checking, unit tests, integration tests, coverage thresholds, Storybook build, app build, and end-to-end tests. The human review can focus entirely on logic, intent, and architecture — not on whether the code works.


14. Adapting This to Your Stack

This guide describes our specific implementation (Node.js, TypeScript, Next.js, Vitest, Playwright, GitHub Actions, Railway). But the principles are stack-agnostic. Here’s how to adapt each layer:

Type Safety Layer

Our Stack Alternatives
TypeScript strict mode Mypy strict (Python), Rust’s borrow checker, Go’s type system
ESLint type-checked rules Ruff (Python), Clippy (Rust), golangci-lint (Go)

The principle: turn on the strictest settings your language supports. Relax only when a specific rule fights your entire codebase.

Testing Layer

Our Stack Alternatives
Vitest Jest, pytest, Go testing
Testcontainers (real DB) SQLite in-memory (lighter but less realistic), Docker Compose per test
React Testing Library Vue Test Utils, Svelte Testing Library
Playwright Cypress, Selenium

The principle: test against real infrastructure where possible (real database, real cache). Mock only external services you don’t control (payment APIs, SMS providers).

CI/CD Layer

Our Stack Alternatives
GitHub Actions GitLab CI, CircleCI, Buildkite
GHCR (container registry) Docker Hub, ECR, GCR
Railway (hosting) Vercel, Render, Fly.io, AWS

The principle: CI runs the exact same checks as pre-commit, but in a clean environment. No caching tricks that could hide failures.

AI Agent Guardrails

Our Stack Alternatives
Claude Code hooks + permissions Cursor rules, Copilot workspace policies, custom MCP servers

The principle: define an explicit allow-list of what the AI can do. Everything else requires human approval. Never let the AI install packages, modify CI, or push to protected branches without review.