diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..b969f7d --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,325 @@ +# CLAUDE.md + +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. + +## Development Environment + +This repository uses Nix for managing development tools. Enter the development shell: + +```bash +nix-shell +``` + +The shell automatically configures: +- `TALOSCONFIG` → `testing1/.talosconfig` +- `KUBECONFIG` → `testing1/kubeconfig` +- `NIX_PROJECT_SHELL` → `kubernetes-management` + +Available tools in the Nix shell: +- `talosctl` - Talos Linux cluster management +- `kubectl` - Kubernetes cluster management +- `flux` - FluxCD GitOps toolkit + +## Cluster Bootstrap + +To bootstrap a new Talos cluster from scratch, use the provided bootstrap script: + +```bash +# Enter the Nix shell first +nix-shell + +# Run the bootstrap script +./bootstrap-cluster.sh +``` + +The bootstrap script (`bootstrap-cluster.sh`) will: +1. Generate new Talos secrets and machine configurations +2. Apply configurations to all nodes (10.0.1.3, 10.0.1.4, 10.0.1.5) +3. Bootstrap etcd on the first control plane +4. Retrieve kubeconfig +5. Verify cluster health + +All generated files are saved to `testing1/` directory: +- `testing1/.talosconfig` - Talos client configuration +- `testing1/kubeconfig` - Kubernetes client configuration +- `testing1/secrets.yaml` - Cluster secrets (keep secure!) +- `testing1/controlplane-*.yaml` - Per-node configurations + +### Troubleshooting Bootstrap + +If nodes remain in maintenance mode or bootstrap fails: + +1. **Check cluster status**: + ```bash + ./check-cluster-status.sh + ``` + +2. **Manual bootstrap process**: + If the automated script fails, bootstrap manually: + + ```bash + # Step 1: Check if nodes are accessible + talosctl --nodes 10.0.1.3 version + + # Step 2: Apply config to each node if in maintenance mode + talosctl apply-config --insecure --nodes 10.0.1.3 --file testing1/controlplane-10.0.1.3.yaml + talosctl apply-config --insecure --nodes 10.0.1.4 --file testing1/controlplane-10.0.1.4.yaml + talosctl apply-config --insecure --nodes 10.0.1.5 --file testing1/controlplane-10.0.1.5.yaml + + # Step 3: Wait for nodes to reboot (2-5 minutes) + # Check with: talosctl --nodes 10.0.1.3 get services + + # Step 4: Bootstrap etcd on first node + talosctl bootstrap --nodes 10.0.1.3 + + # Step 5: Wait for Kubernetes (1-2 minutes) + # Check with: talosctl --nodes 10.0.1.3 service etcd status + + # Step 6: Get kubeconfig + talosctl kubeconfig --nodes 10.0.1.3 testing1/kubeconfig --force + + # Step 7: Verify cluster + kubectl get nodes + ``` + +3. **Common issues**: + - **Nodes in maintenance mode**: Config not applied or nodes didn't reboot + - **Bootstrap fails**: Node not ready, check with `talosctl get services` + - **etcd won't start**: May need to reset nodes and start over + +## Storage Setup + +Talos Linux does not include a default storage provisioner. You must install one before deploying applications that require persistent storage. + +### Install Local Path Provisioner (Recommended) + +```bash +# Enter nix-shell +nix-shell + +# Install local-path-provisioner +./install-local-path-storage.sh +``` + +This installs Rancher's local-path-provisioner which: +- Dynamically provisions PersistentVolumes on local node storage +- Sets itself as the default storage class +- Simple and works well for single-node or testing clusters + +**Important**: Local-path storage is NOT replicated. If a node fails, data is lost. + +### Verify Storage + +```bash +# Check storage class +kubectl get storageclass + +# Check provisioner is running +kubectl get pods -n local-path-storage +``` + +### Alternative Storage Options + +For production HA setups, consider: +- **OpenEBS**: Distributed block storage with replication +- **Rook-Ceph**: Full-featured distributed storage system +- **Longhorn**: Cloud-native distributed storage + +## Common Commands + +### Talos Cluster Management + +```bash +# Check cluster health +talosctl health + +# Get cluster nodes +talosctl get members + +# Apply configuration changes to controlplane +talosctl apply-config --file testing1/controlplane.yaml --nodes + +# Apply configuration changes to worker +talosctl apply-config --file testing1/worker.yaml --nodes + +# Get Talos version +talosctl version + +# Access Talos dashboard +talosctl dashboard +``` + +### Kubernetes Management + +```bash +# Get cluster info +kubectl cluster-info + +# Get all resources in all namespaces +kubectl get all -A + +# Get nodes +kubectl get nodes + +# Apply manifests from first-cluster +kubectl apply -f testing1/first-cluster/cluster/base/ +kubectl apply -f testing1/first-cluster/apps/demo/ + +# Deploy applications using kustomize +kubectl apply -k testing1/first-cluster/apps/gitlab/ +kubectl apply -k testing1/first-cluster/apps// +``` + +### GitLab Management + +**Prerequisites**: Storage provisioner must be installed first (see Storage Setup section) + +```bash +# Deploy GitLab with Container Registry and Runner +kubectl apply -k testing1/first-cluster/apps/gitlab/ + +# Check GitLab status +kubectl get pods -n gitlab -w + +# Check PVC status (should be Bound) +kubectl get pvc -n gitlab + +# Get initial root password +kubectl exec -n gitlab deployment/gitlab -- grep 'Password:' /etc/gitlab/initial_root_password + +# Access GitLab services +# - GitLab UI: http://:30080 +# - SSH: :30022 +# - Container Registry: http://:30500 + +# Restart GitLab Runner after updating registration token +kubectl rollout restart deployment/gitlab-runner -n gitlab + +# Check runner logs +kubectl logs -n gitlab deployment/gitlab-runner -f +``` + +### GitLab Troubleshooting + +If GitLab pods are stuck in Pending: + +```bash +# Check storage issues +./diagnose-storage.sh + +# If no storage provisioner, install it +./install-local-path-storage.sh + +# Redeploy GitLab with storage +./redeploy-gitlab.sh +``` + +## Architecture + +### Repository Structure + +This is a Talos Kubernetes cluster management repository with the following structure: + +- **testing1/** - Active testing cluster configuration + - **controlplane.yaml** - Talos config for control plane nodes (Kubernetes 1.33.0) + - **worker.yaml** - Talos config for worker nodes + - **.talosconfig** - Talos client configuration + - **kubeconfig** - Kubernetes client configuration + - **first-cluster/** - Kubernetes manifests in GitOps structure + - **cluster/base/** - Cluster-level resources (namespaces, etc.) + - **apps/demo/** - Application deployments (nginx demo) + - **apps/gitlab/** - GitLab CE with Container Registry and CI/CD Runner + +- **prod1/** - Production cluster placeholder (currently empty) + +- **shell.nix** - Nix development environment definition +- **bootstrap-cluster.sh** - Automated cluster bootstrap script +- **check-cluster-status.sh** - Cluster status diagnostic tool +- **install-local-path-storage.sh** - Install storage provisioner +- **diagnose-storage.sh** - Storage diagnostic tool +- **redeploy-gitlab.sh** - GitLab cleanup and redeployment +- **APP_DEPLOYMENT.md** - Comprehensive guide for deploying applications + +### Cluster Configuration + +The Talos cluster uses: +- **Kubernetes version**: 1.33.0 (kubelet image: `ghcr.io/siderolabs/kubelet:v1.33.0`) +- **Machine token**: `dhmkxg.kgt4nn0mw72kd3yb` (shared between control plane and workers) +- **Security**: Seccomp profiles enabled by default +- **Manifests directory**: Disabled (kubelet doesn't read from `/etc/kubernetes/manifests`) + +### GitOps Structure + +Kubernetes manifests in `testing1/first-cluster/` follow a GitOps-friendly layout: +- **cluster/** - Cluster infrastructure and base resources +- **apps/** - Application workloads organized by app name + +Each app in `apps/` contains its own deployment and service definitions. + +## Configuration Files + +When modifying Talos configurations: +1. Edit `testing1/controlplane.yaml` for control plane changes +2. Edit `testing1/worker.yaml` for worker node changes +3. Apply changes using `talosctl apply-config` with the appropriate node IPs +4. Always specify `--nodes` flag to target specific nodes + +When adding Kubernetes workloads: +1. Place cluster-level resources in `testing1/first-cluster/cluster/base/` +2. Place application manifests in `testing1/first-cluster/apps//` +3. Create a `kustomization.yaml` file to organize resources +4. Apply using `kubectl apply -k testing1/first-cluster/apps//` +5. See `APP_DEPLOYMENT.md` for detailed guide on adding new applications + +## Deployed Applications + +### GitLab (testing1/first-cluster/apps/gitlab/) + +GitLab CE deployment with integrated Container Registry and CI/CD runner. + +**Components:** +- **GitLab CE 16.11.1**: Main GitLab instance +- **Container Registry**: Docker image registry (port 5005/30500) +- **GitLab Runner**: CI/CD runner with Docker-in-Docker support + +**Access:** +- UI: `http://:30080` +- SSH: `:30022` +- Registry: `http://:30500` + +**Storage:** +- `gitlab-data`: 50Gi - Git repositories, artifacts, uploads +- `gitlab-config`: 5Gi - Configuration files +- `gitlab-logs`: 5Gi - Application logs + +**Initial Setup:** +1. Deploy: `kubectl apply -k testing1/first-cluster/apps/gitlab/` +2. Wait for pods to be ready (5-10 minutes) +3. Get root password: `kubectl exec -n gitlab deployment/gitlab -- grep 'Password:' /etc/gitlab/initial_root_password` +4. Access UI and configure runner registration token +5. Update `testing1/first-cluster/apps/gitlab/runner-secret.yaml` with token +6. Restart runner: `kubectl rollout restart deployment/gitlab-runner -n gitlab` + +**CI/CD Configuration:** + +The runner is configured for building Docker images with: +- Executor: Docker +- Privileged mode enabled +- Access to host Docker socket +- Tags: `docker`, `kubernetes`, `dind` + +Example `.gitlab-ci.yml` for building container images: +```yaml +stages: + - build + +build-image: + stage: build + image: docker:24-dind + tags: + - docker + script: + - docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY + - docker build -t $CI_REGISTRY_IMAGE:$CI_COMMIT_REF_SLUG . + - docker push $CI_REGISTRY_IMAGE:$CI_COMMIT_REF_SLUG +```