Compare commits

...

5 Commits

Author SHA1 Message Date
3ae15ffdfe feat(scripts): add GitLab cleanup and redeploy utility
Add script to cleanly remove and redeploy GitLab:

redeploy-gitlab.sh:
- Deletes existing GitLab deployment and resources
- Removes associated PVCs and data
- Reapplies GitLab manifests from scratch
- Useful for recovering from misconfiguration
- Displays new root password after deployment

Note: Repository now uses Gitea instead of GitLab, but this
script remains for reference or alternative deployments.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2026-03-04 01:53:20 +00:00
f6a3b57bcc feat(scripts): add storage provisioner utilities
Add helper scripts for storage management:

install-local-path-storage.sh:
- Installs Rancher local-path-provisioner
- Sets it as default storage class
- Useful for local testing and single-node scenarios
- Alternative to NFS for simpler setups

diagnose-storage.sh:
- Diagnoses storage-related issues
- Checks for provisioner installation
- Lists storage classes and PVC status
- Identifies pods stuck due to storage problems

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2026-03-04 01:53:12 +00:00
6c292da5f1 feat(scripts): add cluster bootstrap and status scripts
Add automated scripts for Talos cluster management:

bootstrap-cluster.sh:
- Automated cluster bootstrap from scratch
- Generates Talos secrets and machine configs
- Applies configs to all nodes (10.0.1.3-5)
- Bootstraps etcd and retrieves kubeconfig
- Verifies cluster health

check-cluster-status.sh:
- Comprehensive cluster health diagnostics
- Checks Talos services, etcd, and Kubernetes components
- Displays node status and running pods
- Useful for troubleshooting bootstrap issues

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2026-03-04 01:53:05 +00:00
2ed1e82953 docs: add comprehensive application deployment guide
Add APP_DEPLOYMENT.md with step-by-step guide for deploying applications
to the Talos Kubernetes cluster.

Covers:
- Directory structure and GitOps organization
- Creating namespaces and deployments
- Configuring services and ingress
- Storage with PersistentVolumeClaims
- Using Kustomize for manifest management
- Examples for common application types

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2026-03-04 01:52:56 +00:00
ea415ba584 docs: add Claude Code project instructions
Add CLAUDE.md with comprehensive guidance for Claude Code when working
with this Talos Kubernetes cluster repository.

Includes:
- Development environment setup (Nix shell)
- Cluster bootstrap procedures
- Storage provisioner installation
- Common commands for Talos and Kubernetes
- GitLab and Gitea deployment instructions
- Troubleshooting guides

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2026-03-04 01:52:49 +00:00
7 changed files with 1551 additions and 0 deletions

445
APP_DEPLOYMENT.md Normal file
View File

@ -0,0 +1,445 @@
# Application Deployment Guide
This guide explains how to deploy applications to your Talos Kubernetes cluster following the GitOps structure used in this repository.
## Directory Structure
Applications are organized in the `testing1/first-cluster/apps/` directory:
```
testing1/first-cluster/
├── cluster/
│ └── base/ # Cluster-level resources (namespaces, RBAC, etc.)
└── apps/
├── demo/ # Example nginx app
│ ├── nginx-deployment.yaml
│ └── nginx-service.yaml
└── gitlab/ # GitLab with Container Registry
├── namespace.yaml
├── pvc.yaml
├── configmap.yaml
├── deployment.yaml
├── service.yaml
├── runner-secret.yaml
├── runner-configmap.yaml
├── runner-deployment.yaml
└── kustomization.yaml
```
## Deploying Applications
### Method 1: Direct kubectl apply
Apply individual app manifests:
```bash
# Deploy a specific app
kubectl apply -f testing1/first-cluster/apps/gitlab/
# Or use kustomize
kubectl apply -k testing1/first-cluster/apps/gitlab/
```
### Method 2: Using kustomize (Recommended)
Each app directory can contain a `kustomization.yaml` file that lists all resources:
```yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- namespace.yaml
- deployment.yaml
- service.yaml
```
Deploy with:
```bash
kubectl apply -k testing1/first-cluster/apps/<app-name>/
```
## Adding a New Application
Follow these steps to add a new application to your cluster:
### 1. Create App Directory
```bash
mkdir -p testing1/first-cluster/apps/<app-name>
cd testing1/first-cluster/apps/<app-name>
```
### 2. Create Namespace (Optional but Recommended)
Create `namespace.yaml`:
```yaml
apiVersion: v1
kind: Namespace
metadata:
name: <app-name>
```
### 3. Create Application Resources
Create the necessary Kubernetes resources. Common resources include:
#### Deployment
Create `deployment.yaml`:
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: <app-name>
namespace: <app-name>
labels:
app: <app-name>
spec:
replicas: 1
selector:
matchLabels:
app: <app-name>
template:
metadata:
labels:
app: <app-name>
spec:
containers:
- name: <container-name>
image: <image:tag>
ports:
- containerPort: <port>
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "512Mi"
```
#### Service
Create `service.yaml`:
```yaml
apiVersion: v1
kind: Service
metadata:
name: <app-name>
namespace: <app-name>
spec:
type: NodePort # or ClusterIP, LoadBalancer
selector:
app: <app-name>
ports:
- port: 80
targetPort: <container-port>
nodePort: 30XXX # if using NodePort (30000-32767)
```
#### PersistentVolumeClaim (if needed)
Create `pvc.yaml`:
```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: <app-name>-data
namespace: <app-name>
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
```
#### ConfigMap (if needed)
Create `configmap.yaml`:
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: <app-name>-config
namespace: <app-name>
data:
config.yml: |
# Your configuration here
```
#### Secret (if needed)
Create `secret.yaml`:
```yaml
apiVersion: v1
kind: Secret
metadata:
name: <app-name>-secret
namespace: <app-name>
type: Opaque
stringData:
password: "change-me"
api-key: "your-api-key"
```
### 4. Create Kustomization File
Create `kustomization.yaml` to organize all resources:
```yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- namespace.yaml
- pvc.yaml
- configmap.yaml
- secret.yaml
- deployment.yaml
- service.yaml
```
### 5. Deploy the Application
```bash
# From the repository root
kubectl apply -k testing1/first-cluster/apps/<app-name>/
# Verify deployment
kubectl get all -n <app-name>
```
## GitLab Deployment Example
### Prerequisites
1. Ensure your cluster is running and healthy:
```bash
kubectl get nodes
talosctl health
```
2. **IMPORTANT**: Install a storage provisioner first:
```bash
# Check if storage class exists
kubectl get storageclass
# If no storage class found, install local-path-provisioner
./install-local-path-storage.sh
```
Without a storage provisioner, GitLab's PersistentVolumeClaims will remain in Pending state and pods won't start.
### Deploy GitLab
1. **Update the runner registration token** in `testing1/first-cluster/apps/gitlab/runner-secret.yaml`:
After GitLab is running, get the registration token from:
- GitLab UI: `Admin Area > CI/CD > Runners > Register an instance runner`
- Or for project runners: `Settings > CI/CD > Runners > New project runner`
2. **Deploy GitLab and Runner**:
```bash
kubectl apply -k testing1/first-cluster/apps/gitlab/
```
3. **Wait for GitLab to be ready** (this can take 5-10 minutes):
```bash
kubectl get pods -n gitlab -w
```
4. **Access GitLab**:
- GitLab UI: `http://<any-node-ip>:30080`
- SSH: `<any-node-ip>:30022`
- Container Registry: `http://<any-node-ip>:30500`
5. **Get initial root password**:
```bash
kubectl exec -n gitlab deployment/gitlab -- grep 'Password:' /etc/gitlab/initial_root_password
```
6. **Configure GitLab Runner**:
- Login to GitLab
- Get the runner registration token
- Update `runner-secret.yaml` with the token
- Re-apply the secret:
```bash
kubectl apply -f testing1/first-cluster/apps/gitlab/runner-secret.yaml
```
- Restart the runner:
```bash
kubectl rollout restart deployment/gitlab-runner -n gitlab
```
### Using the Container Registry
1. **Login to the registry**:
```bash
docker login <node-ip>:30500
```
2. **Tag and push images**:
```bash
docker tag myapp:latest <node-ip>:30500/mygroup/myapp:latest
docker push <node-ip>:30500/mygroup/myapp:latest
```
3. **Example `.gitlab-ci.yml` for building Docker images**:
```yaml
stages:
- build
- push
variables:
DOCKER_DRIVER: overlay2
DOCKER_TLS_CERTDIR: ""
IMAGE_TAG: $CI_REGISTRY_IMAGE:$CI_COMMIT_REF_SLUG
build:
stage: build
image: docker:24-dind
services:
- docker:24-dind
tags:
- docker
script:
- docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY
- docker build -t $IMAGE_TAG .
- docker push $IMAGE_TAG
```
## Resource Sizing Guidelines
When adding applications, consider these resource guidelines:
### Small Applications (web frontends, APIs)
```yaml
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "512Mi"
```
### Medium Applications (databases, caching)
```yaml
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "2000m"
memory: "4Gi"
```
### Large Applications (GitLab, monitoring stacks)
```yaml
resources:
requests:
cpu: "1000m"
memory: "4Gi"
limits:
cpu: "4000m"
memory: "8Gi"
```
## Service Types
### ClusterIP (default)
- Only accessible within the cluster
- Use for internal services
### NodePort
- Accessible on every node's IP at a static port (30000-32767)
- Use for services you need to access from outside the cluster
- Example: GitLab on port 30080
### LoadBalancer
- Creates an external load balancer (if cloud provider supports it)
- On bare metal, requires MetalLB or similar
## Storage Considerations
### Access Modes
- `ReadWriteOnce` (RWO): Single node read/write (most common)
- `ReadOnlyMany` (ROX): Multiple nodes read-only
- `ReadWriteMany` (RWX): Multiple nodes read/write (requires special storage)
### Storage Sizing
- Logs: 1-5 GB
- Application data: 10-50 GB
- Databases: 50-100+ GB
- Container registries: 100+ GB
## Troubleshooting
### Check Pod Status
```bash
kubectl get pods -n <namespace>
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace>
```
### Check Events
```bash
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
```
### Check Resource Usage
```bash
kubectl top nodes
kubectl top pods -n <namespace>
```
### Common Issues
1. **ImagePullBackOff**: Container image cannot be pulled
- Check image name and tag
- Verify registry credentials if using private registry
2. **CrashLoopBackOff**: Container keeps crashing
- Check logs: `kubectl logs <pod> -n <namespace>`
- Check resource limits
- Verify configuration
3. **Pending Pods**: Pod cannot be scheduled
- Check node resources: `kubectl describe node`
- Check PVC status if using storage
- Verify node selectors/taints
4. **PVC Stuck in Pending**: Storage cannot be provisioned
- **Most common issue on Talos**: No storage provisioner installed
- Check if storage class exists: `kubectl get sc`
- If no storage class, install one:
```bash
./install-local-path-storage.sh
```
- Check PVC events: `kubectl describe pvc <pvc-name> -n <namespace>`
- For GitLab specifically, use the redeploy script:
```bash
./redeploy-gitlab.sh
```
- Verify storage is available on nodes
5. **Storage Provisioner Issues**
- Run diagnostics: `./diagnose-storage.sh`
- Check provisioner pods: `kubectl get pods -n local-path-storage`
- View provisioner logs: `kubectl logs -n local-path-storage deployment/local-path-provisioner`
## Next Steps
- Set up FluxCD for GitOps automation
- Configure ingress controller for HTTP/HTTPS routing
- Set up monitoring with Prometheus and Grafana
- Implement backup solutions for persistent data
- Configure network policies for security

325
CLAUDE.md Normal file
View File

@ -0,0 +1,325 @@
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Development Environment
This repository uses Nix for managing development tools. Enter the development shell:
```bash
nix-shell
```
The shell automatically configures:
- `TALOSCONFIG``testing1/.talosconfig`
- `KUBECONFIG``testing1/kubeconfig`
- `NIX_PROJECT_SHELL``kubernetes-management`
Available tools in the Nix shell:
- `talosctl` - Talos Linux cluster management
- `kubectl` - Kubernetes cluster management
- `flux` - FluxCD GitOps toolkit
## Cluster Bootstrap
To bootstrap a new Talos cluster from scratch, use the provided bootstrap script:
```bash
# Enter the Nix shell first
nix-shell
# Run the bootstrap script
./bootstrap-cluster.sh
```
The bootstrap script (`bootstrap-cluster.sh`) will:
1. Generate new Talos secrets and machine configurations
2. Apply configurations to all nodes (10.0.1.3, 10.0.1.4, 10.0.1.5)
3. Bootstrap etcd on the first control plane
4. Retrieve kubeconfig
5. Verify cluster health
All generated files are saved to `testing1/` directory:
- `testing1/.talosconfig` - Talos client configuration
- `testing1/kubeconfig` - Kubernetes client configuration
- `testing1/secrets.yaml` - Cluster secrets (keep secure!)
- `testing1/controlplane-*.yaml` - Per-node configurations
### Troubleshooting Bootstrap
If nodes remain in maintenance mode or bootstrap fails:
1. **Check cluster status**:
```bash
./check-cluster-status.sh
```
2. **Manual bootstrap process**:
If the automated script fails, bootstrap manually:
```bash
# Step 1: Check if nodes are accessible
talosctl --nodes 10.0.1.3 version
# Step 2: Apply config to each node if in maintenance mode
talosctl apply-config --insecure --nodes 10.0.1.3 --file testing1/controlplane-10.0.1.3.yaml
talosctl apply-config --insecure --nodes 10.0.1.4 --file testing1/controlplane-10.0.1.4.yaml
talosctl apply-config --insecure --nodes 10.0.1.5 --file testing1/controlplane-10.0.1.5.yaml
# Step 3: Wait for nodes to reboot (2-5 minutes)
# Check with: talosctl --nodes 10.0.1.3 get services
# Step 4: Bootstrap etcd on first node
talosctl bootstrap --nodes 10.0.1.3
# Step 5: Wait for Kubernetes (1-2 minutes)
# Check with: talosctl --nodes 10.0.1.3 service etcd status
# Step 6: Get kubeconfig
talosctl kubeconfig --nodes 10.0.1.3 testing1/kubeconfig --force
# Step 7: Verify cluster
kubectl get nodes
```
3. **Common issues**:
- **Nodes in maintenance mode**: Config not applied or nodes didn't reboot
- **Bootstrap fails**: Node not ready, check with `talosctl get services`
- **etcd won't start**: May need to reset nodes and start over
## Storage Setup
Talos Linux does not include a default storage provisioner. You must install one before deploying applications that require persistent storage.
### Install Local Path Provisioner (Recommended)
```bash
# Enter nix-shell
nix-shell
# Install local-path-provisioner
./install-local-path-storage.sh
```
This installs Rancher's local-path-provisioner which:
- Dynamically provisions PersistentVolumes on local node storage
- Sets itself as the default storage class
- Simple and works well for single-node or testing clusters
**Important**: Local-path storage is NOT replicated. If a node fails, data is lost.
### Verify Storage
```bash
# Check storage class
kubectl get storageclass
# Check provisioner is running
kubectl get pods -n local-path-storage
```
### Alternative Storage Options
For production HA setups, consider:
- **OpenEBS**: Distributed block storage with replication
- **Rook-Ceph**: Full-featured distributed storage system
- **Longhorn**: Cloud-native distributed storage
## Common Commands
### Talos Cluster Management
```bash
# Check cluster health
talosctl health
# Get cluster nodes
talosctl get members
# Apply configuration changes to controlplane
talosctl apply-config --file testing1/controlplane.yaml --nodes <node-ip>
# Apply configuration changes to worker
talosctl apply-config --file testing1/worker.yaml --nodes <node-ip>
# Get Talos version
talosctl version
# Access Talos dashboard
talosctl dashboard
```
### Kubernetes Management
```bash
# Get cluster info
kubectl cluster-info
# Get all resources in all namespaces
kubectl get all -A
# Get nodes
kubectl get nodes
# Apply manifests from first-cluster
kubectl apply -f testing1/first-cluster/cluster/base/
kubectl apply -f testing1/first-cluster/apps/demo/
# Deploy applications using kustomize
kubectl apply -k testing1/first-cluster/apps/gitlab/
kubectl apply -k testing1/first-cluster/apps/<app-name>/
```
### GitLab Management
**Prerequisites**: Storage provisioner must be installed first (see Storage Setup section)
```bash
# Deploy GitLab with Container Registry and Runner
kubectl apply -k testing1/first-cluster/apps/gitlab/
# Check GitLab status
kubectl get pods -n gitlab -w
# Check PVC status (should be Bound)
kubectl get pvc -n gitlab
# Get initial root password
kubectl exec -n gitlab deployment/gitlab -- grep 'Password:' /etc/gitlab/initial_root_password
# Access GitLab services
# - GitLab UI: http://<node-ip>:30080
# - SSH: <node-ip>:30022
# - Container Registry: http://<node-ip>:30500
# Restart GitLab Runner after updating registration token
kubectl rollout restart deployment/gitlab-runner -n gitlab
# Check runner logs
kubectl logs -n gitlab deployment/gitlab-runner -f
```
### GitLab Troubleshooting
If GitLab pods are stuck in Pending:
```bash
# Check storage issues
./diagnose-storage.sh
# If no storage provisioner, install it
./install-local-path-storage.sh
# Redeploy GitLab with storage
./redeploy-gitlab.sh
```
## Architecture
### Repository Structure
This is a Talos Kubernetes cluster management repository with the following structure:
- **testing1/** - Active testing cluster configuration
- **controlplane.yaml** - Talos config for control plane nodes (Kubernetes 1.33.0)
- **worker.yaml** - Talos config for worker nodes
- **.talosconfig** - Talos client configuration
- **kubeconfig** - Kubernetes client configuration
- **first-cluster/** - Kubernetes manifests in GitOps structure
- **cluster/base/** - Cluster-level resources (namespaces, etc.)
- **apps/demo/** - Application deployments (nginx demo)
- **apps/gitlab/** - GitLab CE with Container Registry and CI/CD Runner
- **prod1/** - Production cluster placeholder (currently empty)
- **shell.nix** - Nix development environment definition
- **bootstrap-cluster.sh** - Automated cluster bootstrap script
- **check-cluster-status.sh** - Cluster status diagnostic tool
- **install-local-path-storage.sh** - Install storage provisioner
- **diagnose-storage.sh** - Storage diagnostic tool
- **redeploy-gitlab.sh** - GitLab cleanup and redeployment
- **APP_DEPLOYMENT.md** - Comprehensive guide for deploying applications
### Cluster Configuration
The Talos cluster uses:
- **Kubernetes version**: 1.33.0 (kubelet image: `ghcr.io/siderolabs/kubelet:v1.33.0`)
- **Machine token**: `dhmkxg.kgt4nn0mw72kd3yb` (shared between control plane and workers)
- **Security**: Seccomp profiles enabled by default
- **Manifests directory**: Disabled (kubelet doesn't read from `/etc/kubernetes/manifests`)
### GitOps Structure
Kubernetes manifests in `testing1/first-cluster/` follow a GitOps-friendly layout:
- **cluster/** - Cluster infrastructure and base resources
- **apps/** - Application workloads organized by app name
Each app in `apps/` contains its own deployment and service definitions.
## Configuration Files
When modifying Talos configurations:
1. Edit `testing1/controlplane.yaml` for control plane changes
2. Edit `testing1/worker.yaml` for worker node changes
3. Apply changes using `talosctl apply-config` with the appropriate node IPs
4. Always specify `--nodes` flag to target specific nodes
When adding Kubernetes workloads:
1. Place cluster-level resources in `testing1/first-cluster/cluster/base/`
2. Place application manifests in `testing1/first-cluster/apps/<app-name>/`
3. Create a `kustomization.yaml` file to organize resources
4. Apply using `kubectl apply -k testing1/first-cluster/apps/<app-name>/`
5. See `APP_DEPLOYMENT.md` for detailed guide on adding new applications
## Deployed Applications
### GitLab (testing1/first-cluster/apps/gitlab/)
GitLab CE deployment with integrated Container Registry and CI/CD runner.
**Components:**
- **GitLab CE 16.11.1**: Main GitLab instance
- **Container Registry**: Docker image registry (port 5005/30500)
- **GitLab Runner**: CI/CD runner with Docker-in-Docker support
**Access:**
- UI: `http://<node-ip>:30080`
- SSH: `<node-ip>:30022`
- Registry: `http://<node-ip>:30500`
**Storage:**
- `gitlab-data`: 50Gi - Git repositories, artifacts, uploads
- `gitlab-config`: 5Gi - Configuration files
- `gitlab-logs`: 5Gi - Application logs
**Initial Setup:**
1. Deploy: `kubectl apply -k testing1/first-cluster/apps/gitlab/`
2. Wait for pods to be ready (5-10 minutes)
3. Get root password: `kubectl exec -n gitlab deployment/gitlab -- grep 'Password:' /etc/gitlab/initial_root_password`
4. Access UI and configure runner registration token
5. Update `testing1/first-cluster/apps/gitlab/runner-secret.yaml` with token
6. Restart runner: `kubectl rollout restart deployment/gitlab-runner -n gitlab`
**CI/CD Configuration:**
The runner is configured for building Docker images with:
- Executor: Docker
- Privileged mode enabled
- Access to host Docker socket
- Tags: `docker`, `kubernetes`, `dind`
Example `.gitlab-ci.yml` for building container images:
```yaml
stages:
- build
build-image:
stage: build
image: docker:24-dind
tags:
- docker
script:
- docker login -u $CI_REGISTRY_USER -p $CI_REGISTRY_PASSWORD $CI_REGISTRY
- docker build -t $CI_REGISTRY_IMAGE:$CI_COMMIT_REF_SLUG .
- docker push $CI_REGISTRY_IMAGE:$CI_COMMIT_REF_SLUG
```

367
bootstrap-cluster.sh Executable file
View File

@ -0,0 +1,367 @@
#!/usr/bin/env bash
set -euo pipefail
# Configuration
CLUSTER_NAME="talos-cluster"
CONTROL_PLANE_NODES=("10.0.1.3" "10.0.1.4" "10.0.1.5")
CLUSTER_ENDPOINT="https://10.0.1.3:6443"
KUBERNETES_VERSION="1.33.0"
OUTPUT_DIR="testing1"
# Colors for output
GREEN='\033[0;32m'
BLUE='\033[0;34m'
YELLOW='\033[1;33m'
RED='\033[0;31m'
NC='\033[0m' # No Color
log_info() {
echo -e "${BLUE}[INFO]${NC} $1"
}
log_success() {
echo -e "${GREEN}[SUCCESS]${NC} $1"
}
log_warning() {
echo -e "${YELLOW}[WARNING]${NC} $1"
}
log_error() {
echo -e "${RED}[ERROR]${NC} $1"
}
# Check prerequisites
check_prerequisites() {
log_info "Checking prerequisites..."
if ! command -v talosctl &> /dev/null; then
log_error "talosctl not found. Please run 'nix-shell' first."
exit 1
fi
if ! command -v kubectl &> /dev/null; then
log_error "kubectl not found. Please run 'nix-shell' first."
exit 1
fi
log_success "All prerequisites met"
}
# Generate Talos secrets and configurations
generate_configs() {
log_info "Generating Talos secrets for cluster: ${CLUSTER_NAME}"
# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"
# Generate secrets
talosctl gen secrets --force -o "${OUTPUT_DIR}/secrets.yaml"
log_success "Secrets generated"
# Generate configs for all 3 control plane nodes
log_info "Generating machine configurations..."
for i in "${!CONTROL_PLANE_NODES[@]}"; do
NODE_IP="${CONTROL_PLANE_NODES[$i]}"
log_info "Generating config for control plane node: ${NODE_IP}"
talosctl gen config "${CLUSTER_NAME}" "${CLUSTER_ENDPOINT}" \
--with-secrets "${OUTPUT_DIR}/secrets.yaml" \
--kubernetes-version="${KUBERNETES_VERSION}" \
--output-types controlplane \
--output "${OUTPUT_DIR}/controlplane-${NODE_IP}.yaml" \
--force \
--config-patch @<(cat <<EOF
machine:
network:
hostname: cp-${i}
certSANs:
- ${NODE_IP}
- 10.0.1.3
- 10.0.1.4
- 10.0.1.5
cluster:
allowSchedulingOnControlPlanes: true
controlPlane:
endpoint: ${CLUSTER_ENDPOINT}
EOF
)
done
# Generate talosconfig
talosctl gen config "${CLUSTER_NAME}" "${CLUSTER_ENDPOINT}" \
--with-secrets "${OUTPUT_DIR}/secrets.yaml" \
--output-types talosconfig \
--force \
--output "${OUTPUT_DIR}/.talosconfig"
# Configure talosctl to use the new config
export TALOSCONFIG="${OUTPUT_DIR}/.talosconfig"
# Add all endpoints to talosconfig
talosctl config endpoint "${CONTROL_PLANE_NODES[@]}"
talosctl config node "${CONTROL_PLANE_NODES[0]}"
log_success "All configurations generated in ${OUTPUT_DIR}/"
}
# Apply configurations to nodes
apply_configs() {
log_info "Applying configurations to nodes..."
export TALOSCONFIG="${OUTPUT_DIR}/.talosconfig"
for NODE_IP in "${CONTROL_PLANE_NODES[@]}"; do
log_info "Applying config to ${NODE_IP}..."
# Apply config with --insecure flag for initial bootstrap
if talosctl apply-config \
--insecure \
--nodes "${NODE_IP}" \
--file "${OUTPUT_DIR}/controlplane-${NODE_IP}.yaml"; then
log_success "Configuration applied to ${NODE_IP}"
else
log_error "Failed to apply configuration to ${NODE_IP}"
exit 1
fi
# Brief pause between nodes
sleep 2
done
log_success "Configurations applied to all nodes"
}
# Wait for nodes to be ready
wait_for_nodes() {
log_info "Waiting for nodes to reboot and be ready..."
export TALOSCONFIG="${OUTPUT_DIR}/.talosconfig"
# Wait for each node to be accessible
for NODE_IP in "${CONTROL_PLANE_NODES[@]}"; do
log_info "Waiting for node ${NODE_IP} to be accessible..."
local max_attempts=60
local attempt=0
while [ $attempt -lt $max_attempts ]; do
if talosctl --nodes "${NODE_IP}" version &> /dev/null 2>&1; then
log_success "Node ${NODE_IP} is responding"
break
fi
attempt=$((attempt + 1))
sleep 5
done
if [ $attempt -eq $max_attempts ]; then
log_error "Node ${NODE_IP} did not become accessible in time"
exit 1
fi
done
# Wait for all nodes to be out of maintenance mode and services ready
log_info "Checking that all nodes are out of maintenance mode..."
for NODE_IP in "${CONTROL_PLANE_NODES[@]}"; do
local max_attempts=60
local attempt=0
while [ $attempt -lt $max_attempts ]; do
log_info "Checking services on ${NODE_IP} (attempt $((attempt + 1))/${max_attempts})..."
# Get service state - if this succeeds, node is configured
if talosctl --nodes "${NODE_IP}" get services 2>&1 | grep -q "apid"; then
log_success "Node ${NODE_IP} is out of maintenance mode"
break
fi
attempt=$((attempt + 1))
sleep 5
done
if [ $attempt -eq $max_attempts ]; then
log_error "Node ${NODE_IP} did not exit maintenance mode"
log_error "Try checking node console or running: talosctl --nodes ${NODE_IP} get services"
exit 1
fi
done
# Additional wait to ensure etcd service is ready for bootstrap
log_info "Waiting for etcd to be ready for bootstrap on ${CONTROL_PLANE_NODES[0]}..."
sleep 10
log_success "All nodes are ready for bootstrapping"
}
# Check if etcd is already bootstrapped
check_etcd_status() {
export TALOSCONFIG="${OUTPUT_DIR}/.talosconfig"
log_info "Checking if etcd is already bootstrapped..."
# Check if etcd service is running
if talosctl --nodes "${CONTROL_PLANE_NODES[0]}" service etcd status 2>&1 | grep -q "STATE.*Running"; then
log_warning "etcd is already running - cluster appears to be bootstrapped"
return 1
fi
return 0
}
# Bootstrap etcd on the first control plane node
bootstrap_cluster() {
log_info "Bootstrapping etcd on first control plane node: ${CONTROL_PLANE_NODES[0]}"
export TALOSCONFIG="${OUTPUT_DIR}/.talosconfig"
# Check if already bootstrapped
if ! check_etcd_status; then
log_warning "Skipping bootstrap as cluster is already bootstrapped"
return 0
fi
# Verify the node is ready for bootstrap
log_info "Verifying node ${CONTROL_PLANE_NODES[0]} is ready for bootstrap..."
if ! talosctl --nodes "${CONTROL_PLANE_NODES[0]}" get members &> /dev/null; then
log_warning "etcd members not yet initialized, proceeding with bootstrap..."
fi
# Perform bootstrap
log_info "Running bootstrap command..."
if talosctl bootstrap --nodes "${CONTROL_PLANE_NODES[0]}"; then
log_success "Bootstrap command executed successfully"
else
log_error "Failed to bootstrap etcd"
log_error "This may be because:"
log_error " 1. The node is still in maintenance mode (check with: talosctl --nodes ${CONTROL_PLANE_NODES[0]} get services)"
log_error " 2. The configuration was not properly applied"
log_error " 3. etcd is already bootstrapped"
exit 1
fi
# Wait for etcd to come up
log_info "Waiting for etcd to start..."
local max_attempts=30
local attempt=0
while [ $attempt -lt $max_attempts ]; do
if talosctl --nodes "${CONTROL_PLANE_NODES[0]}" service etcd status 2>&1 | grep -q "STATE.*Running"; then
log_success "etcd is running"
break
fi
attempt=$((attempt + 1))
sleep 5
done
if [ $attempt -eq $max_attempts ]; then
log_warning "etcd did not start in expected time, but continuing..."
fi
log_info "Waiting for Kubernetes to initialize..."
sleep 30
}
# Retrieve kubeconfig
get_kubeconfig() {
log_info "Retrieving kubeconfig..."
export TALOSCONFIG="${OUTPUT_DIR}/.talosconfig"
local max_attempts=20
local attempt=0
while [ $attempt -lt $max_attempts ]; do
log_info "Attempting to retrieve kubeconfig (attempt $((attempt + 1))/${max_attempts})..."
if talosctl kubeconfig --nodes "${CONTROL_PLANE_NODES[0]}" "${OUTPUT_DIR}/kubeconfig" --force; then
log_success "Kubeconfig saved to ${OUTPUT_DIR}/kubeconfig"
break
fi
attempt=$((attempt + 1))
sleep 10
done
if [ $attempt -eq $max_attempts ]; then
log_error "Failed to retrieve kubeconfig"
exit 1
fi
}
# Verify cluster health
verify_cluster() {
log_info "Verifying cluster health..."
export TALOSCONFIG="${OUTPUT_DIR}/.talosconfig"
export KUBECONFIG="${OUTPUT_DIR}/kubeconfig"
log_info "Checking Talos health..."
if talosctl health --wait-timeout 5m; then
log_success "Talos cluster is healthy"
else
log_warning "Talos health check reported issues"
fi
log_info "Checking Kubernetes nodes..."
kubectl get nodes -o wide
log_info "Checking system pods..."
kubectl get pods -A
log_success "Cluster verification complete"
}
# Print summary
print_summary() {
echo ""
echo "=========================================="
log_success "Talos Cluster Bootstrap Complete!"
echo "=========================================="
echo ""
echo "Cluster Name: ${CLUSTER_NAME}"
echo "Control Plane Nodes:"
for NODE_IP in "${CONTROL_PLANE_NODES[@]}"; do
echo " - ${NODE_IP}"
done
echo ""
echo "Configuration Files:"
echo " - TALOSCONFIG: ${OUTPUT_DIR}/.talosconfig"
echo " - KUBECONFIG: ${OUTPUT_DIR}/kubeconfig"
echo ""
echo "To use the cluster, export these variables:"
echo " export TALOSCONFIG=\"\$(pwd)/${OUTPUT_DIR}/.talosconfig\""
echo " export KUBECONFIG=\"\$(pwd)/${OUTPUT_DIR}/kubeconfig\""
echo ""
echo "Or run: nix-shell (which sets these automatically)"
echo ""
echo "Useful commands:"
echo " talosctl health"
echo " kubectl get nodes"
echo " kubectl get pods -A"
echo "=========================================="
}
# Main execution
main() {
log_info "Starting Talos Cluster Bootstrap"
log_info "Cluster: ${CLUSTER_NAME}"
log_info "Nodes: ${CONTROL_PLANE_NODES[*]}"
echo ""
check_prerequisites
generate_configs
apply_configs
wait_for_nodes
bootstrap_cluster
get_kubeconfig
verify_cluster
print_summary
}
# Run main function
main

148
check-cluster-status.sh Executable file
View File

@ -0,0 +1,148 @@
#!/usr/bin/env bash
set -euo pipefail
# Configuration
CONTROL_PLANE_NODES=("10.0.1.3" "10.0.1.4" "10.0.1.5")
TALOSCONFIG="${TALOSCONFIG:-testing1/.talosconfig}"
# Colors for output
GREEN='\033[0;32m'
BLUE='\033[0;34m'
YELLOW='\033[1;33m'
RED='\033[0;31m'
NC='\033[0m'
log_info() {
echo -e "${BLUE}[INFO]${NC} $1"
}
log_success() {
echo -e "${GREEN}[SUCCESS]${NC} $1"
}
log_warning() {
echo -e "${YELLOW}[WARNING]${NC} $1"
}
log_error() {
echo -e "${RED}[ERROR]${NC} $1"
}
# Check if talosconfig exists
if [ ! -f "$TALOSCONFIG" ]; then
log_error "TALOSCONFIG not found at: $TALOSCONFIG"
log_info "Have you run ./bootstrap-cluster.sh yet?"
exit 1
fi
export TALOSCONFIG
echo "=========================================="
echo "Talos Cluster Status Check"
echo "=========================================="
echo ""
# Check each node
for NODE_IP in "${CONTROL_PLANE_NODES[@]}"; do
echo "==================== Node: $NODE_IP ===================="
# Check if node is accessible
log_info "Checking if node is accessible..."
if talosctl --nodes "$NODE_IP" version &> /dev/null; then
log_success "Node is accessible"
else
log_error "Node is NOT accessible"
echo ""
continue
fi
# Check version
echo ""
log_info "Talos version:"
talosctl --nodes "$NODE_IP" version --short 2>&1 || log_error "Could not get version"
# Check if in maintenance mode
echo ""
log_info "Checking if node is in maintenance mode..."
if talosctl --nodes "$NODE_IP" get services &> /dev/null; then
log_success "Node is OUT of maintenance mode (configured)"
else
log_error "Node is IN MAINTENANCE MODE - configuration not applied!"
log_info "To apply config, run:"
log_info " talosctl apply-config --insecure --nodes $NODE_IP --file testing1/controlplane-${NODE_IP}.yaml"
fi
# Check services
echo ""
log_info "Service status:"
talosctl --nodes "$NODE_IP" services 2>&1 | head -20 || log_error "Could not get services"
# Check etcd status
echo ""
log_info "etcd status:"
if talosctl --nodes "$NODE_IP" service etcd status 2>&1 | grep -q "STATE.*Running"; then
log_success "etcd is RUNNING"
talosctl --nodes "$NODE_IP" service etcd status 2>&1 | grep "STATE"
else
log_warning "etcd is NOT running"
talosctl --nodes "$NODE_IP" service etcd status 2>&1 | grep "STATE" || log_info "etcd not initialized yet"
fi
# Check if etcd members exist
echo ""
log_info "etcd members:"
if talosctl --nodes "$NODE_IP" get members 2>&1 | grep -v "^NODE" | grep -v "not found"; then
log_success "etcd members found"
else
log_warning "No etcd members - cluster needs bootstrap"
fi
echo ""
done
# Overall cluster status
echo "==================== Overall Cluster Status ===================="
# Check if any node has etcd running
ETCD_RUNNING=false
for NODE_IP in "${CONTROL_PLANE_NODES[@]}"; do
if talosctl --nodes "$NODE_IP" service etcd status 2>&1 | grep -q "STATE.*Running"; then
ETCD_RUNNING=true
break
fi
done
echo ""
if $ETCD_RUNNING; then
log_success "Cluster appears to be bootstrapped (etcd running)"
# Try to get kubeconfig
echo ""
log_info "Attempting to retrieve kubeconfig..."
if talosctl kubeconfig --nodes "${CONTROL_PLANE_NODES[0]}" ./kubeconfig-test --force 2>&1; then
log_success "Kubeconfig retrieved successfully"
log_info "Kubernetes node status:"
KUBECONFIG=./kubeconfig-test kubectl get nodes 2>&1 || log_error "Could not connect to Kubernetes"
rm -f ./kubeconfig-test
else
log_warning "Could not retrieve kubeconfig"
fi
else
log_warning "Cluster is NOT bootstrapped yet"
log_info ""
log_info "Next steps:"
log_info "1. Ensure all nodes are out of maintenance mode (see checks above)"
log_info "2. If nodes are in maintenance mode, apply configs:"
for NODE_IP in "${CONTROL_PLANE_NODES[@]}"; do
log_info " talosctl apply-config --insecure --nodes $NODE_IP --file testing1/controlplane-${NODE_IP}.yaml"
done
log_info "3. Wait for nodes to reboot and become ready (~2-5 minutes)"
log_info "4. Bootstrap the cluster:"
log_info " talosctl bootstrap --nodes ${CONTROL_PLANE_NODES[0]}"
fi
echo ""
echo "=========================================="

87
diagnose-storage.sh Executable file
View File

@ -0,0 +1,87 @@
#!/usr/bin/env bash
set -euo pipefail
# Colors for output
GREEN='\033[0;32m'
BLUE='\033[0;34m'
YELLOW='\033[1;33m'
RED='\033[0;31m'
NC='\033[0m'
log_info() {
echo -e "${BLUE}[INFO]${NC} $1"
}
log_success() {
echo -e "${GREEN}[SUCCESS]${NC} $1"
}
log_warning() {
echo -e "${YELLOW}[WARNING]${NC} $1"
}
log_error() {
echo -e "${RED}[ERROR]${NC} $1"
}
echo "=========================================="
echo "Storage Diagnostics"
echo "=========================================="
echo ""
# Check storage classes
log_info "Checking available storage classes..."
kubectl get storageclass
echo ""
# Check PVCs
log_info "Checking PersistentVolumeClaims in gitlab namespace..."
kubectl get pvc -n gitlab
echo ""
# Check PVC details
log_info "Detailed PVC status..."
kubectl describe pvc -n gitlab
echo ""
# Check pods
log_info "Checking pods in gitlab namespace..."
kubectl get pods -n gitlab
echo ""
# Check pod events
log_info "Checking events in gitlab namespace..."
kubectl get events -n gitlab --sort-by='.lastTimestamp' | tail -20
echo ""
# Summary and recommendations
echo "=========================================="
log_info "Summary and Recommendations"
echo "=========================================="
# Check if storage class exists
if kubectl get storageclass 2>&1 | grep -q "No resources found"; then
log_error "No storage class found!"
echo ""
log_info "Talos Linux does not include a default storage provisioner."
log_info "You need to install one of the following:"
echo ""
echo " 1. Local Path Provisioner (simple, single-node)"
echo " - Best for testing/development"
echo " - Uses local node storage"
echo ""
echo " 2. OpenEBS (distributed, multi-node)"
echo " - Production-ready"
echo " - Supports replication"
echo ""
echo " 3. Rook-Ceph (distributed, enterprise)"
echo " - Full-featured storage solution"
echo " - More complex setup"
echo ""
log_info "I recommend starting with Local Path Provisioner for simplicity."
else
log_success "Storage class found"
fi
echo ""

76
install-local-path-storage.sh Executable file
View File

@ -0,0 +1,76 @@
#!/usr/bin/env bash
set -euo pipefail
# Colors for output
GREEN='\033[0;32m'
BLUE='\033[0;34m'
YELLOW='\033[1;33m'
RED='\033[0;31m'
NC='\033[0m'
log_info() {
echo -e "${BLUE}[INFO]${NC} $1"
}
log_success() {
echo -e "${GREEN}[SUCCESS]${NC} $1"
}
log_warning() {
echo -e "${YELLOW}[WARNING]${NC} $1"
}
log_error() {
echo -e "${RED}[ERROR]${NC} $1"
}
echo "=========================================="
echo "Installing Local Path Provisioner"
echo "=========================================="
echo ""
log_info "This will install Rancher's local-path-provisioner for dynamic storage provisioning."
echo ""
# Create namespace for local-path-provisioner
log_info "Creating local-path-storage namespace..."
kubectl create namespace local-path-storage --dry-run=client -o yaml | kubectl apply -f -
# Apply local-path-provisioner
log_info "Deploying local-path-provisioner..."
kubectl apply -f https://raw.githubusercontent.com/rancher/local-path-provisioner/v0.0.28/deploy/local-path-storage.yaml
# Wait for deployment to be ready
log_info "Waiting for local-path-provisioner to be ready..."
kubectl wait --for=condition=available --timeout=120s deployment/local-path-provisioner -n local-path-storage
# Check storage class
log_info "Checking storage class..."
kubectl get storageclass
# Set as default storage class
log_info "Setting local-path as default storage class..."
kubectl patch storageclass local-path -p '{"metadata": {"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'
log_success "Local Path Provisioner installed successfully!"
echo ""
log_info "Storage configuration:"
kubectl get storageclass
echo ""
log_info "Provisioner pods:"
kubectl get pods -n local-path-storage
echo ""
echo "=========================================="
log_success "Installation Complete!"
echo "=========================================="
echo ""
log_info "You can now deploy applications that require persistent storage."
log_info "PersistentVolumeClaims will automatically be provisioned on the local node."
echo ""
log_warning "Note: local-path storage is NOT replicated across nodes."
log_warning "For production use with HA requirements, consider OpenEBS or Rook-Ceph."
echo ""

103
redeploy-gitlab.sh Executable file
View File

@ -0,0 +1,103 @@
#!/usr/bin/env bash
set -euo pipefail
# Colors for output
GREEN='\033[0;32m'
BLUE='\033[0;34m'
YELLOW='\033[1;33m'
RED='\033[0;31m'
NC='\033[0m'
log_info() {
echo -e "${BLUE}[INFO]${NC} $1"
}
log_success() {
echo -e "${GREEN}[SUCCESS]${NC} $1"
}
log_warning() {
echo -e "${YELLOW}[WARNING]${NC} $1"
}
log_error() {
echo -e "${RED}[ERROR]${NC} $1"
}
echo "=========================================="
echo "GitLab Cleanup and Redeployment"
echo "=========================================="
echo ""
# Check if storage class exists
log_info "Checking for storage class..."
if ! kubectl get storageclass local-path &> /dev/null; then
log_error "Storage class 'local-path' not found!"
log_info "Please run: ./install-local-path-storage.sh first"
exit 1
fi
log_success "Storage class 'local-path' found"
echo ""
# Delete existing GitLab deployment
log_warning "Cleaning up existing GitLab deployment..."
if kubectl get namespace gitlab &> /dev/null; then
log_info "Deleting GitLab deployment..."
kubectl delete -k testing1/first-cluster/apps/gitlab/ --ignore-not-found=true || true
log_info "Waiting for pods to terminate..."
kubectl wait --for=delete pod --all -n gitlab --timeout=120s 2>/dev/null || true
log_info "Deleting PVCs (this will delete all data!)..."
kubectl delete pvc --all -n gitlab --ignore-not-found=true || true
log_info "Waiting for PVCs to be deleted..."
sleep 5
log_success "Cleanup complete"
else
log_info "GitLab namespace doesn't exist - nothing to clean up"
fi
echo ""
# Deploy GitLab
log_info "Deploying GitLab with local-path storage..."
kubectl apply -k testing1/first-cluster/apps/gitlab/
echo ""
log_info "Waiting for PVCs to be bound..."
sleep 5
# Check PVC status
kubectl get pvc -n gitlab
echo ""
log_info "Waiting for GitLab pod to be created..."
sleep 10
# Show pod status
kubectl get pods -n gitlab
echo ""
log_success "GitLab deployment initiated!"
echo ""
log_info "Monitor deployment progress with:"
echo " kubectl get pods -n gitlab -w"
echo ""
log_info "GitLab will take 5-10 minutes to fully start up."
echo ""
log_info "Once running, access GitLab at:"
echo " http://<node-ip>:30080"
echo ""
log_info "Get the initial root password with:"
echo " kubectl exec -n gitlab deployment/gitlab -- grep 'Password:' /etc/gitlab/initial_root_password"
echo ""
echo "=========================================="