Connect Google Dataproc to Digital Tap AI
1 Prerequisites
- GCP Project ID where Dataproc clusters run
- GCP Service Account JSON key
- Digital Tap AI account — sign up free
Creating a Service Account
# Create the service account
gcloud iam service-accounts create digitaltap-agent \
--display-name="Digital Tap AI Agent" \
--project=YOUR_PROJECT_ID
# Download the JSON key
gcloud iam service-accounts keys create digitaltap-key.json \
--iam-account=digitaltap-agent@YOUR_PROJECT_ID.iam.gserviceaccount.com
2 Required GCP Roles
Grant these roles to the service account:
- roles/dataproc.editor — Manage Dataproc clusters
- roles/compute.viewer — View Compute Engine instances
- roles/monitoring.viewer — Read Cloud Monitoring metrics
PROJECT_ID="your-project-id"
SA_EMAIL="digitaltap-agent@${PROJECT_ID}.iam.gserviceaccount.com"
# Dataproc Editor
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:${SA_EMAIL}" \
--role="roles/dataproc.editor"
# Compute Viewer
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:${SA_EMAIL}" \
--role="roles/compute.viewer"
# Monitoring Viewer
gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="serviceAccount:${SA_EMAIL}" \
--role="roles/monitoring.viewer"
💡 Monitor-only: Use
roles/dataproc.viewer instead of dataproc.editor for read-only recommendations.
3 Install the Agent
Option A: Docker
docker run -d \
--name digitaltap-agent \
--restart unless-stopped \
-e DT_API_KEY="your-digital-tap-api-key" \
-e DT_PLATFORM="dataproc" \
-e GCP_PROJECT_ID="your-project-id" \
-e GCP_REGION="us-central1" \
-v /path/to/digitaltap-key.json:/app/gcp-key.json:ro \
-e GOOGLE_APPLICATION_CREDENTIALS="/app/gcp-key.json" \
ghcr.io/digital-tap/agent:latest
Option B: Helm (GKE with Workload Identity)
helm repo add digitaltap https://charts.digitaltap.ai
helm repo update
helm install digitaltap-agent digitaltap/agent \
--set apiKey="your-digital-tap-api-key" \
--set platform="dataproc" \
--set gcp.projectId="your-project-id" \
--set gcp.region="us-central1" \
--set serviceAccount.annotations."iam\.gke\.io/gcp-service-account"="digitaltap-agent@YOUR_PROJECT.iam.gserviceaccount.com" \
--namespace digitaltap --create-namespace
4 Verify Connection
- Open your Digital Tap AI dashboard
- Navigate to Integrations → Connected Platforms
- Your Dataproc clusters should appear within 3-5 minutes
5 Dataproc-Specific Features
- Idle Cluster Detection — Finds Dataproc clusters with no running jobs or YARN activity
- Auto-Delete Optimization — Configures idle timeout auto-delete for transient clusters
- Worker Autoscaling — Tunes Dataproc autoscaling policies based on actual usage patterns
- Preemptible Worker Optimization — Maximizes preemptible (spot) VM usage for secondary workers
- Machine Type Right-Sizing — Recommends optimal GCE machine types based on workload profiles
- Spark Job Optimization — Analyzes Spark job performance and tunes executor/driver configs
- GCS Storage Optimization — Cleans up temporary GCS data left by Dataproc jobs
- Cost Attribution — Per-cluster, per-job, and per-label cost breakdown
6 Troubleshooting
Authentication errors
- Verify service account key:
gcloud auth activate-service-account --key-file=digitaltap-key.json - Check project access:
gcloud dataproc clusters list --project=YOUR_PROJECT --region=YOUR_REGION
No clusters found
- Verify
GCP_REGIONmatches your cluster region - For multi-region, set
DT_GCP_REGIONS="us-central1,us-east1,europe-west1"