Connect Google Dataproc to Digital Tap AI

⏱️ Estimated time: 5 minutes 📋 Difficulty: Easy 📅 Last updated: March 2026

1 Prerequisites

  • GCP Project ID where Dataproc clusters run
  • GCP Service Account JSON key
  • Digital Tap AI accountsign up free

Creating a Service Account

# Create the service account
gcloud iam service-accounts create digitaltap-agent \
  --display-name="Digital Tap AI Agent" \
  --project=YOUR_PROJECT_ID

# Download the JSON key
gcloud iam service-accounts keys create digitaltap-key.json \
  --iam-account=digitaltap-agent@YOUR_PROJECT_ID.iam.gserviceaccount.com

2 Required GCP Roles

Grant these roles to the service account:

  • roles/dataproc.editor — Manage Dataproc clusters
  • roles/compute.viewer — View Compute Engine instances
  • roles/monitoring.viewer — Read Cloud Monitoring metrics
PROJECT_ID="your-project-id"
SA_EMAIL="digitaltap-agent@${PROJECT_ID}.iam.gserviceaccount.com"

# Dataproc Editor
gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:${SA_EMAIL}" \
  --role="roles/dataproc.editor"

# Compute Viewer
gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:${SA_EMAIL}" \
  --role="roles/compute.viewer"

# Monitoring Viewer
gcloud projects add-iam-policy-binding $PROJECT_ID \
  --member="serviceAccount:${SA_EMAIL}" \
  --role="roles/monitoring.viewer"
💡 Monitor-only: Use roles/dataproc.viewer instead of dataproc.editor for read-only recommendations.

3 Install the Agent

Option A: Docker

docker run -d \
  --name digitaltap-agent \
  --restart unless-stopped \
  -e DT_API_KEY="your-digital-tap-api-key" \
  -e DT_PLATFORM="dataproc" \
  -e GCP_PROJECT_ID="your-project-id" \
  -e GCP_REGION="us-central1" \
  -v /path/to/digitaltap-key.json:/app/gcp-key.json:ro \
  -e GOOGLE_APPLICATION_CREDENTIALS="/app/gcp-key.json" \
  ghcr.io/digital-tap/agent:latest

Option B: Helm (GKE with Workload Identity)

helm repo add digitaltap https://charts.digitaltap.ai
helm repo update

helm install digitaltap-agent digitaltap/agent \
  --set apiKey="your-digital-tap-api-key" \
  --set platform="dataproc" \
  --set gcp.projectId="your-project-id" \
  --set gcp.region="us-central1" \
  --set serviceAccount.annotations."iam\.gke\.io/gcp-service-account"="digitaltap-agent@YOUR_PROJECT.iam.gserviceaccount.com" \
  --namespace digitaltap --create-namespace

4 Verify Connection

  1. Open your Digital Tap AI dashboard
  2. Navigate to IntegrationsConnected Platforms
  3. Your Dataproc clusters should appear within 3-5 minutes

5 Dataproc-Specific Features

  • Idle Cluster Detection — Finds Dataproc clusters with no running jobs or YARN activity
  • Auto-Delete Optimization — Configures idle timeout auto-delete for transient clusters
  • Worker Autoscaling — Tunes Dataproc autoscaling policies based on actual usage patterns
  • Preemptible Worker Optimization — Maximizes preemptible (spot) VM usage for secondary workers
  • Machine Type Right-Sizing — Recommends optimal GCE machine types based on workload profiles
  • Spark Job Optimization — Analyzes Spark job performance and tunes executor/driver configs
  • GCS Storage Optimization — Cleans up temporary GCS data left by Dataproc jobs
  • Cost Attribution — Per-cluster, per-job, and per-label cost breakdown

6 Troubleshooting

Authentication errors

  • Verify service account key: gcloud auth activate-service-account --key-file=digitaltap-key.json
  • Check project access: gcloud dataproc clusters list --project=YOUR_PROJECT --region=YOUR_REGION

No clusters found

  • Verify GCP_REGION matches your cluster region
  • For multi-region, set DT_GCP_REGIONS="us-central1,us-east1,europe-west1"
← Back to Quickstart Full API Docs →