Goroutines are Go’s flagship feature for building concurrent applications, but how do they perform when scaled to extreme levels? This article shares the results of comprehensive benchmarks running 10K, 100K, and 1 million concurrent goroutines.

Benchmarking Go Goroutines: Scaling From 10K to 1M Concurrent Routines

One of Go’s most celebrated features is its lightweight concurrency model built around goroutines. Developers coming from other languages are often amazed at how easily Go can handle thousands of concurrent operations without breaking a sweat. But as systems scale and requirements grow more demanding, many wonder: just how far can goroutines be pushed before performance degrades?

To answer this question definitively, I conducted a series of benchmarks designed to stress-test Go’s concurrency capabilities. The results reveal both the impressive strengths and practical limitations of Go’s goroutine implementation.

Understanding Goroutines: A Quick Refresher

Before diving into benchmarks, let’s briefly review what makes goroutines special:

  1. Lightweight: Goroutines start with a small stack size (typically 2KB) compared to OS threads (often 1-2MB)
  2. User-space scheduled: The Go runtime manages goroutines using an M:N scheduler (M goroutines mapped to N OS threads)
  3. Simple API: Just add the go keyword before a function call
  4. Built-in synchronization: Channels provide communication between goroutines

These properties make goroutines ideal for concurrent programming, but every abstraction has its limits. Let’s find out where those limits are.

Benchmark Setup and Methodology

Hardware and Software Environment

To ensure consistent and representative results, all benchmarks were run on:

  • CPU: 8-core Apple M2 Pro
  • Memory: 32GB RAM
  • OS: macOS Ventura 13.5
  • Go Version: 1.21.0

Test Design

The benchmark was designed to measure three primary metrics:

  1. Execution time: How long it takes to spawn and complete N goroutines
  2. Memory usage: Both allocated and total system memory
  3. Scheduler efficiency: Analyzing contention and throughput

Here’s the core benchmark code:

package main

import (
    "fmt"
    "runtime"
    "sync"
    "time"
    "os"
    "strconv"
)

func benchmarkGoroutines(count int) {
    // Pre-benchmark memory stats
    var memStatsBefore runtime.MemStats
    runtime.ReadMemStats(&memStatsBefore)
    
    // Create wait group and buffered channel
    var wg sync.WaitGroup
    ch := make(chan struct{}, count)
    
    // Record start time
    start := time.Now()
    
    // Spawn N goroutines
    for i := 0; i < count; i++ {
        wg.Add(1)
        go func(id int) {
            defer wg.Done()
            
            // Simulate small amount of work
            time.Sleep(time.Microsecond)
            
            // Communication via channel
            ch <- struct{}{}
        }(i)
    }
    
    // Wait for all goroutines to finish
    wg.Wait()
    close(ch)
    
    // Drain channel to ensure all messages were sent
    for range ch {
        // Just drain
    }
    
    // Measure elapsed time
    elapsed := time.Since(start)
    
    // Post-benchmark memory stats
    var memStatsAfter runtime.MemStats
    runtime.GC() // Force GC to get accurate memory usage
    runtime.ReadMemStats(&memStatsAfter)
    
    // Calculate memory deltas
    allocDelta := float64(memStatsAfter.TotalAlloc - memStatsBefore.TotalAlloc) / (1024 * 1024)
    sysDelta := float64(memStatsAfter.Sys - memStatsBefore.Sys) / (1024 * 1024)
    
    // Report results
    fmt.Printf("=== Benchmark Results: %d Goroutines ===\n", count)
    fmt.Printf("Time taken: %v\n", elapsed)
    fmt.Printf("Memory allocated: %.2f MB\n", allocDelta)
    fmt.Printf("System memory: %.2f MB\n", sysDelta)
    fmt.Printf("Goroutines/second: %.2f\n", float64(count)/elapsed.Seconds())
    fmt.Printf("Bytes/goroutine: %.2f\n", (allocDelta * 1024 * 1024) / float64(count))
    fmt.Println()
}

func main() {
    if len(os.Args) != 2 {
        fmt.Println("Usage: go run benchmark.go <count>")
        os.Exit(1)
    }
    
    count, err := strconv.Atoi(os.Args[1])
    if err != nil {
        fmt.Println("Invalid count:", err)
        os.Exit(1)
    }
    
    // Print system info
    fmt.Printf("Go version: %s\n", runtime.Version())
    fmt.Printf("GOMAXPROCS: %d\n", runtime.GOMAXPROCS(0))
    
    // Run benchmark
    benchmarkGoroutines(count)
}

I ran this benchmark with 10,000, 100,000, and 1,000,000 goroutines, collecting metrics for each run. Each test was executed 10 times, and the results were averaged to account for system variability.

Benchmark Results and Analysis

Summary Table

Here are the average results across all test runs:

GoroutinesTime TakenMemory AllocatedSystem MemoryGoroutines/SecondBytes/Goroutine
10K15.2ms5.8 MB19.7 MB657,894609 bytes
100K89.4ms54.7 MB68.2 MB1,118,568573 bytes
1M1,186ms519.6 MB595.3 MB842,327545 bytes

Execution Time Analysis

The time taken to spawn and complete goroutines grows roughly linearly up to 100K, but there’s a slight super-linear growth when scaling to 1M. This suggests that while Go’s scheduler remains efficient at high concurrency, some overhead begins to emerge at extreme scales.

Time Scaling Factor:
100K / 10K = 5.9x (theoretical: 10x)
1M / 100K = 13.3x (theoretical: 10x)

The 1M goroutine case shows performance degradation beyond simple linear scaling, indicating scheduler pressure at this extreme.

Memory Usage Patterns

Memory allocation also shows interesting patterns:

  1. Linear Growth: Memory usage scales roughly linearly with goroutine count
  2. Efficient Memory Use: Each goroutine consumes approximately 550-600 bytes on average, far less than the theoretical 2KB initial stack
  3. Diminishing Per-Goroutine Cost: The per-goroutine memory cost actually decreases slightly at higher counts, suggesting optimization in the runtime’s memory allocation

Scheduler Performance

The “Goroutines/Second” metric reveals how efficiently the Go scheduler can process goroutines:

  1. At 10K: ~658K goroutines/second
  2. At 100K: ~1.12M goroutines/second
  3. At 1M: ~842K goroutines/second

Interestingly, the scheduler throughput peaks at 100K goroutines before declining at 1M. This suggests an optimal range where the scheduler hits peak efficiency before contention starts to reduce throughput.

Deep Dive: What Happens at 1 Million Goroutines?

When pushing to 1 million concurrent goroutines, several interesting phenomena emerge:

1. Memory Pressure

At 1 million goroutines, the benchmark consumes around 520MB of allocated memory and nearly 600MB of system memory. While this is impressive efficiency (less than 1KB per goroutine), it represents significant memory pressure:

1M goroutines × 2KB theoretical stack = 2GB theoretical
Actual usage: ~520MB = ~26% of theoretical maximum

This efficiency is due to Go’s ability to start goroutines with minimal stack space and grow as needed.

2. Scheduler Behavior

The Go scheduler uses a work-stealing algorithm where each OS thread (P) has a local queue of goroutines and can steal work from other threads when idle. At 1 million goroutines, we observed:

  • More frequent context switching: The scheduler must juggle many more runnable goroutines
  • Increased work stealing: Threads more frequently steal work from each other
  • GC pressure: More objects and goroutines to track during garbage collection

3. System Impact

Running 1 million goroutines had noticeable system effects:

  • CPU utilization: Near 100% across all cores
  • Memory allocation patterns: Rapid allocation and deallocation
  • OS scheduling: Increased OS thread contention

Additional Experiments and Observations

Besides the basic benchmark, I ran several variations to understand different aspects of goroutine performance:

Experiment 1: Impact of Work Size

How does the amount of work performed by each goroutine affect scalability?

I modified the benchmark to perform different amounts of simulated work:

// Modified goroutine work function
go func(id int) {
    defer wg.Done()
    
    // Simulate varying workloads
    // 1. No work
    // 2. Light work (fibonacci(10))
    // 3. Medium work (fibonacci(20))
    
    fibonacci(20) // Medium workload example
    
    ch <- struct{}{}
}(i)

Results:

Workload10K Time100K Time1M Time
No Work12.1ms82.7ms1,105ms
Light28.4ms276ms2,845ms
Medium129ms1,279ms12,687ms

Observation: As the work per goroutine increases, the scaling efficiency decreases due to increased contention for CPU resources.

Experiment 2: Channel Buffering Effects

How does channel buffer size impact performance?

I tested with different channel buffer sizes:

  1. Unbuffered: make(chan struct{})
  2. Partially buffered: make(chan struct{}, count/10)
  3. Fully buffered: make(chan struct{}, count)

Results for 100K goroutines:

Buffer TypeTimeMemory Allocated
Unbuffered387ms62.3 MB
Partially (10K)128ms56.1 MB
Fully (100K)89.4ms54.7 MB

Observation: Unbuffered channels significantly impact performance due to synchronization overhead. Properly sized buffers can dramatically improve throughput.

Experiment 3: GOMAXPROCS Impact

How does changing the number of available OS threads affect scalability?

I ran tests with different GOMAXPROCS settings:

// Set at beginning of program
runtime.GOMAXPROCS(threads)

Results for 1M goroutines:

GOMAXPROCSTimeThroughput (goroutines/sec)
13,974ms251,635
22,105ms474,583
41,482ms674,764
81,186ms842,327
161,102ms907,441

Observation: Performance scales sub-linearly with additional OS threads. Doubling threads doesn’t double throughput, suggesting other bottlenecks beyond raw CPU power.

Practical Implications for Go Developers

What do these benchmark results mean for real-world Go applications?

1. Goroutine Usage Guidelines

Based on the benchmarks, here are practical guidelines for goroutine usage:

  • 0-10K goroutines: Virtually no overhead concerns; use freely for most applications
  • 10K-100K goroutines: Still efficient but monitor memory usage; suitable for high-concurrency servers
  • 100K-1M goroutines: Use with caution; implement throttling or pooling; monitor system resources closely
  • >1M goroutines: Generally not recommended without significant tuning; consider alternative architectures

2. Optimizing Goroutine-Heavy Applications

If your application requires high goroutine counts:

Throttling and Pooling

// Worker pool pattern for throttling goroutines
func workerPool(jobs <-chan Job, results chan<- Result, workerCount int) {
    var wg sync.WaitGroup
    
    // Launch fixed number of workers
    for w := 1; w <= workerCount; w++ {
        wg.Add(1)
        go func(workerID int) {
            defer wg.Done()
            
            for job := range jobs {
                results <- process(job)
            }
        }(w)
    }
    
    wg.Wait()
    close(results)
}

Memory Optimization

  • Reduce per-goroutine memory usage
  • Optimize data structures passed between goroutines
  • Avoid unnecessary allocations within goroutines
// Reuse objects with sync.Pool to reduce GC pressure
var bufferPool = sync.Pool{
    New: func() interface{} {
        return make([]byte, 4096)
    },
}

// In goroutine
buffer := bufferPool.Get().([]byte)
defer bufferPool.Put(buffer)

Channel Design

  • Size buffered channels appropriately
  • Consider fan-in/fan-out patterns for work distribution
  • Use select with default cases to avoid blocking
// Non-blocking channel operations
select {
case ch <- value:
    // Sent successfully
case <-ctx.Done():
    // Context cancelled
default:
    // Channel full, handle accordingly
}

3. When to Reconsider Goroutines

There are cases where alternative approaches might be better than massive goroutine counts:

  • Event-driven architecture: For I/O-bound tasks with extreme concurrency
  • Batch processing: When processing can be grouped efficiently
  • Stream processing: For continuous data flows that can be processed sequentially

Real-World Application: HTTP Server Under Load

To demonstrate a practical application, I benchmarked a simple HTTP server handling concurrent requests:

package main

import (
    "fmt"
    "net/http"
    "runtime"
    "time"
)

func handler(w http.ResponseWriter, r *http.Request) {
    // Simulate processing time
    time.Sleep(10 * time.Millisecond)
    fmt.Fprintf(w, "Hello, World!")
}

func monitorMetrics() {
    for {
        var m runtime.MemStats
        runtime.ReadMemStats(&m)
        
        fmt.Printf("Goroutines: %d, Memory: %.2f MB\n", 
            runtime.NumGoroutine(), 
            float64(m.Alloc)/(1024*1024))
        
        time.Sleep(5 * time.Second)
    }
}

func main() {
    go monitorMetrics()
    
    http.HandleFunc("/", handler)
    http.ListenAndServe(":8080", nil)
}

Using wrk to benchmark with 10,000 concurrent connections:

wrk -t8 -c10000 -d30s http://localhost:8080/

Results:

  • Requests/sec: ~9,800
  • Peak goroutine count: ~10,200
  • Memory usage: ~42 MB
  • Latency (avg): 102ms

This demonstrates Go’s ability to efficiently handle thousands of concurrent connections in a real-world scenario.

Conclusion: Goroutines at Scale

Go’s goroutines live up to their reputation as an efficient concurrency primitive, even at extreme scales. The benchmarks show that:

  1. Linear scaling holds reasonably well up to 100K goroutines
  2. Memory efficiency is impressive (550-600 bytes per goroutine on average)
  3. Scheduler performance is excellent but does show some degradation at 1M+ goroutines

From a practical standpoint, while Go can technically handle millions of goroutines, the sweet spot for most applications is likely in the tens of thousands to low hundreds of thousands range. Beyond that, more careful resource management and potentially different architectural approaches should be considered.

The key takeaway: Go’s concurrency model isn’t just a developer convenience—it’s a genuinely scalable approach to handling massive concurrency, provided you understand and respect its operational characteristics at scale.

What’s your experience running Go applications with high concurrency? Have you pushed goroutines to their limits in production? Share your experiences in the comments below.