8 Go Performance Tips I Discovered After Years of Coding

As a Go developer with over 5 years of experience, I have learned that, writing efficient code is not just about the syntax. It’s an art form that requires deep understanding of concepts, constant learning, and a bit of creative thinking.

After countless hours of debugging, profiling, and optimizing, I’ve discovered some game-changing performance tips that have enhanced the Go projects I worked on. In this article, I’ll be sharing these 8 tips with you.

1) Embrace Goroutines, But Use Them Wisely

Goroutines are Go’s main weapon for concurrent programming, because they’re lightweight, easy to spawn, and can dramatically improve the performance of your applications by allowing parallel execution.

However, the key to mastering goroutines lies in using them wisely.

While it’s tempting to spin up a goroutine for every task, this can lead to resource exhaustion and degraded performance, especially under heavy loads.

Then what’s the solution of it?

By using a worker pool pattern to limit the number of concurrent goroutines.

A worker pool in Go is a pattern that allows you to manage and control the number of goroutines (Go’s lightweight threads) working on tasks concurrently. Instead of creating a new goroutine for every task (which can lead to excessive resource usage and system overload), you create a fixed number of workers (goroutines), each responsible for processing tasks from a shared queue.

Here’s how you can implement a basic worker pool:

func workerPool(numWorkers int, jobs <-chan int, results chan<- int) {
	for i := 0; i < numWorkers; i++ {
		go worker(jobs, results)
	}
}

func worker(jobs <-chan int, results chan<- int) {
	for j := range jobs {
		results <- process(j)
	}
}

func process(job int) int {
	// Simulate some work
	time.Sleep(time.Millisecond * time.Duration(rand.Intn(1000)))
	return job * 2
}

func main() {
	numJobs := 100
	jobs := make(chan int, numJobs)
	results := make(chan int, numJobs)

	// Start the worker pool
	workerPool(5, jobs, results)

	// Send jobs
	for i := 0; i < numJobs; i++ {
		jobs <- i
	}
	close(jobs)

	// Collect results
	for i := 0; i < numJobs; i++ {
		result := <-results
		fmt.Printf("Result: %d\n", result)
	}
	fmt.Println("Done...")
}

In this example, we create a pool of 5 workers to process 100 jobs.

The workerPool function spawns the specified number of worker goroutines, each running the worker function. The workers continuously pull jobs from the jobs channel, process them, and send the results to the results channel.

This approach ensures that we don’t overwhelm our system resources while still benefiting from concurrent processing.

This pattern is particularly useful when processing large datasets or handling concurrent API requests.

There are a lot of ways to create this pattern, It’s up to you how you write it.

Pro tip: Always consider the nature of your workload when deciding on the numWorkers value.

For CPU-bound tasks, a good rule of thumb is to set it to the number of available CPU cores. For I/O-bound tasks, you might set it higher, as goroutines spend much of their time waiting and don’t consume much CPU.

By implementing this worker pool pattern, you can achieve optimal performance by maximizing concurrency while preventing resource exhaustion.

2) Buffer Your Channels for Better Performance

Channels are the building block of Go’s concurrency model, allowing goroutines to communicate and synchronize their execution.

While unbuffered channels are great for ensuring synchronization between goroutines, buffered channels can significantly boost performance in certain scenarios.

Unbuffered channels block the sender until the receiver is ready to receive the value.

This behavior is perfect for scenarios where you need strict synchronization.

However, when you’re dealing with bursty workloads or when you want to reduce goroutine blocking, buffered channels come to the rescue.

I suppose if you’re here you know how buffered and unbuffered channels looks like if still not then this is for you.

// Unbuffered channel
unbuffered := make(chan int)

// Buffered channel with a capacity of 100
buffered := make(chan int, 100)

This is a simple publish and subscribe pattern code which uses channels:

func producer(ch chan<- int) {
	for i := 0; i < 1000; i++ {
		ch <- i // This will block if the channel is full
	}
	close(ch)
}

func consumer(ch <-chan int) {
	for num := range ch {
		fmt.Println(num)
		time.Sleep(10 * time.Millisecond)
	}
}

func main() {
	start := time.Now()

	ch := make(chan int, 100)
	go producer(ch)
	consumer(ch)

	fmt.Printf("Time taken: %v\n", time.Since(start))
}

In the example, we have a producer that generates numbers (1000) and a consumer that processes them.

With a buffered channel, the producer can continue sending values even if the consumer is not immediately ready to receive them, up to the buffer’s capacity (100).

This can lead to significant performance improvements, especially when dealing with bursty workloads.

Buffered channels reduce goroutine blocking, especially when you can predict the number of operations or when dealing with uneven producer-consumer speeds. They act as a queue, allowing the producer to “fire and forget” up to the buffer’s capacity.

However, it’s crucial to choose the right buffer size.

Too small, and you might not see much benefit.

Too large, and you risk using excessive memory.

Profile your application to find the sweet spot.

By my experience, buffered channels have been particularly effective in such scenarios:

Handling incoming API requests in web servers

Processing large datasets in batches

Implementing rate limiting for external API calls

By strategically using buffered channels, I’ve seen throughput improvements in some high-load microservices.

Remember, the goal is to find the right balance between concurrency, memory usage, and performance for your specific use cases.

3) Use Sync.Pool for Frequently Allocated Objects

In high-performance Go applications, memory allocation and garbage collection can become significant bottlenecks. This is especially true when you’re constantly allocating and deallocating temporary objects, such as buffers or temporary structs.

Enters sync.Pool, a powerful tool in Go standard library that can dramatically reduce the pressure on the garbage collector and improve your application’s performance.

sync.Pool provides a way to reuse allocated objects, reducing the need for frequent allocations and deallocations.

It’s particularly useful for objects that are expensive to allocate or that are allocated frequently in short-lived operations.

// LogEntry represents a single log entry
type LogEntry struct {
	Timestamp time.Time `json:"timestamp"`
	Level     string    `json:"level"`
	Message   string    `json:"message"`
}

// logEntryPool is a sync.Pool for LogEntry objects
var logEntryPool = sync.Pool{
	New: func() interface{} {
		return &LogEntry{}
	},
}

// WriteLog writes a log entry to a hypothetical storage
func WriteLog(level, message string) error {
	// Get a LogEntry from the pool
	entry := logEntryPool.Get().(*LogEntry)

	// Reset and populate the entry
	entry.Timestamp = time.Now()
	entry.Level = level
	entry.Message = message

	// Convert to JSON (simulating writing to a log file)
	data, err := json.Marshal(entry)
	if err != nil {
		return err
	}

	fmt.Printf("Log written: %s\n", string(data))

	// Return the entry to the pool
	logEntryPool.Put(entry)

	return nil
}

func main() {
	// Simulate writing many log entries
	for i := 0; i < 1000000; i++ {
		WriteLog("INFO", fmt.Sprintf("This is log entry %d", i))
	}
}

In the example:

We define a LogEntry struct to represent our log entries.
We create a sync.Pool for LogEntry objects. The New function creates a new LogEntry when the pool is empty.
In the WriteLog function, we:
- Get a LogEntry from the pool.
- Reset and populate it with new data.
- Use it (in this case, converting to JSON and printing, but in a real scenario, we’d write to a file or send to a logging service).
- Return it to the pool for reuse.

Remember this is not a good way to open and close file very frequently just for single log entry, use batch processing.

This technique significantly reduces garbage collection overhead for frequently used objects.

In our one of the project, by implementing this pattern, we saw a reduction in allocation overhead by approximately 30% and a decrease in GC pause times by around 25% in our high-throughput logging scenarios.

Important points to remember when using sync.Pool:

Always reset the object’s state before use, as it may contain data from previous operations.
Don’t store pointers to pooled objects after returning them to the pool.
sync.Pool is not suitable for long-lived objects or objects that need to be preserved between GC cycles.
The pool may be cleared at any time, so don’t rely on objects persisting in the pool.

Remember, while sync.Pool can provide significant performance benefits, it also adds complexity to your code.

Always profile your application to ensure that the benefits outweigh the added complexity. In our logging case, the performance gains were substantial enough to justify the use of sync.Pool, but this may not always be the case for every application.

4) Optimize Struct Field Order

When it comes to optimizing Go programs, every byte counts. One mostly overlooked aspect of optimization is the order of fields within structs.

Because of the way Go aligns struct fields in memory, the order of fields can significantly impact the overall size of your structs and, consequently, the memory usage and cache efficiency of your program.

Go aligns struct fields to achieve faster memory access.

This alignment can introduce padding between fields, which can increase the size of your struct.

By ordering fields from largest byte size to smallest, you can minimize this padding and reduce the overall size of your structs.

Let’s see an example:

// BadStruct: Inefficient field ordering
type BadStruct struct {
	a bool  // 1 byte
	b int64 // 8 bytes
	c bool  // 1 byte
}

// GoodStruct: Efficient field ordering
type GoodStruct struct {
	b int64 // 8 bytes
	a bool  // 1 byte
	c bool  // 1 byte
}

func main() {
	fmt.Printf("Size of BadStruct: %d bytes\n", unsafe.Sizeof(BadStruct{}))
	fmt.Printf("Size of GoodStruct: %d bytes\n", unsafe.Sizeof(GoodStruct{}))
}

Running this code will output (on 64-bit systems):

Size of BadStruct: 24 bytes
Size of GoodStruct: 16 bytes

By simply reordering the fields has reduced the struct size by 33%.

This reduction is due to the elimination of padding bytes that were necessary in the BadStruct to align the int64 field.

By proper field ordering, You can reduce struct size, leading to less memory usage and improved cache efficiency.

In large-scale applications dealing with thousands of structs, this can lead to significant memory savings and performance improvements.

Here are some guidelines for optimizing struct field order:

Place larger fields before smaller fields.
Group fields of the same size together.
Consider using embedded structs to group related fields.

Remember, while optimizing struct field order can provide memory savings, it’s important to balance this with code readability and maintainability.

Always document the reason for a particular field order if it’s not immediately obvious.

By paying attention to struct field order, you can squeeze out extra performance and reduce memory usage in your Go applications, often with minimal effort.

5) Preallocate Slices When Possible

Slices are one of Go’s most powerful and frequently used data structures. They are flexible, growable sequence of elements.

However, the way slices grow can have significant performance implications, especially when dealing with large data sets or in performance-critical code paths.

When you append to a slice that has reached its capacity, Go needs to allocate a new, larger underlying array, copy all existing elements to it, and then append the new element.

This process can be costly, especially if it happens frequently.

By preallocating slices when you know (or can estimate) the final size, you can avoid these costly grow-and-copy operations, leading to significant performance improvements.

Let’s look at an example comparing preallocated and non-preallocated slices:

func BenchmarkSliceAppend(b *testing.B) {
	b.Run("WithoutPreallocation", func(b *testing.B) {
		for i := 0; i < b.N; i++ {
			var data []int
			for j := 0; j < 10000; j++ {
				data = append(data, j)
			}
		}
	})

	b.Run("WithPreallocation", func(b *testing.B) {
		for i := 0; i < b.N; i++ {
			data := make([]int, 0, 10000)
			for j := 0; j < 10000; j++ {
				data = append(data, j)
			}
		}
	})
}

Running these benchmarks will show a significant performance difference:

BenchmarkSliceAppend/WithoutPreallocation-8                41716             26817 ns/op          357626 B/op         19 allocs/op
BenchmarkSliceAppend/WithPreallocation-8                  154664              7607 ns/op           81920 B/op          1 allocs/op

As you can see, preallocating the slice is more than 5 times faster (26,817 ns/op vs 7,607 ns/op) and uses more than 4 times less memory (357,626 B/op vs 81,920 B/op).

Additionally, it drastically reduces the number of allocations, from 19 allocations to just 1.

By using this approach you can reduces the number of slice growth operations, improve performance and reduce memory allocations.

Key points to remember when preallocating slices:

If you know the exact final size, you can allocate the slice to that size directly: make([]int, size)
If you know the maximum size but not all elements will be used, allocate capacity but zero length: make([]int, 0, maxSize)
For slices that may grow beyond the preallocated size, consider overestimating slightly to reduce the chance of reallocation
Be cautious not to overallocate, as this can waste memory if the extra capacity goes unused

Remember, while preallocation can provide significant benefits, it’s important to use it judiciously.

For small slices or when the final size is unpredictable, the standard append operation might be more appropriate.

As always, profile your application to determine where optimizations will have the most impact.

6) Use of String Builder for String Concatenation

String manipulation is a very common operation in many programs, and how you handle it can significantly impact your application’s performance.

In Go, strings are immutable, which means that every time you concatenate strings using the + operator or +=, a new string is created and the old ones are discarded.

This can lead to excessive memory allocations and poor performance, especially when building strings in a loop.

Enter strings.Builder, a powerful tool in the Go standard library designed specifically for efficient string concatenation.

strings.Builder uses a growing buffer to accumulate string content without creating intermediate strings, resulting in much better performance.

In simpler term, it uses an internal buffer that can grow as needed, allowing you to add multiple pieces of text without creating new strings for each addition.

This makes your program faster and reduces memory usage, especially when working with large amounts of text or many string operations.

Let’s compare the performance of different string concatenation methods:


func BenchmarkConcatenation(b *testing.B) {
	b.Run("PlusOperator", func(b *testing.B) {
		for i := 0; i < b.N; i++ {
			var result string
			for j := 0; j < 1000; j++ {
				result += "speed testing"
			}
		}
	})

	b.Run("StringBuilder", func(b *testing.B) {
		for i := 0; i < b.N; i++ {
			var builder strings.Builder
			for j := 0; j < 1000; j++ {
				builder.WriteString("speed testing")
			}
			// _ = builder.String()
		}
	})
}

Running this benchmark will show a significant performance difference.

BenchmarkConcatenation/PlusOperator-8              1957     556680 ns/op  6894691 B/op      999 allocs/op
BenchmarkConcatenation/StringBuilder-8           139213       8330 ns/op    62960 B/op       16 allocs/op

As you can see, strings.Builder is over 66 times faster (8330 ns/op vs 556,680 ns/op) and uses more than 100 times less memory (62,960 B/op vs 6,894,691 B/op) compared to the + operator.

Additionally, strings.Builder drastically reduces the number of allocations, from 999 to just 16.

This approach is much more efficient than using the + operator or += for string concatenation, especially in loops or when dealing with large amounts of data.

Points to remember when using strings.Builder:

If you know the approximate final size of your string, use builder.Grow(n) to preallocate the buffer and further improve performance.
strings.Builder is not thread-safe. If you need concurrent access, consider using bytes.Buffer with appropriate synchronization.
For very small strings or simple concatenations, the + operator might still be more readable and perform well enough.

By consistently using strings.Builder in appropriate situations, I’ve seen performance improvements in string-heavy operations within larger applications.

Remember, while strings.Builder is very efficient, it’s important to use it judiciously.

For simple, one-off string concatenations, the readability of the + operator might outweigh the performance benefits of strings.Builder.

Always, profile your application to determine where optimizations will have the most impact.

7) Use Maps for O(1) Lookups

When dealing with large datasets or frequently accessed information, the efficiency of data retrieval becomes crucial. This is where Go’s built-in map type shines.

Maps in Go provide constant-time O(1) lookups, insertions, and deletions on average, making them an excellent choice for scenarios where you need fast access to data based on a key.

Let’s compare the performance of using a map versus a slice for lookups.

const dataSize = 1000000

func BenchmarkLookup(b *testing.B) {
	data := make([]int, dataSize)
	for i := range data {
		data[i] = i
	}

	dataMap := make(map[int]int, dataSize)
	for i, v := range data {
		dataMap[v] = i
	}

	b.Run("SliceLookup", func(b *testing.B) {
		for i := 0; i < b.N; i++ {
			findInSlice(data, dataSize-1)
		}
	})

	b.Run("MapLookup", func(b *testing.B) {
		for i := 0; i < b.N; i++ {
			_ = dataMap[dataSize-1]
		}
	})
}

func findInSlice(data []int, target int) int {
	for i, v := range data {
		if v == target {
			return i
		}
	}
	return -1
}

Running this benchmark will show a significant performance difference:

BenchmarkLookup/SliceLookup-8              3525     320707 ns/op        0 B/op        0 allocs/op
BenchmarkLookup/MapLookup-8            201384250          5.964 ns/op        0 B/op        0 allocs/op

As you can see, map lookup is orders of magnitude faster than searching through a slice, especially for large datasets.

This provides constant-time lookups, drastically improving performance for large datasets.

Points to remember when using maps:

Maps are unordered. If you need to maintain order, consider using a separate slice to keep track of keys.
Maps are not safe for concurrent use. Use sync.RWMutex for concurrent access, or consider sync.Map for specific use cases.
The zero value of a map is nil. Always initialize a map before using it.
Be mindful of memory usage. Maps can use more memory than slices for the same amount of data due to their internal structure.

Remember, while maps provide excellent performance for lookups, they come with trade-offs.

Always consider your specific use case and requirements when choosing between maps and other data structures.

8) Use Atomic Operations for Simple Counters

When dealing with concurrent programming in Go, one common task is managing shared counters or flags across multiple goroutines.

While the sync.Mutex is a powerful tool for ensuring exclusive access to shared resources, it can be overkill for simple operations like incrementing a counter or toggling a flag.

This is where atomic operations come into play.

Atomic operations, provided by the sync/atomic package, allow for lock-free synchronization of simple values.

They’re faster than mutexes for these simple use cases and can significantly improve performance in high-concurrency scenarios.

Let’s compare the performance of atomic operations versus mutex-based synchronization for a simple counter.

func BenchmarkCounter(b *testing.B) {
	b.Run("Mutex", func(b *testing.B) {
		var mu sync.Mutex
		var count int64
		b.RunParallel(func(pb *testing.PB) {
			for pb.Next() {
				mu.Lock()
				count++
				mu.Unlock()
			}
		})
	})

	b.Run("Atomic", func(b *testing.B) {
		var count int64
		b.RunParallel(func(pb *testing.PB) {
			for pb.Next() {
				atomic.AddInt64(&count, 1)
			}
		})
	})
}

Running this benchmark will show a performance difference.

BenchmarkCounter/Mutex-8          14567182         80.35 ns/op        0 B/op        0 allocs/op
BenchmarkCounter/Atomic-8         28023906         43.03 ns/op        0 B/op        0 allocs/op

As you can see, atomic operations are nearly twice as fast (43.03 ns/op vs 80.35 ns/op) compared to mutex-based synchronization for this simple counter scenario.

Both approaches have zero memory allocations, but atomic operations provide better performance due to their lock-free nature, making them more efficient for simple tasks like incrementing a counter.

Points to remember when using atomic operations:

Atomic operations are ideal for simple counters, flags, or pointers.
They’re not suitable for complex operations that require multiple steps to be atomic.
Atomic operations work on fixed-size numeric types and pointers.
Always use the atomic package functions to read and write atomic values, never access them directly.

As always, choose the right tool for the job and profile your application to ensure you’re getting the expected benefits.

These performance tips have been battle-tested in production environments and have consistently led to significant improvements in my Go applications.

However, it’s crucial to remember that premature optimization is indeed the root of all evil.

Always measure first, then optimize where it matters most.