If you are using Kafka and AWS you probably have something like the following in one of the AWS regions. Multiple availability zones (AZs). One or more Kafka brokers in each AZ. Probably multiple producers sending data to one or more topics in each AZ.
This is AWS/Cloud best practice for high availability and scalability. For each topic replication factor should be more than one and should have decent number of partitions (greater than or equal to maximum number of consumers)
Depending on the partition selection logic in producers, messages from 1a might be routed to broker in 1b which might replicate the message to 1c broker. There are multiple options for how to select a partition in the producer:
- Use a hard coded partition number 😦
- Randomly pick one of the partition
- Cycle through the partitions
- Based on some message key, generate a hash code and pick a partition based on number of partitions and hash code
- Or some other custom logic
Different client libraries have different out of the box options for partition selection. What happens if your producers are high volume and pumping terabytes of data! There are two issues with high volume deployments:
- Network throughput across AZs might not be as high as with-in AZ
- There is a cost associated with cross AZ data transfer. I think it is $10 per TB
There are couple of options to work around this. A while back support for Rack aware replica assignment was added to Kafka 0.10.0.0. To take advantage of rack awareness a new property called broker.rack was introduced. In case of AWS this should be set to the AZ name. Eg: us-east-1c. This information is also made available in metadata API response. Using this information producer partition selection can do the following:
Producers in each AZ can select partitions belonging to the brokers in same AZ. Broker rack information and topic partition leaders can be retrieved using metadata request. Producers can narrow down the partition list based on their AZ. With-in the filtered list of partitions, producer can apply one of the partition selection algorithms noted above. For example if there are 120 partitions, ideally brokers in each AZ will be leader for 40 partitions. Producers in AZ will pick one of the 40 partitions and route messages to the partition leader in same AZ.
When replication factor is greater than one, messages will still cross AZ boundary. But it will be less data crossing the AZ when compared to non-rack aware partition selection.