Kafka Connector task spawn strategy

Hamid_Jawaid · July 8, 2020, 3:06pm

I have a general question regarding Kafka-Connect. I went through documentation, blogs but couldn’t find a straight answer.

If there are two workers, running single Connector(instance) then
How does a Connector(instance) decide when to spawn a new task, if eg. tasks.max = 10? Also, how does a Connector(instance) decide how many tasks to spawn, if eg. tasks.max = 10?
Does it depend upon underlying hardware configuration? eg. number of cores or memory or cpu utilization?

Ross_Lawley · July 9, 2020, 11:15am

Hi @Hamid_Jawaid,

tasks.max - The maximum number of tasks that should be created for this connector. The connector may create fewer tasks if it cannot achieve this level of parallelism.
https://docs.confluent.io/current/connect/managing/configuring.html#configuring-connectors

Great questions, the answer is the MongoDB Kafka Connector itself isn’t responsible for managing the task or the number of tasks. All it does is take the tasks.max value and create a number of configurations for each task. This allows the connector to determine how many tasks it can support. Prior to 1.2.0 the connector would only ever allow a single task. In 1.2.0 we now allow multiple tasks and Kafka Connect will then manage how many tasks to run in parallel.

The exact algorithm is internal to Kafka-Connect but it generally relates to the number of partitions and topics. So for example if you set tasks.max = 10 and have the following sink connector configuration:

1 topic, 1 partition - then Kafka connect will only spawn a single task
2 topics, 1 partition each - then Kafka connect will spawn 2 tasks, 1 for each topic
2 topics, 5 partitions each - then Kafka connection will spawn 10 tasks, 1 for each topic partition
4 topics, 5 partitions each - the Kafka connection will spawn 10 tasks, each handling data from 2 topic partitions.

The How to Use Kafka Connect - Get Started | Confluent Documentation alludes to this, but as far as the MongoDB Kafka Connector is concerned, it will just process the data it is handed by Kafka Connect.

I hope that helps answer your questions,

Ross

Hamid_Jawaid · July 9, 2020, 7:10pm

Thanks. Nice explanation.
In my case, I am using MongoDB-Source Connector and listening to ChangeStream of one collection, so I have one topic but with three partitions.
For me, I always see as one task. Is it because I have one topic? For one topic with more than one partitions should also have more tasks. If it’s not so, then seems I can scale my consumers(one per partition) but I can’t scale my producer(MongoDB-Source Connector).

Ross_Lawley · July 10, 2020, 7:29am

Prior to 1.2.0 only a single task was supported by the sink connector.

The Source connector still only supports a single task, this is because it uses a single Change Stream cursor. This is enough to watch and publish changes cluster wide, database wide or down to a single collection.

Hamid_Jawaid · July 10, 2020, 11:45am

Thanks Ross. That explains the behavior I am witnessing.
Though I couldn’t find this anywhere in documentation.
Would MongoDB source connector support multiple tasks in future releases?