pySpark Kafka Direct Streaming更新Zooeman/Kafka Offset

提问者：小点点

pySpark Kafka Direct Streaming更新Zooeman/Kafka Offset

目前我正在使用Kafka/Zooeman和pySpark（1.6.0）。我已经成功创建了一个kafka消费者，它使用KafkaUtils. createDirectStream（）。

所有的流媒体都没有问题，但我意识到，在我消费了一些消息后，我的Kafka Topics并没有更新到当前的偏移量。

由于我们需要更新主题，以便在这里进行监控，这有点奇怪。

在Spark的文档中，我发现了以下评论：

   offsetRanges = []

     def storeOffsetRanges(rdd):
         global offsetRanges
         offsetRanges = rdd.offsetRanges()
         return rdd

     def printOffsetRanges(rdd):
         for o in offsetRanges:
             print "%s %s %s %s" % (o.topic, o.partition, o.fromOffset, o.untilOffset)

     directKafkaStream\
         .transform(storeOffsetRanges)\
         .foreachRDD(printOffsetRanges)

如果您希望基于Zookeeper的Kafka监控工具显示流媒体应用程序的进度，您可以使用它来更新Zookeeper。

以下是文档：http://spark.apache.org/docs/1.6.0/streaming-kafka-integration.html#approach-2-直接接近-无接收器

我在Scala中找到了一个解决方案，但我找不到python的等价物。这是Scala示例：http://geeks.aretotally.in/spark-streaming-kafka-direct-api-store-offsets-in-zk/

但问题是，从那时起，我如何才能向动物园管理员汇报最新情况？

共2个答案

匿名用户

我编写了一些函数来保存和读取带有python kazoo库的Kafka偏移量。

第一个获取Kazoo Client单例的函数：

ZOOKEEPER_SERVERS = "127.0.0.1:2181"

def get_zookeeper_instance():
    from kazoo.client import KazooClient

    if 'KazooSingletonInstance' not in globals():
        globals()['KazooSingletonInstance'] = KazooClient(ZOOKEEPER_SERVERS)
        globals()['KazooSingletonInstance'].start()
    return globals()['KazooSingletonInstance']

然后函数读取和写入偏移量：

def read_offsets(zk, topics):
    from pyspark.streaming.kafka import TopicAndPartition

    from_offsets = {}
    for topic in topics:
        for partition in zk.get_children(f'/consumers/{topic}'):
            topic_partion = TopicAndPartition(topic, int(partition))
            offset = int(zk.get(f'/consumers/{topic}/{partition}')[0])
            from_offsets[topic_partion] = offset
    return from_offsets

def save_offsets(rdd):
    zk = get_zookeeper_instance()
    for offset in rdd.offsetRanges():
        path = f"/consumers/{offset.topic}/{offset.partition}"
        zk.ensure_path(path)
        zk.set(path, str(offset.untilOffset).encode())

然后，在开始流式传输之前，您可以从 zookeeper 读取偏移量，并将它们传递给 createDirectStream for fromOffsets 参数。

from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils


def main(brokers="127.0.0.1:9092", topics=['test1', 'test2']):
    sc = SparkContext(appName="PythonStreamingSaveOffsets")
    ssc = StreamingContext(sc, 2)

    zk = get_zookeeper_instance()
    from_offsets = read_offsets(zk, topics)

    directKafkaStream = KafkaUtils.createDirectStream(
        ssc, topics, {"metadata.broker.list": brokers},
        fromOffsets=from_offsets)

    directKafkaStream.foreachRDD(save_offsets)


if __name__ == "__main__":
    main()

匿名用户

我遇到过类似的问题。你说的没错，使用directStream，就是直接使用kafka底层API，没有更新reader offset。这里有几个scala/java的例子，但是python没有。但自己做起来很容易，你需要做的是:

从开头的偏移量读取
最后保存偏移量

例如，我在redis中保存每个分区的偏移量，方法是:

stream.foreachRDD(lambda rdd: save_offset(rdd))
def save_offset(rdd):
  ranges = rdd.offsetRanges()
  for rng in ranges:
     rng.untilOffset # save offset somewhere

然后在开始时，您可以使用：

fromoffset = {}
topic_partition = TopicAndPartition(topic, partition)
fromoffset[topic_partition]= int(value) #the value of int read from where you store previously.

对于某些使用 ZK 跟踪偏移的工具，最好将偏移量保存在 ZooKeeper 中。本页：https://community.hortonworks.com/articles/81357/manually-resetting-offset-for-a-kafka-topic.html 介绍如何设置偏移量，基本上zk节点是：/consumer/[consumer_name]/offsets/[topic name]/[partition id]，因为我们使用的是directStream，所以你必须编一个消费者名称。