Rectangle 27 0

hadoop How to get nth row of Spark RDD?


rdd.zipWithIndex.filter(_._2==9).map(_._1).first()

I don't know how much it is efficient, as it depends on the current and future optimizations in the Spark's engine, but you can try doing the following:

In any case, if n is very large, this method is efficient in that it does not require to collect an array of the first n elements in the driver node.

The first function could be pulled up by the execution engine and influence the behavior of the whole processing. Give it a try.

The first function transforms the RDD into a pair (value, idx) with idx going from 0 onwards. The second function takes the element with idx==9 (the 10th). The third function takes the original value. Then the result is returned.

Unfortunately, zipWithIndex requires a full pass over the data to calculate the index offset of each partition. It is still probably your best bet though.

Note
Rectangle 27 0

hadoop How to get nth row of Spark RDD?


data.take(2).drop(1)

AFAIK: Nicola Ferraro's answer above contains the best approach we currently have. stackoverflow.com/a/27826498/2846609

I haven't checked this for the huge data. But it works fine for me.

You don't want to do this for a large n values as it will result in getting the first n elements (which is affected by the partitioning...) to the driver code itself... so it can be slow or even impossible to do...

Note
Rectangle 27 0

hadoop How to get nth row of Spark RDD?


rdd.zipWithIndex.filter(_._2==9).map(_._1).first()

I don't know how much it is efficient, as it depends on the current and future optimizations in the Spark's engine, but you can try doing the following:

In any case, if n is very large, this method is efficient in that it does not require to collect an array of the first n elements in the driver node.

The first function could be pulled up by the execution engine and influence the behavior of the whole processing. Give it a try.

The first function transforms the RDD into a pair (value, idx) with idx going from 0 onwards. The second function takes the element with idx==9 (the 10th). The third function takes the original value. Then the result is returned.

Unfortunately, zipWithIndex requires a full pass over the data to calculate the index offset of each partition. It is still probably your best bet though.

Note
Rectangle 27 0

hadoop How to get nth row of Spark RDD?


rdd.zipWithIndex.filter(_._2==9).map(_._1).first()

I don't know how much it is efficient, as it depends on the current and future optimizations in the Spark's engine, but you can try doing the following:

In any case, if n is very large, this method is efficient in that it does not require to collect an array of the first n elements in the driver node.

The first function could be pulled up by the execution engine and influence the behavior of the whole processing. Give it a try.

The first function transforms the RDD into a pair (value, idx) with idx going from 0 onwards. The second function takes the element with idx==9 (the 10th). The third function takes the original value. Then the result is returned.

Unfortunately, zipWithIndex requires a full pass over the data to calculate the index offset of each partition. It is still probably your best bet though.

Note
Rectangle 27 0

hadoop How to get nth row of Spark RDD?


data.take(2).drop(1)

AFAIK: Nicola Ferraro's answer above contains the best approach we currently have. stackoverflow.com/a/27826498/2846609

I haven't checked this for the huge data. But it works fine for me.

You don't want to do this for a large n values as it will result in getting the first n elements (which is affected by the partitioning...) to the driver code itself... so it can be slow or even impossible to do...

Note