Rectangle 27 6

I don't know how much it is efficient, as it depends on the current and future optimizations in the Spark's engine, but you can try doing the following:

rdd.zipWithIndex.filter(_._2==9).map(_._1).first()

The first function transforms the RDD into a pair (value, idx) with idx going from 0 onwards. The second function takes the element with idx==9 (the 10th). The third function takes the original value. Then the result is returned.

The first function could be pulled up by the execution engine and influence the behavior of the whole processing. Give it a try.

In any case, if n is very large, this method is efficient in that it does not require to collect an array of the first n elements in the driver node.

Unfortunately, zipWithIndex requires a full pass over the data to calculate the index offset of each partition. It is still probably your best bet though.

hadoop - How to get nth row of Spark RDD? - Stack Overflow

hadoop apache-spark rdd
Rectangle 27 6

I don't know how much it is efficient, as it depends on the current and future optimizations in the Spark's engine, but you can try doing the following:

rdd.zipWithIndex.filter(_._2==9).map(_._1).first()

The first function transforms the RDD into a pair (value, idx) with idx going from 0 onwards. The second function takes the element with idx==9 (the 10th). The third function takes the original value. Then the result is returned.

The first function could be pulled up by the execution engine and influence the behavior of the whole processing. Give it a try.

In any case, if n is very large, this method is efficient in that it does not require to collect an array of the first n elements in the driver node.

Unfortunately, zipWithIndex requires a full pass over the data to calculate the index offset of each partition. It is still probably your best bet though.

hadoop - How to get nth row of Spark RDD? - Stack Overflow

hadoop apache-spark rdd
Rectangle 27 2

I haven't checked this for the huge data. But it works fine for me.

data.take(2).drop(1)

You don't want to do this for a large n values as it will result in getting the first n elements (which is affected by the partitioning...) to the driver code itself... so it can be slow or even impossible to do...

AFAIK: Nicola Ferraro's answer above contains the best approach we currently have. stackoverflow.com/a/27826498/2846609

hadoop - How to get nth row of Spark RDD? - Stack Overflow

hadoop apache-spark rdd
Rectangle 27 2

I haven't checked this for the huge data. But it works fine for me.

data.take(2).drop(1)

You don't want to do this for a large n values as it will result in getting the first n elements (which is affected by the partitioning...) to the driver code itself... so it can be slow or even impossible to do...

AFAIK: Nicola Ferraro's answer above contains the best approach we currently have. stackoverflow.com/a/27826498/2846609

hadoop - How to get nth row of Spark RDD? - Stack Overflow

hadoop apache-spark rdd
Rectangle 27 0

val rdd2 = rdd1.zipWithIndex.filter{ 
    case (row, index) => {
      // row number is index. (but is not fixed, unless RDD is sorted)
}

hadoop - Spark RDD: Get row number - Stack Overflow

hadoop apache-spark rdd
Rectangle 27 0

I don't know how much it is efficient, as it depends on the current and future optimizations in the Spark's engine, but you can try doing the following:

rdd.zipWithIndex.filter(_._2==9).map(_._1).first()

The first function transforms the RDD into a pair (value, idx) with idx going from 0 onwards. The second function takes the element with idx==9 (the 10th). The third function takes the original value. Then the result is returned.

The first function could be pulled up by the execution engine and influence the behavior of the whole processing. Give it a try.

In any case, if n is very large, this method is efficient in that it does not require to collect an array of the first n elements in the driver node.

Unfortunately, zipWithIndex requires a full pass over the data to calculate the index offset of each partition. It is still probably your best bet though.

hadoop - How to get nth row of Spark RDD? - Stack Overflow

hadoop apache-spark rdd