Grouping by key is always problematic since a key might have a huge number
of values. You can do a little better than grouping *all* values and *then*
finding distinct values by using foldByKey, putting values into a Set. At
least you end up with only distinct values in memory. (You don't need two
maps either, right?)
If the number of distinct values is still huge for some keys, consider the
experimental method countApproxDistinctByKey:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L285
This should be much more performant at the cost of some accuracy.
On Sat, Jun 14, 2014 at 1:58 PM, Vivek YS <vivek.ys@gmail.com> wrote:
> Hi,
> For last couple of days I have been trying hard to get around this
> problem. Please share any insights on solving this problem.
>
> Problem :
> There is a huge list of (key, value) pairs. I want to transform this to
> (key, distinct values) and then eventually to (key, distinct values count)
>
> On small dataset
>
> groupByKey().map( x => (x_1, x._2.distinct)) ...map(x => (x_1,
> x._2.distinct.count))
>
> On large data set I am getting OOM.
>
> Is there a way to represent Seq of values from groupByKey as RDD and then
> perform distinct over it ?
>
> Thanks
> Vivek
>
