Interface SupportsGroupWithinPartitions
- All Superinterfaces:
AutoCloseable
,Closeable
,Operator
,OperatorPipelineV3
,Serializable
,Transformer
,org.apache.spark.sql.api.java.UDF1<org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>,
org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>>
A mix-in interface for
Transformer
. Dataset transformer can implement this interface to
support additional grouping within each partition for pairing-aware processing of multiple datasets.
This is particularly useful when the dataframe hash partitioning strategy is not able to provide the
desired granular data partitions, i.e. data records belonging to different partitions would be grouped
into the same partition because of hashing to the same hash value. Especially when processing across multiple
datasets (e.g. tumor-normal somatic analysis), SeqsLab uses the grouping expressions to repartition
records in each dataframe partition after unioning corresponding partitions across multiple datasets
and properly pairing them for localization.-
Method Summary
Modifier and TypeMethodDescriptionorg.apache.spark.sql.Column[]
getGroupExprs
(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> df) Get the grouping expressions as a list ofColumn
, ex: df.col("group_id").Methods inherited from interface com.atgenomix.seqslab.piper.plugin.api.Operator
getName, getOperatorContext
Methods inherited from interface com.atgenomix.seqslab.piper.plugin.api.transformer.Transformer
init, numPartitions
Methods inherited from interface org.apache.spark.sql.api.java.UDF1
call
-
Method Details
-
getGroupExprs
org.apache.spark.sql.Column[] getGroupExprs(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> df) Get the grouping expressions as a list ofColumn
, ex: df.col("group_id").- Returns:
- Array of DataFrame Columns
-