Interface SupportsGroupWithinPartitions

All Superinterfaces:
AutoCloseable, Closeable, Operator, OperatorPipelineV3, Serializable, Transformer, org.apache.spark.sql.api.java.UDF1<org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>,org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>>

@DeveloperApi @FeatureAfterCall public interface SupportsGroupWithinPartitions extends Transformer
A mix-in interface for Transformer. Dataset transformer can implement this interface to support additional grouping within each partition for pairing-aware processing of multiple datasets. This is particularly useful when the dataframe hash partitioning strategy is not able to provide the desired granular data partitions, i.e. data records belonging to different partitions would be grouped into the same partition because of hashing to the same hash value. Especially when processing across multiple datasets (e.g. tumor-normal somatic analysis), SeqsLab uses the grouping expressions to repartition records in each dataframe partition after unioning corresponding partitions across multiple datasets and properly pairing them for localization.
  • Method Summary

    Modifier and Type
    Method
    Description
    org.apache.spark.sql.Column[]
    getGroupExprs(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> df)
    Get the grouping expressions as a list of Column, ex: df.col("group_id").

    Methods inherited from interface java.io.Closeable

    close

    Methods inherited from interface com.atgenomix.seqslab.piper.plugin.api.Operator

    getName, getOperatorContext

    Methods inherited from interface com.atgenomix.seqslab.piper.plugin.api.transformer.Transformer

    init, numPartitions

    Methods inherited from interface org.apache.spark.sql.api.java.UDF1

    call
  • Method Details

    • getGroupExprs

      org.apache.spark.sql.Column[] getGroupExprs(org.apache.spark.sql.Dataset<org.apache.spark.sql.Row> df)
      Get the grouping expressions as a list of Column, ex: df.col("group_id").
      Returns:
      Array of DataFrame Columns