All Superinterfaces:
AutoCloseable, Closeable, Operator, OperatorPipelineV3, Serializable, org.apache.spark.sql.api.java.UDF1<org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>,org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>>
All Known Subinterfaces:
SupportsOrdering

@DeveloperApi public interface Partitioner extends Operator, org.apache.spark.sql.api.java.UDF1<org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>,org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>>
The operator responsible for repartitioning, and additionally sorting, DataFrames loaded by Loader to optimize downstream data processing. For instance, in genome sequencing analysis, a partitioner can repartition BAM or VCF datasets based on non-overlapping target regions. A Partitioner is a user-defined function (UDF) that takes a DataFrame as an input parameter and returns a partitioned and optionally sorted DataFrame. Partitioner objects are created first by invoking Partitioner operator factory (implements PartitionerSupport) when pipeline task requests it and will be lazily initialized when it is ready to run. When completed, the close method will be invoked to release resources. SeqsLab supports multiple data processing features to manage and optimize workloads. A Partitioner can inform SeqsLab its supporting features by implementing the specific mix-in interfaces.
See Also:
  • Method Summary

    Modifier and Type
    Method
    Description
    init(int cpuCores, int memPerCore)
    Initializes this operator.
    int
    Get the number of partitions after repartition.

    Methods inherited from interface java.io.Closeable

    close

    Methods inherited from interface com.atgenomix.seqslab.piper.plugin.api.Operator

    getName, getOperatorContext

    Methods inherited from interface org.apache.spark.sql.api.java.UDF1

    call
  • Method Details

    • init

      Partitioner init(int cpuCores, int memPerCore)
      Initializes this operator.
      Parameters:
      cpuCores - Total number of CPU cores in current computing cluster
      memPerCore - Allocated memory per CPU core in GB
      Returns:
      The object itself
    • numPartitions

      int numPartitions()
      Get the number of partitions after repartition.
      Returns:
      Number of partitions