Interface Partitioner
- All Superinterfaces:
AutoCloseable
,Closeable
,Operator
,OperatorPipelineV3
,Serializable
,org.apache.spark.sql.api.java.UDF1<org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>,
org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>>
- All Known Subinterfaces:
SupportsOrdering
@DeveloperApi
public interface Partitioner
extends Operator, org.apache.spark.sql.api.java.UDF1<org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>,org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>>
The operator responsible for repartitioning, and additionally sorting, DataFrames loaded by
Loader
to optimize downstream data processing.
For instance, in genome sequencing analysis, a partitioner can repartition BAM or VCF datasets based on
non-overlapping target regions.
A Partitioner is a user-defined function (UDF) that takes a DataFrame as an input parameter
and returns a partitioned and optionally sorted DataFrame.
Partitioner objects are created first by invoking Partitioner operator factory (implements
PartitionerSupport
) when pipeline task requests it and will be lazily initialized when it is ready to run.
When completed, the close method will be invoked to release resources.
SeqsLab supports multiple data processing features to manage and optimize workloads.
A Partitioner can inform SeqsLab its supporting features by implementing the specific mix-in interfaces.- See Also:
-
Method Summary
Modifier and TypeMethodDescriptioninit
(int cpuCores, int memPerCore) Initializes this operator.int
Get the number of partitions after repartition.Methods inherited from interface com.atgenomix.seqslab.piper.plugin.api.Operator
getName, getOperatorContext
Methods inherited from interface org.apache.spark.sql.api.java.UDF1
call
-
Method Details
-
init
Initializes this operator.- Parameters:
cpuCores
- Total number of CPU cores in current computing clustermemPerCore
- Allocated memory per CPU core in GB- Returns:
- The object itself
-
numPartitions
int numPartitions()Get the number of partitions after repartition.- Returns:
- Number of partitions
-