All Superinterfaces:
AutoCloseable, Closeable, Operator, OperatorPipelineV3, Serializable, org.apache.spark.sql.api.java.UDF1<org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>,org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>>
All Known Subinterfaces:
SupportsGroupWithinPartitions, SupportsOrdering

@DeveloperApi public interface Transformer extends Operator, org.apache.spark.sql.api.java.UDF1<org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>,org.apache.spark.sql.Dataset<org.apache.spark.sql.Row>>
The operator responsible for repartitioning, and additionally sorting, DataFrames loaded by Loader to optimize downstream data processing. For instance, in genome sequencing analysis, a transformer can repartition BAM or VCF datasets based on non-overlapping target regions. A Transformer is a user-defined function (UDF) that takes a DataFrame as an input parameter and returns a partitioned and optionally sorted DataFrame. Transformer objects are created first by invoking Transformer operator factory (implements TransformerSupport) when pipeline task requests it and will be lazily initialized when it is ready to run. When completed, the close method will be invoked to release resources. SeqsLab supports multiple data processing features to manage and optimize workloads. A Transformer can inform SeqsLab its supporting features by implementing the specific mix-in interfaces.
See Also:
  • Method Summary

    Modifier and Type
    Method
    Description
    init(int cpuCores, int memPerCore)
    Initializes this operator.
    int
    Get the number of partitions after repartition.

    Methods inherited from interface java.io.Closeable

    close

    Methods inherited from interface com.atgenomix.seqslab.piper.plugin.api.Operator

    getName, getOperatorContext

    Methods inherited from interface org.apache.spark.sql.api.java.UDF1

    call
  • Method Details

    • init

      Transformer init(int cpuCores, int memPerCore)
      Initializes this operator.
      Parameters:
      cpuCores - Total number of CPU cores in current computing cluster
      memPerCore - Allocated memory per CPU core in GB
      Returns:
      The object itself
    • numPartitions

      int numPartitions()
      Get the number of partitions after repartition.
      Returns:
      Number of partitions