All Superinterfaces:
AutoCloseable, Closeable, Operator, OperatorPipelineV3, Serializable, org.apache.spark.sql.api.java.UDF0<Iterator<org.apache.spark.sql.Row>>
All Known Subinterfaces:
SupportsCopyToLocal, SupportsHadoopDFS, SupportsReadPartitions, SupportsScanPartitions

@DeveloperApi public interface Loader extends Operator, org.apache.spark.sql.api.java.UDF0<Iterator<org.apache.spark.sql.Row>>
The operator responsible for loading (reading) a dataset into in-memory DataFrame or copying to local host file system from a specific data source, e.g. blob storage. A Loader is a user-defined function (UDF) that takes no parameter and returns an array of data rows as an iterator when required. Loader objects are created first by invoking Loader operator factory (implements LoaderSupport) when pipeline task requests it and will be lazily initialized when it is ready to run. When completed, the close method will be invoked to release resources. SeqsLab supports multiple data processing features to manage and optimize workloads. An operator can inform SeqsLab its supporting features by implementing the specific mix-in interfaces. For instance, a loader implementing SupportsCopyToLocal can localize, e.g. genome reference files, to all available computing nodes; the loader can additionally implement SupportsReadPartitions to tell SeqsLab the localization is partition-aware and can read each data partition concurrently and individually. SeqsLab will run the Loader in parallel and ensure the same data partition across multiple input data sources are in the same command execution in localization process.
See Also:
  • Method Summary

    Modifier and Type
    Method
    Description
    init(DataSource source)
    Initializes this operator with a specific data source.
    org.apache.spark.sql.types.StructType
    Returns the actual schema of this dataset loader, which may be different from the physical schema of the source storage, as column pruning or other optimizations may happen.

    Methods inherited from interface java.io.Closeable

    close

    Methods inherited from interface com.atgenomix.seqslab.piper.plugin.api.Operator

    getName, getOperatorContext

    Methods inherited from interface org.apache.spark.sql.api.java.UDF0

    call
  • Method Details

    • init

      Loader init(DataSource source)
      Initializes this operator with a specific data source.
      Parameters:
      source - Connection arguments of a task input data source
      Returns:
      The object itself
    • readSchema

      org.apache.spark.sql.types.StructType readSchema()
      Returns the actual schema of this dataset loader, which may be different from the physical schema of the source storage, as column pruning or other optimizations may happen.
      Returns:
      The schema as a StructType