Interface Loader
- All Superinterfaces:
AutoCloseable
,Closeable
,Operator
,OperatorPipelineV3
,Serializable
,org.apache.spark.sql.api.java.UDF0<Iterator<org.apache.spark.sql.Row>>
- All Known Subinterfaces:
SupportsCopyToLocal
,SupportsHadoopDFS
,SupportsReadPartitions
,SupportsScanPartitions
@DeveloperApi
public interface Loader
extends Operator, org.apache.spark.sql.api.java.UDF0<Iterator<org.apache.spark.sql.Row>>
The operator responsible for loading (reading) a dataset into in-memory DataFrame or copying to local host
file system from a specific data source, e.g. blob storage.
A Loader is a user-defined function (UDF) that takes no parameter and returns an array of data rows as
an iterator when required.
Loader objects are created first by invoking Loader operator factory (implements
LoaderSupport
)
when pipeline task requests it and will be lazily initialized when it is ready to run. When completed, the
close method will be invoked to release resources.
SeqsLab supports multiple data processing features to manage and optimize workloads.
An operator can inform SeqsLab its supporting features by implementing the specific mix-in interfaces.
For instance, a loader implementing SupportsCopyToLocal
can localize, e.g.
genome reference files, to all available computing nodes; the loader can additionally implement
SupportsReadPartitions
to tell SeqsLab the localization is partition-aware and can read each data
partition concurrently and individually. SeqsLab will run the Loader in parallel and ensure the same data
partition across multiple input data sources are in the same command execution in localization process.-
Method Summary
Modifier and TypeMethodDescriptioninit
(DataSource source) Initializes this operator with a specific data source.org.apache.spark.sql.types.StructType
Returns the actual schema of this dataset loader, which may be different from the physical schema of the source storage, as column pruning or other optimizations may happen.Methods inherited from interface com.atgenomix.seqslab.piper.plugin.api.Operator
getName, getOperatorContext
Methods inherited from interface org.apache.spark.sql.api.java.UDF0
call
-
Method Details
-
init
Initializes this operator with a specific data source.- Parameters:
source
- Connection arguments of a task input data source- Returns:
- The object itself
-
readSchema
org.apache.spark.sql.types.StructType readSchema()Returns the actual schema of this dataset loader, which may be different from the physical schema of the source storage, as column pruning or other optimizations may happen.- Returns:
- The schema as a StructType
-