Interface Collector
- All Superinterfaces:
AutoCloseable
,Closeable
,Operator
,OperatorPipelineV3
,Serializable
,org.apache.spark.sql.api.java.UDF0<Iterator<org.apache.spark.sql.Row>>
- All Known Subinterfaces:
SupportsAggregation
,SupportsRepartitioning
@DeveloperApi
public interface Collector
extends Operator, org.apache.spark.sql.api.java.UDF0<Iterator<org.apache.spark.sql.Row>>
The operator responsible for retrieving command outputs from local file system and returning a DataFrame.
Additionally, it can be responsible for computing aggregates of command outputs.
Collectors are a user-defined function (UDF) that takes no input parameter and returns the command output
as a DataFrame read from local files.
Collector objects are created first by invoking Collector operator factory (implements
CollectorSupport
)
when pipeline task requests it and will be lazily initialized when it is ready to run.
When completed, the close method will be invoked to release resources.
SeqsLab supports multiple data processing features to manage and optimize workloads.
A Collector can inform SeqsLab its supporting features by implementing the specific mix-in interfaces.- See Also:
-
Method Summary
Modifier and TypeMethodDescriptioninit
(boolean isDirectory) Initializes this collector operator.org.apache.spark.sql.types.StructType
schema()
Returns the actual schema of this collected dataset, which may be different from the physical schema of the command outputs, as column pruning or other optimizations may happen.void
When task output to collect from is in a directory, set the object to iterate over the entries in an output directory.void
Set an input stream object to obtain input bytes from a task output file.Methods inherited from interface com.atgenomix.seqslab.piper.plugin.api.Operator
getName, getOperatorContext
Methods inherited from interface org.apache.spark.sql.api.java.UDF0
call
-
Method Details
-
init
Initializes this collector operator.- Parameters:
isDirectory
- true if the task output to collect from is in a directory; otherwise, false- Returns:
- The object itself
-
setInputStream
Set an input stream object to obtain input bytes from a task output file.- Parameters:
in
- The underlying file input stream
-
setDirectoryStream
When task output to collect from is in a directory, set the object to iterate over the entries in an output directory.- Parameters:
dir
- The underlying directory stream
-
schema
org.apache.spark.sql.types.StructType schema()Returns the actual schema of this collected dataset, which may be different from the physical schema of the command outputs, as column pruning or other optimizations may happen.- Returns:
- The schema as a StructType
-