All Superinterfaces:: AutoCloseable, Closeable, Operator, OperatorPipelineV3, Serializable, org.apache.spark.sql.api.java.UDF0<Iterator<org.apache.spark.sql.Row>>

All Known Subinterfaces:: SupportsAggregation, SupportsRepartitioning

@DeveloperApi public interface Collector extends Operator, org.apache.spark.sql.api.java.UDF0<Iterator<org.apache.spark.sql.Row>>

The operator responsible for retrieving command outputs from local file system and returning a DataFrame. Additionally, it can be responsible for computing aggregates of command outputs. Collectors are a user-defined function (UDF) that takes no input parameter and returns the command output as a DataFrame read from local files. Collector objects are created first by invoking Collector operator factory (implements CollectorSupport) when pipeline task requests it and will be lazily initialized when it is ready to run. When completed, the close method will be invoked to release resources. SeqsLab supports multiple data processing features to manage and optimize workloads. A Collector can inform SeqsLab its supporting features by implementing the specific mix-in interfaces.

See Also:

Method Summary

Modifier and Type

Method

Description

Collector

init(boolean isDirectory)

Initializes this collector operator.

org.apache.spark.sql.types.StructType

schema()

Returns the actual schema of this collected dataset, which may be different from the physical schema of the command outputs, as column pruning or other optimizations may happen.

void

setDirectoryStream(DirectoryStream<Path> dir)

When task output to collect from is in a directory, set the object to iterate over the entries in an output directory.

void

setInputStream(FileInputStream in)

Set an input stream object to obtain input bytes from a task output file.

Methods inherited from interface java.io.Closeable
close

Methods inherited from interface com.atgenomix.seqslab.piper.plugin.api.Operator
getName, getOperatorContext

Methods inherited from interface org.apache.spark.sql.api.java.UDF0
call

Method Details
- init
  
  Collector init(boolean isDirectory)
  
  Initializes this collector operator.
  
  Parameters:
  
  isDirectory - true if the task output to collect from is in a directory; otherwise, false
  
  Returns:
  
  The object itself
- setInputStream
  
  void setInputStream(FileInputStream in)
  
  Set an input stream object to obtain input bytes from a task output file.
  
  Parameters:
  
  in - The underlying file input stream
- setDirectoryStream
  
  void setDirectoryStream(DirectoryStream<Path> dir)
  
  When task output to collect from is in a directory, set the object to iterate over the entries in an output directory.
  
  Parameters:
  
  dir - The underlying directory stream
- schema
  
  org.apache.spark.sql.types.StructType schema()
  
  Returns the actual schema of this collected dataset, which may be different from the physical schema of the command outputs, as column pruning or other optimizations may happen.
  
  Returns:
  
  The schema as a StructType

Interface Collector

Method Summary

Methods inherited from interface java.io.Closeable

Methods inherited from interface com.atgenomix.seqslab.piper.plugin.api.Operator

Methods inherited from interface org.apache.spark.sql.api.java.UDF0

Method Details

init

setInputStream

setDirectoryStream

schema