All Superinterfaces:
AutoCloseable, Closeable, Operator, OperatorPipelineV3, Serializable, org.apache.spark.sql.api.java.UDF0<Iterator<org.apache.spark.sql.Row>>
All Known Subinterfaces:
SupportsAggregation, SupportsRepartitioning

@DeveloperApi public interface Collector extends Operator, org.apache.spark.sql.api.java.UDF0<Iterator<org.apache.spark.sql.Row>>
The operator responsible for retrieving command outputs from local file system and returning a DataFrame. Additionally, it can be responsible for computing aggregates of command outputs. Collectors are a user-defined function (UDF) that takes no input parameter and returns the command output as a DataFrame read from local files. Collector objects are created first by invoking Collector operator factory (implements CollectorSupport) when pipeline task requests it and will be lazily initialized when it is ready to run. When completed, the close method will be invoked to release resources. SeqsLab supports multiple data processing features to manage and optimize workloads. A Collector can inform SeqsLab its supporting features by implementing the specific mix-in interfaces.
See Also:
  • Method Summary

    Modifier and Type
    Method
    Description
    init(boolean isDirectory)
    Initializes this collector operator.
    org.apache.spark.sql.types.StructType
    Returns the actual schema of this collected dataset, which may be different from the physical schema of the command outputs, as column pruning or other optimizations may happen.
    void
    When task output to collect from is in a directory, set the object to iterate over the entries in an output directory.
    void
    Set an input stream object to obtain input bytes from a task output file.

    Methods inherited from interface java.io.Closeable

    close

    Methods inherited from interface com.atgenomix.seqslab.piper.plugin.api.Operator

    getName, getOperatorContext

    Methods inherited from interface org.apache.spark.sql.api.java.UDF0

    call
  • Method Details

    • init

      Collector init(boolean isDirectory)
      Initializes this collector operator.
      Parameters:
      isDirectory - true if the task output to collect from is in a directory; otherwise, false
      Returns:
      The object itself
    • setInputStream

      void setInputStream(FileInputStream in)
      Set an input stream object to obtain input bytes from a task output file.
      Parameters:
      in - The underlying file input stream
    • setDirectoryStream

      void setDirectoryStream(DirectoryStream<Path> dir)
      When task output to collect from is in a directory, set the object to iterate over the entries in an output directory.
      Parameters:
      dir - The underlying directory stream
    • schema

      org.apache.spark.sql.types.StructType schema()
      Returns the actual schema of this collected dataset, which may be different from the physical schema of the command outputs, as column pruning or other optimizations may happen.
      Returns:
      The schema as a StructType