Skip to content

⚙️ API reference

polars-bio API is grouped into the following categories:

  • File I/O: Reading files in various biological formats from local and cloud storage.
  • Data Processing: Exposing end user to the rich SQL programming interface powered by Apache Datafusion for operations, such as sorting, filtering and other transformations on input bioinformatic datasets registered as tables. You can easily query and process file formats such as VCF, GFF, BAM, FASTQ using SQL syntax.
  • Interval Operations: Functions for performing common interval operations, such as overlap, nearest, coverage.

There are 2 ways of using polars-bio API:

  • using polars_bio module

Example

import polars_bio as pb
pb.read_fastq("gs://genomics-public-data/platinum-genomes/fastq/ERR194146.fastq.gz").limit(1).collect()
  • directly on a Polars LazyFrame under a registered pb namespace

Example

 >>> type(df)
 <class 'polars.lazyframe.frame.LazyFrame'>
   import polars_bio as pb
   df.pb.sort().limit(5).collect()

Tip

  1. Not all are available in both ways.
  2. You can of course use both ways in the same script.

data_input

Source code in polars_bio/io.py
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
class IOOperations:
    @staticmethod
    def read_fasta(
        path: str,
        chunk_size: int = 8,
        concurrent_fetches: int = 1,
        allow_anonymous: bool = True,
        enable_request_payer: bool = False,
        max_retries: int = 5,
        timeout: int = 300,
        compression_type: str = "auto",
        projection_pushdown: bool = False,
    ) -> pl.DataFrame:
        """

        Read a FASTA file into a DataFrame.

        Parameters:
            path: The path to the FASTA file.
            chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
            concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
            allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
            enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
            max_retries:  The maximum number of retries for reading the file from object storage.
            timeout: The timeout in seconds for reading the file from object storage.
            compression_type: The compression type of the FASTA file. If not specified, it will be detected automatically based on the file extension. BGZF and GZIP compressions are supported ('bgz', 'gz').
            projection_pushdown: Enable column projection pushdown optimization. When True, only requested columns are processed at the DataFusion execution level, improving performance and reducing memory usage.

        !!! Example
            ```shell
            wget https://www.ebi.ac.uk/ena/browser/api/fasta/BK006935.2?download=true -O /tmp/test.fasta
            ```

            ```python
            import polars_bio as pb
            pb.read_fasta("/tmp/test.fasta").limit(1)
            ```
            ```shell
             shape: (1, 3)
            ┌─────────────────────────┬─────────────────────────────────┬─────────────────────────────────┐
            │ name                    ┆ description                     ┆ sequence                        │
            │ ---                     ┆ ---                             ┆ ---                             │
            │ str                     ┆ str                             ┆ str                             │
            ╞═════════════════════════╪═════════════════════════════════╪═════════════════════════════════╡
            │ ENA|BK006935|BK006935.2 ┆ TPA_inf: Saccharomyces cerevis… ┆ CCACACCACACCCACACACCCACACACCAC… │
            └─────────────────────────┴─────────────────────────────────┴─────────────────────────────────┘
            ```
        """
        return IOOperations.scan_fasta(
            path,
            chunk_size,
            concurrent_fetches,
            allow_anonymous,
            enable_request_payer,
            max_retries,
            timeout,
            compression_type,
            projection_pushdown,
        ).collect()

    @staticmethod
    def scan_fasta(
        path: str,
        chunk_size: int = 8,
        concurrent_fetches: int = 1,
        allow_anonymous: bool = True,
        enable_request_payer: bool = False,
        max_retries: int = 5,
        timeout: int = 300,
        compression_type: str = "auto",
        projection_pushdown: bool = False,
    ) -> pl.LazyFrame:
        """

        Lazily read a FASTA file into a LazyFrame.

        Parameters:
            path: The path to the FASTA file.
            chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
            concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
            allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
            enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
            max_retries:  The maximum number of retries for reading the file from object storage.
            timeout: The timeout in seconds for reading the file from object storage.
            compression_type: The compression type of the FASTA file. If not specified, it will be detected automatically based on the file extension. BGZF and GZIP compressions are supported ('bgz', 'gz').
            projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

        !!! Example
            ```shell
            wget https://www.ebi.ac.uk/ena/browser/api/fasta/BK006935.2?download=true -O /tmp/test.fasta
            ```

            ```python
            import polars_bio as pb
            pb.scan_fasta("/tmp/test.fasta").limit(1).collect()
            ```
            ```shell
             shape: (1, 3)
            ┌─────────────────────────┬─────────────────────────────────┬─────────────────────────────────┐
            │ name                    ┆ description                     ┆ sequence                        │
            │ ---                     ┆ ---                             ┆ ---                             │
            │ str                     ┆ str                             ┆ str                             │
            ╞═════════════════════════╪═════════════════════════════════╪═════════════════════════════════╡
            │ ENA|BK006935|BK006935.2 ┆ TPA_inf: Saccharomyces cerevis… ┆ CCACACCACACCCACACACCCACACACCAC… │
            └─────────────────────────┴─────────────────────────────────┴─────────────────────────────────┘
            ```
        """
        object_storage_options = PyObjectStorageOptions(
            allow_anonymous=allow_anonymous,
            enable_request_payer=enable_request_payer,
            chunk_size=chunk_size,
            concurrent_fetches=concurrent_fetches,
            max_retries=max_retries,
            timeout=timeout,
            compression_type=compression_type,
        )
        fasta_read_options = FastaReadOptions(
            object_storage_options=object_storage_options
        )
        read_options = ReadOptions(fasta_read_options=fasta_read_options)
        return _read_file(path, InputFormat.Fasta, read_options, projection_pushdown)

    @staticmethod
    def read_vcf(
        path: str,
        info_fields: Union[list[str], None] = None,
        thread_num: int = 1,
        chunk_size: int = 8,
        concurrent_fetches: int = 1,
        allow_anonymous: bool = True,
        enable_request_payer: bool = False,
        max_retries: int = 5,
        timeout: int = 300,
        compression_type: str = "auto",
        projection_pushdown: bool = False,
    ) -> pl.DataFrame:
        """
        Read a VCF file into a DataFrame.

        Parameters:
            path: The path to the VCF file.
            info_fields: List of INFO field names to include. If *None*, all INFO fields will be detected automatically from the VCF header. Use this to limit fields for better performance.
            thread_num: The number of threads to use for reading the VCF file. Used **only** for parallel decompression of BGZF blocks. Works only for **local** files.
            chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
            concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
            allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
            enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
            max_retries:  The maximum number of retries for reading the file from object storage.
            timeout: The timeout in seconds for reading the file from object storage.
            compression_type: The compression type of the VCF file. If not specified, it will be detected automatically..
            projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

        !!! note
            VCF reader uses **1-based** coordinate system for the `start` and `end` columns.
        """
        return IOOperations.scan_vcf(
            path,
            info_fields,
            thread_num,
            chunk_size,
            concurrent_fetches,
            allow_anonymous,
            enable_request_payer,
            max_retries,
            timeout,
            compression_type,
            projection_pushdown,
        ).collect()

    @staticmethod
    def scan_vcf(
        path: str,
        info_fields: Union[list[str], None] = None,
        thread_num: int = 1,
        chunk_size: int = 8,
        concurrent_fetches: int = 1,
        allow_anonymous: bool = True,
        enable_request_payer: bool = False,
        max_retries: int = 5,
        timeout: int = 300,
        compression_type: str = "auto",
        projection_pushdown: bool = False,
    ) -> pl.LazyFrame:
        """
        Lazily read a VCF file into a LazyFrame.

        Parameters:
            path: The path to the VCF file.
            info_fields: List of INFO field names to include. If *None*, all INFO fields will be detected automatically from the VCF header. Use this to limit fields for better performance.
            thread_num: The number of threads to use for reading the VCF file. Used **only** for parallel decompression of BGZF blocks. Works only for **local** files.
            chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
            concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
            allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
            enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
            max_retries:  The maximum number of retries for reading the file from object storage.
            timeout: The timeout in seconds for reading the file from object storage.
            compression_type: The compression type of the VCF file. If not specified, it will be detected automatically..
            projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

        !!! note
            VCF reader uses **1-based** coordinate system for the `start` and `end` columns.
        """
        object_storage_options = PyObjectStorageOptions(
            allow_anonymous=allow_anonymous,
            enable_request_payer=enable_request_payer,
            chunk_size=chunk_size,
            concurrent_fetches=concurrent_fetches,
            max_retries=max_retries,
            timeout=timeout,
            compression_type=compression_type,
        )

        # Use provided info_fields or autodetect from VCF header
        if info_fields is not None:
            initial_info_fields = info_fields
        else:
            # Get all info fields from VCF header for proper projection pushdown
            all_info_fields = None
            try:
                vcf_schema_df = IOOperations.describe_vcf(
                    path,
                    allow_anonymous=allow_anonymous,
                    enable_request_payer=enable_request_payer,
                    compression_type=compression_type,
                )
                # Use column name 'name' not 'id' based on the schema output
                all_info_fields = vcf_schema_df.select("name").to_series().to_list()
            except Exception:
                # Fallback to None if unable to get info fields
                all_info_fields = None

            # Always start with all info fields to establish full schema
            # The callback will re-register with only requested info fields for optimization
            initial_info_fields = all_info_fields

        vcf_read_options = VcfReadOptions(
            info_fields=initial_info_fields,
            thread_num=thread_num,
            object_storage_options=object_storage_options,
        )
        read_options = ReadOptions(vcf_read_options=vcf_read_options)
        return _read_file(path, InputFormat.Vcf, read_options, projection_pushdown)

    @staticmethod
    def read_gff(
        path: str,
        thread_num: int = 1,
        chunk_size: int = 8,
        concurrent_fetches: int = 1,
        allow_anonymous: bool = True,
        enable_request_payer: bool = False,
        max_retries: int = 5,
        timeout: int = 300,
        compression_type: str = "auto",
        projection_pushdown: bool = False,
        predicate_pushdown: bool = False,
        parallel: bool = False,
    ) -> pl.DataFrame:
        """
        Read a GFF file into a DataFrame.

        Parameters:
            path: The path to the GFF file.
            thread_num: The number of threads to use for reading the GFF file. Used **only** for parallel decompression of BGZF blocks. Works only for **local** files.
            chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
            concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
            allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
            enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
            max_retries:  The maximum number of retries for reading the file from object storage.
            timeout: The timeout in seconds for reading the file from object storage.
            compression_type: The compression type of the GFF file. If not specified, it will be detected automatically..
            projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.
            predicate_pushdown: Enable predicate pushdown optimization to push filter conditions down to the DataFusion table provider level, reducing data processing and I/O.
            parallel: Whether to use the parallel reader for BGZF-compressed local files (uses BGZF chunk-level parallelism similar to FASTQ).

        !!! note
            GFF reader uses **1-based** coordinate system for the `start` and `end` columns.
        """
        return IOOperations.scan_gff(
            path,
            thread_num,
            chunk_size,
            concurrent_fetches,
            allow_anonymous,
            enable_request_payer,
            max_retries,
            timeout,
            compression_type,
            projection_pushdown,
            predicate_pushdown,
            parallel,
        ).collect()

    @staticmethod
    def scan_gff(
        path: str,
        thread_num: int = 1,
        chunk_size: int = 8,
        concurrent_fetches: int = 1,
        allow_anonymous: bool = True,
        enable_request_payer: bool = False,
        max_retries: int = 5,
        timeout: int = 300,
        compression_type: str = "auto",
        projection_pushdown: bool = False,
        predicate_pushdown: bool = False,
        parallel: bool = False,
    ) -> pl.LazyFrame:
        """
        Lazily read a GFF file into a LazyFrame.

        Parameters:
            path: The path to the GFF file.
            thread_num: The number of threads to use for reading the GFF file. Used **only** for parallel decompression of BGZF blocks. Works only for **local** files.
            chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
            concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large-scale operations, it is recommended to increase this value to 8 or even more.
            allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
            enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
            max_retries:  The maximum number of retries for reading the file from object storage.
            timeout: The timeout in seconds for reading the file from object storage.
            compression_type: The compression type of the GFF file. If not specified, it will be detected automatically.
            projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.
            predicate_pushdown: Enable predicate pushdown optimization to push filter conditions down to the DataFusion table provider level, reducing data processing and I/O.
            parallel: Whether to use the parallel reader for BGZF-compressed local files (use BGZF chunk-level parallelism similar to FASTQ).

        !!! note
            GFF reader uses **1-based** coordinate system for the `start` and `end` columns.
        """
        object_storage_options = PyObjectStorageOptions(
            allow_anonymous=allow_anonymous,
            enable_request_payer=enable_request_payer,
            chunk_size=chunk_size,
            concurrent_fetches=concurrent_fetches,
            max_retries=max_retries,
            timeout=timeout,
            compression_type=compression_type,
        )

        gff_read_options = GffReadOptions(
            attr_fields=None,
            thread_num=thread_num,
            object_storage_options=object_storage_options,
            parallel=parallel,
        )
        read_options = ReadOptions(gff_read_options=gff_read_options)
        return _read_file(
            path, InputFormat.Gff, read_options, projection_pushdown, predicate_pushdown
        )

    @staticmethod
    def read_bam(
        path: str,
        thread_num: int = 1,
        chunk_size: int = 8,
        concurrent_fetches: int = 1,
        allow_anonymous: bool = True,
        enable_request_payer: bool = False,
        max_retries: int = 5,
        timeout: int = 300,
        projection_pushdown: bool = False,
    ) -> pl.DataFrame:
        """
        Read a BAM file into a DataFrame.

        Parameters:
            path: The path to the BAM file.
            thread_num: The number of threads to use for reading the BAM file. Used **only** for parallel decompression of BGZF blocks. Works only for **local** files.
            chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large-scale operations, it is recommended to increase this value to 64.
            concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large-scale operations, it is recommended to increase this value to 8 or even more.
            allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
            enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
            max_retries:  The maximum number of retries for reading the file from object storage.
            timeout: The timeout in seconds for reading the file from object storage.
            projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

        !!! note
            BAM reader uses **1-based** coordinate system for the `start`, `end`, `mate_start`, `mate_end` columns.
        """
        return IOOperations.scan_bam(
            path,
            thread_num,
            chunk_size,
            concurrent_fetches,
            allow_anonymous,
            enable_request_payer,
            max_retries,
            timeout,
            projection_pushdown,
        ).collect()

    @staticmethod
    def scan_bam(
        path: str,
        thread_num: int = 1,
        chunk_size: int = 8,
        concurrent_fetches: int = 1,
        allow_anonymous: bool = True,
        enable_request_payer: bool = False,
        max_retries: int = 5,
        timeout: int = 300,
        projection_pushdown: bool = False,
    ) -> pl.LazyFrame:
        """
        Lazily read a BAM file into a LazyFrame.

        Parameters:
            path: The path to the BAM file.
            thread_num: The number of threads to use for reading the BAM file. Used **only** for parallel decompression of BGZF blocks. Works only for **local** files.
            chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
            concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
            allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
            enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
            max_retries:  The maximum number of retries for reading the file from object storage.
            timeout: The timeout in seconds for reading the file from object storage.
            projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

        !!! note
            BAM reader uses **1-based** coordinate system for the `start`, `end`, `mate_start`, `mate_end` columns.
        """
        object_storage_options = PyObjectStorageOptions(
            allow_anonymous=allow_anonymous,
            enable_request_payer=enable_request_payer,
            chunk_size=chunk_size,
            concurrent_fetches=concurrent_fetches,
            max_retries=max_retries,
            timeout=timeout,
            compression_type="auto",
        )

        bam_read_options = BamReadOptions(
            thread_num=thread_num,
            object_storage_options=object_storage_options,
        )
        read_options = ReadOptions(bam_read_options=bam_read_options)
        return _read_file(path, InputFormat.Bam, read_options, projection_pushdown)

    @staticmethod
    def read_fastq(
        path: str,
        chunk_size: int = 8,
        concurrent_fetches: int = 1,
        allow_anonymous: bool = True,
        enable_request_payer: bool = False,
        max_retries: int = 5,
        timeout: int = 300,
        compression_type: str = "auto",
        parallel: bool = False,
        projection_pushdown: bool = False,
    ) -> pl.DataFrame:
        """
        Read a FASTQ file into a DataFrame.

        Parameters:
            path: The path to the FASTQ file.
            chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
            concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
            allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
            enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
            max_retries:  The maximum number of retries for reading the file from object storage.
            timeout: The timeout in seconds for reading the file from object storage.
            compression_type: The compression type of the FASTQ file. If not specified, it will be detected automatically based on the file extension. BGZF and GZIP compressions are supported ('bgz', 'gz').
            parallel: Whether to use the parallel reader for BGZF compressed files stored **locally**. GZI index is **required**.
            projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.
        """
        return IOOperations.scan_fastq(
            path,
            chunk_size,
            concurrent_fetches,
            allow_anonymous,
            enable_request_payer,
            max_retries,
            timeout,
            compression_type,
            parallel,
            projection_pushdown,
        ).collect()

    @staticmethod
    def scan_fastq(
        path: str,
        chunk_size: int = 8,
        concurrent_fetches: int = 1,
        allow_anonymous: bool = True,
        enable_request_payer: bool = False,
        max_retries: int = 5,
        timeout: int = 300,
        compression_type: str = "auto",
        parallel: bool = False,
        projection_pushdown: bool = False,
    ) -> pl.LazyFrame:
        """
        Lazily read a FASTQ file into a LazyFrame.

        Parameters:
            path: The path to the FASTQ file.
            chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
            concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
            allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
            enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
            max_retries:  The maximum number of retries for reading the file from object storage.
            timeout: The timeout in seconds for reading the file from object storage.
            compression_type: The compression type of the FASTQ file. If not specified, it will be detected automatically based on the file extension. BGZF and GZIP compressions are supported ('bgz', 'gz').
            parallel: Whether to use the parallel reader for BGZF compressed files stored **locally**. GZI index is **required**.
            projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.
        """
        object_storage_options = PyObjectStorageOptions(
            allow_anonymous=allow_anonymous,
            enable_request_payer=enable_request_payer,
            chunk_size=chunk_size,
            concurrent_fetches=concurrent_fetches,
            max_retries=max_retries,
            timeout=timeout,
            compression_type=compression_type,
        )

        fastq_read_options = FastqReadOptions(
            object_storage_options=object_storage_options, parallel=parallel
        )
        read_options = ReadOptions(fastq_read_options=fastq_read_options)
        return _read_file(path, InputFormat.Fastq, read_options, projection_pushdown)

    @staticmethod
    def read_bed(
        path: str,
        thread_num: int = 1,
        chunk_size: int = 8,
        concurrent_fetches: int = 1,
        allow_anonymous: bool = True,
        enable_request_payer: bool = False,
        max_retries: int = 5,
        timeout: int = 300,
        compression_type: str = "auto",
        projection_pushdown: bool = False,
    ) -> pl.DataFrame:
        """
        Read a BED file into a DataFrame.

        Parameters:
            path: The path to the BED file.
            thread_num: The number of threads to use for reading the BED file. Used **only** for parallel decompression of BGZF blocks. Works only for **local** files.
            chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
            concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
            allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
            enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
            max_retries:  The maximum number of retries for reading the file from object storage.
            timeout: The timeout in seconds for reading the file from object storage.
            compression_type: The compression type of the BED file. If not specified, it will be detected automatically based on the file extension. BGZF compressions is supported ('bgz').
            projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

        !!! Note
            Only **BED4** format is supported. It extends the basic BED format (BED3) by adding a name field, resulting in four columns: chromosome, start position, end position, and name.
            Also unlike other text formats, **GZIP** compression is not supported.

        !!! note
            BED reader uses **1-based** coordinate system for the `start`, `end`.
        """
        return IOOperations.scan_bed(
            path,
            thread_num,
            chunk_size,
            concurrent_fetches,
            allow_anonymous,
            enable_request_payer,
            max_retries,
            timeout,
            compression_type,
            projection_pushdown,
        ).collect()

    @staticmethod
    def scan_bed(
        path: str,
        thread_num: int = 1,
        chunk_size: int = 8,
        concurrent_fetches: int = 1,
        allow_anonymous: bool = True,
        enable_request_payer: bool = False,
        max_retries: int = 5,
        timeout: int = 300,
        compression_type: str = "auto",
        projection_pushdown: bool = False,
    ) -> pl.LazyFrame:
        """
        Lazily read a BED file into a LazyFrame.

        Parameters:
            path: The path to the BED file.
            thread_num: The number of threads to use for reading the BED file. Used **only** for parallel decompression of BGZF blocks. Works only for **local** files.
            chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
            concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
            allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
            enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
            max_retries:  The maximum number of retries for reading the file from object storage.
            timeout: The timeout in seconds for reading the file from object storage.
            compression_type: The compression type of the BED file. If not specified, it will be detected automatically based on the file extension. BGZF compressions is supported ('bgz').
            projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

        !!! Note
            Only **BED4** format is supported. It extends the basic BED format (BED3) by adding a name field, resulting in four columns: chromosome, start position, end position, and name.
            Also unlike other text formats, **GZIP** compression is not supported.

        !!! note
            BED reader uses **1-based** coordinate system for the `start`, `end`.
        """
        object_storage_options = PyObjectStorageOptions(
            allow_anonymous=allow_anonymous,
            enable_request_payer=enable_request_payer,
            chunk_size=chunk_size,
            concurrent_fetches=concurrent_fetches,
            max_retries=max_retries,
            timeout=timeout,
            compression_type=compression_type,
        )

        bed_read_options = BedReadOptions(
            thread_num=thread_num,
            object_storage_options=object_storage_options,
        )
        read_options = ReadOptions(bed_read_options=bed_read_options)
        return _read_file(path, InputFormat.Bed, read_options, projection_pushdown)

    @staticmethod
    def read_table(path: str, schema: Dict = None, **kwargs) -> pl.DataFrame:
        """
         Read a tab-delimited (i.e. BED) file into a Polars DataFrame.
         Tries to be compatible with Bioframe's [read_table](https://bioframe.readthedocs.io/en/latest/guide-io.html)
         but faster. Schema should follow the Bioframe's schema [format](https://github.com/open2c/bioframe/blob/2b685eebef393c2c9e6220dcf550b3630d87518e/bioframe/io/schemas.py#L174).

        Parameters:
            path: The path to the file.
            schema: Schema should follow the Bioframe's schema [format](https://github.com/open2c/bioframe/blob/2b685eebef393c2c9e6220dcf550b3630d87518e/bioframe/io/schemas.py#L174).
        """
        return IOOperations.scan_table(path, schema, **kwargs).collect()

    @staticmethod
    def scan_table(path: str, schema: Dict = None, **kwargs) -> pl.LazyFrame:
        """
         Lazily read a tab-delimited (i.e. BED) file into a Polars LazyFrame.
         Tries to be compatible with Bioframe's [read_table](https://bioframe.readthedocs.io/en/latest/guide-io.html)
         but faster and lazy. Schema should follow the Bioframe's schema [format](https://github.com/open2c/bioframe/blob/2b685eebef393c2c9e6220dcf550b3630d87518e/bioframe/io/schemas.py#L174).

        Parameters:
            path: The path to the file.
            schema: Schema should follow the Bioframe's schema [format](https://github.com/open2c/bioframe/blob/2b685eebef393c2c9e6220dcf550b3630d87518e/bioframe/io/schemas.py#L174).
        """
        df = pl.scan_csv(path, separator="\t", has_header=False, **kwargs)
        if schema is not None:
            columns = SCHEMAS[schema]
            if len(columns) != len(df.collect_schema()):
                raise ValueError(
                    f"Schema incompatible with the input. Expected {len(columns)} columns in a schema, got {len(df.collect_schema())} in the input data file. Please provide a valid schema."
                )
            for i, c in enumerate(columns):
                df = df.rename({f"column_{i+1}": c})
        return df

    @staticmethod
    def describe_vcf(
        path: str,
        allow_anonymous: bool = True,
        enable_request_payer: bool = False,
        compression_type: str = "auto",
    ) -> pl.DataFrame:
        """
        Describe VCF INFO schema.

        Parameters:
            path: The path to the VCF file.
            allow_anonymous: Whether to allow anonymous access to object storage (GCS and S3 supported).
            enable_request_payer: Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
            compression_type: The compression type of the VCF file. If not specified, it will be detected automatically..
        """
        object_storage_options = PyObjectStorageOptions(
            allow_anonymous=allow_anonymous,
            enable_request_payer=enable_request_payer,
            chunk_size=8,
            concurrent_fetches=1,
            max_retries=1,
            timeout=10,
            compression_type=compression_type,
        )
        return py_describe_vcf(ctx, path, object_storage_options).to_polars()

    @staticmethod
    def from_polars(name: str, df: Union[pl.DataFrame, pl.LazyFrame]) -> None:
        """
        Register a Polars DataFrame as a DataFusion table.

        Parameters:
            name: The name of the table.
            df: The Polars DataFrame.
        """
        reader = (
            df.to_arrow()
            if isinstance(df, pl.DataFrame)
            else df.collect().to_arrow().to_reader()
        )
        py_from_polars(ctx, name, reader)

describe_vcf(path, allow_anonymous=True, enable_request_payer=False, compression_type='auto') staticmethod

Describe VCF INFO schema.

Parameters:

Name Type Description Default
path str

The path to the VCF file.

required
allow_anonymous bool

Whether to allow anonymous access to object storage (GCS and S3 supported).

True
enable_request_payer bool

Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.

False
compression_type str

The compression type of the VCF file. If not specified, it will be detected automatically..

'auto'
Source code in polars_bio/io.py
@staticmethod
def describe_vcf(
    path: str,
    allow_anonymous: bool = True,
    enable_request_payer: bool = False,
    compression_type: str = "auto",
) -> pl.DataFrame:
    """
    Describe VCF INFO schema.

    Parameters:
        path: The path to the VCF file.
        allow_anonymous: Whether to allow anonymous access to object storage (GCS and S3 supported).
        enable_request_payer: Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
        compression_type: The compression type of the VCF file. If not specified, it will be detected automatically..
    """
    object_storage_options = PyObjectStorageOptions(
        allow_anonymous=allow_anonymous,
        enable_request_payer=enable_request_payer,
        chunk_size=8,
        concurrent_fetches=1,
        max_retries=1,
        timeout=10,
        compression_type=compression_type,
    )
    return py_describe_vcf(ctx, path, object_storage_options).to_polars()

from_polars(name, df) staticmethod

Register a Polars DataFrame as a DataFusion table.

Parameters:

Name Type Description Default
name str

The name of the table.

required
df Union[DataFrame, LazyFrame]

The Polars DataFrame.

required
Source code in polars_bio/io.py
@staticmethod
def from_polars(name: str, df: Union[pl.DataFrame, pl.LazyFrame]) -> None:
    """
    Register a Polars DataFrame as a DataFusion table.

    Parameters:
        name: The name of the table.
        df: The Polars DataFrame.
    """
    reader = (
        df.to_arrow()
        if isinstance(df, pl.DataFrame)
        else df.collect().to_arrow().to_reader()
    )
    py_from_polars(ctx, name, reader)

read_bam(path, thread_num=1, chunk_size=8, concurrent_fetches=1, allow_anonymous=True, enable_request_payer=False, max_retries=5, timeout=300, projection_pushdown=False) staticmethod

Read a BAM file into a DataFrame.

Parameters:

Name Type Description Default
path str

The path to the BAM file.

required
thread_num int

The number of threads to use for reading the BAM file. Used only for parallel decompression of BGZF blocks. Works only for local files.

1
chunk_size int

The size in MB of a chunk when reading from an object store. The default is 8 MB. For large-scale operations, it is recommended to increase this value to 64.

8
concurrent_fetches int

[GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large-scale operations, it is recommended to increase this value to 8 or even more.

1
allow_anonymous bool

[GCS, AWS S3] Whether to allow anonymous access to object storage.

True
enable_request_payer bool

[AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.

False
max_retries int

The maximum number of retries for reading the file from object storage.

5
timeout int

The timeout in seconds for reading the file from object storage.

300
projection_pushdown bool

Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

False

Note

BAM reader uses 1-based coordinate system for the start, end, mate_start, mate_end columns.

Source code in polars_bio/io.py
@staticmethod
def read_bam(
    path: str,
    thread_num: int = 1,
    chunk_size: int = 8,
    concurrent_fetches: int = 1,
    allow_anonymous: bool = True,
    enable_request_payer: bool = False,
    max_retries: int = 5,
    timeout: int = 300,
    projection_pushdown: bool = False,
) -> pl.DataFrame:
    """
    Read a BAM file into a DataFrame.

    Parameters:
        path: The path to the BAM file.
        thread_num: The number of threads to use for reading the BAM file. Used **only** for parallel decompression of BGZF blocks. Works only for **local** files.
        chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large-scale operations, it is recommended to increase this value to 64.
        concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large-scale operations, it is recommended to increase this value to 8 or even more.
        allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
        enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
        max_retries:  The maximum number of retries for reading the file from object storage.
        timeout: The timeout in seconds for reading the file from object storage.
        projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

    !!! note
        BAM reader uses **1-based** coordinate system for the `start`, `end`, `mate_start`, `mate_end` columns.
    """
    return IOOperations.scan_bam(
        path,
        thread_num,
        chunk_size,
        concurrent_fetches,
        allow_anonymous,
        enable_request_payer,
        max_retries,
        timeout,
        projection_pushdown,
    ).collect()

read_bed(path, thread_num=1, chunk_size=8, concurrent_fetches=1, allow_anonymous=True, enable_request_payer=False, max_retries=5, timeout=300, compression_type='auto', projection_pushdown=False) staticmethod

Read a BED file into a DataFrame.

Parameters:

Name Type Description Default
path str

The path to the BED file.

required
thread_num int

The number of threads to use for reading the BED file. Used only for parallel decompression of BGZF blocks. Works only for local files.

1
chunk_size int

The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.

8
concurrent_fetches int

[GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.

1
allow_anonymous bool

[GCS, AWS S3] Whether to allow anonymous access to object storage.

True
enable_request_payer bool

[AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.

False
max_retries int

The maximum number of retries for reading the file from object storage.

5
timeout int

The timeout in seconds for reading the file from object storage.

300
compression_type str

The compression type of the BED file. If not specified, it will be detected automatically based on the file extension. BGZF compressions is supported ('bgz').

'auto'
projection_pushdown bool

Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

False

Note

Only BED4 format is supported. It extends the basic BED format (BED3) by adding a name field, resulting in four columns: chromosome, start position, end position, and name. Also unlike other text formats, GZIP compression is not supported.

Note

BED reader uses 1-based coordinate system for the start, end.

Source code in polars_bio/io.py
@staticmethod
def read_bed(
    path: str,
    thread_num: int = 1,
    chunk_size: int = 8,
    concurrent_fetches: int = 1,
    allow_anonymous: bool = True,
    enable_request_payer: bool = False,
    max_retries: int = 5,
    timeout: int = 300,
    compression_type: str = "auto",
    projection_pushdown: bool = False,
) -> pl.DataFrame:
    """
    Read a BED file into a DataFrame.

    Parameters:
        path: The path to the BED file.
        thread_num: The number of threads to use for reading the BED file. Used **only** for parallel decompression of BGZF blocks. Works only for **local** files.
        chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
        concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
        allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
        enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
        max_retries:  The maximum number of retries for reading the file from object storage.
        timeout: The timeout in seconds for reading the file from object storage.
        compression_type: The compression type of the BED file. If not specified, it will be detected automatically based on the file extension. BGZF compressions is supported ('bgz').
        projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

    !!! Note
        Only **BED4** format is supported. It extends the basic BED format (BED3) by adding a name field, resulting in four columns: chromosome, start position, end position, and name.
        Also unlike other text formats, **GZIP** compression is not supported.

    !!! note
        BED reader uses **1-based** coordinate system for the `start`, `end`.
    """
    return IOOperations.scan_bed(
        path,
        thread_num,
        chunk_size,
        concurrent_fetches,
        allow_anonymous,
        enable_request_payer,
        max_retries,
        timeout,
        compression_type,
        projection_pushdown,
    ).collect()

read_fasta(path, chunk_size=8, concurrent_fetches=1, allow_anonymous=True, enable_request_payer=False, max_retries=5, timeout=300, compression_type='auto', projection_pushdown=False) staticmethod

Read a FASTA file into a DataFrame.

Parameters:

Name Type Description Default
path str

The path to the FASTA file.

required
chunk_size int

The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.

8
concurrent_fetches int

[GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.

1
allow_anonymous bool

[GCS, AWS S3] Whether to allow anonymous access to object storage.

True
enable_request_payer bool

[AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.

False
max_retries int

The maximum number of retries for reading the file from object storage.

5
timeout int

The timeout in seconds for reading the file from object storage.

300
compression_type str

The compression type of the FASTA file. If not specified, it will be detected automatically based on the file extension. BGZF and GZIP compressions are supported ('bgz', 'gz').

'auto'
projection_pushdown bool

Enable column projection pushdown optimization. When True, only requested columns are processed at the DataFusion execution level, improving performance and reducing memory usage.

False

Example

wget https://www.ebi.ac.uk/ena/browser/api/fasta/BK006935.2?download=true -O /tmp/test.fasta

import polars_bio as pb
pb.read_fasta("/tmp/test.fasta").limit(1)
 shape: (1, 3)
┌─────────────────────────┬─────────────────────────────────┬─────────────────────────────────┐
 name                     description                      sequence                         ---                      ---                              ---                              str                      str                              str                             ╞═════════════════════════╪═════════════════════════════════╪═════════════════════════════════╡
 ENA|BK006935|BK006935.2  TPA_inf: Saccharomyces cerevis…  CCACACCACACCCACACACCCACACACCAC… └─────────────────────────┴─────────────────────────────────┴─────────────────────────────────┘

Source code in polars_bio/io.py
@staticmethod
def read_fasta(
    path: str,
    chunk_size: int = 8,
    concurrent_fetches: int = 1,
    allow_anonymous: bool = True,
    enable_request_payer: bool = False,
    max_retries: int = 5,
    timeout: int = 300,
    compression_type: str = "auto",
    projection_pushdown: bool = False,
) -> pl.DataFrame:
    """

    Read a FASTA file into a DataFrame.

    Parameters:
        path: The path to the FASTA file.
        chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
        concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
        allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
        enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
        max_retries:  The maximum number of retries for reading the file from object storage.
        timeout: The timeout in seconds for reading the file from object storage.
        compression_type: The compression type of the FASTA file. If not specified, it will be detected automatically based on the file extension. BGZF and GZIP compressions are supported ('bgz', 'gz').
        projection_pushdown: Enable column projection pushdown optimization. When True, only requested columns are processed at the DataFusion execution level, improving performance and reducing memory usage.

    !!! Example
        ```shell
        wget https://www.ebi.ac.uk/ena/browser/api/fasta/BK006935.2?download=true -O /tmp/test.fasta
        ```

        ```python
        import polars_bio as pb
        pb.read_fasta("/tmp/test.fasta").limit(1)
        ```
        ```shell
         shape: (1, 3)
        ┌─────────────────────────┬─────────────────────────────────┬─────────────────────────────────┐
        │ name                    ┆ description                     ┆ sequence                        │
        │ ---                     ┆ ---                             ┆ ---                             │
        │ str                     ┆ str                             ┆ str                             │
        ╞═════════════════════════╪═════════════════════════════════╪═════════════════════════════════╡
        │ ENA|BK006935|BK006935.2 ┆ TPA_inf: Saccharomyces cerevis… ┆ CCACACCACACCCACACACCCACACACCAC… │
        └─────────────────────────┴─────────────────────────────────┴─────────────────────────────────┘
        ```
    """
    return IOOperations.scan_fasta(
        path,
        chunk_size,
        concurrent_fetches,
        allow_anonymous,
        enable_request_payer,
        max_retries,
        timeout,
        compression_type,
        projection_pushdown,
    ).collect()

read_fastq(path, chunk_size=8, concurrent_fetches=1, allow_anonymous=True, enable_request_payer=False, max_retries=5, timeout=300, compression_type='auto', parallel=False, projection_pushdown=False) staticmethod

Read a FASTQ file into a DataFrame.

Parameters:

Name Type Description Default
path str

The path to the FASTQ file.

required
chunk_size int

The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.

8
concurrent_fetches int

[GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.

1
allow_anonymous bool

[GCS, AWS S3] Whether to allow anonymous access to object storage.

True
enable_request_payer bool

[AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.

False
max_retries int

The maximum number of retries for reading the file from object storage.

5
timeout int

The timeout in seconds for reading the file from object storage.

300
compression_type str

The compression type of the FASTQ file. If not specified, it will be detected automatically based on the file extension. BGZF and GZIP compressions are supported ('bgz', 'gz').

'auto'
parallel bool

Whether to use the parallel reader for BGZF compressed files stored locally. GZI index is required.

False
projection_pushdown bool

Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

False
Source code in polars_bio/io.py
@staticmethod
def read_fastq(
    path: str,
    chunk_size: int = 8,
    concurrent_fetches: int = 1,
    allow_anonymous: bool = True,
    enable_request_payer: bool = False,
    max_retries: int = 5,
    timeout: int = 300,
    compression_type: str = "auto",
    parallel: bool = False,
    projection_pushdown: bool = False,
) -> pl.DataFrame:
    """
    Read a FASTQ file into a DataFrame.

    Parameters:
        path: The path to the FASTQ file.
        chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
        concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
        allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
        enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
        max_retries:  The maximum number of retries for reading the file from object storage.
        timeout: The timeout in seconds for reading the file from object storage.
        compression_type: The compression type of the FASTQ file. If not specified, it will be detected automatically based on the file extension. BGZF and GZIP compressions are supported ('bgz', 'gz').
        parallel: Whether to use the parallel reader for BGZF compressed files stored **locally**. GZI index is **required**.
        projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.
    """
    return IOOperations.scan_fastq(
        path,
        chunk_size,
        concurrent_fetches,
        allow_anonymous,
        enable_request_payer,
        max_retries,
        timeout,
        compression_type,
        parallel,
        projection_pushdown,
    ).collect()

read_gff(path, thread_num=1, chunk_size=8, concurrent_fetches=1, allow_anonymous=True, enable_request_payer=False, max_retries=5, timeout=300, compression_type='auto', projection_pushdown=False, predicate_pushdown=False, parallel=False) staticmethod

Read a GFF file into a DataFrame.

Parameters:

Name Type Description Default
path str

The path to the GFF file.

required
thread_num int

The number of threads to use for reading the GFF file. Used only for parallel decompression of BGZF blocks. Works only for local files.

1
chunk_size int

The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.

8
concurrent_fetches int

[GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.

1
allow_anonymous bool

[GCS, AWS S3] Whether to allow anonymous access to object storage.

True
enable_request_payer bool

[AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.

False
max_retries int

The maximum number of retries for reading the file from object storage.

5
timeout int

The timeout in seconds for reading the file from object storage.

300
compression_type str

The compression type of the GFF file. If not specified, it will be detected automatically..

'auto'
projection_pushdown bool

Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

False
predicate_pushdown bool

Enable predicate pushdown optimization to push filter conditions down to the DataFusion table provider level, reducing data processing and I/O.

False
parallel bool

Whether to use the parallel reader for BGZF-compressed local files (uses BGZF chunk-level parallelism similar to FASTQ).

False

Note

GFF reader uses 1-based coordinate system for the start and end columns.

Source code in polars_bio/io.py
@staticmethod
def read_gff(
    path: str,
    thread_num: int = 1,
    chunk_size: int = 8,
    concurrent_fetches: int = 1,
    allow_anonymous: bool = True,
    enable_request_payer: bool = False,
    max_retries: int = 5,
    timeout: int = 300,
    compression_type: str = "auto",
    projection_pushdown: bool = False,
    predicate_pushdown: bool = False,
    parallel: bool = False,
) -> pl.DataFrame:
    """
    Read a GFF file into a DataFrame.

    Parameters:
        path: The path to the GFF file.
        thread_num: The number of threads to use for reading the GFF file. Used **only** for parallel decompression of BGZF blocks. Works only for **local** files.
        chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
        concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
        allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
        enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
        max_retries:  The maximum number of retries for reading the file from object storage.
        timeout: The timeout in seconds for reading the file from object storage.
        compression_type: The compression type of the GFF file. If not specified, it will be detected automatically..
        projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.
        predicate_pushdown: Enable predicate pushdown optimization to push filter conditions down to the DataFusion table provider level, reducing data processing and I/O.
        parallel: Whether to use the parallel reader for BGZF-compressed local files (uses BGZF chunk-level parallelism similar to FASTQ).

    !!! note
        GFF reader uses **1-based** coordinate system for the `start` and `end` columns.
    """
    return IOOperations.scan_gff(
        path,
        thread_num,
        chunk_size,
        concurrent_fetches,
        allow_anonymous,
        enable_request_payer,
        max_retries,
        timeout,
        compression_type,
        projection_pushdown,
        predicate_pushdown,
        parallel,
    ).collect()

read_table(path, schema=None, **kwargs) staticmethod

Read a tab-delimited (i.e. BED) file into a Polars DataFrame. Tries to be compatible with Bioframe's read_table but faster. Schema should follow the Bioframe's schema format.

Parameters:

Name Type Description Default
path str

The path to the file.

required
schema Dict

Schema should follow the Bioframe's schema format.

None
Source code in polars_bio/io.py
@staticmethod
def read_table(path: str, schema: Dict = None, **kwargs) -> pl.DataFrame:
    """
     Read a tab-delimited (i.e. BED) file into a Polars DataFrame.
     Tries to be compatible with Bioframe's [read_table](https://bioframe.readthedocs.io/en/latest/guide-io.html)
     but faster. Schema should follow the Bioframe's schema [format](https://github.com/open2c/bioframe/blob/2b685eebef393c2c9e6220dcf550b3630d87518e/bioframe/io/schemas.py#L174).

    Parameters:
        path: The path to the file.
        schema: Schema should follow the Bioframe's schema [format](https://github.com/open2c/bioframe/blob/2b685eebef393c2c9e6220dcf550b3630d87518e/bioframe/io/schemas.py#L174).
    """
    return IOOperations.scan_table(path, schema, **kwargs).collect()

read_vcf(path, info_fields=None, thread_num=1, chunk_size=8, concurrent_fetches=1, allow_anonymous=True, enable_request_payer=False, max_retries=5, timeout=300, compression_type='auto', projection_pushdown=False) staticmethod

Read a VCF file into a DataFrame.

Parameters:

Name Type Description Default
path str

The path to the VCF file.

required
info_fields Union[list[str], None]

List of INFO field names to include. If None, all INFO fields will be detected automatically from the VCF header. Use this to limit fields for better performance.

None
thread_num int

The number of threads to use for reading the VCF file. Used only for parallel decompression of BGZF blocks. Works only for local files.

1
chunk_size int

The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.

8
concurrent_fetches int

[GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.

1
allow_anonymous bool

[GCS, AWS S3] Whether to allow anonymous access to object storage.

True
enable_request_payer bool

[AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.

False
max_retries int

The maximum number of retries for reading the file from object storage.

5
timeout int

The timeout in seconds for reading the file from object storage.

300
compression_type str

The compression type of the VCF file. If not specified, it will be detected automatically..

'auto'
projection_pushdown bool

Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

False

Note

VCF reader uses 1-based coordinate system for the start and end columns.

Source code in polars_bio/io.py
@staticmethod
def read_vcf(
    path: str,
    info_fields: Union[list[str], None] = None,
    thread_num: int = 1,
    chunk_size: int = 8,
    concurrent_fetches: int = 1,
    allow_anonymous: bool = True,
    enable_request_payer: bool = False,
    max_retries: int = 5,
    timeout: int = 300,
    compression_type: str = "auto",
    projection_pushdown: bool = False,
) -> pl.DataFrame:
    """
    Read a VCF file into a DataFrame.

    Parameters:
        path: The path to the VCF file.
        info_fields: List of INFO field names to include. If *None*, all INFO fields will be detected automatically from the VCF header. Use this to limit fields for better performance.
        thread_num: The number of threads to use for reading the VCF file. Used **only** for parallel decompression of BGZF blocks. Works only for **local** files.
        chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
        concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
        allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
        enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
        max_retries:  The maximum number of retries for reading the file from object storage.
        timeout: The timeout in seconds for reading the file from object storage.
        compression_type: The compression type of the VCF file. If not specified, it will be detected automatically..
        projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

    !!! note
        VCF reader uses **1-based** coordinate system for the `start` and `end` columns.
    """
    return IOOperations.scan_vcf(
        path,
        info_fields,
        thread_num,
        chunk_size,
        concurrent_fetches,
        allow_anonymous,
        enable_request_payer,
        max_retries,
        timeout,
        compression_type,
        projection_pushdown,
    ).collect()

scan_bam(path, thread_num=1, chunk_size=8, concurrent_fetches=1, allow_anonymous=True, enable_request_payer=False, max_retries=5, timeout=300, projection_pushdown=False) staticmethod

Lazily read a BAM file into a LazyFrame.

Parameters:

Name Type Description Default
path str

The path to the BAM file.

required
thread_num int

The number of threads to use for reading the BAM file. Used only for parallel decompression of BGZF blocks. Works only for local files.

1
chunk_size int

The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.

8
concurrent_fetches int

[GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.

1
allow_anonymous bool

[GCS, AWS S3] Whether to allow anonymous access to object storage.

True
enable_request_payer bool

[AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.

False
max_retries int

The maximum number of retries for reading the file from object storage.

5
timeout int

The timeout in seconds for reading the file from object storage.

300
projection_pushdown bool

Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

False

Note

BAM reader uses 1-based coordinate system for the start, end, mate_start, mate_end columns.

Source code in polars_bio/io.py
@staticmethod
def scan_bam(
    path: str,
    thread_num: int = 1,
    chunk_size: int = 8,
    concurrent_fetches: int = 1,
    allow_anonymous: bool = True,
    enable_request_payer: bool = False,
    max_retries: int = 5,
    timeout: int = 300,
    projection_pushdown: bool = False,
) -> pl.LazyFrame:
    """
    Lazily read a BAM file into a LazyFrame.

    Parameters:
        path: The path to the BAM file.
        thread_num: The number of threads to use for reading the BAM file. Used **only** for parallel decompression of BGZF blocks. Works only for **local** files.
        chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
        concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
        allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
        enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
        max_retries:  The maximum number of retries for reading the file from object storage.
        timeout: The timeout in seconds for reading the file from object storage.
        projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

    !!! note
        BAM reader uses **1-based** coordinate system for the `start`, `end`, `mate_start`, `mate_end` columns.
    """
    object_storage_options = PyObjectStorageOptions(
        allow_anonymous=allow_anonymous,
        enable_request_payer=enable_request_payer,
        chunk_size=chunk_size,
        concurrent_fetches=concurrent_fetches,
        max_retries=max_retries,
        timeout=timeout,
        compression_type="auto",
    )

    bam_read_options = BamReadOptions(
        thread_num=thread_num,
        object_storage_options=object_storage_options,
    )
    read_options = ReadOptions(bam_read_options=bam_read_options)
    return _read_file(path, InputFormat.Bam, read_options, projection_pushdown)

scan_bed(path, thread_num=1, chunk_size=8, concurrent_fetches=1, allow_anonymous=True, enable_request_payer=False, max_retries=5, timeout=300, compression_type='auto', projection_pushdown=False) staticmethod

Lazily read a BED file into a LazyFrame.

Parameters:

Name Type Description Default
path str

The path to the BED file.

required
thread_num int

The number of threads to use for reading the BED file. Used only for parallel decompression of BGZF blocks. Works only for local files.

1
chunk_size int

The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.

8
concurrent_fetches int

[GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.

1
allow_anonymous bool

[GCS, AWS S3] Whether to allow anonymous access to object storage.

True
enable_request_payer bool

[AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.

False
max_retries int

The maximum number of retries for reading the file from object storage.

5
timeout int

The timeout in seconds for reading the file from object storage.

300
compression_type str

The compression type of the BED file. If not specified, it will be detected automatically based on the file extension. BGZF compressions is supported ('bgz').

'auto'
projection_pushdown bool

Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

False

Note

Only BED4 format is supported. It extends the basic BED format (BED3) by adding a name field, resulting in four columns: chromosome, start position, end position, and name. Also unlike other text formats, GZIP compression is not supported.

Note

BED reader uses 1-based coordinate system for the start, end.

Source code in polars_bio/io.py
@staticmethod
def scan_bed(
    path: str,
    thread_num: int = 1,
    chunk_size: int = 8,
    concurrent_fetches: int = 1,
    allow_anonymous: bool = True,
    enable_request_payer: bool = False,
    max_retries: int = 5,
    timeout: int = 300,
    compression_type: str = "auto",
    projection_pushdown: bool = False,
) -> pl.LazyFrame:
    """
    Lazily read a BED file into a LazyFrame.

    Parameters:
        path: The path to the BED file.
        thread_num: The number of threads to use for reading the BED file. Used **only** for parallel decompression of BGZF blocks. Works only for **local** files.
        chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
        concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
        allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
        enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
        max_retries:  The maximum number of retries for reading the file from object storage.
        timeout: The timeout in seconds for reading the file from object storage.
        compression_type: The compression type of the BED file. If not specified, it will be detected automatically based on the file extension. BGZF compressions is supported ('bgz').
        projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

    !!! Note
        Only **BED4** format is supported. It extends the basic BED format (BED3) by adding a name field, resulting in four columns: chromosome, start position, end position, and name.
        Also unlike other text formats, **GZIP** compression is not supported.

    !!! note
        BED reader uses **1-based** coordinate system for the `start`, `end`.
    """
    object_storage_options = PyObjectStorageOptions(
        allow_anonymous=allow_anonymous,
        enable_request_payer=enable_request_payer,
        chunk_size=chunk_size,
        concurrent_fetches=concurrent_fetches,
        max_retries=max_retries,
        timeout=timeout,
        compression_type=compression_type,
    )

    bed_read_options = BedReadOptions(
        thread_num=thread_num,
        object_storage_options=object_storage_options,
    )
    read_options = ReadOptions(bed_read_options=bed_read_options)
    return _read_file(path, InputFormat.Bed, read_options, projection_pushdown)

scan_fasta(path, chunk_size=8, concurrent_fetches=1, allow_anonymous=True, enable_request_payer=False, max_retries=5, timeout=300, compression_type='auto', projection_pushdown=False) staticmethod

Lazily read a FASTA file into a LazyFrame.

Parameters:

Name Type Description Default
path str

The path to the FASTA file.

required
chunk_size int

The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.

8
concurrent_fetches int

[GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.

1
allow_anonymous bool

[GCS, AWS S3] Whether to allow anonymous access to object storage.

True
enable_request_payer bool

[AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.

False
max_retries int

The maximum number of retries for reading the file from object storage.

5
timeout int

The timeout in seconds for reading the file from object storage.

300
compression_type str

The compression type of the FASTA file. If not specified, it will be detected automatically based on the file extension. BGZF and GZIP compressions are supported ('bgz', 'gz').

'auto'
projection_pushdown bool

Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

False

Example

wget https://www.ebi.ac.uk/ena/browser/api/fasta/BK006935.2?download=true -O /tmp/test.fasta

import polars_bio as pb
pb.scan_fasta("/tmp/test.fasta").limit(1).collect()
 shape: (1, 3)
┌─────────────────────────┬─────────────────────────────────┬─────────────────────────────────┐
 name                     description                      sequence                         ---                      ---                              ---                              str                      str                              str                             ╞═════════════════════════╪═════════════════════════════════╪═════════════════════════════════╡
 ENA|BK006935|BK006935.2  TPA_inf: Saccharomyces cerevis…  CCACACCACACCCACACACCCACACACCAC… └─────────────────────────┴─────────────────────────────────┴─────────────────────────────────┘

Source code in polars_bio/io.py
@staticmethod
def scan_fasta(
    path: str,
    chunk_size: int = 8,
    concurrent_fetches: int = 1,
    allow_anonymous: bool = True,
    enable_request_payer: bool = False,
    max_retries: int = 5,
    timeout: int = 300,
    compression_type: str = "auto",
    projection_pushdown: bool = False,
) -> pl.LazyFrame:
    """

    Lazily read a FASTA file into a LazyFrame.

    Parameters:
        path: The path to the FASTA file.
        chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
        concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
        allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
        enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
        max_retries:  The maximum number of retries for reading the file from object storage.
        timeout: The timeout in seconds for reading the file from object storage.
        compression_type: The compression type of the FASTA file. If not specified, it will be detected automatically based on the file extension. BGZF and GZIP compressions are supported ('bgz', 'gz').
        projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

    !!! Example
        ```shell
        wget https://www.ebi.ac.uk/ena/browser/api/fasta/BK006935.2?download=true -O /tmp/test.fasta
        ```

        ```python
        import polars_bio as pb
        pb.scan_fasta("/tmp/test.fasta").limit(1).collect()
        ```
        ```shell
         shape: (1, 3)
        ┌─────────────────────────┬─────────────────────────────────┬─────────────────────────────────┐
        │ name                    ┆ description                     ┆ sequence                        │
        │ ---                     ┆ ---                             ┆ ---                             │
        │ str                     ┆ str                             ┆ str                             │
        ╞═════════════════════════╪═════════════════════════════════╪═════════════════════════════════╡
        │ ENA|BK006935|BK006935.2 ┆ TPA_inf: Saccharomyces cerevis… ┆ CCACACCACACCCACACACCCACACACCAC… │
        └─────────────────────────┴─────────────────────────────────┴─────────────────────────────────┘
        ```
    """
    object_storage_options = PyObjectStorageOptions(
        allow_anonymous=allow_anonymous,
        enable_request_payer=enable_request_payer,
        chunk_size=chunk_size,
        concurrent_fetches=concurrent_fetches,
        max_retries=max_retries,
        timeout=timeout,
        compression_type=compression_type,
    )
    fasta_read_options = FastaReadOptions(
        object_storage_options=object_storage_options
    )
    read_options = ReadOptions(fasta_read_options=fasta_read_options)
    return _read_file(path, InputFormat.Fasta, read_options, projection_pushdown)

scan_fastq(path, chunk_size=8, concurrent_fetches=1, allow_anonymous=True, enable_request_payer=False, max_retries=5, timeout=300, compression_type='auto', parallel=False, projection_pushdown=False) staticmethod

Lazily read a FASTQ file into a LazyFrame.

Parameters:

Name Type Description Default
path str

The path to the FASTQ file.

required
chunk_size int

The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.

8
concurrent_fetches int

[GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.

1
allow_anonymous bool

[GCS, AWS S3] Whether to allow anonymous access to object storage.

True
enable_request_payer bool

[AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.

False
max_retries int

The maximum number of retries for reading the file from object storage.

5
timeout int

The timeout in seconds for reading the file from object storage.

300
compression_type str

The compression type of the FASTQ file. If not specified, it will be detected automatically based on the file extension. BGZF and GZIP compressions are supported ('bgz', 'gz').

'auto'
parallel bool

Whether to use the parallel reader for BGZF compressed files stored locally. GZI index is required.

False
projection_pushdown bool

Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

False
Source code in polars_bio/io.py
@staticmethod
def scan_fastq(
    path: str,
    chunk_size: int = 8,
    concurrent_fetches: int = 1,
    allow_anonymous: bool = True,
    enable_request_payer: bool = False,
    max_retries: int = 5,
    timeout: int = 300,
    compression_type: str = "auto",
    parallel: bool = False,
    projection_pushdown: bool = False,
) -> pl.LazyFrame:
    """
    Lazily read a FASTQ file into a LazyFrame.

    Parameters:
        path: The path to the FASTQ file.
        chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
        concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
        allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
        enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
        max_retries:  The maximum number of retries for reading the file from object storage.
        timeout: The timeout in seconds for reading the file from object storage.
        compression_type: The compression type of the FASTQ file. If not specified, it will be detected automatically based on the file extension. BGZF and GZIP compressions are supported ('bgz', 'gz').
        parallel: Whether to use the parallel reader for BGZF compressed files stored **locally**. GZI index is **required**.
        projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.
    """
    object_storage_options = PyObjectStorageOptions(
        allow_anonymous=allow_anonymous,
        enable_request_payer=enable_request_payer,
        chunk_size=chunk_size,
        concurrent_fetches=concurrent_fetches,
        max_retries=max_retries,
        timeout=timeout,
        compression_type=compression_type,
    )

    fastq_read_options = FastqReadOptions(
        object_storage_options=object_storage_options, parallel=parallel
    )
    read_options = ReadOptions(fastq_read_options=fastq_read_options)
    return _read_file(path, InputFormat.Fastq, read_options, projection_pushdown)

scan_gff(path, thread_num=1, chunk_size=8, concurrent_fetches=1, allow_anonymous=True, enable_request_payer=False, max_retries=5, timeout=300, compression_type='auto', projection_pushdown=False, predicate_pushdown=False, parallel=False) staticmethod

Lazily read a GFF file into a LazyFrame.

Parameters:

Name Type Description Default
path str

The path to the GFF file.

required
thread_num int

The number of threads to use for reading the GFF file. Used only for parallel decompression of BGZF blocks. Works only for local files.

1
chunk_size int

The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.

8
concurrent_fetches int

[GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large-scale operations, it is recommended to increase this value to 8 or even more.

1
allow_anonymous bool

[GCS, AWS S3] Whether to allow anonymous access to object storage.

True
enable_request_payer bool

[AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.

False
max_retries int

The maximum number of retries for reading the file from object storage.

5
timeout int

The timeout in seconds for reading the file from object storage.

300
compression_type str

The compression type of the GFF file. If not specified, it will be detected automatically.

'auto'
projection_pushdown bool

Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

False
predicate_pushdown bool

Enable predicate pushdown optimization to push filter conditions down to the DataFusion table provider level, reducing data processing and I/O.

False
parallel bool

Whether to use the parallel reader for BGZF-compressed local files (use BGZF chunk-level parallelism similar to FASTQ).

False

Note

GFF reader uses 1-based coordinate system for the start and end columns.

Source code in polars_bio/io.py
@staticmethod
def scan_gff(
    path: str,
    thread_num: int = 1,
    chunk_size: int = 8,
    concurrent_fetches: int = 1,
    allow_anonymous: bool = True,
    enable_request_payer: bool = False,
    max_retries: int = 5,
    timeout: int = 300,
    compression_type: str = "auto",
    projection_pushdown: bool = False,
    predicate_pushdown: bool = False,
    parallel: bool = False,
) -> pl.LazyFrame:
    """
    Lazily read a GFF file into a LazyFrame.

    Parameters:
        path: The path to the GFF file.
        thread_num: The number of threads to use for reading the GFF file. Used **only** for parallel decompression of BGZF blocks. Works only for **local** files.
        chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
        concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large-scale operations, it is recommended to increase this value to 8 or even more.
        allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
        enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
        max_retries:  The maximum number of retries for reading the file from object storage.
        timeout: The timeout in seconds for reading the file from object storage.
        compression_type: The compression type of the GFF file. If not specified, it will be detected automatically.
        projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.
        predicate_pushdown: Enable predicate pushdown optimization to push filter conditions down to the DataFusion table provider level, reducing data processing and I/O.
        parallel: Whether to use the parallel reader for BGZF-compressed local files (use BGZF chunk-level parallelism similar to FASTQ).

    !!! note
        GFF reader uses **1-based** coordinate system for the `start` and `end` columns.
    """
    object_storage_options = PyObjectStorageOptions(
        allow_anonymous=allow_anonymous,
        enable_request_payer=enable_request_payer,
        chunk_size=chunk_size,
        concurrent_fetches=concurrent_fetches,
        max_retries=max_retries,
        timeout=timeout,
        compression_type=compression_type,
    )

    gff_read_options = GffReadOptions(
        attr_fields=None,
        thread_num=thread_num,
        object_storage_options=object_storage_options,
        parallel=parallel,
    )
    read_options = ReadOptions(gff_read_options=gff_read_options)
    return _read_file(
        path, InputFormat.Gff, read_options, projection_pushdown, predicate_pushdown
    )

scan_table(path, schema=None, **kwargs) staticmethod

Lazily read a tab-delimited (i.e. BED) file into a Polars LazyFrame. Tries to be compatible with Bioframe's read_table but faster and lazy. Schema should follow the Bioframe's schema format.

Parameters:

Name Type Description Default
path str

The path to the file.

required
schema Dict

Schema should follow the Bioframe's schema format.

None
Source code in polars_bio/io.py
@staticmethod
def scan_table(path: str, schema: Dict = None, **kwargs) -> pl.LazyFrame:
    """
     Lazily read a tab-delimited (i.e. BED) file into a Polars LazyFrame.
     Tries to be compatible with Bioframe's [read_table](https://bioframe.readthedocs.io/en/latest/guide-io.html)
     but faster and lazy. Schema should follow the Bioframe's schema [format](https://github.com/open2c/bioframe/blob/2b685eebef393c2c9e6220dcf550b3630d87518e/bioframe/io/schemas.py#L174).

    Parameters:
        path: The path to the file.
        schema: Schema should follow the Bioframe's schema [format](https://github.com/open2c/bioframe/blob/2b685eebef393c2c9e6220dcf550b3630d87518e/bioframe/io/schemas.py#L174).
    """
    df = pl.scan_csv(path, separator="\t", has_header=False, **kwargs)
    if schema is not None:
        columns = SCHEMAS[schema]
        if len(columns) != len(df.collect_schema()):
            raise ValueError(
                f"Schema incompatible with the input. Expected {len(columns)} columns in a schema, got {len(df.collect_schema())} in the input data file. Please provide a valid schema."
            )
        for i, c in enumerate(columns):
            df = df.rename({f"column_{i+1}": c})
    return df

scan_vcf(path, info_fields=None, thread_num=1, chunk_size=8, concurrent_fetches=1, allow_anonymous=True, enable_request_payer=False, max_retries=5, timeout=300, compression_type='auto', projection_pushdown=False) staticmethod

Lazily read a VCF file into a LazyFrame.

Parameters:

Name Type Description Default
path str

The path to the VCF file.

required
info_fields Union[list[str], None]

List of INFO field names to include. If None, all INFO fields will be detected automatically from the VCF header. Use this to limit fields for better performance.

None
thread_num int

The number of threads to use for reading the VCF file. Used only for parallel decompression of BGZF blocks. Works only for local files.

1
chunk_size int

The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.

8
concurrent_fetches int

[GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.

1
allow_anonymous bool

[GCS, AWS S3] Whether to allow anonymous access to object storage.

True
enable_request_payer bool

[AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.

False
max_retries int

The maximum number of retries for reading the file from object storage.

5
timeout int

The timeout in seconds for reading the file from object storage.

300
compression_type str

The compression type of the VCF file. If not specified, it will be detected automatically..

'auto'
projection_pushdown bool

Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

False

Note

VCF reader uses 1-based coordinate system for the start and end columns.

Source code in polars_bio/io.py
@staticmethod
def scan_vcf(
    path: str,
    info_fields: Union[list[str], None] = None,
    thread_num: int = 1,
    chunk_size: int = 8,
    concurrent_fetches: int = 1,
    allow_anonymous: bool = True,
    enable_request_payer: bool = False,
    max_retries: int = 5,
    timeout: int = 300,
    compression_type: str = "auto",
    projection_pushdown: bool = False,
) -> pl.LazyFrame:
    """
    Lazily read a VCF file into a LazyFrame.

    Parameters:
        path: The path to the VCF file.
        info_fields: List of INFO field names to include. If *None*, all INFO fields will be detected automatically from the VCF header. Use this to limit fields for better performance.
        thread_num: The number of threads to use for reading the VCF file. Used **only** for parallel decompression of BGZF blocks. Works only for **local** files.
        chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
        concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
        allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
        enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
        max_retries:  The maximum number of retries for reading the file from object storage.
        timeout: The timeout in seconds for reading the file from object storage.
        compression_type: The compression type of the VCF file. If not specified, it will be detected automatically..
        projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

    !!! note
        VCF reader uses **1-based** coordinate system for the `start` and `end` columns.
    """
    object_storage_options = PyObjectStorageOptions(
        allow_anonymous=allow_anonymous,
        enable_request_payer=enable_request_payer,
        chunk_size=chunk_size,
        concurrent_fetches=concurrent_fetches,
        max_retries=max_retries,
        timeout=timeout,
        compression_type=compression_type,
    )

    # Use provided info_fields or autodetect from VCF header
    if info_fields is not None:
        initial_info_fields = info_fields
    else:
        # Get all info fields from VCF header for proper projection pushdown
        all_info_fields = None
        try:
            vcf_schema_df = IOOperations.describe_vcf(
                path,
                allow_anonymous=allow_anonymous,
                enable_request_payer=enable_request_payer,
                compression_type=compression_type,
            )
            # Use column name 'name' not 'id' based on the schema output
            all_info_fields = vcf_schema_df.select("name").to_series().to_list()
        except Exception:
            # Fallback to None if unable to get info fields
            all_info_fields = None

        # Always start with all info fields to establish full schema
        # The callback will re-register with only requested info fields for optimization
        initial_info_fields = all_info_fields

    vcf_read_options = VcfReadOptions(
        info_fields=initial_info_fields,
        thread_num=thread_num,
        object_storage_options=object_storage_options,
    )
    read_options = ReadOptions(vcf_read_options=vcf_read_options)
    return _read_file(path, InputFormat.Vcf, read_options, projection_pushdown)

data_processing

Source code in polars_bio/sql.py
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
class SQL:
    @staticmethod
    def register_vcf(
        path: str,
        name: Union[str, None] = None,
        info_fields: Union[list[str], None] = None,
        thread_num: Union[int, None] = None,
        chunk_size: int = 64,
        concurrent_fetches: int = 8,
        allow_anonymous: bool = True,
        max_retries: int = 5,
        timeout: int = 300,
        enable_request_payer: bool = False,
        compression_type: str = "auto",
    ) -> None:
        """
        Register a VCF file as a Datafusion table.

        Parameters:
            path: The path to the VCF file.
            name: The name of the table. If *None*, the name of the table will be generated automatically based on the path.
            info_fields: List of INFO field names to register. If *None*, all INFO fields will be detected automatically from the VCF header. Use this to limit registration to specific fields for better performance.
            thread_num: The number of threads to use for reading the VCF file. Used **only** for parallel decompression of BGZF blocks. Works only for **local** files.
            chunk_size: The size in MB of a chunk when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **8-16**.
            concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **1-2**.
            allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
            enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
            compression_type: The compression type of the VCF file. If not specified, it will be detected automatically..
            max_retries:  The maximum number of retries for reading the file from object storage.
            timeout: The timeout in seconds for reading the file from object storage.
        !!! note
            VCF reader uses **1-based** coordinate system for the `start` and `end` columns.

        !!! Example
              ```python
              import polars_bio as pb
              pb.register_vcf("/tmp/gnomad.v4.1.sv.sites.vcf.gz")
              ```
             ```shell
             INFO:polars_bio:Table: gnomad_v4_1_sv_sites_gz registered for path: /tmp/gnomad.v4.1.sv.sites.vcf.gz
             ```
        !!! tip
            `chunk_size` and `concurrent_fetches` can be adjusted according to the network bandwidth and the size of the VCF file. As a rule of thumb for large scale operations (reading a whole VCF), it is recommended to the default values.
        """

        object_storage_options = PyObjectStorageOptions(
            allow_anonymous=allow_anonymous,
            enable_request_payer=enable_request_payer,
            chunk_size=chunk_size,
            concurrent_fetches=concurrent_fetches,
            max_retries=max_retries,
            timeout=timeout,
            compression_type=compression_type,
        )

        # Use provided info_fields or autodetect from VCF header
        if info_fields is not None:
            all_info_fields = info_fields
        else:
            # Get all info fields from VCF header for automatic field detection
            all_info_fields = None
            try:
                from .io import IOOperations

                vcf_schema_df = IOOperations.describe_vcf(
                    path,
                    allow_anonymous=allow_anonymous,
                    enable_request_payer=enable_request_payer,
                    compression_type=compression_type,
                )
                all_info_fields = vcf_schema_df.select("name").to_series().to_list()
            except Exception:
                # Fallback to empty list if unable to get info fields
                all_info_fields = []

        vcf_read_options = VcfReadOptions(
            info_fields=all_info_fields,
            thread_num=thread_num,
            object_storage_options=object_storage_options,
        )
        read_options = ReadOptions(vcf_read_options=vcf_read_options)
        py_register_table(ctx, path, name, InputFormat.Vcf, read_options)

    @staticmethod
    def register_gff(
        path: str,
        name: Union[str, None] = None,
        thread_num: int = 1,
        chunk_size: int = 64,
        concurrent_fetches: int = 8,
        allow_anonymous: bool = True,
        max_retries: int = 5,
        timeout: int = 300,
        enable_request_payer: bool = False,
        compression_type: str = "auto",
        parallel: bool = False,
    ) -> None:
        """
        Register a GFF file as a Datafusion table.

        Parameters:
            path: The path to the GFF file.
            name: The name of the table. If *None*, the name of the table will be generated automatically based on the path.
            thread_num: The number of threads to use for reading the GFF file. Used **only** for parallel decompression of BGZF blocks. Works only for **local** files.
            chunk_size: The size in MB of a chunk when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **8-16**.
            concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **1-2**.
            allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
            enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
            compression_type: The compression type of the GFF file. If not specified, it will be detected automatically based on the file extension. BGZF and GZIP compression is supported ('bgz' and 'gz').
            max_retries:  The maximum number of retries for reading the file from object storage.
            timeout: The timeout in seconds for reading the file from object storage.
            parallel: Whether to use the parallel reader for BGZF-compressed local files. Default is False.
        !!! note
            GFF reader uses **1-based** coordinate system for the `start` and `end` columns.

        !!! Example
            ```shell
            wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.annotation.gff3.gz -O /tmp/gencode.v38.annotation.gff3.gz
            ```
            ```python
            import polars_bio as pb
            pb.register_gff("/tmp/gencode.v38.annotation.gff3.gz", "gencode_v38_annotation3_bgz")
            pb.sql("SELECT attributes, count(*) AS cnt FROM gencode_v38_annotation3_bgz GROUP BY attributes").limit(5).collect()
            ```
            ```shell

            shape: (5, 2)
            ┌───────────────────┬───────┐
            │ Parent            ┆ cnt   │
            │ ---               ┆ ---   │
            │ str               ┆ i64   │
            ╞═══════════════════╪═══════╡
            │ null              ┆ 60649 │
            │ ENSG00000223972.5 ┆ 2     │
            │ ENST00000456328.2 ┆ 3     │
            │ ENST00000450305.2 ┆ 6     │
            │ ENSG00000227232.5 ┆ 1     │
            └───────────────────┴───────┘

            ```
        !!! tip
            `chunk_size` and `concurrent_fetches` can be adjusted according to the network bandwidth and the size of the GFF file. As a rule of thumb for large scale operations (reading a whole GFF), it is recommended to the default values.
        """

        object_storage_options = PyObjectStorageOptions(
            allow_anonymous=allow_anonymous,
            enable_request_payer=enable_request_payer,
            chunk_size=chunk_size,
            concurrent_fetches=concurrent_fetches,
            max_retries=max_retries,
            timeout=timeout,
            compression_type=compression_type,
        )

        gff_read_options = GffReadOptions(
            attr_fields=None,
            thread_num=thread_num,
            object_storage_options=object_storage_options,
            parallel=parallel,
        )
        read_options = ReadOptions(gff_read_options=gff_read_options)
        py_register_table(ctx, path, name, InputFormat.Gff, read_options)

    @staticmethod
    def register_fastq(
        path: str,
        name: Union[str, None] = None,
        chunk_size: int = 64,
        concurrent_fetches: int = 8,
        allow_anonymous: bool = True,
        max_retries: int = 5,
        timeout: int = 300,
        enable_request_payer: bool = False,
        compression_type: str = "auto",
        parallel: bool = False,
    ) -> None:
        """
        Register a FASTQ file as a Datafusion table.

        Parameters:
            path: The path to the FASTQ file.
            name: The name of the table. If *None*, the name of the table will be generated automatically based on the path.
            chunk_size: The size in MB of a chunk when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **8-16**.
            concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **1-2**.
            allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
            enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
            compression_type: The compression type of the FASTQ file. If not specified, it will be detected automatically based on the file extension. BGZF and GZIP compression is supported ('bgz' and 'gz').
            max_retries:  The maximum number of retries for reading the file from object storage.
            timeout: The timeout in seconds for reading the file from object storage.
            parallel: Whether to use the parallel reader for BGZF compressed files. Default is False. If a file ends with ".gz" but is actually BGZF, it will attempt the parallel path and fall back to standard if not BGZF.

        !!! Example
            ```python
              import polars_bio as pb
              pb.register_fastq("gs://genomics-public-data/platinum-genomes/fastq/ERR194146.fastq.gz", "test_fastq")
              pb.sql("SELECT name, description FROM test_fastq WHERE name LIKE 'ERR194146%'").limit(5).collect()
            ```

            ```shell

              shape: (5, 2)
            ┌─────────────────────┬─────────────────────────────────┐
            │ name                ┆ description                     │
            │ ---                 ┆ ---                             │
            │ str                 ┆ str                             │
            ╞═════════════════════╪═════════════════════════════════╡
            │ ERR194146.812444541 ┆ HSQ1008:141:D0CC8ACXX:2:1204:1… │
            │ ERR194146.812444542 ┆ HSQ1008:141:D0CC8ACXX:4:1206:1… │
            │ ERR194146.812444543 ┆ HSQ1008:141:D0CC8ACXX:3:2104:5… │
            │ ERR194146.812444544 ┆ HSQ1008:141:D0CC8ACXX:3:2204:1… │
            │ ERR194146.812444545 ┆ HSQ1008:141:D0CC8ACXX:3:1304:3… │
            └─────────────────────┴─────────────────────────────────┘

            ```


        !!! tip
            `chunk_size` and `concurrent_fetches` can be adjusted according to the network bandwidth and the size of the FASTQ file. As a rule of thumb for large scale operations (reading a whole FASTQ), it is recommended to the default values.
        """

        object_storage_options = PyObjectStorageOptions(
            allow_anonymous=allow_anonymous,
            enable_request_payer=enable_request_payer,
            chunk_size=chunk_size,
            concurrent_fetches=concurrent_fetches,
            max_retries=max_retries,
            timeout=timeout,
            compression_type=compression_type,
        )

        fastq_read_options = FastqReadOptions(
            object_storage_options=object_storage_options, parallel=parallel
        )
        read_options = ReadOptions(fastq_read_options=fastq_read_options)
        py_register_table(ctx, path, name, InputFormat.Fastq, read_options)

    @staticmethod
    def register_bed(
        path: str,
        name: Union[str, None] = None,
        thread_num: int = 1,
        chunk_size: int = 64,
        concurrent_fetches: int = 8,
        allow_anonymous: bool = True,
        max_retries: int = 5,
        timeout: int = 300,
        enable_request_payer: bool = False,
        compression_type: str = "auto",
    ) -> None:
        """
        Register a BED file as a Datafusion table.

        Parameters:
            path: The path to the BED file.
            name: The name of the table. If *None*, the name of the table will be generated automatically based on the path.
            thread_num: The number of threads to use for reading the BED file. Used **only** for parallel decompression of BGZF blocks. Works only for **local** files.
            chunk_size: The size in MB of a chunk when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **8-16**.
            concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **1-2**.
            allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
            enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
            compression_type: The compression type of the BED file. If not specified, it will be detected automatically..
            max_retries:  The maximum number of retries for reading the file from object storage.
            timeout: The timeout in seconds for reading the file from object storage.

        !!! Note
            Only **BED4** format is supported. It extends the basic BED format (BED3) by adding a name field, resulting in four columns: chromosome, start position, end position, and name.
            Also unlike other text formats, **GZIP** compression is not supported.

        !!! Example
            ```shell

             cd /tmp
             wget https://webs.iiitd.edu.in/raghava/humcfs/fragile_site_bed.zip -O fragile_site_bed.zip
             unzip fragile_site_bed.zip -x "__MACOSX/*" "*/.DS_Store"
            ```

            ```python
            import polars_bio as pb
            pb.register_bed("/tmp/fragile_site_bed/chr5_fragile_site.bed", "test_bed")
            b.sql("select * FROM test_bed WHERE name LIKE 'FRA5%'").collect()
            ```

            ```shell

                shape: (8, 4)
                ┌───────┬───────────┬───────────┬───────┐
                │ chrom ┆ start     ┆ end       ┆ name  │
                │ ---   ┆ ---       ┆ ---       ┆ ---   │
                │ str   ┆ u32       ┆ u32       ┆ str   │
                ╞═══════╪═══════════╪═══════════╪═══════╡
                │ chr5  ┆ 28900001  ┆ 42500000  ┆ FRA5A │
                │ chr5  ┆ 92300001  ┆ 98200000  ┆ FRA5B │
                │ chr5  ┆ 130600001 ┆ 136200000 ┆ FRA5C │
                │ chr5  ┆ 92300001  ┆ 93916228  ┆ FRA5D │
                │ chr5  ┆ 18400001  ┆ 28900000  ┆ FRA5E │
                │ chr5  ┆ 98200001  ┆ 109600000 ┆ FRA5F │
                │ chr5  ┆ 168500001 ┆ 180915260 ┆ FRA5G │
                │ chr5  ┆ 50500001  ┆ 63000000  ┆ FRA5H │
                └───────┴───────────┴───────────┴───────┘
            ```


        !!! tip
            `chunk_size` and `concurrent_fetches` can be adjusted according to the network bandwidth and the size of the BED file. As a rule of thumb for large scale operations (reading a whole BED), it is recommended to the default values.
        """

        object_storage_options = PyObjectStorageOptions(
            allow_anonymous=allow_anonymous,
            enable_request_payer=enable_request_payer,
            chunk_size=chunk_size,
            concurrent_fetches=concurrent_fetches,
            max_retries=max_retries,
            timeout=timeout,
            compression_type=compression_type,
        )

        bed_read_options = BedReadOptions(
            thread_num=thread_num,
            object_storage_options=object_storage_options,
        )
        read_options = ReadOptions(bed_read_options=bed_read_options)
        py_register_table(ctx, path, name, InputFormat.Bed, read_options)

    @staticmethod
    def register_view(name: str, query: str) -> None:
        """
        Register a query as a Datafusion view. This view can be used in genomic ranges operations,
        such as overlap, nearest, and count_overlaps. It is useful for filtering, transforming, and aggregating data
        prior to the range operation. When combined with the range operation, it can be used to perform complex in a streaming fashion end-to-end.

        Parameters:
            name: The name of the table.
            query: The SQL query.

        !!! Example
              ```python
              import polars_bio as pb
              pb.register_vcf("gs://gcp-public-data--gnomad/release/4.1/vcf/exomes/gnomad.exomes.v4.1.sites.chr21.vcf.bgz", "gnomad_sv")
              pb.register_view("v_gnomad_sv", "SELECT replace(chrom,'chr', '') AS chrom, start, end FROM gnomad_sv")
              pb.sql("SELECT * FROM v_gnomad_sv").limit(5).collect()
              ```
              ```shell
                shape: (5, 3)
                ┌───────┬─────────┬─────────┐
                │ chrom ┆ start   ┆ end     │
                │ ---   ┆ ---     ┆ ---     │
                │ str   ┆ u32     ┆ u32     │
                ╞═══════╪═════════╪═════════╡
                │ 21    ┆ 5031905 ┆ 5031905 │
                │ 21    ┆ 5031905 ┆ 5031905 │
                │ 21    ┆ 5031909 ┆ 5031909 │
                │ 21    ┆ 5031911 ┆ 5031911 │
                │ 21    ┆ 5031911 ┆ 5031911 │
                └───────┴─────────┴─────────┘
              ```
        """
        py_register_view(ctx, name, query)

    @staticmethod
    def register_bam(
        path: str,
        name: Union[str, None] = None,
        thread_num: int = 1,
        chunk_size: int = 64,
        concurrent_fetches: int = 8,
        allow_anonymous: bool = True,
        max_retries: int = 5,
        timeout: int = 300,
        enable_request_payer: bool = False,
    ) -> None:
        """
        Register a BAM file as a Datafusion table.

        Parameters:
            path: The path to the BAM file.
            name: The name of the table. If *None*, the name of the table will be generated automatically based on the path.
            thread_num: The number of threads to use for reading the BAM file. Used **only** for parallel decompression of BGZF blocks. Works only for **local** files.
            chunk_size: The size in MB of a chunk when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **8-16**.
            concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **1-2**.
            allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
            enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
            max_retries:  The maximum number of retries for reading the file from object storage.
            timeout: The timeout in seconds for reading the file from object storage.
        !!! note
            BAM reader uses **1-based** coordinate system for the `start`, `end`, `mate_start`, `mate_end` columns.

        !!! Example

            ```python
            import polars_bio as pb
            pb.register_bam("gs://genomics-public-data/1000-genomes/bam/HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam", "HG00096_bam", concurrent_fetches=1, chunk_size=8)
            pb.sql("SELECT chrom, flags FROM HG00096_bam").limit(5).collect()
            ```
            ```shell

                shape: (5, 2)
                ┌───────┬───────┐
                │ chrom ┆ flags │
                │ ---   ┆ ---   │
                │ str   ┆ u32   │
                ╞═══════╪═══════╡
                │ chr1  ┆ 163   │
                │ chr1  ┆ 163   │
                │ chr1  ┆ 99    │
                │ chr1  ┆ 99    │
                │ chr1  ┆ 99    │
                └───────┴───────┘
            ```
        !!! tip
            `chunk_size` and `concurrent_fetches` can be adjusted according to the network bandwidth and the size of the BAM file. As a rule of thumb for large scale operations (reading a whole BAM), it is recommended keep the default values.
            For more interactive inspecting a schema, it is recommended to decrease `chunk_size` to **8-16** and `concurrent_fetches` to **1-2**.
        """

        object_storage_options = PyObjectStorageOptions(
            allow_anonymous=allow_anonymous,
            enable_request_payer=enable_request_payer,
            chunk_size=chunk_size,
            concurrent_fetches=concurrent_fetches,
            max_retries=max_retries,
            timeout=timeout,
            compression_type="auto",
        )

        bam_read_options = BamReadOptions(
            thread_num=thread_num,
            object_storage_options=object_storage_options,
        )
        read_options = ReadOptions(bam_read_options=bam_read_options)
        py_register_table(ctx, path, name, InputFormat.Bam, read_options)

    @staticmethod
    def sql(query: str) -> pl.LazyFrame:
        """
        Execute a SQL query on the registered tables.

        Parameters:
            query: The SQL query.

        !!! Example
              ```python
              import polars_bio as pb
              pb.register_vcf("/tmp/gnomad.v4.1.sv.sites.vcf.gz", "gnomad_v4_1_sv")
              pb.sql("SELECT * FROM gnomad_v4_1_sv LIMIT 5").collect()
              ```
        """
        df = py_read_sql(ctx, query)
        return _lazy_scan(df)

register_bam(path, name=None, thread_num=1, chunk_size=64, concurrent_fetches=8, allow_anonymous=True, max_retries=5, timeout=300, enable_request_payer=False) staticmethod

Register a BAM file as a Datafusion table.

Parameters:

Name Type Description Default
path str

The path to the BAM file.

required
name Union[str, None]

The name of the table. If None, the name of the table will be generated automatically based on the path.

None
thread_num int

The number of threads to use for reading the BAM file. Used only for parallel decompression of BGZF blocks. Works only for local files.

1
chunk_size int

The size in MB of a chunk when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to 8-16.

64
concurrent_fetches int

[GCS] The number of concurrent fetches when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to 1-2.

8
allow_anonymous bool

[GCS, AWS S3] Whether to allow anonymous access to object storage.

True
enable_request_payer bool

[AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.

False
max_retries int

The maximum number of retries for reading the file from object storage.

5
timeout int

The timeout in seconds for reading the file from object storage.

300

Note

BAM reader uses 1-based coordinate system for the start, end, mate_start, mate_end columns.

Example

import polars_bio as pb
pb.register_bam("gs://genomics-public-data/1000-genomes/bam/HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam", "HG00096_bam", concurrent_fetches=1, chunk_size=8)
pb.sql("SELECT chrom, flags FROM HG00096_bam").limit(5).collect()
    shape: (5, 2)
    ┌───────┬───────┐
     chrom  flags      ---    ---        str    u32       ╞═══════╪═══════╡
     chr1   163        chr1   163        chr1   99         chr1   99         chr1   99        └───────┴───────┘

Tip

chunk_size and concurrent_fetches can be adjusted according to the network bandwidth and the size of the BAM file. As a rule of thumb for large scale operations (reading a whole BAM), it is recommended keep the default values. For more interactive inspecting a schema, it is recommended to decrease chunk_size to 8-16 and concurrent_fetches to 1-2.

Source code in polars_bio/sql.py
@staticmethod
def register_bam(
    path: str,
    name: Union[str, None] = None,
    thread_num: int = 1,
    chunk_size: int = 64,
    concurrent_fetches: int = 8,
    allow_anonymous: bool = True,
    max_retries: int = 5,
    timeout: int = 300,
    enable_request_payer: bool = False,
) -> None:
    """
    Register a BAM file as a Datafusion table.

    Parameters:
        path: The path to the BAM file.
        name: The name of the table. If *None*, the name of the table will be generated automatically based on the path.
        thread_num: The number of threads to use for reading the BAM file. Used **only** for parallel decompression of BGZF blocks. Works only for **local** files.
        chunk_size: The size in MB of a chunk when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **8-16**.
        concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **1-2**.
        allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
        enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
        max_retries:  The maximum number of retries for reading the file from object storage.
        timeout: The timeout in seconds for reading the file from object storage.
    !!! note
        BAM reader uses **1-based** coordinate system for the `start`, `end`, `mate_start`, `mate_end` columns.

    !!! Example

        ```python
        import polars_bio as pb
        pb.register_bam("gs://genomics-public-data/1000-genomes/bam/HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam", "HG00096_bam", concurrent_fetches=1, chunk_size=8)
        pb.sql("SELECT chrom, flags FROM HG00096_bam").limit(5).collect()
        ```
        ```shell

            shape: (5, 2)
            ┌───────┬───────┐
            │ chrom ┆ flags │
            │ ---   ┆ ---   │
            │ str   ┆ u32   │
            ╞═══════╪═══════╡
            │ chr1  ┆ 163   │
            │ chr1  ┆ 163   │
            │ chr1  ┆ 99    │
            │ chr1  ┆ 99    │
            │ chr1  ┆ 99    │
            └───────┴───────┘
        ```
    !!! tip
        `chunk_size` and `concurrent_fetches` can be adjusted according to the network bandwidth and the size of the BAM file. As a rule of thumb for large scale operations (reading a whole BAM), it is recommended keep the default values.
        For more interactive inspecting a schema, it is recommended to decrease `chunk_size` to **8-16** and `concurrent_fetches` to **1-2**.
    """

    object_storage_options = PyObjectStorageOptions(
        allow_anonymous=allow_anonymous,
        enable_request_payer=enable_request_payer,
        chunk_size=chunk_size,
        concurrent_fetches=concurrent_fetches,
        max_retries=max_retries,
        timeout=timeout,
        compression_type="auto",
    )

    bam_read_options = BamReadOptions(
        thread_num=thread_num,
        object_storage_options=object_storage_options,
    )
    read_options = ReadOptions(bam_read_options=bam_read_options)
    py_register_table(ctx, path, name, InputFormat.Bam, read_options)

register_bed(path, name=None, thread_num=1, chunk_size=64, concurrent_fetches=8, allow_anonymous=True, max_retries=5, timeout=300, enable_request_payer=False, compression_type='auto') staticmethod

Register a BED file as a Datafusion table.

Parameters:

Name Type Description Default
path str

The path to the BED file.

required
name Union[str, None]

The name of the table. If None, the name of the table will be generated automatically based on the path.

None
thread_num int

The number of threads to use for reading the BED file. Used only for parallel decompression of BGZF blocks. Works only for local files.

1
chunk_size int

The size in MB of a chunk when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to 8-16.

64
concurrent_fetches int

[GCS] The number of concurrent fetches when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to 1-2.

8
allow_anonymous bool

[GCS, AWS S3] Whether to allow anonymous access to object storage.

True
enable_request_payer bool

[AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.

False
compression_type str

The compression type of the BED file. If not specified, it will be detected automatically..

'auto'
max_retries int

The maximum number of retries for reading the file from object storage.

5
timeout int

The timeout in seconds for reading the file from object storage.

300

Note

Only BED4 format is supported. It extends the basic BED format (BED3) by adding a name field, resulting in four columns: chromosome, start position, end position, and name. Also unlike other text formats, GZIP compression is not supported.

Example

 cd /tmp
 wget https://webs.iiitd.edu.in/raghava/humcfs/fragile_site_bed.zip -O fragile_site_bed.zip
 unzip fragile_site_bed.zip -x "__MACOSX/*" "*/.DS_Store"
import polars_bio as pb
pb.register_bed("/tmp/fragile_site_bed/chr5_fragile_site.bed", "test_bed")
b.sql("select * FROM test_bed WHERE name LIKE 'FRA5%'").collect()
    shape: (8, 4)
    ┌───────┬───────────┬───────────┬───────┐
     chrom  start      end        name       ---    ---        ---        ---        str    u32        u32        str       ╞═══════╪═══════════╪═══════════╪═══════╡
     chr5   28900001   42500000   FRA5A      chr5   92300001   98200000   FRA5B      chr5   130600001  136200000  FRA5C      chr5   92300001   93916228   FRA5D      chr5   18400001   28900000   FRA5E      chr5   98200001   109600000  FRA5F      chr5   168500001  180915260  FRA5G      chr5   50500001   63000000   FRA5H     └───────┴───────────┴───────────┴───────┘

Tip

chunk_size and concurrent_fetches can be adjusted according to the network bandwidth and the size of the BED file. As a rule of thumb for large scale operations (reading a whole BED), it is recommended to the default values.

Source code in polars_bio/sql.py
@staticmethod
def register_bed(
    path: str,
    name: Union[str, None] = None,
    thread_num: int = 1,
    chunk_size: int = 64,
    concurrent_fetches: int = 8,
    allow_anonymous: bool = True,
    max_retries: int = 5,
    timeout: int = 300,
    enable_request_payer: bool = False,
    compression_type: str = "auto",
) -> None:
    """
    Register a BED file as a Datafusion table.

    Parameters:
        path: The path to the BED file.
        name: The name of the table. If *None*, the name of the table will be generated automatically based on the path.
        thread_num: The number of threads to use for reading the BED file. Used **only** for parallel decompression of BGZF blocks. Works only for **local** files.
        chunk_size: The size in MB of a chunk when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **8-16**.
        concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **1-2**.
        allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
        enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
        compression_type: The compression type of the BED file. If not specified, it will be detected automatically..
        max_retries:  The maximum number of retries for reading the file from object storage.
        timeout: The timeout in seconds for reading the file from object storage.

    !!! Note
        Only **BED4** format is supported. It extends the basic BED format (BED3) by adding a name field, resulting in four columns: chromosome, start position, end position, and name.
        Also unlike other text formats, **GZIP** compression is not supported.

    !!! Example
        ```shell

         cd /tmp
         wget https://webs.iiitd.edu.in/raghava/humcfs/fragile_site_bed.zip -O fragile_site_bed.zip
         unzip fragile_site_bed.zip -x "__MACOSX/*" "*/.DS_Store"
        ```

        ```python
        import polars_bio as pb
        pb.register_bed("/tmp/fragile_site_bed/chr5_fragile_site.bed", "test_bed")
        b.sql("select * FROM test_bed WHERE name LIKE 'FRA5%'").collect()
        ```

        ```shell

            shape: (8, 4)
            ┌───────┬───────────┬───────────┬───────┐
            │ chrom ┆ start     ┆ end       ┆ name  │
            │ ---   ┆ ---       ┆ ---       ┆ ---   │
            │ str   ┆ u32       ┆ u32       ┆ str   │
            ╞═══════╪═══════════╪═══════════╪═══════╡
            │ chr5  ┆ 28900001  ┆ 42500000  ┆ FRA5A │
            │ chr5  ┆ 92300001  ┆ 98200000  ┆ FRA5B │
            │ chr5  ┆ 130600001 ┆ 136200000 ┆ FRA5C │
            │ chr5  ┆ 92300001  ┆ 93916228  ┆ FRA5D │
            │ chr5  ┆ 18400001  ┆ 28900000  ┆ FRA5E │
            │ chr5  ┆ 98200001  ┆ 109600000 ┆ FRA5F │
            │ chr5  ┆ 168500001 ┆ 180915260 ┆ FRA5G │
            │ chr5  ┆ 50500001  ┆ 63000000  ┆ FRA5H │
            └───────┴───────────┴───────────┴───────┘
        ```


    !!! tip
        `chunk_size` and `concurrent_fetches` can be adjusted according to the network bandwidth and the size of the BED file. As a rule of thumb for large scale operations (reading a whole BED), it is recommended to the default values.
    """

    object_storage_options = PyObjectStorageOptions(
        allow_anonymous=allow_anonymous,
        enable_request_payer=enable_request_payer,
        chunk_size=chunk_size,
        concurrent_fetches=concurrent_fetches,
        max_retries=max_retries,
        timeout=timeout,
        compression_type=compression_type,
    )

    bed_read_options = BedReadOptions(
        thread_num=thread_num,
        object_storage_options=object_storage_options,
    )
    read_options = ReadOptions(bed_read_options=bed_read_options)
    py_register_table(ctx, path, name, InputFormat.Bed, read_options)

register_fastq(path, name=None, chunk_size=64, concurrent_fetches=8, allow_anonymous=True, max_retries=5, timeout=300, enable_request_payer=False, compression_type='auto', parallel=False) staticmethod

Register a FASTQ file as a Datafusion table.

Parameters:

Name Type Description Default
path str

The path to the FASTQ file.

required
name Union[str, None]

The name of the table. If None, the name of the table will be generated automatically based on the path.

None
chunk_size int

The size in MB of a chunk when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to 8-16.

64
concurrent_fetches int

[GCS] The number of concurrent fetches when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to 1-2.

8
allow_anonymous bool

[GCS, AWS S3] Whether to allow anonymous access to object storage.

True
enable_request_payer bool

[AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.

False
compression_type str

The compression type of the FASTQ file. If not specified, it will be detected automatically based on the file extension. BGZF and GZIP compression is supported ('bgz' and 'gz').

'auto'
max_retries int

The maximum number of retries for reading the file from object storage.

5
timeout int

The timeout in seconds for reading the file from object storage.

300
parallel bool

Whether to use the parallel reader for BGZF compressed files. Default is False. If a file ends with ".gz" but is actually BGZF, it will attempt the parallel path and fall back to standard if not BGZF.

False

Example

  import polars_bio as pb
  pb.register_fastq("gs://genomics-public-data/platinum-genomes/fastq/ERR194146.fastq.gz", "test_fastq")
  pb.sql("SELECT name, description FROM test_fastq WHERE name LIKE 'ERR194146%'").limit(5).collect()
  shape: (5, 2)
┌─────────────────────┬─────────────────────────────────┐
 name                 description                      ---                  ---                              str                  str                             ╞═════════════════════╪═════════════════════════════════╡
 ERR194146.812444541  HSQ1008:141:D0CC8ACXX:2:1204:1…  ERR194146.812444542  HSQ1008:141:D0CC8ACXX:4:1206:1…  ERR194146.812444543  HSQ1008:141:D0CC8ACXX:3:2104:5…  ERR194146.812444544  HSQ1008:141:D0CC8ACXX:3:2204:1…  ERR194146.812444545  HSQ1008:141:D0CC8ACXX:3:1304:3… └─────────────────────┴─────────────────────────────────┘

Tip

chunk_size and concurrent_fetches can be adjusted according to the network bandwidth and the size of the FASTQ file. As a rule of thumb for large scale operations (reading a whole FASTQ), it is recommended to the default values.

Source code in polars_bio/sql.py
@staticmethod
def register_fastq(
    path: str,
    name: Union[str, None] = None,
    chunk_size: int = 64,
    concurrent_fetches: int = 8,
    allow_anonymous: bool = True,
    max_retries: int = 5,
    timeout: int = 300,
    enable_request_payer: bool = False,
    compression_type: str = "auto",
    parallel: bool = False,
) -> None:
    """
    Register a FASTQ file as a Datafusion table.

    Parameters:
        path: The path to the FASTQ file.
        name: The name of the table. If *None*, the name of the table will be generated automatically based on the path.
        chunk_size: The size in MB of a chunk when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **8-16**.
        concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **1-2**.
        allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
        enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
        compression_type: The compression type of the FASTQ file. If not specified, it will be detected automatically based on the file extension. BGZF and GZIP compression is supported ('bgz' and 'gz').
        max_retries:  The maximum number of retries for reading the file from object storage.
        timeout: The timeout in seconds for reading the file from object storage.
        parallel: Whether to use the parallel reader for BGZF compressed files. Default is False. If a file ends with ".gz" but is actually BGZF, it will attempt the parallel path and fall back to standard if not BGZF.

    !!! Example
        ```python
          import polars_bio as pb
          pb.register_fastq("gs://genomics-public-data/platinum-genomes/fastq/ERR194146.fastq.gz", "test_fastq")
          pb.sql("SELECT name, description FROM test_fastq WHERE name LIKE 'ERR194146%'").limit(5).collect()
        ```

        ```shell

          shape: (5, 2)
        ┌─────────────────────┬─────────────────────────────────┐
        │ name                ┆ description                     │
        │ ---                 ┆ ---                             │
        │ str                 ┆ str                             │
        ╞═════════════════════╪═════════════════════════════════╡
        │ ERR194146.812444541 ┆ HSQ1008:141:D0CC8ACXX:2:1204:1… │
        │ ERR194146.812444542 ┆ HSQ1008:141:D0CC8ACXX:4:1206:1… │
        │ ERR194146.812444543 ┆ HSQ1008:141:D0CC8ACXX:3:2104:5… │
        │ ERR194146.812444544 ┆ HSQ1008:141:D0CC8ACXX:3:2204:1… │
        │ ERR194146.812444545 ┆ HSQ1008:141:D0CC8ACXX:3:1304:3… │
        └─────────────────────┴─────────────────────────────────┘

        ```


    !!! tip
        `chunk_size` and `concurrent_fetches` can be adjusted according to the network bandwidth and the size of the FASTQ file. As a rule of thumb for large scale operations (reading a whole FASTQ), it is recommended to the default values.
    """

    object_storage_options = PyObjectStorageOptions(
        allow_anonymous=allow_anonymous,
        enable_request_payer=enable_request_payer,
        chunk_size=chunk_size,
        concurrent_fetches=concurrent_fetches,
        max_retries=max_retries,
        timeout=timeout,
        compression_type=compression_type,
    )

    fastq_read_options = FastqReadOptions(
        object_storage_options=object_storage_options, parallel=parallel
    )
    read_options = ReadOptions(fastq_read_options=fastq_read_options)
    py_register_table(ctx, path, name, InputFormat.Fastq, read_options)

register_gff(path, name=None, thread_num=1, chunk_size=64, concurrent_fetches=8, allow_anonymous=True, max_retries=5, timeout=300, enable_request_payer=False, compression_type='auto', parallel=False) staticmethod

Register a GFF file as a Datafusion table.

Parameters:

Name Type Description Default
path str

The path to the GFF file.

required
name Union[str, None]

The name of the table. If None, the name of the table will be generated automatically based on the path.

None
thread_num int

The number of threads to use for reading the GFF file. Used only for parallel decompression of BGZF blocks. Works only for local files.

1
chunk_size int

The size in MB of a chunk when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to 8-16.

64
concurrent_fetches int

[GCS] The number of concurrent fetches when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to 1-2.

8
allow_anonymous bool

[GCS, AWS S3] Whether to allow anonymous access to object storage.

True
enable_request_payer bool

[AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.

False
compression_type str

The compression type of the GFF file. If not specified, it will be detected automatically based on the file extension. BGZF and GZIP compression is supported ('bgz' and 'gz').

'auto'
max_retries int

The maximum number of retries for reading the file from object storage.

5
timeout int

The timeout in seconds for reading the file from object storage.

300
parallel bool

Whether to use the parallel reader for BGZF-compressed local files. Default is False.

False

Note

GFF reader uses 1-based coordinate system for the start and end columns.

Example

wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.annotation.gff3.gz -O /tmp/gencode.v38.annotation.gff3.gz
import polars_bio as pb
pb.register_gff("/tmp/gencode.v38.annotation.gff3.gz", "gencode_v38_annotation3_bgz")
pb.sql("SELECT attributes, count(*) AS cnt FROM gencode_v38_annotation3_bgz GROUP BY attributes").limit(5).collect()
shape: (5, 2)
┌───────────────────┬───────┐
 Parent             cnt    ---                ---    str                i64   ╞═══════════════════╪═══════╡
 null               60649  ENSG00000223972.5  2      ENST00000456328.2  3      ENST00000450305.2  6      ENSG00000227232.5  1     └───────────────────┴───────┘

Tip

chunk_size and concurrent_fetches can be adjusted according to the network bandwidth and the size of the GFF file. As a rule of thumb for large scale operations (reading a whole GFF), it is recommended to the default values.

Source code in polars_bio/sql.py
@staticmethod
def register_gff(
    path: str,
    name: Union[str, None] = None,
    thread_num: int = 1,
    chunk_size: int = 64,
    concurrent_fetches: int = 8,
    allow_anonymous: bool = True,
    max_retries: int = 5,
    timeout: int = 300,
    enable_request_payer: bool = False,
    compression_type: str = "auto",
    parallel: bool = False,
) -> None:
    """
    Register a GFF file as a Datafusion table.

    Parameters:
        path: The path to the GFF file.
        name: The name of the table. If *None*, the name of the table will be generated automatically based on the path.
        thread_num: The number of threads to use for reading the GFF file. Used **only** for parallel decompression of BGZF blocks. Works only for **local** files.
        chunk_size: The size in MB of a chunk when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **8-16**.
        concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **1-2**.
        allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
        enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
        compression_type: The compression type of the GFF file. If not specified, it will be detected automatically based on the file extension. BGZF and GZIP compression is supported ('bgz' and 'gz').
        max_retries:  The maximum number of retries for reading the file from object storage.
        timeout: The timeout in seconds for reading the file from object storage.
        parallel: Whether to use the parallel reader for BGZF-compressed local files. Default is False.
    !!! note
        GFF reader uses **1-based** coordinate system for the `start` and `end` columns.

    !!! Example
        ```shell
        wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.annotation.gff3.gz -O /tmp/gencode.v38.annotation.gff3.gz
        ```
        ```python
        import polars_bio as pb
        pb.register_gff("/tmp/gencode.v38.annotation.gff3.gz", "gencode_v38_annotation3_bgz")
        pb.sql("SELECT attributes, count(*) AS cnt FROM gencode_v38_annotation3_bgz GROUP BY attributes").limit(5).collect()
        ```
        ```shell

        shape: (5, 2)
        ┌───────────────────┬───────┐
        │ Parent            ┆ cnt   │
        │ ---               ┆ ---   │
        │ str               ┆ i64   │
        ╞═══════════════════╪═══════╡
        │ null              ┆ 60649 │
        │ ENSG00000223972.5 ┆ 2     │
        │ ENST00000456328.2 ┆ 3     │
        │ ENST00000450305.2 ┆ 6     │
        │ ENSG00000227232.5 ┆ 1     │
        └───────────────────┴───────┘

        ```
    !!! tip
        `chunk_size` and `concurrent_fetches` can be adjusted according to the network bandwidth and the size of the GFF file. As a rule of thumb for large scale operations (reading a whole GFF), it is recommended to the default values.
    """

    object_storage_options = PyObjectStorageOptions(
        allow_anonymous=allow_anonymous,
        enable_request_payer=enable_request_payer,
        chunk_size=chunk_size,
        concurrent_fetches=concurrent_fetches,
        max_retries=max_retries,
        timeout=timeout,
        compression_type=compression_type,
    )

    gff_read_options = GffReadOptions(
        attr_fields=None,
        thread_num=thread_num,
        object_storage_options=object_storage_options,
        parallel=parallel,
    )
    read_options = ReadOptions(gff_read_options=gff_read_options)
    py_register_table(ctx, path, name, InputFormat.Gff, read_options)

register_vcf(path, name=None, info_fields=None, thread_num=None, chunk_size=64, concurrent_fetches=8, allow_anonymous=True, max_retries=5, timeout=300, enable_request_payer=False, compression_type='auto') staticmethod

Register a VCF file as a Datafusion table.

Parameters:

Name Type Description Default
path str

The path to the VCF file.

required
name Union[str, None]

The name of the table. If None, the name of the table will be generated automatically based on the path.

None
info_fields Union[list[str], None]

List of INFO field names to register. If None, all INFO fields will be detected automatically from the VCF header. Use this to limit registration to specific fields for better performance.

None
thread_num Union[int, None]

The number of threads to use for reading the VCF file. Used only for parallel decompression of BGZF blocks. Works only for local files.

None
chunk_size int

The size in MB of a chunk when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to 8-16.

64
concurrent_fetches int

[GCS] The number of concurrent fetches when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to 1-2.

8
allow_anonymous bool

[GCS, AWS S3] Whether to allow anonymous access to object storage.

True
enable_request_payer bool

[AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.

False
compression_type str

The compression type of the VCF file. If not specified, it will be detected automatically..

'auto'
max_retries int

The maximum number of retries for reading the file from object storage.

5
timeout int

The timeout in seconds for reading the file from object storage.

300

Note

VCF reader uses 1-based coordinate system for the start and end columns.

Example

import polars_bio as pb
pb.register_vcf("/tmp/gnomad.v4.1.sv.sites.vcf.gz")
INFO:polars_bio:Table: gnomad_v4_1_sv_sites_gz registered for path: /tmp/gnomad.v4.1.sv.sites.vcf.gz

Tip

chunk_size and concurrent_fetches can be adjusted according to the network bandwidth and the size of the VCF file. As a rule of thumb for large scale operations (reading a whole VCF), it is recommended to the default values.

Source code in polars_bio/sql.py
@staticmethod
def register_vcf(
    path: str,
    name: Union[str, None] = None,
    info_fields: Union[list[str], None] = None,
    thread_num: Union[int, None] = None,
    chunk_size: int = 64,
    concurrent_fetches: int = 8,
    allow_anonymous: bool = True,
    max_retries: int = 5,
    timeout: int = 300,
    enable_request_payer: bool = False,
    compression_type: str = "auto",
) -> None:
    """
    Register a VCF file as a Datafusion table.

    Parameters:
        path: The path to the VCF file.
        name: The name of the table. If *None*, the name of the table will be generated automatically based on the path.
        info_fields: List of INFO field names to register. If *None*, all INFO fields will be detected automatically from the VCF header. Use this to limit registration to specific fields for better performance.
        thread_num: The number of threads to use for reading the VCF file. Used **only** for parallel decompression of BGZF blocks. Works only for **local** files.
        chunk_size: The size in MB of a chunk when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **8-16**.
        concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **1-2**.
        allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
        enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
        compression_type: The compression type of the VCF file. If not specified, it will be detected automatically..
        max_retries:  The maximum number of retries for reading the file from object storage.
        timeout: The timeout in seconds for reading the file from object storage.
    !!! note
        VCF reader uses **1-based** coordinate system for the `start` and `end` columns.

    !!! Example
          ```python
          import polars_bio as pb
          pb.register_vcf("/tmp/gnomad.v4.1.sv.sites.vcf.gz")
          ```
         ```shell
         INFO:polars_bio:Table: gnomad_v4_1_sv_sites_gz registered for path: /tmp/gnomad.v4.1.sv.sites.vcf.gz
         ```
    !!! tip
        `chunk_size` and `concurrent_fetches` can be adjusted according to the network bandwidth and the size of the VCF file. As a rule of thumb for large scale operations (reading a whole VCF), it is recommended to the default values.
    """

    object_storage_options = PyObjectStorageOptions(
        allow_anonymous=allow_anonymous,
        enable_request_payer=enable_request_payer,
        chunk_size=chunk_size,
        concurrent_fetches=concurrent_fetches,
        max_retries=max_retries,
        timeout=timeout,
        compression_type=compression_type,
    )

    # Use provided info_fields or autodetect from VCF header
    if info_fields is not None:
        all_info_fields = info_fields
    else:
        # Get all info fields from VCF header for automatic field detection
        all_info_fields = None
        try:
            from .io import IOOperations

            vcf_schema_df = IOOperations.describe_vcf(
                path,
                allow_anonymous=allow_anonymous,
                enable_request_payer=enable_request_payer,
                compression_type=compression_type,
            )
            all_info_fields = vcf_schema_df.select("name").to_series().to_list()
        except Exception:
            # Fallback to empty list if unable to get info fields
            all_info_fields = []

    vcf_read_options = VcfReadOptions(
        info_fields=all_info_fields,
        thread_num=thread_num,
        object_storage_options=object_storage_options,
    )
    read_options = ReadOptions(vcf_read_options=vcf_read_options)
    py_register_table(ctx, path, name, InputFormat.Vcf, read_options)

register_view(name, query) staticmethod

Register a query as a Datafusion view. This view can be used in genomic ranges operations, such as overlap, nearest, and count_overlaps. It is useful for filtering, transforming, and aggregating data prior to the range operation. When combined with the range operation, it can be used to perform complex in a streaming fashion end-to-end.

Parameters:

Name Type Description Default
name str

The name of the table.

required
query str

The SQL query.

required

Example

import polars_bio as pb
pb.register_vcf("gs://gcp-public-data--gnomad/release/4.1/vcf/exomes/gnomad.exomes.v4.1.sites.chr21.vcf.bgz", "gnomad_sv")
pb.register_view("v_gnomad_sv", "SELECT replace(chrom,'chr', '') AS chrom, start, end FROM gnomad_sv")
pb.sql("SELECT * FROM v_gnomad_sv").limit(5).collect()
  shape: (5, 3)
  ┌───────┬─────────┬─────────┐
   chrom  start    end        ---    ---      ---        str    u32      u32       ╞═══════╪═════════╪═════════╡
   21     5031905  5031905    21     5031905  5031905    21     5031909  5031909    21     5031911  5031911    21     5031911  5031911   └───────┴─────────┴─────────┘

Source code in polars_bio/sql.py
@staticmethod
def register_view(name: str, query: str) -> None:
    """
    Register a query as a Datafusion view. This view can be used in genomic ranges operations,
    such as overlap, nearest, and count_overlaps. It is useful for filtering, transforming, and aggregating data
    prior to the range operation. When combined with the range operation, it can be used to perform complex in a streaming fashion end-to-end.

    Parameters:
        name: The name of the table.
        query: The SQL query.

    !!! Example
          ```python
          import polars_bio as pb
          pb.register_vcf("gs://gcp-public-data--gnomad/release/4.1/vcf/exomes/gnomad.exomes.v4.1.sites.chr21.vcf.bgz", "gnomad_sv")
          pb.register_view("v_gnomad_sv", "SELECT replace(chrom,'chr', '') AS chrom, start, end FROM gnomad_sv")
          pb.sql("SELECT * FROM v_gnomad_sv").limit(5).collect()
          ```
          ```shell
            shape: (5, 3)
            ┌───────┬─────────┬─────────┐
            │ chrom ┆ start   ┆ end     │
            │ ---   ┆ ---     ┆ ---     │
            │ str   ┆ u32     ┆ u32     │
            ╞═══════╪═════════╪═════════╡
            │ 21    ┆ 5031905 ┆ 5031905 │
            │ 21    ┆ 5031905 ┆ 5031905 │
            │ 21    ┆ 5031909 ┆ 5031909 │
            │ 21    ┆ 5031911 ┆ 5031911 │
            │ 21    ┆ 5031911 ┆ 5031911 │
            └───────┴─────────┴─────────┘
          ```
    """
    py_register_view(ctx, name, query)

sql(query) staticmethod

Execute a SQL query on the registered tables.

Parameters:

Name Type Description Default
query str

The SQL query.

required

Example

import polars_bio as pb
pb.register_vcf("/tmp/gnomad.v4.1.sv.sites.vcf.gz", "gnomad_v4_1_sv")
pb.sql("SELECT * FROM gnomad_v4_1_sv LIMIT 5").collect()
Source code in polars_bio/sql.py
@staticmethod
def sql(query: str) -> pl.LazyFrame:
    """
    Execute a SQL query on the registered tables.

    Parameters:
        query: The SQL query.

    !!! Example
          ```python
          import polars_bio as pb
          pb.register_vcf("/tmp/gnomad.v4.1.sv.sites.vcf.gz", "gnomad_v4_1_sv")
          pb.sql("SELECT * FROM gnomad_v4_1_sv LIMIT 5").collect()
          ```
    """
    df = py_read_sql(ctx, query)
    return _lazy_scan(df)

range_operations

Source code in polars_bio/range_op.py
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
class IntervalOperations:

    @staticmethod
    def overlap(
        df1: Union[str, pl.DataFrame, pl.LazyFrame, "pd.DataFrame"],
        df2: Union[str, pl.DataFrame, pl.LazyFrame, "pd.DataFrame"],
        use_zero_based: bool = False,
        suffixes: tuple[str, str] = ("_1", "_2"),
        on_cols: Union[list[str], None] = None,
        cols1: Union[list[str], None] = ["chrom", "start", "end"],
        cols2: Union[list[str], None] = ["chrom", "start", "end"],
        algorithm: str = "Coitrees",
        output_type: str = "polars.LazyFrame",
        read_options1: Union[ReadOptions, None] = None,
        read_options2: Union[ReadOptions, None] = None,
        projection_pushdown: bool = False,
    ) -> Union[pl.LazyFrame, pl.DataFrame, "pd.DataFrame", datafusion.DataFrame]:
        """
        Find pairs of overlapping genomic intervals.
        Bioframe inspired API.

        Parameters:
            df1: Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table (see [register_vcf](api.md#polars_bio.register_vcf)). CSV with a header, BED and Parquet are supported.
            df2: Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table. CSV with a header, BED  and Parquet are supported.
            use_zero_based: By default **1-based** coordinates system is used, as all input file readers use 1-based coordinates. If enabled, 0-based is used instead and end user is responsible for ensuring that both datasets follow this coordinates system.
            cols1: The names of columns containing the chromosome, start and end of the
                genomic intervals, provided separately for each set.
            cols2:  The names of columns containing the chromosome, start and end of the
                genomic intervals, provided separately for each set.
            suffixes: Suffixes for the columns of the two overlapped sets.
            on_cols: List of additional column names to join on. default is None.
            algorithm: The algorithm to use for the overlap operation. Available options: Coitrees, IntervalTree, ArrayIntervalTree, Lapper, SuperIntervals
            output_type: Type of the output. default is "polars.LazyFrame", "polars.DataFrame", or "pandas.DataFrame" or "datafusion.DataFrame" are also supported.
            read_options1: Additional options for reading the input files.
            read_options2: Additional options for reading the input files.
            projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

        Returns:
            **polars.LazyFrame** or polars.DataFrame or pandas.DataFrame of the overlapping intervals.

        Note:
            1. The default output format, i.e.  [LazyFrame](https://docs.pola.rs/api/python/stable/reference/lazyframe/index.html), is recommended for large datasets as it supports output streaming and lazy evaluation.
            This enables efficient processing of large datasets without loading the entire output dataset into memory.
            2. Streaming is only supported for polars.LazyFrame output.

        Example:
            ```python
            import polars_bio as pb
            import pandas as pd

            df1 = pd.DataFrame([
                ['chr1', 1, 5],
                ['chr1', 3, 8],
                ['chr1', 8, 10],
                ['chr1', 12, 14]],
            columns=['chrom', 'start', 'end']
            )

            df2 = pd.DataFrame(
            [['chr1', 4, 8],
             ['chr1', 10, 11]],
            columns=['chrom', 'start', 'end' ]
            )
            overlapping_intervals = pb.overlap(df1, df2, output_type="pandas.DataFrame")

            overlapping_intervals
                chrom_1         start_1     end_1 chrom_2       start_2  end_2
            0     chr1            1          5     chr1            4          8
            1     chr1            3          8     chr1            4          8

            ```

        Todo:
             Support for on_cols.
        """

        _validate_overlap_input(
            cols1, cols2, on_cols, suffixes, output_type, use_zero_based
        )

        cols1 = DEFAULT_INTERVAL_COLUMNS if cols1 is None else cols1
        cols2 = DEFAULT_INTERVAL_COLUMNS if cols2 is None else cols2
        range_options = RangeOptions(
            range_op=RangeOp.Overlap,
            filter_op=FilterOp.Weak if not use_zero_based else FilterOp.Strict,
            suffixes=suffixes,
            columns_1=cols1,
            columns_2=cols2,
            overlap_alg=algorithm,
        )

        return range_operation(
            df1,
            df2,
            range_options,
            output_type,
            ctx,
            read_options1,
            read_options2,
            projection_pushdown,
        )

    @staticmethod
    def nearest(
        df1: Union[str, pl.DataFrame, pl.LazyFrame, "pd.DataFrame"],
        df2: Union[str, pl.DataFrame, pl.LazyFrame, "pd.DataFrame"],
        use_zero_based: bool = False,
        suffixes: tuple[str, str] = ("_1", "_2"),
        on_cols: Union[list[str], None] = None,
        cols1: Union[list[str], None] = ["chrom", "start", "end"],
        cols2: Union[list[str], None] = ["chrom", "start", "end"],
        output_type: str = "polars.LazyFrame",
        read_options: Union[ReadOptions, None] = None,
        projection_pushdown: bool = False,
    ) -> Union[pl.LazyFrame, pl.DataFrame, "pd.DataFrame", datafusion.DataFrame]:
        """
        Find pairs of closest genomic intervals.
        Bioframe inspired API.

        Parameters:
            df1: Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table (see [register_vcf](api.md#polars_bio.register_vcf)). CSV with a header, BED and Parquet are supported.
            df2: Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table. CSV with a header, BED  and Parquet are supported.
            use_zero_based: By default **1-based** coordinates system is used, as all input file readers use 1-based coordinates. If enabled, 0-based is used instead and end user is responsible for ensuring that both datasets follow this coordinates system.
            cols1: The names of columns containing the chromosome, start and end of the
                genomic intervals, provided separately for each set.
            cols2:  The names of columns containing the chromosome, start and end of the
                genomic intervals, provided separately for each set.
            suffixes: Suffixes for the columns of the two overlapped sets.
            on_cols: List of additional column names to join on. default is None.
            output_type: Type of the output. default is "polars.LazyFrame", "polars.DataFrame", or "pandas.DataFrame" or "datafusion.DataFrame" are also supported.
            read_options: Additional options for reading the input files.
            projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.


        Returns:
            **polars.LazyFrame** or polars.DataFrame or pandas.DataFrame of the overlapping intervals.

        Note:
            The default output format, i.e. [LazyFrame](https://docs.pola.rs/api/python/stable/reference/lazyframe/index.html), is recommended for large datasets as it supports output streaming and lazy evaluation.
            This enables efficient processing of large datasets without loading the entire output dataset into memory.

        Example:

        Todo:
            Support for on_cols.
        """

        _validate_overlap_input(
            cols1, cols2, on_cols, suffixes, output_type, use_zero_based
        )

        cols1 = DEFAULT_INTERVAL_COLUMNS if cols1 is None else cols1
        cols2 = DEFAULT_INTERVAL_COLUMNS if cols2 is None else cols2
        range_options = RangeOptions(
            range_op=RangeOp.Nearest,
            filter_op=FilterOp.Weak if not use_zero_based else FilterOp.Strict,
            suffixes=suffixes,
            columns_1=cols1,
            columns_2=cols2,
        )
        return range_operation(
            df1,
            df2,
            range_options,
            output_type,
            ctx,
            read_options,
            projection_pushdown=projection_pushdown,
        )

    @staticmethod
    def coverage(
        df1: Union[str, pl.DataFrame, pl.LazyFrame, "pd.DataFrame"],
        df2: Union[str, pl.DataFrame, pl.LazyFrame, "pd.DataFrame"],
        use_zero_based: bool = False,
        suffixes: tuple[str, str] = ("_1", "_2"),
        on_cols: Union[list[str], None] = None,
        cols1: Union[list[str], None] = ["chrom", "start", "end"],
        cols2: Union[list[str], None] = ["chrom", "start", "end"],
        output_type: str = "polars.LazyFrame",
        read_options: Union[ReadOptions, None] = None,
        projection_pushdown: bool = False,
    ) -> Union[pl.LazyFrame, pl.DataFrame, "pd.DataFrame", datafusion.DataFrame]:
        """
        Calculate intervals coverage.
        Bioframe inspired API.

        Parameters:
            df1: Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table (see [register_vcf](api.md#polars_bio.register_vcf)). CSV with a header, BED and Parquet are supported.
            df2: Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table. CSV with a header, BED  and Parquet are supported.
            use_zero_based: By default **1-based** coordinates system is used, as all input file readers use 1-based coordinates. If enabled, 0-based is used instead and end user is responsible for ensuring that both datasets follow this coordinates system.
            cols1: The names of columns containing the chromosome, start and end of the
                genomic intervals, provided separately for each set.
            cols2:  The names of columns containing the chromosome, start and end of the
                genomic intervals, provided separately for each set.
            suffixes: Suffixes for the columns of the two overlapped sets.
            on_cols: List of additional column names to join on. default is None.
            output_type: Type of the output. default is "polars.LazyFrame", "polars.DataFrame", or "pandas.DataFrame" or "datafusion.DataFrame" are also supported.
            read_options: Additional options for reading the input files.
            projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.


        Returns:
            **polars.LazyFrame** or polars.DataFrame or pandas.DataFrame of the overlapping intervals.

        Note:
            The default output format, i.e. [LazyFrame](https://docs.pola.rs/api/python/stable/reference/lazyframe/index.html), is recommended for large datasets as it supports output streaming and lazy evaluation.
            This enables efficient processing of large datasets without loading the entire output dataset into memory.

        Example:

        Todo:
            Support for on_cols.
        """

        _validate_overlap_input(
            cols1,
            cols2,
            on_cols,
            suffixes,
            output_type,
            use_zero_based,
        )

        cols1 = DEFAULT_INTERVAL_COLUMNS if cols1 is None else cols1
        cols2 = DEFAULT_INTERVAL_COLUMNS if cols2 is None else cols2
        range_options = RangeOptions(
            range_op=RangeOp.Coverage,
            filter_op=FilterOp.Weak if not use_zero_based else FilterOp.Strict,
            suffixes=suffixes,
            columns_1=cols1,
            columns_2=cols2,
        )
        return range_operation(
            df2,
            df1,
            range_options,
            output_type,
            ctx,
            read_options,
            projection_pushdown=projection_pushdown,
        )

    @staticmethod
    def count_overlaps(
        df1: Union[str, pl.DataFrame, pl.LazyFrame, "pd.DataFrame"],
        df2: Union[str, pl.DataFrame, pl.LazyFrame, "pd.DataFrame"],
        use_zero_based: bool = False,
        suffixes: tuple[str, str] = ("", "_"),
        cols1: Union[list[str], None] = ["chrom", "start", "end"],
        cols2: Union[list[str], None] = ["chrom", "start", "end"],
        on_cols: Union[list[str], None] = None,
        output_type: str = "polars.LazyFrame",
        naive_query: bool = True,
        projection_pushdown: bool = False,
    ) -> Union[pl.LazyFrame, pl.DataFrame, "pd.DataFrame", datafusion.DataFrame]:
        """
        Count pairs of overlapping genomic intervals.
        Bioframe inspired API.

        Parameters:
            df1: Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table (see [register_vcf](api.md#polars_bio.register_vcf)). CSV with a header, BED and Parquet are supported.
            df2: Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table. CSV with a header, BED  and Parquet are supported.
            use_zero_based: By default **1-based** coordinates system is used, as all input file readers use 1-based coordinates. If enabled, 0-based is used instead and end user is responsible for ensuring that both datasets follow this coordinates system.
            suffixes: Suffixes for the columns of the two overlapped sets.
            cols1: The names of columns containing the chromosome, start and end of the
                genomic intervals, provided separately for each set.
            cols2:  The names of columns containing the chromosome, start and end of the
                genomic intervals, provided separately for each set.
            on_cols: List of additional column names to join on. default is None.
            output_type: Type of the output. default is "polars.LazyFrame", "polars.DataFrame", or "pandas.DataFrame" or "datafusion.DataFrame" are also supported.
            naive_query: If True, use naive query for counting overlaps based on overlaps.
            projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.
        Returns:
            **polars.LazyFrame** or polars.DataFrame or pandas.DataFrame of the overlapping intervals.

        Example:
            ```python
            import polars_bio as pb
            import pandas as pd

            df1 = pd.DataFrame([
                ['chr1', 1, 5],
                ['chr1', 3, 8],
                ['chr1', 8, 10],
                ['chr1', 12, 14]],
            columns=['chrom', 'start', 'end']
            )

            df2 = pd.DataFrame(
            [['chr1', 4, 8],
             ['chr1', 10, 11]],
            columns=['chrom', 'start', 'end' ]
            )
            counts = pb.count_overlaps(df1, df2, output_type="pandas.DataFrame")

            counts

            chrom  start  end  count
            0  chr1      1    5      1
            1  chr1      3    8      1
            2  chr1      8   10      0
            3  chr1     12   14      0
            ```

        Todo:
             Support return_input.
        """
        _validate_overlap_input(
            cols1, cols2, on_cols, suffixes, output_type, use_zero_based
        )
        my_ctx = get_py_ctx()
        on_cols = [] if on_cols is None else on_cols
        cols1 = DEFAULT_INTERVAL_COLUMNS if cols1 is None else cols1
        cols2 = DEFAULT_INTERVAL_COLUMNS if cols2 is None else cols2
        if naive_query:
            range_options = RangeOptions(
                range_op=RangeOp.CountOverlapsNaive,
                filter_op=FilterOp.Weak if not use_zero_based else FilterOp.Strict,
                suffixes=suffixes,
                columns_1=cols1,
                columns_2=cols2,
            )
            return range_operation(df2, df1, range_options, output_type, ctx)
        df1 = read_df_to_datafusion(my_ctx, df1)
        df2 = read_df_to_datafusion(my_ctx, df2)

        curr_cols = set(df1.schema().names) | set(df2.schema().names)
        s1start_s2end = prevent_column_collision("s1starts2end", curr_cols)
        s1end_s2start = prevent_column_collision("s1ends2start", curr_cols)
        contig = prevent_column_collision("contig", curr_cols)
        count = prevent_column_collision("count", curr_cols)
        starts = prevent_column_collision("starts", curr_cols)
        ends = prevent_column_collision("ends", curr_cols)
        is_s1 = prevent_column_collision("is_s1", curr_cols)
        suff, _ = suffixes
        df1, df2 = df2, df1
        df1 = df1.select(
            *(
                [
                    literal(1).alias(is_s1),
                    col(cols1[1]).alias(s1start_s2end),
                    col(cols1[2]).alias(s1end_s2start),
                    col(cols1[0]).alias(contig),
                ]
                + on_cols
            )
        )
        df2 = df2.select(
            *(
                [
                    literal(0).alias(is_s1),
                    col(cols2[2]).alias(s1end_s2start),
                    col(cols2[1]).alias(s1start_s2end),
                    col(cols2[0]).alias(contig),
                ]
                + on_cols
            )
        )

        df = df1.union(df2)

        partitioning = [col(contig)] + [col(c) for c in on_cols]
        df = df.select(
            *(
                [
                    s1start_s2end,
                    s1end_s2start,
                    contig,
                    is_s1,
                    datafusion.functions.sum(col(is_s1))
                    .over(
                        datafusion.expr.Window(
                            partition_by=partitioning,
                            order_by=[
                                col(s1start_s2end).sort(),
                                col(is_s1).sort(ascending=use_zero_based),
                            ],
                        )
                    )
                    .alias(starts),
                    datafusion.functions.sum(col(is_s1))
                    .over(
                        datafusion.expr.Window(
                            partition_by=partitioning,
                            order_by=[
                                col(s1end_s2start).sort(),
                                col(is_s1).sort(ascending=(not use_zero_based)),
                            ],
                        )
                    )
                    .alias(ends),
                ]
                + on_cols
            )
        )
        df = df.filter(col(is_s1) == 0)
        df = df.select(
            *(
                [
                    col(contig).alias(cols1[0] + suff),
                    col(s1end_s2start).alias(cols1[1] + suff),
                    col(s1start_s2end).alias(cols1[2] + suff),
                ]
                + on_cols
                + [(col(starts) - col(ends)).alias(count)]
            )
        )

        return convert_result(df, output_type)

    @staticmethod
    def merge(
        df: Union[str, pl.DataFrame, pl.LazyFrame, "pd.DataFrame"],
        use_zero_based: bool = False,
        min_dist: float = 0,
        cols: Union[list[str], None] = ["chrom", "start", "end"],
        on_cols: Union[list[str], None] = None,
        output_type: str = "polars.LazyFrame",
        projection_pushdown: bool = False,
    ) -> Union[pl.LazyFrame, pl.DataFrame, "pd.DataFrame", datafusion.DataFrame]:
        """
        Merge overlapping intervals. It is assumed that start < end.


        Parameters:
            df: Can be a path to a file, a polars DataFrame, or a pandas DataFrame. CSV with a header, BED  and Parquet are supported.
            use_zero_based: By default **1-based** coordinates system is used, as all input file readers use 1-based coordinates. If enabled, 0-based is used instead and end user is responsible for ensuring that both datasets follow this coordinates system.
            cols: The names of columns containing the chromosome, start and end of the
                genomic intervals, provided separately for each set.
            on_cols: List of additional column names for clustering. default is None.
            output_type: Type of the output. default is "polars.LazyFrame", "polars.DataFrame", or "pandas.DataFrame" or "datafusion.DataFrame" are also supported.
            projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

        Returns:
            **polars.LazyFrame** or polars.DataFrame or pandas.DataFrame of the overlapping intervals.

        Example:

        Todo:
            Support for on_cols.
        """
        suffixes = ("_1", "_2")
        _validate_overlap_input(
            cols, cols, on_cols, suffixes, output_type, use_zero_based
        )

        my_ctx = get_py_ctx()
        cols = DEFAULT_INTERVAL_COLUMNS if cols is None else cols
        contig = cols[0]
        start = cols[1]
        end = cols[2]

        on_cols = [] if on_cols is None else on_cols
        on_cols = [contig] + on_cols

        df = read_df_to_datafusion(my_ctx, df)
        df_schema = df.schema()
        start_type = df_schema.field(start).type
        end_type = df_schema.field(end).type

        curr_cols = set(df_schema.names)
        start_end = prevent_column_collision("start_end", curr_cols)
        is_start_end = prevent_column_collision("is_start_or_end", curr_cols)
        current_intervals = prevent_column_collision("current_intervals", curr_cols)
        n_intervals = prevent_column_collision("n_intervals", curr_cols)

        end_positions = df.select(
            *(
                [
                    (col(end) + min_dist).alias(start_end),
                    literal(-1).alias(is_start_end),
                ]
                + on_cols
            )
        )
        start_positions = df.select(
            *([col(start).alias(start_end), literal(1).alias(is_start_end)] + on_cols)
        )
        all_positions = start_positions.union(end_positions)
        start_end_type = all_positions.schema().field(start_end).type
        all_positions = all_positions.select(
            *([col(start_end).cast(start_end_type), col(is_start_end)] + on_cols)
        )

        sorting = [
            col(start_end).sort(),
            col(is_start_end).sort(ascending=use_zero_based),
        ]
        all_positions = all_positions.sort(*sorting)

        on_cols_expr = [col(c) for c in on_cols]

        win = datafusion.expr.Window(
            partition_by=on_cols_expr,
            order_by=sorting,
        )
        all_positions = all_positions.select(
            *(
                [
                    start_end,
                    is_start_end,
                    datafusion.functions.sum(col(is_start_end))
                    .over(win)
                    .alias(current_intervals),
                ]
                + on_cols
                + [
                    datafusion.functions.row_number(
                        partition_by=on_cols_expr, order_by=sorting
                    ).alias(n_intervals)
                ]
            )
        )
        all_positions = all_positions.filter(
            ((col(current_intervals) == 0) & (col(is_start_end) == -1))
            | ((col(current_intervals) == 1) & (col(is_start_end) == 1))
        )
        all_positions = all_positions.select(
            *(
                [start_end, is_start_end]
                + on_cols
                + [
                    (
                        (
                            col(n_intervals)
                            - datafusion.functions.lag(
                                col(n_intervals), partition_by=on_cols_expr
                            )
                            + 1
                        )
                        / 2
                    )
                    .cast(pa.int64())
                    .alias(n_intervals)
                ]
            )
        )
        result = all_positions.select(
            *(
                [
                    (col(start_end) - min_dist).alias(end),
                    is_start_end,
                    datafusion.functions.lag(
                        col(start_end), partition_by=on_cols_expr
                    ).alias(start),
                ]
                + on_cols
                + [n_intervals]
            )
        )
        result = result.filter(col(is_start_end) == -1)
        result = result.select(
            *(
                [contig, col(start).cast(start_type), col(end).cast(end_type)]
                + on_cols[1:]
                + [n_intervals]
            )
        )

        return convert_result(result, output_type)

count_overlaps(df1, df2, use_zero_based=False, suffixes=('', '_'), cols1=['chrom', 'start', 'end'], cols2=['chrom', 'start', 'end'], on_cols=None, output_type='polars.LazyFrame', naive_query=True, projection_pushdown=False) staticmethod

Count pairs of overlapping genomic intervals. Bioframe inspired API.

Parameters:

Name Type Description Default
df1 Union[str, DataFrame, LazyFrame, 'pd.DataFrame']

Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table (see register_vcf). CSV with a header, BED and Parquet are supported.

required
df2 Union[str, DataFrame, LazyFrame, 'pd.DataFrame']

Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table. CSV with a header, BED and Parquet are supported.

required
use_zero_based bool

By default 1-based coordinates system is used, as all input file readers use 1-based coordinates. If enabled, 0-based is used instead and end user is responsible for ensuring that both datasets follow this coordinates system.

False
suffixes tuple[str, str]

Suffixes for the columns of the two overlapped sets.

('', '_')
cols1 Union[list[str], None]

The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set.

['chrom', 'start', 'end']
cols2 Union[list[str], None]

The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set.

['chrom', 'start', 'end']
on_cols Union[list[str], None]

List of additional column names to join on. default is None.

None
output_type str

Type of the output. default is "polars.LazyFrame", "polars.DataFrame", or "pandas.DataFrame" or "datafusion.DataFrame" are also supported.

'polars.LazyFrame'
naive_query bool

If True, use naive query for counting overlaps based on overlaps.

True
projection_pushdown bool

Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

False

Returns: polars.LazyFrame or polars.DataFrame or pandas.DataFrame of the overlapping intervals.

Example
import polars_bio as pb
import pandas as pd

df1 = pd.DataFrame([
    ['chr1', 1, 5],
    ['chr1', 3, 8],
    ['chr1', 8, 10],
    ['chr1', 12, 14]],
columns=['chrom', 'start', 'end']
)

df2 = pd.DataFrame(
[['chr1', 4, 8],
 ['chr1', 10, 11]],
columns=['chrom', 'start', 'end' ]
)
counts = pb.count_overlaps(df1, df2, output_type="pandas.DataFrame")

counts

chrom  start  end  count
0  chr1      1    5      1
1  chr1      3    8      1
2  chr1      8   10      0
3  chr1     12   14      0
Todo

Support return_input.

Source code in polars_bio/range_op.py
@staticmethod
def count_overlaps(
    df1: Union[str, pl.DataFrame, pl.LazyFrame, "pd.DataFrame"],
    df2: Union[str, pl.DataFrame, pl.LazyFrame, "pd.DataFrame"],
    use_zero_based: bool = False,
    suffixes: tuple[str, str] = ("", "_"),
    cols1: Union[list[str], None] = ["chrom", "start", "end"],
    cols2: Union[list[str], None] = ["chrom", "start", "end"],
    on_cols: Union[list[str], None] = None,
    output_type: str = "polars.LazyFrame",
    naive_query: bool = True,
    projection_pushdown: bool = False,
) -> Union[pl.LazyFrame, pl.DataFrame, "pd.DataFrame", datafusion.DataFrame]:
    """
    Count pairs of overlapping genomic intervals.
    Bioframe inspired API.

    Parameters:
        df1: Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table (see [register_vcf](api.md#polars_bio.register_vcf)). CSV with a header, BED and Parquet are supported.
        df2: Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table. CSV with a header, BED  and Parquet are supported.
        use_zero_based: By default **1-based** coordinates system is used, as all input file readers use 1-based coordinates. If enabled, 0-based is used instead and end user is responsible for ensuring that both datasets follow this coordinates system.
        suffixes: Suffixes for the columns of the two overlapped sets.
        cols1: The names of columns containing the chromosome, start and end of the
            genomic intervals, provided separately for each set.
        cols2:  The names of columns containing the chromosome, start and end of the
            genomic intervals, provided separately for each set.
        on_cols: List of additional column names to join on. default is None.
        output_type: Type of the output. default is "polars.LazyFrame", "polars.DataFrame", or "pandas.DataFrame" or "datafusion.DataFrame" are also supported.
        naive_query: If True, use naive query for counting overlaps based on overlaps.
        projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.
    Returns:
        **polars.LazyFrame** or polars.DataFrame or pandas.DataFrame of the overlapping intervals.

    Example:
        ```python
        import polars_bio as pb
        import pandas as pd

        df1 = pd.DataFrame([
            ['chr1', 1, 5],
            ['chr1', 3, 8],
            ['chr1', 8, 10],
            ['chr1', 12, 14]],
        columns=['chrom', 'start', 'end']
        )

        df2 = pd.DataFrame(
        [['chr1', 4, 8],
         ['chr1', 10, 11]],
        columns=['chrom', 'start', 'end' ]
        )
        counts = pb.count_overlaps(df1, df2, output_type="pandas.DataFrame")

        counts

        chrom  start  end  count
        0  chr1      1    5      1
        1  chr1      3    8      1
        2  chr1      8   10      0
        3  chr1     12   14      0
        ```

    Todo:
         Support return_input.
    """
    _validate_overlap_input(
        cols1, cols2, on_cols, suffixes, output_type, use_zero_based
    )
    my_ctx = get_py_ctx()
    on_cols = [] if on_cols is None else on_cols
    cols1 = DEFAULT_INTERVAL_COLUMNS if cols1 is None else cols1
    cols2 = DEFAULT_INTERVAL_COLUMNS if cols2 is None else cols2
    if naive_query:
        range_options = RangeOptions(
            range_op=RangeOp.CountOverlapsNaive,
            filter_op=FilterOp.Weak if not use_zero_based else FilterOp.Strict,
            suffixes=suffixes,
            columns_1=cols1,
            columns_2=cols2,
        )
        return range_operation(df2, df1, range_options, output_type, ctx)
    df1 = read_df_to_datafusion(my_ctx, df1)
    df2 = read_df_to_datafusion(my_ctx, df2)

    curr_cols = set(df1.schema().names) | set(df2.schema().names)
    s1start_s2end = prevent_column_collision("s1starts2end", curr_cols)
    s1end_s2start = prevent_column_collision("s1ends2start", curr_cols)
    contig = prevent_column_collision("contig", curr_cols)
    count = prevent_column_collision("count", curr_cols)
    starts = prevent_column_collision("starts", curr_cols)
    ends = prevent_column_collision("ends", curr_cols)
    is_s1 = prevent_column_collision("is_s1", curr_cols)
    suff, _ = suffixes
    df1, df2 = df2, df1
    df1 = df1.select(
        *(
            [
                literal(1).alias(is_s1),
                col(cols1[1]).alias(s1start_s2end),
                col(cols1[2]).alias(s1end_s2start),
                col(cols1[0]).alias(contig),
            ]
            + on_cols
        )
    )
    df2 = df2.select(
        *(
            [
                literal(0).alias(is_s1),
                col(cols2[2]).alias(s1end_s2start),
                col(cols2[1]).alias(s1start_s2end),
                col(cols2[0]).alias(contig),
            ]
            + on_cols
        )
    )

    df = df1.union(df2)

    partitioning = [col(contig)] + [col(c) for c in on_cols]
    df = df.select(
        *(
            [
                s1start_s2end,
                s1end_s2start,
                contig,
                is_s1,
                datafusion.functions.sum(col(is_s1))
                .over(
                    datafusion.expr.Window(
                        partition_by=partitioning,
                        order_by=[
                            col(s1start_s2end).sort(),
                            col(is_s1).sort(ascending=use_zero_based),
                        ],
                    )
                )
                .alias(starts),
                datafusion.functions.sum(col(is_s1))
                .over(
                    datafusion.expr.Window(
                        partition_by=partitioning,
                        order_by=[
                            col(s1end_s2start).sort(),
                            col(is_s1).sort(ascending=(not use_zero_based)),
                        ],
                    )
                )
                .alias(ends),
            ]
            + on_cols
        )
    )
    df = df.filter(col(is_s1) == 0)
    df = df.select(
        *(
            [
                col(contig).alias(cols1[0] + suff),
                col(s1end_s2start).alias(cols1[1] + suff),
                col(s1start_s2end).alias(cols1[2] + suff),
            ]
            + on_cols
            + [(col(starts) - col(ends)).alias(count)]
        )
    )

    return convert_result(df, output_type)

coverage(df1, df2, use_zero_based=False, suffixes=('_1', '_2'), on_cols=None, cols1=['chrom', 'start', 'end'], cols2=['chrom', 'start', 'end'], output_type='polars.LazyFrame', read_options=None, projection_pushdown=False) staticmethod

Calculate intervals coverage. Bioframe inspired API.

Parameters:

Name Type Description Default
df1 Union[str, DataFrame, LazyFrame, 'pd.DataFrame']

Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table (see register_vcf). CSV with a header, BED and Parquet are supported.

required
df2 Union[str, DataFrame, LazyFrame, 'pd.DataFrame']

Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table. CSV with a header, BED and Parquet are supported.

required
use_zero_based bool

By default 1-based coordinates system is used, as all input file readers use 1-based coordinates. If enabled, 0-based is used instead and end user is responsible for ensuring that both datasets follow this coordinates system.

False
cols1 Union[list[str], None]

The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set.

['chrom', 'start', 'end']
cols2 Union[list[str], None]

The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set.

['chrom', 'start', 'end']
suffixes tuple[str, str]

Suffixes for the columns of the two overlapped sets.

('_1', '_2')
on_cols Union[list[str], None]

List of additional column names to join on. default is None.

None
output_type str

Type of the output. default is "polars.LazyFrame", "polars.DataFrame", or "pandas.DataFrame" or "datafusion.DataFrame" are also supported.

'polars.LazyFrame'
read_options Union[ReadOptions, None]

Additional options for reading the input files.

None
projection_pushdown bool

Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

False

Returns:

Type Description
Union[LazyFrame, DataFrame, 'pd.DataFrame', DataFrame]

polars.LazyFrame or polars.DataFrame or pandas.DataFrame of the overlapping intervals.

Note

The default output format, i.e. LazyFrame, is recommended for large datasets as it supports output streaming and lazy evaluation. This enables efficient processing of large datasets without loading the entire output dataset into memory.

Example:

Todo

Support for on_cols.

Source code in polars_bio/range_op.py
@staticmethod
def coverage(
    df1: Union[str, pl.DataFrame, pl.LazyFrame, "pd.DataFrame"],
    df2: Union[str, pl.DataFrame, pl.LazyFrame, "pd.DataFrame"],
    use_zero_based: bool = False,
    suffixes: tuple[str, str] = ("_1", "_2"),
    on_cols: Union[list[str], None] = None,
    cols1: Union[list[str], None] = ["chrom", "start", "end"],
    cols2: Union[list[str], None] = ["chrom", "start", "end"],
    output_type: str = "polars.LazyFrame",
    read_options: Union[ReadOptions, None] = None,
    projection_pushdown: bool = False,
) -> Union[pl.LazyFrame, pl.DataFrame, "pd.DataFrame", datafusion.DataFrame]:
    """
    Calculate intervals coverage.
    Bioframe inspired API.

    Parameters:
        df1: Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table (see [register_vcf](api.md#polars_bio.register_vcf)). CSV with a header, BED and Parquet are supported.
        df2: Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table. CSV with a header, BED  and Parquet are supported.
        use_zero_based: By default **1-based** coordinates system is used, as all input file readers use 1-based coordinates. If enabled, 0-based is used instead and end user is responsible for ensuring that both datasets follow this coordinates system.
        cols1: The names of columns containing the chromosome, start and end of the
            genomic intervals, provided separately for each set.
        cols2:  The names of columns containing the chromosome, start and end of the
            genomic intervals, provided separately for each set.
        suffixes: Suffixes for the columns of the two overlapped sets.
        on_cols: List of additional column names to join on. default is None.
        output_type: Type of the output. default is "polars.LazyFrame", "polars.DataFrame", or "pandas.DataFrame" or "datafusion.DataFrame" are also supported.
        read_options: Additional options for reading the input files.
        projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.


    Returns:
        **polars.LazyFrame** or polars.DataFrame or pandas.DataFrame of the overlapping intervals.

    Note:
        The default output format, i.e. [LazyFrame](https://docs.pola.rs/api/python/stable/reference/lazyframe/index.html), is recommended for large datasets as it supports output streaming and lazy evaluation.
        This enables efficient processing of large datasets without loading the entire output dataset into memory.

    Example:

    Todo:
        Support for on_cols.
    """

    _validate_overlap_input(
        cols1,
        cols2,
        on_cols,
        suffixes,
        output_type,
        use_zero_based,
    )

    cols1 = DEFAULT_INTERVAL_COLUMNS if cols1 is None else cols1
    cols2 = DEFAULT_INTERVAL_COLUMNS if cols2 is None else cols2
    range_options = RangeOptions(
        range_op=RangeOp.Coverage,
        filter_op=FilterOp.Weak if not use_zero_based else FilterOp.Strict,
        suffixes=suffixes,
        columns_1=cols1,
        columns_2=cols2,
    )
    return range_operation(
        df2,
        df1,
        range_options,
        output_type,
        ctx,
        read_options,
        projection_pushdown=projection_pushdown,
    )

merge(df, use_zero_based=False, min_dist=0, cols=['chrom', 'start', 'end'], on_cols=None, output_type='polars.LazyFrame', projection_pushdown=False) staticmethod

Merge overlapping intervals. It is assumed that start < end.

Parameters:

Name Type Description Default
df Union[str, DataFrame, LazyFrame, 'pd.DataFrame']

Can be a path to a file, a polars DataFrame, or a pandas DataFrame. CSV with a header, BED and Parquet are supported.

required
use_zero_based bool

By default 1-based coordinates system is used, as all input file readers use 1-based coordinates. If enabled, 0-based is used instead and end user is responsible for ensuring that both datasets follow this coordinates system.

False
cols Union[list[str], None]

The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set.

['chrom', 'start', 'end']
on_cols Union[list[str], None]

List of additional column names for clustering. default is None.

None
output_type str

Type of the output. default is "polars.LazyFrame", "polars.DataFrame", or "pandas.DataFrame" or "datafusion.DataFrame" are also supported.

'polars.LazyFrame'
projection_pushdown bool

Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

False

Returns:

Type Description
Union[LazyFrame, DataFrame, 'pd.DataFrame', DataFrame]

polars.LazyFrame or polars.DataFrame or pandas.DataFrame of the overlapping intervals.

Example:

Todo

Support for on_cols.

Source code in polars_bio/range_op.py
@staticmethod
def merge(
    df: Union[str, pl.DataFrame, pl.LazyFrame, "pd.DataFrame"],
    use_zero_based: bool = False,
    min_dist: float = 0,
    cols: Union[list[str], None] = ["chrom", "start", "end"],
    on_cols: Union[list[str], None] = None,
    output_type: str = "polars.LazyFrame",
    projection_pushdown: bool = False,
) -> Union[pl.LazyFrame, pl.DataFrame, "pd.DataFrame", datafusion.DataFrame]:
    """
    Merge overlapping intervals. It is assumed that start < end.


    Parameters:
        df: Can be a path to a file, a polars DataFrame, or a pandas DataFrame. CSV with a header, BED  and Parquet are supported.
        use_zero_based: By default **1-based** coordinates system is used, as all input file readers use 1-based coordinates. If enabled, 0-based is used instead and end user is responsible for ensuring that both datasets follow this coordinates system.
        cols: The names of columns containing the chromosome, start and end of the
            genomic intervals, provided separately for each set.
        on_cols: List of additional column names for clustering. default is None.
        output_type: Type of the output. default is "polars.LazyFrame", "polars.DataFrame", or "pandas.DataFrame" or "datafusion.DataFrame" are also supported.
        projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

    Returns:
        **polars.LazyFrame** or polars.DataFrame or pandas.DataFrame of the overlapping intervals.

    Example:

    Todo:
        Support for on_cols.
    """
    suffixes = ("_1", "_2")
    _validate_overlap_input(
        cols, cols, on_cols, suffixes, output_type, use_zero_based
    )

    my_ctx = get_py_ctx()
    cols = DEFAULT_INTERVAL_COLUMNS if cols is None else cols
    contig = cols[0]
    start = cols[1]
    end = cols[2]

    on_cols = [] if on_cols is None else on_cols
    on_cols = [contig] + on_cols

    df = read_df_to_datafusion(my_ctx, df)
    df_schema = df.schema()
    start_type = df_schema.field(start).type
    end_type = df_schema.field(end).type

    curr_cols = set(df_schema.names)
    start_end = prevent_column_collision("start_end", curr_cols)
    is_start_end = prevent_column_collision("is_start_or_end", curr_cols)
    current_intervals = prevent_column_collision("current_intervals", curr_cols)
    n_intervals = prevent_column_collision("n_intervals", curr_cols)

    end_positions = df.select(
        *(
            [
                (col(end) + min_dist).alias(start_end),
                literal(-1).alias(is_start_end),
            ]
            + on_cols
        )
    )
    start_positions = df.select(
        *([col(start).alias(start_end), literal(1).alias(is_start_end)] + on_cols)
    )
    all_positions = start_positions.union(end_positions)
    start_end_type = all_positions.schema().field(start_end).type
    all_positions = all_positions.select(
        *([col(start_end).cast(start_end_type), col(is_start_end)] + on_cols)
    )

    sorting = [
        col(start_end).sort(),
        col(is_start_end).sort(ascending=use_zero_based),
    ]
    all_positions = all_positions.sort(*sorting)

    on_cols_expr = [col(c) for c in on_cols]

    win = datafusion.expr.Window(
        partition_by=on_cols_expr,
        order_by=sorting,
    )
    all_positions = all_positions.select(
        *(
            [
                start_end,
                is_start_end,
                datafusion.functions.sum(col(is_start_end))
                .over(win)
                .alias(current_intervals),
            ]
            + on_cols
            + [
                datafusion.functions.row_number(
                    partition_by=on_cols_expr, order_by=sorting
                ).alias(n_intervals)
            ]
        )
    )
    all_positions = all_positions.filter(
        ((col(current_intervals) == 0) & (col(is_start_end) == -1))
        | ((col(current_intervals) == 1) & (col(is_start_end) == 1))
    )
    all_positions = all_positions.select(
        *(
            [start_end, is_start_end]
            + on_cols
            + [
                (
                    (
                        col(n_intervals)
                        - datafusion.functions.lag(
                            col(n_intervals), partition_by=on_cols_expr
                        )
                        + 1
                    )
                    / 2
                )
                .cast(pa.int64())
                .alias(n_intervals)
            ]
        )
    )
    result = all_positions.select(
        *(
            [
                (col(start_end) - min_dist).alias(end),
                is_start_end,
                datafusion.functions.lag(
                    col(start_end), partition_by=on_cols_expr
                ).alias(start),
            ]
            + on_cols
            + [n_intervals]
        )
    )
    result = result.filter(col(is_start_end) == -1)
    result = result.select(
        *(
            [contig, col(start).cast(start_type), col(end).cast(end_type)]
            + on_cols[1:]
            + [n_intervals]
        )
    )

    return convert_result(result, output_type)

nearest(df1, df2, use_zero_based=False, suffixes=('_1', '_2'), on_cols=None, cols1=['chrom', 'start', 'end'], cols2=['chrom', 'start', 'end'], output_type='polars.LazyFrame', read_options=None, projection_pushdown=False) staticmethod

Find pairs of closest genomic intervals. Bioframe inspired API.

Parameters:

Name Type Description Default
df1 Union[str, DataFrame, LazyFrame, 'pd.DataFrame']

Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table (see register_vcf). CSV with a header, BED and Parquet are supported.

required
df2 Union[str, DataFrame, LazyFrame, 'pd.DataFrame']

Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table. CSV with a header, BED and Parquet are supported.

required
use_zero_based bool

By default 1-based coordinates system is used, as all input file readers use 1-based coordinates. If enabled, 0-based is used instead and end user is responsible for ensuring that both datasets follow this coordinates system.

False
cols1 Union[list[str], None]

The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set.

['chrom', 'start', 'end']
cols2 Union[list[str], None]

The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set.

['chrom', 'start', 'end']
suffixes tuple[str, str]

Suffixes for the columns of the two overlapped sets.

('_1', '_2')
on_cols Union[list[str], None]

List of additional column names to join on. default is None.

None
output_type str

Type of the output. default is "polars.LazyFrame", "polars.DataFrame", or "pandas.DataFrame" or "datafusion.DataFrame" are also supported.

'polars.LazyFrame'
read_options Union[ReadOptions, None]

Additional options for reading the input files.

None
projection_pushdown bool

Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

False

Returns:

Type Description
Union[LazyFrame, DataFrame, 'pd.DataFrame', DataFrame]

polars.LazyFrame or polars.DataFrame or pandas.DataFrame of the overlapping intervals.

Note

The default output format, i.e. LazyFrame, is recommended for large datasets as it supports output streaming and lazy evaluation. This enables efficient processing of large datasets without loading the entire output dataset into memory.

Example:

Todo

Support for on_cols.

Source code in polars_bio/range_op.py
@staticmethod
def nearest(
    df1: Union[str, pl.DataFrame, pl.LazyFrame, "pd.DataFrame"],
    df2: Union[str, pl.DataFrame, pl.LazyFrame, "pd.DataFrame"],
    use_zero_based: bool = False,
    suffixes: tuple[str, str] = ("_1", "_2"),
    on_cols: Union[list[str], None] = None,
    cols1: Union[list[str], None] = ["chrom", "start", "end"],
    cols2: Union[list[str], None] = ["chrom", "start", "end"],
    output_type: str = "polars.LazyFrame",
    read_options: Union[ReadOptions, None] = None,
    projection_pushdown: bool = False,
) -> Union[pl.LazyFrame, pl.DataFrame, "pd.DataFrame", datafusion.DataFrame]:
    """
    Find pairs of closest genomic intervals.
    Bioframe inspired API.

    Parameters:
        df1: Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table (see [register_vcf](api.md#polars_bio.register_vcf)). CSV with a header, BED and Parquet are supported.
        df2: Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table. CSV with a header, BED  and Parquet are supported.
        use_zero_based: By default **1-based** coordinates system is used, as all input file readers use 1-based coordinates. If enabled, 0-based is used instead and end user is responsible for ensuring that both datasets follow this coordinates system.
        cols1: The names of columns containing the chromosome, start and end of the
            genomic intervals, provided separately for each set.
        cols2:  The names of columns containing the chromosome, start and end of the
            genomic intervals, provided separately for each set.
        suffixes: Suffixes for the columns of the two overlapped sets.
        on_cols: List of additional column names to join on. default is None.
        output_type: Type of the output. default is "polars.LazyFrame", "polars.DataFrame", or "pandas.DataFrame" or "datafusion.DataFrame" are also supported.
        read_options: Additional options for reading the input files.
        projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.


    Returns:
        **polars.LazyFrame** or polars.DataFrame or pandas.DataFrame of the overlapping intervals.

    Note:
        The default output format, i.e. [LazyFrame](https://docs.pola.rs/api/python/stable/reference/lazyframe/index.html), is recommended for large datasets as it supports output streaming and lazy evaluation.
        This enables efficient processing of large datasets without loading the entire output dataset into memory.

    Example:

    Todo:
        Support for on_cols.
    """

    _validate_overlap_input(
        cols1, cols2, on_cols, suffixes, output_type, use_zero_based
    )

    cols1 = DEFAULT_INTERVAL_COLUMNS if cols1 is None else cols1
    cols2 = DEFAULT_INTERVAL_COLUMNS if cols2 is None else cols2
    range_options = RangeOptions(
        range_op=RangeOp.Nearest,
        filter_op=FilterOp.Weak if not use_zero_based else FilterOp.Strict,
        suffixes=suffixes,
        columns_1=cols1,
        columns_2=cols2,
    )
    return range_operation(
        df1,
        df2,
        range_options,
        output_type,
        ctx,
        read_options,
        projection_pushdown=projection_pushdown,
    )

overlap(df1, df2, use_zero_based=False, suffixes=('_1', '_2'), on_cols=None, cols1=['chrom', 'start', 'end'], cols2=['chrom', 'start', 'end'], algorithm='Coitrees', output_type='polars.LazyFrame', read_options1=None, read_options2=None, projection_pushdown=False) staticmethod

Find pairs of overlapping genomic intervals. Bioframe inspired API.

Parameters:

Name Type Description Default
df1 Union[str, DataFrame, LazyFrame, 'pd.DataFrame']

Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table (see register_vcf). CSV with a header, BED and Parquet are supported.

required
df2 Union[str, DataFrame, LazyFrame, 'pd.DataFrame']

Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table. CSV with a header, BED and Parquet are supported.

required
use_zero_based bool

By default 1-based coordinates system is used, as all input file readers use 1-based coordinates. If enabled, 0-based is used instead and end user is responsible for ensuring that both datasets follow this coordinates system.

False
cols1 Union[list[str], None]

The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set.

['chrom', 'start', 'end']
cols2 Union[list[str], None]

The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set.

['chrom', 'start', 'end']
suffixes tuple[str, str]

Suffixes for the columns of the two overlapped sets.

('_1', '_2')
on_cols Union[list[str], None]

List of additional column names to join on. default is None.

None
algorithm str

The algorithm to use for the overlap operation. Available options: Coitrees, IntervalTree, ArrayIntervalTree, Lapper, SuperIntervals

'Coitrees'
output_type str

Type of the output. default is "polars.LazyFrame", "polars.DataFrame", or "pandas.DataFrame" or "datafusion.DataFrame" are also supported.

'polars.LazyFrame'
read_options1 Union[ReadOptions, None]

Additional options for reading the input files.

None
read_options2 Union[ReadOptions, None]

Additional options for reading the input files.

None
projection_pushdown bool

Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

False

Returns:

Type Description
Union[LazyFrame, DataFrame, 'pd.DataFrame', DataFrame]

polars.LazyFrame or polars.DataFrame or pandas.DataFrame of the overlapping intervals.

Note
  1. The default output format, i.e. LazyFrame, is recommended for large datasets as it supports output streaming and lazy evaluation. This enables efficient processing of large datasets without loading the entire output dataset into memory.
  2. Streaming is only supported for polars.LazyFrame output.
Example
import polars_bio as pb
import pandas as pd

df1 = pd.DataFrame([
    ['chr1', 1, 5],
    ['chr1', 3, 8],
    ['chr1', 8, 10],
    ['chr1', 12, 14]],
columns=['chrom', 'start', 'end']
)

df2 = pd.DataFrame(
[['chr1', 4, 8],
 ['chr1', 10, 11]],
columns=['chrom', 'start', 'end' ]
)
overlapping_intervals = pb.overlap(df1, df2, output_type="pandas.DataFrame")

overlapping_intervals
    chrom_1         start_1     end_1 chrom_2       start_2  end_2
0     chr1            1          5     chr1            4          8
1     chr1            3          8     chr1            4          8
Todo

Support for on_cols.

Source code in polars_bio/range_op.py
@staticmethod
def overlap(
    df1: Union[str, pl.DataFrame, pl.LazyFrame, "pd.DataFrame"],
    df2: Union[str, pl.DataFrame, pl.LazyFrame, "pd.DataFrame"],
    use_zero_based: bool = False,
    suffixes: tuple[str, str] = ("_1", "_2"),
    on_cols: Union[list[str], None] = None,
    cols1: Union[list[str], None] = ["chrom", "start", "end"],
    cols2: Union[list[str], None] = ["chrom", "start", "end"],
    algorithm: str = "Coitrees",
    output_type: str = "polars.LazyFrame",
    read_options1: Union[ReadOptions, None] = None,
    read_options2: Union[ReadOptions, None] = None,
    projection_pushdown: bool = False,
) -> Union[pl.LazyFrame, pl.DataFrame, "pd.DataFrame", datafusion.DataFrame]:
    """
    Find pairs of overlapping genomic intervals.
    Bioframe inspired API.

    Parameters:
        df1: Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table (see [register_vcf](api.md#polars_bio.register_vcf)). CSV with a header, BED and Parquet are supported.
        df2: Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table. CSV with a header, BED  and Parquet are supported.
        use_zero_based: By default **1-based** coordinates system is used, as all input file readers use 1-based coordinates. If enabled, 0-based is used instead and end user is responsible for ensuring that both datasets follow this coordinates system.
        cols1: The names of columns containing the chromosome, start and end of the
            genomic intervals, provided separately for each set.
        cols2:  The names of columns containing the chromosome, start and end of the
            genomic intervals, provided separately for each set.
        suffixes: Suffixes for the columns of the two overlapped sets.
        on_cols: List of additional column names to join on. default is None.
        algorithm: The algorithm to use for the overlap operation. Available options: Coitrees, IntervalTree, ArrayIntervalTree, Lapper, SuperIntervals
        output_type: Type of the output. default is "polars.LazyFrame", "polars.DataFrame", or "pandas.DataFrame" or "datafusion.DataFrame" are also supported.
        read_options1: Additional options for reading the input files.
        read_options2: Additional options for reading the input files.
        projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

    Returns:
        **polars.LazyFrame** or polars.DataFrame or pandas.DataFrame of the overlapping intervals.

    Note:
        1. The default output format, i.e.  [LazyFrame](https://docs.pola.rs/api/python/stable/reference/lazyframe/index.html), is recommended for large datasets as it supports output streaming and lazy evaluation.
        This enables efficient processing of large datasets without loading the entire output dataset into memory.
        2. Streaming is only supported for polars.LazyFrame output.

    Example:
        ```python
        import polars_bio as pb
        import pandas as pd

        df1 = pd.DataFrame([
            ['chr1', 1, 5],
            ['chr1', 3, 8],
            ['chr1', 8, 10],
            ['chr1', 12, 14]],
        columns=['chrom', 'start', 'end']
        )

        df2 = pd.DataFrame(
        [['chr1', 4, 8],
         ['chr1', 10, 11]],
        columns=['chrom', 'start', 'end' ]
        )
        overlapping_intervals = pb.overlap(df1, df2, output_type="pandas.DataFrame")

        overlapping_intervals
            chrom_1         start_1     end_1 chrom_2       start_2  end_2
        0     chr1            1          5     chr1            4          8
        1     chr1            3          8     chr1            4          8

        ```

    Todo:
         Support for on_cols.
    """

    _validate_overlap_input(
        cols1, cols2, on_cols, suffixes, output_type, use_zero_based
    )

    cols1 = DEFAULT_INTERVAL_COLUMNS if cols1 is None else cols1
    cols2 = DEFAULT_INTERVAL_COLUMNS if cols2 is None else cols2
    range_options = RangeOptions(
        range_op=RangeOp.Overlap,
        filter_op=FilterOp.Weak if not use_zero_based else FilterOp.Strict,
        suffixes=suffixes,
        columns_1=cols1,
        columns_2=cols2,
        overlap_alg=algorithm,
    )

    return range_operation(
        df1,
        df2,
        range_options,
        output_type,
        ctx,
        read_options1,
        read_options2,
        projection_pushdown,
    )

set_loglevel(level)

Set the log level for the logger and root logger.

Parameters:

Name Type Description Default
level str

The log level to set. Can be "debug", "info", "warn", or "warning".

required

Note

Please note that the log level should be set as a first step after importing the library. Once set it can be only decreased, not increased. In order to increase the log level, you need to restart the Python session.

import polars_bio as pb
pb.set_loglevel("info")

Source code in polars_bio/logging.py
def set_loglevel(level: str):
    """
    Set the log level for the logger and root logger.

    Parameters:
        level: The log level to set. Can be "debug", "info", "warn", or "warning".

    !!! note
        Please note that the log level should be set as a **first** step after importing the library.
        Once set it can be only **decreased**, not increased. In order to increase the log level, you need to restart the Python session.
        ```python
        import polars_bio as pb
        pb.set_loglevel("info")
        ```
    """
    level = level.lower()
    if level == "debug":
        logger.setLevel(logging.DEBUG)
        root_logger.setLevel(logging.DEBUG)
        logging.basicConfig(level=logging.DEBUG)
    elif level == "info":
        logger.setLevel(logging.INFO)
        root_logger.setLevel(logging.INFO)
        logging.basicConfig(level=logging.INFO)
    elif level == "warn" or level == "warning":
        logger.setLevel(logging.WARN)
        root_logger.setLevel(logging.WARN)
        logging.basicConfig(level=logging.WARN)
    else:
        raise ValueError(f"{level} is not a valid log level")