Skip to content

⚙️ API reference

polars-bio API is grouped into the following categories:

  • File I/O: Reading files in various biological formats from local and cloud storage.
  • Data Processing: Exposing end user to the rich SQL programming interface powered by Apache Datafusion for operations, such as sorting, filtering and other transformations on input bioinformatic datasets registered as tables. You can easily query and process file formats such as VCF, GFF, BAM, FASTQ using SQL syntax.
  • Interval Operations: Functions for performing common interval operations, such as overlap, nearest, coverage.

There are 2 ways of using polars-bio API:

  • using polars_bio module

Example

import polars_bio as pb
pb.read_fastq("gs://genomics-public-data/platinum-genomes/fastq/ERR194146.fastq.gz").limit(1).collect()
  • directly on a Polars LazyFrame under a registered pb namespace

Example

 >>> type(df)
 <class 'polars.lazyframe.frame.LazyFrame'>
   import polars_bio as pb
   df.pb.sort().limit(5).collect()

Tip

  1. Not all are available in both ways.
  2. You can of course use both ways in the same script.

CoordinateSystemMismatchError

Bases: Exception

Raised when two DataFrames have different coordinate systems.

This error occurs when attempting range operations (overlap, nearest, etc.) on DataFrames where one uses 0-based coordinates and the other uses 1-based coordinates.

Example
df1 = pb.scan_vcf("file1.vcf", one_based=False)  # 0-based
df2 = pb.scan_vcf("file2.vcf", one_based=True)   # 1-based
pb.overlap(df1, df2)  # Raises CoordinateSystemMismatchError
Source code in polars_bio/exceptions.py
class CoordinateSystemMismatchError(Exception):
    """Raised when two DataFrames have different coordinate systems.

    This error occurs when attempting range operations (overlap, nearest, etc.)
    on DataFrames where one uses 0-based coordinates and the other uses 1-based
    coordinates.

    Example:
        ```python
        df1 = pb.scan_vcf("file1.vcf", one_based=False)  # 0-based
        df2 = pb.scan_vcf("file2.vcf", one_based=True)   # 1-based
        pb.overlap(df1, df2)  # Raises CoordinateSystemMismatchError
        ```
    """

    pass

MissingCoordinateSystemError

Bases: Exception

Raised when a DataFrame lacks coordinate system metadata.

Range operations require coordinate system metadata to determine the correct interval semantics. This error is raised when:

  • A Polars LazyFrame/DataFrame lacks polars-config-meta metadata
  • A Pandas DataFrame lacks df.attrs["coordinate_system_zero_based"]
  • A file path registers a table without Arrow schema metadata

For Polars DataFrames, use polars-bio I/O functions (scan_, read_) which automatically set the metadata.

For Pandas DataFrames, set the attribute before passing to range operations:

df.attrs["coordinate_system_zero_based"] = True  # 0-based coords

Example
import pandas as pd
import polars_bio as pb

pdf = pd.read_csv("intervals.bed", sep="        ", names=["chrom", "start", "end"])
pb.overlap(pdf, pdf)  # Raises MissingCoordinateSystemError

# Fix: set the coordinate system metadata
pdf.attrs["coordinate_system_zero_based"] = True
pb.overlap(pdf, pdf)  # Works correctly
Source code in polars_bio/exceptions.py
class MissingCoordinateSystemError(Exception):
    """Raised when a DataFrame lacks coordinate system metadata.

    Range operations require coordinate system metadata to determine the
    correct interval semantics. This error is raised when:

    - A Polars LazyFrame/DataFrame lacks polars-config-meta metadata
    - A Pandas DataFrame lacks df.attrs["coordinate_system_zero_based"]
    - A file path registers a table without Arrow schema metadata

    For Polars DataFrames, use polars-bio I/O functions (scan_*, read_*) which
    automatically set the metadata.

    For Pandas DataFrames, set the attribute before passing to range operations:
        ```python
        df.attrs["coordinate_system_zero_based"] = True  # 0-based coords
        ```

    Example:
        ```python
        import pandas as pd
        import polars_bio as pb

        pdf = pd.read_csv("intervals.bed", sep="\t", names=["chrom", "start", "end"])
        pb.overlap(pdf, pdf)  # Raises MissingCoordinateSystemError

        # Fix: set the coordinate system metadata
        pdf.attrs["coordinate_system_zero_based"] = True
        pb.overlap(pdf, pdf)  # Works correctly
        ```
    """

    pass

data_input

Source code in polars_bio/io.py
 105
 106
 107
 108
 109
 110
 111
 112
 113
 114
 115
 116
 117
 118
 119
 120
 121
 122
 123
 124
 125
 126
 127
 128
 129
 130
 131
 132
 133
 134
 135
 136
 137
 138
 139
 140
 141
 142
 143
 144
 145
 146
 147
 148
 149
 150
 151
 152
 153
 154
 155
 156
 157
 158
 159
 160
 161
 162
 163
 164
 165
 166
 167
 168
 169
 170
 171
 172
 173
 174
 175
 176
 177
 178
 179
 180
 181
 182
 183
 184
 185
 186
 187
 188
 189
 190
 191
 192
 193
 194
 195
 196
 197
 198
 199
 200
 201
 202
 203
 204
 205
 206
 207
 208
 209
 210
 211
 212
 213
 214
 215
 216
 217
 218
 219
 220
 221
 222
 223
 224
 225
 226
 227
 228
 229
 230
 231
 232
 233
 234
 235
 236
 237
 238
 239
 240
 241
 242
 243
 244
 245
 246
 247
 248
 249
 250
 251
 252
 253
 254
 255
 256
 257
 258
 259
 260
 261
 262
 263
 264
 265
 266
 267
 268
 269
 270
 271
 272
 273
 274
 275
 276
 277
 278
 279
 280
 281
 282
 283
 284
 285
 286
 287
 288
 289
 290
 291
 292
 293
 294
 295
 296
 297
 298
 299
 300
 301
 302
 303
 304
 305
 306
 307
 308
 309
 310
 311
 312
 313
 314
 315
 316
 317
 318
 319
 320
 321
 322
 323
 324
 325
 326
 327
 328
 329
 330
 331
 332
 333
 334
 335
 336
 337
 338
 339
 340
 341
 342
 343
 344
 345
 346
 347
 348
 349
 350
 351
 352
 353
 354
 355
 356
 357
 358
 359
 360
 361
 362
 363
 364
 365
 366
 367
 368
 369
 370
 371
 372
 373
 374
 375
 376
 377
 378
 379
 380
 381
 382
 383
 384
 385
 386
 387
 388
 389
 390
 391
 392
 393
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405
 406
 407
 408
 409
 410
 411
 412
 413
 414
 415
 416
 417
 418
 419
 420
 421
 422
 423
 424
 425
 426
 427
 428
 429
 430
 431
 432
 433
 434
 435
 436
 437
 438
 439
 440
 441
 442
 443
 444
 445
 446
 447
 448
 449
 450
 451
 452
 453
 454
 455
 456
 457
 458
 459
 460
 461
 462
 463
 464
 465
 466
 467
 468
 469
 470
 471
 472
 473
 474
 475
 476
 477
 478
 479
 480
 481
 482
 483
 484
 485
 486
 487
 488
 489
 490
 491
 492
 493
 494
 495
 496
 497
 498
 499
 500
 501
 502
 503
 504
 505
 506
 507
 508
 509
 510
 511
 512
 513
 514
 515
 516
 517
 518
 519
 520
 521
 522
 523
 524
 525
 526
 527
 528
 529
 530
 531
 532
 533
 534
 535
 536
 537
 538
 539
 540
 541
 542
 543
 544
 545
 546
 547
 548
 549
 550
 551
 552
 553
 554
 555
 556
 557
 558
 559
 560
 561
 562
 563
 564
 565
 566
 567
 568
 569
 570
 571
 572
 573
 574
 575
 576
 577
 578
 579
 580
 581
 582
 583
 584
 585
 586
 587
 588
 589
 590
 591
 592
 593
 594
 595
 596
 597
 598
 599
 600
 601
 602
 603
 604
 605
 606
 607
 608
 609
 610
 611
 612
 613
 614
 615
 616
 617
 618
 619
 620
 621
 622
 623
 624
 625
 626
 627
 628
 629
 630
 631
 632
 633
 634
 635
 636
 637
 638
 639
 640
 641
 642
 643
 644
 645
 646
 647
 648
 649
 650
 651
 652
 653
 654
 655
 656
 657
 658
 659
 660
 661
 662
 663
 664
 665
 666
 667
 668
 669
 670
 671
 672
 673
 674
 675
 676
 677
 678
 679
 680
 681
 682
 683
 684
 685
 686
 687
 688
 689
 690
 691
 692
 693
 694
 695
 696
 697
 698
 699
 700
 701
 702
 703
 704
 705
 706
 707
 708
 709
 710
 711
 712
 713
 714
 715
 716
 717
 718
 719
 720
 721
 722
 723
 724
 725
 726
 727
 728
 729
 730
 731
 732
 733
 734
 735
 736
 737
 738
 739
 740
 741
 742
 743
 744
 745
 746
 747
 748
 749
 750
 751
 752
 753
 754
 755
 756
 757
 758
 759
 760
 761
 762
 763
 764
 765
 766
 767
 768
 769
 770
 771
 772
 773
 774
 775
 776
 777
 778
 779
 780
 781
 782
 783
 784
 785
 786
 787
 788
 789
 790
 791
 792
 793
 794
 795
 796
 797
 798
 799
 800
 801
 802
 803
 804
 805
 806
 807
 808
 809
 810
 811
 812
 813
 814
 815
 816
 817
 818
 819
 820
 821
 822
 823
 824
 825
 826
 827
 828
 829
 830
 831
 832
 833
 834
 835
 836
 837
 838
 839
 840
 841
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857
 858
 859
 860
 861
 862
 863
 864
 865
 866
 867
 868
 869
 870
 871
 872
 873
 874
 875
 876
 877
 878
 879
 880
 881
 882
 883
 884
 885
 886
 887
 888
 889
 890
 891
 892
 893
 894
 895
 896
 897
 898
 899
 900
 901
 902
 903
 904
 905
 906
 907
 908
 909
 910
 911
 912
 913
 914
 915
 916
 917
 918
 919
 920
 921
 922
 923
 924
 925
 926
 927
 928
 929
 930
 931
 932
 933
 934
 935
 936
 937
 938
 939
 940
 941
 942
 943
 944
 945
 946
 947
 948
 949
 950
 951
 952
 953
 954
 955
 956
 957
 958
 959
 960
 961
 962
 963
 964
 965
 966
 967
 968
 969
 970
 971
 972
 973
 974
 975
 976
 977
 978
 979
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
class IOOperations:
    @staticmethod
    def read_fasta(
        path: str,
        chunk_size: int = 8,
        concurrent_fetches: int = 1,
        allow_anonymous: bool = True,
        enable_request_payer: bool = False,
        max_retries: int = 5,
        timeout: int = 300,
        compression_type: str = "auto",
        projection_pushdown: bool = True,
    ) -> pl.DataFrame:
        """

        Read a FASTA file into a DataFrame.

        Parameters:
            path: The path to the FASTA file.
            chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
            concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
            allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
            enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
            max_retries:  The maximum number of retries for reading the file from object storage.
            timeout: The timeout in seconds for reading the file from object storage.
            compression_type: The compression type of the FASTA file. If not specified, it will be detected automatically based on the file extension. BGZF and GZIP compressions are supported ('bgz', 'gz').
            projection_pushdown: Enable column projection pushdown optimization. When True, only requested columns are processed at the DataFusion execution level, improving performance and reducing memory usage.

        !!! Example
            ```shell
            wget https://www.ebi.ac.uk/ena/browser/api/fasta/BK006935.2?download=true -O /tmp/test.fasta
            ```

            ```python
            import polars_bio as pb
            pb.read_fasta("/tmp/test.fasta").limit(1)
            ```
            ```shell
             shape: (1, 3)
            ┌─────────────────────────┬─────────────────────────────────┬─────────────────────────────────┐
            │ name                    ┆ description                     ┆ sequence                        │
            │ ---                     ┆ ---                             ┆ ---                             │
            │ str                     ┆ str                             ┆ str                             │
            ╞═════════════════════════╪═════════════════════════════════╪═════════════════════════════════╡
            │ ENA|BK006935|BK006935.2 ┆ TPA_inf: Saccharomyces cerevis… ┆ CCACACCACACCCACACACCCACACACCAC… │
            └─────────────────────────┴─────────────────────────────────┴─────────────────────────────────┘
            ```
        """
        return IOOperations.scan_fasta(
            path,
            chunk_size,
            concurrent_fetches,
            allow_anonymous,
            enable_request_payer,
            max_retries,
            timeout,
            compression_type,
            projection_pushdown,
        ).collect()

    @staticmethod
    def scan_fasta(
        path: str,
        chunk_size: int = 8,
        concurrent_fetches: int = 1,
        allow_anonymous: bool = True,
        enable_request_payer: bool = False,
        max_retries: int = 5,
        timeout: int = 300,
        compression_type: str = "auto",
        projection_pushdown: bool = True,
    ) -> pl.LazyFrame:
        """

        Lazily read a FASTA file into a LazyFrame.

        Parameters:
            path: The path to the FASTA file.
            chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
            concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
            allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
            enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
            max_retries:  The maximum number of retries for reading the file from object storage.
            timeout: The timeout in seconds for reading the file from object storage.
            compression_type: The compression type of the FASTA file. If not specified, it will be detected automatically based on the file extension. BGZF and GZIP compressions are supported ('bgz', 'gz').
            projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

        !!! Example
            ```shell
            wget https://www.ebi.ac.uk/ena/browser/api/fasta/BK006935.2?download=true -O /tmp/test.fasta
            ```

            ```python
            import polars_bio as pb
            pb.scan_fasta("/tmp/test.fasta").limit(1).collect()
            ```
            ```shell
             shape: (1, 3)
            ┌─────────────────────────┬─────────────────────────────────┬─────────────────────────────────┐
            │ name                    ┆ description                     ┆ sequence                        │
            │ ---                     ┆ ---                             ┆ ---                             │
            │ str                     ┆ str                             ┆ str                             │
            ╞═════════════════════════╪═════════════════════════════════╪═════════════════════════════════╡
            │ ENA|BK006935|BK006935.2 ┆ TPA_inf: Saccharomyces cerevis… ┆ CCACACCACACCCACACACCCACACACCAC… │
            └─────────────────────────┴─────────────────────────────────┴─────────────────────────────────┘
            ```
        """
        object_storage_options = PyObjectStorageOptions(
            allow_anonymous=allow_anonymous,
            enable_request_payer=enable_request_payer,
            chunk_size=chunk_size,
            concurrent_fetches=concurrent_fetches,
            max_retries=max_retries,
            timeout=timeout,
            compression_type=compression_type,
        )
        fasta_read_options = FastaReadOptions(
            object_storage_options=object_storage_options
        )
        read_options = ReadOptions(fasta_read_options=fasta_read_options)
        return _read_file(path, InputFormat.Fasta, read_options, projection_pushdown)

    @staticmethod
    def read_vcf(
        path: str,
        info_fields: Union[list[str], None] = None,
        format_fields: Union[list[str], None] = None,
        chunk_size: int = 8,
        concurrent_fetches: int = 1,
        allow_anonymous: bool = True,
        enable_request_payer: bool = False,
        max_retries: int = 5,
        timeout: int = 300,
        compression_type: str = "auto",
        projection_pushdown: bool = True,
        predicate_pushdown: bool = True,
        use_zero_based: Optional[bool] = None,
    ) -> pl.DataFrame:
        """
        Read a VCF file into a DataFrame.

        !!! hint "Parallelism & Indexed Reads"
            Indexed parallel reads and predicate pushdown are automatic when a TBI/CSI index
            is present. See [File formats support](/polars-bio/features/#file-formats-support),
            [Indexed reads](/polars-bio/features/#indexed-reads-predicate-pushdown),
            and [Automatic parallel partitioning](/polars-bio/features/#automatic-parallel-partitioning) for details.

        Parameters:
            path: The path to the VCF file.
            info_fields: List of INFO field names to include. If *None*, all INFO fields will be detected automatically from the VCF header. Use this to limit fields for better performance.
            format_fields: List of FORMAT field names to include (per-sample genotype data). If *None*, all FORMAT fields will be automatically detected from the VCF header. Column naming depends on the number of samples: for **single-sample** VCFs, columns are named directly by the FORMAT field (e.g., `GT`, `DP`); for **multi-sample** VCFs, columns are named `{sample_name}_{format_field}` (e.g., `NA12878_GT`, `NA12879_DP`). The GT field is always converted to string with `/` (unphased) or `|` (phased) separator.
            chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
            concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
            allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
            enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
            max_retries:  The maximum number of retries for reading the file from object storage.
            timeout: The timeout in seconds for reading the file from object storage.
            compression_type: The compression type of the VCF file. If not specified, it will be detected automatically..
            projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.
            predicate_pushdown: Enable predicate pushdown using index files (TBI/CSI) for efficient region-based filtering. Index files are auto-discovered (e.g., `file.vcf.gz.tbi`). Only simple predicates are pushed down (equality, comparisons, IN); complex predicates like `.str.contains()` or OR logic are filtered client-side. Correctness is always guaranteed.
            use_zero_based: If True, output 0-based half-open coordinates. If False, output 1-based closed coordinates. If None (default), uses the global configuration `datafusion.bio.coordinate_system_zero_based`.

        !!! note
            By default, coordinates are output in **1-based closed** format. Use `use_zero_based=True` or set `pb.set_option(pb.POLARS_BIO_COORDINATE_SYSTEM_ZERO_BASED, True)` for 0-based half-open coordinates.

        !!! Example "Reading VCF with INFO and FORMAT fields"
            ```python
            import polars_bio as pb

            # Read VCF with both INFO and FORMAT fields
            df = pb.read_vcf(
                "sample.vcf.gz",
                info_fields=["END"],              # INFO field
                format_fields=["GT", "DP", "GQ"]  # FORMAT fields
            )

            # Single-sample VCF: FORMAT columns named directly (GT, DP, GQ)
            print(df.select(["chrom", "start", "ref", "alt", "END", "GT", "DP", "GQ"]))
            # Output:
            # shape: (10, 8)
            # ┌───────┬───────┬─────┬─────┬──────┬─────┬─────┬─────┐
            # │ chrom ┆ start ┆ ref ┆ alt ┆ END  ┆ GT  ┆ DP  ┆ GQ  │
            # │ str   ┆ u32   ┆ str ┆ str ┆ i32  ┆ str ┆ i32 ┆ i32 │
            # ╞═══════╪═══════╪═════╪═════╪══════╪═════╪═════╪═════╡
            # │ 1     ┆ 10009 ┆ A   ┆ .   ┆ null ┆ 0/0 ┆ 10  ┆ 27  │
            # │ 1     ┆ 10015 ┆ A   ┆ .   ┆ null ┆ 0/0 ┆ 17  ┆ 35  │
            # └───────┴───────┴─────┴─────┴──────┴─────┴─────┴─────┘

            # Multi-sample VCF: FORMAT columns named {sample}_{field}
            df = pb.read_vcf("multisample.vcf", format_fields=["GT", "DP"])
            print(df.select(["chrom", "start", "NA12878_GT", "NA12878_DP", "NA12879_GT"]))
            ```
        """
        lf = IOOperations.scan_vcf(
            path,
            info_fields,
            format_fields,
            chunk_size,
            concurrent_fetches,
            allow_anonymous,
            enable_request_payer,
            max_retries,
            timeout,
            compression_type,
            projection_pushdown,
            predicate_pushdown,
            use_zero_based,
        )
        # Get metadata before collecting (polars-config-meta doesn't preserve through collect)
        zero_based = lf.config_meta.get_metadata().get("coordinate_system_zero_based")
        df = lf.collect()
        # Set metadata on the collected DataFrame
        if zero_based is not None:
            set_coordinate_system(df, zero_based)
        return df

    @staticmethod
    def scan_vcf(
        path: str,
        info_fields: Union[list[str], None] = None,
        format_fields: Union[list[str], None] = None,
        chunk_size: int = 8,
        concurrent_fetches: int = 1,
        allow_anonymous: bool = True,
        enable_request_payer: bool = False,
        max_retries: int = 5,
        timeout: int = 300,
        compression_type: str = "auto",
        projection_pushdown: bool = True,
        predicate_pushdown: bool = True,
        use_zero_based: Optional[bool] = None,
    ) -> pl.LazyFrame:
        """
        Lazily read a VCF file into a LazyFrame.

        !!! hint "Parallelism & Indexed Reads"
            Indexed parallel reads and predicate pushdown are automatic when a TBI/CSI index
            is present. See [File formats support](/polars-bio/features/#file-formats-support),
            [Indexed reads](/polars-bio/features/#indexed-reads-predicate-pushdown),
            and [Automatic parallel partitioning](/polars-bio/features/#automatic-parallel-partitioning) for details.

        Parameters:
            path: The path to the VCF file.
            info_fields: List of INFO field names to include. If *None*, all INFO fields will be detected automatically from the VCF header. Use this to limit fields for better performance.
            format_fields: List of FORMAT field names to include (per-sample genotype data). If *None*, all FORMAT fields will be automatically detected from the VCF header. Column naming depends on the number of samples: for **single-sample** VCFs, columns are named directly by the FORMAT field (e.g., `GT`, `DP`); for **multi-sample** VCFs, columns are named `{sample_name}_{format_field}` (e.g., `NA12878_GT`, `NA12879_DP`). The GT field is always converted to string with `/` (unphased) or `|` (phased) separator.
            chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
            concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
            allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
            enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
            max_retries:  The maximum number of retries for reading the file from object storage.
            timeout: The timeout in seconds for reading the file from object storage.
            compression_type: The compression type of the VCF file. If not specified, it will be detected automatically..
            projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.
            predicate_pushdown: Enable predicate pushdown using index files (TBI/CSI) for efficient region-based filtering. Index files are auto-discovered (e.g., `file.vcf.gz.tbi`). Only simple predicates are pushed down (equality, comparisons, IN); complex predicates like `.str.contains()` or OR logic are filtered client-side. Correctness is always guaranteed.
            use_zero_based: If True, output 0-based half-open coordinates. If False, output 1-based closed coordinates. If None (default), uses the global configuration `datafusion.bio.coordinate_system_zero_based`.

        !!! note
            By default, coordinates are output in **1-based closed** format. Use `use_zero_based=True` or set `pb.set_option(pb.POLARS_BIO_COORDINATE_SYSTEM_ZERO_BASED, True)` for 0-based half-open coordinates.

        !!! Example "Lazy scanning VCF with INFO and FORMAT fields"
            ```python
            import polars_bio as pb

            # Lazily scan VCF with both INFO and FORMAT fields
            lf = pb.scan_vcf(
                "sample.vcf.gz",
                info_fields=["END"],              # INFO field
                format_fields=["GT", "DP", "GQ"]  # FORMAT fields
            )

            # Apply filters and collect only what's needed
            df = lf.filter(pl.col("DP") > 20).select(
                ["chrom", "start", "ref", "alt", "GT", "DP", "GQ"]
            ).collect()

            # Single-sample VCF: FORMAT columns named directly (GT, DP, GQ)
            # Multi-sample VCF: FORMAT columns named {sample}_{field}
            ```
        """
        object_storage_options = PyObjectStorageOptions(
            allow_anonymous=allow_anonymous,
            enable_request_payer=enable_request_payer,
            chunk_size=chunk_size,
            concurrent_fetches=concurrent_fetches,
            max_retries=max_retries,
            timeout=timeout,
            compression_type=compression_type,
        )

        # Use provided info_fields or autodetect from VCF header
        if info_fields is not None:
            initial_info_fields = info_fields
        else:
            # Get all info fields from VCF header for proper projection pushdown
            all_info_fields = None
            try:
                vcf_schema_df = IOOperations.describe_vcf(
                    path,
                    allow_anonymous=allow_anonymous,
                    enable_request_payer=enable_request_payer,
                    compression_type=compression_type,
                )
                # Use column name 'name' not 'id' based on the schema output
                all_info_fields = vcf_schema_df.select("name").to_series().to_list()
            except Exception:
                # Fallback to None if unable to get info fields
                all_info_fields = None

            # Always start with all info fields to establish full schema
            # The callback will re-register with only requested info fields for optimization
            initial_info_fields = all_info_fields

        zero_based = _resolve_zero_based(use_zero_based)
        vcf_read_options = VcfReadOptions(
            info_fields=initial_info_fields,
            format_fields=format_fields,
            object_storage_options=object_storage_options,
            zero_based=zero_based,
        )
        read_options = ReadOptions(vcf_read_options=vcf_read_options)
        return _read_file(
            path,
            InputFormat.Vcf,
            read_options,
            projection_pushdown,
            predicate_pushdown,
            zero_based=zero_based,
        )

    @staticmethod
    def read_gff(
        path: str,
        attr_fields: Union[list[str], None] = None,
        chunk_size: int = 8,
        concurrent_fetches: int = 1,
        allow_anonymous: bool = True,
        enable_request_payer: bool = False,
        max_retries: int = 5,
        timeout: int = 300,
        compression_type: str = "auto",
        projection_pushdown: bool = True,
        predicate_pushdown: bool = True,
        use_zero_based: Optional[bool] = None,
    ) -> pl.DataFrame:
        """
        Read a GFF file into a DataFrame.

        Parameters:
            path: The path to the GFF file.
            attr_fields: List of attribute field names to extract as separate columns. If *None*, attributes will be kept as a nested structure. Use this to extract specific attributes like 'ID', 'gene_name', 'gene_type', etc. as direct columns for easier access.
            chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
            concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
            allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
            enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
            max_retries:  The maximum number of retries for reading the file from object storage.
            timeout: The timeout in seconds for reading the file from object storage.
            compression_type: The compression type of the GFF file. If not specified, it will be detected automatically..
            projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.
            predicate_pushdown: Enable predicate pushdown using index files (TBI/CSI) for efficient region-based filtering. Index files are auto-discovered (e.g., `file.gff.gz.tbi`). Only simple predicates are pushed down (equality, comparisons, IN); complex predicates like `.str.contains()` or OR logic are filtered client-side. Correctness is always guaranteed.
            use_zero_based: If True, output 0-based half-open coordinates. If False, output 1-based closed coordinates. If None (default), uses the global configuration `datafusion.bio.coordinate_system_zero_based`.

        !!! note
            By default, coordinates are output in **1-based closed** format. Use `use_zero_based=True` or set `pb.set_option(pb.POLARS_BIO_COORDINATE_SYSTEM_ZERO_BASED, True)` for 0-based half-open coordinates.
        """
        lf = IOOperations.scan_gff(
            path,
            attr_fields,
            chunk_size,
            concurrent_fetches,
            allow_anonymous,
            enable_request_payer,
            max_retries,
            timeout,
            compression_type,
            projection_pushdown,
            predicate_pushdown,
            use_zero_based,
        )
        # Get metadata before collecting (polars-config-meta doesn't preserve through collect)
        zero_based = lf.config_meta.get_metadata().get("coordinate_system_zero_based")
        df = lf.collect()
        # Set metadata on the collected DataFrame
        if zero_based is not None:
            set_coordinate_system(df, zero_based)
        return df

    @staticmethod
    def scan_gff(
        path: str,
        attr_fields: Union[list[str], None] = None,
        chunk_size: int = 8,
        concurrent_fetches: int = 1,
        allow_anonymous: bool = True,
        enable_request_payer: bool = False,
        max_retries: int = 5,
        timeout: int = 300,
        compression_type: str = "auto",
        projection_pushdown: bool = True,
        predicate_pushdown: bool = True,
        use_zero_based: Optional[bool] = None,
    ) -> pl.LazyFrame:
        """
        Lazily read a GFF file into a LazyFrame.

        Parameters:
            path: The path to the GFF file.
            attr_fields: List of attribute field names to extract as separate columns. If *None*, attributes will be kept as a nested structure. Use this to extract specific attributes like 'ID', 'gene_name', 'gene_type', etc. as direct columns for easier access.
            chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
            concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large-scale operations, it is recommended to increase this value to 8 or even more.
            allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
            enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
            max_retries:  The maximum number of retries for reading the file from object storage.
            timeout: The timeout in seconds for reading the file from object storage.
            compression_type: The compression type of the GFF file. If not specified, it will be detected automatically.
            projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.
            predicate_pushdown: Enable predicate pushdown using index files (TBI/CSI) for efficient region-based filtering. Index files are auto-discovered (e.g., `file.gff.gz.tbi`). Only simple predicates are pushed down (equality, comparisons, IN); complex predicates like `.str.contains()` or OR logic are filtered client-side. Correctness is always guaranteed.
            use_zero_based: If True, output 0-based half-open coordinates. If False, output 1-based closed coordinates. If None (default), uses the global configuration `datafusion.bio.coordinate_system_zero_based`.

        !!! note
            By default, coordinates are output in **1-based closed** format. Use `use_zero_based=True` or set `pb.set_option(pb.POLARS_BIO_COORDINATE_SYSTEM_ZERO_BASED, True)` for 0-based half-open coordinates.
        """
        object_storage_options = PyObjectStorageOptions(
            allow_anonymous=allow_anonymous,
            enable_request_payer=enable_request_payer,
            chunk_size=chunk_size,
            concurrent_fetches=concurrent_fetches,
            max_retries=max_retries,
            timeout=timeout,
            compression_type=compression_type,
        )

        zero_based = _resolve_zero_based(use_zero_based)
        gff_read_options = GffReadOptions(
            attr_fields=attr_fields,
            object_storage_options=object_storage_options,
            zero_based=zero_based,
        )
        read_options = ReadOptions(gff_read_options=gff_read_options)
        return _read_file(
            path,
            InputFormat.Gff,
            read_options,
            projection_pushdown,
            predicate_pushdown,
            zero_based=zero_based,
        )

    @staticmethod
    def read_bam(
        path: str,
        tag_fields: Union[list[str], None] = None,
        chunk_size: int = 8,
        concurrent_fetches: int = 1,
        allow_anonymous: bool = True,
        enable_request_payer: bool = False,
        max_retries: int = 5,
        timeout: int = 300,
        projection_pushdown: bool = True,
        predicate_pushdown: bool = True,
        use_zero_based: Optional[bool] = None,
    ) -> pl.DataFrame:
        """
        Read a BAM file into a DataFrame.

        !!! hint "Parallelism & Indexed Reads"
            Indexed parallel reads and predicate pushdown are automatic when a BAI/CSI index
            is present. See [File formats support](/polars-bio/features/#file-formats-support),
            [Indexed reads](/polars-bio/features/#indexed-reads-predicate-pushdown),
            and [Automatic parallel partitioning](/polars-bio/features/#automatic-parallel-partitioning) for details.

        Parameters:
            path: The path to the BAM file.
            tag_fields: List of BAM tag names to include as columns (e.g., ["NM", "MD", "AS"]). If None, no optional tags are parsed (default). Common tags include: NM (edit distance), MD (mismatch string), AS (alignment score), XS (secondary alignment score), RG (read group), CB (cell barcode), UB (UMI barcode).
            chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large-scale operations, it is recommended to increase this value to 64.
            concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large-scale operations, it is recommended to increase this value to 8 or even more.
            allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
            enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
            max_retries:  The maximum number of retries for reading the file from object storage.
            timeout: The timeout in seconds for reading the file from object storage.
            projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.
            predicate_pushdown: Enable predicate pushdown using index files (BAI/CSI) for efficient region-based filtering. Index files are auto-discovered (e.g., `file.bam.bai`). Only simple predicates are pushed down (equality, comparisons, IN); complex predicates like `.str.contains()` or OR logic are filtered client-side. Correctness is always guaranteed.
            use_zero_based: If True, output 0-based half-open coordinates. If False, output 1-based closed coordinates. If None (default), uses the global configuration `datafusion.bio.coordinate_system_zero_based`.

        !!! note
            By default, coordinates are output in **1-based closed** format. Use `use_zero_based=True` or set `pb.set_option(pb.POLARS_BIO_COORDINATE_SYSTEM_ZERO_BASED, True)` for 0-based half-open coordinates.
        """
        lf = IOOperations.scan_bam(
            path,
            tag_fields,
            chunk_size,
            concurrent_fetches,
            allow_anonymous,
            enable_request_payer,
            max_retries,
            timeout,
            projection_pushdown,
            predicate_pushdown,
            use_zero_based,
        )
        # Get metadata before collecting (polars-config-meta doesn't preserve through collect)
        zero_based = lf.config_meta.get_metadata().get("coordinate_system_zero_based")
        df = lf.collect()
        # Set metadata on the collected DataFrame
        if zero_based is not None:
            set_coordinate_system(df, zero_based)
        return df

    @staticmethod
    def scan_bam(
        path: str,
        tag_fields: Union[list[str], None] = None,
        chunk_size: int = 8,
        concurrent_fetches: int = 1,
        allow_anonymous: bool = True,
        enable_request_payer: bool = False,
        max_retries: int = 5,
        timeout: int = 300,
        projection_pushdown: bool = True,
        predicate_pushdown: bool = True,
        use_zero_based: Optional[bool] = None,
    ) -> pl.LazyFrame:
        """
        Lazily read a BAM file into a LazyFrame.

        !!! hint "Parallelism & Indexed Reads"
            Indexed parallel reads and predicate pushdown are automatic when a BAI/CSI index
            is present. See [File formats support](/polars-bio/features/#file-formats-support),
            [Indexed reads](/polars-bio/features/#indexed-reads-predicate-pushdown),
            and [Automatic parallel partitioning](/polars-bio/features/#automatic-parallel-partitioning) for details.

        Parameters:
            path: The path to the BAM file.
            tag_fields: List of BAM tag names to include as columns (e.g., ["NM", "MD", "AS"]). If None, no optional tags are parsed (default). Common tags include: NM (edit distance), MD (mismatch string), AS (alignment score), XS (secondary alignment score), RG (read group), CB (cell barcode), UB (UMI barcode).
            chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
            concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
            allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
            enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
            max_retries:  The maximum number of retries for reading the file from object storage.
            timeout: The timeout in seconds for reading the file from object storage.
            projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.
            predicate_pushdown: Enable predicate pushdown using index files (BAI/CSI) for efficient region-based filtering. Index files are auto-discovered (e.g., `file.bam.bai`). Only simple predicates are pushed down (equality, comparisons, IN); complex predicates like `.str.contains()` or OR logic are filtered client-side. Correctness is always guaranteed.
            use_zero_based: If True, output 0-based half-open coordinates. If False, output 1-based closed coordinates. If None (default), uses the global configuration `datafusion.bio.coordinate_system_zero_based`.

        !!! note
            By default, coordinates are output in **1-based closed** format. Use `use_zero_based=True` or set `pb.set_option(pb.POLARS_BIO_COORDINATE_SYSTEM_ZERO_BASED, True)` for 0-based half-open coordinates.
        """
        object_storage_options = PyObjectStorageOptions(
            allow_anonymous=allow_anonymous,
            enable_request_payer=enable_request_payer,
            chunk_size=chunk_size,
            concurrent_fetches=concurrent_fetches,
            max_retries=max_retries,
            timeout=timeout,
            compression_type="auto",
        )

        zero_based = _resolve_zero_based(use_zero_based)
        bam_read_options = BamReadOptions(
            object_storage_options=object_storage_options,
            zero_based=zero_based,
            tag_fields=tag_fields,
        )
        read_options = ReadOptions(bam_read_options=bam_read_options)
        return _read_file(
            path,
            InputFormat.Bam,
            read_options,
            projection_pushdown,
            predicate_pushdown,
            zero_based=zero_based,
        )

    @staticmethod
    def read_cram(
        path: str,
        reference_path: str = None,
        tag_fields: Union[list[str], None] = None,
        chunk_size: int = 8,
        concurrent_fetches: int = 1,
        allow_anonymous: bool = True,
        enable_request_payer: bool = False,
        max_retries: int = 5,
        timeout: int = 300,
        projection_pushdown: bool = True,
        predicate_pushdown: bool = True,
        use_zero_based: Optional[bool] = None,
    ) -> pl.DataFrame:
        """
        Read a CRAM file into a DataFrame.

        !!! hint "Parallelism & Indexed Reads"
            Indexed parallel reads and predicate pushdown are automatic when a CRAI index
            is present. See [File formats support](/polars-bio/features/#file-formats-support),
            [Indexed reads](/polars-bio/features/#indexed-reads-predicate-pushdown),
            and [Automatic parallel partitioning](/polars-bio/features/#automatic-parallel-partitioning) for details.

        Parameters:
            path: The path to the CRAM file (local or cloud storage: S3, GCS, Azure Blob).
            reference_path: Optional path to external FASTA reference file (**local path only**, cloud storage not supported). If not provided, the CRAM file must contain embedded reference sequences. The FASTA file must have an accompanying index file (.fai) in the same directory. Create the index using: `samtools faidx reference.fasta`
            tag_fields: List of CRAM tag names to include as columns (e.g., ["NM", "MD", "AS"]). If None, no optional tags are parsed (default). Common tags include: NM (edit distance), MD (mismatch string), AS (alignment score), XS (secondary alignment score), RG (read group), CB (cell barcode), UB (UMI barcode).
            chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
            concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
            allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
            enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
            max_retries: The maximum number of retries for reading the file from object storage.
            timeout: The timeout in seconds for reading the file from object storage.
            projection_pushdown: Enable column projection pushdown optimization. When True, only requested columns are processed at the DataFusion execution level, improving performance and reducing memory usage.
            predicate_pushdown: Enable predicate pushdown using index files (CRAI) for efficient region-based filtering. Index files are auto-discovered (e.g., `file.cram.crai`). Only simple predicates are pushed down (equality, comparisons, IN); complex predicates like `.str.contains()` or OR logic are filtered client-side. Correctness is always guaranteed.
            use_zero_based: If True, output 0-based half-open coordinates. If False, output 1-based closed coordinates. If None (default), uses the global configuration `datafusion.bio.coordinate_system_zero_based`.

        !!! note
            By default, coordinates are output in **1-based closed** format. Use `use_zero_based=True` or set `pb.set_option(pb.POLARS_BIO_COORDINATE_SYSTEM_ZERO_BASED, True)` for 0-based half-open coordinates.

        !!! warning "Known Limitation: MD and NM Tags"
            Due to a limitation in the underlying noodles-cram library, **MD (mismatch descriptor) and NM (edit distance) tags are not accessible** from CRAM files, even when stored in the file. These tags can be seen with samtools but are not exposed through the noodles-cram record.data() interface.

            Other optional tags (RG, MQ, AM, OQ, etc.) work correctly. This issue is tracked at: https://github.com/biodatageeks/datafusion-bio-formats/issues/54

            **Workaround**: Use BAM format if MD/NM tags are required for your analysis.

        !!! example "Using External Reference"
            ```python
            import polars_bio as pb

            # Read CRAM with external reference
            df = pb.read_cram(
                "/path/to/file.cram",
                reference_path="/path/to/reference.fasta"
            )
            ```

        !!! example "Public CRAM File Example"
            Download and read a public CRAM file from 42basepairs:
            ```bash
            # Download the CRAM file and reference
            wget https://42basepairs.com/download/s3/gatk-test-data/wgs_cram/NA12878_20k_hg38/NA12878.cram
            wget https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta

            # Create FASTA index (required)
            samtools faidx Homo_sapiens_assembly38.fasta
            ```

            ```python
            import polars_bio as pb

            # Read first 5 reads from the CRAM file
            df = pb.scan_cram(
                "NA12878.cram",
                reference_path="Homo_sapiens_assembly38.fasta"
            ).limit(5).collect()

            print(df.select(["name", "chrom", "start", "end", "cigar"]))
            ```

        !!! example "Creating CRAM with Embedded Reference"
            To create a CRAM file with embedded reference using samtools:
            ```bash
            samtools view -C -o output.cram --output-fmt-option embed_ref=1 input.bam
            ```

        Returns:
            A Polars DataFrame with the following schema:
                - name: Read name (String)
                - chrom: Chromosome/contig name (String)
                - start: Alignment start position, 1-based (UInt32)
                - end: Alignment end position, 1-based (UInt32)
                - flags: SAM flags (UInt32)
                - cigar: CIGAR string (String)
                - mapping_quality: Mapping quality (UInt32)
                - mate_chrom: Mate chromosome/contig name (String)
                - mate_start: Mate alignment start position, 1-based (UInt32)
                - sequence: Read sequence (String)
                - quality_scores: Base quality scores (String)
        """
        lf = IOOperations.scan_cram(
            path,
            reference_path,
            tag_fields,
            chunk_size,
            concurrent_fetches,
            allow_anonymous,
            enable_request_payer,
            max_retries,
            timeout,
            projection_pushdown,
            predicate_pushdown,
            use_zero_based,
        )
        # Get metadata before collecting (polars-config-meta doesn't preserve through collect)
        zero_based = lf.config_meta.get_metadata().get("coordinate_system_zero_based")
        df = lf.collect()
        # Set metadata on the collected DataFrame
        if zero_based is not None:
            set_coordinate_system(df, zero_based)
        return df

    @staticmethod
    def scan_cram(
        path: str,
        reference_path: str = None,
        tag_fields: Union[list[str], None] = None,
        chunk_size: int = 8,
        concurrent_fetches: int = 1,
        allow_anonymous: bool = True,
        enable_request_payer: bool = False,
        max_retries: int = 5,
        timeout: int = 300,
        projection_pushdown: bool = True,
        predicate_pushdown: bool = True,
        use_zero_based: Optional[bool] = None,
    ) -> pl.LazyFrame:
        """
        Lazily read a CRAM file into a LazyFrame.

        !!! hint "Parallelism & Indexed Reads"
            Indexed parallel reads and predicate pushdown are automatic when a CRAI index
            is present. See [File formats support](/polars-bio/features/#file-formats-support),
            [Indexed reads](/polars-bio/features/#indexed-reads-predicate-pushdown),
            and [Automatic parallel partitioning](/polars-bio/features/#automatic-parallel-partitioning) for details.

        Parameters:
            path: The path to the CRAM file (local or cloud storage: S3, GCS, Azure Blob).
            reference_path: Optional path to external FASTA reference file (**local path only**, cloud storage not supported). If not provided, the CRAM file must contain embedded reference sequences. The FASTA file must have an accompanying index file (.fai) in the same directory. Create the index using: `samtools faidx reference.fasta`
            tag_fields: List of CRAM tag names to include as columns (e.g., ["NM", "MD", "AS"]). If None, no optional tags are parsed (default). Common tags include: NM (edit distance), MD (mismatch string), AS (alignment score), XS (secondary alignment score), RG (read group), CB (cell barcode), UB (UMI barcode).
            chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
            concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
            allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
            enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
            max_retries: The maximum number of retries for reading the file from object storage.
            timeout: The timeout in seconds for reading the file from object storage.
            projection_pushdown: Enable column projection pushdown optimization. When True, only requested columns are processed at the DataFusion execution level, improving performance and reducing memory usage.
            predicate_pushdown: Enable predicate pushdown using index files (CRAI) for efficient region-based filtering. Index files are auto-discovered (e.g., `file.cram.crai`). Only simple predicates are pushed down (equality, comparisons, IN); complex predicates like `.str.contains()` or OR logic are filtered client-side. Correctness is always guaranteed.
            use_zero_based: If True, output 0-based half-open coordinates. If False, output 1-based closed coordinates. If None (default), uses the global configuration `datafusion.bio.coordinate_system_zero_based`.

        !!! note
            By default, coordinates are output in **1-based closed** format. Use `use_zero_based=True` or set `pb.set_option(pb.POLARS_BIO_COORDINATE_SYSTEM_ZERO_BASED, True)` for 0-based half-open coordinates.

        !!! warning "Known Limitation: MD and NM Tags"
            Due to a limitation in the underlying noodles-cram library, **MD (mismatch descriptor) and NM (edit distance) tags are not accessible** from CRAM files, even when stored in the file. These tags can be seen with samtools but are not exposed through the noodles-cram record.data() interface.

            Other optional tags (RG, MQ, AM, OQ, etc.) work correctly. This issue is tracked at: https://github.com/biodatageeks/datafusion-bio-formats/issues/54

            **Workaround**: Use BAM format if MD/NM tags are required for your analysis.

        !!! example "Using External Reference"
            ```python
            import polars_bio as pb

            # Lazy scan CRAM with external reference
            lf = pb.scan_cram(
                "/path/to/file.cram",
                reference_path="/path/to/reference.fasta"
            )

            # Apply transformations and collect
            df = lf.filter(pl.col("chrom") == "chr1").collect()
            ```

        !!! example "Public CRAM File Example"
            Download and read a public CRAM file from 42basepairs:
            ```bash
            # Download the CRAM file and reference
            wget https://42basepairs.com/download/s3/gatk-test-data/wgs_cram/NA12878_20k_hg38/NA12878.cram
            wget https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta

            # Create FASTA index (required)
            samtools faidx Homo_sapiens_assembly38.fasta
            ```

            ```python
            import polars_bio as pb
            import polars as pl

            # Lazy scan and filter for chromosome 20 reads
            df = pb.scan_cram(
                "NA12878.cram",
                reference_path="Homo_sapiens_assembly38.fasta"
            ).filter(
                pl.col("chrom") == "chr20"
            ).select(
                ["name", "chrom", "start", "end", "mapping_quality"]
            ).limit(10).collect()

            print(df)
            ```

        !!! example "Creating CRAM with Embedded Reference"
            To create a CRAM file with embedded reference using samtools:
            ```bash
            samtools view -C -o output.cram --output-fmt-option embed_ref=1 input.bam
            ```

        Returns:
            A Polars LazyFrame with the following schema:
                - name: Read name (String)
                - chrom: Chromosome/contig name (String)
                - start: Alignment start position, 1-based (UInt32)
                - end: Alignment end position, 1-based (UInt32)
                - flags: SAM flags (UInt32)
                - cigar: CIGAR string (String)
                - mapping_quality: Mapping quality (UInt32)
                - mate_chrom: Mate chromosome/contig name (String)
                - mate_start: Mate alignment start position, 1-based (UInt32)
                - sequence: Read sequence (String)
                - quality_scores: Base quality scores (String)
        """
        object_storage_options = PyObjectStorageOptions(
            allow_anonymous=allow_anonymous,
            enable_request_payer=enable_request_payer,
            chunk_size=chunk_size,
            concurrent_fetches=concurrent_fetches,
            max_retries=max_retries,
            timeout=timeout,
            compression_type="auto",
        )

        zero_based = _resolve_zero_based(use_zero_based)
        cram_read_options = CramReadOptions(
            reference_path=reference_path,
            object_storage_options=object_storage_options,
            zero_based=zero_based,
            tag_fields=tag_fields,
        )
        read_options = ReadOptions(cram_read_options=cram_read_options)
        return _read_file(
            path,
            InputFormat.Cram,
            read_options,
            projection_pushdown,
            predicate_pushdown,
            zero_based=zero_based,
        )

    @staticmethod
    def describe_bam(
        path: str,
        sample_size: int = 100,
        chunk_size: int = 8,
        concurrent_fetches: int = 1,
        allow_anonymous: bool = True,
        enable_request_payer: bool = False,
        max_retries: int = 5,
        timeout: int = 300,
        compression_type: str = "auto",
        use_zero_based: Optional[bool] = None,
    ) -> pl.DataFrame:
        """
        Get schema information for a BAM file with automatic tag discovery.

        Samples the first N records to discover all available tags and their types.
        Returns detailed schema information including column names, data types,
        nullability, category (standard/tag), SAM type, and descriptions.

        Parameters:
            path: The path to the BAM file.
            sample_size: Number of records to sample for tag discovery (default: 100).
                Use higher values for more comprehensive tag discovery.
            chunk_size: The size in MB of a chunk when reading from object storage.
            concurrent_fetches: The number of concurrent fetches when reading from object storage.
            allow_anonymous: Whether to allow anonymous access to object storage.
            enable_request_payer: Whether to enable request payer for object storage.
            max_retries: The maximum number of retries for reading the file.
            timeout: The timeout in seconds for reading the file.
            compression_type: The compression type of the file. If "auto" (default), compression is detected automatically.
            use_zero_based: If True, output 0-based coordinates. If False, 1-based coordinates.

        Returns:
            DataFrame with columns:
            - column_name: Name of the column/field
            - data_type: Arrow data type (e.g., "Utf8", "Int32")
            - nullable: Whether the field can be null
            - category: "core" for fixed columns, "tag" for optional SAM tags
            - sam_type: SAM type code (e.g., "Z", "i") for tags, null for core columns
            - description: Human-readable description of the field

        Example:
            ```python
            import polars_bio as pb

            # Auto-discover all tags present in the file
            schema = pb.describe_bam("file.bam", sample_size=100)
            print(schema)
            # Output:
            # shape: (15, 6)
            # ┌─────────────┬───────────┬──────────┬──────────┬──────────┬──────────────────────┐
            # │ column_name ┆ data_type ┆ nullable ┆ category ┆ sam_type ┆ description          │
            # │ ---         ┆ ---       ┆ ---      ┆ ---      ┆ ---      ┆ ---                  │
            # │ str         ┆ str       ┆ bool     ┆ str      ┆ str      ┆ str                  │
            # ╞═════════════╪═══════════╪══════════╪══════════╪══════════╪══════════════════════╡
            # │ name        ┆ Utf8      ┆ true     ┆ core     ┆ null     ┆ Query name           │
            # │ chrom       ┆ Utf8      ┆ true     ┆ core     ┆ null     ┆ Reference name       │
            # │ ...         ┆ ...       ┆ ...      ┆ ...      ┆ ...      ┆ ...                  │
            # │ NM          ┆ Int32     ┆ true     ┆ tag      ┆ i        ┆ Edit distance        │
            # │ AS          ┆ Int32     ┆ true     ┆ tag      ┆ i        ┆ Alignment score      │
            # └─────────────┴───────────┴──────────┴──────────┴──────────┴──────────────────────┘
            ```
        """
        # Build object storage options
        object_storage_options = PyObjectStorageOptions(
            chunk_size=chunk_size,
            concurrent_fetches=concurrent_fetches,
            allow_anonymous=allow_anonymous,
            enable_request_payer=enable_request_payer,
            max_retries=max_retries,
            timeout=timeout,
            compression_type=compression_type,
        )

        # Resolve zero_based setting
        zero_based = _resolve_zero_based(use_zero_based)

        # Call Rust function with tag auto-discovery (tag_fields=None)
        df = py_describe_bam(
            ctx,  # PyBioSessionContext
            path,
            object_storage_options,
            zero_based,
            None,  # tag_fields=None enables auto-discovery
            sample_size,
        )

        # Convert DataFusion DataFrame to Polars DataFrame
        return pl.from_arrow(df.to_arrow_table())

    @staticmethod
    def describe_cram(
        path: str,
        reference_path: str = None,
        sample_size: int = 100,
        chunk_size: int = 8,
        concurrent_fetches: int = 1,
        allow_anonymous: bool = True,
        enable_request_payer: bool = False,
        max_retries: int = 5,
        timeout: int = 300,
        compression_type: str = "auto",
        use_zero_based: Optional[bool] = None,
    ) -> pl.DataFrame:
        """
        Get schema information for a CRAM file with automatic tag discovery.

        Samples the first N records to discover all available tags and their types.
        Returns detailed schema information including column names, data types,
        nullability, category (core/tag), SAM type, and descriptions.

        Parameters:
            path: The path to the CRAM file.
            reference_path: Optional path to external FASTA reference file.
            sample_size: Number of records to sample for tag discovery (default: 100).
            chunk_size: The size in MB of a chunk when reading from object storage.
            concurrent_fetches: The number of concurrent fetches when reading from object storage.
            allow_anonymous: Whether to allow anonymous access to object storage.
            enable_request_payer: Whether to enable request payer for object storage.
            max_retries: The maximum number of retries for reading the file.
            timeout: The timeout in seconds for reading the file.
            compression_type: The compression type of the file. If "auto" (default), compression is detected automatically.
            use_zero_based: If True, output 0-based coordinates. If False, 1-based coordinates.

        Returns:
            DataFrame with columns:
            - column_name: Name of the column/field
            - data_type: Arrow data type (e.g., "Utf8", "Int32")
            - nullable: Whether the field can be null
            - category: "core" for fixed columns, "tag" for optional SAM tags
            - sam_type: SAM type code (e.g., "Z", "i") for tags, null for core columns
            - description: Human-readable description of the field

        !!! warning "Known Limitation: MD and NM Tags"
            Due to a limitation in the underlying noodles-cram library, **MD (mismatch descriptor) and NM (edit distance) tags are not discoverable** from CRAM files, even when stored. Automatic tag discovery will not include MD/NM tags. Other optional tags (RG, MQ, AM, OQ, etc.) are discovered correctly. See: https://github.com/biodatageeks/datafusion-bio-formats/issues/54

        Example:
            ```python
            import polars_bio as pb

            # Auto-discover all tags present in the file
            schema = pb.describe_cram("file.cram", sample_size=100)
            print(schema)

            # Filter to see only tag columns
            tags = schema.filter(schema["category"] == "tag")
            print(tags["column_name"])
            ```
        """
        # Build object storage options
        object_storage_options = PyObjectStorageOptions(
            chunk_size=chunk_size,
            concurrent_fetches=concurrent_fetches,
            allow_anonymous=allow_anonymous,
            enable_request_payer=enable_request_payer,
            max_retries=max_retries,
            timeout=timeout,
            compression_type=compression_type,
        )

        # Resolve zero_based setting
        zero_based = _resolve_zero_based(use_zero_based)

        # Call Rust function with tag auto-discovery (tag_fields=None)
        df = py_describe_cram(
            ctx,
            path,
            reference_path,
            object_storage_options,
            zero_based,
            None,  # tag_fields=None enables auto-discovery
            sample_size,
        )

        # Convert DataFusion DataFrame to Polars DataFrame
        return pl.from_arrow(df.to_arrow_table())

    @staticmethod
    def read_fastq(
        path: str,
        chunk_size: int = 8,
        concurrent_fetches: int = 1,
        allow_anonymous: bool = True,
        enable_request_payer: bool = False,
        max_retries: int = 5,
        timeout: int = 300,
        compression_type: str = "auto",
        projection_pushdown: bool = True,
    ) -> pl.DataFrame:
        """
        Read a FASTQ file into a DataFrame.

        !!! hint "Parallelism & Compression"
            See [File formats support](/polars-bio/features/#file-formats-support),
            [Compression](/polars-bio/features/#compression),
            and [Automatic parallel partitioning](/polars-bio/features/#automatic-parallel-partitioning) for details on parallel reads and supported compression types.

        Parameters:
            path: The path to the FASTQ file.
            chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
            concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
            allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
            enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
            max_retries:  The maximum number of retries for reading the file from object storage.
            timeout: The timeout in seconds for reading the file from object storage.
            compression_type: The compression type of the FASTQ file. If not specified, it will be detected automatically based on the file extension. BGZF and GZIP compressions are supported ('bgz', 'gz').
            projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.
        """
        return IOOperations.scan_fastq(
            path,
            chunk_size,
            concurrent_fetches,
            allow_anonymous,
            enable_request_payer,
            max_retries,
            timeout,
            compression_type,
            projection_pushdown,
        ).collect()

    @staticmethod
    def scan_fastq(
        path: str,
        chunk_size: int = 8,
        concurrent_fetches: int = 1,
        allow_anonymous: bool = True,
        enable_request_payer: bool = False,
        max_retries: int = 5,
        timeout: int = 300,
        compression_type: str = "auto",
        projection_pushdown: bool = True,
    ) -> pl.LazyFrame:
        """
        Lazily read a FASTQ file into a LazyFrame.

        !!! hint "Parallelism & Compression"
            See [File formats support](/polars-bio/features/#file-formats-support),
            [Compression](/polars-bio/features/#compression),
            and [Automatic parallel partitioning](/polars-bio/features/#automatic-parallel-partitioning) for details on parallel reads and supported compression types.

        Parameters:
            path: The path to the FASTQ file.
            chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
            concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
            allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
            enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
            max_retries:  The maximum number of retries for reading the file from object storage.
            timeout: The timeout in seconds for reading the file from object storage.
            compression_type: The compression type of the FASTQ file. If not specified, it will be detected automatically based on the file extension. BGZF and GZIP compressions are supported ('bgz', 'gz').
            projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.
        """
        object_storage_options = PyObjectStorageOptions(
            allow_anonymous=allow_anonymous,
            enable_request_payer=enable_request_payer,
            chunk_size=chunk_size,
            concurrent_fetches=concurrent_fetches,
            max_retries=max_retries,
            timeout=timeout,
            compression_type=compression_type,
        )

        fastq_read_options = FastqReadOptions(
            object_storage_options=object_storage_options,
        )
        read_options = ReadOptions(fastq_read_options=fastq_read_options)
        return _read_file(path, InputFormat.Fastq, read_options, projection_pushdown)

    @staticmethod
    def read_bed(
        path: str,
        thread_num: int = 1,
        chunk_size: int = 8,
        concurrent_fetches: int = 1,
        allow_anonymous: bool = True,
        enable_request_payer: bool = False,
        max_retries: int = 5,
        timeout: int = 300,
        compression_type: str = "auto",
        projection_pushdown: bool = True,
        use_zero_based: Optional[bool] = None,
    ) -> pl.DataFrame:
        """
        Read a BED file into a DataFrame.

        Parameters:
            path: The path to the BED file.
            thread_num: The number of threads to use for reading the BED file. Used **only** for parallel decompression of BGZF blocks. Works only for **local** files.
            chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
            concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
            allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
            enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
            max_retries:  The maximum number of retries for reading the file from object storage.
            timeout: The timeout in seconds for reading the file from object storage.
            compression_type: The compression type of the BED file. If not specified, it will be detected automatically based on the file extension. BGZF compressions is supported ('bgz').
            projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.
            use_zero_based: If True, output 0-based half-open coordinates. If False, output 1-based closed coordinates. If None (default), uses the global configuration `datafusion.bio.coordinate_system_zero_based`.

        !!! Note
            Only **BED4** format is supported. It extends the basic BED format (BED3) by adding a name field, resulting in four columns: chromosome, start position, end position, and name.
            Also unlike other text formats, **GZIP** compression is not supported.

        !!! note
            By default, coordinates are output in **1-based closed** format. Use `use_zero_based=True` or set `pb.set_option(pb.POLARS_BIO_COORDINATE_SYSTEM_ZERO_BASED, True)` for 0-based half-open coordinates.
        """
        lf = IOOperations.scan_bed(
            path,
            thread_num,
            chunk_size,
            concurrent_fetches,
            allow_anonymous,
            enable_request_payer,
            max_retries,
            timeout,
            compression_type,
            projection_pushdown,
            use_zero_based,
        )
        # Get metadata before collecting (polars-config-meta doesn't preserve through collect)
        zero_based = lf.config_meta.get_metadata().get("coordinate_system_zero_based")
        df = lf.collect()
        # Set metadata on the collected DataFrame
        if zero_based is not None:
            set_coordinate_system(df, zero_based)
        return df

    @staticmethod
    def scan_bed(
        path: str,
        thread_num: int = 1,
        chunk_size: int = 8,
        concurrent_fetches: int = 1,
        allow_anonymous: bool = True,
        enable_request_payer: bool = False,
        max_retries: int = 5,
        timeout: int = 300,
        compression_type: str = "auto",
        projection_pushdown: bool = True,
        use_zero_based: Optional[bool] = None,
    ) -> pl.LazyFrame:
        """
        Lazily read a BED file into a LazyFrame.

        Parameters:
            path: The path to the BED file.
            thread_num: The number of threads to use for reading the BED file. Used **only** for parallel decompression of BGZF blocks. Works only for **local** files.
            chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
            concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
            allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
            enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
            max_retries:  The maximum number of retries for reading the file from object storage.
            timeout: The timeout in seconds for reading the file from object storage.
            compression_type: The compression type of the BED file. If not specified, it will be detected automatically based on the file extension. BGZF compressions is supported ('bgz').
            projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.
            use_zero_based: If True, output 0-based half-open coordinates. If False, output 1-based closed coordinates. If None (default), uses the global configuration `datafusion.bio.coordinate_system_zero_based`.

        !!! Note
            Only **BED4** format is supported. It extends the basic BED format (BED3) by adding a name field, resulting in four columns: chromosome, start position, end position, and name.
            Also unlike other text formats, **GZIP** compression is not supported.

        !!! note
            By default, coordinates are output in **1-based closed** format. Use `use_zero_based=True` or set `pb.set_option(pb.POLARS_BIO_COORDINATE_SYSTEM_ZERO_BASED, True)` for 0-based half-open coordinates.
        """
        object_storage_options = PyObjectStorageOptions(
            allow_anonymous=allow_anonymous,
            enable_request_payer=enable_request_payer,
            chunk_size=chunk_size,
            concurrent_fetches=concurrent_fetches,
            max_retries=max_retries,
            timeout=timeout,
            compression_type=compression_type,
        )

        zero_based = _resolve_zero_based(use_zero_based)
        bed_read_options = BedReadOptions(
            thread_num=thread_num,
            object_storage_options=object_storage_options,
            zero_based=zero_based,
        )
        read_options = ReadOptions(bed_read_options=bed_read_options)
        return _read_file(
            path,
            InputFormat.Bed,
            read_options,
            projection_pushdown,
            zero_based=zero_based,
        )

    @staticmethod
    def read_table(path: str, schema: Dict = None, **kwargs) -> pl.DataFrame:
        """
         Read a tab-delimited (i.e. BED) file into a Polars DataFrame.
         Tries to be compatible with Bioframe's [read_table](https://bioframe.readthedocs.io/en/latest/guide-io.html)
         but faster. Schema should follow the Bioframe's schema [format](https://github.com/open2c/bioframe/blob/2b685eebef393c2c9e6220dcf550b3630d87518e/bioframe/io/schemas.py#L174).

        Parameters:
            path: The path to the file.
            schema: Schema should follow the Bioframe's schema [format](https://github.com/open2c/bioframe/blob/2b685eebef393c2c9e6220dcf550b3630d87518e/bioframe/io/schemas.py#L174).
        """
        return IOOperations.scan_table(path, schema, **kwargs).collect()

    @staticmethod
    def scan_table(path: str, schema: Dict = None, **kwargs) -> pl.LazyFrame:
        """
         Lazily read a tab-delimited (i.e. BED) file into a Polars LazyFrame.
         Tries to be compatible with Bioframe's [read_table](https://bioframe.readthedocs.io/en/latest/guide-io.html)
         but faster and lazy. Schema should follow the Bioframe's schema [format](https://github.com/open2c/bioframe/blob/2b685eebef393c2c9e6220dcf550b3630d87518e/bioframe/io/schemas.py#L174).

        Parameters:
            path: The path to the file.
            schema: Schema should follow the Bioframe's schema [format](https://github.com/open2c/bioframe/blob/2b685eebef393c2c9e6220dcf550b3630d87518e/bioframe/io/schemas.py#L174).
        """
        df = pl.scan_csv(path, separator="\t", has_header=False, **kwargs)
        if schema is not None:
            columns = SCHEMAS[schema]
            if len(columns) != len(df.collect_schema()):
                raise ValueError(
                    f"Schema incompatible with the input. Expected {len(columns)} columns in a schema, got {len(df.collect_schema())} in the input data file. Please provide a valid schema."
                )
            for i, c in enumerate(columns):
                df = df.rename({f"column_{i+1}": c})
        return df

    @staticmethod
    def describe_vcf(
        path: str,
        allow_anonymous: bool = True,
        enable_request_payer: bool = False,
        compression_type: str = "auto",
    ) -> pl.DataFrame:
        """
        Describe VCF INFO schema.

        Parameters:
            path: The path to the VCF file.
            allow_anonymous: Whether to allow anonymous access to object storage (GCS and S3 supported).
            enable_request_payer: Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
            compression_type: The compression type of the VCF file. If not specified, it will be detected automatically..
        """
        object_storage_options = PyObjectStorageOptions(
            allow_anonymous=allow_anonymous,
            enable_request_payer=enable_request_payer,
            chunk_size=8,
            concurrent_fetches=1,
            max_retries=1,
            timeout=10,
            compression_type=compression_type,
        )
        return py_describe_vcf(ctx, path, object_storage_options).to_polars()

    @staticmethod
    def from_polars(name: str, df: Union[pl.DataFrame, pl.LazyFrame]) -> None:
        """
        Register a Polars DataFrame as a DataFusion table.

        Parameters:
            name: The name of the table.
            df: The Polars DataFrame.
        """
        reader = (
            df.to_arrow()
            if isinstance(df, pl.DataFrame)
            else df.collect().to_arrow().to_reader()
        )
        py_from_polars(ctx, name, reader)

    @staticmethod
    def write_vcf(
        df: Union[pl.DataFrame, pl.LazyFrame],
        path: str,
    ) -> int:
        """
        Write a DataFrame to VCF format.

        Coordinate system is automatically read from DataFrame metadata (set during
        read_vcf). Compression is auto-detected from the file extension.

        Parameters:
            df: The DataFrame or LazyFrame to write.
            path: The output file path. Compression is auto-detected from extension
                  (.vcf.bgz for BGZF, .vcf.gz for GZIP, .vcf for uncompressed).

        Returns:
            The number of rows written.

        !!! Example "Writing VCF files"
            ```python
            import polars_bio as pb

            # Read a VCF file
            df = pb.read_vcf("input.vcf")

            # Write to uncompressed VCF
            pb.write_vcf(df, "output.vcf")

            # Write to BGZF-compressed VCF
            pb.write_vcf(df, "output.vcf.bgz")

            # Write to GZIP-compressed VCF
            pb.write_vcf(df, "output.vcf.gz")
            ```
        """
        return _write_file(df, path, OutputFormat.Vcf)

    @staticmethod
    def sink_vcf(
        lf: pl.LazyFrame,
        path: str,
    ) -> None:
        """
        Streaming write a LazyFrame to VCF format.

        This method executes the LazyFrame immediately and writes the results
        to the specified path. Unlike `write_vcf`, it doesn't return the row count.

        Coordinate system is automatically read from LazyFrame metadata (set during
        scan_vcf). Compression is auto-detected from the file extension.

        Parameters:
            lf: The LazyFrame to write.
            path: The output file path. Compression is auto-detected from extension
                  (.vcf.bgz for BGZF, .vcf.gz for GZIP, .vcf for uncompressed).

        !!! Example "Streaming write VCF"
            ```python
            import polars_bio as pb

            # Lazy read and filter, then sink to VCF
            lf = pb.scan_vcf("large_input.vcf").filter(pl.col("qual") > 30)
            pb.sink_vcf(lf, "filtered_output.vcf.bgz")
            ```
        """
        _write_file(lf, path, OutputFormat.Vcf)

    @staticmethod
    def write_fastq(
        df: Union[pl.DataFrame, pl.LazyFrame],
        path: str,
    ) -> int:
        """
        Write a DataFrame to FASTQ format.

        Compression is auto-detected from the file extension.

        Parameters:
            df: The DataFrame or LazyFrame to write. Must have columns:
                - name: Read name/identifier
                - sequence: DNA sequence
                - quality_scores: Quality scores string
                Optional: description (added after name on header line)
            path: The output file path. Compression is auto-detected from extension
                  (.fastq.bgz for BGZF, .fastq.gz for GZIP, .fastq for uncompressed).

        Returns:
            The number of rows written.

        !!! Example "Writing FASTQ files"
            ```python
            import polars_bio as pb

            # Read a FASTQ file
            df = pb.read_fastq("input.fastq")

            # Write to uncompressed FASTQ
            pb.write_fastq(df, "output.fastq")

            # Write to GZIP-compressed FASTQ
            pb.write_fastq(df, "output.fastq.gz")
            ```
        """
        return _write_file(df, path, OutputFormat.Fastq)

    @staticmethod
    def sink_fastq(
        lf: pl.LazyFrame,
        path: str,
    ) -> None:
        """
        Streaming write a LazyFrame to FASTQ format.

        Compression is auto-detected from the file extension.

        Parameters:
            lf: The LazyFrame to write.
            path: The output file path. Compression is auto-detected from extension
                  (.fastq.bgz for BGZF, .fastq.gz for GZIP, .fastq for uncompressed).

        !!! Example "Streaming write FASTQ"
            ```python
            import polars_bio as pb

            # Lazy read, filter by quality, then sink
            lf = pb.scan_fastq("large_input.fastq.gz")
            pb.sink_fastq(lf.limit(1000), "sample_output.fastq")
            ```
        """
        _write_file(lf, path, OutputFormat.Fastq)

    @staticmethod
    def write_bam(
        df: Union[pl.DataFrame, pl.LazyFrame],
        path: str,
        sort_on_write: bool = False,
    ) -> int:
        """
        Write a DataFrame to BAM/SAM format.

        Compression is auto-detected from file extension:
        - .sam → Uncompressed SAM (plain text)
        - .bam → BGZF-compressed BAM

        For CRAM format, use `write_cram()` instead.

        Parameters:
            df: DataFrame or LazyFrame with 11 core BAM columns + optional tag columns
            path: Output file path (.bam or .sam)
            sort_on_write: If True, sort records by (chrom, start) and set header SO:coordinate.
                If False (default), set header SO:unsorted.

        Returns:
            Number of rows written

        !!! Example "Write BAM files"
            ```python
            import polars_bio as pb
            df = pb.read_bam("input.bam", tag_fields=["NM", "AS"])
            pb.write_bam(df, "output.bam")
            pb.write_bam(df, "output.sam")
            ```
        """
        return _write_bam_file(
            df, path, OutputFormat.Bam, None, sort_on_write=sort_on_write
        )

    @staticmethod
    def sink_bam(
        lf: pl.LazyFrame,
        path: str,
        sort_on_write: bool = False,
    ) -> None:
        """
        Streaming write a LazyFrame to BAM/SAM format.

        For CRAM format, use `sink_cram()` instead.

        Parameters:
            lf: LazyFrame to write
            path: Output file path (.bam or .sam)
            sort_on_write: If True, sort records by (chrom, start) and set header SO:coordinate.
                If False (default), set header SO:unsorted.

        !!! Example "Streaming write BAM"
            ```python
            import polars_bio as pb
            lf = pb.scan_bam("input.bam").filter(pl.col("mapping_quality") > 20)
            pb.sink_bam(lf, "filtered.bam")
            ```
        """
        _write_bam_file(lf, path, OutputFormat.Bam, None, sort_on_write=sort_on_write)

    @staticmethod
    def read_sam(
        path: str,
        tag_fields: Union[list[str], None] = None,
        projection_pushdown: bool = True,
        use_zero_based: Optional[bool] = None,
    ) -> pl.DataFrame:
        """
        Read a SAM file into a DataFrame.

        SAM (Sequence Alignment/Map) is the plain-text counterpart of BAM.
        This function reuses the BAM reader, which auto-detects the format
        from the file extension.

        Parameters:
            path: The path to the SAM file.
            tag_fields: List of SAM tag names to include as columns (e.g., ["NM", "MD", "AS"]).
                If None, no optional tags are parsed (default).
            projection_pushdown: Enable column projection pushdown to optimize query performance.
            use_zero_based: If True, output 0-based half-open coordinates.
                If False, output 1-based closed coordinates.
                If None (default), uses the global configuration.

        !!! note
            By default, coordinates are output in **1-based closed** format.
        """
        lf = IOOperations.scan_sam(
            path,
            tag_fields,
            projection_pushdown,
            use_zero_based,
        )
        zero_based = lf.config_meta.get_metadata().get("coordinate_system_zero_based")
        df = lf.collect()
        if zero_based is not None:
            set_coordinate_system(df, zero_based)
        return df

    @staticmethod
    def scan_sam(
        path: str,
        tag_fields: Union[list[str], None] = None,
        projection_pushdown: bool = True,
        use_zero_based: Optional[bool] = None,
    ) -> pl.LazyFrame:
        """
        Lazily read a SAM file into a LazyFrame.

        SAM (Sequence Alignment/Map) is the plain-text counterpart of BAM.
        This function reuses the BAM reader, which auto-detects the format
        from the file extension.

        Parameters:
            path: The path to the SAM file.
            tag_fields: List of SAM tag names to include as columns (e.g., ["NM", "MD", "AS"]).
                If None, no optional tags are parsed (default).
            projection_pushdown: Enable column projection pushdown to optimize query performance.
            use_zero_based: If True, output 0-based half-open coordinates.
                If False, output 1-based closed coordinates.
                If None (default), uses the global configuration.

        !!! note
            By default, coordinates are output in **1-based closed** format.
        """
        zero_based = _resolve_zero_based(use_zero_based)
        bam_read_options = BamReadOptions(
            zero_based=zero_based,
            tag_fields=tag_fields,
        )
        read_options = ReadOptions(bam_read_options=bam_read_options)
        return _read_file(
            path,
            InputFormat.Sam,
            read_options,
            projection_pushdown,
            zero_based=zero_based,
        )

    @staticmethod
    def describe_sam(
        path: str,
        sample_size: int = 100,
        use_zero_based: Optional[bool] = None,
    ) -> pl.DataFrame:
        """
        Get schema information for a SAM file with automatic tag discovery.

        Samples the first N records to discover all available tags and their types.
        Reuses the BAM describe logic, which auto-detects SAM from the file extension.

        Parameters:
            path: The path to the SAM file.
            sample_size: Number of records to sample for tag discovery (default: 100).
            use_zero_based: If True, output 0-based coordinates. If False, 1-based coordinates.

        Returns:
            DataFrame with columns: column_name, data_type, nullable, category, sam_type, description
        """
        zero_based = _resolve_zero_based(use_zero_based)

        df = py_describe_bam(
            ctx,
            path,
            None,
            zero_based,
            None,
            sample_size,
        )

        return pl.from_arrow(df.to_arrow_table())

    @staticmethod
    def write_sam(
        df: Union[pl.DataFrame, pl.LazyFrame],
        path: str,
        sort_on_write: bool = False,
    ) -> int:
        """
        Write a DataFrame to SAM format (plain text).

        Parameters:
            df: DataFrame or LazyFrame with 11 core BAM/SAM columns + optional tag columns
            path: Output file path (.sam)
            sort_on_write: If True, sort records by (chrom, start) and set header SO:coordinate.
                If False (default), set header SO:unsorted.

        Returns:
            Number of rows written

        !!! Example "Write SAM files"
            ```python
            import polars_bio as pb
            df = pb.read_bam("input.bam", tag_fields=["NM", "AS"])
            pb.write_sam(df, "output.sam")
            ```
        """
        return _write_bam_file(
            df, path, OutputFormat.Sam, None, sort_on_write=sort_on_write
        )

    @staticmethod
    def sink_sam(
        lf: pl.LazyFrame,
        path: str,
        sort_on_write: bool = False,
    ) -> None:
        """
        Streaming write a LazyFrame to SAM format (plain text).

        Parameters:
            lf: LazyFrame to write
            path: Output file path (.sam)
            sort_on_write: If True, sort records by (chrom, start) and set header SO:coordinate.
                If False (default), set header SO:unsorted.

        !!! Example "Streaming write SAM"
            ```python
            import polars_bio as pb
            lf = pb.scan_bam("input.bam").filter(pl.col("mapping_quality") > 20)
            pb.sink_sam(lf, "filtered.sam")
            ```
        """
        _write_bam_file(lf, path, OutputFormat.Sam, None, sort_on_write=sort_on_write)

    @staticmethod
    def write_cram(
        df: Union[pl.DataFrame, pl.LazyFrame],
        path: str,
        reference_path: str,
        sort_on_write: bool = False,
    ) -> int:
        """
        Write a DataFrame to CRAM format.

        CRAM uses reference-based compression, storing only differences from the
        reference sequence. This achieves 30-60% better compression than BAM.

        Parameters:
            df: DataFrame or LazyFrame with 11 core BAM columns + optional tag columns
            path: Output CRAM file path
            reference_path: Path to reference FASTA file (required). The reference must
                contain all sequences referenced by the alignment data.
            sort_on_write: If True, sort records by (chrom, start) and set header SO:coordinate.
                If False (default), set header SO:unsorted.

        Returns:
            Number of rows written

        !!! warning "Known Limitation: MD and NM Tags"
            Due to a limitation in the underlying noodles-cram library, **MD and NM tags cannot be read back from CRAM files** after writing, even though they are written to the file. If you need MD/NM tags for downstream analysis, use BAM format instead. Other optional tags (RG, MQ, AM, OQ, AS, etc.) work correctly. See: https://github.com/biodatageeks/datafusion-bio-formats/issues/54

        !!! Example "Write CRAM files"
            ```python
            import polars_bio as pb

            df = pb.read_bam("input.bam", tag_fields=["NM", "AS"])

            # Write CRAM with reference (required)
            pb.write_cram(df, "output.cram", reference_path="reference.fasta")

            # For sorted output
            pb.write_cram(df, "output.cram", reference_path="reference.fasta", sort_on_write=True)
            ```
        """
        return _write_bam_file(
            df, path, OutputFormat.Cram, reference_path, sort_on_write=sort_on_write
        )

    @staticmethod
    def sink_cram(
        lf: pl.LazyFrame,
        path: str,
        reference_path: str,
        sort_on_write: bool = False,
    ) -> None:
        """
        Streaming write a LazyFrame to CRAM format.

        CRAM uses reference-based compression, storing only differences from the
        reference sequence. This method streams data without materializing all
        rows in memory.

        Parameters:
            lf: LazyFrame to write
            path: Output CRAM file path
            reference_path: Path to reference FASTA file (required). The reference must
                contain all sequences referenced by the alignment data.
            sort_on_write: If True, sort records by (chrom, start) and set header SO:coordinate.
                If False (default), set header SO:unsorted.

        !!! warning "Known Limitation: MD and NM Tags"
            Due to a limitation in the underlying noodles-cram library, **MD and NM tags cannot be read back from CRAM files** after writing, even though they are written to the file. If you need MD/NM tags for downstream analysis, use BAM format instead. Other optional tags (RG, MQ, AM, OQ, AS, etc.) work correctly. See: https://github.com/biodatageeks/datafusion-bio-formats/issues/54

        !!! Example "Streaming write CRAM"
            ```python
            import polars_bio as pb
            import polars as pl

            lf = pb.scan_bam("large_input.bam")
            lf = lf.filter(pl.col("mapping_quality") > 30)

            # Write CRAM with reference (required)
            pb.sink_cram(lf, "filtered.cram", reference_path="reference.fasta")

            # For sorted output
            pb.sink_cram(lf, "filtered.cram", reference_path="reference.fasta", sort_on_write=True)
            ```
        """
        _write_bam_file(
            lf, path, OutputFormat.Cram, reference_path, sort_on_write=sort_on_write
        )

describe_bam(path, sample_size=100, chunk_size=8, concurrent_fetches=1, allow_anonymous=True, enable_request_payer=False, max_retries=5, timeout=300, compression_type='auto', use_zero_based=None) staticmethod

Get schema information for a BAM file with automatic tag discovery.

Samples the first N records to discover all available tags and their types. Returns detailed schema information including column names, data types, nullability, category (standard/tag), SAM type, and descriptions.

Parameters:

Name Type Description Default
path str

The path to the BAM file.

required
sample_size int

Number of records to sample for tag discovery (default: 100). Use higher values for more comprehensive tag discovery.

100
chunk_size int

The size in MB of a chunk when reading from object storage.

8
concurrent_fetches int

The number of concurrent fetches when reading from object storage.

1
allow_anonymous bool

Whether to allow anonymous access to object storage.

True
enable_request_payer bool

Whether to enable request payer for object storage.

False
max_retries int

The maximum number of retries for reading the file.

5
timeout int

The timeout in seconds for reading the file.

300
compression_type str

The compression type of the file. If "auto" (default), compression is detected automatically.

'auto'
use_zero_based Optional[bool]

If True, output 0-based coordinates. If False, 1-based coordinates.

None

Returns:

Type Description
DataFrame

DataFrame with columns:

DataFrame
  • column_name: Name of the column/field
DataFrame
  • data_type: Arrow data type (e.g., "Utf8", "Int32")
DataFrame
  • nullable: Whether the field can be null
DataFrame
  • category: "core" for fixed columns, "tag" for optional SAM tags
DataFrame
  • sam_type: SAM type code (e.g., "Z", "i") for tags, null for core columns
DataFrame
  • description: Human-readable description of the field
Example
import polars_bio as pb

# Auto-discover all tags present in the file
schema = pb.describe_bam("file.bam", sample_size=100)
print(schema)
# Output:
# shape: (15, 6)
# ┌─────────────┬───────────┬──────────┬──────────┬──────────┬──────────────────────┐
# │ column_name ┆ data_type ┆ nullable ┆ category ┆ sam_type ┆ description          │
# │ ---         ┆ ---       ┆ ---      ┆ ---      ┆ ---      ┆ ---                  │
# │ str         ┆ str       ┆ bool     ┆ str      ┆ str      ┆ str                  │
# ╞═════════════╪═══════════╪══════════╪══════════╪══════════╪══════════════════════╡
# │ name        ┆ Utf8      ┆ true     ┆ core     ┆ null     ┆ Query name           │
# │ chrom       ┆ Utf8      ┆ true     ┆ core     ┆ null     ┆ Reference name       │
# │ ...         ┆ ...       ┆ ...      ┆ ...      ┆ ...      ┆ ...                  │
# │ NM          ┆ Int32     ┆ true     ┆ tag      ┆ i        ┆ Edit distance        │
# │ AS          ┆ Int32     ┆ true     ┆ tag      ┆ i        ┆ Alignment score      │
# └─────────────┴───────────┴──────────┴──────────┴──────────┴──────────────────────┘
Source code in polars_bio/io.py
@staticmethod
def describe_bam(
    path: str,
    sample_size: int = 100,
    chunk_size: int = 8,
    concurrent_fetches: int = 1,
    allow_anonymous: bool = True,
    enable_request_payer: bool = False,
    max_retries: int = 5,
    timeout: int = 300,
    compression_type: str = "auto",
    use_zero_based: Optional[bool] = None,
) -> pl.DataFrame:
    """
    Get schema information for a BAM file with automatic tag discovery.

    Samples the first N records to discover all available tags and their types.
    Returns detailed schema information including column names, data types,
    nullability, category (standard/tag), SAM type, and descriptions.

    Parameters:
        path: The path to the BAM file.
        sample_size: Number of records to sample for tag discovery (default: 100).
            Use higher values for more comprehensive tag discovery.
        chunk_size: The size in MB of a chunk when reading from object storage.
        concurrent_fetches: The number of concurrent fetches when reading from object storage.
        allow_anonymous: Whether to allow anonymous access to object storage.
        enable_request_payer: Whether to enable request payer for object storage.
        max_retries: The maximum number of retries for reading the file.
        timeout: The timeout in seconds for reading the file.
        compression_type: The compression type of the file. If "auto" (default), compression is detected automatically.
        use_zero_based: If True, output 0-based coordinates. If False, 1-based coordinates.

    Returns:
        DataFrame with columns:
        - column_name: Name of the column/field
        - data_type: Arrow data type (e.g., "Utf8", "Int32")
        - nullable: Whether the field can be null
        - category: "core" for fixed columns, "tag" for optional SAM tags
        - sam_type: SAM type code (e.g., "Z", "i") for tags, null for core columns
        - description: Human-readable description of the field

    Example:
        ```python
        import polars_bio as pb

        # Auto-discover all tags present in the file
        schema = pb.describe_bam("file.bam", sample_size=100)
        print(schema)
        # Output:
        # shape: (15, 6)
        # ┌─────────────┬───────────┬──────────┬──────────┬──────────┬──────────────────────┐
        # │ column_name ┆ data_type ┆ nullable ┆ category ┆ sam_type ┆ description          │
        # │ ---         ┆ ---       ┆ ---      ┆ ---      ┆ ---      ┆ ---                  │
        # │ str         ┆ str       ┆ bool     ┆ str      ┆ str      ┆ str                  │
        # ╞═════════════╪═══════════╪══════════╪══════════╪══════════╪══════════════════════╡
        # │ name        ┆ Utf8      ┆ true     ┆ core     ┆ null     ┆ Query name           │
        # │ chrom       ┆ Utf8      ┆ true     ┆ core     ┆ null     ┆ Reference name       │
        # │ ...         ┆ ...       ┆ ...      ┆ ...      ┆ ...      ┆ ...                  │
        # │ NM          ┆ Int32     ┆ true     ┆ tag      ┆ i        ┆ Edit distance        │
        # │ AS          ┆ Int32     ┆ true     ┆ tag      ┆ i        ┆ Alignment score      │
        # └─────────────┴───────────┴──────────┴──────────┴──────────┴──────────────────────┘
        ```
    """
    # Build object storage options
    object_storage_options = PyObjectStorageOptions(
        chunk_size=chunk_size,
        concurrent_fetches=concurrent_fetches,
        allow_anonymous=allow_anonymous,
        enable_request_payer=enable_request_payer,
        max_retries=max_retries,
        timeout=timeout,
        compression_type=compression_type,
    )

    # Resolve zero_based setting
    zero_based = _resolve_zero_based(use_zero_based)

    # Call Rust function with tag auto-discovery (tag_fields=None)
    df = py_describe_bam(
        ctx,  # PyBioSessionContext
        path,
        object_storage_options,
        zero_based,
        None,  # tag_fields=None enables auto-discovery
        sample_size,
    )

    # Convert DataFusion DataFrame to Polars DataFrame
    return pl.from_arrow(df.to_arrow_table())

describe_cram(path, reference_path=None, sample_size=100, chunk_size=8, concurrent_fetches=1, allow_anonymous=True, enable_request_payer=False, max_retries=5, timeout=300, compression_type='auto', use_zero_based=None) staticmethod

Get schema information for a CRAM file with automatic tag discovery.

Samples the first N records to discover all available tags and their types. Returns detailed schema information including column names, data types, nullability, category (core/tag), SAM type, and descriptions.

Parameters:

Name Type Description Default
path str

The path to the CRAM file.

required
reference_path str

Optional path to external FASTA reference file.

None
sample_size int

Number of records to sample for tag discovery (default: 100).

100
chunk_size int

The size in MB of a chunk when reading from object storage.

8
concurrent_fetches int

The number of concurrent fetches when reading from object storage.

1
allow_anonymous bool

Whether to allow anonymous access to object storage.

True
enable_request_payer bool

Whether to enable request payer for object storage.

False
max_retries int

The maximum number of retries for reading the file.

5
timeout int

The timeout in seconds for reading the file.

300
compression_type str

The compression type of the file. If "auto" (default), compression is detected automatically.

'auto'
use_zero_based Optional[bool]

If True, output 0-based coordinates. If False, 1-based coordinates.

None

Returns:

Type Description
DataFrame

DataFrame with columns:

DataFrame
  • column_name: Name of the column/field
DataFrame
  • data_type: Arrow data type (e.g., "Utf8", "Int32")
DataFrame
  • nullable: Whether the field can be null
DataFrame
  • category: "core" for fixed columns, "tag" for optional SAM tags
DataFrame
  • sam_type: SAM type code (e.g., "Z", "i") for tags, null for core columns
DataFrame
  • description: Human-readable description of the field

Known Limitation: MD and NM Tags

Due to a limitation in the underlying noodles-cram library, MD (mismatch descriptor) and NM (edit distance) tags are not discoverable from CRAM files, even when stored. Automatic tag discovery will not include MD/NM tags. Other optional tags (RG, MQ, AM, OQ, etc.) are discovered correctly. See: https://github.com/biodatageeks/datafusion-bio-formats/issues/54

Example
import polars_bio as pb

# Auto-discover all tags present in the file
schema = pb.describe_cram("file.cram", sample_size=100)
print(schema)

# Filter to see only tag columns
tags = schema.filter(schema["category"] == "tag")
print(tags["column_name"])
Source code in polars_bio/io.py
@staticmethod
def describe_cram(
    path: str,
    reference_path: str = None,
    sample_size: int = 100,
    chunk_size: int = 8,
    concurrent_fetches: int = 1,
    allow_anonymous: bool = True,
    enable_request_payer: bool = False,
    max_retries: int = 5,
    timeout: int = 300,
    compression_type: str = "auto",
    use_zero_based: Optional[bool] = None,
) -> pl.DataFrame:
    """
    Get schema information for a CRAM file with automatic tag discovery.

    Samples the first N records to discover all available tags and their types.
    Returns detailed schema information including column names, data types,
    nullability, category (core/tag), SAM type, and descriptions.

    Parameters:
        path: The path to the CRAM file.
        reference_path: Optional path to external FASTA reference file.
        sample_size: Number of records to sample for tag discovery (default: 100).
        chunk_size: The size in MB of a chunk when reading from object storage.
        concurrent_fetches: The number of concurrent fetches when reading from object storage.
        allow_anonymous: Whether to allow anonymous access to object storage.
        enable_request_payer: Whether to enable request payer for object storage.
        max_retries: The maximum number of retries for reading the file.
        timeout: The timeout in seconds for reading the file.
        compression_type: The compression type of the file. If "auto" (default), compression is detected automatically.
        use_zero_based: If True, output 0-based coordinates. If False, 1-based coordinates.

    Returns:
        DataFrame with columns:
        - column_name: Name of the column/field
        - data_type: Arrow data type (e.g., "Utf8", "Int32")
        - nullable: Whether the field can be null
        - category: "core" for fixed columns, "tag" for optional SAM tags
        - sam_type: SAM type code (e.g., "Z", "i") for tags, null for core columns
        - description: Human-readable description of the field

    !!! warning "Known Limitation: MD and NM Tags"
        Due to a limitation in the underlying noodles-cram library, **MD (mismatch descriptor) and NM (edit distance) tags are not discoverable** from CRAM files, even when stored. Automatic tag discovery will not include MD/NM tags. Other optional tags (RG, MQ, AM, OQ, etc.) are discovered correctly. See: https://github.com/biodatageeks/datafusion-bio-formats/issues/54

    Example:
        ```python
        import polars_bio as pb

        # Auto-discover all tags present in the file
        schema = pb.describe_cram("file.cram", sample_size=100)
        print(schema)

        # Filter to see only tag columns
        tags = schema.filter(schema["category"] == "tag")
        print(tags["column_name"])
        ```
    """
    # Build object storage options
    object_storage_options = PyObjectStorageOptions(
        chunk_size=chunk_size,
        concurrent_fetches=concurrent_fetches,
        allow_anonymous=allow_anonymous,
        enable_request_payer=enable_request_payer,
        max_retries=max_retries,
        timeout=timeout,
        compression_type=compression_type,
    )

    # Resolve zero_based setting
    zero_based = _resolve_zero_based(use_zero_based)

    # Call Rust function with tag auto-discovery (tag_fields=None)
    df = py_describe_cram(
        ctx,
        path,
        reference_path,
        object_storage_options,
        zero_based,
        None,  # tag_fields=None enables auto-discovery
        sample_size,
    )

    # Convert DataFusion DataFrame to Polars DataFrame
    return pl.from_arrow(df.to_arrow_table())

describe_sam(path, sample_size=100, use_zero_based=None) staticmethod

Get schema information for a SAM file with automatic tag discovery.

Samples the first N records to discover all available tags and their types. Reuses the BAM describe logic, which auto-detects SAM from the file extension.

Parameters:

Name Type Description Default
path str

The path to the SAM file.

required
sample_size int

Number of records to sample for tag discovery (default: 100).

100
use_zero_based Optional[bool]

If True, output 0-based coordinates. If False, 1-based coordinates.

None

Returns:

Type Description
DataFrame

DataFrame with columns: column_name, data_type, nullable, category, sam_type, description

Source code in polars_bio/io.py
@staticmethod
def describe_sam(
    path: str,
    sample_size: int = 100,
    use_zero_based: Optional[bool] = None,
) -> pl.DataFrame:
    """
    Get schema information for a SAM file with automatic tag discovery.

    Samples the first N records to discover all available tags and their types.
    Reuses the BAM describe logic, which auto-detects SAM from the file extension.

    Parameters:
        path: The path to the SAM file.
        sample_size: Number of records to sample for tag discovery (default: 100).
        use_zero_based: If True, output 0-based coordinates. If False, 1-based coordinates.

    Returns:
        DataFrame with columns: column_name, data_type, nullable, category, sam_type, description
    """
    zero_based = _resolve_zero_based(use_zero_based)

    df = py_describe_bam(
        ctx,
        path,
        None,
        zero_based,
        None,
        sample_size,
    )

    return pl.from_arrow(df.to_arrow_table())

describe_vcf(path, allow_anonymous=True, enable_request_payer=False, compression_type='auto') staticmethod

Describe VCF INFO schema.

Parameters:

Name Type Description Default
path str

The path to the VCF file.

required
allow_anonymous bool

Whether to allow anonymous access to object storage (GCS and S3 supported).

True
enable_request_payer bool

Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.

False
compression_type str

The compression type of the VCF file. If not specified, it will be detected automatically..

'auto'
Source code in polars_bio/io.py
@staticmethod
def describe_vcf(
    path: str,
    allow_anonymous: bool = True,
    enable_request_payer: bool = False,
    compression_type: str = "auto",
) -> pl.DataFrame:
    """
    Describe VCF INFO schema.

    Parameters:
        path: The path to the VCF file.
        allow_anonymous: Whether to allow anonymous access to object storage (GCS and S3 supported).
        enable_request_payer: Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
        compression_type: The compression type of the VCF file. If not specified, it will be detected automatically..
    """
    object_storage_options = PyObjectStorageOptions(
        allow_anonymous=allow_anonymous,
        enable_request_payer=enable_request_payer,
        chunk_size=8,
        concurrent_fetches=1,
        max_retries=1,
        timeout=10,
        compression_type=compression_type,
    )
    return py_describe_vcf(ctx, path, object_storage_options).to_polars()

from_polars(name, df) staticmethod

Register a Polars DataFrame as a DataFusion table.

Parameters:

Name Type Description Default
name str

The name of the table.

required
df Union[DataFrame, LazyFrame]

The Polars DataFrame.

required
Source code in polars_bio/io.py
@staticmethod
def from_polars(name: str, df: Union[pl.DataFrame, pl.LazyFrame]) -> None:
    """
    Register a Polars DataFrame as a DataFusion table.

    Parameters:
        name: The name of the table.
        df: The Polars DataFrame.
    """
    reader = (
        df.to_arrow()
        if isinstance(df, pl.DataFrame)
        else df.collect().to_arrow().to_reader()
    )
    py_from_polars(ctx, name, reader)

read_bam(path, tag_fields=None, chunk_size=8, concurrent_fetches=1, allow_anonymous=True, enable_request_payer=False, max_retries=5, timeout=300, projection_pushdown=True, predicate_pushdown=True, use_zero_based=None) staticmethod

Read a BAM file into a DataFrame.

Parallelism & Indexed Reads

Indexed parallel reads and predicate pushdown are automatic when a BAI/CSI index is present. See File formats support, Indexed reads, and Automatic parallel partitioning for details.

Parameters:

Name Type Description Default
path str

The path to the BAM file.

required
tag_fields Union[list[str], None]

List of BAM tag names to include as columns (e.g., ["NM", "MD", "AS"]). If None, no optional tags are parsed (default). Common tags include: NM (edit distance), MD (mismatch string), AS (alignment score), XS (secondary alignment score), RG (read group), CB (cell barcode), UB (UMI barcode).

None
chunk_size int

The size in MB of a chunk when reading from an object store. The default is 8 MB. For large-scale operations, it is recommended to increase this value to 64.

8
concurrent_fetches int

[GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large-scale operations, it is recommended to increase this value to 8 or even more.

1
allow_anonymous bool

[GCS, AWS S3] Whether to allow anonymous access to object storage.

True
enable_request_payer bool

[AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.

False
max_retries int

The maximum number of retries for reading the file from object storage.

5
timeout int

The timeout in seconds for reading the file from object storage.

300
projection_pushdown bool

Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

True
predicate_pushdown bool

Enable predicate pushdown using index files (BAI/CSI) for efficient region-based filtering. Index files are auto-discovered (e.g., file.bam.bai). Only simple predicates are pushed down (equality, comparisons, IN); complex predicates like .str.contains() or OR logic are filtered client-side. Correctness is always guaranteed.

True
use_zero_based Optional[bool]

If True, output 0-based half-open coordinates. If False, output 1-based closed coordinates. If None (default), uses the global configuration datafusion.bio.coordinate_system_zero_based.

None

Note

By default, coordinates are output in 1-based closed format. Use use_zero_based=True or set pb.set_option(pb.POLARS_BIO_COORDINATE_SYSTEM_ZERO_BASED, True) for 0-based half-open coordinates.

Source code in polars_bio/io.py
@staticmethod
def read_bam(
    path: str,
    tag_fields: Union[list[str], None] = None,
    chunk_size: int = 8,
    concurrent_fetches: int = 1,
    allow_anonymous: bool = True,
    enable_request_payer: bool = False,
    max_retries: int = 5,
    timeout: int = 300,
    projection_pushdown: bool = True,
    predicate_pushdown: bool = True,
    use_zero_based: Optional[bool] = None,
) -> pl.DataFrame:
    """
    Read a BAM file into a DataFrame.

    !!! hint "Parallelism & Indexed Reads"
        Indexed parallel reads and predicate pushdown are automatic when a BAI/CSI index
        is present. See [File formats support](/polars-bio/features/#file-formats-support),
        [Indexed reads](/polars-bio/features/#indexed-reads-predicate-pushdown),
        and [Automatic parallel partitioning](/polars-bio/features/#automatic-parallel-partitioning) for details.

    Parameters:
        path: The path to the BAM file.
        tag_fields: List of BAM tag names to include as columns (e.g., ["NM", "MD", "AS"]). If None, no optional tags are parsed (default). Common tags include: NM (edit distance), MD (mismatch string), AS (alignment score), XS (secondary alignment score), RG (read group), CB (cell barcode), UB (UMI barcode).
        chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large-scale operations, it is recommended to increase this value to 64.
        concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large-scale operations, it is recommended to increase this value to 8 or even more.
        allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
        enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
        max_retries:  The maximum number of retries for reading the file from object storage.
        timeout: The timeout in seconds for reading the file from object storage.
        projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.
        predicate_pushdown: Enable predicate pushdown using index files (BAI/CSI) for efficient region-based filtering. Index files are auto-discovered (e.g., `file.bam.bai`). Only simple predicates are pushed down (equality, comparisons, IN); complex predicates like `.str.contains()` or OR logic are filtered client-side. Correctness is always guaranteed.
        use_zero_based: If True, output 0-based half-open coordinates. If False, output 1-based closed coordinates. If None (default), uses the global configuration `datafusion.bio.coordinate_system_zero_based`.

    !!! note
        By default, coordinates are output in **1-based closed** format. Use `use_zero_based=True` or set `pb.set_option(pb.POLARS_BIO_COORDINATE_SYSTEM_ZERO_BASED, True)` for 0-based half-open coordinates.
    """
    lf = IOOperations.scan_bam(
        path,
        tag_fields,
        chunk_size,
        concurrent_fetches,
        allow_anonymous,
        enable_request_payer,
        max_retries,
        timeout,
        projection_pushdown,
        predicate_pushdown,
        use_zero_based,
    )
    # Get metadata before collecting (polars-config-meta doesn't preserve through collect)
    zero_based = lf.config_meta.get_metadata().get("coordinate_system_zero_based")
    df = lf.collect()
    # Set metadata on the collected DataFrame
    if zero_based is not None:
        set_coordinate_system(df, zero_based)
    return df

read_bed(path, thread_num=1, chunk_size=8, concurrent_fetches=1, allow_anonymous=True, enable_request_payer=False, max_retries=5, timeout=300, compression_type='auto', projection_pushdown=True, use_zero_based=None) staticmethod

Read a BED file into a DataFrame.

Parameters:

Name Type Description Default
path str

The path to the BED file.

required
thread_num int

The number of threads to use for reading the BED file. Used only for parallel decompression of BGZF blocks. Works only for local files.

1
chunk_size int

The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.

8
concurrent_fetches int

[GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.

1
allow_anonymous bool

[GCS, AWS S3] Whether to allow anonymous access to object storage.

True
enable_request_payer bool

[AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.

False
max_retries int

The maximum number of retries for reading the file from object storage.

5
timeout int

The timeout in seconds for reading the file from object storage.

300
compression_type str

The compression type of the BED file. If not specified, it will be detected automatically based on the file extension. BGZF compressions is supported ('bgz').

'auto'
projection_pushdown bool

Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

True
use_zero_based Optional[bool]

If True, output 0-based half-open coordinates. If False, output 1-based closed coordinates. If None (default), uses the global configuration datafusion.bio.coordinate_system_zero_based.

None

Note

Only BED4 format is supported. It extends the basic BED format (BED3) by adding a name field, resulting in four columns: chromosome, start position, end position, and name. Also unlike other text formats, GZIP compression is not supported.

Note

By default, coordinates are output in 1-based closed format. Use use_zero_based=True or set pb.set_option(pb.POLARS_BIO_COORDINATE_SYSTEM_ZERO_BASED, True) for 0-based half-open coordinates.

Source code in polars_bio/io.py
@staticmethod
def read_bed(
    path: str,
    thread_num: int = 1,
    chunk_size: int = 8,
    concurrent_fetches: int = 1,
    allow_anonymous: bool = True,
    enable_request_payer: bool = False,
    max_retries: int = 5,
    timeout: int = 300,
    compression_type: str = "auto",
    projection_pushdown: bool = True,
    use_zero_based: Optional[bool] = None,
) -> pl.DataFrame:
    """
    Read a BED file into a DataFrame.

    Parameters:
        path: The path to the BED file.
        thread_num: The number of threads to use for reading the BED file. Used **only** for parallel decompression of BGZF blocks. Works only for **local** files.
        chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
        concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
        allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
        enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
        max_retries:  The maximum number of retries for reading the file from object storage.
        timeout: The timeout in seconds for reading the file from object storage.
        compression_type: The compression type of the BED file. If not specified, it will be detected automatically based on the file extension. BGZF compressions is supported ('bgz').
        projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.
        use_zero_based: If True, output 0-based half-open coordinates. If False, output 1-based closed coordinates. If None (default), uses the global configuration `datafusion.bio.coordinate_system_zero_based`.

    !!! Note
        Only **BED4** format is supported. It extends the basic BED format (BED3) by adding a name field, resulting in four columns: chromosome, start position, end position, and name.
        Also unlike other text formats, **GZIP** compression is not supported.

    !!! note
        By default, coordinates are output in **1-based closed** format. Use `use_zero_based=True` or set `pb.set_option(pb.POLARS_BIO_COORDINATE_SYSTEM_ZERO_BASED, True)` for 0-based half-open coordinates.
    """
    lf = IOOperations.scan_bed(
        path,
        thread_num,
        chunk_size,
        concurrent_fetches,
        allow_anonymous,
        enable_request_payer,
        max_retries,
        timeout,
        compression_type,
        projection_pushdown,
        use_zero_based,
    )
    # Get metadata before collecting (polars-config-meta doesn't preserve through collect)
    zero_based = lf.config_meta.get_metadata().get("coordinate_system_zero_based")
    df = lf.collect()
    # Set metadata on the collected DataFrame
    if zero_based is not None:
        set_coordinate_system(df, zero_based)
    return df

read_cram(path, reference_path=None, tag_fields=None, chunk_size=8, concurrent_fetches=1, allow_anonymous=True, enable_request_payer=False, max_retries=5, timeout=300, projection_pushdown=True, predicate_pushdown=True, use_zero_based=None) staticmethod

Read a CRAM file into a DataFrame.

Parallelism & Indexed Reads

Indexed parallel reads and predicate pushdown are automatic when a CRAI index is present. See File formats support, Indexed reads, and Automatic parallel partitioning for details.

Parameters:

Name Type Description Default
path str

The path to the CRAM file (local or cloud storage: S3, GCS, Azure Blob).

required
reference_path str

Optional path to external FASTA reference file (local path only, cloud storage not supported). If not provided, the CRAM file must contain embedded reference sequences. The FASTA file must have an accompanying index file (.fai) in the same directory. Create the index using: samtools faidx reference.fasta

None
tag_fields Union[list[str], None]

List of CRAM tag names to include as columns (e.g., ["NM", "MD", "AS"]). If None, no optional tags are parsed (default). Common tags include: NM (edit distance), MD (mismatch string), AS (alignment score), XS (secondary alignment score), RG (read group), CB (cell barcode), UB (UMI barcode).

None
chunk_size int

The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.

8
concurrent_fetches int

[GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.

1
allow_anonymous bool

[GCS, AWS S3] Whether to allow anonymous access to object storage.

True
enable_request_payer bool

[AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.

False
max_retries int

The maximum number of retries for reading the file from object storage.

5
timeout int

The timeout in seconds for reading the file from object storage.

300
projection_pushdown bool

Enable column projection pushdown optimization. When True, only requested columns are processed at the DataFusion execution level, improving performance and reducing memory usage.

True
predicate_pushdown bool

Enable predicate pushdown using index files (CRAI) for efficient region-based filtering. Index files are auto-discovered (e.g., file.cram.crai). Only simple predicates are pushed down (equality, comparisons, IN); complex predicates like .str.contains() or OR logic are filtered client-side. Correctness is always guaranteed.

True
use_zero_based Optional[bool]

If True, output 0-based half-open coordinates. If False, output 1-based closed coordinates. If None (default), uses the global configuration datafusion.bio.coordinate_system_zero_based.

None

Note

By default, coordinates are output in 1-based closed format. Use use_zero_based=True or set pb.set_option(pb.POLARS_BIO_COORDINATE_SYSTEM_ZERO_BASED, True) for 0-based half-open coordinates.

Known Limitation: MD and NM Tags

Due to a limitation in the underlying noodles-cram library, MD (mismatch descriptor) and NM (edit distance) tags are not accessible from CRAM files, even when stored in the file. These tags can be seen with samtools but are not exposed through the noodles-cram record.data() interface.

Other optional tags (RG, MQ, AM, OQ, etc.) work correctly. This issue is tracked at: https://github.com/biodatageeks/datafusion-bio-formats/issues/54

Workaround: Use BAM format if MD/NM tags are required for your analysis.

Using External Reference

import polars_bio as pb

# Read CRAM with external reference
df = pb.read_cram(
    "/path/to/file.cram",
    reference_path="/path/to/reference.fasta"
)

Public CRAM File Example

Download and read a public CRAM file from 42basepairs:

# Download the CRAM file and reference
wget https://42basepairs.com/download/s3/gatk-test-data/wgs_cram/NA12878_20k_hg38/NA12878.cram
wget https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta

# Create FASTA index (required)
samtools faidx Homo_sapiens_assembly38.fasta

import polars_bio as pb

# Read first 5 reads from the CRAM file
df = pb.scan_cram(
    "NA12878.cram",
    reference_path="Homo_sapiens_assembly38.fasta"
).limit(5).collect()

print(df.select(["name", "chrom", "start", "end", "cigar"]))

Creating CRAM with Embedded Reference

To create a CRAM file with embedded reference using samtools:

samtools view -C -o output.cram --output-fmt-option embed_ref=1 input.bam

Returns:

Type Description
DataFrame

A Polars DataFrame with the following schema: - name: Read name (String) - chrom: Chromosome/contig name (String) - start: Alignment start position, 1-based (UInt32) - end: Alignment end position, 1-based (UInt32) - flags: SAM flags (UInt32) - cigar: CIGAR string (String) - mapping_quality: Mapping quality (UInt32) - mate_chrom: Mate chromosome/contig name (String) - mate_start: Mate alignment start position, 1-based (UInt32) - sequence: Read sequence (String) - quality_scores: Base quality scores (String)

Source code in polars_bio/io.py
@staticmethod
def read_cram(
    path: str,
    reference_path: str = None,
    tag_fields: Union[list[str], None] = None,
    chunk_size: int = 8,
    concurrent_fetches: int = 1,
    allow_anonymous: bool = True,
    enable_request_payer: bool = False,
    max_retries: int = 5,
    timeout: int = 300,
    projection_pushdown: bool = True,
    predicate_pushdown: bool = True,
    use_zero_based: Optional[bool] = None,
) -> pl.DataFrame:
    """
    Read a CRAM file into a DataFrame.

    !!! hint "Parallelism & Indexed Reads"
        Indexed parallel reads and predicate pushdown are automatic when a CRAI index
        is present. See [File formats support](/polars-bio/features/#file-formats-support),
        [Indexed reads](/polars-bio/features/#indexed-reads-predicate-pushdown),
        and [Automatic parallel partitioning](/polars-bio/features/#automatic-parallel-partitioning) for details.

    Parameters:
        path: The path to the CRAM file (local or cloud storage: S3, GCS, Azure Blob).
        reference_path: Optional path to external FASTA reference file (**local path only**, cloud storage not supported). If not provided, the CRAM file must contain embedded reference sequences. The FASTA file must have an accompanying index file (.fai) in the same directory. Create the index using: `samtools faidx reference.fasta`
        tag_fields: List of CRAM tag names to include as columns (e.g., ["NM", "MD", "AS"]). If None, no optional tags are parsed (default). Common tags include: NM (edit distance), MD (mismatch string), AS (alignment score), XS (secondary alignment score), RG (read group), CB (cell barcode), UB (UMI barcode).
        chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
        concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
        allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
        enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
        max_retries: The maximum number of retries for reading the file from object storage.
        timeout: The timeout in seconds for reading the file from object storage.
        projection_pushdown: Enable column projection pushdown optimization. When True, only requested columns are processed at the DataFusion execution level, improving performance and reducing memory usage.
        predicate_pushdown: Enable predicate pushdown using index files (CRAI) for efficient region-based filtering. Index files are auto-discovered (e.g., `file.cram.crai`). Only simple predicates are pushed down (equality, comparisons, IN); complex predicates like `.str.contains()` or OR logic are filtered client-side. Correctness is always guaranteed.
        use_zero_based: If True, output 0-based half-open coordinates. If False, output 1-based closed coordinates. If None (default), uses the global configuration `datafusion.bio.coordinate_system_zero_based`.

    !!! note
        By default, coordinates are output in **1-based closed** format. Use `use_zero_based=True` or set `pb.set_option(pb.POLARS_BIO_COORDINATE_SYSTEM_ZERO_BASED, True)` for 0-based half-open coordinates.

    !!! warning "Known Limitation: MD and NM Tags"
        Due to a limitation in the underlying noodles-cram library, **MD (mismatch descriptor) and NM (edit distance) tags are not accessible** from CRAM files, even when stored in the file. These tags can be seen with samtools but are not exposed through the noodles-cram record.data() interface.

        Other optional tags (RG, MQ, AM, OQ, etc.) work correctly. This issue is tracked at: https://github.com/biodatageeks/datafusion-bio-formats/issues/54

        **Workaround**: Use BAM format if MD/NM tags are required for your analysis.

    !!! example "Using External Reference"
        ```python
        import polars_bio as pb

        # Read CRAM with external reference
        df = pb.read_cram(
            "/path/to/file.cram",
            reference_path="/path/to/reference.fasta"
        )
        ```

    !!! example "Public CRAM File Example"
        Download and read a public CRAM file from 42basepairs:
        ```bash
        # Download the CRAM file and reference
        wget https://42basepairs.com/download/s3/gatk-test-data/wgs_cram/NA12878_20k_hg38/NA12878.cram
        wget https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta

        # Create FASTA index (required)
        samtools faidx Homo_sapiens_assembly38.fasta
        ```

        ```python
        import polars_bio as pb

        # Read first 5 reads from the CRAM file
        df = pb.scan_cram(
            "NA12878.cram",
            reference_path="Homo_sapiens_assembly38.fasta"
        ).limit(5).collect()

        print(df.select(["name", "chrom", "start", "end", "cigar"]))
        ```

    !!! example "Creating CRAM with Embedded Reference"
        To create a CRAM file with embedded reference using samtools:
        ```bash
        samtools view -C -o output.cram --output-fmt-option embed_ref=1 input.bam
        ```

    Returns:
        A Polars DataFrame with the following schema:
            - name: Read name (String)
            - chrom: Chromosome/contig name (String)
            - start: Alignment start position, 1-based (UInt32)
            - end: Alignment end position, 1-based (UInt32)
            - flags: SAM flags (UInt32)
            - cigar: CIGAR string (String)
            - mapping_quality: Mapping quality (UInt32)
            - mate_chrom: Mate chromosome/contig name (String)
            - mate_start: Mate alignment start position, 1-based (UInt32)
            - sequence: Read sequence (String)
            - quality_scores: Base quality scores (String)
    """
    lf = IOOperations.scan_cram(
        path,
        reference_path,
        tag_fields,
        chunk_size,
        concurrent_fetches,
        allow_anonymous,
        enable_request_payer,
        max_retries,
        timeout,
        projection_pushdown,
        predicate_pushdown,
        use_zero_based,
    )
    # Get metadata before collecting (polars-config-meta doesn't preserve through collect)
    zero_based = lf.config_meta.get_metadata().get("coordinate_system_zero_based")
    df = lf.collect()
    # Set metadata on the collected DataFrame
    if zero_based is not None:
        set_coordinate_system(df, zero_based)
    return df

read_fasta(path, chunk_size=8, concurrent_fetches=1, allow_anonymous=True, enable_request_payer=False, max_retries=5, timeout=300, compression_type='auto', projection_pushdown=True) staticmethod

Read a FASTA file into a DataFrame.

Parameters:

Name Type Description Default
path str

The path to the FASTA file.

required
chunk_size int

The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.

8
concurrent_fetches int

[GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.

1
allow_anonymous bool

[GCS, AWS S3] Whether to allow anonymous access to object storage.

True
enable_request_payer bool

[AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.

False
max_retries int

The maximum number of retries for reading the file from object storage.

5
timeout int

The timeout in seconds for reading the file from object storage.

300
compression_type str

The compression type of the FASTA file. If not specified, it will be detected automatically based on the file extension. BGZF and GZIP compressions are supported ('bgz', 'gz').

'auto'
projection_pushdown bool

Enable column projection pushdown optimization. When True, only requested columns are processed at the DataFusion execution level, improving performance and reducing memory usage.

True

Example

wget https://www.ebi.ac.uk/ena/browser/api/fasta/BK006935.2?download=true -O /tmp/test.fasta

import polars_bio as pb
pb.read_fasta("/tmp/test.fasta").limit(1)
 shape: (1, 3)
┌─────────────────────────┬─────────────────────────────────┬─────────────────────────────────┐
 name                     description                      sequence                         ---                      ---                              ---                              str                      str                              str                             ╞═════════════════════════╪═════════════════════════════════╪═════════════════════════════════╡
 ENA|BK006935|BK006935.2  TPA_inf: Saccharomyces cerevis…  CCACACCACACCCACACACCCACACACCAC… └─────────────────────────┴─────────────────────────────────┴─────────────────────────────────┘

Source code in polars_bio/io.py
@staticmethod
def read_fasta(
    path: str,
    chunk_size: int = 8,
    concurrent_fetches: int = 1,
    allow_anonymous: bool = True,
    enable_request_payer: bool = False,
    max_retries: int = 5,
    timeout: int = 300,
    compression_type: str = "auto",
    projection_pushdown: bool = True,
) -> pl.DataFrame:
    """

    Read a FASTA file into a DataFrame.

    Parameters:
        path: The path to the FASTA file.
        chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
        concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
        allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
        enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
        max_retries:  The maximum number of retries for reading the file from object storage.
        timeout: The timeout in seconds for reading the file from object storage.
        compression_type: The compression type of the FASTA file. If not specified, it will be detected automatically based on the file extension. BGZF and GZIP compressions are supported ('bgz', 'gz').
        projection_pushdown: Enable column projection pushdown optimization. When True, only requested columns are processed at the DataFusion execution level, improving performance and reducing memory usage.

    !!! Example
        ```shell
        wget https://www.ebi.ac.uk/ena/browser/api/fasta/BK006935.2?download=true -O /tmp/test.fasta
        ```

        ```python
        import polars_bio as pb
        pb.read_fasta("/tmp/test.fasta").limit(1)
        ```
        ```shell
         shape: (1, 3)
        ┌─────────────────────────┬─────────────────────────────────┬─────────────────────────────────┐
        │ name                    ┆ description                     ┆ sequence                        │
        │ ---                     ┆ ---                             ┆ ---                             │
        │ str                     ┆ str                             ┆ str                             │
        ╞═════════════════════════╪═════════════════════════════════╪═════════════════════════════════╡
        │ ENA|BK006935|BK006935.2 ┆ TPA_inf: Saccharomyces cerevis… ┆ CCACACCACACCCACACACCCACACACCAC… │
        └─────────────────────────┴─────────────────────────────────┴─────────────────────────────────┘
        ```
    """
    return IOOperations.scan_fasta(
        path,
        chunk_size,
        concurrent_fetches,
        allow_anonymous,
        enable_request_payer,
        max_retries,
        timeout,
        compression_type,
        projection_pushdown,
    ).collect()

read_fastq(path, chunk_size=8, concurrent_fetches=1, allow_anonymous=True, enable_request_payer=False, max_retries=5, timeout=300, compression_type='auto', projection_pushdown=True) staticmethod

Read a FASTQ file into a DataFrame.

Parallelism & Compression

See File formats support, Compression, and Automatic parallel partitioning for details on parallel reads and supported compression types.

Parameters:

Name Type Description Default
path str

The path to the FASTQ file.

required
chunk_size int

The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.

8
concurrent_fetches int

[GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.

1
allow_anonymous bool

[GCS, AWS S3] Whether to allow anonymous access to object storage.

True
enable_request_payer bool

[AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.

False
max_retries int

The maximum number of retries for reading the file from object storage.

5
timeout int

The timeout in seconds for reading the file from object storage.

300
compression_type str

The compression type of the FASTQ file. If not specified, it will be detected automatically based on the file extension. BGZF and GZIP compressions are supported ('bgz', 'gz').

'auto'
projection_pushdown bool

Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

True
Source code in polars_bio/io.py
@staticmethod
def read_fastq(
    path: str,
    chunk_size: int = 8,
    concurrent_fetches: int = 1,
    allow_anonymous: bool = True,
    enable_request_payer: bool = False,
    max_retries: int = 5,
    timeout: int = 300,
    compression_type: str = "auto",
    projection_pushdown: bool = True,
) -> pl.DataFrame:
    """
    Read a FASTQ file into a DataFrame.

    !!! hint "Parallelism & Compression"
        See [File formats support](/polars-bio/features/#file-formats-support),
        [Compression](/polars-bio/features/#compression),
        and [Automatic parallel partitioning](/polars-bio/features/#automatic-parallel-partitioning) for details on parallel reads and supported compression types.

    Parameters:
        path: The path to the FASTQ file.
        chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
        concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
        allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
        enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
        max_retries:  The maximum number of retries for reading the file from object storage.
        timeout: The timeout in seconds for reading the file from object storage.
        compression_type: The compression type of the FASTQ file. If not specified, it will be detected automatically based on the file extension. BGZF and GZIP compressions are supported ('bgz', 'gz').
        projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.
    """
    return IOOperations.scan_fastq(
        path,
        chunk_size,
        concurrent_fetches,
        allow_anonymous,
        enable_request_payer,
        max_retries,
        timeout,
        compression_type,
        projection_pushdown,
    ).collect()

read_gff(path, attr_fields=None, chunk_size=8, concurrent_fetches=1, allow_anonymous=True, enable_request_payer=False, max_retries=5, timeout=300, compression_type='auto', projection_pushdown=True, predicate_pushdown=True, use_zero_based=None) staticmethod

Read a GFF file into a DataFrame.

Parameters:

Name Type Description Default
path str

The path to the GFF file.

required
attr_fields Union[list[str], None]

List of attribute field names to extract as separate columns. If None, attributes will be kept as a nested structure. Use this to extract specific attributes like 'ID', 'gene_name', 'gene_type', etc. as direct columns for easier access.

None
chunk_size int

The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.

8
concurrent_fetches int

[GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.

1
allow_anonymous bool

[GCS, AWS S3] Whether to allow anonymous access to object storage.

True
enable_request_payer bool

[AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.

False
max_retries int

The maximum number of retries for reading the file from object storage.

5
timeout int

The timeout in seconds for reading the file from object storage.

300
compression_type str

The compression type of the GFF file. If not specified, it will be detected automatically..

'auto'
projection_pushdown bool

Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

True
predicate_pushdown bool

Enable predicate pushdown using index files (TBI/CSI) for efficient region-based filtering. Index files are auto-discovered (e.g., file.gff.gz.tbi). Only simple predicates are pushed down (equality, comparisons, IN); complex predicates like .str.contains() or OR logic are filtered client-side. Correctness is always guaranteed.

True
use_zero_based Optional[bool]

If True, output 0-based half-open coordinates. If False, output 1-based closed coordinates. If None (default), uses the global configuration datafusion.bio.coordinate_system_zero_based.

None

Note

By default, coordinates are output in 1-based closed format. Use use_zero_based=True or set pb.set_option(pb.POLARS_BIO_COORDINATE_SYSTEM_ZERO_BASED, True) for 0-based half-open coordinates.

Source code in polars_bio/io.py
@staticmethod
def read_gff(
    path: str,
    attr_fields: Union[list[str], None] = None,
    chunk_size: int = 8,
    concurrent_fetches: int = 1,
    allow_anonymous: bool = True,
    enable_request_payer: bool = False,
    max_retries: int = 5,
    timeout: int = 300,
    compression_type: str = "auto",
    projection_pushdown: bool = True,
    predicate_pushdown: bool = True,
    use_zero_based: Optional[bool] = None,
) -> pl.DataFrame:
    """
    Read a GFF file into a DataFrame.

    Parameters:
        path: The path to the GFF file.
        attr_fields: List of attribute field names to extract as separate columns. If *None*, attributes will be kept as a nested structure. Use this to extract specific attributes like 'ID', 'gene_name', 'gene_type', etc. as direct columns for easier access.
        chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
        concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
        allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
        enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
        max_retries:  The maximum number of retries for reading the file from object storage.
        timeout: The timeout in seconds for reading the file from object storage.
        compression_type: The compression type of the GFF file. If not specified, it will be detected automatically..
        projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.
        predicate_pushdown: Enable predicate pushdown using index files (TBI/CSI) for efficient region-based filtering. Index files are auto-discovered (e.g., `file.gff.gz.tbi`). Only simple predicates are pushed down (equality, comparisons, IN); complex predicates like `.str.contains()` or OR logic are filtered client-side. Correctness is always guaranteed.
        use_zero_based: If True, output 0-based half-open coordinates. If False, output 1-based closed coordinates. If None (default), uses the global configuration `datafusion.bio.coordinate_system_zero_based`.

    !!! note
        By default, coordinates are output in **1-based closed** format. Use `use_zero_based=True` or set `pb.set_option(pb.POLARS_BIO_COORDINATE_SYSTEM_ZERO_BASED, True)` for 0-based half-open coordinates.
    """
    lf = IOOperations.scan_gff(
        path,
        attr_fields,
        chunk_size,
        concurrent_fetches,
        allow_anonymous,
        enable_request_payer,
        max_retries,
        timeout,
        compression_type,
        projection_pushdown,
        predicate_pushdown,
        use_zero_based,
    )
    # Get metadata before collecting (polars-config-meta doesn't preserve through collect)
    zero_based = lf.config_meta.get_metadata().get("coordinate_system_zero_based")
    df = lf.collect()
    # Set metadata on the collected DataFrame
    if zero_based is not None:
        set_coordinate_system(df, zero_based)
    return df

read_sam(path, tag_fields=None, projection_pushdown=True, use_zero_based=None) staticmethod

Read a SAM file into a DataFrame.

SAM (Sequence Alignment/Map) is the plain-text counterpart of BAM. This function reuses the BAM reader, which auto-detects the format from the file extension.

Parameters:

Name Type Description Default
path str

The path to the SAM file.

required
tag_fields Union[list[str], None]

List of SAM tag names to include as columns (e.g., ["NM", "MD", "AS"]). If None, no optional tags are parsed (default).

None
projection_pushdown bool

Enable column projection pushdown to optimize query performance.

True
use_zero_based Optional[bool]

If True, output 0-based half-open coordinates. If False, output 1-based closed coordinates. If None (default), uses the global configuration.

None

Note

By default, coordinates are output in 1-based closed format.

Source code in polars_bio/io.py
@staticmethod
def read_sam(
    path: str,
    tag_fields: Union[list[str], None] = None,
    projection_pushdown: bool = True,
    use_zero_based: Optional[bool] = None,
) -> pl.DataFrame:
    """
    Read a SAM file into a DataFrame.

    SAM (Sequence Alignment/Map) is the plain-text counterpart of BAM.
    This function reuses the BAM reader, which auto-detects the format
    from the file extension.

    Parameters:
        path: The path to the SAM file.
        tag_fields: List of SAM tag names to include as columns (e.g., ["NM", "MD", "AS"]).
            If None, no optional tags are parsed (default).
        projection_pushdown: Enable column projection pushdown to optimize query performance.
        use_zero_based: If True, output 0-based half-open coordinates.
            If False, output 1-based closed coordinates.
            If None (default), uses the global configuration.

    !!! note
        By default, coordinates are output in **1-based closed** format.
    """
    lf = IOOperations.scan_sam(
        path,
        tag_fields,
        projection_pushdown,
        use_zero_based,
    )
    zero_based = lf.config_meta.get_metadata().get("coordinate_system_zero_based")
    df = lf.collect()
    if zero_based is not None:
        set_coordinate_system(df, zero_based)
    return df

read_table(path, schema=None, **kwargs) staticmethod

Read a tab-delimited (i.e. BED) file into a Polars DataFrame. Tries to be compatible with Bioframe's read_table but faster. Schema should follow the Bioframe's schema format.

Parameters:

Name Type Description Default
path str

The path to the file.

required
schema Dict

Schema should follow the Bioframe's schema format.

None
Source code in polars_bio/io.py
@staticmethod
def read_table(path: str, schema: Dict = None, **kwargs) -> pl.DataFrame:
    """
     Read a tab-delimited (i.e. BED) file into a Polars DataFrame.
     Tries to be compatible with Bioframe's [read_table](https://bioframe.readthedocs.io/en/latest/guide-io.html)
     but faster. Schema should follow the Bioframe's schema [format](https://github.com/open2c/bioframe/blob/2b685eebef393c2c9e6220dcf550b3630d87518e/bioframe/io/schemas.py#L174).

    Parameters:
        path: The path to the file.
        schema: Schema should follow the Bioframe's schema [format](https://github.com/open2c/bioframe/blob/2b685eebef393c2c9e6220dcf550b3630d87518e/bioframe/io/schemas.py#L174).
    """
    return IOOperations.scan_table(path, schema, **kwargs).collect()

read_vcf(path, info_fields=None, format_fields=None, chunk_size=8, concurrent_fetches=1, allow_anonymous=True, enable_request_payer=False, max_retries=5, timeout=300, compression_type='auto', projection_pushdown=True, predicate_pushdown=True, use_zero_based=None) staticmethod

Read a VCF file into a DataFrame.

Parallelism & Indexed Reads

Indexed parallel reads and predicate pushdown are automatic when a TBI/CSI index is present. See File formats support, Indexed reads, and Automatic parallel partitioning for details.

Parameters:

Name Type Description Default
path str

The path to the VCF file.

required
info_fields Union[list[str], None]

List of INFO field names to include. If None, all INFO fields will be detected automatically from the VCF header. Use this to limit fields for better performance.

None
format_fields Union[list[str], None]

List of FORMAT field names to include (per-sample genotype data). If None, all FORMAT fields will be automatically detected from the VCF header. Column naming depends on the number of samples: for single-sample VCFs, columns are named directly by the FORMAT field (e.g., GT, DP); for multi-sample VCFs, columns are named {sample_name}_{format_field} (e.g., NA12878_GT, NA12879_DP). The GT field is always converted to string with / (unphased) or | (phased) separator.

None
chunk_size int

The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.

8
concurrent_fetches int

[GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.

1
allow_anonymous bool

[GCS, AWS S3] Whether to allow anonymous access to object storage.

True
enable_request_payer bool

[AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.

False
max_retries int

The maximum number of retries for reading the file from object storage.

5
timeout int

The timeout in seconds for reading the file from object storage.

300
compression_type str

The compression type of the VCF file. If not specified, it will be detected automatically..

'auto'
projection_pushdown bool

Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

True
predicate_pushdown bool

Enable predicate pushdown using index files (TBI/CSI) for efficient region-based filtering. Index files are auto-discovered (e.g., file.vcf.gz.tbi). Only simple predicates are pushed down (equality, comparisons, IN); complex predicates like .str.contains() or OR logic are filtered client-side. Correctness is always guaranteed.

True
use_zero_based Optional[bool]

If True, output 0-based half-open coordinates. If False, output 1-based closed coordinates. If None (default), uses the global configuration datafusion.bio.coordinate_system_zero_based.

None

Note

By default, coordinates are output in 1-based closed format. Use use_zero_based=True or set pb.set_option(pb.POLARS_BIO_COORDINATE_SYSTEM_ZERO_BASED, True) for 0-based half-open coordinates.

Reading VCF with INFO and FORMAT fields

import polars_bio as pb

# Read VCF with both INFO and FORMAT fields
df = pb.read_vcf(
    "sample.vcf.gz",
    info_fields=["END"],              # INFO field
    format_fields=["GT", "DP", "GQ"]  # FORMAT fields
)

# Single-sample VCF: FORMAT columns named directly (GT, DP, GQ)
print(df.select(["chrom", "start", "ref", "alt", "END", "GT", "DP", "GQ"]))
# Output:
# shape: (10, 8)
# ┌───────┬───────┬─────┬─────┬──────┬─────┬─────┬─────┐
# │ chrom ┆ start ┆ ref ┆ alt ┆ END  ┆ GT  ┆ DP  ┆ GQ  │
# │ str   ┆ u32   ┆ str ┆ str ┆ i32  ┆ str ┆ i32 ┆ i32 │
# ╞═══════╪═══════╪═════╪═════╪══════╪═════╪═════╪═════╡
# │ 1     ┆ 10009 ┆ A   ┆ .   ┆ null ┆ 0/0 ┆ 10  ┆ 27  │
# │ 1     ┆ 10015 ┆ A   ┆ .   ┆ null ┆ 0/0 ┆ 17  ┆ 35  │
# └───────┴───────┴─────┴─────┴──────┴─────┴─────┴─────┘

# Multi-sample VCF: FORMAT columns named {sample}_{field}
df = pb.read_vcf("multisample.vcf", format_fields=["GT", "DP"])
print(df.select(["chrom", "start", "NA12878_GT", "NA12878_DP", "NA12879_GT"]))
Source code in polars_bio/io.py
@staticmethod
def read_vcf(
    path: str,
    info_fields: Union[list[str], None] = None,
    format_fields: Union[list[str], None] = None,
    chunk_size: int = 8,
    concurrent_fetches: int = 1,
    allow_anonymous: bool = True,
    enable_request_payer: bool = False,
    max_retries: int = 5,
    timeout: int = 300,
    compression_type: str = "auto",
    projection_pushdown: bool = True,
    predicate_pushdown: bool = True,
    use_zero_based: Optional[bool] = None,
) -> pl.DataFrame:
    """
    Read a VCF file into a DataFrame.

    !!! hint "Parallelism & Indexed Reads"
        Indexed parallel reads and predicate pushdown are automatic when a TBI/CSI index
        is present. See [File formats support](/polars-bio/features/#file-formats-support),
        [Indexed reads](/polars-bio/features/#indexed-reads-predicate-pushdown),
        and [Automatic parallel partitioning](/polars-bio/features/#automatic-parallel-partitioning) for details.

    Parameters:
        path: The path to the VCF file.
        info_fields: List of INFO field names to include. If *None*, all INFO fields will be detected automatically from the VCF header. Use this to limit fields for better performance.
        format_fields: List of FORMAT field names to include (per-sample genotype data). If *None*, all FORMAT fields will be automatically detected from the VCF header. Column naming depends on the number of samples: for **single-sample** VCFs, columns are named directly by the FORMAT field (e.g., `GT`, `DP`); for **multi-sample** VCFs, columns are named `{sample_name}_{format_field}` (e.g., `NA12878_GT`, `NA12879_DP`). The GT field is always converted to string with `/` (unphased) or `|` (phased) separator.
        chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
        concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
        allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
        enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
        max_retries:  The maximum number of retries for reading the file from object storage.
        timeout: The timeout in seconds for reading the file from object storage.
        compression_type: The compression type of the VCF file. If not specified, it will be detected automatically..
        projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.
        predicate_pushdown: Enable predicate pushdown using index files (TBI/CSI) for efficient region-based filtering. Index files are auto-discovered (e.g., `file.vcf.gz.tbi`). Only simple predicates are pushed down (equality, comparisons, IN); complex predicates like `.str.contains()` or OR logic are filtered client-side. Correctness is always guaranteed.
        use_zero_based: If True, output 0-based half-open coordinates. If False, output 1-based closed coordinates. If None (default), uses the global configuration `datafusion.bio.coordinate_system_zero_based`.

    !!! note
        By default, coordinates are output in **1-based closed** format. Use `use_zero_based=True` or set `pb.set_option(pb.POLARS_BIO_COORDINATE_SYSTEM_ZERO_BASED, True)` for 0-based half-open coordinates.

    !!! Example "Reading VCF with INFO and FORMAT fields"
        ```python
        import polars_bio as pb

        # Read VCF with both INFO and FORMAT fields
        df = pb.read_vcf(
            "sample.vcf.gz",
            info_fields=["END"],              # INFO field
            format_fields=["GT", "DP", "GQ"]  # FORMAT fields
        )

        # Single-sample VCF: FORMAT columns named directly (GT, DP, GQ)
        print(df.select(["chrom", "start", "ref", "alt", "END", "GT", "DP", "GQ"]))
        # Output:
        # shape: (10, 8)
        # ┌───────┬───────┬─────┬─────┬──────┬─────┬─────┬─────┐
        # │ chrom ┆ start ┆ ref ┆ alt ┆ END  ┆ GT  ┆ DP  ┆ GQ  │
        # │ str   ┆ u32   ┆ str ┆ str ┆ i32  ┆ str ┆ i32 ┆ i32 │
        # ╞═══════╪═══════╪═════╪═════╪══════╪═════╪═════╪═════╡
        # │ 1     ┆ 10009 ┆ A   ┆ .   ┆ null ┆ 0/0 ┆ 10  ┆ 27  │
        # │ 1     ┆ 10015 ┆ A   ┆ .   ┆ null ┆ 0/0 ┆ 17  ┆ 35  │
        # └───────┴───────┴─────┴─────┴──────┴─────┴─────┴─────┘

        # Multi-sample VCF: FORMAT columns named {sample}_{field}
        df = pb.read_vcf("multisample.vcf", format_fields=["GT", "DP"])
        print(df.select(["chrom", "start", "NA12878_GT", "NA12878_DP", "NA12879_GT"]))
        ```
    """
    lf = IOOperations.scan_vcf(
        path,
        info_fields,
        format_fields,
        chunk_size,
        concurrent_fetches,
        allow_anonymous,
        enable_request_payer,
        max_retries,
        timeout,
        compression_type,
        projection_pushdown,
        predicate_pushdown,
        use_zero_based,
    )
    # Get metadata before collecting (polars-config-meta doesn't preserve through collect)
    zero_based = lf.config_meta.get_metadata().get("coordinate_system_zero_based")
    df = lf.collect()
    # Set metadata on the collected DataFrame
    if zero_based is not None:
        set_coordinate_system(df, zero_based)
    return df

scan_bam(path, tag_fields=None, chunk_size=8, concurrent_fetches=1, allow_anonymous=True, enable_request_payer=False, max_retries=5, timeout=300, projection_pushdown=True, predicate_pushdown=True, use_zero_based=None) staticmethod

Lazily read a BAM file into a LazyFrame.

Parallelism & Indexed Reads

Indexed parallel reads and predicate pushdown are automatic when a BAI/CSI index is present. See File formats support, Indexed reads, and Automatic parallel partitioning for details.

Parameters:

Name Type Description Default
path str

The path to the BAM file.

required
tag_fields Union[list[str], None]

List of BAM tag names to include as columns (e.g., ["NM", "MD", "AS"]). If None, no optional tags are parsed (default). Common tags include: NM (edit distance), MD (mismatch string), AS (alignment score), XS (secondary alignment score), RG (read group), CB (cell barcode), UB (UMI barcode).

None
chunk_size int

The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.

8
concurrent_fetches int

[GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.

1
allow_anonymous bool

[GCS, AWS S3] Whether to allow anonymous access to object storage.

True
enable_request_payer bool

[AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.

False
max_retries int

The maximum number of retries for reading the file from object storage.

5
timeout int

The timeout in seconds for reading the file from object storage.

300
projection_pushdown bool

Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

True
predicate_pushdown bool

Enable predicate pushdown using index files (BAI/CSI) for efficient region-based filtering. Index files are auto-discovered (e.g., file.bam.bai). Only simple predicates are pushed down (equality, comparisons, IN); complex predicates like .str.contains() or OR logic are filtered client-side. Correctness is always guaranteed.

True
use_zero_based Optional[bool]

If True, output 0-based half-open coordinates. If False, output 1-based closed coordinates. If None (default), uses the global configuration datafusion.bio.coordinate_system_zero_based.

None

Note

By default, coordinates are output in 1-based closed format. Use use_zero_based=True or set pb.set_option(pb.POLARS_BIO_COORDINATE_SYSTEM_ZERO_BASED, True) for 0-based half-open coordinates.

Source code in polars_bio/io.py
@staticmethod
def scan_bam(
    path: str,
    tag_fields: Union[list[str], None] = None,
    chunk_size: int = 8,
    concurrent_fetches: int = 1,
    allow_anonymous: bool = True,
    enable_request_payer: bool = False,
    max_retries: int = 5,
    timeout: int = 300,
    projection_pushdown: bool = True,
    predicate_pushdown: bool = True,
    use_zero_based: Optional[bool] = None,
) -> pl.LazyFrame:
    """
    Lazily read a BAM file into a LazyFrame.

    !!! hint "Parallelism & Indexed Reads"
        Indexed parallel reads and predicate pushdown are automatic when a BAI/CSI index
        is present. See [File formats support](/polars-bio/features/#file-formats-support),
        [Indexed reads](/polars-bio/features/#indexed-reads-predicate-pushdown),
        and [Automatic parallel partitioning](/polars-bio/features/#automatic-parallel-partitioning) for details.

    Parameters:
        path: The path to the BAM file.
        tag_fields: List of BAM tag names to include as columns (e.g., ["NM", "MD", "AS"]). If None, no optional tags are parsed (default). Common tags include: NM (edit distance), MD (mismatch string), AS (alignment score), XS (secondary alignment score), RG (read group), CB (cell barcode), UB (UMI barcode).
        chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
        concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
        allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
        enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
        max_retries:  The maximum number of retries for reading the file from object storage.
        timeout: The timeout in seconds for reading the file from object storage.
        projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.
        predicate_pushdown: Enable predicate pushdown using index files (BAI/CSI) for efficient region-based filtering. Index files are auto-discovered (e.g., `file.bam.bai`). Only simple predicates are pushed down (equality, comparisons, IN); complex predicates like `.str.contains()` or OR logic are filtered client-side. Correctness is always guaranteed.
        use_zero_based: If True, output 0-based half-open coordinates. If False, output 1-based closed coordinates. If None (default), uses the global configuration `datafusion.bio.coordinate_system_zero_based`.

    !!! note
        By default, coordinates are output in **1-based closed** format. Use `use_zero_based=True` or set `pb.set_option(pb.POLARS_BIO_COORDINATE_SYSTEM_ZERO_BASED, True)` for 0-based half-open coordinates.
    """
    object_storage_options = PyObjectStorageOptions(
        allow_anonymous=allow_anonymous,
        enable_request_payer=enable_request_payer,
        chunk_size=chunk_size,
        concurrent_fetches=concurrent_fetches,
        max_retries=max_retries,
        timeout=timeout,
        compression_type="auto",
    )

    zero_based = _resolve_zero_based(use_zero_based)
    bam_read_options = BamReadOptions(
        object_storage_options=object_storage_options,
        zero_based=zero_based,
        tag_fields=tag_fields,
    )
    read_options = ReadOptions(bam_read_options=bam_read_options)
    return _read_file(
        path,
        InputFormat.Bam,
        read_options,
        projection_pushdown,
        predicate_pushdown,
        zero_based=zero_based,
    )

scan_bed(path, thread_num=1, chunk_size=8, concurrent_fetches=1, allow_anonymous=True, enable_request_payer=False, max_retries=5, timeout=300, compression_type='auto', projection_pushdown=True, use_zero_based=None) staticmethod

Lazily read a BED file into a LazyFrame.

Parameters:

Name Type Description Default
path str

The path to the BED file.

required
thread_num int

The number of threads to use for reading the BED file. Used only for parallel decompression of BGZF blocks. Works only for local files.

1
chunk_size int

The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.

8
concurrent_fetches int

[GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.

1
allow_anonymous bool

[GCS, AWS S3] Whether to allow anonymous access to object storage.

True
enable_request_payer bool

[AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.

False
max_retries int

The maximum number of retries for reading the file from object storage.

5
timeout int

The timeout in seconds for reading the file from object storage.

300
compression_type str

The compression type of the BED file. If not specified, it will be detected automatically based on the file extension. BGZF compressions is supported ('bgz').

'auto'
projection_pushdown bool

Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

True
use_zero_based Optional[bool]

If True, output 0-based half-open coordinates. If False, output 1-based closed coordinates. If None (default), uses the global configuration datafusion.bio.coordinate_system_zero_based.

None

Note

Only BED4 format is supported. It extends the basic BED format (BED3) by adding a name field, resulting in four columns: chromosome, start position, end position, and name. Also unlike other text formats, GZIP compression is not supported.

Note

By default, coordinates are output in 1-based closed format. Use use_zero_based=True or set pb.set_option(pb.POLARS_BIO_COORDINATE_SYSTEM_ZERO_BASED, True) for 0-based half-open coordinates.

Source code in polars_bio/io.py
@staticmethod
def scan_bed(
    path: str,
    thread_num: int = 1,
    chunk_size: int = 8,
    concurrent_fetches: int = 1,
    allow_anonymous: bool = True,
    enable_request_payer: bool = False,
    max_retries: int = 5,
    timeout: int = 300,
    compression_type: str = "auto",
    projection_pushdown: bool = True,
    use_zero_based: Optional[bool] = None,
) -> pl.LazyFrame:
    """
    Lazily read a BED file into a LazyFrame.

    Parameters:
        path: The path to the BED file.
        thread_num: The number of threads to use for reading the BED file. Used **only** for parallel decompression of BGZF blocks. Works only for **local** files.
        chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
        concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
        allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
        enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
        max_retries:  The maximum number of retries for reading the file from object storage.
        timeout: The timeout in seconds for reading the file from object storage.
        compression_type: The compression type of the BED file. If not specified, it will be detected automatically based on the file extension. BGZF compressions is supported ('bgz').
        projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.
        use_zero_based: If True, output 0-based half-open coordinates. If False, output 1-based closed coordinates. If None (default), uses the global configuration `datafusion.bio.coordinate_system_zero_based`.

    !!! Note
        Only **BED4** format is supported. It extends the basic BED format (BED3) by adding a name field, resulting in four columns: chromosome, start position, end position, and name.
        Also unlike other text formats, **GZIP** compression is not supported.

    !!! note
        By default, coordinates are output in **1-based closed** format. Use `use_zero_based=True` or set `pb.set_option(pb.POLARS_BIO_COORDINATE_SYSTEM_ZERO_BASED, True)` for 0-based half-open coordinates.
    """
    object_storage_options = PyObjectStorageOptions(
        allow_anonymous=allow_anonymous,
        enable_request_payer=enable_request_payer,
        chunk_size=chunk_size,
        concurrent_fetches=concurrent_fetches,
        max_retries=max_retries,
        timeout=timeout,
        compression_type=compression_type,
    )

    zero_based = _resolve_zero_based(use_zero_based)
    bed_read_options = BedReadOptions(
        thread_num=thread_num,
        object_storage_options=object_storage_options,
        zero_based=zero_based,
    )
    read_options = ReadOptions(bed_read_options=bed_read_options)
    return _read_file(
        path,
        InputFormat.Bed,
        read_options,
        projection_pushdown,
        zero_based=zero_based,
    )

scan_cram(path, reference_path=None, tag_fields=None, chunk_size=8, concurrent_fetches=1, allow_anonymous=True, enable_request_payer=False, max_retries=5, timeout=300, projection_pushdown=True, predicate_pushdown=True, use_zero_based=None) staticmethod

Lazily read a CRAM file into a LazyFrame.

Parallelism & Indexed Reads

Indexed parallel reads and predicate pushdown are automatic when a CRAI index is present. See File formats support, Indexed reads, and Automatic parallel partitioning for details.

Parameters:

Name Type Description Default
path str

The path to the CRAM file (local or cloud storage: S3, GCS, Azure Blob).

required
reference_path str

Optional path to external FASTA reference file (local path only, cloud storage not supported). If not provided, the CRAM file must contain embedded reference sequences. The FASTA file must have an accompanying index file (.fai) in the same directory. Create the index using: samtools faidx reference.fasta

None
tag_fields Union[list[str], None]

List of CRAM tag names to include as columns (e.g., ["NM", "MD", "AS"]). If None, no optional tags are parsed (default). Common tags include: NM (edit distance), MD (mismatch string), AS (alignment score), XS (secondary alignment score), RG (read group), CB (cell barcode), UB (UMI barcode).

None
chunk_size int

The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.

8
concurrent_fetches int

[GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.

1
allow_anonymous bool

[GCS, AWS S3] Whether to allow anonymous access to object storage.

True
enable_request_payer bool

[AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.

False
max_retries int

The maximum number of retries for reading the file from object storage.

5
timeout int

The timeout in seconds for reading the file from object storage.

300
projection_pushdown bool

Enable column projection pushdown optimization. When True, only requested columns are processed at the DataFusion execution level, improving performance and reducing memory usage.

True
predicate_pushdown bool

Enable predicate pushdown using index files (CRAI) for efficient region-based filtering. Index files are auto-discovered (e.g., file.cram.crai). Only simple predicates are pushed down (equality, comparisons, IN); complex predicates like .str.contains() or OR logic are filtered client-side. Correctness is always guaranteed.

True
use_zero_based Optional[bool]

If True, output 0-based half-open coordinates. If False, output 1-based closed coordinates. If None (default), uses the global configuration datafusion.bio.coordinate_system_zero_based.

None

Note

By default, coordinates are output in 1-based closed format. Use use_zero_based=True or set pb.set_option(pb.POLARS_BIO_COORDINATE_SYSTEM_ZERO_BASED, True) for 0-based half-open coordinates.

Known Limitation: MD and NM Tags

Due to a limitation in the underlying noodles-cram library, MD (mismatch descriptor) and NM (edit distance) tags are not accessible from CRAM files, even when stored in the file. These tags can be seen with samtools but are not exposed through the noodles-cram record.data() interface.

Other optional tags (RG, MQ, AM, OQ, etc.) work correctly. This issue is tracked at: https://github.com/biodatageeks/datafusion-bio-formats/issues/54

Workaround: Use BAM format if MD/NM tags are required for your analysis.

Using External Reference

import polars_bio as pb

# Lazy scan CRAM with external reference
lf = pb.scan_cram(
    "/path/to/file.cram",
    reference_path="/path/to/reference.fasta"
)

# Apply transformations and collect
df = lf.filter(pl.col("chrom") == "chr1").collect()

Public CRAM File Example

Download and read a public CRAM file from 42basepairs:

# Download the CRAM file and reference
wget https://42basepairs.com/download/s3/gatk-test-data/wgs_cram/NA12878_20k_hg38/NA12878.cram
wget https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta

# Create FASTA index (required)
samtools faidx Homo_sapiens_assembly38.fasta

import polars_bio as pb
import polars as pl

# Lazy scan and filter for chromosome 20 reads
df = pb.scan_cram(
    "NA12878.cram",
    reference_path="Homo_sapiens_assembly38.fasta"
).filter(
    pl.col("chrom") == "chr20"
).select(
    ["name", "chrom", "start", "end", "mapping_quality"]
).limit(10).collect()

print(df)

Creating CRAM with Embedded Reference

To create a CRAM file with embedded reference using samtools:

samtools view -C -o output.cram --output-fmt-option embed_ref=1 input.bam

Returns:

Type Description
LazyFrame

A Polars LazyFrame with the following schema: - name: Read name (String) - chrom: Chromosome/contig name (String) - start: Alignment start position, 1-based (UInt32) - end: Alignment end position, 1-based (UInt32) - flags: SAM flags (UInt32) - cigar: CIGAR string (String) - mapping_quality: Mapping quality (UInt32) - mate_chrom: Mate chromosome/contig name (String) - mate_start: Mate alignment start position, 1-based (UInt32) - sequence: Read sequence (String) - quality_scores: Base quality scores (String)

Source code in polars_bio/io.py
@staticmethod
def scan_cram(
    path: str,
    reference_path: str = None,
    tag_fields: Union[list[str], None] = None,
    chunk_size: int = 8,
    concurrent_fetches: int = 1,
    allow_anonymous: bool = True,
    enable_request_payer: bool = False,
    max_retries: int = 5,
    timeout: int = 300,
    projection_pushdown: bool = True,
    predicate_pushdown: bool = True,
    use_zero_based: Optional[bool] = None,
) -> pl.LazyFrame:
    """
    Lazily read a CRAM file into a LazyFrame.

    !!! hint "Parallelism & Indexed Reads"
        Indexed parallel reads and predicate pushdown are automatic when a CRAI index
        is present. See [File formats support](/polars-bio/features/#file-formats-support),
        [Indexed reads](/polars-bio/features/#indexed-reads-predicate-pushdown),
        and [Automatic parallel partitioning](/polars-bio/features/#automatic-parallel-partitioning) for details.

    Parameters:
        path: The path to the CRAM file (local or cloud storage: S3, GCS, Azure Blob).
        reference_path: Optional path to external FASTA reference file (**local path only**, cloud storage not supported). If not provided, the CRAM file must contain embedded reference sequences. The FASTA file must have an accompanying index file (.fai) in the same directory. Create the index using: `samtools faidx reference.fasta`
        tag_fields: List of CRAM tag names to include as columns (e.g., ["NM", "MD", "AS"]). If None, no optional tags are parsed (default). Common tags include: NM (edit distance), MD (mismatch string), AS (alignment score), XS (secondary alignment score), RG (read group), CB (cell barcode), UB (UMI barcode).
        chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
        concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
        allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
        enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
        max_retries: The maximum number of retries for reading the file from object storage.
        timeout: The timeout in seconds for reading the file from object storage.
        projection_pushdown: Enable column projection pushdown optimization. When True, only requested columns are processed at the DataFusion execution level, improving performance and reducing memory usage.
        predicate_pushdown: Enable predicate pushdown using index files (CRAI) for efficient region-based filtering. Index files are auto-discovered (e.g., `file.cram.crai`). Only simple predicates are pushed down (equality, comparisons, IN); complex predicates like `.str.contains()` or OR logic are filtered client-side. Correctness is always guaranteed.
        use_zero_based: If True, output 0-based half-open coordinates. If False, output 1-based closed coordinates. If None (default), uses the global configuration `datafusion.bio.coordinate_system_zero_based`.

    !!! note
        By default, coordinates are output in **1-based closed** format. Use `use_zero_based=True` or set `pb.set_option(pb.POLARS_BIO_COORDINATE_SYSTEM_ZERO_BASED, True)` for 0-based half-open coordinates.

    !!! warning "Known Limitation: MD and NM Tags"
        Due to a limitation in the underlying noodles-cram library, **MD (mismatch descriptor) and NM (edit distance) tags are not accessible** from CRAM files, even when stored in the file. These tags can be seen with samtools but are not exposed through the noodles-cram record.data() interface.

        Other optional tags (RG, MQ, AM, OQ, etc.) work correctly. This issue is tracked at: https://github.com/biodatageeks/datafusion-bio-formats/issues/54

        **Workaround**: Use BAM format if MD/NM tags are required for your analysis.

    !!! example "Using External Reference"
        ```python
        import polars_bio as pb

        # Lazy scan CRAM with external reference
        lf = pb.scan_cram(
            "/path/to/file.cram",
            reference_path="/path/to/reference.fasta"
        )

        # Apply transformations and collect
        df = lf.filter(pl.col("chrom") == "chr1").collect()
        ```

    !!! example "Public CRAM File Example"
        Download and read a public CRAM file from 42basepairs:
        ```bash
        # Download the CRAM file and reference
        wget https://42basepairs.com/download/s3/gatk-test-data/wgs_cram/NA12878_20k_hg38/NA12878.cram
        wget https://storage.googleapis.com/genomics-public-data/resources/broad/hg38/v0/Homo_sapiens_assembly38.fasta

        # Create FASTA index (required)
        samtools faidx Homo_sapiens_assembly38.fasta
        ```

        ```python
        import polars_bio as pb
        import polars as pl

        # Lazy scan and filter for chromosome 20 reads
        df = pb.scan_cram(
            "NA12878.cram",
            reference_path="Homo_sapiens_assembly38.fasta"
        ).filter(
            pl.col("chrom") == "chr20"
        ).select(
            ["name", "chrom", "start", "end", "mapping_quality"]
        ).limit(10).collect()

        print(df)
        ```

    !!! example "Creating CRAM with Embedded Reference"
        To create a CRAM file with embedded reference using samtools:
        ```bash
        samtools view -C -o output.cram --output-fmt-option embed_ref=1 input.bam
        ```

    Returns:
        A Polars LazyFrame with the following schema:
            - name: Read name (String)
            - chrom: Chromosome/contig name (String)
            - start: Alignment start position, 1-based (UInt32)
            - end: Alignment end position, 1-based (UInt32)
            - flags: SAM flags (UInt32)
            - cigar: CIGAR string (String)
            - mapping_quality: Mapping quality (UInt32)
            - mate_chrom: Mate chromosome/contig name (String)
            - mate_start: Mate alignment start position, 1-based (UInt32)
            - sequence: Read sequence (String)
            - quality_scores: Base quality scores (String)
    """
    object_storage_options = PyObjectStorageOptions(
        allow_anonymous=allow_anonymous,
        enable_request_payer=enable_request_payer,
        chunk_size=chunk_size,
        concurrent_fetches=concurrent_fetches,
        max_retries=max_retries,
        timeout=timeout,
        compression_type="auto",
    )

    zero_based = _resolve_zero_based(use_zero_based)
    cram_read_options = CramReadOptions(
        reference_path=reference_path,
        object_storage_options=object_storage_options,
        zero_based=zero_based,
        tag_fields=tag_fields,
    )
    read_options = ReadOptions(cram_read_options=cram_read_options)
    return _read_file(
        path,
        InputFormat.Cram,
        read_options,
        projection_pushdown,
        predicate_pushdown,
        zero_based=zero_based,
    )

scan_fasta(path, chunk_size=8, concurrent_fetches=1, allow_anonymous=True, enable_request_payer=False, max_retries=5, timeout=300, compression_type='auto', projection_pushdown=True) staticmethod

Lazily read a FASTA file into a LazyFrame.

Parameters:

Name Type Description Default
path str

The path to the FASTA file.

required
chunk_size int

The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.

8
concurrent_fetches int

[GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.

1
allow_anonymous bool

[GCS, AWS S3] Whether to allow anonymous access to object storage.

True
enable_request_payer bool

[AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.

False
max_retries int

The maximum number of retries for reading the file from object storage.

5
timeout int

The timeout in seconds for reading the file from object storage.

300
compression_type str

The compression type of the FASTA file. If not specified, it will be detected automatically based on the file extension. BGZF and GZIP compressions are supported ('bgz', 'gz').

'auto'
projection_pushdown bool

Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

True

Example

wget https://www.ebi.ac.uk/ena/browser/api/fasta/BK006935.2?download=true -O /tmp/test.fasta

import polars_bio as pb
pb.scan_fasta("/tmp/test.fasta").limit(1).collect()
 shape: (1, 3)
┌─────────────────────────┬─────────────────────────────────┬─────────────────────────────────┐
 name                     description                      sequence                         ---                      ---                              ---                              str                      str                              str                             ╞═════════════════════════╪═════════════════════════════════╪═════════════════════════════════╡
 ENA|BK006935|BK006935.2  TPA_inf: Saccharomyces cerevis…  CCACACCACACCCACACACCCACACACCAC… └─────────────────────────┴─────────────────────────────────┴─────────────────────────────────┘

Source code in polars_bio/io.py
@staticmethod
def scan_fasta(
    path: str,
    chunk_size: int = 8,
    concurrent_fetches: int = 1,
    allow_anonymous: bool = True,
    enable_request_payer: bool = False,
    max_retries: int = 5,
    timeout: int = 300,
    compression_type: str = "auto",
    projection_pushdown: bool = True,
) -> pl.LazyFrame:
    """

    Lazily read a FASTA file into a LazyFrame.

    Parameters:
        path: The path to the FASTA file.
        chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
        concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
        allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
        enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
        max_retries:  The maximum number of retries for reading the file from object storage.
        timeout: The timeout in seconds for reading the file from object storage.
        compression_type: The compression type of the FASTA file. If not specified, it will be detected automatically based on the file extension. BGZF and GZIP compressions are supported ('bgz', 'gz').
        projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

    !!! Example
        ```shell
        wget https://www.ebi.ac.uk/ena/browser/api/fasta/BK006935.2?download=true -O /tmp/test.fasta
        ```

        ```python
        import polars_bio as pb
        pb.scan_fasta("/tmp/test.fasta").limit(1).collect()
        ```
        ```shell
         shape: (1, 3)
        ┌─────────────────────────┬─────────────────────────────────┬─────────────────────────────────┐
        │ name                    ┆ description                     ┆ sequence                        │
        │ ---                     ┆ ---                             ┆ ---                             │
        │ str                     ┆ str                             ┆ str                             │
        ╞═════════════════════════╪═════════════════════════════════╪═════════════════════════════════╡
        │ ENA|BK006935|BK006935.2 ┆ TPA_inf: Saccharomyces cerevis… ┆ CCACACCACACCCACACACCCACACACCAC… │
        └─────────────────────────┴─────────────────────────────────┴─────────────────────────────────┘
        ```
    """
    object_storage_options = PyObjectStorageOptions(
        allow_anonymous=allow_anonymous,
        enable_request_payer=enable_request_payer,
        chunk_size=chunk_size,
        concurrent_fetches=concurrent_fetches,
        max_retries=max_retries,
        timeout=timeout,
        compression_type=compression_type,
    )
    fasta_read_options = FastaReadOptions(
        object_storage_options=object_storage_options
    )
    read_options = ReadOptions(fasta_read_options=fasta_read_options)
    return _read_file(path, InputFormat.Fasta, read_options, projection_pushdown)

scan_fastq(path, chunk_size=8, concurrent_fetches=1, allow_anonymous=True, enable_request_payer=False, max_retries=5, timeout=300, compression_type='auto', projection_pushdown=True) staticmethod

Lazily read a FASTQ file into a LazyFrame.

Parallelism & Compression

See File formats support, Compression, and Automatic parallel partitioning for details on parallel reads and supported compression types.

Parameters:

Name Type Description Default
path str

The path to the FASTQ file.

required
chunk_size int

The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.

8
concurrent_fetches int

[GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.

1
allow_anonymous bool

[GCS, AWS S3] Whether to allow anonymous access to object storage.

True
enable_request_payer bool

[AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.

False
max_retries int

The maximum number of retries for reading the file from object storage.

5
timeout int

The timeout in seconds for reading the file from object storage.

300
compression_type str

The compression type of the FASTQ file. If not specified, it will be detected automatically based on the file extension. BGZF and GZIP compressions are supported ('bgz', 'gz').

'auto'
projection_pushdown bool

Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

True
Source code in polars_bio/io.py
@staticmethod
def scan_fastq(
    path: str,
    chunk_size: int = 8,
    concurrent_fetches: int = 1,
    allow_anonymous: bool = True,
    enable_request_payer: bool = False,
    max_retries: int = 5,
    timeout: int = 300,
    compression_type: str = "auto",
    projection_pushdown: bool = True,
) -> pl.LazyFrame:
    """
    Lazily read a FASTQ file into a LazyFrame.

    !!! hint "Parallelism & Compression"
        See [File formats support](/polars-bio/features/#file-formats-support),
        [Compression](/polars-bio/features/#compression),
        and [Automatic parallel partitioning](/polars-bio/features/#automatic-parallel-partitioning) for details on parallel reads and supported compression types.

    Parameters:
        path: The path to the FASTQ file.
        chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
        concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
        allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
        enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
        max_retries:  The maximum number of retries for reading the file from object storage.
        timeout: The timeout in seconds for reading the file from object storage.
        compression_type: The compression type of the FASTQ file. If not specified, it will be detected automatically based on the file extension. BGZF and GZIP compressions are supported ('bgz', 'gz').
        projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.
    """
    object_storage_options = PyObjectStorageOptions(
        allow_anonymous=allow_anonymous,
        enable_request_payer=enable_request_payer,
        chunk_size=chunk_size,
        concurrent_fetches=concurrent_fetches,
        max_retries=max_retries,
        timeout=timeout,
        compression_type=compression_type,
    )

    fastq_read_options = FastqReadOptions(
        object_storage_options=object_storage_options,
    )
    read_options = ReadOptions(fastq_read_options=fastq_read_options)
    return _read_file(path, InputFormat.Fastq, read_options, projection_pushdown)

scan_gff(path, attr_fields=None, chunk_size=8, concurrent_fetches=1, allow_anonymous=True, enable_request_payer=False, max_retries=5, timeout=300, compression_type='auto', projection_pushdown=True, predicate_pushdown=True, use_zero_based=None) staticmethod

Lazily read a GFF file into a LazyFrame.

Parameters:

Name Type Description Default
path str

The path to the GFF file.

required
attr_fields Union[list[str], None]

List of attribute field names to extract as separate columns. If None, attributes will be kept as a nested structure. Use this to extract specific attributes like 'ID', 'gene_name', 'gene_type', etc. as direct columns for easier access.

None
chunk_size int

The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.

8
concurrent_fetches int

[GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large-scale operations, it is recommended to increase this value to 8 or even more.

1
allow_anonymous bool

[GCS, AWS S3] Whether to allow anonymous access to object storage.

True
enable_request_payer bool

[AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.

False
max_retries int

The maximum number of retries for reading the file from object storage.

5
timeout int

The timeout in seconds for reading the file from object storage.

300
compression_type str

The compression type of the GFF file. If not specified, it will be detected automatically.

'auto'
projection_pushdown bool

Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

True
predicate_pushdown bool

Enable predicate pushdown using index files (TBI/CSI) for efficient region-based filtering. Index files are auto-discovered (e.g., file.gff.gz.tbi). Only simple predicates are pushed down (equality, comparisons, IN); complex predicates like .str.contains() or OR logic are filtered client-side. Correctness is always guaranteed.

True
use_zero_based Optional[bool]

If True, output 0-based half-open coordinates. If False, output 1-based closed coordinates. If None (default), uses the global configuration datafusion.bio.coordinate_system_zero_based.

None

Note

By default, coordinates are output in 1-based closed format. Use use_zero_based=True or set pb.set_option(pb.POLARS_BIO_COORDINATE_SYSTEM_ZERO_BASED, True) for 0-based half-open coordinates.

Source code in polars_bio/io.py
@staticmethod
def scan_gff(
    path: str,
    attr_fields: Union[list[str], None] = None,
    chunk_size: int = 8,
    concurrent_fetches: int = 1,
    allow_anonymous: bool = True,
    enable_request_payer: bool = False,
    max_retries: int = 5,
    timeout: int = 300,
    compression_type: str = "auto",
    projection_pushdown: bool = True,
    predicate_pushdown: bool = True,
    use_zero_based: Optional[bool] = None,
) -> pl.LazyFrame:
    """
    Lazily read a GFF file into a LazyFrame.

    Parameters:
        path: The path to the GFF file.
        attr_fields: List of attribute field names to extract as separate columns. If *None*, attributes will be kept as a nested structure. Use this to extract specific attributes like 'ID', 'gene_name', 'gene_type', etc. as direct columns for easier access.
        chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
        concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large-scale operations, it is recommended to increase this value to 8 or even more.
        allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
        enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
        max_retries:  The maximum number of retries for reading the file from object storage.
        timeout: The timeout in seconds for reading the file from object storage.
        compression_type: The compression type of the GFF file. If not specified, it will be detected automatically.
        projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.
        predicate_pushdown: Enable predicate pushdown using index files (TBI/CSI) for efficient region-based filtering. Index files are auto-discovered (e.g., `file.gff.gz.tbi`). Only simple predicates are pushed down (equality, comparisons, IN); complex predicates like `.str.contains()` or OR logic are filtered client-side. Correctness is always guaranteed.
        use_zero_based: If True, output 0-based half-open coordinates. If False, output 1-based closed coordinates. If None (default), uses the global configuration `datafusion.bio.coordinate_system_zero_based`.

    !!! note
        By default, coordinates are output in **1-based closed** format. Use `use_zero_based=True` or set `pb.set_option(pb.POLARS_BIO_COORDINATE_SYSTEM_ZERO_BASED, True)` for 0-based half-open coordinates.
    """
    object_storage_options = PyObjectStorageOptions(
        allow_anonymous=allow_anonymous,
        enable_request_payer=enable_request_payer,
        chunk_size=chunk_size,
        concurrent_fetches=concurrent_fetches,
        max_retries=max_retries,
        timeout=timeout,
        compression_type=compression_type,
    )

    zero_based = _resolve_zero_based(use_zero_based)
    gff_read_options = GffReadOptions(
        attr_fields=attr_fields,
        object_storage_options=object_storage_options,
        zero_based=zero_based,
    )
    read_options = ReadOptions(gff_read_options=gff_read_options)
    return _read_file(
        path,
        InputFormat.Gff,
        read_options,
        projection_pushdown,
        predicate_pushdown,
        zero_based=zero_based,
    )

scan_sam(path, tag_fields=None, projection_pushdown=True, use_zero_based=None) staticmethod

Lazily read a SAM file into a LazyFrame.

SAM (Sequence Alignment/Map) is the plain-text counterpart of BAM. This function reuses the BAM reader, which auto-detects the format from the file extension.

Parameters:

Name Type Description Default
path str

The path to the SAM file.

required
tag_fields Union[list[str], None]

List of SAM tag names to include as columns (e.g., ["NM", "MD", "AS"]). If None, no optional tags are parsed (default).

None
projection_pushdown bool

Enable column projection pushdown to optimize query performance.

True
use_zero_based Optional[bool]

If True, output 0-based half-open coordinates. If False, output 1-based closed coordinates. If None (default), uses the global configuration.

None

Note

By default, coordinates are output in 1-based closed format.

Source code in polars_bio/io.py
@staticmethod
def scan_sam(
    path: str,
    tag_fields: Union[list[str], None] = None,
    projection_pushdown: bool = True,
    use_zero_based: Optional[bool] = None,
) -> pl.LazyFrame:
    """
    Lazily read a SAM file into a LazyFrame.

    SAM (Sequence Alignment/Map) is the plain-text counterpart of BAM.
    This function reuses the BAM reader, which auto-detects the format
    from the file extension.

    Parameters:
        path: The path to the SAM file.
        tag_fields: List of SAM tag names to include as columns (e.g., ["NM", "MD", "AS"]).
            If None, no optional tags are parsed (default).
        projection_pushdown: Enable column projection pushdown to optimize query performance.
        use_zero_based: If True, output 0-based half-open coordinates.
            If False, output 1-based closed coordinates.
            If None (default), uses the global configuration.

    !!! note
        By default, coordinates are output in **1-based closed** format.
    """
    zero_based = _resolve_zero_based(use_zero_based)
    bam_read_options = BamReadOptions(
        zero_based=zero_based,
        tag_fields=tag_fields,
    )
    read_options = ReadOptions(bam_read_options=bam_read_options)
    return _read_file(
        path,
        InputFormat.Sam,
        read_options,
        projection_pushdown,
        zero_based=zero_based,
    )

scan_table(path, schema=None, **kwargs) staticmethod

Lazily read a tab-delimited (i.e. BED) file into a Polars LazyFrame. Tries to be compatible with Bioframe's read_table but faster and lazy. Schema should follow the Bioframe's schema format.

Parameters:

Name Type Description Default
path str

The path to the file.

required
schema Dict

Schema should follow the Bioframe's schema format.

None
Source code in polars_bio/io.py
@staticmethod
def scan_table(path: str, schema: Dict = None, **kwargs) -> pl.LazyFrame:
    """
     Lazily read a tab-delimited (i.e. BED) file into a Polars LazyFrame.
     Tries to be compatible with Bioframe's [read_table](https://bioframe.readthedocs.io/en/latest/guide-io.html)
     but faster and lazy. Schema should follow the Bioframe's schema [format](https://github.com/open2c/bioframe/blob/2b685eebef393c2c9e6220dcf550b3630d87518e/bioframe/io/schemas.py#L174).

    Parameters:
        path: The path to the file.
        schema: Schema should follow the Bioframe's schema [format](https://github.com/open2c/bioframe/blob/2b685eebef393c2c9e6220dcf550b3630d87518e/bioframe/io/schemas.py#L174).
    """
    df = pl.scan_csv(path, separator="\t", has_header=False, **kwargs)
    if schema is not None:
        columns = SCHEMAS[schema]
        if len(columns) != len(df.collect_schema()):
            raise ValueError(
                f"Schema incompatible with the input. Expected {len(columns)} columns in a schema, got {len(df.collect_schema())} in the input data file. Please provide a valid schema."
            )
        for i, c in enumerate(columns):
            df = df.rename({f"column_{i+1}": c})
    return df

scan_vcf(path, info_fields=None, format_fields=None, chunk_size=8, concurrent_fetches=1, allow_anonymous=True, enable_request_payer=False, max_retries=5, timeout=300, compression_type='auto', projection_pushdown=True, predicate_pushdown=True, use_zero_based=None) staticmethod

Lazily read a VCF file into a LazyFrame.

Parallelism & Indexed Reads

Indexed parallel reads and predicate pushdown are automatic when a TBI/CSI index is present. See File formats support, Indexed reads, and Automatic parallel partitioning for details.

Parameters:

Name Type Description Default
path str

The path to the VCF file.

required
info_fields Union[list[str], None]

List of INFO field names to include. If None, all INFO fields will be detected automatically from the VCF header. Use this to limit fields for better performance.

None
format_fields Union[list[str], None]

List of FORMAT field names to include (per-sample genotype data). If None, all FORMAT fields will be automatically detected from the VCF header. Column naming depends on the number of samples: for single-sample VCFs, columns are named directly by the FORMAT field (e.g., GT, DP); for multi-sample VCFs, columns are named {sample_name}_{format_field} (e.g., NA12878_GT, NA12879_DP). The GT field is always converted to string with / (unphased) or | (phased) separator.

None
chunk_size int

The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.

8
concurrent_fetches int

[GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.

1
allow_anonymous bool

[GCS, AWS S3] Whether to allow anonymous access to object storage.

True
enable_request_payer bool

[AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.

False
max_retries int

The maximum number of retries for reading the file from object storage.

5
timeout int

The timeout in seconds for reading the file from object storage.

300
compression_type str

The compression type of the VCF file. If not specified, it will be detected automatically..

'auto'
projection_pushdown bool

Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

True
predicate_pushdown bool

Enable predicate pushdown using index files (TBI/CSI) for efficient region-based filtering. Index files are auto-discovered (e.g., file.vcf.gz.tbi). Only simple predicates are pushed down (equality, comparisons, IN); complex predicates like .str.contains() or OR logic are filtered client-side. Correctness is always guaranteed.

True
use_zero_based Optional[bool]

If True, output 0-based half-open coordinates. If False, output 1-based closed coordinates. If None (default), uses the global configuration datafusion.bio.coordinate_system_zero_based.

None

Note

By default, coordinates are output in 1-based closed format. Use use_zero_based=True or set pb.set_option(pb.POLARS_BIO_COORDINATE_SYSTEM_ZERO_BASED, True) for 0-based half-open coordinates.

Lazy scanning VCF with INFO and FORMAT fields

import polars_bio as pb

# Lazily scan VCF with both INFO and FORMAT fields
lf = pb.scan_vcf(
    "sample.vcf.gz",
    info_fields=["END"],              # INFO field
    format_fields=["GT", "DP", "GQ"]  # FORMAT fields
)

# Apply filters and collect only what's needed
df = lf.filter(pl.col("DP") > 20).select(
    ["chrom", "start", "ref", "alt", "GT", "DP", "GQ"]
).collect()

# Single-sample VCF: FORMAT columns named directly (GT, DP, GQ)
# Multi-sample VCF: FORMAT columns named {sample}_{field}
Source code in polars_bio/io.py
@staticmethod
def scan_vcf(
    path: str,
    info_fields: Union[list[str], None] = None,
    format_fields: Union[list[str], None] = None,
    chunk_size: int = 8,
    concurrent_fetches: int = 1,
    allow_anonymous: bool = True,
    enable_request_payer: bool = False,
    max_retries: int = 5,
    timeout: int = 300,
    compression_type: str = "auto",
    projection_pushdown: bool = True,
    predicate_pushdown: bool = True,
    use_zero_based: Optional[bool] = None,
) -> pl.LazyFrame:
    """
    Lazily read a VCF file into a LazyFrame.

    !!! hint "Parallelism & Indexed Reads"
        Indexed parallel reads and predicate pushdown are automatic when a TBI/CSI index
        is present. See [File formats support](/polars-bio/features/#file-formats-support),
        [Indexed reads](/polars-bio/features/#indexed-reads-predicate-pushdown),
        and [Automatic parallel partitioning](/polars-bio/features/#automatic-parallel-partitioning) for details.

    Parameters:
        path: The path to the VCF file.
        info_fields: List of INFO field names to include. If *None*, all INFO fields will be detected automatically from the VCF header. Use this to limit fields for better performance.
        format_fields: List of FORMAT field names to include (per-sample genotype data). If *None*, all FORMAT fields will be automatically detected from the VCF header. Column naming depends on the number of samples: for **single-sample** VCFs, columns are named directly by the FORMAT field (e.g., `GT`, `DP`); for **multi-sample** VCFs, columns are named `{sample_name}_{format_field}` (e.g., `NA12878_GT`, `NA12879_DP`). The GT field is always converted to string with `/` (unphased) or `|` (phased) separator.
        chunk_size: The size in MB of a chunk when reading from an object store. The default is 8 MB. For large scale operations, it is recommended to increase this value to 64.
        concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. The default is 1. For large scale operations, it is recommended to increase this value to 8 or even more.
        allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
        enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
        max_retries:  The maximum number of retries for reading the file from object storage.
        timeout: The timeout in seconds for reading the file from object storage.
        compression_type: The compression type of the VCF file. If not specified, it will be detected automatically..
        projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.
        predicate_pushdown: Enable predicate pushdown using index files (TBI/CSI) for efficient region-based filtering. Index files are auto-discovered (e.g., `file.vcf.gz.tbi`). Only simple predicates are pushed down (equality, comparisons, IN); complex predicates like `.str.contains()` or OR logic are filtered client-side. Correctness is always guaranteed.
        use_zero_based: If True, output 0-based half-open coordinates. If False, output 1-based closed coordinates. If None (default), uses the global configuration `datafusion.bio.coordinate_system_zero_based`.

    !!! note
        By default, coordinates are output in **1-based closed** format. Use `use_zero_based=True` or set `pb.set_option(pb.POLARS_BIO_COORDINATE_SYSTEM_ZERO_BASED, True)` for 0-based half-open coordinates.

    !!! Example "Lazy scanning VCF with INFO and FORMAT fields"
        ```python
        import polars_bio as pb

        # Lazily scan VCF with both INFO and FORMAT fields
        lf = pb.scan_vcf(
            "sample.vcf.gz",
            info_fields=["END"],              # INFO field
            format_fields=["GT", "DP", "GQ"]  # FORMAT fields
        )

        # Apply filters and collect only what's needed
        df = lf.filter(pl.col("DP") > 20).select(
            ["chrom", "start", "ref", "alt", "GT", "DP", "GQ"]
        ).collect()

        # Single-sample VCF: FORMAT columns named directly (GT, DP, GQ)
        # Multi-sample VCF: FORMAT columns named {sample}_{field}
        ```
    """
    object_storage_options = PyObjectStorageOptions(
        allow_anonymous=allow_anonymous,
        enable_request_payer=enable_request_payer,
        chunk_size=chunk_size,
        concurrent_fetches=concurrent_fetches,
        max_retries=max_retries,
        timeout=timeout,
        compression_type=compression_type,
    )

    # Use provided info_fields or autodetect from VCF header
    if info_fields is not None:
        initial_info_fields = info_fields
    else:
        # Get all info fields from VCF header for proper projection pushdown
        all_info_fields = None
        try:
            vcf_schema_df = IOOperations.describe_vcf(
                path,
                allow_anonymous=allow_anonymous,
                enable_request_payer=enable_request_payer,
                compression_type=compression_type,
            )
            # Use column name 'name' not 'id' based on the schema output
            all_info_fields = vcf_schema_df.select("name").to_series().to_list()
        except Exception:
            # Fallback to None if unable to get info fields
            all_info_fields = None

        # Always start with all info fields to establish full schema
        # The callback will re-register with only requested info fields for optimization
        initial_info_fields = all_info_fields

    zero_based = _resolve_zero_based(use_zero_based)
    vcf_read_options = VcfReadOptions(
        info_fields=initial_info_fields,
        format_fields=format_fields,
        object_storage_options=object_storage_options,
        zero_based=zero_based,
    )
    read_options = ReadOptions(vcf_read_options=vcf_read_options)
    return _read_file(
        path,
        InputFormat.Vcf,
        read_options,
        projection_pushdown,
        predicate_pushdown,
        zero_based=zero_based,
    )

sink_bam(lf, path, sort_on_write=False) staticmethod

Streaming write a LazyFrame to BAM/SAM format.

For CRAM format, use sink_cram() instead.

Parameters:

Name Type Description Default
lf LazyFrame

LazyFrame to write

required
path str

Output file path (.bam or .sam)

required
sort_on_write bool

If True, sort records by (chrom, start) and set header SO:coordinate. If False (default), set header SO:unsorted.

False

Streaming write BAM

import polars_bio as pb
lf = pb.scan_bam("input.bam").filter(pl.col("mapping_quality") > 20)
pb.sink_bam(lf, "filtered.bam")
Source code in polars_bio/io.py
@staticmethod
def sink_bam(
    lf: pl.LazyFrame,
    path: str,
    sort_on_write: bool = False,
) -> None:
    """
    Streaming write a LazyFrame to BAM/SAM format.

    For CRAM format, use `sink_cram()` instead.

    Parameters:
        lf: LazyFrame to write
        path: Output file path (.bam or .sam)
        sort_on_write: If True, sort records by (chrom, start) and set header SO:coordinate.
            If False (default), set header SO:unsorted.

    !!! Example "Streaming write BAM"
        ```python
        import polars_bio as pb
        lf = pb.scan_bam("input.bam").filter(pl.col("mapping_quality") > 20)
        pb.sink_bam(lf, "filtered.bam")
        ```
    """
    _write_bam_file(lf, path, OutputFormat.Bam, None, sort_on_write=sort_on_write)

sink_cram(lf, path, reference_path, sort_on_write=False) staticmethod

Streaming write a LazyFrame to CRAM format.

CRAM uses reference-based compression, storing only differences from the reference sequence. This method streams data without materializing all rows in memory.

Parameters:

Name Type Description Default
lf LazyFrame

LazyFrame to write

required
path str

Output CRAM file path

required
reference_path str

Path to reference FASTA file (required). The reference must contain all sequences referenced by the alignment data.

required
sort_on_write bool

If True, sort records by (chrom, start) and set header SO:coordinate. If False (default), set header SO:unsorted.

False

Known Limitation: MD and NM Tags

Due to a limitation in the underlying noodles-cram library, MD and NM tags cannot be read back from CRAM files after writing, even though they are written to the file. If you need MD/NM tags for downstream analysis, use BAM format instead. Other optional tags (RG, MQ, AM, OQ, AS, etc.) work correctly. See: https://github.com/biodatageeks/datafusion-bio-formats/issues/54

Streaming write CRAM

import polars_bio as pb
import polars as pl

lf = pb.scan_bam("large_input.bam")
lf = lf.filter(pl.col("mapping_quality") > 30)

# Write CRAM with reference (required)
pb.sink_cram(lf, "filtered.cram", reference_path="reference.fasta")

# For sorted output
pb.sink_cram(lf, "filtered.cram", reference_path="reference.fasta", sort_on_write=True)
Source code in polars_bio/io.py
@staticmethod
def sink_cram(
    lf: pl.LazyFrame,
    path: str,
    reference_path: str,
    sort_on_write: bool = False,
) -> None:
    """
    Streaming write a LazyFrame to CRAM format.

    CRAM uses reference-based compression, storing only differences from the
    reference sequence. This method streams data without materializing all
    rows in memory.

    Parameters:
        lf: LazyFrame to write
        path: Output CRAM file path
        reference_path: Path to reference FASTA file (required). The reference must
            contain all sequences referenced by the alignment data.
        sort_on_write: If True, sort records by (chrom, start) and set header SO:coordinate.
            If False (default), set header SO:unsorted.

    !!! warning "Known Limitation: MD and NM Tags"
        Due to a limitation in the underlying noodles-cram library, **MD and NM tags cannot be read back from CRAM files** after writing, even though they are written to the file. If you need MD/NM tags for downstream analysis, use BAM format instead. Other optional tags (RG, MQ, AM, OQ, AS, etc.) work correctly. See: https://github.com/biodatageeks/datafusion-bio-formats/issues/54

    !!! Example "Streaming write CRAM"
        ```python
        import polars_bio as pb
        import polars as pl

        lf = pb.scan_bam("large_input.bam")
        lf = lf.filter(pl.col("mapping_quality") > 30)

        # Write CRAM with reference (required)
        pb.sink_cram(lf, "filtered.cram", reference_path="reference.fasta")

        # For sorted output
        pb.sink_cram(lf, "filtered.cram", reference_path="reference.fasta", sort_on_write=True)
        ```
    """
    _write_bam_file(
        lf, path, OutputFormat.Cram, reference_path, sort_on_write=sort_on_write
    )

sink_fastq(lf, path) staticmethod

Streaming write a LazyFrame to FASTQ format.

Compression is auto-detected from the file extension.

Parameters:

Name Type Description Default
lf LazyFrame

The LazyFrame to write.

required
path str

The output file path. Compression is auto-detected from extension (.fastq.bgz for BGZF, .fastq.gz for GZIP, .fastq for uncompressed).

required

Streaming write FASTQ

import polars_bio as pb

# Lazy read, filter by quality, then sink
lf = pb.scan_fastq("large_input.fastq.gz")
pb.sink_fastq(lf.limit(1000), "sample_output.fastq")
Source code in polars_bio/io.py
@staticmethod
def sink_fastq(
    lf: pl.LazyFrame,
    path: str,
) -> None:
    """
    Streaming write a LazyFrame to FASTQ format.

    Compression is auto-detected from the file extension.

    Parameters:
        lf: The LazyFrame to write.
        path: The output file path. Compression is auto-detected from extension
              (.fastq.bgz for BGZF, .fastq.gz for GZIP, .fastq for uncompressed).

    !!! Example "Streaming write FASTQ"
        ```python
        import polars_bio as pb

        # Lazy read, filter by quality, then sink
        lf = pb.scan_fastq("large_input.fastq.gz")
        pb.sink_fastq(lf.limit(1000), "sample_output.fastq")
        ```
    """
    _write_file(lf, path, OutputFormat.Fastq)

sink_sam(lf, path, sort_on_write=False) staticmethod

Streaming write a LazyFrame to SAM format (plain text).

Parameters:

Name Type Description Default
lf LazyFrame

LazyFrame to write

required
path str

Output file path (.sam)

required
sort_on_write bool

If True, sort records by (chrom, start) and set header SO:coordinate. If False (default), set header SO:unsorted.

False

Streaming write SAM

import polars_bio as pb
lf = pb.scan_bam("input.bam").filter(pl.col("mapping_quality") > 20)
pb.sink_sam(lf, "filtered.sam")
Source code in polars_bio/io.py
@staticmethod
def sink_sam(
    lf: pl.LazyFrame,
    path: str,
    sort_on_write: bool = False,
) -> None:
    """
    Streaming write a LazyFrame to SAM format (plain text).

    Parameters:
        lf: LazyFrame to write
        path: Output file path (.sam)
        sort_on_write: If True, sort records by (chrom, start) and set header SO:coordinate.
            If False (default), set header SO:unsorted.

    !!! Example "Streaming write SAM"
        ```python
        import polars_bio as pb
        lf = pb.scan_bam("input.bam").filter(pl.col("mapping_quality") > 20)
        pb.sink_sam(lf, "filtered.sam")
        ```
    """
    _write_bam_file(lf, path, OutputFormat.Sam, None, sort_on_write=sort_on_write)

sink_vcf(lf, path) staticmethod

Streaming write a LazyFrame to VCF format.

This method executes the LazyFrame immediately and writes the results to the specified path. Unlike write_vcf, it doesn't return the row count.

Coordinate system is automatically read from LazyFrame metadata (set during scan_vcf). Compression is auto-detected from the file extension.

Parameters:

Name Type Description Default
lf LazyFrame

The LazyFrame to write.

required
path str

The output file path. Compression is auto-detected from extension (.vcf.bgz for BGZF, .vcf.gz for GZIP, .vcf for uncompressed).

required

Streaming write VCF

import polars_bio as pb

# Lazy read and filter, then sink to VCF
lf = pb.scan_vcf("large_input.vcf").filter(pl.col("qual") > 30)
pb.sink_vcf(lf, "filtered_output.vcf.bgz")
Source code in polars_bio/io.py
@staticmethod
def sink_vcf(
    lf: pl.LazyFrame,
    path: str,
) -> None:
    """
    Streaming write a LazyFrame to VCF format.

    This method executes the LazyFrame immediately and writes the results
    to the specified path. Unlike `write_vcf`, it doesn't return the row count.

    Coordinate system is automatically read from LazyFrame metadata (set during
    scan_vcf). Compression is auto-detected from the file extension.

    Parameters:
        lf: The LazyFrame to write.
        path: The output file path. Compression is auto-detected from extension
              (.vcf.bgz for BGZF, .vcf.gz for GZIP, .vcf for uncompressed).

    !!! Example "Streaming write VCF"
        ```python
        import polars_bio as pb

        # Lazy read and filter, then sink to VCF
        lf = pb.scan_vcf("large_input.vcf").filter(pl.col("qual") > 30)
        pb.sink_vcf(lf, "filtered_output.vcf.bgz")
        ```
    """
    _write_file(lf, path, OutputFormat.Vcf)

write_bam(df, path, sort_on_write=False) staticmethod

Write a DataFrame to BAM/SAM format.

Compression is auto-detected from file extension: - .sam → Uncompressed SAM (plain text) - .bam → BGZF-compressed BAM

For CRAM format, use write_cram() instead.

Parameters:

Name Type Description Default
df Union[DataFrame, LazyFrame]

DataFrame or LazyFrame with 11 core BAM columns + optional tag columns

required
path str

Output file path (.bam or .sam)

required
sort_on_write bool

If True, sort records by (chrom, start) and set header SO:coordinate. If False (default), set header SO:unsorted.

False

Returns:

Type Description
int

Number of rows written

Write BAM files

import polars_bio as pb
df = pb.read_bam("input.bam", tag_fields=["NM", "AS"])
pb.write_bam(df, "output.bam")
pb.write_bam(df, "output.sam")
Source code in polars_bio/io.py
@staticmethod
def write_bam(
    df: Union[pl.DataFrame, pl.LazyFrame],
    path: str,
    sort_on_write: bool = False,
) -> int:
    """
    Write a DataFrame to BAM/SAM format.

    Compression is auto-detected from file extension:
    - .sam → Uncompressed SAM (plain text)
    - .bam → BGZF-compressed BAM

    For CRAM format, use `write_cram()` instead.

    Parameters:
        df: DataFrame or LazyFrame with 11 core BAM columns + optional tag columns
        path: Output file path (.bam or .sam)
        sort_on_write: If True, sort records by (chrom, start) and set header SO:coordinate.
            If False (default), set header SO:unsorted.

    Returns:
        Number of rows written

    !!! Example "Write BAM files"
        ```python
        import polars_bio as pb
        df = pb.read_bam("input.bam", tag_fields=["NM", "AS"])
        pb.write_bam(df, "output.bam")
        pb.write_bam(df, "output.sam")
        ```
    """
    return _write_bam_file(
        df, path, OutputFormat.Bam, None, sort_on_write=sort_on_write
    )

write_cram(df, path, reference_path, sort_on_write=False) staticmethod

Write a DataFrame to CRAM format.

CRAM uses reference-based compression, storing only differences from the reference sequence. This achieves 30-60% better compression than BAM.

Parameters:

Name Type Description Default
df Union[DataFrame, LazyFrame]

DataFrame or LazyFrame with 11 core BAM columns + optional tag columns

required
path str

Output CRAM file path

required
reference_path str

Path to reference FASTA file (required). The reference must contain all sequences referenced by the alignment data.

required
sort_on_write bool

If True, sort records by (chrom, start) and set header SO:coordinate. If False (default), set header SO:unsorted.

False

Returns:

Type Description
int

Number of rows written

Known Limitation: MD and NM Tags

Due to a limitation in the underlying noodles-cram library, MD and NM tags cannot be read back from CRAM files after writing, even though they are written to the file. If you need MD/NM tags for downstream analysis, use BAM format instead. Other optional tags (RG, MQ, AM, OQ, AS, etc.) work correctly. See: https://github.com/biodatageeks/datafusion-bio-formats/issues/54

Write CRAM files

import polars_bio as pb

df = pb.read_bam("input.bam", tag_fields=["NM", "AS"])

# Write CRAM with reference (required)
pb.write_cram(df, "output.cram", reference_path="reference.fasta")

# For sorted output
pb.write_cram(df, "output.cram", reference_path="reference.fasta", sort_on_write=True)
Source code in polars_bio/io.py
@staticmethod
def write_cram(
    df: Union[pl.DataFrame, pl.LazyFrame],
    path: str,
    reference_path: str,
    sort_on_write: bool = False,
) -> int:
    """
    Write a DataFrame to CRAM format.

    CRAM uses reference-based compression, storing only differences from the
    reference sequence. This achieves 30-60% better compression than BAM.

    Parameters:
        df: DataFrame or LazyFrame with 11 core BAM columns + optional tag columns
        path: Output CRAM file path
        reference_path: Path to reference FASTA file (required). The reference must
            contain all sequences referenced by the alignment data.
        sort_on_write: If True, sort records by (chrom, start) and set header SO:coordinate.
            If False (default), set header SO:unsorted.

    Returns:
        Number of rows written

    !!! warning "Known Limitation: MD and NM Tags"
        Due to a limitation in the underlying noodles-cram library, **MD and NM tags cannot be read back from CRAM files** after writing, even though they are written to the file. If you need MD/NM tags for downstream analysis, use BAM format instead. Other optional tags (RG, MQ, AM, OQ, AS, etc.) work correctly. See: https://github.com/biodatageeks/datafusion-bio-formats/issues/54

    !!! Example "Write CRAM files"
        ```python
        import polars_bio as pb

        df = pb.read_bam("input.bam", tag_fields=["NM", "AS"])

        # Write CRAM with reference (required)
        pb.write_cram(df, "output.cram", reference_path="reference.fasta")

        # For sorted output
        pb.write_cram(df, "output.cram", reference_path="reference.fasta", sort_on_write=True)
        ```
    """
    return _write_bam_file(
        df, path, OutputFormat.Cram, reference_path, sort_on_write=sort_on_write
    )

write_fastq(df, path) staticmethod

Write a DataFrame to FASTQ format.

Compression is auto-detected from the file extension.

Parameters:

Name Type Description Default
df Union[DataFrame, LazyFrame]

The DataFrame or LazyFrame to write. Must have columns: - name: Read name/identifier - sequence: DNA sequence - quality_scores: Quality scores string Optional: description (added after name on header line)

required
path str

The output file path. Compression is auto-detected from extension (.fastq.bgz for BGZF, .fastq.gz for GZIP, .fastq for uncompressed).

required

Returns:

Type Description
int

The number of rows written.

Writing FASTQ files

import polars_bio as pb

# Read a FASTQ file
df = pb.read_fastq("input.fastq")

# Write to uncompressed FASTQ
pb.write_fastq(df, "output.fastq")

# Write to GZIP-compressed FASTQ
pb.write_fastq(df, "output.fastq.gz")
Source code in polars_bio/io.py
@staticmethod
def write_fastq(
    df: Union[pl.DataFrame, pl.LazyFrame],
    path: str,
) -> int:
    """
    Write a DataFrame to FASTQ format.

    Compression is auto-detected from the file extension.

    Parameters:
        df: The DataFrame or LazyFrame to write. Must have columns:
            - name: Read name/identifier
            - sequence: DNA sequence
            - quality_scores: Quality scores string
            Optional: description (added after name on header line)
        path: The output file path. Compression is auto-detected from extension
              (.fastq.bgz for BGZF, .fastq.gz for GZIP, .fastq for uncompressed).

    Returns:
        The number of rows written.

    !!! Example "Writing FASTQ files"
        ```python
        import polars_bio as pb

        # Read a FASTQ file
        df = pb.read_fastq("input.fastq")

        # Write to uncompressed FASTQ
        pb.write_fastq(df, "output.fastq")

        # Write to GZIP-compressed FASTQ
        pb.write_fastq(df, "output.fastq.gz")
        ```
    """
    return _write_file(df, path, OutputFormat.Fastq)

write_sam(df, path, sort_on_write=False) staticmethod

Write a DataFrame to SAM format (plain text).

Parameters:

Name Type Description Default
df Union[DataFrame, LazyFrame]

DataFrame or LazyFrame with 11 core BAM/SAM columns + optional tag columns

required
path str

Output file path (.sam)

required
sort_on_write bool

If True, sort records by (chrom, start) and set header SO:coordinate. If False (default), set header SO:unsorted.

False

Returns:

Type Description
int

Number of rows written

Write SAM files

import polars_bio as pb
df = pb.read_bam("input.bam", tag_fields=["NM", "AS"])
pb.write_sam(df, "output.sam")
Source code in polars_bio/io.py
@staticmethod
def write_sam(
    df: Union[pl.DataFrame, pl.LazyFrame],
    path: str,
    sort_on_write: bool = False,
) -> int:
    """
    Write a DataFrame to SAM format (plain text).

    Parameters:
        df: DataFrame or LazyFrame with 11 core BAM/SAM columns + optional tag columns
        path: Output file path (.sam)
        sort_on_write: If True, sort records by (chrom, start) and set header SO:coordinate.
            If False (default), set header SO:unsorted.

    Returns:
        Number of rows written

    !!! Example "Write SAM files"
        ```python
        import polars_bio as pb
        df = pb.read_bam("input.bam", tag_fields=["NM", "AS"])
        pb.write_sam(df, "output.sam")
        ```
    """
    return _write_bam_file(
        df, path, OutputFormat.Sam, None, sort_on_write=sort_on_write
    )

write_vcf(df, path) staticmethod

Write a DataFrame to VCF format.

Coordinate system is automatically read from DataFrame metadata (set during read_vcf). Compression is auto-detected from the file extension.

Parameters:

Name Type Description Default
df Union[DataFrame, LazyFrame]

The DataFrame or LazyFrame to write.

required
path str

The output file path. Compression is auto-detected from extension (.vcf.bgz for BGZF, .vcf.gz for GZIP, .vcf for uncompressed).

required

Returns:

Type Description
int

The number of rows written.

Writing VCF files

import polars_bio as pb

# Read a VCF file
df = pb.read_vcf("input.vcf")

# Write to uncompressed VCF
pb.write_vcf(df, "output.vcf")

# Write to BGZF-compressed VCF
pb.write_vcf(df, "output.vcf.bgz")

# Write to GZIP-compressed VCF
pb.write_vcf(df, "output.vcf.gz")
Source code in polars_bio/io.py
@staticmethod
def write_vcf(
    df: Union[pl.DataFrame, pl.LazyFrame],
    path: str,
) -> int:
    """
    Write a DataFrame to VCF format.

    Coordinate system is automatically read from DataFrame metadata (set during
    read_vcf). Compression is auto-detected from the file extension.

    Parameters:
        df: The DataFrame or LazyFrame to write.
        path: The output file path. Compression is auto-detected from extension
              (.vcf.bgz for BGZF, .vcf.gz for GZIP, .vcf for uncompressed).

    Returns:
        The number of rows written.

    !!! Example "Writing VCF files"
        ```python
        import polars_bio as pb

        # Read a VCF file
        df = pb.read_vcf("input.vcf")

        # Write to uncompressed VCF
        pb.write_vcf(df, "output.vcf")

        # Write to BGZF-compressed VCF
        pb.write_vcf(df, "output.vcf.bgz")

        # Write to GZIP-compressed VCF
        pb.write_vcf(df, "output.vcf.gz")
        ```
    """
    return _write_file(df, path, OutputFormat.Vcf)

data_processing

Source code in polars_bio/sql.py
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
class SQL:
    @staticmethod
    def register_vcf(
        path: str,
        name: Union[str, None] = None,
        info_fields: Union[list[str], None] = None,
        chunk_size: int = 64,
        concurrent_fetches: int = 8,
        allow_anonymous: bool = True,
        max_retries: int = 5,
        timeout: int = 300,
        enable_request_payer: bool = False,
        compression_type: str = "auto",
    ) -> None:
        """
        Register a VCF file as a Datafusion table.

        Parameters:
            path: The path to the VCF file.
            name: The name of the table. If *None*, the name of the table will be generated automatically based on the path.
            info_fields: List of INFO field names to register. If *None*, all INFO fields will be detected automatically from the VCF header. Use this to limit registration to specific fields for better performance.
            chunk_size: The size in MB of a chunk when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **8-16**.
            concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **1-2**.
            allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
            enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
            compression_type: The compression type of the VCF file. If not specified, it will be detected automatically..
            max_retries:  The maximum number of retries for reading the file from object storage.
            timeout: The timeout in seconds for reading the file from object storage.
        !!! note
            VCF reader uses **1-based** coordinate system for the `start` and `end` columns.

        !!! Example
              ```python
              import polars_bio as pb
              pb.register_vcf("/tmp/gnomad.v4.1.sv.sites.vcf.gz")
              ```
             ```shell
             INFO:polars_bio:Table: gnomad_v4_1_sv_sites_gz registered for path: /tmp/gnomad.v4.1.sv.sites.vcf.gz
             ```
        !!! tip
            `chunk_size` and `concurrent_fetches` can be adjusted according to the network bandwidth and the size of the VCF file. As a rule of thumb for large scale operations (reading a whole VCF), it is recommended to the default values.
        """

        object_storage_options = PyObjectStorageOptions(
            allow_anonymous=allow_anonymous,
            enable_request_payer=enable_request_payer,
            chunk_size=chunk_size,
            concurrent_fetches=concurrent_fetches,
            max_retries=max_retries,
            timeout=timeout,
            compression_type=compression_type,
        )

        # Use provided info_fields or autodetect from VCF header
        if info_fields is not None:
            all_info_fields = info_fields
        else:
            # Get all info fields from VCF header for automatic field detection
            all_info_fields = None
            try:
                from .io import IOOperations

                vcf_schema_df = IOOperations.describe_vcf(
                    path,
                    allow_anonymous=allow_anonymous,
                    enable_request_payer=enable_request_payer,
                    compression_type=compression_type,
                )
                all_info_fields = vcf_schema_df.select("name").to_series().to_list()
            except Exception:
                # Fallback to empty list if unable to get info fields
                all_info_fields = []

        vcf_read_options = VcfReadOptions(
            info_fields=all_info_fields,
            object_storage_options=object_storage_options,
        )
        read_options = ReadOptions(vcf_read_options=vcf_read_options)
        py_register_table(ctx, path, name, InputFormat.Vcf, read_options)

    @staticmethod
    def register_gff(
        path: str,
        name: Union[str, None] = None,
        chunk_size: int = 64,
        concurrent_fetches: int = 8,
        allow_anonymous: bool = True,
        max_retries: int = 5,
        timeout: int = 300,
        enable_request_payer: bool = False,
        compression_type: str = "auto",
    ) -> None:
        """
        Register a GFF file as a Datafusion table.

        Parameters:
            path: The path to the GFF file.
            name: The name of the table. If *None*, the name of the table will be generated automatically based on the path.
            chunk_size: The size in MB of a chunk when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **8-16**.
            concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **1-2**.
            allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
            enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
            compression_type: The compression type of the GFF file. If not specified, it will be detected automatically based on the file extension. BGZF and GZIP compression is supported ('bgz' and 'gz').
            max_retries:  The maximum number of retries for reading the file from object storage.
            timeout: The timeout in seconds for reading the file from object storage.
        !!! note
            GFF reader uses **1-based** coordinate system for the `start` and `end` columns.

        !!! Example
            ```shell
            wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.annotation.gff3.gz -O /tmp/gencode.v38.annotation.gff3.gz
            ```
            ```python
            import polars_bio as pb
            pb.register_gff("/tmp/gencode.v38.annotation.gff3.gz", "gencode_v38_annotation3_bgz")
            pb.sql("SELECT attributes, count(*) AS cnt FROM gencode_v38_annotation3_bgz GROUP BY attributes").limit(5).collect()
            ```
            ```shell

            shape: (5, 2)
            ┌───────────────────┬───────┐
            │ Parent            ┆ cnt   │
            │ ---               ┆ ---   │
            │ str               ┆ i64   │
            ╞═══════════════════╪═══════╡
            │ null              ┆ 60649 │
            │ ENSG00000223972.5 ┆ 2     │
            │ ENST00000456328.2 ┆ 3     │
            │ ENST00000450305.2 ┆ 6     │
            │ ENSG00000227232.5 ┆ 1     │
            └───────────────────┴───────┘

            ```
        !!! tip
            `chunk_size` and `concurrent_fetches` can be adjusted according to the network bandwidth and the size of the GFF file. As a rule of thumb for large scale operations (reading a whole GFF), it is recommended to the default values.
        """

        object_storage_options = PyObjectStorageOptions(
            allow_anonymous=allow_anonymous,
            enable_request_payer=enable_request_payer,
            chunk_size=chunk_size,
            concurrent_fetches=concurrent_fetches,
            max_retries=max_retries,
            timeout=timeout,
            compression_type=compression_type,
        )

        gff_read_options = GffReadOptions(
            attr_fields=None,
            object_storage_options=object_storage_options,
        )
        read_options = ReadOptions(gff_read_options=gff_read_options)
        py_register_table(ctx, path, name, InputFormat.Gff, read_options)

    @staticmethod
    def register_fastq(
        path: str,
        name: Union[str, None] = None,
        chunk_size: int = 64,
        concurrent_fetches: int = 8,
        allow_anonymous: bool = True,
        max_retries: int = 5,
        timeout: int = 300,
        enable_request_payer: bool = False,
        compression_type: str = "auto",
        parallel: bool = False,
    ) -> None:
        """
        Register a FASTQ file as a Datafusion table.

        Parameters:
            path: The path to the FASTQ file.
            name: The name of the table. If *None*, the name of the table will be generated automatically based on the path.
            chunk_size: The size in MB of a chunk when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **8-16**.
            concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **1-2**.
            allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
            enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
            compression_type: The compression type of the FASTQ file. If not specified, it will be detected automatically based on the file extension. BGZF and GZIP compression is supported ('bgz' and 'gz').
            max_retries:  The maximum number of retries for reading the file from object storage.
            timeout: The timeout in seconds for reading the file from object storage.
            parallel: Whether to use the parallel reader for BGZF compressed files. Default is False. If a file ends with ".gz" but is actually BGZF, it will attempt the parallel path and fall back to standard if not BGZF.

        !!! Example
            ```python
              import polars_bio as pb
              pb.register_fastq("gs://genomics-public-data/platinum-genomes/fastq/ERR194146.fastq.gz", "test_fastq")
              pb.sql("SELECT name, description FROM test_fastq WHERE name LIKE 'ERR194146%'").limit(5).collect()
            ```

            ```shell

              shape: (5, 2)
            ┌─────────────────────┬─────────────────────────────────┐
            │ name                ┆ description                     │
            │ ---                 ┆ ---                             │
            │ str                 ┆ str                             │
            ╞═════════════════════╪═════════════════════════════════╡
            │ ERR194146.812444541 ┆ HSQ1008:141:D0CC8ACXX:2:1204:1… │
            │ ERR194146.812444542 ┆ HSQ1008:141:D0CC8ACXX:4:1206:1… │
            │ ERR194146.812444543 ┆ HSQ1008:141:D0CC8ACXX:3:2104:5… │
            │ ERR194146.812444544 ┆ HSQ1008:141:D0CC8ACXX:3:2204:1… │
            │ ERR194146.812444545 ┆ HSQ1008:141:D0CC8ACXX:3:1304:3… │
            └─────────────────────┴─────────────────────────────────┘

            ```


        !!! tip
            `chunk_size` and `concurrent_fetches` can be adjusted according to the network bandwidth and the size of the FASTQ file. As a rule of thumb for large scale operations (reading a whole FASTQ), it is recommended to the default values.
        """

        object_storage_options = PyObjectStorageOptions(
            allow_anonymous=allow_anonymous,
            enable_request_payer=enable_request_payer,
            chunk_size=chunk_size,
            concurrent_fetches=concurrent_fetches,
            max_retries=max_retries,
            timeout=timeout,
            compression_type=compression_type,
        )

        fastq_read_options = FastqReadOptions(
            object_storage_options=object_storage_options, parallel=parallel
        )
        read_options = ReadOptions(fastq_read_options=fastq_read_options)
        py_register_table(ctx, path, name, InputFormat.Fastq, read_options)

    @staticmethod
    def register_bed(
        path: str,
        name: Union[str, None] = None,
        thread_num: int = 1,
        chunk_size: int = 64,
        concurrent_fetches: int = 8,
        allow_anonymous: bool = True,
        max_retries: int = 5,
        timeout: int = 300,
        enable_request_payer: bool = False,
        compression_type: str = "auto",
    ) -> None:
        """
        Register a BED file as a Datafusion table.

        Parameters:
            path: The path to the BED file.
            name: The name of the table. If *None*, the name of the table will be generated automatically based on the path.
            thread_num: The number of threads to use for reading the BED file. Used **only** for parallel decompression of BGZF blocks. Works only for **local** files.
            chunk_size: The size in MB of a chunk when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **8-16**.
            concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **1-2**.
            allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
            enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
            compression_type: The compression type of the BED file. If not specified, it will be detected automatically..
            max_retries:  The maximum number of retries for reading the file from object storage.
            timeout: The timeout in seconds for reading the file from object storage.

        !!! Note
            Only **BED4** format is supported. It extends the basic BED format (BED3) by adding a name field, resulting in four columns: chromosome, start position, end position, and name.
            Also unlike other text formats, **GZIP** compression is not supported.

        !!! Example
            ```shell

             cd /tmp
             wget https://webs.iiitd.edu.in/raghava/humcfs/fragile_site_bed.zip -O fragile_site_bed.zip
             unzip fragile_site_bed.zip -x "__MACOSX/*" "*/.DS_Store"
            ```

            ```python
            import polars_bio as pb
            pb.register_bed("/tmp/fragile_site_bed/chr5_fragile_site.bed", "test_bed")
            b.sql("select * FROM test_bed WHERE name LIKE 'FRA5%'").collect()
            ```

            ```shell

                shape: (8, 4)
                ┌───────┬───────────┬───────────┬───────┐
                │ chrom ┆ start     ┆ end       ┆ name  │
                │ ---   ┆ ---       ┆ ---       ┆ ---   │
                │ str   ┆ u32       ┆ u32       ┆ str   │
                ╞═══════╪═══════════╪═══════════╪═══════╡
                │ chr5  ┆ 28900001  ┆ 42500000  ┆ FRA5A │
                │ chr5  ┆ 92300001  ┆ 98200000  ┆ FRA5B │
                │ chr5  ┆ 130600001 ┆ 136200000 ┆ FRA5C │
                │ chr5  ┆ 92300001  ┆ 93916228  ┆ FRA5D │
                │ chr5  ┆ 18400001  ┆ 28900000  ┆ FRA5E │
                │ chr5  ┆ 98200001  ┆ 109600000 ┆ FRA5F │
                │ chr5  ┆ 168500001 ┆ 180915260 ┆ FRA5G │
                │ chr5  ┆ 50500001  ┆ 63000000  ┆ FRA5H │
                └───────┴───────────┴───────────┴───────┘
            ```


        !!! tip
            `chunk_size` and `concurrent_fetches` can be adjusted according to the network bandwidth and the size of the BED file. As a rule of thumb for large scale operations (reading a whole BED), it is recommended to the default values.
        """

        object_storage_options = PyObjectStorageOptions(
            allow_anonymous=allow_anonymous,
            enable_request_payer=enable_request_payer,
            chunk_size=chunk_size,
            concurrent_fetches=concurrent_fetches,
            max_retries=max_retries,
            timeout=timeout,
            compression_type=compression_type,
        )

        bed_read_options = BedReadOptions(
            thread_num=thread_num,
            object_storage_options=object_storage_options,
        )
        read_options = ReadOptions(bed_read_options=bed_read_options)
        py_register_table(ctx, path, name, InputFormat.Bed, read_options)

    @staticmethod
    def register_view(name: str, query: str) -> None:
        """
        Register a query as a Datafusion view. This view can be used in genomic ranges operations,
        such as overlap, nearest, and count_overlaps. It is useful for filtering, transforming, and aggregating data
        prior to the range operation. When combined with the range operation, it can be used to perform complex in a streaming fashion end-to-end.

        Parameters:
            name: The name of the table.
            query: The SQL query.

        !!! Example
              ```python
              import polars_bio as pb
              pb.register_vcf("gs://gcp-public-data--gnomad/release/4.1/vcf/exomes/gnomad.exomes.v4.1.sites.chr21.vcf.bgz", "gnomad_sv")
              pb.register_view("v_gnomad_sv", "SELECT replace(chrom,'chr', '') AS chrom, start, end FROM gnomad_sv")
              pb.sql("SELECT * FROM v_gnomad_sv").limit(5).collect()
              ```
              ```shell
                shape: (5, 3)
                ┌───────┬─────────┬─────────┐
                │ chrom ┆ start   ┆ end     │
                │ ---   ┆ ---     ┆ ---     │
                │ str   ┆ u32     ┆ u32     │
                ╞═══════╪═════════╪═════════╡
                │ 21    ┆ 5031905 ┆ 5031905 │
                │ 21    ┆ 5031905 ┆ 5031905 │
                │ 21    ┆ 5031909 ┆ 5031909 │
                │ 21    ┆ 5031911 ┆ 5031911 │
                │ 21    ┆ 5031911 ┆ 5031911 │
                └───────┴─────────┴─────────┘
              ```
        """
        py_register_view(ctx, name, query)

    @staticmethod
    def register_bam(
        path: str,
        name: Union[str, None] = None,
        tag_fields: Union[list[str], None] = None,
        chunk_size: int = 64,
        concurrent_fetches: int = 8,
        allow_anonymous: bool = True,
        max_retries: int = 5,
        timeout: int = 300,
        enable_request_payer: bool = False,
    ) -> None:
        """
        Register a BAM file as a Datafusion table.

        Parameters:
            path: The path to the BAM file.
            name: The name of the table. If *None*, the name of the table will be generated automatically based on the path.
            tag_fields: List of BAM tag names to include as columns (e.g., ["NM", "MD", "AS"]). If None, no optional tags are parsed (default). Common tags include: NM (edit distance), MD (mismatch string), AS (alignment score), XS (secondary alignment score), RG (read group), CB (cell barcode), UB (UMI barcode).
            chunk_size: The size in MB of a chunk when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **8-16**.
            concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **1-2**.
            allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
            enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
            max_retries:  The maximum number of retries for reading the file from object storage.
            timeout: The timeout in seconds for reading the file from object storage.
        !!! note
            BAM reader uses **1-based** coordinate system for the `start`, `end`, `mate_start`, `mate_end` columns.

        !!! Example

            ```python
            import polars_bio as pb
            pb.register_bam("gs://genomics-public-data/1000-genomes/bam/HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam", "HG00096_bam", concurrent_fetches=1, chunk_size=8)
            pb.sql("SELECT chrom, flags FROM HG00096_bam").limit(5).collect()
            ```
            ```shell

                shape: (5, 2)
                ┌───────┬───────┐
                │ chrom ┆ flags │
                │ ---   ┆ ---   │
                │ str   ┆ u32   │
                ╞═══════╪═══════╡
                │ chr1  ┆ 163   │
                │ chr1  ┆ 163   │
                │ chr1  ┆ 99    │
                │ chr1  ┆ 99    │
                │ chr1  ┆ 99    │
                └───────┴───────┘
            ```
        !!! tip
            `chunk_size` and `concurrent_fetches` can be adjusted according to the network bandwidth and the size of the BAM file. As a rule of thumb for large scale operations (reading a whole BAM), it is recommended keep the default values.
            For more interactive inspecting a schema, it is recommended to decrease `chunk_size` to **8-16** and `concurrent_fetches` to **1-2**.
        """

        object_storage_options = PyObjectStorageOptions(
            allow_anonymous=allow_anonymous,
            enable_request_payer=enable_request_payer,
            chunk_size=chunk_size,
            concurrent_fetches=concurrent_fetches,
            max_retries=max_retries,
            timeout=timeout,
            compression_type="auto",
        )

        bam_read_options = BamReadOptions(
            object_storage_options=object_storage_options,
            tag_fields=tag_fields,
        )
        read_options = ReadOptions(bam_read_options=bam_read_options)
        py_register_table(ctx, path, name, InputFormat.Bam, read_options)

    @staticmethod
    def register_sam(
        path: str,
        name: Union[str, None] = None,
        tag_fields: Union[list[str], None] = None,
    ) -> None:
        """
        Register a SAM file as a Datafusion table.

        SAM (Sequence Alignment/Map) is the plain-text counterpart of BAM.
        This function reuses the BAM table provider, which auto-detects
        the format from the file extension.

        Parameters:
            path: The path to the SAM file.
            name: The name of the table. If *None*, the name will be generated automatically from the path.
            tag_fields: List of SAM tag names to include as columns (e.g., ["NM", "MD", "AS"]).
                If None, no optional tags are parsed (default).

        !!! Example
            ```python
            import polars_bio as pb
            pb.register_sam("test.sam", "my_sam")
            pb.sql("SELECT chrom, flags FROM my_sam").limit(5).collect()
            ```
        """
        bam_read_options = BamReadOptions(
            tag_fields=tag_fields,
        )
        read_options = ReadOptions(bam_read_options=bam_read_options)
        py_register_table(ctx, path, name, InputFormat.Sam, read_options)

    @staticmethod
    def register_cram(
        path: str,
        name: Union[str, None] = None,
        tag_fields: Union[list[str], None] = None,
        chunk_size: int = 64,
        concurrent_fetches: int = 8,
        allow_anonymous: bool = True,
        max_retries: int = 5,
        timeout: int = 300,
        enable_request_payer: bool = False,
    ) -> None:
        """
        Register a CRAM file as a Datafusion table.

        !!! warning "Embedded Reference Required"
            Currently, only CRAM files with **embedded reference sequences** are supported.
            CRAM files requiring external reference FASTA files cannot be registered.
            Most modern CRAM files include embedded references by default.

            To create a CRAM file with embedded reference using samtools:
            ```bash
            samtools view -C -o output.cram --output-fmt-option embed_ref=1 input.bam
            ```

        Parameters:
            path: The path to the CRAM file (local or cloud storage: S3, GCS, Azure Blob).
            name: The name of the table. If *None*, the name of the table will be generated automatically based on the path.
            tag_fields: List of CRAM tag names to include as columns (e.g., ["NM", "MD", "AS"]). If None, no optional tags are parsed (default). Common tags include: NM (edit distance), MD (mismatch string), AS (alignment score), XS (secondary alignment score), RG (read group), CB (cell barcode), UB (UMI barcode).
            chunk_size: The size in MB of a chunk when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **8-16**.
            concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **1-2**.
            allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
            enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
            max_retries:  The maximum number of retries for reading the file from object storage.
            timeout: The timeout in seconds for reading the file from object storage.
        !!! note
            CRAM reader uses **1-based** coordinate system for the `start`, `end`, `mate_start`, `mate_end` columns.

        !!! tip
            `chunk_size` and `concurrent_fetches` can be adjusted according to the network bandwidth and the size of the CRAM file. As a rule of thumb for large scale operations (reading a whole CRAM), it is recommended to keep the default values.
            For more interactive inspecting a schema, it is recommended to decrease `chunk_size` to **8-16** and `concurrent_fetches` to **1-2**.
        """

        object_storage_options = PyObjectStorageOptions(
            allow_anonymous=allow_anonymous,
            enable_request_payer=enable_request_payer,
            chunk_size=chunk_size,
            concurrent_fetches=concurrent_fetches,
            max_retries=max_retries,
            timeout=timeout,
            compression_type="auto",
        )

        cram_read_options = CramReadOptions(
            reference_path=None,
            object_storage_options=object_storage_options,
            tag_fields=tag_fields,
        )
        read_options = ReadOptions(cram_read_options=cram_read_options)
        py_register_table(ctx, path, name, InputFormat.Cram, read_options)

    @staticmethod
    def sql(query: str) -> pl.LazyFrame:
        """
        Execute a SQL query on the registered tables.

        Parameters:
            query: The SQL query.

        !!! Example
              ```python
              import polars_bio as pb
              pb.register_vcf("/tmp/gnomad.v4.1.sv.sites.vcf.gz", "gnomad_v4_1_sv")
              pb.sql("SELECT * FROM gnomad_v4_1_sv LIMIT 5").collect()
              ```
        """
        df = py_read_sql(ctx, query)
        return _lazy_scan(df)

register_bam(path, name=None, tag_fields=None, chunk_size=64, concurrent_fetches=8, allow_anonymous=True, max_retries=5, timeout=300, enable_request_payer=False) staticmethod

Register a BAM file as a Datafusion table.

Parameters:

Name Type Description Default
path str

The path to the BAM file.

required
name Union[str, None]

The name of the table. If None, the name of the table will be generated automatically based on the path.

None
tag_fields Union[list[str], None]

List of BAM tag names to include as columns (e.g., ["NM", "MD", "AS"]). If None, no optional tags are parsed (default). Common tags include: NM (edit distance), MD (mismatch string), AS (alignment score), XS (secondary alignment score), RG (read group), CB (cell barcode), UB (UMI barcode).

None
chunk_size int

The size in MB of a chunk when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to 8-16.

64
concurrent_fetches int

[GCS] The number of concurrent fetches when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to 1-2.

8
allow_anonymous bool

[GCS, AWS S3] Whether to allow anonymous access to object storage.

True
enable_request_payer bool

[AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.

False
max_retries int

The maximum number of retries for reading the file from object storage.

5
timeout int

The timeout in seconds for reading the file from object storage.

300

Note

BAM reader uses 1-based coordinate system for the start, end, mate_start, mate_end columns.

Example

import polars_bio as pb
pb.register_bam("gs://genomics-public-data/1000-genomes/bam/HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam", "HG00096_bam", concurrent_fetches=1, chunk_size=8)
pb.sql("SELECT chrom, flags FROM HG00096_bam").limit(5).collect()
    shape: (5, 2)
    ┌───────┬───────┐
     chrom  flags      ---    ---        str    u32       ╞═══════╪═══════╡
     chr1   163        chr1   163        chr1   99         chr1   99         chr1   99        └───────┴───────┘

Tip

chunk_size and concurrent_fetches can be adjusted according to the network bandwidth and the size of the BAM file. As a rule of thumb for large scale operations (reading a whole BAM), it is recommended keep the default values. For more interactive inspecting a schema, it is recommended to decrease chunk_size to 8-16 and concurrent_fetches to 1-2.

Source code in polars_bio/sql.py
@staticmethod
def register_bam(
    path: str,
    name: Union[str, None] = None,
    tag_fields: Union[list[str], None] = None,
    chunk_size: int = 64,
    concurrent_fetches: int = 8,
    allow_anonymous: bool = True,
    max_retries: int = 5,
    timeout: int = 300,
    enable_request_payer: bool = False,
) -> None:
    """
    Register a BAM file as a Datafusion table.

    Parameters:
        path: The path to the BAM file.
        name: The name of the table. If *None*, the name of the table will be generated automatically based on the path.
        tag_fields: List of BAM tag names to include as columns (e.g., ["NM", "MD", "AS"]). If None, no optional tags are parsed (default). Common tags include: NM (edit distance), MD (mismatch string), AS (alignment score), XS (secondary alignment score), RG (read group), CB (cell barcode), UB (UMI barcode).
        chunk_size: The size in MB of a chunk when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **8-16**.
        concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **1-2**.
        allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
        enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
        max_retries:  The maximum number of retries for reading the file from object storage.
        timeout: The timeout in seconds for reading the file from object storage.
    !!! note
        BAM reader uses **1-based** coordinate system for the `start`, `end`, `mate_start`, `mate_end` columns.

    !!! Example

        ```python
        import polars_bio as pb
        pb.register_bam("gs://genomics-public-data/1000-genomes/bam/HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam", "HG00096_bam", concurrent_fetches=1, chunk_size=8)
        pb.sql("SELECT chrom, flags FROM HG00096_bam").limit(5).collect()
        ```
        ```shell

            shape: (5, 2)
            ┌───────┬───────┐
            │ chrom ┆ flags │
            │ ---   ┆ ---   │
            │ str   ┆ u32   │
            ╞═══════╪═══════╡
            │ chr1  ┆ 163   │
            │ chr1  ┆ 163   │
            │ chr1  ┆ 99    │
            │ chr1  ┆ 99    │
            │ chr1  ┆ 99    │
            └───────┴───────┘
        ```
    !!! tip
        `chunk_size` and `concurrent_fetches` can be adjusted according to the network bandwidth and the size of the BAM file. As a rule of thumb for large scale operations (reading a whole BAM), it is recommended keep the default values.
        For more interactive inspecting a schema, it is recommended to decrease `chunk_size` to **8-16** and `concurrent_fetches` to **1-2**.
    """

    object_storage_options = PyObjectStorageOptions(
        allow_anonymous=allow_anonymous,
        enable_request_payer=enable_request_payer,
        chunk_size=chunk_size,
        concurrent_fetches=concurrent_fetches,
        max_retries=max_retries,
        timeout=timeout,
        compression_type="auto",
    )

    bam_read_options = BamReadOptions(
        object_storage_options=object_storage_options,
        tag_fields=tag_fields,
    )
    read_options = ReadOptions(bam_read_options=bam_read_options)
    py_register_table(ctx, path, name, InputFormat.Bam, read_options)

register_bed(path, name=None, thread_num=1, chunk_size=64, concurrent_fetches=8, allow_anonymous=True, max_retries=5, timeout=300, enable_request_payer=False, compression_type='auto') staticmethod

Register a BED file as a Datafusion table.

Parameters:

Name Type Description Default
path str

The path to the BED file.

required
name Union[str, None]

The name of the table. If None, the name of the table will be generated automatically based on the path.

None
thread_num int

The number of threads to use for reading the BED file. Used only for parallel decompression of BGZF blocks. Works only for local files.

1
chunk_size int

The size in MB of a chunk when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to 8-16.

64
concurrent_fetches int

[GCS] The number of concurrent fetches when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to 1-2.

8
allow_anonymous bool

[GCS, AWS S3] Whether to allow anonymous access to object storage.

True
enable_request_payer bool

[AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.

False
compression_type str

The compression type of the BED file. If not specified, it will be detected automatically..

'auto'
max_retries int

The maximum number of retries for reading the file from object storage.

5
timeout int

The timeout in seconds for reading the file from object storage.

300

Note

Only BED4 format is supported. It extends the basic BED format (BED3) by adding a name field, resulting in four columns: chromosome, start position, end position, and name. Also unlike other text formats, GZIP compression is not supported.

Example

 cd /tmp
 wget https://webs.iiitd.edu.in/raghava/humcfs/fragile_site_bed.zip -O fragile_site_bed.zip
 unzip fragile_site_bed.zip -x "__MACOSX/*" "*/.DS_Store"
import polars_bio as pb
pb.register_bed("/tmp/fragile_site_bed/chr5_fragile_site.bed", "test_bed")
b.sql("select * FROM test_bed WHERE name LIKE 'FRA5%'").collect()
    shape: (8, 4)
    ┌───────┬───────────┬───────────┬───────┐
     chrom  start      end        name       ---    ---        ---        ---        str    u32        u32        str       ╞═══════╪═══════════╪═══════════╪═══════╡
     chr5   28900001   42500000   FRA5A      chr5   92300001   98200000   FRA5B      chr5   130600001  136200000  FRA5C      chr5   92300001   93916228   FRA5D      chr5   18400001   28900000   FRA5E      chr5   98200001   109600000  FRA5F      chr5   168500001  180915260  FRA5G      chr5   50500001   63000000   FRA5H     └───────┴───────────┴───────────┴───────┘

Tip

chunk_size and concurrent_fetches can be adjusted according to the network bandwidth and the size of the BED file. As a rule of thumb for large scale operations (reading a whole BED), it is recommended to the default values.

Source code in polars_bio/sql.py
@staticmethod
def register_bed(
    path: str,
    name: Union[str, None] = None,
    thread_num: int = 1,
    chunk_size: int = 64,
    concurrent_fetches: int = 8,
    allow_anonymous: bool = True,
    max_retries: int = 5,
    timeout: int = 300,
    enable_request_payer: bool = False,
    compression_type: str = "auto",
) -> None:
    """
    Register a BED file as a Datafusion table.

    Parameters:
        path: The path to the BED file.
        name: The name of the table. If *None*, the name of the table will be generated automatically based on the path.
        thread_num: The number of threads to use for reading the BED file. Used **only** for parallel decompression of BGZF blocks. Works only for **local** files.
        chunk_size: The size in MB of a chunk when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **8-16**.
        concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **1-2**.
        allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
        enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
        compression_type: The compression type of the BED file. If not specified, it will be detected automatically..
        max_retries:  The maximum number of retries for reading the file from object storage.
        timeout: The timeout in seconds for reading the file from object storage.

    !!! Note
        Only **BED4** format is supported. It extends the basic BED format (BED3) by adding a name field, resulting in four columns: chromosome, start position, end position, and name.
        Also unlike other text formats, **GZIP** compression is not supported.

    !!! Example
        ```shell

         cd /tmp
         wget https://webs.iiitd.edu.in/raghava/humcfs/fragile_site_bed.zip -O fragile_site_bed.zip
         unzip fragile_site_bed.zip -x "__MACOSX/*" "*/.DS_Store"
        ```

        ```python
        import polars_bio as pb
        pb.register_bed("/tmp/fragile_site_bed/chr5_fragile_site.bed", "test_bed")
        b.sql("select * FROM test_bed WHERE name LIKE 'FRA5%'").collect()
        ```

        ```shell

            shape: (8, 4)
            ┌───────┬───────────┬───────────┬───────┐
            │ chrom ┆ start     ┆ end       ┆ name  │
            │ ---   ┆ ---       ┆ ---       ┆ ---   │
            │ str   ┆ u32       ┆ u32       ┆ str   │
            ╞═══════╪═══════════╪═══════════╪═══════╡
            │ chr5  ┆ 28900001  ┆ 42500000  ┆ FRA5A │
            │ chr5  ┆ 92300001  ┆ 98200000  ┆ FRA5B │
            │ chr5  ┆ 130600001 ┆ 136200000 ┆ FRA5C │
            │ chr5  ┆ 92300001  ┆ 93916228  ┆ FRA5D │
            │ chr5  ┆ 18400001  ┆ 28900000  ┆ FRA5E │
            │ chr5  ┆ 98200001  ┆ 109600000 ┆ FRA5F │
            │ chr5  ┆ 168500001 ┆ 180915260 ┆ FRA5G │
            │ chr5  ┆ 50500001  ┆ 63000000  ┆ FRA5H │
            └───────┴───────────┴───────────┴───────┘
        ```


    !!! tip
        `chunk_size` and `concurrent_fetches` can be adjusted according to the network bandwidth and the size of the BED file. As a rule of thumb for large scale operations (reading a whole BED), it is recommended to the default values.
    """

    object_storage_options = PyObjectStorageOptions(
        allow_anonymous=allow_anonymous,
        enable_request_payer=enable_request_payer,
        chunk_size=chunk_size,
        concurrent_fetches=concurrent_fetches,
        max_retries=max_retries,
        timeout=timeout,
        compression_type=compression_type,
    )

    bed_read_options = BedReadOptions(
        thread_num=thread_num,
        object_storage_options=object_storage_options,
    )
    read_options = ReadOptions(bed_read_options=bed_read_options)
    py_register_table(ctx, path, name, InputFormat.Bed, read_options)

register_cram(path, name=None, tag_fields=None, chunk_size=64, concurrent_fetches=8, allow_anonymous=True, max_retries=5, timeout=300, enable_request_payer=False) staticmethod

Register a CRAM file as a Datafusion table.

Embedded Reference Required

Currently, only CRAM files with embedded reference sequences are supported. CRAM files requiring external reference FASTA files cannot be registered. Most modern CRAM files include embedded references by default.

To create a CRAM file with embedded reference using samtools:

samtools view -C -o output.cram --output-fmt-option embed_ref=1 input.bam

Parameters:

Name Type Description Default
path str

The path to the CRAM file (local or cloud storage: S3, GCS, Azure Blob).

required
name Union[str, None]

The name of the table. If None, the name of the table will be generated automatically based on the path.

None
tag_fields Union[list[str], None]

List of CRAM tag names to include as columns (e.g., ["NM", "MD", "AS"]). If None, no optional tags are parsed (default). Common tags include: NM (edit distance), MD (mismatch string), AS (alignment score), XS (secondary alignment score), RG (read group), CB (cell barcode), UB (UMI barcode).

None
chunk_size int

The size in MB of a chunk when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to 8-16.

64
concurrent_fetches int

[GCS] The number of concurrent fetches when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to 1-2.

8
allow_anonymous bool

[GCS, AWS S3] Whether to allow anonymous access to object storage.

True
enable_request_payer bool

[AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.

False
max_retries int

The maximum number of retries for reading the file from object storage.

5
timeout int

The timeout in seconds for reading the file from object storage.

300

Note

CRAM reader uses 1-based coordinate system for the start, end, mate_start, mate_end columns.

Tip

chunk_size and concurrent_fetches can be adjusted according to the network bandwidth and the size of the CRAM file. As a rule of thumb for large scale operations (reading a whole CRAM), it is recommended to keep the default values. For more interactive inspecting a schema, it is recommended to decrease chunk_size to 8-16 and concurrent_fetches to 1-2.

Source code in polars_bio/sql.py
@staticmethod
def register_cram(
    path: str,
    name: Union[str, None] = None,
    tag_fields: Union[list[str], None] = None,
    chunk_size: int = 64,
    concurrent_fetches: int = 8,
    allow_anonymous: bool = True,
    max_retries: int = 5,
    timeout: int = 300,
    enable_request_payer: bool = False,
) -> None:
    """
    Register a CRAM file as a Datafusion table.

    !!! warning "Embedded Reference Required"
        Currently, only CRAM files with **embedded reference sequences** are supported.
        CRAM files requiring external reference FASTA files cannot be registered.
        Most modern CRAM files include embedded references by default.

        To create a CRAM file with embedded reference using samtools:
        ```bash
        samtools view -C -o output.cram --output-fmt-option embed_ref=1 input.bam
        ```

    Parameters:
        path: The path to the CRAM file (local or cloud storage: S3, GCS, Azure Blob).
        name: The name of the table. If *None*, the name of the table will be generated automatically based on the path.
        tag_fields: List of CRAM tag names to include as columns (e.g., ["NM", "MD", "AS"]). If None, no optional tags are parsed (default). Common tags include: NM (edit distance), MD (mismatch string), AS (alignment score), XS (secondary alignment score), RG (read group), CB (cell barcode), UB (UMI barcode).
        chunk_size: The size in MB of a chunk when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **8-16**.
        concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **1-2**.
        allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
        enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
        max_retries:  The maximum number of retries for reading the file from object storage.
        timeout: The timeout in seconds for reading the file from object storage.
    !!! note
        CRAM reader uses **1-based** coordinate system for the `start`, `end`, `mate_start`, `mate_end` columns.

    !!! tip
        `chunk_size` and `concurrent_fetches` can be adjusted according to the network bandwidth and the size of the CRAM file. As a rule of thumb for large scale operations (reading a whole CRAM), it is recommended to keep the default values.
        For more interactive inspecting a schema, it is recommended to decrease `chunk_size` to **8-16** and `concurrent_fetches` to **1-2**.
    """

    object_storage_options = PyObjectStorageOptions(
        allow_anonymous=allow_anonymous,
        enable_request_payer=enable_request_payer,
        chunk_size=chunk_size,
        concurrent_fetches=concurrent_fetches,
        max_retries=max_retries,
        timeout=timeout,
        compression_type="auto",
    )

    cram_read_options = CramReadOptions(
        reference_path=None,
        object_storage_options=object_storage_options,
        tag_fields=tag_fields,
    )
    read_options = ReadOptions(cram_read_options=cram_read_options)
    py_register_table(ctx, path, name, InputFormat.Cram, read_options)

register_fastq(path, name=None, chunk_size=64, concurrent_fetches=8, allow_anonymous=True, max_retries=5, timeout=300, enable_request_payer=False, compression_type='auto', parallel=False) staticmethod

Register a FASTQ file as a Datafusion table.

Parameters:

Name Type Description Default
path str

The path to the FASTQ file.

required
name Union[str, None]

The name of the table. If None, the name of the table will be generated automatically based on the path.

None
chunk_size int

The size in MB of a chunk when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to 8-16.

64
concurrent_fetches int

[GCS] The number of concurrent fetches when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to 1-2.

8
allow_anonymous bool

[GCS, AWS S3] Whether to allow anonymous access to object storage.

True
enable_request_payer bool

[AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.

False
compression_type str

The compression type of the FASTQ file. If not specified, it will be detected automatically based on the file extension. BGZF and GZIP compression is supported ('bgz' and 'gz').

'auto'
max_retries int

The maximum number of retries for reading the file from object storage.

5
timeout int

The timeout in seconds for reading the file from object storage.

300
parallel bool

Whether to use the parallel reader for BGZF compressed files. Default is False. If a file ends with ".gz" but is actually BGZF, it will attempt the parallel path and fall back to standard if not BGZF.

False

Example

  import polars_bio as pb
  pb.register_fastq("gs://genomics-public-data/platinum-genomes/fastq/ERR194146.fastq.gz", "test_fastq")
  pb.sql("SELECT name, description FROM test_fastq WHERE name LIKE 'ERR194146%'").limit(5).collect()
  shape: (5, 2)
┌─────────────────────┬─────────────────────────────────┐
 name                 description                      ---                  ---                              str                  str                             ╞═════════════════════╪═════════════════════════════════╡
 ERR194146.812444541  HSQ1008:141:D0CC8ACXX:2:1204:1…  ERR194146.812444542  HSQ1008:141:D0CC8ACXX:4:1206:1…  ERR194146.812444543  HSQ1008:141:D0CC8ACXX:3:2104:5…  ERR194146.812444544  HSQ1008:141:D0CC8ACXX:3:2204:1…  ERR194146.812444545  HSQ1008:141:D0CC8ACXX:3:1304:3… └─────────────────────┴─────────────────────────────────┘

Tip

chunk_size and concurrent_fetches can be adjusted according to the network bandwidth and the size of the FASTQ file. As a rule of thumb for large scale operations (reading a whole FASTQ), it is recommended to the default values.

Source code in polars_bio/sql.py
@staticmethod
def register_fastq(
    path: str,
    name: Union[str, None] = None,
    chunk_size: int = 64,
    concurrent_fetches: int = 8,
    allow_anonymous: bool = True,
    max_retries: int = 5,
    timeout: int = 300,
    enable_request_payer: bool = False,
    compression_type: str = "auto",
    parallel: bool = False,
) -> None:
    """
    Register a FASTQ file as a Datafusion table.

    Parameters:
        path: The path to the FASTQ file.
        name: The name of the table. If *None*, the name of the table will be generated automatically based on the path.
        chunk_size: The size in MB of a chunk when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **8-16**.
        concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **1-2**.
        allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
        enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
        compression_type: The compression type of the FASTQ file. If not specified, it will be detected automatically based on the file extension. BGZF and GZIP compression is supported ('bgz' and 'gz').
        max_retries:  The maximum number of retries for reading the file from object storage.
        timeout: The timeout in seconds for reading the file from object storage.
        parallel: Whether to use the parallel reader for BGZF compressed files. Default is False. If a file ends with ".gz" but is actually BGZF, it will attempt the parallel path and fall back to standard if not BGZF.

    !!! Example
        ```python
          import polars_bio as pb
          pb.register_fastq("gs://genomics-public-data/platinum-genomes/fastq/ERR194146.fastq.gz", "test_fastq")
          pb.sql("SELECT name, description FROM test_fastq WHERE name LIKE 'ERR194146%'").limit(5).collect()
        ```

        ```shell

          shape: (5, 2)
        ┌─────────────────────┬─────────────────────────────────┐
        │ name                ┆ description                     │
        │ ---                 ┆ ---                             │
        │ str                 ┆ str                             │
        ╞═════════════════════╪═════════════════════════════════╡
        │ ERR194146.812444541 ┆ HSQ1008:141:D0CC8ACXX:2:1204:1… │
        │ ERR194146.812444542 ┆ HSQ1008:141:D0CC8ACXX:4:1206:1… │
        │ ERR194146.812444543 ┆ HSQ1008:141:D0CC8ACXX:3:2104:5… │
        │ ERR194146.812444544 ┆ HSQ1008:141:D0CC8ACXX:3:2204:1… │
        │ ERR194146.812444545 ┆ HSQ1008:141:D0CC8ACXX:3:1304:3… │
        └─────────────────────┴─────────────────────────────────┘

        ```


    !!! tip
        `chunk_size` and `concurrent_fetches` can be adjusted according to the network bandwidth and the size of the FASTQ file. As a rule of thumb for large scale operations (reading a whole FASTQ), it is recommended to the default values.
    """

    object_storage_options = PyObjectStorageOptions(
        allow_anonymous=allow_anonymous,
        enable_request_payer=enable_request_payer,
        chunk_size=chunk_size,
        concurrent_fetches=concurrent_fetches,
        max_retries=max_retries,
        timeout=timeout,
        compression_type=compression_type,
    )

    fastq_read_options = FastqReadOptions(
        object_storage_options=object_storage_options, parallel=parallel
    )
    read_options = ReadOptions(fastq_read_options=fastq_read_options)
    py_register_table(ctx, path, name, InputFormat.Fastq, read_options)

register_gff(path, name=None, chunk_size=64, concurrent_fetches=8, allow_anonymous=True, max_retries=5, timeout=300, enable_request_payer=False, compression_type='auto') staticmethod

Register a GFF file as a Datafusion table.

Parameters:

Name Type Description Default
path str

The path to the GFF file.

required
name Union[str, None]

The name of the table. If None, the name of the table will be generated automatically based on the path.

None
chunk_size int

The size in MB of a chunk when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to 8-16.

64
concurrent_fetches int

[GCS] The number of concurrent fetches when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to 1-2.

8
allow_anonymous bool

[GCS, AWS S3] Whether to allow anonymous access to object storage.

True
enable_request_payer bool

[AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.

False
compression_type str

The compression type of the GFF file. If not specified, it will be detected automatically based on the file extension. BGZF and GZIP compression is supported ('bgz' and 'gz').

'auto'
max_retries int

The maximum number of retries for reading the file from object storage.

5
timeout int

The timeout in seconds for reading the file from object storage.

300

Note

GFF reader uses 1-based coordinate system for the start and end columns.

Example

wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.annotation.gff3.gz -O /tmp/gencode.v38.annotation.gff3.gz
import polars_bio as pb
pb.register_gff("/tmp/gencode.v38.annotation.gff3.gz", "gencode_v38_annotation3_bgz")
pb.sql("SELECT attributes, count(*) AS cnt FROM gencode_v38_annotation3_bgz GROUP BY attributes").limit(5).collect()
shape: (5, 2)
┌───────────────────┬───────┐
 Parent             cnt    ---                ---    str                i64   ╞═══════════════════╪═══════╡
 null               60649  ENSG00000223972.5  2      ENST00000456328.2  3      ENST00000450305.2  6      ENSG00000227232.5  1     └───────────────────┴───────┘

Tip

chunk_size and concurrent_fetches can be adjusted according to the network bandwidth and the size of the GFF file. As a rule of thumb for large scale operations (reading a whole GFF), it is recommended to the default values.

Source code in polars_bio/sql.py
@staticmethod
def register_gff(
    path: str,
    name: Union[str, None] = None,
    chunk_size: int = 64,
    concurrent_fetches: int = 8,
    allow_anonymous: bool = True,
    max_retries: int = 5,
    timeout: int = 300,
    enable_request_payer: bool = False,
    compression_type: str = "auto",
) -> None:
    """
    Register a GFF file as a Datafusion table.

    Parameters:
        path: The path to the GFF file.
        name: The name of the table. If *None*, the name of the table will be generated automatically based on the path.
        chunk_size: The size in MB of a chunk when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **8-16**.
        concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **1-2**.
        allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
        enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
        compression_type: The compression type of the GFF file. If not specified, it will be detected automatically based on the file extension. BGZF and GZIP compression is supported ('bgz' and 'gz').
        max_retries:  The maximum number of retries for reading the file from object storage.
        timeout: The timeout in seconds for reading the file from object storage.
    !!! note
        GFF reader uses **1-based** coordinate system for the `start` and `end` columns.

    !!! Example
        ```shell
        wget https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.annotation.gff3.gz -O /tmp/gencode.v38.annotation.gff3.gz
        ```
        ```python
        import polars_bio as pb
        pb.register_gff("/tmp/gencode.v38.annotation.gff3.gz", "gencode_v38_annotation3_bgz")
        pb.sql("SELECT attributes, count(*) AS cnt FROM gencode_v38_annotation3_bgz GROUP BY attributes").limit(5).collect()
        ```
        ```shell

        shape: (5, 2)
        ┌───────────────────┬───────┐
        │ Parent            ┆ cnt   │
        │ ---               ┆ ---   │
        │ str               ┆ i64   │
        ╞═══════════════════╪═══════╡
        │ null              ┆ 60649 │
        │ ENSG00000223972.5 ┆ 2     │
        │ ENST00000456328.2 ┆ 3     │
        │ ENST00000450305.2 ┆ 6     │
        │ ENSG00000227232.5 ┆ 1     │
        └───────────────────┴───────┘

        ```
    !!! tip
        `chunk_size` and `concurrent_fetches` can be adjusted according to the network bandwidth and the size of the GFF file. As a rule of thumb for large scale operations (reading a whole GFF), it is recommended to the default values.
    """

    object_storage_options = PyObjectStorageOptions(
        allow_anonymous=allow_anonymous,
        enable_request_payer=enable_request_payer,
        chunk_size=chunk_size,
        concurrent_fetches=concurrent_fetches,
        max_retries=max_retries,
        timeout=timeout,
        compression_type=compression_type,
    )

    gff_read_options = GffReadOptions(
        attr_fields=None,
        object_storage_options=object_storage_options,
    )
    read_options = ReadOptions(gff_read_options=gff_read_options)
    py_register_table(ctx, path, name, InputFormat.Gff, read_options)

register_sam(path, name=None, tag_fields=None) staticmethod

Register a SAM file as a Datafusion table.

SAM (Sequence Alignment/Map) is the plain-text counterpart of BAM. This function reuses the BAM table provider, which auto-detects the format from the file extension.

Parameters:

Name Type Description Default
path str

The path to the SAM file.

required
name Union[str, None]

The name of the table. If None, the name will be generated automatically from the path.

None
tag_fields Union[list[str], None]

List of SAM tag names to include as columns (e.g., ["NM", "MD", "AS"]). If None, no optional tags are parsed (default).

None

Example

import polars_bio as pb
pb.register_sam("test.sam", "my_sam")
pb.sql("SELECT chrom, flags FROM my_sam").limit(5).collect()
Source code in polars_bio/sql.py
@staticmethod
def register_sam(
    path: str,
    name: Union[str, None] = None,
    tag_fields: Union[list[str], None] = None,
) -> None:
    """
    Register a SAM file as a Datafusion table.

    SAM (Sequence Alignment/Map) is the plain-text counterpart of BAM.
    This function reuses the BAM table provider, which auto-detects
    the format from the file extension.

    Parameters:
        path: The path to the SAM file.
        name: The name of the table. If *None*, the name will be generated automatically from the path.
        tag_fields: List of SAM tag names to include as columns (e.g., ["NM", "MD", "AS"]).
            If None, no optional tags are parsed (default).

    !!! Example
        ```python
        import polars_bio as pb
        pb.register_sam("test.sam", "my_sam")
        pb.sql("SELECT chrom, flags FROM my_sam").limit(5).collect()
        ```
    """
    bam_read_options = BamReadOptions(
        tag_fields=tag_fields,
    )
    read_options = ReadOptions(bam_read_options=bam_read_options)
    py_register_table(ctx, path, name, InputFormat.Sam, read_options)

register_vcf(path, name=None, info_fields=None, chunk_size=64, concurrent_fetches=8, allow_anonymous=True, max_retries=5, timeout=300, enable_request_payer=False, compression_type='auto') staticmethod

Register a VCF file as a Datafusion table.

Parameters:

Name Type Description Default
path str

The path to the VCF file.

required
name Union[str, None]

The name of the table. If None, the name of the table will be generated automatically based on the path.

None
info_fields Union[list[str], None]

List of INFO field names to register. If None, all INFO fields will be detected automatically from the VCF header. Use this to limit registration to specific fields for better performance.

None
chunk_size int

The size in MB of a chunk when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to 8-16.

64
concurrent_fetches int

[GCS] The number of concurrent fetches when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to 1-2.

8
allow_anonymous bool

[GCS, AWS S3] Whether to allow anonymous access to object storage.

True
enable_request_payer bool

[AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.

False
compression_type str

The compression type of the VCF file. If not specified, it will be detected automatically..

'auto'
max_retries int

The maximum number of retries for reading the file from object storage.

5
timeout int

The timeout in seconds for reading the file from object storage.

300

Note

VCF reader uses 1-based coordinate system for the start and end columns.

Example

import polars_bio as pb
pb.register_vcf("/tmp/gnomad.v4.1.sv.sites.vcf.gz")
INFO:polars_bio:Table: gnomad_v4_1_sv_sites_gz registered for path: /tmp/gnomad.v4.1.sv.sites.vcf.gz

Tip

chunk_size and concurrent_fetches can be adjusted according to the network bandwidth and the size of the VCF file. As a rule of thumb for large scale operations (reading a whole VCF), it is recommended to the default values.

Source code in polars_bio/sql.py
@staticmethod
def register_vcf(
    path: str,
    name: Union[str, None] = None,
    info_fields: Union[list[str], None] = None,
    chunk_size: int = 64,
    concurrent_fetches: int = 8,
    allow_anonymous: bool = True,
    max_retries: int = 5,
    timeout: int = 300,
    enable_request_payer: bool = False,
    compression_type: str = "auto",
) -> None:
    """
    Register a VCF file as a Datafusion table.

    Parameters:
        path: The path to the VCF file.
        name: The name of the table. If *None*, the name of the table will be generated automatically based on the path.
        info_fields: List of INFO field names to register. If *None*, all INFO fields will be detected automatically from the VCF header. Use this to limit registration to specific fields for better performance.
        chunk_size: The size in MB of a chunk when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **8-16**.
        concurrent_fetches: [GCS] The number of concurrent fetches when reading from an object store. Default settings are optimized for large scale operations. For small scale (interactive) operations, it is recommended to decrease this value to **1-2**.
        allow_anonymous: [GCS, AWS S3] Whether to allow anonymous access to object storage.
        enable_request_payer: [AWS S3] Whether to enable request payer for object storage. This is useful for reading files from AWS S3 buckets that require request payer.
        compression_type: The compression type of the VCF file. If not specified, it will be detected automatically..
        max_retries:  The maximum number of retries for reading the file from object storage.
        timeout: The timeout in seconds for reading the file from object storage.
    !!! note
        VCF reader uses **1-based** coordinate system for the `start` and `end` columns.

    !!! Example
          ```python
          import polars_bio as pb
          pb.register_vcf("/tmp/gnomad.v4.1.sv.sites.vcf.gz")
          ```
         ```shell
         INFO:polars_bio:Table: gnomad_v4_1_sv_sites_gz registered for path: /tmp/gnomad.v4.1.sv.sites.vcf.gz
         ```
    !!! tip
        `chunk_size` and `concurrent_fetches` can be adjusted according to the network bandwidth and the size of the VCF file. As a rule of thumb for large scale operations (reading a whole VCF), it is recommended to the default values.
    """

    object_storage_options = PyObjectStorageOptions(
        allow_anonymous=allow_anonymous,
        enable_request_payer=enable_request_payer,
        chunk_size=chunk_size,
        concurrent_fetches=concurrent_fetches,
        max_retries=max_retries,
        timeout=timeout,
        compression_type=compression_type,
    )

    # Use provided info_fields or autodetect from VCF header
    if info_fields is not None:
        all_info_fields = info_fields
    else:
        # Get all info fields from VCF header for automatic field detection
        all_info_fields = None
        try:
            from .io import IOOperations

            vcf_schema_df = IOOperations.describe_vcf(
                path,
                allow_anonymous=allow_anonymous,
                enable_request_payer=enable_request_payer,
                compression_type=compression_type,
            )
            all_info_fields = vcf_schema_df.select("name").to_series().to_list()
        except Exception:
            # Fallback to empty list if unable to get info fields
            all_info_fields = []

    vcf_read_options = VcfReadOptions(
        info_fields=all_info_fields,
        object_storage_options=object_storage_options,
    )
    read_options = ReadOptions(vcf_read_options=vcf_read_options)
    py_register_table(ctx, path, name, InputFormat.Vcf, read_options)

register_view(name, query) staticmethod

Register a query as a Datafusion view. This view can be used in genomic ranges operations, such as overlap, nearest, and count_overlaps. It is useful for filtering, transforming, and aggregating data prior to the range operation. When combined with the range operation, it can be used to perform complex in a streaming fashion end-to-end.

Parameters:

Name Type Description Default
name str

The name of the table.

required
query str

The SQL query.

required

Example

import polars_bio as pb
pb.register_vcf("gs://gcp-public-data--gnomad/release/4.1/vcf/exomes/gnomad.exomes.v4.1.sites.chr21.vcf.bgz", "gnomad_sv")
pb.register_view("v_gnomad_sv", "SELECT replace(chrom,'chr', '') AS chrom, start, end FROM gnomad_sv")
pb.sql("SELECT * FROM v_gnomad_sv").limit(5).collect()
  shape: (5, 3)
  ┌───────┬─────────┬─────────┐
   chrom  start    end        ---    ---      ---        str    u32      u32       ╞═══════╪═════════╪═════════╡
   21     5031905  5031905    21     5031905  5031905    21     5031909  5031909    21     5031911  5031911    21     5031911  5031911   └───────┴─────────┴─────────┘

Source code in polars_bio/sql.py
@staticmethod
def register_view(name: str, query: str) -> None:
    """
    Register a query as a Datafusion view. This view can be used in genomic ranges operations,
    such as overlap, nearest, and count_overlaps. It is useful for filtering, transforming, and aggregating data
    prior to the range operation. When combined with the range operation, it can be used to perform complex in a streaming fashion end-to-end.

    Parameters:
        name: The name of the table.
        query: The SQL query.

    !!! Example
          ```python
          import polars_bio as pb
          pb.register_vcf("gs://gcp-public-data--gnomad/release/4.1/vcf/exomes/gnomad.exomes.v4.1.sites.chr21.vcf.bgz", "gnomad_sv")
          pb.register_view("v_gnomad_sv", "SELECT replace(chrom,'chr', '') AS chrom, start, end FROM gnomad_sv")
          pb.sql("SELECT * FROM v_gnomad_sv").limit(5).collect()
          ```
          ```shell
            shape: (5, 3)
            ┌───────┬─────────┬─────────┐
            │ chrom ┆ start   ┆ end     │
            │ ---   ┆ ---     ┆ ---     │
            │ str   ┆ u32     ┆ u32     │
            ╞═══════╪═════════╪═════════╡
            │ 21    ┆ 5031905 ┆ 5031905 │
            │ 21    ┆ 5031905 ┆ 5031905 │
            │ 21    ┆ 5031909 ┆ 5031909 │
            │ 21    ┆ 5031911 ┆ 5031911 │
            │ 21    ┆ 5031911 ┆ 5031911 │
            └───────┴─────────┴─────────┘
          ```
    """
    py_register_view(ctx, name, query)

sql(query) staticmethod

Execute a SQL query on the registered tables.

Parameters:

Name Type Description Default
query str

The SQL query.

required

Example

import polars_bio as pb
pb.register_vcf("/tmp/gnomad.v4.1.sv.sites.vcf.gz", "gnomad_v4_1_sv")
pb.sql("SELECT * FROM gnomad_v4_1_sv LIMIT 5").collect()
Source code in polars_bio/sql.py
@staticmethod
def sql(query: str) -> pl.LazyFrame:
    """
    Execute a SQL query on the registered tables.

    Parameters:
        query: The SQL query.

    !!! Example
          ```python
          import polars_bio as pb
          pb.register_vcf("/tmp/gnomad.v4.1.sv.sites.vcf.gz", "gnomad_v4_1_sv")
          pb.sql("SELECT * FROM gnomad_v4_1_sv LIMIT 5").collect()
          ```
    """
    df = py_read_sql(ctx, query)
    return _lazy_scan(df)

range_operations

Source code in polars_bio/range_op.py
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
class IntervalOperations:

    @staticmethod
    def overlap(
        df1: Union[str, pl.DataFrame, pl.LazyFrame, "pd.DataFrame"],
        df2: Union[str, pl.DataFrame, pl.LazyFrame, "pd.DataFrame"],
        suffixes: tuple[str, str] = ("_1", "_2"),
        on_cols: Union[list[str], None] = None,
        cols1: Union[list[str], None] = ["chrom", "start", "end"],
        cols2: Union[list[str], None] = ["chrom", "start", "end"],
        algorithm: str = "Coitrees",
        low_memory: bool = False,
        output_type: str = "polars.LazyFrame",
        read_options1: Union[ReadOptions, None] = None,
        read_options2: Union[ReadOptions, None] = None,
        projection_pushdown: bool = True,
    ) -> Union[pl.LazyFrame, pl.DataFrame, "pd.DataFrame", datafusion.DataFrame]:
        """
        Find pairs of overlapping genomic intervals.
        Bioframe inspired API.

        The coordinate system (0-based or 1-based) is automatically detected from
        DataFrame metadata set at I/O time. Both inputs must have the same coordinate
        system.

        Parameters:
            df1: Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table (see [register_vcf](api.md#polars_bio.register_vcf)). CSV with a header, BED and Parquet are supported.
            df2: Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table. CSV with a header, BED  and Parquet are supported.
            cols1: The names of columns containing the chromosome, start and end of the
                genomic intervals, provided separately for each set.
            cols2:  The names of columns containing the chromosome, start and end of the
                genomic intervals, provided separately for each set.
            suffixes: Suffixes for the columns of the two overlapped sets.
            on_cols: List of additional column names to join on. default is None.
            algorithm: The algorithm to use for the overlap operation. Available options: Coitrees, IntervalTree, ArrayIntervalTree, Lapper, SuperIntervals
            low_memory: If True, use low memory method for output generation. This may be slower but uses less memory.
            output_type: Type of the output. default is "polars.LazyFrame", "polars.DataFrame", or "pandas.DataFrame" or "datafusion.DataFrame" are also supported.
            read_options1: Additional options for reading the input files.
            read_options2: Additional options for reading the input files.
            projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

        Returns:
            **polars.LazyFrame** or polars.DataFrame or pandas.DataFrame of the overlapping intervals.

        Raises:
            MissingCoordinateSystemError: If either input lacks coordinate system metadata
                and `datafusion.bio.coordinate_system_check` is "true" (default). Use polars-bio
                I/O functions (scan_*, read_*) which automatically set metadata, or set it manually
                on Polars DataFrames via `df.config_meta.set(coordinate_system_zero_based=True/False)`
                or on Pandas DataFrames via `df.attrs["coordinate_system_zero_based"] = True/False`.
                Set `pb.set_option("datafusion.bio.coordinate_system_check", False)` to disable
                strict checking and fall back to global coordinate system setting.
            CoordinateSystemMismatchError: If inputs have different coordinate systems.

        Note:
            1. The default output format, i.e.  [LazyFrame](https://docs.pola.rs/api/python/stable/reference/lazyframe/index.html), is recommended for large datasets as it supports output streaming and lazy evaluation.
            This enables efficient processing of large datasets without loading the entire output dataset into memory.
            2. Streaming is only supported for polars.LazyFrame output.

        Example:
            ```python
            import polars_bio as pb
            import pandas as pd

            df1 = pd.DataFrame([
                ['chr1', 1, 5],
                ['chr1', 3, 8],
                ['chr1', 8, 10],
                ['chr1', 12, 14]],
            columns=['chrom', 'start', 'end']
            )
            df1.attrs["coordinate_system_zero_based"] = False  # 1-based coordinates

            df2 = pd.DataFrame(
            [['chr1', 4, 8],
             ['chr1', 10, 11]],
            columns=['chrom', 'start', 'end' ]
            )
            df2.attrs["coordinate_system_zero_based"] = False  # 1-based coordinates

            overlapping_intervals = pb.overlap(df1, df2, output_type="pandas.DataFrame")

            overlapping_intervals
                chrom_1         start_1     end_1 chrom_2       start_2  end_2
            0     chr1            1          5     chr1            4          8
            1     chr1            3          8     chr1            4          8

            ```

        Todo:
             Support for on_cols.
        """

        _validate_overlap_input(cols1, cols2, on_cols, suffixes, output_type)

        # Get filter_op from DataFrame metadata
        filter_op = _get_filter_op_from_metadata(df1, df2)

        cols1 = DEFAULT_INTERVAL_COLUMNS if cols1 is None else cols1
        cols2 = DEFAULT_INTERVAL_COLUMNS if cols2 is None else cols2
        range_options = RangeOptions(
            range_op=RangeOp.Overlap,
            filter_op=filter_op,
            suffixes=suffixes,
            columns_1=cols1,
            columns_2=cols2,
            overlap_alg=algorithm,
            overlap_low_memory=low_memory,
        )

        return range_operation(
            df1,
            df2,
            range_options,
            output_type,
            ctx,
            read_options1,
            read_options2,
            projection_pushdown,
        )

    @staticmethod
    def nearest(
        df1: Union[str, pl.DataFrame, pl.LazyFrame, "pd.DataFrame"],
        df2: Union[str, pl.DataFrame, pl.LazyFrame, "pd.DataFrame"],
        suffixes: tuple[str, str] = ("_1", "_2"),
        on_cols: Union[list[str], None] = None,
        cols1: Union[list[str], None] = ["chrom", "start", "end"],
        cols2: Union[list[str], None] = ["chrom", "start", "end"],
        output_type: str = "polars.LazyFrame",
        read_options: Union[ReadOptions, None] = None,
        projection_pushdown: bool = True,
    ) -> Union[pl.LazyFrame, pl.DataFrame, "pd.DataFrame", datafusion.DataFrame]:
        """
        Find pairs of closest genomic intervals.
        Bioframe inspired API.

        The coordinate system (0-based or 1-based) is automatically detected from
        DataFrame metadata set at I/O time. Both inputs must have the same coordinate
        system.

        Parameters:
            df1: Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table (see [register_vcf](api.md#polars_bio.register_vcf)). CSV with a header, BED and Parquet are supported.
            df2: Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table. CSV with a header, BED  and Parquet are supported.
            cols1: The names of columns containing the chromosome, start and end of the
                genomic intervals, provided separately for each set.
            cols2:  The names of columns containing the chromosome, start and end of the
                genomic intervals, provided separately for each set.
            suffixes: Suffixes for the columns of the two overlapped sets.
            on_cols: List of additional column names to join on. default is None.
            output_type: Type of the output. default is "polars.LazyFrame", "polars.DataFrame", or "pandas.DataFrame" or "datafusion.DataFrame" are also supported.
            read_options: Additional options for reading the input files.
            projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

        Returns:
            **polars.LazyFrame** or polars.DataFrame or pandas.DataFrame of the overlapping intervals.

        Raises:
            MissingCoordinateSystemError: If either input lacks coordinate system metadata
                and `datafusion.bio.coordinate_system_check` is "true" (default).
            CoordinateSystemMismatchError: If inputs have different coordinate systems.

        Note:
            The default output format, i.e. [LazyFrame](https://docs.pola.rs/api/python/stable/reference/lazyframe/index.html), is recommended for large datasets as it supports output streaming and lazy evaluation.
            This enables efficient processing of large datasets without loading the entire output dataset into memory.

        Example:

        Todo:
            Support for on_cols.
        """

        _validate_overlap_input(cols1, cols2, on_cols, suffixes, output_type)

        # Get filter_op from DataFrame metadata
        filter_op = _get_filter_op_from_metadata(df1, df2)

        cols1 = DEFAULT_INTERVAL_COLUMNS if cols1 is None else cols1
        cols2 = DEFAULT_INTERVAL_COLUMNS if cols2 is None else cols2
        range_options = RangeOptions(
            range_op=RangeOp.Nearest,
            filter_op=filter_op,
            suffixes=suffixes,
            columns_1=cols1,
            columns_2=cols2,
        )
        return range_operation(
            df1,
            df2,
            range_options,
            output_type,
            ctx,
            read_options,
            projection_pushdown=projection_pushdown,
        )

    @staticmethod
    def coverage(
        df1: Union[str, pl.DataFrame, pl.LazyFrame, "pd.DataFrame"],
        df2: Union[str, pl.DataFrame, pl.LazyFrame, "pd.DataFrame"],
        suffixes: tuple[str, str] = ("_1", "_2"),
        on_cols: Union[list[str], None] = None,
        cols1: Union[list[str], None] = ["chrom", "start", "end"],
        cols2: Union[list[str], None] = ["chrom", "start", "end"],
        output_type: str = "polars.LazyFrame",
        read_options: Union[ReadOptions, None] = None,
        projection_pushdown: bool = True,
    ) -> Union[pl.LazyFrame, pl.DataFrame, "pd.DataFrame", datafusion.DataFrame]:
        """
        Calculate intervals coverage.
        Bioframe inspired API.

        The coordinate system (0-based or 1-based) is automatically detected from
        DataFrame metadata set at I/O time. Both inputs must have the same coordinate
        system.

        Parameters:
            df1: Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table (see [register_vcf](api.md#polars_bio.register_vcf)). CSV with a header, BED and Parquet are supported.
            df2: Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table. CSV with a header, BED  and Parquet are supported.
            cols1: The names of columns containing the chromosome, start and end of the
                genomic intervals, provided separately for each set.
            cols2:  The names of columns containing the chromosome, start and end of the
                genomic intervals, provided separately for each set.
            suffixes: Suffixes for the columns of the two overlapped sets.
            on_cols: List of additional column names to join on. default is None.
            output_type: Type of the output. default is "polars.LazyFrame", "polars.DataFrame", or "pandas.DataFrame" or "datafusion.DataFrame" are also supported.
            read_options: Additional options for reading the input files.
            projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

        Returns:
            **polars.LazyFrame** or polars.DataFrame or pandas.DataFrame of the overlapping intervals.

        Raises:
            MissingCoordinateSystemError: If either input lacks coordinate system metadata
                and `datafusion.bio.coordinate_system_check` is "true" (default).
            CoordinateSystemMismatchError: If inputs have different coordinate systems.

        Note:
            The default output format, i.e. [LazyFrame](https://docs.pola.rs/api/python/stable/reference/lazyframe/index.html), is recommended for large datasets as it supports output streaming and lazy evaluation.
            This enables efficient processing of large datasets without loading the entire output dataset into memory.

        Example:

        Todo:
            Support for on_cols.
        """

        _validate_overlap_input(cols1, cols2, on_cols, suffixes, output_type)

        # Get filter_op from DataFrame metadata
        filter_op = _get_filter_op_from_metadata(df1, df2)

        cols1 = DEFAULT_INTERVAL_COLUMNS if cols1 is None else cols1
        cols2 = DEFAULT_INTERVAL_COLUMNS if cols2 is None else cols2
        range_options = RangeOptions(
            range_op=RangeOp.Coverage,
            filter_op=filter_op,
            suffixes=suffixes,
            columns_1=cols1,
            columns_2=cols2,
        )
        return range_operation(
            df2,
            df1,
            range_options,
            output_type,
            ctx,
            read_options,
            projection_pushdown=projection_pushdown,
        )

    @staticmethod
    def count_overlaps(
        df1: Union[str, pl.DataFrame, pl.LazyFrame, "pd.DataFrame"],
        df2: Union[str, pl.DataFrame, pl.LazyFrame, "pd.DataFrame"],
        suffixes: tuple[str, str] = ("", "_"),
        cols1: Union[list[str], None] = ["chrom", "start", "end"],
        cols2: Union[list[str], None] = ["chrom", "start", "end"],
        on_cols: Union[list[str], None] = None,
        output_type: str = "polars.LazyFrame",
        naive_query: bool = True,
        projection_pushdown: bool = True,
    ) -> Union[pl.LazyFrame, pl.DataFrame, "pd.DataFrame", datafusion.DataFrame]:
        """
        Count pairs of overlapping genomic intervals.
        Bioframe inspired API.

        The coordinate system (0-based or 1-based) is automatically detected from
        DataFrame metadata set at I/O time. Both inputs must have the same coordinate
        system.

        Parameters:
            df1: Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table (see [register_vcf](api.md#polars_bio.register_vcf)). CSV with a header, BED and Parquet are supported.
            df2: Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table. CSV with a header, BED  and Parquet are supported.
            suffixes: Suffixes for the columns of the two overlapped sets.
            cols1: The names of columns containing the chromosome, start and end of the
                genomic intervals, provided separately for each set.
            cols2:  The names of columns containing the chromosome, start and end of the
                genomic intervals, provided separately for each set.
            on_cols: List of additional column names to join on. default is None.
            output_type: Type of the output. default is "polars.LazyFrame", "polars.DataFrame", or "pandas.DataFrame" or "datafusion.DataFrame" are also supported.
            naive_query: If True, use naive query for counting overlaps based on overlaps.
            projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

        Returns:
            **polars.LazyFrame** or polars.DataFrame or pandas.DataFrame of the overlapping intervals.

        Raises:
            MissingCoordinateSystemError: If either input lacks coordinate system metadata
                and `datafusion.bio.coordinate_system_check` is "true" (default).
            CoordinateSystemMismatchError: If inputs have different coordinate systems.

        Example:
            ```python
            import polars_bio as pb
            import pandas as pd

            df1 = pd.DataFrame([
                ['chr1', 1, 5],
                ['chr1', 3, 8],
                ['chr1', 8, 10],
                ['chr1', 12, 14]],
            columns=['chrom', 'start', 'end']
            )
            df1.attrs["coordinate_system_zero_based"] = False  # 1-based coordinates

            df2 = pd.DataFrame(
            [['chr1', 4, 8],
             ['chr1', 10, 11]],
            columns=['chrom', 'start', 'end' ]
            )
            df2.attrs["coordinate_system_zero_based"] = False  # 1-based coordinates

            counts = pb.count_overlaps(df1, df2, output_type="pandas.DataFrame")

            counts

            chrom  start  end  count
            0  chr1      1    5      1
            1  chr1      3    8      1
            2  chr1      8   10      0
            3  chr1     12   14      0
            ```

        Todo:
             Support return_input.
        """
        _validate_overlap_input(cols1, cols2, on_cols, suffixes, output_type)

        # Get filter_op and zero_based from DataFrame metadata
        zero_based = validate_coordinate_systems(df1, df2, ctx)
        filter_op = FilterOp.Strict if zero_based else FilterOp.Weak

        my_ctx = get_py_ctx()
        on_cols = [] if on_cols is None else on_cols
        cols1 = DEFAULT_INTERVAL_COLUMNS if cols1 is None else cols1
        cols2 = DEFAULT_INTERVAL_COLUMNS if cols2 is None else cols2
        if naive_query:
            range_options = RangeOptions(
                range_op=RangeOp.CountOverlapsNaive,
                filter_op=filter_op,
                suffixes=suffixes,
                columns_1=cols1,
                columns_2=cols2,
            )
            return range_operation(df2, df1, range_options, output_type, ctx)
        df1 = read_df_to_datafusion(my_ctx, df1)
        df2 = read_df_to_datafusion(my_ctx, df2)

        curr_cols = set(df1.schema().names) | set(df2.schema().names)
        s1start_s2end = prevent_column_collision("s1starts2end", curr_cols)
        s1end_s2start = prevent_column_collision("s1ends2start", curr_cols)
        contig = prevent_column_collision("contig", curr_cols)
        count = prevent_column_collision("count", curr_cols)
        starts = prevent_column_collision("starts", curr_cols)
        ends = prevent_column_collision("ends", curr_cols)
        is_s1 = prevent_column_collision("is_s1", curr_cols)
        suff, _ = suffixes
        df1, df2 = df2, df1
        df1 = df1.select(
            *(
                [
                    literal(1).alias(is_s1),
                    col(cols1[1]).alias(s1start_s2end),
                    col(cols1[2]).alias(s1end_s2start),
                    col(cols1[0]).alias(contig),
                ]
                + on_cols
            )
        )
        df2 = df2.select(
            *(
                [
                    literal(0).alias(is_s1),
                    col(cols2[2]).alias(s1end_s2start),
                    col(cols2[1]).alias(s1start_s2end),
                    col(cols2[0]).alias(contig),
                ]
                + on_cols
            )
        )

        df = df1.union(df2)

        partitioning = [col(contig)] + [col(c) for c in on_cols]
        df = df.select(
            *(
                [
                    s1start_s2end,
                    s1end_s2start,
                    contig,
                    is_s1,
                    datafusion.functions.sum(col(is_s1))
                    .over(
                        datafusion.expr.Window(
                            partition_by=partitioning,
                            order_by=[
                                col(s1start_s2end).sort(),
                                col(is_s1).sort(ascending=zero_based),
                            ],
                        )
                    )
                    .alias(starts),
                    datafusion.functions.sum(col(is_s1))
                    .over(
                        datafusion.expr.Window(
                            partition_by=partitioning,
                            order_by=[
                                col(s1end_s2start).sort(),
                                col(is_s1).sort(ascending=(not zero_based)),
                            ],
                        )
                    )
                    .alias(ends),
                ]
                + on_cols
            )
        )
        df = df.filter(col(is_s1) == 0)
        df = df.select(
            *(
                [
                    col(contig).alias(cols1[0] + suff),
                    col(s1end_s2start).alias(cols1[1] + suff),
                    col(s1start_s2end).alias(cols1[2] + suff),
                ]
                + on_cols
                + [(col(starts) - col(ends)).alias(count)]
            )
        )

        return convert_result(df, output_type)

    @staticmethod
    def merge(
        df: Union[str, pl.DataFrame, pl.LazyFrame, "pd.DataFrame"],
        min_dist: float = 0,
        cols: Union[list[str], None] = ["chrom", "start", "end"],
        on_cols: Union[list[str], None] = None,
        output_type: str = "polars.LazyFrame",
        projection_pushdown: bool = True,
    ) -> Union[pl.LazyFrame, pl.DataFrame, "pd.DataFrame", datafusion.DataFrame]:
        """
        Merge overlapping intervals. It is assumed that start < end.

        The coordinate system (0-based or 1-based) is automatically detected from
        DataFrame metadata set at I/O time.

        Parameters:
            df: Can be a path to a file, a polars DataFrame, or a pandas DataFrame. CSV with a header, BED  and Parquet are supported.
            min_dist: Minimum distance between intervals to merge. Default is 0.
            cols: The names of columns containing the chromosome, start and end of the
                genomic intervals, provided separately for each set.
            on_cols: List of additional column names for clustering. default is None.
            output_type: Type of the output. default is "polars.LazyFrame", "polars.DataFrame", or "pandas.DataFrame" or "datafusion.DataFrame" are also supported.
            projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

        Returns:
            **polars.LazyFrame** or polars.DataFrame or pandas.DataFrame of the overlapping intervals.

        Raises:
            MissingCoordinateSystemError: If input lacks coordinate system metadata
                and `datafusion.bio.coordinate_system_check` is "true" (default).

        Example:

        Todo:
            Support for on_cols.
        """
        suffixes = ("_1", "_2")
        _validate_overlap_input(cols, cols, on_cols, suffixes, output_type)

        # Get zero_based from DataFrame metadata
        zero_based = validate_coordinate_system_single(df, ctx)

        my_ctx = get_py_ctx()
        cols = DEFAULT_INTERVAL_COLUMNS if cols is None else cols
        contig = cols[0]
        start = cols[1]
        end = cols[2]

        on_cols = [] if on_cols is None else on_cols
        on_cols = [contig] + on_cols

        df = read_df_to_datafusion(my_ctx, df)
        df_schema = df.schema()
        start_type = df_schema.field(start).type
        end_type = df_schema.field(end).type

        curr_cols = set(df_schema.names)
        start_end = prevent_column_collision("start_end", curr_cols)
        is_start_end = prevent_column_collision("is_start_or_end", curr_cols)
        current_intervals = prevent_column_collision("current_intervals", curr_cols)
        n_intervals = prevent_column_collision("n_intervals", curr_cols)

        end_positions = df.select(
            *(
                [
                    (col(end) + min_dist).alias(start_end),
                    literal(-1).alias(is_start_end),
                ]
                + on_cols
            )
        )
        start_positions = df.select(
            *([col(start).alias(start_end), literal(1).alias(is_start_end)] + on_cols)
        )
        all_positions = start_positions.union(end_positions)
        start_end_type = all_positions.schema().field(start_end).type
        all_positions = all_positions.select(
            *([col(start_end).cast(start_end_type), col(is_start_end)] + on_cols)
        )

        sorting = [
            col(start_end).sort(),
            col(is_start_end).sort(ascending=zero_based),
        ]
        all_positions = all_positions.sort(*sorting)

        on_cols_expr = [col(c) for c in on_cols]

        win = datafusion.expr.Window(
            partition_by=on_cols_expr,
            order_by=sorting,
        )
        all_positions = all_positions.select(
            *(
                [
                    start_end,
                    is_start_end,
                    datafusion.functions.sum(col(is_start_end))
                    .over(win)
                    .alias(current_intervals),
                ]
                + on_cols
                + [
                    datafusion.functions.row_number(
                        partition_by=on_cols_expr, order_by=sorting
                    ).alias(n_intervals)
                ]
            )
        )
        all_positions = all_positions.filter(
            ((col(current_intervals) == 0) & (col(is_start_end) == -1))
            | ((col(current_intervals) == 1) & (col(is_start_end) == 1))
        )
        all_positions = all_positions.select(
            *(
                [start_end, is_start_end]
                + on_cols
                + [
                    (
                        (
                            col(n_intervals)
                            - datafusion.functions.lag(
                                col(n_intervals), partition_by=on_cols_expr
                            )
                            + 1
                        )
                        / 2
                    )
                    .cast(pa.int64())
                    .alias(n_intervals)
                ]
            )
        )
        result = all_positions.select(
            *(
                [
                    (col(start_end) - min_dist).alias(end),
                    is_start_end,
                    datafusion.functions.lag(
                        col(start_end), partition_by=on_cols_expr
                    ).alias(start),
                ]
                + on_cols
                + [n_intervals]
            )
        )
        result = result.filter(col(is_start_end) == -1)
        result = result.select(
            *(
                [contig, col(start).cast(start_type), col(end).cast(end_type)]
                + on_cols[1:]
                + [n_intervals]
            )
        )

        output = convert_result(result, output_type)

        # Propagate coordinate system metadata to result
        if output_type in ("polars.DataFrame", "polars.LazyFrame", "pandas.DataFrame"):
            set_coordinate_system(output, zero_based)

        return output

count_overlaps(df1, df2, suffixes=('', '_'), cols1=['chrom', 'start', 'end'], cols2=['chrom', 'start', 'end'], on_cols=None, output_type='polars.LazyFrame', naive_query=True, projection_pushdown=True) staticmethod

Count pairs of overlapping genomic intervals. Bioframe inspired API.

The coordinate system (0-based or 1-based) is automatically detected from DataFrame metadata set at I/O time. Both inputs must have the same coordinate system.

Parameters:

Name Type Description Default
df1 Union[str, DataFrame, LazyFrame, 'pd.DataFrame']

Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table (see register_vcf). CSV with a header, BED and Parquet are supported.

required
df2 Union[str, DataFrame, LazyFrame, 'pd.DataFrame']

Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table. CSV with a header, BED and Parquet are supported.

required
suffixes tuple[str, str]

Suffixes for the columns of the two overlapped sets.

('', '_')
cols1 Union[list[str], None]

The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set.

['chrom', 'start', 'end']
cols2 Union[list[str], None]

The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set.

['chrom', 'start', 'end']
on_cols Union[list[str], None]

List of additional column names to join on. default is None.

None
output_type str

Type of the output. default is "polars.LazyFrame", "polars.DataFrame", or "pandas.DataFrame" or "datafusion.DataFrame" are also supported.

'polars.LazyFrame'
naive_query bool

If True, use naive query for counting overlaps based on overlaps.

True
projection_pushdown bool

Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

True

Returns:

Type Description
Union[LazyFrame, DataFrame, 'pd.DataFrame', DataFrame]

polars.LazyFrame or polars.DataFrame or pandas.DataFrame of the overlapping intervals.

Raises:

Type Description
MissingCoordinateSystemError

If either input lacks coordinate system metadata and datafusion.bio.coordinate_system_check is "true" (default).

CoordinateSystemMismatchError

If inputs have different coordinate systems.

Example
import polars_bio as pb
import pandas as pd

df1 = pd.DataFrame([
    ['chr1', 1, 5],
    ['chr1', 3, 8],
    ['chr1', 8, 10],
    ['chr1', 12, 14]],
columns=['chrom', 'start', 'end']
)
df1.attrs["coordinate_system_zero_based"] = False  # 1-based coordinates

df2 = pd.DataFrame(
[['chr1', 4, 8],
 ['chr1', 10, 11]],
columns=['chrom', 'start', 'end' ]
)
df2.attrs["coordinate_system_zero_based"] = False  # 1-based coordinates

counts = pb.count_overlaps(df1, df2, output_type="pandas.DataFrame")

counts

chrom  start  end  count
0  chr1      1    5      1
1  chr1      3    8      1
2  chr1      8   10      0
3  chr1     12   14      0
Todo

Support return_input.

Source code in polars_bio/range_op.py
@staticmethod
def count_overlaps(
    df1: Union[str, pl.DataFrame, pl.LazyFrame, "pd.DataFrame"],
    df2: Union[str, pl.DataFrame, pl.LazyFrame, "pd.DataFrame"],
    suffixes: tuple[str, str] = ("", "_"),
    cols1: Union[list[str], None] = ["chrom", "start", "end"],
    cols2: Union[list[str], None] = ["chrom", "start", "end"],
    on_cols: Union[list[str], None] = None,
    output_type: str = "polars.LazyFrame",
    naive_query: bool = True,
    projection_pushdown: bool = True,
) -> Union[pl.LazyFrame, pl.DataFrame, "pd.DataFrame", datafusion.DataFrame]:
    """
    Count pairs of overlapping genomic intervals.
    Bioframe inspired API.

    The coordinate system (0-based or 1-based) is automatically detected from
    DataFrame metadata set at I/O time. Both inputs must have the same coordinate
    system.

    Parameters:
        df1: Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table (see [register_vcf](api.md#polars_bio.register_vcf)). CSV with a header, BED and Parquet are supported.
        df2: Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table. CSV with a header, BED  and Parquet are supported.
        suffixes: Suffixes for the columns of the two overlapped sets.
        cols1: The names of columns containing the chromosome, start and end of the
            genomic intervals, provided separately for each set.
        cols2:  The names of columns containing the chromosome, start and end of the
            genomic intervals, provided separately for each set.
        on_cols: List of additional column names to join on. default is None.
        output_type: Type of the output. default is "polars.LazyFrame", "polars.DataFrame", or "pandas.DataFrame" or "datafusion.DataFrame" are also supported.
        naive_query: If True, use naive query for counting overlaps based on overlaps.
        projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

    Returns:
        **polars.LazyFrame** or polars.DataFrame or pandas.DataFrame of the overlapping intervals.

    Raises:
        MissingCoordinateSystemError: If either input lacks coordinate system metadata
            and `datafusion.bio.coordinate_system_check` is "true" (default).
        CoordinateSystemMismatchError: If inputs have different coordinate systems.

    Example:
        ```python
        import polars_bio as pb
        import pandas as pd

        df1 = pd.DataFrame([
            ['chr1', 1, 5],
            ['chr1', 3, 8],
            ['chr1', 8, 10],
            ['chr1', 12, 14]],
        columns=['chrom', 'start', 'end']
        )
        df1.attrs["coordinate_system_zero_based"] = False  # 1-based coordinates

        df2 = pd.DataFrame(
        [['chr1', 4, 8],
         ['chr1', 10, 11]],
        columns=['chrom', 'start', 'end' ]
        )
        df2.attrs["coordinate_system_zero_based"] = False  # 1-based coordinates

        counts = pb.count_overlaps(df1, df2, output_type="pandas.DataFrame")

        counts

        chrom  start  end  count
        0  chr1      1    5      1
        1  chr1      3    8      1
        2  chr1      8   10      0
        3  chr1     12   14      0
        ```

    Todo:
         Support return_input.
    """
    _validate_overlap_input(cols1, cols2, on_cols, suffixes, output_type)

    # Get filter_op and zero_based from DataFrame metadata
    zero_based = validate_coordinate_systems(df1, df2, ctx)
    filter_op = FilterOp.Strict if zero_based else FilterOp.Weak

    my_ctx = get_py_ctx()
    on_cols = [] if on_cols is None else on_cols
    cols1 = DEFAULT_INTERVAL_COLUMNS if cols1 is None else cols1
    cols2 = DEFAULT_INTERVAL_COLUMNS if cols2 is None else cols2
    if naive_query:
        range_options = RangeOptions(
            range_op=RangeOp.CountOverlapsNaive,
            filter_op=filter_op,
            suffixes=suffixes,
            columns_1=cols1,
            columns_2=cols2,
        )
        return range_operation(df2, df1, range_options, output_type, ctx)
    df1 = read_df_to_datafusion(my_ctx, df1)
    df2 = read_df_to_datafusion(my_ctx, df2)

    curr_cols = set(df1.schema().names) | set(df2.schema().names)
    s1start_s2end = prevent_column_collision("s1starts2end", curr_cols)
    s1end_s2start = prevent_column_collision("s1ends2start", curr_cols)
    contig = prevent_column_collision("contig", curr_cols)
    count = prevent_column_collision("count", curr_cols)
    starts = prevent_column_collision("starts", curr_cols)
    ends = prevent_column_collision("ends", curr_cols)
    is_s1 = prevent_column_collision("is_s1", curr_cols)
    suff, _ = suffixes
    df1, df2 = df2, df1
    df1 = df1.select(
        *(
            [
                literal(1).alias(is_s1),
                col(cols1[1]).alias(s1start_s2end),
                col(cols1[2]).alias(s1end_s2start),
                col(cols1[0]).alias(contig),
            ]
            + on_cols
        )
    )
    df2 = df2.select(
        *(
            [
                literal(0).alias(is_s1),
                col(cols2[2]).alias(s1end_s2start),
                col(cols2[1]).alias(s1start_s2end),
                col(cols2[0]).alias(contig),
            ]
            + on_cols
        )
    )

    df = df1.union(df2)

    partitioning = [col(contig)] + [col(c) for c in on_cols]
    df = df.select(
        *(
            [
                s1start_s2end,
                s1end_s2start,
                contig,
                is_s1,
                datafusion.functions.sum(col(is_s1))
                .over(
                    datafusion.expr.Window(
                        partition_by=partitioning,
                        order_by=[
                            col(s1start_s2end).sort(),
                            col(is_s1).sort(ascending=zero_based),
                        ],
                    )
                )
                .alias(starts),
                datafusion.functions.sum(col(is_s1))
                .over(
                    datafusion.expr.Window(
                        partition_by=partitioning,
                        order_by=[
                            col(s1end_s2start).sort(),
                            col(is_s1).sort(ascending=(not zero_based)),
                        ],
                    )
                )
                .alias(ends),
            ]
            + on_cols
        )
    )
    df = df.filter(col(is_s1) == 0)
    df = df.select(
        *(
            [
                col(contig).alias(cols1[0] + suff),
                col(s1end_s2start).alias(cols1[1] + suff),
                col(s1start_s2end).alias(cols1[2] + suff),
            ]
            + on_cols
            + [(col(starts) - col(ends)).alias(count)]
        )
    )

    return convert_result(df, output_type)

coverage(df1, df2, suffixes=('_1', '_2'), on_cols=None, cols1=['chrom', 'start', 'end'], cols2=['chrom', 'start', 'end'], output_type='polars.LazyFrame', read_options=None, projection_pushdown=True) staticmethod

Calculate intervals coverage. Bioframe inspired API.

The coordinate system (0-based or 1-based) is automatically detected from DataFrame metadata set at I/O time. Both inputs must have the same coordinate system.

Parameters:

Name Type Description Default
df1 Union[str, DataFrame, LazyFrame, 'pd.DataFrame']

Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table (see register_vcf). CSV with a header, BED and Parquet are supported.

required
df2 Union[str, DataFrame, LazyFrame, 'pd.DataFrame']

Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table. CSV with a header, BED and Parquet are supported.

required
cols1 Union[list[str], None]

The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set.

['chrom', 'start', 'end']
cols2 Union[list[str], None]

The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set.

['chrom', 'start', 'end']
suffixes tuple[str, str]

Suffixes for the columns of the two overlapped sets.

('_1', '_2')
on_cols Union[list[str], None]

List of additional column names to join on. default is None.

None
output_type str

Type of the output. default is "polars.LazyFrame", "polars.DataFrame", or "pandas.DataFrame" or "datafusion.DataFrame" are also supported.

'polars.LazyFrame'
read_options Union[ReadOptions, None]

Additional options for reading the input files.

None
projection_pushdown bool

Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

True

Returns:

Type Description
Union[LazyFrame, DataFrame, 'pd.DataFrame', DataFrame]

polars.LazyFrame or polars.DataFrame or pandas.DataFrame of the overlapping intervals.

Raises:

Type Description
MissingCoordinateSystemError

If either input lacks coordinate system metadata and datafusion.bio.coordinate_system_check is "true" (default).

CoordinateSystemMismatchError

If inputs have different coordinate systems.

Note

The default output format, i.e. LazyFrame, is recommended for large datasets as it supports output streaming and lazy evaluation. This enables efficient processing of large datasets without loading the entire output dataset into memory.

Example:

Todo

Support for on_cols.

Source code in polars_bio/range_op.py
@staticmethod
def coverage(
    df1: Union[str, pl.DataFrame, pl.LazyFrame, "pd.DataFrame"],
    df2: Union[str, pl.DataFrame, pl.LazyFrame, "pd.DataFrame"],
    suffixes: tuple[str, str] = ("_1", "_2"),
    on_cols: Union[list[str], None] = None,
    cols1: Union[list[str], None] = ["chrom", "start", "end"],
    cols2: Union[list[str], None] = ["chrom", "start", "end"],
    output_type: str = "polars.LazyFrame",
    read_options: Union[ReadOptions, None] = None,
    projection_pushdown: bool = True,
) -> Union[pl.LazyFrame, pl.DataFrame, "pd.DataFrame", datafusion.DataFrame]:
    """
    Calculate intervals coverage.
    Bioframe inspired API.

    The coordinate system (0-based or 1-based) is automatically detected from
    DataFrame metadata set at I/O time. Both inputs must have the same coordinate
    system.

    Parameters:
        df1: Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table (see [register_vcf](api.md#polars_bio.register_vcf)). CSV with a header, BED and Parquet are supported.
        df2: Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table. CSV with a header, BED  and Parquet are supported.
        cols1: The names of columns containing the chromosome, start and end of the
            genomic intervals, provided separately for each set.
        cols2:  The names of columns containing the chromosome, start and end of the
            genomic intervals, provided separately for each set.
        suffixes: Suffixes for the columns of the two overlapped sets.
        on_cols: List of additional column names to join on. default is None.
        output_type: Type of the output. default is "polars.LazyFrame", "polars.DataFrame", or "pandas.DataFrame" or "datafusion.DataFrame" are also supported.
        read_options: Additional options for reading the input files.
        projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

    Returns:
        **polars.LazyFrame** or polars.DataFrame or pandas.DataFrame of the overlapping intervals.

    Raises:
        MissingCoordinateSystemError: If either input lacks coordinate system metadata
            and `datafusion.bio.coordinate_system_check` is "true" (default).
        CoordinateSystemMismatchError: If inputs have different coordinate systems.

    Note:
        The default output format, i.e. [LazyFrame](https://docs.pola.rs/api/python/stable/reference/lazyframe/index.html), is recommended for large datasets as it supports output streaming and lazy evaluation.
        This enables efficient processing of large datasets without loading the entire output dataset into memory.

    Example:

    Todo:
        Support for on_cols.
    """

    _validate_overlap_input(cols1, cols2, on_cols, suffixes, output_type)

    # Get filter_op from DataFrame metadata
    filter_op = _get_filter_op_from_metadata(df1, df2)

    cols1 = DEFAULT_INTERVAL_COLUMNS if cols1 is None else cols1
    cols2 = DEFAULT_INTERVAL_COLUMNS if cols2 is None else cols2
    range_options = RangeOptions(
        range_op=RangeOp.Coverage,
        filter_op=filter_op,
        suffixes=suffixes,
        columns_1=cols1,
        columns_2=cols2,
    )
    return range_operation(
        df2,
        df1,
        range_options,
        output_type,
        ctx,
        read_options,
        projection_pushdown=projection_pushdown,
    )

merge(df, min_dist=0, cols=['chrom', 'start', 'end'], on_cols=None, output_type='polars.LazyFrame', projection_pushdown=True) staticmethod

Merge overlapping intervals. It is assumed that start < end.

The coordinate system (0-based or 1-based) is automatically detected from DataFrame metadata set at I/O time.

Parameters:

Name Type Description Default
df Union[str, DataFrame, LazyFrame, 'pd.DataFrame']

Can be a path to a file, a polars DataFrame, or a pandas DataFrame. CSV with a header, BED and Parquet are supported.

required
min_dist float

Minimum distance between intervals to merge. Default is 0.

0
cols Union[list[str], None]

The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set.

['chrom', 'start', 'end']
on_cols Union[list[str], None]

List of additional column names for clustering. default is None.

None
output_type str

Type of the output. default is "polars.LazyFrame", "polars.DataFrame", or "pandas.DataFrame" or "datafusion.DataFrame" are also supported.

'polars.LazyFrame'
projection_pushdown bool

Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

True

Returns:

Type Description
Union[LazyFrame, DataFrame, 'pd.DataFrame', DataFrame]

polars.LazyFrame or polars.DataFrame or pandas.DataFrame of the overlapping intervals.

Raises:

Type Description
MissingCoordinateSystemError

If input lacks coordinate system metadata and datafusion.bio.coordinate_system_check is "true" (default).

Example:

Todo

Support for on_cols.

Source code in polars_bio/range_op.py
@staticmethod
def merge(
    df: Union[str, pl.DataFrame, pl.LazyFrame, "pd.DataFrame"],
    min_dist: float = 0,
    cols: Union[list[str], None] = ["chrom", "start", "end"],
    on_cols: Union[list[str], None] = None,
    output_type: str = "polars.LazyFrame",
    projection_pushdown: bool = True,
) -> Union[pl.LazyFrame, pl.DataFrame, "pd.DataFrame", datafusion.DataFrame]:
    """
    Merge overlapping intervals. It is assumed that start < end.

    The coordinate system (0-based or 1-based) is automatically detected from
    DataFrame metadata set at I/O time.

    Parameters:
        df: Can be a path to a file, a polars DataFrame, or a pandas DataFrame. CSV with a header, BED  and Parquet are supported.
        min_dist: Minimum distance between intervals to merge. Default is 0.
        cols: The names of columns containing the chromosome, start and end of the
            genomic intervals, provided separately for each set.
        on_cols: List of additional column names for clustering. default is None.
        output_type: Type of the output. default is "polars.LazyFrame", "polars.DataFrame", or "pandas.DataFrame" or "datafusion.DataFrame" are also supported.
        projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

    Returns:
        **polars.LazyFrame** or polars.DataFrame or pandas.DataFrame of the overlapping intervals.

    Raises:
        MissingCoordinateSystemError: If input lacks coordinate system metadata
            and `datafusion.bio.coordinate_system_check` is "true" (default).

    Example:

    Todo:
        Support for on_cols.
    """
    suffixes = ("_1", "_2")
    _validate_overlap_input(cols, cols, on_cols, suffixes, output_type)

    # Get zero_based from DataFrame metadata
    zero_based = validate_coordinate_system_single(df, ctx)

    my_ctx = get_py_ctx()
    cols = DEFAULT_INTERVAL_COLUMNS if cols is None else cols
    contig = cols[0]
    start = cols[1]
    end = cols[2]

    on_cols = [] if on_cols is None else on_cols
    on_cols = [contig] + on_cols

    df = read_df_to_datafusion(my_ctx, df)
    df_schema = df.schema()
    start_type = df_schema.field(start).type
    end_type = df_schema.field(end).type

    curr_cols = set(df_schema.names)
    start_end = prevent_column_collision("start_end", curr_cols)
    is_start_end = prevent_column_collision("is_start_or_end", curr_cols)
    current_intervals = prevent_column_collision("current_intervals", curr_cols)
    n_intervals = prevent_column_collision("n_intervals", curr_cols)

    end_positions = df.select(
        *(
            [
                (col(end) + min_dist).alias(start_end),
                literal(-1).alias(is_start_end),
            ]
            + on_cols
        )
    )
    start_positions = df.select(
        *([col(start).alias(start_end), literal(1).alias(is_start_end)] + on_cols)
    )
    all_positions = start_positions.union(end_positions)
    start_end_type = all_positions.schema().field(start_end).type
    all_positions = all_positions.select(
        *([col(start_end).cast(start_end_type), col(is_start_end)] + on_cols)
    )

    sorting = [
        col(start_end).sort(),
        col(is_start_end).sort(ascending=zero_based),
    ]
    all_positions = all_positions.sort(*sorting)

    on_cols_expr = [col(c) for c in on_cols]

    win = datafusion.expr.Window(
        partition_by=on_cols_expr,
        order_by=sorting,
    )
    all_positions = all_positions.select(
        *(
            [
                start_end,
                is_start_end,
                datafusion.functions.sum(col(is_start_end))
                .over(win)
                .alias(current_intervals),
            ]
            + on_cols
            + [
                datafusion.functions.row_number(
                    partition_by=on_cols_expr, order_by=sorting
                ).alias(n_intervals)
            ]
        )
    )
    all_positions = all_positions.filter(
        ((col(current_intervals) == 0) & (col(is_start_end) == -1))
        | ((col(current_intervals) == 1) & (col(is_start_end) == 1))
    )
    all_positions = all_positions.select(
        *(
            [start_end, is_start_end]
            + on_cols
            + [
                (
                    (
                        col(n_intervals)
                        - datafusion.functions.lag(
                            col(n_intervals), partition_by=on_cols_expr
                        )
                        + 1
                    )
                    / 2
                )
                .cast(pa.int64())
                .alias(n_intervals)
            ]
        )
    )
    result = all_positions.select(
        *(
            [
                (col(start_end) - min_dist).alias(end),
                is_start_end,
                datafusion.functions.lag(
                    col(start_end), partition_by=on_cols_expr
                ).alias(start),
            ]
            + on_cols
            + [n_intervals]
        )
    )
    result = result.filter(col(is_start_end) == -1)
    result = result.select(
        *(
            [contig, col(start).cast(start_type), col(end).cast(end_type)]
            + on_cols[1:]
            + [n_intervals]
        )
    )

    output = convert_result(result, output_type)

    # Propagate coordinate system metadata to result
    if output_type in ("polars.DataFrame", "polars.LazyFrame", "pandas.DataFrame"):
        set_coordinate_system(output, zero_based)

    return output

nearest(df1, df2, suffixes=('_1', '_2'), on_cols=None, cols1=['chrom', 'start', 'end'], cols2=['chrom', 'start', 'end'], output_type='polars.LazyFrame', read_options=None, projection_pushdown=True) staticmethod

Find pairs of closest genomic intervals. Bioframe inspired API.

The coordinate system (0-based or 1-based) is automatically detected from DataFrame metadata set at I/O time. Both inputs must have the same coordinate system.

Parameters:

Name Type Description Default
df1 Union[str, DataFrame, LazyFrame, 'pd.DataFrame']

Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table (see register_vcf). CSV with a header, BED and Parquet are supported.

required
df2 Union[str, DataFrame, LazyFrame, 'pd.DataFrame']

Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table. CSV with a header, BED and Parquet are supported.

required
cols1 Union[list[str], None]

The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set.

['chrom', 'start', 'end']
cols2 Union[list[str], None]

The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set.

['chrom', 'start', 'end']
suffixes tuple[str, str]

Suffixes for the columns of the two overlapped sets.

('_1', '_2')
on_cols Union[list[str], None]

List of additional column names to join on. default is None.

None
output_type str

Type of the output. default is "polars.LazyFrame", "polars.DataFrame", or "pandas.DataFrame" or "datafusion.DataFrame" are also supported.

'polars.LazyFrame'
read_options Union[ReadOptions, None]

Additional options for reading the input files.

None
projection_pushdown bool

Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

True

Returns:

Type Description
Union[LazyFrame, DataFrame, 'pd.DataFrame', DataFrame]

polars.LazyFrame or polars.DataFrame or pandas.DataFrame of the overlapping intervals.

Raises:

Type Description
MissingCoordinateSystemError

If either input lacks coordinate system metadata and datafusion.bio.coordinate_system_check is "true" (default).

CoordinateSystemMismatchError

If inputs have different coordinate systems.

Note

The default output format, i.e. LazyFrame, is recommended for large datasets as it supports output streaming and lazy evaluation. This enables efficient processing of large datasets without loading the entire output dataset into memory.

Example:

Todo

Support for on_cols.

Source code in polars_bio/range_op.py
@staticmethod
def nearest(
    df1: Union[str, pl.DataFrame, pl.LazyFrame, "pd.DataFrame"],
    df2: Union[str, pl.DataFrame, pl.LazyFrame, "pd.DataFrame"],
    suffixes: tuple[str, str] = ("_1", "_2"),
    on_cols: Union[list[str], None] = None,
    cols1: Union[list[str], None] = ["chrom", "start", "end"],
    cols2: Union[list[str], None] = ["chrom", "start", "end"],
    output_type: str = "polars.LazyFrame",
    read_options: Union[ReadOptions, None] = None,
    projection_pushdown: bool = True,
) -> Union[pl.LazyFrame, pl.DataFrame, "pd.DataFrame", datafusion.DataFrame]:
    """
    Find pairs of closest genomic intervals.
    Bioframe inspired API.

    The coordinate system (0-based or 1-based) is automatically detected from
    DataFrame metadata set at I/O time. Both inputs must have the same coordinate
    system.

    Parameters:
        df1: Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table (see [register_vcf](api.md#polars_bio.register_vcf)). CSV with a header, BED and Parquet are supported.
        df2: Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table. CSV with a header, BED  and Parquet are supported.
        cols1: The names of columns containing the chromosome, start and end of the
            genomic intervals, provided separately for each set.
        cols2:  The names of columns containing the chromosome, start and end of the
            genomic intervals, provided separately for each set.
        suffixes: Suffixes for the columns of the two overlapped sets.
        on_cols: List of additional column names to join on. default is None.
        output_type: Type of the output. default is "polars.LazyFrame", "polars.DataFrame", or "pandas.DataFrame" or "datafusion.DataFrame" are also supported.
        read_options: Additional options for reading the input files.
        projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

    Returns:
        **polars.LazyFrame** or polars.DataFrame or pandas.DataFrame of the overlapping intervals.

    Raises:
        MissingCoordinateSystemError: If either input lacks coordinate system metadata
            and `datafusion.bio.coordinate_system_check` is "true" (default).
        CoordinateSystemMismatchError: If inputs have different coordinate systems.

    Note:
        The default output format, i.e. [LazyFrame](https://docs.pola.rs/api/python/stable/reference/lazyframe/index.html), is recommended for large datasets as it supports output streaming and lazy evaluation.
        This enables efficient processing of large datasets without loading the entire output dataset into memory.

    Example:

    Todo:
        Support for on_cols.
    """

    _validate_overlap_input(cols1, cols2, on_cols, suffixes, output_type)

    # Get filter_op from DataFrame metadata
    filter_op = _get_filter_op_from_metadata(df1, df2)

    cols1 = DEFAULT_INTERVAL_COLUMNS if cols1 is None else cols1
    cols2 = DEFAULT_INTERVAL_COLUMNS if cols2 is None else cols2
    range_options = RangeOptions(
        range_op=RangeOp.Nearest,
        filter_op=filter_op,
        suffixes=suffixes,
        columns_1=cols1,
        columns_2=cols2,
    )
    return range_operation(
        df1,
        df2,
        range_options,
        output_type,
        ctx,
        read_options,
        projection_pushdown=projection_pushdown,
    )

overlap(df1, df2, suffixes=('_1', '_2'), on_cols=None, cols1=['chrom', 'start', 'end'], cols2=['chrom', 'start', 'end'], algorithm='Coitrees', low_memory=False, output_type='polars.LazyFrame', read_options1=None, read_options2=None, projection_pushdown=True) staticmethod

Find pairs of overlapping genomic intervals. Bioframe inspired API.

The coordinate system (0-based or 1-based) is automatically detected from DataFrame metadata set at I/O time. Both inputs must have the same coordinate system.

Parameters:

Name Type Description Default
df1 Union[str, DataFrame, LazyFrame, 'pd.DataFrame']

Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table (see register_vcf). CSV with a header, BED and Parquet are supported.

required
df2 Union[str, DataFrame, LazyFrame, 'pd.DataFrame']

Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table. CSV with a header, BED and Parquet are supported.

required
cols1 Union[list[str], None]

The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set.

['chrom', 'start', 'end']
cols2 Union[list[str], None]

The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set.

['chrom', 'start', 'end']
suffixes tuple[str, str]

Suffixes for the columns of the two overlapped sets.

('_1', '_2')
on_cols Union[list[str], None]

List of additional column names to join on. default is None.

None
algorithm str

The algorithm to use for the overlap operation. Available options: Coitrees, IntervalTree, ArrayIntervalTree, Lapper, SuperIntervals

'Coitrees'
low_memory bool

If True, use low memory method for output generation. This may be slower but uses less memory.

False
output_type str

Type of the output. default is "polars.LazyFrame", "polars.DataFrame", or "pandas.DataFrame" or "datafusion.DataFrame" are also supported.

'polars.LazyFrame'
read_options1 Union[ReadOptions, None]

Additional options for reading the input files.

None
read_options2 Union[ReadOptions, None]

Additional options for reading the input files.

None
projection_pushdown bool

Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

True

Returns:

Type Description
Union[LazyFrame, DataFrame, 'pd.DataFrame', DataFrame]

polars.LazyFrame or polars.DataFrame or pandas.DataFrame of the overlapping intervals.

Raises:

Type Description
MissingCoordinateSystemError

If either input lacks coordinate system metadata and datafusion.bio.coordinate_system_check is "true" (default). Use polars-bio I/O functions (scan_, read_) which automatically set metadata, or set it manually on Polars DataFrames via df.config_meta.set(coordinate_system_zero_based=True/False) or on Pandas DataFrames via df.attrs["coordinate_system_zero_based"] = True/False. Set pb.set_option("datafusion.bio.coordinate_system_check", False) to disable strict checking and fall back to global coordinate system setting.

CoordinateSystemMismatchError

If inputs have different coordinate systems.

Note
  1. The default output format, i.e. LazyFrame, is recommended for large datasets as it supports output streaming and lazy evaluation. This enables efficient processing of large datasets without loading the entire output dataset into memory.
  2. Streaming is only supported for polars.LazyFrame output.
Example
import polars_bio as pb
import pandas as pd

df1 = pd.DataFrame([
    ['chr1', 1, 5],
    ['chr1', 3, 8],
    ['chr1', 8, 10],
    ['chr1', 12, 14]],
columns=['chrom', 'start', 'end']
)
df1.attrs["coordinate_system_zero_based"] = False  # 1-based coordinates

df2 = pd.DataFrame(
[['chr1', 4, 8],
 ['chr1', 10, 11]],
columns=['chrom', 'start', 'end' ]
)
df2.attrs["coordinate_system_zero_based"] = False  # 1-based coordinates

overlapping_intervals = pb.overlap(df1, df2, output_type="pandas.DataFrame")

overlapping_intervals
    chrom_1         start_1     end_1 chrom_2       start_2  end_2
0     chr1            1          5     chr1            4          8
1     chr1            3          8     chr1            4          8
Todo

Support for on_cols.

Source code in polars_bio/range_op.py
@staticmethod
def overlap(
    df1: Union[str, pl.DataFrame, pl.LazyFrame, "pd.DataFrame"],
    df2: Union[str, pl.DataFrame, pl.LazyFrame, "pd.DataFrame"],
    suffixes: tuple[str, str] = ("_1", "_2"),
    on_cols: Union[list[str], None] = None,
    cols1: Union[list[str], None] = ["chrom", "start", "end"],
    cols2: Union[list[str], None] = ["chrom", "start", "end"],
    algorithm: str = "Coitrees",
    low_memory: bool = False,
    output_type: str = "polars.LazyFrame",
    read_options1: Union[ReadOptions, None] = None,
    read_options2: Union[ReadOptions, None] = None,
    projection_pushdown: bool = True,
) -> Union[pl.LazyFrame, pl.DataFrame, "pd.DataFrame", datafusion.DataFrame]:
    """
    Find pairs of overlapping genomic intervals.
    Bioframe inspired API.

    The coordinate system (0-based or 1-based) is automatically detected from
    DataFrame metadata set at I/O time. Both inputs must have the same coordinate
    system.

    Parameters:
        df1: Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table (see [register_vcf](api.md#polars_bio.register_vcf)). CSV with a header, BED and Parquet are supported.
        df2: Can be a path to a file, a polars DataFrame, or a pandas DataFrame or a registered table. CSV with a header, BED  and Parquet are supported.
        cols1: The names of columns containing the chromosome, start and end of the
            genomic intervals, provided separately for each set.
        cols2:  The names of columns containing the chromosome, start and end of the
            genomic intervals, provided separately for each set.
        suffixes: Suffixes for the columns of the two overlapped sets.
        on_cols: List of additional column names to join on. default is None.
        algorithm: The algorithm to use for the overlap operation. Available options: Coitrees, IntervalTree, ArrayIntervalTree, Lapper, SuperIntervals
        low_memory: If True, use low memory method for output generation. This may be slower but uses less memory.
        output_type: Type of the output. default is "polars.LazyFrame", "polars.DataFrame", or "pandas.DataFrame" or "datafusion.DataFrame" are also supported.
        read_options1: Additional options for reading the input files.
        read_options2: Additional options for reading the input files.
        projection_pushdown: Enable column projection pushdown to optimize query performance by only reading the necessary columns at the DataFusion level.

    Returns:
        **polars.LazyFrame** or polars.DataFrame or pandas.DataFrame of the overlapping intervals.

    Raises:
        MissingCoordinateSystemError: If either input lacks coordinate system metadata
            and `datafusion.bio.coordinate_system_check` is "true" (default). Use polars-bio
            I/O functions (scan_*, read_*) which automatically set metadata, or set it manually
            on Polars DataFrames via `df.config_meta.set(coordinate_system_zero_based=True/False)`
            or on Pandas DataFrames via `df.attrs["coordinate_system_zero_based"] = True/False`.
            Set `pb.set_option("datafusion.bio.coordinate_system_check", False)` to disable
            strict checking and fall back to global coordinate system setting.
        CoordinateSystemMismatchError: If inputs have different coordinate systems.

    Note:
        1. The default output format, i.e.  [LazyFrame](https://docs.pola.rs/api/python/stable/reference/lazyframe/index.html), is recommended for large datasets as it supports output streaming and lazy evaluation.
        This enables efficient processing of large datasets without loading the entire output dataset into memory.
        2. Streaming is only supported for polars.LazyFrame output.

    Example:
        ```python
        import polars_bio as pb
        import pandas as pd

        df1 = pd.DataFrame([
            ['chr1', 1, 5],
            ['chr1', 3, 8],
            ['chr1', 8, 10],
            ['chr1', 12, 14]],
        columns=['chrom', 'start', 'end']
        )
        df1.attrs["coordinate_system_zero_based"] = False  # 1-based coordinates

        df2 = pd.DataFrame(
        [['chr1', 4, 8],
         ['chr1', 10, 11]],
        columns=['chrom', 'start', 'end' ]
        )
        df2.attrs["coordinate_system_zero_based"] = False  # 1-based coordinates

        overlapping_intervals = pb.overlap(df1, df2, output_type="pandas.DataFrame")

        overlapping_intervals
            chrom_1         start_1     end_1 chrom_2       start_2  end_2
        0     chr1            1          5     chr1            4          8
        1     chr1            3          8     chr1            4          8

        ```

    Todo:
         Support for on_cols.
    """

    _validate_overlap_input(cols1, cols2, on_cols, suffixes, output_type)

    # Get filter_op from DataFrame metadata
    filter_op = _get_filter_op_from_metadata(df1, df2)

    cols1 = DEFAULT_INTERVAL_COLUMNS if cols1 is None else cols1
    cols2 = DEFAULT_INTERVAL_COLUMNS if cols2 is None else cols2
    range_options = RangeOptions(
        range_op=RangeOp.Overlap,
        filter_op=filter_op,
        suffixes=suffixes,
        columns_1=cols1,
        columns_2=cols2,
        overlap_alg=algorithm,
        overlap_low_memory=low_memory,
    )

    return range_operation(
        df1,
        df2,
        range_options,
        output_type,
        ctx,
        read_options1,
        read_options2,
        projection_pushdown,
    )

get_metadata(df)

Get all metadata attached to a DataFrame or LazyFrame.

Returns all metadata including: - Source file information (format, path) - Format-specific metadata (VCF INFO/FORMAT fields, FASTQ quality encoding, etc.) - Comprehensive Arrow schema metadata (if available)

Parameters:

Name Type Description Default
df

Polars DataFrame or LazyFrame (or Pandas DataFrame)

required

Returns:

Type Description
dict

Dict with keys:

dict
  • "format": File format identifier (e.g., "vcf", "fastq", "bam")
dict
  • "path": Original file path
dict
  • "coordinate_system_zero_based": Boolean indicating coordinate system (True=0-based, False=1-based, None=not set)
dict
  • "header": Format-specific header data as dict, may include:
  • For VCF: "info_fields", "format_fields", "sample_names", "version", "contigs", "filters", etc.
  • For FASTQ: quality encoding information
  • For other formats: format-specific metadata
  • "_datafusion_table_name": Internal DataFusion table name (for debugging)

Examples:

Get all metadata from a VCF file:

import polars_bio as pb
lf = pb.scan_vcf("file.vcf")
meta = pb.get_metadata(lf)

Access basic metadata:

meta["format"]                        # Returns: 'vcf'
meta["path"]                          # Returns: 'file.vcf'
meta["coordinate_system_zero_based"]  # Returns: False (1-based for VCF)

Access VCF-specific metadata:

info_fields = meta["header"]["info_fields"]
format_fields = meta["header"]["format_fields"]
sample_names = meta["header"]["sample_names"]
version = meta["header"]["version"]
contigs = meta["header"]["contigs"]

Source code in polars_bio/_metadata.py
def get_metadata(df) -> dict:
    """Get all metadata attached to a DataFrame or LazyFrame.

    Returns all metadata including:
    - Source file information (format, path)
    - Format-specific metadata (VCF INFO/FORMAT fields, FASTQ quality encoding, etc.)
    - Comprehensive Arrow schema metadata (if available)

    Args:
        df: Polars DataFrame or LazyFrame (or Pandas DataFrame)

    Returns:
        Dict with keys:
        - "format": File format identifier (e.g., "vcf", "fastq", "bam")
        - "path": Original file path
        - "coordinate_system_zero_based": Boolean indicating coordinate system (True=0-based, False=1-based, None=not set)
        - "header": Format-specific header data as dict, may include:
            - For VCF: "info_fields", "format_fields", "sample_names", "version", "contigs", "filters", etc.
            - For FASTQ: quality encoding information
            - For other formats: format-specific metadata
            - "_datafusion_table_name": Internal DataFusion table name (for debugging)

    Examples:
        Get all metadata from a VCF file:
        ```python
        import polars_bio as pb
        lf = pb.scan_vcf("file.vcf")
        meta = pb.get_metadata(lf)
        ```

        Access basic metadata:
        ```python
        meta["format"]                        # Returns: 'vcf'
        meta["path"]                          # Returns: 'file.vcf'
        meta["coordinate_system_zero_based"]  # Returns: False (1-based for VCF)
        ```

        Access VCF-specific metadata:
        ```python
        info_fields = meta["header"]["info_fields"]
        format_fields = meta["header"]["format_fields"]
        sample_names = meta["header"]["sample_names"]
        version = meta["header"]["version"]
        contigs = meta["header"]["contigs"]
        ```
    """
    result = {
        "format": None,
        "path": None,
        "coordinate_system_zero_based": None,
        "header": None,
    }

    if _has_config_meta(df):
        # Polars DataFrame/LazyFrame
        try:
            metadata = df.config_meta.get_metadata()
        except (KeyError, AttributeError, TypeError):
            return result

        result["format"] = metadata.get(SOURCE_FORMAT_KEY)
        result["path"] = metadata.get(SOURCE_PATH_KEY)
        result["coordinate_system_zero_based"] = metadata.get(COORDINATE_SYSTEM_KEY)

        header_json = metadata.get(SOURCE_HEADER_KEY)
        if header_json:
            try:
                result["header"] = json.loads(header_json)
            except (json.JSONDecodeError, TypeError):
                pass

    elif _is_pandas_dataframe(df):
        # Pandas DataFrame
        if hasattr(df, "attrs"):
            result["format"] = df.attrs.get(SOURCE_FORMAT_KEY)
            result["path"] = df.attrs.get(SOURCE_PATH_KEY)
            result["coordinate_system_zero_based"] = df.attrs.get(COORDINATE_SYSTEM_KEY)

            header_json = df.attrs.get(SOURCE_HEADER_KEY)
            if header_json:
                try:
                    result["header"] = json.loads(header_json)
                except (json.JSONDecodeError, TypeError):
                    pass

    return result

get_option(key)

Get the value of a configuration option.

Parameters:

Name Type Description Default
key

The configuration key.

required

Returns:

Type Description

The current value of the option as a string, or None if not set.

Example
import polars_bio as pb
pb.get_option("datafusion.bio.coordinate_system_zero_based")
'true'
Source code in polars_bio/context.py
def get_option(key):
    """Get the value of a configuration option.

    Args:
        key: The configuration key.

    Returns:
        The current value of the option as a string, or None if not set.

    Example:
        ```python
        import polars_bio as pb
        pb.get_option("datafusion.bio.coordinate_system_zero_based")
        'true'
        ```
    """
    return Context().get_option(key)

print_metadata_json(df, indent=2)

Print metadata as pretty-formatted JSON.

Parameters:

Name Type Description Default
df Union[DataFrame, LazyFrame]

Polars DataFrame or LazyFrame

required
indent int

Number of spaces for indentation (default: 2)

2
Example
import polars_bio as pb
lf = pb.scan_vcf("file.vcf")
pb.print_metadata_json(lf)
Source code in polars_bio/_metadata.py
def print_metadata_json(df: Union[pl.DataFrame, pl.LazyFrame], indent: int = 2) -> None:
    """Print metadata as pretty-formatted JSON.

    Args:
        df: Polars DataFrame or LazyFrame
        indent: Number of spaces for indentation (default: 2)

    Example:
        ```python
        import polars_bio as pb
        lf = pb.scan_vcf("file.vcf")
        pb.print_metadata_json(lf)
        ```
    """
    meta = get_metadata(df)
    print(json.dumps(meta, indent=indent, default=str))

print_metadata_summary(df)

Print a human-readable summary of all metadata.

Displays a formatted summary of all metadata attached to a DataFrame or LazyFrame, including format, path, coordinate system, and format-specific information.

Parameters:

Name Type Description Default
df Union[DataFrame, LazyFrame]

Polars DataFrame or LazyFrame

required
Example
import polars_bio as pb
lf = pb.scan_vcf("file.vcf")
pb.print_metadata_summary(lf)
Source code in polars_bio/_metadata.py
def print_metadata_summary(df: Union[pl.DataFrame, pl.LazyFrame]) -> None:
    """Print a human-readable summary of all metadata.

    Displays a formatted summary of all metadata attached to a DataFrame or LazyFrame,
    including format, path, coordinate system, and format-specific information.

    Args:
        df: Polars DataFrame or LazyFrame

    Example:
        ```python
        import polars_bio as pb
        lf = pb.scan_vcf("file.vcf")
        pb.print_metadata_summary(lf)
        ```
    """
    meta = get_metadata(df)
    if not meta or not any([meta.get("format"), meta.get("path"), meta.get("header")]):
        print("No metadata available")
        return

    print("=" * 70)
    print("Metadata Summary")
    print("=" * 70)
    print()

    # Basic metadata
    if meta.get("format"):
        print(f"Format: {meta['format']}")
    if meta.get("path"):
        print(f"Path: {meta['path']}")
    if meta.get("coordinate_system_zero_based") is not None:
        coord_sys = "0-based" if meta["coordinate_system_zero_based"] else "1-based"
        print(f"Coordinate System: {coord_sys}")

    # Format-specific metadata
    if meta.get("header"):
        header = meta["header"]
        print()
        print("Format-specific metadata:")
        print("-" * 70)

        # VCF-specific
        if meta.get("format") == "vcf":
            if "version" in header:
                print(f"  VCF Version: {header['version']}")
            if "sample_names" in header:
                samples = header["sample_names"]
                print(f"  Samples ({len(samples)}): {', '.join(samples[:5])}")
                if len(samples) > 5:
                    print(f"    ... and {len(samples) - 5} more")
            if "info_fields" in header:
                print(f"  INFO fields: {len(header['info_fields'])}")
                for field_id in list(header["info_fields"].keys())[:3]:
                    field = header["info_fields"][field_id]
                    print(
                        f"    - {field_id}: {field.get('type')} ({field.get('description', 'No description')})"
                    )
                if len(header["info_fields"]) > 3:
                    print(f"    ... and {len(header['info_fields']) - 3} more")
            if "format_fields" in header:
                print(f"  FORMAT fields: {len(header['format_fields'])}")
                for field_id in list(header["format_fields"].keys())[:3]:
                    field = header["format_fields"][field_id]
                    print(
                        f"    - {field_id}: {field.get('type')} ({field.get('description', 'No description')})"
                    )
                if len(header["format_fields"]) > 3:
                    print(f"    ... and {len(header['format_fields']) - 3} more")
            if "contigs" in header and header["contigs"]:
                print(f"  Contigs: {len(header['contigs'])}")
            if "filters" in header and header["filters"]:
                print(f"  Filters: {len(header['filters'])}")

        # Other formats can be added here as needed

    print()
    print("=" * 70)

set_loglevel(level)

Set the log level for the logger and root logger.

Parameters:

Name Type Description Default
level str

The log level to set. Can be "debug", "info", "warn", or "warning".

required

Note

The log level should be set as a first step after importing the library. Once set it can be only decreased, not increased. In order to increase the log level, you need to restart the Python session.

Example:

import polars_bio as pb
pb.set_loglevel("info")

Source code in polars_bio/logging.py
def set_loglevel(level: str):
    """Set the log level for the logger and root logger.

    Args:
        level: The log level to set. Can be "debug", "info", "warn", or "warning".

    !!! note
        The log level should be set as a **first** step after importing the library.
        Once set it can be only **decreased**, not increased. In order to increase
        the log level, you need to restart the Python session.

        Example:
        ```python
        import polars_bio as pb
        pb.set_loglevel("info")
        ```
    """
    level = level.lower()
    if level == "debug":
        logger.setLevel(logging.DEBUG)
        root_logger.setLevel(logging.DEBUG)
        logging.basicConfig(level=logging.DEBUG)
    elif level == "info":
        logger.setLevel(logging.INFO)
        root_logger.setLevel(logging.INFO)
        logging.basicConfig(level=logging.INFO)
    elif level == "warn" or level == "warning":
        logger.setLevel(logging.WARN)
        root_logger.setLevel(logging.WARN)
        logging.basicConfig(level=logging.WARN)
    else:
        raise ValueError(f"{level} is not a valid log level")

set_option(key, value)

Set a configuration option.

Parameters:

Name Type Description Default
key

The configuration key.

required
value

The value to set (bool values are converted to "true"/"false").

required
Example
import polars_bio as pb
pb.set_option("datafusion.bio.coordinate_system_zero_based", False)
Source code in polars_bio/context.py
def set_option(key, value):
    """Set a configuration option.

    Args:
        key: The configuration key.
        value: The value to set (bool values are converted to "true"/"false").

    Example:
        ```python
        import polars_bio as pb
        pb.set_option("datafusion.bio.coordinate_system_zero_based", False)
        ```
    """
    Context().set_option(key, value)

set_source_metadata(df, format, path='', header=None)

Set standardized source file metadata.

Stores metadata about the source file format, path, and format-specific header information. This standardized approach works across all file formats (VCF, FASTQ, BAM, GFF, BED, FASTA, CRAM).

Parameters:

Name Type Description Default
df

Polars DataFrame or LazyFrame (or Pandas DataFrame)

required
format str

File format identifier (e.g., "vcf", "fastq", "bam")

required
path str

Original file path (default: "")

''
header dict

Format-specific header data as dict (default: None) For VCF: {"info_fields": {...}, "format_fields": {...}, "sample_names": [...], ...} For other formats: format-specific metadata

None
Example
import polars_bio as pb
lf = pb.scan_vcf("sample.vcf")
header = {"info_fields": {...}, "sample_names": ["sample1"]}
pb.set_source_metadata(lf, format="vcf", path="sample.vcf", header=header)
Source code in polars_bio/_metadata.py
def set_source_metadata(df, format: str, path: str = "", header: dict = None):
    """Set standardized source file metadata.

    Stores metadata about the source file format, path, and format-specific
    header information. This standardized approach works across all file
    formats (VCF, FASTQ, BAM, GFF, BED, FASTA, CRAM).

    Args:
        df: Polars DataFrame or LazyFrame (or Pandas DataFrame)
        format: File format identifier (e.g., "vcf", "fastq", "bam")
        path: Original file path (default: "")
        header: Format-specific header data as dict (default: None)
                For VCF: {"info_fields": {...}, "format_fields": {...}, "sample_names": [...], ...}
                For other formats: format-specific metadata

    Example:
        ```python
        import polars_bio as pb
        lf = pb.scan_vcf("sample.vcf")
        header = {"info_fields": {...}, "sample_names": ["sample1"]}
        pb.set_source_metadata(lf, format="vcf", path="sample.vcf", header=header)
        ```
    """
    if _has_config_meta(df):
        # Polars DataFrame/LazyFrame
        metadata_updates = {
            SOURCE_FORMAT_KEY: format,
            SOURCE_PATH_KEY: path,
            SOURCE_HEADER_KEY: json.dumps(header) if header else "",
        }
        df.config_meta.set(**metadata_updates)
    elif _is_pandas_dataframe(df):
        # Pandas DataFrame
        if not hasattr(df, "attrs"):
            df.attrs = {}
        df.attrs[SOURCE_FORMAT_KEY] = format
        df.attrs[SOURCE_PATH_KEY] = path
        df.attrs[SOURCE_HEADER_KEY] = json.dumps(header) if header else ""