βοΈ API reference
nearest(df1, df2, overlap_filter=FilterOp.Strict, suffixes=('_1', '_2'), on_cols=None, col1=None, col2=None, output_type='polars.LazyFrame')
Find pairs of overlapping genomic intervals. Bioframe inspired API.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df1
|
Union[str, DataFrame, LazyFrame, DataFrame]
|
Can be a path to a file, a polars DataFrame, or a pandas DataFrame. CSV with a header and Parquet are supported. |
required |
df2
|
Union[str, DataFrame, LazyFrame, DataFrame]
|
Can be a path to a file, a polars DataFrame, or a pandas DataFrame. CSV with a header and Parquet are supported. |
required |
overlap_filter
|
FilterOp
|
FilterOp, optional. The type of overlap to consider(Weak or Strict). default is FilterOp.Weak. |
Strict
|
col1
|
Union[list[str], None]
|
The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set. The default values are 'chrom', 'start', 'end'. |
None
|
col2
|
Union[list[str], None]
|
The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set. The default values are 'chrom', 'start', 'end'. |
None
|
suffixes
|
tuple[str, str]
|
Suffixes for the columns of the two overlapped sets. |
('_1', '_2')
|
on_cols
|
Union[list[str], None]
|
List of additional column names to join on. default is None. |
None
|
output_type
|
str
|
Type of the output. default is "polars.LazyFrame", "polars.DataFrame", or "pandas.DataFrame" are also supported. |
'polars.LazyFrame'
|
Returns:
Type | Description |
---|---|
Union[LazyFrame, DataFrame, DataFrame]
|
polars.LazyFrame or polars.DataFrame or pandas.DataFrame of the overlapping intervals. |
Note
The default output format, i.e. LazyFrame, is recommended for large datasets as it supports output streaming and lazy evaluation. This enables efficient processing of large datasets without loading the entire output dataset into memory.
Example:
Todo
Support for col1, col2, and on_cols and suffixes parameters.
Source code in polars_bio/range_op.py
overlap(df1, df2, how='inner', overlap_filter=FilterOp.Strict, suffixes=('_1', '_2'), on_cols=None, col1=None, col2=None, output_type='polars.LazyFrame')
Find pairs of overlapping genomic intervals. Bioframe inspired API.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df1
|
Union[str, DataFrame, LazyFrame, DataFrame]
|
Can be a path to a file, a polars DataFrame, or a pandas DataFrame. CSV with a header and Parquet are supported. |
required |
df2
|
Union[str, DataFrame, LazyFrame, DataFrame]
|
Can be a path to a file, a polars DataFrame, or a pandas DataFrame. CSV with a header and Parquet are supported. |
required |
how
|
str
|
How to handle the overlaps on the two dataframes. inner: use intersection of the set of intervals from df1 and df2, optional. |
'inner'
|
overlap_filter
|
FilterOp
|
FilterOp, optional. The type of overlap to consider(Weak or Strict). default is FilterOp.Weak. |
Strict
|
col1
|
Union[list[str], None]
|
The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set. The default values are 'chrom', 'start', 'end'. |
None
|
col2
|
Union[list[str], None]
|
The names of columns containing the chromosome, start and end of the genomic intervals, provided separately for each set. The default values are 'chrom', 'start', 'end'. |
None
|
suffixes
|
tuple[str, str]
|
Suffixes for the columns of the two overlapped sets. |
('_1', '_2')
|
on_cols
|
List of additional column names to join on. default is None. |
None
|
|
output_type
|
str
|
Type of the output. default is "polars.LazyFrame", "polars.DataFrame", or "pandas.DataFrame" are also supported. |
'polars.LazyFrame'
|
Returns:
Type | Description |
---|---|
Union[LazyFrame, DataFrame, DataFrame]
|
polars.LazyFrame or polars.DataFrame or pandas.DataFrame of the overlapping intervals. |
Note
The default output format, i.e. LazyFrame, is recommended for large datasets as it supports output streaming and lazy evaluation. This enables efficient processing of large datasets without loading the entire output dataset into memory.
Example
import polars_bio as pb
import pandas as pd
df1 = pd.DataFrame([
['chr1', 1, 5],
['chr1', 3, 8],
['chr1', 8, 10],
['chr1', 12, 14]],
columns=['chrom', 'start', 'end']
)
df2 = pd.DataFrame(
[['chr1', 4, 8],
['chr1', 10, 11]],
columns=['chrom', 'start', 'end' ]
)
overlapping_intervals = pb.overlap(df1, df2, output_type="pandas.DataFrame")
overlapping_intervals
chrom_1 start_1 end_1 chrom_2 start_2 end_2
0 chr1 1 5 chr1 4 8
1 chr1 3 8 chr1 4 8
Todo
Support for col1, col2, and on_cols and suffixes parameters.