pyspark.pandas.DataFrame.where#

DataFrame.where(cond, other=nan, axis=None)[source]#

Replace values where the condition is False.

Parameters

condboolean DataFrame: Where cond is True, keep the original value. Where False, replace with corresponding value from other.
otherscalar, DataFrame: Entries where cond is False are replaced with corresponding value from other.
axisint, default None: Can only be set to 0 now for compatibility with pandas.

Returns

DataFrame

Examples

>>> from pyspark.pandas.config import set_option, reset_option
>>> set_option("compute.ops_on_diff_frames", True)
>>> df1 = ps.DataFrame({'A': [0, 1, 2, 3, 4], 'B':[100, 200, 300, 400, 500]})
>>> df2 = ps.DataFrame({'A': [0, -1, -2, -3, -4], 'B':[-100, -200, -300, -400, -500]})
>>> df1
   A    B
0  0  100
1  1  200
2  2  300
3  3  400
4  4  500
>>> df2
   A    B
0  0 -100
1 -1 -200
2 -2 -300
3 -3 -400
4 -4 -500

>>> df1.where(df1 > 0).sort_index()
     A      B
NaN  100.0
1.0  200.0
2.0  300.0
3.0  400.0
4.0  500.0

>>> df1.where(df1 > 1, 10).sort_index()
    A    B
10  100
10  200
 2  300
 3  400
 4  500

>>> df1.where(df1 > 1, df1 + 100).sort_index()
     A    B
100  100
101  200
  2  300
  3  400
  4  500

>>> df1.where(df1 > 1, df2).sort_index()
   A    B
0  100
-1  200
2  300
3  400
4  500

When the column name of cond is different from self, it treats all values are False

>>> cond = ps.DataFrame({'C': [0, -1, -2, -3, -4], 'D':[4, 3, 2, 1, 0]}) % 3 == 0
>>> cond
       C      D
0   True  False
1  False   True
2  False  False
3   True  False
4  False   True

>>> df1.where(cond).sort_index()
    A   B
NaN NaN
NaN NaN
NaN NaN
NaN NaN
NaN NaN

When the type of cond is Series, it just check boolean regardless of column name

>>> cond = ps.Series([1, 2]) > 1
>>> cond
0    False
1     True
dtype: bool

>>> df1.where(cond).sort_index()
     A      B
NaN    NaN
1.0  200.0
NaN    NaN
NaN    NaN
NaN    NaN

>>> reset_option("compute.ops_on_diff_frames")