pyspark.sql.functions.schema_of_csv#

pyspark.sql.functions.schema_of_csv(csv, options=None)[source]#

CSV Function: Parses a CSV string and infers its schema in DDL format.

New in version 3.0.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters
csvColumn or str

A CSV string or a foldable string column containing a CSV string.

optionsdict, optional

Options to control parsing. Accepts the same options as the CSV datasource. See Data Source Option for the version you use.

Returns
Column

A string representation of a StructType parsed from the given CSV.

Examples

Example 1: Inferring the schema of a CSV string with different data types

>>> from pyspark.sql import functions as sf
>>> df = spark.range(1)
>>> df.select(sf.schema_of_csv(sf.lit('1|a|true'), {'sep':'|'})).show(truncate=False)
+-------------------------------------------+
|schema_of_csv(1|a|true)                    |
+-------------------------------------------+
|STRUCT<_c0: INT, _c1: STRING, _c2: BOOLEAN>|
+-------------------------------------------+

Example 2: Inferring the schema of a CSV string with missing values

>>> from pyspark.sql import functions as sf
>>> df = spark.range(1)
>>> df.select(sf.schema_of_csv(sf.lit('1||true'), {'sep':'|'})).show(truncate=False)
+-------------------------------------------+
|schema_of_csv(1||true)                     |
+-------------------------------------------+
|STRUCT<_c0: INT, _c1: STRING, _c2: BOOLEAN>|
+-------------------------------------------+

Example 3: Inferring the schema of a CSV string with a different delimiter

>>> from pyspark.sql import functions as sf
>>> df = spark.range(1)
>>> df.select(sf.schema_of_csv(sf.lit('1;a;true'), {'sep':';'})).show(truncate=False)
+-------------------------------------------+
|schema_of_csv(1;a;true)                    |
+-------------------------------------------+
|STRUCT<_c0: INT, _c1: STRING, _c2: BOOLEAN>|
+-------------------------------------------+

Example 4: Inferring the schema of a CSV string with quoted fields

>>> from pyspark.sql import functions as sf
>>> df = spark.range(1)
>>> df.select(sf.schema_of_csv(sf.lit('"1","a","true"'), {'sep':','})).show(truncate=False)
+-------------------------------------------+
|schema_of_csv("1","a","true")              |
+-------------------------------------------+
|STRUCT<_c0: INT, _c1: STRING, _c2: BOOLEAN>|
+-------------------------------------------+