Data Engineering
Blog
Snowflake
5
min read

Snowpark Connect for Apache Spark

Optimize workflows with Snowpark Connect for Apache Spark. Run Spark code directly in Snowflake to lower costs and simplify your architecture
Author
Kristy Broekmans
Kristy Broekmans
Data Engineer
Snowpark Connect for Apache Spark
Share article

In the world of big data analytics, Spark is currently the main way of processing data. But recently Snowflake came out with the Snowpark Connector for Apache Spark. In the realm of modern data architecture, Apache Spark has established itself as the leading engine for large-scale data processing and analytics, offering unparalleled speed and scalability. Recognizing Spark's importance, Snowflake has introduced the Snowpark Connector for Apache Spark. This new feature enables organizations to seamlessly integrate their existing Spark workflows with the performance, elasticity, and governance capabilities of the Snowflake Data Cloud, offering a unified and powerful environment for all their data initiatives.

In this blogpost we want to share a small how-to of how you can load and transform some data. And we are going to compare both of them based on their use cases.

Approach 1: Traditional Snowflake Connector for Spark

Apache Spark is a well known ecosystem that can connect to an endless number of sources and transform a huge amount of data. In our how-to we are going to set up a connection to Snowflake using a JDBC driver. This allows us to pull in our data into Spark and transform the data.

How it works

  • Data Movement: This connector facilitates bi-directional data transfer. Spark can pull data from Snowflake into its DataFrames, and equally, Spark DataFrames can be pushed into Snowflake tables.
  • Optimization: It's optimized for transferring large volumes of data and supports query pushdown. This means that whenever possible, Spark offloads query execution directly to Snowflake, allowing Snowflake's powerful engine to perform filtering, aggregations, and other operations, thereby improving performance and reducing data transfer.
  • Infrastructure: Traditionally, using this connector requires you to manage a separate Spark compute cluster (e.g., on AWS EMR, Databricks, or a self-managed cluster) alongside your Snowflake virtual warehouse.

Example: Reading Data with Spark-Submit

Here's an simple example script (spark_to_snowflake.py) that demonstrates how to read data from Snowflake using the Snowflake Connector for Spark, typically executed via spark-submit:

To run the code please make sure you have a local spark cluster running and and have installed pyspark locally (here is a good howto: https://machinelearningplus.com/pyspark/install-pyspark-on-mac/)



from pyspark.sql import SparkSession, DataFrameWriter

# Initiate Spark session
spark = SparkSession.builder \
   .appName("SnowflakeRead") \
   .config("spark.driver.bindAddress", "127.0.0.1") \
   .config("spark.driver.host", "127.0.0.1") \
   .config("spark.jars.packages", "net.snowflake:snowflake-jdbc:3.13.30,net.snowflake:spark-snowflake_2.12:3.1.1") \
   .getOrCreate()

# Snowflake options
sf_options = {
    "sfURL":           "<account>.snowflakecomputing.com",
    "sfUser":          "username",
    "sfPassword":      "#########",
    "sfDatabase":      "your_database",
    "sfSchema":        "your_schema",
    "sfWarehouse":     "you_warehouse",
}

# Read and show
df = spark.read \
    .format("net.snowflake.spark.snowflake") \
    .options(**sf_options) \
    .option("dbtable", "employees") \
    .load()

# show the employees table entirley
df.show()

filtered_df = df.select("name", "department").filter(df["department"] == "Sales")

# only show that name and department of all sales persons
filtered_df.show()

# Write this dataframe to a table
filtered_df.write \
    .format("snowflake") \
    .options(**sf_options) \
    .option("dbtable", "sales_employees") \
    .mode('overwrite') \
    .save()

Explanation: This script sets up a Spark session and configures it to use the Snowflake JDBC driver and Spark-Snowflake connector by specifying them in spark.jars.packages. It then defines Snowflake connection properties like username and password. Finally, it reads the "employees" table from Snowflake into a Spark DataFrame and performs a simple filtering operation. And then we load that data back again to a Snowflake table “sales_employeers”.

The Result: the first table shows all the employees and in the second we show we have filtered out all sales persons.


Snowflake - Apache Spark connector
Facts & figures

About client

Approach 2: Using Snowpark Connect for Apache Spark

Snowpark Connect for Apache Spark represents a significant evolution in the integration between Spark and Snowflake. It allows you to run your existing Spark code (including Spark SQL, DataFrame operations) directly within Snowflake's compute engine, rather than requiring a separate, external Spark cluster.

The big question: Does it actually run Spark? No, it does not.

What it does is that it translates your Spark code into SnowSQL and can use all the optimisation features that Snowflake has to offer.

Benefits of Snowpark Connect

  • Eliminates Separate Clusters: One of the biggest advantages is removing the need to provision and maintain dedicated Spark clusters, significantly reducing operational overhead and costs.
  • Native Execution: Spark workloads are executed natively on Snowflake's powerful engine. This means you leverage Snowflake's performance, scalability, security, and governance features for your Spark tasks.
  • Reduced Data Movement: Because you  process data directly within Snowflake, No data need to be transferred between a spark cluster and Snowflake. This leads to lower costs, reduced latency, and simplified data governance.
  • Familiarity: Developers can continue to use their familiar Spark APIs (PySpark DataFrames, Spark SQL) without needing to rewrite their code or learn new constructs.
  • Architecture: Snowflake provides a Spark Connect-compatible server that translates Spark logical plans into Snowpark or SQL operations for execution within Snowflake.

Example: Reading Data with Snowpark Connect

Here's an example script (spark_on_snowpark.py) showcasing how to leverage Snowpark Connect with the same use case as we used above in our example spark to snowflake:

Make sure you have installed pyspark and snowpark-connect, both can be installed using pip

from snowflake import snowpark_connect
from snowflake.snowpark_connect.snowflake_session import SnowflakeSession

spark = snowpark_connect.server.init_spark_session()
sf_session = SnowflakeSession(spark)


# Read and show
df = spark.read \
   .format("net.snowflake.spark.snowflake") \
   .option("dbtable", "employees") \
   .load()

df.show()

filtered_df = df.select("name", "department").filter(df["department"] == "Sales")

filtered_df.show()

filtered_df.write \
   .format("snowflake") \
   .option("dbtable", "sales_employees") \
   .mode('overwrite') \
   .save()

print(f"Wrote {filtered_df.count()} lines to snowfalke")

# stop Snowpark server if available
if hasattr(snowpark_connect.server, "stop"):
   snowpark_connect.server.stop()
elif hasattr(snowpark_connect.server, "shutdown"):
   snowpark_connect.server.shutdown()
spark.stop()

Explanation: In this script, snowflake.snowpark_connect.server.init_spark_session() initializes a Spark session that is implicitly connected to your Snowflake environment. The SnowflakeSession(spark) object further strengthens this integration.

Notice that explicit connection options (like URL, user, warehouse, private key) are no longer needed in the sf_options dictionary, as these are managed by the Snowpark Connect configuration (which we will create in a bit). The rest of the Spark DataFrame operations remain largely the same, demonstrating the power of using familiar Spark APIs with Snowflake's native capabilities.

To get it to work we would still need specify a connection in our connections file ~/.snowflake/connections.toml. Please make sure to name the connection [spark-connect]

[snowpark-submit]
host = "<your_account>.snowflakecomputing.com"
account = "<your_account>"
user = "<your_username (email)>"
password = "<your_password>"
warehouse = "<your_warhouse>"
database = "<your_database>"
schema = "<your_schema>"

The Result: the result should be exactly the same as running it using traditional spark.

Using snowpark connect for Apache spark

Comparison and when to use which

Comparison table: Snowflake and Snowpark connect for Apache Spark

Choose the Traditional Connector when:

  • You have existing, heavily customized Spark environments that are difficult to migrate.
  • You need very specific Spark library versions or configurations not yet fully supported by Snowpark Connect.
  • You are performing transformations that require full Spark cluster capabilities that cannot be pushed down to Snowflake.

Choose Snowpark Connect when:

  • You want to simplify your architecture and reduce operational costs by consolidating compute.
  • You prioritize native Snowflake performance, scalability, and security for your Spark workloads.
  • You want to minimize data movement and latency.
  • You are starting new Spark projects with Snowflake or modernizing existing ones.

What can’t it do?

Although we have seen how easy it is to update your code from Spark to Snowpark Connect. You still need to watch out for some of the operations with will not work “yet”:

  1. Low-Level RDD APIs: Snowpark Connect supports the DataFrame API and Spark SQL, but it does not support the RDD (Resilient Distributed Dataset) API.
  2. Spark Structured Streaming: It does not currently support Spark Structured Streaming (e.g., readStream or writeStream). If you need real-time streaming with complex stateful processing, you would still use a Spark cluster (or switch to Snowflake’s native streaming features like Dynamic Tables and Snowpipe).
  3. Comprehensive Machine Learning (MLlib): While basic transformers may work, many parts of Spark MLlib (like certain Estimators and distributed training algorithms) are not supported
  4. GraphX and Graph Processing: If your workload involves graph analytics using GraphX, you cannot use Snowpark Connect. Snowflake does not have a native equivalent for Spark's graph processing library.
  5. Specialized Data Formats (Delta Lake / Hudi): While Snowflake is expanding its support for Open Table Formats (like Iceberg), if your pipeline relies on native Delta Lake or Hudi features

Conclusion

Both the Snowflake Connector for Apache Spark and Snowpark Connect provide valuable ways to integrate Spark with Snowflake. While the traditional connector offers robust data exchange, Snowpark Connect pushes the boundaries by allowing Spark workloads to run directly within Snowflake, unlocking significant benefits in terms of operational simplicity, performance, and cost efficiency. As Snowflake continues to innovate, Snowpark Connect emerges as the preferred choice for a truly integrated and optimized Spark-on-Snowflake experience.

Sources

Testimonial

Blogs you might also like

The role of semantics in agentic analytics
Arrow icon dark

The role of semantics in agentic analytics

Learn more about the role of semantics in agentic analytics and how it drives clear, consistent autonomous insights.

Data Engineering
Blog
7 Things You Should Know About NULL Values
Arrow icon dark

7 Things You Should Know About NULL Values

Having troubles with NULL values? Here are 7 things you should know about them.

Data Engineering
Blog
Tableau
Tableau <> Snowflake key-pair authentication
Arrow icon dark

Tableau <> Snowflake key-pair authentication

Discover how you can use key-pair authentication to connect Tableau to Snowflake.

Data Engineering
Blog
Snowflake
Snowflake 101: Loading cloud data using AWS
Arrow icon dark

Snowflake 101: Loading cloud data using AWS

Let's discover how to load structured data from Cloud using AWS S3 into Snowflake.

Data Engineering
Blog
Snowflake
Loading data from local environments into Snowflake
Arrow icon dark

Loading data from local environments into Snowflake

Discover how to load structured data from a computer into Snowflake.

Data Engineering
Blog
Snowflake
How to pass the SnowPro Core certification exam
Arrow icon dark

How to pass the SnowPro Core certification exam

Get a hands-on personal take on the SnowPro Core certification and how you should prepare for it. We're sure you'll ace the exam!

Data Engineering
Blog
Snowflake
How to UPSERT or MERGE data with Tableau Prep’s write-back functionality
Arrow icon dark

How to UPSERT or MERGE data with Tableau Prep’s write-back functionality

A demonstration of how to simply apply the straightforward concept of MERGE and UPSERT in Tableau Prep.

Data Engineering
Blog
Tableau
dbt Configuration: YAML file
Arrow icon dark

dbt Configuration: YAML file

Learn the basics of configuring your YAML files for dbt the right way.

Data Engineering
Blog
dbt
Improving your Data Quality in 7 Steps
Arrow icon dark

Improving your Data Quality in 7 Steps

Want to improve your data quality? Learn how to improve data quality in 7 steps here. Read the full article.

Data Engineering
Blog
Why automate your data pipelines?
Arrow icon dark

Why automate your data pipelines?

Thinking of building your own data pipelines? This article explains why that's not always the best option.

Data Engineering
Blog
Fivetran
Using dbt to model GA4 raw data
Arrow icon dark

Using dbt to model GA4 raw data

Learn how to leverage dbt to model GA4 raw data for in-depth analysis and insights.

Data Engineering
Blog
dbt
What is Salesforce Data 360: The Ultimate Guide
Arrow icon dark

What is Salesforce Data 360: The Ultimate Guide

Discover how Salesforce Data 360 unifies customer data, enhances decision-making, and powers AI-driven innovations.

Data Engineering
Blog
Data Cloud
Snowflake vs Salesforce Data 360
Arrow icon dark

Snowflake vs Salesforce Data 360

Discover how Snowflake and Salesforce Data Cloud can complement each other to create an integrated, scalable, and actionable data strategy.

Data Engineering
Blog
Data Cloud