12. 08. 2021

Deep-Dive into the New Features and Capabilities of Databricks

In recent years, Databricks has made its mark as a unified analytics framework that makes it easier for users to collaborate and share code and resources. By breaking down siloes between Data Engineers and Data Analysts, Databricks enables large-scale data processing, analytics, data science, and machine learning.

History and Core Functionalities of Databricks

Databricks is the brainchild of the creators of Apache Spark, a unified framework that provides capabilities for distributed data processing. The idea behind Spark was to create a platform-agnostic framework for developing software and applications for distributed computing. When Spark gained popularity across various systems, technologies, and platforms, its creators established Databricks as a company delivering closed-source optimizations of Apache Spark in terms of performance and extended capabilities.

Databricks, which is essentially a wrapper on top of Spark, is a closed-source unified analytics framework available only in the Cloud environment. It comes with a user environment for collaborative work, allowing users to share resources and work together. The Databricks notebooks can be used for both data analytics and data integration, which makes information management and sharing between Data Engineers and Data Analysts much easier.

Like Apache Spark, Databricks supports several programming languages, including Java, Scala, Python, and R. Of these, R and Python are predominantly used for analytics, while Scala and Java are typical application development languages.

Since Databricks is a wrapper on Spark, the core Apache Spark APIs and libraries are all available in Databricks. Spark revolves around the concept of Resilient Distributed Dataset (RDD), and all libraries and expansions available within Spark are based on this. Spark SQL is a framework for processing structured data in Spark and provides capabilities for querying data in Spark using SQL language. Dataframes and Datasets, which are abstractions of Spark SQL, are also available to Databricks users. Spark Streaming and its evolution, Structured Streaming, can be used for near-real-time data processing in micro-batches. Other commonly used libraries in Spark include Spark MLlib for machine learning and Spark GraphX for graphs and graph parallel computation.

As Databricks evolved, more technologies developed by the company have been added to the offering. These include:

  • Delta Lake: Delta Lake is an open format storage layer that delivers reliability, security, and performance on your data lake — for both streaming and batch operations. It is a metadata wrapper around data stored in the Apache Parquet format, which allows users to ensure data consistency, keep a comprehensive data history, and even roll back to older versions of data if needed. It is an open approach to introducing the basic tenets of data management and governance into data lakes.

  • MLflow: MLflow is a machine learning lifecycle platform that allows data analysts to keep track of their machine learning models and experiments. It is a leading framework for MLOps supporting the tracking, registry, and deployment of machine learning models.

What’s New in Databricks

Databricks runs on all three major cloud platforms, and its focus is on being simple, open, and collaborative. Databricks is deployed in the customers' cloud account, and all data and compute draws on the consumption investments in the cloud. Consequently, it is not a separate investment from the Cloud platforms.

In the case of Microsoft Synapse, while Synapse does have open-source Spark as a part of it, Databricks has optimized Spark to operate faster and be more performant and reliable within Databricks. Consequently. Databricks can be used to cleanse and validate data and put it all back on ADLS, and then Synapse can be used as the serving layer for reporting or analytics.

Both Synapse and Databricks have their place in the architecture, and the two organizations work closely to ensure that their offerings work "better together" to provide the customer with a faster, integrated architecture.

Databricks on AWS and Databricks on Google Cloud also allow users to make the most of the combined platforms, leveraging the strengths of each for a more performant system. Databricks’ partnership with the various cloud providers enables customers to accelerate Databricks implementations by simplifying their data access by combining analytics and AI/ML capabilities to better drive business outcomes.

Why Adastra?

As a leader in data and analytics, we have expertise in implementing emerging technologies, such as Databricks, across various industries, including financial services, retail, energy, etc. Adastra, as an official Databricks partner, can offer best-in-class services and implementations backed by Databricks experts.

Adastra offers services to help our customers implement Databricks solutions, including identifying the right data pipelines where the data will come from, understanding the kind of models they want to build, the underlying ML solution being used, building out models, and refining the solution. To get our customers to the solution faster, we can also leverage pre-built accelerators developed by Databricks.

We have dedicated Data Engineering and Data Analytics teams that work closely with our Cloud partners to ensure end-to-end implementation and project delivery. Adastra has partnerships with all three major cloud providers, and we offer a full stack of services in the data and analytics domain, ranging from Data Governance to AI and Managed Services.

Want to know more about how Databricks can help your organization? Schedule a free consultation with our experts.

Book a Free Consultation

Thank you

We will contact you as soon as possible.