Increase A/B Testing Power by Combining Experiments

Say you’ve had an experiment that produced some surprising results, so you replicated it with a new experiment. Or say you’ve got a number of separate experiments for multiple channels, yielding different reports from the same hypothesis. In the past, this would have potentially provided under-powered experiment reports without sound evidence. But there’s a more […]

Continue Reading

Why and How eBay Pivoted to OpenTelemetry

Introduction Observability provides the eyes and ears to any organization. A major benefit to observability is in preventing the loss of revenue by efficiently surfacing ongoing issues in critical workflows that could potentially impact customer experience. The Observability landscape is an ever-changing one and recent developments in the OpenTelemetry world forced us to rethink our […]

Continue Reading

Building Patient Cohorts with NLP and Knowledge Graphs

Check out the solution accelerator to download the notebooks referred throughout this blog.  Cohort building is an essential part of patient analytics. Defining which patients belong to a cohort, testing the sensitivity of various inclusion and exclusion criteria on sample size, building a control cohort with propensity score matching techniques: These are just some of the processes […]

Continue Reading

Databricks State Rebalancing Structured Streaming Enhancement Preview

In light of the accelerated growth and adoption of Apache Spark Structured Streaming, Databricks announced Project Lightspeed at Data + AI Summit 2022. Among the items outlined in the announcement was a goal of improving latency in Structured Streaming workloads. In this post we are excited to go deeper into just one of the ways […]

Continue Reading

How to Profile PySpark – The Databricks Blog

In Apache Spark™, declarative Python APIs are supported for big data workloads. They are powerful enough to handle most common use cases. Furthermore, PySpark UDFs offer more flexibility since they enable users to run arbitrary Python code on top of the Apache Spark™ engine. Users only have to state “what to do”; PySpark, as a […]

Continue Reading

Admin Isolation on Shared Clusters

This blog was co-authored by David Meyer, SVP Product Management at Databricks and Joosua Santasalo, a security researcher with Secureworks.   At Databricks, we know the security of the data processed in our platform is essential to our customers. Our Security & Trust Center chronicles investments in internal policies and processes (like vulnerability management and […]

Continue Reading

Improved Performance and Value With Databricks Photon and Azure Lasv3 Instances Using AMD 3rd Gen EPYC™ 7763v Processors

Databricks has partnered with AMD to support a new chip that lets you run your queries faster, saving you time and money. Combining the latest technologies from Azure Databricks and AMD, users can now take advantage of the new Lasv3-series VMs with the Databricks Runtimes to reduce the total cost of ownership (TCO) and achieve […]

Continue Reading

Python Arbitrary Stateful Processing in Structured Streaming

More and more customers are using Databricks for their real-time analytics and machine learning workloads to meet the ever increasing demand of their businesses and customers. This is why we started Project Lightspeed, which aims to improve Structured Streaming in Apache Spark™ around latency, functionalities, ecosystem connectors, and ease of operations. With real-time stream processing, […]

Continue Reading

Build a Customer 360 Solution with Fivetran and Delta Live Tables

The Databricks Lakehouse Platform is an open architecture that combines the best elements of data lakes and data warehouses. In this blog post, we’ll show you how to build a Customer 360 solution on the lakehouse, delivering data and insights that would typically take months of effort on legacy platforms. We will use Fivetran to […]

Continue Reading

Introducing Ingestion Time Clustering with Databricks SQL and Databricks Runtime 11.2

Databricks customers are processing over an exabyte of data every day on the Databricks Lakehouse platform using Delta Lake, a significant amount of it being time-series based fact data. With such a large amount of data comes the need for customers to optimize their tables for read and write performance, which is commonly done by […]

Continue Reading