Admin Isolation on Shared Clusters

This blog was co-authored by David Meyer, SVP Product Management at Databricks and Joosua Santasalo, a security researcher with Secureworks.   At Databricks, we know the security of the data processed in our platform is essential to our customers. Our Security & Trust Center chronicles investments in internal policies and processes (like vulnerability management and […]

Continue Reading

Improved Performance and Value With Databricks Photon and Azure Lasv3 Instances Using AMD 3rd Gen EPYC™ 7763v Processors

Databricks has partnered with AMD to support a new chip that lets you run your queries faster, saving you time and money. Combining the latest technologies from Azure Databricks and AMD, users can now take advantage of the new Lasv3-series VMs with the Databricks Runtimes to reduce the total cost of ownership (TCO) and achieve […]

Continue Reading

Python Arbitrary Stateful Processing in Structured Streaming

More and more customers are using Databricks for their real-time analytics and machine learning workloads to meet the ever increasing demand of their businesses and customers. This is why we started Project Lightspeed, which aims to improve Structured Streaming in Apache Spark™ around latency, functionalities, ecosystem connectors, and ease of operations. With real-time stream processing, […]

Continue Reading

Build a Customer 360 Solution with Fivetran and Delta Live Tables

The Databricks Lakehouse Platform is an open architecture that combines the best elements of data lakes and data warehouses. In this blog post, we’ll show you how to build a Customer 360 solution on the lakehouse, delivering data and insights that would typically take months of effort on legacy platforms. We will use Fivetran to […]

Continue Reading

Introducing Ingestion Time Clustering with Databricks SQL and Databricks Runtime 11.2

Databricks customers are processing over an exabyte of data every day on the Databricks Lakehouse platform using Delta Lake, a significant amount of it being time-series based fact data. With such a large amount of data comes the need for customers to optimize their tables for read and write performance, which is commonly done by […]

Continue Reading

Memory Profiling in PySpark – The Databricks Blog

There are many factors in a PySpark program’s performance. PySpark supports various profiling tools to expose tight loops of your program and allow you to make performance improvement decisions, see more. However, memory, as one of the key factors of a program’s performance, had been missing in PySpark profiling. A PySpark program on the Spark […]

Continue Reading

Build Reliable and Cost Effective Streaming Data Pipelines With Delta Live Tables’ Enhanced Autoscaling

This year we announced the general availability of Delta Live Tables (DLT), the first ETL framework to use a simple, declarative approach to building reliable data pipelines. Since the launch, Databricks continues to expand DLT with new capabilities. Today we are excited to announce that Enhanced Autoscaling for Delta Live Tables (DLT) is now generally […]

Continue Reading

How facial recognition technology keeps you safe

Facial recognition technology is one of the many modern technologies that previously only appeared in science fiction movies. The roots of this technology can be traced back to the 1960s and have since grown dramatically due to the rise of deep learning techniques and accelerated digital transformation in recent years. In this blog post, we […]

Continue Reading

Graph Networks – 10X investigation with Graph Visualisations

Introduction Detecting fraud schemes used to require investigations using large amounts and varying types of data that come from many different anti-fraud systems. Investigators then need to combine the different types of data and use statistical methods to uncover suspicious claims, which is time consuming and inefficient in most cases. We are always looking for […]

Continue Reading

How we automated FAQ responses at Grab

Overview and initial analysis Knowledge management is often one of the biggest challenges most companies face internally. Teams spend several working hours trying to either inefficiently look for information or constantly asking colleagues about information already documented somewhere. A lot of time is spent on the internal employee communication channels (in our case, Slack) simply […]

Continue Reading