Processing data simultaneously from multiple streaming platforms using Delta Live Tables

One of the major imperatives of organizations today is to enable decision making at the speed of business. Business teams and autonomous decisioning systems often require all the information they need to make decisions and respond quickly as soon as their source events happen – in real time or near real time. Such information, known […]

Continue Reading

PyTorch on Databricks – Introducing the Spark PyTorch Distributor

Background and Motives Deep Learning algorithms are complex and time consuming to train, but are quickly moving from the lab to production because of the value these algorithms help realize. Whether using pre-trained models with fine tuning, building a network from scratch or anything in between, the memory and computational load of training can quickly […]

Continue Reading

Message Center – Redesigning the messaging experience on the Grab superapp

Since 2016, Grab has been using GrabChat, a built-in messaging feature to connect our users with delivery-partners or driver-partners. However, as the Grab superapp grew to include more features, the limitations of the old system became apparent. GrabChat could only handle two-party chats because that’s what it was designed to do. To make our messaging […]

Continue Reading

Introducing Apache Spark™ 3.4 for Databricks Runtime 13.0

Today, we are happy to announce the availability of Apache Spark™ 3.4 on Databricks as part of Databricks Runtime 13.0. We extend our sincere appreciation to the Apache Spark community for their invaluable contributions to the Spark 3.4 release. To further unify Spark, bring Spark to applications anywhere, increase productivity, simplify usage, and add new […]

Continue Reading

How Collective Health uses Delta Live Tables and Structured Streaming for Data Integration

Collective Health is not an insurance company. We’re a technology company that’s fundamentally making health insurance work better for everyone— starting with the 155M+ Americans covered by their employer. We’ve created a powerful, flexible infrastructure to simplify employer-led healthcare, and started a movement that prioritizes the human experience within health benefits. We’ve built smarter technology […]

Continue Reading

Synthetic Data for Better Machine Learning

You’ve likely tried the buzziest advances in generative AI in the past year, tools like ChatGPT and DALL-E. They consume complex data and generate more data in ways that feel startlingly like something intelligent. These and other new ideas (diffusion models, generative adversarial networks or GANs) are entertaining, even frightening to play with. However, the […]

Continue Reading

How eBay Modernized the Most Important Page on Our Platform

Background eBay’s View Item page lives at the center of our e-commerce platform. Our customers load this page over 250 million times each day, and stringent budgets on site speed and availability guarantee the quality of their experience. And yet, this page had its last intentional rewrite ten years ago. A decade of rapid iteration […]

Continue Reading

Pandas-Profiling Now Supports Apache Spark

Data profiling is the process of collecting statistics and summaries of data to assess its quality and other characteristics. It is an essential step in both data discovery and the data science lifecycle because it helps us ensure quality data flows from which we can derive trustworthy and actionable insights. Profiling involves analyzing data across […]

Continue Reading

Saving Mothers with ML: How MLOps Improves Healthcare in High-Risk Obstetrics

In the United States, roughly 7 out of every 1000 mothers suffer from both pregnancy and delivery complications each year1. Of those mothers with pregnancy complications, 700 die but 60% of those deaths are preventable with the right medical attention, according to the CDC. Even among the 3.7 million successful births, 19% have either low […]

Continue Reading