Unifying Your Data Ecosystem with Delta Lake Integration

As organizations are maturing their data infrastructure and accumulating more data than ever before in their data lakes, Open and Reliable table formats such as Delta Lake become a critical necessity. Thousands of companies are already using Delta Lake in production, and open-sourcing all of Delta Lake (as announced in June 2022) has further increased […]

Continue Reading

Securing Databricks cluster init scripts

This blog was co-authored by Elia Florio, Sr. Director of Detection & Response at Databricks and Florian Roth and Marius Bartholdy, security researchers with SEC-Consult.   Protecting the Databricks platform and continuously raising the bar with security improvements is the mission of our Security team and the main reason why we invest in our bug […]

Continue Reading

Safer deployment of streaming applications

The Flink framework has gained popularity as a real-time stateful stream processing solution for distributed stream and batch data processing. Flink also provides data distribution, communication, and fault tolerance for distributed computations over data streams. To fully leverage Flink’s features, Coban, Grab’s real-time data platform team, has adopted Flink as part of our service offerings. In […]

Continue Reading

eBay’s Blazingly Fast Billion-Scale Vector Similarity Engine

Introduction Often, ecommerce marketplaces provide buyers with listings similar to those previously visited by the buyer, as well as a personalized shopping experience based on profiles, past shopping histories and behavior signals such as clicks, views and additions to cart. These are vital to the shopping experience, and so it’s equally vital that we continuously […]

Continue Reading

Databricks ❤️ Hugging Face – The Databricks Blog

Generative AI has been taking the world by storm. As the data and AI company, we have been on this journey with the release of the open source large language model Dolly, as well as the internally crowdsourced dataset licensed for research and commercial use that we used to fine-tune it, the databricks-dolly-15k. Both the […]

Continue Reading

Processing data simultaneously from multiple streaming platforms using Delta Live Tables

One of the major imperatives of organizations today is to enable decision making at the speed of business. Business teams and autonomous decisioning systems often require all the information they need to make decisions and respond quickly as soon as their source events happen – in real time or near real time. Such information, known […]

Continue Reading

PyTorch on Databricks – Introducing the Spark PyTorch Distributor

Background and Motives Deep Learning algorithms are complex and time consuming to train, but are quickly moving from the lab to production because of the value these algorithms help realize. Whether using pre-trained models with fine tuning, building a network from scratch or anything in between, the memory and computational load of training can quickly […]

Continue Reading

Message Center – Redesigning the messaging experience on the Grab superapp

Since 2016, Grab has been using GrabChat, a built-in messaging feature to connect our users with delivery-partners or driver-partners. However, as the Grab superapp grew to include more features, the limitations of the old system became apparent. GrabChat could only handle two-party chats because that’s what it was designed to do. To make our messaging […]

Continue Reading

Introducing Apache Spark™ 3.4 for Databricks Runtime 13.0

Today, we are happy to announce the availability of Apache Spark™ 3.4 on Databricks as part of Databricks Runtime 13.0. We extend our sincere appreciation to the Apache Spark community for their invaluable contributions to the Spark 3.4 release. To further unify Spark, bring Spark to applications anywhere, increase productivity, simplify usage, and add new […]

Continue Reading