By Matt Chudoba
In part 1 of this blog series, I discussed the design of the pipelines we built to process late arriving data. In this post, I will detail the steps we added to the pipelines to automatically process late arriving data.
Our architecture consists of two parts: a Receiver Pipeline and a Processing Pipeline. This allows us to ingest and process late data separately.
We need to ensure that all available data is ingested and partitioned correctly, regardless of how late it is. This offloads the bulk of the reprocessing complexity to the backend pipeline.
For partners that…
By Matt Chudoba
At Integral Ad Science, we process high volumes of partner data through our platform every day, so that we can quickly provide meaningful insights to customers. Our partners control when the data is made available and sometimes there are delays due to API outages or planned downtime.
This two-part series will share an overview of our data intake and processing pipelines, focusing on how we built automatic monitoring and tools to handle late arriving data most effectively.
This post addresses how we designed our architecture to accommodate late arriving data, while the next outlines specific steps in…
By Karishma Agarwal
Integral Ad Science (IAS) processes trillions of data requests globally each month, with several mapping tables in MySQL to manage these events in real-time. We typically use a distributed cache such as Redis or Aerospike to keep this data in memory and perform real-time lookups with very high throughput. However, we recently worked on a project to evaluate two new approaches:
We wanted to add a cache to the existing pipeline for storing data from the mapping tables in MySQL and use it…
In part 1, we discussed how to uncover the project’s underlying problems. Now, we’ll share six specific strategies to give projects more momentum. These strategies can be led by people in product, user experience, development, marketing, sales, or other influential roles. You can also proactively use these six strategies to keep projects from stalling out in the first place:
Whether you manage, design, or build projects, you will find yourself in situations where a project gets put on hold — officially or unofficially — for one reason or another. Sometimes it’s obvious why it’s on hold, and sometimes there’s a misunderstanding that needs to be resolved. Whether it’s a prioritized project that must move forward, or a pet project that you want to advance, this two-step approach will help you and your team get it going again.
● In this post, we’ll discuss how to uncover a project’s underlying challenges and early steps to address…
by Feng Fan
Recently I worked on a project to evaluate different left-join options for a Spark application we are building to modernize our largest data pipeline. The pipeline processes about 2B events per hour, creating a data set of about 0.5B records. There was a long running left-join operation that took 20 minute to finish using Pig over MapReduce in the old pipeline. My task was to benchmark this left-join operation with different Spark join options. This article shares the learnings I gathered during that project.
On the left side of the join we had a big dataset of…
by Akshay Tambe
At Integral Ad Science (IAS), we measure over 100 billion data events daily, giving our customers unmatched scale, coverage, and accuracy. We process this data with hundreds of big data processing and data science pipelines. As we’ve continued to scale globally, IAS migrated to a cloud-based infrastructure hosted on Amazon Web Services (AWS), resulting in cost savings and increased performance. One great strategy to control and reduce AWS costs is to leverage spot instances.
Spot Instances are spare EC2 instances in the AWS Cloud which are offered at up to 90% cost savings compared to on-demand instances…
Is banana bread a bread? An unexpected UXR journey in cookbook design.
by Joey Stempel
After organizers announced they were putting together a company cookbook, I was quick to volunteer. As a UX Designer who loves to cook and consumes a significant amount of food media, this was too perfect of an opportunity to pass up. Only I didn’t realize just how helpful UX Research would be; by the end, I had conducted competitive research, run a card sorting exercise, and applied user-centered design principles.
In early discussions with the team, the first question to arise was: how should the…
by Yuva Mahendran
At Integral Ad Science we constantly experiment with technologies to process massive datasets and get insightful performance details for customers. One of our major initiatives over the upcoming quarters is to introduce streaming in our multi-billion-events-per-hour data ingestion layer and provide real-time metrics for our customers. Introducing streaming into this massive pipeline could easily span multiple quarters before reaping any benefit, if not properly planned. This blog covers our phased plan to introduce streaming in our system and highlights tracers we added to automatically test data consistency in the streaming pipeline.
The current log processing pipeline is…
by Yuva Mahendran
At Integral Ad Science, with billions of events hourly, milliseconds can make a difference in down-stream processing. Is Apache Pulsar ready to replace Kafka as our go to streaming data provider? We put it to the test.
Our main goal was to expose and make data available for down-stream processing within milliseconds from the actual event happening.
Apache Kafka is a framework that’s been in the market since 2011, and has stood the test in time in and outside IAS. Given that we have our core-pipes already running in AWS, MSK (Amazon managed Kafka) was a natural…