Self-Joins, Windowing, and User Defined Functions in BigQuery


Google’s BigQuery offers unprecedented access to Google Analytics data for Google Analytics 360 (Premium) customers. With great data comes great challenges, especially when you start attempting complicated calculations beyond the traditional metrics in Google Analytics.

There are many ways to write queries in Google’s BigQuery, each has its strengths and weakness. Some of the basics of writing queries were covered in a previous post, but here we’re going to look at three different approaches to writing complex queries. As with any situation where there are multiple options, each option has use-cases in which they simplify the overall amount of effort needed to reach the goal.

The first method, using a self-join, is flexible and straight-forward, but can be very slow. The second, using windowing functions, is efficient and fairly straight-forward, but also limiting in what exactly can be queried. The last method we’ll discuss, User Defined Functions (UDFs), can be efficient on certain types of computations and can make expressing certain types of functions more straight-forward, but comes at the cost of speed.


Self-Joins are when you join a table to itself, and one of many different ways to join a table to other datasources. Why would you ever want to join a table to itself? The most common reason is to find rows in a table that are related to each other. While this provides a great deal of flexibility, joins in BigQuery are inefficient — the larger the “smaller” table becomes, the more data needs to be shipped between nodes.

When self-joining, it’s possible to get into a situation where the entire table needs to be shipped to every node working on the query, as opposed to just the single, or small handful, that it would need otherwise.

An example of a query that could only be answered with a self-join would be a question like this: of people who viewed one specific page, which other pages did they view and how many times?

(Note: the Cross Join used below is not what makes a self join, but that the table being joined is the same. The Join Type is irrelevant.)

This could then loaded into a spreadsheet and a pivot table could be used to look at trends in how people visit pages.

Pivot Table of LCH co-pageviews

Another example would be to find the the page viewed before a page, we could find all related pages in a session using a self-join, filter out, using a WHERE clause because in BigQuery join conditions, in the ON, cannot have inequalities, all hits who have greater hit numbers, and then aggregating all the results for each hit and finding the greatest pageview less than the current one.

(Note: Again I’ll point out that the Inner Join used below is not what makes a self join, but that the table being joined is the same. The Join Type is irrelevant.)

However, as we will see in the following sections, this particular query can be done much more cleanly with windowing functions.

Windowing Functions

Windowing Functions allow you to view smaller portions of your data by allowing access to rows before and after the row currently being processed, in essence opening up “window” that allows you to see more than just the current row and to be able to see your data in different ways.

To help us understand this concept, let’s look at a simple example. Our “table” is the sequence 2,4,3,1 in a field named “num”.

Note the odd OVER () for LAG and LEAD. This tells LAG and LEAD to use the entire table as their window. This is similar to the manner in which aggregate functions behave. More on that later!





1 null 2 4
2 2 4 3
3 4 3 1
4 3 1 null

Note that LAG and LEAD are processing the rows in order of their appearance in the table. If we want them to process rows in sorted order by specifying their order in the OVER clause, as each window can be ordered by different values and in different directions, or in an ORDER BY clause.







1 null 3 1 null 2
2 1 null 2 4 3
3 2 4 3 1 4
4 3 2 4 3 null

Also note that the final output of rows is ordered by the window, not in table order as before. Even though the table is ordered, the LAG and LEAD OVER () return the values in the unordered window.

One of the many useful ways to use a windowing function is to modify the aggregate functions we know and love, e.g. SUM, AVG, STDEV, MIN, and MAX.




1 2 4
2 4 4
3 3 3
4 1 1

As stated before, the windows need not be the same for every function. While this complicates understanding the query (remember! the query is ordered by the final ORDER BY), it can be useful at times. Here is a slightly convoluted query illustrating these points.










1 4 3.0 3.0 2 3 4 4 4
2 3 3.5 2.66 4 1 3 3 3
3 2 2.0 3.0 null 4 4 2 2
4 1 2.0 2.0 3 null 1 1 1

We’re not done with windowing functions yet! Remember how I said that aggregate functions behave as though they were OVER ()? The GROUP BY can be done within the window! However, they’re called “partitions” and you PARTITION BY in a window.





1 4 1 5
2 3 1 5
3 1 2 5
4 1 2 5
5 2 1 5

Coming back to Google Analytics, and how to apply windowing functions to it, we can simplify the query to find time-on-page significantly, both in visual and execution complexity.

For another example, one of the reasons we could have used a self-join would be to get the number of a certain kind of event in the next 2 hits following a pageview. We can do this succinctly (and efficiently!) with a windowing function! (“Jim, look. Surely you must know that the page is attached to the event and we can just do a simple GROUP BY and COUNT!” // “Yes, but what if we’re on a different page now! Maybe I want to know how many people login shortly after visiting a product.”).



1 5

An interesting use of windowing functions is to compute which transaction or event a session precedes. Here we take advantage of a window function being executed over each row individually, but carrying the totals forward. If a session has a transaction, then we give it a 1, otherwise a 0. We then add up all of the values after the current session (we’re sorting descending). The sum will only increment if there was a transaction and stay constant for all sessions chronologically before the transaction until the next transaction. This gives us a value that is the same for the transaction and all sessions before it and after the previous transaction.






5711522334224447562 1378805285 2013-09-10 09:28:05 UTC 0 0
5711522334224447562 1378804654 2013-09-10 09:17:34 UTC 1 1
5711522334224447562 1378803724 2013-09-10 09:02:04 UTC 0 1

User Defined Functions

User Defined Functions (UDFs) are JavaScript functions running in the BigQuery database (using V8). These functions are run as “close” to your data as possible to minimize the amount of time spent shuttling data around. UDFs can do anything you can write a program to do, as such they are extremely powerful, and also extremely easy to abuse and over-complicate a query. More on the details of UDFs can be found in the BigQuery documentation.

The long and short is that the input to a UDF is a query whose output fields are named the same as the fields when registering the UDF (below). The UDF is run once per row, however the UDF may output multiple rows. Take for example, calculating the time on the page, the session and list of hits are a single input row, and multiple rows of output are generated giving the difference between each pageview.

This query also introduces the NEST function, which aggregates all rows into a single repeated field. Repeated fields (similar to how hits and custom dimensions are normally stored) are exposed to our UDF as an array.

To edit the UDF, click on the UDF Editor button in the top right corner.

Example UDF Screen

Because UDFs are a full-fledged programming language, we can compute multiple values or introduce complexities that would be hard to model in SQL. For instance, let’s say we’d like to combine subsequent pages if they are the same URL. We could simply update our function to not increment oi or emit a row.


BigQuery offers many powerful ways for you to leverage your data, but have caveats and fortes each their own. By playing around with the small London Cycle Helmet dataset, other example data Google provides, as well as your own data, you can feel your way through these different methods and learn how to leverage them to your advantage.

Jim is a Data Architect at LunaMetrics. He is a rail fan with a MS in Civil Engineering and BS in Mathematics, both from the University of Pittsburgh, and has worked with software at local companies and the Pittsburgh Supercomputing Center. When not train-watching or extracting information from data, he works on software projects and is a member at a maker space and public transit advocacy group.

Contact Us.


24 S. 18th Street, Suite 100,
Pittsburgh, PA 15203

Follow Us



We'll get back to you
in ONE business day.