Excel Tip : “Group By, Concatenate”

2 Toronto Header

 

tl;dr : Here are the two key Excel functions to insert at the top of two new columns =IF(A2=A1,D1&”, “&C2,C2) & =IF(A2<>A3,”Last”,””)
CONTEXT : Here is an interesting data cleaning problem. At my workplace, people would enter reports into a system. Although the user sees one big box to type their reports,  the information was stored in separate, sequential lines. When I exported the csv file, I have a series of reports and so each individual report is broken up into multiple line. This reminds me of computing the “group by, sum” query in SQL, but instating of summing, we are concatenating text.

Given a 2-column table, column A has reference / report ids and column B has the numerical value, but you want to summarize the values by column A’s report ids e.g. take the sum of the column B, by grouping column A (e.g. group by, summarise, sum). In this case, column B is not numerical, but it’s a string. In this case, we want to “group by, summarise, and concatenate” the text. We will show two examples : one in Excel and a faster one using an R script.
KEY QUESTION: Given a two-column list (where one column is the report id and another column is the text), how can I concatenate the text together, grouping them by report id? (A third column may be needed to index and sort the lines if the list is not already sorted.)

DESIRED OUTPUT: Two column matrix with concatenated text

LEVEL OF DIFFICULTY: Easy

ESTIMATED EXCEL AND SCRIPT TIME: less than 4 hours

CATEGORY OF ANALYSIS: Excel tip, data cleaning

PREREQUISITE KNOWLEDGE: basic manipulation in R, Excel formulas and functions

TOOLS AND LIBRARIES: R/R Studio, Excel

VALUE ADDED: A handy function to help concatenate that text and clean up your file!

METHODOLOGY: First, one new column is created, concatenating the text from the previous row’s text column (per report id). Then, a second column is created based on report id again and makes a marker if it the row is the last of that report id. If the row is indeed the last of that grouping of the report id (remember, the report ids are sorted!) then the last entry of the concatenated text is the concatenation of all the relevant rows. The file is filtered for the marker. The R script mirrors the logic and steps in the Excel file.

EXCE EXAMPLE:

First, we have our text:2_excelfile_first.png

And the report is separated into multiple lines. Boo-urns!

 

We introduce two columns. At the top of each column, we insert the following formulas and then bring the formula down. =IF(A2=A1,D1&”, “&C2,C2) & =IF(A2<>A3,”Last”,””)

 

3_Excefile_second.png

 

And

4_excel_third.png

 

For each row that has the marker “Last”, we can see that it is the complete concatenated text for the reference id. Then we filter formula 2’s column for the Last marker to get the rows of interest.

 

R EXAMPLE

In R, here is our example:

1_R_script_group_by_concatenate.png

 

rm(list=ls())
df <- read.csv(“/Users/patrickroncal/Desktop/Group_By.csv”)
library(dplyr)
## re-name your columns
df$col_a <- df$Reference_ID ### change this to the main identifying file number
df$col_b <- df$Line_Number ### this is actually optional, but it’s the sequential paragraph line number for each file number
df$col_c <- df$Text ### the actual words
## function
group_by_concatenate <- function(df){
df$col_d <- “”
df$col_e <- “”
df$col_c <- as.character(df$col_c)
for(j in 2:nrow(df)){
i <- j-1;
df[1, “col_d”] <- df[1, “col_c”];
df[j, “col_d”] <- ifelse(df[j, “col_a”] == df[i, “col_a”],
paste0(df[i, “col_d”], ” “, df[j, “col_c”]),
df[j, “col_c”]);
df[1, “col_d”] <- df[1, “col_c”];}
for(j in 2:nrow(df)){
i <- j-1;
df[nrow(df), “col_e”] <- “Last”;
df[i, “col_e”] <- ifelse(df[i, “col_a”] != df[j, “col_a”],
“Last”,
“”);}
df <- filter(df, col_e == “Last”);
df$Full_Comment <- df$col_d
df <- subset(df, select = c(“col_a”, “Full_Comment”));
return(df);}
## run function
df_cleaned <- group_by_concatenate(df)

 

What do you think? Do you find this handy? Are there other ways to achieve the same output?