AlixPartners' Analytics Challenge

At AlixPartners, we’re always looking for smart, accomplished, quick learners with high energy and a demonstrated ability to get results. We put a lot of effort into the screening, interview, and assessment process to ensure not only that you have the right skills and experience, but also that our culture and core values are a good fit for you.

IT & Applied Analytics is a part of AlixPartners' Digital practice. To see job descriptions and open positions in this area please visit the Careers: Current Openings section of our corporate website and select Information Management Services from the Categories dropdown menu.

If you do not find an open position that suits you, but you are still interested, you may also submit your CV and cover letter to IMS-Recruiting@alixpartners.com.

If you are interested in our Applied Analytics group, consider taking the AlixPartners’ Analytics Challenge.

Take the AlixPartners' Analytics Challenge

Data – at AlixPartners we deal with a lot of it. Whether it’s parsing through terabytes of forensic information in an electronic discovery setting or building a predictive model to improve the revenue forecasting ability for one of our Fortune 500 clients – the circumstances are always different but the challenge remains the same. What actionable insights can we provide, from the data we are able to obtain, to improve our client’s situation when it really matters? 

So that’s where you come in. The high pressure situations we typically operate under are rarely conducive to the provision of luxurious, clean, normalized data sets. If you love working with ambiguous, incomplete, duplicative or otherwise outdated data then tackling one of these challenges might be for you. If you hit a homerun then perhaps we’ll have something to talk about.

 When we’re collecting forensic information, sometimes it’s not always clear who it belongs to. We might be trying to trace and identify e-mails from unknown senders or understand the source of a stream of message data. All too often the binary logic comparison operators provided by our standard tool sets fall short when we try to describe and understand the vagueness of the real world. So that’s where the fun begins – utilization of fuzzy logic typically offers more graceful alternatives.

You’ll be provided a randomized interlaced dataset comprised of user data drawn from the command history of an unknown number of UNIX computer users over a multi-year period. The data has been sanitized to remove identifying attributes such as user and file names as well as directory structures. Additionally, the identifiers #BOF# and #EOF#  have been inserted into the dataset to designate the beginning and end of each shell session.  The sessions have been concatenated into a single stream by date order but no timestamps are provided. The output of your program should predict within a degree of certainty how many users comprise the dataset.

Download: Problem 1 Input Data

 

Sample Session Data from Initial Capture

 

Provided Tokenized Stream

 

# Start session 1
cd ~/private/files
whoami
cat zug.txt zam.txt > nowhere
exit
# End session1

  #BOF#
cd
<1>    -- represents 1 file argument
whomami
cat
<2>    -- represents 2 file arguments
>
<1>
Exit
#EOF# 
 

Output Format

No. Users, Probability
2 Users, .1
3 Users, .4
4 Users, .5

 

 

As consultants, we are often required to make inferences based on limited amounts of data.  In this dataset, with more variables than observations, traditional tools such as logistic regression fail. 

You'll be provided with a dataset with 300 random variables (each drawn from [0,1]). A secret algorithm was used to compute a binary target variable based on these data.

The training dataset has 250 rows, and the test dataset has 19,750 rows. The goal is to build a model based on the training dataset that accurately classifies the test dataset.

Submit your best predicted probabilities for the 19,750 Target_Evaluate variable (which ranges from 0-1).  Be careful what variable selection techniques you use, and don’t overfit -- you'll be evaluated using area under the ROC curve.

Download: Problem 2 - Input Data

Lack of data validation is commonly pinpointed as the most common failure when assessing application security weakness. Employing it correctly, however; has a significant effect on not only ensuring security but also encouraging input completion, efficiency, consistency and the minimization of errors in data captured by information systems.

Understanding these identified benefits raises the obvious question – “Why isn’t good validation employed in all places”? The simple truth is that it isn’t easy. As the demands of data capture and tracking grow, validation rules must be developed in sync. Doing so – on both the client and server side of applications -- is a tedious process and tedium begets errors.

In this problem, you’ll be provided with two data sets. The first is nearly three million City / Country pairs extracted from a source system before true data validation was employed. The second – known to be correct – data set maps the country codes in the first set to the corresponding country description. The output of your program should accurately match each entry to its correct/cleaned city spelling and country description. 

The output file should be provided as a pipe '|' delimited .txt file. For each transaction of the input file it should contain four fields: Input_City, Input_CountryCode, Output_City, Output_CountryName. Other descriptions or Lat Long pairs may be added as additional output fields.

You’ll be evaluated solely on the basis of a correct match percentage – although we can’t say we won’t give bonus points for geo-coding your results!

Download: Problem 3 - Input Data

Do you know your customers? That’s a perennial question that drives a large portion of the marketing investment across industries, keeping CMOs awake. Analytics is key to driving insights and inform user-level actionable decisions.

This challenge is to predict granular customer/item affinity to support a better decision-making process. In a broader application, this information can drive front-line results in Marketin Campaign or Online Experience Optimization.

Traditional approaches usually fail because features do not hold much predictive power. Indeed, most of it resides in the sequence of ratings from users, more than in the user and item descriptions. We pushed the challenge to the extreme where no user or item metadata is provided.

Each customer has rated items from 1 (= strongly dislike) to 5 (strongly like).  The task is to predict as accurately as possible (measured by the RMSE) the affinity between customers and items in the scoring dataset. Good luck! 

 Data Description: You are provided with 2 files: a training set (.train) and a scoring set (.score).  Both are space delimited files. Download: Problem 4 - Input Data

Format of the Submission file :

  1. Comma Delimited File
  2. File name: “predictions.txt”
  3. Headers: [Entry_id],[Predicted_Rating] with Entry_id matching the first column of the .score document