About IntelliMind : IntelliMind LLC, founded in 2015 is a stealth mode startup based in USA. It specializes in providing intelligent assistance within FinTech space. IntelliMind engages in design and development of product and services targeted towards Financial Services. We’re on a mission to push boundaries of Cognitive Science, developing programs that can answer complex questions.
IntelliMind leverages machine learning, artificial intelligence, and natural language processing to build tools that enable automated investment assistance.
Interviewing for Role : Data Scientist
Key Skills Required : Python, Time Series forecasting, Sequential Networks, Data Mining, Inferential Statistics, Regression and Classification algorithms, Unsupervised algorithms, OOPs
Round 1 | Zoom Interview with CEO
Started with “tell me about yourself” and on my work experience so far. Was asked three questions:
(1) Relation between missing values and standard deviation of a series
(2) Definition of z-score and its uses
(3) Feature selection techniques and which one to use when
Round 2 | Data Science Assignment
The problem statement was “Given the data of historical stock prices, predict the direction and potentially magnitude of the move”. Overall, this was an event modeling exercise where each event contains “actual result”, “forecasted result” and “previous result”. The goal is to predict item # 7 (log difference between successive close prices) if on the day we have have actual, forecast, rolling_std, z-score and # 7.
Mean Absolute Error and Mean Squared Error were the choice of metrics for regression while Accuracy was for classification. For calculating accuracy the sign of the output values (+ or -) were considered.
(1) Univariate feature visualization was performed wherein seasonality of columns actual and forecast were observed (2 years)
(2) Causality check was performed with the help of Granger’s causality tests. It checks if each of the time series in the system influences each other where the null hypothesis is coefficients of past values in regression equation is zero
(3) Cointegration test was performed to establish the presence of statistically significant connection between two or more columns
(4) Stationarity of every feature was performed using Augmented Dickey-Fuller test
(5) For benchmarking, persistence model was used which simply outputs the previous day’s output.
(6) Autocorrelation plots were made for every feature in order to visualize how lags influenced stability
(7) Time lagged features (by 1 day) were generated for the input variables
(8) Performed train-test split (default training set size was kept as 80%)
(9) Outlier detection and treatment was performed using Robust Scaler where outlier range was (-1.5 IQR, +1.5 IQR)
(10) Two new columns year_even (0 if year is even) and difference (actual-forecast) were created after some experimentation
(11) For sanity check, Pearson’s correlation coefficient matrix was generated
(12) Six variants of ML algorithms were tried out: Linear Regression, Decision Tree, Random Forest, Gradient Boosting, Lasso Regression & Ridge Regression
(13) Created a Python package for wrapping everything up
Round 3 | Zoom Interview with CEO
This round was more like a presentation round of my take on the assignment. The interviewer questioned almost everything related to the assignment. Also, discussed on availability of joining
Round 4 | Google Meet Interview with Tech Lead
Started with my work experience and one of the projects in technical depth starting from architecture to deployment. Was asked two questions:
(1) How will you go about predicting if a person has COVID-19? What features would you select and what pre-processing, feature selection and model tuning would you do?
(2) Was asked a puzzle on balancing weights (cannot recollect); took some time from interviewer to gather my thoughts
Round 5 | Telephonic Interview with Data Science Lead
Started with how I started with data science and so far my experience with it. Discussed the data science assignment and showed him my results. It was a very mathematical round with respect to the questions which were asked such as:
(1) Z-score was a moving average given in the data. Did I try finding out how many lags it was calculated? If yes, how many?
(2) In Spite of boosting algorithms getting lower RMSE, why did you go ahead with Decision Tree as the final choice?
(3) Difference between overfitting and underfitting
(4) Why was ARIMA not performed? ARIMA with causal parameters could’ve given a more solid insight into the workings of the model since it is highly interpretable.
(5) Why does time differencing make a series stationary? Which test was used to confirm stationarity? Which plots would you look at for determining stationarity?
(6) Was asked to calculate standard errors as he wanted to observe the robustness of my results.
Final Outcome : SELECT
What I think worked for me :
Overall feedback from all sides was positive. Liked the way in the manner I designed a Python package with highly configurable parameters along with the final two rounds.