DATA SCIENCE INTERVIEW QUESTIONS
Data Science is an interdisciplinary field that consists of numerous scientific procedures, algorithms, tools, and machine learning approaches that strive to help uncover common patterns and extract meaningful insights from provided raw input data through statistical and mathematical analysis.
• Data science is the work of converting data via the use of numerous technical analysis methodologies in order to derive useful insights that a data analyst can apply to their business scenarios.
• Data analytics is concerned with testing current hypotheses and facts and providing answers to inquiries in order to make better and more successful business decisions.
• Data Science encourages innovation by addressing questions that lead to new connections and solutions to future problems. Data analytics is concerned with extracting current meaning from existing historical context, whereas data science is concerned with predictive modelling.
Data analysis cannot be performed on a significant volume of data at once, especially when dealing with enormous datasets. It is critical to collect some data samples that may be used to represent the entire population and then analyse them. While doing so, it is critical to carefully select sample data from the massive dataset that properly represents the complete dataset.
Overfitting occurs when a model performs well only on a subset of the training data. When new data is fed into the model, it fails to produce any results. These situations develop as a result of the model’s low bias and high variance. Overfitting is more likely in decision trees.
Underfitting occurs when the model is so simplistic that it is unable to recognise the correct relationship in the data and hence performs poorly even on test data. This can occur as a result of excessive bias and low variance. Underfitting is more common in linear regression.
The long format Data in Multiple Formats
In this case, each row of data reflects a subject’s one-time information. Each subject’s data would be organised in different/multiple rows. The repeated responses of a subject are divided into several columns in this case.
By seeing rows as groupings, the data can be recognised.
By seeing columns as groups, the data can be recognised.
This data format is most typically used in R analysis and for writing to log files at the end of each experiment.
This data format is rarely used in R analysis, however it is extensively used in statistics packages for repeated measures ANOVAs.
Eigenvectors are column vectors or unit vectors with a length/magnitude of one. They are also known as right vectors. Eigenvalues are coefficients that are applied to eigenvectors to give them different length or magnitude values.
Eigen decomposition is the process of breaking down a matrix into Eigenvectors and Eigenvalues. These are then employed in machine learning approaches such as PCA (Principal Component Analysis) to extract useful insights from the given matrix.
A p-value is a measure of the likelihood of obtaining outcomes that are equal to or greater than those obtained under a certain hypothesis, provided that the null hypothesis is correct. This shows the likelihood that the observed discrepancy occurred by coincidence.
•A low p-value of 0.05 indicates that the null hypothesis can be rejected and that the data is unlikely to be true null.
•A high p-value, i.e. values 0.05, implies that the null hypothesis is strong. It signifies that the data is the same as if it were true null.
•p-value = 0.05 indicates that the hypothesis is open to interpretation.
Resampling is a technique for sampling data in order to improve accuracy and quantify the uncertainty of population parameters. It is done to check that the model is good enough by training it on different patterns in a dataset to guarantee that variances are handled. It is also done when models need to be validated using random subsets or when labelling data points while doing tests.
When data is spread unequally across several categories, it is said to be highly unbalanced. These datasets cause an error in model performance and inaccuracy.
There aren’t many distinctions between these two, however it’s worth noting that they’re employed in various settings. In general, the mean value relates to the probability distribution, whereas the anticipated value is used in contexts containing random variables.
This bias refers to a logical fallacy that occurs while focusing on components that survived a procedure and ignoring those that did not function due to a lack of prominence. This prejudice can lead to incorrect findings.
• KPI: KPI stands for Key Performance Indicator, and it measures how successfully a company fulfils its goals.
• Lift: This is a measure of the target model’s performance when compared to a random choice model. Lift reflects how good the model is at predicting compared to the absence of a model.
• Model fitting: This metric measures how well a model fits supplied observations.
• Robustness: This is the system’s ability to deal with differences and fluctuations successfully.
• DOE: stands for design of experiments, and it reflects the task design aimed at describing and explaining information variance under postulated conditions to reflect factors.
• Time series data can be regarded of as an extension to linear regression that employs terminology such as autocorrelation and average movement to summarise previous data of y-axis variables in order to forecast a better future.
• The main purpose of time series problems is forecasting and prediction, when precise predictions can be produced but the underlying reasons are not always known.
• The presence of time in an issue does not necessarily imply that it is a time series problem. For a problem to be classified as a time series problem, it must have a relationship between the target and the time.
• Close observations in time are supposed to be identical to those far away, which give accountability.
Depending on the size of the dataset, we use the following methods: • If the dataset is small, the missing values are replaced with the mean or average of the remaining data. In pandas, use mean = df.mean(), where df represents the pandas dataframe containing the dataset and mean() computes the mean of the data. We can use df.fillna to replace missing values with the calculated mean (mean).
• For larger datasets, delete the rows with missing values and utilise the remaining data for data prediction.