Python Data Science & Machine Learning Basics: NumPy, Pandas, scikit-learn
1. Introduction to Data Science and Machine Learning
Q: What is Data Science and Machine Learning?
What is Data Science? Data science is an interdisciplinary field that uses scientific methods, algorithms, and systems to extract knowledge and insights from structured and unstructured data. It combines statistics, programming, and domain expertise to inform decision-making.
What is Machine Learning? Machine learning is a subset of AI where systems learn patterns from data to make predictions or decisions without explicit programming. It includes supervised, unsupervised, and reinforcement learning.
How do Data Science and Machine Learning Relate? Data science encompasses machine learning as a tool for analysis, along with data cleaning, visualization, and deployment.
Use Case: Data science is used in recommendation systems (e.g., Netflix), fraud detection, and business analytics. Machine learning powers predictive models within these applications.
2. NumPy and Pandas Basics
Q: What are NumPy and Pandas?
What is NumPy? NumPy is a library for numerical computing in Python, providing support for multi-dimensional arrays and mathematical functions.
What is Pandas? Pandas is a library for data manipulation and analysis, offering data structures like DataFrames for handling tabular data.
Example: NumPy Array Operations
import numpy as np
# Create arrays
arr1 = np.array([1, 2, 3, 4, 5]) # 1D array
arr2 = np.array([[1, 2, 3], [4, 5, 6]]) # 2D array
zeros = np.zeros((2, 3)) # 2x3 array of zeros
ones = np.ones((3, 2)) # 3x2 array of ones
# Array operations
sum_arr = np.sum(arr1) # Sum
mean_arr = np.mean(arr1) # Mean
squared = arr1 ** 2 # Element-wise squaring
# Display results
print(f"Array 1D: {arr1}")
print(f"Array 2D: {arr2}")
print(f"Zeros: {zeros}")
print(f"Ones: {ones}")
print(f"Sum: {sum_arr}, Mean: {mean_arr}")
print(f"Squared: {squared}")
Example: Pandas DataFrame Operations
import pandas as pd
# Create DataFrame
data = {
'Name': ['Krishna', 'Kristal', 'Ram'],
'Age': [25, 30, 35],
'Salary': [50000, 60000, 70000]
}
df = pd.DataFrame(data)
# Basic operations
print("DataFrame:")
print(df)
print(f"\nMean Salary: {df['Salary'].mean()}")
print(f"Filtered (Age > 28):\n{df[df['Age'] > 28]}")
Note: Run with python script.py after installing NumPy and Pandas (pip install numpy pandas). Use virtual environments to manage dependencies.
3. Data Visualization with Matplotlib and Seaborn
Q: How to visualize data?
What is Matplotlib? Matplotlib is a plotting library for creating static, animated, and interactive visualizations in Python.
What is Seaborn? Seaborn is a statistical visualization library based on Matplotlib, providing high-level interfaces for attractive plots.
Example: Line and Scatter Plots
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# Sample data
data = {'Age': [25, 30, 35, 40, 45], 'Salary': [50000, 60000, 70000, 80000, 90000]}
df = pd.DataFrame(data)
# Matplotlib: Line plot
plt.figure(figsize=(8, 5))
plt.plot(df['Age'], df['Salary'], marker='o', color='blue')
plt.title('Salary vs Age')
plt.xlabel('Age')
plt.ylabel('Salary')
plt.grid(True)
plt.show()
# Seaborn: Scatter plot
sns.scatterplot(data=df, x='Age', y='Salary')
plt.title('Salary vs Age (Seaborn)')
plt.show()
Note: Run with python script.py after installing Matplotlib and Seaborn (pip install matplotlib seaborn). Outputs are graphical plots displayed in a window or saved as files.
4. Introduction to scikit-learn
Q: What is scikit-learn?
What is scikit-learn? scikit-learn is a machine learning library for Python, offering simple and efficient tools for data mining and analysis.
Example: Decision Tree Classifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
# Load dataset
iris = load_iris()
X, y = iris.data, iris.target
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train model
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)
# Predict and evaluate
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")
Note: Run with python script.py after installing scikit-learn (pip install scikit-learn). The Iris dataset is included with scikit-learn.
5. Basics of Machine Learning Algorithms
Q: Types of Machine Learning?
What is Supervised Learning? Supervised learning uses labeled data to train models for prediction (e.g., classification, regression).
What is Unsupervised Learning? Unsupervised learning finds patterns in unlabeled data (e.g., clustering, dimensionality reduction).
What is Reinforcement Learning? Reinforcement learning involves agents learning from actions and rewards in an environment.
Example: Supervised and Unsupervised Algorithms
from sklearn.datasets import load_iris
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.cluster import KMeans
# Load dataset
iris = load_iris()
X, y = iris.data[:, :2], iris.target # Use 2 features for simplicity
# Supervised: Linear Regression (example with continuous target)
lr = LinearRegression()
lr.fit(X, y)
# Supervised: Decision Tree
dt = DecisionTreeClassifier()
dt.fit(X, y)
# Unsupervised: K-Means Clustering
kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
kmeans.fit(X)
print("Linear Regression Coefficients:", lr.coef_)
print("Decision Tree Classes:", dt.classes_)
print("K-Means Cluster Centers:", kmeans.cluster_centers_)
Note: Run with python script.py after installing scikit-learn. Linear Regression is used here for demonstration, though Iris is typically for classification.
6. Best Practices for Python Data Science & ML
Q: What are best practices?
- Use virtual environments (
venvorconda) for dependency management. - Clean and explore data thoroughly before modeling.
- Split data into train/validation/test sets.
- Visualize data at every stage.
- Evaluate models with appropriate metrics (accuracy, precision, recall, etc.).
- Avoid data leakage (e.g., scaling before splitting).
- Document code and experiments (Jupyter notebooks are great).
- Use version control (Git) for reproducibility.