If you are a developer looking to add computer vision to your stack, start with OpenCV for image processing and PyTorch or TensorFlow for deep learning. You do not need a PhD. You need a working Python environment, a GPU (optional but helpful), and a willingness to experiment with data.
Choose Your Entry Point
For most developers, the fastest path is Python with OpenCV. Install it via pip:
pip install opencv-python
OpenCV gives you 2500+ algorithms for image manipulation, feature detection, and camera calibration. It is not a deep learning framework, but it handles preprocessing and basic tasks well.
For deep learning, PyTorch is the current favorite in research and industry. Install with:
pip install torch torchvision
TensorFlow is still widely used in production, especially with TensorFlow Serving. Pick one and stick with it until you hit a wall.
Your First Pipeline: Reading and Displaying an Image
import cv2
img = cv2.imread('cat.jpg')
cv2.imshow('Cat', img)
cv2.waitKey(0)
cv2.destroyAllWindows()
That loads an image, shows it, and waits for a key press. This is the "Hello World" of computer vision.
Common Tasks and Code Snippets
Resize and Convert to Grayscale
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
resized = cv2.resize(gray, (224, 224))
Edge Detection
edges = cv2.Canny(img, 100, 200)
Face Detection with Haar Cascades
face_cascade = cv2.CascadeClassifier(cv2.data.haarcascades + 'haarcascade_frontalface_default.xml')
facces = face_cascade.detectMultiScale(gray, 1.1, 4)
for (x, y, w, h) in faces:
cv2.rectangle(img, (x, y), (x+w, y+h), (255, 0, 0), 2)
Deep Learning: Image Classification with PyTorch
import torch
import torchvision.transforms as transforms
from PIL import Image
model = torch.hub.load('pytorch/vision', 'resnet18', pretrained=True)
model.eval()
transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
img = Image.open('cat.jpg')
img_t = transform(img)
batch_t = torch.unsqueeze(img_t, 0)
with torch.no_grad():
out = model(batch_t)
_, index = torch.max(out, 1)
print(index.item()) # class index
This loads a pretrained ResNet-18, preprocesses an image, and predicts a class. The model expects 224x224 RGB images normalized to ImageNet stats.
Data Preparation: The Real Work
Computer vision models are data hungry. You need thousands of labeled images. Public datasets like ImageNet, COCO, and Open Images are good starting points. For custom data, tools like LabelImg (https://github.com/tzutalin/labelImg) let you draw bounding boxes manually.
Expect to spend 80% of your time on data cleaning and augmentation. Use torchvision.transforms for random flips, rotations, and color jitter:
transforms.Compose([
transforms.RandomHorizontalFlip(),
transforms.RandomRotation(10),
transforms.ColorJitter(brightness=0.2),
])
Training vs. Transfer Learning
Training from scratch requires a lot of data and compute. A ResNet-50 takes about 2-3 days on a single V100 GPU for ImageNet. Transfer learning is cheaper: take a pretrained model, freeze early layers, and retrain the last few on your data.
model = torch.hub.load('pytorch/vision', 'resnet18', pretrained=True)
for param in model.parameters():
param.requires_grad = False
num_ftrs = model.fc.in_features
model.fc = torch.nn.Linear(num_ftrs, 2) # binary classification
This freezes all layers except the final fully connected layer. Train only that layer on your small dataset. It works well with as few as 100 images per class.
Deployment Options
Once you have a trained model, you need to serve it. Options:
- ONNX Runtime: Convert your model to ONNX format for cross-platform inference.
- TensorFlow Serving: If you used TensorFlow, it handles versioning and batching.
- TorchServe: PyTorch's official serving framework.
- FastAPI + ONNX: Simple REST API with Python.
Example FastAPI endpoint:
from fastapi import FastAPI, File, UploadFile
import onnxruntime as ort
import numpy as np
from PIL import Image
import io
app = FastAPI()
session = ort.InferenceSession('model.onnx')
@app.post('/predict')
async def predict(file: UploadFile = File(...)):
img = Image.open(io.BytesIO(await file.read()))
# preprocess
input_data = np.array(img).astype(np.float32)
outputs = session.run(None, {'input': input_data})
return {'class': int(outputs[0][0])}
Hardware Considerations
You can start with CPU. For training, a GPU with at least 8GB VRAM (e.g., RTX 3070) is recommended. Cloud options: AWS p3 instances, Google Cloud TPUs, or Lambda Labs. For inference, CPUs are often sufficient for batch sizes of 1, but GPUs reduce latency.
Common Pitfalls
- Overfitting: Use dropout, data augmentation, and early stopping.
- Class imbalance: Use weighted loss functions or oversample minority classes.
- Wrong input size: Models expect specific dimensions. Check the docs.
- Not normalizing: Always normalize pixel values to [0,1] or [-1,1] as the model expects.
When Not to Use Computer Vision
If your problem can be solved with simple heuristics or classical image processing, do that first. Deep learning is expensive in data and compute. For example, detecting a red circle in an image is easier with color thresholding than a neural network.
Additional Tips for Production
When moving to production, consider model quantization to reduce size and speed up inference. PyTorch supports dynamic quantization:
import torch
quantized_model = torch.quantization.quantize_dynamic(model, {torch.nn.Linear}, dtype=torch.qint8)
This can reduce model size by 4x and improve CPU inference speed by 2-3x with minimal accuracy loss. Also, use batching for higher throughput. A batch size of 32 on a GPU can process 32 images in roughly the same time as 1.
Keep Reading
- How to Build a Real-Time Object Detection System
- Transfer Learning Best Practices for Image Classification
- Deploying PyTorch Models to Production
Ready to build your first vision pipeline? Try Zlyqor for free at https://app.zlyqor.com/signup and get a managed environment with pre-installed libraries and GPU access.