



OpenAI's multimodal neural network trained on 400 million image-text pairs, enabling zero-shot image classification and cross-modal retrieval by learning joint embeddings for images and text.
CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet given an image, without directly optimizing for the task.
CLIP can perform image classification without task-specific training:
CLIP demonstrates strong zero-shot transfer capabilities across multiple datasets, often matching or exceeding supervised models without domain-specific training.
Loading more......
Free and open-source model, available for research and commercial use.