In this guide, you will learn about building applications involving images with the OpenAI API. If you know what you want to build, find your use case below to get started. If you’re not sure where to start, continue reading to get an overview.
Recent language models can process image inputs and analyze them — a capability known as vision. With gpt-image-1, they can both analyze visual inputs and create images.
The OpenAI API offers several endpoints to process images as input or generate them as output, enabling you to build powerful multimodal applications.
| API | Supported use cases |
|---|---|
| Responses API | Analyze images and use them as input and/or generate images as output |
| Images API | Generate images as output, optionally using images as input |
| Chat Completions API | Analyze images and use them as input to generate text or audio |
To learn more about the input and output modalities supported by our models, refer to our models page.
You can generate or edit images using the Image API or the Responses API.
Our latest image generation model, gpt-image-1, is a natively multimodal large language model.
It can understand text and images and leverage its broad world knowledge to generate images with better instruction following and contextual awareness.
In contrast, we also offer specialized image generation models - DALL·E 2 and 3 - which don’t have the same inherent understanding of the world as GPT Image.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import OpenAI from"openai";
const openai = new OpenAI();
const response = await openai.responses.create({
model: "gpt-4.1-mini",
input: "Generate an image of gray tabby cat hugging an otter with an orange scarf",
tools: [{type: "image_generation"}],
});
const imageData = response.output
.filter((output) => output.type === "image_generation_call")
.map((output) => output.result);
if (imageData.length > 0) {
const imageBase64 = imageData[0];
const fs = awaitimport("fs");
fs.writeFileSync("cat_and_otter.png", Buffer.from(imageBase64, "base64"));
}1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from openai import OpenAI
import base64
client = OpenAI()
response = client.responses.create(
model="gpt-4.1-mini",
input="Generate an image of gray tabby cat hugging an otter with an orange scarf",
tools=[{"type": "image_generation"}],
)
// Save the image to a file
image_data = [
output.result
for output in response.output
if output.type == "image_generation_call"]
if image_data:
image_base64 = image_data[0]
withopen("cat_and_otter.png", "wb") as f:
f.write(base64.b64decode(image_base64))You can learn more about image generation in our Image generation guide.
The difference between DALL·E models and GPT Image is that a natively multimodal language model can use its visual understanding of the world to generate lifelike images including real-life details without a reference.
For example, if you prompt GPT Image to generate an image of a glass cabinet with the most popular semi-precious stones, the model knows enough to select gemstones like amethyst, rose quartz, jade, etc, and depict them in a realistic way.
Vision is the ability for a model to “see” and understand images. If there is text in an image, the model can also understand the text. It can understand most visual elements, including objects, shapes, colors, and textures, even if there are some limitations.
You can provide images as input to generation requests in multiple ways:
You can provide multiple images as input in a single request by including multiple images in the content array, but keep in mind that images count as tokens and will be billed accordingly.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import OpenAI from"openai";
const openai = new OpenAI();
const response = await openai.responses.create({
model: "gpt-4.1-mini",
input: [{
role: "user",
content: [
{ type: "input_text", text: "what's in this image?" },
{
type: "input_image",
image_url: "https://api.nga.gov/iiif/a2e6da57-3cd1-4235-b20e-95dcaefed6c8/full/!800,800/0/default.jpg",
},
],
}],
});
console.log(response.output_text);1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from openai import OpenAI
client = OpenAI()
response = client.responses.create(
model="gpt-4.1-mini",
input=[{
"role": "user",
"content": [
{"type": "input_text", "text": "what's in this image?"},
{
"type": "input_image",
"image_url": "https://api.nga.gov/iiif/a2e6da57-3cd1-4235-b20e-95dcaefed6c8/full/!800,800/0/default.jpg",
},
],
}],
)
print(response.output_text)1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
using OpenAI.Responses;
string key = Environment.GetEnvironmentVariable("OPENAI_API_KEY")!;
OpenAIResponseClient client = new(model: "gpt-5", apiKey: key);
Uri imageUrl = new("https://api.nga.gov/iiif/a2e6da57-3cd1-4235-b20e-95dcaefed6c8/full/!800,800/0/default.jpg");
OpenAIResponse response = (OpenAIResponse)client.CreateResponse([
ResponseItem.CreateUserMessageItem([
ResponseContentPart.CreateInputTextPart("What is in this image?"),
ResponseContentPart.CreateInputImagePart(imageUrl)
])
]);
Console.WriteLine(response.GetOutputText());1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
curl https://api.openai.com/v1/responses \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-d '{
"model": "gpt-4.1-mini",
"input": [
{
"role": "user",
"content": [
{"type": "input_text", "text": "what is in this image?"},
{
"type": "input_image",
"image_url": "https://api.nga.gov/iiif/a2e6da57-3cd1-4235-b20e-95dcaefed6c8/full/!800,800/0/default.jpg"
}
]
}
]
}'Input images must meet the following requirements to be used in the API.
| Supported file types |
|
| Size limits |
|
| Other requirements |
|
The detail parameter tells the model what level of detail to use when processing and understanding the image (low, high, or auto to let the model decide). If you skip the parameter, the model will use auto.
1
2
3
4
5
{
"type": "input_image",
"image_url": "https://api.nga.gov/iiif/a2e6da57-3cd1-4235-b20e-95dcaefed6c8/full/!800,800/0/default.jpg",
"detail": "high"}You can save tokens and speed up responses by using "detail": "low". This lets the model process the image with a budget of 85 tokens. The model receives a low-resolution 512px x 512px version of the image. This is fine if your use case doesn’t require the model to see with high-resolution detail (for example, if you’re asking about the dominant shape or color in the image).
On the other hand, you can use "detail": "high" if you want the model to have a better understanding of the image.
Read more about calculating image processing costs in the Calculating costs section below.
While models with vision capabilities are powerful and can be used in many situations, it’s important to understand the limitations of these models. Here are some known limitations:
Image inputs are metered and charged in tokens, just as text inputs are. How images are converted to text token inputs varies based on the model. You can find a vision pricing calculator in the FAQ section of the pricing page.
Image inputs are metered and charged in tokens based on their dimensions. The token cost of an image is determined as follows:
A. Calculate the number of 32px x 32px patches that are needed to fully cover the image (a patch may extend beyond the image boundaries; out-of-bounds pixels are treated as black.)
raw_patches =ceil(width/32)×ceil(height/32)B. If the number of patches exceeds 1536, we scale down the image so that it can be covered by no more than 1536 patches
r = √(32²×1536/(width×height))
r = r × min( floor(width×r/32) / (width×r/32), floor(height×r/32) / (height×r/32) )C. The token cost is the number of patches, capped at a maximum of 1536 tokens
image_tokens =ceil(resized_width/32)×ceil(resized_height/32)D. Apply a multiplier based on the model to get the total tokens.
| Model | Multiplier |
|---|---|
gpt-5-mini | 1.62 |
gpt-5-nano | 2.46 |
gpt-4.1-mini | 1.62 |
gpt-4.1-nano | 2.46 |
o4-mini | 1.72 |
Cost calculation examples
(1024 + 32 - 1) // 32 = 32 patches(1024 + 32 - 1) // 32 = 32 patches32 * 32 = 1024, below the cap of 1536(1800 + 32 - 1) // 32 = 57 patches(2400 + 32 - 1) // 32 = 75 patches57 * 75 = 4275 patches to cover the full image. Since that exceeds 1536, we need to scale down the image while preserving the aspect ratio.sqrt(token_budget × patch_size^2 / (width * height)). In our example, the shrink factor is sqrt(1536 * 32^2 / (1800 * 2400)) = 0.603.1086 / 32 = 33.94 patches1448 / 32 = 45.25 patches33 / 33.94 = 0.97 to fit the width in 33 patches.1086 * (33 / 33.94) = 1056) and the final height is 1448 * (33 / 33.94) = 14081056 / 32 = 33 patches to cover the width and 1408 / 32 = 44 patches to cover the height33 * 44 = 1452, below the cap of 1536The token cost of an image is determined by two factors: size and detail.
Any image with "detail": "low" costs a set, base number of tokens. This amount varies by model (see chart below). To calculate the cost of an image with "detail": "high", we do the following:
| Model | Base tokens | Tile tokens |
|---|---|---|
| gpt-5, gpt-5-chat-latest | 70 | 140 |
| 4o, 4.1, 4.5 | 85 | 170 |
| 4o-mini | 2833 | 5667 |
| o1, o1-pro, o3 | 75 | 150 |
| computer-use-preview | 65 | 129 |
Cost calculation examples (for gpt-4o)
"detail": "high" mode costs 765 tokens
170 * 4 + 85 = 765."detail": "high" mode costs 1105 tokens
170 * 6 + 85 = 1105."detail": "low" most costs 85 tokens
For GPT Image 1, we calculate the cost of an image input the same way as described above, except that we scale down the image so that the shortest side is 512px instead of 768px. The price depends on the dimensions of the image and the input fidelity.
When input fidelity is set to low, the base cost is 65 image tokens, and each tile costs 129 image tokens. When using high input fidelity, we add a set number of tokens based on the image’s aspect ratio in addition to the image tokens described above.
To see pricing for image input tokens, refer to our pricing page.
We process images at the token level, so each image we process counts towards your tokens per minute (TPM) limit.
For the most precise and up-to-date estimates for image processing, please use our image pricing calculator available here.