The Non-Engineer's Guide

How AI Image
Models Actually Work

Why some AI models make generic faces, others copy perfectly, and what architecture has to do with everything you see in the output.

The One Thing to Understand

Every AI image model falls into one of three fundamental approaches to creating pictures. Think of it this way: some models write a description and then paint from it (autoregressive), some start with TV static and slowly clean it up (diffusion), and some draw a straight line from noise to the final image (flow matching).

The architecture a model uses determines everything about its output: why faces look a certain way, why some models are faster, why some follow instructions better, and why some copy reference photos more faithfully than others. This is not about which model is "best." It's about which one is best for what you need.

Three Ways to Make a Picture

Autoregressive

The model writes the image token by token, like an LLM writing text. It predicts the next piece based on everything generated so far. Great at understanding instructions, terrible at preserving exact visual details from references.

Like describing a face over the phone to a sketch artist. Conceptually accurate, but details shift.

Diffusion

The model starts with pure noise and progressively removes it over many small steps until a clean image appears. Reference images stay as visual data throughout the process, never getting converted to text.

Like a painter staring at a reference photo pinned to their easel, reproducing it brushstroke by brushstroke.

Flow Matching

Instead of many tiny denoising steps, the model learns direct straight-line paths from noise to final image. Fewer steps, more efficient, but still starts from noise so it keeps seed control and visual diversity.

Like taking the highway instead of wandering through side streets. Same destination, much faster route.

How Each Approach Generates an Image

Auto-regressive
Prompt In
LLM reads & reasons about your text
"Thinking"
Model plans composition & layout
Token Generation
~1,290 visual tokens, one by one
Decode & Refine
Tokens decoded into final pixels
Diffusion
Random Noise
Start from pure static (seed controls this)
Denoise Step 1-15
Shapes emerge from the noise
Denoise Step 15-30
Details sharpen progressively
Final Image
Clean, high-detail output
Flow Match
Noise + VLM
24B language model parses prompt deeply
Direct Path
Straight-line trajectory, fewer steps
High-Res Output
Up to 4MP with seed control

The Models, Explained

NB

Nano Banana Pro

Gemini 3 Pro Image · Google DeepMind
Autoregressive + Diffusion Hybrid

NBP is essentially an LLM that outputs images. Its Gemini 3 Pro backbone thinks about your prompt before generating, giving it unmatched instruction-following and reasoning. It understands concepts like physics, emotions, and spatial logic. The tradeoff: reference images get converted to abstract tokens, so faces are always re-imagined from a description, never pixel-preserved.

How it handles your face reference

NBP looks at your photo, compresses the face into ~1,290 visual tokens (covering the ENTIRE image, not just the face), and generates a new image from those tokens. The face passes through a text-like bottleneck. It's like describing someone to a sketch artist over the phone. Conceptually right, geometrically shifted.

Best atInfographics, text in images, complex instructions, multi-step editing, reasoning-heavy prompts
Watch outFaces from references always slightly "off." Generic-looking people due to safety filters + token averaging
Speed~9 seconds per image. Fast for its quality tier
Seed controlNone. Cannot reproduce the exact same image twice. Architectural limitation
N2

Nano Banana 2

Gemini 3.1 Flash Image · Google DeepMind
Distilled Autoregressive + Diffusion

NB2 is NBP's faster, cheaper sibling. It was distilled from the larger Gemini 3 Pro, meaning the big model was used to "teach" a smaller, more efficient Flash-tier model. Same architectural approach, same token-based generation, same face-geometry drift from references. But faster and half the cost.

What "distilled" actually means

Think of it like an experienced chef training an apprentice. The apprentice learns to cook the same dishes faster and with fewer ingredients, but the fundamental technique (and its limitations) remain. NB2 inherits NBP's strengths (reasoning, text rendering, editing) and its weaknesses (face approximation from references).

Best atSame strengths as NBP at ~50% lower cost. Better fictional character consistency with ID tracking
Watch outSame face-drift from references as NBP. Stricter safety filters than Pro
SpeedFaster than NBP. Flash-tier architecture optimized for speed
Cost~$0.067 per 2K image. Roughly half of NBP's pricing
GP

GPT Image 1.5

Native to GPT-5 · OpenAI
Autoregressive (Unified Multimodal)

GPT Image 1.5 is OpenAI's successor to DALL-E, built directly into the GPT-5 architecture. The same neural network that understands your text also generates the image. This native multimodal design gives it excellent instruction following, strong editing precision, and the deepest world knowledge of any image model. It went viral for Ghibli-style images and has been used by over 130 million people.

What makes it different from NBP

Both are autoregressive, but GPT Image 1.5 was specifically optimized for "Consistent In-Painting," meaning local edits that preserve the rest of the image. OpenAI focused heavily on solving the "butterfly effect" problem where changing one small thing shifts the whole image. It also offers three quality tiers (low/medium/high) that trade speed for detail, letting you prototype fast and render final at high quality.

Best atPrecise local editing, world knowledge, text rendering, complex multi-object scenes (handles 10-20 objects)
Watch outSame autoregressive face-drift as NBP. Warm color bias. High quality mode is slow (30-45s) and expensive via API
Speed15-45 seconds depending on quality tier. 4x faster than GPT Image 1
CostFree in ChatGPT (with limits). API: $0.01-$0.25 per image depending on quality
SD

Seedream 4.5

Diffusion Transformer · ByteDance
Diffusion + Cross-Image Consistency

Seedream is a visual fidelity machine. Pure diffusion architecture with a specialized Cross-Image Consistency Module that computes feature maps across up to 10 reference images. It doesn't "understand" faces conceptually. It sees them as mathematical spatial data and preserves that data with geometric precision through the generation process.

How it handles your face reference

Your photo gets encoded into a compressed latent representation where facial landmarks, skin textures, and proportions are preserved as spatial vectors. The diffusion process uses this data as a direct constraint. The face never passes through a language bottleneck. It flows from pixels to latent math to new pixels. That's why it copies faces almost perfectly, including unusual details like scars, asymmetries, or unique bone structure.

Best atFace preservation from references (9.6/10 consistency). Product photography. Commercial campaigns needing identical subjects
Watch outLess conceptual understanding of prompts. Also preserves mistakes/artifacts faithfully
Speed34-60 seconds. Slowest of the bunch due to consistency checks
AccessClosed-source. API only via specific platforms
FX

Flux 2 Pro

Rectified Flow Transformer · Black Forest Labs
Flow Matching + 24B Vision-Language Model

Flux 2 Pro is the best all-rounder. It pairs a massive 24B Mistral-3 vision-language model (for deep prompt understanding) with a rectified flow transformer (for efficient, high-fidelity generation). It still starts from noise, so it keeps seed control and produces genuinely diverse, specific-looking faces. The most realistic faces in pure text-to-image without references.

Why its text-to-image faces look the most real

The 24B VLM understands your prompt deeply (it "gets" what "weathered" or "mischievous" means for a face). The flow matching explores a wide distribution of possible faces instead of converging on an average. And a 16-channel latent space preserves subtle details like skin micro-texture, asymmetries, and under-eye shadows. Three factors working together.

Best atPhotorealistic text-to-image faces. Prompt adherence. Versatility across styles. Seed reproducibility
Watch outLess obsessive face-locking than Seedream from references. Pricier at the Max tier
Speed3-10 seconds. Very fast for its quality level
EcosystemOpen-weight Dev variant. LoRA training. Huge community
GI

Grok Imagine

Aurora Engine · xAI
Autoregressive Mixture-of-Experts

Grok Imagine is autoregressive like NBP but with a crucial twist: Mixture-of-Experts (MoE). Instead of one huge model doing everything, it routes each part of generation to specialized sub-networks. This makes it absurdly fast. Lower resolution than competitors, but the speed and volume (30 images in under a minute for free) make it an incredible concept generation engine.

Why the characters feel so good at low resolution

At lower resolution, fewer tokens are needed per image, so each token carries more of the image's "meaning." The model spends its budget on character, expression, and composition rather than rendering pores. It's like a skilled caricaturist who captures someone's essence in 20 strokes. Less safety clamping than Google also means more distinctive, varied faces.

Best atRapid character concepts. Volume generation. Distinctive/characterful faces. Creative exploration at speed
Watch outLower resolution (1024x1024). Less refined detail than NBP/Flux. Content moderation concerns
Speed10-20 sec per image. Batch generation is lightning fast
CostFree with X Premium. Pro variant available via API
MJ

Midjourney

Proprietary Diffusion · Midjourney Inc.
Advanced Diffusion (Details Undisclosed)

Midjourney is the artist's model. It's diffusion-based but trained on exceptionally high-aesthetic data. Every generation rolls the dice on a completely different starting noise pattern, giving you wide creative variety. Faces are always unique and diverse. It has a built-in "artistic taste" that makes even simple prompts produce beautiful results.

Why Midjourney outputs look "artistic" by default

The training data is heavily weighted toward high-quality artistic and photographic work. The diffusion noise explores a broad aesthetic space, and the model's defaults lean toward dramatic lighting, rich color, and cinematic composition. You don't need to be a prompt engineer. The model's priors do the heavy lifting for visual beauty.

Best atArtistic quality by default. Wide face diversity. Concept art. Beautiful outputs from simple prompts
Watch outWeaker prompt adherence for complex instructions. No API. Discord-based workflow
Speed30-60 seconds typically
VibeThe most "opinionated" model. Outputs have a recognizable Midjourney aesthetic
OG

Classic Diffusion Models

Stable Diffusion 1.5 / XL / 3.5 · DALL-E 2/3 · Imagen
Traditional Latent Diffusion / U-Net

The originals. These pure diffusion models start from random noise and denoise it over 20-50 steps. Maximum creative chaos and variety. Every seed gives you a completely different roll of the dice. The open-source ecosystem (LoRAs, ControlNet, fine-tuning) is massive. Requires more manual work for consistency but gives you total control.

The "creative chaos" advantage

Because everything starts from random noise, and that noise cascades into every pixel of the final image, you get genuinely surprising, diverse results every time. Face structure, lighting, composition are all up for grabs with each new seed. For exploration and variety, nothing beats pure diffusion. For consistency, you need LoRAs, ControlNet, or a lot of patience.

Best atCreative variety. Open-source control. LoRA fine-tuning for custom faces. Huge community ecosystem
Watch outHistorically terrible text rendering. Variable quality. Needs technical skill for best results
Seed controlFull seed reproducibility. Same seed = identical output every time
Run locallyMany variants run on consumer GPUs (8GB+ VRAM). Complete privacy and control

The Two Modes of Image Editing

Every major AI image model quietly operates in two very different modes when editing. Understanding which mode you're triggering is the difference between a clean result and a frustrating one.

Mode 1: Pixel Locking

Inpainting / Masked Editing

Triggered when: You ask the model to "maintain the image" or "keep everything the same" while changing one specific thing.

What actually happens: The model freezes all pixels outside the edit area and only regenerates the masked region. The frozen pixels are treated as a flat 2D backdrop. The model paints the new element ONTO this static canvas.

Why it looks "Photoshop-y": The new content doesn't share the same 3D understanding of the scene. A new t-shirt gets pasted flat rather than wrapping around the body. Lighting on the edit doesn't match the frozen surroundings. Blending at the edges is often visible.

Best for

Color changes (eye color, hair color, wall paint), removing small objects, swapping text on signs, subtle tweaks where the surrounding context doesn't need to change.

Mode 2: World Building

Full Regeneration

Triggered when: You provide reference images and a fresh prompt without forcing pixel preservation. "Create a scene with this person in a forest."

What actually happens: The model builds an entirely new scene from scratch. New lighting, new angle, new composition. References are used as identity/style guides, not pixel sources. Every element is generated together in a unified pass.

Why it looks so much better: Everything is coherent because it was imagined as a whole. Lighting wraps around objects correctly. Fabrics drape realistically. Shadows are consistent. There's no blending seam because nothing was "pasted."

Best for

Changing outfits, putting a person in a new scene, changing pose or angle, anything where the physical interaction between the subject and their environment matters.

Prompting Tip: Don't Create a Conflict

Saying "maintain this image but change the t-shirt" creates a direct conflict: you're asking the model to both freeze and change at the same time. Instead, be explicit about what's free and what's locked.

INSTEAD OF

"Keep this image the same but change the t-shirt to a blue polo."

TRY THIS

"Keep everything in this image except the t-shirt. Replace the t-shirt with a blue polo that wraps naturally around the person's torso with matching lighting."

For outfit/clothing changes where realism matters, consider full regeneration instead: provide a face reference and describe the entire scene fresh. The result will look dramatically better than any masked edit.

There is no smooth middle ground. The switch between these modes is binary, not gradual. Either pixels are frozen (and you get the flat, Photoshop-y look) or the whole scene is regenerated (and you get beautiful but potentially different composition). This is an architectural limitation across all current models. GPT Image 1.5 has made the most progress on bridging this gap with its "Consistent In-Painting" feature, but even there, complex edits like clothing changes can still look pasted.

Quick Comparison

Capability Nano Banana Pro NB2 Seedream 4.5 Flux 2 Pro Grok Imagine Midjourney Classic Diffusion
Face from reference Approximate Approximate Near-perfect Good Decent Good Needs LoRA
Text-to-image faces Generic/safe Generic/safe Good Most realistic Characterful Artistic Variable
Text in images Excellent Excellent Excellent Very good Good Weak Poor
Complex instructions Best Very good Moderate Very good Good Moderate Weak
Speed ~9s ~4-6s 34-60s 3-10s 10-20s 30-60s 2-15s (local)
Seed reproducibility No No Yes Yes Limited Yes Yes
Creative diversity Low Low Moderate High High Very high Very high
Open source / local No No No Dev variant No No Full
Conversational editing Best Very good Limited Good Good Basic Manual tools

The Core Insight

Autoregressive models understand faces. They can reason about expressions, emotions, and anatomy. But they reconstruct from a compressed description, so details drift.

Diffusion models see faces. They preserve visual data with mathematical precision. But they don't deeply "understand" what they're looking at conceptually.

Flow matching models balance both. Deep prompt understanding from a vision-language model, visual fidelity from a noise-to-image pipeline, and seed control for reproducibility.

No single model does everything. The best results come from understanding what each architecture is good at and using the right tool for each job.

Pro Tips

Generic advice that applies across all models and platforms, from resolution strategy to where you actually use these tools.

Use models outside their native chat apps

Nano Banana Pro inside Gemini and GPT Image inside ChatGPT are not running at full potential. In Gemini, free users get 1K resolution (max ~1MP), Pro subscribers get 2K, and only Ultra ($250/mo) gets 4K. The chat interface also applies the strictest content filters and forces you to start a new chat for each clean generation. Third-party platforms (Freepik, Hixfield, API providers) typically offer full resolution, fewer restrictions, and better workflow for the same underlying model.

Always generate at the highest resolution

This matters especially for text rendering. Text that looks tiny in a 4K image is actually large when viewed at 100% pixel scale. The model has more pixels to work with, so letters come out sharper and more accurate. In NBP, 4K text rendering is dramatically better than 2K. The cost difference (1K/2K are the same price, 4K is ~80% more) is worth it for anything with text or fine detail. Iterate at 2K, then render your final at 4K.

Combine models deliberately

Don't stay loyal to one model. Generate fast concepts in Grok (free, fast, characterful), upscale in NBP. Generate face-perfect references in Seedream, build scenes around them in NBP. Create stripped gray-silhouette reference images for scale/pose, then feed them to Seedream with face references separately. Each model has a specific architectural strength. Use it.

Use Seedream's own weapon against itself

Seedream can't reason about abstract concepts from text, but it reproduces references with scary precision. So for things like specific hand poses, scale relationships, or body positions: find or create a reference image showing exactly what you want, strip it down to essentials (gray figures, no distracting details), and label it as "reference for position/scale only, not for likeness." You do the reasoning. Seedream does the rendering.

Word choice triggers different visual modes

Saying "giant person" makes them look heavy and chubby. Saying "person A is half-sized compared to person B" gives you proportional scaling. "Large arm" is weak. "Large sized arm" works better. Models are sensitive to exact phrasing because different word combinations activate different clusters in the training data. When something doesn't work, try synonyms or rephrase as a comparison rather than an absolute description.

Start a fresh chat for each new concept

In Gemini and ChatGPT, previous messages in the conversation influence the next generation. This is by design for iterative editing, but it means a "clean" first generation in an existing chat is impossible. The model carries context, biases, and constraints from earlier messages. If you want a truly fresh output, start a new conversation. This is especially important for autoregressive models where the full conversation history feeds into the generation.

Best Model for the Job

Faithful face reproduction

You have a reference photo and need the generated person to look exactly like them across multiple scenes.

Seedream 4.5

Infographics & text-heavy

You need accurate text rendering, data visualization, diagrams, or educational content with readable labels.

NBPNB2

Realistic portraits from text

No reference photo. Just a text description. You want the most photorealistic, specific-looking person possible.

Flux 2 Pro

Rapid character concepts

You need 20-30 character variations fast to find the right vibe before investing in high-res renders.

Grok Imagine

Concept art & beauty

Artistic exploration where aesthetic quality matters more than technical accuracy. You want to be surprised.

Midjourney

Full creative control

You want to fine-tune on custom data, run locally, use LoRAs, have total privacy and zero API costs.

Classic SD / Flux Dev

Multi-step logical editing

"Change the lighting to sunset, add a reflection in the window, put a sign that says OPEN." Complex chains of edits.

NBPNB2

Smart combo workflow

Generate fast concepts in Grok, pick the best, upscale in NBP. Use each model for what its architecture does best.

Grok+NBP

10 Universal Challenges

Problems that affect every AI image model to some degree, regardless of architecture, and practical tips to work around them.

Occlusion

Models want to show everything. A "lollipop in mouth" shows the full candy visible because training data rarely depicts hidden objects. The model fights to make every described object fully visible.

Tip

Describe what IS visible, not the concept. "Lips closed around a thin white stick" instead of "lollipop in mouth."

Left/Right Ambiguity

Whose left? The viewer's or the subject's? Training data mixes both conventions, so models inconsistently flip directions. This worsens with mirrored/facing-camera subjects.

Tip

Avoid "left/right." Use "the hand holding the cup," "viewer's left side of the frame," or reference other objects in scene.

Counting Objects

"Three cats" gives you two or four. Diffusion models can't count. Autoregressive models (NBP, GPT Image) are much better since they generate sequentially and can track quantity.

Tip

Assign spatial positions: "one cat on the left cushion, one in the middle, one on the right." Use autoregressive models for precise counts.

Hands & Fingers

Much improved in 2025-2026 models but still fails in complex poses. Hands are geometrically complex and frequently occluded in training data. For diffusion models, provide a hand pose reference image.

Tip

Describe the action, not the hand: "gripping a mug handle with thumb on top." For Seedream, use a pose reference image.

Attribute Mixing

"Man in blue, woman in red" often swaps colors between subjects. Diffusion models process prompts holistically and can't reliably bind specific attributes to specific people.

Tip

Describe each subject in a complete, self-contained block. Don't interleave attributes. Use autoregressive models for multi-subject scenes.

Negation Failure

"A room with no windows" often produces windows. The word "no" is weak across all architectures. Diffusion models literally can't suppress activated concepts. Autoregressive models are slightly better.

Tip

Try compound words: "windowless room" or "legless figure" instead of "no windows" or "no legs." The model reads these as distinct concepts, not negations. Also describe what IS there positively.

Scale & Proportion

Models have learned "default sizes" for objects and snap back to those priors. Unusual scale relationships (tiny house, giant mountain) get normalized to training data averages.

Tip

Use relative comparisons: "person A is half-sized compared to person B." Provide stripped gray-silhouette reference images showing the desired height ratio. Describe camera shots, not object sizes.

Physics Violations

Wrong shadow directions, impossible reflections, floating objects, liquid defying gravity. Diffusion models just match visual patterns and frequently get physics wrong in subtle but uncanny ways.

Tip

Use autoregressive models (NBP, GPT Image) for physics-critical scenes. Explicitly describe light source direction and shadow behavior in your prompt.

Text in Images

Garbled, misspelled, or unreadable text has been the classic AI image failure. Autoregressive models have largely solved this. Classic diffusion still can't do it. Use the right model for the job.

Tip

Use NBP or GPT Image for text-critical work. Always generate at maximum resolution (4K in NBP) since text that's tiny at 4K is actually large at 100% pixel scale. Put text in quotes. Keep it short.

Edit vs. Regenerate Gap

The dramatic quality difference between masked editing (flat, Photoshop-y) and full regeneration (beautiful, coherent). There is no smooth middle ground across any current model.

Tip

Use pixel locking only for simple changes (color swaps, text edits). For anything involving 3D interaction (clothing, poses), go full regeneration with reference images.