【英文原版】StableDiffusion3技术报告-英.docx
ScalingRectifledFlowTransformersforHigh-ResolutionImageSynthesisPatrickEsser'SumithKulalAndreasBlattmannRahimEntezariJonasMu,llerHarrySainiYam1.eviDominik1.orenzAxelSauerFredericBoeselDustinPodelITimDockhornZionEnglishKyle1.aceyAlexGoodwinYannikMarekRobinRombach*StabilityAIFigure1.High-resolutionsamplesfromour8Brectifiedflowmodel,showcasingitscapabilitiesintypography,precisepromptfollowingandspatialreasoning,attentiontofinedetails,andhighimagequalityacrossawidevarietyofstyles.AbstractDiffusionmodelscreatedatafromnoisebyinvertingtheforwardpathsofdatatowardsnoiseandhaveemergedasapowerfulgenerativemodelingtechniqueforhigh-dimensional,perceptualdatasuchasimagesandvideos.Rectifiedflowisarecentgenerativemodelformulationthatconnectsdataandnoiseinastraightline.Despiteitsbettertheoreticalpropertiesandconceptualsimplicity,itisnotyetdecisivelyestablishedasstandardpractice.Inthiswork,weimproveexistingnoisesamplingtechniquesfbrtrainingrectifiedflowmodelsbybiasingthemtowardsperceptuallyrelevantscales.Throughalarge-scalestudy,wedemon-4Equalcontribution.<firstlast>stability.ai.stratethesuperiorperformanceofthisapproachcomparedtoestablisheddiffusionformulationsforhigh-resolutiontext-to-imagesynthesis.Additionally,wepresentanoveltransformer-basedarchitecturefortext-to-imagegenerationthatusesseparateweightsforthetwomodalitiesandenablesabidirectionalflowofinformationbetweenimageandtexttokens,improvingtextcomprehension,typography,andhumanpreferenceratings.Wedemonstratethatthisarchitecturefollowspredictablescalingtrendsandcorrelateslowervalidationlosstoimprovedtext-to-imagesynthesisasmeasuredbyvariousmetricsandhumanevaluations.Ourlargestmodelsoutperformstate-of-the-artmodels,andwewillmakeourexperimentaldata,code,andmodelweightspubliclyavailable.1. IntroductionDiffusionmodelscreatedatafromnoise(Songetal.,2020).Theyaretrainedtoinvertforwardpathsofdatatowardsrandomnoiseand,thus,inconjunctionwithapproximationandgeneralizationpropertiesofneuralnetworks,canbeusedtogeneratenewdatapointsthatarenotpresentinthetrainingdatabutfollowthedistributionofthetrainingdata(Sohl-Dicksteinetal.,2015;Song&Ermon,2020).Thisgenerativemodelingtechniquehasproventobeveryeffectiveformodelinghigh-dimensional,perceptualdatasuchasimages(HOetal.,2020).Inrecentyears,diffusionmodelshavebecomethede-factoapproachforgeneratinghigh-resolutionimagesandvideosfromnaturallanguageinputswithimpressivegeneralizationcapabilities(Sahariaetal.,2022b;Rameshetal.,2022;Rombachetal.,2022;Podelletal.,2023;Daietal.,2023;Esseretal.,2023;Blattmannetal.,2023b;Betkeretal.,2023;Blattmannetal.,2023a;Singeretal.l2022).Duetotheiriterativenatureandtheassociatedcomputationalcosts,aswellasthelongsamplingtimesduringinference,researchonformulationsformoreefficienttrainingand/orfastersamplingofthesemodelshasincreased(Karrasetal.,2023;1.iuetal.,2022).Whilespecifyingaforwardpathfromdatatonoiseleadstoefficienttraining,italsoraisesthequestionofwhichpathtochoose.Thischoicecanhaveimportantimplicationsforsampling.Forexample,aforwardprocessthatfailstoremoveallnoisefromthedatacanleadtoadiscrepancyintrainingandtestdistributionandresultinartifactssuchasgrayimagesamples(1.inetal.,2024).Importantly,thechoiceoftheforwardprocessalsoinfluencesthelearnedbackwardprocessand,thus,thesamplingefficiency.Whilecurvedpathsrequiremanyintegrationstepstosimulatetheprocess,astraightpathcouldbesimulatedwithasinglestepandislesspronetoerroraccumulation.Sinceeachstepcorrespondstoanevaluationoftheneuralnetwork,thishasadirectimpactonthesamplingspeed.Aparticularchoicefortheforwardpathisaso-calledRectifiedFlow(1.iuetal.,2022;Albergo&Vanden-Eijnden,2022;1.ipmanetal.,2023),whichconnectsdataandnoiseonastraightline.Althoughthismodelclasshasbettertheoreticalproperties,ithasnotyetbecomedecisivelyestablishedinpractice.Sofar,someadvantageshavebeenempiricallydemonstratedinsmallandmedium-sizedexperiments(Maetal.,2024),butthesearemostlylimitedtoclass-conditionalmodels.Inthiswork,wechangethisbyintroducingare-weightingofthenoisescalesinrectifiedflowmodels,similartonoise-predictivediffusionmodels(Hoetal.,2020).Throughalarge-scalestudy,wecompareournewformulationtoexistingdiffusionformulationsanddemonstrateitsbenefits.Weshowthatthewidelyusedapproachfortext-to-imagesynthesis,whereafixedtextrepresentationisfeddirectlyintothemodel(e.g.,viacross-attention(Vaswanietal.,2017;Rombachetal.,2022),isnotideal,andpresentanewarchitecturethatincorporatesIeamablestreamsforbothimageandtexttokens,whichenablesatwo-wayflowOfinformationbetweenthem.Wecombinethiswithourimprovedrectifiedflowformulationandinvestigateitsscalability.Wedemonstrateapredictablescalingtrendinthevalidationlossandshowthatalowervalidationlosscorrelatesstronglywithimprovedautomaticandhumanevaluations.Ourlargestmodelsoutperformstate-of-theartopenmodelssuchasSDX1.(Podelletal.,2023),SDX1.-Turbo(Saueretal.,2023),Pixart-(Chenetal.,2023),andclosed-sourcemodelssuchasDA1.1.-E3(Betkeretal.,2023)bothinquantitativeevaluation(Ghoshetal.,2023)ofpromptunderstandingandhumanpreferenceratings.Thecorecontributionsofourworkare:(i)Weconductalarge-scale,systematicstudyondifferentdiffusionmodelandrectifiedflowformulationstoidentifythebestsetting.Forthispurpose,weintroducenewnoisesamplersforrectifiedflowmodelsthatimproveperformanceoverpreviouslyknownsamplers,(ii)Wedeviseanovel,scalablearchitecturefortext-to-imagesynthesisthatallowsbi-directionalmixingbetweentextandimagetokenstreamswithinthenetwork.WeshowitsbenefitscomparedtoestablishedbackbonessuchasUViT(Hoogeboometal,2023)andDiT(Peebles&Xie,2023).Finally,we(iii)performascalingstudyofourmodelanddemonstratethatitfollowspredictablescalingtrends.Weshowthatalowervalidationlosscorrelatesstronglywithimprovedtext-to-imageperformanceassessedviametricssuchasT2I-CompBench(Huangetal.,2023),GenEval(Ghoshetal.,2023)andhumanratings.Wemakeresults,code,andmodelweightspubliclyavailable.2. Simulation-FreeTrainingofFlowsWeconsidergenerativemodelsthatdefineamappingbetweensamplesifromanoisedistributionPltosamplesxofromadatadistributionpointermsofanordinarydifferentialequation(ODE),dyt=v-,r)dt,(1)wherethevelocityvisparameterizedbytheweightsofaneuralnetwork.PriorworkbyChenetal.(2018)suggestedtodirectlysolveEquation(1)viadifferentiableODEsolvers.However,thisprocessiscomputationallyexpensive,especiallyforlargenetworkarchitecturesthatparameterizev-(,t.t).AmoreefficientalternativeistodirectlyregressavectorfieldwtthatgeneratesaprobabilitypathbetweenPOandp.Toconstructsuchaux,wedefineaforwardprocess,correspondingtoaprobabilitypathPtbetweenpoandPl=N(0,1),aszt=auo+btawhere»<-N(0,/).(2)Forao=1,Z?o=O,a=Oandb=1,themarginals,Pt(Zt)=ESN(M)Pt(Zt0,(3)areconsistentwiththedataandnoisedistribution.Toexpresstherelationshipbetweent>xoand,weintroducetanduxast():xo'ato+Z¾(4)Mze):=ItT(Z付(5)SinceZtcanbewrittenassolutiontotheODEzt'="t(ZtI£),withinitialvaluezo=xo,wt()generatespt(e).Remarkably,onecanconstructamarginalvectorfield“twhichgeneratesthemarginalprobabilitypaths(1.ipmanetal.,2023)(seeB.l),usingtheconditionalvectorfields11t():(z)=EufzelAtUl(6)tSN(OJ)八Pt(Z)WhileregressingwlwiththeFlowMatchingobjective1.FM=Et,pt(z)Hv-(z,z)wt(z)112.(7)directlyisintractableduetothemarginalizationinEquation6,ConditionalFlowMatching(seeB.l),1.CFM=Et,p,(z/e),pIW-Qt)Mt(z)22.(8)withtheconditionalvectorfieldsMt(Z¢)providesanequivalentyettractableobjective.ToconvertthelossintoanexplicitformWeinsert/(xo)=cixq+heandt_1(z)=¾ycinto(5)Ztz=NZtl£)=-Zi_£庆(一4)(9)tv,jatatbtNow,CQnSi尊thesignal-to-noiseratioA:=ogWith,=2(a),wecanrewriteEquation(9)astarz(ze)=tz,(10)ttTt?lCli乙Next,weuseEquation(10)toreparameterizeEquation(8)asanoise-predictionobjective:1.=Evz,i)-az+,e2(11)CFMt,pt(ze),p(e)M2t2rJ&,.22_Et,pjze),p(e)2-(z,/)e2储)wherewedenedc:=2(v6z).T人组募Notethattheoptimumoftheaboveobjectivedoesnotchangewhenintroducingatime-dependentweighting.Thus,onecanderivevariousweightedlossfunctionsthatprovideasignaltowardsthedesiredsolutionbutmightaffecttheoptimizationtrajectory.Foraunifiedanalysisofdifferentapproaches,includingclassicdiffusionformulations,wecanwritetheobjectiveinthefollowingform(followingKingma&Gao(2023):T,H-1.w(x0)=-2EU(t),SN(0,1)VYMJ(Z/)£12,where1'2correspondsto.Wt=-ZAE1.CFM3. FlowTrajectoriesInthiswork,weconsiderdifferentvariantsoftheaboveformalismthatwebrieflydescribeinthefollowing.RectifiedFlowRectifiedFlows(RFs)(1.iuetal.,2022;Albergo&Vanden-Eijnden,2022;1.ipmanetal.,2023)definetheforwardprocessasstraightpathsbetweenthedatadistributionandastandardnormaldistribution,i.e.zt=(1r)xo+te,(13)anduses1.CFMwhichthencorrespondstovvjJf=rl.Thenetworkoutputdirectlyparameterizesthevelocityv-.EDMEDM(Karrasetal.,2022)usesaforwardprocessoftheformzt=xo+bt(14)WherJ(Kingma&Gao,2023)bt=exp/7T1(Pm,P)withFjbeingthequantilefunctionofthenormaldistributionwithmeanPmandvarianceP2.NotethatthischoiceSresultsinAN(2Pm,(2A)z)fort-U(0,1)(15)ThenetworkisparameterizedthroughanF-prediction(Kingma&Gao,2023;Karrasetal.,2022)andthelosscanbewrittenas1.weomwithtWqDM=N(42Pm,(2Ps)2)(ef<+o.52)(16)Cosine(Nichol&Dhariwal,2021)proposedaforwardprocessoftheform1111,«一Zt=cosrxo+sin/.(17)Incombinationwithane-parameterizationandloss,thiscorrespondstoaweighting1伏=SeChat/2).Whencombinedwithav-predictionloss(Kingma&Gaof2023),theweightingisgivenbyWt=ei2.(1.DM-)1.inear1.DM(Rombachetal.,2022)usesamodificationoftheDDPMschedule(Hoetalg020).BOtharevariancepreservingschedules,i.e.bt=%anddefineatfordiscretetimestepst=0,.T1intermsofdiffusioncoefficientsBtasat=(:=o(lA),Forgivenboundaryvaluesoand-,DDPMusesBt=6。+t1So)and1.DMusesBt="Yp*2”防十占位际一3.1. TailoredSNRSamplersforRFmodelsTheRFlosstrainsthevelocityv-,uniformlyonalltimestepsin0,1.Intuitively,however,theresultingvelocitypredictiontargetcxoismoredifficultfortinthemiddleof0.1,sincefort=Q,theoptimalpredictionisthemeanofpi,andfor/=1theoptimalpredictionisthemeanofpo.Ingeneral,changingthedistributionovertfromthecommonlyuseduniformdistributionU。)toadistributionwithdensityt)isequivalenttoaweightedloss1.wwith¼t=1,77W(18)1-IThus,weaimtogivemoreweighttointermediatetimestepsbysamplingthemmorefrequently.Next,wedescribethetimestepdensities()thatweusetotrainourmodels.1.ogit-NormalSamplingOneoptionforadistributionthatputsmoreweightonintermediatestepsisthelogitnormaldistribution(Atchison&Shen,1980).Itsdensity,f11*(logit(r)_my*-0;用,5)=,2面(1_,)exp-W'wherelogit()=logJ_,hasalocationparameter,w,l-tascaleparameter,s.ThelocationparameterenablesustobiasthetrainingtimestepstowardseitherdataPo(negativeniornoisePl(positiven).AsshowninFigure11,thescaleparameterscontrolshowwidethedistributionis.Inpractice,wesampletherandomvariableufromanormaldistributionu,几S)andmapitthroughthestandardlogisticfunction.ModeSamplingwithHeavyTailsThelogit-normaldensityalwaysvanishesattheendpointsOand1.Tostudywhetherthishasadverseeffectsontheperformance,wealsouseatimestepsamplingdistributionwithstrictlypositivedensityon0,1.Forascaleparameters,wedefinei11找,/mode(WJS)=IUS,COS?U1+M.(20)For-1s7thisfunctionismonotonic,andwecanuseittosamplefromtheimplieddensityrrmode(cS)=目"温M).AsseeninFigure11,thescaleparametercontrolsthedegreetowhicheitherthemidpoint(positiveS)ortheendpoints(negativeS)arefavoredduringsampling.Thisformulationalsoincludesauniformweighting"mode。;s=O)=U(f)fors=O,whichhasbeenusedwidelyinpreviousworksonRectifiedFlows(1.iuetal.,2022;Maetal.,2024).CosMapFinally,wealsoconsiderthecosineschedule(Nichol&Dhariwal,2021)fromSection3intheRFsetting.Inparticular,wearelookingforamapping/:u'(m)=t,ul(1.11.suchthatthelos-snrmatchesthatofthecosineschedUej21ogcosG=勿i.SolvingtorAwesin(u)f(u)obtainforuU(m)Z=小)=1一_(21)tan(2m)+1fromwhichweobtainthedensityd2"CosMapQ)=_广1。)=.(22)dt11-211+211t24. Text-to-ImageArchitectureFortext-conditionalsamplingofimages,ourmodelhastotakebothmodalities,textandimages,intoaccount.Weusepretrainedmodelstoderivesuitablerepresentationsandthendescribethearchitectureofourdiffusionbackbone.AnoverviewofthisispresentedinFigure2.Ourgeneralsetupfollows1.DM(Rombachetal.f2022)fortrainingtext-to-imagemodelsinthelatentspaceofapretrainedautoencoder.Similartotheencodingofimagestolatentrepresentations,wealsofollowpreviousapproaches(Sahariaetal.,2022b;Balajietal.,2022)andencodethetextconditioningcusingpretrained,frozentextmodels.DetailscanbefoundinAppendixB.2.MultimodalDiffusionBackboneOurarchitecturebuildsupontheDiT(Peebles&Xie,2023)architecture.DiTonlyconsidersclassconditionalimagegenerationandusesamodulationmechanismtoconditionthenetworkonboththetimestepofthediffusionprocessandtheclasslabel.Similarly,weuseembeddingsofthetimesteptandCVeCasinputstothemodulationmechanism.However,asthepooledtextrepresentationretainsonlycoarse-grainedinformationaboutthetextinput(Podelletal.,2023),thenetworkalsorequiresinformationfromthesequencerepresentationCCtXtWeconstructasequenceconsistingofembeddingsofthetextandimageinputs.Specifically,weaddpositionalencodingsandflatten2×2patchesofthelatentpixelrepresentationxRhXWXCtoapatchencodingsequenceoflength17w.AfterembeddingthispatchencodingandthetextencodingCCtxttoacommondimensionality,weCaPtiOn)(八)Overviewofallcomponents.Figure2.Ourmodelarchitecture.Concatenationisindicatedbyandelement-wisemultiplicationby.TheRMS-NormforQandKcanbeaddedtostabilizetrainingruns.Bestviewedzoomedin.(b)OneAfM-DzTblockconcatenatethetwosequences.WethenfollowDiTandapplyasequenceofmodulatedattentionandM1.Ps.Sincetextandimageembeddingsareconceptuallyquitedifferent,weusetwoseparatesetsofweightsforthetwomodalities.AsshowninFigure2b,thisisequivalenttohavingtwoindependenttransformersforeachmodality,butjoiningthesequencesofthetwomodalitiesfortheattentionoperation,suchthatbothrepresentationscanworkintheirownspaceyettaketheotheroneintoaccountForourscalingexperiments,Weparameterizethesizeofthemodelintermsofthemodesdepthd,i.e.thenumberofattentionblocks,bysettingthehiddensizeto64d(expandedto464channelsintheM1.Pblocks),andthenumberofattentionheadsequaltod.5. Experiments5.1. ImprovingRectifiedFlowsWeaimtounderstandwhichoftheapproachesforsimulation-freetrainingofnormalizingflowsasinEquation1isthemostefficientToenablecomparisonsacrossdifferentapproaches,wecontrolfortheoptimizationalgorithm,themodelarchitecture,thedatasetandsamplers.Inaddition,thelossesofdifferentapproachesareincomparableandalsodonotnecessarilycorrelatewiththequalityofoutputsamples;henceweneedevaluationmetricsthatallowforacomparisonbetweenapproaches.WetrainmodelsonIma-geNet(Russakovskyetal.,2014)andCC12M(Changpinyoetal.,2021),andevaluateboththetrainingandtheEMAweightsofthemodelsduringtrainingusingvalidationlosses,C1.IPscores(Radfordetal.,2021;Hesseletal”2021)fandFlD(Heuseletal.f2017)underdifferentsamplersettings(differentguidancescalesandsamplingsteps).Wecalc