【英文原版】StableDiffusion3技术报告-英.docx

资源ID：1373411 资源大小：934.01KB 全文页数：30页
资源格式： DOCX 下载积分：5金币

快捷下载

会员登录下载

三方登录下载：

下载资源需要5金币

邮箱/手机：
温馨提示：	用户名和密码都是您填写的邮箱或者手机号，方便查询和重复下载（系统自动生成）
支付方式：
验证码：	换一换

加入VIP免费专享

账号：
密码：
验证码：	换一换
当日自动登录忘记密码？

友情提示

1、下载资料失败解决办法

2、PDF文件下载后，可能会被浏览器默认打开，此种情况可以点击浏览器菜单，保存网页到桌面，就可以正常下载了。

3、本站不支持迅雷下载，请使用电脑自带的IE浏览器，或者360浏览器、谷歌浏览器下载即可。

4、本站资源下载后的文档和图纸-无水印,预览文档经过压缩，下载后原文更清晰。

5、试题试卷类文档，如果标题没有明确说明有答案则都视为没有答案，请知晓。

网站客服

侵权投诉

【英文原版】StableDiffusion3技术报告-英.docx

ScalingRectifledFlowTransformersforHigh-ResolutionImageSynthesisPatrickEsser'SumithKulalAndreasBlattmannRahimEntezariJonasMu,llerHarrySainiYam1.eviDominik1.orenzAxelSauerFredericBoeselDustinPodelITimDockhornZionEnglishKyle1.aceyAlexGoodwinYannikMarekRobinRombach*StabilityAIFigure1.High-resolutionsamplesfromour8Brectifiedflowmodel,showcasingitscapabilitiesintypography,precisepromptfollowingandspatialreasoning,attentiontofinedetails,andhighimagequalityacrossawidevarietyofstyles.AbstractDiffusionmodelscreatedatafromnoisebyinvertingtheforwardpathsofdatatowardsnoiseandhaveemergedasapowerfulgenerativemodelingtechniqueforhigh-dimensional,perceptualdatasuchasimagesandvideos.Rectifiedflowisarecentgenerativemodelformulationthatconnectsdataandnoiseinastraightline.Despiteitsbettertheoreticalpropertiesandconceptualsimplicity,itisnotyetdecisivelyestablishedasstandardpractice.Inthiswork,weimproveexistingnoisesamplingtechniquesfbrtrainingrectifiedflowmodelsbybiasingthemtowardsperceptuallyrelevantscales.Throughalarge-scalestudy,wedemon-4Equalcontribution.<firstlast>stability.ai.stratethesuperiorperformanceofthisapproachcomparedtoestablisheddiffusionformulationsforhigh-resolutiontext-to-imagesynthesis.Additionally,wepresentanoveltransformer-basedarchitecturefortext-to-imagegenerationthatusesseparateweightsforthetwomodalitiesandenablesabidirectionalflowofinformationbetweenimageandtexttokens,improvingtextcomprehension,typography,andhumanpreferenceratings.Wedemonstratethatthisarchitecturefollowspredictablescalingtrendsandcorrelateslowervalidationlosstoimprovedtext-to-imagesynthesisasmeasuredbyvariousmetricsandhumanevaluations.Ourlargestmodelsoutperformstate-of-the-artmodels,andwewillmakeourexperimentaldata,code,andmodelweightspubliclyavailable.1. IntroductionDiffusionmodelscreatedatafromnoise(Songetal.,2020).Theyaretrainedtoinvertforwardpathsofdatatowardsrandomnoiseand,thus,inconjunctionwithapproximationandgeneralizationpropertiesofneuralnetworks,canbeusedtogeneratenewdatapointsthatarenotpresentinthetrainingdatabutfollowthedistributionofthetrainingdata(Sohl-Dicksteinetal.,2015;Song&Ermon,2020).Thisgenerativemodelingtechniquehasproventobeveryeffectiveformodelinghigh-dimensional,perceptualdatasuchasimages(HOetal.,2020).Inrecentyears,diffusionmodelshavebecomethede-factoapproachforgeneratinghigh-resolutionimagesandvideosfromnaturallanguageinputswithimpressivegeneralizationcapabilities(Sahariaetal.,2022b;Rameshetal.,2022;Rombachetal.,2022;Podelletal.,2023;Daietal.,2023;Esseretal.,2023;Blattmannetal.,2023b;Betkeretal.,2023;Blattmannetal.,2023a;Singeretal.l2022).Duetotheiriterativenatureandtheassociatedcomputationalcosts,aswellasthelongsamplingtimesduringinference,researchonformulationsformoreefficienttrainingand/orfastersamplingofthesemodelshasincreased(Karrasetal.,2023;1.iuetal.,2022).Whilespecifyingaforwardpathfromdatatonoiseleadstoefficienttraining,italsoraisesthequestionofwhichpathtochoose.Thischoicecanhaveimportantimplicationsforsampling.Forexample,aforwardprocessthatfailstoremoveallnoisefromthedatacanleadtoadiscrepancyintrainingandtestdistributionandresultinartifactssuchasgrayimagesamples(1.inetal.,2024).Importantly,thechoiceoftheforwardprocessalsoinfluencesthelearnedbackwardprocessand,thus,thesamplingefficiency.Whilecurvedpathsrequiremanyintegrationstepstosimulatetheprocess,astraightpathcouldbesimulatedwithasinglestepandislesspronetoerroraccumulation.Sinceeachstepcorrespondstoanevaluationoftheneuralnetwork,thishasadirectimpactonthesamplingspeed.Aparticularchoicefortheforwardpathisaso-calledRectifiedFlow(1.iuetal.,2022;Albergo&Vanden-Eijnden,2022;1.ipmanetal.,2023),whichconnectsdataandnoiseonastraightline.Althoughthismodelclasshasbettertheoreticalproperties,ithasnotyetbecomedecisivelyestablishedinpractice.Sofar,someadvantageshavebeenempiricallydemonstratedinsmallandmedium-sizedexperiments(Maetal.,2024),butthesearemostlylimitedtoclass-conditionalmodels.Inthiswork,wechangethisbyintroducingare-weightingofthenoisescalesinrectifiedflowmodels,similartonoise-predictivediffusionmodels(Hoetal.,2020).Throughalarge-scalestudy,wecompareournewformulationtoexistingdiffusionformulationsanddemonstrateitsbenefits.Weshowthatthewidelyusedapproachfortext-to-imagesynthesis,whereafixedtextrepresentationisfeddirectlyintothemodel(e.g.,viacross-attention(Vaswanietal.,2017;Rombachetal.,2022),isnotideal,andpresentanewarchitecturethatincorporatesIeamablestreamsforbothimageandtexttokens,whichenablesatwo-wayflowOfinformationbetweenthem.Wecombinethiswithourimprovedrectifiedflowformulationandinvestigateitsscalability.Wedemonstrateapredictablescalingtrendinthevalidationlossandshowthatalowervalidationlosscorrelatesstronglywithimprovedautomaticandhumanevaluations.Ourlargestmodelsoutperformstate-of-theartopenmodelssuchasSDX1.(Podelletal.,2023),SDX1.-Turbo(Saueretal.,2023),Pixart-(Chenetal.,2023),andclosed-sourcemodelssuchasDA1.1.-E3(Betkeretal.,2023)bothinquantitativeevaluation(Ghoshetal.,2023)ofpromptunderstandingandhumanpreferenceratings.Thecorecontributionsofourworkare:(i)Weconductalarge-scale,systematicstudyondifferentdiffusionmodelandrectifiedflowformulationstoidentifythebestsetting.Forthispurpose,weintroducenewnoisesamplersforrectifiedflowmodelsthatimproveperformanceoverpreviouslyknownsamplers,(ii)Wedeviseanovel,scalablearchitecturefortext-to-imagesynthesisthatallowsbi-directionalmixingbetweentextandimagetokenstreamswithinthenetwork.WeshowitsbenefitscomparedtoestablishedbackbonessuchasUViT(Hoogeboometal,2023)andDiT(Peebles&Xie,2023).Finally,we(iii)performascalingstudyofourmodelanddemonstratethatitfollowspredictablescalingtrends.Weshowthatalowervalidationlosscorrelatesstronglywithimprovedtext-to-imageperformanceassessedviametricssuchasT2I-CompBench(Huangetal.,2023),GenEval(Ghoshetal.,2023)andhumanratings.Wemakeresults,code,andmodelweightspubliclyavailable.2. Simulation-FreeTrainingofFlowsWeconsidergenerativemodelsthatdefineamappingbetweensamplesifromanoisedistributionPltosamplesxofromadatadistributionpointermsofanordinarydifferentialequation(ODE),dyt=v-,r)dt,(1)wherethevelocityvisparameterizedbytheweightsofaneuralnetwork.PriorworkbyChenetal.(2018)suggestedtodirectlysolveEquation(1)viadifferentiableODEsolvers.However,thisprocessiscomputationallyexpensive,especiallyforlargenetworkarchitecturesthatparameterizev-(,t.t).AmoreefficientalternativeistodirectlyregressavectorfieldwtthatgeneratesaprobabilitypathbetweenPOandp.Toconstructsuchaux,wedefineaforwardprocess,correspondingtoaprobabilitypathPtbetweenpoandPl=N(0,1),aszt=auo+btawhere»<-N(0,/).(2)Forao=1,Z?o=O,a=Oandb=1,themarginals,Pt(Zt)=ESN(M)Pt(Zt0,(3)areconsistentwiththedataandnoisedistribution.Toexpresstherelationshipbetweent>xoand,weintroducetanduxast():xo'ato+Z¾(4)Mze):=ItT(Z付(5)SinceZtcanbewrittenassolutiontotheODEzt'="t(ZtI£),withinitialvaluezo=xo,wt()generatespt(e).Remarkably,onecanconstructamarginalvectorfield“twhichgeneratesthemarginalprobabilitypaths(1.ipmanetal.,2023)(seeB.l),usingtheconditionalvectorfields11t():(z)=EufzelAtUl(6)tSN(OJ)八Pt(Z)WhileregressingwlwiththeFlowMatchingobjective1.FM=Et,pt(z)Hv-(z,z)wt(z)112.(7)directlyisintractableduetothemarginalizationinEquation6,ConditionalFlowMatching(seeB.l),1.CFM=Et,p,(z/e),pIW-Qt)Mt(z)22.(8)withtheconditionalvectorfieldsMt(Z¢)providesanequivalentyettractableobjective.ToconvertthelossintoanexplicitformWeinsert/(xo)=cixq+heandt_1(z)=¾ycinto(5)Ztz=NZtl£)=-Zi_£庆(一4)(9)tv,jatatbtNow,CQnSi尊thesignal-to-noiseratioA：=ogWith,=2(a),wecanrewriteEquation(9)astarz(ze)=tz,(10)ttTt?lCli乙Next,weuseEquation(10)toreparameterizeEquation(8)asanoise-predictionobjective:1.=Evz,i)-az+,e2(11)CFMt,pt(ze),p(e)M2t2rJ&,.22_Et,pjze),p(e)2-(z,/)e2储)wherewedenedc:=2(v6z).T人组募Notethattheoptimumoftheaboveobjectivedoesnotchangewhenintroducingatime-dependentweighting.Thus,onecanderivevariousweightedlossfunctionsthatprovideasignaltowardsthedesiredsolutionbutmightaffecttheoptimizationtrajectory.Foraunifiedanalysisofdifferentapproaches,includingclassicdiffusionformulations,wecanwritetheobjectiveinthefollowingform(followingKingma&Gao(2023):T,H-1.w(x0)=-2EU(t),SN(0,1)VYMJ(Z/)£12,where1'2correspondsto.Wt=-ZAE1.CFM3. FlowTrajectoriesInthiswork,weconsiderdifferentvariantsoftheaboveformalismthatwebrieflydescribeinthefollowing.RectifiedFlowRectifiedFlows(RFs)(1.iuetal.,2022;Albergo&Vanden-Eijnden,2022;1.ipmanetal.,2023)definetheforwardprocessasstraightpathsbetweenthedatadistributionandastandardnormaldistribution,i.e.zt=(1r)xo+te,(13)anduses1.CFMwhichthencorrespondstovvjJf=rl.Thenetworkoutputdirectlyparameterizesthevelocityv-.EDMEDM(Karrasetal.,2022)usesaforwardprocessoftheformzt=xo+bt(14)WherJ(Kingma&Gao,2023)bt=exp/7T1(Pm,P)withFjbeingthequantilefunctionofthenormaldistributionwithmeanPmandvarianceP2.NotethatthischoiceSresultsinAN(2Pm,(2A)z)fort-U(0,1)(15)ThenetworkisparameterizedthroughanF-prediction(Kingma&Gao,2023;Karrasetal.,2022)andthelosscanbewrittenas1.weomwithtWqDM=N(42Pm,(2Ps)2)(ef<+o.52)(16)Cosine(Nichol&Dhariwal,2021)proposedaforwardprocessoftheform1111,«一Zt=cosrxo+sin/.(17)Incombinationwithane-parameterizationandloss,thiscorrespondstoaweighting1伏=SeChat/2).Whencombinedwithav-predictionloss(Kingma&Gaof2023),theweightingisgivenbyWt=ei2.(1.DM-)1.inear1.DM(Rombachetal.,2022)usesamodificationoftheDDPMschedule(Hoetalg020).BOtharevariancepreservingschedules,i.e.bt=%anddefineatfordiscretetimestepst=0,.T1intermsofdiffusioncoefficientsBtasat=(：=o(lA),Forgivenboundaryvaluesoand-,DDPMusesBt=6。+t1So)and1.DMusesBt="Yp*2”防十占位际一3.1. TailoredSNRSamplersforRFmodelsTheRFlosstrainsthevelocityv-,uniformlyonalltimestepsin0,1.Intuitively,however,theresultingvelocitypredictiontargetcxoismoredifficultfortinthemiddleof0.1,sincefort=Q,theoptimalpredictionisthemeanofpi,andfor/=1theoptimalpredictionisthemeanofpo.Ingeneral,changingthedistributionovertfromthecommonlyuseduniformdistributionU。)toadistributionwithdensityt)isequivalenttoaweightedloss1.wwith¼t=1,77W(18)1-IThus,weaimtogivemoreweighttointermediatetimestepsbysamplingthemmorefrequently.Next,wedescribethetimestepdensities()thatweusetotrainourmodels.1.ogit-NormalSamplingOneoptionforadistributionthatputsmoreweightonintermediatestepsisthelogitnormaldistribution(Atchison&Shen,1980).Itsdensity,f11*(logit(r)_my*-0；用，5)=,2面(1_，)exp-W'wherelogit()=logJ_,hasalocationparameter,w,l-tascaleparameter,s.ThelocationparameterenablesustobiasthetrainingtimestepstowardseitherdataPo(negativeniornoisePl(positiven).AsshowninFigure11,thescaleparameterscontrolshowwidethedistributionis.Inpractice,wesampletherandomvariableufromanormaldistributionu，几S)andmapitthroughthestandardlogisticfunction.ModeSamplingwithHeavyTailsThelogit-normaldensityalwaysvanishesattheendpointsOand1.Tostudywhetherthishasadverseeffectsontheperformance,wealsouseatimestepsamplingdistributionwithstrictlypositivedensityon0,1.Forascaleparameters,wedefinei11找,/mode(WJS)=IUS,COS?U1+M.(20)For-1s7thisfunctionismonotonic,andwecanuseittosamplefromtheimplieddensityrrmode(cS)=目"温M).AsseeninFigure11,thescaleparametercontrolsthedegreetowhicheitherthemidpoint(positiveS)ortheendpoints(negativeS)arefavoredduringsampling.Thisformulationalsoincludesauniformweighting"mode。;s=O)=U(f)fors=O,whichhasbeenusedwidelyinpreviousworksonRectifiedFlows(1.iuetal.,2022;Maetal.,2024).CosMapFinally,wealsoconsiderthecosineschedule(Nichol&Dhariwal,2021)fromSection3intheRFsetting.Inparticular,wearelookingforamapping/:u'(m)=t,ul(1.11.suchthatthelos-snrmatchesthatofthecosineschedUej21ogcosG=勿i.SolvingtorAwesin(u)f(u)obtainforuU(m)Z=小)=1一_(21)tan(2m)+1fromwhichweobtainthedensityd2"CosMapQ)=_广1。)=.(22)dt11-211+211t24. Text-to-ImageArchitectureFortext-conditionalsamplingofimages,ourmodelhastotakebothmodalities,textandimages,intoaccount.Weusepretrainedmodelstoderivesuitablerepresentationsandthendescribethearchitectureofourdiffusionbackbone.AnoverviewofthisispresentedinFigure2.Ourgeneralsetupfollows1.DM(Rombachetal.f2022)fortrainingtext-to-imagemodelsinthelatentspaceofapretrainedautoencoder.Similartotheencodingofimagestolatentrepresentations,wealsofollowpreviousapproaches(Sahariaetal.,2022b;Balajietal.,2022)andencodethetextconditioningcusingpretrained,frozentextmodels.DetailscanbefoundinAppendixB.2.MultimodalDiffusionBackboneOurarchitecturebuildsupontheDiT(Peebles&Xie,2023)architecture.DiTonlyconsidersclassconditionalimagegenerationandusesamodulationmechanismtoconditionthenetworkonboththetimestepofthediffusionprocessandtheclasslabel.Similarly,weuseembeddingsofthetimesteptandCVeCasinputstothemodulationmechanism.However,asthepooledtextrepresentationretainsonlycoarse-grainedinformationaboutthetextinput(Podelletal.,2023),thenetworkalsorequiresinformationfromthesequencerepresentationCCtXtWeconstructasequenceconsistingofembeddingsofthetextandimageinputs.Specifically,weaddpositionalencodingsandflatten2×2patchesofthelatentpixelrepresentationxRhXWXCtoapatchencodingsequenceoflength17w.AfterembeddingthispatchencodingandthetextencodingCCtxttoacommondimensionality,weCaPtiOn)（八）Overviewofallcomponents.Figure2.Ourmodelarchitecture.Concatenationisindicatedbyandelement-wisemultiplicationby.TheRMS-NormforQandKcanbeaddedtostabilizetrainingruns.Bestviewedzoomedin.(b)OneAfM-DzTblockconcatenatethetwosequences.WethenfollowDiTandapplyasequenceofmodulatedattentionandM1.Ps.Sincetextandimageembeddingsareconceptuallyquitedifferent,weusetwoseparatesetsofweightsforthetwomodalities.AsshowninFigure2b,thisisequivalenttohavingtwoindependenttransformersforeachmodality,butjoiningthesequencesofthetwomodalitiesfortheattentionoperation,suchthatbothrepresentationscanworkintheirownspaceyettaketheotheroneintoaccountForourscalingexperiments,Weparameterizethesizeofthemodelintermsofthemodesdepthd,i.e.thenumberofattentionblocks,bysettingthehiddensizeto64d(expandedto464channelsintheM1.Pblocks),andthenumberofattentionheadsequaltod.5. Experiments5.1. ImprovingRectifiedFlowsWeaimtounderstandwhichoftheapproachesforsimulation-freetrainingofnormalizingflowsasinEquation1isthemostefficientToenablecomparisonsacrossdifferentapproaches,wecontrolfortheoptimizationalgorithm,themodelarchitecture,thedatasetandsamplers.Inaddition,thelossesofdifferentapproachesareincomparableandalsodonotnecessarilycorrelatewiththequalityofoutputsamples;henceweneedevaluationmetricsthatallowforacomparisonbetweenapproaches.WetrainmodelsonIma-geNet(Russakovskyetal.,2014)andCC12M(Changpinyoetal.,2021),andevaluateboththetrainingandtheEMAweightsofthemodelsduringtrainingusingvalidationlosses,C1.IPscores(Radfordetal.,2021;Hesseletal”2021)fandFlD(Heuseletal.f2017)underdifferentsamplersettings(differentguidancescalesandsamplingsteps).Wecalc

注意事项

本文（【英文原版】StableDiffusion3技术报告-英.docx）为本站会员（夺命阿水）主动上传，课桌文档仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若此文所含内容侵犯了您的版权或隐私，请立即通知课桌文档（点击联系客服），我们立即给予删除！

温馨提示：如果因为网速或其他原因下载失败请重新下载，重复下载不扣分。